Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)
Hi Patrick,

On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasi
 wrote:
[..]
>> > which can be used to defined tunable root constraints when CGroups are
>> > not available, and becomes RO when CGroups are.
>> >
>> > Can this be eventually an acceptable option?
>> >
>> > In any case I think that this feature will be mainly targeting CGroup
>> > based systems. Indeed, one of the main goals is to collect
>> > "application specific" information from "informed run-times". Being
>> > "application specific" means that we need a way to classify
>> > applications depending on the runtime context... and that capability
>> > in Linux is ultimately provided via the CGroup interface.
>>
>> I think the concern raised is more about whether CGroups is the right
>> interface to use for attaching capacity constraints to task or groups
>> of tasks, or is there a better way to attach such constraints?
>
> Notice that CGroups based classification allows to easily enforce
> the concept of "delegation containment". I think this feature should
> be nice to have whatever interface we choose.
>
> However, potentially we can define a proper per-task API; are you
> thinking to something specifically?
>

I was thinking how about adding per-task constraints to the resource
limits API if it makes sense to? There's already RLIMIT_CPU and
RLIMIT_NICE. An informed-runtime could then modify the limits of tasks
using prlimit.

>> The other advantage of such interface is we don't have to
>> create a separate CGroup for every new constraint limit and can have
>> several tasks with different unique constraints.
>
> That's still possible using CGroups and IMO it will not be the "most
> common case".
> Don't you think that in general we will need to set constraints at
> applications level, thus group of tasks?

Some applications could be a single task, also not all tasks in an
application may need constraints right?

> As a general rule we should probably go for an interface which makes
> easy the most common case.

I agree.

Thanks,
Joel


Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-24 Thread Joel Fernandes (Google)
Hi Tejun,

>> That's also why the proposed interface has now been defined as a extension of
>> the CPU controller in such a way to keep a consistent view.
>>
>> This controller is already used by run-times like Android to "scope" apps by
>> constraining the amount of CPUs resource they are getting.
>> Is that not a legitimate usage of the cpu controller?
>>
>> What we are doing here is just extending it a bit in such a way that, while:
>>
>>   {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU
>>
>> we can also use:
>>
>>   capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use
>>  during that time.
>
> Yes, we do have bandwidth restriction as a cgroup only feature, which
> is different from how we handle nice levels and weights.  Given the
> nature of bandwidth limits, if necessary, it is straight-forward to
> expose per-task interface.
>
> capacity min/max isn't the same thing.  It isn't a limit on countable
> units of a specific resource and that's why the interface you
> suggested for .min is different.  It's restricting attribute set which
> can be picked in the subhierarchy rather than controlling distribution
> of atoms of the resource.
>
> That's also why we're gonna have problem if we later decide we need a
> thread based API for it.  Once we make cgroup the primary owner of the
> attribute, it's not straight forward to add another owner.

Sorry I don't immediately see why it is not straight forward to have a
per-task API later once CGroup interface is added? Maybe if you don't
mind giving an example that will help?

I can start with an example, say you have a single level hierarchy
(Top-app in Android terms is the set of tasks that are user facing and
we'd like to enforce some capacity minimums, background on the other
hand is the opposite):

   ROOT (min = 0, max = 1024)
   / \
  /   \
  TOP-APP (min = 200, max = 1024)  BACKGROUND (min = 0, max = 500)

If in the future, if we want to have a per-task API to individually
configure the task with these limits, it seems it will be straight
forward to implement IMO.

As Patrick mentioned, all of the usecases needing this right now is an
informed runtime placing a task in a group of tasks and not needing to
set attributes for each individual task. We are already placing tasks
in individual CGroups in Android based on the information the runtime
has so adding in the capacity constraints will make it fit naturally
while leaving the door open for any future per-task API additions IMO.

Thanks,

Joel


Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-22 Thread Joel Fernandes (Google)
Hi,

On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasi
 wrote:
> On 20-Mar 13:15, Tejun Heo wrote:
>> Hello,
>>
>> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote:
[..]
>> > These attributes:
>> > a) are tunable at all hierarchy levels, i.e. root group too
>>
>> This usually is problematic because there should be a non-cgroup way
>> of configuring the feature in case cgroup isn't configured or used,
>> and it becomes awkward to have two separate mechanisms configuring the
>> same thing.  Maybe the feature is cgroup specific enough that it makes
>> sense here but this needs more explanation / justification.
>
> In the previous proposal I used to expose global tunables under
> procfs, e.g.:
>
>  /proc/sys/kernel/sched_capacity_min
>  /proc/sys/kernel/sched_capacity_max
>

But then we would lose out on being able to attach capacity
constraints to specific tasks or groups of tasks?

> which can be used to defined tunable root constraints when CGroups are
> not available, and becomes RO when CGroups are.
>
> Can this be eventually an acceptable option?
>
> In any case I think that this feature will be mainly targeting CGroup
> based systems. Indeed, one of the main goals is to collect
> "application specific" information from "informed run-times". Being
> "application specific" means that we need a way to classify
> applications depending on the runtime context... and that capability
> in Linux is ultimately provided via the CGroup interface.

I think the concern raised is more about whether CGroups is the right
interface to use for attaching capacity constraints to task or groups
of tasks, or is there a better way to attach such constraints?

I am actually looking at a workload where its desirable to attach such
constraints to only 1 thread or task, in this case it would be a bit
overkill to use CGroups to attach such property just for 1 task with
specific constraints and it would be beneficial that along with the
CGroup interface, there's also an interface to attach it to individual
tasks. The other advantage of such interface is we don't have to
create a separate CGroup for every new constraint limit and can have
several tasks with different unique constraints.

Regards,
Joel


Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller

2017-03-13 Thread Joel Fernandes (Google)
On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
 wrote:
> The CPU CGroup controller allows to assign a specified (maximum)
> bandwidth to tasks within a group, however it does not enforce any
> constraint on how such bandwidth can be consumed.
> With the integration of schedutil, the scheduler has now the proper
> information about a task to select  the most suitable frequency to
> satisfy tasks needs.
[..]

> +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css,
> +struct cftype *cft)
> +{
> +   struct task_group *tg;
> +   u64 min_capacity;
> +
> +   rcu_read_lock();
> +   tg = css_tg(css);
> +   min_capacity = tg->cap_clamp[CAP_CLAMP_MIN];

Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in
the write path) to avoid load-tearing?

Thanks,
Joel


Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks

2017-03-13 Thread Joel Fernandes (Google)
Hi Patrick,

On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasi
 wrote:
> Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE.
> Such a mandatory policy can be made more tunable from userspace thus
> allowing for example to define a reasonable max capacity (i.e.
> frequency) which is required for the execution of a specific RT/DL
> workload. This will contribute to make the RT class more "friendly" for
> power/energy sensible applications.
>
> This patch extends the usage of capacity_{min,max} to the RT/DL classes.
> Whenever a task in these classes is RUNNABLE, the capacity required is
> defined by the constraints of the control group that task belongs to.
>

We briefly discussed this at Linaro Connect that this works well for
sporadic RT tasks that run briefly and then sleep for long periods of
time - so certainly this patch is good, but its only a partial
solution to the problem of frequent and short-sleepers and something
is required to keep the boost active for short non-RUNNABLE as well.
The behavior with many periodic RT tasks is that they will sleep for
short intervals and run for short intervals periodically. In this case
removing the clamp (or the boost as in schedtune v2) on a dequeue will
essentially mean during a narrow window cpufreq can drop the frequency
and only to make it go back up again.

Currently for schedtune v2, I am working on prototyping something like
the following for Android:
- if RT task is enqueue, introduce the boost.
- When task is dequeued, start a timer for a  "minimum deboost delay
time" before taking out the boost.
- If task is enqueued again before the timer fires, then cancel the timer.

I don't think any "fix" to this particular issue should be to the
schedutil governor and should be sorted before going to cpufreq itself
(that is before making the request). What do you think about this?

Thanks,
Joel


Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-27 Thread Joel Fernandes (Google)
On Wed, Jul 26, 2017 at 10:50 PM, Viresh Kumar <viresh.ku...@linaro.org> wrote:
> On 26-07-17, 22:34, Joel Fernandes (Google) wrote:
>> On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar <viresh.ku...@linaro.org> 
>> wrote:
>> > @@ -221,7 +226,7 @@ static void sugov_update_single(struct 
>> > update_util_data *hook, u64 time,
>> > sugov_set_iowait_boost(sg_cpu, time, flags);
>> > sg_cpu->last_update = time;
>> >
>> > -   if (!sugov_should_update_freq(sg_policy, time))
>> > +   if (!sugov_should_update_freq(sg_policy, time, hook->cpu))
>> > return;
>>
>> Since with the remote callbacks now possible, isn't it unsafe to
>> modify sg_cpu and sg_policy structures without a lock in
>> sugov_update_single?
>>
>> Unlike sugov_update_shared, we don't acquire any lock in
>> sugov_update_single before updating these structures. Did I miss
>> something?
>
> As Peter already mentioned it earlier, the callbacks are called with
> rq locks held and so sugov_update_single() wouldn't get called in
> parallel for a target CPU.

Ah ok, I have to catch up with that discussion since I missed the
whole thing. Now that you will have me on CC, that shouldn't happen,
thanks and sorry about the noise.

> That's the only race you were worried about ?

Yes. So then in that case, makes sense to move raw_spin_lock in
sugov_update_shared further down? (Just discussing, this point is
independent of your patch), Something like:

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 622eed1b7658..9a6c12fb2c16 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -295,8 +295,6 @@ static void sugov_update_shared(struct
update_util_data *hook, u64 time,

sugov_get_util(, );

-   raw_spin_lock(_policy->update_lock);
-
sg_cpu->util = util;
sg_cpu->max = max;
sg_cpu->flags = flags;
@@ -304,6 +302,8 @@ static void sugov_update_shared(struct
update_util_data *hook, u64 time,
sugov_set_iowait_boost(sg_cpu, time, flags);
sg_cpu->last_update = time;

+   raw_spin_lock(_policy->update_lock);
+
if (sugov_should_update_freq(sg_policy, time)) {
if (flags & SCHED_CPUFREQ_RT_DL)
next_f = sg_policy->policy->cpuinfo.max_freq;



thanks,

-Joel


Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks

2017-07-27 Thread Joel Fernandes (Google)
Hi Viresh,

On Wed, Jul 26, 2017 at 10:46 PM, Viresh Kumar <viresh.ku...@linaro.org> wrote:
> On 26-07-17, 22:14, Joel Fernandes (Google) wrote:

>> Also one more comment about this usecase:
>>
>> You mentioned in our discussion at [2] sometime back, about the
>> question of initial utilization,
>>
>> "We don't have any such configurable way possible right
>> now, but there were discussions on how much utilization should a new
>> task be assigned when it first comes up."
>
> We still initialize it to a value, just that it isn't configurable.
> See below..
>
>> But, then in your cover letter above, you mentioned "This is verified
>> using ftrace". So my question is how has this been verified with
>> ftrace if the new initial utilization as you said in [2] is currently
>> still under discussion? Basically how could you verify with ftrace
>> that the target CPU frequency isn't increasing immediately on spawning
>> of a new task remotely, if the initial utilization of a new task isn't
>> something we set/configure with current code? Am I missing something?
>>
>> [2] https://lists.linaro.org/pipermail/eas-dev/2017-January/000785.html
>
> The statement "new tasks should receive maximum demand initially" is
> used to represent tasks which have high demand every time they run.
> For example scrolling of a web page or gallery on our phones. Yes,
> maybe I can use the work "migrated" (as you suggested later) as the
> history of its utilization will move with it then to the new CPU.
>
> But even without that, if you see the routine
> init_entity_runnable_average() in fair.c, the new tasks are
> initialized in a way that they are seen as heavy tasks. And so even
> for the first time they run, freq should normally increase on the
> target CPU (at least with above application).i

Ok, but the "heavy" in init_entity_runnable_average means for load,
not the util_avg. The util_avg is what's used for frequency scaling
IIUC and is set to 0 in that function no?

>
> The application was written by Steve (all credit goes to him) before
> he left Linaro, but I did test it with ftrace. What I saw with ftrace
> was that the freq isn't reevaluated for almost an entire tick many
> times because we enqueued the task remotely. And that changes with
> this series.
>
>> > The reason being that this patchset only targets a corner case, where
>> > following are required to be true to improve performance and that
>> > doesn't happen too often with these tests:
>> >
>> > - Task is migrated to another CPU.
>> > - The task has maximum demand initially, and should take the CPU to
>>
>> Just to make the cover-letter more clear and also confirming with you
>> I understand the above usecase, maybe in the future this can reworded
>> from "initially" to "before the migration" and "take the CPU" to "take
>> the target CPU of the migration" ?
>
> I can reword it a bit, but the test case wasn't really migrating
> anything and was looking only at the initial loads.

Ok, I wasn't talking about the synthetic test in the second part of my
email above but about the explanation you gave about Galleryfling
improvement (that the migration of a task with high utilization
doesn't update the target frequency) which makes sense to me so we are
on the same page about that.

thanks,

-Joel


Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks

2017-07-26 Thread Joel Fernandes (Google)
Hi Viresh,

On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar  wrote:

>
> With Android UI and benchmarks the latency of cpufreq response to
> certain scheduling events can become very critical. Currently, callbacks
> into schedutil are only made from the scheduler if the target CPU of the
> event is the same as the current CPU. This means there are certain
> situations where a target CPU may not run schedutil for some time.
>
> One testcase to show this behavior is where a task starts running on
> CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
> system is configured such that new tasks should receive maximum demand
> initially, this should result in CPU0 increasing frequency immediately.
> Because of the above mentioned limitation though this does not occur.
> This is verified using ftrace with the sample [1] application.

I think you dropped [1] in your cover-letter. May be you meant to add
it at the end of the cover letter?

I noticed from your v2 that its:
https://pastebin.com/7LkMSRxE

Also one more comment about this usecase:

You mentioned in our discussion at [2] sometime back, about the
question of initial utilization,

"We don't have any such configurable way possible right
now, but there were discussions on how much utilization should a new
task be assigned when it first comes up."

But, then in your cover letter above, you mentioned "This is verified
using ftrace". So my question is how has this been verified with
ftrace if the new initial utilization as you said in [2] is currently
still under discussion? Basically how could you verify with ftrace
that the target CPU frequency isn't increasing immediately on spawning
of a new task remotely, if the initial utilization of a new task isn't
something we set/configure with current code? Am I missing something?

[2] https://lists.linaro.org/pipermail/eas-dev/2017-January/000785.html

>
> Maybe the ideal solution is to always allow remote callbacks but that
> has its own challenges:
>
> o There is no protection required for single CPU per policy case today,
>   and adding any kind of locking there, to supply remote callbacks,
>   isn't really a good idea.
>
> o If is local CPU isn't part of the same cpufreq policy as the target
>   CPU, then we wouldn't be able to do fast switching at all and have to
>   use some kind of bottom half to schedule work on the target CPU to do
>   real switching. That may be overkill as well.
>
>
> And so this series only allows remote callbacks for target CPUs that
> share the cpufreq policy with the local CPU.
>
> This series is tested with couple of usecases (Android: hackbench,
> recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey
> board (64 bit octa-core, single policy). Only galleryfling showed minor
> improvements, while others didn't had much deviation.
>
> The reason being that this patchset only targets a corner case, where
> following are required to be true to improve performance and that
> doesn't happen too often with these tests:
>
> - Task is migrated to another CPU.
> - The task has maximum demand initially, and should take the CPU to

Just to make the cover-letter more clear and also confirming with you
I understand the above usecase, maybe in the future this can reworded
from "initially" to "before the migration" and "take the CPU" to "take
the target CPU of the migration" ?

>   higher OPPs.
> - And the target CPU doesn't call into schedutil until the next tick.

I found this usecase to be more plausible and can see this patch
series being useful there.

Could you also keep me in CC on these patches (at joe...@google.com)?
I'm interested in this series.

thanks!

-Joel


Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-26 Thread Joel Fernandes (Google)
Hi Viresh,

On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar  wrote:
> We do not call cpufreq callbacks from scheduler core for remote
> (non-local) CPUs currently. But there are cases where such remote
> callbacks are useful, specially in the case of shared cpufreq policies.
>
> This patch updates the scheduler core to call the cpufreq callbacks for
> remote CPUs as well.
>
> For now, all the registered utilization update callbacks are updated to
> return early if remote callback is detected. That is, this patch just
> moves the decision making down in the hierarchy.
>
> Later patches would enable remote callbacks for shared policies.
>
> Based on initial work from Steve Muckle.
>
> Signed-off-by: Viresh Kumar 

> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -72,10 +72,15 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
>  / Governor internals ***/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 
> time)
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 
> time,
> +int target_cpu)
>  {
> s64 delta_ns;
>
> +   /* Don't allow remote callbacks */
> +   if (smp_processor_id() != target_cpu)
> +   return false;
> +
> if (sg_policy->work_in_progress)
> return false;
>
> @@ -221,7 +226,7 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> -   if (!sugov_should_update_freq(sg_policy, time))
> +   if (!sugov_should_update_freq(sg_policy, time, hook->cpu))
> return;

Since with the remote callbacks now possible, isn't it unsafe to
modify sg_cpu and sg_policy structures without a lock in
sugov_update_single?

Unlike sugov_update_shared, we don't acquire any lock in
sugov_update_single before updating these structures. Did I miss
something?


thanks,

-Joel


Re: [Eas-dev] [PATCH V4 2/3] cpufreq: schedutil: Process remote callback for shared policies

2017-07-26 Thread Joel Fernandes (Google)
On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar  wrote:
> This patch updates the schedutil governor to process cpufreq utilization
> update hooks called for remote CPUs where the remote CPU is managed by
> the cpufreq policy of the local CPU.
>
> Based on initial work from Steve Muckle.
>
> Signed-off-by: Viresh Kumar 

Reviewed-by: Joel Fernandes 


thanks,

-Joel


> ---
>  kernel/sched/cpufreq_schedutil.c | 21 ++---
>  1 file changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index bb834747e49b..c3baf70d360c 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -72,13 +72,12 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu);
>
>  / Governor internals ***/
>
> -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 
> time,
> -int target_cpu)
> +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 
> time)
>  {
> s64 delta_ns;
>
> -   /* Don't allow remote callbacks */
> -   if (smp_processor_id() != target_cpu)
> +   /* Allow remote callbacks only on the CPUs sharing cpufreq policy */
> +   if (!cpumask_test_cpu(smp_processor_id(), sg_policy->policy->cpus))
> return false;
>
> if (sg_policy->work_in_progress)
> @@ -159,12 +158,12 @@ static unsigned int get_next_freq(struct sugov_policy 
> *sg_policy,
> return cpufreq_driver_resolve_freq(policy, freq);
>  }
>
> -static void sugov_get_util(unsigned long *util, unsigned long *max)
> +static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu)
>  {
> -   struct rq *rq = this_rq();
> +   struct rq *rq = cpu_rq(cpu);
> unsigned long cfs_max;
>
> -   cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id());
> +   cfs_max = arch_scale_cpu_capacity(NULL, cpu);
>
> *util = min(rq->cfs.avg.util_avg, cfs_max);
> *max = cfs_max;
> @@ -226,7 +225,7 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> -   if (!sugov_should_update_freq(sg_policy, time, hook->cpu))
> +   if (!sugov_should_update_freq(sg_policy, time))
> return;
>
> busy = sugov_cpu_is_busy(sg_cpu);
> @@ -234,7 +233,7 @@ static void sugov_update_single(struct update_util_data 
> *hook, u64 time,
> if (flags & SCHED_CPUFREQ_RT_DL) {
> next_f = policy->cpuinfo.max_freq;
> } else {
> -   sugov_get_util(, );
> +   sugov_get_util(, , hook->cpu);
> sugov_iowait_boost(sg_cpu, , );
> next_f = get_next_freq(sg_policy, util, max);
> /*
> @@ -295,7 +294,7 @@ static void sugov_update_shared(struct update_util_data 
> *hook, u64 time,
> unsigned long util, max;
> unsigned int next_f;
>
> -   sugov_get_util(, );
> +   sugov_get_util(, , hook->cpu);
>
> raw_spin_lock(_policy->update_lock);
>
> @@ -306,7 +305,7 @@ static void sugov_update_shared(struct update_util_data 
> *hook, u64 time,
> sugov_set_iowait_boost(sg_cpu, time, flags);
> sg_cpu->last_update = time;
>
> -   if (sugov_should_update_freq(sg_policy, time, hook->cpu)) {
> +   if (sugov_should_update_freq(sg_policy, time)) {
> if (flags & SCHED_CPUFREQ_RT_DL)
> next_f = sg_policy->policy->cpuinfo.max_freq;


Re: [PATCH 00/32] tracing: Inter-event (e.g. latency) support

2017-07-01 Thread Joel Fernandes (Google)
Hi Tom,
Nice series and nice ELC talk as well. Thanks.

On Mon, Jun 26, 2017 at 3:49 PM, Tom Zanussi
 wrote:
> This patchset adds support for 'inter-event' quantities to the trace
> event subsystem.  The most important example of inter-event quantities
> are latencies, or the time differences between two events.

I tried your patches out and they are working fine for me. I will test
them out more.

I think for the wakeup latency in your examples, its better / more
accurate to use sched_waking instead of sched_wakeup tracepoint? [1]

Also, one other comment I had is, it would be nice to suppress the
output of individual trace events that are part of the synthetic event
into the trace buffer. Otherwise I feel the value of it is slightly
diminished - because one can simply post-process the individual
non-synthetic trace events themselves to get wake up latencies which
the synthetic events is trying to calculate in the first place.
Inorder to conserve space, if a user doesn't care about individual
events, and just the synthetic events then the individual ones
shouldn't be written to the trace buffer IMO.

-Joel

[1] commit fbd705a0c6184580d0e2fbcbd47a37b6e5822511 (sched: Introduce
the 'trace_sched_waking' tracepoint)


Re: [PATCH 04/32] ring-buffer: Redefine the unimplemented RINGBUF_TIME_TIME_STAMP

2017-07-02 Thread Joel Fernandes (Google)
On Mon, Jun 26, 2017 at 3:49 PM, Tom Zanussi
 wrote:
> RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can
> gather was reserved for something like an absolute timestamp feature
> for the ring buffer, if not a complete replacement of the current
> time_delta scheme.
>
> This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time
> stamps.  Another way to look at it is that it essentially forces
> extended time_deltas for all events.
>
> The motivation for doing this is to enable time_deltas that aren't
> dependent on previous events in the ring buffer, making it feasible to
> use the ring_buffer_event timetamps in a more random-access way, for
> purposes other than serial event printing.
>
> To set/reset this mode, use tracing_set_timestamp_abs() from the
> previous interface patch.
>
> Signed-off-by: Tom Zanussi 
> ---
>  include/linux/ring_buffer.h |  12 ++---
>  kernel/trace/ring_buffer.c  | 107 
> +++-
>  2 files changed, 83 insertions(+), 36 deletions(-)
>
> diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
> index 28e3472..74bc276 100644
> --- a/include/linux/ring_buffer.h
> +++ b/include/linux/ring_buffer.h
> @@ -36,10 +36,12 @@ struct ring_buffer_event {
>   *  array[0] = time delta (28 .. 59)
>   *  size = 8 bytes
>   *
> - * @RINGBUF_TYPE_TIME_STAMP:   Sync time stamp with external clock
> - *  array[0]= tv_nsec
> - *  array[1..2] = tv_sec
> - *  size = 16 bytes
> + * @RINGBUF_TYPE_TIME_STAMP:   Absolute timestamp
> + *  Same format as TIME_EXTEND except that the
> + *  value is an absolute timestamp, not a delta
> + *  event.time_delta contains bottom 27 bits
> + *  array[0] = top (28 .. 59) bits
> + *  size = 8 bytes
>   *

Let's also change it here in ring_buffer.c?

enum {
RB_LEN_TIME_EXTEND = 8,
RB_LEN_TIME_STAMP = 16,<- change to 8
};


>   * <= @RINGBUF_TYPE_DATA_TYPE_LEN_MAX:
>   * Data record
> @@ -56,12 +58,12 @@ enum ring_buffer_type {
> RINGBUF_TYPE_DATA_TYPE_LEN_MAX = 28,
> RINGBUF_TYPE_PADDING,
> RINGBUF_TYPE_TIME_EXTEND,
> -   /* FIXME: RINGBUF_TYPE_TIME_STAMP not implemented */
> RINGBUF_TYPE_TIME_STAMP,
>  };
>
>  unsigned ring_buffer_event_length(struct ring_buffer_event *event);
>  void *ring_buffer_event_data(struct ring_buffer_event *event);
> +u64 ring_buffer_event_time_stamp(struct ring_buffer_event *event);
>
>  /*
>   * ring_buffer_discard_commit will remove an event that has not
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index f544738..df848f0 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -42,6 +42,8 @@ int ring_buffer_print_entry_header(struct trace_seq *s)
>  RINGBUF_TYPE_PADDING);
> trace_seq_printf(s, "\ttime_extend : type == %d\n",
>  RINGBUF_TYPE_TIME_EXTEND);
> +   trace_seq_printf(s, "\ttime_stamp : type == %d\n",
> +RINGBUF_TYPE_TIME_STAMP);
> trace_seq_printf(s, "\tdata max type_len  == %d\n",
>  RINGBUF_TYPE_DATA_TYPE_LEN_MAX);
>
> @@ -147,6 +149,9 @@ enum {
>  #define skip_time_extend(event) \
> ((struct ring_buffer_event *)((char *)event + RB_LEN_TIME_EXTEND))
>
> +#define extended_time(event) \
> +   (event->type_len >= RINGBUF_TYPE_TIME_EXTEND)
> +
>  static inline int rb_null_event(struct ring_buffer_event *event)
>  {
> return event->type_len == RINGBUF_TYPE_PADDING && !event->time_delta;
> @@ -187,10 +192,8 @@ static void rb_event_set_padding(struct 
> ring_buffer_event *event)
> return  event->array[0] + RB_EVNT_HDR_SIZE;
>
> case RINGBUF_TYPE_TIME_EXTEND:
> -   return RB_LEN_TIME_EXTEND;
> -
> case RINGBUF_TYPE_TIME_STAMP:
> -   return RB_LEN_TIME_STAMP;
> +   return RB_LEN_TIME_EXTEND;

And then this change can be dropped. Or a new single enum can be
defined for both TIME_EXTEND and TIME_STAMP and then used.

thanks,

-Joel


Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-27 Thread Joel Fernandes (Google)
On Thu, Jul 27, 2017 at 12:14 AM, Viresh Kumar <viresh.ku...@linaro.org> wrote:
> On 26-07-17, 23:13, Joel Fernandes (Google) wrote:
>> On Wed, Jul 26, 2017 at 10:50 PM, Viresh Kumar <viresh.ku...@linaro.org> 
>> wrote:
>> > On 26-07-17, 22:34, Joel Fernandes (Google) wrote:
>> >> On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar <viresh.ku...@linaro.org> 
>> >> wrote:
>> >> > @@ -221,7 +226,7 @@ static void sugov_update_single(struct 
>> >> > update_util_data *hook, u64 time,
>> >> > sugov_set_iowait_boost(sg_cpu, time, flags);
>> >> > sg_cpu->last_update = time;
>> >> >
>> >> > -   if (!sugov_should_update_freq(sg_policy, time))
>> >> > +   if (!sugov_should_update_freq(sg_policy, time, hook->cpu))
>> >> > return;
>> >>
>> >> Since with the remote callbacks now possible, isn't it unsafe to
>> >> modify sg_cpu and sg_policy structures without a lock in
>> >> sugov_update_single?
>> >>
>> >> Unlike sugov_update_shared, we don't acquire any lock in
>> >> sugov_update_single before updating these structures. Did I miss
>> >> something?
>> >
>> > As Peter already mentioned it earlier, the callbacks are called with
>> > rq locks held and so sugov_update_single() wouldn't get called in
>> > parallel for a target CPU.
>>
>> Ah ok, I have to catch up with that discussion since I missed the
>> whole thing. Now that you will have me on CC, that shouldn't happen,
>> thanks and sorry about the noise.
>>
>> > That's the only race you were worried about ?
>>
>> Yes. So then in that case, makes sense to move raw_spin_lock in
>> sugov_update_shared further down? (Just discussing, this point is
>> independent of your patch), Something like:
>
> Even that was discussed tomorrow with Peter :)
>
> No it wouldn't work because sg_cpu->util we are updating here may be
> getting read from some other cpu that shares policy with sg_cpu.
>

Ok. yes you are right :) thank you Viresh and Peter for the clarification.

thanks,

-Joel


Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks

2017-07-27 Thread Joel Fernandes (Google)
On Thu, Jul 27, 2017 at 12:21 AM, Juri Lelli  wrote:
[..]
>> >
>> > But even without that, if you see the routine
>> > init_entity_runnable_average() in fair.c, the new tasks are
>> > initialized in a way that they are seen as heavy tasks. And so even
>> > for the first time they run, freq should normally increase on the
>> > target CPU (at least with above application).i
>>
>> Ok, but the "heavy" in init_entity_runnable_average means for load,
>> not the util_avg. The util_avg is what's used for frequency scaling
>> IIUC and is set to 0 in that function no?
>>
>
> True for init_entity_runnable_average(), but for new task post_init_
> entity_util_avg() is then also called (from wake_up_new_task()), which
> modifies the initial util_avg value (depending on current rq {util,
> load}_avg.

Got it. That makes sense, thank you!

-Joel


Re: [Eas-dev] [PATCH V3 1/3] sched: cpufreq: Allow remote cpufreq callbacks

2017-07-27 Thread Joel Fernandes (Google)
On Thu, Jul 27, 2017 at 12:55 PM, Saravana Kannan
 wrote:
> On 07/26/2017 08:30 PM, Viresh Kumar wrote:
>>
>> On 26-07-17, 14:00, Saravana Kannan wrote:
>>>
>>> No, the alternative is to pass it on to the CPU freq driver and let it
>>> decide what it wants to do. That's the whole point if having a CPU freq
>>> driver -- so that the generic code doesn't need to care about HW specific
>>> details. Which is the point I was making in an earlier email to Viresh's
>>> patch -- we shouldn't be doing any CPU check for the call backs at the
>>> scheduler or ever governor level.
>>>
>>> That would simplify this whole thing by deleting a bunch of code. And
>>> having
>>> much simpler checks in those drivers that actually have to deal with
>>> their
>>> HW specific details.
>>
>>
>> So what you are saying is that we go and update (almost) every cpufreq
>> driver we have today and make their ->target() callbacks return early
>> if they don't support switching frequency remotely ? Is that really
>> simplifying anything?
>
>
> Yes. Simplifying isn't always about number of lines of code. It's also about
> abstraction. Having generic scheduler code care about HW details doesn't
> seem nice.
>
> It'll literally one simple check (cpu == smp_processor_id()) or (cpu "in"
> policy->cpus).
>

I think we can have both approaches? So we query the driver some time
around sugov_should_update_freq (with a new driver callback?) and ask
it if it has any say over the default behavior of "can't update remote
CPU if I'm not a part of its policy" and use that over the default if
it hasn't defined it in their struct cpufreq_driver.

I think this will also not have the concern of "updating every
driver", then we can just stick to the sane default of "no" for
drivers that haven't defined it. Probably Viresh has already thought
about this, but I just thought of bringing it up anyway. I also think
its fine to handle this case after this series gets in, but that's
just my opinion.

thanks!

-Joel


Re: [PATCH v7 1/2] sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq()

2017-08-18 Thread Joel Fernandes (Google)
Hi Byungchul,

On Thu, Aug 17, 2017 at 11:05 PM, Byungchul Park  wrote:
> It would be better to avoid pushing tasks to other cpu within
> a SD_PREFER_SIBLING domain, instead, get more chances to check other
> siblings.
>
> Signed-off-by: Byungchul Park 
> ---
>  kernel/sched/deadline.c | 55 
> ++---
>  1 file changed, 52 insertions(+), 3 deletions(-)
>
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 0223694..8f69a37 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -1319,12 +1319,35 @@ static struct task_struct 
> *pick_earliest_pushable_dl_task(struct rq *rq, int cpu
>
>  static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl);
>
> +/*
> + * Find the first cpu in: mask & sd & ~prefer
> + */
> +static int find_cpu(const struct cpumask *mask,
> +   const struct sched_domain *sd,
> +   const struct sched_domain *prefer)
> +{
> +   const struct cpumask *sds = sched_domain_span(sd);
> +   const struct cpumask *ps  = prefer ? sched_domain_span(prefer) : NULL;
> +   int cpu = -1;
> +
> +   while ((cpu = cpumask_next(cpu, mask)) < nr_cpu_ids) {

This can be simplified to for_each_cpu(cpu, mask) ?


thanks,

-Joel


Re: [PATCH v6 0/2] Make find_later_rq() choose a closer cpu in topology

2017-08-17 Thread Joel Fernandes (Google)
On Thu, Aug 17, 2017 at 6:25 PM, Byungchul Park  wrote:
> On Mon, Aug 07, 2017 at 12:50:32PM +0900, Byungchul Park wrote:
>> When cpudl_find() returns any among free_cpus, the cpu might not be
>> closer than others, considering sched domain. For example:
>>
>>this_cpu: 15
>>free_cpus: 0, 1,..., 14 (== later_mask)
>>best_cpu: 0
>>
>>topology:
>>
>>0 --+
>>+--+
>>1 --+  |
>>   +-- ... --+
>>2 --+  | |
>>+--+ |
>>3 --+|
>>
>>... ...
>>
>>12 --+   |
>> +--+|
>>13 --+  ||
>>+-- ... -+
>>14 --+  |
>> +--+
>>15 --+
>>
>> In this case, it would be best to select 14 since it's a free cpu and
>> closest to 15(this_cpu). However, currently the code select 0(best_cpu)
>> even though that's just any among free_cpus. Fix it.
>
> Could you let me know your opinions about this?

Patch looks good to me, I would also add a comment ontop of
fallback_cpu (I think Steve mentioned similar thing at [1])

/*
 * fallback is the closest CPU in the closest SD incase
 * all domains are PREFER_SIBLING
 */
 if (fallback_cpu == -1)
 fallback_cpu = best_cpu;

And clarify this in the commit message.


thanks,

-Joel

[1] https://patchwork.kernel.org/patch/9884383/


>
>> Change from v5
>>-. exclude two patches already picked up by peterz
>>   (sched/deadline: Make find_later_rq() choose a closer cpu in topology)
>>   (sched/deadline: Change return value of cpudl_find())
>>-. apply what peterz fixed for 'prefer sibling', into deadline and rt
>>
>> Change from v4
>>-. remove a patch that might cause huge lock contention
>>   (by spin lock() in a hot path of scheduler)
>>
>> Change from v3
>>-. rename closest_cpu to best_cpu so that it align with rt
>>-. protect referring cpudl.elements with cpudl.lock
>>-. change return value of cpudl_find() to bool
>>
>> Change from v2
>>-. add support for SD_PREFER_SIBLING
>>
>> Change from v1
>>-. clean up the patch
>>
>> Byungchul Park (2):
>>   sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq()
>>   sched/rt: Add support for SD_PREFER_SIBLING on find_lowest_rq()
>>
>>  kernel/sched/deadline.c | 46 +++---
>>  kernel/sched/rt.c   | 47 ---
>>  2 files changed, 87 insertions(+), 6 deletions(-)
>>
>> --
>> 1.9.1


Re: [for-next][PATCH 15/16] ftrace: Add freeing algorithm to free ftrace_mod_maps

2017-10-07 Thread Joel Fernandes (Google)
Hi Steve,

On Sat, Oct 7, 2017 at 6:32 AM, Steven Rostedt <rost...@goodmis.org> wrote:
> On Fri, 6 Oct 2017 23:41:25 -0700
> "Joel Fernandes (Google)" <joel.open...@gmail.com> wrote:
>
>> Hi Steve,
>>
>> On Fri, Oct 6, 2017 at 11:07 AM, Steven Rostedt <rost...@goodmis.org> wrote:
>> > From: "Steven Rostedt (VMware)" <rost...@goodmis.org>
>> >
>> > The ftrace_mod_map is a descriptor to save module init function names in
>> > case they were traced, and the trace output needs to reference the function
>> > name from the function address. But after the function is unloaded, it
>> > the maps should be freed, as the rest of the function names are as well.
>>
>> Just checking for my understanding of this patch - wouldn't this also
>> mean that if there were any look ups of the init functions that may be
>> needed at trace output time, then those look ups wont be possible any
>> more after module is unloaded?
>
> Yes. That's true for all functions in the module. When a module is
> unloaded, all references to it in kallsyms is also freed. Try it on a
> current kernel. Trace a function in a module, then unload that module.
> The trace data will just show the ip hex address of the module function
> after that.
>

I tried your core branch and I see some weirdness with the filters:

Here's the code for the module:

#include 
#include 
#include 

void bar(void)
{
printk(KERN_INFO "bar!\n");
}

void foo(void)
{
printk(KERN_INFO "foo!\n");
bar();
}

static int __init hello_init(void)
{
printk(KERN_INFO "Hello world!\n");
foo();
return 0;
}

static void __exit hello_cleanup(void)
{
printk(KERN_INFO "Cleaning up module.\n");
}

module_init(hello_init);
module_exit(hello_cleanup);


Following is a run with function tracing, during the first run I see
foo and bar are setup in the filters. After that when I unload and
load the module, I see that only "bar" is in the filters:

bash-4.3# echo '*:mod:test' > /d/tracing/set_ftrace_filter
bash-4.3# echo function > /d/tracing/current_tracer
bash-4.3# cat /d/tracing/set_ftrace_filter
*:mod:test
bash-4.3# modprobe test
bash-4.3# cat /d/tracing/trace
# tracer: function
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
modprobe-1050  [003]  1.653000: hello_init <-do_one_initcall
bash-4.3# cat /d/tracing/set_ftrace_filter
bar [test]
foo [test]
bash-4.3# rmmod test
bash-4.3# cat /d/tracing/set_ftrace_filter
bash-4.3# sleep 2
bash-4.3# modprobe test
bash-4.3# cat /d/tracing/trace
# tracer: function
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
modprobe-1050  [003]  1.653000: bar <-do_one_initcall
bash-4.3# cat /d/tracing/set_ftrace_filter
bar [test]   <--- bar is still in
the filter list
bash-4.3#

Interestingly this also shows that the symbol conversion can be
unreliable if another module was loaded into the same address, in this
case 'bar' was loaded at the same address as 'hello_init' the second
time so during the second cat of the trace file, it shows a different
symbol name.

Also could you let me know what is the correct behavior of the filters
after a module being traced is unloaded, are the filters supposed to
be setup again after the module is unloaded, before the module can be
loaded again?

Also I wasn't able to see the call from hello_init -> bar and bar ->
foo so I'm not fully sure if that's a bug or I did something wrong.
I'm happy to try out anything you suggest.

>> I guess having a reference somehow on the ftrace_mod_map descriptor if
>> there are any entries in the trace buffer that need it can help
>> prevent that but it could be too expensive for not much return since
>> most likely the user wouldn't unload modules before trace collection
>> in normal usage.
>
> Right, I have thought about this, and I haven't come up with an
> inexpensive way to do this. As this has been the default operation of
> all module functions, and I haven't heard m

Re: [for-next][PATCH 15/16] ftrace: Add freeing algorithm to free ftrace_mod_maps

2017-10-07 Thread Joel Fernandes (Google)
Hi Steve,

On Fri, Oct 6, 2017 at 11:07 AM, Steven Rostedt  wrote:
> From: "Steven Rostedt (VMware)" 
>
> The ftrace_mod_map is a descriptor to save module init function names in
> case they were traced, and the trace output needs to reference the function
> name from the function address. But after the function is unloaded, it
> the maps should be freed, as the rest of the function names are as well.

Just checking for my understanding of this patch - wouldn't this also
mean that if there were any look ups of the init functions that may be
needed at trace output time, then those look ups wont be possible any
more after module is unloaded?

I guess having a reference somehow on the ftrace_mod_map descriptor if
there are any entries in the trace buffer that need it can help
prevent that but it could be too expensive for not much return since
most likely the user wouldn't unload modules before trace collection
in normal usage.

thanks,

- Joel


[PATCH] rcu: trace: Remove Startedleaf from trace events comment

2018-05-11 Thread Joel Fernandes (Google)
As part of the gp_seq clean up, the Startleaf condition doesn't occur
anymore. Remove it from the comment in the trace event file.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 include/trace/events/rcu.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index ce9d1a1cac78..6d8dd04912d2 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -91,7 +91,6 @@ TRACE_EVENT(rcu_grace_period,
  *
  * "Startleaf": Request a grace period based on leaf-node data.
  * "Prestarted": Someone beat us to the request
- * "Startedleaf": Leaf-node start proved sufficient.
  * "Startedleafroot": Leaf-node start proved sufficient after checking root.
  * "Startedroot": Requested a nocb grace period based on root-node data.
  * "NoGPkthread": The RCU grace-period kthread has not yet started.
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH v2] rcu: Add comment documenting how rcu_seq_snap works

2018-05-11 Thread Joel Fernandes (Google)
rcu_seq_snap may be tricky for someone looking at it for the first time.
Lets document how it works with an example to make it easier.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
v2 changes: Corrections as suggested by Randy.

 kernel/rcu/rcu.h | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 003671825d62..533bc1087371 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
 }
 
-/* Take a snapshot of the update side's sequence number. */
+/*
+ * Take a snapshot of the update side's sequence number.
+ *
+ * This function predicts what the grace period number will be the next
+ * time an RCU callback will be executed, given the current grace period's
+ * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is
+ * already in progress.
+ *
+ * We do this with a single addition and masking.
+ * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of
+ * the seq is used to track if a GP is in progress or not, its sufficient if we
+ * add (2+1) and mask with ~1. Lets see why with an example:
+ *
+ * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0).
+ * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1)
+ * to account for the state bit. However, if the current seq is 7 (gp is 3 and
+ * state bit is 1), then it means the current grace period is already in
+ * progress so the next time the callback will run is at the end of grace
+ * period number gp+2. To account for the extra +1, we just overflow the LSB by
+ * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU
+ * is idle), then the addition of the extra 0x1 and masking will have no
+ * effect. This is calculated as below.
+ */
 static inline unsigned long rcu_seq_snap(unsigned long *sp)
 {
unsigned long s;
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH] rcu: Add comment documenting how rcu_seq_snap works

2018-05-11 Thread Joel Fernandes (Google)
rcu_seq_snap may be tricky for someone looking at it for the first time.
Lets document how it works with an example to make it easier.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/rcu.h | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 003671825d62..004ace3d22c2 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,28 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
 }
 
-/* Take a snapshot of the update side's sequence number. */
+/*
+ * Take a snapshot of the update side's sequence number.
+ *
+ * This function predicts what the grace period number will be the next
+ * time an RCU callback will be executed, given the current grace period's
+ * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is
+ * already in progress.
+ *
+ * We do this with a single addition and masking.
+ * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of
+ * the seq is used to track if a GP is in progress or not, its sufficient if we
+ * add (2+1) and mask with ~1. Lets see why with an example:
+ *
+ * Say the current seq is 6 which is 0x110 (gp is 3 and state bit is 0).
+ * To get the next GP number, we have to atleast add 0x10 to this (0x1 << 1) to
+ * account for the state bit. However, if the current seq is 7 (GP num is 3
+ * and state bit is 1), then it means the current grace period is already
+ * in progress so the next the callback will run is at gp+2. To account for
+ * the extra +1, we just overflow the LSB by adding another 0x1 and masking
+ * with ~0x1. Incase no GP was in progress (RCU is idle), then the adding
+ * by 0x1 and masking will have no effect. This is calculated as below.
+ */
 static inline unsigned long rcu_seq_snap(unsigned long *sp)
 {
unsigned long s;
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH RFC 1/8] rcu: Add comment documenting how rcu_seq_snap works

2018-05-13 Thread Joel Fernandes (Google)
rcu_seq_snap may be tricky for someone looking at it for the first time.
Lets document how it works with an example to make it easier.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/rcu.h | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 003671825d62..fc3170914ac7 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
 }
 
-/* Take a snapshot of the update side's sequence number. */
+/*
+ * Take a snapshot of the update side's sequence number.
+ *
+ * This function predicts what the grace period number will be the next
+ * time an RCU callback will be executed, given the current grace period's
+ * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is
+ * already in progress.
+ *
+ * We do this with a single addition and masking.
+ * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of
+ * the seq is used to track if a GP is in progress or not, its sufficient if we
+ * add (2+1) and mask with ~1. Let's see why with an example:
+ *
+ * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0).
+ * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1)
+ * to account for the state bit. However, if the current seq is 7 (gp is 3 and
+ * state bit is 1), then it means the current grace period is already in
+ * progress so the next time the callback will run is at the end of grace
+ * period number gp+2. To account for the extra +1, we just overflow the LSB by
+ * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU
+ * is idle), then the addition of the extra 0x1 and masking will have no
+ * effect. This is calculated as below.
+ */
 static inline unsigned long rcu_seq_snap(unsigned long *sp)
 {
unsigned long s;
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 7/8] rcu: trace CleanupMore condition only if needed

2018-05-13 Thread Joel Fernandes (Google)
Currently the tree RCU clean up code records a CleanupMore trace event
even if the GP was already in progress. This makes CleanupMore show up
twice for no reason. Avoid it.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8401a253e7de..25c44328d071 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2083,7 +2083,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
rsp->gp_state = RCU_GP_IDLE;
/* Check for GP requests since above loop. */
rdp = this_cpu_ptr(rsp->rda);
-   if (ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
+   if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) {
trace_rcu_this_gp(rnp, rdp, rnp->gp_seq_needed,
  TPS("CleanupMore"));
needgp = true;
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 4/8] rcu: Get rid of old c variable from places in tree RCU

2018-05-13 Thread Joel Fernandes (Google)
The 'c' variable was used previously to store the grace period
that is being requested. However it is not very meaningful for
a code reader, this patch replaces it with gp_seq_start indicating that
this is the grace period that was requested. Also updating tracing with
the new name.

Just a clean up patch, no logical change.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 include/trace/events/rcu.h | 15 ++--
 kernel/rcu/tree.c  | 47 ++
 2 files changed, 35 insertions(+), 27 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index ce9d1a1cac78..539900a9f8c7 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -103,15 +103,16 @@ TRACE_EVENT(rcu_grace_period,
  */
 TRACE_EVENT(rcu_future_grace_period,
 
-   TP_PROTO(const char *rcuname, unsigned long gp_seq, unsigned long c,
-u8 level, int grplo, int grphi, const char *gpevent),
+   TP_PROTO(const char *rcuname, unsigned long gp_seq,
+unsigned long gp_seq_start, u8 level, int grplo, int grphi,
+const char *gpevent),
 
-   TP_ARGS(rcuname, gp_seq, c, level, grplo, grphi, gpevent),
+   TP_ARGS(rcuname, gp_seq, gp_seq_start, level, grplo, grphi, gpevent),
 
TP_STRUCT__entry(
__field(const char *, rcuname)
__field(unsigned long, gp_seq)
-   __field(unsigned long, c)
+   __field(unsigned long, gp_seq_start)
__field(u8, level)
__field(int, grplo)
__field(int, grphi)
@@ -121,7 +122,7 @@ TRACE_EVENT(rcu_future_grace_period,
TP_fast_assign(
__entry->rcuname = rcuname;
__entry->gp_seq = gp_seq;
-   __entry->c = c;
+   __entry->gp_seq_start = gp_seq_start;
__entry->level = level;
__entry->grplo = grplo;
__entry->grphi = grphi;
@@ -129,7 +130,7 @@ TRACE_EVENT(rcu_future_grace_period,
),
 
TP_printk("%s %lu %lu %u %d %d %s",
- __entry->rcuname, __entry->gp_seq, __entry->c, __entry->level,
+ __entry->rcuname, __entry->gp_seq, __entry->gp_seq_start, 
__entry->level,
  __entry->grplo, __entry->grphi, __entry->gpevent)
 );
 
@@ -751,7 +752,7 @@ TRACE_EVENT(rcu_barrier,
 #else /* #ifdef CONFIG_RCU_TRACE */
 
 #define trace_rcu_grace_period(rcuname, gp_seq, gpevent) do { } while (0)
-#define trace_rcu_future_grace_period(rcuname, gp_seq, c, \
+#define trace_rcu_future_grace_period(rcuname, gp_seq, gp_seq_start, \
  level, grplo, grphi, event) \
  do { } while (0)
 #define trace_rcu_grace_period_init(rcuname, gp_seq, level, grplo, grphi, \
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 29ccc60bdbfc..9f5679ba413b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1541,13 +1541,18 @@ void rcu_cpu_stall_reset(void)
 
 /* Trace-event wrapper function for trace_rcu_future_grace_period.  */
 static void trace_rcu_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
- unsigned long c, const char *s)
+ unsigned long gp_seq_start, const char *s)
 {
-   trace_rcu_future_grace_period(rdp->rsp->name, rnp->gp_seq, c,
+   trace_rcu_future_grace_period(rdp->rsp->name, rnp->gp_seq, gp_seq_start,
  rnp->level, rnp->grplo, rnp->grphi, s);
 }
 
 /*
+ * rcu_start_this_gp - Request the start of a particular grace period
+ * @rnp: The leaf node of the CPU from which to start.
+ * @rdp: The rcu_data corresponding to the CPU from which to start.
+ * @gp_seq_start: The gp_seq of the grace period to start.
+ *
  * Start the specified grace period, as needed to handle newly arrived
  * callbacks.  The required future grace periods are recorded in each
  * rcu_node structure's ->gp_seq_needed field.  Returns true if there
@@ -1555,9 +1560,11 @@ static void trace_rcu_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
  *
  * The caller must hold the specified rcu_node structure's ->lock, which
  * is why the caller is responsible for waking the grace-period kthread.
+ *
+ * Returns true if the GP thread needs to be awakened else false.
  */
 static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp,
- unsigned long c)
+ unsigned long gp_seq_start)
 {
bool ret = false;
struct rcu_state *rsp = rdp->rsp;
@@ -1573,18 +1580,19 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
 * not be released.
 */
raw_lockdep_assert_held_rcu_node(rnp);
-   trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf"));
+   

[PATCH RFC 8/8] rcu: Fix cpustart tracepoint gp_seq number

2018-05-13 Thread Joel Fernandes (Google)
cpustart shows a stale gp_seq. This is because rdp->gp_seq is updated
only at the end of the __note_gp_changes function. For this reason, use
rnp->gp_seq instead. I believe we can't update rdp->gp_seq too early so
lets just use the gp_seq from rnp instead.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 25c44328d071..58d2b68f8b98 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1807,7 +1807,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, 
struct rcu_node *rnp,
 * set up to detect a quiescent state, otherwise don't
 * go looking for one.
 */
-   trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpustart"));
+   trace_rcu_grace_period(rsp->name, rnp->gp_seq, TPS("cpustart"));
need_gp = !!(rnp->qsmask & rdp->grpmask);
rdp->cpu_no_qs.b.norm = need_gp;
rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_dynticks.rcu_qs_ctr);
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 6/8] rcu: Add back the Startedleaf tracepoint

2018-05-13 Thread Joel Fernandes (Google)
In recent discussion [1], the check for whether a leaf believes RCU is
not idle, is being added back to funnel locking code, to avoid more
locking. In this we are marking the leaf node for a future grace-period
and bailing out since a GP is currently in progress. However the
tracepoint is missing. Lets add it back.

Also add a small comment about why we do this check (basically the point
is to avoid locking intermediate nodes unnecessarily) and clarify the
comments in the trace event header now that we are doing traversal of
one or more intermediate nodes.

[1] http://lkml.kernel.org/r/20180513190906.gl26...@linux.vnet.ibm.com

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 include/trace/events/rcu.h |  4 ++--
 kernel/rcu/tree.c  | 11 ++-
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 539900a9f8c7..dc0bd11739c7 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -91,8 +91,8 @@ TRACE_EVENT(rcu_grace_period,
  *
  * "Startleaf": Request a grace period based on leaf-node data.
  * "Prestarted": Someone beat us to the request
- * "Startedleaf": Leaf-node start proved sufficient.
- * "Startedleafroot": Leaf-node start proved sufficient after checking root.
+ * "Startedleaf": Leaf and one or more non-root nodes marked for future start.
+ * "Startedleafroot": all non-root nodes from leaf to root marked for future 
start.
  * "Startedroot": Requested a nocb grace period based on root-node data.
  * "NoGPkthread": The RCU grace-period kthread has not yet started.
  * "StartWait": Start waiting for the requested grace period.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40670047d22c..8401a253e7de 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1593,8 +1593,17 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
goto unlock_out;
}
rnp_node->gp_seq_needed = gp_seq_start;
-   if (rcu_seq_state(rcu_seq_current(>gp_seq)))
+
+   /*
+* Check if leaf believes a GP is in progress, if yes we can
+* bail and avoid more locking. We have already marked the leaf.
+*/
+   if (rcu_seq_state(rcu_seq_current(>gp_seq))) {
+   trace_rcu_this_gp(rnp_node, rdp, gp_seq_start,
+ TPS("Startedleaf"));
goto unlock_out;
+   }
+
if (rnp_node != rnp && rnp_node->parent != NULL)
raw_spin_unlock_rcu_node(rnp_node);
if (!rnp_node->parent) {
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 0/8] rcu fixes, clean ups for rcu/dev

2018-05-13 Thread Joel Fernandes (Google)
Hi,
Here are some fixes, clean ups and some code comments changes mostly
for the new funnel locking, gp_seq changes and some tracing. Its based
on latest rcu/dev branch.

thanks,

- Joel

Joel Fernandes (Google) (8):
  rcu: Add comment documenting how rcu_seq_snap works
  rcu: Clarify usage of cond_resched for tasks-RCU
  rcu: Add back the cpuend tracepoint
  rcu: Get rid of old c variable from places in tree RCU
  rcu: Use rcu_node as temporary variable in funnel locking loop
  rcu: Add back the Startedleaf tracepoint
  rcu: trace CleanupMore condition only if needed
  rcu: Fix cpustart tracepoint gp_seq number

 include/linux/rcupdate.h   | 11 +++--
 include/trace/events/rcu.h | 19 
 kernel/rcu/rcu.h   | 24 +-
 kernel/rcu/tree.c  | 92 +++---
 4 files changed, 97 insertions(+), 49 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 2/8] rcu: Clarify usage of cond_resched for tasks-RCU

2018-05-13 Thread Joel Fernandes (Google)
Recently we had a discussion about cond_resched unconditionally
recording a voluntary context switch [1].

Lets add a comment clarifying that how this API is to be used.

[1] 
https://lkml.kernel.org/r/1526027434-21237-1-git-send-email-byungchul.p...@lge.com

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 include/linux/rcupdate.h | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 743226176350..a9881007ece6 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -159,8 +159,12 @@ static inline void rcu_init_nohz(void) { }
} while (0)
 
 /*
- * Note a voluntary context switch for RCU-tasks benefit.  This is a
- * macro rather than an inline function to avoid #include hell.
+ * Note an attempt to perform a voluntary context switch for RCU-tasks benefit.
+ *
+ * This is called even in situations where a context switch didn't really
+ * happen even though it was requested. The caller uses it to indicate
+ * traversal of an RCU-tasks quiescent state. This is a macro rather than an
+ * inline function to avoid #include hell.
  */
 #ifdef CONFIG_TASKS_RCU
 #define rcu_note_voluntary_context_switch_lite(t) \
@@ -187,7 +191,8 @@ static inline void exit_tasks_rcu_finish(void) { }
 #endif /* #else #ifdef CONFIG_TASKS_RCU */
 
 /**
- * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU
+ * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU.
+ * The quiescent state report is made even if cond_resched() did nothing.
  *
  * This macro resembles cond_resched(), except that it is defined to
  * report potential quiescent states to RCU-tasks even if the cond_resched()
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 5/8] rcu: Use rcu_node as temporary variable in funnel locking loop

2018-05-13 Thread Joel Fernandes (Google)
The funnel locking loop in rcu_start_this_gp uses rcu_root as a
temporary variable while walking the combining tree. This causes a
tiresome exercise of a code reader reminding themselves that rcu_root
may not be root. Lets just call it rcu_node, and then finally when
rcu_node is the rcu_root, lets assign it at that time.

Just a clean up patch, no logical change.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/tree.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9f5679ba413b..40670047d22c 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1568,7 +1568,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
 {
bool ret = false;
struct rcu_state *rsp = rdp->rsp;
-   struct rcu_node *rnp_root;
+   struct rcu_node *rnp_node, *rnp_root = NULL;
 
/*
 * Use funnel locking to either acquire the root rcu_node
@@ -1581,24 +1581,26 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
 */
raw_lockdep_assert_held_rcu_node(rnp);
trace_rcu_this_gp(rnp, rdp, gp_seq_start, TPS("Startleaf"));
-   for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) {
-   if (rnp_root != rnp)
-   raw_spin_lock_rcu_node(rnp_root);
-   if (ULONG_CMP_GE(rnp_root->gp_seq_needed, gp_seq_start) ||
-   rcu_seq_done(_root->gp_seq, gp_seq_start) ||
-   (rnp != rnp_root &&
-rcu_seq_state(rcu_seq_current(_root->gp_seq {
-   trace_rcu_this_gp(rnp_root, rdp, gp_seq_start,
+   for (rnp_node = rnp; 1; rnp_node = rnp_node->parent) {
+   if (rnp_node != rnp)
+   raw_spin_lock_rcu_node(rnp_node);
+   if (ULONG_CMP_GE(rnp_node->gp_seq_needed, gp_seq_start) ||
+   rcu_seq_done(_node->gp_seq, gp_seq_start) ||
+   (rnp != rnp_node &&
+rcu_seq_state(rcu_seq_current(_node->gp_seq {
+   trace_rcu_this_gp(rnp_node, rdp, gp_seq_start,
  TPS("Prestarted"));
goto unlock_out;
}
-   rnp_root->gp_seq_needed = gp_seq_start;
+   rnp_node->gp_seq_needed = gp_seq_start;
if (rcu_seq_state(rcu_seq_current(>gp_seq)))
goto unlock_out;
-   if (rnp_root != rnp && rnp_root->parent != NULL)
-   raw_spin_unlock_rcu_node(rnp_root);
-   if (!rnp_root->parent)
+   if (rnp_node != rnp && rnp_node->parent != NULL)
+   raw_spin_unlock_rcu_node(rnp_node);
+   if (!rnp_node->parent) {
+   rnp_root = rnp_node;
break;  /* At root, and perhaps also leaf. */
+   }
}
 
/* If GP already in progress, just leave, otherwise start one. */
@@ -1616,10 +1618,10 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gp_seq), 
TPS("newreq"));
ret = true;  /* Caller must wake GP kthread. */
 unlock_out:
-   if (rnp != rnp_root)
-   raw_spin_unlock_rcu_node(rnp_root);
+   if (rnp != rnp_node)
+   raw_spin_unlock_rcu_node(rnp_node);
/* Push furthest requested GP to leaf node and rcu_data structure. */
-   if (ULONG_CMP_GE(rnp_root->gp_seq_needed, gp_seq_start)) {
+   if (ULONG_CMP_GE(rnp_node->gp_seq_needed, gp_seq_start)) {
rnp->gp_seq_needed = gp_seq_start;
rdp->gp_seq_needed = gp_seq_start;
}
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 3/8] rcu: Add back the cpuend tracepoint

2018-05-13 Thread Joel Fernandes (Google)
Commit be4b8beed87d ("rcu: Move RCU's grace-period-change code to ->gp_seq")
removed the cpuend grace period trace point. This patch adds it back.

Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/rcu/tree.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 9ad931bff409..29ccc60bdbfc 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1774,10 +1774,12 @@ static bool __note_gp_changes(struct rcu_state *rsp, 
struct rcu_node *rnp,
 
/* Handle the ends of any preceding grace periods first. */
if (rcu_seq_completed_gp(rdp->gp_seq, rnp->gp_seq) ||
-   unlikely(READ_ONCE(rdp->gpwrap)))
+   unlikely(READ_ONCE(rdp->gpwrap))) {
ret = rcu_advance_cbs(rsp, rnp, rdp); /* Advance callbacks. */
-   else
+   trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpuend"));
+   } else {
ret = rcu_accelerate_cbs(rsp, rnp, rdp); /* Recent callbacks. */
+   }
 
/* Now handle the beginnings of any new-to-this-CPU grace periods. */
if (rcu_seq_new_gp(rdp->gp_seq, rnp->gp_seq) ||
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC] schedutil: Allow cpufreq requests to be made even when kthread kicked

2018-05-16 Thread Joel Fernandes (Google)
Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.

A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.

This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.

I had brought up this issue at the OSPM conference and Claudio had a
discussion RFC with an alternate approach [1]. I prefer the approach as
done in the patch below since it doesn't need any new flags and doesn't
cause any other extra overhead.

[1] https://patchwork.kernel.org/patch/10384261/

CC: Viresh Kumar <viresh.ku...@linaro.org>
CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Patrick Bellasi <patrick.bell...@arm.com>
CC: Juri Lelli <juri.le...@redhat.com>
Cc: Luca Abeni <luca.ab...@santannapisa.it>
CC: Joel Fernandes <joe...@google.com>
CC: linux...@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
Claudio,
Could you also test this patch for your usecase?

 kernel/sched/cpufreq_schedutil.c | 36 +---
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e13df951aca7..a87fc281893d 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy 
*sg_policy, u64 time)
!cpufreq_can_do_remote_dvfs(sg_policy->policy))
return false;
 
-   if (sg_policy->work_in_progress)
-   return false;
-
if (unlikely(sg_policy->need_freq_update)) {
sg_policy->need_freq_update = false;
/*
@@ -129,8 +126,11 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
policy->cur = next_freq;
trace_cpu_frequency(next_freq, smp_processor_id());
} else {
-   sg_policy->work_in_progress = true;
-   irq_work_queue(_policy->irq_work);
+   /* Don't queue request if one was already queued */
+   if (!sg_policy->work_in_progress) {
+   sg_policy->work_in_progress = true;
+   irq_work_queue(_policy->irq_work);
+   }
}
 }
 
@@ -291,6 +291,15 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
 
ignore_dl_rate_limit(sg_cpu, sg_policy);
 
+   /*
+* For slow-switch systems, single policy requests can't run at the
+* moment if the governor thread is already processing a pending
+* frequency switch request, this can be fixed by acquiring update_lock
+* while updating next_freq and work_in_progress but we prefer not to.
+*/
+   if (sg_policy->work_in_progress)
+   return;
+
if (!sugov_should_update_freq(sg_policy, time))
return;
 
@@ -382,13 +391,24 @@ sugov_update_shared(struct update_util_data *hook, u64 
time, unsigned int flags)
 static void sugov_work(struct kthread_work *work)
 {
struct sugov_policy *sg_policy = container_of(work, struct 
sugov_policy, work);
+   unsigned int freq;
+   unsigned long flags;
+
+   /*
+* Hold sg_policy->update_lock shortly to handle the case where:
+* incase sg_policy->next_freq is read here, and then updated by
+* sugov_update_shared just before work_in_progress is set to false
+* here, we may miss queueing the new update.
+*/
+   raw_spin_lock_irqsave(_policy->update_lock, flags);
+   freq = sg_policy->next_freq;
+   sg_policy->work_in_progress = false;
+   raw_spin_unlock_irqrestore(_policy->update_lock, flags);
 
mutex_lock(_policy->work_lock);
-   __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
+   __cpufreq_driver_target(sg_policy->policy, freq,
CPUFREQ_RELATION_L);
mutex_unlock(_policy->work_lock);
-
-   sg_policy->work_in_progress = false;
 }
 
 static void sugov_irq_work(struct irq_work *irq_work)
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC] schedutil: Address the r/w ordering race in kthread

2018-05-22 Thread Joel Fernandes (Google)
Currently there is a race in schedutil code for slow-switch single-CPU
systems. Fix it by enforcing ordering the write to work_in_progress to
happen before the read of next_freq.

Kthread   Sched update

sugov_work()  sugov_update_single()

  lock();
  // The CPU is free to rearrange below
  // two in any order, so it may clear
  // the flag first and then read next
  // freq. Lets assume it does.
  work_in_progress = false

   if (work_in_progress)
 return;

   sg_policy->next_freq = 0;
  freq = sg_policy->next_freq;
   sg_policy->next_freq = real-freq;
  unlock();

Reported-by: Viresh Kumar <viresh.ku...@linaro.org>
CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Patrick Bellasi <patrick.bell...@arm.com>
CC: Juri Lelli <juri.le...@redhat.com>
Cc: Luca Abeni <luca.ab...@santannapisa.it>
CC: Todd Kjos <tk...@google.com>
CC: clau...@evidence.eu.com
CC: kernel-t...@android.com
CC: linux...@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
I split this into separate patch, because this race can also happen in
mainline.

 kernel/sched/cpufreq_schedutil.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 5c482ec38610..ce7749da7a44 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -401,6 +401,13 @@ static void sugov_work(struct kthread_work *work)
 */
raw_spin_lock_irqsave(_policy->update_lock, flags);
freq = sg_policy->next_freq;
+
+   /*
+* sugov_update_single can access work_in_progress without update_lock,
+* make sure next_freq is read before work_in_progress is set.
+*/
+   smp_mb();
+
sg_policy->work_in_progress = false;
raw_spin_unlock_irqrestore(_policy->update_lock, flags);
 
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH 2/2] kselftests: ftrace: Add tests for the preemptoff and irqsoff tracers

2018-05-17 Thread Joel Fernandes (Google)
Here we add unit tests for the preemptoff and irqsoff tracer by using a
kernel module introduced previously to trigger atomic sections in the
kernel.

Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Peter Zilstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Cc: Tom Zanussi <tom.zanu...@linux.intel.com>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: Thomas Glexiner <t...@linutronix.de>
Cc: Boqun Feng <boqun.f...@gmail.com>
Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
Cc: Masami Hiramatsu <mhira...@kernel.org>
Cc: Todd Kjos <tk...@google.com>
Cc: Erick Reyes <erickre...@google.com>
Cc: Julia Cartwright <ju...@ni.com>
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 tools/testing/selftests/ftrace/config |  3 +
 .../test.d/preemptirq/irqsoff_tracer.tc   | 74 +++
 2 files changed, 77 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc

diff --git a/tools/testing/selftests/ftrace/config 
b/tools/testing/selftests/ftrace/config
index b01924c71c09..29588b328345 100644
--- a/tools/testing/selftests/ftrace/config
+++ b/tools/testing/selftests/ftrace/config
@@ -4,3 +4,6 @@ CONFIG_FUNCTION_PROFILER=y
 CONFIG_TRACER_SNAPSHOT=y
 CONFIG_STACK_TRACER=y
 CONFIG_HIST_TRIGGERS=y
+CONFIG_PREEMPT_TRACER=y
+CONFIG_IRQSOFF_TRACER=y
+CONFIG_TEST_ATOMIC_SECTIONS=m
diff --git a/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc 
b/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc
new file mode 100644
index ..b76d781c5645
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc
@@ -0,0 +1,74 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: test for the preemptirqsoff tracer
+
+MOD=test_atomic_sections
+
+fail() {
+reset_tracer
+rmmod $MOD || true
+exit_fail
+}
+
+unsup() { #msg
+reset_tracer
+rmmod $MOD || true
+echo $1
+exit_unsupported
+}
+
+modprobe $MOD || unsup "$MOD module not available"
+rmmod $MOD
+
+grep "preemptoff" available_tracers || unsup "preemptoff tracer not enabled"
+grep "irqsoff" available_tracers || unsup "irqsoff tracer not enabled"
+
+reset_tracer
+
+# Simulate preemptoff section for half a second couple of times
+echo preemptoff > current_tracer
+sleep 1
+modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+
+cat trace
+
+# Confirm which tracer
+grep "tracer: preemptoff" trace || fail
+
+# Check the end of the section
+egrep "5.us : " trace || fail
+
+# Check for 500ms of latency
+egrep "latency: 5. us" trace || fail
+
+reset_tracer
+
+# Simulate irqsoff section for half a second couple of times
+echo irqsoff > current_tracer
+sleep 1
+modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail
+rmmod test_atomic_sections || fail
+
+cat trace
+
+# Confirm which tracer
+grep "tracer: irqsoff" trace || fail
+
+# Check the end of the section
+egrep "5.us : " trace || fail
+
+# Check for 500ms of latency
+egrep "latency: 5. us" trace || fail
+
+reset_tracer
+exit 0
+
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v2] schedutil: Allow cpufreq requests to be made even when kthread kicked

2018-05-18 Thread Joel Fernandes (Google.)
From: "Joel Fernandes (Google)" <j...@joelfernandes.org>

Currently there is a chance of a schedutil cpufreq update request to be
dropped if there is a pending update request. This pending request can
be delayed if there is a scheduling delay of the irq_work and the wake
up of the schedutil governor kthread.

A very bad scenario is when a schedutil request was already just made,
such as to reduce the CPU frequency, then a newer request to increase
CPU frequency (even sched deadline urgent frequency increase requests)
can be dropped, even though the rate limits suggest that its Ok to
process a request. This is because of the way the work_in_progress flag
is used.

This patch improves the situation by allowing new requests to happen
even though the old one is still being processed. Note that in this
approach, if an irq_work was already issued, we just update next_freq
and don't bother to queue another request so there's no extra work being
done to make this happen.

I had brought up this issue at the OSPM conference and Claudio had a
discussion RFC with an alternate approach [1]. I prefer the approach as
done in the patch below since it doesn't need any new flags and doesn't
cause any other extra overhead.

[1] https://patchwork.kernel.org/patch/10384261/

LGTMed-by: Viresh Kumar <viresh.ku...@linaro.org>
LGTMed-by: Juri Lelli <juri.le...@redhat.com>
CC: Viresh Kumar <viresh.ku...@linaro.org>
CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com>
CC: Peter Zijlstra <pet...@infradead.org>
CC: Ingo Molnar <mi...@redhat.com>
CC: Patrick Bellasi <patrick.bell...@arm.com>
CC: Juri Lelli <juri.le...@redhat.com>
Cc: Luca Abeni <luca.ab...@santannapisa.it>
CC: Joel Fernandes <joe...@google.com>
CC: Todd Kjos <tk...@google.com>
CC: clau...@evidence.eu.com
CC: kernel-t...@android.com
CC: linux...@vger.kernel.org
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
v1 -> v2: Minor style related changes.

 kernel/sched/cpufreq_schedutil.c | 34 
 1 file changed, 26 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index e13df951aca7..5c482ec38610 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy 
*sg_policy, u64 time)
!cpufreq_can_do_remote_dvfs(sg_policy->policy))
return false;
 
-   if (sg_policy->work_in_progress)
-   return false;
-
if (unlikely(sg_policy->need_freq_update)) {
sg_policy->need_freq_update = false;
/*
@@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy 
*sg_policy, u64 time,
 
policy->cur = next_freq;
trace_cpu_frequency(next_freq, smp_processor_id());
-   } else {
+   } else if (!sg_policy->work_in_progress) {
sg_policy->work_in_progress = true;
irq_work_queue(_policy->irq_work);
}
@@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data 
*hook, u64 time,
 
ignore_dl_rate_limit(sg_cpu, sg_policy);
 
+   /*
+* For slow-switch systems, single policy requests can't run at the
+* moment if update is in progress, unless we acquire update_lock.
+*/
+   if (sg_policy->work_in_progress)
+   return;
+
if (!sugov_should_update_freq(sg_policy, time))
return;
 
@@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 
time, unsigned int flags)
 static void sugov_work(struct kthread_work *work)
 {
struct sugov_policy *sg_policy = container_of(work, struct 
sugov_policy, work);
+   unsigned int freq;
+   unsigned long flags;
+
+   /*
+* Hold sg_policy->update_lock shortly to handle the case where:
+* incase sg_policy->next_freq is read here, and then updated by
+* sugov_update_shared just before work_in_progress is set to false
+* here, we may miss queueing the new update.
+*
+* Note: If a work was queued after the update_lock is released,
+* sugov_work will just be called again by kthread_work code; and the
+* request will be proceed before the sugov thread sleeps.
+*/
+   raw_spin_lock_irqsave(_policy->update_lock, flags);
+   freq = sg_policy->next_freq;
+   sg_policy->work_in_progress = false;
+   raw_spin_unlock_irqrestore(_policy->update_lock, flags);
 
mutex_lock(_policy->work_lock);
-   __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq,
-   CPUFREQ_RELATION_L);
+   __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L);
mutex_unlock(_policy->work_lock);
-
-   sg_policy->work_in_progress = false;
 }
 
 static void sugov_irq_work(struct irq_work *irq_work)
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v7 4/6] trace/irqsoff: Split reset into separate functions

2018-05-15 Thread Joel Fernandes (Google)
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.

Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Peter Zilstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Cc: Tom Zanussi <tom.zanu...@linux.intel.com>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: Thomas Glexiner <t...@linutronix.de>
Cc: Boqun Feng <boqun.f...@gmail.com>
Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweis...@gmail.com>
Cc: Randy Dunlap <rdun...@infradead.org>
Cc: Masami Hiramatsu <mhira...@kernel.org>
Cc: Fenguang Wu <fengguang...@intel.com>
Cc: Baohong Liu <baohong@intel.com>
Cc: Vedang Patel <vedang.pa...@intel.com>
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 kernel/trace/trace_irqsoff.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03ecb4465ee4..f8daa754cce2 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -634,7 +634,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
return 0;
 }
 
-static void irqsoff_tracer_reset(struct trace_array *tr)
+static void __irqsoff_tracer_reset(struct trace_array *tr)
 {
int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
int overwrite_flag = save_flags & TRACE_ITER_OVERWRITE;
@@ -665,6 +665,12 @@ static int irqsoff_tracer_init(struct trace_array *tr)
 
return __irqsoff_tracer_init(tr);
 }
+
+static void irqsoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer irqsoff_tracer __read_mostly =
 {
.name   = "irqsoff",
@@ -697,11 +703,16 @@ static int preemptoff_tracer_init(struct trace_array *tr)
return __irqsoff_tracer_init(tr);
 }
 
+static void preemptoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer preemptoff_tracer __read_mostly =
 {
.name   = "preemptoff",
.init   = preemptoff_tracer_init,
-   .reset  = irqsoff_tracer_reset,
+   .reset  = preemptoff_tracer_reset,
.start  = irqsoff_tracer_start,
.stop   = irqsoff_tracer_stop,
.print_max  = true,
@@ -731,11 +742,16 @@ static int preemptirqsoff_tracer_init(struct trace_array 
*tr)
return __irqsoff_tracer_init(tr);
 }
 
+static void preemptirqsoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer preemptirqsoff_tracer __read_mostly =
 {
.name   = "preemptirqsoff",
.init   = preemptirqsoff_tracer_init,
-   .reset  = irqsoff_tracer_reset,
+   .reset  = preemptirqsoff_tracer_reset,
.start  = irqsoff_tracer_start,
.stop   = irqsoff_tracer_stop,
.print_max  = true,
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v7 1/6] softirq: reorder trace_softirqs_on to prevent lockdep splat

2018-05-15 Thread Joel Fernandes (Google)
I'm able to reproduce a lockdep splat with config options:
CONFIG_PROVE_LOCKING=y,
CONFIG_DEBUG_LOCK_ALLOC=y and
CONFIG_PREEMPTIRQ_EVENTS=y

$ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable
---
 kernel/softirq.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 177de3640c78..8a040bcaa033 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt)
 {
lockdep_assert_irqs_disabled();
 
+   if (preempt_count() == cnt)
+   trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_on(_RET_IP_);
-   preempt_count_sub(cnt);
+
+   __preempt_count_sub(cnt);
 }
 
 /*
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v7 2/6] srcu: Add notrace variants of srcu_read_{lock,unlock}

2018-05-15 Thread Joel Fernandes (Google)
From: Paul McKenney <paul...@linux.vnet.ibm.com>

This is needed for a future tracepoint patch that uses srcu, and to make
sure it doesn't call into lockdep.

tracepoint code already calls notrace variants for rcu_read_lock_sched
so this patch does the same for srcu which will be used in a later
patch. Keeps it consistent with rcu-sched.

[Joel: Added commit message]

Cc: Steven Rostedt <rost...@goodmis.org>
Cc: Peter Zilstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com>
Cc: Tom Zanussi <tom.zanu...@linux.intel.com>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: Thomas Glexiner <t...@linutronix.de>
Cc: Boqun Feng <boqun.f...@gmail.com>
Cc: Paul McKenney <paul...@linux.vnet.ibm.com>
Cc: Frederic Weisbecker <fweis...@gmail.com>
Cc: Randy Dunlap <rdun...@infradead.org>
Cc: Masami Hiramatsu <mhira...@kernel.org>
Cc: Fenguang Wu <fengguang...@intel.com>
Cc: Baohong Liu <baohong@intel.com>
Cc: Vedang Patel <vedang.pa...@intel.com>
Cc: kernel-t...@android.com
Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org>
Signed-off-by: Paul McKenney <paul...@linux.vnet.ibm.com>
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
 include/linux/srcu.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 33c1c698df09..2ec618979b20 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -161,6 +161,16 @@ static inline int srcu_read_lock(struct srcu_struct *sp) 
__acquires(sp)
return retval;
 }
 
+/* Used by tracing, cannot be traced and cannot invoke lockdep. */
+static inline notrace int
+srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp)
+{
+   int retval;
+
+   retval = __srcu_read_lock(sp);
+   return retval;
+}
+
 /**
  * srcu_read_unlock - unregister a old reader from an SRCU-protected structure.
  * @sp: srcu_struct in which to unregister the old reader.
@@ -175,6 +185,13 @@ static inline void srcu_read_unlock(struct srcu_struct 
*sp, int idx)
__srcu_read_unlock(sp, idx);
 }
 
+/* Used by tracing, cannot be traced and cannot call lockdep. */
+static inline notrace void
+srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp)
+{
+   __srcu_read_unlock(sp, idx);
+}
+
 /**
  * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
  *
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v2] sched: Remove obscure comment from select_task_rq_fair

2018-05-27 Thread Joel Fernandes (Google)
I was playing with cpusets and sched_load_balance flag and notice that
the fast-path (select_idle_sibling) can also be attempted for
exec-balance, not just wake-balance if the waker cpu's cpuset has
sched_load_balance = 0. This patch removes the obscure comment which was
saying this path can be entered only for wake-balance.

To trigger this, I just do:
mkdir /cpuset
mount -t cpuset none /cpuset
echo 0 > sched_load_balance

Then did some random activity and dumped the stack from 'if (!sd)' for
the non wake-balance cases.

Following is one of the stacks:
dump_stack+0x46/0x5b
select_task_rq_fair+0x101d/0x1030
sched_exec+0x4f/0xc0
do_execveat_common.isra.41+0x1e3/0x7c0
__x64_sys_execve+0x2d/0x40
do_syscall_64+0x43/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Turns out the same case occurs also during boot up when kthreadd tries
to create threads before domains are attached so lets fix the comment.

Cc: Dietmar Eggemann <dietmar.eggem...@arm.com>
Cc: Morten Ramussen <morten.rasmus...@arm.com>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Juri Lelli <juri.le...@redhat.com>
Cc: Vincent Guittot <vincent.guit...@linaro.org>
Cc: Patrick Bellasi <patrick.bell...@arm.com>
Cc: Rohit Jain <rohit.k.j...@oracle.com>
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org>
---
v1->v2: Resending without "XXX" in subject since otherwise LKML thinks
its junk.

 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 54dc31e7ab9b..dd07794141d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6665,7 +6665,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, 
int sd_flag, int wake_f
 
if (!sd) {
 pick_cpu:
-   if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
+   if (sd_flag & SD_BALANCE_WAKE) {
new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
if (want_affine)
-- 
2.17.0.921.gf22659ad46-goog



[PATCH] trace: Use rcu_dereference_raw for hooks from trace-event subsystem

2018-08-02 Thread Joel Fernandes (Google)
Since we switched to using SRCU for tracepoints used in the idle path,
we can no longer use rcu_dereference_sched for dereferencing points in
trace-event hooks.

Since tracepoints can now use either SRCU or sched-RCU, just use
rcu_dereference_raw for traceevents just like we're doing when
dereferencing the tracepoint table.

This prevents an RCU warning reported by Masami:

[  282.060593] WARNING: can't dereference registers at f3c7f62b
[  282.063200] =
[  282.064082] WARNING: suspicious RCU usage
[  282.064963] 4.18.0-rc6+ #15 Tainted: GW
[  282.066048] -
[  282.066923] /home/mhiramat/ksrc/linux/kernel/trace/trace_events.c:242
suspicious rcu_dereference_check() usage!
[  282.068974] 
[  282.068974] other info that might help us debug this:
[  282.068974] 
[  282.070770] 
[  282.070770] RCU used illegally from idle CPU!
[  282.070770] rcu_scheduler_active = 2, debug_locks = 1
[  282.072938] RCU used illegally from extended quiescent state!
[  282.074183] no locks held by swapper/0/0.
[  282.075071] 
[  282.075071] stack backtrace:
[  282.076121] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW
[  282.077782] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996)
[  282.079604] Call Trace:
[  282.080212]  
[  282.080755]  dump_stack+0x85/0xcb
[  282.081523]  trace_event_ignore_this_pid+0x66/0x70
[  282.082541]  trace_event_raw_event_preemptirq_template+0xa2/0xb0
[  282.083774]  ? interrupt_entry+0xc4/0xe0
[  282.084665]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  282.085669]  trace_hardirqs_off_caller+0x90/0xd0
[  282.086597]  trace_hardirqs_off_thunk+0x1a/0x1c
[  282.087433]  ? call_function_interrupt+0xa/0x20
[  282.088201]  interrupt_entry+0xc4/0xe0
[  282.088848]  ? call_function_interrupt+0xa/0x20
[  282.089579]  
[  282.090029]  ? native_safe_halt+0x2/0x10
[  282.090695]  ? default_idle+0x1f/0x160
[  282.091330]  ? default_idle_call+0x24/0x40
[  282.091997]  ? do_idle+0x210/0x250
[  282.092658]  ? cpu_startup_entry+0x6f/0x80
[  282.093338]  ? start_kernel+0x49d/0x4bd
[  282.093987]  ? secondary_startup_64+0xa5/0xb0

Reported-by: Masami Hiramatsu 
Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/trace/trace_events.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c
index 14ff4ff3caab..7b508ce8ac44 100644
--- a/kernel/trace/trace_events.c
+++ b/kernel/trace/trace_events.c
@@ -239,7 +239,7 @@ bool trace_event_ignore_this_pid(struct trace_event_file 
*trace_file)
struct trace_array_cpu *data;
struct trace_pid_list *pid_list;
 
-   pid_list = rcu_dereference_sched(tr->filtered_pids);
+   pid_list = rcu_dereference_raw(tr->filtered_pids);
if (!pid_list)
return false;
 
@@ -512,7 +512,7 @@ event_filter_pid_sched_process_exit(void *data, struct 
task_struct *task)
struct trace_pid_list *pid_list;
struct trace_array *tr = data;
 
-   pid_list = rcu_dereference_sched(tr->filtered_pids);
+   pid_list = rcu_dereference_raw(tr->filtered_pids);
trace_filter_add_remove_task(pid_list, NULL, task);
 }
 
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH ftrace/core] tracing: irqsoff: Account for additional preempt_disable

2018-08-05 Thread Joel Fernandes (Google)
Recently we tried to make the preemptirqsoff tracer to use irqsoff
tracepoint probes. However this causes issues as reported by Masami:

[2.271078] Testing tracer preemptirqsoff: .. no entries found ..FAILED!
[2.381015] WARNING: CPU: 0 PID: 1 at /home/mhiramat/ksrc/linux/kernel/
trace/trace.c:1512 run_tracer_selftest+0xf3/0x154

This is due to the tracepoint code increasing the preempt nesting count
by calling an additional preempt_disable before calling into the
preemptoff tracer which messes up the preempt_count() check in
tracer_hardirqs_off.

To fix this, make the irqsoff tracer probes balance the additional outer
preempt_disable with a preempt_enable_notrace.

The other way to fix this is to just use SRCU for all tracepoints.
However we can't do that because we can't use NMIs from RCU context.

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints
  and unify their usage")
Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU")
Reported-by: Masami Hiramatsu 
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/trace/trace_irqsoff.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 770cd30cda40..ffbf1505d5bc 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -603,14 +603,40 @@ static void irqsoff_tracer_stop(struct trace_array *tr)
  */
 static void tracer_hardirqs_on(void *none, unsigned long a0, unsigned long a1)
 {
+   /*
+* Tracepoint probes are expected to be called with preempt disabled,
+* We don't care about being called with preempt disabled but we need
+* to know in the future if that changes so we can remove the next
+* preempt_enable.
+*/
+   WARN_ON_ONCE(!preempt_count());
+
+   /* Tracepoint probes disable preemption atleast once, account for that 
*/
+   preempt_enable_notrace();
+
if (!preempt_trace() && irq_trace())
stop_critical_timing(a0, a1);
+
+   preempt_disable_notrace();
 }
 
 static void tracer_hardirqs_off(void *none, unsigned long a0, unsigned long a1)
 {
+   /*
+* Tracepoint probes are expected to be called with preempt disabled,
+* We don't care about being called with preempt disabled but we need
+* to know in the future if that changes so we can remove the next
+* preempt_enable.
+*/
+   WARN_ON_ONCE(!preempt_count());
+
+   /* Tracepoint probes disable preemption atleast once, account for that 
*/
+   preempt_enable_notrace();
+
if (!preempt_trace() && irq_trace())
start_critical_timing(a0, a1);
+
+   preempt_disable_notrace();
 }
 
 static int irqsoff_tracer_init(struct trace_array *tr)
-- 
2.18.0.597.ga71716f1ad-goog



[RFC] tracepoint: Run tracepoints even after CPU is offline

2018-08-05 Thread Joel Fernandes (Google)
Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline")
causes a problem for lockdep using tracepoint code. Once a CPU is
offline, tracepoints donot get called, however this causes a big problem
for lockdep probes that need to run so that IRQ annotations are marked
correctly.

An issue is possible where while the CPU is going offline, an interrupt
can come in and then a lockdep assert causes an annotation warning:

[  106.551354] IRQs not enabled as expected
[  106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982
 tick_nohz_idle_enter+0x99/0xb0
[  106.552964] Modules linked in:
[  106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW

We need tracepoints to run as late as possible. This commit tries to fix
the issue by removing the cpu_online check in tracepoint code that was
introduced in the mentioned commit, however now we run the risk of
running dereferencing probes that aren't RCU protected, which gives an
RCU warning like so on boot up:
[0.030159] x86: Booting SMP configuration:
[0.030169]  node  #0, CPUs:  #1
[0.001000]
[0.001000] =
[0.001000] WARNING: suspicious RCU usage
[0.001000] 4.18.0-rc6+ #42 Not tainted
[0.001000] -
[0.001000] ./include/trace/events/timer.h:38 suspicious
rcu_dereference_check() usage!
[0.001000]
[0.001000] other info that might help us debug this:
[0.001000]
[0.001000]
[0.001000] RCU used illegally from offline CPU!
[0.001000] rcu_scheduler_active = 1, debug_locks = 1
[0.001000] no locks held by swapper/1/0.
[0.001000]

Any ideas on how we can fix this? Basically we need RCU to work here
even after !cpu_online. I thought of just using SRCU for all tracepoints
however that may mean we can't use tracepoints from NMI..

Tries-to-Fix: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
 unify their usage")
Reported-by: Masami Hiramatsu 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/tracepoint.h | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d9a084c72541..020885714a0f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -365,19 +365,17 @@ extern void syscall_unregfunc(void);
  * "void *__data, proto" as the callback prototype.
  */
 #define DECLARE_TRACE_NOARGS(name) \
-   __DECLARE_TRACE(name, void, ,   \
-   cpu_online(raw_smp_processor_id()), \
+   __DECLARE_TRACE(name, void, , 1,\
void *__data, __data)
 
 #define DECLARE_TRACE(name, proto, args)   \
-   __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
-   cpu_online(raw_smp_processor_id()), \
+   __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), 1,   \
PARAMS(void *__data, proto),\
PARAMS(__data, args))
 
 #define DECLARE_TRACE_CONDITION(name, proto, args, cond)   \
__DECLARE_TRACE(name, PARAMS(proto), PARAMS(args),  \
-   cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \
+   PARAMS(cond),   \
PARAMS(void *__data, proto),\
PARAMS(__data, args))
 
-- 
2.18.0.597.ga71716f1ad-goog



[PATCH RFC ftrace/core] tracing: Directly call tracer and lockdep probes

2018-08-08 Thread Joel Fernandes (Google)
From: Steven Rostedt 

Due to various problems with using tracepoints as probes for IRQ
enable/disable and preempt enable/disable: such as using SRCU in NMI
context and other issues with the irqsoff/preemptoff tracer, directly
call the lockdep and irqsoff tracer probes for now. This keeps all the
other benefits and clean ups of the patch being partially reverted.

Based on Steve's original Partial revert patch. I added directly calling
irqsoff tracer probes as well.

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify 
their usage")
Signed-off-by: Joel Fernandes (Google) 
---
Steve, I expanded your partial revert such that it fixes the remaining
issues as well while keeping the clean ups and benefits of my original
patch. Its based on the latest ftrace/core. I'm still testing it, but it
works so far. What do you think?  I feel like it could still be a step
forward for this merge window.
Also you could drop "tracepoint: Make rcuidle tracepoint callers use
SRCU" since its not needed anymore for this merge window.

Tests passing so far:
- ftrace startup tests
- locking api self-tests
- manually tested preemptoff and irqsoff tracers
- manually tested preempt/irq disable and enable events
- boot tested without warnings

 include/linux/irqflags.h|  4 ++
 include/linux/lockdep.h |  2 -
 init/main.c |  2 -
 kernel/locking/lockdep.c| 14 +--
 kernel/trace/trace.h| 16 
 kernel/trace/trace_irqsoff.c| 26 ++---
 kernel/trace/trace_preemptirq.c | 66 -
 7 files changed, 75 insertions(+), 55 deletions(-)

diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h
index 50edb9cbbd26..19a90ac7c87d 100644
--- a/include/linux/irqflags.h
+++ b/include/linux/irqflags.h
@@ -19,9 +19,13 @@
 #ifdef CONFIG_PROVE_LOCKING
   extern void trace_softirqs_on(unsigned long ip);
   extern void trace_softirqs_off(unsigned long ip);
+  extern void lockdep_hardirqs_on(unsigned long ip);
+  extern void lockdep_hardirqs_off(unsigned long ip);
 #else
 # define trace_softirqs_on(ip) do { } while (0)
 # define trace_softirqs_off(ip)do { } while (0)
+# define lockdep_hardirqs_on(ip) do { } while (0)
+# define lockdep_hardirqs_off(ip) do { } while (0)
 #endif
 
 #ifdef CONFIG_TRACE_IRQFLAGS
diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index a8113357ceeb..b0d0b51c4d85 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -267,7 +267,6 @@ struct held_lock {
  * Initialization, self-test and debugging-output methods:
  */
 extern void lockdep_init(void);
-extern void lockdep_init_early(void);
 extern void lockdep_reset(void);
 extern void lockdep_reset_lock(struct lockdep_map *lock);
 extern void lockdep_free_key_range(void *start, unsigned long size);
@@ -408,7 +407,6 @@ static inline void lockdep_on(void)
 # define lock_set_class(l, n, k, s, i) do { } while (0)
 # define lock_set_subclass(l, s, i)do { } while (0)
 # define lockdep_init()do { } while (0)
-# define lockdep_init_early()  do { } while (0)
 # define lockdep_init_map(lock, name, key, sub) \
do { (void)(name); (void)(key); } while (0)
 # define lockdep_set_class(lock, key)  do { (void)(key); } while (0)
diff --git a/init/main.c b/init/main.c
index 44fe43be84c1..5d42e577643a 100644
--- a/init/main.c
+++ b/init/main.c
@@ -649,8 +649,6 @@ asmlinkage __visible void __init start_kernel(void)
call_function_init();
WARN(!irqs_disabled(), "Interrupts were enabled early\n");
 
-   lockdep_init_early();
-
early_boot_irqs_disabled = false;
local_irq_enable();
 
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 03bfaeb9f4e6..e406c5fdb41e 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2840,8 +2840,7 @@ static void __trace_hardirqs_on_caller(unsigned long ip)
debug_atomic_inc(hardirqs_on_events);
 }
 
-static void lockdep_hardirqs_on(void *none, unsigned long ignore,
-   unsigned long ip)
+void lockdep_hardirqs_on(unsigned long ip)
 {
if (unlikely(!debug_locks || current->lockdep_recursion))
return;
@@ -2885,8 +2884,7 @@ static void lockdep_hardirqs_on(void *none, unsigned long 
ignore,
 /*
  * Hardirqs were disabled:
  */
-static void lockdep_hardirqs_off(void *none, unsigned long ignore,
-unsigned long ip)
+void lockdep_hardirqs_off(unsigned long ip)
 {
struct task_struct *curr = current;
 
@@ -4315,14 +4313,6 @@ void lockdep_reset_lock(struct lockdep_map *lock)
raw_local_irq_restore(flags);
 }
 
-void __init lockdep_init_early(void)
-{
-#ifdef CONFIG_PROVE_LOCKING
-   register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX);
-   register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, IN

[PATCH RFC] tracepoint: Use SRCU for all tracepoint invocations

2018-08-07 Thread Joel Fernandes (Google)
Steven reported that rcuidle && in_nmi() condition can occur which
creates a problem for SRCU usage, since we can't use the SRCU node from
both NMI context and other contexts (NMI can come in while the SRCU read
lock is in the process of being held).

This patch switches to using a separate SRCU node for tracepoints called
from in_nmi(). This is needed to also make tracepoints work while CPU is
offline.

Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and
  unify their usage")
Reported-by: Masami Hiramatsu 
Signed-off-by: Joel Fernandes (Google) 
---
Dropped the "CPU offline" changes, and only keeping the SRCU changes.

 include/linux/tracepoint.h | 19 ---
 kernel/tracepoint.c| 10 ++
 2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index d9a084c72541..1ceee17a38dc 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -35,6 +35,7 @@ struct trace_eval_map {
 #define TRACEPOINT_DEFAULT_PRIO10
 
 extern struct srcu_struct tracepoint_srcu;
+extern struct srcu_struct tracepoint_srcu_nmi;
 
 extern int
 tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data);
@@ -144,13 +145,11 @@ extern void syscall_unregfunc(void);
void *it_func;  \
void *__data;   \
int __maybe_unused idx = 0; \
+   struct srcu_struct *ss; \
\
if (!(cond))\
return; \
\
-   /* srcu can't be used from NMI */   \
-   WARN_ON_ONCE(rcuidle && in_nmi());  \
-   \
/* keep srcu and sched-rcu usage consistent */  \
preempt_disable_notrace();  \
\
@@ -159,7 +158,11 @@ extern void syscall_unregfunc(void);
 * doesn't work from the idle path. \
 */ \
if (rcuidle)\
-   idx = srcu_read_lock_notrace(_srcu); \
+   ss = _srcu_nmi;  \
+   else\
+   ss = _srcu;  \
+   \
+   idx = srcu_read_lock_notrace(ss);   \
\
it_func_ptr = rcu_dereference_raw((tp)->funcs); \
\
@@ -171,8 +174,7 @@ extern void syscall_unregfunc(void);
} while ((++it_func_ptr)->func);\
}   \
\
-   if (rcuidle)\
-   srcu_read_unlock_notrace(_srcu, idx);\
+   srcu_read_unlock_notrace(ss, idx);  \
\
preempt_enable_notrace();   \
} while (0)
@@ -212,11 +214,6 @@ extern void syscall_unregfunc(void);
TP_PROTO(data_proto),   \
TP_ARGS(data_args), \
TP_CONDITION(cond), 0); \
-   if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \
-   rcu_read_lock_sched_notrace();  \
-   rcu_dereference_sched(__tracepoint_##name.funcs);\
-   rcu_read_unlock_sched_notrace();\
-   }   \
}   \
__DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args),  \
PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))\
diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
index 955148d91b74..769d74b2f90e 100644
--- a/kernel/tracepoint.c
+++ b/kernel/tracepoint.c

[PATCH] mm: shmem: Correctly annotate new inodes

2018-08-14 Thread Joel Fernandes (Google)
Directories and inodes don't necessarily need to be in the same
lockdep class. For ex, hugetlbfs splits them out too to prevent
false positives in lockdep. Annotate correctly after new inode
creation. If its a directory inode, it will be put into a different
class.

This should fix a lockdep splat reported by syzbot:

> ==
> WARNING: possible circular locking dependency detected
> 4.18.0-rc8-next-20180810+ #36 Not tainted
> --
> syz-executor900/4483 is trying to acquire lock:
> d2bfc8fe (>s_type->i_mutex_key#9){}, at: inode_lock
> include/linux/fs.h:765 [inline]
> d2bfc8fe (>s_type->i_mutex_key#9){}, at:
> shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>
> but task is already holding lock:
> 25208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630
> drivers/staging/android/ashmem.c:448
>
> which lock already depends on the new lock.
>
> -> #2 (ashmem_mutex){+.+.}:
>__mutex_lock_common kernel/locking/mutex.c:925 [inline]
>__mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073
>mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088
>ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361
>call_mmap include/linux/fs.h:1844 [inline]
>mmap_region+0xf27/0x1c50 mm/mmap.c:1762
>do_mmap+0xa10/0x1220 mm/mmap.c:1535
>do_mmap_pgoff include/linux/mm.h:2298 [inline]
>vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357
>ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585
>__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
>__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline]
>__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #1 (>mmap_sem){}:
>__might_fault+0x155/0x1e0 mm/memory.c:4568
>_copy_to_user+0x30/0x110 lib/usercopy.c:25
>copy_to_user include/linux/uaccess.h:155 [inline]
>filldir+0x1ea/0x3a0 fs/readdir.c:196
>dir_emit_dot include/linux/fs.h:3464 [inline]
>dir_emit_dots include/linux/fs.h:3475 [inline]
>dcache_readdir+0x13a/0x620 fs/libfs.c:193
>iterate_dir+0x48b/0x5d0 fs/readdir.c:51
>__do_sys_getdents fs/readdir.c:231 [inline]
>__se_sys_getdents fs/readdir.c:212 [inline]
>__x64_sys_getdents+0x29f/0x510 fs/readdir.c:212
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> -> #0 (>s_type->i_mutex_key#9){}:
>lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924
>down_write+0x8f/0x130 kernel/locking/rwsem.c:70
>inode_lock include/linux/fs.h:765 [inline]
>shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602
>ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455
>ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797
>vfs_ioctl fs/ioctl.c:46 [inline]
>file_ioctl fs/ioctl.c:501 [inline]
>do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685
>ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702
>__do_sys_ioctl fs/ioctl.c:709 [inline]
>__se_sys_ioctl fs/ioctl.c:707 [inline]
>__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707
>do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> other info that might help us debug this:
>
> Chain exists of:
>   >s_type->i_mutex_key#9 --> >mmap_sem --> ashmem_mutex
>
>  Possible unsafe locking scenario:
>
>CPU0CPU1
>
>   lock(ashmem_mutex);
>lock(>mmap_sem);
>lock(ashmem_mutex);
>   lock(>s_type->i_mutex_key#9);
>
>  *** DEADLOCK ***
>
> 1 lock held by syz-executor900/4483:
>  #0: 25208078 (ashmem_mutex){+.+.}, at:
> ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448

Reported-by: syzbot 
Cc: wi...@infradead.org
Cc: sta...@vger.kernel.org
Cc: pet...@infradead.org
Suggested-by: Neil Brown 
Signed-off-by: Joel Fernandes (Google) 
---
 mm/shmem.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/shmem.c b/mm/shmem.c
index 2cab84403055..4429a8fd932d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2225,6 +2225,8 @@ static struct inode *shmem_get_inode(struct super_block 
*sb, const struct inode
mpol_shared_policy_init(>policy, NULL);
break;
}
+
+   lockdep_annotate_inode_mutex_key(inode);
} else
shmem_free_inode(sb);
return inode;
-- 
2.18.0.597.ga71716f1ad-goog



Re: [PATCH] sched: support dynamiQ cluster

2018-04-13 Thread Joel Fernandes (Google)
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussen
 wrote:
> On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 5 April 2018 at 17:46, Morten Rasmussen  wrote:
>> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote:
>> >> On 4 April 2018 at 12:44, Valentin Schneider  
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > On 03/04/18 13:17, Vincent Guittot wrote:
>> >> >> Hi Valentin,
>> >> >>
>> >> > [...]
>> >> >>>
>> >> >>> I believe ASYM_PACKING behaves better here because the workload is 
>> >> >>> only
>> >> >>> sysbench threads. As stated above, since task utilization is 
>> >> >>> disregarded, I
>> >> >>
>> >> >> It behaves better because it doesn't wait for the task's utilization
>> >> >> to reach a level before assuming the task needs high compute capacity.
>> >> >> The utilization gives an idea of the running time of the task not the
>> >> >> performance level that is needed
>> >> >>
>> >> >
>> >> > [
>> >> > That's my point actually. ASYM_PACKING disregards utilization and moves 
>> >> > those
>> >> > threads to the big cores ASAP, which is good here because it's just 
>> >> > sysbench
>> >> > threads.
>> >> >
>> >> > What I meant was that if the task composition changes, IOW we mix 
>> >> > "small"
>> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive 
>> >> > stuff like
>> >> > sysbench threads), we shouldn't assume all of those require to run on a 
>> >> > big
>> >> > CPU. The thing is, ASYM_PACKING can't make the difference between 
>> >> > those, so
>> > [Morten]
>> >>
>> >> That's the 1st point where I tend to disagree: why big cores are only
>> >> for long running task and periodic stuff can't need to run on big
>> >> cores to get max compute capacity ?
>> >> You make the assumption that only long running tasks need high compute
>> >> capacity. This patch wants to always provide max compute capacity to
>> >> the system and not only long running task
>> >
>> > There is no way we can tell if a periodic or short-running tasks
>> > requires the compute capacity of a big core or not based on utilization
>> > alone. The utilization can only tell us if a task could potentially use
>> > more compute capacity, i.e. the utilization approaches the compute
>> > capacity of its current cpu.
>> >
>> > How we handle low utilization tasks comes down to how we define
>> > "performance" and if we care about the cost of "performance" (e.g.
>> > energy consumption).
>> >
>> > Placing a low utilization task on a little cpu should always be fine
>> > from _throughput_ point of view. As long as the cpu has spare cycles it
>>
>> [Vincent]
>> I disagree, throughput is not only a matter of spare cycle it's also a
>> matter of how fast you compute the work like with IO activity as an
>> example
>
> [Morten]
> From a cpu centric point of view it is, but I agree that from a
> application/user point of view completion time might impact throughput
> too. For example of if your throughput depends on how fast you can
> offload work to some peripheral device (GPU for example).
>
> However, as I said in the beginning we don't know what the task does.

[Joel]
Just wanted to say about Vincent point of IO loads throughput -
remembering from when I was playing with the iowait boost stuff, that
- say you have a little task that does some IO and blocks and does so
periodically. In the scenario the task will run for little time and is
a little task by way of looking at utilization. However, if we were to
run it on the BIG CPUs, the overall throughput of the I/O activity
would be higher.

For this case, it seems its impossible to specify the "default"
behavior correctly. Like, do we care about performance or energy more?
This seems more like a policy-decision from userspace and not
something the scheduler should necessarily have to decide. Like if I/O
activity is background and not affecting the user experience.

thanks,

- Joel


Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs

2018-04-19 Thread Joel Fernandes (Google)
Hi,

 Or maintain array of registered irqs and iterate over them only.
>>> Right, we can allocate a bitmap of used irqs to do that.
>>>
 I have another idea.

 perf record shows mutex_lock/mutex_unlock at the top.
 Most of them are irq mutex not seqfile mutex as there are many more
 interrupts than reads. Take it once.

>>> How many cpus are in your test system? In that skylake server, it was
>>> the per-cpu summing operation of the irq counts that was consuming most
>>> of the time for reading /proc/stat. I think we can certainly try to
>>> optimize the lock taking.
>> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16)
>> Given that irq registering is rare operation, maintaining sorted array
>> of irq should be the best option.
>>> For the time being, I think I am going to have a clone /proc/stat2 as
>>> suggested in my earlier email. Alternatively, I can put that somewhere
>>> in sysfs if you have a good idea of where I can put it.
>> sysfs is strictly one-value-per-file.
>>
>>> I will also look into ways to optimize the current per-IRQ stats
>>> handling, but it will come later.
>> There is always a time-honored way of ioctl(2) switching irq info off
>> /proc supports that.
>>
>> There are many options.
>
> OK, it is good to know. Do you have any existing code snippet in the
> kernel that I can use as reference on how to use ioctl(2) switching?
>
> I will look into how to optimize the existing per-IRQ stats code first
> before venturing into cloning /proc/stat.

Can we not just remove per-IRQ stats from /proc/stat (since I gather
from this discussion it isn't scalable), and just have applications
that need per-IRQ stats use /proc/interrupts ?

thanks,

- Joel


Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name

2018-03-26 Thread Joel Fernandes (Google)
Hi Mathieu,

On Mon, Mar 26, 2018 at 12:10 PM, Mathieu Desnoyers
 wrote:
> Provide an API allowing eBPF to lookup core kernel tracepoints by name.
>
> Given that a lookup by name explicitly requires tracepoint definitions
> to be unique for a given name (no duplicate keys), include a
> WARN_ON_ONCE() check that only a single match is encountered at runtime.
> This should always be the case, given that a DEFINE_TRACE emits a
> __tracepoint_##name symbol, which would cause a link-time error if more
> than one instance is found. Nevertheless, check this at runtime with
> WARN_ON_ONCE() to stay on the safe side.
>
> Signed-off-by: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Alexei Starovoitov 
> CC: Peter Zijlstra 
> CC: Ingo Molnar 
> ---
>  include/linux/tracepoint.h |  1 +
>  kernel/tracepoint.c| 35 +++
>  2 files changed, 36 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..1b4ae64b7c6a 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -43,6 +43,7 @@ tracepoint_probe_unregister(struct tracepoint *tp, void 
> *probe, void *data);
>  extern void
>  for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
> void *priv);
> +extern struct tracepoint *tracepoint_kernel_find_by_name(const char *name);
>
>  #ifdef CONFIG_MODULES
>  struct tp_module {
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 671b13457387..0a59f988055a 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -60,6 +60,11 @@ struct tp_probes {
> struct tracepoint_func probes[0];
>  };
>
> +struct tp_find_args {
> +   const char *name;
> +   struct tracepoint *tp;
> +};
> +
>  static inline void *allocate_probes(int count)
>  {
> struct tp_probes *p  = kmalloc(count * sizeof(struct tracepoint_func)
> @@ -528,6 +533,36 @@ void for_each_kernel_tracepoint(void (*fct)(struct 
> tracepoint *tp, void *priv),
>  }
>  EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint);
>
> +static void find_tp(struct tracepoint *tp, void *priv)
> +{
> +   struct tp_find_args *args = priv;
> +
> +   if (!strcmp(tp->name, args->name)) {
> +   WARN_ON_ONCE(args->tp);
> +   args->tp = tp;

I think this runtime check is not needed if it really can't happen
(linker verifies that already as you mentioned) although I'm not
opposed if you still want to keep it, because there's no way to break
out of the outer loop anyway so I guess your call.. I would just do:

if (args->tp)
return;

if find_tp already found the tracepoint.

Tried to also create a duplicate tracepoint and I get:
  CC  init/version.o
  AR  init/built-in.o
  AR  built-in.o
  LD  vmlinux.o
block/blk-core.o:(__tracepoints+0x440): multiple definition of
`__tracepoint_sched_switch'
kernel/sched/core.o:(__tracepoints+0x440): first defined here
Makefile:1032: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

For this senseless diff:
 diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 81b43f5bdf23..2b855c05197d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -49,6 +49,13 @@ DEFINE_EVENT(block_buffer, block_touch_buffer,
TP_ARGS(bh)
 );

+DEFINE_EVENT(block_buffer, sched_switch,
+
+   TP_PROTO(struct buffer_head *bh),
+
+   TP_ARGS(bh)
+);
+
 /**
  * block_dirty_buffer - mark a buffer dirty
  * @bh: buffer_head being dirtied


thanks,

- Joel


Re: [PATCH 0/3] [RFC] init, tracing: Add initcall trace events

2018-03-26 Thread Joel Fernandes (Google)
Hi Steve,

On Fri, Mar 23, 2018 at 8:02 AM, Steven Rostedt  wrote:
> A while ago we had a boot tracer. But it was eventually removed:
>  commit 30dbb20e68e6f ("tracing: Remove boot tracer").
>
> The rational was because there is already a initcall_debug boot option
> that causes printk()s of all the initcall functions.
>
> The problem with the initcall_debug option is that printk() is awfully slow,
> and makes it difficult to see the real impact of initcalls. Mainly because
> a single printk() is usually slower than most initcall functions.
>
> Instead of bringing back the boot tracer, adding trace events around the
> initcall functions, and even one to denote which level the initcall
> functions are being called from, adds the necessary information to
> analyze the initcalls without the high overhead of printk()s, that
> can substantially slow down the boot process.
>
> Another positive, is that the console initcall functions themselves
> can also be traced. The timestamps are not operational at that time
> but you can see which consoles are being registered. I saw this on
> one of my test boxes:
>
> -0 [000] ...1 0.00: initcall_level: level=console
> -0 [000] ...1 0.00: initcall_start: func=con_init+0x0/0x224
> -0 [000] ...1 0.00: initcall_finish: 
> func=con_init+0x0/0x224 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=hvc_console_init+0x0/0x19
> -0 [000] ...1 0.00: initcall_finish: 
> func=hvc_console_init+0x0/0x19 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=xen_cons_init+0x0/0x60
> -0 [000] ...1 0.00: initcall_finish: 
> func=xen_cons_init+0x0/0x60 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=univ8250_console_init+0x0/0x2d
> -0 [000] ...1 0.00: initcall_finish: 
> func=univ8250_console_init+0x0/0x2d ret=0

Will this make initcall_debug not work if CONFIG_TRACEPOINTS is turned
off? Although it builds but I think this initcall_debug feature will
fail, maybe CONFIG_TRACEPOINTS should be selected somewhere?

I recently ran into some issues like this for my preemptirq
tracepoints patch (which I will post again soon :D) where lockdep
needed the tracepoints and I had to select it.

thanks,

- Joel


Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name

2018-03-27 Thread Joel Fernandes (Google)
On Tue, Mar 27, 2018 at 6:27 AM, Mathieu Desnoyers
 wrote:

>>> +static void find_tp(struct tracepoint *tp, void *priv)
>>> +{
>>> +   struct tp_find_args *args = priv;
>>> +
>>> +   if (!strcmp(tp->name, args->name)) {
>>> +   WARN_ON_ONCE(args->tp);
>>> +   args->tp = tp;
>>
>> I think this runtime check is not needed if it really can't happen
>> (linker verifies that already as you mentioned) although I'm not
>> opposed if you still want to keep it, because there's no way to break
>> out of the outer loop anyway so I guess your call..
>
> We can change the outer loop and break from it if needed, that's not
> an issue.
>
>> I would just do:
>>
>> if (args->tp)
>>return;
>>
>> if find_tp already found the tracepoint.
>>
>> Tried to also create a duplicate tracepoint and I get:
>>  CC  init/version.o
>>  AR  init/built-in.o
>>  AR  built-in.o
>>  LD  vmlinux.o
>> block/blk-core.o:(__tracepoints+0x440): multiple definition of
>> `__tracepoint_sched_switch'
>> kernel/sched/core.o:(__tracepoints+0x440): first defined here
>> Makefile:1032: recipe for target 'vmlinux' failed
>> make: *** [vmlinux] Error 1
>
> Yeah, as I state in my changelog, I'm very well aware that the linker
> is able to catch those. This was the purpose of emitting a
> __tracepoint_##name symbol in the tracepoint definition macro. This
> WARN_ON_ONCE() is a redundant check for an invariant that we expect
> is provided by the linker. Given that it's not a fast path, I would
> prefer to keep this redundant check in place, given that an average
> factor 2 speedup on a slow path should really not be something we
> need to optimize for. IMHO, for slow paths, robustness is more important
> than speed (unless the slow path becomes so slow that it really affects
> the user).
>
> I envision that a way to break this invariant would be to compile a
> kernel object with gcc -fvisibility=hidden or such. I admit this is
> particular scenario is really far fetched, but it illustrates why I
> prefer to keep an extra safety net at runtime for linker-based
> validations when possible.
>
> If a faster tracepoint lookup function is needed, we should consider
> perfect hashing algorithms done post-build, but so far nobody has
> complained about speed of this lookup operation. Anyhow a factor 2 in
> the speed of this lookup should really not matter, right ?

Yes, I agree with all the great reasons. So lets do it your way then.

thanks,

- Joel


Re: [PATCH] staging: android: ashmem: Fix possible deadlock in ashmem_ioctl

2018-03-19 Thread Joel Fernandes (Google)
On Tue, Feb 27, 2018 at 10:59 PM, Yisheng Xie  wrote:
> ashmem_mutex may create a chain of dependencies like:
>
> CPU0CPU1
>  mmap syscall   ioctl syscall
>  -> mmap_sem (acquired) -> ashmem_ioctl
>  -> ashmem_mmap-> ashmem_mutex (acquired)
> -> ashmem_mutex (try to acquire)   -> copy_from_user
>   -> mmap_sem (try to acquire)
>
> There is a lock odering problem between mmap_sem and ashmem_mutex causing
> a lockdep splat[1] during a syzcaller test. This patch fixes the problem
> by move copy_from_user out of ashmem_mutex.
>
> [1] https://www.spinics.net/lists/kernel/msg2733200.html
>
> Fixes: ce8a3a9e76d0 (staging: android: ashmem: Fix a race condition in pin 
> ioctls)
> Reported-by: syzbot+d7a918a7a8e1c952b...@syzkaller.appspotmail.com
> Signed-off-by: Yisheng Xie 

Greg,
Could you take this patch for the stable trees? I do see it in staging
already. I couldn't find it in stable so wanted to bring it to your
attention. If you already aware of it, please ignore my note.

Thanks,
- Joel


[PATCH RFC] mm: Add an fs-write seal to memfd

2018-10-05 Thread Joel Fernandes (Google)
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then drop its protection for "future" writes
while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer. See CursorWindow in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;

// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no fs-write seal "
   "(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");

addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");

ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("fs-write seal now active\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to fs-write seal\n");
else
printf("write passed (unexpected)\n");

addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");

addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
fs-write seal now active
write failed as expected due to fs-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Cc: jr...@google.com
Cc: john.stu...@linaro.org
Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Signed-off-by: Joel Fernandes (Google) 
---
 include/uapi/linux/fcntl.h | 1 +
 mm/memfd.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..bafd1233b6a8 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7 @@
 #define F_SEAL_SHRINK  0x0002  /* prevent file from shrinking */
 #define F_SEAL_GROW0x0004  /* prevent file from growing */
 #define F_SEAL_WRITE   0x0008  /* prevent writes */
+#define F_SEAL_FS_WRITE0x0010  /* prevent all write-related syscalls */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 2bb5e257080e..4b09f446c2d2 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 F_SEAL_SHRINK | \
 F_SEAL_GROW | \
-F_SEAL_WRITE)
+F_SEAL_WRITE | \
+F_SEAL_FS_WRITE)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
@@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int 
seals)
}
}
 
+   if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
+   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
+
*file_seals |= seals;
error = 0;
 
-- 
2.19.0.605.g01d371f741-goog



[PATCH] doc: rcu: Fix code listing in performance and scalability requirements

2018-10-14 Thread Joel Fernandes (Google)
The code listing under this section has a quick quiz that says line 19
uses rcu_access_pointer, but the code listing itself does not. Fix this.

Signed-off-by: Joel Fernandes (Google) 
---
 .../RCU/Design/Requirements/Requirements.html|  2 +-
 kernel/sys.c | 16 
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/RCU/Design/Requirements/Requirements.html 
b/Documentation/RCU/Design/Requirements/Requirements.html
index 4fae55056c1d..f74a2233865c 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -1596,7 +1596,7 @@ used in place of synchronize_rcu() as follows:
 16   struct foo *p;
 17
 18   spin_lock(gp_lock);
-19   p = rcu_dereference(gp);
+19   p = rcu_access_pointer(gp);
 20   if (!p) {
 21 spin_unlock(gp_lock);
 22 return false;
-- 
2.19.0.605.g01d371f741-goog



[PATCH for-stable] dmaengine: stm32-dma: fix incomplete configuration in cyclic mode

2018-10-16 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

commit e57cb3b3f10d005410f09d4598cc6d62b833f2b0 upstream.

When in cyclic mode, the configuration is updated after having started the
DMA hardware (STM32_DMA_SCR_EN) leading to incomplete configuration of
SMxAR registers.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Hugues Fruchet 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index 4099948b6914..fae7de54f00a 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -441,6 +441,8 @@ static void stm32_dma_dump_reg(struct stm32_dma_chan *chan)
dev_dbg(chan2dev(chan), "SFCR:  0x%08x\n", sfcr);
 }
 
+static void stm32_dma_configure_next_sg(struct stm32_dma_chan *chan);
+
 static void stm32_dma_start_transfer(struct stm32_dma_chan *chan)
 {
struct stm32_dma_device *dmadev = stm32_dma_get_dev(chan);
@@ -483,6 +485,9 @@ static void stm32_dma_start_transfer(struct stm32_dma_chan 
*chan)
if (status)
stm32_dma_irq_clear(chan, status);
 
+   if (chan->desc->cyclic)
+   stm32_dma_configure_next_sg(chan);
+
stm32_dma_dump_reg(chan);
 
/* Start DMA */
@@ -576,8 +581,7 @@ static void stm32_dma_issue_pending(struct dma_chan *c)
if (vchan_issue_pending(>vchan) && !chan->desc && !chan->busy) {
dev_dbg(chan2dev(chan), "vchan %p: issued\n", >vchan);
stm32_dma_start_transfer(chan);
-   if (chan->desc->cyclic)
-   stm32_dma_configure_next_sg(chan);
+
}
spin_unlock_irqrestore(>vchan.lock, flags);
 }
-- 
2.19.0.605.g01d371f741-goog



[PATCH v3 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-10-18 Thread Joel Fernandes (Google)
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active.  This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer.
See CursorWindow documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #define F_SEAL_FUTURE_WRITE 0x0010
 #define REGION_SIZE (5 * 1024 * 1024)

int memfd_create_region(const char *name, size_t size)
{
int ret;
int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING);
if (fd < 0) return fd;
ret = ftruncate(fd, size);
if (ret < 0) { close(fd); return ret; }
return fd;
}

int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;

// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no future-write seal "
   "(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");

addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");

ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE |
 F_SEAL_GROW |
 F_SEAL_SHRINK);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("future-write seal now active\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to future-write seal\n");
else
printf("write passed (unexpected)\n");

addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");

addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
future-write seal now active
write failed as expected due to future-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Cc: jr...@google.com
Cc: john.stu...@linaro.org
Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Cc: h...@infradead.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.

v2->v3: Updated commit message to have more support code (John Stultz)
Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE
(Christoph Hellwig)
Allow for this seal only if grow/shrink seals are also
either previous set, or are requested along with this seal.
(Christoph Hellwig)
Added locking to synchronize access to file->f_mode.
(Christoph Hellwig)

 include/uapi/linux/fcntl.h |  1 +
 mm/memfd.c | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..a2f8658f1c55 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7

[PATCH v3 2/2] selftests/memfd: Add tests for F_SEAL_FS_WRITE seal

2018-10-18 Thread Joel Fernandes (Google)
Add tests to verify sealing memfds with the F_SEAL_FS_WRITE works as
expected.

Cc: dan...@google.com
Cc: minc...@kernel.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 10baa1652fc2..32b207ca7372 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -692,6 +692,79 @@ static void test_seal_write(void)
close(fd);
 }
 
+/*
+ * Test SEAL_FUTURE_WRITE
+ * Test whether SEAL_FUTURE_WRITE actually prevents modifications.
+ */
+static void test_seal_future_write(void)
+{
+   int fd;
+   void *p;
+
+   printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
+
+   fd = mfd_assert_new("kern_memfd_seal_future_write",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   /* Not adding grow/shrink seals makes the future write
+* seal fail to get added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_GROW);
+   mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+   /* Should still fail since shrink seal has
+* not yet been added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+   mfd_assert_has_seals(fd, F_SEAL_GROW |
+F_SEAL_SHRINK);
+
+   /* Now should succeed, also verifies that the seal
+* could be added with an existing writable mmap
+*/
+   mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+
+   /* Test adding all seals (grow, shrink, future write) at once */
+   fd = mfd_assert_new("kern_memfd_seal_future_write2",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+}
+
 /*
  * Test SEAL_SHRINK
  * Test whether SEAL_SHRINK actually prevents shrinking
@@ -945,6 +1018,7 @@ int main(int argc, char **argv)
test_basic();
 
test_seal_write();
+   test_seal_future_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
-- 
2.19.1.331.ge82ca0e54c-goog



[PATCH] doc: correct parameter in stallwarn

2018-10-29 Thread Joel Fernandes (Google)
The stallwarn document incorrectly mentions 'fps=' instead of 'fqs='.
Correct that.

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/stallwarn.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index b01bcafc64aa..073dbc12d1ea 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -205,7 +205,7 @@ handlers are no longer able to execute on this CPU.  This 
can happen if
 the stalled CPU is spinning with interrupts are disabled, or, in -rt
 kernels, if a high-priority process is starving RCU's softirq handler.
 
-The "fps=" shows the number of force-quiescent-state idle/offline
+The "fqs=" shows the number of force-quiescent-state idle/offline
 detection passes that the grace-period kthread has made across this
 CPU since the last time that this CPU noted the beginning of a grace
 period.
-- 
2.19.1.568.g152ad8e336-goog



[RFC 1/6] pstore: map pstore types to names

2018-10-26 Thread Joel Fernandes (Google)
In later patches we will need to map types to names, so create a table
for that which can also be used and reused in different parts of old and
new code. Also use it to save the type in the PRZ which will be useful
in later patches.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/inode.c  | 44 --
 fs/pstore/ram.c|  4 +++-
 include/linux/pstore.h | 29 +
 include/linux/pstore_ram.h |  2 ++
 4 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 5fcb845b9fec..43757049d384 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record 
*record)
struct dentry   *dentry;
struct inode*inode;
int rc = 0;
+   enum pstore_type_id type;
charname[PSTORE_NAMELEN];
struct pstore_private   *private, *pos;
unsigned long   flags;
@@ -335,43 +336,44 @@ int pstore_mkfile(struct dentry *root, struct 
pstore_record *record)
goto fail_alloc;
private->record = record;
 
-   switch (record->type) {
+   type = record->type;
+   switch (type) {
case PSTORE_TYPE_DMESG:
-   scnprintf(name, sizeof(name), "dmesg-%s-%llu%s",
- record->psi->name, record->id,
- record->compressed ? ".enc.z" : "");
+   scnprintf(name, sizeof(name), "%s-%s-%llu%s",
+pstore_names[type], record->psi->name, record->id,
+record->compressed ? ".enc.z" : "");
break;
case PSTORE_TYPE_CONSOLE:
-   scnprintf(name, sizeof(name), "console-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_FTRACE:
-   scnprintf(name, sizeof(name), "ftrace-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_MCE:
-   scnprintf(name, sizeof(name), "mce-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_PPC_RTAS:
-   scnprintf(name, sizeof(name), "rtas-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_PPC_OF:
-   scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_PPC_COMMON:
-   scnprintf(name, sizeof(name), "powerpc-common-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_PMSG:
-   scnprintf(name, sizeof(name), "pmsg-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_PPC_OPAL:
-   scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu",
- record->psi->name, record->id);
+   scnprintf(name, sizeof(name), "%s-%s-%llu",
+ pstore_names[type], record->psi->name, record->id);
break;
case PSTORE_TYPE_UNKNOWN:
scnprintf(name, sizeof(name), "unknown-%s-%llu",
@@ -379,7 +381,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record 
*record)
break;
default:
scnprintf(name, sizeof(name), "type%d-

[RFC 5/6] pstore: donot treat empty buffers as valid

2018-10-26 Thread Joel Fernandes (Google)
pstore currently calls persistent_ram_save_old even if a buffer is
empty. While this appears to work, it is simply not the right thing to
do and could lead to bugs so lets avoid that. It also prevent misleading
prints in the logs which claim the buffer is valid.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 0792595ebcfb..1299aa3ea734 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -495,7 +495,7 @@ static int persistent_ram_post_init(struct 
persistent_ram_zone *prz, u32 sig,
 
sig ^= PERSISTENT_RAM_SIG;
 
-   if (prz->buffer->sig == sig) {
+   if (prz->buffer->sig == sig && buffer_size(prz)) {
if (buffer_size(prz) > prz->buffer_size ||
buffer_start(prz) > buffer_size(prz))
pr_info("found existing invalid buffer, size %zu, start 
%zu\n",
-- 
2.19.1.568.g152ad8e336-goog



[RFC 2/6] pstore: remove type argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes (Google)
Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz since we can just use that
to setup the pstore record. So lets remove it from the argument list.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 20 +++-
 1 file changed, 7 insertions(+), 13 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index c7cd858adce7..cbfdf4b8e89d 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -125,9 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 
 static struct persistent_ram_zone *
 ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
-u64 *id,
-enum pstore_type_id *typep, enum pstore_type_id type,
-bool update)
+u64 *id, enum pstore_type_id *typep, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
@@ -147,7 +145,7 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c, uint max,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = type;
+   *typep = prz->type;
*id = i;
 
return prz;
@@ -257,8 +255,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
   cxt->max_dump_cnt, >id,
-  >type,
-  PSTORE_TYPE_DMESG, 1);
+  >type, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -274,20 +271,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_CONSOLE, 0);
+  1, >id, >type, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_PMSG, 0);
+  1, >id, >type, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt, 1, >id,
-   >type, PSTORE_TYPE_FTRACE, 0);
+   >type, 0);
} else {
/*
 * Build a new dummy record which combines all the
@@ -306,8 +301,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
>ftrace_read_cnt,
cxt->max_ftrace_cnt,
>id,
-   >type,
-   PSTORE_TYPE_FTRACE, 0);
+   >type, 0);
 
if (!prz_ok(prz_next))
continue;
-- 
2.19.1.568.g152ad8e336-goog



[RFC 4/6] pstore: further reduce ramoops_get_next_prz arguments by passing record

2018-10-26 Thread Joel Fernandes (Google)
Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz. So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 3055e05acab1..710c3d30bac0 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -125,7 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 
 static struct persistent_ram_zone *
 ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
-u64 *id, enum pstore_type_id *typep, bool update)
+struct pstore_record *record, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
@@ -145,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = prz->type;
-   *id = i;
+   record->type = prz->type;
+   record->id = i;
 
return prz;
 }
@@ -254,7 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  >id, >type, 1);
+  record, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -270,18 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  >id, >type, 0);
+  record, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  >id, >type, 0);
+  record, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, >id,
-   >type, 0);
+   >ftrace_read_cnt, record, 0);
} else {
/*
 * Build a new dummy record which combines all the
@@ -298,8 +297,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) {
prz_next = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt,
-   >id,
-   >type, 0);
+   record, 0);
 
if (!prz_ok(prz_next))
continue;
-- 
2.19.1.568.g152ad8e336-goog



[RFC 3/6] pstore: remove max argument from ramoops_get_next_prz

2018-10-26 Thread Joel Fernandes (Google)
>From the code flow, the 'max' checks are already being done on the prz
passed to ramoops_get_next_prz. Lets remove it to simplify this function
and reduce its arguments.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 14 ++
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index cbfdf4b8e89d..3055e05acab1 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -124,14 +124,14 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 }
 
 static struct persistent_ram_zone *
-ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
+ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c,
 u64 *id, enum pstore_type_id *typep, bool update)
 {
struct persistent_ram_zone *prz;
int i = (*c)++;
 
/* Give up if we never existed or have hit the end. */
-   if (!przs || i >= max)
+   if (!przs)
return NULL;
 
prz = przs[i];
@@ -254,8 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  cxt->max_dump_cnt, >id,
-  >type, 1);
+  >id, >type, 1);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -271,17 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type, 0);
+  >id, >type, 0);
 
if (!prz_ok(prz))
prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type, 0);
+  >id, >type, 0);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, 1, >id,
+   >ftrace_read_cnt, >id,
>type, 0);
} else {
/*
@@ -299,7 +298,6 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) {
prz_next = ramoops_get_next_prz(cxt->fprzs,
>ftrace_read_cnt,
-   cxt->max_ftrace_cnt,
>id,
>type, 0);
 
-- 
2.19.1.568.g152ad8e336-goog



[RFC 6/6] Revert "pstore/ram_core: Do not reset restored zone's position and size"

2018-10-26 Thread Joel Fernandes (Google)
This reverts commit 25b63da64708212985c06c7f8b089d356efdd9cf.

Due to the commit which is being reverted here, it is not possible to
know if pstore's messages were from a previous boot, or from very old
boots. This creates an awkard situation where its unclear if crash or
other logs are from the previous boot or from very old boots. Also
typically we dump the pstore buffers after one reboot and are interested
in only the previous boot's crash so let us reset the buffer after we
save them.

Lastly, if we don't zap them, then I think it is possible that part of
the buffer will be from this boot and the other parts will be from
previous boots. So this revert fixes all of this by calling
persistent_ram_zap always.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram_core.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index 1299aa3ea734..67d74dd97da1 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -504,7 +504,6 @@ static int persistent_ram_post_init(struct 
persistent_ram_zone *prz, u32 sig,
pr_debug("found existing buffer, size %zu, start %zu\n",
 buffer_size(prz), buffer_start(prz));
persistent_ram_save_old(prz);
-   return 0;
}
} else {
pr_debug("no valid data in buffer (sig = 0x%08x)\n",
-- 
2.19.1.568.g152ad8e336-goog



[RFC] rcu: doc: update example about stale data

2018-10-27 Thread Joel Fernandes (Google)
The RCU example for 'rejecting stale data' on system-call auditting
stops iterating through the rules if a deleted one is found. It makes
more sense to continue looking at other rules once a deleted one is
rejected. Although the original example is fine, this makes it more
meaningful.

Signed-off-by: Joel Fernandes (Google) 
---
 Documentation/RCU/listRCU.txt | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt
index adb5a3782846..09e9a4fc723e 100644
--- a/Documentation/RCU/listRCU.txt
+++ b/Documentation/RCU/listRCU.txt
@@ -250,8 +250,7 @@ as follows:
spin_lock(>lock);
if (e->deleted) {
spin_unlock(>lock);
-   rcu_read_unlock();
-   return AUDIT_BUILD_CONTEXT;
+   continue;
}
rcu_read_unlock();
return state;
-- 
2.19.1.568.g152ad8e336-goog



[PATCH RFC v2 0/3] cleanups for pstore and ramoops

2018-11-03 Thread Joel Fernandes (Google)
Here are some simple cleanups and fixes for ramoops in pstore. Let me know
what you think, thanks.

Joel Fernandes (Google) (3):
pstore: map pstore types to names
pstore: simplify ramoops_get_next_prz arguments
pstore: donot treat empty buffers as valid

fs/pstore/inode.c  | 53 +-
fs/pstore/ram.c| 52 +++--
fs/pstore/ram_core.c   |  2 +-
include/linux/pstore.h | 37 ++
include/linux/pstore_ram.h |  2 ++
5 files changed, 67 insertions(+), 79 deletions(-)

--
2.19.1.930.g4563a0d9d0-goog



[PATCH RFC v2 1/3] pstore: map pstore types to names

2018-11-03 Thread Joel Fernandes (Google)
In later patches we will need to map types to names, so create a table
for that which can also be used and reused in different parts of old and
new code. Also use it to save the type in the PRZ which will be useful
in later patches.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/inode.c  | 53 +-
 fs/pstore/ram.c|  4 ++-
 include/linux/pstore.h | 37 ++
 include/linux/pstore_ram.h |  2 ++
 4 files changed, 48 insertions(+), 48 deletions(-)

diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 8cf2218b46a7..c5c6b8b4b70a 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record 
*record)
struct dentry   *dentry;
struct inode*inode;
int rc = 0;
+   enum pstore_type_id type;
charname[PSTORE_NAMELEN];
struct pstore_private   *private, *pos;
unsigned long   flags;
@@ -335,53 +336,11 @@ int pstore_mkfile(struct dentry *root, struct 
pstore_record *record)
goto fail_alloc;
private->record = record;
 
-   switch (record->type) {
-   case PSTORE_TYPE_DMESG:
-   scnprintf(name, sizeof(name), "dmesg-%s-%llu%s",
- record->psi->name, record->id,
- record->compressed ? ".enc.z" : "");
-   break;
-   case PSTORE_TYPE_CONSOLE:
-   scnprintf(name, sizeof(name), "console-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_FTRACE:
-   scnprintf(name, sizeof(name), "ftrace-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_MCE:
-   scnprintf(name, sizeof(name), "mce-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_RTAS:
-   scnprintf(name, sizeof(name), "rtas-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OF:
-   scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_COMMON:
-   scnprintf(name, sizeof(name), "powerpc-common-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PMSG:
-   scnprintf(name, sizeof(name), "pmsg-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_PPC_OPAL:
-   scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu",
- record->psi->name, record->id);
-   break;
-   case PSTORE_TYPE_UNKNOWN:
-   scnprintf(name, sizeof(name), "unknown-%s-%llu",
- record->psi->name, record->id);
-   break;
-   default:
-   scnprintf(name, sizeof(name), "type%d-%s-%llu",
- record->type, record->psi->name, record->id);
-   break;
-   }
+   scnprintf(name, sizeof(name), "%s-%s-%llu%s",
+   pstore_type_to_name(record->type),
+   record->psi->name, record->id,
+   (record->type == PSTORE_TYPE_DMESG
+&& record->compressed) ? ".enc.z" : "");
 
dentry = d_alloc_name(root, name);
if (!dentry)
diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index 10ac4d23c423..b174d0fc009f 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -611,6 +611,7 @@ static int ramoops_init_przs(const char *name,
goto fail;
}
*paddr += zone_sz;
+   prz_ar[i]->type = pstore_name_to_type(name);
}
 
*przs = prz_ar;
@@ -650,6 +651,7 @@ static int ramoops_init_prz(const char *name,
}
 
*paddr += sz;
+   (*prz)->type = pstore_name_to_type(name);
 
return 0;
 }
@@ -785,7 +787,7 @@ static int ramoops_probe(struct platform_device *pdev)
 
dump_mem_sz = cxt->size - cxt->console_size - cxt->ftrace_size
- cxt->pmsg_size;
-   err = ramoops_init_przs("dump", dev, cxt, >dprzs, ,
+   err = ramoops_init_przs("dmesg", dev, cxt, >dprzs, ,
dump_mem_sz, cxt->record_size,
>

[PATCH RFC v2 3/3] pstore: donot treat empty buffers as valid

2018-11-03 Thread Joel Fernandes (Google)
pstore currently calls persistent_ram_save_old even if a buffer is
empty. While this appears to work, it is does not seem like the right
thing to do and could lead to future bugs so lets avoid that. It also
prevent misleading prints in the logs which claim the buffer is valid.

I got something like:
found existing buffer, size 0, start 0

When I was expecting:
no valid data in buffer (sig = ...)

Signed-off-by: Joel Fernandes (Google) 
---
Note that if you feel this patch is not necessary, then feel free to
drop it. I would say it is harmless and is a good clean up.

 fs/pstore/ram_core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c
index e6375439c5ac..196e4fd7ba8c 100644
--- a/fs/pstore/ram_core.c
+++ b/fs/pstore/ram_core.c
@@ -510,7 +510,7 @@ static int persistent_ram_post_init(struct 
persistent_ram_zone *prz, u32 sig,
 
sig ^= PERSISTENT_RAM_SIG;
 
-   if (prz->buffer->sig == sig) {
+   if (prz->buffer->sig == sig && buffer_size(prz)) {
if (buffer_size(prz) > prz->buffer_size ||
buffer_start(prz) > buffer_size(prz)) {
pr_info("found existing invalid buffer, size %zu, start 
%zu\n",
-- 
2.19.1.930.g4563a0d9d0-goog


[PATCH RFC v2 2/3] pstore: simplify ramoops_get_next_prz arguments

2018-11-03 Thread Joel Fernandes (Google)
(1) remove type argument from ramoops_get_next_prz

Since we store the type of the prz when we initialize it, we no longer
need to pass it again in ramoops_get_next_prz since we can just use that
to setup the pstore record. So lets remove it from the argument list.

(2) remove max argument from ramoops_get_next_prz

>From the code flow, the 'max' checks are already being done on the prz
passed to ramoops_get_next_prz. Lets remove it to simplify this function
and reduce its arguments.

(3) further reduce ramoops_get_next_prz arguments by passing record

Both the id and type fields of a pstore_record are set by
ramoops_get_next_prz. So we can just pass a pointer to the pstore_record
instead of passing individual elements. This results in cleaner more
readable code and fewer lines.

In addition lets also remove the 'update' argument since we can detect
that. Changes are squashed into a single patch to reduce fixup conflicts.

Signed-off-by: Joel Fernandes (Google) 
---
 fs/pstore/ram.c | 48 ++--
 1 file changed, 18 insertions(+), 30 deletions(-)

diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c
index b174d0fc009f..202eaa82bcc6 100644
--- a/fs/pstore/ram.c
+++ b/fs/pstore/ram.c
@@ -124,19 +124,17 @@ static int ramoops_pstore_open(struct pstore_info *psi)
 }
 
 static struct persistent_ram_zone *
-ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max,
-u64 *id,
-enum pstore_type_id *typep, enum pstore_type_id type,
-bool update)
+ramoops_get_next_prz(struct persistent_ram_zone *przs[], int id,
+struct pstore_record *record)
 {
struct persistent_ram_zone *prz;
-   int i = (*c)++;
+   bool update = (record->type == PSTORE_TYPE_DMESG);
 
/* Give up if we never existed or have hit the end. */
-   if (!przs || i >= max)
+   if (!przs)
return NULL;
 
-   prz = przs[i];
+   prz = przs[id];
if (!prz)
return NULL;
 
@@ -147,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], 
uint *c, uint max,
if (!persistent_ram_old_size(prz))
return NULL;
 
-   *typep = type;
-   *id = i;
+   record->type = prz->type;
+   record->id = id;
 
return prz;
 }
@@ -255,10 +253,8 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
/* Find the next valid persistent_ram_zone for DMESG */
while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) {
-   prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt,
-  cxt->max_dump_cnt, >id,
-  >type,
-  PSTORE_TYPE_DMESG, 1);
+   prz = ramoops_get_next_prz(cxt->dprzs, cxt->dump_read_cnt++,
+  record);
if (!prz_ok(prz))
continue;
header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz),
@@ -272,22 +268,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
}
}
 
-   if (!prz_ok(prz))
-   prz = ramoops_get_next_prz(>cprz, >console_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_CONSOLE, 0);
+   if (!prz_ok(prz) && !cxt->console_read_cnt++)
+   prz = ramoops_get_next_prz(>cprz, 0 /* single */, record);
 
-   if (!prz_ok(prz))
-   prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt,
-  1, >id, >type,
-  PSTORE_TYPE_PMSG, 0);
+   if (!prz_ok(prz) && !cxt->pmsg_read_cnt++)
+   prz = ramoops_get_next_prz(>mprz, 0 /* single */, record);
 
/* ftrace is last since it may want to dynamically allocate memory. */
if (!prz_ok(prz)) {
-   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) {
-   prz = ramoops_get_next_prz(cxt->fprzs,
-   >ftrace_read_cnt, 1, >id,
-   >type, PSTORE_TYPE_FTRACE, 0);
+   if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU) &&
+   !cxt->ftrace_read_cnt++) {
+   prz = ramoops_get_next_prz(cxt->fprzs, 0 /* single */,
+  record);
} else {
/*
 * Build a new dummy record which combines all the
@@ -303,11 +295,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record 
*record)
 
while (cxt->ftrace_read_cnt < cxt->m

[RFC] doc: rcu: remove note on smp_mb during synchronize_rcu

2018-10-27 Thread Joel Fernandes (Google)
As per this thread [1], it seems this smp_mb isn't needed anymore:
"So the smp_mb() that I was trying to add doesn't need to be there."

So let us remove this part from the memory ordering documentation.

[1] https://lkml.org/lkml/2017/10/6/707

Signed-off-by: Joel Fernandes (Google) 
---
 .../Tree-RCU-Memory-Ordering.html | 32 +--
 1 file changed, 1 insertion(+), 31 deletions(-)

diff --git 
a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html 
b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
index a346ce0116eb..0fb1511763d4 100644
--- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
@@ -77,7 +77,7 @@ The key point is that the lock-acquisition functions, 
including
 smp_mb__after_unlock_lock() immediately after successful
 acquisition of the lock.
 
-Therefore, for any given rcu_node struction, any access
+Therefore, for any given rcu_node structure, any access
 happening before one of the above lock-release functions will be seen
 by all CPUs as happening before any access happening after a later
 one of the above lock-acquisition functions.
@@ -162,36 +162,6 @@ an atomic_add_return() of zero) to detect idle 
CPUs.
 
 
 
-The approach must be extended to handle one final case, that
-of waking a task blocked in synchronize_rcu().
-This task might be affinitied to a CPU that is not yet aware that
-the grace period has ended, and thus might not yet be subject to
-the grace period's memory ordering.
-Therefore, there is an smp_mb() after the return from
-wait_for_completion() in the synchronize_rcu()
-code path.
-
-
-
-Quick Quiz:
-
-   What?  Where???
-   I don't see any smp_mb() after the return from
-   wait_for_completion()!!!
-
-Answer:
-
-   That would be because I spotted the need for that
-   smp_mb() during the creation of this documentation,
-   and it is therefore unlikely to hit mainline before v4.14.
-   Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
-   Jonathan Cameron for asking questions that sensitized me
-   to the rather elaborate sequence of events that demonstrate
-   the need for this memory barrier.
-
-
-
-
 Tree RCU's grace--period memory-ordering guarantees rely most
 heavily on the rcu_node structure's -lock
 field, so much so that it is necessary to abbreviate this pattern
-- 
2.19.1.568.g152ad8e336-goog



[PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd

2018-11-07 Thread Joel Fernandes (Google)
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then add protection against making any
"future" writes while keeping the existing already mmap'ed
writeable-region active.  This allows us to implement a usecase where
receivers of the shared memory buffer can get a read-only view, while
the sender continues to write to the buffer.
See CursorWindow documentation in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal
which prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #include 
 #define F_SEAL_FUTURE_WRITE 0x0010
 #define REGION_SIZE (5 * 1024 * 1024)

int memfd_create_region(const char *name, size_t size)
{
int ret;
int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING);
if (fd < 0) return fd;
ret = ftruncate(fd, size);
if (ret < 0) { close(fd); return ret; }
return fd;
}

int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;

// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no future-write seal "
   "(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");

addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");

ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE |
 F_SEAL_GROW |
 F_SEAL_SHRINK);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("future-write seal now active\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to future-write seal\n");
else
printf("write passed (unexpected)\n");

addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");

addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
future-write seal now active
write failed as expected due to future-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Cc: jr...@google.com
Cc: john.stu...@linaro.org
Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Cc: h...@infradead.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.

v2->v3: Updated commit message to have more support code (John Stultz)
Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE
(Christoph Hellwig)
Allow for this seal only if grow/shrink seals are also
either previous set, or are requested along with this seal.
(Christoph Hellwig)
Added locking to synchronize access to file->f_mode.
(Christoph Hellwig)

 include/uapi/linux/fcntl.h |  1 +
 mm/memfd.c | 22 +-
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 6448cdd9a350..a2f8658f1c55 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7

[PATCH v3 resend 2/2] selftests/memfd: Add tests for F_SEAL_FUTURE_WRITE seal

2018-11-07 Thread Joel Fernandes (Google)
Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as
expected.

Cc: dan...@google.com
Cc: minc...@kernel.org
Reviewed-by: John Stultz 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 74 ++
 1 file changed, 74 insertions(+)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 10baa1652fc2..32b207ca7372 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -692,6 +692,79 @@ static void test_seal_write(void)
close(fd);
 }
 
+/*
+ * Test SEAL_FUTURE_WRITE
+ * Test whether SEAL_FUTURE_WRITE actually prevents modifications.
+ */
+static void test_seal_future_write(void)
+{
+   int fd;
+   void *p;
+
+   printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
+
+   fd = mfd_assert_new("kern_memfd_seal_future_write",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   /* Not adding grow/shrink seals makes the future write
+* seal fail to get added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_GROW);
+   mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+   /* Should still fail since shrink seal has
+* not yet been added
+*/
+   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
+
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+   mfd_assert_has_seals(fd, F_SEAL_GROW |
+F_SEAL_SHRINK);
+
+   /* Now should succeed, also verifies that the seal
+* could be added with an existing writable mmap
+*/
+   mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+
+   /* Test adding all seals (grow, shrink, future write) at once */
+   fd = mfd_assert_new("kern_memfd_seal_future_write2",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   mfd_assert_has_seals(fd, 0);
+   mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+F_SEAL_GROW |
+F_SEAL_FUTURE_WRITE);
+
+   /* read should pass, writes should fail */
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+}
+
 /*
  * Test SEAL_SHRINK
  * Test whether SEAL_SHRINK actually prevents shrinking
@@ -945,6 +1018,7 @@ int main(int argc, char **argv)
test_basic();
 
test_seal_write();
+   test_seal_future_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
-- 
2.19.1.930.g4563a0d9d0-goog



[PATCH] MAINTAINERS: Add me to Android drivers

2018-10-04 Thread Joel Fernandes (Google)
I am one of the main engineers working on ashmem. I have been fixing
bugs in the driver and have been involved in the memfd conversion
discussions and sending patches about that. I also have an understanding
of the binder driver and was involved with some of the development on
finer grained locking. So I would like to be added to the MAINTAINERS
file for android drivers for review and maintenance of ashmem and other
Android drivers.

Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Signed-off-by: Joel Fernandes (Google) 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 544cac829cf4..d639c4d04438 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -894,6 +894,7 @@ M:  Greg Kroah-Hartman 
 M: Arve Hjønnevåg 
 M: Todd Kjos 
 M: Martijn Coenen 
+M: Joel Fernandes 
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
 L: de...@driverdev.osuosl.org
 S: Supported
-- 
2.19.0.605.g01d371f741-goog



[PATCH 2/7] dmaengine: stm32-dma: fix incomplete configuration in cyclic mode

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

When in cyclic mode, the configuration is updated after having started the
DMA hardware (STM32_DMA_SCR_EN) leading to incomplete configuration of
SMxAR registers.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Hugues Fruchet 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index 4099948b6914..fae7de54f00a 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -441,6 +441,8 @@ static void stm32_dma_dump_reg(struct stm32_dma_chan *chan)
dev_dbg(chan2dev(chan), "SFCR:  0x%08x\n", sfcr);
 }
 
+static void stm32_dma_configure_next_sg(struct stm32_dma_chan *chan);
+
 static void stm32_dma_start_transfer(struct stm32_dma_chan *chan)
 {
struct stm32_dma_device *dmadev = stm32_dma_get_dev(chan);
@@ -483,6 +485,9 @@ static void stm32_dma_start_transfer(struct stm32_dma_chan 
*chan)
if (status)
stm32_dma_irq_clear(chan, status);
 
+   if (chan->desc->cyclic)
+   stm32_dma_configure_next_sg(chan);
+
stm32_dma_dump_reg(chan);
 
/* Start DMA */
@@ -576,8 +581,7 @@ static void stm32_dma_issue_pending(struct dma_chan *c)
if (vchan_issue_pending(>vchan) && !chan->desc && !chan->busy) {
dev_dbg(chan2dev(chan), "vchan %p: issued\n", >vchan);
stm32_dma_start_transfer(chan);
-   if (chan->desc->cyclic)
-   stm32_dma_configure_next_sg(chan);
+
}
spin_unlock_irqrestore(>vchan.lock, flags);
 }
-- 
2.19.0.605.g01d371f741-goog



[PATCH 0/7] NULL pointer deref fix for stm32-dma

2018-10-08 Thread Joel Fernandes (Google)
Hi Greg,

While looking at android-4.14, I found a NULL pointer deref with
stm32-dma driver using Coccicheck errors. I found that upstream had a
bunch of patches on stm32-dma that have fixed this and other issues, I
applied these patches cleanly onto Android 4.14. I believe these should
goto stable and flow into Android 4.14 from there, but I haven't tested
this since I have no hardware to do so.

Atleast I can say that the coccicheck error below goes away when running:
make coccicheck MODE=report
./drivers/dma/stm32-dma.c:567:18-24: ERROR: chan -> desc is NULL but 
dereferenced.

Anyway, please consider this series for 4.14 stable, I have CC'd the
author and others, thanks.

Pierre Yves MORDRET (7):
  dmaengine: stm32-dma: threshold manages with bitfield feature
  dmaengine: stm32-dma: fix incomplete configuration in cyclic mode
  dmaengine: stm32-dma: fix typo and reported checkpatch warnings
  dmaengine: stm32-dma: Improve memory burst management
  dmaengine: stm32-dma: fix DMA IRQ status handling
  dmaengine: stm32-dma: fix max items per transfer
  dmaengine: stm32-dma: properly mask irq bits

 drivers/dma/stm32-dma.c | 287 +---
 1 file changed, 240 insertions(+), 47 deletions(-)

-- 
2.19.0.605.g01d371f741-goog


[PATCH 4/7] dmaengine: stm32-dma: Improve memory burst management

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

This patch improves memory burst capability using best burst size
according to transferred buffer size from/to memory.

>From now on, memory burst is not necessarily same as with peripheral
burst one and fifo threshold is directly managed by this driver in order
to fit with computed memory burst.

Signed-off-by: M'boumba Cedric Madianga 
Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 204 ++--
 1 file changed, 175 insertions(+), 29 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index b64e14a83dec..21ad359a5a59 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -5,6 +5,7 @@
  *
  * Copyright (C) M'boumba Cedric Madianga 2015
  * Author: M'boumba Cedric Madianga 
+ * Pierre-Yves Mordret 
  *
  * License terms:  GNU General Public License (GPL), version 2
  */
@@ -115,6 +116,8 @@
 #define STM32_DMA_MAX_CHANNELS 0x08
 #define STM32_DMA_MAX_REQUEST_ID   0x08
 #define STM32_DMA_MAX_DATA_PARAM   0x03
+#define STM32_DMA_FIFO_SIZE16  /* FIFO is 16 bytes */
+#define STM32_DMA_MIN_BURST4
 #define STM32_DMA_MAX_BURST16
 
 /* DMA Features */
@@ -184,6 +187,8 @@ struct stm32_dma_chan {
struct dma_slave_config dma_sconfig;
struct stm32_dma_chan_reg chan_reg;
u32 threshold;
+   u32 mem_burst;
+   u32 mem_width;
 };
 
 struct stm32_dma_device {
@@ -248,6 +253,85 @@ static int stm32_dma_get_width(struct stm32_dma_chan *chan,
}
 }
 
+static enum dma_slave_buswidth stm32_dma_get_max_width(u32 buf_len,
+  u32 threshold)
+{
+   enum dma_slave_buswidth max_width;
+
+   if (threshold == STM32_DMA_FIFO_THRESHOLD_FULL)
+   max_width = DMA_SLAVE_BUSWIDTH_4_BYTES;
+   else
+   max_width = DMA_SLAVE_BUSWIDTH_2_BYTES;
+
+   while ((buf_len < max_width  || buf_len % max_width) &&
+  max_width > DMA_SLAVE_BUSWIDTH_1_BYTE)
+   max_width = max_width >> 1;
+
+   return max_width;
+}
+
+static bool stm32_dma_fifo_threshold_is_allowed(u32 burst, u32 threshold,
+   enum dma_slave_buswidth width)
+{
+   u32 remaining;
+
+   if (width != DMA_SLAVE_BUSWIDTH_UNDEFINED) {
+   if (burst != 0) {
+   /*
+* If number of beats fit in several whole bursts
+* this configuration is allowed.
+*/
+   remaining = ((STM32_DMA_FIFO_SIZE / width) *
+(threshold + 1) / 4) % burst;
+
+   if (remaining == 0)
+   return true;
+   } else {
+   return true;
+   }
+   }
+
+   return false;
+}
+
+static bool stm32_dma_is_burst_possible(u32 buf_len, u32 threshold)
+{
+   switch (threshold) {
+   case STM32_DMA_FIFO_THRESHOLD_FULL:
+   if (buf_len >= STM32_DMA_MAX_BURST)
+   return true;
+   else
+   return false;
+   case STM32_DMA_FIFO_THRESHOLD_HALFFULL:
+   if (buf_len >= STM32_DMA_MAX_BURST / 2)
+   return true;
+   else
+   return false;
+   default:
+   return false;
+   }
+}
+
+static u32 stm32_dma_get_best_burst(u32 buf_len, u32 max_burst, u32 threshold,
+   enum dma_slave_buswidth width)
+{
+   u32 best_burst = max_burst;
+
+   if (best_burst == 1 || !stm32_dma_is_burst_possible(buf_len, threshold))
+   return 0;
+
+   while ((buf_len < best_burst * width && best_burst > 1) ||
+  !stm32_dma_fifo_threshold_is_allowed(best_burst, threshold,
+   width)) {
+   if (best_burst > STM32_DMA_MIN_BURST)
+   best_burst = best_burst >> 1;
+   else
+   best_burst = 0;
+   }
+
+   return best_burst;
+}
+
 static int stm32_dma_get_burst(struct stm32_dma_chan *chan, u32 maxburst)
 {
switch (maxburst) {
@@ -267,12 +351,12 @@ static int stm32_dma_get_burst(struct stm32_dma_chan 
*chan, u32 maxburst)
 }
 
 static void stm32_dma_set_fifo_config(struct stm32_dma_chan *chan,
- u32 src_maxburst, u32 dst_maxburst)
+ u32 src_burst, u32 dst_burst)
 {
chan->chan_reg.dma_sfcr &= ~STM32_DMA_SFCR_MASK;
chan->chan_reg.dma_scr &= ~STM32_DMA_SCR_DMEIE;
 
-   if ((!src_maxburst) && (!dst_maxburst)) {
+   if (!src_burst && !dst_burst) {
/* Using direct mode */
chan->chan_reg.dma_scr |= STM32_DMA_SCR_DMEIE;
} else {
@@ -589,37 

[PATCH 1/7] dmaengine: stm32-dma: threshold manages with bitfield feature

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

>From now on, DMA bitfield is to manage DMA FIFO Threshold.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index 786fc8fcc38e..4099948b6914 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -116,6 +116,10 @@
 #define STM32_DMA_MAX_DATA_PARAM   0x03
 #define STM32_DMA_MAX_BURST16
 
+/* DMA Features */
+#define STM32_DMA_THRESHOLD_FTR_MASK   GENMASK(1, 0)
+#define STM32_DMA_THRESHOLD_FTR_GET(n) ((n) & STM32_DMA_THRESHOLD_FTR_MASK)
+
 enum stm32_dma_width {
STM32_DMA_BYTE,
STM32_DMA_HALF_WORD,
@@ -129,11 +133,18 @@ enum stm32_dma_burst_size {
STM32_DMA_BURST_INCR16,
 };
 
+/**
+ * struct stm32_dma_cfg - STM32 DMA custom configuration
+ * @channel_id: channel ID
+ * @request_line: DMA request
+ * @stream_config: 32bit mask specifying the DMA channel configuration
+ * @features: 32bit mask specifying the DMA Feature list
+ */
 struct stm32_dma_cfg {
u32 channel_id;
u32 request_line;
u32 stream_config;
-   u32 threshold;
+   u32 features;
 };
 
 struct stm32_dma_chan_reg {
@@ -171,6 +182,7 @@ struct stm32_dma_chan {
u32 next_sg;
struct dma_slave_config dma_sconfig;
struct stm32_dma_chan_reg chan_reg;
+   u32 threshold;
 };
 
 struct stm32_dma_device {
@@ -976,7 +988,8 @@ static void stm32_dma_set_config(struct stm32_dma_chan 
*chan,
/* Enable Interrupts  */
chan->chan_reg.dma_scr |= STM32_DMA_SCR_TEIE | STM32_DMA_SCR_TCIE;
 
-   chan->chan_reg.dma_sfcr = cfg->threshold & STM32_DMA_SFCR_FTH_MASK;
+   chan->threshold = STM32_DMA_THRESHOLD_FTR_GET(cfg->features);
+   chan->chan_reg.dma_sfcr = STM32_DMA_SFCR_FTH(chan->threshold);
 }
 
 static struct dma_chan *stm32_dma_of_xlate(struct of_phandle_args *dma_spec,
@@ -996,7 +1009,7 @@ static struct dma_chan *stm32_dma_of_xlate(struct 
of_phandle_args *dma_spec,
cfg.channel_id = dma_spec->args[0];
cfg.request_line = dma_spec->args[1];
cfg.stream_config = dma_spec->args[2];
-   cfg.threshold = dma_spec->args[3];
+   cfg.features = dma_spec->args[3];
 
if ((cfg.channel_id >= STM32_DMA_MAX_CHANNELS) ||
(cfg.request_line >= STM32_DMA_MAX_REQUEST_ID)) {
-- 
2.19.0.605.g01d371f741-goog



[PATCH 3/7] dmaengine: stm32-dma: fix typo and reported checkpatch warnings

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

Fix typo in a comment and solved reported checkpatch warnings.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index fae7de54f00a..b64e14a83dec 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -60,7 +60,8 @@
 #define STM32_DMA_SCR_PINC BIT(9) /* Peripheral increment mode */
 #define STM32_DMA_SCR_CIRC BIT(8) /* Circular mode */
 #define STM32_DMA_SCR_PFCTRL   BIT(5) /* Peripheral Flow Controller */
-#define STM32_DMA_SCR_TCIE BIT(4) /* Transfer Cplete Int Enable*/
+#define STM32_DMA_SCR_TCIE BIT(4) /* Transfer Complete Int Enable
+   */
 #define STM32_DMA_SCR_TEIE BIT(2) /* Transfer Error Int Enable */
 #define STM32_DMA_SCR_DMEIEBIT(1) /* Direct Mode Err Int Enable */
 #define STM32_DMA_SCR_EN   BIT(0) /* Stream Enable */
@@ -918,7 +919,7 @@ static enum dma_status stm32_dma_tx_status(struct dma_chan 
*c,
u32 residue = 0;
 
status = dma_cookie_status(c, cookie, state);
-   if ((status == DMA_COMPLETE) || (!state))
+   if (status == DMA_COMPLETE || !state)
return status;
 
spin_lock_irqsave(>vchan.lock, flags);
@@ -982,7 +983,7 @@ static void stm32_dma_desc_free(struct virt_dma_desc *vdesc)
 }
 
 static void stm32_dma_set_config(struct stm32_dma_chan *chan,
- struct stm32_dma_cfg *cfg)
+struct stm32_dma_cfg *cfg)
 {
stm32_dma_clear_reg(>chan_reg);
 
@@ -1015,8 +1016,8 @@ static struct dma_chan *stm32_dma_of_xlate(struct 
of_phandle_args *dma_spec,
cfg.stream_config = dma_spec->args[2];
cfg.features = dma_spec->args[3];
 
-   if ((cfg.channel_id >= STM32_DMA_MAX_CHANNELS) ||
-   (cfg.request_line >= STM32_DMA_MAX_REQUEST_ID)) {
+   if (cfg.channel_id >= STM32_DMA_MAX_CHANNELS ||
+   cfg.request_line >= STM32_DMA_MAX_REQUEST_ID) {
dev_err(dev, "Bad channel and/or request id\n");
return NULL;
}
-- 
2.19.0.605.g01d371f741-goog



[PATCH 5/7] dmaengine: stm32-dma: fix DMA IRQ status handling

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

Update the way Transfer Complete and Half Transfer Complete status are
acknowledge. Even if HTI is not enabled its status is shown when reading
registers, driver has to clear it gently and not raise an error.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 29 +
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index 21ad359a5a59..b40486454a2c 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -34,9 +34,14 @@
 #define STM32_DMA_LIFCR0x0008 /* DMA Low Int Flag 
Clear Reg */
 #define STM32_DMA_HIFCR0x000c /* DMA High Int Flag 
Clear Reg */
 #define STM32_DMA_TCI  BIT(5) /* Transfer Complete Interrupt */
+#define STM32_DMA_HTI  BIT(4) /* Half Transfer Interrupt */
 #define STM32_DMA_TEI  BIT(3) /* Transfer Error Interrupt */
 #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */
 #define STM32_DMA_FEI  BIT(0) /* FIFO Error Interrupt */
+#define STM32_DMA_MASKI(STM32_DMA_TCI \
+| STM32_DMA_TEI \
+| STM32_DMA_DMEI \
+| STM32_DMA_FEI)
 
 /* DMA Stream x Configuration Register */
 #define STM32_DMA_SCR(x)   (0x0010 + 0x18 * (x)) /* x = 0..7 */
@@ -643,13 +648,29 @@ static irqreturn_t stm32_dma_chan_irq(int irq, void 
*devid)
status = stm32_dma_irq_status(chan);
scr = stm32_dma_read(dmadev, STM32_DMA_SCR(chan->id));
 
-   if ((status & STM32_DMA_TCI) && (scr & STM32_DMA_SCR_TCIE)) {
+   if (status & STM32_DMA_TCI) {
stm32_dma_irq_clear(chan, STM32_DMA_TCI);
-   stm32_dma_handle_chan_done(chan);
-
-   } else {
+   if (scr & STM32_DMA_SCR_TCIE)
+   stm32_dma_handle_chan_done(chan);
+   status &= ~STM32_DMA_TCI;
+   }
+   if (status & STM32_DMA_HTI) {
+   stm32_dma_irq_clear(chan, STM32_DMA_HTI);
+   status &= ~STM32_DMA_HTI;
+   }
+   if (status & STM32_DMA_FEI) {
+   stm32_dma_irq_clear(chan, STM32_DMA_FEI);
+   status &= ~STM32_DMA_FEI;
+   if (!(scr & STM32_DMA_SCR_EN))
+   dev_err(chan2dev(chan), "FIFO Error\n");
+   else
+   dev_dbg(chan2dev(chan), "FIFO over/underrun\n");
+   }
+   if (status) {
stm32_dma_irq_clear(chan, status);
dev_err(chan2dev(chan), "DMA error: status=0x%08x\n", status);
+   if (!(scr & STM32_DMA_SCR_EN))
+   dev_err(chan2dev(chan), "chan disabled by HW\n");
}
 
spin_unlock(>vchan.lock);
-- 
2.19.0.605.g01d371f741-goog



[PATCH 7/7] dmaengine: stm32-dma: properly mask irq bits

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

A single register of the controller holds the information for four dma
channels.
The functions stm32_dma_irq_status() don't mask the relevant bits after
the shift, thus adjacent channel's status is also reported in the returned
value.
Fixed by masking the value before returning it.

Similarly, the function stm32_dma_irq_clear() don't mask the input value
before shifting it, thus an incorrect input value could disable the
interrupts of adjacent channels.
Fixed by masking the input value before using it.

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Antonio Borneo 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index 05a2974cd2c0..8c5807362a25 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -38,6 +38,10 @@
 #define STM32_DMA_TEI  BIT(3) /* Transfer Error Interrupt */
 #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */
 #define STM32_DMA_FEI  BIT(0) /* FIFO Error Interrupt */
+#define STM32_DMA_MASKI(STM32_DMA_TCI \
+| STM32_DMA_TEI \
+| STM32_DMA_DMEI \
+| STM32_DMA_FEI)
 
 /* DMA Stream x Configuration Register */
 #define STM32_DMA_SCR(x)   (0x0010 + 0x18 * (x)) /* x = 0..7 */
@@ -405,7 +409,7 @@ static u32 stm32_dma_irq_status(struct stm32_dma_chan *chan)
 
flags = dma_isr >> (((chan->id & 2) << 3) | ((chan->id & 1) * 6));
 
-   return flags;
+   return flags & STM32_DMA_MASKI;
 }
 
 static void stm32_dma_irq_clear(struct stm32_dma_chan *chan, u32 flags)
@@ -420,6 +424,7 @@ static void stm32_dma_irq_clear(struct stm32_dma_chan 
*chan, u32 flags)
 * If (ch % 4) is 2 or 3, left shift the mask by 16 bits.
 * If (ch % 4) is 1 or 3, additionally left shift the mask by 6 bits.
 */
+   flags &= STM32_DMA_MASKI;
dma_ifcr = flags << (((chan->id & 2) << 3) | ((chan->id & 1) * 6));
 
if (chan->id & 4)
-- 
2.19.0.605.g01d371f741-goog



[PATCH 6/7] dmaengine: stm32-dma: fix max items per transfer

2018-10-08 Thread Joel Fernandes (Google)
From: Pierre Yves MORDRET 

Having 0 in item counter register is valid and stands for a "No or Ended
transfer". Therefore valid transfer starts from @+0 to @+0xFFFE leading to
unaligned scatter gather at boundary. Thus it's safer to round down this
value on its FIFO size (16 Bytes).

Signed-off-by: Pierre-Yves MORDRET 
Signed-off-by: Vinod Koul 
---
 drivers/dma/stm32-dma.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c
index b40486454a2c..05a2974cd2c0 100644
--- a/drivers/dma/stm32-dma.c
+++ b/drivers/dma/stm32-dma.c
@@ -38,10 +38,6 @@
 #define STM32_DMA_TEI  BIT(3) /* Transfer Error Interrupt */
 #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */
 #define STM32_DMA_FEI  BIT(0) /* FIFO Error Interrupt */
-#define STM32_DMA_MASKI(STM32_DMA_TCI \
-| STM32_DMA_TEI \
-| STM32_DMA_DMEI \
-| STM32_DMA_FEI)
 
 /* DMA Stream x Configuration Register */
 #define STM32_DMA_SCR(x)   (0x0010 + 0x18 * (x)) /* x = 0..7 */
@@ -118,6 +114,13 @@
 #define STM32_DMA_FIFO_THRESHOLD_FULL  0x03
 
 #define STM32_DMA_MAX_DATA_ITEMS   0x
+/*
+ * Valid transfer starts from @0 to @0xFFFE leading to unaligned scatter
+ * gather at boundary. Thus it's safer to round down this value on FIFO
+ * size (16 Bytes)
+ */
+#define STM32_DMA_ALIGNED_MAX_DATA_ITEMS   \
+   ALIGN_DOWN(STM32_DMA_MAX_DATA_ITEMS, 16)
 #define STM32_DMA_MAX_CHANNELS 0x08
 #define STM32_DMA_MAX_REQUEST_ID   0x08
 #define STM32_DMA_MAX_DATA_PARAM   0x03
@@ -869,7 +872,7 @@ static struct dma_async_tx_descriptor 
*stm32_dma_prep_slave_sg(
desc->sg_req[i].len = sg_dma_len(sg);
 
nb_data_items = desc->sg_req[i].len / buswidth;
-   if (nb_data_items > STM32_DMA_MAX_DATA_ITEMS) {
+   if (nb_data_items > STM32_DMA_ALIGNED_MAX_DATA_ITEMS) {
dev_err(chan2dev(chan), "nb items not supported\n");
goto err;
}
@@ -935,7 +938,7 @@ static struct dma_async_tx_descriptor 
*stm32_dma_prep_dma_cyclic(
return NULL;
 
nb_data_items = period_len / buswidth;
-   if (nb_data_items > STM32_DMA_MAX_DATA_ITEMS) {
+   if (nb_data_items > STM32_DMA_ALIGNED_MAX_DATA_ITEMS) {
dev_err(chan2dev(chan), "number of items not supported\n");
return NULL;
}
@@ -985,7 +988,7 @@ static struct dma_async_tx_descriptor 
*stm32_dma_prep_dma_memcpy(
u32 num_sgs, best_burst, dma_burst, threshold;
int i;
 
-   num_sgs = DIV_ROUND_UP(len, STM32_DMA_MAX_DATA_ITEMS);
+   num_sgs = DIV_ROUND_UP(len, STM32_DMA_ALIGNED_MAX_DATA_ITEMS);
desc = stm32_dma_alloc_desc(num_sgs);
if (!desc)
return NULL;
@@ -994,7 +997,7 @@ static struct dma_async_tx_descriptor 
*stm32_dma_prep_dma_memcpy(
 
for (offset = 0, i = 0; offset < len; offset += xfer_count, i++) {
xfer_count = min_t(size_t, len - offset,
-  STM32_DMA_MAX_DATA_ITEMS);
+  STM32_DMA_ALIGNED_MAX_DATA_ITEMS);
 
/* Compute best burst size */
max_width = DMA_SLAVE_BUSWIDTH_1_BYTE;
-- 
2.19.0.605.g01d371f741-goog



[PATCH v2 1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd

2018-10-09 Thread Joel Fernandes (Google)
Android uses ashmem for sharing memory regions. We are looking forward
to migrating all usecases of ashmem to memfd so that we can possibly
remove the ashmem driver in the future from staging while also
benefiting from using memfd and contributing to it. Note staging drivers
are also not ABI and generally can be removed at anytime.

One of the main usecases Android has is the ability to create a region
and mmap it as writeable, then drop its protection for "future" writes
while keeping the existing already mmap'ed writeable-region active.
This allows us to implement a usecase where receivers of the shared
memory buffer can get a read-only view, while the sender continues to
write to the buffer. See CursorWindow in Android for more details:
https://developer.android.com/reference/android/database/CursorWindow

This usecase cannot be implemented with the existing F_SEAL_WRITE seal.
To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which
prevents any future mmap and write syscalls from succeeding while
keeping the existing mmap active. The following program shows the seal
working in action:

int main() {
int ret, fd;
void *addr, *addr2, *addr3, *addr1;
ret = memfd_create_region("test_region", REGION_SIZE);
printf("ret=%d\n", ret);
fd = ret;

// Create map
addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED)
printf("map 0 failed\n");
else
printf("map 0 passed\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed even though no fs-write seal "
   "(ret=%d errno =%d)\n", ret, errno);
else
printf("write passed\n");

addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr1 == MAP_FAILED)
perror("map 1 prot-write failed even though no seal\n");
else
printf("map 1 prot-write passed as expected\n");

ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE);
if (ret == -1)
printf("fcntl failed, errno: %d\n", errno);
else
printf("fs-write seal now active\n");

if ((ret = write(fd, "test", 4)) != 4)
printf("write failed as expected due to fs-write seal\n");
else
printf("write passed (unexpected)\n");

addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
if (addr2 == MAP_FAILED)
perror("map 2 prot-write failed as expected due to seal\n");
else
printf("map 2 passed\n");

addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0);
if (addr3 == MAP_FAILED)
perror("map 3 failed\n");
else
printf("map 3 prot-read passed as expected\n");
}

The output of running this program is as follows:
ret=3
map 0 passed
write passed
map 1 prot-write passed as expected
fs-write seal now active
write failed as expected due to fs-write seal
map 2 prot-write failed as expected due to seal
: Permission denied
map 3 prot-read passed as expected

Note: This seal will also prevent growing and shrinking of the memfd.
This is not something we do in Android so it does not affect us, however
I have mentioned this behavior of the seal in the manpage.

Cc: jr...@google.com
Cc: john.stu...@linaro.org
Cc: tk...@google.com
Cc: gre...@linuxfoundation.org
Signed-off-by: Joel Fernandes (Google) 
---
v1->v2: No change, just added selftests to the series. manpages are
ready and I'll submit them once the patches are accepted.

 include/uapi/linux/fcntl.h | 1 +
 mm/memfd.c | 6 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index c98312fa78a5..fe44a2035edf 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -41,6 +41,7 @@
 #define F_SEAL_SHRINK  0x0002  /* prevent file from shrinking */
 #define F_SEAL_GROW0x0004  /* prevent file from growing */
 #define F_SEAL_WRITE   0x0008  /* prevent writes */
+#define F_SEAL_FS_WRITE0x0010  /* prevent all write-related syscalls */
 /* (1U << 31) is reserved for signed error codes */
 
 /*
diff --git a/mm/memfd.c b/mm/memfd.c
index 27069518e3c5..9b8855b80de9 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
 #define F_ALL_SEALS (F_SEAL_SEAL | \
 F_SEAL_SHRINK | \
 F_SEAL_GROW | \
-F_SEAL_WRITE)
+F_SEAL_WRITE | \
+F_SEAL_FS_WRITE)
 
 static int memfd_add_seals(struct file *file, unsigned int seals)
 {
@@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int 
seals)
}
}
 
+   if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE))
+   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
+
*file_seals |= seals;
error = 0;
 
-- 
2.19.0.605.g01d371f741-goog



[PATCH v2 2/2] selftests/memfd: Add tests for F_SEAL_FS_WRITE seal

2018-10-09 Thread Joel Fernandes (Google)
Add tests to verify sealing memfds with the F_SEAL_FS_WRITE works as
expected.

Cc: dan...@google.com
Cc: minc...@google.com
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 51 +-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 10baa1652fc2..4bd2b6c87bb4 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -27,7 +27,7 @@
 
 #define MFD_DEF_SIZE 8192
 #define STACK_SIZE 65536
-
+#define F_SEAL_FS_WRITE 0x0010
 /*
  * Default is not to test hugetlbfs
  */
@@ -170,6 +170,24 @@ static void *mfd_assert_mmap_shared(int fd)
return p;
 }
 
+static void *mfd_fail_mmap_shared(int fd)
+{
+   void *p;
+
+   p = mmap(NULL,
+mfd_def_size,
+PROT_READ | PROT_WRITE,
+MAP_SHARED,
+fd,
+0);
+   if (p != MAP_FAILED) {
+   printf("mmap() didn't fail as expected\n");
+   abort();
+   }
+
+   return p;
+}
+
 static void *mfd_assert_mmap_private(int fd)
 {
void *p;
@@ -692,6 +710,36 @@ static void test_seal_write(void)
close(fd);
 }
 
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_fs_write(void)
+{
+   int fd;
+   void *p;
+
+   printf("%s SEAL-FS-WRITE\n", memfd_str);
+
+   fd = mfd_assert_new("kern_memfd_seal_fs_write",
+   mfd_def_size,
+   MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+   p = mfd_assert_mmap_shared(fd);
+
+   /* FS_WRITE seal can be added even with existing
+* writeable mappings */
+   mfd_assert_has_seals(fd, 0);
+   mfd_assert_add_seals(fd, F_SEAL_FS_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_FS_WRITE);
+
+   mfd_assert_read(fd);
+   mfd_fail_write(fd);
+
+   munmap(p, mfd_def_size);
+   close(fd);
+}
+
 /*
  * Test SEAL_SHRINK
  * Test whether SEAL_SHRINK actually prevents shrinking
@@ -945,6 +993,7 @@ int main(int argc, char **argv)
test_basic();
 
test_seal_write();
+   test_seal_fs_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
-- 
2.19.0.605.g01d371f741-goog



[PATCH] mm: Speed up mremap on large regions

2018-10-09 Thread Joel Fernandes (Google)
Android needs to mremap large regions of memory during memory management
related operations. The mremap system call can be really slow if THP is
not enabled. The bottleneck is move_page_tables, which is copying each
pte at a time, and can be really slow across a large map. Turning on THP
may not be a viable option, and is not for us. This patch speeds up the
performance for non-THP system by copying at the PMD level when possible.

The speed up is three orders of magnitude. On a 1GB mremap, the mremap
completion times drops from 160-250 millesconds to 380-400 microseconds.

Before:
Total mremap time for 1GB data: 242321014 nanoseconds.
Total mremap time for 1GB data: 196842467 nanoseconds.
Total mremap time for 1GB data: 167051162 nanoseconds.

After:
Total mremap time for 1GB data: 385781 nanoseconds.
Total mremap time for 1GB data: 388959 nanoseconds.
Total mremap time for 1GB data: 402813 nanoseconds.

Incase THP is enabled, the optimization is skipped. I also flush the
tlb every time we do this optimization since I couldn't find a way to
determine if the low-level PTEs are dirty. It is seen that the cost of
doing so is not much compared the improvement, on both x86-64 and arm64.

Cc: minc...@google.com
Cc: hu...@google.com
Cc: lokeshgi...@google.com
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) 
---
 mm/mremap.c | 62 +
 1 file changed, 62 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 5c2e18505f75..68ddc9e9dfde 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t 
*old_pmd,
drop_rmap_locks(vma);
 }
 
+bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr,
+ unsigned long new_addr, unsigned long old_end,
+ pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush)
+{
+   spinlock_t *old_ptl, *new_ptl;
+   struct mm_struct *mm = vma->vm_mm;
+
+   if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK)
+   || old_end - old_addr < PMD_SIZE)
+   return false;
+
+   /*
+* The destination pmd shouldn't be established, free_pgtables()
+* should have release it.
+*/
+   if (WARN_ON(!pmd_none(*new_pmd)))
+   return false;
+
+   /*
+* We don't have to worry about the ordering of src and dst
+* ptlocks because exclusive mmap_sem prevents deadlock.
+*/
+   old_ptl = pmd_lock(vma->vm_mm, old_pmd);
+   if (old_ptl) {
+   pmd_t pmd;
+
+   new_ptl = pmd_lockptr(mm, new_pmd);
+   if (new_ptl != old_ptl)
+   spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING);
+
+   /* Clear the pmd */
+   pmd = *old_pmd;
+   pmd_clear(old_pmd);
+
+   VM_BUG_ON(!pmd_none(*new_pmd));
+
+   /* Set the new pmd */
+   set_pmd_at(mm, new_addr, new_pmd, pmd);
+   if (new_ptl != old_ptl)
+   spin_unlock(new_ptl);
+   spin_unlock(old_ptl);
+
+   *need_flush = true;
+   return true;
+   }
+   return false;
+}
+
 unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
split_huge_pmd(vma, old_pmd, old_addr);
if (pmd_trans_unstable(old_pmd))
continue;
+   } else if (extent == PMD_SIZE) {
+   bool moved;
+
+   /* See comment in move_ptes() */
+   if (need_rmap_locks)
+   take_rmap_locks(vma);
+   moved = move_normal_pmd(vma, old_addr, new_addr,
+   old_end, old_pmd, new_pmd,
+   _flush);
+   if (need_rmap_locks)
+   drop_rmap_locks(vma);
+   if (moved)
+   continue;
}
+
if (pte_alloc(new_vma->vm_mm, new_pmd, new_addr))
break;
next = (new_addr + PMD_SIZE) & PMD_MASK;
-- 
2.19.0.605.g01d371f741-goog



[PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust

2018-11-19 Thread Joel Fernandes (Google)
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week
where we don't need to modify core VFS structures to get the same
behavior of the seal. This solves several side-effects pointed out by
Andy [2].

[1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/
[2] 
https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/

Suggested-by: Andy Lutomirski 
Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Joel Fernandes (Google) 
---
 fs/hugetlbfs/inode.c |  2 +-
 mm/memfd.c   | 19 ---
 mm/shmem.c   | 24 +---
 3 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 762028994f47..5b54bf893a67 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -558,7 +558,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, 
loff_t offset, loff_t len)
inode_lock(inode);
 
/* protected by i_mutex */
-   if (info->seals & F_SEAL_WRITE) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
inode_unlock(inode);
return -EPERM;
}
diff --git a/mm/memfd.c b/mm/memfd.c
index 63fff5e77114..650e65a46b9c 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -201,25 +201,6 @@ static int memfd_add_seals(struct file *file, unsigned int 
seals)
}
}
 
-   if ((seals & F_SEAL_FUTURE_WRITE) &&
-   !(*file_seals & F_SEAL_FUTURE_WRITE)) {
-   /*
-* The FUTURE_WRITE seal also prevents growing and shrinking
-* so we need them to be already set, or requested now.
-*/
-   int test_seals = (seals | *file_seals) &
-(F_SEAL_GROW | F_SEAL_SHRINK);
-
-   if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) {
-   error = -EINVAL;
-   goto unlock;
-   }
-
-   spin_lock(>f_lock);
-   file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE);
-   spin_unlock(>f_lock);
-   }
-
*file_seals |= seals;
error = 0;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 32eb29bd72c6..cee9878c87f1 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2121,6 +2121,23 @@ int shmem_lock(struct file *file, int lock, struct 
user_struct *user)
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+   struct shmem_inode_info *info = SHMEM_I(file_inode(file));
+
+   /*
+* New PROT_READ and MAP_SHARED mmaps are not allowed when "future
+* write" seal active.
+*/
+   if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE) &&
+   (info->seals & F_SEAL_FUTURE_WRITE))
+   return -EPERM;
+
+   /*
+* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED read-only
+* mapping, take care to not allow mprotect to revert protections.
+*/
+   if (info->seals & F_SEAL_FUTURE_WRITE)
+   vma->vm_flags &= ~(VM_MAYWRITE);
+
file_accessed(file);
vma->vm_ops = _vm_ops;
if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) &&
@@ -2346,8 +2363,9 @@ shmem_write_begin(struct file *file, struct address_space 
*mapping,
pgoff_t index = pos >> PAGE_SHIFT;
 
/* i_mutex is held by caller */
-   if (unlikely(info->seals & (F_SEAL_WRITE | F_SEAL_GROW))) {
-   if (info->seals & F_SEAL_WRITE)
+   if (unlikely(info->seals & (F_SEAL_GROW |
+  F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))
return -EPERM;
if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
return -EPERM;
@@ -2610,7 +2628,7 @@ static long shmem_fallocate(struct file *file, int mode, 
loff_t offset,
DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq);
 
/* protected by i_mutex */
-   if (info->seals & F_SEAL_WRITE) {
+   if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) {
error = -EPERM;
goto out;
}
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -next 2/2] selftests/memfd: modify tests for F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
Modify the tests for F_SEAL_FUTURE_WRITE based on the changes
introduced in previous patch.

Also add a test to make sure the reopen issue pointed by Jann Horn [1]
is fixed.

[1] 
https://lore.kernel.org/lkml/CAG48ez1h=v-JYnDw81HaYJzOfrNhwYksxmc2r=cjvdqvgym...@mail.gmail.com/

Cc: Jann Horn 
Signed-off-by: Joel Fernandes (Google) 
---
 tools/testing/selftests/memfd/memfd_test.c | 88 +++---
 1 file changed, 44 insertions(+), 44 deletions(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c 
b/tools/testing/selftests/memfd/memfd_test.c
index 32b207ca7372..c67d32eeb668 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -54,6 +54,22 @@ static int mfd_assert_new(const char *name, loff_t sz, 
unsigned int flags)
return fd;
 }
 
+static int mfd_assert_reopen_fd(int fd_in)
+{
+   int r, fd;
+   char path[100];
+
+   sprintf(path, "/proc/self/fd/%d", fd_in);
+
+   fd = open(path, O_RDWR);
+   if (fd < 0) {
+   printf("re-open of existing fd %d failed\n", fd_in);
+   abort();
+   }
+
+   return fd;
+}
+
 static void mfd_fail_new(const char *name, unsigned int flags)
 {
int r;
@@ -255,6 +271,25 @@ static void mfd_assert_read(int fd)
munmap(p, mfd_def_size);
 }
 
+/* Test that PROT_READ + MAP_SHARED mappings work. */
+static void mfd_assert_read_shared(int fd)
+{
+   void *p;
+
+   /* verify PROT_READ and MAP_SHARED *is* allowed */
+   p = mmap(NULL,
+mfd_def_size,
+PROT_READ,
+MAP_SHARED,
+fd,
+0);
+   if (p == MAP_FAILED) {
+   printf("mmap() failed: %m\n");
+   abort();
+   }
+   munmap(p, mfd_def_size);
+}
+
 static void mfd_assert_write(int fd)
 {
ssize_t l;
@@ -698,7 +733,7 @@ static void test_seal_write(void)
  */
 static void test_seal_future_write(void)
 {
-   int fd;
+   int fd, fd2;
void *p;
 
printf("%s SEAL-FUTURE-WRITE\n", memfd_str);
@@ -710,58 +745,23 @@ static void test_seal_future_write(void)
p = mfd_assert_mmap_shared(fd);
 
mfd_assert_has_seals(fd, 0);
-   /* Not adding grow/shrink seals makes the future write
-* seal fail to get added
-*/
-   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
-
-   mfd_assert_add_seals(fd, F_SEAL_GROW);
-   mfd_assert_has_seals(fd, F_SEAL_GROW);
-
-   /* Should still fail since shrink seal has
-* not yet been added
-*/
-   mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE);
-
-   mfd_assert_add_seals(fd, F_SEAL_SHRINK);
-   mfd_assert_has_seals(fd, F_SEAL_GROW |
-F_SEAL_SHRINK);
 
-   /* Now should succeed, also verifies that the seal
-* could be added with an existing writable mmap
-*/
mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE);
-   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
+   mfd_assert_has_seals(fd, F_SEAL_FUTURE_WRITE);
 
/* read should pass, writes should fail */
mfd_assert_read(fd);
+   mfd_assert_read_shared(fd);
mfd_fail_write(fd);
 
-   munmap(p, mfd_def_size);
-   close(fd);
-
-   /* Test adding all seals (grow, shrink, future write) at once */
-   fd = mfd_assert_new("kern_memfd_seal_future_write2",
-   mfd_def_size,
-   MFD_CLOEXEC | MFD_ALLOW_SEALING);
-
-   p = mfd_assert_mmap_shared(fd);
-
-   mfd_assert_has_seals(fd, 0);
-   mfd_assert_add_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
-   mfd_assert_has_seals(fd, F_SEAL_SHRINK |
-F_SEAL_GROW |
-F_SEAL_FUTURE_WRITE);
-
-   /* read should pass, writes should fail */
-   mfd_assert_read(fd);
-   mfd_fail_write(fd);
+   fd2 = mfd_assert_reopen_fd(fd);
+   /* read should pass, writes should still fail */
+   mfd_assert_read(fd2);
+   mfd_assert_read_shared(fd2);
+   mfd_fail_write(fd2);
 
munmap(p, mfd_def_size);
+   close(fd2);
close(fd);
 }
 
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -manpage 1/2] fcntl.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
More details of the seal can be found in the LKML patch:
https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t

Signed-off-by: Joel Fernandes (Google) 
---
 man2/fcntl.2 | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index 03533d65b49d..54772f94964c 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -1525,6 +1525,21 @@ Furthermore, if there are any asynchronous I/O operations
 .RB ( io_submit (2))
 pending on the file,
 all outstanding writes will be discarded.
+.TP
+.BR F_SEAL_FUTURE_WRITE
+If this seal is set, the contents of the file can be modified only from
+existing writeable mappings that were created prior to the seal being set.
+Any attempt to create a new writeable mapping on the memfd via
+.BR mmap (2)
+will fail with
+.BR EPERM.
+Also any attempts to write to the memfd via
+.BR write (2)
+will fail with
+.BR EPERM.
+This is useful in situations where existing writable mapped regions need to be
+kept intact while preventing any future writes. For example, to share a
+read-only memory buffer to other processes that only the sender can write to.
 .\"
 .SS File read/write hints
 Write lifetime hints can be used to inform the kernel about the relative
-- 
2.19.1.1215.g8438c0b245-goog



[PATCH -manpage 2/2] memfd_create.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal

2018-11-19 Thread Joel Fernandes (Google)
More details of the seal can be found in the LKML patch:
https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t

Signed-off-by: Joel Fernandes (Google) 
---
 man2/memfd_create.2 | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
index 3cd392d1b4d9..fce2bf8d0fff 100644
--- a/man2/memfd_create.2
+++ b/man2/memfd_create.2
@@ -280,7 +280,15 @@ in order to restrict further modifications on the file.
 (If placing the seal
 .BR F_SEAL_WRITE ,
 then it will be necessary to first unmap the shared writable mapping
-created in the previous step.)
+created in the previous step. Otherwise, behavior similar to
+.BR F_SEAL_WRITE
+can be achieved, by using
+.BR F_SEAL_FUTURE_WRITE
+which will prevent future writes via
+.BR mmap (2)
+and
+.BR write (2)
+from succeeding, while keeping existing shared writable mappings).
 .IP 4.
 A second process obtains a file descriptor for the
 .BR tmpfs (5)
@@ -425,6 +433,7 @@ main(int argc, char *argv[])
 fprintf(stderr, "\\t\\tg \- F_SEAL_GROW\\n");
 fprintf(stderr, "\\t\\ts \- F_SEAL_SHRINK\\n");
 fprintf(stderr, "\\t\\tw \- F_SEAL_WRITE\\n");
+fprintf(stderr, "\\t\\tW \- F_SEAL_FUTURE_WRITE\\n");
 fprintf(stderr, "\\t\\tS \- F_SEAL_SEAL\\n");
 exit(EXIT_FAILURE);
 }
@@ -463,6 +472,8 @@ main(int argc, char *argv[])
 seals |= F_SEAL_SHRINK;
 if (strchr(seals_arg, \(aqw\(aq) != NULL)
 seals |= F_SEAL_WRITE;
+if (strchr(seals_arg, \(aqW\(aq) != NULL)
+seals |= F_SEAL_FUTURE_WRITE;
 if (strchr(seals_arg, \(aqS\(aq) != NULL)
 seals |= F_SEAL_SEAL;
 
@@ -518,6 +529,8 @@ main(int argc, char *argv[])
 printf(" GROW");
 if (seals & F_SEAL_WRITE)
 printf(" WRITE");
+if (seals & F_SEAL_FUTURE_WRITE)
+printf(" FUTURE_WRITE");
 if (seals & F_SEAL_SHRINK)
 printf(" SHRINK");
 printf("\\n");
-- 
2.19.1.1215.g8438c0b245-goog



Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name

2018-03-26 Thread Joel Fernandes (Google)
Hi Mathieu,

On Mon, Mar 26, 2018 at 12:10 PM, Mathieu Desnoyers
 wrote:
> Provide an API allowing eBPF to lookup core kernel tracepoints by name.
>
> Given that a lookup by name explicitly requires tracepoint definitions
> to be unique for a given name (no duplicate keys), include a
> WARN_ON_ONCE() check that only a single match is encountered at runtime.
> This should always be the case, given that a DEFINE_TRACE emits a
> __tracepoint_##name symbol, which would cause a link-time error if more
> than one instance is found. Nevertheless, check this at runtime with
> WARN_ON_ONCE() to stay on the safe side.
>
> Signed-off-by: Mathieu Desnoyers 
> CC: Steven Rostedt 
> CC: Alexei Starovoitov 
> CC: Peter Zijlstra 
> CC: Ingo Molnar 
> ---
>  include/linux/tracepoint.h |  1 +
>  kernel/tracepoint.c| 35 +++
>  2 files changed, 36 insertions(+)
>
> diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
> index c94f466d57ef..1b4ae64b7c6a 100644
> --- a/include/linux/tracepoint.h
> +++ b/include/linux/tracepoint.h
> @@ -43,6 +43,7 @@ tracepoint_probe_unregister(struct tracepoint *tp, void 
> *probe, void *data);
>  extern void
>  for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
> void *priv);
> +extern struct tracepoint *tracepoint_kernel_find_by_name(const char *name);
>
>  #ifdef CONFIG_MODULES
>  struct tp_module {
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 671b13457387..0a59f988055a 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -60,6 +60,11 @@ struct tp_probes {
> struct tracepoint_func probes[0];
>  };
>
> +struct tp_find_args {
> +   const char *name;
> +   struct tracepoint *tp;
> +};
> +
>  static inline void *allocate_probes(int count)
>  {
> struct tp_probes *p  = kmalloc(count * sizeof(struct tracepoint_func)
> @@ -528,6 +533,36 @@ void for_each_kernel_tracepoint(void (*fct)(struct 
> tracepoint *tp, void *priv),
>  }
>  EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint);
>
> +static void find_tp(struct tracepoint *tp, void *priv)
> +{
> +   struct tp_find_args *args = priv;
> +
> +   if (!strcmp(tp->name, args->name)) {
> +   WARN_ON_ONCE(args->tp);
> +   args->tp = tp;

I think this runtime check is not needed if it really can't happen
(linker verifies that already as you mentioned) although I'm not
opposed if you still want to keep it, because there's no way to break
out of the outer loop anyway so I guess your call.. I would just do:

if (args->tp)
return;

if find_tp already found the tracepoint.

Tried to also create a duplicate tracepoint and I get:
  CC  init/version.o
  AR  init/built-in.o
  AR  built-in.o
  LD  vmlinux.o
block/blk-core.o:(__tracepoints+0x440): multiple definition of
`__tracepoint_sched_switch'
kernel/sched/core.o:(__tracepoints+0x440): first defined here
Makefile:1032: recipe for target 'vmlinux' failed
make: *** [vmlinux] Error 1

For this senseless diff:
 diff --git a/include/trace/events/block.h b/include/trace/events/block.h
index 81b43f5bdf23..2b855c05197d 100644
--- a/include/trace/events/block.h
+++ b/include/trace/events/block.h
@@ -49,6 +49,13 @@ DEFINE_EVENT(block_buffer, block_touch_buffer,
TP_ARGS(bh)
 );

+DEFINE_EVENT(block_buffer, sched_switch,
+
+   TP_PROTO(struct buffer_head *bh),
+
+   TP_ARGS(bh)
+);
+
 /**
  * block_dirty_buffer - mark a buffer dirty
  * @bh: buffer_head being dirtied


thanks,

- Joel


Re: [PATCH 0/3] [RFC] init, tracing: Add initcall trace events

2018-03-26 Thread Joel Fernandes (Google)
Hi Steve,

On Fri, Mar 23, 2018 at 8:02 AM, Steven Rostedt  wrote:
> A while ago we had a boot tracer. But it was eventually removed:
>  commit 30dbb20e68e6f ("tracing: Remove boot tracer").
>
> The rational was because there is already a initcall_debug boot option
> that causes printk()s of all the initcall functions.
>
> The problem with the initcall_debug option is that printk() is awfully slow,
> and makes it difficult to see the real impact of initcalls. Mainly because
> a single printk() is usually slower than most initcall functions.
>
> Instead of bringing back the boot tracer, adding trace events around the
> initcall functions, and even one to denote which level the initcall
> functions are being called from, adds the necessary information to
> analyze the initcalls without the high overhead of printk()s, that
> can substantially slow down the boot process.
>
> Another positive, is that the console initcall functions themselves
> can also be traced. The timestamps are not operational at that time
> but you can see which consoles are being registered. I saw this on
> one of my test boxes:
>
> -0 [000] ...1 0.00: initcall_level: level=console
> -0 [000] ...1 0.00: initcall_start: func=con_init+0x0/0x224
> -0 [000] ...1 0.00: initcall_finish: 
> func=con_init+0x0/0x224 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=hvc_console_init+0x0/0x19
> -0 [000] ...1 0.00: initcall_finish: 
> func=hvc_console_init+0x0/0x19 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=xen_cons_init+0x0/0x60
> -0 [000] ...1 0.00: initcall_finish: 
> func=xen_cons_init+0x0/0x60 ret=0
> -0 [000] ...1 0.00: initcall_start: 
> func=univ8250_console_init+0x0/0x2d
> -0 [000] ...1 0.00: initcall_finish: 
> func=univ8250_console_init+0x0/0x2d ret=0

Will this make initcall_debug not work if CONFIG_TRACEPOINTS is turned
off? Although it builds but I think this initcall_debug feature will
fail, maybe CONFIG_TRACEPOINTS should be selected somewhere?

I recently ran into some issues like this for my preemptirq
tracepoints patch (which I will post again soon :D) where lockdep
needed the tracepoints and I had to select it.

thanks,

- Joel


Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name

2018-03-27 Thread Joel Fernandes (Google)
On Tue, Mar 27, 2018 at 6:27 AM, Mathieu Desnoyers
 wrote:

>>> +static void find_tp(struct tracepoint *tp, void *priv)
>>> +{
>>> +   struct tp_find_args *args = priv;
>>> +
>>> +   if (!strcmp(tp->name, args->name)) {
>>> +   WARN_ON_ONCE(args->tp);
>>> +   args->tp = tp;
>>
>> I think this runtime check is not needed if it really can't happen
>> (linker verifies that already as you mentioned) although I'm not
>> opposed if you still want to keep it, because there's no way to break
>> out of the outer loop anyway so I guess your call..
>
> We can change the outer loop and break from it if needed, that's not
> an issue.
>
>> I would just do:
>>
>> if (args->tp)
>>return;
>>
>> if find_tp already found the tracepoint.
>>
>> Tried to also create a duplicate tracepoint and I get:
>>  CC  init/version.o
>>  AR  init/built-in.o
>>  AR  built-in.o
>>  LD  vmlinux.o
>> block/blk-core.o:(__tracepoints+0x440): multiple definition of
>> `__tracepoint_sched_switch'
>> kernel/sched/core.o:(__tracepoints+0x440): first defined here
>> Makefile:1032: recipe for target 'vmlinux' failed
>> make: *** [vmlinux] Error 1
>
> Yeah, as I state in my changelog, I'm very well aware that the linker
> is able to catch those. This was the purpose of emitting a
> __tracepoint_##name symbol in the tracepoint definition macro. This
> WARN_ON_ONCE() is a redundant check for an invariant that we expect
> is provided by the linker. Given that it's not a fast path, I would
> prefer to keep this redundant check in place, given that an average
> factor 2 speedup on a slow path should really not be something we
> need to optimize for. IMHO, for slow paths, robustness is more important
> than speed (unless the slow path becomes so slow that it really affects
> the user).
>
> I envision that a way to break this invariant would be to compile a
> kernel object with gcc -fvisibility=hidden or such. I admit this is
> particular scenario is really far fetched, but it illustrates why I
> prefer to keep an extra safety net at runtime for linker-based
> validations when possible.
>
> If a faster tracepoint lookup function is needed, we should consider
> perfect hashing algorithms done post-build, but so far nobody has
> complained about speed of this lookup operation. Anyhow a factor 2 in
> the speed of this lookup should really not matter, right ?

Yes, I agree with all the great reasons. So lets do it your way then.

thanks,

- Joel


Re: [PATCH] staging: android: ashmem: Fix possible deadlock in ashmem_ioctl

2018-03-19 Thread Joel Fernandes (Google)
On Tue, Feb 27, 2018 at 10:59 PM, Yisheng Xie  wrote:
> ashmem_mutex may create a chain of dependencies like:
>
> CPU0CPU1
>  mmap syscall   ioctl syscall
>  -> mmap_sem (acquired) -> ashmem_ioctl
>  -> ashmem_mmap-> ashmem_mutex (acquired)
> -> ashmem_mutex (try to acquire)   -> copy_from_user
>   -> mmap_sem (try to acquire)
>
> There is a lock odering problem between mmap_sem and ashmem_mutex causing
> a lockdep splat[1] during a syzcaller test. This patch fixes the problem
> by move copy_from_user out of ashmem_mutex.
>
> [1] https://www.spinics.net/lists/kernel/msg2733200.html
>
> Fixes: ce8a3a9e76d0 (staging: android: ashmem: Fix a race condition in pin 
> ioctls)
> Reported-by: syzbot+d7a918a7a8e1c952b...@syzkaller.appspotmail.com
> Signed-off-by: Yisheng Xie 

Greg,
Could you take this patch for the stable trees? I do see it in staging
already. I couldn't find it in stable so wanted to bring it to your
attention. If you already aware of it, please ignore my note.

Thanks,
- Joel


[PATCH v7 4/6] trace/irqsoff: Split reset into separate functions

2018-05-15 Thread Joel Fernandes (Google)
Split reset functions into seperate functions in preparation
of future patches that need to do tracer specific reset.

Cc: Steven Rostedt 
Cc: Peter Zilstra 
Cc: Ingo Molnar 
Cc: Mathieu Desnoyers 
Cc: Tom Zanussi 
Cc: Namhyung Kim 
Cc: Thomas Glexiner 
Cc: Boqun Feng 
Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Cc: Randy Dunlap 
Cc: Masami Hiramatsu 
Cc: Fenguang Wu 
Cc: Baohong Liu 
Cc: Vedang Patel 
Cc: kernel-t...@android.com
Signed-off-by: Joel Fernandes (Google) 
---
 kernel/trace/trace_irqsoff.c | 22 +++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c
index 03ecb4465ee4..f8daa754cce2 100644
--- a/kernel/trace/trace_irqsoff.c
+++ b/kernel/trace/trace_irqsoff.c
@@ -634,7 +634,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr)
return 0;
 }
 
-static void irqsoff_tracer_reset(struct trace_array *tr)
+static void __irqsoff_tracer_reset(struct trace_array *tr)
 {
int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT;
int overwrite_flag = save_flags & TRACE_ITER_OVERWRITE;
@@ -665,6 +665,12 @@ static int irqsoff_tracer_init(struct trace_array *tr)
 
return __irqsoff_tracer_init(tr);
 }
+
+static void irqsoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer irqsoff_tracer __read_mostly =
 {
.name   = "irqsoff",
@@ -697,11 +703,16 @@ static int preemptoff_tracer_init(struct trace_array *tr)
return __irqsoff_tracer_init(tr);
 }
 
+static void preemptoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer preemptoff_tracer __read_mostly =
 {
.name   = "preemptoff",
.init   = preemptoff_tracer_init,
-   .reset  = irqsoff_tracer_reset,
+   .reset  = preemptoff_tracer_reset,
.start  = irqsoff_tracer_start,
.stop   = irqsoff_tracer_stop,
.print_max  = true,
@@ -731,11 +742,16 @@ static int preemptirqsoff_tracer_init(struct trace_array 
*tr)
return __irqsoff_tracer_init(tr);
 }
 
+static void preemptirqsoff_tracer_reset(struct trace_array *tr)
+{
+   __irqsoff_tracer_reset(tr);
+}
+
 static struct tracer preemptirqsoff_tracer __read_mostly =
 {
.name   = "preemptirqsoff",
.init   = preemptirqsoff_tracer_init,
-   .reset  = irqsoff_tracer_reset,
+   .reset  = preemptirqsoff_tracer_reset,
.start  = irqsoff_tracer_start,
.stop   = irqsoff_tracer_stop,
.print_max  = true,
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v7 2/6] srcu: Add notrace variants of srcu_read_{lock,unlock}

2018-05-15 Thread Joel Fernandes (Google)
From: Paul McKenney 

This is needed for a future tracepoint patch that uses srcu, and to make
sure it doesn't call into lockdep.

tracepoint code already calls notrace variants for rcu_read_lock_sched
so this patch does the same for srcu which will be used in a later
patch. Keeps it consistent with rcu-sched.

[Joel: Added commit message]

Cc: Steven Rostedt 
Cc: Peter Zilstra 
Cc: Ingo Molnar 
Cc: Mathieu Desnoyers 
Cc: Tom Zanussi 
Cc: Namhyung Kim 
Cc: Thomas Glexiner 
Cc: Boqun Feng 
Cc: Paul McKenney 
Cc: Frederic Weisbecker 
Cc: Randy Dunlap 
Cc: Masami Hiramatsu 
Cc: Fenguang Wu 
Cc: Baohong Liu 
Cc: Vedang Patel 
Cc: kernel-t...@android.com
Reviewed-by: Steven Rostedt (VMware) 
Signed-off-by: Paul McKenney 
Signed-off-by: Joel Fernandes (Google) 
---
 include/linux/srcu.h | 17 +
 1 file changed, 17 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index 33c1c698df09..2ec618979b20 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -161,6 +161,16 @@ static inline int srcu_read_lock(struct srcu_struct *sp) 
__acquires(sp)
return retval;
 }
 
+/* Used by tracing, cannot be traced and cannot invoke lockdep. */
+static inline notrace int
+srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp)
+{
+   int retval;
+
+   retval = __srcu_read_lock(sp);
+   return retval;
+}
+
 /**
  * srcu_read_unlock - unregister a old reader from an SRCU-protected structure.
  * @sp: srcu_struct in which to unregister the old reader.
@@ -175,6 +185,13 @@ static inline void srcu_read_unlock(struct srcu_struct 
*sp, int idx)
__srcu_read_unlock(sp, idx);
 }
 
+/* Used by tracing, cannot be traced and cannot call lockdep. */
+static inline notrace void
+srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp)
+{
+   __srcu_read_unlock(sp, idx);
+}
+
 /**
  * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock
  *
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH v7 1/6] softirq: reorder trace_softirqs_on to prevent lockdep splat

2018-05-15 Thread Joel Fernandes (Google)
I'm able to reproduce a lockdep splat with config options:
CONFIG_PROVE_LOCKING=y,
CONFIG_DEBUG_LOCK_ALLOC=y and
CONFIG_PREEMPTIRQ_EVENTS=y

$ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable
---
 kernel/softirq.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 177de3640c78..8a040bcaa033 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt)
 {
lockdep_assert_irqs_disabled();
 
+   if (preempt_count() == cnt)
+   trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip());
+
if (softirq_count() == (cnt & SOFTIRQ_MASK))
trace_softirqs_on(_RET_IP_);
-   preempt_count_sub(cnt);
+
+   __preempt_count_sub(cnt);
 }
 
 /*
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH] rcu: Add comment documenting how rcu_seq_snap works

2018-05-11 Thread Joel Fernandes (Google)
rcu_seq_snap may be tricky for someone looking at it for the first time.
Lets document how it works with an example to make it easier.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/rcu.h | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 003671825d62..004ace3d22c2 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,28 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
 }
 
-/* Take a snapshot of the update side's sequence number. */
+/*
+ * Take a snapshot of the update side's sequence number.
+ *
+ * This function predicts what the grace period number will be the next
+ * time an RCU callback will be executed, given the current grace period's
+ * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is
+ * already in progress.
+ *
+ * We do this with a single addition and masking.
+ * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of
+ * the seq is used to track if a GP is in progress or not, its sufficient if we
+ * add (2+1) and mask with ~1. Lets see why with an example:
+ *
+ * Say the current seq is 6 which is 0x110 (gp is 3 and state bit is 0).
+ * To get the next GP number, we have to atleast add 0x10 to this (0x1 << 1) to
+ * account for the state bit. However, if the current seq is 7 (GP num is 3
+ * and state bit is 1), then it means the current grace period is already
+ * in progress so the next the callback will run is at gp+2. To account for
+ * the extra +1, we just overflow the LSB by adding another 0x1 and masking
+ * with ~0x1. Incase no GP was in progress (RCU is idle), then the adding
+ * by 0x1 and masking will have no effect. This is calculated as below.
+ */
 static inline unsigned long rcu_seq_snap(unsigned long *sp)
 {
unsigned long s;
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH] rcu: trace: Remove Startedleaf from trace events comment

2018-05-11 Thread Joel Fernandes (Google)
As part of the gp_seq clean up, the Startleaf condition doesn't occur
anymore. Remove it from the comment in the trace event file.

Signed-off-by: Joel Fernandes (Google) 
---
 include/trace/events/rcu.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index ce9d1a1cac78..6d8dd04912d2 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -91,7 +91,6 @@ TRACE_EVENT(rcu_grace_period,
  *
  * "Startleaf": Request a grace period based on leaf-node data.
  * "Prestarted": Someone beat us to the request
- * "Startedleaf": Leaf-node start proved sufficient.
  * "Startedleafroot": Leaf-node start proved sufficient after checking root.
  * "Startedroot": Requested a nocb grace period based on root-node data.
  * "NoGPkthread": The RCU grace-period kthread has not yet started.
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH v2] rcu: Add comment documenting how rcu_seq_snap works

2018-05-11 Thread Joel Fernandes (Google)
rcu_seq_snap may be tricky for someone looking at it for the first time.
Lets document how it works with an example to make it easier.

Signed-off-by: Joel Fernandes (Google) 
---
v2 changes: Corrections as suggested by Randy.

 kernel/rcu/rcu.h | 24 +++-
 1 file changed, 23 insertions(+), 1 deletion(-)

diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h
index 003671825d62..533bc1087371 100644
--- a/kernel/rcu/rcu.h
+++ b/kernel/rcu/rcu.h
@@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp)
WRITE_ONCE(*sp, rcu_seq_endval(sp));
 }
 
-/* Take a snapshot of the update side's sequence number. */
+/*
+ * Take a snapshot of the update side's sequence number.
+ *
+ * This function predicts what the grace period number will be the next
+ * time an RCU callback will be executed, given the current grace period's
+ * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is
+ * already in progress.
+ *
+ * We do this with a single addition and masking.
+ * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of
+ * the seq is used to track if a GP is in progress or not, its sufficient if we
+ * add (2+1) and mask with ~1. Lets see why with an example:
+ *
+ * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0).
+ * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1)
+ * to account for the state bit. However, if the current seq is 7 (gp is 3 and
+ * state bit is 1), then it means the current grace period is already in
+ * progress so the next time the callback will run is at the end of grace
+ * period number gp+2. To account for the extra +1, we just overflow the LSB by
+ * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU
+ * is idle), then the addition of the extra 0x1 and masking will have no
+ * effect. This is calculated as below.
+ */
 static inline unsigned long rcu_seq_snap(unsigned long *sp)
 {
unsigned long s;
-- 
2.17.0.441.gb46fe60e1d-goog


[PATCH RFC 8/8] rcu: Fix cpustart tracepoint gp_seq number

2018-05-13 Thread Joel Fernandes (Google)
cpustart shows a stale gp_seq. This is because rdp->gp_seq is updated
only at the end of the __note_gp_changes function. For this reason, use
rnp->gp_seq instead. I believe we can't update rdp->gp_seq too early so
lets just use the gp_seq from rnp instead.

Signed-off-by: Joel Fernandes (Google) 
---
 kernel/rcu/tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 25c44328d071..58d2b68f8b98 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1807,7 +1807,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, 
struct rcu_node *rnp,
 * set up to detect a quiescent state, otherwise don't
 * go looking for one.
 */
-   trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpustart"));
+   trace_rcu_grace_period(rsp->name, rnp->gp_seq, TPS("cpustart"));
need_gp = !!(rnp->qsmask & rdp->grpmask);
rdp->cpu_no_qs.b.norm = need_gp;
rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_dynticks.rcu_qs_ctr);
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 6/8] rcu: Add back the Startedleaf tracepoint

2018-05-13 Thread Joel Fernandes (Google)
In recent discussion [1], the check for whether a leaf believes RCU is
not idle, is being added back to funnel locking code, to avoid more
locking. In this we are marking the leaf node for a future grace-period
and bailing out since a GP is currently in progress. However the
tracepoint is missing. Lets add it back.

Also add a small comment about why we do this check (basically the point
is to avoid locking intermediate nodes unnecessarily) and clarify the
comments in the trace event header now that we are doing traversal of
one or more intermediate nodes.

[1] http://lkml.kernel.org/r/20180513190906.gl26...@linux.vnet.ibm.com

Signed-off-by: Joel Fernandes (Google) 
---
 include/trace/events/rcu.h |  4 ++--
 kernel/rcu/tree.c  | 11 ++-
 2 files changed, 12 insertions(+), 3 deletions(-)

diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h
index 539900a9f8c7..dc0bd11739c7 100644
--- a/include/trace/events/rcu.h
+++ b/include/trace/events/rcu.h
@@ -91,8 +91,8 @@ TRACE_EVENT(rcu_grace_period,
  *
  * "Startleaf": Request a grace period based on leaf-node data.
  * "Prestarted": Someone beat us to the request
- * "Startedleaf": Leaf-node start proved sufficient.
- * "Startedleafroot": Leaf-node start proved sufficient after checking root.
+ * "Startedleaf": Leaf and one or more non-root nodes marked for future start.
+ * "Startedleafroot": all non-root nodes from leaf to root marked for future 
start.
  * "Startedroot": Requested a nocb grace period based on root-node data.
  * "NoGPkthread": The RCU grace-period kthread has not yet started.
  * "StartWait": Start waiting for the requested grace period.
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 40670047d22c..8401a253e7de 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1593,8 +1593,17 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, 
struct rcu_data *rdp,
goto unlock_out;
}
rnp_node->gp_seq_needed = gp_seq_start;
-   if (rcu_seq_state(rcu_seq_current(>gp_seq)))
+
+   /*
+* Check if leaf believes a GP is in progress, if yes we can
+* bail and avoid more locking. We have already marked the leaf.
+*/
+   if (rcu_seq_state(rcu_seq_current(>gp_seq))) {
+   trace_rcu_this_gp(rnp_node, rdp, gp_seq_start,
+ TPS("Startedleaf"));
goto unlock_out;
+   }
+
if (rnp_node != rnp && rnp_node->parent != NULL)
raw_spin_unlock_rcu_node(rnp_node);
if (!rnp_node->parent) {
-- 
2.17.0.441.gb46fe60e1d-goog



[PATCH RFC 0/8] rcu fixes, clean ups for rcu/dev

2018-05-13 Thread Joel Fernandes (Google)
Hi,
Here are some fixes, clean ups and some code comments changes mostly
for the new funnel locking, gp_seq changes and some tracing. Its based
on latest rcu/dev branch.

thanks,

- Joel

Joel Fernandes (Google) (8):
  rcu: Add comment documenting how rcu_seq_snap works
  rcu: Clarify usage of cond_resched for tasks-RCU
  rcu: Add back the cpuend tracepoint
  rcu: Get rid of old c variable from places in tree RCU
  rcu: Use rcu_node as temporary variable in funnel locking loop
  rcu: Add back the Startedleaf tracepoint
  rcu: trace CleanupMore condition only if needed
  rcu: Fix cpustart tracepoint gp_seq number

 include/linux/rcupdate.h   | 11 +++--
 include/trace/events/rcu.h | 19 
 kernel/rcu/rcu.h   | 24 +-
 kernel/rcu/tree.c  | 92 +++---
 4 files changed, 97 insertions(+), 49 deletions(-)

-- 
2.17.0.441.gb46fe60e1d-goog



  1   2   3   4   5   6   >