Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
Hi Patrick, On Thu, Mar 23, 2017 at 3:32 AM, Patrick Bellasiwrote: [..] >> > which can be used to defined tunable root constraints when CGroups are >> > not available, and becomes RO when CGroups are. >> > >> > Can this be eventually an acceptable option? >> > >> > In any case I think that this feature will be mainly targeting CGroup >> > based systems. Indeed, one of the main goals is to collect >> > "application specific" information from "informed run-times". Being >> > "application specific" means that we need a way to classify >> > applications depending on the runtime context... and that capability >> > in Linux is ultimately provided via the CGroup interface. >> >> I think the concern raised is more about whether CGroups is the right >> interface to use for attaching capacity constraints to task or groups >> of tasks, or is there a better way to attach such constraints? > > Notice that CGroups based classification allows to easily enforce > the concept of "delegation containment". I think this feature should > be nice to have whatever interface we choose. > > However, potentially we can define a proper per-task API; are you > thinking to something specifically? > I was thinking how about adding per-task constraints to the resource limits API if it makes sense to? There's already RLIMIT_CPU and RLIMIT_NICE. An informed-runtime could then modify the limits of tasks using prlimit. >> The other advantage of such interface is we don't have to >> create a separate CGroup for every new constraint limit and can have >> several tasks with different unique constraints. > > That's still possible using CGroups and IMO it will not be the "most > common case". > Don't you think that in general we will need to set constraints at > applications level, thus group of tasks? Some applications could be a single task, also not all tasks in an application may need constraints right? > As a general rule we should probably go for an interface which makes > easy the most common case. I agree. Thanks, Joel
Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
Hi Tejun, >> That's also why the proposed interface has now been defined as a extension of >> the CPU controller in such a way to keep a consistent view. >> >> This controller is already used by run-times like Android to "scope" apps by >> constraining the amount of CPUs resource they are getting. >> Is that not a legitimate usage of the cpu controller? >> >> What we are doing here is just extending it a bit in such a way that, while: >> >> {cfs,rt}_{period,runtime}_us limits the amount of TIME we can use a CPU >> >> we can also use: >> >> capacity_{min,max} to limit the actual COMPUTATIONAL BANDWIDTH we can use >> during that time. > > Yes, we do have bandwidth restriction as a cgroup only feature, which > is different from how we handle nice levels and weights. Given the > nature of bandwidth limits, if necessary, it is straight-forward to > expose per-task interface. > > capacity min/max isn't the same thing. It isn't a limit on countable > units of a specific resource and that's why the interface you > suggested for .min is different. It's restricting attribute set which > can be picked in the subhierarchy rather than controlling distribution > of atoms of the resource. > > That's also why we're gonna have problem if we later decide we need a > thread based API for it. Once we make cgroup the primary owner of the > attribute, it's not straight forward to add another owner. Sorry I don't immediately see why it is not straight forward to have a per-task API later once CGroup interface is added? Maybe if you don't mind giving an example that will help? I can start with an example, say you have a single level hierarchy (Top-app in Android terms is the set of tasks that are user facing and we'd like to enforce some capacity minimums, background on the other hand is the opposite): ROOT (min = 0, max = 1024) / \ / \ TOP-APP (min = 200, max = 1024) BACKGROUND (min = 0, max = 500) If in the future, if we want to have a per-task API to individually configure the task with these limits, it seems it will be straight forward to implement IMO. As Patrick mentioned, all of the usecases needing this right now is an informed runtime placing a task in a group of tasks and not needing to set attributes for each individual task. We are already placing tasks in individual CGroups in Android based on the information the runtime has so adding in the capacity constraints will make it fit naturally while leaving the door open for any future per-task API additions IMO. Thanks, Joel
Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
Hi, On Mon, Mar 20, 2017 at 11:08 AM, Patrick Bellasiwrote: > On 20-Mar 13:15, Tejun Heo wrote: >> Hello, >> >> On Tue, Feb 28, 2017 at 02:38:38PM +, Patrick Bellasi wrote: [..] >> > These attributes: >> > a) are tunable at all hierarchy levels, i.e. root group too >> >> This usually is problematic because there should be a non-cgroup way >> of configuring the feature in case cgroup isn't configured or used, >> and it becomes awkward to have two separate mechanisms configuring the >> same thing. Maybe the feature is cgroup specific enough that it makes >> sense here but this needs more explanation / justification. > > In the previous proposal I used to expose global tunables under > procfs, e.g.: > > /proc/sys/kernel/sched_capacity_min > /proc/sys/kernel/sched_capacity_max > But then we would lose out on being able to attach capacity constraints to specific tasks or groups of tasks? > which can be used to defined tunable root constraints when CGroups are > not available, and becomes RO when CGroups are. > > Can this be eventually an acceptable option? > > In any case I think that this feature will be mainly targeting CGroup > based systems. Indeed, one of the main goals is to collect > "application specific" information from "informed run-times". Being > "application specific" means that we need a way to classify > applications depending on the runtime context... and that capability > in Linux is ultimately provided via the CGroup interface. I think the concern raised is more about whether CGroups is the right interface to use for attaching capacity constraints to task or groups of tasks, or is there a better way to attach such constraints? I am actually looking at a workload where its desirable to attach such constraints to only 1 thread or task, in this case it would be a bit overkill to use CGroups to attach such property just for 1 task with specific constraints and it would be beneficial that along with the CGroup interface, there's also an interface to attach it to individual tasks. The other advantage of such interface is we don't have to create a separate CGroup for every new constraint limit and can have several tasks with different unique constraints. Regards, Joel
Re: [RFC v3 1/5] sched/core: add capacity constraints to CPU controller
On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasiwrote: > The CPU CGroup controller allows to assign a specified (maximum) > bandwidth to tasks within a group, however it does not enforce any > constraint on how such bandwidth can be consumed. > With the integration of schedutil, the scheduler has now the proper > information about a task to select the most suitable frequency to > satisfy tasks needs. [..] > +static u64 cpu_capacity_min_read_u64(struct cgroup_subsys_state *css, > +struct cftype *cft) > +{ > + struct task_group *tg; > + u64 min_capacity; > + > + rcu_read_lock(); > + tg = css_tg(css); > + min_capacity = tg->cap_clamp[CAP_CLAMP_MIN]; Shouldn't the cap_clamp be accessed with READ_ONCE (and WRITE_ONCE in the write path) to avoid load-tearing? Thanks, Joel
Re: [RFC v3 5/5] sched/{core,cpufreq_schedutil}: add capacity clamping for RT/DL tasks
Hi Patrick, On Tue, Feb 28, 2017 at 6:38 AM, Patrick Bellasiwrote: > Currently schedutil enforce a maximum OPP when RT/DL tasks are RUNNABLE. > Such a mandatory policy can be made more tunable from userspace thus > allowing for example to define a reasonable max capacity (i.e. > frequency) which is required for the execution of a specific RT/DL > workload. This will contribute to make the RT class more "friendly" for > power/energy sensible applications. > > This patch extends the usage of capacity_{min,max} to the RT/DL classes. > Whenever a task in these classes is RUNNABLE, the capacity required is > defined by the constraints of the control group that task belongs to. > We briefly discussed this at Linaro Connect that this works well for sporadic RT tasks that run briefly and then sleep for long periods of time - so certainly this patch is good, but its only a partial solution to the problem of frequent and short-sleepers and something is required to keep the boost active for short non-RUNNABLE as well. The behavior with many periodic RT tasks is that they will sleep for short intervals and run for short intervals periodically. In this case removing the clamp (or the boost as in schedtune v2) on a dequeue will essentially mean during a narrow window cpufreq can drop the frequency and only to make it go back up again. Currently for schedtune v2, I am working on prototyping something like the following for Android: - if RT task is enqueue, introduce the boost. - When task is dequeued, start a timer for a "minimum deboost delay time" before taking out the boost. - If task is enqueued again before the timer fires, then cancel the timer. I don't think any "fix" to this particular issue should be to the schedutil governor and should be sorted before going to cpufreq itself (that is before making the request). What do you think about this? Thanks, Joel
Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks
On Wed, Jul 26, 2017 at 10:50 PM, Viresh Kumar <viresh.ku...@linaro.org> wrote: > On 26-07-17, 22:34, Joel Fernandes (Google) wrote: >> On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar <viresh.ku...@linaro.org> >> wrote: >> > @@ -221,7 +226,7 @@ static void sugov_update_single(struct >> > update_util_data *hook, u64 time, >> > sugov_set_iowait_boost(sg_cpu, time, flags); >> > sg_cpu->last_update = time; >> > >> > - if (!sugov_should_update_freq(sg_policy, time)) >> > + if (!sugov_should_update_freq(sg_policy, time, hook->cpu)) >> > return; >> >> Since with the remote callbacks now possible, isn't it unsafe to >> modify sg_cpu and sg_policy structures without a lock in >> sugov_update_single? >> >> Unlike sugov_update_shared, we don't acquire any lock in >> sugov_update_single before updating these structures. Did I miss >> something? > > As Peter already mentioned it earlier, the callbacks are called with > rq locks held and so sugov_update_single() wouldn't get called in > parallel for a target CPU. Ah ok, I have to catch up with that discussion since I missed the whole thing. Now that you will have me on CC, that shouldn't happen, thanks and sorry about the noise. > That's the only race you were worried about ? Yes. So then in that case, makes sense to move raw_spin_lock in sugov_update_shared further down? (Just discussing, this point is independent of your patch), Something like: diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 622eed1b7658..9a6c12fb2c16 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -295,8 +295,6 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, sugov_get_util(, ); - raw_spin_lock(_policy->update_lock); - sg_cpu->util = util; sg_cpu->max = max; sg_cpu->flags = flags; @@ -304,6 +302,8 @@ static void sugov_update_shared(struct update_util_data *hook, u64 time, sugov_set_iowait_boost(sg_cpu, time, flags); sg_cpu->last_update = time; + raw_spin_lock(_policy->update_lock); + if (sugov_should_update_freq(sg_policy, time)) { if (flags & SCHED_CPUFREQ_RT_DL) next_f = sg_policy->policy->cpuinfo.max_freq; thanks, -Joel
Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks
Hi Viresh, On Wed, Jul 26, 2017 at 10:46 PM, Viresh Kumar <viresh.ku...@linaro.org> wrote: > On 26-07-17, 22:14, Joel Fernandes (Google) wrote: >> Also one more comment about this usecase: >> >> You mentioned in our discussion at [2] sometime back, about the >> question of initial utilization, >> >> "We don't have any such configurable way possible right >> now, but there were discussions on how much utilization should a new >> task be assigned when it first comes up." > > We still initialize it to a value, just that it isn't configurable. > See below.. > >> But, then in your cover letter above, you mentioned "This is verified >> using ftrace". So my question is how has this been verified with >> ftrace if the new initial utilization as you said in [2] is currently >> still under discussion? Basically how could you verify with ftrace >> that the target CPU frequency isn't increasing immediately on spawning >> of a new task remotely, if the initial utilization of a new task isn't >> something we set/configure with current code? Am I missing something? >> >> [2] https://lists.linaro.org/pipermail/eas-dev/2017-January/000785.html > > The statement "new tasks should receive maximum demand initially" is > used to represent tasks which have high demand every time they run. > For example scrolling of a web page or gallery on our phones. Yes, > maybe I can use the work "migrated" (as you suggested later) as the > history of its utilization will move with it then to the new CPU. > > But even without that, if you see the routine > init_entity_runnable_average() in fair.c, the new tasks are > initialized in a way that they are seen as heavy tasks. And so even > for the first time they run, freq should normally increase on the > target CPU (at least with above application).i Ok, but the "heavy" in init_entity_runnable_average means for load, not the util_avg. The util_avg is what's used for frequency scaling IIUC and is set to 0 in that function no? > > The application was written by Steve (all credit goes to him) before > he left Linaro, but I did test it with ftrace. What I saw with ftrace > was that the freq isn't reevaluated for almost an entire tick many > times because we enqueued the task remotely. And that changes with > this series. > >> > The reason being that this patchset only targets a corner case, where >> > following are required to be true to improve performance and that >> > doesn't happen too often with these tests: >> > >> > - Task is migrated to another CPU. >> > - The task has maximum demand initially, and should take the CPU to >> >> Just to make the cover-letter more clear and also confirming with you >> I understand the above usecase, maybe in the future this can reworded >> from "initially" to "before the migration" and "take the CPU" to "take >> the target CPU of the migration" ? > > I can reword it a bit, but the test case wasn't really migrating > anything and was looking only at the initial loads. Ok, I wasn't talking about the synthetic test in the second part of my email above but about the explanation you gave about Galleryfling improvement (that the migration of a task with high utilization doesn't update the target frequency) which makes sense to me so we are on the same page about that. thanks, -Joel
Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks
Hi Viresh, On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumarwrote: > > With Android UI and benchmarks the latency of cpufreq response to > certain scheduling events can become very critical. Currently, callbacks > into schedutil are only made from the scheduler if the target CPU of the > event is the same as the current CPU. This means there are certain > situations where a target CPU may not run schedutil for some time. > > One testcase to show this behavior is where a task starts running on > CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the > system is configured such that new tasks should receive maximum demand > initially, this should result in CPU0 increasing frequency immediately. > Because of the above mentioned limitation though this does not occur. > This is verified using ftrace with the sample [1] application. I think you dropped [1] in your cover-letter. May be you meant to add it at the end of the cover letter? I noticed from your v2 that its: https://pastebin.com/7LkMSRxE Also one more comment about this usecase: You mentioned in our discussion at [2] sometime back, about the question of initial utilization, "We don't have any such configurable way possible right now, but there were discussions on how much utilization should a new task be assigned when it first comes up." But, then in your cover letter above, you mentioned "This is verified using ftrace". So my question is how has this been verified with ftrace if the new initial utilization as you said in [2] is currently still under discussion? Basically how could you verify with ftrace that the target CPU frequency isn't increasing immediately on spawning of a new task remotely, if the initial utilization of a new task isn't something we set/configure with current code? Am I missing something? [2] https://lists.linaro.org/pipermail/eas-dev/2017-January/000785.html > > Maybe the ideal solution is to always allow remote callbacks but that > has its own challenges: > > o There is no protection required for single CPU per policy case today, > and adding any kind of locking there, to supply remote callbacks, > isn't really a good idea. > > o If is local CPU isn't part of the same cpufreq policy as the target > CPU, then we wouldn't be able to do fast switching at all and have to > use some kind of bottom half to schedule work on the target CPU to do > real switching. That may be overkill as well. > > > And so this series only allows remote callbacks for target CPUs that > share the cpufreq policy with the local CPU. > > This series is tested with couple of usecases (Android: hackbench, > recentfling, galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey > board (64 bit octa-core, single policy). Only galleryfling showed minor > improvements, while others didn't had much deviation. > > The reason being that this patchset only targets a corner case, where > following are required to be true to improve performance and that > doesn't happen too often with these tests: > > - Task is migrated to another CPU. > - The task has maximum demand initially, and should take the CPU to Just to make the cover-letter more clear and also confirming with you I understand the above usecase, maybe in the future this can reworded from "initially" to "before the migration" and "take the CPU" to "take the target CPU of the migration" ? > higher OPPs. > - And the target CPU doesn't call into schedutil until the next tick. I found this usecase to be more plausible and can see this patch series being useful there. Could you also keep me in CC on these patches (at joe...@google.com)? I'm interested in this series. thanks! -Joel
Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks
Hi Viresh, On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumarwrote: > We do not call cpufreq callbacks from scheduler core for remote > (non-local) CPUs currently. But there are cases where such remote > callbacks are useful, specially in the case of shared cpufreq policies. > > This patch updates the scheduler core to call the cpufreq callbacks for > remote CPUs as well. > > For now, all the registered utilization update callbacks are updated to > return early if remote callback is detected. That is, this patch just > moves the decision making down in the hierarchy. > > Later patches would enable remote callbacks for shared policies. > > Based on initial work from Steve Muckle. > > Signed-off-by: Viresh Kumar > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -72,10 +72,15 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > / Governor internals ***/ > > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 > time) > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 > time, > +int target_cpu) > { > s64 delta_ns; > > + /* Don't allow remote callbacks */ > + if (smp_processor_id() != target_cpu) > + return false; > + > if (sg_policy->work_in_progress) > return false; > > @@ -221,7 +226,7 @@ static void sugov_update_single(struct update_util_data > *hook, u64 time, > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > > - if (!sugov_should_update_freq(sg_policy, time)) > + if (!sugov_should_update_freq(sg_policy, time, hook->cpu)) > return; Since with the remote callbacks now possible, isn't it unsafe to modify sg_cpu and sg_policy structures without a lock in sugov_update_single? Unlike sugov_update_shared, we don't acquire any lock in sugov_update_single before updating these structures. Did I miss something? thanks, -Joel
Re: [Eas-dev] [PATCH V4 2/3] cpufreq: schedutil: Process remote callback for shared policies
On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumarwrote: > This patch updates the schedutil governor to process cpufreq utilization > update hooks called for remote CPUs where the remote CPU is managed by > the cpufreq policy of the local CPU. > > Based on initial work from Steve Muckle. > > Signed-off-by: Viresh Kumar Reviewed-by: Joel Fernandes thanks, -Joel > --- > kernel/sched/cpufreq_schedutil.c | 21 ++--- > 1 file changed, 10 insertions(+), 11 deletions(-) > > diff --git a/kernel/sched/cpufreq_schedutil.c > b/kernel/sched/cpufreq_schedutil.c > index bb834747e49b..c3baf70d360c 100644 > --- a/kernel/sched/cpufreq_schedutil.c > +++ b/kernel/sched/cpufreq_schedutil.c > @@ -72,13 +72,12 @@ static DEFINE_PER_CPU(struct sugov_cpu, sugov_cpu); > > / Governor internals ***/ > > -static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 > time, > -int target_cpu) > +static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 > time) > { > s64 delta_ns; > > - /* Don't allow remote callbacks */ > - if (smp_processor_id() != target_cpu) > + /* Allow remote callbacks only on the CPUs sharing cpufreq policy */ > + if (!cpumask_test_cpu(smp_processor_id(), sg_policy->policy->cpus)) > return false; > > if (sg_policy->work_in_progress) > @@ -159,12 +158,12 @@ static unsigned int get_next_freq(struct sugov_policy > *sg_policy, > return cpufreq_driver_resolve_freq(policy, freq); > } > > -static void sugov_get_util(unsigned long *util, unsigned long *max) > +static void sugov_get_util(unsigned long *util, unsigned long *max, int cpu) > { > - struct rq *rq = this_rq(); > + struct rq *rq = cpu_rq(cpu); > unsigned long cfs_max; > > - cfs_max = arch_scale_cpu_capacity(NULL, smp_processor_id()); > + cfs_max = arch_scale_cpu_capacity(NULL, cpu); > > *util = min(rq->cfs.avg.util_avg, cfs_max); > *max = cfs_max; > @@ -226,7 +225,7 @@ static void sugov_update_single(struct update_util_data > *hook, u64 time, > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > > - if (!sugov_should_update_freq(sg_policy, time, hook->cpu)) > + if (!sugov_should_update_freq(sg_policy, time)) > return; > > busy = sugov_cpu_is_busy(sg_cpu); > @@ -234,7 +233,7 @@ static void sugov_update_single(struct update_util_data > *hook, u64 time, > if (flags & SCHED_CPUFREQ_RT_DL) { > next_f = policy->cpuinfo.max_freq; > } else { > - sugov_get_util(, ); > + sugov_get_util(, , hook->cpu); > sugov_iowait_boost(sg_cpu, , ); > next_f = get_next_freq(sg_policy, util, max); > /* > @@ -295,7 +294,7 @@ static void sugov_update_shared(struct update_util_data > *hook, u64 time, > unsigned long util, max; > unsigned int next_f; > > - sugov_get_util(, ); > + sugov_get_util(, , hook->cpu); > > raw_spin_lock(_policy->update_lock); > > @@ -306,7 +305,7 @@ static void sugov_update_shared(struct update_util_data > *hook, u64 time, > sugov_set_iowait_boost(sg_cpu, time, flags); > sg_cpu->last_update = time; > > - if (sugov_should_update_freq(sg_policy, time, hook->cpu)) { > + if (sugov_should_update_freq(sg_policy, time)) { > if (flags & SCHED_CPUFREQ_RT_DL) > next_f = sg_policy->policy->cpuinfo.max_freq;
Re: [PATCH 00/32] tracing: Inter-event (e.g. latency) support
Hi Tom, Nice series and nice ELC talk as well. Thanks. On Mon, Jun 26, 2017 at 3:49 PM, Tom Zanussiwrote: > This patchset adds support for 'inter-event' quantities to the trace > event subsystem. The most important example of inter-event quantities > are latencies, or the time differences between two events. I tried your patches out and they are working fine for me. I will test them out more. I think for the wakeup latency in your examples, its better / more accurate to use sched_waking instead of sched_wakeup tracepoint? [1] Also, one other comment I had is, it would be nice to suppress the output of individual trace events that are part of the synthetic event into the trace buffer. Otherwise I feel the value of it is slightly diminished - because one can simply post-process the individual non-synthetic trace events themselves to get wake up latencies which the synthetic events is trying to calculate in the first place. Inorder to conserve space, if a user doesn't care about individual events, and just the synthetic events then the individual ones shouldn't be written to the trace buffer IMO. -Joel [1] commit fbd705a0c6184580d0e2fbcbd47a37b6e5822511 (sched: Introduce the 'trace_sched_waking' tracepoint)
Re: [PATCH 04/32] ring-buffer: Redefine the unimplemented RINGBUF_TIME_TIME_STAMP
On Mon, Jun 26, 2017 at 3:49 PM, Tom Zanussiwrote: > RINGBUF_TYPE_TIME_STAMP is defined but not used, and from what I can > gather was reserved for something like an absolute timestamp feature > for the ring buffer, if not a complete replacement of the current > time_delta scheme. > > This code redefines RINGBUF_TYPE_TIME_STAMP to implement absolute time > stamps. Another way to look at it is that it essentially forces > extended time_deltas for all events. > > The motivation for doing this is to enable time_deltas that aren't > dependent on previous events in the ring buffer, making it feasible to > use the ring_buffer_event timetamps in a more random-access way, for > purposes other than serial event printing. > > To set/reset this mode, use tracing_set_timestamp_abs() from the > previous interface patch. > > Signed-off-by: Tom Zanussi > --- > include/linux/ring_buffer.h | 12 ++--- > kernel/trace/ring_buffer.c | 107 > +++- > 2 files changed, 83 insertions(+), 36 deletions(-) > > diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h > index 28e3472..74bc276 100644 > --- a/include/linux/ring_buffer.h > +++ b/include/linux/ring_buffer.h > @@ -36,10 +36,12 @@ struct ring_buffer_event { > * array[0] = time delta (28 .. 59) > * size = 8 bytes > * > - * @RINGBUF_TYPE_TIME_STAMP: Sync time stamp with external clock > - * array[0]= tv_nsec > - * array[1..2] = tv_sec > - * size = 16 bytes > + * @RINGBUF_TYPE_TIME_STAMP: Absolute timestamp > + * Same format as TIME_EXTEND except that the > + * value is an absolute timestamp, not a delta > + * event.time_delta contains bottom 27 bits > + * array[0] = top (28 .. 59) bits > + * size = 8 bytes > * Let's also change it here in ring_buffer.c? enum { RB_LEN_TIME_EXTEND = 8, RB_LEN_TIME_STAMP = 16,<- change to 8 }; > * <= @RINGBUF_TYPE_DATA_TYPE_LEN_MAX: > * Data record > @@ -56,12 +58,12 @@ enum ring_buffer_type { > RINGBUF_TYPE_DATA_TYPE_LEN_MAX = 28, > RINGBUF_TYPE_PADDING, > RINGBUF_TYPE_TIME_EXTEND, > - /* FIXME: RINGBUF_TYPE_TIME_STAMP not implemented */ > RINGBUF_TYPE_TIME_STAMP, > }; > > unsigned ring_buffer_event_length(struct ring_buffer_event *event); > void *ring_buffer_event_data(struct ring_buffer_event *event); > +u64 ring_buffer_event_time_stamp(struct ring_buffer_event *event); > > /* > * ring_buffer_discard_commit will remove an event that has not > diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c > index f544738..df848f0 100644 > --- a/kernel/trace/ring_buffer.c > +++ b/kernel/trace/ring_buffer.c > @@ -42,6 +42,8 @@ int ring_buffer_print_entry_header(struct trace_seq *s) > RINGBUF_TYPE_PADDING); > trace_seq_printf(s, "\ttime_extend : type == %d\n", > RINGBUF_TYPE_TIME_EXTEND); > + trace_seq_printf(s, "\ttime_stamp : type == %d\n", > +RINGBUF_TYPE_TIME_STAMP); > trace_seq_printf(s, "\tdata max type_len == %d\n", > RINGBUF_TYPE_DATA_TYPE_LEN_MAX); > > @@ -147,6 +149,9 @@ enum { > #define skip_time_extend(event) \ > ((struct ring_buffer_event *)((char *)event + RB_LEN_TIME_EXTEND)) > > +#define extended_time(event) \ > + (event->type_len >= RINGBUF_TYPE_TIME_EXTEND) > + > static inline int rb_null_event(struct ring_buffer_event *event) > { > return event->type_len == RINGBUF_TYPE_PADDING && !event->time_delta; > @@ -187,10 +192,8 @@ static void rb_event_set_padding(struct > ring_buffer_event *event) > return event->array[0] + RB_EVNT_HDR_SIZE; > > case RINGBUF_TYPE_TIME_EXTEND: > - return RB_LEN_TIME_EXTEND; > - > case RINGBUF_TYPE_TIME_STAMP: > - return RB_LEN_TIME_STAMP; > + return RB_LEN_TIME_EXTEND; And then this change can be dropped. Or a new single enum can be defined for both TIME_EXTEND and TIME_STAMP and then used. thanks, -Joel
Re: [Eas-dev] [PATCH V4 1/3] sched: cpufreq: Allow remote cpufreq callbacks
On Thu, Jul 27, 2017 at 12:14 AM, Viresh Kumar <viresh.ku...@linaro.org> wrote: > On 26-07-17, 23:13, Joel Fernandes (Google) wrote: >> On Wed, Jul 26, 2017 at 10:50 PM, Viresh Kumar <viresh.ku...@linaro.org> >> wrote: >> > On 26-07-17, 22:34, Joel Fernandes (Google) wrote: >> >> On Wed, Jul 26, 2017 at 2:22 AM, Viresh Kumar <viresh.ku...@linaro.org> >> >> wrote: >> >> > @@ -221,7 +226,7 @@ static void sugov_update_single(struct >> >> > update_util_data *hook, u64 time, >> >> > sugov_set_iowait_boost(sg_cpu, time, flags); >> >> > sg_cpu->last_update = time; >> >> > >> >> > - if (!sugov_should_update_freq(sg_policy, time)) >> >> > + if (!sugov_should_update_freq(sg_policy, time, hook->cpu)) >> >> > return; >> >> >> >> Since with the remote callbacks now possible, isn't it unsafe to >> >> modify sg_cpu and sg_policy structures without a lock in >> >> sugov_update_single? >> >> >> >> Unlike sugov_update_shared, we don't acquire any lock in >> >> sugov_update_single before updating these structures. Did I miss >> >> something? >> > >> > As Peter already mentioned it earlier, the callbacks are called with >> > rq locks held and so sugov_update_single() wouldn't get called in >> > parallel for a target CPU. >> >> Ah ok, I have to catch up with that discussion since I missed the >> whole thing. Now that you will have me on CC, that shouldn't happen, >> thanks and sorry about the noise. >> >> > That's the only race you were worried about ? >> >> Yes. So then in that case, makes sense to move raw_spin_lock in >> sugov_update_shared further down? (Just discussing, this point is >> independent of your patch), Something like: > > Even that was discussed tomorrow with Peter :) > > No it wouldn't work because sg_cpu->util we are updating here may be > getting read from some other cpu that shares policy with sg_cpu. > Ok. yes you are right :) thank you Viresh and Peter for the clarification. thanks, -Joel
Re: [Eas-dev] [PATCH V4 0/3] sched: cpufreq: Allow remote callbacks
On Thu, Jul 27, 2017 at 12:21 AM, Juri Lelliwrote: [..] >> > >> > But even without that, if you see the routine >> > init_entity_runnable_average() in fair.c, the new tasks are >> > initialized in a way that they are seen as heavy tasks. And so even >> > for the first time they run, freq should normally increase on the >> > target CPU (at least with above application).i >> >> Ok, but the "heavy" in init_entity_runnable_average means for load, >> not the util_avg. The util_avg is what's used for frequency scaling >> IIUC and is set to 0 in that function no? >> > > True for init_entity_runnable_average(), but for new task post_init_ > entity_util_avg() is then also called (from wake_up_new_task()), which > modifies the initial util_avg value (depending on current rq {util, > load}_avg. Got it. That makes sense, thank you! -Joel
Re: [Eas-dev] [PATCH V3 1/3] sched: cpufreq: Allow remote cpufreq callbacks
On Thu, Jul 27, 2017 at 12:55 PM, Saravana Kannanwrote: > On 07/26/2017 08:30 PM, Viresh Kumar wrote: >> >> On 26-07-17, 14:00, Saravana Kannan wrote: >>> >>> No, the alternative is to pass it on to the CPU freq driver and let it >>> decide what it wants to do. That's the whole point if having a CPU freq >>> driver -- so that the generic code doesn't need to care about HW specific >>> details. Which is the point I was making in an earlier email to Viresh's >>> patch -- we shouldn't be doing any CPU check for the call backs at the >>> scheduler or ever governor level. >>> >>> That would simplify this whole thing by deleting a bunch of code. And >>> having >>> much simpler checks in those drivers that actually have to deal with >>> their >>> HW specific details. >> >> >> So what you are saying is that we go and update (almost) every cpufreq >> driver we have today and make their ->target() callbacks return early >> if they don't support switching frequency remotely ? Is that really >> simplifying anything? > > > Yes. Simplifying isn't always about number of lines of code. It's also about > abstraction. Having generic scheduler code care about HW details doesn't > seem nice. > > It'll literally one simple check (cpu == smp_processor_id()) or (cpu "in" > policy->cpus). > I think we can have both approaches? So we query the driver some time around sugov_should_update_freq (with a new driver callback?) and ask it if it has any say over the default behavior of "can't update remote CPU if I'm not a part of its policy" and use that over the default if it hasn't defined it in their struct cpufreq_driver. I think this will also not have the concern of "updating every driver", then we can just stick to the sane default of "no" for drivers that haven't defined it. Probably Viresh has already thought about this, but I just thought of bringing it up anyway. I also think its fine to handle this case after this series gets in, but that's just my opinion. thanks! -Joel
Re: [PATCH v7 1/2] sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq()
Hi Byungchul, On Thu, Aug 17, 2017 at 11:05 PM, Byungchul Parkwrote: > It would be better to avoid pushing tasks to other cpu within > a SD_PREFER_SIBLING domain, instead, get more chances to check other > siblings. > > Signed-off-by: Byungchul Park > --- > kernel/sched/deadline.c | 55 > ++--- > 1 file changed, 52 insertions(+), 3 deletions(-) > > diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c > index 0223694..8f69a37 100644 > --- a/kernel/sched/deadline.c > +++ b/kernel/sched/deadline.c > @@ -1319,12 +1319,35 @@ static struct task_struct > *pick_earliest_pushable_dl_task(struct rq *rq, int cpu > > static DEFINE_PER_CPU(cpumask_var_t, local_cpu_mask_dl); > > +/* > + * Find the first cpu in: mask & sd & ~prefer > + */ > +static int find_cpu(const struct cpumask *mask, > + const struct sched_domain *sd, > + const struct sched_domain *prefer) > +{ > + const struct cpumask *sds = sched_domain_span(sd); > + const struct cpumask *ps = prefer ? sched_domain_span(prefer) : NULL; > + int cpu = -1; > + > + while ((cpu = cpumask_next(cpu, mask)) < nr_cpu_ids) { This can be simplified to for_each_cpu(cpu, mask) ? thanks, -Joel
Re: [PATCH v6 0/2] Make find_later_rq() choose a closer cpu in topology
On Thu, Aug 17, 2017 at 6:25 PM, Byungchul Parkwrote: > On Mon, Aug 07, 2017 at 12:50:32PM +0900, Byungchul Park wrote: >> When cpudl_find() returns any among free_cpus, the cpu might not be >> closer than others, considering sched domain. For example: >> >>this_cpu: 15 >>free_cpus: 0, 1,..., 14 (== later_mask) >>best_cpu: 0 >> >>topology: >> >>0 --+ >>+--+ >>1 --+ | >> +-- ... --+ >>2 --+ | | >>+--+ | >>3 --+| >> >>... ... >> >>12 --+ | >> +--+| >>13 --+ || >>+-- ... -+ >>14 --+ | >> +--+ >>15 --+ >> >> In this case, it would be best to select 14 since it's a free cpu and >> closest to 15(this_cpu). However, currently the code select 0(best_cpu) >> even though that's just any among free_cpus. Fix it. > > Could you let me know your opinions about this? Patch looks good to me, I would also add a comment ontop of fallback_cpu (I think Steve mentioned similar thing at [1]) /* * fallback is the closest CPU in the closest SD incase * all domains are PREFER_SIBLING */ if (fallback_cpu == -1) fallback_cpu = best_cpu; And clarify this in the commit message. thanks, -Joel [1] https://patchwork.kernel.org/patch/9884383/ > >> Change from v5 >>-. exclude two patches already picked up by peterz >> (sched/deadline: Make find_later_rq() choose a closer cpu in topology) >> (sched/deadline: Change return value of cpudl_find()) >>-. apply what peterz fixed for 'prefer sibling', into deadline and rt >> >> Change from v4 >>-. remove a patch that might cause huge lock contention >> (by spin lock() in a hot path of scheduler) >> >> Change from v3 >>-. rename closest_cpu to best_cpu so that it align with rt >>-. protect referring cpudl.elements with cpudl.lock >>-. change return value of cpudl_find() to bool >> >> Change from v2 >>-. add support for SD_PREFER_SIBLING >> >> Change from v1 >>-. clean up the patch >> >> Byungchul Park (2): >> sched/deadline: Add support for SD_PREFER_SIBLING on find_later_rq() >> sched/rt: Add support for SD_PREFER_SIBLING on find_lowest_rq() >> >> kernel/sched/deadline.c | 46 +++--- >> kernel/sched/rt.c | 47 --- >> 2 files changed, 87 insertions(+), 6 deletions(-) >> >> -- >> 1.9.1
Re: [for-next][PATCH 15/16] ftrace: Add freeing algorithm to free ftrace_mod_maps
Hi Steve, On Sat, Oct 7, 2017 at 6:32 AM, Steven Rostedt <rost...@goodmis.org> wrote: > On Fri, 6 Oct 2017 23:41:25 -0700 > "Joel Fernandes (Google)" <joel.open...@gmail.com> wrote: > >> Hi Steve, >> >> On Fri, Oct 6, 2017 at 11:07 AM, Steven Rostedt <rost...@goodmis.org> wrote: >> > From: "Steven Rostedt (VMware)" <rost...@goodmis.org> >> > >> > The ftrace_mod_map is a descriptor to save module init function names in >> > case they were traced, and the trace output needs to reference the function >> > name from the function address. But after the function is unloaded, it >> > the maps should be freed, as the rest of the function names are as well. >> >> Just checking for my understanding of this patch - wouldn't this also >> mean that if there were any look ups of the init functions that may be >> needed at trace output time, then those look ups wont be possible any >> more after module is unloaded? > > Yes. That's true for all functions in the module. When a module is > unloaded, all references to it in kallsyms is also freed. Try it on a > current kernel. Trace a function in a module, then unload that module. > The trace data will just show the ip hex address of the module function > after that. > I tried your core branch and I see some weirdness with the filters: Here's the code for the module: #include #include #include void bar(void) { printk(KERN_INFO "bar!\n"); } void foo(void) { printk(KERN_INFO "foo!\n"); bar(); } static int __init hello_init(void) { printk(KERN_INFO "Hello world!\n"); foo(); return 0; } static void __exit hello_cleanup(void) { printk(KERN_INFO "Cleaning up module.\n"); } module_init(hello_init); module_exit(hello_cleanup); Following is a run with function tracing, during the first run I see foo and bar are setup in the filters. After that when I unload and load the module, I see that only "bar" is in the filters: bash-4.3# echo '*:mod:test' > /d/tracing/set_ftrace_filter bash-4.3# echo function > /d/tracing/current_tracer bash-4.3# cat /d/tracing/set_ftrace_filter *:mod:test bash-4.3# modprobe test bash-4.3# cat /d/tracing/trace # tracer: function # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | modprobe-1050 [003] 1.653000: hello_init <-do_one_initcall bash-4.3# cat /d/tracing/set_ftrace_filter bar [test] foo [test] bash-4.3# rmmod test bash-4.3# cat /d/tracing/set_ftrace_filter bash-4.3# sleep 2 bash-4.3# modprobe test bash-4.3# cat /d/tracing/trace # tracer: function # # _-=> irqs-off # / _=> need-resched #| / _---=> hardirq/softirq #|| / _--=> preempt-depth #||| / delay # TASK-PID CPU# TIMESTAMP FUNCTION # | | | | | modprobe-1050 [003] 1.653000: bar <-do_one_initcall bash-4.3# cat /d/tracing/set_ftrace_filter bar [test] <--- bar is still in the filter list bash-4.3# Interestingly this also shows that the symbol conversion can be unreliable if another module was loaded into the same address, in this case 'bar' was loaded at the same address as 'hello_init' the second time so during the second cat of the trace file, it shows a different symbol name. Also could you let me know what is the correct behavior of the filters after a module being traced is unloaded, are the filters supposed to be setup again after the module is unloaded, before the module can be loaded again? Also I wasn't able to see the call from hello_init -> bar and bar -> foo so I'm not fully sure if that's a bug or I did something wrong. I'm happy to try out anything you suggest. >> I guess having a reference somehow on the ftrace_mod_map descriptor if >> there are any entries in the trace buffer that need it can help >> prevent that but it could be too expensive for not much return since >> most likely the user wouldn't unload modules before trace collection >> in normal usage. > > Right, I have thought about this, and I haven't come up with an > inexpensive way to do this. As this has been the default operation of > all module functions, and I haven't heard m
Re: [for-next][PATCH 15/16] ftrace: Add freeing algorithm to free ftrace_mod_maps
Hi Steve, On Fri, Oct 6, 2017 at 11:07 AM, Steven Rostedtwrote: > From: "Steven Rostedt (VMware)" > > The ftrace_mod_map is a descriptor to save module init function names in > case they were traced, and the trace output needs to reference the function > name from the function address. But after the function is unloaded, it > the maps should be freed, as the rest of the function names are as well. Just checking for my understanding of this patch - wouldn't this also mean that if there were any look ups of the init functions that may be needed at trace output time, then those look ups wont be possible any more after module is unloaded? I guess having a reference somehow on the ftrace_mod_map descriptor if there are any entries in the trace buffer that need it can help prevent that but it could be too expensive for not much return since most likely the user wouldn't unload modules before trace collection in normal usage. thanks, - Joel
[PATCH] rcu: trace: Remove Startedleaf from trace events comment
As part of the gp_seq clean up, the Startleaf condition doesn't occur anymore. Remove it from the comment in the trace event file. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- include/trace/events/rcu.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index ce9d1a1cac78..6d8dd04912d2 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -91,7 +91,6 @@ TRACE_EVENT(rcu_grace_period, * * "Startleaf": Request a grace period based on leaf-node data. * "Prestarted": Someone beat us to the request - * "Startedleaf": Leaf-node start proved sufficient. * "Startedleafroot": Leaf-node start proved sufficient after checking root. * "Startedroot": Requested a nocb grace period based on root-node data. * "NoGPkthread": The RCU grace-period kthread has not yet started. -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v2] rcu: Add comment documenting how rcu_seq_snap works
rcu_seq_snap may be tricky for someone looking at it for the first time. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- v2 changes: Corrections as suggested by Randy. kernel/rcu/rcu.h | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 003671825d62..533bc1087371 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp) WRITE_ONCE(*sp, rcu_seq_endval(sp)); } -/* Take a snapshot of the update side's sequence number. */ +/* + * Take a snapshot of the update side's sequence number. + * + * This function predicts what the grace period number will be the next + * time an RCU callback will be executed, given the current grace period's + * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is + * already in progress. + * + * We do this with a single addition and masking. + * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of + * the seq is used to track if a GP is in progress or not, its sufficient if we + * add (2+1) and mask with ~1. Lets see why with an example: + * + * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0). + * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1) + * to account for the state bit. However, if the current seq is 7 (gp is 3 and + * state bit is 1), then it means the current grace period is already in + * progress so the next time the callback will run is at the end of grace + * period number gp+2. To account for the extra +1, we just overflow the LSB by + * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU + * is idle), then the addition of the extra 0x1 and masking will have no + * effect. This is calculated as below. + */ static inline unsigned long rcu_seq_snap(unsigned long *sp) { unsigned long s; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] rcu: Add comment documenting how rcu_seq_snap works
rcu_seq_snap may be tricky for someone looking at it for the first time. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/rcu.h | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 003671825d62..004ace3d22c2 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -91,7 +91,28 @@ static inline void rcu_seq_end(unsigned long *sp) WRITE_ONCE(*sp, rcu_seq_endval(sp)); } -/* Take a snapshot of the update side's sequence number. */ +/* + * Take a snapshot of the update side's sequence number. + * + * This function predicts what the grace period number will be the next + * time an RCU callback will be executed, given the current grace period's + * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is + * already in progress. + * + * We do this with a single addition and masking. + * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of + * the seq is used to track if a GP is in progress or not, its sufficient if we + * add (2+1) and mask with ~1. Lets see why with an example: + * + * Say the current seq is 6 which is 0x110 (gp is 3 and state bit is 0). + * To get the next GP number, we have to atleast add 0x10 to this (0x1 << 1) to + * account for the state bit. However, if the current seq is 7 (GP num is 3 + * and state bit is 1), then it means the current grace period is already + * in progress so the next the callback will run is at gp+2. To account for + * the extra +1, we just overflow the LSB by adding another 0x1 and masking + * with ~0x1. Incase no GP was in progress (RCU is idle), then the adding + * by 0x1 and masking will have no effect. This is calculated as below. + */ static inline unsigned long rcu_seq_snap(unsigned long *sp) { unsigned long s; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 1/8] rcu: Add comment documenting how rcu_seq_snap works
rcu_seq_snap may be tricky for someone looking at it for the first time. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/rcu.h | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 003671825d62..fc3170914ac7 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp) WRITE_ONCE(*sp, rcu_seq_endval(sp)); } -/* Take a snapshot of the update side's sequence number. */ +/* + * Take a snapshot of the update side's sequence number. + * + * This function predicts what the grace period number will be the next + * time an RCU callback will be executed, given the current grace period's + * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is + * already in progress. + * + * We do this with a single addition and masking. + * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of + * the seq is used to track if a GP is in progress or not, its sufficient if we + * add (2+1) and mask with ~1. Let's see why with an example: + * + * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0). + * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1) + * to account for the state bit. However, if the current seq is 7 (gp is 3 and + * state bit is 1), then it means the current grace period is already in + * progress so the next time the callback will run is at the end of grace + * period number gp+2. To account for the extra +1, we just overflow the LSB by + * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU + * is idle), then the addition of the extra 0x1 and masking will have no + * effect. This is calculated as below. + */ static inline unsigned long rcu_seq_snap(unsigned long *sp) { unsigned long s; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 7/8] rcu: trace CleanupMore condition only if needed
Currently the tree RCU clean up code records a CleanupMore trace event even if the GP was already in progress. This makes CleanupMore show up twice for no reason. Avoid it. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 8401a253e7de..25c44328d071 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -2083,7 +2083,7 @@ static void rcu_gp_cleanup(struct rcu_state *rsp) rsp->gp_state = RCU_GP_IDLE; /* Check for GP requests since above loop. */ rdp = this_cpu_ptr(rsp->rda); - if (ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) { + if (!needgp && ULONG_CMP_LT(rnp->gp_seq, rnp->gp_seq_needed)) { trace_rcu_this_gp(rnp, rdp, rnp->gp_seq_needed, TPS("CleanupMore")); needgp = true; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 4/8] rcu: Get rid of old c variable from places in tree RCU
The 'c' variable was used previously to store the grace period that is being requested. However it is not very meaningful for a code reader, this patch replaces it with gp_seq_start indicating that this is the grace period that was requested. Also updating tracing with the new name. Just a clean up patch, no logical change. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- include/trace/events/rcu.h | 15 ++-- kernel/rcu/tree.c | 47 ++ 2 files changed, 35 insertions(+), 27 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index ce9d1a1cac78..539900a9f8c7 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -103,15 +103,16 @@ TRACE_EVENT(rcu_grace_period, */ TRACE_EVENT(rcu_future_grace_period, - TP_PROTO(const char *rcuname, unsigned long gp_seq, unsigned long c, -u8 level, int grplo, int grphi, const char *gpevent), + TP_PROTO(const char *rcuname, unsigned long gp_seq, +unsigned long gp_seq_start, u8 level, int grplo, int grphi, +const char *gpevent), - TP_ARGS(rcuname, gp_seq, c, level, grplo, grphi, gpevent), + TP_ARGS(rcuname, gp_seq, gp_seq_start, level, grplo, grphi, gpevent), TP_STRUCT__entry( __field(const char *, rcuname) __field(unsigned long, gp_seq) - __field(unsigned long, c) + __field(unsigned long, gp_seq_start) __field(u8, level) __field(int, grplo) __field(int, grphi) @@ -121,7 +122,7 @@ TRACE_EVENT(rcu_future_grace_period, TP_fast_assign( __entry->rcuname = rcuname; __entry->gp_seq = gp_seq; - __entry->c = c; + __entry->gp_seq_start = gp_seq_start; __entry->level = level; __entry->grplo = grplo; __entry->grphi = grphi; @@ -129,7 +130,7 @@ TRACE_EVENT(rcu_future_grace_period, ), TP_printk("%s %lu %lu %u %d %d %s", - __entry->rcuname, __entry->gp_seq, __entry->c, __entry->level, + __entry->rcuname, __entry->gp_seq, __entry->gp_seq_start, __entry->level, __entry->grplo, __entry->grphi, __entry->gpevent) ); @@ -751,7 +752,7 @@ TRACE_EVENT(rcu_barrier, #else /* #ifdef CONFIG_RCU_TRACE */ #define trace_rcu_grace_period(rcuname, gp_seq, gpevent) do { } while (0) -#define trace_rcu_future_grace_period(rcuname, gp_seq, c, \ +#define trace_rcu_future_grace_period(rcuname, gp_seq, gp_seq_start, \ level, grplo, grphi, event) \ do { } while (0) #define trace_rcu_grace_period_init(rcuname, gp_seq, level, grplo, grphi, \ diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 29ccc60bdbfc..9f5679ba413b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1541,13 +1541,18 @@ void rcu_cpu_stall_reset(void) /* Trace-event wrapper function for trace_rcu_future_grace_period. */ static void trace_rcu_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, - unsigned long c, const char *s) + unsigned long gp_seq_start, const char *s) { - trace_rcu_future_grace_period(rdp->rsp->name, rnp->gp_seq, c, + trace_rcu_future_grace_period(rdp->rsp->name, rnp->gp_seq, gp_seq_start, rnp->level, rnp->grplo, rnp->grphi, s); } /* + * rcu_start_this_gp - Request the start of a particular grace period + * @rnp: The leaf node of the CPU from which to start. + * @rdp: The rcu_data corresponding to the CPU from which to start. + * @gp_seq_start: The gp_seq of the grace period to start. + * * Start the specified grace period, as needed to handle newly arrived * callbacks. The required future grace periods are recorded in each * rcu_node structure's ->gp_seq_needed field. Returns true if there @@ -1555,9 +1560,11 @@ static void trace_rcu_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, * * The caller must hold the specified rcu_node structure's ->lock, which * is why the caller is responsible for waking the grace-period kthread. + * + * Returns true if the GP thread needs to be awakened else false. */ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, - unsigned long c) + unsigned long gp_seq_start) { bool ret = false; struct rcu_state *rsp = rdp->rsp; @@ -1573,18 +1580,19 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, * not be released. */ raw_lockdep_assert_held_rcu_node(rnp); - trace_rcu_this_gp(rnp, rdp, c, TPS("Startleaf")); +
[PATCH RFC 8/8] rcu: Fix cpustart tracepoint gp_seq number
cpustart shows a stale gp_seq. This is because rdp->gp_seq is updated only at the end of the __note_gp_changes function. For this reason, use rnp->gp_seq instead. I believe we can't update rdp->gp_seq too early so lets just use the gp_seq from rnp instead. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 25c44328d071..58d2b68f8b98 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1807,7 +1807,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, * set up to detect a quiescent state, otherwise don't * go looking for one. */ - trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpustart")); + trace_rcu_grace_period(rsp->name, rnp->gp_seq, TPS("cpustart")); need_gp = !!(rnp->qsmask & rdp->grpmask); rdp->cpu_no_qs.b.norm = need_gp; rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_dynticks.rcu_qs_ctr); -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 6/8] rcu: Add back the Startedleaf tracepoint
In recent discussion [1], the check for whether a leaf believes RCU is not idle, is being added back to funnel locking code, to avoid more locking. In this we are marking the leaf node for a future grace-period and bailing out since a GP is currently in progress. However the tracepoint is missing. Lets add it back. Also add a small comment about why we do this check (basically the point is to avoid locking intermediate nodes unnecessarily) and clarify the comments in the trace event header now that we are doing traversal of one or more intermediate nodes. [1] http://lkml.kernel.org/r/20180513190906.gl26...@linux.vnet.ibm.com Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- include/trace/events/rcu.h | 4 ++-- kernel/rcu/tree.c | 11 ++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 539900a9f8c7..dc0bd11739c7 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -91,8 +91,8 @@ TRACE_EVENT(rcu_grace_period, * * "Startleaf": Request a grace period based on leaf-node data. * "Prestarted": Someone beat us to the request - * "Startedleaf": Leaf-node start proved sufficient. - * "Startedleafroot": Leaf-node start proved sufficient after checking root. + * "Startedleaf": Leaf and one or more non-root nodes marked for future start. + * "Startedleafroot": all non-root nodes from leaf to root marked for future start. * "Startedroot": Requested a nocb grace period based on root-node data. * "NoGPkthread": The RCU grace-period kthread has not yet started. * "StartWait": Start waiting for the requested grace period. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 40670047d22c..8401a253e7de 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1593,8 +1593,17 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, goto unlock_out; } rnp_node->gp_seq_needed = gp_seq_start; - if (rcu_seq_state(rcu_seq_current(>gp_seq))) + + /* +* Check if leaf believes a GP is in progress, if yes we can +* bail and avoid more locking. We have already marked the leaf. +*/ + if (rcu_seq_state(rcu_seq_current(>gp_seq))) { + trace_rcu_this_gp(rnp_node, rdp, gp_seq_start, + TPS("Startedleaf")); goto unlock_out; + } + if (rnp_node != rnp && rnp_node->parent != NULL) raw_spin_unlock_rcu_node(rnp_node); if (!rnp_node->parent) { -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 0/8] rcu fixes, clean ups for rcu/dev
Hi, Here are some fixes, clean ups and some code comments changes mostly for the new funnel locking, gp_seq changes and some tracing. Its based on latest rcu/dev branch. thanks, - Joel Joel Fernandes (Google) (8): rcu: Add comment documenting how rcu_seq_snap works rcu: Clarify usage of cond_resched for tasks-RCU rcu: Add back the cpuend tracepoint rcu: Get rid of old c variable from places in tree RCU rcu: Use rcu_node as temporary variable in funnel locking loop rcu: Add back the Startedleaf tracepoint rcu: trace CleanupMore condition only if needed rcu: Fix cpustart tracepoint gp_seq number include/linux/rcupdate.h | 11 +++-- include/trace/events/rcu.h | 19 kernel/rcu/rcu.h | 24 +- kernel/rcu/tree.c | 92 +++--- 4 files changed, 97 insertions(+), 49 deletions(-) -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 2/8] rcu: Clarify usage of cond_resched for tasks-RCU
Recently we had a discussion about cond_resched unconditionally recording a voluntary context switch [1]. Lets add a comment clarifying that how this API is to be used. [1] https://lkml.kernel.org/r/1526027434-21237-1-git-send-email-byungchul.p...@lge.com Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- include/linux/rcupdate.h | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 743226176350..a9881007ece6 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -159,8 +159,12 @@ static inline void rcu_init_nohz(void) { } } while (0) /* - * Note a voluntary context switch for RCU-tasks benefit. This is a - * macro rather than an inline function to avoid #include hell. + * Note an attempt to perform a voluntary context switch for RCU-tasks benefit. + * + * This is called even in situations where a context switch didn't really + * happen even though it was requested. The caller uses it to indicate + * traversal of an RCU-tasks quiescent state. This is a macro rather than an + * inline function to avoid #include hell. */ #ifdef CONFIG_TASKS_RCU #define rcu_note_voluntary_context_switch_lite(t) \ @@ -187,7 +191,8 @@ static inline void exit_tasks_rcu_finish(void) { } #endif /* #else #ifdef CONFIG_TASKS_RCU */ /** - * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU + * cond_resched_tasks_rcu_qs - Report potential quiescent states to RCU. + * The quiescent state report is made even if cond_resched() did nothing. * * This macro resembles cond_resched(), except that it is defined to * report potential quiescent states to RCU-tasks even if the cond_resched() -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 5/8] rcu: Use rcu_node as temporary variable in funnel locking loop
The funnel locking loop in rcu_start_this_gp uses rcu_root as a temporary variable while walking the combining tree. This causes a tiresome exercise of a code reader reminding themselves that rcu_root may not be root. Lets just call it rcu_node, and then finally when rcu_node is the rcu_root, lets assign it at that time. Just a clean up patch, no logical change. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/tree.c | 34 ++ 1 file changed, 18 insertions(+), 16 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9f5679ba413b..40670047d22c 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1568,7 +1568,7 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, { bool ret = false; struct rcu_state *rsp = rdp->rsp; - struct rcu_node *rnp_root; + struct rcu_node *rnp_node, *rnp_root = NULL; /* * Use funnel locking to either acquire the root rcu_node @@ -1581,24 +1581,26 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, */ raw_lockdep_assert_held_rcu_node(rnp); trace_rcu_this_gp(rnp, rdp, gp_seq_start, TPS("Startleaf")); - for (rnp_root = rnp; 1; rnp_root = rnp_root->parent) { - if (rnp_root != rnp) - raw_spin_lock_rcu_node(rnp_root); - if (ULONG_CMP_GE(rnp_root->gp_seq_needed, gp_seq_start) || - rcu_seq_done(_root->gp_seq, gp_seq_start) || - (rnp != rnp_root && -rcu_seq_state(rcu_seq_current(_root->gp_seq { - trace_rcu_this_gp(rnp_root, rdp, gp_seq_start, + for (rnp_node = rnp; 1; rnp_node = rnp_node->parent) { + if (rnp_node != rnp) + raw_spin_lock_rcu_node(rnp_node); + if (ULONG_CMP_GE(rnp_node->gp_seq_needed, gp_seq_start) || + rcu_seq_done(_node->gp_seq, gp_seq_start) || + (rnp != rnp_node && +rcu_seq_state(rcu_seq_current(_node->gp_seq { + trace_rcu_this_gp(rnp_node, rdp, gp_seq_start, TPS("Prestarted")); goto unlock_out; } - rnp_root->gp_seq_needed = gp_seq_start; + rnp_node->gp_seq_needed = gp_seq_start; if (rcu_seq_state(rcu_seq_current(>gp_seq))) goto unlock_out; - if (rnp_root != rnp && rnp_root->parent != NULL) - raw_spin_unlock_rcu_node(rnp_root); - if (!rnp_root->parent) + if (rnp_node != rnp && rnp_node->parent != NULL) + raw_spin_unlock_rcu_node(rnp_node); + if (!rnp_node->parent) { + rnp_root = rnp_node; break; /* At root, and perhaps also leaf. */ + } } /* If GP already in progress, just leave, otherwise start one. */ @@ -1616,10 +1618,10 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, trace_rcu_grace_period(rsp->name, READ_ONCE(rsp->gp_seq), TPS("newreq")); ret = true; /* Caller must wake GP kthread. */ unlock_out: - if (rnp != rnp_root) - raw_spin_unlock_rcu_node(rnp_root); + if (rnp != rnp_node) + raw_spin_unlock_rcu_node(rnp_node); /* Push furthest requested GP to leaf node and rcu_data structure. */ - if (ULONG_CMP_GE(rnp_root->gp_seq_needed, gp_seq_start)) { + if (ULONG_CMP_GE(rnp_node->gp_seq_needed, gp_seq_start)) { rnp->gp_seq_needed = gp_seq_start; rdp->gp_seq_needed = gp_seq_start; } -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 3/8] rcu: Add back the cpuend tracepoint
Commit be4b8beed87d ("rcu: Move RCU's grace-period-change code to ->gp_seq") removed the cpuend grace period trace point. This patch adds it back. Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/rcu/tree.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 9ad931bff409..29ccc60bdbfc 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1774,10 +1774,12 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, /* Handle the ends of any preceding grace periods first. */ if (rcu_seq_completed_gp(rdp->gp_seq, rnp->gp_seq) || - unlikely(READ_ONCE(rdp->gpwrap))) + unlikely(READ_ONCE(rdp->gpwrap))) { ret = rcu_advance_cbs(rsp, rnp, rdp); /* Advance callbacks. */ - else + trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpuend")); + } else { ret = rcu_accelerate_cbs(rsp, rnp, rdp); /* Recent callbacks. */ + } /* Now handle the beginnings of any new-to-this-CPU grace periods. */ if (rcu_seq_new_gp(rdp->gp_seq, rnp->gp_seq) || -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC] schedutil: Allow cpufreq requests to be made even when kthread kicked
Currently there is a chance of a schedutil cpufreq update request to be dropped if there is a pending update request. This pending request can be delayed if there is a scheduling delay of the irq_work and the wake up of the schedutil governor kthread. A very bad scenario is when a schedutil request was already just made, such as to reduce the CPU frequency, then a newer request to increase CPU frequency (even sched deadline urgent frequency increase requests) can be dropped, even though the rate limits suggest that its Ok to process a request. This is because of the way the work_in_progress flag is used. This patch improves the situation by allowing new requests to happen even though the old one is still being processed. Note that in this approach, if an irq_work was already issued, we just update next_freq and don't bother to queue another request so there's no extra work being done to make this happen. I had brought up this issue at the OSPM conference and Claudio had a discussion RFC with an alternate approach [1]. I prefer the approach as done in the patch below since it doesn't need any new flags and doesn't cause any other extra overhead. [1] https://patchwork.kernel.org/patch/10384261/ CC: Viresh Kumar <viresh.ku...@linaro.org> CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com> CC: Peter Zijlstra <pet...@infradead.org> CC: Ingo Molnar <mi...@redhat.com> CC: Patrick Bellasi <patrick.bell...@arm.com> CC: Juri Lelli <juri.le...@redhat.com> Cc: Luca Abeni <luca.ab...@santannapisa.it> CC: Joel Fernandes <joe...@google.com> CC: linux...@vger.kernel.org Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- Claudio, Could you also test this patch for your usecase? kernel/sched/cpufreq_schedutil.c | 36 +--- 1 file changed, 28 insertions(+), 8 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index e13df951aca7..a87fc281893d 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) !cpufreq_can_do_remote_dvfs(sg_policy->policy)) return false; - if (sg_policy->work_in_progress) - return false; - if (unlikely(sg_policy->need_freq_update)) { sg_policy->need_freq_update = false; /* @@ -129,8 +126,11 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, policy->cur = next_freq; trace_cpu_frequency(next_freq, smp_processor_id()); } else { - sg_policy->work_in_progress = true; - irq_work_queue(_policy->irq_work); + /* Don't queue request if one was already queued */ + if (!sg_policy->work_in_progress) { + sg_policy->work_in_progress = true; + irq_work_queue(_policy->irq_work); + } } } @@ -291,6 +291,15 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, ignore_dl_rate_limit(sg_cpu, sg_policy); + /* +* For slow-switch systems, single policy requests can't run at the +* moment if the governor thread is already processing a pending +* frequency switch request, this can be fixed by acquiring update_lock +* while updating next_freq and work_in_progress but we prefer not to. +*/ + if (sg_policy->work_in_progress) + return; + if (!sugov_should_update_freq(sg_policy, time)) return; @@ -382,13 +391,24 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) static void sugov_work(struct kthread_work *work) { struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + unsigned int freq; + unsigned long flags; + + /* +* Hold sg_policy->update_lock shortly to handle the case where: +* incase sg_policy->next_freq is read here, and then updated by +* sugov_update_shared just before work_in_progress is set to false +* here, we may miss queueing the new update. +*/ + raw_spin_lock_irqsave(_policy->update_lock, flags); + freq = sg_policy->next_freq; + sg_policy->work_in_progress = false; + raw_spin_unlock_irqrestore(_policy->update_lock, flags); mutex_lock(_policy->work_lock); - __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, + __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L); mutex_unlock(_policy->work_lock); - - sg_policy->work_in_progress = false; } static void sugov_irq_work(struct irq_work *irq_work) -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC] schedutil: Address the r/w ordering race in kthread
Currently there is a race in schedutil code for slow-switch single-CPU systems. Fix it by enforcing ordering the write to work_in_progress to happen before the read of next_freq. Kthread Sched update sugov_work() sugov_update_single() lock(); // The CPU is free to rearrange below // two in any order, so it may clear // the flag first and then read next // freq. Lets assume it does. work_in_progress = false if (work_in_progress) return; sg_policy->next_freq = 0; freq = sg_policy->next_freq; sg_policy->next_freq = real-freq; unlock(); Reported-by: Viresh Kumar <viresh.ku...@linaro.org> CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com> CC: Peter Zijlstra <pet...@infradead.org> CC: Ingo Molnar <mi...@redhat.com> CC: Patrick Bellasi <patrick.bell...@arm.com> CC: Juri Lelli <juri.le...@redhat.com> Cc: Luca Abeni <luca.ab...@santannapisa.it> CC: Todd Kjos <tk...@google.com> CC: clau...@evidence.eu.com CC: kernel-t...@android.com CC: linux...@vger.kernel.org Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- I split this into separate patch, because this race can also happen in mainline. kernel/sched/cpufreq_schedutil.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index 5c482ec38610..ce7749da7a44 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -401,6 +401,13 @@ static void sugov_work(struct kthread_work *work) */ raw_spin_lock_irqsave(_policy->update_lock, flags); freq = sg_policy->next_freq; + + /* +* sugov_update_single can access work_in_progress without update_lock, +* make sure next_freq is read before work_in_progress is set. +*/ + smp_mb(); + sg_policy->work_in_progress = false; raw_spin_unlock_irqrestore(_policy->update_lock, flags); -- 2.17.0.441.gb46fe60e1d-goog
[PATCH 2/2] kselftests: ftrace: Add tests for the preemptoff and irqsoff tracers
Here we add unit tests for the preemptoff and irqsoff tracer by using a kernel module introduced previously to trigger atomic sections in the kernel. Cc: Steven Rostedt <rost...@goodmis.org> Cc: Peter Zilstra <pet...@infradead.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> Cc: Tom Zanussi <tom.zanu...@linux.intel.com> Cc: Namhyung Kim <namhy...@kernel.org> Cc: Thomas Glexiner <t...@linutronix.de> Cc: Boqun Feng <boqun.f...@gmail.com> Cc: Paul McKenney <paul...@linux.vnet.ibm.com> Cc: Masami Hiramatsu <mhira...@kernel.org> Cc: Todd Kjos <tk...@google.com> Cc: Erick Reyes <erickre...@google.com> Cc: Julia Cartwright <ju...@ni.com> Cc: kernel-t...@android.com Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- tools/testing/selftests/ftrace/config | 3 + .../test.d/preemptirq/irqsoff_tracer.tc | 74 +++ 2 files changed, 77 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc diff --git a/tools/testing/selftests/ftrace/config b/tools/testing/selftests/ftrace/config index b01924c71c09..29588b328345 100644 --- a/tools/testing/selftests/ftrace/config +++ b/tools/testing/selftests/ftrace/config @@ -4,3 +4,6 @@ CONFIG_FUNCTION_PROFILER=y CONFIG_TRACER_SNAPSHOT=y CONFIG_STACK_TRACER=y CONFIG_HIST_TRIGGERS=y +CONFIG_PREEMPT_TRACER=y +CONFIG_IRQSOFF_TRACER=y +CONFIG_TEST_ATOMIC_SECTIONS=m diff --git a/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc b/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc new file mode 100644 index ..b76d781c5645 --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/preemptirq/irqsoff_tracer.tc @@ -0,0 +1,74 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: test for the preemptirqsoff tracer + +MOD=test_atomic_sections + +fail() { +reset_tracer +rmmod $MOD || true +exit_fail +} + +unsup() { #msg +reset_tracer +rmmod $MOD || true +echo $1 +exit_unsupported +} + +modprobe $MOD || unsup "$MOD module not available" +rmmod $MOD + +grep "preemptoff" available_tracers || unsup "preemptoff tracer not enabled" +grep "irqsoff" available_tracers || unsup "irqsoff tracer not enabled" + +reset_tracer + +# Simulate preemptoff section for half a second couple of times +echo preemptoff > current_tracer +sleep 1 +modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail +rmmod test_atomic_sections || fail +modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail +rmmod test_atomic_sections || fail +modprobe test_atomic_sections atomic_mode=preempt atomic_time=50 || fail +rmmod test_atomic_sections || fail + +cat trace + +# Confirm which tracer +grep "tracer: preemptoff" trace || fail + +# Check the end of the section +egrep "5.us : " trace || fail + +# Check for 500ms of latency +egrep "latency: 5. us" trace || fail + +reset_tracer + +# Simulate irqsoff section for half a second couple of times +echo irqsoff > current_tracer +sleep 1 +modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail +rmmod test_atomic_sections || fail +modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail +rmmod test_atomic_sections || fail +modprobe test_atomic_sections atomic_mode=irq atomic_time=50 || fail +rmmod test_atomic_sections || fail + +cat trace + +# Confirm which tracer +grep "tracer: irqsoff" trace || fail + +# Check the end of the section +egrep "5.us : " trace || fail + +# Check for 500ms of latency +egrep "latency: 5. us" trace || fail + +reset_tracer +exit 0 + -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v2] schedutil: Allow cpufreq requests to be made even when kthread kicked
From: "Joel Fernandes (Google)" <j...@joelfernandes.org> Currently there is a chance of a schedutil cpufreq update request to be dropped if there is a pending update request. This pending request can be delayed if there is a scheduling delay of the irq_work and the wake up of the schedutil governor kthread. A very bad scenario is when a schedutil request was already just made, such as to reduce the CPU frequency, then a newer request to increase CPU frequency (even sched deadline urgent frequency increase requests) can be dropped, even though the rate limits suggest that its Ok to process a request. This is because of the way the work_in_progress flag is used. This patch improves the situation by allowing new requests to happen even though the old one is still being processed. Note that in this approach, if an irq_work was already issued, we just update next_freq and don't bother to queue another request so there's no extra work being done to make this happen. I had brought up this issue at the OSPM conference and Claudio had a discussion RFC with an alternate approach [1]. I prefer the approach as done in the patch below since it doesn't need any new flags and doesn't cause any other extra overhead. [1] https://patchwork.kernel.org/patch/10384261/ LGTMed-by: Viresh Kumar <viresh.ku...@linaro.org> LGTMed-by: Juri Lelli <juri.le...@redhat.com> CC: Viresh Kumar <viresh.ku...@linaro.org> CC: Rafael J. Wysocki <rafael.j.wyso...@intel.com> CC: Peter Zijlstra <pet...@infradead.org> CC: Ingo Molnar <mi...@redhat.com> CC: Patrick Bellasi <patrick.bell...@arm.com> CC: Juri Lelli <juri.le...@redhat.com> Cc: Luca Abeni <luca.ab...@santannapisa.it> CC: Joel Fernandes <joe...@google.com> CC: Todd Kjos <tk...@google.com> CC: clau...@evidence.eu.com CC: kernel-t...@android.com CC: linux...@vger.kernel.org Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- v1 -> v2: Minor style related changes. kernel/sched/cpufreq_schedutil.c | 34 1 file changed, 26 insertions(+), 8 deletions(-) diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c index e13df951aca7..5c482ec38610 100644 --- a/kernel/sched/cpufreq_schedutil.c +++ b/kernel/sched/cpufreq_schedutil.c @@ -92,9 +92,6 @@ static bool sugov_should_update_freq(struct sugov_policy *sg_policy, u64 time) !cpufreq_can_do_remote_dvfs(sg_policy->policy)) return false; - if (sg_policy->work_in_progress) - return false; - if (unlikely(sg_policy->need_freq_update)) { sg_policy->need_freq_update = false; /* @@ -128,7 +125,7 @@ static void sugov_update_commit(struct sugov_policy *sg_policy, u64 time, policy->cur = next_freq; trace_cpu_frequency(next_freq, smp_processor_id()); - } else { + } else if (!sg_policy->work_in_progress) { sg_policy->work_in_progress = true; irq_work_queue(_policy->irq_work); } @@ -291,6 +288,13 @@ static void sugov_update_single(struct update_util_data *hook, u64 time, ignore_dl_rate_limit(sg_cpu, sg_policy); + /* +* For slow-switch systems, single policy requests can't run at the +* moment if update is in progress, unless we acquire update_lock. +*/ + if (sg_policy->work_in_progress) + return; + if (!sugov_should_update_freq(sg_policy, time)) return; @@ -382,13 +386,27 @@ sugov_update_shared(struct update_util_data *hook, u64 time, unsigned int flags) static void sugov_work(struct kthread_work *work) { struct sugov_policy *sg_policy = container_of(work, struct sugov_policy, work); + unsigned int freq; + unsigned long flags; + + /* +* Hold sg_policy->update_lock shortly to handle the case where: +* incase sg_policy->next_freq is read here, and then updated by +* sugov_update_shared just before work_in_progress is set to false +* here, we may miss queueing the new update. +* +* Note: If a work was queued after the update_lock is released, +* sugov_work will just be called again by kthread_work code; and the +* request will be proceed before the sugov thread sleeps. +*/ + raw_spin_lock_irqsave(_policy->update_lock, flags); + freq = sg_policy->next_freq; + sg_policy->work_in_progress = false; + raw_spin_unlock_irqrestore(_policy->update_lock, flags); mutex_lock(_policy->work_lock); - __cpufreq_driver_target(sg_policy->policy, sg_policy->next_freq, - CPUFREQ_RELATION_L); + __cpufreq_driver_target(sg_policy->policy, freq, CPUFREQ_RELATION_L); mutex_unlock(_policy->work_lock); - - sg_policy->work_in_progress = false; } static void sugov_irq_work(struct irq_work *irq_work) -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v7 4/6] trace/irqsoff: Split reset into separate functions
Split reset functions into seperate functions in preparation of future patches that need to do tracer specific reset. Cc: Steven Rostedt <rost...@goodmis.org> Cc: Peter Zilstra <pet...@infradead.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> Cc: Tom Zanussi <tom.zanu...@linux.intel.com> Cc: Namhyung Kim <namhy...@kernel.org> Cc: Thomas Glexiner <t...@linutronix.de> Cc: Boqun Feng <boqun.f...@gmail.com> Cc: Paul McKenney <paul...@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweis...@gmail.com> Cc: Randy Dunlap <rdun...@infradead.org> Cc: Masami Hiramatsu <mhira...@kernel.org> Cc: Fenguang Wu <fengguang...@intel.com> Cc: Baohong Liu <baohong@intel.com> Cc: Vedang Patel <vedang.pa...@intel.com> Cc: kernel-t...@android.com Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- kernel/trace/trace_irqsoff.c | 22 +++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 03ecb4465ee4..f8daa754cce2 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -634,7 +634,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr) return 0; } -static void irqsoff_tracer_reset(struct trace_array *tr) +static void __irqsoff_tracer_reset(struct trace_array *tr) { int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT; int overwrite_flag = save_flags & TRACE_ITER_OVERWRITE; @@ -665,6 +665,12 @@ static int irqsoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } + +static void irqsoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer irqsoff_tracer __read_mostly = { .name = "irqsoff", @@ -697,11 +703,16 @@ static int preemptoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } +static void preemptoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer preemptoff_tracer __read_mostly = { .name = "preemptoff", .init = preemptoff_tracer_init, - .reset = irqsoff_tracer_reset, + .reset = preemptoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, .print_max = true, @@ -731,11 +742,16 @@ static int preemptirqsoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } +static void preemptirqsoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer preemptirqsoff_tracer __read_mostly = { .name = "preemptirqsoff", .init = preemptirqsoff_tracer_init, - .reset = irqsoff_tracer_reset, + .reset = preemptirqsoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, .print_max = true, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v7 1/6] softirq: reorder trace_softirqs_on to prevent lockdep splat
I'm able to reproduce a lockdep splat with config options: CONFIG_PROVE_LOCKING=y, CONFIG_DEBUG_LOCK_ALLOC=y and CONFIG_PREEMPTIRQ_EVENTS=y $ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable --- kernel/softirq.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 177de3640c78..8a040bcaa033 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt) { lockdep_assert_irqs_disabled(); + if (preempt_count() == cnt) + trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip()); + if (softirq_count() == (cnt & SOFTIRQ_MASK)) trace_softirqs_on(_RET_IP_); - preempt_count_sub(cnt); + + __preempt_count_sub(cnt); } /* -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v7 2/6] srcu: Add notrace variants of srcu_read_{lock,unlock}
From: Paul McKenney <paul...@linux.vnet.ibm.com> This is needed for a future tracepoint patch that uses srcu, and to make sure it doesn't call into lockdep. tracepoint code already calls notrace variants for rcu_read_lock_sched so this patch does the same for srcu which will be used in a later patch. Keeps it consistent with rcu-sched. [Joel: Added commit message] Cc: Steven Rostedt <rost...@goodmis.org> Cc: Peter Zilstra <pet...@infradead.org> Cc: Ingo Molnar <mi...@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoy...@efficios.com> Cc: Tom Zanussi <tom.zanu...@linux.intel.com> Cc: Namhyung Kim <namhy...@kernel.org> Cc: Thomas Glexiner <t...@linutronix.de> Cc: Boqun Feng <boqun.f...@gmail.com> Cc: Paul McKenney <paul...@linux.vnet.ibm.com> Cc: Frederic Weisbecker <fweis...@gmail.com> Cc: Randy Dunlap <rdun...@infradead.org> Cc: Masami Hiramatsu <mhira...@kernel.org> Cc: Fenguang Wu <fengguang...@intel.com> Cc: Baohong Liu <baohong@intel.com> Cc: Vedang Patel <vedang.pa...@intel.com> Cc: kernel-t...@android.com Reviewed-by: Steven Rostedt (VMware) <rost...@goodmis.org> Signed-off-by: Paul McKenney <paul...@linux.vnet.ibm.com> Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- include/linux/srcu.h | 17 + 1 file changed, 17 insertions(+) diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 33c1c698df09..2ec618979b20 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -161,6 +161,16 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp) return retval; } +/* Used by tracing, cannot be traced and cannot invoke lockdep. */ +static inline notrace int +srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp) +{ + int retval; + + retval = __srcu_read_lock(sp); + return retval; +} + /** * srcu_read_unlock - unregister a old reader from an SRCU-protected structure. * @sp: srcu_struct in which to unregister the old reader. @@ -175,6 +185,13 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx) __srcu_read_unlock(sp, idx); } +/* Used by tracing, cannot be traced and cannot call lockdep. */ +static inline notrace void +srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp) +{ + __srcu_read_unlock(sp, idx); +} + /** * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock * -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v2] sched: Remove obscure comment from select_task_rq_fair
I was playing with cpusets and sched_load_balance flag and notice that the fast-path (select_idle_sibling) can also be attempted for exec-balance, not just wake-balance if the waker cpu's cpuset has sched_load_balance = 0. This patch removes the obscure comment which was saying this path can be entered only for wake-balance. To trigger this, I just do: mkdir /cpuset mount -t cpuset none /cpuset echo 0 > sched_load_balance Then did some random activity and dumped the stack from 'if (!sd)' for the non wake-balance cases. Following is one of the stacks: dump_stack+0x46/0x5b select_task_rq_fair+0x101d/0x1030 sched_exec+0x4f/0xc0 do_execveat_common.isra.41+0x1e3/0x7c0 __x64_sys_execve+0x2d/0x40 do_syscall_64+0x43/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Turns out the same case occurs also during boot up when kthreadd tries to create threads before domains are attached so lets fix the comment. Cc: Dietmar Eggemann <dietmar.eggem...@arm.com> Cc: Morten Ramussen <morten.rasmus...@arm.com> Cc: Ingo Molnar <mi...@redhat.com> Cc: Peter Zijlstra <pet...@infradead.org> Cc: Juri Lelli <juri.le...@redhat.com> Cc: Vincent Guittot <vincent.guit...@linaro.org> Cc: Patrick Bellasi <patrick.bell...@arm.com> Cc: Rohit Jain <rohit.k.j...@oracle.com> Cc: kernel-t...@android.com Signed-off-by: Joel Fernandes (Google) <j...@joelfernandes.org> --- v1->v2: Resending without "XXX" in subject since otherwise LKML thinks its junk. kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 54dc31e7ab9b..dd07794141d0 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6665,7 +6665,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f if (!sd) { pick_cpu: - if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */ + if (sd_flag & SD_BALANCE_WAKE) { new_cpu = select_idle_sibling(p, prev_cpu, new_cpu); if (want_affine) -- 2.17.0.921.gf22659ad46-goog
[PATCH] trace: Use rcu_dereference_raw for hooks from trace-event subsystem
Since we switched to using SRCU for tracepoints used in the idle path, we can no longer use rcu_dereference_sched for dereferencing points in trace-event hooks. Since tracepoints can now use either SRCU or sched-RCU, just use rcu_dereference_raw for traceevents just like we're doing when dereferencing the tracepoint table. This prevents an RCU warning reported by Masami: [ 282.060593] WARNING: can't dereference registers at f3c7f62b [ 282.063200] = [ 282.064082] WARNING: suspicious RCU usage [ 282.064963] 4.18.0-rc6+ #15 Tainted: GW [ 282.066048] - [ 282.066923] /home/mhiramat/ksrc/linux/kernel/trace/trace_events.c:242 suspicious rcu_dereference_check() usage! [ 282.068974] [ 282.068974] other info that might help us debug this: [ 282.068974] [ 282.070770] [ 282.070770] RCU used illegally from idle CPU! [ 282.070770] rcu_scheduler_active = 2, debug_locks = 1 [ 282.072938] RCU used illegally from extended quiescent state! [ 282.074183] no locks held by swapper/0/0. [ 282.075071] [ 282.075071] stack backtrace: [ 282.076121] CPU: 0 PID: 0 Comm: swapper/0 Tainted: GW [ 282.077782] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) [ 282.079604] Call Trace: [ 282.080212] [ 282.080755] dump_stack+0x85/0xcb [ 282.081523] trace_event_ignore_this_pid+0x66/0x70 [ 282.082541] trace_event_raw_event_preemptirq_template+0xa2/0xb0 [ 282.083774] ? interrupt_entry+0xc4/0xe0 [ 282.084665] ? trace_hardirqs_off_thunk+0x1a/0x1c [ 282.085669] trace_hardirqs_off_caller+0x90/0xd0 [ 282.086597] trace_hardirqs_off_thunk+0x1a/0x1c [ 282.087433] ? call_function_interrupt+0xa/0x20 [ 282.088201] interrupt_entry+0xc4/0xe0 [ 282.088848] ? call_function_interrupt+0xa/0x20 [ 282.089579] [ 282.090029] ? native_safe_halt+0x2/0x10 [ 282.090695] ? default_idle+0x1f/0x160 [ 282.091330] ? default_idle_call+0x24/0x40 [ 282.091997] ? do_idle+0x210/0x250 [ 282.092658] ? cpu_startup_entry+0x6f/0x80 [ 282.093338] ? start_kernel+0x49d/0x4bd [ 282.093987] ? secondary_startup_64+0xa5/0xb0 Reported-by: Masami Hiramatsu Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU") Signed-off-by: Joel Fernandes (Google) --- kernel/trace/trace_events.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/trace/trace_events.c b/kernel/trace/trace_events.c index 14ff4ff3caab..7b508ce8ac44 100644 --- a/kernel/trace/trace_events.c +++ b/kernel/trace/trace_events.c @@ -239,7 +239,7 @@ bool trace_event_ignore_this_pid(struct trace_event_file *trace_file) struct trace_array_cpu *data; struct trace_pid_list *pid_list; - pid_list = rcu_dereference_sched(tr->filtered_pids); + pid_list = rcu_dereference_raw(tr->filtered_pids); if (!pid_list) return false; @@ -512,7 +512,7 @@ event_filter_pid_sched_process_exit(void *data, struct task_struct *task) struct trace_pid_list *pid_list; struct trace_array *tr = data; - pid_list = rcu_dereference_sched(tr->filtered_pids); + pid_list = rcu_dereference_raw(tr->filtered_pids); trace_filter_add_remove_task(pid_list, NULL, task); } -- 2.18.0.597.ga71716f1ad-goog
[PATCH ftrace/core] tracing: irqsoff: Account for additional preempt_disable
Recently we tried to make the preemptirqsoff tracer to use irqsoff tracepoint probes. However this causes issues as reported by Masami: [2.271078] Testing tracer preemptirqsoff: .. no entries found ..FAILED! [2.381015] WARNING: CPU: 0 PID: 1 at /home/mhiramat/ksrc/linux/kernel/ trace/trace.c:1512 run_tracer_selftest+0xf3/0x154 This is due to the tracepoint code increasing the preempt nesting count by calling an additional preempt_disable before calling into the preemptoff tracer which messes up the preempt_count() check in tracer_hardirqs_off. To fix this, make the irqsoff tracer probes balance the additional outer preempt_disable with a preempt_enable_notrace. The other way to fix this is to just use SRCU for all tracepoints. However we can't do that because we can't use NMIs from RCU context. Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") Fixes: e6753f23d961 ("tracepoint: Make rcuidle tracepoint callers use SRCU") Reported-by: Masami Hiramatsu Signed-off-by: Joel Fernandes (Google) --- kernel/trace/trace_irqsoff.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 770cd30cda40..ffbf1505d5bc 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -603,14 +603,40 @@ static void irqsoff_tracer_stop(struct trace_array *tr) */ static void tracer_hardirqs_on(void *none, unsigned long a0, unsigned long a1) { + /* +* Tracepoint probes are expected to be called with preempt disabled, +* We don't care about being called with preempt disabled but we need +* to know in the future if that changes so we can remove the next +* preempt_enable. +*/ + WARN_ON_ONCE(!preempt_count()); + + /* Tracepoint probes disable preemption atleast once, account for that */ + preempt_enable_notrace(); + if (!preempt_trace() && irq_trace()) stop_critical_timing(a0, a1); + + preempt_disable_notrace(); } static void tracer_hardirqs_off(void *none, unsigned long a0, unsigned long a1) { + /* +* Tracepoint probes are expected to be called with preempt disabled, +* We don't care about being called with preempt disabled but we need +* to know in the future if that changes so we can remove the next +* preempt_enable. +*/ + WARN_ON_ONCE(!preempt_count()); + + /* Tracepoint probes disable preemption atleast once, account for that */ + preempt_enable_notrace(); + if (!preempt_trace() && irq_trace()) start_critical_timing(a0, a1); + + preempt_disable_notrace(); } static int irqsoff_tracer_init(struct trace_array *tr) -- 2.18.0.597.ga71716f1ad-goog
[RFC] tracepoint: Run tracepoints even after CPU is offline
Commit f37755490fe9 ("tracepoints: Do not trace when cpu is offline") causes a problem for lockdep using tracepoint code. Once a CPU is offline, tracepoints donot get called, however this causes a big problem for lockdep probes that need to run so that IRQ annotations are marked correctly. An issue is possible where while the CPU is going offline, an interrupt can come in and then a lockdep assert causes an annotation warning: [ 106.551354] IRQs not enabled as expected [ 106.551785] WARNING: CPU: 1 PID: 0 at kernel/time/tick-sched.c:982 tick_nohz_idle_enter+0x99/0xb0 [ 106.552964] Modules linked in: [ 106.553299] CPU: 1 PID: 0 Comm: swapper/1 Tainted: GW We need tracepoints to run as late as possible. This commit tries to fix the issue by removing the cpu_online check in tracepoint code that was introduced in the mentioned commit, however now we run the risk of running dereferencing probes that aren't RCU protected, which gives an RCU warning like so on boot up: [0.030159] x86: Booting SMP configuration: [0.030169] node #0, CPUs: #1 [0.001000] [0.001000] = [0.001000] WARNING: suspicious RCU usage [0.001000] 4.18.0-rc6+ #42 Not tainted [0.001000] - [0.001000] ./include/trace/events/timer.h:38 suspicious rcu_dereference_check() usage! [0.001000] [0.001000] other info that might help us debug this: [0.001000] [0.001000] [0.001000] RCU used illegally from offline CPU! [0.001000] rcu_scheduler_active = 1, debug_locks = 1 [0.001000] no locks held by swapper/1/0. [0.001000] Any ideas on how we can fix this? Basically we need RCU to work here even after !cpu_online. I thought of just using SRCU for all tracepoints however that may mean we can't use tracepoints from NMI.. Tries-to-Fix: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") Reported-by: Masami Hiramatsu Signed-off-by: Joel Fernandes (Google) --- include/linux/tracepoint.h | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index d9a084c72541..020885714a0f 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -365,19 +365,17 @@ extern void syscall_unregfunc(void); * "void *__data, proto" as the callback prototype. */ #define DECLARE_TRACE_NOARGS(name) \ - __DECLARE_TRACE(name, void, , \ - cpu_online(raw_smp_processor_id()), \ + __DECLARE_TRACE(name, void, , 1,\ void *__data, __data) #define DECLARE_TRACE(name, proto, args) \ - __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \ - cpu_online(raw_smp_processor_id()), \ + __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), 1, \ PARAMS(void *__data, proto),\ PARAMS(__data, args)) #define DECLARE_TRACE_CONDITION(name, proto, args, cond) \ __DECLARE_TRACE(name, PARAMS(proto), PARAMS(args), \ - cpu_online(raw_smp_processor_id()) && (PARAMS(cond)), \ + PARAMS(cond), \ PARAMS(void *__data, proto),\ PARAMS(__data, args)) -- 2.18.0.597.ga71716f1ad-goog
[PATCH RFC ftrace/core] tracing: Directly call tracer and lockdep probes
From: Steven Rostedt Due to various problems with using tracepoints as probes for IRQ enable/disable and preempt enable/disable: such as using SRCU in NMI context and other issues with the irqsoff/preemptoff tracer, directly call the lockdep and irqsoff tracer probes for now. This keeps all the other benefits and clean ups of the patch being partially reverted. Based on Steve's original Partial revert patch. I added directly calling irqsoff tracer probes as well. Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") Signed-off-by: Joel Fernandes (Google) --- Steve, I expanded your partial revert such that it fixes the remaining issues as well while keeping the clean ups and benefits of my original patch. Its based on the latest ftrace/core. I'm still testing it, but it works so far. What do you think? I feel like it could still be a step forward for this merge window. Also you could drop "tracepoint: Make rcuidle tracepoint callers use SRCU" since its not needed anymore for this merge window. Tests passing so far: - ftrace startup tests - locking api self-tests - manually tested preemptoff and irqsoff tracers - manually tested preempt/irq disable and enable events - boot tested without warnings include/linux/irqflags.h| 4 ++ include/linux/lockdep.h | 2 - init/main.c | 2 - kernel/locking/lockdep.c| 14 +-- kernel/trace/trace.h| 16 kernel/trace/trace_irqsoff.c| 26 ++--- kernel/trace/trace_preemptirq.c | 66 - 7 files changed, 75 insertions(+), 55 deletions(-) diff --git a/include/linux/irqflags.h b/include/linux/irqflags.h index 50edb9cbbd26..19a90ac7c87d 100644 --- a/include/linux/irqflags.h +++ b/include/linux/irqflags.h @@ -19,9 +19,13 @@ #ifdef CONFIG_PROVE_LOCKING extern void trace_softirqs_on(unsigned long ip); extern void trace_softirqs_off(unsigned long ip); + extern void lockdep_hardirqs_on(unsigned long ip); + extern void lockdep_hardirqs_off(unsigned long ip); #else # define trace_softirqs_on(ip) do { } while (0) # define trace_softirqs_off(ip)do { } while (0) +# define lockdep_hardirqs_on(ip) do { } while (0) +# define lockdep_hardirqs_off(ip) do { } while (0) #endif #ifdef CONFIG_TRACE_IRQFLAGS diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h index a8113357ceeb..b0d0b51c4d85 100644 --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -267,7 +267,6 @@ struct held_lock { * Initialization, self-test and debugging-output methods: */ extern void lockdep_init(void); -extern void lockdep_init_early(void); extern void lockdep_reset(void); extern void lockdep_reset_lock(struct lockdep_map *lock); extern void lockdep_free_key_range(void *start, unsigned long size); @@ -408,7 +407,6 @@ static inline void lockdep_on(void) # define lock_set_class(l, n, k, s, i) do { } while (0) # define lock_set_subclass(l, s, i)do { } while (0) # define lockdep_init()do { } while (0) -# define lockdep_init_early() do { } while (0) # define lockdep_init_map(lock, name, key, sub) \ do { (void)(name); (void)(key); } while (0) # define lockdep_set_class(lock, key) do { (void)(key); } while (0) diff --git a/init/main.c b/init/main.c index 44fe43be84c1..5d42e577643a 100644 --- a/init/main.c +++ b/init/main.c @@ -649,8 +649,6 @@ asmlinkage __visible void __init start_kernel(void) call_function_init(); WARN(!irqs_disabled(), "Interrupts were enabled early\n"); - lockdep_init_early(); - early_boot_irqs_disabled = false; local_irq_enable(); diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c index 03bfaeb9f4e6..e406c5fdb41e 100644 --- a/kernel/locking/lockdep.c +++ b/kernel/locking/lockdep.c @@ -2840,8 +2840,7 @@ static void __trace_hardirqs_on_caller(unsigned long ip) debug_atomic_inc(hardirqs_on_events); } -static void lockdep_hardirqs_on(void *none, unsigned long ignore, - unsigned long ip) +void lockdep_hardirqs_on(unsigned long ip) { if (unlikely(!debug_locks || current->lockdep_recursion)) return; @@ -2885,8 +2884,7 @@ static void lockdep_hardirqs_on(void *none, unsigned long ignore, /* * Hardirqs were disabled: */ -static void lockdep_hardirqs_off(void *none, unsigned long ignore, -unsigned long ip) +void lockdep_hardirqs_off(unsigned long ip) { struct task_struct *curr = current; @@ -4315,14 +4313,6 @@ void lockdep_reset_lock(struct lockdep_map *lock) raw_local_irq_restore(flags); } -void __init lockdep_init_early(void) -{ -#ifdef CONFIG_PROVE_LOCKING - register_trace_prio_irq_disable(lockdep_hardirqs_off, NULL, INT_MAX); - register_trace_prio_irq_enable(lockdep_hardirqs_on, NULL, IN
[PATCH RFC] tracepoint: Use SRCU for all tracepoint invocations
Steven reported that rcuidle && in_nmi() condition can occur which creates a problem for SRCU usage, since we can't use the SRCU node from both NMI context and other contexts (NMI can come in while the SRCU read lock is in the process of being held). This patch switches to using a separate SRCU node for tracepoints called from in_nmi(). This is needed to also make tracepoints work while CPU is offline. Fixes: c3bc8fd637a9 ("tracing: Centralize preemptirq tracepoints and unify their usage") Reported-by: Masami Hiramatsu Signed-off-by: Joel Fernandes (Google) --- Dropped the "CPU offline" changes, and only keeping the SRCU changes. include/linux/tracepoint.h | 19 --- kernel/tracepoint.c| 10 ++ 2 files changed, 14 insertions(+), 15 deletions(-) diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index d9a084c72541..1ceee17a38dc 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -35,6 +35,7 @@ struct trace_eval_map { #define TRACEPOINT_DEFAULT_PRIO10 extern struct srcu_struct tracepoint_srcu; +extern struct srcu_struct tracepoint_srcu_nmi; extern int tracepoint_probe_register(struct tracepoint *tp, void *probe, void *data); @@ -144,13 +145,11 @@ extern void syscall_unregfunc(void); void *it_func; \ void *__data; \ int __maybe_unused idx = 0; \ + struct srcu_struct *ss; \ \ if (!(cond))\ return; \ \ - /* srcu can't be used from NMI */ \ - WARN_ON_ONCE(rcuidle && in_nmi()); \ - \ /* keep srcu and sched-rcu usage consistent */ \ preempt_disable_notrace(); \ \ @@ -159,7 +158,11 @@ extern void syscall_unregfunc(void); * doesn't work from the idle path. \ */ \ if (rcuidle)\ - idx = srcu_read_lock_notrace(_srcu); \ + ss = _srcu_nmi; \ + else\ + ss = _srcu; \ + \ + idx = srcu_read_lock_notrace(ss); \ \ it_func_ptr = rcu_dereference_raw((tp)->funcs); \ \ @@ -171,8 +174,7 @@ extern void syscall_unregfunc(void); } while ((++it_func_ptr)->func);\ } \ \ - if (rcuidle)\ - srcu_read_unlock_notrace(_srcu, idx);\ + srcu_read_unlock_notrace(ss, idx); \ \ preempt_enable_notrace(); \ } while (0) @@ -212,11 +214,6 @@ extern void syscall_unregfunc(void); TP_PROTO(data_proto), \ TP_ARGS(data_args), \ TP_CONDITION(cond), 0); \ - if (IS_ENABLED(CONFIG_LOCKDEP) && (cond)) { \ - rcu_read_lock_sched_notrace(); \ - rcu_dereference_sched(__tracepoint_##name.funcs);\ - rcu_read_unlock_sched_notrace();\ - } \ } \ __DECLARE_TRACE_RCU(name, PARAMS(proto), PARAMS(args), \ PARAMS(cond), PARAMS(data_proto), PARAMS(data_args))\ diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c index 955148d91b74..769d74b2f90e 100644 --- a/kernel/tracepoint.c +++ b/kernel/tracepoint.c
[PATCH] mm: shmem: Correctly annotate new inodes
Directories and inodes don't necessarily need to be in the same lockdep class. For ex, hugetlbfs splits them out too to prevent false positives in lockdep. Annotate correctly after new inode creation. If its a directory inode, it will be put into a different class. This should fix a lockdep splat reported by syzbot: > == > WARNING: possible circular locking dependency detected > 4.18.0-rc8-next-20180810+ #36 Not tainted > -- > syz-executor900/4483 is trying to acquire lock: > d2bfc8fe (>s_type->i_mutex_key#9){}, at: inode_lock > include/linux/fs.h:765 [inline] > d2bfc8fe (>s_type->i_mutex_key#9){}, at: > shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 > > but task is already holding lock: > 25208078 (ashmem_mutex){+.+.}, at: ashmem_shrink_scan+0xb4/0x630 > drivers/staging/android/ashmem.c:448 > > which lock already depends on the new lock. > > -> #2 (ashmem_mutex){+.+.}: >__mutex_lock_common kernel/locking/mutex.c:925 [inline] >__mutex_lock+0x171/0x1700 kernel/locking/mutex.c:1073 >mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:1088 >ashmem_mmap+0x55/0x520 drivers/staging/android/ashmem.c:361 >call_mmap include/linux/fs.h:1844 [inline] >mmap_region+0xf27/0x1c50 mm/mmap.c:1762 >do_mmap+0xa10/0x1220 mm/mmap.c:1535 >do_mmap_pgoff include/linux/mm.h:2298 [inline] >vm_mmap_pgoff+0x213/0x2c0 mm/util.c:357 >ksys_mmap_pgoff+0x4da/0x660 mm/mmap.c:1585 >__do_sys_mmap arch/x86/kernel/sys_x86_64.c:100 [inline] >__se_sys_mmap arch/x86/kernel/sys_x86_64.c:91 [inline] >__x64_sys_mmap+0xe9/0x1b0 arch/x86/kernel/sys_x86_64.c:91 >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #1 (>mmap_sem){}: >__might_fault+0x155/0x1e0 mm/memory.c:4568 >_copy_to_user+0x30/0x110 lib/usercopy.c:25 >copy_to_user include/linux/uaccess.h:155 [inline] >filldir+0x1ea/0x3a0 fs/readdir.c:196 >dir_emit_dot include/linux/fs.h:3464 [inline] >dir_emit_dots include/linux/fs.h:3475 [inline] >dcache_readdir+0x13a/0x620 fs/libfs.c:193 >iterate_dir+0x48b/0x5d0 fs/readdir.c:51 >__do_sys_getdents fs/readdir.c:231 [inline] >__se_sys_getdents fs/readdir.c:212 [inline] >__x64_sys_getdents+0x29f/0x510 fs/readdir.c:212 >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > -> #0 (>s_type->i_mutex_key#9){}: >lock_acquire+0x1e4/0x540 kernel/locking/lockdep.c:3924 >down_write+0x8f/0x130 kernel/locking/rwsem.c:70 >inode_lock include/linux/fs.h:765 [inline] >shmem_fallocate+0x18b/0x12e0 mm/shmem.c:2602 >ashmem_shrink_scan+0x236/0x630 drivers/staging/android/ashmem.c:455 >ashmem_ioctl+0x3ae/0x13a0 drivers/staging/android/ashmem.c:797 >vfs_ioctl fs/ioctl.c:46 [inline] >file_ioctl fs/ioctl.c:501 [inline] >do_vfs_ioctl+0x1de/0x1720 fs/ioctl.c:685 >ksys_ioctl+0xa9/0xd0 fs/ioctl.c:702 >__do_sys_ioctl fs/ioctl.c:709 [inline] >__se_sys_ioctl fs/ioctl.c:707 [inline] >__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:707 >do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290 >entry_SYSCALL_64_after_hwframe+0x49/0xbe > > other info that might help us debug this: > > Chain exists of: > >s_type->i_mutex_key#9 --> >mmap_sem --> ashmem_mutex > > Possible unsafe locking scenario: > >CPU0CPU1 > > lock(ashmem_mutex); >lock(>mmap_sem); >lock(ashmem_mutex); > lock(>s_type->i_mutex_key#9); > > *** DEADLOCK *** > > 1 lock held by syz-executor900/4483: > #0: 25208078 (ashmem_mutex){+.+.}, at: > ashmem_shrink_scan+0xb4/0x630 drivers/staging/android/ashmem.c:448 Reported-by: syzbot Cc: wi...@infradead.org Cc: sta...@vger.kernel.org Cc: pet...@infradead.org Suggested-by: Neil Brown Signed-off-by: Joel Fernandes (Google) --- mm/shmem.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/shmem.c b/mm/shmem.c index 2cab84403055..4429a8fd932d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2225,6 +2225,8 @@ static struct inode *shmem_get_inode(struct super_block *sb, const struct inode mpol_shared_policy_init(>policy, NULL); break; } + + lockdep_annotate_inode_mutex_key(inode); } else shmem_free_inode(sb); return inode; -- 2.18.0.597.ga71716f1ad-goog
Re: [PATCH] sched: support dynamiQ cluster
On Fri, Apr 6, 2018 at 5:58 AM, Morten Rasmussenwrote: > On Thu, Apr 05, 2018 at 06:22:48PM +0200, Vincent Guittot wrote: >> Hi Morten, >> >> On 5 April 2018 at 17:46, Morten Rasmussen wrote: >> > On Wed, Apr 04, 2018 at 03:43:17PM +0200, Vincent Guittot wrote: >> >> On 4 April 2018 at 12:44, Valentin Schneider >> >> wrote: >> >> > Hi, >> >> > >> >> > On 03/04/18 13:17, Vincent Guittot wrote: >> >> >> Hi Valentin, >> >> >> >> >> > [...] >> >> >>> >> >> >>> I believe ASYM_PACKING behaves better here because the workload is >> >> >>> only >> >> >>> sysbench threads. As stated above, since task utilization is >> >> >>> disregarded, I >> >> >> >> >> >> It behaves better because it doesn't wait for the task's utilization >> >> >> to reach a level before assuming the task needs high compute capacity. >> >> >> The utilization gives an idea of the running time of the task not the >> >> >> performance level that is needed >> >> >> >> >> > >> >> > [ >> >> > That's my point actually. ASYM_PACKING disregards utilization and moves >> >> > those >> >> > threads to the big cores ASAP, which is good here because it's just >> >> > sysbench >> >> > threads. >> >> > >> >> > What I meant was that if the task composition changes, IOW we mix >> >> > "small" >> >> > tasks (e.g. periodic stuff) and "big" tasks (performance-sensitive >> >> > stuff like >> >> > sysbench threads), we shouldn't assume all of those require to run on a >> >> > big >> >> > CPU. The thing is, ASYM_PACKING can't make the difference between >> >> > those, so >> > [Morten] >> >> >> >> That's the 1st point where I tend to disagree: why big cores are only >> >> for long running task and periodic stuff can't need to run on big >> >> cores to get max compute capacity ? >> >> You make the assumption that only long running tasks need high compute >> >> capacity. This patch wants to always provide max compute capacity to >> >> the system and not only long running task >> > >> > There is no way we can tell if a periodic or short-running tasks >> > requires the compute capacity of a big core or not based on utilization >> > alone. The utilization can only tell us if a task could potentially use >> > more compute capacity, i.e. the utilization approaches the compute >> > capacity of its current cpu. >> > >> > How we handle low utilization tasks comes down to how we define >> > "performance" and if we care about the cost of "performance" (e.g. >> > energy consumption). >> > >> > Placing a low utilization task on a little cpu should always be fine >> > from _throughput_ point of view. As long as the cpu has spare cycles it >> >> [Vincent] >> I disagree, throughput is not only a matter of spare cycle it's also a >> matter of how fast you compute the work like with IO activity as an >> example > > [Morten] > From a cpu centric point of view it is, but I agree that from a > application/user point of view completion time might impact throughput > too. For example of if your throughput depends on how fast you can > offload work to some peripheral device (GPU for example). > > However, as I said in the beginning we don't know what the task does. [Joel] Just wanted to say about Vincent point of IO loads throughput - remembering from when I was playing with the iowait boost stuff, that - say you have a little task that does some IO and blocks and does so periodically. In the scenario the task will run for little time and is a little task by way of looking at utilization. However, if we were to run it on the BIG CPUs, the overall throughput of the I/O activity would be higher. For this case, it seems its impossible to specify the "default" behavior correctly. Like, do we care about performance or energy more? This seems more like a policy-decision from userspace and not something the scheduler should necessarily have to decide. Like if I/O activity is background and not affecting the user experience. thanks, - Joel
Re: [PATCH] proc/stat: Separate out individual irq counts into /proc/stat_irqs
Hi, Or maintain array of registered irqs and iterate over them only. >>> Right, we can allocate a bitmap of used irqs to do that. >>> I have another idea. perf record shows mutex_lock/mutex_unlock at the top. Most of them are irq mutex not seqfile mutex as there are many more interrupts than reads. Take it once. >>> How many cpus are in your test system? In that skylake server, it was >>> the per-cpu summing operation of the irq counts that was consuming most >>> of the time for reading /proc/stat. I think we can certainly try to >>> optimize the lock taking. >> It's 16x(NR_IRQS: 4352, nr_irqs: 960, preallocated irqs: 16) >> Given that irq registering is rare operation, maintaining sorted array >> of irq should be the best option. >>> For the time being, I think I am going to have a clone /proc/stat2 as >>> suggested in my earlier email. Alternatively, I can put that somewhere >>> in sysfs if you have a good idea of where I can put it. >> sysfs is strictly one-value-per-file. >> >>> I will also look into ways to optimize the current per-IRQ stats >>> handling, but it will come later. >> There is always a time-honored way of ioctl(2) switching irq info off >> /proc supports that. >> >> There are many options. > > OK, it is good to know. Do you have any existing code snippet in the > kernel that I can use as reference on how to use ioctl(2) switching? > > I will look into how to optimize the existing per-IRQ stats code first > before venturing into cloning /proc/stat. Can we not just remove per-IRQ stats from /proc/stat (since I gather from this discussion it isn't scalable), and just have applications that need per-IRQ stats use /proc/interrupts ? thanks, - Joel
Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name
Hi Mathieu, On Mon, Mar 26, 2018 at 12:10 PM, Mathieu Desnoyerswrote: > Provide an API allowing eBPF to lookup core kernel tracepoints by name. > > Given that a lookup by name explicitly requires tracepoint definitions > to be unique for a given name (no duplicate keys), include a > WARN_ON_ONCE() check that only a single match is encountered at runtime. > This should always be the case, given that a DEFINE_TRACE emits a > __tracepoint_##name symbol, which would cause a link-time error if more > than one instance is found. Nevertheless, check this at runtime with > WARN_ON_ONCE() to stay on the safe side. > > Signed-off-by: Mathieu Desnoyers > CC: Steven Rostedt > CC: Alexei Starovoitov > CC: Peter Zijlstra > CC: Ingo Molnar > --- > include/linux/tracepoint.h | 1 + > kernel/tracepoint.c| 35 +++ > 2 files changed, 36 insertions(+) > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > index c94f466d57ef..1b4ae64b7c6a 100644 > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -43,6 +43,7 @@ tracepoint_probe_unregister(struct tracepoint *tp, void > *probe, void *data); > extern void > for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), > void *priv); > +extern struct tracepoint *tracepoint_kernel_find_by_name(const char *name); > > #ifdef CONFIG_MODULES > struct tp_module { > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c > index 671b13457387..0a59f988055a 100644 > --- a/kernel/tracepoint.c > +++ b/kernel/tracepoint.c > @@ -60,6 +60,11 @@ struct tp_probes { > struct tracepoint_func probes[0]; > }; > > +struct tp_find_args { > + const char *name; > + struct tracepoint *tp; > +}; > + > static inline void *allocate_probes(int count) > { > struct tp_probes *p = kmalloc(count * sizeof(struct tracepoint_func) > @@ -528,6 +533,36 @@ void for_each_kernel_tracepoint(void (*fct)(struct > tracepoint *tp, void *priv), > } > EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint); > > +static void find_tp(struct tracepoint *tp, void *priv) > +{ > + struct tp_find_args *args = priv; > + > + if (!strcmp(tp->name, args->name)) { > + WARN_ON_ONCE(args->tp); > + args->tp = tp; I think this runtime check is not needed if it really can't happen (linker verifies that already as you mentioned) although I'm not opposed if you still want to keep it, because there's no way to break out of the outer loop anyway so I guess your call.. I would just do: if (args->tp) return; if find_tp already found the tracepoint. Tried to also create a duplicate tracepoint and I get: CC init/version.o AR init/built-in.o AR built-in.o LD vmlinux.o block/blk-core.o:(__tracepoints+0x440): multiple definition of `__tracepoint_sched_switch' kernel/sched/core.o:(__tracepoints+0x440): first defined here Makefile:1032: recipe for target 'vmlinux' failed make: *** [vmlinux] Error 1 For this senseless diff: diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 81b43f5bdf23..2b855c05197d 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -49,6 +49,13 @@ DEFINE_EVENT(block_buffer, block_touch_buffer, TP_ARGS(bh) ); +DEFINE_EVENT(block_buffer, sched_switch, + + TP_PROTO(struct buffer_head *bh), + + TP_ARGS(bh) +); + /** * block_dirty_buffer - mark a buffer dirty * @bh: buffer_head being dirtied thanks, - Joel
Re: [PATCH 0/3] [RFC] init, tracing: Add initcall trace events
Hi Steve, On Fri, Mar 23, 2018 at 8:02 AM, Steven Rostedtwrote: > A while ago we had a boot tracer. But it was eventually removed: > commit 30dbb20e68e6f ("tracing: Remove boot tracer"). > > The rational was because there is already a initcall_debug boot option > that causes printk()s of all the initcall functions. > > The problem with the initcall_debug option is that printk() is awfully slow, > and makes it difficult to see the real impact of initcalls. Mainly because > a single printk() is usually slower than most initcall functions. > > Instead of bringing back the boot tracer, adding trace events around the > initcall functions, and even one to denote which level the initcall > functions are being called from, adds the necessary information to > analyze the initcalls without the high overhead of printk()s, that > can substantially slow down the boot process. > > Another positive, is that the console initcall functions themselves > can also be traced. The timestamps are not operational at that time > but you can see which consoles are being registered. I saw this on > one of my test boxes: > > -0 [000] ...1 0.00: initcall_level: level=console > -0 [000] ...1 0.00: initcall_start: func=con_init+0x0/0x224 > -0 [000] ...1 0.00: initcall_finish: > func=con_init+0x0/0x224 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=hvc_console_init+0x0/0x19 > -0 [000] ...1 0.00: initcall_finish: > func=hvc_console_init+0x0/0x19 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=xen_cons_init+0x0/0x60 > -0 [000] ...1 0.00: initcall_finish: > func=xen_cons_init+0x0/0x60 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=univ8250_console_init+0x0/0x2d > -0 [000] ...1 0.00: initcall_finish: > func=univ8250_console_init+0x0/0x2d ret=0 Will this make initcall_debug not work if CONFIG_TRACEPOINTS is turned off? Although it builds but I think this initcall_debug feature will fail, maybe CONFIG_TRACEPOINTS should be selected somewhere? I recently ran into some issues like this for my preemptirq tracepoints patch (which I will post again soon :D) where lockdep needed the tracepoints and I had to select it. thanks, - Joel
Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name
On Tue, Mar 27, 2018 at 6:27 AM, Mathieu Desnoyerswrote: >>> +static void find_tp(struct tracepoint *tp, void *priv) >>> +{ >>> + struct tp_find_args *args = priv; >>> + >>> + if (!strcmp(tp->name, args->name)) { >>> + WARN_ON_ONCE(args->tp); >>> + args->tp = tp; >> >> I think this runtime check is not needed if it really can't happen >> (linker verifies that already as you mentioned) although I'm not >> opposed if you still want to keep it, because there's no way to break >> out of the outer loop anyway so I guess your call.. > > We can change the outer loop and break from it if needed, that's not > an issue. > >> I would just do: >> >> if (args->tp) >>return; >> >> if find_tp already found the tracepoint. >> >> Tried to also create a duplicate tracepoint and I get: >> CC init/version.o >> AR init/built-in.o >> AR built-in.o >> LD vmlinux.o >> block/blk-core.o:(__tracepoints+0x440): multiple definition of >> `__tracepoint_sched_switch' >> kernel/sched/core.o:(__tracepoints+0x440): first defined here >> Makefile:1032: recipe for target 'vmlinux' failed >> make: *** [vmlinux] Error 1 > > Yeah, as I state in my changelog, I'm very well aware that the linker > is able to catch those. This was the purpose of emitting a > __tracepoint_##name symbol in the tracepoint definition macro. This > WARN_ON_ONCE() is a redundant check for an invariant that we expect > is provided by the linker. Given that it's not a fast path, I would > prefer to keep this redundant check in place, given that an average > factor 2 speedup on a slow path should really not be something we > need to optimize for. IMHO, for slow paths, robustness is more important > than speed (unless the slow path becomes so slow that it really affects > the user). > > I envision that a way to break this invariant would be to compile a > kernel object with gcc -fvisibility=hidden or such. I admit this is > particular scenario is really far fetched, but it illustrates why I > prefer to keep an extra safety net at runtime for linker-based > validations when possible. > > If a faster tracepoint lookup function is needed, we should consider > perfect hashing algorithms done post-build, but so far nobody has > complained about speed of this lookup operation. Anyhow a factor 2 in > the speed of this lookup should really not matter, right ? Yes, I agree with all the great reasons. So lets do it your way then. thanks, - Joel
Re: [PATCH] staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
On Tue, Feb 27, 2018 at 10:59 PM, Yisheng Xiewrote: > ashmem_mutex may create a chain of dependencies like: > > CPU0CPU1 > mmap syscall ioctl syscall > -> mmap_sem (acquired) -> ashmem_ioctl > -> ashmem_mmap-> ashmem_mutex (acquired) > -> ashmem_mutex (try to acquire) -> copy_from_user > -> mmap_sem (try to acquire) > > There is a lock odering problem between mmap_sem and ashmem_mutex causing > a lockdep splat[1] during a syzcaller test. This patch fixes the problem > by move copy_from_user out of ashmem_mutex. > > [1] https://www.spinics.net/lists/kernel/msg2733200.html > > Fixes: ce8a3a9e76d0 (staging: android: ashmem: Fix a race condition in pin > ioctls) > Reported-by: syzbot+d7a918a7a8e1c952b...@syzkaller.appspotmail.com > Signed-off-by: Yisheng Xie Greg, Could you take this patch for the stable trees? I do see it in staging already. I couldn't find it in stable so wanted to bring it to your attention. If you already aware of it, please ignore my note. Thanks, - Joel
[PATCH RFC] mm: Add an fs-write seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then drop its protection for "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. The following program shows the seal working in action: int main() { int ret, fd; void *addr, *addr2, *addr3, *addr1; ret = memfd_create_region("test_region", REGION_SIZE); printf("ret=%d\n", ret); fd = ret; // Create map addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) printf("map 0 failed\n"); else printf("map 0 passed\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed even though no fs-write seal " "(ret=%d errno =%d)\n", ret, errno); else printf("write passed\n"); addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr1 == MAP_FAILED) perror("map 1 prot-write failed even though no seal\n"); else printf("map 1 prot-write passed as expected\n"); ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE); if (ret == -1) printf("fcntl failed, errno: %d\n", errno); else printf("fs-write seal now active\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed as expected due to fs-write seal\n"); else printf("write passed (unexpected)\n"); addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr2 == MAP_FAILED) perror("map 2 prot-write failed as expected due to seal\n"); else printf("map 2 passed\n"); addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0); if (addr3 == MAP_FAILED) perror("map 3 failed\n"); else printf("map 3 prot-read passed as expected\n"); } The output of running this program is as follows: ret=3 map 0 passed write passed map 1 prot-write passed as expected fs-write seal now active write failed as expected due to fs-write seal map 2 prot-write failed as expected due to seal : Permission denied map 3 prot-read passed as expected Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Signed-off-by: Joel Fernandes (Google) --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6448cdd9a350..bafd1233b6a8 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -41,6 +41,7 @@ #define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */ #define F_SEAL_GROW0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ +#define F_SEAL_FS_WRITE0x0010 /* prevent all write-related syscalls */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 2bb5e257080e..4b09f446c2d2 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) #define F_ALL_SEALS (F_SEAL_SEAL | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ -F_SEAL_WRITE) +F_SEAL_WRITE | \ +F_SEAL_FS_WRITE) static int memfd_add_seals(struct file *file, unsigned int seals) { @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE)) + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); + *file_seals |= seals; error = 0; -- 2.19.0.605.g01d371f741-goog
[PATCH] doc: rcu: Fix code listing in performance and scalability requirements
The code listing under this section has a quick quiz that says line 19 uses rcu_access_pointer, but the code listing itself does not. Fix this. Signed-off-by: Joel Fernandes (Google) --- .../RCU/Design/Requirements/Requirements.html| 2 +- kernel/sys.c | 16 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 4fae55056c1d..f74a2233865c 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -1596,7 +1596,7 @@ used in place of synchronize_rcu() as follows: 16 struct foo *p; 17 18 spin_lock(gp_lock); -19 p = rcu_dereference(gp); +19 p = rcu_access_pointer(gp); 20 if (!p) { 21 spin_unlock(gp_lock); 22 return false; -- 2.19.0.605.g01d371f741-goog
[PATCH for-stable] dmaengine: stm32-dma: fix incomplete configuration in cyclic mode
From: Pierre Yves MORDRET commit e57cb3b3f10d005410f09d4598cc6d62b833f2b0 upstream. When in cyclic mode, the configuration is updated after having started the DMA hardware (STM32_DMA_SCR_EN) leading to incomplete configuration of SMxAR registers. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Hugues Fruchet Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index 4099948b6914..fae7de54f00a 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -441,6 +441,8 @@ static void stm32_dma_dump_reg(struct stm32_dma_chan *chan) dev_dbg(chan2dev(chan), "SFCR: 0x%08x\n", sfcr); } +static void stm32_dma_configure_next_sg(struct stm32_dma_chan *chan); + static void stm32_dma_start_transfer(struct stm32_dma_chan *chan) { struct stm32_dma_device *dmadev = stm32_dma_get_dev(chan); @@ -483,6 +485,9 @@ static void stm32_dma_start_transfer(struct stm32_dma_chan *chan) if (status) stm32_dma_irq_clear(chan, status); + if (chan->desc->cyclic) + stm32_dma_configure_next_sg(chan); + stm32_dma_dump_reg(chan); /* Start DMA */ @@ -576,8 +581,7 @@ static void stm32_dma_issue_pending(struct dma_chan *c) if (vchan_issue_pending(>vchan) && !chan->desc && !chan->busy) { dev_dbg(chan2dev(chan), "vchan %p: issued\n", >vchan); stm32_dma_start_transfer(chan); - if (chan->desc->cyclic) - stm32_dma_configure_next_sg(chan); + } spin_unlock_irqrestore(>vchan.lock, flags); } -- 2.19.0.605.g01d371f741-goog
[PATCH v3 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. The following program shows the seal working in action: #include #include #include #include #include #include #include #define F_SEAL_FUTURE_WRITE 0x0010 #define REGION_SIZE (5 * 1024 * 1024) int memfd_create_region(const char *name, size_t size) { int ret; int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING); if (fd < 0) return fd; ret = ftruncate(fd, size); if (ret < 0) { close(fd); return ret; } return fd; } int main() { int ret, fd; void *addr, *addr2, *addr3, *addr1; ret = memfd_create_region("test_region", REGION_SIZE); printf("ret=%d\n", ret); fd = ret; // Create map addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) printf("map 0 failed\n"); else printf("map 0 passed\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed even though no future-write seal " "(ret=%d errno =%d)\n", ret, errno); else printf("write passed\n"); addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr1 == MAP_FAILED) perror("map 1 prot-write failed even though no seal\n"); else printf("map 1 prot-write passed as expected\n"); ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE | F_SEAL_GROW | F_SEAL_SHRINK); if (ret == -1) printf("fcntl failed, errno: %d\n", errno); else printf("future-write seal now active\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed as expected due to future-write seal\n"); else printf("write passed (unexpected)\n"); addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr2 == MAP_FAILED) perror("map 2 prot-write failed as expected due to seal\n"); else printf("map 2 passed\n"); addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0); if (addr3 == MAP_FAILED) perror("map 3 failed\n"); else printf("map 3 prot-read passed as expected\n"); } The output of running this program is as follows: ret=3 map 0 passed write passed map 1 prot-write passed as expected future-write seal now active write failed as expected due to future-write seal map 2 prot-write failed as expected due to seal : Permission denied map 3 prot-read passed as expected Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Cc: h...@infradead.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- v1->v2: No change, just added selftests to the series. manpages are ready and I'll submit them once the patches are accepted. v2->v3: Updated commit message to have more support code (John Stultz) Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE (Christoph Hellwig) Allow for this seal only if grow/shrink seals are also either previous set, or are requested along with this seal. (Christoph Hellwig) Added locking to synchronize access to file->f_mode. (Christoph Hellwig) include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 22 +- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6448cdd9a350..a2f8658f1c55 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -41,6 +41,7
[PATCH v3 2/2] selftests/memfd: Add tests for F_SEAL_FS_WRITE seal
Add tests to verify sealing memfds with the F_SEAL_FS_WRITE works as expected. Cc: dan...@google.com Cc: minc...@kernel.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- tools/testing/selftests/memfd/memfd_test.c | 74 ++ 1 file changed, 74 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 10baa1652fc2..32b207ca7372 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -692,6 +692,79 @@ static void test_seal_write(void) close(fd); } +/* + * Test SEAL_FUTURE_WRITE + * Test whether SEAL_FUTURE_WRITE actually prevents modifications. + */ +static void test_seal_future_write(void) +{ + int fd; + void *p; + + printf("%s SEAL-FUTURE-WRITE\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_future_write", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + p = mfd_assert_mmap_shared(fd); + + mfd_assert_has_seals(fd, 0); + /* Not adding grow/shrink seals makes the future write +* seal fail to get added +*/ + mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); + + mfd_assert_add_seals(fd, F_SEAL_GROW); + mfd_assert_has_seals(fd, F_SEAL_GROW); + + /* Should still fail since shrink seal has +* not yet been added +*/ + mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); + + mfd_assert_add_seals(fd, F_SEAL_SHRINK); + mfd_assert_has_seals(fd, F_SEAL_GROW | +F_SEAL_SHRINK); + + /* Now should succeed, also verifies that the seal +* could be added with an existing writable mmap +*/ + mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE); + mfd_assert_has_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + + /* read should pass, writes should fail */ + mfd_assert_read(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); + + /* Test adding all seals (grow, shrink, future write) at once */ + fd = mfd_assert_new("kern_memfd_seal_future_write2", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + p = mfd_assert_mmap_shared(fd); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + mfd_assert_has_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + + /* read should pass, writes should fail */ + mfd_assert_read(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -945,6 +1018,7 @@ int main(int argc, char **argv) test_basic(); test_seal_write(); + test_seal_future_write(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.19.1.331.ge82ca0e54c-goog
[PATCH] doc: correct parameter in stallwarn
The stallwarn document incorrectly mentions 'fps=' instead of 'fqs='. Correct that. Signed-off-by: Joel Fernandes (Google) --- Documentation/RCU/stallwarn.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index b01bcafc64aa..073dbc12d1ea 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt @@ -205,7 +205,7 @@ handlers are no longer able to execute on this CPU. This can happen if the stalled CPU is spinning with interrupts are disabled, or, in -rt kernels, if a high-priority process is starving RCU's softirq handler. -The "fps=" shows the number of force-quiescent-state idle/offline +The "fqs=" shows the number of force-quiescent-state idle/offline detection passes that the grace-period kthread has made across this CPU since the last time that this CPU noted the beginning of a grace period. -- 2.19.1.568.g152ad8e336-goog
[RFC 1/6] pstore: map pstore types to names
In later patches we will need to map types to names, so create a table for that which can also be used and reused in different parts of old and new code. Also use it to save the type in the PRZ which will be useful in later patches. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/inode.c | 44 -- fs/pstore/ram.c| 4 +++- include/linux/pstore.h | 29 + include/linux/pstore_ram.h | 2 ++ 4 files changed, 57 insertions(+), 22 deletions(-) diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c index 5fcb845b9fec..43757049d384 100644 --- a/fs/pstore/inode.c +++ b/fs/pstore/inode.c @@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record *record) struct dentry *dentry; struct inode*inode; int rc = 0; + enum pstore_type_id type; charname[PSTORE_NAMELEN]; struct pstore_private *private, *pos; unsigned long flags; @@ -335,43 +336,44 @@ int pstore_mkfile(struct dentry *root, struct pstore_record *record) goto fail_alloc; private->record = record; - switch (record->type) { + type = record->type; + switch (type) { case PSTORE_TYPE_DMESG: - scnprintf(name, sizeof(name), "dmesg-%s-%llu%s", - record->psi->name, record->id, - record->compressed ? ".enc.z" : ""); + scnprintf(name, sizeof(name), "%s-%s-%llu%s", +pstore_names[type], record->psi->name, record->id, +record->compressed ? ".enc.z" : ""); break; case PSTORE_TYPE_CONSOLE: - scnprintf(name, sizeof(name), "console-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_FTRACE: - scnprintf(name, sizeof(name), "ftrace-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_MCE: - scnprintf(name, sizeof(name), "mce-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_PPC_RTAS: - scnprintf(name, sizeof(name), "rtas-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_PPC_OF: - scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_PPC_COMMON: - scnprintf(name, sizeof(name), "powerpc-common-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_PMSG: - scnprintf(name, sizeof(name), "pmsg-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_PPC_OPAL: - scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu", - record->psi->name, record->id); + scnprintf(name, sizeof(name), "%s-%s-%llu", + pstore_names[type], record->psi->name, record->id); break; case PSTORE_TYPE_UNKNOWN: scnprintf(name, sizeof(name), "unknown-%s-%llu", @@ -379,7 +381,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record *record) break; default: scnprintf(name, sizeof(name), "type%d-
[RFC 5/6] pstore: donot treat empty buffers as valid
pstore currently calls persistent_ram_save_old even if a buffer is empty. While this appears to work, it is simply not the right thing to do and could lead to bugs so lets avoid that. It also prevent misleading prints in the logs which claim the buffer is valid. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index 0792595ebcfb..1299aa3ea734 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -495,7 +495,7 @@ static int persistent_ram_post_init(struct persistent_ram_zone *prz, u32 sig, sig ^= PERSISTENT_RAM_SIG; - if (prz->buffer->sig == sig) { + if (prz->buffer->sig == sig && buffer_size(prz)) { if (buffer_size(prz) > prz->buffer_size || buffer_start(prz) > buffer_size(prz)) pr_info("found existing invalid buffer, size %zu, start %zu\n", -- 2.19.1.568.g152ad8e336-goog
[RFC 2/6] pstore: remove type argument from ramoops_get_next_prz
Since we store the type of the prz when we initialize it, we no longer need to pass it again in ramoops_get_next_prz since we can just use that to setup the pstore record. So lets remove it from the argument list. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram.c | 20 +++- 1 file changed, 7 insertions(+), 13 deletions(-) diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index c7cd858adce7..cbfdf4b8e89d 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -125,9 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi) static struct persistent_ram_zone * ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, -u64 *id, -enum pstore_type_id *typep, enum pstore_type_id type, -bool update) +u64 *id, enum pstore_type_id *typep, bool update) { struct persistent_ram_zone *prz; int i = (*c)++; @@ -147,7 +145,7 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, if (!persistent_ram_old_size(prz)) return NULL; - *typep = type; + *typep = prz->type; *id = i; return prz; @@ -257,8 +255,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) { prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt, cxt->max_dump_cnt, >id, - >type, - PSTORE_TYPE_DMESG, 1); + >type, 1); if (!prz_ok(prz)) continue; header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz), @@ -274,20 +271,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) if (!prz_ok(prz)) prz = ramoops_get_next_prz(>cprz, >console_read_cnt, - 1, >id, >type, - PSTORE_TYPE_CONSOLE, 0); + 1, >id, >type, 0); if (!prz_ok(prz)) prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt, - 1, >id, >type, - PSTORE_TYPE_PMSG, 0); + 1, >id, >type, 0); /* ftrace is last since it may want to dynamically allocate memory. */ if (!prz_ok(prz)) { if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) { prz = ramoops_get_next_prz(cxt->fprzs, >ftrace_read_cnt, 1, >id, - >type, PSTORE_TYPE_FTRACE, 0); + >type, 0); } else { /* * Build a new dummy record which combines all the @@ -306,8 +301,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) >ftrace_read_cnt, cxt->max_ftrace_cnt, >id, - >type, - PSTORE_TYPE_FTRACE, 0); + >type, 0); if (!prz_ok(prz_next)) continue; -- 2.19.1.568.g152ad8e336-goog
[RFC 4/6] pstore: further reduce ramoops_get_next_prz arguments by passing record
Both the id and type fields of a pstore_record are set by ramoops_get_next_prz. So we can just pass a pointer to the pstore_record instead of passing individual elements. This results in cleaner more readable code and fewer lines. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram.c | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index 3055e05acab1..710c3d30bac0 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -125,7 +125,7 @@ static int ramoops_pstore_open(struct pstore_info *psi) static struct persistent_ram_zone * ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, -u64 *id, enum pstore_type_id *typep, bool update) +struct pstore_record *record, bool update) { struct persistent_ram_zone *prz; int i = (*c)++; @@ -145,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, if (!persistent_ram_old_size(prz)) return NULL; - *typep = prz->type; - *id = i; + record->type = prz->type; + record->id = i; return prz; } @@ -254,7 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) /* Find the next valid persistent_ram_zone for DMESG */ while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) { prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt, - >id, >type, 1); + record, 1); if (!prz_ok(prz)) continue; header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz), @@ -270,18 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) if (!prz_ok(prz)) prz = ramoops_get_next_prz(>cprz, >console_read_cnt, - >id, >type, 0); + record, 0); if (!prz_ok(prz)) prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt, - >id, >type, 0); + record, 0); /* ftrace is last since it may want to dynamically allocate memory. */ if (!prz_ok(prz)) { if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) { prz = ramoops_get_next_prz(cxt->fprzs, - >ftrace_read_cnt, >id, - >type, 0); + >ftrace_read_cnt, record, 0); } else { /* * Build a new dummy record which combines all the @@ -298,8 +297,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) { prz_next = ramoops_get_next_prz(cxt->fprzs, >ftrace_read_cnt, - >id, - >type, 0); + record, 0); if (!prz_ok(prz_next)) continue; -- 2.19.1.568.g152ad8e336-goog
[RFC 3/6] pstore: remove max argument from ramoops_get_next_prz
>From the code flow, the 'max' checks are already being done on the prz passed to ramoops_get_next_prz. Lets remove it to simplify this function and reduce its arguments. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index cbfdf4b8e89d..3055e05acab1 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -124,14 +124,14 @@ static int ramoops_pstore_open(struct pstore_info *psi) } static struct persistent_ram_zone * -ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, +ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, u64 *id, enum pstore_type_id *typep, bool update) { struct persistent_ram_zone *prz; int i = (*c)++; /* Give up if we never existed or have hit the end. */ - if (!przs || i >= max) + if (!przs) return NULL; prz = przs[i]; @@ -254,8 +254,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) /* Find the next valid persistent_ram_zone for DMESG */ while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) { prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt, - cxt->max_dump_cnt, >id, - >type, 1); + >id, >type, 1); if (!prz_ok(prz)) continue; header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz), @@ -271,17 +270,17 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) if (!prz_ok(prz)) prz = ramoops_get_next_prz(>cprz, >console_read_cnt, - 1, >id, >type, 0); + >id, >type, 0); if (!prz_ok(prz)) prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt, - 1, >id, >type, 0); + >id, >type, 0); /* ftrace is last since it may want to dynamically allocate memory. */ if (!prz_ok(prz)) { if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) { prz = ramoops_get_next_prz(cxt->fprzs, - >ftrace_read_cnt, 1, >id, + >ftrace_read_cnt, >id, >type, 0); } else { /* @@ -299,7 +298,6 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) while (cxt->ftrace_read_cnt < cxt->max_ftrace_cnt) { prz_next = ramoops_get_next_prz(cxt->fprzs, >ftrace_read_cnt, - cxt->max_ftrace_cnt, >id, >type, 0); -- 2.19.1.568.g152ad8e336-goog
[RFC 6/6] Revert "pstore/ram_core: Do not reset restored zone's position and size"
This reverts commit 25b63da64708212985c06c7f8b089d356efdd9cf. Due to the commit which is being reverted here, it is not possible to know if pstore's messages were from a previous boot, or from very old boots. This creates an awkard situation where its unclear if crash or other logs are from the previous boot or from very old boots. Also typically we dump the pstore buffers after one reboot and are interested in only the previous boot's crash so let us reset the buffer after we save them. Lastly, if we don't zap them, then I think it is possible that part of the buffer will be from this boot and the other parts will be from previous boots. So this revert fixes all of this by calling persistent_ram_zap always. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram_core.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index 1299aa3ea734..67d74dd97da1 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -504,7 +504,6 @@ static int persistent_ram_post_init(struct persistent_ram_zone *prz, u32 sig, pr_debug("found existing buffer, size %zu, start %zu\n", buffer_size(prz), buffer_start(prz)); persistent_ram_save_old(prz); - return 0; } } else { pr_debug("no valid data in buffer (sig = 0x%08x)\n", -- 2.19.1.568.g152ad8e336-goog
[RFC] rcu: doc: update example about stale data
The RCU example for 'rejecting stale data' on system-call auditting stops iterating through the rules if a deleted one is found. It makes more sense to continue looking at other rules once a deleted one is rejected. Although the original example is fine, this makes it more meaningful. Signed-off-by: Joel Fernandes (Google) --- Documentation/RCU/listRCU.txt | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt index adb5a3782846..09e9a4fc723e 100644 --- a/Documentation/RCU/listRCU.txt +++ b/Documentation/RCU/listRCU.txt @@ -250,8 +250,7 @@ as follows: spin_lock(>lock); if (e->deleted) { spin_unlock(>lock); - rcu_read_unlock(); - return AUDIT_BUILD_CONTEXT; + continue; } rcu_read_unlock(); return state; -- 2.19.1.568.g152ad8e336-goog
[PATCH RFC v2 0/3] cleanups for pstore and ramoops
Here are some simple cleanups and fixes for ramoops in pstore. Let me know what you think, thanks. Joel Fernandes (Google) (3): pstore: map pstore types to names pstore: simplify ramoops_get_next_prz arguments pstore: donot treat empty buffers as valid fs/pstore/inode.c | 53 +- fs/pstore/ram.c| 52 +++-- fs/pstore/ram_core.c | 2 +- include/linux/pstore.h | 37 ++ include/linux/pstore_ram.h | 2 ++ 5 files changed, 67 insertions(+), 79 deletions(-) -- 2.19.1.930.g4563a0d9d0-goog
[PATCH RFC v2 1/3] pstore: map pstore types to names
In later patches we will need to map types to names, so create a table for that which can also be used and reused in different parts of old and new code. Also use it to save the type in the PRZ which will be useful in later patches. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/inode.c | 53 +- fs/pstore/ram.c| 4 ++- include/linux/pstore.h | 37 ++ include/linux/pstore_ram.h | 2 ++ 4 files changed, 48 insertions(+), 48 deletions(-) diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c index 8cf2218b46a7..c5c6b8b4b70a 100644 --- a/fs/pstore/inode.c +++ b/fs/pstore/inode.c @@ -304,6 +304,7 @@ int pstore_mkfile(struct dentry *root, struct pstore_record *record) struct dentry *dentry; struct inode*inode; int rc = 0; + enum pstore_type_id type; charname[PSTORE_NAMELEN]; struct pstore_private *private, *pos; unsigned long flags; @@ -335,53 +336,11 @@ int pstore_mkfile(struct dentry *root, struct pstore_record *record) goto fail_alloc; private->record = record; - switch (record->type) { - case PSTORE_TYPE_DMESG: - scnprintf(name, sizeof(name), "dmesg-%s-%llu%s", - record->psi->name, record->id, - record->compressed ? ".enc.z" : ""); - break; - case PSTORE_TYPE_CONSOLE: - scnprintf(name, sizeof(name), "console-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_FTRACE: - scnprintf(name, sizeof(name), "ftrace-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_MCE: - scnprintf(name, sizeof(name), "mce-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_PPC_RTAS: - scnprintf(name, sizeof(name), "rtas-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_PPC_OF: - scnprintf(name, sizeof(name), "powerpc-ofw-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_PPC_COMMON: - scnprintf(name, sizeof(name), "powerpc-common-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_PMSG: - scnprintf(name, sizeof(name), "pmsg-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_PPC_OPAL: - scnprintf(name, sizeof(name), "powerpc-opal-%s-%llu", - record->psi->name, record->id); - break; - case PSTORE_TYPE_UNKNOWN: - scnprintf(name, sizeof(name), "unknown-%s-%llu", - record->psi->name, record->id); - break; - default: - scnprintf(name, sizeof(name), "type%d-%s-%llu", - record->type, record->psi->name, record->id); - break; - } + scnprintf(name, sizeof(name), "%s-%s-%llu%s", + pstore_type_to_name(record->type), + record->psi->name, record->id, + (record->type == PSTORE_TYPE_DMESG +&& record->compressed) ? ".enc.z" : ""); dentry = d_alloc_name(root, name); if (!dentry) diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index 10ac4d23c423..b174d0fc009f 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -611,6 +611,7 @@ static int ramoops_init_przs(const char *name, goto fail; } *paddr += zone_sz; + prz_ar[i]->type = pstore_name_to_type(name); } *przs = prz_ar; @@ -650,6 +651,7 @@ static int ramoops_init_prz(const char *name, } *paddr += sz; + (*prz)->type = pstore_name_to_type(name); return 0; } @@ -785,7 +787,7 @@ static int ramoops_probe(struct platform_device *pdev) dump_mem_sz = cxt->size - cxt->console_size - cxt->ftrace_size - cxt->pmsg_size; - err = ramoops_init_przs("dump", dev, cxt, >dprzs, , + err = ramoops_init_przs("dmesg", dev, cxt, >dprzs, , dump_mem_sz, cxt->record_size, >
[PATCH RFC v2 3/3] pstore: donot treat empty buffers as valid
pstore currently calls persistent_ram_save_old even if a buffer is empty. While this appears to work, it is does not seem like the right thing to do and could lead to future bugs so lets avoid that. It also prevent misleading prints in the logs which claim the buffer is valid. I got something like: found existing buffer, size 0, start 0 When I was expecting: no valid data in buffer (sig = ...) Signed-off-by: Joel Fernandes (Google) --- Note that if you feel this patch is not necessary, then feel free to drop it. I would say it is harmless and is a good clean up. fs/pstore/ram_core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/pstore/ram_core.c b/fs/pstore/ram_core.c index e6375439c5ac..196e4fd7ba8c 100644 --- a/fs/pstore/ram_core.c +++ b/fs/pstore/ram_core.c @@ -510,7 +510,7 @@ static int persistent_ram_post_init(struct persistent_ram_zone *prz, u32 sig, sig ^= PERSISTENT_RAM_SIG; - if (prz->buffer->sig == sig) { + if (prz->buffer->sig == sig && buffer_size(prz)) { if (buffer_size(prz) > prz->buffer_size || buffer_start(prz) > buffer_size(prz)) { pr_info("found existing invalid buffer, size %zu, start %zu\n", -- 2.19.1.930.g4563a0d9d0-goog
[PATCH RFC v2 2/3] pstore: simplify ramoops_get_next_prz arguments
(1) remove type argument from ramoops_get_next_prz Since we store the type of the prz when we initialize it, we no longer need to pass it again in ramoops_get_next_prz since we can just use that to setup the pstore record. So lets remove it from the argument list. (2) remove max argument from ramoops_get_next_prz >From the code flow, the 'max' checks are already being done on the prz passed to ramoops_get_next_prz. Lets remove it to simplify this function and reduce its arguments. (3) further reduce ramoops_get_next_prz arguments by passing record Both the id and type fields of a pstore_record are set by ramoops_get_next_prz. So we can just pass a pointer to the pstore_record instead of passing individual elements. This results in cleaner more readable code and fewer lines. In addition lets also remove the 'update' argument since we can detect that. Changes are squashed into a single patch to reduce fixup conflicts. Signed-off-by: Joel Fernandes (Google) --- fs/pstore/ram.c | 48 ++-- 1 file changed, 18 insertions(+), 30 deletions(-) diff --git a/fs/pstore/ram.c b/fs/pstore/ram.c index b174d0fc009f..202eaa82bcc6 100644 --- a/fs/pstore/ram.c +++ b/fs/pstore/ram.c @@ -124,19 +124,17 @@ static int ramoops_pstore_open(struct pstore_info *psi) } static struct persistent_ram_zone * -ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, -u64 *id, -enum pstore_type_id *typep, enum pstore_type_id type, -bool update) +ramoops_get_next_prz(struct persistent_ram_zone *przs[], int id, +struct pstore_record *record) { struct persistent_ram_zone *prz; - int i = (*c)++; + bool update = (record->type == PSTORE_TYPE_DMESG); /* Give up if we never existed or have hit the end. */ - if (!przs || i >= max) + if (!przs) return NULL; - prz = przs[i]; + prz = przs[id]; if (!prz) return NULL; @@ -147,8 +145,8 @@ ramoops_get_next_prz(struct persistent_ram_zone *przs[], uint *c, uint max, if (!persistent_ram_old_size(prz)) return NULL; - *typep = type; - *id = i; + record->type = prz->type; + record->id = id; return prz; } @@ -255,10 +253,8 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) /* Find the next valid persistent_ram_zone for DMESG */ while (cxt->dump_read_cnt < cxt->max_dump_cnt && !prz) { - prz = ramoops_get_next_prz(cxt->dprzs, >dump_read_cnt, - cxt->max_dump_cnt, >id, - >type, - PSTORE_TYPE_DMESG, 1); + prz = ramoops_get_next_prz(cxt->dprzs, cxt->dump_read_cnt++, + record); if (!prz_ok(prz)) continue; header_length = ramoops_read_kmsg_hdr(persistent_ram_old(prz), @@ -272,22 +268,18 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) } } - if (!prz_ok(prz)) - prz = ramoops_get_next_prz(>cprz, >console_read_cnt, - 1, >id, >type, - PSTORE_TYPE_CONSOLE, 0); + if (!prz_ok(prz) && !cxt->console_read_cnt++) + prz = ramoops_get_next_prz(>cprz, 0 /* single */, record); - if (!prz_ok(prz)) - prz = ramoops_get_next_prz(>mprz, >pmsg_read_cnt, - 1, >id, >type, - PSTORE_TYPE_PMSG, 0); + if (!prz_ok(prz) && !cxt->pmsg_read_cnt++) + prz = ramoops_get_next_prz(>mprz, 0 /* single */, record); /* ftrace is last since it may want to dynamically allocate memory. */ if (!prz_ok(prz)) { - if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU)) { - prz = ramoops_get_next_prz(cxt->fprzs, - >ftrace_read_cnt, 1, >id, - >type, PSTORE_TYPE_FTRACE, 0); + if (!(cxt->flags & RAMOOPS_FLAG_FTRACE_PER_CPU) && + !cxt->ftrace_read_cnt++) { + prz = ramoops_get_next_prz(cxt->fprzs, 0 /* single */, + record); } else { /* * Build a new dummy record which combines all the @@ -303,11 +295,7 @@ static ssize_t ramoops_pstore_read(struct pstore_record *record) while (cxt->ftrace_read_cnt < cxt->m
[RFC] doc: rcu: remove note on smp_mb during synchronize_rcu
As per this thread [1], it seems this smp_mb isn't needed anymore: "So the smp_mb() that I was trying to add doesn't need to be there." So let us remove this part from the memory ordering documentation. [1] https://lkml.org/lkml/2017/10/6/707 Signed-off-by: Joel Fernandes (Google) --- .../Tree-RCU-Memory-Ordering.html | 32 +-- 1 file changed, 1 insertion(+), 31 deletions(-) diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html index a346ce0116eb..0fb1511763d4 100644 --- a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html @@ -77,7 +77,7 @@ The key point is that the lock-acquisition functions, including smp_mb__after_unlock_lock() immediately after successful acquisition of the lock. -Therefore, for any given rcu_node struction, any access +Therefore, for any given rcu_node structure, any access happening before one of the above lock-release functions will be seen by all CPUs as happening before any access happening after a later one of the above lock-acquisition functions. @@ -162,36 +162,6 @@ an atomic_add_return() of zero) to detect idle CPUs. -The approach must be extended to handle one final case, that -of waking a task blocked in synchronize_rcu(). -This task might be affinitied to a CPU that is not yet aware that -the grace period has ended, and thus might not yet be subject to -the grace period's memory ordering. -Therefore, there is an smp_mb() after the return from -wait_for_completion() in the synchronize_rcu() -code path. - - - -Quick Quiz: - - What? Where??? - I don't see any smp_mb() after the return from - wait_for_completion()!!! - -Answer: - - That would be because I spotted the need for that - smp_mb() during the creation of this documentation, - and it is therefore unlikely to hit mainline before v4.14. - Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and - Jonathan Cameron for asking questions that sensitized me - to the rather elaborate sequence of events that demonstrate - the need for this memory barrier. - - - - Tree RCU's grace--period memory-ordering guarantees rely most heavily on the rcu_node structure's -lock field, so much so that it is necessary to abbreviate this pattern -- 2.19.1.568.g152ad8e336-goog
[PATCH v3 resend 1/2] mm: Add an F_SEAL_FUTURE_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then add protection against making any "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow documentation in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FUTURE_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. The following program shows the seal working in action: #include #include #include #include #include #include #include #define F_SEAL_FUTURE_WRITE 0x0010 #define REGION_SIZE (5 * 1024 * 1024) int memfd_create_region(const char *name, size_t size) { int ret; int fd = syscall(__NR_memfd_create, name, MFD_ALLOW_SEALING); if (fd < 0) return fd; ret = ftruncate(fd, size); if (ret < 0) { close(fd); return ret; } return fd; } int main() { int ret, fd; void *addr, *addr2, *addr3, *addr1; ret = memfd_create_region("test_region", REGION_SIZE); printf("ret=%d\n", ret); fd = ret; // Create map addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) printf("map 0 failed\n"); else printf("map 0 passed\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed even though no future-write seal " "(ret=%d errno =%d)\n", ret, errno); else printf("write passed\n"); addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr1 == MAP_FAILED) perror("map 1 prot-write failed even though no seal\n"); else printf("map 1 prot-write passed as expected\n"); ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE | F_SEAL_GROW | F_SEAL_SHRINK); if (ret == -1) printf("fcntl failed, errno: %d\n", errno); else printf("future-write seal now active\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed as expected due to future-write seal\n"); else printf("write passed (unexpected)\n"); addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr2 == MAP_FAILED) perror("map 2 prot-write failed as expected due to seal\n"); else printf("map 2 passed\n"); addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0); if (addr3 == MAP_FAILED) perror("map 3 failed\n"); else printf("map 3 prot-read passed as expected\n"); } The output of running this program is as follows: ret=3 map 0 passed write passed map 1 prot-write passed as expected future-write seal now active write failed as expected due to future-write seal map 2 prot-write failed as expected due to seal : Permission denied map 3 prot-read passed as expected Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Cc: h...@infradead.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- v1->v2: No change, just added selftests to the series. manpages are ready and I'll submit them once the patches are accepted. v2->v3: Updated commit message to have more support code (John Stultz) Renamed seal from F_SEAL_FS_WRITE to F_SEAL_FUTURE_WRITE (Christoph Hellwig) Allow for this seal only if grow/shrink seals are also either previous set, or are requested along with this seal. (Christoph Hellwig) Added locking to synchronize access to file->f_mode. (Christoph Hellwig) include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 22 +- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 6448cdd9a350..a2f8658f1c55 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -41,6 +41,7
[PATCH v3 resend 2/2] selftests/memfd: Add tests for F_SEAL_FUTURE_WRITE seal
Add tests to verify sealing memfds with the F_SEAL_FUTURE_WRITE works as expected. Cc: dan...@google.com Cc: minc...@kernel.org Reviewed-by: John Stultz Signed-off-by: Joel Fernandes (Google) --- tools/testing/selftests/memfd/memfd_test.c | 74 ++ 1 file changed, 74 insertions(+) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 10baa1652fc2..32b207ca7372 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -692,6 +692,79 @@ static void test_seal_write(void) close(fd); } +/* + * Test SEAL_FUTURE_WRITE + * Test whether SEAL_FUTURE_WRITE actually prevents modifications. + */ +static void test_seal_future_write(void) +{ + int fd; + void *p; + + printf("%s SEAL-FUTURE-WRITE\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_future_write", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + p = mfd_assert_mmap_shared(fd); + + mfd_assert_has_seals(fd, 0); + /* Not adding grow/shrink seals makes the future write +* seal fail to get added +*/ + mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); + + mfd_assert_add_seals(fd, F_SEAL_GROW); + mfd_assert_has_seals(fd, F_SEAL_GROW); + + /* Should still fail since shrink seal has +* not yet been added +*/ + mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); + + mfd_assert_add_seals(fd, F_SEAL_SHRINK); + mfd_assert_has_seals(fd, F_SEAL_GROW | +F_SEAL_SHRINK); + + /* Now should succeed, also verifies that the seal +* could be added with an existing writable mmap +*/ + mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE); + mfd_assert_has_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + + /* read should pass, writes should fail */ + mfd_assert_read(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); + + /* Test adding all seals (grow, shrink, future write) at once */ + fd = mfd_assert_new("kern_memfd_seal_future_write2", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + p = mfd_assert_mmap_shared(fd); + + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + mfd_assert_has_seals(fd, F_SEAL_SHRINK | +F_SEAL_GROW | +F_SEAL_FUTURE_WRITE); + + /* read should pass, writes should fail */ + mfd_assert_read(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -945,6 +1018,7 @@ int main(int argc, char **argv) test_basic(); test_seal_write(); + test_seal_future_write(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.19.1.930.g4563a0d9d0-goog
[PATCH] MAINTAINERS: Add me to Android drivers
I am one of the main engineers working on ashmem. I have been fixing bugs in the driver and have been involved in the memfd conversion discussions and sending patches about that. I also have an understanding of the binder driver and was involved with some of the development on finer grained locking. So I would like to be added to the MAINTAINERS file for android drivers for review and maintenance of ashmem and other Android drivers. Cc: tk...@google.com Cc: gre...@linuxfoundation.org Signed-off-by: Joel Fernandes (Google) --- MAINTAINERS | 1 + 1 file changed, 1 insertion(+) diff --git a/MAINTAINERS b/MAINTAINERS index 544cac829cf4..d639c4d04438 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -894,6 +894,7 @@ M: Greg Kroah-Hartman M: Arve Hjønnevåg M: Todd Kjos M: Martijn Coenen +M: Joel Fernandes T: git git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git L: de...@driverdev.osuosl.org S: Supported -- 2.19.0.605.g01d371f741-goog
[PATCH 2/7] dmaengine: stm32-dma: fix incomplete configuration in cyclic mode
From: Pierre Yves MORDRET When in cyclic mode, the configuration is updated after having started the DMA hardware (STM32_DMA_SCR_EN) leading to incomplete configuration of SMxAR registers. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Hugues Fruchet Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index 4099948b6914..fae7de54f00a 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -441,6 +441,8 @@ static void stm32_dma_dump_reg(struct stm32_dma_chan *chan) dev_dbg(chan2dev(chan), "SFCR: 0x%08x\n", sfcr); } +static void stm32_dma_configure_next_sg(struct stm32_dma_chan *chan); + static void stm32_dma_start_transfer(struct stm32_dma_chan *chan) { struct stm32_dma_device *dmadev = stm32_dma_get_dev(chan); @@ -483,6 +485,9 @@ static void stm32_dma_start_transfer(struct stm32_dma_chan *chan) if (status) stm32_dma_irq_clear(chan, status); + if (chan->desc->cyclic) + stm32_dma_configure_next_sg(chan); + stm32_dma_dump_reg(chan); /* Start DMA */ @@ -576,8 +581,7 @@ static void stm32_dma_issue_pending(struct dma_chan *c) if (vchan_issue_pending(>vchan) && !chan->desc && !chan->busy) { dev_dbg(chan2dev(chan), "vchan %p: issued\n", >vchan); stm32_dma_start_transfer(chan); - if (chan->desc->cyclic) - stm32_dma_configure_next_sg(chan); + } spin_unlock_irqrestore(>vchan.lock, flags); } -- 2.19.0.605.g01d371f741-goog
[PATCH 0/7] NULL pointer deref fix for stm32-dma
Hi Greg, While looking at android-4.14, I found a NULL pointer deref with stm32-dma driver using Coccicheck errors. I found that upstream had a bunch of patches on stm32-dma that have fixed this and other issues, I applied these patches cleanly onto Android 4.14. I believe these should goto stable and flow into Android 4.14 from there, but I haven't tested this since I have no hardware to do so. Atleast I can say that the coccicheck error below goes away when running: make coccicheck MODE=report ./drivers/dma/stm32-dma.c:567:18-24: ERROR: chan -> desc is NULL but dereferenced. Anyway, please consider this series for 4.14 stable, I have CC'd the author and others, thanks. Pierre Yves MORDRET (7): dmaengine: stm32-dma: threshold manages with bitfield feature dmaengine: stm32-dma: fix incomplete configuration in cyclic mode dmaengine: stm32-dma: fix typo and reported checkpatch warnings dmaengine: stm32-dma: Improve memory burst management dmaengine: stm32-dma: fix DMA IRQ status handling dmaengine: stm32-dma: fix max items per transfer dmaengine: stm32-dma: properly mask irq bits drivers/dma/stm32-dma.c | 287 +--- 1 file changed, 240 insertions(+), 47 deletions(-) -- 2.19.0.605.g01d371f741-goog
[PATCH 4/7] dmaengine: stm32-dma: Improve memory burst management
From: Pierre Yves MORDRET This patch improves memory burst capability using best burst size according to transferred buffer size from/to memory. >From now on, memory burst is not necessarily same as with peripheral burst one and fifo threshold is directly managed by this driver in order to fit with computed memory burst. Signed-off-by: M'boumba Cedric Madianga Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 204 ++-- 1 file changed, 175 insertions(+), 29 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index b64e14a83dec..21ad359a5a59 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -5,6 +5,7 @@ * * Copyright (C) M'boumba Cedric Madianga 2015 * Author: M'boumba Cedric Madianga + * Pierre-Yves Mordret * * License terms: GNU General Public License (GPL), version 2 */ @@ -115,6 +116,8 @@ #define STM32_DMA_MAX_CHANNELS 0x08 #define STM32_DMA_MAX_REQUEST_ID 0x08 #define STM32_DMA_MAX_DATA_PARAM 0x03 +#define STM32_DMA_FIFO_SIZE16 /* FIFO is 16 bytes */ +#define STM32_DMA_MIN_BURST4 #define STM32_DMA_MAX_BURST16 /* DMA Features */ @@ -184,6 +187,8 @@ struct stm32_dma_chan { struct dma_slave_config dma_sconfig; struct stm32_dma_chan_reg chan_reg; u32 threshold; + u32 mem_burst; + u32 mem_width; }; struct stm32_dma_device { @@ -248,6 +253,85 @@ static int stm32_dma_get_width(struct stm32_dma_chan *chan, } } +static enum dma_slave_buswidth stm32_dma_get_max_width(u32 buf_len, + u32 threshold) +{ + enum dma_slave_buswidth max_width; + + if (threshold == STM32_DMA_FIFO_THRESHOLD_FULL) + max_width = DMA_SLAVE_BUSWIDTH_4_BYTES; + else + max_width = DMA_SLAVE_BUSWIDTH_2_BYTES; + + while ((buf_len < max_width || buf_len % max_width) && + max_width > DMA_SLAVE_BUSWIDTH_1_BYTE) + max_width = max_width >> 1; + + return max_width; +} + +static bool stm32_dma_fifo_threshold_is_allowed(u32 burst, u32 threshold, + enum dma_slave_buswidth width) +{ + u32 remaining; + + if (width != DMA_SLAVE_BUSWIDTH_UNDEFINED) { + if (burst != 0) { + /* +* If number of beats fit in several whole bursts +* this configuration is allowed. +*/ + remaining = ((STM32_DMA_FIFO_SIZE / width) * +(threshold + 1) / 4) % burst; + + if (remaining == 0) + return true; + } else { + return true; + } + } + + return false; +} + +static bool stm32_dma_is_burst_possible(u32 buf_len, u32 threshold) +{ + switch (threshold) { + case STM32_DMA_FIFO_THRESHOLD_FULL: + if (buf_len >= STM32_DMA_MAX_BURST) + return true; + else + return false; + case STM32_DMA_FIFO_THRESHOLD_HALFFULL: + if (buf_len >= STM32_DMA_MAX_BURST / 2) + return true; + else + return false; + default: + return false; + } +} + +static u32 stm32_dma_get_best_burst(u32 buf_len, u32 max_burst, u32 threshold, + enum dma_slave_buswidth width) +{ + u32 best_burst = max_burst; + + if (best_burst == 1 || !stm32_dma_is_burst_possible(buf_len, threshold)) + return 0; + + while ((buf_len < best_burst * width && best_burst > 1) || + !stm32_dma_fifo_threshold_is_allowed(best_burst, threshold, + width)) { + if (best_burst > STM32_DMA_MIN_BURST) + best_burst = best_burst >> 1; + else + best_burst = 0; + } + + return best_burst; +} + static int stm32_dma_get_burst(struct stm32_dma_chan *chan, u32 maxburst) { switch (maxburst) { @@ -267,12 +351,12 @@ static int stm32_dma_get_burst(struct stm32_dma_chan *chan, u32 maxburst) } static void stm32_dma_set_fifo_config(struct stm32_dma_chan *chan, - u32 src_maxburst, u32 dst_maxburst) + u32 src_burst, u32 dst_burst) { chan->chan_reg.dma_sfcr &= ~STM32_DMA_SFCR_MASK; chan->chan_reg.dma_scr &= ~STM32_DMA_SCR_DMEIE; - if ((!src_maxburst) && (!dst_maxburst)) { + if (!src_burst && !dst_burst) { /* Using direct mode */ chan->chan_reg.dma_scr |= STM32_DMA_SCR_DMEIE; } else { @@ -589,37
[PATCH 1/7] dmaengine: stm32-dma: threshold manages with bitfield feature
From: Pierre Yves MORDRET >From now on, DMA bitfield is to manage DMA FIFO Threshold. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 19 --- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index 786fc8fcc38e..4099948b6914 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -116,6 +116,10 @@ #define STM32_DMA_MAX_DATA_PARAM 0x03 #define STM32_DMA_MAX_BURST16 +/* DMA Features */ +#define STM32_DMA_THRESHOLD_FTR_MASK GENMASK(1, 0) +#define STM32_DMA_THRESHOLD_FTR_GET(n) ((n) & STM32_DMA_THRESHOLD_FTR_MASK) + enum stm32_dma_width { STM32_DMA_BYTE, STM32_DMA_HALF_WORD, @@ -129,11 +133,18 @@ enum stm32_dma_burst_size { STM32_DMA_BURST_INCR16, }; +/** + * struct stm32_dma_cfg - STM32 DMA custom configuration + * @channel_id: channel ID + * @request_line: DMA request + * @stream_config: 32bit mask specifying the DMA channel configuration + * @features: 32bit mask specifying the DMA Feature list + */ struct stm32_dma_cfg { u32 channel_id; u32 request_line; u32 stream_config; - u32 threshold; + u32 features; }; struct stm32_dma_chan_reg { @@ -171,6 +182,7 @@ struct stm32_dma_chan { u32 next_sg; struct dma_slave_config dma_sconfig; struct stm32_dma_chan_reg chan_reg; + u32 threshold; }; struct stm32_dma_device { @@ -976,7 +988,8 @@ static void stm32_dma_set_config(struct stm32_dma_chan *chan, /* Enable Interrupts */ chan->chan_reg.dma_scr |= STM32_DMA_SCR_TEIE | STM32_DMA_SCR_TCIE; - chan->chan_reg.dma_sfcr = cfg->threshold & STM32_DMA_SFCR_FTH_MASK; + chan->threshold = STM32_DMA_THRESHOLD_FTR_GET(cfg->features); + chan->chan_reg.dma_sfcr = STM32_DMA_SFCR_FTH(chan->threshold); } static struct dma_chan *stm32_dma_of_xlate(struct of_phandle_args *dma_spec, @@ -996,7 +1009,7 @@ static struct dma_chan *stm32_dma_of_xlate(struct of_phandle_args *dma_spec, cfg.channel_id = dma_spec->args[0]; cfg.request_line = dma_spec->args[1]; cfg.stream_config = dma_spec->args[2]; - cfg.threshold = dma_spec->args[3]; + cfg.features = dma_spec->args[3]; if ((cfg.channel_id >= STM32_DMA_MAX_CHANNELS) || (cfg.request_line >= STM32_DMA_MAX_REQUEST_ID)) { -- 2.19.0.605.g01d371f741-goog
[PATCH 3/7] dmaengine: stm32-dma: fix typo and reported checkpatch warnings
From: Pierre Yves MORDRET Fix typo in a comment and solved reported checkpatch warnings. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index fae7de54f00a..b64e14a83dec 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -60,7 +60,8 @@ #define STM32_DMA_SCR_PINC BIT(9) /* Peripheral increment mode */ #define STM32_DMA_SCR_CIRC BIT(8) /* Circular mode */ #define STM32_DMA_SCR_PFCTRL BIT(5) /* Peripheral Flow Controller */ -#define STM32_DMA_SCR_TCIE BIT(4) /* Transfer Cplete Int Enable*/ +#define STM32_DMA_SCR_TCIE BIT(4) /* Transfer Complete Int Enable + */ #define STM32_DMA_SCR_TEIE BIT(2) /* Transfer Error Int Enable */ #define STM32_DMA_SCR_DMEIEBIT(1) /* Direct Mode Err Int Enable */ #define STM32_DMA_SCR_EN BIT(0) /* Stream Enable */ @@ -918,7 +919,7 @@ static enum dma_status stm32_dma_tx_status(struct dma_chan *c, u32 residue = 0; status = dma_cookie_status(c, cookie, state); - if ((status == DMA_COMPLETE) || (!state)) + if (status == DMA_COMPLETE || !state) return status; spin_lock_irqsave(>vchan.lock, flags); @@ -982,7 +983,7 @@ static void stm32_dma_desc_free(struct virt_dma_desc *vdesc) } static void stm32_dma_set_config(struct stm32_dma_chan *chan, - struct stm32_dma_cfg *cfg) +struct stm32_dma_cfg *cfg) { stm32_dma_clear_reg(>chan_reg); @@ -1015,8 +1016,8 @@ static struct dma_chan *stm32_dma_of_xlate(struct of_phandle_args *dma_spec, cfg.stream_config = dma_spec->args[2]; cfg.features = dma_spec->args[3]; - if ((cfg.channel_id >= STM32_DMA_MAX_CHANNELS) || - (cfg.request_line >= STM32_DMA_MAX_REQUEST_ID)) { + if (cfg.channel_id >= STM32_DMA_MAX_CHANNELS || + cfg.request_line >= STM32_DMA_MAX_REQUEST_ID) { dev_err(dev, "Bad channel and/or request id\n"); return NULL; } -- 2.19.0.605.g01d371f741-goog
[PATCH 5/7] dmaengine: stm32-dma: fix DMA IRQ status handling
From: Pierre Yves MORDRET Update the way Transfer Complete and Half Transfer Complete status are acknowledge. Even if HTI is not enabled its status is shown when reading registers, driver has to clear it gently and not raise an error. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 29 + 1 file changed, 25 insertions(+), 4 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index 21ad359a5a59..b40486454a2c 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -34,9 +34,14 @@ #define STM32_DMA_LIFCR0x0008 /* DMA Low Int Flag Clear Reg */ #define STM32_DMA_HIFCR0x000c /* DMA High Int Flag Clear Reg */ #define STM32_DMA_TCI BIT(5) /* Transfer Complete Interrupt */ +#define STM32_DMA_HTI BIT(4) /* Half Transfer Interrupt */ #define STM32_DMA_TEI BIT(3) /* Transfer Error Interrupt */ #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */ #define STM32_DMA_FEI BIT(0) /* FIFO Error Interrupt */ +#define STM32_DMA_MASKI(STM32_DMA_TCI \ +| STM32_DMA_TEI \ +| STM32_DMA_DMEI \ +| STM32_DMA_FEI) /* DMA Stream x Configuration Register */ #define STM32_DMA_SCR(x) (0x0010 + 0x18 * (x)) /* x = 0..7 */ @@ -643,13 +648,29 @@ static irqreturn_t stm32_dma_chan_irq(int irq, void *devid) status = stm32_dma_irq_status(chan); scr = stm32_dma_read(dmadev, STM32_DMA_SCR(chan->id)); - if ((status & STM32_DMA_TCI) && (scr & STM32_DMA_SCR_TCIE)) { + if (status & STM32_DMA_TCI) { stm32_dma_irq_clear(chan, STM32_DMA_TCI); - stm32_dma_handle_chan_done(chan); - - } else { + if (scr & STM32_DMA_SCR_TCIE) + stm32_dma_handle_chan_done(chan); + status &= ~STM32_DMA_TCI; + } + if (status & STM32_DMA_HTI) { + stm32_dma_irq_clear(chan, STM32_DMA_HTI); + status &= ~STM32_DMA_HTI; + } + if (status & STM32_DMA_FEI) { + stm32_dma_irq_clear(chan, STM32_DMA_FEI); + status &= ~STM32_DMA_FEI; + if (!(scr & STM32_DMA_SCR_EN)) + dev_err(chan2dev(chan), "FIFO Error\n"); + else + dev_dbg(chan2dev(chan), "FIFO over/underrun\n"); + } + if (status) { stm32_dma_irq_clear(chan, status); dev_err(chan2dev(chan), "DMA error: status=0x%08x\n", status); + if (!(scr & STM32_DMA_SCR_EN)) + dev_err(chan2dev(chan), "chan disabled by HW\n"); } spin_unlock(>vchan.lock); -- 2.19.0.605.g01d371f741-goog
[PATCH 7/7] dmaengine: stm32-dma: properly mask irq bits
From: Pierre Yves MORDRET A single register of the controller holds the information for four dma channels. The functions stm32_dma_irq_status() don't mask the relevant bits after the shift, thus adjacent channel's status is also reported in the returned value. Fixed by masking the value before returning it. Similarly, the function stm32_dma_irq_clear() don't mask the input value before shifting it, thus an incorrect input value could disable the interrupts of adjacent channels. Fixed by masking the input value before using it. Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Antonio Borneo Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index 05a2974cd2c0..8c5807362a25 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -38,6 +38,10 @@ #define STM32_DMA_TEI BIT(3) /* Transfer Error Interrupt */ #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */ #define STM32_DMA_FEI BIT(0) /* FIFO Error Interrupt */ +#define STM32_DMA_MASKI(STM32_DMA_TCI \ +| STM32_DMA_TEI \ +| STM32_DMA_DMEI \ +| STM32_DMA_FEI) /* DMA Stream x Configuration Register */ #define STM32_DMA_SCR(x) (0x0010 + 0x18 * (x)) /* x = 0..7 */ @@ -405,7 +409,7 @@ static u32 stm32_dma_irq_status(struct stm32_dma_chan *chan) flags = dma_isr >> (((chan->id & 2) << 3) | ((chan->id & 1) * 6)); - return flags; + return flags & STM32_DMA_MASKI; } static void stm32_dma_irq_clear(struct stm32_dma_chan *chan, u32 flags) @@ -420,6 +424,7 @@ static void stm32_dma_irq_clear(struct stm32_dma_chan *chan, u32 flags) * If (ch % 4) is 2 or 3, left shift the mask by 16 bits. * If (ch % 4) is 1 or 3, additionally left shift the mask by 6 bits. */ + flags &= STM32_DMA_MASKI; dma_ifcr = flags << (((chan->id & 2) << 3) | ((chan->id & 1) * 6)); if (chan->id & 4) -- 2.19.0.605.g01d371f741-goog
[PATCH 6/7] dmaengine: stm32-dma: fix max items per transfer
From: Pierre Yves MORDRET Having 0 in item counter register is valid and stands for a "No or Ended transfer". Therefore valid transfer starts from @+0 to @+0xFFFE leading to unaligned scatter gather at boundary. Thus it's safer to round down this value on its FIFO size (16 Bytes). Signed-off-by: Pierre-Yves MORDRET Signed-off-by: Vinod Koul --- drivers/dma/stm32-dma.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/drivers/dma/stm32-dma.c b/drivers/dma/stm32-dma.c index b40486454a2c..05a2974cd2c0 100644 --- a/drivers/dma/stm32-dma.c +++ b/drivers/dma/stm32-dma.c @@ -38,10 +38,6 @@ #define STM32_DMA_TEI BIT(3) /* Transfer Error Interrupt */ #define STM32_DMA_DMEI BIT(2) /* Direct Mode Error Interrupt */ #define STM32_DMA_FEI BIT(0) /* FIFO Error Interrupt */ -#define STM32_DMA_MASKI(STM32_DMA_TCI \ -| STM32_DMA_TEI \ -| STM32_DMA_DMEI \ -| STM32_DMA_FEI) /* DMA Stream x Configuration Register */ #define STM32_DMA_SCR(x) (0x0010 + 0x18 * (x)) /* x = 0..7 */ @@ -118,6 +114,13 @@ #define STM32_DMA_FIFO_THRESHOLD_FULL 0x03 #define STM32_DMA_MAX_DATA_ITEMS 0x +/* + * Valid transfer starts from @0 to @0xFFFE leading to unaligned scatter + * gather at boundary. Thus it's safer to round down this value on FIFO + * size (16 Bytes) + */ +#define STM32_DMA_ALIGNED_MAX_DATA_ITEMS \ + ALIGN_DOWN(STM32_DMA_MAX_DATA_ITEMS, 16) #define STM32_DMA_MAX_CHANNELS 0x08 #define STM32_DMA_MAX_REQUEST_ID 0x08 #define STM32_DMA_MAX_DATA_PARAM 0x03 @@ -869,7 +872,7 @@ static struct dma_async_tx_descriptor *stm32_dma_prep_slave_sg( desc->sg_req[i].len = sg_dma_len(sg); nb_data_items = desc->sg_req[i].len / buswidth; - if (nb_data_items > STM32_DMA_MAX_DATA_ITEMS) { + if (nb_data_items > STM32_DMA_ALIGNED_MAX_DATA_ITEMS) { dev_err(chan2dev(chan), "nb items not supported\n"); goto err; } @@ -935,7 +938,7 @@ static struct dma_async_tx_descriptor *stm32_dma_prep_dma_cyclic( return NULL; nb_data_items = period_len / buswidth; - if (nb_data_items > STM32_DMA_MAX_DATA_ITEMS) { + if (nb_data_items > STM32_DMA_ALIGNED_MAX_DATA_ITEMS) { dev_err(chan2dev(chan), "number of items not supported\n"); return NULL; } @@ -985,7 +988,7 @@ static struct dma_async_tx_descriptor *stm32_dma_prep_dma_memcpy( u32 num_sgs, best_burst, dma_burst, threshold; int i; - num_sgs = DIV_ROUND_UP(len, STM32_DMA_MAX_DATA_ITEMS); + num_sgs = DIV_ROUND_UP(len, STM32_DMA_ALIGNED_MAX_DATA_ITEMS); desc = stm32_dma_alloc_desc(num_sgs); if (!desc) return NULL; @@ -994,7 +997,7 @@ static struct dma_async_tx_descriptor *stm32_dma_prep_dma_memcpy( for (offset = 0, i = 0; offset < len; offset += xfer_count, i++) { xfer_count = min_t(size_t, len - offset, - STM32_DMA_MAX_DATA_ITEMS); + STM32_DMA_ALIGNED_MAX_DATA_ITEMS); /* Compute best burst size */ max_width = DMA_SLAVE_BUSWIDTH_1_BYTE; -- 2.19.0.605.g01d371f741-goog
[PATCH v2 1/2] mm: Add an F_SEAL_FS_WRITE seal to memfd
Android uses ashmem for sharing memory regions. We are looking forward to migrating all usecases of ashmem to memfd so that we can possibly remove the ashmem driver in the future from staging while also benefiting from using memfd and contributing to it. Note staging drivers are also not ABI and generally can be removed at anytime. One of the main usecases Android has is the ability to create a region and mmap it as writeable, then drop its protection for "future" writes while keeping the existing already mmap'ed writeable-region active. This allows us to implement a usecase where receivers of the shared memory buffer can get a read-only view, while the sender continues to write to the buffer. See CursorWindow in Android for more details: https://developer.android.com/reference/android/database/CursorWindow This usecase cannot be implemented with the existing F_SEAL_WRITE seal. To support the usecase, this patch adds a new F_SEAL_FS_WRITE seal which prevents any future mmap and write syscalls from succeeding while keeping the existing mmap active. The following program shows the seal working in action: int main() { int ret, fd; void *addr, *addr2, *addr3, *addr1; ret = memfd_create_region("test_region", REGION_SIZE); printf("ret=%d\n", ret); fd = ret; // Create map addr = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) printf("map 0 failed\n"); else printf("map 0 passed\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed even though no fs-write seal " "(ret=%d errno =%d)\n", ret, errno); else printf("write passed\n"); addr1 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr1 == MAP_FAILED) perror("map 1 prot-write failed even though no seal\n"); else printf("map 1 prot-write passed as expected\n"); ret = fcntl(fd, F_ADD_SEALS, F_SEAL_FS_WRITE); if (ret == -1) printf("fcntl failed, errno: %d\n", errno); else printf("fs-write seal now active\n"); if ((ret = write(fd, "test", 4)) != 4) printf("write failed as expected due to fs-write seal\n"); else printf("write passed (unexpected)\n"); addr2 = mmap(0, REGION_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr2 == MAP_FAILED) perror("map 2 prot-write failed as expected due to seal\n"); else printf("map 2 passed\n"); addr3 = mmap(0, REGION_SIZE, PROT_READ, MAP_SHARED, fd, 0); if (addr3 == MAP_FAILED) perror("map 3 failed\n"); else printf("map 3 prot-read passed as expected\n"); } The output of running this program is as follows: ret=3 map 0 passed write passed map 1 prot-write passed as expected fs-write seal now active write failed as expected due to fs-write seal map 2 prot-write failed as expected due to seal : Permission denied map 3 prot-read passed as expected Note: This seal will also prevent growing and shrinking of the memfd. This is not something we do in Android so it does not affect us, however I have mentioned this behavior of the seal in the manpage. Cc: jr...@google.com Cc: john.stu...@linaro.org Cc: tk...@google.com Cc: gre...@linuxfoundation.org Signed-off-by: Joel Fernandes (Google) --- v1->v2: No change, just added selftests to the series. manpages are ready and I'll submit them once the patches are accepted. include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index c98312fa78a5..fe44a2035edf 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -41,6 +41,7 @@ #define F_SEAL_SHRINK 0x0002 /* prevent file from shrinking */ #define F_SEAL_GROW0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ +#define F_SEAL_FS_WRITE0x0010 /* prevent all write-related syscalls */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 27069518e3c5..9b8855b80de9 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) #define F_ALL_SEALS (F_SEAL_SEAL | \ F_SEAL_SHRINK | \ F_SEAL_GROW | \ -F_SEAL_WRITE) +F_SEAL_WRITE | \ +F_SEAL_FS_WRITE) static int memfd_add_seals(struct file *file, unsigned int seals) { @@ -219,6 +220,9 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } + if ((seals & F_SEAL_FS_WRITE) && !(*file_seals & F_SEAL_FS_WRITE)) + file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); + *file_seals |= seals; error = 0; -- 2.19.0.605.g01d371f741-goog
[PATCH v2 2/2] selftests/memfd: Add tests for F_SEAL_FS_WRITE seal
Add tests to verify sealing memfds with the F_SEAL_FS_WRITE works as expected. Cc: dan...@google.com Cc: minc...@google.com Signed-off-by: Joel Fernandes (Google) --- tools/testing/selftests/memfd/memfd_test.c | 51 +- 1 file changed, 50 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 10baa1652fc2..4bd2b6c87bb4 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -27,7 +27,7 @@ #define MFD_DEF_SIZE 8192 #define STACK_SIZE 65536 - +#define F_SEAL_FS_WRITE 0x0010 /* * Default is not to test hugetlbfs */ @@ -170,6 +170,24 @@ static void *mfd_assert_mmap_shared(int fd) return p; } +static void *mfd_fail_mmap_shared(int fd) +{ + void *p; + + p = mmap(NULL, +mfd_def_size, +PROT_READ | PROT_WRITE, +MAP_SHARED, +fd, +0); + if (p != MAP_FAILED) { + printf("mmap() didn't fail as expected\n"); + abort(); + } + + return p; +} + static void *mfd_assert_mmap_private(int fd) { void *p; @@ -692,6 +710,36 @@ static void test_seal_write(void) close(fd); } +/* + * Test SEAL_WRITE + * Test whether SEAL_WRITE actually prevents modifications. + */ +static void test_seal_fs_write(void) +{ + int fd; + void *p; + + printf("%s SEAL-FS-WRITE\n", memfd_str); + + fd = mfd_assert_new("kern_memfd_seal_fs_write", + mfd_def_size, + MFD_CLOEXEC | MFD_ALLOW_SEALING); + + p = mfd_assert_mmap_shared(fd); + + /* FS_WRITE seal can be added even with existing +* writeable mappings */ + mfd_assert_has_seals(fd, 0); + mfd_assert_add_seals(fd, F_SEAL_FS_WRITE); + mfd_assert_has_seals(fd, F_SEAL_FS_WRITE); + + mfd_assert_read(fd); + mfd_fail_write(fd); + + munmap(p, mfd_def_size); + close(fd); +} + /* * Test SEAL_SHRINK * Test whether SEAL_SHRINK actually prevents shrinking @@ -945,6 +993,7 @@ int main(int argc, char **argv) test_basic(); test_seal_write(); + test_seal_fs_write(); test_seal_shrink(); test_seal_grow(); test_seal_resize(); -- 2.19.0.605.g01d371f741-goog
[PATCH] mm: Speed up mremap on large regions
Android needs to mremap large regions of memory during memory management related operations. The mremap system call can be really slow if THP is not enabled. The bottleneck is move_page_tables, which is copying each pte at a time, and can be really slow across a large map. Turning on THP may not be a viable option, and is not for us. This patch speeds up the performance for non-THP system by copying at the PMD level when possible. The speed up is three orders of magnitude. On a 1GB mremap, the mremap completion times drops from 160-250 millesconds to 380-400 microseconds. Before: Total mremap time for 1GB data: 242321014 nanoseconds. Total mremap time for 1GB data: 196842467 nanoseconds. Total mremap time for 1GB data: 167051162 nanoseconds. After: Total mremap time for 1GB data: 385781 nanoseconds. Total mremap time for 1GB data: 388959 nanoseconds. Total mremap time for 1GB data: 402813 nanoseconds. Incase THP is enabled, the optimization is skipped. I also flush the tlb every time we do this optimization since I couldn't find a way to determine if the low-level PTEs are dirty. It is seen that the cost of doing so is not much compared the improvement, on both x86-64 and arm64. Cc: minc...@google.com Cc: hu...@google.com Cc: lokeshgi...@google.com Cc: kernel-t...@android.com Signed-off-by: Joel Fernandes (Google) --- mm/mremap.c | 62 + 1 file changed, 62 insertions(+) diff --git a/mm/mremap.c b/mm/mremap.c index 5c2e18505f75..68ddc9e9dfde 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -191,6 +191,54 @@ static void move_ptes(struct vm_area_struct *vma, pmd_t *old_pmd, drop_rmap_locks(vma); } +bool move_normal_pmd(struct vm_area_struct *vma, unsigned long old_addr, + unsigned long new_addr, unsigned long old_end, + pmd_t *old_pmd, pmd_t *new_pmd, bool *need_flush) +{ + spinlock_t *old_ptl, *new_ptl; + struct mm_struct *mm = vma->vm_mm; + + if ((old_addr & ~PMD_MASK) || (new_addr & ~PMD_MASK) + || old_end - old_addr < PMD_SIZE) + return false; + + /* +* The destination pmd shouldn't be established, free_pgtables() +* should have release it. +*/ + if (WARN_ON(!pmd_none(*new_pmd))) + return false; + + /* +* We don't have to worry about the ordering of src and dst +* ptlocks because exclusive mmap_sem prevents deadlock. +*/ + old_ptl = pmd_lock(vma->vm_mm, old_pmd); + if (old_ptl) { + pmd_t pmd; + + new_ptl = pmd_lockptr(mm, new_pmd); + if (new_ptl != old_ptl) + spin_lock_nested(new_ptl, SINGLE_DEPTH_NESTING); + + /* Clear the pmd */ + pmd = *old_pmd; + pmd_clear(old_pmd); + + VM_BUG_ON(!pmd_none(*new_pmd)); + + /* Set the new pmd */ + set_pmd_at(mm, new_addr, new_pmd, pmd); + if (new_ptl != old_ptl) + spin_unlock(new_ptl); + spin_unlock(old_ptl); + + *need_flush = true; + return true; + } + return false; +} + unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -239,7 +287,21 @@ unsigned long move_page_tables(struct vm_area_struct *vma, split_huge_pmd(vma, old_pmd, old_addr); if (pmd_trans_unstable(old_pmd)) continue; + } else if (extent == PMD_SIZE) { + bool moved; + + /* See comment in move_ptes() */ + if (need_rmap_locks) + take_rmap_locks(vma); + moved = move_normal_pmd(vma, old_addr, new_addr, + old_end, old_pmd, new_pmd, + _flush); + if (need_rmap_locks) + drop_rmap_locks(vma); + if (moved) + continue; } + if (pte_alloc(new_vma->vm_mm, new_pmd, new_addr)) break; next = (new_addr + PMD_SIZE) & PMD_MASK; -- 2.19.0.605.g01d371f741-goog
[PATCH -next 1/2] mm/memfd: make F_SEAL_FUTURE_WRITE seal more robust
A better way to do F_SEAL_FUTURE_WRITE seal was discussed [1] last week where we don't need to modify core VFS structures to get the same behavior of the seal. This solves several side-effects pointed out by Andy [2]. [1] https://lore.kernel.org/lkml/2018173650.ga256...@google.com/ [2] https://lore.kernel.org/lkml/69ce06cc-e47c-4992-848a-66eb23ee6...@amacapital.net/ Suggested-by: Andy Lutomirski Fixes: 5e653c2923fd ("mm: Add an F_SEAL_FUTURE_WRITE seal to memfd") Signed-off-by: Joel Fernandes (Google) --- fs/hugetlbfs/inode.c | 2 +- mm/memfd.c | 19 --- mm/shmem.c | 24 +--- 3 files changed, 22 insertions(+), 23 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 762028994f47..5b54bf893a67 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -558,7 +558,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len) inode_lock(inode); /* protected by i_mutex */ - if (info->seals & F_SEAL_WRITE) { + if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) { inode_unlock(inode); return -EPERM; } diff --git a/mm/memfd.c b/mm/memfd.c index 63fff5e77114..650e65a46b9c 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -201,25 +201,6 @@ static int memfd_add_seals(struct file *file, unsigned int seals) } } - if ((seals & F_SEAL_FUTURE_WRITE) && - !(*file_seals & F_SEAL_FUTURE_WRITE)) { - /* -* The FUTURE_WRITE seal also prevents growing and shrinking -* so we need them to be already set, or requested now. -*/ - int test_seals = (seals | *file_seals) & -(F_SEAL_GROW | F_SEAL_SHRINK); - - if (test_seals != (F_SEAL_GROW | F_SEAL_SHRINK)) { - error = -EINVAL; - goto unlock; - } - - spin_lock(>f_lock); - file->f_mode &= ~(FMODE_WRITE | FMODE_PWRITE); - spin_unlock(>f_lock); - } - *file_seals |= seals; error = 0; diff --git a/mm/shmem.c b/mm/shmem.c index 32eb29bd72c6..cee9878c87f1 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2121,6 +2121,23 @@ int shmem_lock(struct file *file, int lock, struct user_struct *user) static int shmem_mmap(struct file *file, struct vm_area_struct *vma) { + struct shmem_inode_info *info = SHMEM_I(file_inode(file)); + + /* +* New PROT_READ and MAP_SHARED mmaps are not allowed when "future +* write" seal active. +*/ + if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_WRITE) && + (info->seals & F_SEAL_FUTURE_WRITE)) + return -EPERM; + + /* +* Since the F_SEAL_FUTURE_WRITE seals allow for a MAP_SHARED read-only +* mapping, take care to not allow mprotect to revert protections. +*/ + if (info->seals & F_SEAL_FUTURE_WRITE) + vma->vm_flags &= ~(VM_MAYWRITE); + file_accessed(file); vma->vm_ops = _vm_ops; if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && @@ -2346,8 +2363,9 @@ shmem_write_begin(struct file *file, struct address_space *mapping, pgoff_t index = pos >> PAGE_SHIFT; /* i_mutex is held by caller */ - if (unlikely(info->seals & (F_SEAL_WRITE | F_SEAL_GROW))) { - if (info->seals & F_SEAL_WRITE) + if (unlikely(info->seals & (F_SEAL_GROW | + F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) { + if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) return -EPERM; if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size) return -EPERM; @@ -2610,7 +2628,7 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset, DECLARE_WAIT_QUEUE_HEAD_ONSTACK(shmem_falloc_waitq); /* protected by i_mutex */ - if (info->seals & F_SEAL_WRITE) { + if (info->seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE)) { error = -EPERM; goto out; } -- 2.19.1.1215.g8438c0b245-goog
[PATCH -next 2/2] selftests/memfd: modify tests for F_SEAL_FUTURE_WRITE seal
Modify the tests for F_SEAL_FUTURE_WRITE based on the changes introduced in previous patch. Also add a test to make sure the reopen issue pointed by Jann Horn [1] is fixed. [1] https://lore.kernel.org/lkml/CAG48ez1h=v-JYnDw81HaYJzOfrNhwYksxmc2r=cjvdqvgym...@mail.gmail.com/ Cc: Jann Horn Signed-off-by: Joel Fernandes (Google) --- tools/testing/selftests/memfd/memfd_test.c | 88 +++--- 1 file changed, 44 insertions(+), 44 deletions(-) diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c index 32b207ca7372..c67d32eeb668 100644 --- a/tools/testing/selftests/memfd/memfd_test.c +++ b/tools/testing/selftests/memfd/memfd_test.c @@ -54,6 +54,22 @@ static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags) return fd; } +static int mfd_assert_reopen_fd(int fd_in) +{ + int r, fd; + char path[100]; + + sprintf(path, "/proc/self/fd/%d", fd_in); + + fd = open(path, O_RDWR); + if (fd < 0) { + printf("re-open of existing fd %d failed\n", fd_in); + abort(); + } + + return fd; +} + static void mfd_fail_new(const char *name, unsigned int flags) { int r; @@ -255,6 +271,25 @@ static void mfd_assert_read(int fd) munmap(p, mfd_def_size); } +/* Test that PROT_READ + MAP_SHARED mappings work. */ +static void mfd_assert_read_shared(int fd) +{ + void *p; + + /* verify PROT_READ and MAP_SHARED *is* allowed */ + p = mmap(NULL, +mfd_def_size, +PROT_READ, +MAP_SHARED, +fd, +0); + if (p == MAP_FAILED) { + printf("mmap() failed: %m\n"); + abort(); + } + munmap(p, mfd_def_size); +} + static void mfd_assert_write(int fd) { ssize_t l; @@ -698,7 +733,7 @@ static void test_seal_write(void) */ static void test_seal_future_write(void) { - int fd; + int fd, fd2; void *p; printf("%s SEAL-FUTURE-WRITE\n", memfd_str); @@ -710,58 +745,23 @@ static void test_seal_future_write(void) p = mfd_assert_mmap_shared(fd); mfd_assert_has_seals(fd, 0); - /* Not adding grow/shrink seals makes the future write -* seal fail to get added -*/ - mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); - - mfd_assert_add_seals(fd, F_SEAL_GROW); - mfd_assert_has_seals(fd, F_SEAL_GROW); - - /* Should still fail since shrink seal has -* not yet been added -*/ - mfd_fail_add_seals(fd, F_SEAL_FUTURE_WRITE); - - mfd_assert_add_seals(fd, F_SEAL_SHRINK); - mfd_assert_has_seals(fd, F_SEAL_GROW | -F_SEAL_SHRINK); - /* Now should succeed, also verifies that the seal -* could be added with an existing writable mmap -*/ mfd_assert_add_seals(fd, F_SEAL_FUTURE_WRITE); - mfd_assert_has_seals(fd, F_SEAL_SHRINK | -F_SEAL_GROW | -F_SEAL_FUTURE_WRITE); + mfd_assert_has_seals(fd, F_SEAL_FUTURE_WRITE); /* read should pass, writes should fail */ mfd_assert_read(fd); + mfd_assert_read_shared(fd); mfd_fail_write(fd); - munmap(p, mfd_def_size); - close(fd); - - /* Test adding all seals (grow, shrink, future write) at once */ - fd = mfd_assert_new("kern_memfd_seal_future_write2", - mfd_def_size, - MFD_CLOEXEC | MFD_ALLOW_SEALING); - - p = mfd_assert_mmap_shared(fd); - - mfd_assert_has_seals(fd, 0); - mfd_assert_add_seals(fd, F_SEAL_SHRINK | -F_SEAL_GROW | -F_SEAL_FUTURE_WRITE); - mfd_assert_has_seals(fd, F_SEAL_SHRINK | -F_SEAL_GROW | -F_SEAL_FUTURE_WRITE); - - /* read should pass, writes should fail */ - mfd_assert_read(fd); - mfd_fail_write(fd); + fd2 = mfd_assert_reopen_fd(fd); + /* read should pass, writes should still fail */ + mfd_assert_read(fd2); + mfd_assert_read_shared(fd2); + mfd_fail_write(fd2); munmap(p, mfd_def_size); + close(fd2); close(fd); } -- 2.19.1.1215.g8438c0b245-goog
[PATCH -manpage 1/2] fcntl.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal
More details of the seal can be found in the LKML patch: https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t Signed-off-by: Joel Fernandes (Google) --- man2/fcntl.2 | 15 +++ 1 file changed, 15 insertions(+) diff --git a/man2/fcntl.2 b/man2/fcntl.2 index 03533d65b49d..54772f94964c 100644 --- a/man2/fcntl.2 +++ b/man2/fcntl.2 @@ -1525,6 +1525,21 @@ Furthermore, if there are any asynchronous I/O operations .RB ( io_submit (2)) pending on the file, all outstanding writes will be discarded. +.TP +.BR F_SEAL_FUTURE_WRITE +If this seal is set, the contents of the file can be modified only from +existing writeable mappings that were created prior to the seal being set. +Any attempt to create a new writeable mapping on the memfd via +.BR mmap (2) +will fail with +.BR EPERM. +Also any attempts to write to the memfd via +.BR write (2) +will fail with +.BR EPERM. +This is useful in situations where existing writable mapped regions need to be +kept intact while preventing any future writes. For example, to share a +read-only memory buffer to other processes that only the sender can write to. .\" .SS File read/write hints Write lifetime hints can be used to inform the kernel about the relative -- 2.19.1.1215.g8438c0b245-goog
[PATCH -manpage 2/2] memfd_create.2: Update manpage with new memfd F_SEAL_FUTURE_WRITE seal
More details of the seal can be found in the LKML patch: https://lore.kernel.org/lkml/20181120052137.74317-1-j...@joelfernandes.org/T/#t Signed-off-by: Joel Fernandes (Google) --- man2/memfd_create.2 | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/man2/memfd_create.2 b/man2/memfd_create.2 index 3cd392d1b4d9..fce2bf8d0fff 100644 --- a/man2/memfd_create.2 +++ b/man2/memfd_create.2 @@ -280,7 +280,15 @@ in order to restrict further modifications on the file. (If placing the seal .BR F_SEAL_WRITE , then it will be necessary to first unmap the shared writable mapping -created in the previous step.) +created in the previous step. Otherwise, behavior similar to +.BR F_SEAL_WRITE +can be achieved, by using +.BR F_SEAL_FUTURE_WRITE +which will prevent future writes via +.BR mmap (2) +and +.BR write (2) +from succeeding, while keeping existing shared writable mappings). .IP 4. A second process obtains a file descriptor for the .BR tmpfs (5) @@ -425,6 +433,7 @@ main(int argc, char *argv[]) fprintf(stderr, "\\t\\tg \- F_SEAL_GROW\\n"); fprintf(stderr, "\\t\\ts \- F_SEAL_SHRINK\\n"); fprintf(stderr, "\\t\\tw \- F_SEAL_WRITE\\n"); +fprintf(stderr, "\\t\\tW \- F_SEAL_FUTURE_WRITE\\n"); fprintf(stderr, "\\t\\tS \- F_SEAL_SEAL\\n"); exit(EXIT_FAILURE); } @@ -463,6 +472,8 @@ main(int argc, char *argv[]) seals |= F_SEAL_SHRINK; if (strchr(seals_arg, \(aqw\(aq) != NULL) seals |= F_SEAL_WRITE; +if (strchr(seals_arg, \(aqW\(aq) != NULL) +seals |= F_SEAL_FUTURE_WRITE; if (strchr(seals_arg, \(aqS\(aq) != NULL) seals |= F_SEAL_SEAL; @@ -518,6 +529,8 @@ main(int argc, char *argv[]) printf(" GROW"); if (seals & F_SEAL_WRITE) printf(" WRITE"); +if (seals & F_SEAL_FUTURE_WRITE) +printf(" FUTURE_WRITE"); if (seals & F_SEAL_SHRINK) printf(" SHRINK"); printf("\\n"); -- 2.19.1.1215.g8438c0b245-goog
Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name
Hi Mathieu, On Mon, Mar 26, 2018 at 12:10 PM, Mathieu Desnoyers wrote: > Provide an API allowing eBPF to lookup core kernel tracepoints by name. > > Given that a lookup by name explicitly requires tracepoint definitions > to be unique for a given name (no duplicate keys), include a > WARN_ON_ONCE() check that only a single match is encountered at runtime. > This should always be the case, given that a DEFINE_TRACE emits a > __tracepoint_##name symbol, which would cause a link-time error if more > than one instance is found. Nevertheless, check this at runtime with > WARN_ON_ONCE() to stay on the safe side. > > Signed-off-by: Mathieu Desnoyers > CC: Steven Rostedt > CC: Alexei Starovoitov > CC: Peter Zijlstra > CC: Ingo Molnar > --- > include/linux/tracepoint.h | 1 + > kernel/tracepoint.c| 35 +++ > 2 files changed, 36 insertions(+) > > diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h > index c94f466d57ef..1b4ae64b7c6a 100644 > --- a/include/linux/tracepoint.h > +++ b/include/linux/tracepoint.h > @@ -43,6 +43,7 @@ tracepoint_probe_unregister(struct tracepoint *tp, void > *probe, void *data); > extern void > for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), > void *priv); > +extern struct tracepoint *tracepoint_kernel_find_by_name(const char *name); > > #ifdef CONFIG_MODULES > struct tp_module { > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c > index 671b13457387..0a59f988055a 100644 > --- a/kernel/tracepoint.c > +++ b/kernel/tracepoint.c > @@ -60,6 +60,11 @@ struct tp_probes { > struct tracepoint_func probes[0]; > }; > > +struct tp_find_args { > + const char *name; > + struct tracepoint *tp; > +}; > + > static inline void *allocate_probes(int count) > { > struct tp_probes *p = kmalloc(count * sizeof(struct tracepoint_func) > @@ -528,6 +533,36 @@ void for_each_kernel_tracepoint(void (*fct)(struct > tracepoint *tp, void *priv), > } > EXPORT_SYMBOL_GPL(for_each_kernel_tracepoint); > > +static void find_tp(struct tracepoint *tp, void *priv) > +{ > + struct tp_find_args *args = priv; > + > + if (!strcmp(tp->name, args->name)) { > + WARN_ON_ONCE(args->tp); > + args->tp = tp; I think this runtime check is not needed if it really can't happen (linker verifies that already as you mentioned) although I'm not opposed if you still want to keep it, because there's no way to break out of the outer loop anyway so I guess your call.. I would just do: if (args->tp) return; if find_tp already found the tracepoint. Tried to also create a duplicate tracepoint and I get: CC init/version.o AR init/built-in.o AR built-in.o LD vmlinux.o block/blk-core.o:(__tracepoints+0x440): multiple definition of `__tracepoint_sched_switch' kernel/sched/core.o:(__tracepoints+0x440): first defined here Makefile:1032: recipe for target 'vmlinux' failed make: *** [vmlinux] Error 1 For this senseless diff: diff --git a/include/trace/events/block.h b/include/trace/events/block.h index 81b43f5bdf23..2b855c05197d 100644 --- a/include/trace/events/block.h +++ b/include/trace/events/block.h @@ -49,6 +49,13 @@ DEFINE_EVENT(block_buffer, block_touch_buffer, TP_ARGS(bh) ); +DEFINE_EVENT(block_buffer, sched_switch, + + TP_PROTO(struct buffer_head *bh), + + TP_ARGS(bh) +); + /** * block_dirty_buffer - mark a buffer dirty * @bh: buffer_head being dirtied thanks, - Joel
Re: [PATCH 0/3] [RFC] init, tracing: Add initcall trace events
Hi Steve, On Fri, Mar 23, 2018 at 8:02 AM, Steven Rostedt wrote: > A while ago we had a boot tracer. But it was eventually removed: > commit 30dbb20e68e6f ("tracing: Remove boot tracer"). > > The rational was because there is already a initcall_debug boot option > that causes printk()s of all the initcall functions. > > The problem with the initcall_debug option is that printk() is awfully slow, > and makes it difficult to see the real impact of initcalls. Mainly because > a single printk() is usually slower than most initcall functions. > > Instead of bringing back the boot tracer, adding trace events around the > initcall functions, and even one to denote which level the initcall > functions are being called from, adds the necessary information to > analyze the initcalls without the high overhead of printk()s, that > can substantially slow down the boot process. > > Another positive, is that the console initcall functions themselves > can also be traced. The timestamps are not operational at that time > but you can see which consoles are being registered. I saw this on > one of my test boxes: > > -0 [000] ...1 0.00: initcall_level: level=console > -0 [000] ...1 0.00: initcall_start: func=con_init+0x0/0x224 > -0 [000] ...1 0.00: initcall_finish: > func=con_init+0x0/0x224 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=hvc_console_init+0x0/0x19 > -0 [000] ...1 0.00: initcall_finish: > func=hvc_console_init+0x0/0x19 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=xen_cons_init+0x0/0x60 > -0 [000] ...1 0.00: initcall_finish: > func=xen_cons_init+0x0/0x60 ret=0 > -0 [000] ...1 0.00: initcall_start: > func=univ8250_console_init+0x0/0x2d > -0 [000] ...1 0.00: initcall_finish: > func=univ8250_console_init+0x0/0x2d ret=0 Will this make initcall_debug not work if CONFIG_TRACEPOINTS is turned off? Although it builds but I think this initcall_debug feature will fail, maybe CONFIG_TRACEPOINTS should be selected somewhere? I recently ran into some issues like this for my preemptirq tracepoints patch (which I will post again soon :D) where lockdep needed the tracepoints and I had to select it. thanks, - Joel
Re: [RFC PATCH] tracepoint: Provide tracepoint_kernel_find_by_name
On Tue, Mar 27, 2018 at 6:27 AM, Mathieu Desnoyers wrote: >>> +static void find_tp(struct tracepoint *tp, void *priv) >>> +{ >>> + struct tp_find_args *args = priv; >>> + >>> + if (!strcmp(tp->name, args->name)) { >>> + WARN_ON_ONCE(args->tp); >>> + args->tp = tp; >> >> I think this runtime check is not needed if it really can't happen >> (linker verifies that already as you mentioned) although I'm not >> opposed if you still want to keep it, because there's no way to break >> out of the outer loop anyway so I guess your call.. > > We can change the outer loop and break from it if needed, that's not > an issue. > >> I would just do: >> >> if (args->tp) >>return; >> >> if find_tp already found the tracepoint. >> >> Tried to also create a duplicate tracepoint and I get: >> CC init/version.o >> AR init/built-in.o >> AR built-in.o >> LD vmlinux.o >> block/blk-core.o:(__tracepoints+0x440): multiple definition of >> `__tracepoint_sched_switch' >> kernel/sched/core.o:(__tracepoints+0x440): first defined here >> Makefile:1032: recipe for target 'vmlinux' failed >> make: *** [vmlinux] Error 1 > > Yeah, as I state in my changelog, I'm very well aware that the linker > is able to catch those. This was the purpose of emitting a > __tracepoint_##name symbol in the tracepoint definition macro. This > WARN_ON_ONCE() is a redundant check for an invariant that we expect > is provided by the linker. Given that it's not a fast path, I would > prefer to keep this redundant check in place, given that an average > factor 2 speedup on a slow path should really not be something we > need to optimize for. IMHO, for slow paths, robustness is more important > than speed (unless the slow path becomes so slow that it really affects > the user). > > I envision that a way to break this invariant would be to compile a > kernel object with gcc -fvisibility=hidden or such. I admit this is > particular scenario is really far fetched, but it illustrates why I > prefer to keep an extra safety net at runtime for linker-based > validations when possible. > > If a faster tracepoint lookup function is needed, we should consider > perfect hashing algorithms done post-build, but so far nobody has > complained about speed of this lookup operation. Anyhow a factor 2 in > the speed of this lookup should really not matter, right ? Yes, I agree with all the great reasons. So lets do it your way then. thanks, - Joel
Re: [PATCH] staging: android: ashmem: Fix possible deadlock in ashmem_ioctl
On Tue, Feb 27, 2018 at 10:59 PM, Yisheng Xie wrote: > ashmem_mutex may create a chain of dependencies like: > > CPU0CPU1 > mmap syscall ioctl syscall > -> mmap_sem (acquired) -> ashmem_ioctl > -> ashmem_mmap-> ashmem_mutex (acquired) > -> ashmem_mutex (try to acquire) -> copy_from_user > -> mmap_sem (try to acquire) > > There is a lock odering problem between mmap_sem and ashmem_mutex causing > a lockdep splat[1] during a syzcaller test. This patch fixes the problem > by move copy_from_user out of ashmem_mutex. > > [1] https://www.spinics.net/lists/kernel/msg2733200.html > > Fixes: ce8a3a9e76d0 (staging: android: ashmem: Fix a race condition in pin > ioctls) > Reported-by: syzbot+d7a918a7a8e1c952b...@syzkaller.appspotmail.com > Signed-off-by: Yisheng Xie Greg, Could you take this patch for the stable trees? I do see it in staging already. I couldn't find it in stable so wanted to bring it to your attention. If you already aware of it, please ignore my note. Thanks, - Joel
[PATCH v7 4/6] trace/irqsoff: Split reset into separate functions
Split reset functions into seperate functions in preparation of future patches that need to do tracer specific reset. Cc: Steven Rostedt Cc: Peter Zilstra Cc: Ingo Molnar Cc: Mathieu Desnoyers Cc: Tom Zanussi Cc: Namhyung Kim Cc: Thomas Glexiner Cc: Boqun Feng Cc: Paul McKenney Cc: Frederic Weisbecker Cc: Randy Dunlap Cc: Masami Hiramatsu Cc: Fenguang Wu Cc: Baohong Liu Cc: Vedang Patel Cc: kernel-t...@android.com Signed-off-by: Joel Fernandes (Google) --- kernel/trace/trace_irqsoff.c | 22 +++--- 1 file changed, 19 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_irqsoff.c b/kernel/trace/trace_irqsoff.c index 03ecb4465ee4..f8daa754cce2 100644 --- a/kernel/trace/trace_irqsoff.c +++ b/kernel/trace/trace_irqsoff.c @@ -634,7 +634,7 @@ static int __irqsoff_tracer_init(struct trace_array *tr) return 0; } -static void irqsoff_tracer_reset(struct trace_array *tr) +static void __irqsoff_tracer_reset(struct trace_array *tr) { int lat_flag = save_flags & TRACE_ITER_LATENCY_FMT; int overwrite_flag = save_flags & TRACE_ITER_OVERWRITE; @@ -665,6 +665,12 @@ static int irqsoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } + +static void irqsoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer irqsoff_tracer __read_mostly = { .name = "irqsoff", @@ -697,11 +703,16 @@ static int preemptoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } +static void preemptoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer preemptoff_tracer __read_mostly = { .name = "preemptoff", .init = preemptoff_tracer_init, - .reset = irqsoff_tracer_reset, + .reset = preemptoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, .print_max = true, @@ -731,11 +742,16 @@ static int preemptirqsoff_tracer_init(struct trace_array *tr) return __irqsoff_tracer_init(tr); } +static void preemptirqsoff_tracer_reset(struct trace_array *tr) +{ + __irqsoff_tracer_reset(tr); +} + static struct tracer preemptirqsoff_tracer __read_mostly = { .name = "preemptirqsoff", .init = preemptirqsoff_tracer_init, - .reset = irqsoff_tracer_reset, + .reset = preemptirqsoff_tracer_reset, .start = irqsoff_tracer_start, .stop = irqsoff_tracer_stop, .print_max = true, -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v7 2/6] srcu: Add notrace variants of srcu_read_{lock,unlock}
From: Paul McKenney This is needed for a future tracepoint patch that uses srcu, and to make sure it doesn't call into lockdep. tracepoint code already calls notrace variants for rcu_read_lock_sched so this patch does the same for srcu which will be used in a later patch. Keeps it consistent with rcu-sched. [Joel: Added commit message] Cc: Steven Rostedt Cc: Peter Zilstra Cc: Ingo Molnar Cc: Mathieu Desnoyers Cc: Tom Zanussi Cc: Namhyung Kim Cc: Thomas Glexiner Cc: Boqun Feng Cc: Paul McKenney Cc: Frederic Weisbecker Cc: Randy Dunlap Cc: Masami Hiramatsu Cc: Fenguang Wu Cc: Baohong Liu Cc: Vedang Patel Cc: kernel-t...@android.com Reviewed-by: Steven Rostedt (VMware) Signed-off-by: Paul McKenney Signed-off-by: Joel Fernandes (Google) --- include/linux/srcu.h | 17 + 1 file changed, 17 insertions(+) diff --git a/include/linux/srcu.h b/include/linux/srcu.h index 33c1c698df09..2ec618979b20 100644 --- a/include/linux/srcu.h +++ b/include/linux/srcu.h @@ -161,6 +161,16 @@ static inline int srcu_read_lock(struct srcu_struct *sp) __acquires(sp) return retval; } +/* Used by tracing, cannot be traced and cannot invoke lockdep. */ +static inline notrace int +srcu_read_lock_notrace(struct srcu_struct *sp) __acquires(sp) +{ + int retval; + + retval = __srcu_read_lock(sp); + return retval; +} + /** * srcu_read_unlock - unregister a old reader from an SRCU-protected structure. * @sp: srcu_struct in which to unregister the old reader. @@ -175,6 +185,13 @@ static inline void srcu_read_unlock(struct srcu_struct *sp, int idx) __srcu_read_unlock(sp, idx); } +/* Used by tracing, cannot be traced and cannot call lockdep. */ +static inline notrace void +srcu_read_unlock_notrace(struct srcu_struct *sp, int idx) __releases(sp) +{ + __srcu_read_unlock(sp, idx); +} + /** * smp_mb__after_srcu_read_unlock - ensure full ordering after srcu_read_unlock * -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v7 1/6] softirq: reorder trace_softirqs_on to prevent lockdep splat
I'm able to reproduce a lockdep splat with config options: CONFIG_PROVE_LOCKING=y, CONFIG_DEBUG_LOCK_ALLOC=y and CONFIG_PREEMPTIRQ_EVENTS=y $ echo 1 > /d/tracing/events/preemptirq/preempt_enable/enable --- kernel/softirq.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 177de3640c78..8a040bcaa033 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -139,9 +139,13 @@ static void __local_bh_enable(unsigned int cnt) { lockdep_assert_irqs_disabled(); + if (preempt_count() == cnt) + trace_preempt_on(CALLER_ADDR0, get_lock_parent_ip()); + if (softirq_count() == (cnt & SOFTIRQ_MASK)) trace_softirqs_on(_RET_IP_); - preempt_count_sub(cnt); + + __preempt_count_sub(cnt); } /* -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] rcu: Add comment documenting how rcu_seq_snap works
rcu_seq_snap may be tricky for someone looking at it for the first time. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) --- kernel/rcu/rcu.h | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 003671825d62..004ace3d22c2 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -91,7 +91,28 @@ static inline void rcu_seq_end(unsigned long *sp) WRITE_ONCE(*sp, rcu_seq_endval(sp)); } -/* Take a snapshot of the update side's sequence number. */ +/* + * Take a snapshot of the update side's sequence number. + * + * This function predicts what the grace period number will be the next + * time an RCU callback will be executed, given the current grace period's + * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is + * already in progress. + * + * We do this with a single addition and masking. + * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of + * the seq is used to track if a GP is in progress or not, its sufficient if we + * add (2+1) and mask with ~1. Lets see why with an example: + * + * Say the current seq is 6 which is 0x110 (gp is 3 and state bit is 0). + * To get the next GP number, we have to atleast add 0x10 to this (0x1 << 1) to + * account for the state bit. However, if the current seq is 7 (GP num is 3 + * and state bit is 1), then it means the current grace period is already + * in progress so the next the callback will run is at gp+2. To account for + * the extra +1, we just overflow the LSB by adding another 0x1 and masking + * with ~0x1. Incase no GP was in progress (RCU is idle), then the adding + * by 0x1 and masking will have no effect. This is calculated as below. + */ static inline unsigned long rcu_seq_snap(unsigned long *sp) { unsigned long s; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH] rcu: trace: Remove Startedleaf from trace events comment
As part of the gp_seq clean up, the Startleaf condition doesn't occur anymore. Remove it from the comment in the trace event file. Signed-off-by: Joel Fernandes (Google) --- include/trace/events/rcu.h | 1 - 1 file changed, 1 deletion(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index ce9d1a1cac78..6d8dd04912d2 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -91,7 +91,6 @@ TRACE_EVENT(rcu_grace_period, * * "Startleaf": Request a grace period based on leaf-node data. * "Prestarted": Someone beat us to the request - * "Startedleaf": Leaf-node start proved sufficient. * "Startedleafroot": Leaf-node start proved sufficient after checking root. * "Startedroot": Requested a nocb grace period based on root-node data. * "NoGPkthread": The RCU grace-period kthread has not yet started. -- 2.17.0.441.gb46fe60e1d-goog
[PATCH v2] rcu: Add comment documenting how rcu_seq_snap works
rcu_seq_snap may be tricky for someone looking at it for the first time. Lets document how it works with an example to make it easier. Signed-off-by: Joel Fernandes (Google) --- v2 changes: Corrections as suggested by Randy. kernel/rcu/rcu.h | 24 +++- 1 file changed, 23 insertions(+), 1 deletion(-) diff --git a/kernel/rcu/rcu.h b/kernel/rcu/rcu.h index 003671825d62..533bc1087371 100644 --- a/kernel/rcu/rcu.h +++ b/kernel/rcu/rcu.h @@ -91,7 +91,29 @@ static inline void rcu_seq_end(unsigned long *sp) WRITE_ONCE(*sp, rcu_seq_endval(sp)); } -/* Take a snapshot of the update side's sequence number. */ +/* + * Take a snapshot of the update side's sequence number. + * + * This function predicts what the grace period number will be the next + * time an RCU callback will be executed, given the current grace period's + * number. This can be gp+1 if RCU is idle, or gp+2 if a grace period is + * already in progress. + * + * We do this with a single addition and masking. + * For example, if RCU_SEQ_STATE_MASK=1 and the least significant bit (LSB) of + * the seq is used to track if a GP is in progress or not, its sufficient if we + * add (2+1) and mask with ~1. Lets see why with an example: + * + * Say the current seq is 6 which is 0b110 (gp is 3 and state bit is 0). + * To get the next GP number, we have to at least add 0b10 to this (0x1 << 1) + * to account for the state bit. However, if the current seq is 7 (gp is 3 and + * state bit is 1), then it means the current grace period is already in + * progress so the next time the callback will run is at the end of grace + * period number gp+2. To account for the extra +1, we just overflow the LSB by + * adding another 0x1 and masking with ~0x1. In case no GP was in progress (RCU + * is idle), then the addition of the extra 0x1 and masking will have no + * effect. This is calculated as below. + */ static inline unsigned long rcu_seq_snap(unsigned long *sp) { unsigned long s; -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 8/8] rcu: Fix cpustart tracepoint gp_seq number
cpustart shows a stale gp_seq. This is because rdp->gp_seq is updated only at the end of the __note_gp_changes function. For this reason, use rnp->gp_seq instead. I believe we can't update rdp->gp_seq too early so lets just use the gp_seq from rnp instead. Signed-off-by: Joel Fernandes (Google) --- kernel/rcu/tree.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 25c44328d071..58d2b68f8b98 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1807,7 +1807,7 @@ static bool __note_gp_changes(struct rcu_state *rsp, struct rcu_node *rnp, * set up to detect a quiescent state, otherwise don't * go looking for one. */ - trace_rcu_grace_period(rsp->name, rdp->gp_seq, TPS("cpustart")); + trace_rcu_grace_period(rsp->name, rnp->gp_seq, TPS("cpustart")); need_gp = !!(rnp->qsmask & rdp->grpmask); rdp->cpu_no_qs.b.norm = need_gp; rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_dynticks.rcu_qs_ctr); -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 6/8] rcu: Add back the Startedleaf tracepoint
In recent discussion [1], the check for whether a leaf believes RCU is not idle, is being added back to funnel locking code, to avoid more locking. In this we are marking the leaf node for a future grace-period and bailing out since a GP is currently in progress. However the tracepoint is missing. Lets add it back. Also add a small comment about why we do this check (basically the point is to avoid locking intermediate nodes unnecessarily) and clarify the comments in the trace event header now that we are doing traversal of one or more intermediate nodes. [1] http://lkml.kernel.org/r/20180513190906.gl26...@linux.vnet.ibm.com Signed-off-by: Joel Fernandes (Google) --- include/trace/events/rcu.h | 4 ++-- kernel/rcu/tree.c | 11 ++- 2 files changed, 12 insertions(+), 3 deletions(-) diff --git a/include/trace/events/rcu.h b/include/trace/events/rcu.h index 539900a9f8c7..dc0bd11739c7 100644 --- a/include/trace/events/rcu.h +++ b/include/trace/events/rcu.h @@ -91,8 +91,8 @@ TRACE_EVENT(rcu_grace_period, * * "Startleaf": Request a grace period based on leaf-node data. * "Prestarted": Someone beat us to the request - * "Startedleaf": Leaf-node start proved sufficient. - * "Startedleafroot": Leaf-node start proved sufficient after checking root. + * "Startedleaf": Leaf and one or more non-root nodes marked for future start. + * "Startedleafroot": all non-root nodes from leaf to root marked for future start. * "Startedroot": Requested a nocb grace period based on root-node data. * "NoGPkthread": The RCU grace-period kthread has not yet started. * "StartWait": Start waiting for the requested grace period. diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 40670047d22c..8401a253e7de 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -1593,8 +1593,17 @@ static bool rcu_start_this_gp(struct rcu_node *rnp, struct rcu_data *rdp, goto unlock_out; } rnp_node->gp_seq_needed = gp_seq_start; - if (rcu_seq_state(rcu_seq_current(>gp_seq))) + + /* +* Check if leaf believes a GP is in progress, if yes we can +* bail and avoid more locking. We have already marked the leaf. +*/ + if (rcu_seq_state(rcu_seq_current(>gp_seq))) { + trace_rcu_this_gp(rnp_node, rdp, gp_seq_start, + TPS("Startedleaf")); goto unlock_out; + } + if (rnp_node != rnp && rnp_node->parent != NULL) raw_spin_unlock_rcu_node(rnp_node); if (!rnp_node->parent) { -- 2.17.0.441.gb46fe60e1d-goog
[PATCH RFC 0/8] rcu fixes, clean ups for rcu/dev
Hi, Here are some fixes, clean ups and some code comments changes mostly for the new funnel locking, gp_seq changes and some tracing. Its based on latest rcu/dev branch. thanks, - Joel Joel Fernandes (Google) (8): rcu: Add comment documenting how rcu_seq_snap works rcu: Clarify usage of cond_resched for tasks-RCU rcu: Add back the cpuend tracepoint rcu: Get rid of old c variable from places in tree RCU rcu: Use rcu_node as temporary variable in funnel locking loop rcu: Add back the Startedleaf tracepoint rcu: trace CleanupMore condition only if needed rcu: Fix cpustart tracepoint gp_seq number include/linux/rcupdate.h | 11 +++-- include/trace/events/rcu.h | 19 kernel/rcu/rcu.h | 24 +- kernel/rcu/tree.c | 92 +++--- 4 files changed, 97 insertions(+), 49 deletions(-) -- 2.17.0.441.gb46fe60e1d-goog