Re: [PATCH v7 9/9] sparc64: Add support for ADI (Application Data Integrity)

2017-08-15 Thread David Miller
From: Khalid Aziz 
Date: Wed,  9 Aug 2017 15:26:02 -0600

> +void adi_restore_tags(struct mm_struct *mm, struct vm_area_struct *vma,
> +   unsigned long addr, pte_t pte)
> +{
 ...
> + tag = tag_start(addr, tag_desc);
> + paddr = pte_val(pte) & _PAGE_PADDR_4V;
> + for (tmp = paddr; tmp < (paddr+PAGE_SIZE); tmp += adi_blksize()) {
> + version1 = (*tag) >> 4;
> + version2 = (*tag) & 0x0f;
> + *tag++ = 0;
> + asm volatile("stxa %0, [%1] %2\n\t"
> + :
> + : "r" (version1), "r" (tmp),
> +   "i" (ASI_MCD_REAL));
> + tmp += adi_blksize();
> + asm volatile("stxa %0, [%1] %2\n\t"
> + :
> + : "r" (version2), "r" (tmp),
> +   "i" (ASI_MCD_REAL));
> + }
> + asm volatile("membar #Sync\n\t");

You do a membar here.

> + for (i = pfrom; i < (pfrom + PAGE_SIZE); i += adi_blksize()) {
> + asm volatile("ldxa [%1] %2, %0\n\t"
> + : "=r" (adi_tag)
> + :  "r" (i), "i" (ASI_MCD_REAL));
> + asm volatile("stxa %0, [%1] %2\n\t"
> + :
> + : "r" (adi_tag), "r" (pto),
> +   "i" (ASI_MCD_REAL));

But not here.

Is this OK?  I suspect you need to add a membar this this second piece
of MCD tag storing code.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] x86/idle: add halt poll for halt idle

2017-08-15 Thread Michael S. Tsirkin
On Thu, Jun 22, 2017 at 11:22:13AM +, root wrote:
> From: Yang Zhang 
> 
> This patch introduce a new mechanism to poll for a while before
> entering idle state.
> 
> David has a topic in KVM forum to describe the problem on current KVM VM
> when running some message passing workload in KVM forum. Also, there
> are some work to improve the performance in KVM, like halt polling in KVM.
> But we still has 4 MSR wirtes and HLT vmexit when going into halt idle
> which introduce lot of latency.
> 
> Halt polling in KVM provide the capbility to not schedule out VCPU when
> it is the only task in this pCPU. Unlike it, this patch will let VCPU polls
> for a while if there is no work inside VCPU to elimiate heavy vmexit during
> in/out idle. The potential impact is it will cost more CPU cycle since we
> are doing polling and may impact other task which waiting on the same
> physical CPU in host.

I wonder whether you considered doing this in an idle driver.
I have a prototype patch combining this with mwait within guest -
I can post it if you are interested.


> Here is the data i get when running benchmark contextswitch
> (https://github.com/tsuna/contextswitch)
> 
> before patch:
> 200 process context switches in 4822613801ns (2411.3ns/ctxsw)
> 
> after patch:
> 200 process context switches in 3584098241ns (1792.0ns/ctxsw)
> 
> Signed-off-by: Yang Zhang 
> ---
>  Documentation/sysctl/kernel.txt | 10 ++
>  arch/x86/kernel/process.c   | 21 +
>  include/linux/kernel.h  |  3 +++
>  kernel/sched/idle.c |  3 +++
>  kernel/sysctl.c |  9 +
>  5 files changed, 46 insertions(+)
> 
> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
> index bac23c1..4e71bfe 100644
> --- a/Documentation/sysctl/kernel.txt
> +++ b/Documentation/sysctl/kernel.txt
> @@ -63,6 +63,7 @@ show up in /proc/sys/kernel:
>  - perf_event_max_stack
>  - perf_event_max_contexts_per_stack
>  - pid_max
> +- poll_threshold_ns[ X86 only ]
>  - powersave-nap   [ PPC only ]
>  - printk
>  - printk_delay
> @@ -702,6 +703,15 @@ kernel tries to allocate a number starting from this one.
>  
>  ==
>  
> +poll_threshold_ns: (X86 only)
> +
> +This parameter used to control the max wait time to poll before going
> +into real idle state. By default, the values is 0 means don't poll.
> +It is recommended to change the value to non-zero if running latency-bound
> +workloads in VM.
> +
> +==
> +
>  powersave-nap: (PPC only)
>  
>  If set, Linux-PPC will use the 'nap' mode of powersaving,
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index 0bb8842..6361783 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -39,6 +39,10 @@
>  #include 
>  #include 
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +unsigned long poll_threshold_ns;
> +#endif
> +
>  /*
>   * per-CPU TSS segments. Threads are completely 'soft' on Linux,
>   * no more per-task TSS's. The TSS size is kept cacheline-aligned
> @@ -313,6 +317,23 @@ static inline void play_dead(void)
>  }
>  #endif
>  
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +void arch_cpu_idle_poll(void)
> +{
> + ktime_t start, cur, stop;
> +
> + if (poll_threshold_ns) {
> + start = cur = ktime_get();
> + stop = ktime_add_ns(ktime_get(), poll_threshold_ns);
> + do {
> + if (need_resched())
> + break;
> + cur = ktime_get();
> + } while (ktime_before(cur, stop));
> + }
> +}
> +#endif
> +
>  void arch_cpu_idle_enter(void)
>  {
>   tsc_verify_tsc_adjust(false);
> diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> index 13bc08a..04cf774 100644
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -460,6 +460,9 @@ extern __scanf(2, 0)
>  extern int sysctl_panic_on_stackoverflow;
>  
>  extern bool crash_kexec_post_notifiers;
> +#ifdef CONFIG_HYPERVISOR_GUEST
> +extern unsigned long poll_threshold_ns;
> +#endif
>  
>  /*
>   * panic_cpu is used for synchronizing panic() and crash_kexec() execution. 
> It
> diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> index 2a25a9e..e789f99 100644
> --- a/kernel/sched/idle.c
> +++ b/kernel/sched/idle.c
> @@ -74,6 +74,7 @@ static noinline int __cpuidle cpu_idle_poll(void)
>  }
>  
>  /* Weak implementations for optional arch specific functions */
> +void __weak arch_cpu_idle_poll(void) { }
>  void __weak arch_cpu_idle_prepare(void) { }
>  void __weak arch_cpu_idle_enter(void) { }
>  void __weak arch_cpu_idle_exit(void) { }
> @@ -219,6 +220,8 @@ static void do_idle(void)
>*/
>  
>   __current_set_polling();
> + arch_cpu_idle_poll();
> +
>   tick_nohz_idle_enter();
>  
>   while (!need_resched()) {

Re: [PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-15 Thread Zefan Li
On 2017/8/16 1:27, Waiman Long wrote:
> Cpuset v2 has some valuable attributes that are not present in
> v1 because of backward compatibility concern. One of that is the
> restoration of the original cpu and memory node mask after a hot
> removal and addition event sequence.
> 
> This patch adds a new kernel command line option "cpuset_v2_mode=" to
> allow cpuset to have v2 behavior in a v1 cgroup when using a non-zero
> value. This enables users to optionally enable cpuset v2 behavior if
> they choose to.
> 
> Signed-off-by: Waiman Long 

I have no strong objection. I have received quite a few requests for
this in the past, and some were from other departments in my company.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-15 Thread David Rientjes
On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > I'm curious about the decision made in this conditional and how 
> > oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
> > that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
> > should otherwise be disabled.
> > 
> > It's undocumented in the changelog, but I'm questioning whether it's the 
> > right decision.  Doesn't it make sense to kill all tasks that are not oom 
> > disabled, and allow the user to still protect certain processes by their 
> > /proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
> > protection without a sibling memcg and its own reservation of memory.  I'm 
> > thinking about a process that governs jobs inside the memcg and if there 
> > is an oom kill, it wants to do logging and any cleanup necessary before 
> > exiting itself.  It seems like a powerful combination if coupled with oom 
> > notification.
> 
> Good question!
> I think, that an ability to override any oom_score_adj value and get all tasks
> killed is more important, than an ability to kill all processes with some
> exceptions.
> 

I'm disagreeing because getting all tasks killed is not necessarily 
something that only the kernel can do.  If all processes are oom disabled, 
that's a configuration issue done by sysadmin and the kernel should decide 
to kill the next largest memory cgroup or lower priority memory cgroup.  
It's not killing things like sshd that intentionally oom disable 
themselves.

You could argue that having an oom disabled process attached to these 
memcgs in the first place is also a configuration issue, but the problem 
is that in cgroup v2 with a restriction on processes only being attached 
at the leaf cgroups that there is no competition for memory in this case.  
I must assign memory resources to that sshd, or "Activity Manager" 
described by the cgroup v1 documentation, just to prevent it from being 
killed.

I think the latter of what you describe, killing all processes with some 
exceptions, is actually quite powerful.  I can guarantee that processes 
that set themselves to oom disabled are really oom disabled and I don't 
need to work around that in the cgroup hierarchy only because of this 
caveat.  I can also oom disable my Activity Manger that wants to wait on 
oom notification and collect the oom kill logs, raise notifications, and 
perhaps restart the process that it manages.

> In your example someone still needs to look after the remaining process,
> and kill it after some timeout, if it will not quit by itself, right?
> 

No, it can restart the process that was oom killed; or it can be sshd and 
I can still ssh into my machine.

> The special treatment of the -1000 value (without oom_kill_all_tasks)
> is required only to not to break the existing setups.
> 

I think as a general principle that allowing an oom disabled process to be 
oom killed is incorrect and if you really do want these to be killed, then 
(1) either your oom_score_adj is already wrong or (2) you can wait on oom 
notification and exit.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-15 Thread David Rientjes
On Tue, 15 Aug 2017, Roman Gushchin wrote:

> > > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > > index dec5afdaa36d..22108f31e09d 100644
> > > --- a/Documentation/cgroup-v2.txt
> > > +++ b/Documentation/cgroup-v2.txt
> > > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > > 5-2-1. Memory Interface Files
> > > 5-2-2. Usage Guidelines
> > > 5-2-3. Memory Ownership
> > > +   5-2-4. Cgroup-aware OOM Killer
> > 
> > Random curiousness, why cgroup-aware oom killer and not memcg-aware oom 
> > killer?
> 
> I don't think we use the term "memcg" somewhere in v2 docs.
> Do you think that "Memory cgroup-aware OOM killer" is better?
> 

I think it would be better to not describe it as its own entity, but 
rather a part of how the memory cgroup works, so simply describing it in 
section 5-2, perhaps as its own subsection, as how the oom killer works 
when using the memory cgroup is sufficient.  I wouldn't separate it out as 
a distinct cgroup feature in the documentation.

> > > + cgroups.  The default is "0".
> > > +
> > > + Defines whether the OOM killer should treat the cgroup
> > > + as a single entity during the victim selection.
> > 
> > Isn't this true independent of the memory.oom_kill_all_tasks setting?  
> > The cgroup aware oom killer will consider memcg's as logical units when 
> > deciding what to kill with or without memory.oom_kill_all_tasks, right?
> > 
> > I think you cover this fact in the cgroup aware oom killer section below 
> > so this might result in confusion if described alongside a setting of
> > memory.oom_kill_all_tasks.
> > 

I assume this is fixed so that it's documented that memory cgroups are 
considered logical units by the oom killer and that 
memory.oom_kill_all_tasks is separate?  The former defines the policy on 
how a memory cgroup is targeted and the latter defines the mechanism it 
uses to free memory.

> > > + If set, OOM killer will kill all belonging tasks in
> > > + corresponding cgroup is selected as an OOM victim.
> > 
> > Maybe
> > 
> > "If set, the OOM killer will kill all threads attached to the memcg if 
> > selected as an OOM victim."
> > 
> > is better?
> 
> Fixed to the following (to conform with core v2 concepts):
>   If set, OOM killer will kill all processes attached to the cgroup
>   if selected as an OOM victim.
> 

Thanks.

> > > +Cgroup-aware OOM Killer
> > > +~~~
> > > +
> > > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > > +It means that it treats memory cgroups as first class OOM entities.
> > > +
> > > +Under OOM conditions the memory controller tries to make the best
> > > +choise of a victim, hierarchically looking for the largest memory
> > > +consumer. By default, it will look for the biggest task in the
> > > +biggest leaf cgroup.
> > > +
> > > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > > +chose the largest cgroup recursively on each level. For non-root
> > > +cgroups it's possible to change the oom_priority, and it will cause
> > > +the OOM killer to look athe the priority value first, and compare
> > > +sizes only of cgroups with equal priority.
> > 
> > Maybe some description of "largest" would be helpful here?  I think you 
> > could briefly describe what is accounted for in the decisionmaking.
> 
> I'm afraid that it's too implementation-defined to be described.
> Do you have an idea, how to describe it without going too much into details?
> 

The point is that "largest cgroup" is ambiguous here: largest in what 
sense?  The cgroup with the largest number of processes attached?  Using 
the largest amount of memory?

I think the documentation should clearly define that the oom killer 
selects the memory cgroup that has the most memory managed at each level.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] Documentation: hwmon: Document the IBM CFF power supply

2017-08-15 Thread Eddie James



On 08/14/2017 05:37 PM, Guenter Roeck wrote:

On Mon, Aug 14, 2017 at 02:26:20PM -0500, Eddie James wrote:


On 08/14/2017 01:53 PM, Guenter Roeck wrote:

On Mon, Aug 14, 2017 at 10:26:30AM -0500, Eddie James wrote:

From: "Edward A. James" 

Signed-off-by: Edward A. James 
---
  Documentation/hwmon/ibm-cffps | 54 +++
  1 file changed, 54 insertions(+)
  create mode 100644 Documentation/hwmon/ibm-cffps

diff --git a/Documentation/hwmon/ibm-cffps b/Documentation/hwmon/ibm-cffps
new file mode 100644
index 000..e091ff2
--- /dev/null
+++ b/Documentation/hwmon/ibm-cffps
@@ -0,0 +1,54 @@
+Kernel driver ibm-cffps
+===
+
+Supported chips:
+  * IBM Common Form Factor power supply
+
+Author: Eddie James 
+
+Description
+---
+
+This driver supports IBM Common Form Factor (CFF) power supplies. This driver
+is a client to the core PMBus driver.
+
+Usage Notes
+---
+
+This driver does not auto-detect devices. You will have to instantiate the
+devices explicitly. Please see Documentation/i2c/instantiating-devices for
+details.
+
+Sysfs entries
+-
+
+The following attributes are supported:
+
+curr1_alarmOutput current over-current fault.
+curr1_inputMeasured output current in mA.
+curr1_label"iout1"
+
+fan1_alarm Fan 1 warning.
+fan1_fault Fan 1 fault.
+fan1_input Fan 1 speed in RPM.
+fan2_alarm Fan 2 warning.
+fan2_fault Fan 2 fault.
+fan2_input Fan 2 speed in RPM.
+
+in1_alarm  Input voltage under-voltage fault.

Just noticed. Are you sure you mean 'fault' here and below ?
'alarm' attributes normally report an over- or under- condition,
but not a fault. Faults should be reported with 'fault' attributes.
In PMBus lingo (which doesn't distinguish a real 'fault' from
a critical over- or under- condition), the "FAULT" condition
usually maps with the 'crit_alarm' or 'lcrit_alarm' attributes.
Also, under-voltages would normally be reported as min_alarm
or clrit_alarm, not in_alarm.

Thanks, I better change this doc to "alarm." The spec reports all these as
"faults" but many of them are merely over-temp or over-voltage, etc, and
should be "alarm" to be consistent with PMBus.

The problem with this power supply is that it doesn't report any "limits."
So unless I set up my read_byte function to return some limits, we can't get
any lower or upper limits and therefore won't get the crit_alarm,
lcrit_alarm, etc. Do you think I should "fake" the limits in the driver?


Good question. Are the limits documented ? If yes, that would make sense.
I am quite sure that limits are word registers, though.


No, no documentation on any limits... I'll leave it as is, as it it's 
meeting our requirements for now. I'll just change "fault" to "alarm" in 
the doc here.


Thanks,
Eddie



Guenter


+in1_input  Measured input voltage in mV.
+in1_label  "vin"
+in2_alarm  Output voltage over-voltage fault.
+in2_input  Measured output voltage in mV.
+in2_label  "vout1"
+
+power1_alarm   Input fault.

Another example; this maps to PMBUS_PIN_OP_WARN_LIMIT which is an
input power alarm, not an indication of a fault condition.

Hm, with my latest changes to look at the higher byte of STATUS_WORD, it
looks like we now have the same name for both the pin generic alarm
attribute and the pin_limit_attr... So in this device's case, it would map
to PB_STATUS_INPUT bit of STATUS_WORD. Didn't think about that... any
suggestions? Can't really change the name of the limit one without breaking
people's code...


+power1_input   Measured input power in uW.
+power1_label   "pin"
+
+temp1_alarmPSU inlet ambient temperature over-temperature fault.
+temp1_inputMeasured PSU inlet ambient temp in millidegrees C.
+temp2_alarmSecondary rectifier temp over-temperature fault.

Interestingly, PMBus does not distinguish between a critical temperature
alarm and an actual "fault". Makes me wonder if the IBM PS reports
CFFPS_MFR_THERMAL_FAULT if there is an actual fault (chip or sensor failure),
or if it has the same meaning as PB_TEMP_OT_FAULT, ie an excessively high
temperature.

Will change these to "alarm" in the doc too.


If it is a real fault (a detected sensor failure), we should possibly
consider adding a respective "virtual" temperature status flag. The same
is true for other status bits reported in the manufacturer status
register if any of those reflect a "real" fault, ie a chip failure.

Yea, that would probably be helpful. The CFFPS_MFR_THERMAL_FAULT bit is a
fault (so the spec says), but I'm not sure what is triggering it.

Thanks,
Eddie


+temp2_inputMeasured secondary rectifier temp in millidegrees C.
+temp3_alarmORing FET temperature over-temperature 

[PATCH] cpuset: Allow v2 behavior in v1 cgroup

2017-08-15 Thread Waiman Long
Cpuset v2 has some valuable attributes that are not present in
v1 because of backward compatibility concern. One of that is the
restoration of the original cpu and memory node mask after a hot
removal and addition event sequence.

This patch adds a new kernel command line option "cpuset_v2_mode=" to
allow cpuset to have v2 behavior in a v1 cgroup when using a non-zero
value. This enables users to optionally enable cpuset v2 behavior if
they choose to.

Signed-off-by: Waiman Long 
---
 Documentation/admin-guide/kernel-parameters.txt |  7 
 kernel/cgroup/cpuset.c  | 46 ++---
 2 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 372cc66..7abca11 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -668,6 +668,13 @@
on every CPU online, such as boot, and resume from 
suspend.
Default: 1
 
+   cpuset_v2_mode= [KNL] Enable cpuset v2 behavior in cpuset v1 cgroups.
+   In v2 mode, the cpus and mems can be restored to
+   their original values after a removal-addition
+   event sequence.
+   0: default value, cpuset v1 keeps legacy behavior.
+   1: cpuset v1 behaves like cpuset v2.
+
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
Format:
,,,[,]
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 9ed6a05..7e56383 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -300,6 +300,17 @@ static inline int is_spread_slab(const struct cpuset *cs)
 static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
 
 /*
+ * Flag to make cpuset v1 behaves like v2 so that cpus and mems can be
+ * restored back to their original values after an offline-online sequence.
+ */
+static bool cpuset_v2_mode __read_mostly;
+
+static inline bool is_in_v2_mode(void)
+{
+   return cgroup_subsys_on_dfl(cpuset_cgrp_subsys) || cpuset_v2_mode;
+}
+
+/*
  * This is ugly, but preserves the userspace API for existing cpuset
  * users. If someone tries to mount the "cpuset" filesystem, we
  * silently switch it to mount "cgroup" instead
@@ -489,8 +500,7 @@ static int validate_change(struct cpuset *cur, struct 
cpuset *trial)
 
/* On legacy hiearchy, we must be a subset of our parent cpuset. */
ret = -EACCES;
-   if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
-   !is_cpuset_subset(trial, par))
+   if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
goto out;
 
/*
@@ -903,8 +913,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct 
cpumask *new_cpus)
 * If it becomes empty, inherit the effective mask of the
 * parent, which is guaranteed to have some CPUs.
 */
-   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
-   cpumask_empty(new_cpus))
+   if (is_in_v2_mode() && cpumask_empty(new_cpus))
cpumask_copy(new_cpus, parent->effective_cpus);
 
/* Skip the whole subtree if the cpumask remains the same. */
@@ -921,7 +930,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct 
cpumask *new_cpus)
cpumask_copy(cp->effective_cpus, new_cpus);
spin_unlock_irq(_lock);
 
-   WARN_ON(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+   WARN_ON(!is_in_v2_mode() &&
!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
 
update_tasks_cpumask(cp);
@@ -1157,8 +1166,7 @@ static void update_nodemasks_hier(struct cpuset *cs, 
nodemask_t *new_mems)
 * If it becomes empty, inherit the effective mask of the
 * parent, which is guaranteed to have some MEMs.
 */
-   if (cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
-   nodes_empty(*new_mems))
+   if (is_in_v2_mode() && nodes_empty(*new_mems))
*new_mems = parent->effective_mems;
 
/* Skip the whole subtree if the nodemask remains the same. */
@@ -1175,7 +1183,7 @@ static void update_nodemasks_hier(struct cpuset *cs, 
nodemask_t *new_mems)
cp->effective_mems = *new_mems;
spin_unlock_irq(_lock);
 
-   WARN_ON(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
+   WARN_ON(!is_in_v2_mode() &&
!nodes_equal(cp->mems_allowed, cp->effective_mems));
 
update_tasks_nodemask(cp);
@@ -1467,7 +1475,7 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
 
/* allow moving tasks into an empty cpuset if on default hierarchy */

Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-15 Thread Waiman Long
On 07/28/2017 02:34 PM, Waiman Long wrote:
>  v2->v3:
>   - Add a faster pruning rate when the free pool is closed to depletion.
>   - As suggested by James Bottomley, add an artificial delay waiting
> loop before killing a negative dentry and properly clear the
> DCACHE_KILL_NEGATIVE flag if killing doesn't happen.
>   - Add a new patch to track number of negative dentries that are
> forcifully killed.
>
>  v1->v2:
>   - Move the new nr_negative field to the end of dentry_stat_t structure
> as suggested by Matthew Wilcox.
>   - With the help of Miklos Szeredi, fix incorrect locking order in
> dentry_kill() by using lock_parent() instead of locking the parent's
> d_lock directly.
>   - Correctly account for positive to negative dentry transitions.
>   - Automatic pruning of negative dentries will now ignore the reference
> bit in negative dentries but not the regular shrinking.
>
> A rogue application can potentially create a large number of negative
> dentries in the system consuming most of the memory available. This
> can impact performance of other applications running on the system.
>
> This patchset introduces changes to the dcache subsystem to limit
> the number of negative dentries allowed to be created thus limiting
> the amount of memory that can be consumed by negative dentries.
>
> Patch 1 tracks the number of negative dentries used and disallow
> the creation of more when the limit is reached.
>
> Patch 2 enables /proc/sys/fs/dentry-state to report the number of
> negative dentries in the system.
>
> Patch 3 enables automatic pruning of negative dentries when it is
> close to the limit so that we won't end up killing recently used
> negative dentries.
>
> Patch 4 prevents racing between negative dentry pruning and umount
> operation.
>
> Patch 5 shows the number of forced negative dentry killings in
> /proc/sys/fs/dentry-state. End users can then tune the neg_dentry_pc=
> kernel boot parameter if they want to reduce forced negative dentry
> killings.
>
> Waiman Long (5):
>   fs/dcache: Limit numbers of negative dentries
>   fs/dcache: Report negative dentry number in dentry-state
>   fs/dcache: Enable automatic pruning of negative dentries
>   fs/dcache: Protect negative dentry pruning from racing with umount
>   fs/dcache: Track count of negative dentries forcibly killed
>
>  Documentation/admin-guide/kernel-parameters.txt |   7 +
>  fs/dcache.c | 451 
> ++--
>  include/linux/dcache.h  |   8 +-
>  include/linux/list_lru.h|   1 +
>  mm/list_lru.c   |   4 +-
>  5 files changed, 435 insertions(+), 36 deletions(-)
>
I haven't received any comment on this v3 patch for over 2 weeks. Is
there anything I can do to make it more ready to be merged?

Thanks,
Longman

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] New Chapter on CodingStyle .

2017-08-15 Thread David Laight
From: Stephen Hemminger
> Sent: 15 August 2017 17:21
> On Tue, 15 Aug 2017 10:42:39 +
> David Laight  wrote:
> 
> > From: Jonathan Corbet
> > > Sent: 12 August 2017 15:55
> > ...
> > > > +   Chapter 20: Put values on initialisers without exception
> > > > +
> > > > +When declaring variables on functions must put values:
> > >
> > > Thanks for sending a patch for the kernel's documentation.
> > > Unfortunately, I can't accept this patch for a couple of reasons:
> > ...
> > > - The coding style document is there to describe the community's
> > >   standards for kernel code.  It is *not* a mechanism for imposing new
> > >   standards.  If you really think that the kernel community should adopt
> > >   this rule, you will need to argue for it on the mailing lists.  I will
> > >   say, though, that I do not expect that this effort would be successful.
> >
> > I'd even go as far as suggesting almost the opposite.
> > Declarations should only have initialisers if the value is constant.
> 
> Yup. This new rule sound like something taught to people in coding schools.
> But initializing everything defeats the compiler detection of uninitialized 
> variables
> which is more useful for catching errors.

You'll also get:
Values being read the wrong side of locks.
Values being read early so requiring spilling to stack.

Next someone will be suggesting that all pointers should be checked
against NULL every time they are used.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 resend 1/2] rtmutex: update rt-mutex-design

2017-08-15 Thread Steven Rostedt
On Fri, 11 Aug 2017 15:47:14 +0800
Alex Shi  wrote:

> On 08/08/2017 04:30 AM, Jonathan Corbet wrote:
> > On Mon, 31 Jul 2017 09:53:01 +0800
> > Alex Shi  wrote:
> >   
> >> On 07/31/2017 09:50 AM, Alex Shi wrote:  
> >>> -Reviewers:  Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy 
> >>> Dunlap
> >>> +Original Reviewers:  Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and
> >>> +  Randy Dunlap
> >>> +Update (7/6/2017) Reviewers: Steven Rostedt and Sebastian Siewior
> >>>  
> >>
> >> Hi Peter,
> >>
> >> This patch had been reviewed and acked by Sebastian and Steven. Do you
> >> still need a official 'reviewed-by' from them?  
> > 
> > Given this, I've been assuming that these changes aren't meant to go
> > through the docs tree; let me know if I should pick them up instead.
> >   
> 
> Hi Jon,
> 
> Sorry for skip you here. Yes, if you like to pick them, please go ahead!
> 
> 

Agreed, Jon's tree is the best path.

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-15 Thread Roman Gushchin
On Mon, Aug 14, 2017 at 03:52:26PM -0700, David Rientjes wrote:
> On Mon, 14 Aug 2017, Roman Gushchin wrote:
> 
> > diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> > index dec5afdaa36d..22108f31e09d 100644
> > --- a/Documentation/cgroup-v2.txt
> > +++ b/Documentation/cgroup-v2.txt
> > @@ -48,6 +48,7 @@ v1 is available under Documentation/cgroup-v1/.
> > 5-2-1. Memory Interface Files
> > 5-2-2. Usage Guidelines
> > 5-2-3. Memory Ownership
> > +   5-2-4. Cgroup-aware OOM Killer
> 
> Random curiousness, why cgroup-aware oom killer and not memcg-aware oom 
> killer?

I don't think we use the term "memcg" somewhere in v2 docs.
Do you think that "Memory cgroup-aware OOM killer" is better?

> 
> >   5-3. IO
> > 5-3-1. IO Interface Files
> > 5-3-2. Writeback
> > @@ -1002,6 +1003,37 @@ PAGE_SIZE multiple when read back.
> > high limit is used and monitored properly, this limit's
> > utility is limited to providing the final safety net.
> >  
> > +  memory.oom_kill_all_tasks
> > +
> > +   A read-write single value file which exits on non-root
> 
> s/exits/exists/

Fixed. Thanks!

> 
> > +   cgroups.  The default is "0".
> > +
> > +   Defines whether the OOM killer should treat the cgroup
> > +   as a single entity during the victim selection.
> 
> Isn't this true independent of the memory.oom_kill_all_tasks setting?  
> The cgroup aware oom killer will consider memcg's as logical units when 
> deciding what to kill with or without memory.oom_kill_all_tasks, right?
> 
> I think you cover this fact in the cgroup aware oom killer section below 
> so this might result in confusion if described alongside a setting of
> memory.oom_kill_all_tasks.
> 
> > +
> > +   If set, OOM killer will kill all belonging tasks in
> > +   corresponding cgroup is selected as an OOM victim.
> 
> Maybe
> 
> "If set, the OOM killer will kill all threads attached to the memcg if 
> selected as an OOM victim."
> 
> is better?

Fixed to the following (to conform with core v2 concepts):
  If set, OOM killer will kill all processes attached to the cgroup
  if selected as an OOM victim.

> 
> > +
> > +   Be default, OOM killer respect /proc/pid/oom_score_adj value
> > +   -1000, and will never kill the task, unless oom_kill_all_tasks
> > +   is set.
> > +
> > +  memory.oom_priority
> > +
> > +   A read-write single value file which exits on non-root
> 
> s/exits/exists/

Fixed.

> 
> > +   cgroups.  The default is "0".
> > +
> > +   An integer number within the [-1, 1] range,
> > +   which defines the order in which the OOM killer selects victim
> > +   memory cgroups.
> > +
> > +   OOM killer prefers memory cgroups with larger priority if they
> > +   are populated with elegible tasks.
> 
> s/elegible/eligible/

Fixed.

> 
> > +
> > +   The oom_priority value is compared within sibling cgroups.
> > +
> > +   The root cgroup has the oom_priority 0, which cannot be changed.
> > +
> >memory.events
> > A read-only flat-keyed file which exists on non-root cgroups.
> > The following entries are defined.  Unless specified
> > @@ -1206,6 +1238,36 @@ POSIX_FADV_DONTNEED to relinquish the ownership of 
> > memory areas
> >  belonging to the affected files to ensure correct memory ownership.
> >  
> >  
> > +Cgroup-aware OOM Killer
> > +~~~
> > +
> > +Cgroup v2 memory controller implements a cgroup-aware OOM killer.
> > +It means that it treats memory cgroups as first class OOM entities.
> > +
> > +Under OOM conditions the memory controller tries to make the best
> > +choise of a victim, hierarchically looking for the largest memory
> > +consumer. By default, it will look for the biggest task in the
> > +biggest leaf cgroup.
> > +
> > +Be default, all cgroups have oom_priority 0, and OOM killer will
> > +chose the largest cgroup recursively on each level. For non-root
> > +cgroups it's possible to change the oom_priority, and it will cause
> > +the OOM killer to look athe the priority value first, and compare
> > +sizes only of cgroups with equal priority.
> 
> Maybe some description of "largest" would be helpful here?  I think you 
> could briefly describe what is accounted for in the decisionmaking.

I'm afraid that it's too implementation-defined to be described.
Do you have an idea, how to describe it without going too much into details?

> s/athe/at the/

Fixed.

> 
> Reading through this, it makes me wonder if doing s/cgroup/memcg/ over 
> most of it would be better.

I don't think memcg is a good user term, but I agree, that it's necessary
to highlight the fact that a user should enable memory controller to get
this functionality.
Added a corresponding note.

Thanks!

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 5/6] perf: hisi: Add support for HiSilicon SoC DDRC PMU driver

2017-08-15 Thread Mark Rutland
On Tue, Jul 25, 2017 at 08:10:41PM +0800, Shaokun Zhang wrote:
> This patch adds support for DDRC PMU driver in HiSilicon SoC chip, Each
> DDRC has own control, counter and interrupt registers and is an separate
> PMU. For each DDRC PMU, it has 8-fixed-purpose counters which have been
> mapped to 8-events by hardware, it assumes that counter index is equal
> to event code (0 - 7) in DDRC PMU driver. Interrupt is supported to
> handle counter (32-bits) overflow.
> 
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Shaokun Zhang 
> Signed-off-by: Anurup M 
> ---
>  drivers/perf/hisilicon/Makefile   |   2 +-
>  drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c | 420 
> ++
>  2 files changed, 421 insertions(+), 1 deletion(-)
>  create mode 100644 drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c

My comments for the HHA PMU driver all apply here, too.

Thanks,
Mark.

> diff --git a/drivers/perf/hisilicon/Makefile b/drivers/perf/hisilicon/Makefile
> index a72afe8..2621d51 100644
> --- a/drivers/perf/hisilicon/Makefile
> +++ b/drivers/perf/hisilicon/Makefile
> @@ -1 +1 @@
> -obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o 
> hisi_uncore_hha_pmu.o
> +obj-$(CONFIG_HISI_PMU) += hisi_uncore_pmu.o hisi_uncore_l3c_pmu.o 
> hisi_uncore_hha_pmu.o hisi_uncore_ddrc_pmu.o
> diff --git a/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c 
> b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
> new file mode 100644
> index 000..e178a09
> --- /dev/null
> +++ b/drivers/perf/hisilicon/hisi_uncore_ddrc_pmu.c
> @@ -0,0 +1,420 @@
> +/*
> + * HiSilicon SoC DDRC uncore Hardware event counters support
> + *
> + * Copyright (C) 2017 Hisilicon Limited
> + * Author: Shaokun Zhang 
> + * Anurup M 
> + *
> + * This code is based on the uncore PMUs like arm-cci and arm-ccn.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include "hisi_uncore_pmu.h"
> +
> +/* DDRC register definition */
> +#define DDRC_PERF_CTRL   0x010
> +#define DDRC_FLUX_WR 0x380
> +#define DDRC_FLUX_RD 0x384
> +#define DDRC_FLUX_WCMD  0x388
> +#define DDRC_FLUX_RCMD  0x38c
> +#define DDRC_PRE_CMD0x3c0
> +#define DDRC_ACT_CMD0x3c4
> +#define DDRC_BNK_CHG0x3c8
> +#define DDRC_RNK_CHG0x3cc
> +#define DDRC_EVENT_CTRL 0x6C0
> +#define DDRC_INT_MASK0x6c8
> +#define DDRC_INT_STATUS  0x6cc
> +#define DDRC_INT_CLEAR   0x6d0
> +
> +/* DDRC supports 8-events and counter is fixed-purpose */
> +#define DDRC_NR_COUNTERS 0x8
> +#define DDRC_NR_EVENTS   DDRC_NR_COUNTERS
> +
> +#define DDRC_PERF_CTRL_EN0x2
> +
> +/*
> + * For DDRC PMU, there are eight-events and every event has been mapped
> + * to fixed-purpose counters which register offset is not consistent.
> + * Therefore there is no write event type and we assume that event
> + * code (0 to 7) is equal to counter index in PMU driver.
> + */
> +#define GET_DDRC_EVENTID(hwc)(hwc->config_base & 0x7)
> +
> +static const u32 ddrc_reg_off[] = {
> + DDRC_FLUX_WR, DDRC_FLUX_RD, DDRC_FLUX_WCMD, DDRC_FLUX_RCMD,
> + DDRC_PRE_CMD, DDRC_ACT_CMD, DDRC_BNK_CHG, DDRC_RNK_CHG
> +};
> +
> +/*
> + * Select the counter register offset using the counter index.
> + * In DDRC there are no programmable counter, the count
> + * is readed form the statistics counter register itself.
> + */
> +static u32 get_counter_reg_off(int cntr_idx)
> +{
> + return ddrc_reg_off[cntr_idx];
> +}
> +
> +static u64 hisi_ddrc_pmu_read_counter(struct hisi_pmu *ddrc_pmu,
> +   struct hw_perf_event *hwc)
> +{
> + /* Use event code as counter index */
> + u32 idx = GET_DDRC_EVENTID(hwc);
> + u32 reg;
> +
> + if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) {
> + dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx);
> + return 0;
> + }
> +
> + reg = get_counter_reg_off(idx);
> +
> + return readl(ddrc_pmu->base + reg);
> +}
> +
> +static void hisi_ddrc_pmu_write_counter(struct hisi_pmu *ddrc_pmu,
> + struct hw_perf_event *hwc, u64 val)
> +{
> + u32 idx = GET_DDRC_EVENTID(hwc);
> + u32 reg;
> +
> + if (!hisi_uncore_pmu_counter_valid(ddrc_pmu, idx)) {
> + dev_err(ddrc_pmu->dev, "Unsupported event index:%d!\n", idx);
> + return;
> + }
> +
> + reg = get_counter_reg_off(idx);
> + writel((u32)val, ddrc_pmu->base + reg);
> +}
> +
> +static void hisi_ddrc_pmu_start_counters(struct hisi_pmu *ddrc_pmu)
> +{
> + u32 val;
> 

Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-15 Thread Roman Gushchin
On Tue, Aug 15, 2017 at 10:20:18PM +1000, Aleksa Sarai wrote:
> On 08/15/2017 10:15 PM, Roman Gushchin wrote:
> > Generally, oom_score_adj should have a meaning only on a cgroup level,
> > so extending it to the system level doesn't sound as a good idea.
> 
> But wasn't the original purpose of oom_score (and oom_score_adj) to work on
> a system level, aka "normal" OOM? Is there some peculiarity about memcg OOM
> that I'm missing?

I'm sorry, if it wasn't clear from my message, it's not about
the system-wide OOM vs the memcg-wide OOM, it's about the isolation.

In general, decision is made on memcg level first (based on oom_priority
and size), and only then on a task level (based on size and oom_score_adj).

Oom_score_adj affects which task inside the cgroup will be killed,
but we never compare tasks from different cgroups. This is what I mean,
when I'm saying, that oom_score_adj should not have a system-wide meaning.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] printk: Add monotonic, boottime, and realtime timestamps

2017-08-15 Thread Thomas Gleixner
On Thu, 10 Aug 2017, Prarit Bhargava wrote:
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index cedafa008de5..1ddf04201047 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 

That's needed because?

>  #include "tick-internal.h"
>  #include "ntp_internal.h"
> @@ -60,8 +61,39 @@ struct tk_fast {
>   struct tk_read_base base[2];
>  };
>  
> -static struct tk_fast tk_fast_mono cacheline_aligned;
> -static struct tk_fast tk_fast_raw  cacheline_aligned;
> +/* Suspend-time cycles value for halted fast timekeeper. */
> +static u64 cycles_at_suspend;
> +
> +static u64 dummy_clock_read(struct clocksource *cs)
> +{
> + return cycles_at_suspend;
> +}
> +
> +static struct clocksource dummy_clock = {
> + .read = dummy_clock_read,
> +};
> +
> +static struct tk_fast tk_fast_mono cacheline_aligned = {
> + .base = {
> + (struct tk_read_base){

Eew.

> + .clock = _clock,
> + },

.base[0] = {
.clock = _clock,
},

.base[1] = {
.clock = _clock,
},

Hmm?

> -static struct clocksource dummy_clock = {
> - .read = dummy_clock_read,
> -};
> -

Can we please have that timekeeping change as a seperate patch?

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-15 Thread Aleksa Sarai

On 08/15/2017 10:15 PM, Roman Gushchin wrote:

Generally, oom_score_adj should have a meaning only on a cgroup level,
so extending it to the system level doesn't sound as a good idea.


But wasn't the original purpose of oom_score (and oom_score_adj) to work 
on a system level, aka "normal" OOM? Is there some peculiarity about 
memcg OOM that I'm missing?


--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-15 Thread Roman Gushchin
On Mon, Aug 14, 2017 at 03:42:54PM -0700, David Rientjes wrote:
> On Mon, 14 Aug 2017, Roman Gushchin wrote:
> > +
> > +static long oom_evaluate_memcg(struct mem_cgroup *memcg,
> > +  const nodemask_t *nodemask)
> > +{
> > +   struct css_task_iter it;
> > +   struct task_struct *task;
> > +   int elegible = 0;
> > +
> > +   css_task_iter_start(>css, 0, );
> > +   while ((task = css_task_iter_next())) {
> > +   /*
> > +* If there are no tasks, or all tasks have oom_score_adj set
> > +* to OOM_SCORE_ADJ_MIN and oom_kill_all_tasks is not set,
> > +* don't select this memory cgroup.
> > +*/
> > +   if (!elegible &&
> > +   (memcg->oom_kill_all_tasks ||
> > +task->signal->oom_score_adj != OOM_SCORE_ADJ_MIN))
> > +   elegible = 1;
> 
> I'm curious about the decision made in this conditional and how 
> oom_kill_memcg_member() ignores task->signal->oom_score_adj.  It means 
> that memory.oom_kill_all_tasks overrides /proc/pid/oom_score_adj if it 
> should otherwise be disabled.
> 
> It's undocumented in the changelog, but I'm questioning whether it's the 
> right decision.  Doesn't it make sense to kill all tasks that are not oom 
> disabled, and allow the user to still protect certain processes by their 
> /proc/pid/oom_score_adj setting?  Otherwise, there's no way to do that 
> protection without a sibling memcg and its own reservation of memory.  I'm 
> thinking about a process that governs jobs inside the memcg and if there 
> is an oom kill, it wants to do logging and any cleanup necessary before 
> exiting itself.  It seems like a powerful combination if coupled with oom 
> notification.

Good question!
I think, that an ability to override any oom_score_adj value and get all tasks
killed is more important, than an ability to kill all processes with some
exceptions.

In your example someone still needs to look after the remaining process,
and kill it after some timeout, if it will not quit by itself, right?

The special treatment of the -1000 value (without oom_kill_all_tasks)
is required only to not to break the existing setups.

Generally, oom_score_adj should have a meaning only on a cgroup level,
so extending it to the system level doesn't sound as a good idea.

> 
> Also, s/elegible/eligible/

Shame on me :)
Will fix, thanks!

> 
> Otherwise, looks good!

Great!
Thank you for the reviewing and testing.

Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 4/6] perf: hisi: Add support for HiSilicon SoC HHA PMU driver

2017-08-15 Thread Mark Rutland
On Tue, Jul 25, 2017 at 08:10:40PM +0800, Shaokun Zhang wrote:
> +/* HHA register definition */
> +#define HHA_INT_MASK 0x0804
> +#define HHA_INT_STATUS   0x0808
> +#define HHA_INT_CLEAR0x080C
> +#define HHA_PERF_CTRL0x1E00
> +#define HHA_EVENT_CTRL   0x1E04
> +#define HHA_EVENT_TYPE0  0x1E80
> +#define HHA_CNT0_LOWER   0x1F00
> +
> +/* HHA has 16-counters and supports 0x50 events */

As with the L3C PMU, what exactly does this mean?

Does this mean event IDs 0-0x4f are valid?

[...]

> +static irqreturn_t hisi_hha_pmu_isr(int irq, void *dev_id)
> +{
> + struct hisi_pmu *hha_pmu = dev_id;
> + struct perf_event *event;
> + unsigned long overflown;
> + u32 status;
> + int idx;
> +
> + /* Read HHA_INT_STATUS register */
> + status = readl(hha_pmu->base + HHA_INT_STATUS);
> + if (!status)
> + return IRQ_NONE;
> + overflown = status;

No need for the u32 temporary here.

[]

> +static int hisi_hha_pmu_dev_probe(struct platform_device *pdev,
> +   struct hisi_pmu *hha_pmu)
> +{
> + struct device *dev = >dev;
> + int ret;
> +
> + ret = hisi_hha_pmu_init_data(pdev, hha_pmu);
> + if (ret)
> + return ret;
> +
> + /* Pick one core to use for cpumask attributes */
> + cpumask_set_cpu(smp_processor_id(), _pmu->cpus);
> +

Why does this not have the usual event migration callbacks, across CPUs
in the same SCCL?

> + ret = hisi_hha_pmu_init_irq(hha_pmu, pdev);
> + if (ret)
> + return ret;
> +
> + hha_pmu->name = devm_kasprintf(dev, GFP_KERNEL, "hisi_hha%u_%u",
> +hha_pmu->hha_uid, hha_pmu->sccl_id);

As on the doc patch, this should be hierarchical.

Thanks,
Mark
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] New Chapter on CodingStyle .

2017-08-15 Thread David Laight
From: Jonathan Corbet
> Sent: 12 August 2017 15:55
...
> > +   Chapter 20: Put values on initialisers without exception
> > +
> > +When declaring variables on functions must put values:
> 
> Thanks for sending a patch for the kernel's documentation.
> Unfortunately, I can't accept this patch for a couple of reasons:
...
> - The coding style document is there to describe the community's
>   standards for kernel code.  It is *not* a mechanism for imposing new
>   standards.  If you really think that the kernel community should adopt
>   this rule, you will need to argue for it on the mailing lists.  I will
>   say, though, that I do not expect that this effort would be successful.

I'd even go as far as suggesting almost the opposite.
Declarations should only have initialisers if the value is constant.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 3/6] perf: hisi: Add support for HiSilicon SoC L3C PMU driver

2017-08-15 Thread Mark Rutland
On Tue, Jul 25, 2017 at 08:10:39PM +0800, Shaokun Zhang wrote:
> This patch adds support for L3C PMU driver in HiSilicon SoC chip, Each
> L3C has own control, counter and interrupt registers and is an separate
> PMU. For each L3C PMU, it has 8-programable counters and supports 0x60
> events, event code is 8-bits and every counter is free-running.
> Interrupt is supported to handle counter (48-bits) overflow.

[...]

> +/* L3C register definition */
> +#define L3C_PERF_CTRL0x0408
> +#define L3C_INT_MASK 0x0800
> +#define L3C_INT_STATUS   0x0808
> +#define L3C_INT_CLEAR0x080c
> +#define L3C_EVENT_CTRL   0x1c00
> +#define L3C_EVENT_TYPE0  0x1d00
> +#define L3C_CNTR0_LOWER  0x1e00

Why does this have a _LOWER suffix?

How big is this regsiter? is it part of a pair of registers?

> +
> +/* L3C has 8-counters and supports 0x60 events */
> +#define L3C_NR_COUNTERS  0x8
> +#define L3C_NR_EVENTS0x60

What exactly is meant by "supports 0x60 events"?

e.g. does tha mean event IDs 0-0x5f are valid?

> +static irqreturn_t hisi_l3c_pmu_isr(int irq, void *dev_id)
> +{
> + struct hisi_pmu *l3c_pmu = dev_id;
> + struct perf_event *event;
> + unsigned long overflown;
> + u32 status;
> + int idx;
> +
> + /* Read L3C_INT_STATUS register */
> + status = readl(l3c_pmu->base + L3C_INT_STATUS);
> + if (!status)
> + return IRQ_NONE;
> + overflown = status;

You don't need the temporary u32 value here; you can assign directly to
an unsigned lnog and do all the manipulation on that.

[...]

> +/* Check if the CPU belongs to the SCCL and CCL of PMU */
> +static bool hisi_l3c_is_cpu_in_ccl(struct hisi_pmu *l3c_pmu)
> +{
> + u32 ccl_id, sccl_id;
> +
> + hisi_read_sccl_and_ccl_id(_id, _id);
> +
> + if (sccl_id != l3c_pmu->sccl_id)
> + return false;
> +
> + if (ccl_id != l3c_pmu->ccl_id)
> + return false;
> +
> + /* Return true if matched */
> + return true;
> +}

The conditionals would be simpler as:

return (sccl_id == l3c_pmu->sccl_id &&
ccl_id == l3c_pmu->ccl_id);

> +
> +static int hisi_l3c_pmu_online_cpu(unsigned int cpu, struct hlist_node *node)
> +{
> + struct hisi_pmu *l3c_pmu;
> +
> + l3c_pmu = hlist_entry_safe(node, struct hisi_pmu, node);
> +
> + /* Proceed only when CPU belongs to the same SCCL and CCL */
> + if (!hisi_l3c_is_cpu_in_ccl(l3c_pmu))
> + return 0;

Surely you have a mask of CPUs that you can check instead? You'll need
that for the offline path.

[...]

> +static int hisi_l3c_pmu_offline_cpu(unsigned int cpu, struct hlist_node 
> *node)
> +{
> + struct hisi_pmu *l3c_pmu;
> + cpumask_t ccl_online_cpus;
> + unsigned int target;
> +
> + l3c_pmu = hlist_entry_safe(node, struct hisi_pmu, node);
> +
> + /* Proceed only when CPU belongs to the same SCCL and CCL */
> + if (!hisi_l3c_is_cpu_in_ccl(l3c_pmu))
> + return 0;

Again, surely you can check a pre-computed mask?

> +
> + /* Proceed if this CPU is used for event counting */
> + if (!cpumask_test_cpu(cpu, _pmu->cpus))
> + return 0;

You need to set up the CPU state regardless of whether there are active
events currently. Otherwise the cpumask can be stale, pointing at an
offline CPU, leaving the PMU unusable.

> +
> + /* Give up ownership of CCL */
> + cpumask_test_and_clear_cpu(cpu, _pmu->cpus);
> +
> + /* Any other CPU for this CCL which is still online */
> + cpumask_and(_online_cpus, cpu_coregroup_mask(cpu),
> + cpu_online_mask);

What is cpu_coregroup_mask? I do not think you can rely on that
happening to align with the physical CCL mask.

Instead, please:

* Keep track of which CPU(s) this PMU can be used from, with a cpumask.
  Either initialise that at probe time, or add CPUs to that in the
  hotplug callback.

* Use only that mask to determine which CPU to move the PMU context to.

* Use an int to hold the current CPU; there's no need to use a mask to
  hold what shoule be a single CPU ID.

[...]

> + /* Get the L3C SCCL ID */
> + if (device_property_read_u32(dev, "hisilicon,scl-id",
> +  _pmu->sccl_id)) {
> + dev_err(dev, "Can not read l3c sccl-id!\n");
> + return -EINVAL;
> + }
> +
> + /* Get the L3C CPU cluster(CCL) ID */
> + if (device_property_read_u32(dev, "hisilicon,ccl-id",
> +  _pmu->ccl_id)) {
> + dev_err(dev, "Can not read l3c ccl-id!\n");
> + return -EINVAL;
> + }

Previously, you expect that these happen to match particular bits of the
MPIDR, which vary for multi-threaded cores. Please document this.

> +static int hisi_l3c_pmu_dev_probe(struct platform_device *pdev,
> +   struct hisi_pmu *l3c_pmu)
> +{
> + struct device *dev = >dev;

Re: [PATCH v4 2/6] perf: hisi: Add support for HiSilicon SoC uncore PMU driver

2017-08-15 Thread Mark Rutland
Hi,

On Tue, Jul 25, 2017 at 08:10:38PM +0800, Shaokun Zhang wrote:
> +/* Read Super CPU cluster and CPU cluster ID from MPIDR_EL1 */
> +void hisi_read_sccl_and_ccl_id(u32 *sccl_id, u32 *ccl_id)
> +{
> + u64 mpidr;
> +
> + mpidr = read_cpuid_mpidr();
> + if (mpidr & MPIDR_MT_BITMASK) {
> + if (sccl_id)
> + *sccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 3);
> + if (ccl_id)
> + *ccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 2);
> + } else {
> + if (sccl_id)
> + *sccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 2);
> + if (ccl_id)
> + *ccl_id = MPIDR_AFFINITY_LEVEL(mpidr, 1);
> + }
> +}

How exactly are SCCLs organised w.r.t. MPIDRS?

Is this guaranteed to be correct for future SoCs?

It would be nicer if this were described explicitly by FW rather than
guessed at based on the MPIDR.

> +static bool hisi_validate_event_group(struct perf_event *event)
> +{
> + struct perf_event *sibling, *leader = event->group_leader;
> + struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu);
> + /* Include count for the event */
> + int counters = 1;
> +
> + /*
> +  * We must NOT create groups containing mixed PMUs, although
> +  * software events are acceptable
> +  */
> + if (leader->pmu != event->pmu && !is_software_event(leader))
> + return false;
> +
> + /* Increment counter for the leader */
> + counters++;

If this event is the leader, you account for it twice.

I guess you get away with that assuming you have at least two counters,
but it's less than ideal.

> +
> + list_for_each_entry(sibling, >group_leader->sibling_list,
> + group_entry) {
> + if (is_software_event(sibling))
> + continue;
> + if (sibling->pmu != event->pmu)
> + return false;
> + /* Increment counter for each sibling */
> + counters++;
> + }
> +
> + /* The group can not count events more than the counters in the HW */
> + return counters <= hisi_pmu->num_counters;
> +}

[...]

> +/*
> + * Set the counter to count the event that we're interested in,
> + * and enable counter and interrupt.
> + */
> +static void hisi_uncore_pmu_enable_event(struct perf_event *event)
> +{
> + struct hisi_pmu *hisi_pmu = to_hisi_pmu(event->pmu);
> + struct hw_perf_event *hwc = >hw;
> +
> + /*
> +  * Write event code in event select registers(for DDRC PMU,
> +  * event has been mapped to fixed-purpose counter, there is
> +  * no need to write event type).
> +  */
> + if (hisi_pmu->ops->write_evtype)
> + hisi_pmu->ops->write_evtype(hisi_pmu, hwc->idx,
> + HISI_GET_EVENTID(event));

It looks like this is the only op which might be NULL. It would be
cleaner for the DDRC PMU code to provide an empty callback.

[...]

> +struct hisi_pmu *hisi_pmu_alloc(struct device *dev, u32 num_cntrs)
> +{
> + struct hisi_pmu *hisi_pmu;
> + struct hisi_pmu_hwevents *pmu_events;
> +
> + hisi_pmu = devm_kzalloc(dev, sizeof(*hisi_pmu), GFP_KERNEL);
> + if (!hisi_pmu)
> + return ERR_PTR(-ENOMEM);
> +
> + pmu_events = _pmu->pmu_events;
> + pmu_events->hw_events = devm_kcalloc(dev,
> +  num_cntrs,
> +  sizeof(*pmu_events->hw_events),
> +  GFP_KERNEL);
> + if (!pmu_events->hw_events)
> + return ERR_PTR(-ENOMEM);
> +
> + pmu_events->used_mask = devm_kcalloc(dev,
> +  BITS_TO_LONGS(num_cntrs),
> +  sizeof(*pmu_events->used_mask),
> +  GFP_KERNEL);

How big can num_counters be?

Assuming it's not too big, it would be nicer to embed these within the
hisi_pmu_hwevents.

[...]

> +
> +/* Generic pmu struct for different pmu types */
> +struct hisi_pmu {
> + const char *name;
> + struct pmu pmu;

struct pmu has a name field. Why do we need another?

> + union {
> + u32 ddrc_chn_id;
> + u32 l3c_tag_id;
> + u32 hha_uid;
> + };

This would be simpler as a `u32 id` rather than a union.

> + int num_counters;
> + int num_events;

Subsequent patches intialise num_events, but it is never used. Was it
supposed to be checked at event_init time? Or is it unnnecessary?

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 1/6] Documentation: perf: hisi: Documentation for HiSilicon SoC PMU driver

2017-08-15 Thread Mark Rutland
Hi,

On Tue, Jul 25, 2017 at 08:10:37PM +0800, Shaokun Zhang wrote:
> This patch adds documentation for the uncore PMUs on HiSilicon SoC.
> 
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Shaokun Zhang 
> Signed-off-by: Anurup M 
> ---
>  Documentation/perf/hisi-pmu.txt | 52 
> +
>  1 file changed, 52 insertions(+)
>  create mode 100644 Documentation/perf/hisi-pmu.txt
> 
> diff --git a/Documentation/perf/hisi-pmu.txt b/Documentation/perf/hisi-pmu.txt
> new file mode 100644
> index 000..f45a03d
> --- /dev/null
> +++ b/Documentation/perf/hisi-pmu.txt
> @@ -0,0 +1,52 @@
> +HiSilicon SoC uncore Performance Monitoring Unit (PMU)
> +==
> +The HiSilicon SoC chip comprehends various independent system device PMUs

Nit: s/comprehends/comprises/ would be easier to read.

> +such as L3 cache (L3C), Hydra Home Agent (HHA) and DDRC. These PMUs are
> +independent and have hardware logic to gather statistics and performance
> +information.
> +
> +HiSilicon SoC encapsulates multiple CPU and IO dies. Each CPU cluster

Nit: The Hisilicon SoC

> +(CCL) is made up of 4 cpu cores sharing one L3 cache; Each CPU die is

Nit: s/Each/each/

> +called Super CPU cluster (SCCL) and is made up of 6 CCLs. Each SCCL has
> +two HHAs (0 - 1) and four DDRCs (0 - 3), respectively.
> +
> +HiSilicon SoC uncore PMU driver
> +---
> +Each device PMU has separate registers for event counting, control and
> +interrupt, and the PMU driver shall register perf PMU drivers like L3C,
> +HHA and DDRC etc. The available events and configuration options shall
> +be described in the sysfs, see /sys/devices/hisi_* 

What exactly its exposed under /sys/devices/hisi_* ?

> or /sys/bus/
> +event_source/devices/hisi_*.

Please don't wrap paths; keep this on one line.

> +The "perf list" command shall list the available events from sysfs.
> +
> +Each L3C, HHA and DDRC in one SCCL are registered as an separate PMU with 
> perf.
> +The PMU name will appear in event listing as hisi_module 
> _.
> +where "index-id" is the index of module and "sccl-id" is the identifier of
> +the SCCL.
> +e.g. hisi_l3c0_1/rd_hit_cpipe is READ_HIT_CPIPE event of L3C index #0 and 
> SCCL
> +ID #1.
> +e.g. hisi_hha0_1/rx_operations is RX_OPERATIONS event of HHA index #0 and 
> SCCL
> +ID #1.

It would make more sense for this to be hierarichal, e.g. hisi_sccl{X}_l3c{Y}.

Other than the above nits, this documentation is very useful. Thanks for
putting this together.

Thanks,
Mark.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 04/19] crypto: marvell/cesa: remove redundant backlog checks on EBUSY

2017-08-15 Thread Boris Brezillon
Le Mon, 14 Aug 2017 18:21:14 +0300,
Gilad Ben-Yossef  a écrit :

> Now that -EBUSY return code only indicates backlog queueing
> we can safely remove the now redundant check for the
> CRYPTO_TFM_REQ_MAY_BACKLOG flag when -EBUSY is returned.
> 
> Signed-off-by: Gilad Ben-Yossef 

Acked-by: Boris Brezillon 

> ---
>  drivers/crypto/marvell/cesa.c | 3 +--
>  drivers/crypto/marvell/cesa.h | 2 +-
>  2 files changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/crypto/marvell/cesa.c b/drivers/crypto/marvell/cesa.c
> index 6e7a5c7..269737f 100644
> --- a/drivers/crypto/marvell/cesa.c
> +++ b/drivers/crypto/marvell/cesa.c
> @@ -183,8 +183,7 @@ int mv_cesa_queue_req(struct crypto_async_request *req,
>   spin_lock_bh(>lock);
>   ret = crypto_enqueue_request(>queue, req);
>   if ((mv_cesa_req_get_type(creq) == CESA_DMA_REQ) &&
> - (ret == -EINPROGRESS ||
> - (ret == -EBUSY && req->flags & CRYPTO_TFM_REQ_MAY_BACKLOG)))
> + (ret == -EINPROGRESS || ret == -EBUSY)
>   mv_cesa_tdma_chain(engine, creq);
>   spin_unlock_bh(>lock);
>  
> diff --git a/drivers/crypto/marvell/cesa.h b/drivers/crypto/marvell/cesa.h
> index b7872f6..63c8457 100644
> --- a/drivers/crypto/marvell/cesa.h
> +++ b/drivers/crypto/marvell/cesa.h
> @@ -763,7 +763,7 @@ static inline int mv_cesa_req_needs_cleanup(struct 
> crypto_async_request *req,
>* the backlog and will be processed later. There's no need to
>* clean it up.
>*/
> - if (ret == -EBUSY && req->flags & CRYPTO_TFM_REQ_MAY_BACKLOG)
> + if (ret == -EBUSY)
>   return false;
>  
>   /* Request wasn't queued, we need to clean it up */

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html