[RFC V2 0/6] cpufreq: transition-latency cleanups

2017-07-12 Thread Viresh Kumar
Hi Rafael,

Here is the V2 and sending it as RFC this time.

This series tries to cleanup the code around transition-latency and its
users.  Some of the old legacy code, which may not make much sense now,
is dropped as well.

Some code consolidation also done across governors.

Based of: pm/linux-next
Tested on: ARM64 Hikey board.

V1->V2:
- While we still get rid of the limitation of 10ms for using
  ondemand/conservative, but we preserve the earlier behavior where the
  transition latency set to CPUFREQ_ETERNAL would not allow use of
  ondemand/conservative governors. Thanks to Dominik for his feedback on
  that.

--
viresh

Viresh Kumar (6):
  cpufreq: Replace "max_transition_latency" with "dynamic_switching"
  cpufreq: schedutil: Set dynamic_switching to true
  cpufreq: governor: Drop min_sampling_rate
  cpufreq: Use transition_delay_us for legacy governors as well
  cpufreq: Cap the default transition delay value to 10 ms
  cpufreq: arm_big_little: Make ->get_transition_latency() mandatory

 Documentation/admin-guide/pm/cpufreq.rst |  8 ---
 drivers/cpufreq/arm_big_little.c | 10 -
 drivers/cpufreq/cpufreq.c|  8 +++
 drivers/cpufreq/cpufreq_conservative.c   |  6 --
 drivers/cpufreq/cpufreq_governor.c   | 17 ++-
 drivers/cpufreq/cpufreq_governor.h   |  3 +--
 drivers/cpufreq/cpufreq_ondemand.c   | 12 ---
 include/linux/cpufreq.h  | 36 
 kernel/sched/cpufreq_schedutil.c | 12 ++-
 9 files changed, 40 insertions(+), 72 deletions(-)

-- 
2.13.0.71.gd7076ec9c9cb

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC V2 3/6] cpufreq: governor: Drop min_sampling_rate

2017-07-12 Thread Viresh Kumar
The cpufreq core and governors aren't supposed to set a limit on how
fast we want to try changing the frequency. This is currently done for
the legacy governors with help of min_sampling_rate.

At worst, we may end up setting the sampling rate to a value lower than
the rate at which frequency can be changed and then one of the CPUs in
the policy will be only changing frequency for ever.

But that is something for the user to decide and there is no need to
have special handling for such cases in the core. Leave it for the user
to figure out.

Signed-off-by: Viresh Kumar 
---
 Documentation/admin-guide/pm/cpufreq.rst |  8 
 drivers/cpufreq/cpufreq_conservative.c   |  6 --
 drivers/cpufreq/cpufreq_governor.c   | 10 ++
 drivers/cpufreq/cpufreq_governor.h   |  1 -
 drivers/cpufreq/cpufreq_ondemand.c   | 12 
 include/linux/cpufreq.h  |  2 --
 6 files changed, 2 insertions(+), 37 deletions(-)

diff --git a/Documentation/admin-guide/pm/cpufreq.rst 
b/Documentation/admin-guide/pm/cpufreq.rst
index 463cf7e73db8..2eb3bf62393e 100644
--- a/Documentation/admin-guide/pm/cpufreq.rst
+++ b/Documentation/admin-guide/pm/cpufreq.rst
@@ -471,14 +471,6 @@ it is allowed to use (the ``scaling_max_freq`` policy 
limit).
 
# echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > 
ondemand/sampling_rate
 
-
-``min_sampling_rate``
-   The minimum value of ``sampling_rate``.
-
-   Equal to 1 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and
-   :c:data:`tick_nohz_active` are both set or to 20 times the value of
-   :c:data:`jiffies` in microseconds otherwise.
-
 ``up_threshold``
If the estimated CPU load is above this value (in percent), the governor
will set the frequency to the maximum value allowed for the policy.
diff --git a/drivers/cpufreq/cpufreq_conservative.c 
b/drivers/cpufreq/cpufreq_conservative.c
index 88220ff3e1c2..f20f20a77d4d 100644
--- a/drivers/cpufreq/cpufreq_conservative.c
+++ b/drivers/cpufreq/cpufreq_conservative.c
@@ -246,7 +246,6 @@ gov_show_one_common(sampling_rate);
 gov_show_one_common(sampling_down_factor);
 gov_show_one_common(up_threshold);
 gov_show_one_common(ignore_nice_load);
-gov_show_one_common(min_sampling_rate);
 gov_show_one(cs, down_threshold);
 gov_show_one(cs, freq_step);
 
@@ -254,12 +253,10 @@ gov_attr_rw(sampling_rate);
 gov_attr_rw(sampling_down_factor);
 gov_attr_rw(up_threshold);
 gov_attr_rw(ignore_nice_load);
-gov_attr_ro(min_sampling_rate);
 gov_attr_rw(down_threshold);
 gov_attr_rw(freq_step);
 
 static struct attribute *cs_attributes[] = {
-   _sampling_rate.attr,
_rate.attr,
_down_factor.attr,
_threshold.attr,
@@ -297,10 +294,7 @@ static int cs_init(struct dbs_data *dbs_data)
dbs_data->up_threshold = DEF_FREQUENCY_UP_THRESHOLD;
dbs_data->sampling_down_factor = DEF_SAMPLING_DOWN_FACTOR;
dbs_data->ignore_nice_load = 0;
-
dbs_data->tuners = tuners;
-   dbs_data->min_sampling_rate = MIN_SAMPLING_RATE_RATIO *
-   jiffies_to_usecs(10);
 
return 0;
 }
diff --git a/drivers/cpufreq/cpufreq_governor.c 
b/drivers/cpufreq/cpufreq_governor.c
index 47e24b5384b3..858081f9c3d7 100644
--- a/drivers/cpufreq/cpufreq_governor.c
+++ b/drivers/cpufreq/cpufreq_governor.c
@@ -47,14 +47,11 @@ ssize_t store_sampling_rate(struct gov_attr_set *attr_set, 
const char *buf,
 {
struct dbs_data *dbs_data = to_dbs_data(attr_set);
struct policy_dbs_info *policy_dbs;
-   unsigned int rate;
int ret;
-   ret = sscanf(buf, "%u", );
+   ret = sscanf(buf, "%u", _data->sampling_rate);
if (ret != 1)
return -EINVAL;
 
-   dbs_data->sampling_rate = max(rate, dbs_data->min_sampling_rate);
-
/*
 * We are operating under dbs_data->mutex and so the list and its
 * entries can't be freed concurrently.
@@ -437,10 +434,7 @@ int cpufreq_dbs_governor_init(struct cpufreq_policy 
*policy)
latency = 1;
 
/* Bring kernel and HW constraints together */
-   dbs_data->min_sampling_rate = max(dbs_data->min_sampling_rate,
- MIN_LATENCY_MULTIPLIER * latency);
-   dbs_data->sampling_rate = max(dbs_data->min_sampling_rate,
- LATENCY_MULTIPLIER * latency);
+   dbs_data->sampling_rate = LATENCY_MULTIPLIER * latency;
 
if (!have_governor_per_policy())
gov->gdbs_data = dbs_data;
diff --git a/drivers/cpufreq/cpufreq_governor.h 
b/drivers/cpufreq/cpufreq_governor.h
index 7b7839c45fba..8463f5def0f5 100644
--- a/drivers/cpufreq/cpufreq_governor.h
+++ b/drivers/cpufreq/cpufreq_governor.h
@@ -41,7 +41,6 @@ enum {OD_NORMAL_SAMPLE, OD_SUB_SAMPLE};
 struct dbs_data {
struct gov_attr_set attr_set;
void *tuners;
-   unsigned int min_sampling_rate;
unsigned int ignore_nice_load;
unsigned int 

Re: [PATCH 1/2] memory-barriers.txt: Fix broken link to atomic_ops.txt

2017-07-12 Thread Jonathan Corbet
On Fri,  7 Jul 2017 03:21:17 +0900
SeongJae Park  wrote:

> Few obsolete links to atomic_ops.txt exist in memory-barriers.txt though
> the file has moved to core-api/atomic_ops.rst.  This commit fixes the
> obsolete links.

Both have been applied, thanks.

jon
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Benjamin Herrenschmidt
On Wed, 2017-07-12 at 09:23 +0200, Michal Hocko wrote:
> 
> > 
> > Ideally the MMU looks at the PTE for keys, in order to enforce
> > protection. This is the case with x86 and is the case with power9 Radix
> > page table. Hence the keys have to be programmed into the PTE.
> 
> But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> You could use standard mprotect to do the same...

What do you mean ? x86 ends up in mprotect_fixup -> change_protection()
which will update the PTEs just the same as we do.

Changing the key for a page is a form mprotect. Changing the access
permissions for keys is different, for us it's a special register
(AMR).

I don't understand why you think we are doing any differently than x86
here.

> > However with HPT on power, these keys do not necessarily have to be
> > programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> > and instead just program them into the Hash Page Table(HPTE), since
> > the MMU does not refer the PTE but refers the HPTE. The last version
> > of the page attempted to do that.   It worked as follows:
> > 
> > a) when a address range is requested to be associated with a key; by the
> >application through key_mprotect() system call, the kernel
> >stores that key in the vmas corresponding to that address
> >range.
> > 
> > b) Whenever there is a hash page fault for that address, the fault
> >handler reads the key from the VMA and programs the key into the
> >HPTE. __hash_page() is the function that does that.
> 
> What causes the fault here?

The hardware. With the hash MMU, the HW walks a hash table which is
effectively a large in-memory TLB extension. When a page isn't found
there, a  "hash fault" is generated allowing Linux to populate that
hash table with the content of the corresponding PTE. 

> > c) Once the hpte is programmed, the MMU can sense key violations and
> >generate key-faults.
> > 
> > The problem is with step (b).  This step is really a very critical
> > path which is performance sensitive. We dont want to add any delays.
> > However if we want to access the key from the vma, we will have to
> > hold the vma semaphore, and that is a big NO-NO. As a result, this
> > design had to be dropped.
> > 
> > 
> > 
> > I reverted back to the old design i.e the design in v4 version. In this
> > version we do the following:
> > 
> > a) when a address range is requested to be associated with a key; by the
> >application through key_mprotect() system call, the kernel
> >stores that key in the vmas corresponding to that address
> >range. Also the kernel programs the key into Linux PTE coresponding to 
> > all the
> >pages associated with the address range.
> 
> OK, so how is this any different from the regular mprotect then?

It takes the key argument. This is nothing new. This was done for x86
already, we are just re-using the infrastructure. Look at
do_mprotect_pkey() in mm/mprotect.c today. It's all the same code,
pkey_mprotect() is just mprotect with an added key argument.

> > b) Whenever there is a hash page fault for that address, the fault
> >handler reads the key from the Linux PTE and programs the key into 
> >the HPTE.
> > 
> > c) Once the HPTE is programmed, the MMU can sense key violations and
> >generate key-faults.
> > 
> > 
> > Since step (b) in this case has easy access to the Linux PTE, and hence
> > to the key, it is fast to access it and program the HPTE. Thus we avoid
> > taking any performance hit on this critical path.
> > 
> > Hope this explains the rationale,
> > 
> > 
> > As promised here is the high level design:
> 
> I will read through that later
> [...]
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey

2017-07-12 Thread Benjamin Herrenschmidt
On Wed, 2017-07-12 at 15:23 -0700, Ram Pai wrote:
> Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns.

Which is fine to ignore in a case like that where you remain consistent
with the existing code.

Ben.

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 11/38] mm: introduce an additional vma bit for powerpc pkey

2017-07-12 Thread Ram Pai
On Tue, Jul 11, 2017 at 11:10:46AM -0700, Dave Hansen wrote:
> On 07/05/2017 02:21 PM, Ram Pai wrote:
> > Currently there are only 4bits in the vma flags to support 16 keys
> > on x86.  powerpc supports 32 keys, which needs 5bits. This patch
> > introduces an addition bit in the vma flags.
> > 
> > Signed-off-by: Ram Pai 
> > ---
> >  fs/proc/task_mmu.c |6 +-
> >  include/linux/mm.h |   18 +-
> >  2 files changed, 18 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index f0c8b33..2ddc298 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -666,12 +666,16 @@ static void show_smap_vma_flags(struct seq_file *m, 
> > struct vm_area_struct *vma)
> > [ilog2(VM_MERGEABLE)]   = "mg",
> > [ilog2(VM_UFFD_MISSING)]= "um",
> > [ilog2(VM_UFFD_WP)] = "uw",
> > -#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
> > +#ifdef CONFIG_ARCH_HAS_PKEYS
> > /* These come out via ProtectionKey: */
> > [ilog2(VM_PKEY_BIT0)]   = "",
> > [ilog2(VM_PKEY_BIT1)]   = "",
> > [ilog2(VM_PKEY_BIT2)]   = "",
> > [ilog2(VM_PKEY_BIT3)]   = "",
> > +#endif /* CONFIG_ARCH_HAS_PKEYS */
> > +#ifdef CONFIG_PPC64_MEMORY_PROTECTION_KEYS
> > +   /* Additional bit in ProtectionKey: */
> > +   [ilog2(VM_PKEY_BIT4)]   = "",
> >  #endif
> 
> I'd probably just leave the #ifdef out and eat the byte or whatever of
> storage that this costs us on x86.

fine with me.

> 
> > };
> > size_t i;
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 7cb17c6..3d35bcc 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -208,21 +208,29 @@ extern int overcommit_kbytes_handler(struct ctl_table 
> > *, int, void __user *,
> >  #define VM_HIGH_ARCH_BIT_1 33  /* bit only usable on 64-bit 
> > architectures */
> >  #define VM_HIGH_ARCH_BIT_2 34  /* bit only usable on 64-bit 
> > architectures */
> >  #define VM_HIGH_ARCH_BIT_3 35  /* bit only usable on 64-bit 
> > architectures */
> > +#define VM_HIGH_ARCH_BIT_4 36  /* bit only usable on 64-bit arch */
> 
> Please just copy the above lines.

Just copying over makes checkpatch.pl unhappy. It exceeds 80 columns.

> 
> >  #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0)
> >  #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1)
> >  #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2)
> >  #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3)
> > +#define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4)
> >  #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
> >  
> > -#if defined(CONFIG_X86)
> > -# define VM_PATVM_ARCH_1   /* PAT reserves whole VMA at 
> > once (x86) */
> > -#if defined (CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)
> > +#ifdef CONFIG_ARCH_HAS_PKEYS
> >  # define VM_PKEY_SHIFT VM_HIGH_ARCH_BIT_0
> > -# define VM_PKEY_BIT0  VM_HIGH_ARCH_0  /* A protection key is a 4-bit 
> > value */
> > +# define VM_PKEY_BIT0  VM_HIGH_ARCH_0
> >  # define VM_PKEY_BIT1  VM_HIGH_ARCH_1
> >  # define VM_PKEY_BIT2  VM_HIGH_ARCH_2
> >  # define VM_PKEY_BIT3  VM_HIGH_ARCH_3
> > -#endif
> > +#endif /* CONFIG_ARCH_HAS_PKEYS */
> 
> We have the space here, so can we just say that it's 4-bits on x86 and 5
> on ppc?

sure.

> 
> > +#if defined(CONFIG_PPC64_MEMORY_PROTECTION_KEYS)
> > +# define VM_PKEY_BIT4  VM_HIGH_ARCH_4 /* additional key bit used on 
> > ppc64 */
> > +#endif /* CONFIG_PPC64_MEMORY_PROTECTION_KEYS */
> 
> Why bother #ifdef'ing a #define?

ok. 

RP

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 36/38] selftest: PowerPC specific test updates to memory protection keys

2017-07-12 Thread Ram Pai
On Tue, Jul 11, 2017 at 10:33:09AM -0700, Dave Hansen wrote:
> On 07/05/2017 02:22 PM, Ram Pai wrote:
> > Abstracted out the arch specific code into the header file, and
> > added powerpc specific changes.
> > 
> > a) added 4k-backed hpte, memory allocator, powerpc specific.
> > b) added three test case where the key is associated after the page is
> > accessed/allocated/mapped.
> > c) cleaned up the code to make checkpatch.pl happy
> 
> There's a *lot* of churn here.  If it breaks, I'm going to have a heck
> of a time figuring out which hunk broke.  Is there any way to break this
> up into a series of things that we have a chance at bisecting?

Just finished breaking down the changes into 20 gradual increments.
I have pushed it to my github tree at

https://github.com/rampai/memorykeys.git
branch is memkey.v6-rc3

See if it works for you. I am sure I would have broken something on
x86 since I dont have a x86 platform to test.

Let me know, Thanks,
RP

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net 0/2] minor net kernel-doc fixes

2017-07-12 Thread David Miller
From: Stephen Hemminger 
Date: Wed, 12 Jul 2017 09:29:05 -0700

> Fix a couple of small errors in kernel-doc for networking

Series applied, thanks Stephen.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

2017-07-12 Thread David Rientjes
On Wed, 12 Jul 2017, Roman Gushchin wrote:

> > It's a no-op if nobody sets up priorities or the system-wide sysctl is 
> > disabled.  Presumably, as in our model, the Activity Manager sets the 
> > sysctl and is responsible for configuring the priorities if present.  All 
> > memcgs at the sibling level or subcontainer level remain the default if 
> > not defined by the chown'd user, so this falls back to an rss model for 
> > backwards compatibility.
> 
> Hm, this is interesting...
> 
> What I'm thinking about, is that we can introduce the following model:
> each memory cgroup has an integer oom priority value, 0 be default.
> Root cgroup priority is always 0, other cgroups can have both positive
> or negative priorities.
> 

For our purposes we use a range of [0, 1] for the per-process oom 
priority; 1 implies the process is not oom killable, 5000 is the 
default.  We use a range of [0, ] for the per-memcg oom priority since 
memcgs cannot disable themselves from oom killing (although they could oom 
disable all attached processes).  We can obviously remap our priorities to 
whatever we decide here, but I think we should give ourselves more room 
and provide 1 priorities at the minimum (we have 5000 true priorities 
plus overlimit bias).  I'm not sure that negative priorities make sense in 
this model, is there a strong reason to prefer [-5000, 5000] over 
[0, 1]?

And, yes, the root memcg remains a constant oom priority and is never 
actually checked.

> During OOM victim selection we compare cgroups on each hierarchy level
> based on priority and size, if there are several cgroups with equal priority.
> Per-task oom_score_adj will affect task selection inside a cgroup if
> oom_kill_all_tasks is not set. -1000 special value will also completely
> protect a task from being killed, if only oom_kill_all_tasks is not set.
> 

If there are several cgroups of equal priority, we prefer the one that was 
created the most recently just to avoid losing work that has been done for 
a long period of time.  But the key in this proposal is that we _always_ 
continue to iterate the memcg hierarchy until we find a process attached 
to a memcg with the lowest priority relative to sibling cgroups, if any.

To adapt your model to this proposal, memory.oom_kill_all_tasks would only 
be effective if there are no descendant memcgs.  In that case, iteration 
stops anyway and in my model we kill the process with the lowest 
per-process priority.  This could trivially check 
memory.oom_kill_all_tasks and kill everything, and I'm happy to support 
that feature since we have had a need for it in the past as well.

We should talk about when this priority-based scoring becomes effective.  
We enable it by default in our kernel, but it could be guarded with a VM 
sysctl if necessary to enact a system-wide policy.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] rtmutex: remove unnecessary adjust prio

2017-07-12 Thread Peter Zijlstra
On Wed, Jul 12, 2017 at 10:14:49AM -0400, Steven Rostedt wrote:
> On Tue, 11 Jul 2017 22:39:24 +0800
> Alex Shi  wrote:
> 
> > Any comments for this little change? It's passed on 0day testing.
> 
> I think the problem was that this was a third patch after two
> documentation patches. Where, people put documentation review at the
> bottom of their priority list.
> 
> This should have been sent as separate patch on its own.

My problem was the sparse changelog, which forces me to think hard and
thus is landed on the 'later' queue, which moves at glacial speeds.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 1/2] socket: add documentation for missing elements

2017-07-12 Thread Stephen Hemminger
Fill in missing kernel-doc for missing elements in struct sock.

Signed-off-by: Stephen Hemminger 
---
 include/net/sock.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 8c85791fc196..f69c8c2782df 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -246,6 +246,7 @@ struct sock_common {
   *@sk_policy: flow policy
   *@sk_receive_queue: incoming packets
   *@sk_wmem_alloc: transmit queue bytes committed
+  *@sk_tsq_flags: TCP Small Queues flags
   *@sk_write_queue: Packet sending queue
   *@sk_omem_alloc: "o" is "option" or "other"
   *@sk_wmem_queued: persistent queue size
@@ -257,6 +258,7 @@ struct sock_common {
   *@sk_pacing_status: Pacing status (requested, handled by sch_fq)
   *@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
   *@sk_sndbuf: size of send buffer in bytes
+  *@__sk_flags_offset: empty field used to determine location of bitfield
   *@sk_padding: unused element for alignment
   *@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
   *@sk_no_check_rx: allow zero checksum in RX packets
@@ -277,6 +279,7 @@ struct sock_common {
   *@sk_drops: raw/udp drops counter
   *@sk_ack_backlog: current listen backlog
   *@sk_max_ack_backlog: listen backlog set in listen()
+  *@sk_uid: user id of owner
   *@sk_priority: %SO_PRIORITY setting
   *@sk_type: socket type (%SOCK_STREAM, etc)
   *@sk_protocol: which protocol this socket belongs in this network family
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 2/2] datagram: fix kernel-doc comments

2017-07-12 Thread Stephen Hemminger
An underscore in the kernel-doc comment section has special meaning
and mis-use generates an errors.

./net/core/datagram.c:207: ERROR: Unknown target name: "msg".
./net/core/datagram.c:379: ERROR: Unknown target name: "msg".
./net/core/datagram.c:816: ERROR: Unknown target name: "t".

Signed-off-by: Stephen Hemminger 
---
 net/core/datagram.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 6877c43cc92d..ee5647bd91b3 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -203,7 +203,7 @@ struct sk_buff *__skb_try_recv_from_queue(struct sock *sk,
 /**
  * __skb_try_recv_datagram - Receive a datagram skbuff
  * @sk: socket
- * @flags: MSG_ flags
+ * @flags: MSG\_ flags
  * @destructor: invoked under the receive lock on successful dequeue
  * @peeked: returns non-zero if this packet has been seen before
  * @off: an offset in bytes to peek skb from. Returns an offset
@@ -375,7 +375,7 @@ EXPORT_SYMBOL(__sk_queue_drop_skb);
  * skb_kill_datagram - Free a datagram skbuff forcibly
  * @sk: socket
  * @skb: datagram skbuff
- * @flags: MSG_ flags
+ * @flags: MSG\_ flags
  *
  * This function frees a datagram skbuff that was received by
  * skb_recv_datagram.  The flags argument must match the one
@@ -809,7 +809,7 @@ EXPORT_SYMBOL(skb_copy_and_csum_datagram_msg);
  * sequenced packet sockets providing the socket receive queue
  * is only ever holding data ready to receive.
  *
- * Note: when you _don't_ use this routine for this protocol,
+ * Note: when you *don't* use this routine for this protocol,
  * and you use a different write policy from sock_writeable()
  * then please supply your own write_space callback.
  */
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net 0/2] minor net kernel-doc fixes

2017-07-12 Thread Stephen Hemminger
Fix a couple of small errors in kernel-doc for networking

Stephen Hemminger (2):
  socket: add documentation for missing elements
  datagram: fix kernel-doc comments

 include/net/sock.h  | 3 +++
 net/core/datagram.c | 6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] rtmutex: remove unnecessary adjust prio

2017-07-12 Thread Steven Rostedt
On Tue, 11 Jul 2017 22:39:24 +0800
Alex Shi  wrote:

> Any comments for this little change? It's passed on 0day testing.

I think the problem was that this was a third patch after two
documentation patches. Where, people put documentation review at the
bottom of their priority list.

This should have been sent as separate patch on its own.

> 
> Thanks
> Alex
> 
> On 07/07/2017 10:52 AM, Alex Shi wrote:
> > We don't need to adjust prio before new pi_waiter adding. The prio
> > only need update after pi_waiter change or task priority change.
> > 
> > Signed-off-by: Alex Shi 
> > Cc: Steven Rostedt 
> > Cc: Sebastian Siewior 
> > Cc: Mathieu Poirier 
> > Cc: Juri Lelli 
> > Cc: Thomas Gleixner 
> > To: linux-ker...@vger.kernel.org
> > To: Ingo Molnar 
> > To: Peter Zijlstra 
> > ---
> >  kernel/locking/rtmutex.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/kernel/locking/rtmutex.c b/kernel/locking/rtmutex.c
> > index 28cd09e..d1fe41f 100644
> > --- a/kernel/locking/rtmutex.c
> > +++ b/kernel/locking/rtmutex.c
> > @@ -963,7 +963,6 @@ static int task_blocks_on_rt_mutex(struct rt_mutex 
> > *lock,
> > return -EDEADLK;
> >  
> > raw_spin_lock(>pi_lock);
> > -   rt_mutex_adjust_prio(task);

Interesting, I did some git mining and this was added with the original
entry of the rtmutex.c (23f78d4a0). Looking at even that version, I
don't see the purpose of adjusting the task prio here. It is done
before anything changes in the task.

Reviewed-by: Steven Rostedt (VMware) 

-- Steve


> > waiter->task = task;
> > waiter->lock = lock;
> > waiter->prio = task->prio;
> >   

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] tee.txt: standardize document format

2017-07-12 Thread Mauro Carvalho Chehab
Each text file under Documentation follows a different format. Some
doesn't even have titles!

Change its representation to follow the adopted standard,
using ReST markups for it to be parseable by Sphinx:

- adjust identation of titles;
- mark ascii artwork as a literal block;
- adjust references.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/tee.txt | 51 ++-
 1 file changed, 30 insertions(+), 21 deletions(-)

diff --git a/Documentation/tee.txt b/Documentation/tee.txt
index 718599357596..56ea85ffebf2 100644
--- a/Documentation/tee.txt
+++ b/Documentation/tee.txt
@@ -1,4 +1,7 @@
+=
 TEE subsystem
+=
+
 This document describes the TEE subsystem in Linux.
 
 A TEE (Trusted Execution Environment) is a trusted OS running in some
@@ -80,27 +83,27 @@ The GlobalPlatform TEE Client API [5] is implemented on top 
of the generic
 TEE API.
 
 Picture of the relationship between the different components in the
-OP-TEE architecture.
+OP-TEE architecture::
 
-User space  Kernel   Secure world
-~~  ~~   
- ++ +-+
- | Client | | Trusted |
- ++ | Application |
-/\  +-+
-|| +--+   /\
-|| |tee-  |   ||
-|| |supplicant|   \/
-|| +--+ +-+
-\/  /\  | TEE Internal|
- +---+  ||  | API |
- + TEE   |  ||+++   +-+
- | Client|  ||| TEE| OP-TEE |   | OP-TEE  |
- | API   |  \/| subsys | driver |   | Trusted OS  |
- +---+++---++---+-+
- |  Generic TEE API|   | OP-TEE MSG   |
- |  IOCTL (TEE_IOC_*)  |   | SMCCC (OPTEE_SMC_CALL_*) |
- +-+   +--+
+  User space  Kernel   Secure world
+  ~~  ~~   
+   ++ +-+
+   | Client | | Trusted |
+   ++ | Application |
+  /\  +-+
+  || +--+   /\
+  || |tee-  |   ||
+  || |supplicant|   \/
+  || +--+ +-+
+  \/  /\  | TEE Internal|
+   +---+  ||  | API |
+   + TEE   |  ||+++   +-+
+   | Client|  ||| TEE| OP-TEE |   | OP-TEE  |
+   | API   |  \/| subsys | driver |   | Trusted OS  |
+   +---+++---++---+-+
+   |  Generic TEE API|   | OP-TEE MSG   |
+   |  IOCTL (TEE_IOC_*)  |   | SMCCC (OPTEE_SMC_CALL_*) |
+   +-+   +--+
 
 RPC (Remote Procedure Call) are requests from secure world to kernel driver
 or tee-supplicant. An RPC is identified by a special range of SMCCC return
@@ -109,10 +112,16 @@ kernel are handled by the kernel driver. Other RPC 
messages will be forwarded to
 tee-supplicant without further involvement of the driver, except switching
 shared memory buffer representation.
 
-References:
+References
+==
+
 [1] https://github.com/OP-TEE/optee_os
+
 [2] http://infocenter.arm.com/help/topic/com.arm.doc.den0028a/index.html
+
 [3] drivers/tee/optee/optee_smc.h
+
 [4] drivers/tee/optee/optee_msg.h
+
 [5] http://www.globalplatform.org/specificationsdevice.asp look for
 "TEE Client API Specification v1.0" and click download.
-- 
2.13.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] kernel-doc-nano-HOWTO.txt: standardize document format

2017-07-12 Thread Mauro Carvalho Chehab
Each text file under Documentation follows a different format. Some
doesn't even have titles!

Change its representation to follow the adopted standard,
using ReST markups for it to be parseable by Sphinx:

- adjust titles;
- use :Author: for authorship;
- mark literal blocks and adjust identation;
- escape literals.

Signed-off-by: Mauro Carvalho Chehab 
---
 Documentation/kernel-doc-nano-HOWTO.txt | 209 
 1 file changed, 106 insertions(+), 103 deletions(-)

diff --git a/Documentation/kernel-doc-nano-HOWTO.txt 
b/Documentation/kernel-doc-nano-HOWTO.txt
index c23e2c5ab80d..b1093a9a7e7d 100644
--- a/Documentation/kernel-doc-nano-HOWTO.txt
+++ b/Documentation/kernel-doc-nano-HOWTO.txt
@@ -1,9 +1,14 @@
-NOTE: this document is outdated and will eventually be removed.  See
-Documentation/doc-guide/ for current information.
-
+=
 kernel-doc nano-HOWTO
 =
 
+:Author: Tim 
+
+.. note::
+
+   This document is outdated and will eventually be removed.  See
+   Documentation/doc-guide/ for current information.
+
 How to format kernel-doc comments
 -
 
@@ -41,35 +46,35 @@ discretion of the MAINTAINER of that kernel source file.
 Data structures visible in kernel include files should also be
 documented using kernel-doc formatted comments.
 
-The opening comment mark "/**" is reserved for kernel-doc comments.
+The opening comment mark ``/**`` is reserved for kernel-doc comments.
 Only comments so marked will be considered by the kernel-doc scripts,
 and any comment so marked must be in kernel-doc format.  Do not use
-"/**" to be begin a comment block unless the comment block contains
+``/**`` to be begin a comment block unless the comment block contains
 kernel-doc formatted comments.  The closing comment marker for
-kernel-doc comments can be either "*/" or "**/", but "*/" is
+kernel-doc comments can be either ``*/`` or ``**/``, but ``*/`` is
 preferred in the Linux kernel tree.
 
 Kernel-doc comments should be placed just before the function
 or data structure being described.
 
-Example kernel-doc function comment:
+Example kernel-doc function comment::
 
-/**
- * foobar() - short function description of foobar
- * @arg1:  Describe the first argument to foobar.
- * @arg2:  Describe the second argument to foobar.
- * One can provide multiple line descriptions
- * for arguments.
- *
- * A longer description, with more discussion of the function foobar()
- * that might be useful to those using or modifying it.  Begins with
- * empty comment line, and may include additional embedded empty
- * comment lines.
- *
- * The longer description can have multiple paragraphs.
- *
- * Return: Describe the return value of foobar.
- */
+ /**
+  * foobar() - short function description of foobar
+  * @arg1: Describe the first argument to foobar.
+  * @arg2: Describe the second argument to foobar.
+  *One can provide multiple line descriptions
+  *for arguments.
+  *
+  * A longer description, with more discussion of the function foobar()
+  * that might be useful to those using or modifying it.  Begins with
+  * empty comment line, and may include additional embedded empty
+  * comment lines.
+  *
+  * The longer description can have multiple paragraphs.
+  *
+  * Return: Describe the return value of foobar.
+  */
 
 The short description following the subject can span multiple lines
 and ends with an @argument description, an empty line or the end of
@@ -80,22 +85,23 @@ this opening short function description line, with no 
intervening
 empty comment lines.
 
 If a function parameter is "..." (varargs), it should be listed in
-kernel-doc notation as:
+kernel-doc notation as::
+
  * @...: description
 
 The return value, if any, should be described in a dedicated section
 named "Return".
 
-Example kernel-doc data structure comment.
+Example kernel-doc data structure comment::
 
-/**
- * struct blah - the basic blah structure
- * @mem1:  describe the first member of struct blah
- * @mem2:  describe the second member of struct blah,
- * perhaps with more lines and words.
- *
- * Longer description of this structure.
- */
+ /**
+  * struct blah - the basic blah structure
+  * @mem1: describe the first member of struct blah
+  * @mem2: describe the second member of struct blah,
+  *perhaps with more lines and words.
+  *
+  * Longer description of this structure.
+  */
 
 The kernel-doc function comments describe each parameter to the
 function, in order, with the @name lines.
@@ -150,64 +156,64 @@ If you just want to read the ready-made books on the 
various
 subsystems, just type 'make epubdocs', or 'make pdfdocs', or 'make htmldocs',
 depending on your preference.  If you would rather read a different format,
 you can type 'make xmldocs' and then use DocBook tools to convert

Re: [v3 2/6] mm, oom: cgroup-aware OOM killer

2017-07-12 Thread Roman Gushchin
On Tue, Jul 11, 2017 at 01:56:30PM -0700, David Rientjes wrote:
> On Tue, 11 Jul 2017, Roman Gushchin wrote:
> 
> > > Yes, the original motivation was to limit killing to a single process, if 
> > > possible.  To do that, we kill the process with the largest rss to free 
> > > the most memory and rely on the user to configure /proc/pid/oom_score_adj 
> > > if something else should be prioritized.
> > > 
> > > With containerization and overcommit of system memory, we concur that 
> > > killing the single largest process isn't always preferable and neglects 
> > > the priority of its memcg.  Your motivation seems to be to provide 
> > > fairness between one memcg with a large process and one memcg with a 
> > > large 
> > > number of small processes; I'm curious if you are concerned about the 
> > > priority of a memcg hierarchy (how important that "job" is) or whether 
> > > you 
> > > are strictly concerned with "largeness" of memcgs relative to each other.
> > 
> > I'm pretty sure we should provide some way to prioritize some cgroups
> > over other (in terms of oom killer preferences), but I'm not 100% sure yet,
> > what's the best way to do it. I've suggested something similar to the 
> > existing
> > oom_score_adj for tasks, mostly to folow the existing design.
> > 
> > One of the questions to answer in priority-based model is
> > how to compare tasks in the root cgroup with cgroups?
> > 
> 
> We do this with an alternate scoring mechanism, that is purely priority 
> based and tiebreaks based on largest rss.  An additional tunable is added 
> for each process, under /proc/pid, and also to the memcg hierarchy, and is 
> enabled via a system-wide sysctl.  I way to mesh the two scoring 
> mechanisms together would be helpful, but for our purposes we don't use 
> oom_score_adj at all, other than converting OOM_SCORE_ADJ_MIN to still be 
> oom disabled when written by third party apps.
> 
> For memcg oom conditions, iteration of the hierarchy begins at the oom 
> memcg.  For system oom conditions, this is the root memcg.
> 
> All processes attached to the oom memcg have their priority based value 
> and this is compared to all child memcg's priority value at that level.  
> If a process has the lowest priority, it is killed and we're done; we 
> could implement a "kill all" mechanism for this memcg that is checked 
> before the process is killed.
> 
> If a memcg has the lowest priority compared to attached processes, it is 
> iterated as well, and so on throughout the memcg hierarchy until we find 
> the lowest priority process in the lowest priority leaf memcg.  This way, 
> we can fully control which process is killed for both system and memcg oom 
> conditions.  I can easily post patches for this, we have used it for 
> years.
> 
> > > These are two different things, right?  We can adjust how the system oom 
> > > killer chooses victims when memcg hierarchies overcommit the system to 
> > > not 
> > > strictly prefer the single process with the largest rss without killing 
> > > everything attached to the memcg.
> > 
> > They are different, and I thought about providing two independent knobs.
> > But after all I haven't found enough real life examples, where it can be 
> > useful.
> > Can you provide something here?
> > 
> 
> Yes, we have users who we chown their memcg hierarchy to and have full 
> control over setting up their hierarchy however we want.  Our "Activity 
> Manager", using Documentation/cgroup-v1/memory.txt terminology, only is 
> aware of the top level memcg that was chown'd to the user.  That user runs 
> a series of batch jobs that are submitted to it and each job is 
> represented as a subcontainer to enforce strict limits on the amount of 
> memory that job can use.  When it becomes oom, we have found that it is 
> preferable to oom kill the entire batch job rather than leave it in an 
> inconsistent state, so enabling such a knob here would be helpful.
> 
> Other top-level jobs are fine with individual processes being oom killed.  
> It can be a low priority process for which they have full control over 
> defining the priority through the new per-process and per-memcg value 
> described above.  Easy example is scraping logs periodically or other 
> best-effort tasks like cleanup.  They can happily be oom killed and 
> rescheduled without taking down the entire first-class job.
> 
> > Also, they are different only for non-leaf cgroups; leaf cgroups
> > are always treated as indivisible memory consumers during victim selection.
> > 
> > I assume, that containerized systems will always set oom_kill_all_tasks for
> > top-level container memory cgroups. By default it's turned off
> > to provide backward compatibility with current behavior and avoid
> > excessive kills and support oom_score_adj==-1000 (I've added this to v4,
> > will post soon).
> > 
> 
> We certainly would not be enabling it for top-level memcgs, there would be 
> no way that we could because we have best-effort 

[PATCH] batman-adv: Convert batman-adv.txt to reStructuredText

2017-07-12 Thread Sven Eckelmann
Converting the freeform text to parsable reStructuredText, allows the
integration in the sphinx based documentation system of the kernel. It will
therefore be accessible as hypertext under
https://www.kernel.org/doc/html/latest/

Signed-off-by: Sven Eckelmann 
---
 Documentation/networking/00-INDEX   |   2 -
 Documentation/networking/batman-adv.rst | 220 
 Documentation/networking/batman-adv.txt | 215 ---
 Documentation/networking/index.rst  |   1 +
 MAINTAINERS |   2 +-
 5 files changed, 222 insertions(+), 218 deletions(-)
 create mode 100644 Documentation/networking/batman-adv.rst
 delete mode 100644 Documentation/networking/batman-adv.txt

diff --git a/Documentation/networking/00-INDEX 
b/Documentation/networking/00-INDEX
index c6beb5f1637f..7a79b3587dd3 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -30,8 +30,6 @@ atm.txt
- info on where to get ATM programs and support for Linux.
 ax25.txt
- info on using AX.25 and NET/ROM code for Linux
-batman-adv.txt
-   - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames.
 baycom.txt
- info on the driver for Baycom style amateur radio modems
 bonding.txt
diff --git a/Documentation/networking/batman-adv.rst 
b/Documentation/networking/batman-adv.rst
new file mode 100644
index ..a342b2cc3dc6
--- /dev/null
+++ b/Documentation/networking/batman-adv.rst
@@ -0,0 +1,220 @@
+==
+batman-adv
+==
+
+Batman advanced is a new approach to wireless networking which does no longer
+operate on the IP basis. Unlike the batman daemon, which exchanges information
+using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI
+Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It
+emulates a virtual network switch of all nodes participating. Therefore all
+nodes appear to be link local, thus all higher operating protocols won't be
+affected by any changes within the network. You can run almost any protocol
+above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX.
+
+Batman advanced was implemented as a Linux kernel driver to reduce the overhead
+to a minimum. It does not depend on any (other) network driver, and can be used
+on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style
+layer 2).
+
+
+Configuration
+=
+
+Load the batman-adv module into your kernel::
+
+  $ insmod batman-adv.ko
+
+The module is now waiting for activation. You must add some interfaces on which
+batman can operate. After loading the module batman advanced will scan your
+systems interfaces to search for compatible interfaces. Once found, it will
+create subfolders in the ``/sys`` directories of each supported interface,
+e.g.::
+
+  $ ls /sys/class/net/eth0/batman_adv/
+  elp_interval iface_status mesh_iface throughput_override
+
+If an interface does not have the ``batman_adv`` subfolder, it probably is not
+supported. Not supported interfaces are: loopback, non-ethernet and batman's
+own interfaces.
+
+Note: After the module was loaded it will continuously watch for new
+interfaces to verify the compatibility. There is no need to reload the module
+if you plug your USB wifi adapter into your machine after batman advanced was
+initially loaded.
+
+The batman-adv soft-interface can be created using the iproute2 tool ``ip``::
+
+  $ ip link add name bat0 type batadv
+
+To activate a given interface simply attach it to the ``bat0`` interface::
+
+  $ ip link set dev eth0 master bat0
+
+Repeat this step for all interfaces you wish to add. Now batman starts
+using/broadcasting on this/these interface(s).
+
+By reading the "iface_status" file you can check its status::
+
+  $ cat /sys/class/net/eth0/batman_adv/iface_status
+  active
+
+To deactivate an interface you have to detach it from the "bat0" interface::
+
+  $ ip link set dev eth0 nomaster
+
+
+All mesh wide settings can be found in batman's own interface folder::
+
+  $ ls /sys/class/net/bat0/mesh/
+  aggregated_ogms   fragmentation isolation_mark routing_algo
+  ap_isolation  gw_bandwidth  log_level  vlan0
+  bonding   gw_mode   multicast_mode
+  bridge_loop_avoidance gw_sel_class  network_coding
+  distributed_arp_table hop_penalty   orig_interval
+
+There is a special folder for debugging information::
+
+  $ ls /sys/kernel/debug/batman_adv/bat0/
+  bla_backbone_table log neighbors transtable_local
+  bla_claim_tablemcast_flags originators
+  dat_cache  nc  socket
+  gateways   nc_nodestranstable_global
+
+Some of the files contain all sort of status information regarding the mesh
+network. For example, you can view the table of originators (mesh
+participants) with::
+
+  $ cat /sys/kernel/debug/batman_adv/bat0/originators
+
+Other files allow to change batman's 

[PATCH 2/2] Documentation: admin-guide: remove redundant first paragraph of README.rst

2017-07-12 Thread Martin Kepplinger
"These are the release notes for Linux version 4." is what the header above
says. There's no real need to say that again.

"Read them carefully," Why would we advise people to read one file carefully
but other file not?

"they tell you what this is all about, explain how to install the kernel, and
what to do if something goes wrong." That's just the chapters below. With
rendered docs, the reader doesn't even have to scroll down to have this
sentense's information :)

Therefore let's remove the first paragraph, including the "What is Linux?"
subheader, and start with telling what Linux is. This won't come as a surprise
to the reader.

Signed-off-by: Martin Kepplinger 
---
 Documentation/admin-guide/README.rst | 7 ---
 1 file changed, 7 deletions(-)

diff --git a/Documentation/admin-guide/README.rst 
b/Documentation/admin-guide/README.rst
index b5343c5aa224..4810ae94dce1 100644
--- a/Documentation/admin-guide/README.rst
+++ b/Documentation/admin-guide/README.rst
@@ -1,13 +1,6 @@
 Linux kernel release 4.x 
 =
 
-These are the release notes for Linux version 4.  Read them carefully,
-as they tell you what this is all about, explain how to install the
-kernel, and what to do if something goes wrong.
-
-What is Linux?
---
-
   Linux is a clone of the operating system Unix, written from scratch by
   Linus Torvalds with assistance from a loosely-knit team of hackers across
   the Net. It aims towards POSIX and Single UNIX Specification compliance.
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] README: revise the top level readme texts

2017-07-12 Thread Martin Kepplinger
This improves the top level README situation a little: Instead of starting
with historical information like "This file was moved to..." we add a short
introductory description and point the reader to the documention in a direct
way, avoiding phrases like "Please notice that there are...".

Signed-off-by: Martin Kepplinger 
---
 README | 26 ++
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/README b/README
index b2ba4aaa3a71..b281213e7c6f 100644
--- a/README
+++ b/README
@@ -1,17 +1,27 @@
 Linux kernel
 
 
-This file was moved to Documentation/admin-guide/README.rst
+Linux is a computer operating system kernel first released by Linus Torvalds
+in September of 1991. It runs on a wide variety of hardware architectures and
+has all expected features like multitasking, memory management, multistack
+networking and so on.
 
-Please notice that there are several guides for kernel developers and users.
-These guides can be rendered in a number of formats, like HTML and PDF.
+Device drivers are an integral part too, supporting the use of countless
+devices. Documentation/process/howto.rst describes how to include a new one.
 
-In order to build the documentation, use ``make htmldocs`` or
-``make pdfdocs``.
+Documentation
+-
 
-There are various text files in the Documentation/ subdirectory,
-several of them using the Restructured Text markup notation.
-See Documentation/00-INDEX for a list of what is contained in each file.
+Linux is documented in the Documentation/ subdirectory. Several of the text
+files use the Restructured Text markup notation.
+
+Documentation/admin-guide/README.rst may be what you are looking for and
+Documentation/00-INDEX has a list of what is contained in each file.
+
+The included guides for kernel developers and users can be rendered in a number
+of formats, like HTML and PDF.
+
+In order to build the documentation, use ``make htmldocs`` or ``make pdfdocs``.
 
 Please read the Documentation/process/changes.rst file, as it contains the
 requirements for building and running the kernel, and information about
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko
On Wed 12-07-17 09:23:37, Michal Hocko wrote:
> On Tue 11-07-17 12:32:57, Ram Pai wrote:
[...]
> > Ideally the MMU looks at the PTE for keys, in order to enforce
> > protection. This is the case with x86 and is the case with power9 Radix
> > page table. Hence the keys have to be programmed into the PTE.
> 
> But x86 doesn't update ptes for PKEYs, that would be just too expensive.
> You could use standard mprotect to do the same...

OK, this seems to be a misunderstanding and confusion on my end.
do_mprotect_pkey does mprotect_fixup even for the pkey path which is
quite surprising to me. I guess my misunderstanding comes from
Documentation/x86/protection-keys.txt
"
Memory Protection Keys provides a mechanism for enforcing page-based
protections, but without requiring modification of the page tables
when an application changes protection domains.  It works by
dedicating 4 previously ignored bits in each page table entry to a
"protection key", giving 16 possible keys.
"

So please disregard my previous comments about page tables and sorry
about the confusion.
-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC v5 00/38] powerpc: Memory Protection Keys

2017-07-12 Thread Michal Hocko
On Tue 11-07-17 12:32:57, Ram Pai wrote:
> On Tue, Jul 11, 2017 at 04:52:46PM +0200, Michal Hocko wrote:
> > On Wed 05-07-17 14:21:37, Ram Pai wrote:
> > > Memory protection keys enable applications to protect its
> > > address space from inadvertent access or corruption from
> > > itself.
> > > 
> > > The overall idea:
> > > 
> > >  A process allocates a   key  and associates it with
> > >  an  address  range  withinits   address   space.
> > >  The process  then  can  dynamically  set read/write 
> > >  permissions on  the   key   without  involving  the 
> > >  kernel. Any  code that  violates   the  permissions
> > >  of  the address space; as defined by its associated
> > >  key, will receive a segmentation fault.
> > > 
> > > This patch series enables the feature on PPC64 HPTE
> > > platform.
> > > 
> > > ISA3.0 section 5.7.13 describes the detailed specifications.
> > 
> > Could you describe the highlevel design of this feature in the cover
> > letter.
> 
> Yes it can be hard to understand without the big picture.  I will
> provide the high level design and the rationale behind the patch split
> towards the end.  Also I will have it in the cover letter for my next
> revision of the patchset.

Thanks!
 
> > I have tried to get some idea from the patchset but it was
> > really far from trivial. Patches are not very well split up (many
> > helpers are added without their users etc..). 
> 
> I see your point. Earlier, I had the patches split such a way that the
> users of the helpers were in the same patch as that of the helper.
> But then comments from others lead to the current split.

It is not my call here, obviously. I cannot review arch specific parts
due to lack of familiarity but it is a general good practice to include
helpers along with their users to make the usage clear. Also, as much as
I like small patches because they are easier to review, having very many
of them can lead to a harder review in the end because you easily lose
a higher level overview.

> > > Testing:
> > >   This patch series has passed all the protection key
> > >   tests available in  the selftests directory.
> > >   The tests are updated to work on both x86 and powerpc.
> > > 
> > > version v5:
> > >   (1) reverted back to the old design -- store the 
> > >   key in the pte, instead of bypassing it.
> > >   The v4 design slowed down the hash page path.
> > 
> > This surprised me a lot but I couldn't find the respective code. Why do
> > you need to store anything in the pte? My understanding of PKEYs is that
> > the setup and teardown should be very cheap and so no page tables have
> > to updated. Or do I just misunderstand what you wrote here?
> 
> Ideally the MMU looks at the PTE for keys, in order to enforce
> protection. This is the case with x86 and is the case with power9 Radix
> page table. Hence the keys have to be programmed into the PTE.

But x86 doesn't update ptes for PKEYs, that would be just too expensive.
You could use standard mprotect to do the same...
 
> However with HPT on power, these keys do not necessarily have to be
> programmed into the PTE. We could bypass the Linux Page Table Entry(PTE)
> and instead just program them into the Hash Page Table(HPTE), since
> the MMU does not refer the PTE but refers the HPTE. The last version
> of the page attempted to do that.   It worked as follows:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
>range.
> 
> b) Whenever there is a hash page fault for that address, the fault
>handler reads the key from the VMA and programs the key into the
>HPTE. __hash_page() is the function that does that.

What causes the fault here?

> c) Once the hpte is programmed, the MMU can sense key violations and
>generate key-faults.
> 
> The problem is with step (b).  This step is really a very critical
> path which is performance sensitive. We dont want to add any delays.
> However if we want to access the key from the vma, we will have to
> hold the vma semaphore, and that is a big NO-NO. As a result, this
> design had to be dropped.
> 
> 
> 
> I reverted back to the old design i.e the design in v4 version. In this
> version we do the following:
> 
> a) when a address range is requested to be associated with a key; by the
>application through key_mprotect() system call, the kernel
>stores that key in the vmas corresponding to that address
>range. Also the kernel programs the key into Linux PTE coresponding to all 
> the
>pages associated with the address range.

OK, so how is this any different from the regular mprotect then?

> b) Whenever there is a hash page fault for that address, the fault
>handler reads the key from the Linux PTE and programs the key into 
>the HPTE.
> 
> c) Once the HPTE is programmed, the MMU can sense key violations and
>generate key-faults.
> 

[RFC v2 3/6] drivers: boot_constraint: Add boot_constraints_disable kernel parameter

2017-07-12 Thread Viresh Kumar
Users must be given an option to discard any constraints set by
bootloaders. For example, consider that a constraint is set for the LCD
controller's supply and the LCD driver isn't loaded by the kernel. If
the user doesn't need to use the LCD device, then he shouldn't be forced
to honour the constraint.

We can also think about finer control of such constraints with help of
some sysfs files, but a kernel parameter is fine to begin with.

Signed-off-by: Viresh Kumar 
---
 Documentation/admin-guide/kernel-parameters.txt |  3 +++
 drivers/base/boot_constraint.c  | 17 +
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7737ab5d04b2..59ad24822d10 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -426,6 +426,9 @@
embedded devices based on command line input.
See Documentation/block/cmdline-partition.txt
 
+   boot_constraints_disable
+   Do not set any boot constraints for devices.
+
boot_delay= Milliseconds to delay each printk during boot.
Values larger than 10 seconds (1) are changed to
no delay (0).
diff --git a/drivers/base/boot_constraint.c b/drivers/base/boot_constraint.c
index 4e4332751c5f..d372ddfe1264 100644
--- a/drivers/base/boot_constraint.c
+++ b/drivers/base/boot_constraint.c
@@ -47,6 +47,17 @@ static DEFINE_MUTEX(constraint_devices_mutex);
 static int constraint_supply_add(struct constraint *constraint, void *data);
 static void constraint_supply_remove(struct constraint *constraint);
 
+static bool boot_constraints_disabled;
+
+static int __init constraints_disable(char *str)
+{
+   boot_constraints_disabled = true;
+   pr_debug("disabled\n");
+
+   return 0;
+}
+early_param("boot_constraints_disable", constraints_disable);
+
 
 /* Boot constraints core */
 
@@ -154,6 +165,9 @@ int boot_constraint_add(struct device *dev, enum 
boot_constraint_type type,
struct constraint *constraint;
int ret;
 
+   if (boot_constraints_disabled)
+   return -ENODEV;
+
mutex_lock(_devices_mutex);
 
/* Find or add the cdev type first */
@@ -211,6 +225,9 @@ void boot_constraints_remove(struct device *dev)
struct constraint_dev *cdev;
struct constraint *constraint, *temp;
 
+   if (boot_constraints_disabled)
+   return;
+
mutex_lock(_devices_mutex);
 
cdev = constraint_device_find(dev);
-- 
2.13.0.71.gd7076ec9c9cb

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html