RE: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries

2017-08-20 Thread Wangkai (Kevin,C)


> -Original Message-
> From: Waiman Long [mailto:long...@redhat.com]
> Sent: Friday, August 18, 2017 10:10 PM
> To: Wangkai (Kevin,C); Alexander Viro; Jonathan Corbet
> Cc: linux-ker...@vger.kernel.org; linux-doc@vger.kernel.org;
> linux-fsde...@vger.kernel.org; Paul E. McKenney; Andrew Morton; Ingo Molnar;
> Miklos Szeredi; Matthew Wilcox; Larry Woodman; James Bottomley
> Subject: Re: [PATCH v3 0/5] fs/dcache: Limit # of negative dentries
> 
> On 08/18/2017 05:59 AM, Wangkai (Kevin,C) wrote:
> >
> >>> In my patch the DCACHE_FILE_REMOVED flag was to distinguish the
> >>> removed file and The closed file, I found there was no difference of
> >>> a dentry between the removed file and the closed File, they all on the lru
> list.
> >> There is a difference between removed file and closed file. The type
> >> field of d_flags will be empty for a removed file which indicate a negative
> dentry.
> >> Anything else is a positive dentry. Look at the inline function
> >> d_is_negative() [d_is_miss()] and you will see how it is done.
> > After the file was removed, the dentry flag was not MISS, the flag was:
> > DCACHE_REFERENCED | DCACHE_RCUACCESS | DCACHE_LRU_LIST |
> > DCACHE_REGULAR_TYPE So, the dentry never be freed, until the kernel
> reclaim the slab memory.
> 
> The dentry_unlink_inode() function will clear DCACHE_REGULAR_TYPE.
> 

Yes, I have add some trace info for the dentry state changed, with dentry flag 
and reference count:

File create:
[   42.636675] dentry [_1234] 0x880230be8180 flag 0x0 ref 1 ev dentry 
alloc
File close:
[   42.637421] dentry [_1234] 0x880230be8180 flag 0x4800c0 ref 0 ev 
dput called

Unlink lookup:
[  244.658086] dentry [_1234] 0x880230be8180 flag 0x4800c0 ref 1 ev 
d_lookup
Unlink d_delete:
[  244.658254] dentry [_1234] 0x880230be8180 flag 0x800c0 ref 1 ev 
d_lockref ref 1
Unlink dput:
[  244.658438] dentry [_1234] 0x880230be8180 flag 0x800c0 ref 0 ev dput 
called

The end, dentry's flag stay at 0x800c0, but this dentry was not freed, keeped 
by the dcache as unused,
After tens of thousands of the dentries slow down the dentry lookup 
performance, kernel memory usage
Keep high.

Regards,
Kevin


Re: [PATCH] switchdev: documentation: minor typo fixes

2017-08-20 Thread David Miller
From: Chris Packham 
Date: Mon, 21 Aug 2017 08:52:54 +1200

> Two typos in switchdev.txt
> 
> Signed-off-by: Chris Packham 

Applied.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] PM: docs: Describe high-level PM strategies and sleep states

2017-08-20 Thread Rafael J. Wysocki
On Sun, Aug 20, 2017 at 8:23 PM, Lukas Wunner  wrote:
> On Sun, Aug 20, 2017 at 06:05:05PM +0200, Rafael J. Wysocki wrote:
>> From: Rafael J. Wysocki 
>>
>> Reorganize the power management part of admin-guide by adding a
>> description of major power management strategies supported by the
>> kernel (system-wide and working-state power management) to it and
>> dividing the rest of the material into the system-wide PM and
>> working-state PM chapters.
>>
>> On top of that, add a description of system sleep states to the
>> system-wide PM chapter.
>
> Found no typos and no factual inaccuracies, the only thing that
> irritated me a bit was the part about "working state" power
> management:
>
>> +The other strategy, referred to as the
>> +:doc:`working-state power management `, is based on 
>> adjusting the
>> +power states of individual hardware components of the system, as needed, in 
>> the
>> +working state.  In consequence, if this strategy is in use, the working 
>> state
>> +of the system usually does not correspond to any particular physical
>> +configuration of it, but can be treated as a metastate covering a range of
>> +different power states of the system in which the individual components of 
>> it
>> +can be either ``active`` (in use) or ``inactive`` (idle).  If they are 
>> active,
>> +they have to be in power states allowing them to process data and to be 
>> accessed
>> +by software.  In turn, if they are inactive, they are expected to be in
>> +low-power states in which they may not be accessible.
>> +
>> +If all of the system components are active, the system as a whole is 
>> regarded as
>> +``runtime active`` and that situation typically corresponds to the maximum 
>> power
>> +draw (or maximum energy usage) of it.  If all of them are inactive, the 
>> system
>> +as a whole is regarded as ``runtime idle`` which may be very close to a 
>> sleep
>
> The code uses the terms pm_runtime_active() and pm_runtime_suspended(),
> not "runtime idle".

Well, the phrase "runtime idle" in the document is (quite clearly)
used with respect to the system as a whole, so I'm not sure how this
is related to the runtime PM framework.

> Taking the ->runtime_idle callback as guidance,
> "runtime idle" would mean that a component is runtime active, but idling

Well, no.  At least that wasn't the intention and now it turns out to
be confusing ...

->runtime_idle is for cases when the device looks idle to a piece of
the kernel code and then it can use this callback to request the
devices driver/bus type and so on to take care of this situation.

Also the prefix "runtime" is there to distinguish the callback from
the other callbacks, related to system suspend and hibernation, in the
same data type.

And, of course, "suspended" implies "idle", but not the other way
around, in general.

> and could thus be transitioned to runtime suspended state.  However above
> it says that if it's idle, it's already "in low-power states and may
> not be accessible".

No, it doesn't say that.  It talks about expectations which may not be
what actually happens.

> For someone reading this it may be difficult to
> reconcile it with the terminology used in the code.

Anyway, the point here is to note the difference between sleep states
and a completely idle system in the working state.

>
> Otherwise,
> Reviewed-by: Lukas Wunner 

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 2/4] mm, oom: cgroup-aware OOM killer

2017-08-20 Thread David Rientjes
On Wed, 16 Aug 2017, Roman Gushchin wrote:

> It's natural to expect that inside a container there are their own sshd,
> "activity manager" or some other stuff, which can play with oom_score_adj.
> If it can override the upper cgroup-level settings, the whole delegation model
> is broken.
> 

I don't think any delegation model related to core cgroups or memory 
cgroup is broken, I think it's based on how memory.oom_kill_all_tasks is 
defined.  It could very well behave as memory.oom_kill_all_eligible_tasks 
when enacted upon.

> You can think about the oom_kill_all_tasks like the panic_on_oom,
> but on a cgroup level. It should _guarantee_, that in case of oom
> the whole cgroup will be destroyed completely, and will not remain
> in a non-consistent state.
> 

Only CAP_SYS_ADMIN has this ability to set /proc/pid/oom_score_adj to 
OOM_SCORE_ADJ_MIN, so it preserves the ability to change that setting, if 
needed, when it sets memory.oom_kill_all_tasks.  If a user gains 
permissions to change memory.oom_kill_all_tasks, I disagree it should 
override the CAP_SYS_ADMIN setting of /proc/pid/oom_score_adj.

I would prefer not to exclude oom disabled processes to their own sibling 
cgroups because they would require their own reservation with cgroup v2 
and it makes the single hierarchy model much more difficult to arrange 
alongside cpusets, for example.

> The model you're describing is based on a trust given to these oom-unkillable
> processes on system level. But we can't really trust some unknown processes
> inside a cgroup that they will be able to do some useful work and finish
> in a reasonable time; especially in case of a global memory shortage.

Yes, we prefer to panic instead of sshd, for example, being oom killed.  
We trust that sshd, as well as our own activity manager and security 
daemons are trusted to do useful work and that we never want the kernel to 
do this.  I'm not sure why you are describing processes that CAP_SYS_ADMIN 
has set to be oom disabled as unknown processes.

I'd be interested in hearing the opinions of others related to a per-memcg 
knob being allowed to override the setting of the sysadmin.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [v5 4/4] mm, oom, docs: describe the cgroup-aware OOM killer

2017-08-20 Thread David Rientjes
On Thu, 17 Aug 2017, Roman Gushchin wrote:

> Hi David!
> 
> Please, find an updated version of docs patch below.
> 

Looks much better, thanks!  I think the only pending issue is discussing 
the relationship of memory.oom_kill_all_tasks with /proc/pid/oom_score_adj 
== OOM_SCORE_ADJ_MIN.
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 3/3] gpio: mockup: use irq_sim

2017-08-20 Thread Linus Walleij
On Mon, Aug 14, 2017 at 1:20 PM, Bartosz Golaszewski  wrote:

> Shrink the driver by removing the code dealing with dummy interrupts
> and replacing it with calls to the irq_sim API.
>
> Signed-off-by: Bartosz Golaszewski 
> Acked-by: Jonathan Cameron 
> Reviewed-by: Linus Walleij 

I applied this to the GPIO tree after pulling the infrastructure
from Thomas branch.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 0/3] simulated interrupts

2017-08-20 Thread Linus Walleij
On Wed, Aug 16, 2017 at 4:44 PM, Thomas Gleixner  wrote:

> I merged the irq part (1+2) into a separate branch, which can be consumed
> by the gpio folks so the mockup driver patch can be merged as well.
>
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/for-gpio

Awesome, thanks Thomas. I pulled this into the GPIO devel branch.

Yours,
Linus Walleij
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] switchdev: documentation: minor typo fixes

2017-08-20 Thread Chris Packham
Two typos in switchdev.txt

Signed-off-by: Chris Packham 
---
 Documentation/networking/switchdev.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/networking/switchdev.txt 
b/Documentation/networking/switchdev.txt
index 3e7b946dea27..5e40e1f68873 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -228,7 +228,7 @@ Learning on the device port should be enabled, as well as 
learning_sync:
bridge link set dev DEV learning on self
bridge link set dev DEV learning_sync on self
 
-Learning_sync attribute enables syncing of the learned/forgotton FDB entry to
+Learning_sync attribute enables syncing of the learned/forgotten FDB entry to
 the bridge's FDB.  It's possible, but not optimal, to enable learning on the
 device port and on the bridge port, and disable learning_sync.
 
@@ -245,7 +245,7 @@ the responsibility of the port driver/device to age out 
these entries.  If the
 port device supports ageing, when the FDB entry expires, it will notify the
 driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If the
 device does not support ageing, the driver can simulate ageing using a
-garbage collection timer to monitor FBD entries.  Expired entries will be
+garbage collection timer to monitor FDB entries.  Expired entries will be
 notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
 example of driver running ageing timer.
 
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] PM: docs: Describe high-level PM strategies and sleep states

2017-08-20 Thread Lukas Wunner
On Sun, Aug 20, 2017 at 06:05:05PM +0200, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> Reorganize the power management part of admin-guide by adding a
> description of major power management strategies supported by the
> kernel (system-wide and working-state power management) to it and
> dividing the rest of the material into the system-wide PM and
> working-state PM chapters.
> 
> On top of that, add a description of system sleep states to the
> system-wide PM chapter.

Found no typos and no factual inaccuracies, the only thing that
irritated me a bit was the part about "working state" power
management:

> +The other strategy, referred to as the
> +:doc:`working-state power management `, is based on adjusting 
> the
> +power states of individual hardware components of the system, as needed, in 
> the
> +working state.  In consequence, if this strategy is in use, the working state
> +of the system usually does not correspond to any particular physical
> +configuration of it, but can be treated as a metastate covering a range of
> +different power states of the system in which the individual components of it
> +can be either ``active`` (in use) or ``inactive`` (idle).  If they are 
> active,
> +they have to be in power states allowing them to process data and to be 
> accessed
> +by software.  In turn, if they are inactive, they are expected to be in
> +low-power states in which they may not be accessible.
> +
> +If all of the system components are active, the system as a whole is 
> regarded as
> +``runtime active`` and that situation typically corresponds to the maximum 
> power
> +draw (or maximum energy usage) of it.  If all of them are inactive, the 
> system
> +as a whole is regarded as ``runtime idle`` which may be very close to a sleep

The code uses the terms pm_runtime_active() and pm_runtime_suspended(),
not "runtime idle".  Taking the ->runtime_idle callback as guidance,
"runtime idle" would mean that a component is runtime active, but idling
and could thus be transitioned to runtime suspended state.  However above
it says that if it's idle, it's already "in low-power states and may
not be accessible".  For someone reading this it may be difficult to
reconcile it with the terminology used in the code.

Otherwise,
Reviewed-by: Lukas Wunner 

Thanks,

Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-doc" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] PM: docs: Describe high-level PM strategies and sleep states

2017-08-20 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

Reorganize the power management part of admin-guide by adding a
description of major power management strategies supported by the
kernel (system-wide and working-state power management) to it and
dividing the rest of the material into the system-wide PM and
working-state PM chapters.

On top of that, add a description of system sleep states to the
system-wide PM chapter.

Signed-off-by: Rafael J. Wysocki 
---
 Documentation/admin-guide/pm/index.rst |5 
 Documentation/admin-guide/pm/sleep-states.rst  |  234 +
 Documentation/admin-guide/pm/strategies.rst|   52 +
 Documentation/admin-guide/pm/system-wide.rst   |   15 +
 Documentation/admin-guide/pm/working-state.rst |   16 +
 5 files changed, 320 insertions(+), 2 deletions(-)

Index: linux-pm/Documentation/admin-guide/pm/index.rst
===
--- linux-pm.orig/Documentation/admin-guide/pm/index.rst
+++ linux-pm/Documentation/admin-guide/pm/index.rst
@@ -5,8 +5,9 @@ Power Management
 .. toctree::
:maxdepth: 2
 
-   cpufreq
-   intel_pstate
+   strategies
+   system-wide
+   working-state
 
 .. only::  subproject and html
 
Index: linux-pm/Documentation/admin-guide/pm/sleep-states.rst
===
--- /dev/null
+++ linux-pm/Documentation/admin-guide/pm/sleep-states.rst
@@ -0,0 +1,234 @@
+===
+System Sleep States
+===
+
+::
+
+ Copyright (c) 2017 Intel Corp., Rafael J. Wysocki 
+
+Sleep states are global low-power states of the entire system in which user
+space code cannot be executed and the overall system activity is significantly
+reduced.
+
+
+Sleep States That Can Be Supported
+==
+
+Depending on its configuration and the capabilities of the platform it runs on,
+the Linux kernel can support up to four system sleep states, includig
+hibernation and up to three variants of system suspend.  The sleep states that
+can be supported by the kernel are listed below.
+
+Suspend-to-Idle
+---
+
+This is a generic, pure software, light-weight variant of system suspend (also
+referred to as S2I or S2Idle).  It allows more energy to be saved relative to
+runtime idle by freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states (possibly lower-power than available in the
+working state), such that the processors can spend time in their deepest idle
+states while the system is suspended.
+
+The system is woken up from this state by in-band interrupts, so theoretically
+any devices that can cause interrupts to be generated in the working state can
+also be set up as wakeup devices for S2Idle.
+
+This state can be used on platforms without support for `Standby`_ or
+`Suspend-to-RAM`_, or it can be used in addition to any of the deeper system
+suspend variants to provide reduced resume latency.  It is always supported if
+the :c:macro:`CONFIG_SUSPEND` kernel configuration option is set.
+
+Standby
+---
+
+This state, if supported, offers moderate, but real, energy savings, while
+providing a relatively straightforward transition back to the working state.  
No
+operating state is lost (the system core logic retains power), so the system 
can
+go back to where it left off easily enough.
+
+In addition to freezing user space, suspending the timekeeping and putting all
+I/O devices into low-power states, which is done for `Suspend-to-Idle`_ too,
+nonboot CPUs are taken offline and all low-level system functions are suspended
+during transitions into this state.  For this reason, it should allow more
+energy to be saved relative to `Suspend-to-Idle`_, but the resume latency will
+generally be greater than for that state.
+
+The set of devices that can wake up the system from this state usually is
+reduced relative to `Suspend-to-Idle`_ and it may be necessary to rely on the
+platform for setting up the wakeup functionality as appropriate.
+
+This state is supported if the :c:macro:`CONFIG_SUSPEND` kernel configuration
+option is set and the support for it is registered by the platform with the
+core system suspend subsystem.  On ACPI-based systems this state is mapped to
+the S1 system state defined by ACPI.
+
+Suspend-to-RAM
+--
+
+This state (also referred to as STR or S2RAM), if supported, offers significant
+energy savings as everything in the system is put into a low-power state, 
except
+for memory, which should be placed into the self-refresh mode to retain its
+contents.  All of the steps carried out when entering `Standby`_ are also
+carried out during transitions to S2RAM.  Additional operations may take place
+depending on the platform capabilities.  In particular, on ACPI-based systems
+the kernel passes control to the platform firmware (BIOS) as the last step
+during