Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-19 Thread Tejun Heo
Hello,

Sorry about late reply.

On Wed, Apr 07, 2021 at 08:34:24PM +0200, Peter Zijlstra wrote:
> > Given the existence of prctl and clone APIs, I don't see the reason to
> > have a separate cgroup-bound interface too (as argued by Tejun). 
> 
> IMO as long as cgroups have that tasks file, you get to support people
> using it. That means that tasks joining your cgroup need to 'inherit'

This argument doesn't really make sense to me. We don't just add things to
make interfaces orthogonal. It can be a consideration but not the only or
even one of the most important ones. There are many cases we say to not
shoot oneself in the foot and also many interfaces which are fading away or
in the process of being deprecated.

I'm not planning to deprecate the dynamic migration interfaces given the
history and usefulness in testing but the usage model of cgroup2 is clearly
defined and documented in this regard - whether the initial population of
the cgroup happens through CLONE_INTO_CGROUP or migration, for resource
tracking and control purposes, cgroup2 does not support dynamic migration
with the exception of migrations within threaded domains.

Anything is a possibility but given how this requirement is intertwined with
achieveing comprehensive resource control, a core goal of cgroup2, and
widely adopted by all the new fangled things as you put it, changing this
wouldn't be easy. Not just because some people including me are against it
but because there are inherent technical challenges and overheads to
supporting dynamic charge migration for stateful controllers and the
cost-benefit balance doesn't come out in favor.

So, the current "policy" is something like this:

* cgroupfs interface is for cgroup core features of organizing cgroups and
  processes and configuring resource configurations which preferably follow
  one of the control models defined in the doc.

* The hierarchical organization is semi-static in the sense that once a
  cgroup is populated tasks shouldn't be moved in or out of the cgroup with
  the exception of threaded domains.

* Non-resource control usages can hook into cgroup for identification /
  tagging purposes but should use the originating interface for
  interactions.

This has been consistently applied over the years now. There of course can
be exceptions but I haven't seen anything outstanding in this round of
discussion so am pretty skeptical. The actual use cases don't seem to need
it and the only argument for it is it'd be nice to have it and involves
violating the usage model.

My suggestion is going ahead with the per-process interface with cgroup
extension on mind in case actual use cases arise. Also, when planning cgroup
integration, putting dynamic migration front and center likely isn't a good
idea.

Thanks.

-- 
tejun


Re: [PATCH 0/2] workqueue: Have 'alloc_workqueue()' like macros accept a format specifier

2021-04-19 Thread Tejun Heo
Hello, Christophe.

On Sun, Apr 18, 2021 at 11:25:52PM +0200, Christophe JAILLET wrote:
> Improve 'create_workqueue', 'create_freezable_workqueue' and
> 'create_singlethread_workqueue' so that they accept a format
> specifier and a variable number of arguments.
> 
> This will put these macros more in line with 'alloc_ordered_workqueue' and
> the underlying 'alloc_workqueue()' function.

Those interfaces are deprecated and if you're doing anything with the users,
the right course of action would be converting them to use one of the
alloc_workqueue interfaces.

Thanks.

-- 
tejun


Re: [PATCH 0/2] Relax cpuset validation for cgroup v2

2021-04-13 Thread Tejun Heo
On Tue, Apr 13, 2021 at 11:02:33AM +0200, Odin Ugedal wrote:
> Two small validation relaxations for cgroup v2, making it easier to
> manage the hierarchy without the current pain points. Both changes
> work for both mems and cpus (but I have no NUMA machine to test mems).
> 
> Hopefully the patches has an ok description about what change they
> provide, and why they are helpful.

I'm generally in favor of removing configuration constraints but let's hear
what Li thinks.

Thanks.

-- 
tejun


Re: [PATCH -next] cgroup/cpuset: fix typos in comments

2021-04-12 Thread Tejun Heo
On Thu, Apr 08, 2021 at 04:03:46PM +0800, Lu Jialin wrote:
> Change hierachy to hierarchy and unrechable to unreachable,
> no functionality changed.
> 
> Signed-off-by: Lu Jialin 

Applied to cgroup/for-5.13.

Thanks.

-- 
tejun


Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Tejun Heo
Hello,

On Tue, Apr 06, 2021 at 08:57:15PM +0530, Pavan Kondeti wrote:
> Yeah. The workqueue attrs comes in handy to reduce the nice/prio of a
> background workqueue if we identify that it is cpu intensive. However, this
> needs case by case analysis, tweaking etc. If there is no other alternative,
> we might end up chasing the background workers and reduce their nice value.

There shouldn't be that many workqueues that consume a lot of cpu cycles.
The usual culprit is kswapd, IO related stuff (writeback, encryption), so it
shouldn't be a long list and we want them identified anyway.

> The problem at our hand (which you might be knowing already) is that, lets say
> we have 2 cgroups in our setup and we want to prioritize UX over background.
> IOW, reduce the cpu.shares of background cgroup. This helps in prioritizing
> Task-A and Task-B over Task-X and Task-Y. However, each individual kworker
> can potentially have CPU share equal to the entire UX cgroup.
> 
> -kworker/0:0
> -kworker/1:0
> - UX
> Task-A
> Task-B
> - background
> Task-X
> Task-Y
> Hence we want to move all kernel threads to another cgroup so that this cgroup
> will have CPU share equal to UX.
> 
> The patch presented here allows us to create the above setup. Any other
> alternative approaches to achieve the same without impacting any future
> designs/requirements?

Not quite the same but we already have
/sys/devices/virtual/workqueue/cpumask which affects all unbound workqueues,
so maybe a global default priority knob would help here?

Thanks.

-- 
tejun


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-06 Thread Tejun Heo
Hello,

On Tue, Apr 06, 2021 at 05:32:04PM +0200, Peter Zijlstra wrote:
> > I find it difficult to like the proposed interface from the name (the term
> > "core" is really confusing given how the word tends to be used internally)
> > to the semantics (it isn't like anything else) and even the functionality
> > (we're gonna have fixed processors at some point, right?).
> 
> Core is the topological name for the thing that hosts the SMT threads.
> Can't really help that.

I find the name pretty unfortunate given how overloaded the term is
generally and also in kernel but oh well...

> > Here are some preliminary thoughts:
> > 
> > * Are both prctl and cgroup based interfaces really necessary? I could be
> >   being naive but given that we're (hopefully) working around hardware
> >   deficiencies which will go away in time, I think there's a strong case for
> >   minimizing at least the interface to the bare minimum.
> 
> I'm not one for cgroups much, so I'll let others argue that case, except
> that per systemd and all the other new fangled shit, people seem to use
> cgroups a lot to group tasks. So it makes sense to also expose this
> through cgroups in some form.

All the new fangled things follow a certain usage pattern which makes
aligning parts of process tree with cgroup layout trivial, so when
restrictions can be applied along the process tree like this and there isn't
a particular need for dynamic hierarchical control, there isn't much need
for direct cgroup interface.

> That said; I've had requests from lots of non security folks about this
> feature to help mitigate the SMT interference.
> 
> Consider for example Real-Time. If you have an active SMT sibling, the
> CPU performance is much less than it would be when the SMT sibling is
> idle. Therefore, for the benefit of determinism, it would be very nice
> if RT tasks could force-idle their SMT siblings, and voila, this
> interface allows exactly that.
> 
> The same is true for other workloads that care about interference.

I see.

> >   Given how cgroups are set up (membership operations happening only for
> >   seeding, especially with the new clone interface), it isn't too difficult
> >   to synchronize process tree and cgroup hierarchy where it matters - ie.
> >   given the right per-process level interface, restricting configuration for
> >   a cgroup sub-hierarchy may not need any cgroup involvement at all. This
> >   also nicely gets rid of the interaction between prctl and cgroup bits.
> 
> I've no idea what you mean :/ The way I use cgroups (when I have to, for
> testing) is to echo the pid into /cgroup/foo/tasks. No clone or anything
> involved.

The usage pattern is creating a new cgroup, seeding it with a process
(either writing to tasks or using CLONE_INTO_CGROUP) and let it continue
only on that sub-hierarchy, so cgroup hierarchy usually partially overlays
process trees.

> None of my test machines come up with cgroupfs mounted, and any and all
> cgroup setup is under my control.
>
> > * If we *have* to have cgroup interface, I wonder whether this would fit a
> >   lot better as a part of cpuset. If you squint just right, this can be
> >   viewed as some dynamic form of cpuset. Implementation-wise, it probably
> >   won't integrate with the rest but I think the feature will be less jarring
> >   as a part of cpuset, which already is a bit of kitchensink anyway.
> 
> Not sure I agree, we do not change the affinity of things, we only
> control who's allowed to run concurrently on SMT siblings. There could
> be a cpuset partition split between the siblings and it would still work
> fine.

I see. Yeah, if we really need it, I'm not sure it fits in cgroup interface
proper. As I wrote elsewhere, these things are usually implemented on the
originating subsystem interface with cgroup ID as a parameter.

Thanks.

-- 
tejun


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-06 Thread Tejun Heo
Hello,

On Mon, Apr 05, 2021 at 02:46:09PM -0400, Joel Fernandes wrote:
> Yeah, its at http://lore.kernel.org/r/20200822030155.ga414...@google.com
> as mentioned above, let me know if you need any more details about
> usecase.

Except for the unspecified reason in usecase 4, I don't see why cgroup is in
the picture at all. This doesn't really have much to do with hierarchical
resource distribution. Besides, yes, you can use cgroup for logical
structuring and identificaiton purposes but in those cases the interactions
and interface should be with the original subsystem while using cgroup IDs
or paths as parameters - see tracing and bpf for examples.

Thanks.

-- 
tejun


Re: [PATCH] cgroup: Relax restrictions on kernel threads moving out of root cpu cgroup

2021-04-06 Thread Tejun Heo
Hello,

On Tue, Apr 06, 2021 at 06:34:21PM +0530, Pavankumar Kondeti wrote:
> In Android GKI, CONFIG_FAIR_GROUP_SCHED is enabled [1] to help prioritize
> important work. Given that CPU shares of root cgroup can't be changed,
> leaving the tasks inside root cgroup will give them higher share
> compared to the other tasks inside important cgroups. This is mitigated
> by moving all tasks inside root cgroup to a different cgroup after
> Android is booted. However, there are many kernel tasks stuck in the
> root cgroup after the boot.
> 
> We see all kworker threads are in the root cpu cgroup. This is because,
> tasks with PF_NO_SETAFFINITY flag set are forbidden from cgroup migration.
> This restriction is in place to avoid kworkers getting moved to a cpuset
> which conflicts with kworker affinity. Relax this restriction by explicitly
> checking if the task is moving out of a cpuset cgroup. This allows kworkers
> to be moved out root cpu cgroup when cpu and cpuset cgroup controllers
> are mounted on different hierarchies.
> 
> We also see kthreadd_task and any kernel thread created after the Android boot
> also stuck in the root cgroup. The current code prevents kthreadd_task moving
> out root cgroup to avoid the possibility of creating new RT kernel threads
> inside a cgroup with no RT runtime allocated. Apply this restriction when 
> tasks
> are moving out of cpu cgroup under CONFIG_RT_GROUP_SCHED. This allows all
> kernel threads to be moved out of root cpu cgroup if the kernel does not
> enable RT group scheduling.

The fundamental reason why those kthreads are in the root cgroup is because
they're doing work on behalf of the entire system and their resource usages
can't be attributed to any specific cgroup. What we want to do is accounting
actual usages to the originating cgroups so that cpu cycles spent by kswapd
is charged to the originating cgroups, however well we can define them, and
then throttle the origin if the consumption is going over budget for that
cgroup's allocation. This is how we already handle shared IOs.

The problem with the proposed patch is that it breaks the logical
organization of resource hierarchy in a way which hinders proper future
solutions.

If all you want is deprioritizing certain kworkers, please use workqueue
attrs instead.

Thanks.

-- 
tejun


[GIT PULL] wq fixes for v5.12-rc6

2021-04-05 Thread Tejun Heo
Hello, Linus.

Two workqueue fixes. One is around debugobj and poses no risk. The other is
to prevent the stall watchdog from firing spuriously in certain conditions.
Not as trivial as debugobj change but is still fairly low risk.

Thanks.

The following changes since commit 2023a53bdf41b7646b1d384b6816af06309f73a5:

  Merge tag 'for-linus' of git://github.com/openrisc/linux (2021-04-03 15:42:45 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-5.12-fixes

for you to fetch changes up to 89e28ce60cb65971c73359c66d076aa20a395cd5:

  workqueue/watchdog: Make unbound workqueues aware of 
touch_softlockup_watchdog() 84;0;0c84;0;0c There are two workqueue-specific 
watchdog timestamps: (2021-04-04 13:26:49 -0400)


Wang Qing (1):
  workqueue/watchdog: Make unbound workqueues aware of 
touch_softlockup_watchdog() 84;0;0c84;0;0c There are two workqueue-specific 
watchdog timestamps:

Zqiang (1):
  workqueue: Move the position of debug_work_activate() in __queue_work()

 kernel/watchdog.c  |  5 +++--
 kernel/workqueue.c | 19 +++
 2 files changed, 10 insertions(+), 14 deletions(-)


Re: [PATCH 0/9] sched: Core scheduling interfaces

2021-04-04 Thread Tejun Heo
cc'ing Michal and Christian who've been spending some time on cgroup
interface issues recently and Li Zefan for cpuset.

On Thu, Apr 01, 2021 at 03:10:12PM +0200, Peter Zijlstra wrote:
> The cgroup interface now uses a 'core_sched' file, which still takes 0,1. It 
> is
> however changed such that you can have nested tags. The for any given task, 
> the
> first parent with a cookie is the effective one. The rationale is that this 
> way
> you can delegate subtrees and still allow them some control over grouping.

I find it difficult to like the proposed interface from the name (the term
"core" is really confusing given how the word tends to be used internally)
to the semantics (it isn't like anything else) and even the functionality
(we're gonna have fixed processors at some point, right?).

Here are some preliminary thoughts:

* Are both prctl and cgroup based interfaces really necessary? I could be
  being naive but given that we're (hopefully) working around hardware
  deficiencies which will go away in time, I think there's a strong case for
  minimizing at least the interface to the bare minimum.

  Given how cgroups are set up (membership operations happening only for
  seeding, especially with the new clone interface), it isn't too difficult
  to synchronize process tree and cgroup hierarchy where it matters - ie.
  given the right per-process level interface, restricting configuration for
  a cgroup sub-hierarchy may not need any cgroup involvement at all. This
  also nicely gets rid of the interaction between prctl and cgroup bits.

* If we *have* to have cgroup interface, I wonder whether this would fit a
  lot better as a part of cpuset. If you squint just right, this can be
  viewed as some dynamic form of cpuset. Implementation-wise, it probably
  won't integrate with the rest but I think the feature will be less jarring
  as a part of cpuset, which already is a bit of kitchensink anyway.

> The cgroup thing also '(ab)uses' cgroup_mutex for serialization because it
> needs to ensure continuity between ss->can_attach() and ss->attach() for the
> memory allocation. If the prctl() were allowed to interleave it might steal 
> the
> memory.
> 
> Using cgroup_mutex feels icky, but is not without precedent,
> kernel/bpf/cgroup.c does the same thing afaict.
> 
> TJ, can you please have a look at this?

Yeah, using cgroup_mutex for stabilizing cgroup hierarchy for consecutive
operations is fine. It might be worthwhile to break that out into a proper
interface but that's the least of concerns here.

Can someone point me to a realistic and concrete usage scenario for this
feature?

Thanks.

-- 
tejun


Re: [PATCH v3] writeback: fix obtain a reference to a freeing memcg css

2021-04-04 Thread Tejun Heo
On Fri, Apr 02, 2021 at 05:11:45PM +0800, Muchun Song wrote:
> The caller of wb_get_create() should pin the memcg, because
> wb_get_create() relies on this guarantee. The rcu read lock
> only can guarantee that the memcg css returned by css_from_id()
> cannot be released, but the reference of the memcg can be zero.
> 
>   rcu_read_lock()
>   memcg_css = css_from_id()
>   wb_get_create(memcg_css)
>   cgwb_create(memcg_css)
>   // css_get can change the ref counter from 0 back to 1
>   css_get(memcg_css)
>   rcu_read_unlock()
> 
> Fix it by holding a reference to the css before calling
> wb_get_create(). This is not a problem I encountered in the
> real world. Just the result of a code review.
> 
> Fixes: 682aa8e1a6a1 ("writeback: implement unlocked_inode_to_wb transaction 
> and use it for stat updates")
> Signed-off-by: Muchun Song 
> Acked-by: Michal Hocko 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH v4 0/3] cgroup: New misc cgroup controller

2021-04-04 Thread Tejun Heo
Applied to cgroup/for-5.13. If there are further issues, let's address them
incrementally.

Thanks.

-- 
tejun


Re: [PATCH V3,RESEND] workqueue/watchdog: Make unbound workqueues aware of touch_softlockup_watchdog()

2021-04-04 Thread Tejun Heo
Applied to wq/for-5.12-fixes w/ minor description update.

Thanks.

-- 
tejun


Re: [PATCH v3 00/14] bfq: introduce bfq.ioprio for cgroup

2021-04-04 Thread Tejun Heo
Hello,

On Thu, Mar 25, 2021 at 02:57:44PM +0800, brookxu wrote:
> INTERFACE:
> 
> The bfq.ioprio interface now is available for cgroup v1 and cgroup
> v2. Users can configure the ioprio for cgroup through this
> interface, as shown below:
> 
> echo "1 2"> blkio.bfq.ioprio
> 
> The above two values respectively represent the values of ioprio
> class and ioprio for cgroup.
> 
> EXPERIMENT:
> 
> The test process is as follows:
> # prepare data disk
> mount /dev/sdb /data1
> 
> # prepare IO scheduler
> echo bfq > /sys/block/sdb/queue/scheduler
> echo 0 > /sys/block/sdb/queue/iosched/low_latency
> echo 1 > /sys/block/sdb/queue/iosched/better_fairness
> 
> It is worth noting here that nr_requests limits the number of
> requests, and it does not perceive priority. If nr_requests is
> too small, it may cause a serious priority inversion problem.
> Therefore, we can increase the size of nr_requests based on
> the actual situation.
> 
> # create cgroup v1 hierarchy
> cd /sys/fs/cgroup/blkio
> mkdir rt be0 be1 be2 idle
> 
> # prepare cgroup
> echo "1 0" > rt/blkio.bfq.ioprio
> echo "2 0" > be0/blkio.bfq.ioprio
> echo "2 4" > be1/blkio.bfq.ioprio
> echo "2 7" > be2/blkio.bfq.ioprio
> echo "3 0" > idle/blkio.bfq.ioprio

Here are some concerns:

* The main benefit of bfq compared to cfq at least was that the behavior
  model was defined in a clearer way. It was possible to describe what the
  control model was in a way which makes semantic sense. The main problem I
  see with this proposal is that it's an interface which grew out of the
  current implementation specifics and I'm having a hard time understanding
  what the end results should be with different configuration combinations.

* While this might work around some scheduling latency issues but I have a
  hard time imagining it being able to address actual QoS issues. e.g. on a
  lot of SSDs, without absolute throttling, device side latencies can spike
  by multiple orders of magnitude and no prioritization on the scheduler
  side is gonna help once such state is reached. Here, there's no robust
  mechanisms or measurement/control units defined to address that. In fact,
  the above direction to increase nr_requests limit will make priority
  inversions on the device and post-elevator side way more likely and
  severe.

So, maybe it helps with specific scenarios on some hardware, but given the
ad-hoc nature, I don't think it justifies all the extra interface additions.
My suggestion would be slimming it down to bare essentials and making the
user interface part as minimal as possible.

Thanks.

-- 
tejun


Re: [PATCH] workqueue: Switch to new kerneldoc syntax for named variable macro argument

2021-03-20 Thread Tejun Heo
On Wed, Mar 03, 2021 at 08:16:38PM +0100, Jonathan Neuschäfer wrote:
> The syntax without dots is available since commit 43756e347f21
> ("scripts/kernel-doc: Add support for named variable macro arguments").
> 
> The same HTML output is produced with and without this patch.
> 
> Signed-off-by: Jonathan Neuschäfer 

Acked-by: Tejun Heo 

Can you re-send with triv...@kernel.org cc'd?

Thanks.

-- 
tejun


Re: [PATCH V2] workqueue: watchdog: update wq_watchdog_touched for unbound lockup checking

2021-03-20 Thread Tejun Heo
On Fri, Mar 19, 2021 at 04:00:36PM +0800, Wang Qing wrote:
> When touch_softlockup_watchdog() is called, only wq_watchdog_touched_cpu 
> updated, while the unbound worker_pool running on its core uses 
> wq_watchdog_touched to determine whether locked up. This may be mischecked.

Can you please elaborate here, preferably with a concrete scenario where the
new code is better?

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-15 Thread Tejun Heo
On Mon, Mar 15, 2021 at 06:30:30PM -0700, Jacob Pan wrote:
> I don't know if this is required. I thought utilities such as cgclassify
> need to be supported.
> " cgclassify - move running task(s) to given cgroups "
> If no such use case, I am fine with dropping the migration support. Just
> enforce limit on allocations.

Yeah, that's what all other controllers do. Please read the in-tree cgroup2
doc.

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-15 Thread Tejun Heo
Hello,

On Mon, Mar 15, 2021 at 04:40:12PM -0700, Jacob Pan wrote:
> 2. then we want to move/migrate Process1 to cg_B. so we need uncharge 10 of
> cg_A, charge 10 of cg_B

So, what I don't get is why this migration is necessary. This isn't
supported as a usage pattern and no one, at least in terms of wide-spread
usage, does this. Why is this a requirement for your use case?

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-15 Thread Tejun Heo
Hello,

On Mon, Mar 15, 2021 at 03:11:55PM -0700, Jacob Pan wrote:
> > Migration itself doesn't have restrictions but all resources are
> > distributed on the same hierarchy, so the controllers are supposed to
> > follow the same conventions that can be implemented by all controllers.
> > 
> Got it, I guess that is the behavior required by the unified hierarchy.
> Cgroup v1 would be ok? But I am guessing we are not extending on v1?

A new cgroup1 only controller is unlikely to be accpeted.

> The IOASIDs are programmed into devices to generate DMA requests tagged
> with them. The IOMMU has a per device IOASID table with each entry has two
> pointers:
>  - the PGD of the guest process.
>  - the PGD of the host process
> 
> The result of this 2 stage/nested translation is that we can share virtual
> address (SVA) between guest process and DMA. The host process needs to
> allocate multiple IOASIDs since one IOASID is needed for each guest process
> who wants SVA.
> 
> The DMA binding among device-IOMMU-process is setup via a series of user
> APIs (e.g. via VFIO).
> 
> If a process calls fork(), the children does not inherit the IOASIDs and
> their bindings. Children who wish to use SVA has to call those APIs to
> establish the binding for themselves.
> 
> Therefore, if a host process allocates 10 IOASIDs then does a
> fork()/clone(), it cannot charge 10 IOASIDs in the new cgroup. i.e. the 10
> IOASIDs stays with the process wherever it goes.
> 
> I feel this fit in the domain model, true?

I still don't get where migration is coming into the picture. Who's
migrating where?

Thanks.

-- 
tejun


Re: [PATCH] vmscan: retry without cache trim mode if nothing scanned

2021-03-14 Thread Tejun Heo
Hello,

On Sun, Mar 14, 2021 at 01:58:33PM -0700, Shakeel Butt wrote:
> > If my understanding were correct, what Tejun suggested is to add a fast
> > read interface to rstat to be used in hot path.  And its accuracy is
> > similar as that of traditional per-CPU counter.  But if we can regularly
> > update the lruvec rstat with something like vmstat_update(), that should
> > be OK for the issue described in this patch.
> >
> 
> This is also my understanding. Tejun, please correct us if we misunderstood 
> you.

Yeah, that was what I had on mind. Instead of always flushing on read, have
a variant where flushing is explicit and trigger that periodically (or
whichever way appropriate).

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-13 Thread Tejun Heo
Hello,

On Sat, Mar 13, 2021 at 08:57:01AM -0800, Jacob Pan wrote:
> Isn't PIDs controller doing the charge/uncharge? I was under the impression
> that each resource can be independently charged/uncharged, why it affects
> other resources? Sorry for the basic question.

Yeah, PID is an exception as we needed the initial migration to seed new
cgroups and it gets really confusing with other ways to observe the
processes - e.g. if you follow the original way of creating a cgroup,
forking and then moving the seed process into the target cgroup, if we don't
migrate the pid charge together, the numbers wouldn't agree and the seeder
cgroup may end up running out of pids if there are any restrictions.

> I also didn't quite get the limitation on cgroup v2 migration, this is much
> simpler than memcg. Could you give me some pointers?

Migration itself doesn't have restrictions but all resources are distributed
on the same hierarchy, so the controllers are supposed to follow the same
conventions that can be implemented by all controllers.

> BTW, since the IOASIDs are used to tag DMA and bound with guest process(mm)
> for shared virtual addressing. fork() cannot be supported, so I guess clone
> is not a solution here.

Can you please elaborate what wouldn't work? The new spawning into a new
cgroup w/ clone doesn't really change the usage model. It's just a neater
way to seed a new cgroup. If you're saying that the overall usage model
doesn't fit the needs of IOASIDs, it likely shouldn't be a cgroup
controller.

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-13 Thread Tejun Heo
On Fri, Mar 12, 2021 at 02:59:04PM -0800, Jacob Pan wrote:
> Our primary goal is to limit the amount of IOASIDs that VMs can allocate.
> If a VM is migrated to a different cgroup, I think we need to
> charge/uncharge the destination/source cgroup in order enforce the limit. I
> am not an expert here, any feedback would be appreciated.

That simply isn't a supported usage model. None of other resources will get
tracked if you do that.

Thanks.

-- 
tejun


Re: [Patch v3 0/2] cgroup: New misc cgroup controller

2021-03-11 Thread Tejun Heo
Hello,

On Thu, Mar 11, 2021 at 07:58:19PM +0100, Michal Koutný wrote:
> > Michal, as you've been reviewing the series, can you please take
> > another look and ack them if you don't find anything objectionable?
> Honestly, I'm still sitting on the fence whether this needs a new
> controller and whether the miscontroller (:-p) is a good approach in the
> long term [1].

Yeah, it's a bit of cop-out. My take is that the underlying hardware feature
isn't mature enough to have reasonable abstraction built on top of them.
Given time, maybe future iterations will get there or maybe it's a passing
fad and people will mostly forget about these.

In the meantime, keeping them out of cgroup is one direction, a relatively
high friction one but still viable. Or we can provide something of a halfway
house so that people who have immediate needs can still leverage the
existing infrastructure while controlling the amount of time, energy and
future lock-ins they take. So, that's misc controller.

I'm somewhat ambivalent but we've had multiple of these things popping up in
the past several years and containment seems to be a reasonable approach at
this point.

> [1] Currently, only one thing comes to my mind -- the delegation via
> cgroup.subtree_control. The miscontroller may add possibly further
> resources whose delegation granularity is bunched up under one entry.

Controller enabling and delegation in themselves aren't supposed to have
resource or security implications, so I don't think it's a practical
problem.

Thanks.

-- 
tejun


Re: [PATCH] cgroup-v2: Add taskstats counters in cgroup.stat

2021-03-11 Thread Tejun Heo
On Thu, Mar 11, 2021 at 02:17:52PM +0800, Chengming Zhou wrote:
> We have the netlink CGROUPSTATS_CMD_GET interface to get taskstats
> of the cgroup on v1, but haven't the equivalent interface on v2,
> making it difficult to calculate the per-cgroup cpu load in cadvisor
> or implement the cgroup proc interface in lxcfs, like /proc/loadavg.

So, this is what the PSI metrics are for and we've been using it for that
for quite a while now. I'd much prefer not adding something duplicate (and
incomplete).

Thanks.

-- 
tejun


Re: [Patch v3 0/2] cgroup: New misc cgroup controller

2021-03-07 Thread Tejun Heo
Hello,

On Thu, Mar 04, 2021 at 03:19:44PM -0800, Vipin Sharma wrote:
> This patch series is creating a new misc cgroup controller for limiting
> and tracking of resources which are not abstract like other cgroup
> controllers.
> 
> This controller was initially proposed as encryption_id but after
> the feedbacks, it is now changed to misc cgroup.
> https://lore.kernel.org/lkml/20210108012846.4134815-2-vipi...@google.com/

Vipin, thank you very much for your persistence and patience. The patchset
looks good to me. Michal, as you've been reviewing the series, can you
please take another look and ack them if you don't find anything
objectionable?

-- 
tejun


Re: [PATCH] blk-cgroup: Fix the recursive blkg rwstat

2021-03-07 Thread Tejun Heo
On Fri, Mar 05, 2021 at 11:32:05AM -0700, Jens Axboe wrote:
> On 3/5/21 1:13 AM, Xunlei Pang wrote:
> > The current blkio.throttle.io_service_bytes_recursive doesn't
> > work correctly.
> > 
> > As an example, for the following blkcg hierarchy:
> >  (Made 1GB READ in test1, 512MB READ in test2)
> >  test
> > /\
> >  test1   test2
> > 
> > $ head -n 1 test/test1/blkio.throttle.io_service_bytes_recursive
> > 8:0 Read 1073684480
> > $ head -n 1 test/test2/blkio.throttle.io_service_bytes_recursive
> > 8:0 Read 537448448
> > $ head -n 1 test/blkio.throttle.io_service_bytes_recursive
> > 8:0 Read 537448448
> > 
> > Clearly, above data of "test" reflects "test2" not "test1"+"test2".
> > 
> > Do the correct summary in blkg_rwstat_recursive_sum().
> 
> LGTM, Tejun?

Gees, that's horrible. Thanks for fixing this.

Acked-by: Tejun Heo 

-- 
tejun


Re: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

2021-03-04 Thread Tejun Heo
Hello,

On Wed, Mar 03, 2021 at 10:12:32PM -0800, Vipin Sharma wrote:
> Right now there is no API for the caller to know total usage, unless they
> keep their own tally, I was thinking it will be useful to add one more API
> 
> unsigned long misc_cg_res_total_usage(enum misc_res_type type)
> 
> It will return root_cg usage for "type" resource.
> Will it be fine?

Yeah, that sounds fine.

Thanks.

-- 
tejun


Re: [RFC v2 2/2] cgroup: sev: Miscellaneous cgroup documentation.

2021-03-04 Thread Tejun Heo
Hello,

On Wed, Mar 03, 2021 at 10:22:03PM -0800, Vipin Sharma wrote:
> > I am trying to see if IOASIDs cgroup can also fit in this misc controller
> > as yet another resource type.
> > https://lore.kernel.org/linux-iommu/20210303131726.7a8cb169@jacob-builder/T/#u
> > However, unlike sev IOASIDs need to be migrated if the process is moved to
> > another cgroup. i.e. charge the destination and uncharge the source.
> > 
> > Do you think this behavior can be achieved by differentiating resource
> > types? i.e. add attach callbacks for certain types. Having a single misc
> > interface seems cleaner than creating another controller.
> 
> I think it makes sense to add support for migration for the resources
> which need it. Resources like SEV, SEV-ES will not participate in
> migration and won't stop can_attach() to succeed, other resources which
> need migration will allow or stop based on their limits and capacity in
> the destination.

Please note that cgroup2 by and large don't really like or support charge
migration or even migrations themselves. We tried that w/ memcg on cgroup1
and it turned out horrible. The expected usage model as decribed in the doc
is using migration to seed a cgroup (or even better, use the new clone call
to start in the target cgroup) and then stay there until exit. All existing
controllers assume this usage model and I'm likely to nack deviation unless
there are some super strong justifications.

Thanks.

-- 
tejun


Re: [RFC PATCH 15/18] cgroup: Introduce ioasids controller

2021-03-03 Thread Tejun Heo
On Sat, Feb 27, 2021 at 02:01:23PM -0800, Jacob Pan wrote:
> IOASIDs are used to associate DMA requests with virtual address spaces.
> They are a system-wide limited resource made available to the userspace
> applications. Let it be VMs or user-space device drivers.
> 
> This RFC patch introduces a cgroup controller to address the following
> problems:
> 1. Some user applications exhaust all the available IOASIDs thus
> depriving others of the same host.
> 2. System admins need to provision VMs based on their needs for IOASIDs,
> e.g. the number of VMs with assigned devices that perform DMA requests
> with PASID.

Please take a look at the proposed misc controller:

 http://lkml.kernel.org/r/20210302081705.1990283-2-vipi...@google.com

Would that fit your bill?

Thanks.

-- 
tejun


Re: [RFC v2 1/2] cgroup: sev: Add misc cgroup controller

2021-03-03 Thread Tejun Heo
Hello,

On Tue, Mar 02, 2021 at 12:17:04AM -0800, Vipin Sharma wrote:
> +/**
> + * struct misc_res: Per cgroup per misc type resource
> + * @max: Maximum count of the resource.
> + * @usage: Current usage of the resource.
> + */
> +struct misc_res {
> + unsigned int max;
> + atomic_t usage;
> +};

Can we do 64bits so that something which counts memory can use this too?

> +/*
> + * Miscellaneous resources capacity for the entire machine. 0 capacity means
> + * resource is not initialized or not present in the host.
> + *
> + * root_cg.max and capacity are independent of each other. root_cg.max can be
> + * more than the actual capacity. We are using Limits resource distribution
> + * model of cgroup for miscellaneous controller. However, root_cg.current 
> for a
> + * resource will never exceeds the resource capacity.
^
typo

> +int misc_cg_set_capacity(enum misc_res_type type, unsigned int capacity)
> +{
> + if (!valid_type(type))
> + return -EINVAL;
> +
> + for (;;) {
> + int usage;
> + unsigned int old;
> +
> + /*
> +  * Update the capacity while making sure that it's not below
> +  * the concurrently-changing usage value.
> +  *
> +  * The xchg implies two full memory barriers before and after,
> +  * so the read-swap-read is ordered and ensures coherency with
> +  * misc_cg_try_charge(): that function modifies the usage
> +  * before checking the capacity, so if it sees the old
> +  * capacity, we see the modified usage and retry.
> +  */
> + usage = atomic_read(_cg.res[type].usage);
> +
> + if (usage > capacity)
> + return -EBUSY;

I'd rather go with allowing bringing down capacity below usage so that the
users can set it to a lower value to drain existing usages while denying new
ones. It's not like it's difficult to check the current total usage from the
caller side, so I'm not sure it's very useful to shift the condition check
here.

> +int misc_cg_try_charge(enum misc_res_type type, struct misc_cg *cg,
> +unsigned int amount)
> +{
...
> + for (i = cg; i; i = parent_misc(i)) {
> + res = >res[type];
> +
> + /*
> +  * The atomic_long_add_return() implies a full memory barrier
> +  * between incrementing the count and reading the capacity.
> +  * When racing with misc_cg_set_capacity(), we either see the
> +  * new capacity or the setter sees the counter has changed and
> +  * retries.
> +  */
> + new_usage = atomic_add_return(amount, >usage);
> + if (new_usage > res->max ||
> + new_usage > misc_res_capacity[type]) {
> + pr_info("cgroup: charge rejected by misc controller for 
> %s resource in ",
> + misc_res_name[type]);
> + pr_cont_cgroup_path(i->css.cgroup);
> + pr_cont("\n");

Should have commented on this in the priv thread but don't print something
on every rejection. This often becomes a nuisance and can make an easy DoS
vector at worst. If you wanna do it, print it once per cgroup or sth like
that.

Otherwise, looks good to me.

Thanks.

-- 
tejun


Re: [PATCH v2] workqueue: Move the position of debug_work_activate() in __queue_work()

2021-03-03 Thread Tejun Heo
On Thu, Feb 18, 2021 at 11:16:49AM +0800, qiang.zh...@windriver.com wrote:
> From: Zqiang 
> 
> The debug_work_activate() is called on the premise that
> the work can be inserted, because if wq be in WQ_DRAINING
> status, insert work may be failed.
> 
> Fixes: e41e704bc4f4 ("workqueue: improve destroy_workqueue() debuggability")
> Signed-off-by: Zqiang 
> Reviewed-by: Lai Jiangshan 

Applied to wq/for-5.12-fixes.

Thanks.

-- 
tejun


Re: [PATCH V6] x86/mm: Tracking linear mapping split events

2021-03-01 Thread Tejun Heo
Hello,

On Thu, Feb 18, 2021 at 03:57:44PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
...
> Signed-off-by: Saravanan D 
> Acked-by: Tejun Heo 
> Acked-by: Johannes Weiner 
> Acked-by: Dave Hansen 

Andrew, do you mind picking this one up? It has enough acks and can go
through either mm or x86 tree.

Thank you.

-- 
tejun


[GIT PULL] workqueue changes for v5.12-rc1

2021-02-21 Thread Tejun Heo
Hello,

Tracepoint and comment updates only.

Thanks.

The following changes since commit 1e2a199f6ccdc15cf111d68d212e2fd4ce65682e:

  Merge tag 'spi-fix-v5.11-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi (2021-01-18 11:23:05 
-0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-5.12

for you to fetch changes up to e9ad2eb3d9ae05471c9b9fafcc0a31d8f565ca5b:

  workqueue: Use %s instead of function name (2021-01-27 09:42:48 -0500)


Menglong Dong (1):
  workqueue: fix annotation for WQ_SYSFS

Stephen Zhang (1):
  workqueue: Use %s instead of function name

Zqiang (1):
  workqueue: tracing the name of the workqueue instead of it's address

 include/linux/workqueue.h| 2 +-
 include/trace/events/workqueue.h | 6 +++---
 kernel/workqueue.c   | 4 ++--
 3 files changed, 6 insertions(+), 6 deletions(-)

-- 
tejun


[GIT PULL] cgroup changes for v5.12-rc1

2021-02-21 Thread Tejun Heo
Hello, Linus.

Nothing interesting. Just two minor patches.

Thanks.

The following changes since commit b5e56576e16236de3c035ca86cd3ef16591722fb:

  MAINTAINERS: Update my email address (2021-01-15 15:33:06 -0500)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-5.12

for you to fetch changes up to 415de5fdeb5ac28c2960df85b749700560dcd63c:

  cpuset: fix typos in comments (2021-01-15 15:36:41 -0500)


Aubrey Li (1):
  cpuset: fix typos in comments

Michal Koutný (1):
  cgroup: cgroup.{procs,threads} factor out common parts

 kernel/cgroup/cgroup.c | 55 +-
 kernel/cgroup/cpuset.c |  6 +++---
 2 files changed, 17 insertions(+), 44 deletions(-)

-- 
tejun


[GIT PULL] cgroup fixes for v5.11-rc7

2021-02-13 Thread Tejun Heo
Hello, Linus.

Two cgroup fixes: 1. Fix the NULL deref when trying to poll PSI in the root
cgroup and 2. fix confusing controller parsing corner case when mounting
cgroup v1 hierarchies. And doc / maintainer file updates.

Thanks.

The following changes since commit f4e087c666f54559cb4e530af1fbfc9967e14a15:

  Merge tag 'acpi-5.11-rc4' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm (2021-01-15 
10:55:33 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-5.11-fixes

for you to fetch changes up to 74bdd45c85d02f695a1cd1c3dccf8b3960a86d8f:

  cgroup: update PSI file description in docs (2021-01-19 12:03:07 -0500)


Chen Zhou (1):
  cgroup-v1: add disabled controller check in cgroup1_parse_param()

Odin Ugedal (2):
  cgroup: fix psi monitor for root cgroup
  cgroup: update PSI file description in docs

Zefan Li (2):
  MAINTAINERS: Remove stale URLs for cpuset
  MAINTAINERS: Update my email address

 Documentation/admin-guide/cgroup-v2.rst | 6 +++---
 MAINTAINERS | 6 ++
 kernel/cgroup/cgroup-v1.c   | 3 +++
 kernel/cgroup/cgroup.c  | 4 +++-
 4 files changed, 11 insertions(+), 8 deletions(-)

-- 
tejun


Re: [PATCH v3 0/4] sched/fair: Burstable CFS bandwidth controller

2021-02-09 Thread Tejun Heo
Hello,

On Tue, Feb 09, 2021 at 02:17:19PM +0100, Odin Ugedal wrote:
> A am not that familiar how cross subsystem patches like these are handled, but
> I am still adding the Tejun Heo (cgroup maintainer) as a CC. Should maybe cc 
> to
> cgroup@ as well?

Yeah, that'd be great. Given that it's mostly straight forward extension on
an existing interface, things looks fine from cgroup side; however, please
do add cgroup2 interface and documentation. One thing which has bene
bothersome about the bandwidth interface is that we're exposing
implementation details (window size and now burst size) instead of
usage-centric requirements but that boat has already sailed, so...

Thanks.

-- 
tejun


Re: [PATCH 7/8] mm: memcontrol: consolidate lruvec stat flushing

2021-02-08 Thread Tejun Heo
Hello,

On Mon, Feb 08, 2021 at 03:54:14PM -0500, Johannes Weiner wrote:
> We probably do need a better solution for the lruvecs as well, but in
> this case it just started holding up fixing the memory.stat issue for
> no reason and so I tabled it for another patch series.

rstat doesn't currently have a flush throttling mechanism cuz it doens't
expect readers to be super hot but adding one should be pretty easy - e.g.
it can just keep track of the number of updates on this cpu since the last
flush and then flush iff the it's above a certain threshold. Shouldn't be
too difficult to match or exceed the performance and error characteristics
of the existing code.

Thanks.

-- 
tejun


Re: [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers

2021-02-08 Thread Tejun Heo
Hello,

On Mon, Feb 08, 2021 at 03:29:21PM -0500, Johannes Weiner wrote:
> > > @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct 
> > > cgroup_subsys_state *css, int cpu)
> > >   u64_stats_update_end(>iostat.sync);
> > >  
> > >   /* propagate global delta to parent */
> > > + /* XXX: could skip this if parent is root */
> > >   if (parent) {
> > >   u64_stats_update_begin(>iostat.sync);
> > >   blkg_iostat_set(, >iostat.cur);
> > 
> > Might as well update this similar to cgroup_base_stat_flush()?
> 
> I meant to revisit that, but I'm never 100% confident when it comes to
> the interaction and lifetime of css, blkcg and blkg_gq.

Yeah, it does get hairy.

> IIUC, the blkg_gq->parent linkage always matches the css parent
> linkage; it just exists as an optimization for ancestor walks, which
> would otherwise have to do radix lookups when going through the css.

But yes, at least this part is straight-forward.

> So with the cgroup_parent() check at the beginning of the function
> making sure we're looking at a non-root group, blkg_gq->parent should
> also never be NULL and I can do if (paren->parent) directly, right?

I think so.

> > >  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
> > >  {
> > > - struct cgroup *parent = cgroup_parent(cgrp);
> > >   struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> > > + struct cgroup *parent = cgroup_parent(cgrp);
> > 
> > Is this chunk intentional?
> 
> Yeah, it puts the local variable declarations into reverse christmas
> tree ordering to make them a bit easier to read. It's a while-at-it
> cleanup, mostly a force of habit. I can drop it if it bothers you.

I don't mind either way. Was just wondering whether it was accidental.

Thanks.

-- 
tejun


Re: [PATCH 0/8] mm: memcontrol: switch to rstat v2

2021-02-05 Thread Tejun Heo
On Fri, Feb 05, 2021 at 01:27:58PM -0500, Johannes Weiner wrote:
> This is version 2 of the memcg rstat patches. Updates since v1:
> 
> - added cgroup selftest output (see test section below) (thanks Roman)
> - updated cgroup selftest to match new kernel implementation
> - added Fixes: tag to 'mm: memcontrol: fix cpuhotplug statistics flushing' 
> (Shakeel)
> - collected review & ack tags
> - added rstat overview to 'mm: memcontrol: switch to rstat' changelog (Michal)
> - simplified memcg_flush_lruvec_page_state() and removed cpu==-1 case (Michal)

The whole series looks good to me. Please feel free to route the rstat
patches with the rest of the series.

Thanks.

-- 
tejun


Re: [PATCH 5/8] cgroup: rstat: punt root-level optimization to individual controllers

2021-02-05 Thread Tejun Heo
Hello,

On Fri, Feb 05, 2021 at 01:28:03PM -0500, Johannes Weiner wrote:
> Current users of the rstat code can source root-level statistics from
> the native counters of their respective subsystem, allowing them to
> forego aggregation at the root level. This optimization is currently
> implemented inside the generic rstat code, which doesn't track the
> root cgroup and doesn't invoke the subsystem flush callbacks on it.
> 
> However, the memory controller cannot do this optimization, because
> cgroup1 breaks out memory specifically for the local level, including
> at the root level. In preparation for the memory controller switching
> to rstat, move the optimization from rstat core to the controllers.
> 
> Afterwards, rstat will always track the root cgroup for changes and
> invoke the subsystem callbacks on it; and it's up to the subsystem to
> special-case and skip aggregation of the root cgroup if it can source
> this information through other, cheaper means.
> 
> The extra cost of tracking the root cgroup is negligible: on stat
> changes, we actually remove a branch that checks for the root. The
> queueing for a flush touches only per-cpu data, and only the first
> stat change since a flush requires a (per-cpu) lock.
> 
> Signed-off-by: Johannes Weiner 

Generally looks good to me.

Acked-by: Tejun Heo 

A couple nits below.

> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 02ce2058c14b..76725e1cad7f 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -766,6 +766,10 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state 
> *css, int cpu)
>   struct blkcg *blkcg = css_to_blkcg(css);
>   struct blkcg_gq *blkg;
>  
> + /* Root-level stats are sourced from system-wide IO stats */
> + if (!cgroup_parent(css->cgroup))
> + return;
> +
>   rcu_read_lock();
>  
>   hlist_for_each_entry_rcu(blkg, >blkg_list, blkcg_node) {
> @@ -789,6 +793,7 @@ static void blkcg_rstat_flush(struct cgroup_subsys_state 
> *css, int cpu)
>   u64_stats_update_end(>iostat.sync);
>  
>   /* propagate global delta to parent */
> + /* XXX: could skip this if parent is root */
>   if (parent) {
>   u64_stats_update_begin(>iostat.sync);
>   blkg_iostat_set(, >iostat.cur);

Might as well update this similar to cgroup_base_stat_flush()?

> @@ -58,8 +53,16 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
>   if (rstatc->updated_next)
>   break;
>  
> + if (!parent) {

Maybe useful to note that the node is being marked busy but not added to the
non-existent parent.

> + rstatc->updated_next = cgrp;
> + break;
> + }
> +
> + prstatc = cgroup_rstat_cpu(parent, cpu);
>   rstatc->updated_next = prstatc->updated_children;
>   prstatc->updated_children = cgrp;
> +
> + cgrp = parent;
>   }
>  
>   raw_spin_unlock_irqrestore(cpu_lock, flags);
...
>  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu)
>  {
> - struct cgroup *parent = cgroup_parent(cgrp);
>   struct cgroup_rstat_cpu *rstatc = cgroup_rstat_cpu(cgrp, cpu);
> + struct cgroup *parent = cgroup_parent(cgrp);

Is this chunk intentional?

>   struct cgroup_base_stat cur, delta;
>   unsigned seq;
>  
> + /* Root-level stats are sourced from system-wide CPU stats */
> + if (!parent)
> + return;
> +
>   /* fetch the current per-cpu values */
>   do {
>   seq = __u64_stats_fetch_begin(>bsync);
> @@ -326,8 +336,8 @@ static void cgroup_base_stat_flush(struct cgroup *cgrp, 
> int cpu)
>   cgroup_base_stat_add(>bstat, );
>   cgroup_base_stat_add(>last_bstat, );
>  
> - /* propagate global delta to parent */
> - if (parent) {
> + /* propagate global delta to parent (unless that's root) */
> + if (cgroup_parent(parent)) {

Yeah, this makes sense. Can you add a short while-at-it note in the patch
description?

Thanks.

-- 
tejun


Re: [PATCH 4/8] cgroup: rstat: support cgroup1

2021-02-05 Thread Tejun Heo
On Fri, Feb 05, 2021 at 01:28:02PM -0500, Johannes Weiner wrote:
> Rstat currently only supports the default hierarchy in cgroup2. In
> order to replace memcg's private stats infrastructure - used in both
> cgroup1 and cgroup2 - with rstat, the latter needs to support cgroup1.
> 
> The initialization and destruction callbacks for regular cgroups are
> already in place. Remove the cgroup_on_dfl() guards to handle cgroup1.
> 
> The initialization of the root cgroup is currently hardcoded to only
> handle cgrp_dfl_root.cgrp. Move those callbacks to cgroup_setup_root()
> and cgroup_destroy_root() to handle the default root as well as the
> various cgroup1 roots we may set up during mounting.
> 
> The linking of css to cgroups happens in code shared between cgroup1
> and cgroup2 as well. Simply remove the cgroup_on_dfl() guard.
> 
> Linkage of the root css to the root cgroup is a bit trickier: per
> default, the root css of a subsystem controller belongs to the default
> hierarchy (i.e. the cgroup2 root). When a controller is mounted in its
> cgroup1 version, the root css is stolen and moved to the cgroup1 root;
> on unmount, the css moves back to the default hierarchy. Annotate
> rebind_subsystems() to move the root css linkage along between roots.
> 
> Signed-off-by: Johannes Weiner 
> Reviewed-by: Roman Gushchin 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH V6] x86/mm: Tracking linear mapping split events

2021-01-28 Thread Tejun Heo
On Thu, Jan 28, 2021 at 03:34:30PM -0800, Saravanan D wrote:
> To help with debugging the sluggishness caused by TLB miss/reload,
> we introduce monotonic hugepage [direct mapped] split event counts since
> system state: SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
...
> Signed-off-by: Saravanan D 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] kernfs: Remove redundant code

2021-01-28 Thread Tejun Heo
On Thu, Jan 28, 2021 at 04:25:32PM +0800, Abaci Team wrote:
> Fix the following coccicheck warnings:
> 
> ./fs/kernfs/file.c:647:1-3: WARNING: possible condition with no effect
> (if == else).
> 
> Reported-by: Abaci Robot 
> Suggested-by: Jiapeng Zhong 
> Signed-off-by: Abaci Team 
> ---
>  fs/kernfs/file.c | 6 +-
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index c757193..9a74c9d 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -644,11 +644,7 @@ static int kernfs_fop_open(struct inode *inode, struct 
> file *file)
>* Both paths of the branch look the same.  They're supposed to
>* look that way and give @of->mutex different static lockdep keys.

Please read the comment right above.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Tejun Heo
On Thu, Jan 28, 2021 at 11:22:00AM +0800, Baolin Wang wrote:
> On !PREEMPT kernel, we can get below softlockup when doing stress
> testing with creating and destroying block cgroup repeatly. The
> reason is it may take a long time to acquire the queue's lock in
> the loop of blkcg_destroy_blkgs(), or the system can accumulate a
> huge number of blkgs in pathological cases. We can add a need_resched()
> check on each loop and release locks and do cond_resched() if true
> to avoid this issue, since the blkcg_destroy_blkgs() is not called
> from atomic contexts.
> 
> [ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
> [ 4757.010698] Call trace:
> [ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
> [ 4757.010701]  cgwb_release_workfn+0x104/0x158
> [ 4757.010702]  process_one_work+0x1bc/0x3f0
> [ 4757.010704]  worker_thread+0x164/0x468
> [ 4757.010705]  kthread+0x108/0x138
> 
> Suggested-by: Tejun Heo 
> Signed-off-by: Baolin Wang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH V2] x86/mm: Tracking linear mapping split events

2021-01-27 Thread Tejun Heo
Hello,

On Wed, Jan 27, 2021 at 01:32:03PM -0800, Dave Hansen wrote:
> >>  arch/x86/mm/pat/set_memory.c  | 117 ++
> >>  include/linux/vm_event_item.h |   8 +++
> >>  mm/vmstat.c   |   8 +++
> >>  3 files changed, 133 insertions(+)
> > 
> > So, now the majority of the added code is to add debugfs knobs which don't
> > provide anything that userland can't already do by simply reading the
> > monotonic counters.
> > 
> > Dave, are you still set on the resettable counters?
> 
> Not *really*.  But, you either need them to be resettable, or you need
> to expect your users to take snapshots and compare changes over time.
> Considering how much more code it is, though, I'm not super attached to it.

Saravanan, can you please drop the debugfs portion and repost?

Thanks.

-- 
tejun


Re: [PATCH V2] x86/mm: Tracking linear mapping split events

2021-01-27 Thread Tejun Heo
Hello,

On Wed, Jan 27, 2021 at 09:51:24AM -0800, Saravanan D wrote:
> Numerous hugepage splits in the linear mapping would give
> admins the signal to narrow down the sluggishness caused by TLB
> miss/reload.
> 
> To help with debugging, we introduce monotonic lifetime  hugepage
> split event counts since SYSTEM_RUNNING to be displayed as part of
> /proc/vmstat in x86 servers
> 
> The lifetime split event information will be displayed at the bottom of
> /proc/vmstat
> 
> swap_ra 0
> swap_ra_hit 0
> direct_map_2M_splits 139
> direct_map_4M_splits 0
> direct_map_1G_splits 7
> nr_unstable 0
> 

This looks great to me.

> 
> Ancillary debugfs split event counts exported to userspace via read-write
> endpoints : /sys/kernel/debug/x86/direct_map_[2M|4M|1G]_split
> 
> dmesg log when user resets the debugfs split event count for
> debugging
> 
> [  232.470531] debugfs 2M Pages split event count(128) reset to 0
> 

I'm not convinced this part is necessary or even beneficial.

> One of the many lasting (as we don't coalesce back) sources for huge page
> splits is tracing as the granular page attribute/permission changes would
> force the kernel to split code segments mapped to huge pages to smaller
> ones thereby increasing the probability of TLB miss/reload even after
> tracing has been stopped.
> 
> Signed-off-by: Saravanan D 
> ---
>  arch/x86/mm/pat/set_memory.c  | 117 ++
>  include/linux/vm_event_item.h |   8 +++
>  mm/vmstat.c   |   8 +++
>  3 files changed, 133 insertions(+)

So, now the majority of the added code is to add debugfs knobs which don't
provide anything that userland can't already do by simply reading the
monotonic counters.

Dave, are you still set on the resettable counters?

Thanks.

-- 
tejun


Re: [PATCH] workqueue: Use %s instead of function name

2021-01-27 Thread Tejun Heo
On Sat, Jan 23, 2021 at 04:04:00PM +0800, Stephen Zhang wrote:
> It is better to replace the function name with %s, in case the function
> name changes.
> 
> Signed-off-by: Stephen Zhang 

Applied to wq/for-5.12.

Thanks.

-- 
tejun


Re: [PATCH] blk-cgroup: Use cond_resched() when destroy blkgs

2021-01-27 Thread Tejun Heo
Hello, Baolin.

On Tue, Jan 26, 2021 at 09:33:25PM +0800, Baolin Wang wrote:
> On !PREEMPT kernel, we can get below softlockup when doing stress
> testing with creating and destroying block cgroup repeatly. The
> reason is it may take a long time to acquire the queue's lock in
> the loop of blkcg_destroy_blkgs(), thus we can use cond_resched()
> instead of cpu_relax() to avoid this issue, since the
> blkcg_destroy_blkgs() is not called from atomic contexts.
> 
> [ 4757.010308] watchdog: BUG: soft lockup - CPU#11 stuck for 94s!
> [ 4757.010698] Call trace:
> [ 4757.010700]  blkcg_destroy_blkgs+0x68/0x150
> [ 4757.010701]  cgwb_release_workfn+0x104/0x158
> [ 4757.010702]  process_one_work+0x1bc/0x3f0
> [ 4757.010704]  worker_thread+0x164/0x468
> [ 4757.010705]  kthread+0x108/0x138
> 
> Signed-off-by: Baolin Wang 

* Can you please add might_sleep() at the top of the function?

* Given that the system can accumulate a huge number of blkgs in
  pathological cases, I wonder whether a better way to go about it is
  explicitly testing need_resched() on each loop and release locks and do
  cond_resched() if true?

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-27 Thread Tejun Heo
Hello,

On Tue, Jan 26, 2021 at 05:11:59PM -0800, Vipin Sharma wrote:
> Sounds good, we can have a single top level stat file
> 
> misc.stat
>   Shows how many are supported on the host:
>   $ cat misc.stat
>   sev 500
>   sev_es 10
> 
> If total value of some resource is 0 then it will be considered inactive and
> won't show in misc.{stat, current, max}
> 
> We discussed earlier, instead of having "stat" file we should show
> "current" and "capacity" files in the root but I think we can just have stat
> at top showing total resources to keep it consistent with other cgroup
> files.

Let's do misc.capacity and show only the entries which have their resources
initialized.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-26 Thread Tejun Heo
On Tue, Jan 26, 2021 at 05:01:04PM -0500, Tejun Heo wrote:
> * misc.max and misc.current: nested keyed files listing max and current
>   usage for the cgroup.

Keyed files, not nested keyed files.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-26 Thread Tejun Heo
Hello,

On Tue, Jan 26, 2021 at 12:49:14PM -0800, David Rientjes wrote:
> > SEV-SNP, another incremental enhancement (on SEV-ES), further strengthens 
> > the
> > argument for SEV and SEV-* coexistenence.  SEV-SNP and SEV-ES will share the
> > same ASID range, so the question is really, "do we expect to run SEV guests 
> > and
> > any flavor of SEV-* guests on the same platform".  And due to SEV-* not 
> > being
> > directly backward compatible with SEV, the answer will eventually be "yes", 
> > as
> > we'll want to keep running existing SEV guest while also spinning up new 
> > SEV-*
> > guests.
> > 
> 
> Agreed, cloud providers will most certainly want to run both SEV and SEV-* 
> guests on the same platform.

Am I correct in thinking that the reason why these IDs are limited is
because they need to be embedded into the page table entries? If so, we
aren't talking about that many IDs and having to divide the already small
pool into disjoint purposes doesn't seem like a particularly smart use of
those bits. It is what it is, I guess.

> I'm slightly concerned about extensibility if there is to be an 
> incremental enhancement atop SEV-* or TDX with yet another pool of 
> encryption ids.  (For example, when we only had hugepages, this name was 
> perfect; then we got 1GB pages which became "gigantic pages", so are 512GB 
> pages "enormous"? :)  I could argue (encryption_ids.basic.*,
> encryption_ids.enhanced.*) should map to 
> (encryption_ids.legacy.*, encryption_ids.*) but that's likely 
> bikeshedding.
> 
> Thomas: does encryption_ids.{basic,enhanced}.* make sense for ASID 
> partitioning?
> 
> Tejun: if this makes sense for legacy SEV and SEV-* per Thomas, and this 
> is now abstracted to be technology (vendor) neutral, does this make sense 
> to you?

The whole thing seems pretty immature to me and I agree with you that coming
up with an abstraction at this stage feels risky.

I'm leaning towards creating a misc controller to shove these things into:

* misc.max and misc.current: nested keyed files listing max and current
  usage for the cgroup.

* Have an API to activate or update a given resource with total resource
  count. I'd much prefer the resource list to be in the controller itself
  rather than being through some dynamic API just so that there is some
  review in what keys get added.

* Top level cgroup lists which resource is active and how many are
  available.

So, behavior-wise, not that different from the proposed code. Just made
generic into a misc controller. Would that work?

Thanks.

-- 
tejun


Re: [PATCH] x86/mm: Tracking linear mapping split events since boot

2021-01-26 Thread Tejun Heo
Hello, Dave.

On Mon, Jan 25, 2021 at 04:47:42PM -0800, Dave Hansen wrote:
> The patch here does not actually separate out pre-boot from post-boot,
> so it's pretty hard to tell if the splits came from something like
> tracing which is totally unnecessary or they were the result of
> something at boot that we can't do anything about.

Ah, right, didn't know they also included splits during boot. It'd be a lot
more useful if they were counting post-boot splits.

> This would be a lot more useful if you could reset the counters.  Then
> just reset them from userspace at boot.  Adding read-write debugfs
> exports for these should be pretty trivial.

While this would work for hands-on cases, I'm a bit worried that this might
be more challenging to gain confidence in large production environments.

Thanks.

-- 
tejun


Re: [PATCH] x86/mm: Tracking linear mapping split events since boot

2021-01-26 Thread Tejun Heo
Hello,

On Mon, Jan 25, 2021 at 05:04:00PM -0800, Dave Hansen wrote:
> Which part?  Large production environments don't trust data from
> debugfs?  Or don't trust it if it might have been reset?

When the last reset was. Not saying it's impossible or anything but in
general it's a lot better to have the counters to be monotonically
increasing with time/event stamped markers than the counters themselves
getting reset or modified in other ways because the ownership of a specific
counter might not be obvious to everyone and accidents and mistakes happen.

Note that the "time/event stamped markers" above don't need to and shouldn't
be in the kernel. It can be managed by whoever that wants to monitor a given
time period and there can be any number of them.

> You could stick the "reset" switch in debugfs, and dump something out in
> dmesg like we do for /proc/sys/vm/drop_caches so it's not a surprise
> that it happened.

Processing dmesgs can work too but isn't particularly reliable or scalable.

> BTW, counts of *events* don't really belong in meminfo.  These really do
> belong in /proc/vmstat if anything.

Oh yeah, I don't have a strong opinion on where the counters should go.

Thanks.

-- 
tejun


Re: [PATCH v2 4/7] rbtree, perf: Use new rbtree helpers

2021-01-25 Thread Tejun Heo
On Mon, Jan 25, 2021 at 04:09:57PM +0100, Peter Zijlstra wrote:
> Reduce rbtree boiler plate by using the new helpers.
> 
> One noteworthy change is unification of the various (partial) compare
> functions. We construct a subtree match by forcing the sub-order to
> always match, see __group_cmp().
> 
> Due to 'const' we had to touch cgroup_id().
> 
> Cc: Tejun Heo 
> Signed-off-by: Peter Zijlstra (Intel) 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] x86/mm: Tracking linear mapping split events since boot

2021-01-25 Thread Tejun Heo
Hello,

On Mon, Jan 25, 2021 at 12:15:51PM -0800, Dave Hansen wrote:
> > DirectMap4k: 3505112 kB
> > DirectMap2M:19464192 kB
> > DirectMap1G:12582912 kB
> > DirectMap2MSplits:  1705
> > DirectMap1GSplits:20
> 
> This seems much more like something we'd want in /proc/vmstat or as a
> tracepoint than meminfo.  A tracepoint would be especially nice because
> the trace buffer could actually be examined if an admin finds an
> excessive number of these.

Adding a TP sure can be helpful but I'm not sure how that'd make counters
unnecessary given that the accumulated number of events since boot is what
matters.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-21 Thread Tejun Heo
Hello,

On Thu, Jan 21, 2021 at 08:55:07AM -0600, Tom Lendacky wrote:
> The hardware will allow any SEV capable ASID to be run as SEV-ES, however,
> the SEV firmware will not allow the activation of an SEV-ES VM to be
> assigned to an ASID greater than or equal to the SEV minimum ASID value. The
> reason for the latter is to prevent an !SEV-ES ASID starting out as an
> SEV-ES guest and then disabling the SEV-ES VMCB bit that is used by VMRUN.
> This would result in the downgrading of the security of the VM without the
> VM realizing it.
> 
> As a result, you have a range of ASIDs that can only run SEV-ES VMs and a
> range of ASIDs that can only run SEV VMs.

I see. That makes sense. What's the downside of SEV-ES compared to SEV w/o
ES? Are there noticeable performance / feature penalties or is the split
mostly for backward compatibility?

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-20 Thread Tejun Heo
Hello,

On Wed, Jan 20, 2021 at 03:18:29PM -0800, Vipin Sharma wrote:
> RDMA cgroup expose hardware details to users. In rdma.{max, current}
> interface files we can see actual hardware names. Only difference

No, what's shown is the device name followed by resources which are commonly
defined for all rdma devices. The format is the same as io controller
interface files.

> compared to Encryption ID cgroup is that latter is exposing that detail
> via file names.
> 
> Will you prefer that encryption ID cgroup do things similar to RDMA
> cgroup? It can have 3 files

I don't know how many times I have to repeat the same point to get it
across. For any question about actual abstraction, you haven't provided any
kind of actual research or analysis and just keep pushing the same thing
over and over again. Maybe the situation is such that it makes sense to
change the rule but that needs substantial justifications. I've been asking
to see whether there are such justifications but all I've been getting are
empty answers. Until such discussions take place, please consider the series
nacked and please excuse if I don't respond promptly in this thread.

> > Attaching the interface to kvm side, most likely, instead of exposing the
> > feature through cgroup.
> I am little confused, do you mean moving files from the kernel/cgroup/
> to kvm related directories or you are recommending not to use cgroup at
> all?  I hope it is the former :)
> 
> Only issue with this is that TDX is not limited to KVM, they have
> potential use cases for MKTME without KVM.

There are ways to integrate with cgroup through other interfaces - e.g. take
a look at how bpf works with cgroups. Here, it isn't ideal but may work out
if things actually require a lot of hardware dependent bits. There's also
RDT which exists outside of cgroup for similar reasons.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-20 Thread Tejun Heo
Hello,

On Tue, Jan 19, 2021 at 11:13:51PM -0800, Vipin Sharma wrote:
> > Can you please elaborate? I skimmed through the amd manual and it seemed to
> > say that SEV-ES ASIDs are superset of SEV but !SEV-ES ASIDs. What's the use
> > case for mixing those two?
> 
> For example, customers can be given options for which kind of protection they
> want to choose for their workloads based on factors like data protection
> requirement, cost, speed, etc.

So, I'm looking for is a bit more in-depth analysis than that. ie. What's
the downside of SEV && !SEV-ES and is the disticntion something inherently
useful?

> In terms of features SEV-ES is superset of SEV but that doesn't mean SEV
> ASIDs are superset of SEV ASIDs. SEV ASIDs cannot be used for SEV-ES VMs
> and similarly SEV-ES ASIDs cannot be used for SEV VMs. Once a system is
> booted, based on the BIOS settings each type will have their own
> capacity and that number cannot be changed until the next boot and BIOS
> changes.

Here's an excerpt from the AMD's system programming manual, section 15.35.2:

  On some systems, there is a limitation on which ASID values can be used on
  SEV guests that are run with SEV-ES disabled. While SEV-ES may be enabled
  on any valid SEV ASID (as defined by CPUID Fn8000_001F[ECX]), there are
  restrictions on which ASIDs may be used for SEV guests with SEV- ES
  disabled. CPUID Fn8000_001F[EDX] indicates the minimum ASID value that
  must be used for an SEV-enabled, SEV-ES-disabled guest. For example, if
  CPUID Fn8000_001F[EDX] returns the value 5, then any VMs which use ASIDs
  1-4 and which enable SEV must also enable SEV-ES.

> We are not mixing the two types of ASIDs, they are separate and used
> separately.

Maybe in practice, the key management on the BIOS side is implemented in a
more restricted way but at least the processor manual says differently.

> > I'm very reluctant to ack vendor specific interfaces for a few reasons but
> > most importantly because they usually indicate abstraction and/or the
> > underlying feature not being sufficiently developed and they tend to become
> > baggages after a while. So, here are my suggestions:
> 
> My first patch was only for SEV, but soon we got comments that this can
> be abstracted and used by TDX and SEID for their use cases.
> 
> I see this patch as providing an abstraction for simple accounting of
> resources used for creating secure execution contexts. Here, secure
> execution is achieved through different means. SEID, TDX, and SEV
> provide security using different features and capabilities. I am not
> sure if we will reach a point where all three and other vendors will use
> the same approach and technology for this purpose.
> 
> Instead of each one coming up with their own resource tracking for their
> features, this patch is providing a common framework and cgroup for
> tracking these resources.

What's implemented is a shared place where similar things can be thrown in
bu from user's perspective the underlying hardware feature isn't really
abstracted. It's just exposing whatever hardware knobs there are. If you
look at any other cgroup controllers, nothing is exposing this level of
hardware dependent details and I'd really like to keep it that way.

So, what I'm asking for is more in-depth analysis of the landscape and
inherent differences among different vendor implementations to see whether
there can be better approaches or we should just wait and see.

> > * If there can be a shared abstraction which hopefully makes intuitive
> >   sense, that'd be ideal. It doesn't have to be one knob but it shouldn't be
> >   something arbitrary to specific vendors.
> 
> I think we should see these as features provided on a host. Tasks can
> be executed securely on a host with the guarantees provided by the
> specific feature (SEV, SEV-ES, TDX, SEID) used by the task.
> 
> I don't think each H/W vendor can agree to a common set of security
> guarantees and approach.

Do TDX and SEID have multiple key types tho?

> > * If we aren't there yet and vendor-specific interface is a must, attach
> >   that part to an interface which is already vendor-aware.
> Sorry, I don't understand this approach. Can you please give more
> details about it?

Attaching the interface to kvm side, most likely, instead of exposing the
feature through cgroup.

Thanks.

-- 
tejun


Re: [PATCH v2 2/2] cgroup: update PSI file description in docs

2021-01-19 Thread Tejun Heo
On Tue, Jan 19, 2021 at 11:56:18AM -0500, Johannes Weiner wrote:
> On Sat, Jan 16, 2021 at 06:36:34PM +0100, Odin Ugedal wrote:
> > Update PSI file description in cgroup-v2 docs to reflect the current
> > implementation.
> > 
> > Signed-off-by: Odin Ugedal 
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 6 +++---
> >  1 file changed, 3 insertions(+), 3 deletions(-)
> > 
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst 
> > b/Documentation/admin-guide/cgroup-v2.rst
> > index 63521cd36ce5..f638c9d3d9f2 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1029,7 +1029,7 @@ All time durations are in microseconds.
> > one number is written, $MAX is updated.
> >  
> >cpu.pressure
> > -   A read-only nested-key file which exists on non-root cgroups.
> > +   A read-only nested-keyed file.
> 
> Could you please also change the 'read-only' to 'read-write'?
> 
> With that, please feel free to add:
> Acked-by: Johannes Weiner 

Applied w/ the suggested change.

Thanks.

-- 
tejun


Re: [PATCH v2 0/2] cgroup: fix psi monitor for root cgroup

2021-01-19 Thread Tejun Heo
On Sat, Jan 16, 2021 at 06:36:32PM +0100, Odin Ugedal wrote:
> This patchset fixes PSI monitors on the root cgroup, as they currently
> only works on the non-root cgroups. Reading works for all, since that
> was fixed when adding support for the root PSI files. It also contains
> a doc update to reflect the current implementation.

Applied to cgroup/for-5.11-fixes with acks and stable tag added.

Thanks.

-- 
tejun


Re: [PATCH] workqueue: keep unbound rescuer's cpumask to be default cpumask

2021-01-19 Thread Tejun Heo
On Sat, Jan 16, 2021 at 02:57:53PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan 
> 
> When we attach a rescuer to a pool, we will set the rescuer's cpumask
> to the pool's cpumask.  If there is hotplug ongoing, it may cause
> the rescuer running on the dying CPU and cause bug or it may cause
> warning of setting online&!active cpumask.
> 
> So we have to find a reliable way to set cpumask when attaching
> rescuer.
> 
> When the pool is percpu pool, we have easy way to reliably
> set cpumask with the help of %POOL_DISASSOCIATED.
> 
> But when it is unbound rescuer, the problem becomes harder, because
> we can't neither use get_online_cpus() here nor test cpu_active_mask
> without synchronization.
> 
> Atfer investigation, and noticing the unbound nature of the unbound
> rescuer, we decide to make it use the wq's default pwq's cpumask so
> that we don't need to set the rescuer's cpumask when attaching.
> 
> To implement it, we have to set unbound rescuer's cpumask to the
> default pwq's cpumask when creation and maintain it when hotplug
> or the default pwq is changed.

Yeah, this approach makes sense to me. It doesn't look like all problems are
resolved but for the rescuer behavior part, please feel free to add

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] workqueue: tracing the name of the workqueue instead of it's address

2021-01-19 Thread Tejun Heo
On Mon, Jan 04, 2021 at 08:43:06PM +0800, qiang.zh...@windriver.com wrote:
> From: Zqiang 
> 
> This patch tracing workqueue name instead of it's address, the
> new format is as follows.
> 
> workqueue_queue_work: work struct=84e3df56 function=
> drm_fb_helper_dirty_work workqueue=events req_cpu=256 cpu=1
> 
> This tell us to know which workqueue our work is queued.
> 
> Signed-off-by: Zqiang 

Applied to wq/for-5.12.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-19 Thread Tejun Heo
Hello,

On Fri, Jan 15, 2021 at 08:32:19PM -0800, Vipin Sharma wrote:
> SEV-ES has stronger memory encryption gurantees compared to SEV, apart
> from encrypting the application memory it also encrypts register state
> among other things. In a single host ASIDs can be distributed between
> these two types by BIOS settings.
> 
> Currently, Google Cloud has Confidential VM machines offering using SEV.
> ASIDs are not compatible between SEV and SEV-ES, so a VM running on SEV
> cannot run on SEV-ES and vice versa
> 
> There are use cases for both types of VMs getting used in future.

Can you please elaborate? I skimmed through the amd manual and it seemed to
say that SEV-ES ASIDs are superset of SEV but !SEV-ES ASIDs. What's the use
case for mixing those two?

> > > > > Other ID types can be easily added in the controller in the same way.
> > > > 
> > > > I'm not sure this is necessarily a good thing.
> > > 
> > > This is to just say that when Intel and PowerPC changes are ready it
> > > won't be difficult for them to add their controller.
> > 
> > I'm not really enthused about having per-hardware-type control knobs. None
> > of other controllers behave that way. Unless it can be abstracted into
> > something common, I'm likely to object.
> 
> There was a discussion in Patch v1 and consensus was to have individual
> files because it makes kernel implementation extremely simple.
> 
> https://lore.kernel.org/lkml/alpine.deb.2.23.453.2011131615510.333...@chino.kir.corp.google.com/#t

I'm very reluctant to ack vendor specific interfaces for a few reasons but
most importantly because they usually indicate abstraction and/or the
underlying feature not being sufficiently developed and they tend to become
baggages after a while. So, here are my suggestions:

* If there can be a shared abstraction which hopefully makes intuitive
  sense, that'd be ideal. It doesn't have to be one knob but it shouldn't be
  something arbitrary to specific vendors.

* If we aren't there yet and vendor-specific interface is a must, attach
  that part to an interface which is already vendor-aware.

> This information is not available anywhere else in the system. Only
> other way to get this value is to use CPUID instruction (0x801F) of
> the processor. Which also has disdvantage if sev module in kernel
> doesn't use all of the available ASIDs for its work (right now it uses
> all) then there will be a mismatch between what user get through their
> code and what is actually getting used in the kernel by sev.
> 
> In cgroup v2, I didn't see current files for other cgroups in root
> folder that is why I didn't show that file in root folder.
> 
> Will you be fine if I show two files in the root, something like:
> 
> encids.sev.capacity
> encids.sev.current
> 
> In non root folder, it will be:
> encids.sev.max
> encids.sev.current
> 
> I still prefer encids.sev.stat, as it won't repeat same information in
> each cgroup but let me know what you think.

Yeah, this will be a first and I was mostly wondering about the same number
appearing under different files / names on root and !root cgroups. I'm
leaning more towards capacity/current but let me think about it a bit more.

Thank you.

-- 
tejun


Re: [PATCH] workqueue: fix annotation for WQ_SYSFS

2021-01-19 Thread Tejun Heo
On Mon, Jan 18, 2021 at 12:04:55AM -0800, menglong8.d...@gmail.com wrote:
> From: Menglong Dong 
> 
> 'wq_sysfs_register()' in annotation for 'WQ_SYSFS' is unavailable,
> change it to 'workqueue_sysfs_register()'.
> 
> Signed-off-by: Menglong Dong 

Applied to wq/for-5.12.

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-15 Thread Tejun Heo
On Fri, Jan 15, 2021 at 02:18:40PM -0800, Vipin Sharma wrote:
> > * Why is .sev a separate namespace? Isn't the controller supposed to cover
> >   encryption ids across different implementations? It's not like multiple
> >   types of IDs can be in use on the same machine, right?
> > 
> 
> On AMD platform we have two types SEV and SEV-ES which can exists
> simultaneously and they have their own quota.

Can you please give a brief explanation of the two and lay out a scenario
where the two are being used / allocated disjointly?

> > > Other ID types can be easily added in the controller in the same way.
> > 
> > I'm not sure this is necessarily a good thing.
> 
> This is to just say that when Intel and PowerPC changes are ready it
> won't be difficult for them to add their controller.

I'm not really enthused about having per-hardware-type control knobs. None
of other controllers behave that way. Unless it can be abstracted into
something common, I'm likely to object.

> > > +static int enc_id_cg_stat_show(struct seq_file *sf, void *v)
> > > +{
> > > + unsigned long flags;
> > > + enum encryption_id_type type = seq_cft(sf)->private;
> > > +
> > > + spin_lock_irqsave(_id_cg_lock, flags);
> > > +
> > > + seq_printf(sf, "total %u\n", enc_id_capacity[type]);
> > > + seq_printf(sf, "used %u\n", root_cg.res[type].usage);
> > 
> > Dup with .current and no need to show total on every cgroup, right?
> 
> This is for the stat file which will only be seen in the root cgroup
> directory.  It is to know overall picture for the resource, what is the
> total capacity and what is the current usage. ".current" file is not
> shown on the root cgroup.

Ah, missed the flags. It's odd for the usage to be presented in two
different ways tho. I think it'd make more sense w/ cgroup.current at root
level. Is the total number available somewhere else in the system?

Thanks.

-- 
tejun


Re: [Patch v4 2/2] cgroup: svm: Encryption IDs cgroup documentation.

2021-01-15 Thread Tejun Heo
On Thu, Jan 07, 2021 at 05:28:46PM -0800, Vipin Sharma wrote:
> Documentation for both cgroup versions, v1 and v2, of Encryption IDs
> controller. This new controller is used to track and limit usage of
> hardware memory encryption capabilities on the CPUs.
> 
> Signed-off-by: Vipin Sharma 
> Reviewed-by: David Rientjes 
> Reviewed-by: Dionna Glaze 
> ---
>  .../admin-guide/cgroup-v1/encryption_ids.rst  | 108 ++
>  Documentation/admin-guide/cgroup-v2.rst   |  78 -

Given how trivial it is, I'm not gonna object to adding new v1 interface but
maybe just point to v2 doc from v1?

Thanks.

-- 
tejun


Re: [Patch v4 1/2] cgroup: svm: Add Encryption ID controller

2021-01-15 Thread Tejun Heo
Hello,

On Thu, Jan 07, 2021 at 05:28:45PM -0800, Vipin Sharma wrote:
> 1. encrpytion_ids.sev.max
>   Sets the maximum usage of SEV IDs in the cgroup.
> 2. encryption_ids.sev.current
>   Current usage of SEV IDs in the cgroup and its children.
> 3. encryption_ids.sev.stat
>   Shown only at the root cgroup. Displays total SEV IDs available
>   on the platform and current usage count.

Sorry, should have raised these earlier:

* Can we shorten the name to encids?

* Why is .sev a separate namespace? Isn't the controller supposed to cover
  encryption ids across different implementations? It's not like multiple
  types of IDs can be in use on the same machine, right?

> Other ID types can be easily added in the controller in the same way.

I'm not sure this is necessarily a good thing.

> +/**
> + * enc_id_cg_uncharge_hierarchy() - Uncharge the enryption ID cgroup 
> hierarchy.
> + * @start_cg: Starting cgroup.
> + * @stop_cg: cgroup at which uncharge stops.
> + * @type: type of encryption ID to uncharge.
> + * @amount: Charge amount.
> + *
> + * Uncharge the cgroup tree from the given start cgroup to the stop cgroup.
> + *
> + * Context: Any context. Expects enc_id_cg_lock to be held by the caller.
> + */
> +static void enc_id_cg_uncharge_hierarchy(struct encryption_id_cgroup 
> *start_cg,
> +  struct encryption_id_cgroup *stop_cg,
> +  enum encryption_id_type type,
> +  unsigned int amount)
> +{
> + struct encryption_id_cgroup *i;
> +
> + lockdep_assert_held(_id_cg_lock);
> +
> + for (i = start_cg; i != stop_cg; i = parent_enc(i)) {
> + WARN_ON_ONCE(i->res[type].usage < amount);
> + i->res[type].usage -= amount;
> + }
> + css_put(_cg->css);

I'm curious whether this is necessary given that a css can't be destroyed
while tasks are attached. Are there cases where this wouldn't hold true? If
so, it'd be great to have some comments on how that can happen.

> +/**
> + * enc_id_cg_max_write() - Update the maximum limit of the cgroup.
> + * @of: Handler for the file.
> + * @buf: Data from the user. It should be either "max", 0, or a positive
> + *integer.
> + * @nbytes: Number of bytes of the data.
> + * @off: Offset in the file.
> + *
> + * Uses cft->private value to determine for which enryption ID type results 
> be
> + * shown.
> + *
> + * Context: Any context. Takes and releases enc_id_cg_lock.
> + * Return:
> + * * >= 0 - Number of bytes processed in the input.
> + * * -EINVAL - If buf is not valid.
> + * * -ERANGE - If number is bigger than unsigned int capacity.
> + * * -EBUSY - If usage can become more than max limit.

The aboves are stale, right?

> +static int enc_id_cg_stat_show(struct seq_file *sf, void *v)
> +{
> + unsigned long flags;
> + enum encryption_id_type type = seq_cft(sf)->private;
> +
> + spin_lock_irqsave(_id_cg_lock, flags);
> +
> + seq_printf(sf, "total %u\n", enc_id_capacity[type]);
> + seq_printf(sf, "used %u\n", root_cg.res[type].usage);

Dup with .current and no need to show total on every cgroup, right?

Thanks.

-- 
tejun


Re: [PATCH] cpuset: fix typos in comments

2021-01-15 Thread Tejun Heo
On Wed, Jan 13, 2021 at 12:37:41PM +0800, Aubrey Li wrote:
> Change hierachy to hierarchy and congifured to configured, no functionality
> changed.
> 
> Signed-off-by: Aubrey Li 

Applied to cgroup/for-5.12.

Thanks.

-- 
tejun


Re: [PATCH v2 2/2] MAINTAINERS: Update my email address

2021-01-15 Thread Tejun Heo
On Wed, Jan 13, 2021 at 04:59:42PM +0800, Zefan Li wrote:
> Signed-off-by: Zefan Li 

Applied 1-2 to cgroup/for-5.11-fixes.

Thanks.

-- 
tejun


Re: [PATCH] cgroup: Remove unnecessary call to strstrip()

2021-01-15 Thread Tejun Heo
On Thu, Jan 14, 2021 at 01:44:27PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Sun, Jan 03, 2021 at 02:50:01AM +, Hao Lee  
> wrote:
> > The string buf will be stripped in cgroup_procs_write_start() before it
> > is converted to int, so remove this unnecessary call to strstrip().
> Good catch, Hao.
> 
> Perhaps the code be then simplified a bit
> 
> -- >8 --
> From: =?UTF-8?q?Michal=20Koutn=C3=BD?= 
> Date: Thu, 14 Jan 2021 13:23:39 +0100
> Subject: [PATCH] cgroup: cgroup.{procs,threads} factor out common parts
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> The functions cgroup_threads_start and cgroup_procs_start are almost
> identical. In order to reduce duplication, factor out the common code in
> similar fashion we already do for other threadgroup/task functions. No
> functional changes are intended.
> 
> Suggested-by: Hao Lee 
> Signed-off-by: Michal Koutný 

Applied to cgroup/for-5.12 w/ minor description update suggested by Daniel.

Thanks.

-- 
tejun


Re: [PATCH v3] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-15 Thread Tejun Heo
On Fri, Jan 15, 2021 at 05:37:17PM +0800, Chen Zhou wrote:
> When mounting a cgroup hierarchy with disabled controller in cgroup v1,
> all available controllers will be attached.
> For example, boot with cgroup_no_v1=cpu or cgroup_disable=cpu, and then
> mount with "mount -t cgroup -ocpu cpu /sys/fs/cgroup/cpu", then all
> enabled controllers will be attached except cpu.
> 
> Fix this by adding disabled controller check in cgroup1_parse_param().
> If the specified controller is disabled, just return error with information
> "Disabled controller xx" rather than attaching all the other enabled
> controllers.
> 
> Fixes: f5dfb5315d34 ("cgroup: take options parsing into ->parse_monolithic()")
> Signed-off-by: Chen Zhou 
> Reviewed-by: Zefan Li 

Applied to cgroup/for-5.11-fixes.

Thanks.

-- 
tejun


Re: [PATCH v2] cgroup-v1: add disabled controller check in cgroup1_parse_param()

2021-01-14 Thread Tejun Heo
Hello,

On Fri, Jan 15, 2021 at 09:55:43AM +0800, chenzhou wrote:
> Yeah, this will select all enabled controllers, but which doesn't the 
> behavior we want.
> I think the case should return error with information "Disabled controller 
> xx" rather than
> attaching all the other enabled controllers.
> 
> For example, boot with cgroup_no_v1=cpu, and then mount with
> "mount -t cgroup -o cpu cpu /sys/fs/cgroup/cpu", then all enabled controllers 
> will
> be attached expect cpu.

Okay, that explanation actually makes sense. Can you please update the
description to include what's broken and how it's being fixed? It really
isn't clear what the patch is trying to achieve from the current
description.

Thanks.

-- 
tejun


Re: [Patch v3 0/2] cgroup: KVM: New Encryption IDs cgroup controller

2021-01-05 Thread Tejun Heo
Happy new year!

On Wed, Dec 16, 2020 at 12:02:37PM -0800, Vipin Sharma wrote:
> I like the idea of having a separate controller to keep the code simple and
> easier for maintenance.

Yeah, the more I think about it, keeping it separate seems like the right
thing to do. What bothers me primarily is that the internal logic is
identical between the RDMA controller and this one. If you wanna try
factoring them out into library, great. If not, I don't think it should
block merging this controller. We can get to refactoring later.

Thanks.

-- 
tejun


[GIT PULL] workqueue changes for v5.11-rc1

2020-12-28 Thread Tejun Heo
Hello, again.

The same as the cgroup tree - one commit which was scheduled for the 5.11
merge window. All the commit does is avoding spurious worker wakeups from
workqueue allocation / config change path to help cpuisol use cases.

Thank you.

The following changes since commit 127c501a03d5db8b833e953728d3bcf53c8832a9:

  Merge tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2020-11-24 15:33:18 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git for-5.11

for you to fetch changes up to 01341fbd0d8d4e717fc1231cdffe00343088ce0b:

  workqueue: Kick a worker based on the actual activation of delayed works 
(2020-11-25 17:10:28 -0500)


Yunfeng Ye (1):
  workqueue: Kick a worker based on the actual activation of delayed works

 kernel/workqueue.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

-- 
tejun


[GIT PULL] cgroup changes for v5.11-rc1

2020-12-28 Thread Tejun Heo
Hello, Linus.

These three patches were scheduled for the merge window but I forgot to send
them out. Sorry about that. None of them are significant and they fit well
in a fix pull request too - two are cosmetic and one fixes a memory leak in
the mount option parsing path.

Thanks and happy holidays!

The following changes since commit 127c501a03d5db8b833e953728d3bcf53c8832a9:

  Merge tag '5.10-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6 
(2020-11-24 15:33:18 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-5.11

for you to fetch changes up to 2d18e54dd8662442ef5898c6bdadeaf90b3cebbc:

  cgroup: Fix memory leak when parsing multiple source parameters (2020-12-16 
10:10:32 -0500)


Bhaskar Chowdhury (1):
  kernel: cgroup: Mundane spelling fixes throughout the file

Hui Su (1):
  cgroup/cgroup.c: replace 'of->kn->priv' with of_cft()

Qinglang Miao (1):
  cgroup: Fix memory leak when parsing multiple source parameters

 kernel/cgroup/cgroup-v1.c |  2 ++
 kernel/cgroup/cgroup.c| 30 +++---
 2 files changed, 17 insertions(+), 15 deletions(-)

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-19 Thread Tejun Heo
Hello,

On Sat, Dec 19, 2020 at 03:08:13PM +0800, Ian Kent wrote:
> And looking further I see there's a race that kernfs can't do anything
> about between kernfs_refresh_inode() and fs/inode.c:update_times().

Do kernfs files end up calling into that path tho? Doesn't look like it to
me but if so yeah we'd need to override the update_time for kernfs.

> +static int kernfs_iop_update_time(struct inode *inode, struct timespec64 
> *time, int flags)
>  {
> - struct inode *inode = d_inode(path->dentry);
>   struct kernfs_node *kn = inode->i_private;
> + struct kernfs_iattrs *attrs;
>  
>   mutex_lock(_mutex);
> + attrs = kernfs_iattrs(kn);
> + if (!attrs) {
> + mutex_unlock(_mutex);
> + return -ENOMEM;
> + }
> +
> + /* Which kernfs node attributes should be updated from
> +  * time?
> +  */
> +
>   kernfs_refresh_inode(kn, inode);
>   mutex_unlock(_mutex);

I don't see how this would reflect the changes from kernfs_setattr() into
the attached inode. This would actually make the attr updates obviously racy
- the userland visible attrs would be stale until the inode gets reclaimed
and then when it gets reinstantiated it'd show the latest information.

That said, if you wanna take the direction where attr updates are reflected
to the associated inode when the change occurs, which makes sense, the right
thing to do would be making kernfs_setattr() update the associated inode if
existent.

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-18 Thread Tejun Heo
Hello,

On Fri, Dec 18, 2020 at 03:36:21PM +0800, Ian Kent wrote:
> Sounds like your saying it would be ok to add a lock to the
> attrs structure, am I correct?

Yeah, adding a lock to attrs is a lot less of a problem and it looks like
it's gonna have to be either that or hashed locks, which might actually make
sense if we're worried about the size of attrs (I don't think we need to).

Thanks.

-- 
tejun


Re: [PATCH v2 0/6] kernfs: proposed locking and concurrency improvement

2020-12-17 Thread Tejun Heo
Hello,

On Thu, Dec 17, 2020 at 07:48:49PM +0800, Ian Kent wrote:
> > What could be done is to make the kernfs node attr_mutex
> > a pointer and dynamically allocate it but even that is too
> > costly a size addition to the kernfs node structure as
> > Tejun has said.
> 
> I guess the question to ask is, is there really a need to
> call kernfs_refresh_inode() from functions that are usually
> reading/checking functions.
> 
> Would it be sufficient to refresh the inode in the write/set
> operations in (if there's any) places where things like
> setattr_copy() is not already called?
> 
> Perhaps GKH or Tejun could comment on this?

My memory is a bit hazy but invalidations on reads is how sysfs namespace is
implemented, so I don't think there's an easy around that. The only thing I
can think of is embedding the lock into attrs and doing xchg dance when
attaching it.

Thanks.

-- 
tejun


Re: [Patch v3 0/2] cgroup: KVM: New Encryption IDs cgroup controller

2020-12-16 Thread Tejun Heo
Hello,

On Thu, Dec 10, 2020 at 03:44:35PM -0800, David Rientjes wrote:
> Concern with a single misc controller would be that any subsystem that 
> wants to use it has to exactly fit this support: current, max, stat, 
> nothing more.  The moment a controller needs some additional support, and 
> its controller is already implemented in previous kernel versionv as a 
> part of "misc," we face a problem.

Yeah, that's a valid concern, but given the history, there doesn't seem to
be much need beyond that for these use cases and the limited need seems
inherent to the way the resources are defined and consumed. So yeah, it can
either way.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-iocost: Use alloc_percpu_gfp() to simplify the code

2020-12-16 Thread Tejun Heo
Hello,

On Fri, Dec 11, 2020 at 03:13:29PM +0800, Baolin Wang wrote:
> Thanks for teaching me this, at least I did not get this from the local_ops
> Documentation before. Just out of curiosity, these local[64]_t variables are
> also allocated from budy allocator ultimately, why they can not be
> initialized to zeros on some ARCHs with __GFP_ZERO? Could you elaborate on
> about this restriction? Thanks.

* It's highly unlikely but theoretically possible that some arch might need
  more than raw value to implement local semantics.

* People might wanna add debug annotations which may require more than just
  the raw value.

> By the way, seems the kyber-iosched has the same issue, since the 'struct
> kyber_cpu_latency' also contains an atomic_t variable.
> 
>   kqd->cpu_latency = alloc_percpu_gfp(struct kyber_cpu_latency,
>   GFP_KERNEL | __GFP_ZERO);
>   if (!kqd->cpu_latency)
>   goto err_kqd;

Yeah, the lines become blurry when all existing usages are fine with zeroing
and we do end up needing to clean up those when the need arises (e.g. we
used to zero some spinlocks too). It's just a better form to stick with
initializers when they are provided.

Thanks.

-- 
tejun


Re: [PATCH v2] cgroup: Fix memory leak when parsing multiple source parameters

2020-12-16 Thread Tejun Heo
On Thu, Dec 10, 2020 at 09:29:43AM +0800, Qinglang Miao wrote:
> A memory leak is found in cgroup1_parse_param() when multiple source
> parameters overwrite fc->source in the fs_context struct without free.
> 
> unreferenced object 0x888100d930e0 (size 16):
>   comm "mount", pid 520, jiffies 4303326831 (age 152.783s)
>   hex dump (first 16 bytes):
> 74 65 73 74 6c 65 61 6b 00 00 00 00 00 00 00 00  testleak
>   backtrace:
> [<3e5023ec>] kmemdup_nul+0x2d/0xa0
> [<377dbdaa>] vfs_parse_fs_string+0xc0/0x150
> [] generic_parse_monolithic+0x15a/0x1d0
> [<0f750198>] path_mount+0xee1/0x1820
> [<04756de2>] do_mount+0xea/0x100
> [<94cafb0a>] __x64_sys_mount+0x14b/0x1f0
> 
> Fix this bug by permitting a single source parameter and rejecting with
> an error all subsequent ones.
> 
> Fixes: 8d2451f4994f ("cgroup1: switch to option-by-option parsing")
> Reported-by: Hulk Robot 
> Signed-off-by: Qinglang Miao 

Applied to cgroup/for-5.11.

Thanks.

-- 
tejun


Re: [PATCH 07/10] workqueue: Manually break affinity on hotplug for unbound pool

2020-12-16 Thread Tejun Heo
On Mon, Dec 14, 2020 at 11:54:54PM +0800, Lai Jiangshan wrote:
>   * An unbound pool may end up with a cpumask which doesn't have any online
> - * CPUs.  When a worker of such pool get scheduled, the scheduler resets
> - * its cpus_allowed.  If @cpu is in @pool's cpumask which didn't have any
> - * online CPU before, cpus_allowed of all its workers should be restored.
> + * CPUs.  We have to reset workers' cpus_allowed of such pool.  And we
> + * restore the workers' cpus_allowed when the pool's cpumask has online
> + * CPU at the first time after reset.
  ^
  for the first time

-- 
tejun


Re: [PATCH 04/10] workqueue: don't set the worker's cpumask when kthread_bind_mask()

2020-12-16 Thread Tejun Heo
On Mon, Dec 14, 2020 at 11:54:51PM +0800, Lai Jiangshan wrote:
> @@ -1945,7 +1945,15 @@ static struct worker *create_worker(struct worker_pool 
> *pool)
>   goto fail;
>  
>   set_user_nice(worker->task, pool->attrs->nice);
> - kthread_bind_mask(worker->task, pool->attrs->cpumask);
> +
> + /*
> +  * Set PF_NO_SETAFFINITY via kthread_bind_mask().  We use
> +  * cpu_possible_mask other than pool->attrs->cpumask, because
 ^
 instead of

> +  * there might be no online cpu in the pool->attrs->cpumask.
 ^
 might not be any

> +  * The cpumask of the worker will be set properly later in
> +  * worker_attach_to_pool().
> +  */
> + kthread_bind_mask(worker->task, cpu_possible_mask);

This is a bit ugly but not the end of the world. Maybe we can move it to the
start of worker_thread() but that'd require an extra handshake. Oh well...

Thanks.

-- 
tejun


Re: [PATCH 02/10] workqueue: use cpu_possible_mask instead of cpu_active_mask to break affinity

2020-12-16 Thread Tejun Heo
Hello,

On Mon, Dec 14, 2020 at 11:54:49PM +0800, Lai Jiangshan wrote:
> @@ -4909,8 +4909,9 @@ static void unbind_workers(int cpu)
>  
>   raw_spin_unlock_irq(>lock);
>  
> + /* don't rely on the scheduler to force break affinity for us. 
> */

I'm not sure this comment is helpful. The comment may make sense right now
while the scheduler behavior is changing but down the line it's not gonna
make whole lot of sense.

>   for_each_pool_worker(worker, pool)
> - WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, 
> cpu_active_mask) < 0);
> + WARN_ON_ONCE(set_cpus_allowed_ptr(worker->task, 
> cpu_possible_mask) < 0);
>  
>   mutex_unlock(_pool_attach_mutex);

Thanks.

-- 
tejun


Re: [PATCH 00/10] workqueue: break affinity initiatively

2020-12-16 Thread Tejun Heo
On Mon, Dec 14, 2020 at 11:54:47PM +0800, Lai Jiangshan wrote:
> From: Lai Jiangshan 
> 
> 06249738a41a ("workqueue: Manually break affinity on hotplug")
> said that scheduler will not force break affinity for us.
> 
> But workqueue highly depends on the old behavior. Many parts of the codes
> relies on it, 06249738a41a ("workqueue: Manually break affinity on hotplug")
> is not enough to change it, and the commit has flaws in itself too.
> 
> We need to thoroughly update the way workqueue handles affinity
> in cpu hot[un]plug, what is this patchset intends to do and
> replace the Valentin Schneider's patch [1].
> 
> Patch 1 fixes a flaw reported by Hillf Danton .
> I have to include this fix because later patches depends on it.
> 
> The patchset is based on tip/master rather than workqueue tree,
> because the patchset is a complement for 06249738a41a ("workqueue:
> Manually break affinity on hotplug") which is only in tip/master by now.
> 
> [1]: 
> https://lore.kernel.org/r/ff62e3ee994efb3620177bf7b19fab16f4866845.ca...@redhat.com

Generally looks good to me. Please feel free to add

 Acked-by: Tejun Heo 

and route the series through tip.

Thanks.

-- 
tejun


Re: [PATCH 1/2] blk-iocost: Add iocg idle state tracepoint

2020-12-10 Thread Tejun Heo
On Thu, Dec 10, 2020 at 06:56:44PM +0800, Baolin Wang wrote:
> It will be helpful to trace the iocg's whole state, including active and
> idle state. And we can easily expand the original iocost_iocg_activate
> trace event to support a state trace class, including active and idle
> state tracing.
> 
> Signed-off-by: Baolin Wang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-iocost: Use alloc_percpu_gfp() to simplify the code

2020-12-10 Thread Tejun Heo
Hello,

On Thu, Dec 10, 2020 at 06:56:45PM +0800, Baolin Wang wrote:
> Use alloc_percpu_gfp() with __GFP_ZERO flag, which can remove
> some explicit initialization code.

__GFP_ZERO is implicit for percpu allocations and local[64]_t's initial
states aren't guaranteed to be all zeros on different archs.

Thanks.

-- 
tejun


Re: [Patch v3 0/2] cgroup: KVM: New Encryption IDs cgroup controller

2020-12-09 Thread Tejun Heo
Hello,

Rough take after skimming:

* I don't have an overall objection. In terms of behavior, the only thing
  which stood out was input rejection depending on the current usage. The
  preferred way of handling that is rejecting future allocations rather than
  failing configuration as that makes it impossible e.g. to lower limit and
  drain existing usages from outside the container.

* However, the boilerplate to usefulness ratio doesn't look too good and I
  wonder whether what we should do is adding a generic "misc" controller
  which can host this sort of static hierarchical counting. I'll think more
  on it.

Thanks.

-- 
tejun


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-04 Thread Tejun Heo
Hello,

On Thu, Dec 03, 2020 at 04:51:42PM -0800, Josh Don wrote:
> > So me the color thing reads like an end-run around the cgroup hierarchy.
> 
> Restructuring the cgroup resource hierarchy to incorporate the trust
> domains is not necessarily trivial (as is the case for us).  I agree
> though that that would be the ideal correct solution from the cgroup
> hierarchy perspective.

Yeah, it sure isn't. We (FB) had to go through a couple iterations and it's
looking like we need another adjustment, so I fully agree that these are
painful but at the same time I don't think it's difficult to see that how
piling up workarounds in the lowest layer is not the right thing to do. The
workarounds you guys need would be different from what we or others would
need. The kernel can't be maintained in any sustainable manner if we keep
piling on disjoint workarounds on it. Please consider long term trajectory
when proposing interface changes which often boils down to identifying the
core features which must be supported by the interface.

Thanks.

-- 
tejun


Re: [RFC PATCH] workqueue: cut wq_rr_cpu_last

2020-12-03 Thread Tejun Heo
Hello,

On Thu, Dec 03, 2020 at 06:28:41PM +0800, Hillf Danton wrote:
> + new_cpu = cpumask_any_and_distribute(wq_unbound_cpumask, 
> cpu_online_mask);
> + if (new_cpu < nr_cpu_ids)
> + return new_cpu;
> + else
> + return cpu;
>  }
>  
>  static void __queue_work(int cpu, struct workqueue_struct *wq,
> @@ -1554,7 +1546,7 @@ static int workqueue_select_cpu_near(int
>   return cpu;
>  
>   /* Use "random" otherwise know as "first" online CPU of node */
> - cpu = cpumask_any_and(cpumask_of_node(node), cpu_online_mask);
> + cpu = cpumask_any_and_distribute(cpumask_of_node(node), 
> cpu_online_mask);

This looks generally okay but I think there's a real risk of different
cpumasks interfering with cpu selection. e.g. imagine a cpu issuing work
items to two unbound workqueues consecutively, one numa-bound, the other
not. The above change will basically confine the !numa one to the numa node.

I think the right thing to do here is expanding the
cpumask_any_and_distribute() so that the user can provide its own cursor
similar to what we do with ratelimits.

Thanks.

-- 
tejun


Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-02 Thread Tejun Heo
Hello,

On Wed, Dec 02, 2020 at 03:23:57PM -0800, Shakeel Butt wrote:
> > There've been some changes to cgroup ids recently and now cgroup id, ino and
> > its file_handle are all compatible. On 64bit ino machines, they're all the
> > same and won't be reused. On 32bit ino machines, the lower 32bit of full id
> > is used as ino. ino may be reused but not the full 64bit id.
> 
> __kernfs_new_node() is using idr_alloc_cyclic() which will return
> 32bit ID. If I am understanding this correctly the full ID is
> generated similarly for 32bit and 64bit machines but for 64bit
> machines the whole ID is inode number while on 32bit machines the
> lower 32bits contains the inode number. Is that correct?

Yes, that's correct.

Thanks.

-- 
tejun


Re: [RFC PATCH] blk-iocost: Optimize the ioc_refreash_vrate() function

2020-12-02 Thread Tejun Heo
On Sun, Nov 29, 2020 at 10:37:18AM +0800, Baolin Wang wrote:
> The ioc_refreash_vrate() will only be called in ioc_timer_fn() after
> starting a new period or stopping the period.
> 
> So when starting a new period, the variable 'pleft' in ioc_refreash_vrate()
> is always the period's time, which means if the abs(ioc->vtime_err)
> is less than the period's time, the vcomp is 0, and we do not need
> compensate the vtime_rate in this case, just set it as the base vrate
> and return.
> 
> When stopping the period, the ioc->vtime_err will be cleared to 0,
> and we also do not need to compensate the vtime_rate, just set it as
> the base vrate and return.

Before, the function did something which is conceptually discrete and
describable. After, its operation is intertwined with when it's called. I
don't think this sort of micro optimizations are beneficial in cold paths.

Thanks.

-- 
tejun


Re: [PATCH v2 5/5] blk-iocost: Factor out the base vrate change into a separate function

2020-12-02 Thread Tejun Heo
On Thu, Nov 26, 2020 at 04:16:15PM +0800, Baolin Wang wrote:
> Factor out the base vrate change code into a separate function
> to fimplify the ioc_timer_fn().
> 
> No functional change.
> 
> Signed-off-by: Baolin Wang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH v2 4/5] blk-iocost: Factor out the active iocgs' state check into a separate function

2020-12-02 Thread Tejun Heo
On Thu, Nov 26, 2020 at 04:16:14PM +0800, Baolin Wang wrote:
> Factor out the iocgs' state check into a separate function to
> simplify the ioc_timer_fn().
> 
> No functional change.
> 
> Signed-off-by: Baolin Wang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH v2 3/5] blk-iocost: Move the usage ratio calculation to the correct place

2020-12-02 Thread Tejun Heo
On Thu, Nov 26, 2020 at 04:16:13PM +0800, Baolin Wang wrote:
> We only use the hweight based usage ratio to calculate the new
> hweight_inuse of the iocg to decide if this iocg can donate some
> surplus vtime.
> 
> Thus move the usage ratio calculation to the correct place to
> avoid unnecessary calculation for some vtime shortage iocgs.
> 
> Signed-off-by: Baolin Wang 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] blk-throttle: don't check whether or not lower limit is valid if CONFIG_BLK_DEV_THROTTLING_LOW is off

2020-12-02 Thread Tejun Heo
On Thu, Nov 26, 2020 at 11:18:34AM +0800, Yu Kuai wrote:
> blk_throtl_update_limit_valid() will search for descendants to see if
> 'LIMIT_LOW' of bps/iops and READ/WRITE is nonzero. However, they're always
> zero if CONFIG_BLK_DEV_THROTTLING_LOW is not set, furthermore, a lot of
> time will be wasted to iterate descendants.
> 
> Thus do nothing in blk_throtl_update_limit_valid() in such situation.
> 
> Signed-off-by: Yu Kuai 

Acked-by: Tejun Heo 

Thanks.

-- 
tejun


Re: [PATCH] mm: mmap_lock: fix use-after-free race and css ref leak in tracepoints

2020-12-02 Thread Tejun Heo
Hello,

On Tue, Dec 01, 2020 at 12:53:46PM -0800, Shakeel Butt wrote:
> The writeback tracepoint in include/trace/events/writeback.h is
> already using the cgroup IDs. Actually it used to use cgroup_path but
> converted to cgroup_ino.
> 
> Tejun, how do you use these tracepoints?

There've been some changes to cgroup ids recently and now cgroup id, ino and
its file_handle are all compatible. On 64bit ino machines, they're all the
same and won't be reused. On 32bit ino machines, the lower 32bit of full id
is used as ino. ino may be reused but not the full 64bit id.

You can map back cgroup id to path from userspace using open_by_handle_at().
The following is an example program which does path -> cgrp id -> path
mappings.

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#ifndef FILEID_KERNFS
#define FILEID_KERNFS 0xfe
#endif

struct fh_store {
struct file_handle fh;
char stor[MAX_HANDLE_SZ];
};

uint64_t path_to_cgrp_id(const char *path)
{
struct fh_store fh_store;
struct file_handle *fh = _store.fh;
int mnt_id;

fh->handle_bytes = MAX_HANDLE_SZ;

if (name_to_handle_at(AT_FDCWD, path, fh, _id, 0)) {
perror("name_to_handle_at");
abort();
}

if (fh->handle_type != FILEID_KERNFS) {
fprintf(stderr, "invalid handle_type 0x%x\n", fh->handle_type);
abort();
}

return *(uint64_t *)fh->f_handle;
}

void cgrp_id_to_path(uint64_t cgrp_id, char *path_buf)
{
struct fh_store fh_store;
struct file_handle *fh = _store.fh;
char proc_path[PATH_MAX];
int mnt_fd, fd;

fh->handle_type = FILEID_KERNFS;
fh->handle_bytes = sizeof(uint64_t);
*(uint64_t *)fh->f_handle = cgrp_id;

mnt_fd = open("/sys/fs/cgroup", O_RDONLY);
if (mnt_fd < 0) {
perror("open(\"/sys/fs/cgroup\")");
abort();
}

fd = open_by_handle_at(mnt_fd, fh, O_RDONLY);
if (fd < 0) {
perror("open_by_handle_at");
abort();
}

snprintf(proc_path, PATH_MAX, "/proc/self/fd/%d", fd);
printf("proc_path=%s\n", proc_path);

if (readlink(proc_path, path_buf, PATH_MAX) < 0) {
perror("readlink");
abort();
}
}

int main(int argc, char **argv)
{
char path_buf[PATH_MAX + 1] = "";
uint64_t cgrp_id;

if (argc != 2) {
fprintf(stderr, "Usage: test-cgrp-id CGROUP_PATH\n");
return 1;
}

cgrp_id = path_to_cgrp_id(argv[1]);
printf("cgrp_id=%llu\n", (unsigned long long)cgrp_id);

cgrp_id_to_path(cgrp_id, path_buf);
printf("cgrp_path=%s\n", path_buf);

return 0;
}


Re: [PATCH -tip 26/32] sched: Add a second-level tag for nested CGroup usecase

2020-12-02 Thread Tejun Heo
Hello,

On Wed, Dec 02, 2020 at 09:02:11AM +0100, Peter Zijlstra wrote:
> > the user might only want subsets of {B, C, D, E} to share.  For
> > instance, the user might only want {B,C} and {D, E} to share.  One way
> > to solve this would be to allow the user to write the group cookie
> > directly.  However, this interface would need to be restricted to
> > privileged users, since otherwise the cookie could be configured to
> > share with any arbitrary cgroup.  The purpose of the 'color' field is
> > to expose a portion of the cookie that can be modified by a
> > non-privileged user in order to achieve this sharing goal.
> > 
> > If this doesn't seem like a useful case, I'm happy to drop this patch
> > from the series to unblock it.
> 
> Well, the traditional cgroup way of doing that would be to:
> 
>  A
>   / \
>T1 T2
>   / \
>  B   C
> 
> And tag T1 if you want B,C to share.
> 
> So me the color thing reads like an end-run around the cgroup hierarchy.

+1

and please cc me on cgroup related changes.

Thanks.

-- 
tejun


Re: [PATCH 2/2] kernfs: remove mutex in kernfs_dop_revalidate

2020-12-02 Thread Tejun Heo
Hello,

On Wed, Dec 02, 2020 at 10:58:37PM +0800, Fox Chen wrote:
> There is a big mutex in kernfs_dop_revalidate which slows down the
> concurrent performance of kernfs.
> 
> Since kernfs_dop_revalidate only does some checks, the lock is
> largely unnecessary. Also, according to kernel filesystem locking
> document:
> https://www.kernel.org/doc/html/latest/filesystems/locking.html
> locking is not in the protocal for d_revalidate operation.

That's just describing the rules seen from vfs side. It doesn't say anything
about locking rules internal to each file system implementation.

> This patch remove this mutex from
> kernfs_dop_revalidate, so kernfs_dop_revalidate
> can run concurrently.
> 
> Signed-off-by: Fox Chen 
> ---
>  fs/kernfs/dir.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 9aec80b9d7c6..c2267c93f546 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -26,7 +26,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock);/* 
> root->ino_idr */
>  
>  static bool kernfs_active(struct kernfs_node *kn)
>  {
> - lockdep_assert_held(_mutex);
>   return atomic_read(>active) >= 0;
>  }
>  
> @@ -557,10 +556,9 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
> unsigned int flags)
>  
>   /* Always perform fresh lookup for negatives */
>   if (d_really_is_negative(dentry))
> - goto out_bad_unlocked;
> + goto out_bad;
>  
>   kn = kernfs_dentry_node(dentry);
> - mutex_lock(_mutex);
>  
>   /* The kernfs node has been deactivated */
>   if (!kernfs_active(kn))
> @@ -579,11 +577,8 @@ static int kernfs_dop_revalidate(struct dentry *dentry, 
> unsigned int flags)
>   kernfs_info(dentry->d_sb)->ns != kn->ns)
>   goto out_bad;
>  
> - mutex_unlock(_mutex);
>   return 1;
>  out_bad:
> - mutex_unlock(_mutex);
> -out_bad_unlocked:
>   return 0;
>  }

I don't see how this can be safe. Nothing even protects the dentry from
turning negative in the middle and it may end up trying to deref NULL. I'm
sure we can make this not need kernfs_mutex but that'd have to be a lot more
careful.

Thanks.

-- 
tejun


Re: [PATCH 1/2] kernfs: replace the mutex in kernfs_iop_permission with a rwlock

2020-12-02 Thread Tejun Heo
On Wed, Dec 02, 2020 at 10:58:36PM +0800, Fox Chen wrote:
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 89f6a4214a70..545cdb39b34b 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -156,6 +156,7 @@ struct kernfs_node {
>   unsigned short  flags;
>   umode_t mode;
>   struct kernfs_iattrs*iattr;
> + rwlock_tiattr_rwlock;
>  };

Also, while this might not look like much, kernfs_node is very size
sensitive. There are systems with huge number of these nodes, so I don't
think putting a per-node lock like this is a good idea. Either we can use a
shared iattr protecting lock or play some cmpxchg games when allocating and
setting ->iattr and put the lock there.

Thanks.

-- 
tejun


  1   2   3   4   5   6   7   8   9   10   >