Re: [RFC PATCH] cgroup, netclassid: add a preemption point to write_classid

2018-10-23 Thread Tejun Heo
On Thu, Oct 18, 2018 at 10:56:17AM +0200, Michal Hocko wrote:
> From: Michal Hocko 
> 
> We have seen a customer complaining about soft lockups on !PREEMPT
> kernel config with 4.4 based kernel
...
> If a cgroup has many tasks with many open file descriptors then we would
> end up in a large loop without any rescheduling point throught the
> operation. Add cond_resched once per task.
> 
> Signed-off-by: Michal Hocko 

Applied to cgroup/for-4.20.

Thanks.

-- 
tejun


Re: Documentation: Add HOWTO Korean translation into BPF and XDP Reference Guide.

2018-09-24 Thread Tejun Heo
Hello,

First of all, thanks a lot for the contribution.  It'd be great to
have a korean translation.  I glanced over it and there often were
areas where it was a bit challenging to understand without going back
to the english version.

I think this would still be helpful and a step in the right direction;
however, I think it'd be helpful to explicitly note, in korean, that
the translation is still work-in-progress and the readers are
recommended to refer back to the english copy when there's any doubt.

Thanks.

-- 
tejun


Re: [PATCH bpf-next 1/4] bpf: Introduce bpf_skb_ancestor_cgroup_id helper

2018-08-13 Thread Tejun Heo
Hello, Andrey.

On Fri, Aug 10, 2018 at 10:35:23PM -0700, Andrey Ignatov wrote:
> +static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
> +  int ancestor_level)
> +{
> + struct cgroup *ptr;
> +
> + if (cgrp->level < ancestor_level)
> + return NULL;
> +
> + for (ptr = cgrp;
> +  ptr && ptr->level > ancestor_level;
> +  ptr = cgroup_parent(ptr))
> + ;
> +
> + if (ptr && ptr->level == ancestor_level)
> + return ptr;
> +
> + return NULL;
> +}

I don't have any objections functionanlity-wise but can we do sth like
the following instead?

static inline struct cgroup *cgroup_ancestor(struct cgroup *cgrp,
 int ancestor_level)
{
if (cgrp->level < ancestor_level)
return NULL;

while (cgrp->level > ancestor_level)
cgrp = cgroup_parent(cgrp);
return cgrp;
}

Thanks.

-- 
tejun


Re: [PATCH] percpu: add a schedule point in pcpu_balance_workfn()

2018-02-23 Thread Tejun Heo
On Fri, Feb 23, 2018 at 08:12:42AM -0800, Eric Dumazet wrote:
> From: Eric Dumazet 
> 
> When a large BPF percpu map is destroyed, I have seen
> pcpu_balance_workfn() holding cpu for hundreds of milliseconds.
> 
> On KASAN config and 112 hyperthreads, average time to destroy a chunk
> is ~4 ms.
> 
> [ 2489.841376] destroy chunk 1 in 4148689 ns
> ...
> [ 2490.093428] destroy chunk 32 in 4072718 ns
> 
> Signed-off-by: Eric Dumazet 

Applied to percpu/for-4.16-fixes.

Thank, Eric.

-- 
tejun


Re: lost connection to test machine (4)

2018-02-12 Thread Tejun Heo
On Mon, Feb 12, 2018 at 09:03:25AM -0800, Tejun Heo wrote:
> Hello, Daniel.
> 
> On Mon, Feb 12, 2018 at 06:00:13PM +0100, Daniel Borkmann wrote:
> > [ +Dennis, +Tejun ]
> > 
> > Looks like we're stuck in percpu allocator with key/value size of 4 bytes
> > each and large number of entries (max_entries) in the reproducer in above
> > link.
> > 
> > Could we have some __GFP_NORETRY semantics and let allocations fail instead
> > of triggering OOM killer?
> 
> For some part, maybe, but not generally.  The virt area allocation
> goes down to page table allocation which is hard coded to use
> GFP_KERNEL in arch mm code.

So, the following should convert majority of allocations to use
__GFP_NORETRY.  It doesn't catch everything but should significantly
lower the probability of hitting this and put this on the same footing
as vmalloc.  Can you see whether this is enough?

Note that this patch isn't upstreamable.  We definitely want to
restrict this to the rebalance path, but it should be good enough for
testing.

Thanks.

diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 9158e5a..0b4739f 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -81,7 +81,7 @@ static void pcpu_free_pages(struct pcpu_chunk *chunk,
 static int pcpu_alloc_pages(struct pcpu_chunk *chunk,
struct page **pages, int page_start, int page_end)
 {
-   const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM;
+   const gfp_t gfp = GFP_KERNEL | __GFP_HIGHMEM | __GFP_NORETRY;
unsigned int cpu, tcpu;
int i;
 


Re: lost connection to test machine (4)

2018-02-12 Thread Tejun Heo
Hello, Daniel.

On Mon, Feb 12, 2018 at 06:00:13PM +0100, Daniel Borkmann wrote:
> [ +Dennis, +Tejun ]
> 
> Looks like we're stuck in percpu allocator with key/value size of 4 bytes
> each and large number of entries (max_entries) in the reproducer in above
> link.
> 
> Could we have some __GFP_NORETRY semantics and let allocations fail instead
> of triggering OOM killer?

For some part, maybe, but not generally.  The virt area allocation
goes down to page table allocation which is hard coded to use
GFP_KERNEL in arch mm code.

Thanks.

-- 
tejun


Re: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-21 Thread Tejun Heo
Hello,

On Wed, Oct 18, 2017 at 04:45:08PM -0500, Dennis Zhou wrote:
> I'm not sure I see the reason we can't match the minimum allocation size
> with the unit size? It seems weird to arbitrate the maximum allocation
> size given a lower bound on the unit size.

idk, it can be weird for the maximum allowed allocation size varying
widely depending on how the machine boots up.

Thanks.

-- 
tejun


Re: [PATCH net 0/3] Fix for BPF devmap percpu allocation splat

2017-10-18 Thread Tejun Heo
Hello, Daniel.

(cc'ing Dennis)

On Tue, Oct 17, 2017 at 04:55:51PM +0200, Daniel Borkmann wrote:
> The set fixes a splat in devmap percpu allocation when we alloc
> the flush bitmap. Patch 1 is a prerequisite for the fix in patch 2,
> patch 1 is rather small, so if this could be routed via -net, for
> example, with Tejun's Ack that would be good. Patch 3 gets rid of
> remaining PCPU_MIN_UNIT_SIZE checks, which are percpu allocator
> internals and should not be used.
> 
> Thanks!
> 
> Daniel Borkmann (3):
>   mm, percpu: add support for __GFP_NOWARN flag

This looks fine.

>   bpf: fix splat for illegal devmap percpu allocation
>   bpf: do not test for PCPU_MIN_UNIT_SIZE before percpu allocations

These look okay too but if it helps percpu allocator can expose the
maximum size / alignment supported to take out the guessing game too.

Also, the reason why PCPU_MIN_UNIT_SIZE is what it is is because
nobody needed anything bigger.  Increasing the size doesn't really
cost much at least on 64bit archs.  Is that something we want to be
considering?

Thanks.

-- 
tejun


Re: [PATCH net-next] ipv6: avoid zeroing per cpu data again

2017-10-09 Thread Tejun Heo
On Mon, Oct 09, 2017 at 06:01:37AM -0700, Eric Dumazet wrote:
> From: Eric Dumazet <eduma...@google.com>
> 
> per cpu allocations are already zeroed, no need to clear them again.
> 
> Fixes: d52d3997f843f ("ipv6: Create percpu rt6_info")
> Signed-off-by: Eric Dumazet <eduma...@google.com>
> Cc: Martin KaFai Lau <ka...@fb.com>
> Cc: Tejun Heo <t...@kernel.org>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: EBPF-triggered WARNING at mm/percpu.c:1361 in v4-14-rc2

2017-09-28 Thread Tejun Heo
Hello,

On Thu, Sep 28, 2017 at 03:45:38PM +0100, Mark Rutland wrote:
> > Perhaps the pr_warn() should be ratelimited; or could there be an
> > option where we only return NULL, not triggering a warn at all (which
> > would likely be what callers might do anyway when checking against
> > PCPU_MIN_UNIT_SIZE and then bailing out)?
> 
> Those both make sense to me; checking __GFP_NOWARN should be easy
> enough.

That also makes sense.

> Just to check, do you think that dev_map_alloc() should explicitly test
> the size against PCPU_MIN_UNIT_SIZE, prior to calling pcpu_alloc()?

But let's please not do this.

Thanks.

-- 
tejun


Re: EBPF-triggered WARNING at mm/percpu.c:1361 in v4-14-rc2

2017-09-28 Thread Tejun Heo
Hello,

On Thu, Sep 28, 2017 at 12:27:28PM +0100, Mark Rutland wrote:
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 59d44d6..f731c45 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -1355,8 +1355,13 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> align, bool reserved,
> bits = size >> PCPU_MIN_ALLOC_SHIFT;
> bit_align = align >> PCPU_MIN_ALLOC_SHIFT;
>  
> -   if (unlikely(!size || size > PCPU_MIN_UNIT_SIZE || align > PAGE_SIZE 
> ||
> -!is_power_of_2(align))) {
> +   if (unlikely(size > PCPU_MIN_UNIT_SIZE)) {
> +   pr_warn("cannot allocate pcpu chunk of size %zu (max %zu)\n",
> +   size, PCPU_MIN_UNIT_SIZE);

WARN_ONCE() probably is the better choice here.  We wanna know who
tries to allocate larger than the supported size and increase the size
limit if warranted.

Thanks.

-- 
tejun


Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-09-01 Thread Tejun Heo
Hello, Alexei.

On Thu, Aug 31, 2017 at 08:27:56PM -0700, Alexei Starovoitov wrote:
> > > The 2 flags are completely independent. The existing override logic is
> > > unchanged. If a program can not be overridden, then the new recursive
> > > flag is irrelevant.
> > 
> > I'm not sure all four combo of the two flags makes sense.  Can't we
> > have something simpler like the following?
> > 
> > 1. None: No further bpf programs allowed in the subtree.
> > 
> > 2. Overridable: If a sub-cgroup installs the same bpf program, this
> >one yields to that one.
> > 
> > 3. Recursive: If a sub-cgroup installs the same bpf program, that
> >cgroup program gets run in addition to this one.
> > 
> > Note that we can have combinations of overridables and recursives -
> > both allow further programs in the sub-hierarchy and the only
> > distinction is whether that specific program behaves when that
> > happens.
> 
> If I understand the proposal correctly in case of:
> A (with recurs) -> B (with override) -> C (with recurse) -> D (with override)
> when something happens in D, you propose to run D,C,A  ?

Yes, B gets overridden by C, so the effective progarms are A, C and D.

> With the order of execution from children to parent?

Hmm... I'm not sure about the execution ordering.  How these programs
chain would be dependent on the type of the program, right?  Would we
be able to use the same chaining order for all types of programs?

> That's a bit a different then what I was proposing with 'multi-prog' flag,
> but the more I think about it the more I like it.

Great.

> In your case detach is sort of transparent to everything around.
> And you would also allow to say 'None' to one of the substrees too, right?
> So something like:
> A (with recurs) -> B (with override) -> C (with recurse) -> D (None) -> E
> would mean that E cannot attach anything and events in E will
> call D->C->A, right?

Yeap.

Thanks.

-- 
tejun


Re: [PATCH v2 net-next 1/8] bpf: Add support for recursively running cgroup sock filters

2017-08-31 Thread Tejun Heo
Hello, David, Alexei.

Sorry about late reply.

On Sun, Aug 27, 2017 at 08:49:23AM -0600, David Ahern wrote:
> On 8/25/17 8:49 PM, Alexei Starovoitov wrote:
> > 
> >> +  if (prog && curr_recursive && !new_recursive)
> >> +  /* if a parent has recursive prog attached, only
> >> +   * allow recursive programs in descendent cgroup
> >> +   */
> >> +  return -EINVAL;
> >> +
> >>old_prog = cgrp->bpf.prog[type];
> > 
> > ... I'm struggling to completely understand how it interacts
> > with BPF_F_ALLOW_OVERRIDE.
> 
> The 2 flags are completely independent. The existing override logic is
> unchanged. If a program can not be overridden, then the new recursive
> flag is irrelevant.

I'm not sure all four combo of the two flags makes sense.  Can't we
have something simpler like the following?

1. None: No further bpf programs allowed in the subtree.

2. Overridable: If a sub-cgroup installs the same bpf program, this
   one yields to that one.

3. Recursive: If a sub-cgroup installs the same bpf program, that
   cgroup program gets run in addition to this one.

Note that we can have combinations of overridables and recursives -
both allow further programs in the sub-hierarchy and the only
distinction is whether that specific program behaves when that
happens.

Thanks.

-- 
tejun


Re: Lots of new warnings with gcc-7.1.1

2017-07-15 Thread Tejun Heo
Hello,

On Wed, Jul 12, 2017 at 03:31:02PM +0200, Arnd Bergmann wrote:
> > We also have about a bazillion
> >
> > warning: ‘*’ in boolean context, suggest ‘&&’ instead
> >
> > warnings in drivers/ata/libata-core.c, all due to a single macro that
> > uses a pattern that gcc-7.1.1 doesn't like. The warning looks a bit
> > debatable, but I suspect the macro could easily be changed too.
> >
> > Tejun, would you hate just moving the "multiply by 1000" part _into_
> > that EZ() macro? Something like the attached (UNTESTED!) patch?
> 
> Tejun applied an almost identical patch of mine a while ago, but it seems to
> have gotten lost in the meantime in some rebase:

Yeah, I was scratching my head remembering your patch.  Sorry about
that.  It should have been routed through for-4.12-fixes.

> https://patchwork.kernel.org/patch/9721397/
> https://patchwork.kernel.org/patch/9721399/
> 
> I guess I should have resubmitted the second patch with the suggested
> improvement.

The new one looks good to me.

Thanks.

-- 
tejun


Re: [PATCH 02/11] dma-mapping: replace dmam_alloc_noncoherent with dmam_alloc_attrs

2017-06-26 Thread Tejun Heo
On Mon, Jun 26, 2017 at 12:07:30AM -0700, Christoph Hellwig wrote:
> Tejun, does this look ok to you?

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH 01/11] dma-mapping: remove dmam_free_noncoherent

2017-06-26 Thread Tejun Heo
On Mon, Jun 26, 2017 at 12:07:15AM -0700, Christoph Hellwig wrote:
> Tejun, does this look ok to you?

Sure,

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


[PATCH v4.11] cgroup, net_cls: iterate the fds of only the tasks which are being migrated

2017-03-14 Thread Tejun Heo
The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51de ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61fc ("cgroup: call subsys->*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the ->css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
->css_attach callback for those.  The net_cls ->css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys->*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode <dgo...@fb.com>
Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid")
Cc: sta...@vger.kernel.org # v4.4+
Cc: Nina Schiff <nin...@fb.com>
Cc: David S. Miller <da...@davemloft.net>
Signed-off-by: Tejun Heo <t...@kernel.org>
---
Hello, Dave.

Can you please route this fix for v4.11?  I can also take it through
cgroup/for-4.11-fixes, if that's preferable.

Thanks.

 net/core/netclassid_cgroup.c |   32 
 1 file changed, 16 insertions(+), 16 deletions(-)

--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -69,27 +69,17 @@ static int update_classid_sock(const voi
return 0;
 }
 
-static void update_classid(struct cgroup_subsys_state *css, void *v)
+static void cgrp_attach(struct cgroup_taskset *tset)
 {
-   struct css_task_iter it;
+   struct cgroup_subsys_state *css;
struct task_struct *p;
 
-   css_task_iter_start(css, );
-   while ((p = css_task_iter_next())) {
+   cgroup_taskset_for_each(p, css, tset) {
task_lock(p);
-   iterate_fd(p->files, 0, update_classid_sock, v);
+   iterate_fd(p->files, 0, update_classid_sock,
+  (void *)(unsigned long)css_cls_state(css)->classid);
task_unlock(p);
}
-   css_task_iter_end();
-}
-
-static void cgrp_attach(struct cgroup_taskset *tset)
-{
-   struct cgroup_subsys_state *css;
-
-   cgroup_taskset_first(tset, );
-   update_classid(css,
-  (void *)(unsigned long)css_cls_state(css)->classid);
 }
 
 static u64 read_classid(struct cgroup_subsys_state *css, struct cftype *cft)
@@ -101,12 +91,22 @@ static int write_classid(struct cgroup_s
 u64 value)
 {
struct cgroup_cls_state *cs = css_cls_state(css);
+   struct css_task_iter it;
+   struct task_struct *p;
 
cgroup_sk_alloc_disable();
 
cs->classid = (u32)value;
 
-   update_classid(css, (void *)(unsigned long)cs->classid);
+   css_task_iter_start(css, );
+   while ((p = css_task_iter_next())) {
+   task_lock(p);
+   iterate_fd(p->files, 0, update_classid

Re: Fw: [Bug 193911] New: net_prio.ifpriomap is not aware of the network namespace, and discloses all network interface

2017-02-21 Thread Tejun Heo
Hello,

On Mon, Feb 13, 2017 at 08:04:57AM +1300, Eric W. Biederman wrote:
> > Yeah, the whole thing never considered netns or delegation.  Maybe the
> > read function itself should probably filter on the namespace of the
> > reader?  I'm not completely sure whether trying to fix it won't cause
> > some of existing use cases to break.  Eric, what do you think?
> 
> Apologies for the delay I just made it back from vacation.

Hope you enjoyed the vacation.

> There are cases where we do look at the reader/opener of the  file, and
> it is a pain, almost always the best policy is to have the context fixed
> at mount time.
> 
> I don't see an obvious answer of what better semantics for this file
> should be.  Perhaps Docker can mount over this file on older kernels?
> 
> The namespace primitives that people build containers out of were never
> guaranteed not to leak the fact that you are in a container.  So a small
> essentially harmless information leak is not something I panic about.
> It is the setting up of the container itself that must know what the
> primitives do to ensure that leaks don't happen, if you want to avoid leaks.
> 
> That said if this controller/file does not consider netns and delegation
> I suspect the right thing to do is put it under CONFIG_BROKEN or
> possibly
> CONFIG_I_REALLY_NEED_THIS_SILLY_CODE_FOR_BACKWARDS_COMPATIBILITY
> aka CONFIG_STAGING and let the code age out of the kernel there.

I see.  Yeah, it is broken in terms of ns support.  Marking it BROKEN
from the config seems a bit drastic tho.

> If someone actually cares about this code and wants to fix it to do the
> something reasonable and is willing to dig through all of the subtleties
> I can help with that.  I may be wrong but the code feels like something
> that just isn't interesting enough to make it worth fixing.

The network part of cgroup v2 can do the same thing, doesn't have
these issues and can be used in conjunction with cgroup v1, so we
already have a way out of this.

I don't think we need to take an active action on it at this point.
People who need namespace support can adopt cgroup v2 without breaking
other things and I'm not sure there's enough benefit in marking the v1
features BROKEN at this point.

Thanks.

-- 
tejun


Re: [PATCH v2 net] bpf: introduce BPF_F_ALLOW_OVERRIDE flag

2017-02-11 Thread Tejun Heo
On Fri, Feb 10, 2017 at 08:28:24PM -0800, Alexei Starovoitov wrote:
> If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
> to the given cgroup the descendent cgroup will be able to override
> effective bpf program that was inherited from this cgroup.
> By default it's not passed, therefore override is disallowed.
> 
> Examples:
> 1.
> prog X attached to /A with default
> prog Y fails to attach to /A/B and /A/B/C
> Everything under /A runs prog X
> 
> 2.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A/B with default (non-override)
> prog M attached to /A/B with allow_override.
> Everything under /A/B runs prog M only.
> 
> 3.
> prog X attached to /A with allow_override.
> prog Y fails to attach to /A with default.
> The user has to detach first to switch the mode.
> 
> In the future this behavior may be extended with a chain of
> non-overridable programs.
> 
> Also fix the bug where detach from cgroup where nothing is attached
> was not throwing error. Return ENOENT in such case.
> 
> Add several testcases and adjust libbpf.
> 
> Fixes: 3007098494be ("cgroup: add support for eBPF programs")
> Signed-off-by: Alexei Starovoitov <a...@kernel.org>

The cgroup part looks good to me.  Please feel free to add

Acked-by: Tejun Heo <t...@kernel.org>

One question tho.  Is there a specific reason to disallow attaching
!overridable program under an overridable one?  Isn't disallowing
attaching programs if the closest ancestor is !overridable enough?

Thanks.

-- 
tejun


Re: Fw: [Bug 193911] New: net_prio.ifpriomap is not aware of the network namespace, and discloses all network interface

2017-02-06 Thread Tejun Heo
Hello,

On Sun, Feb 05, 2017 at 11:05:36PM -0800, Cong Wang wrote:
> > To be more specific, the read operation of net_prio.ifpriomap is handled by 
> > the
> > function read_priomap. Tracing from this function, we can find it invokes
> > for_each_netdev_rcu and set the first parameter as the address of init_net. 
> > It
> > iterates all network devices of the host regardless of the network 
> > namespace.
> > Thus, from the view of a container, it can read the names of all network
> > devices of the host.
> 
> I think that is probably because cgroup files don't provide a net pointer
> for the context, if so we probably need some API similar to
> class_create_file_ns().

Yeah, the whole thing never considered netns or delegation.  Maybe the
read function itself should probably filter on the namespace of the
reader?  I'm not completely sure whether trying to fix it won't cause
some of existing use cases to break.  Eric, what do you think?

Thanks.

-- 
tejun


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-28 Thread Tejun Heo
Hello, Eric.

On Thu, Jan 26, 2017 at 01:45:07PM +1300, Eric W. Biederman wrote:
> > Eric, does this sound okay to you?  You're the authority on exposing
> > things like namespace ids to users.
> 
> *Boggle*  Things that run across all network namespaces break any kind
>  of sense I have about thinking about them.
> 
> Running across more than one network namespace by default seems very
> broken to me.

Can you explain why that is?  Other namespaces don't behave this way.
For example, a PID namespace doesn't hide the processes at the system
level.  It just gives additional nested names to the namespaced
processes and having objects visible at the system level is very
useful for monitoring and management.  Are there inherent reasons why
network namespace should be very different from other namespaces in
this regard?

Thanks.

-- 
tejun


Re: [PATCH v2] bpf: Restrict cgroup bpf hooks to the init netns

2017-01-24 Thread Tejun Heo
Hello, Andy.

On Tue, Jan 24, 2017 at 10:54:03AM -0800, Andy Lutomirski wrote:
> Tejun, I can see two basic ways that cgroup+bpf delegation could work
> down the road.  Both depend on unprivileged users being able to load
> BPF programs of the correct type, but that's mostly a matter of
> auditing the code and doesn't have any particularly interesting design
> challenges as far as I know.
> 
> Approach 1: Scope cgroup+bpf hooks to a netns and require
> ns_capable(CAP_NET_ADMIN), fs permissions, and (optionally) delegation
> to be enabled to install them.  This shouldn't have any particularly
> dangerous security implications because ns_capable(CAP_NET_ADMIN)
> means you already own the netns.
> 
> Approach 2: Keep cgroup+bpf hooks global.  This makes the delegation
> story much trickier.  There needs to be some mechanism to prevent some
> program that has delegated cgroup control from affecting the behavior
> of outside programs.  This isn't so easy.  Imagine that you've
> delegated /cgroup/foo to UID 1000.  A program with UID 1000 can now
> install a hook on /cgroup/foo, but there needs to be a mechanism to
> prevent UID 1000 from running in /cgroup/foo in the global netns and
> then running a tool like sudo.  This *might* be possible using just
> existing cgroup mechanisms, but it's going to be quite tricky to make
> sure it's done completely correctly and without dangerous races.
> 
> If cgroup+bpf stays global, then I think you're basically committing
> to approach 2.
> 
> I would suggest doing approach 1 (i.e. apply my patch) and then, if
> truly needed for some use case, add an option so that globally
> privileged programs can create hooks that affect all namespaces.

I thought about restricting according to namespaces and your ifindex
example.  While the ifindex issue seems broken and on the surface
seems to indicate that we need to segregate cgroup bpf programs per
netns as iptables does, the more I think about it, the less convinced
I'm that that is the only right way forward.

I'm not too familiar with netns but if you think about other
namespaces, !root namespaces nest inside the root one.  When you enter
a pid namespace, it adds its pid namespace on top of the global one.
It doesn't replace that, and that is an important characteristic which
allows the root namespace to maintain control over the nested ones.
This is the same with cgroupns, which basically is just nesting, or
mount namespace.  Namespaced objects don't disappear in the eyes of
the root namespace.  They just get additional scoped names.

Back to netns, if I'm understanding correctly, this doesn't seem to be
the case.  An interface belongs to a namespace and can be moved around
but it disappears from the root namespace once it enters a non-root
one.  I guess there are reasons, even if historical, for it but this
is what is breaking your ifindex example.  It's fine for an interface
to get a new scoped ifindex inside a netns, but that shouldn't
override the global one.  This, I think, should be fixable without
breaking existing APIs by introducing a new global index.

My opinion on this isn't very strong but what iptables is doing
actually seems wrong to me in that it actually precludes any way of
implementing global policies which encompasses the whole system.
Again, this is a fixable problem if ever necessary by introducing
properly global rule tables or even just adding a flag.

So, I don't think the situation is as clear as "ifindex is broken, so
cgroup bpf programs should be segregated per netns".

David, in another reply, you mentioned about fixing ifindex.  If I
understood correctly, we're talking about the same thing, right?  If
so, is this something easy to implement?

Thanks.

-- 
tejun


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo
Hello, Andy.

On Wed, Jan 18, 2017 at 04:18:04PM -0800, Andy Lutomirski wrote:
> To map cgroup -> hook, a simple array in each cgroup structure works.
> To map (cgroup, netns) -> hook function, the natural approach would be
> to have some kind of hash, and that would be slower.  This code is
> intended to be *fast*.

Updating these programs isn't a frequent operation.  We can easily
calculate a perfect (or acceptable) hash per-cgroup and rcu swap the
new hashtable.

> I suppose you could have each cgroup structure contain an array where
> each element is (netns, hook function) and just skip the ones that are
> the wrong netns.  Perhaps it would be rare to have many of those.

Yeah, or maybe even that.  It just isn't a fundamentally difficult
problem.

> I think it's currently a broken cgroup feature.  I think it could be
> made into a properly working cgroup feature (which I favor) or a
> properly working net feature.  Can you articulate why you prefer
> having it be a net feature instead of a cgroup feature?  I tend to

I thought that's what I was doing in the previous message.

> favor making it be a working cgroup feature (by giving it a file or
> files in cgroupfs and maybe even a controller) because the result
> would have very obvious semantics.

I'm fine with both directions but one seems far out.

> But maybe this is just a false dichotomy.  Could this feature be a
> per-netns configuration where you can give instructions like "run this
> hook if you're in such-and-such netns and in a given cgroup or one of
> its descendents"?  This would prevent it from being a direct analogue
> to systemd's RestrictAddressFamilies= option, but that may be okay.
> This would side-step issues in the current code where a hook can't
> rely on ifindex meaning what the hook thinks it means.

How is this different from making the current code netns aware?

> Actually, I think I like that approach.  What if it we had a "socket"
> controller and files like "socket.create_socket", "socket.ingress" and
> "socket.egress"?  You could use the bpf() syscall to install a bpf
> filter into "socket.egress" and that filter would filter egress for
> the target cgroup and its descendents on the current netns.  As a
> first pass, the netns issue could be sidestepped by making it only
> work in the root netns (i.e. the bpf() call would fail if you're not
> in the root netns and the hooks wouldn't run if you're not in the root
> netns.)  It *might* even be possible to retrofit in the socket
> controller by saying that the default hierarchy is used if the socket
> controller isn't mounted.

I don't know.  You're throwing out too many ideas too fast and it's
difficult to keep up with what the differences are between those
ideas.  But if we're doing cgroup controllers, shouldn't cgroup ns
support be sufficient?  We can consider the product of cgroup and net
namespaces but that doesn't seem necessary given that people usually
set up these namespaces in conjunction.

> What *isn't* possible to cleanly fix after the fact is the current
> semantics that cgroup hooks override the hooks in their ancestors.
> IMO that is simply broken.  The example you gave (perf_event) is very
> careful not to have this problem.

That's like saying installing an iptables rule for a more specific
target is broken.  As a cgroup controller, it is not an acceptable
behavior given how delegation works.  As something similar to
iptables, it is completely fine.

> > * My impression with bpf is that while delegation is something
> >   theoretically possible it is not something which is gonna take place
> >   in any immediate time frame.  If I'm wrong on this, please feel free
> >   to correct me.
> 
> But the issue isn't *BPF* delegation.  It's cgroup delegation or netns
> creation, both of which exist today.

No, the issue is bpf delegation.  If bpf were fully delegatable in
practical sense, we could just do straight forward cgroup bpf
controller.  Well, we'll have to think about how to chain the programs
which would differ on program type but that shouldn't be too hard.

> > * There are a lot of use cases which can immediately take advantage of
> >   cgroup-aware bpf programs operating as proposed.  The model is
> >   semantically equivalent to iptables (let's address the netns part if
> >   that's an issue) which net people are familiar with.
> 
> That sounds like a great argument for those users to patch their
> kernels to gain experience with this feature.

I don't get why this would particularly point to out-of-tree usage.
Why is that?

> > * It isn't exclusive with adding cgroup bpf controller down the road
> >   if necessary and practical.  Sure, this isn't the perfect situation
> >   but it seems like an acceptable trade-off to me.  What ever is
> >   perfect?
> 
> I think it more or less is exclusive.  I can imagine davem accepting a
> patch to make the sock_cgroup_data think point at a new "socket"
> cgroup type.  I can't imagine him accepting a patch to 

Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo
Helloo, Andy.

On Mon, Jan 16, 2017 at 09:18:36PM -0800, Andy Lutomirski wrote:
> Perhaps this is a net feature, though, not a cgroup feature.  This
> would IMO make a certain amount of sense.  Iptables cgroup matches,
> for example, logically are an iptables (i.e., net) feature.  The

Yeap.

> problem here is that network configuration (and this explicitly
> includes iptables) is done on a per-netns level, whereas these new
> hooks entirely ignore network namespaces.  I've pointed out that
> trying to enable them in a namespaced world is problematic (e.g.
> switching to ns_capable() will cause serious problems), and Alexei
> seems to think that this will never happen.  So I don't think we can
> really call this a net feature that works the way that other net
> features work.
> 
> (Suppose that this actually tried to be netns-enabled.  Then you'd
> have to map from (netns, cgroup) -> hook, and this would probably be
> slow and messy.)

I'm not sure how important netns support for these bpf programs
actually is but I have no objection to making it behave in an
equivalent manner to iptables where appropriate.  This is kinda
trivial to do too.  For now, we can simply not run the programs if the
associated socket belongs to non-root netns.  Later, if necessary, we
can add per-ns bpf programs.  And I can't think of a reason why
(netns, cgroup) -> function mapping would be slow and messy.  Why
would that be?

> So I think that leaves it being a cgroup feature.  And it definitely
> does *not* play by the rules of cgroups right now.

Because this in no way is a cgroup feature.  As you wrote above, it is
something similar to iptables with lacking netns considerations.
Let's address that and make it a proper network thing.

> > I'm sure we'll
> > eventually get there but from what I hear it isn't something we can
> > pull off in a restricted timeframe.
> 
> To me, this sounds like "the API isn't that great, we know how to do
> better, but we really really want this feature ASAP so we want to ship
> it with a sub-par API."  I think that's a bad idea.

I see your point but this isn't something which is black and white.
There are arguments for both directions.  I'm leaning the other
direction because

* My impression with bpf is that while delegation is something
  theoretically possible it is not something which is gonna take place
  in any immediate time frame.  If I'm wrong on this, please feel free
  to correct me.

* There are a lot of use cases which can immediately take advantage of
  cgroup-aware bpf programs operating as proposed.  The model is
  semantically equivalent to iptables (let's address the netns part if
  that's an issue) which net people are familiar with.

* It isn't exclusive with adding cgroup bpf controller down the road
  if necessary and practical.  Sure, this isn't the perfect situation
  but it seems like an acceptable trade-off to me.  What ever is
  perfect?

> > This also holds true for the perf controller.  While it is implemented
> > as a controller, it isn't visible to cgroup users in any way and the
> > only function it serves is providing the membership test to perf
> > subsystem.  perf is the one which decides whether and how it is to be
> > used.  cgroup providing membership test to other subsystems is
> > completely acceptable and established.
> 
> Unless I'm mistaken, "perf_event" is an actual cgroup controller, and

Yeah, it is implemented as a controller but in essence what it does is
just tagging the hierarchy to tell perf to use that hierarchy for
membership test purposes.

> it explicitly respects the cgroup hierarchy.  It shows up in
> /proc/cgroups, and I had no problem mounting a cgroupfs instance with
> perf_event enabled.  So I'm not sure what you mean.

That all it's doing is providing membership information.

Thanks.

-- 
tejun


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-18 Thread Tejun Heo
Hello, Michal.

On Tue, Jan 17, 2017 at 02:58:30PM +0100, Michal Hocko wrote:
> This would require using hierarchical cgroup iterators to iterate over

It does behave hierarchically.

> tasks. As per Andy's testing this doesn't seem to be the case. I haven't

That's not what Andy's testing showed.  What that showed was that
program in a child can override the one from its ancestor.

> checked the implementation closely but my understanding was that using
> only cgroup specific tasks was intentional.

Maybe I'm misreading you but if you're saying that a bpf program would
only execute for tasks on that specific cgroup, that is not true.

> I do agree that using hierarchy aware cgroup iterators is the right
> approach here and we wouldn't see any issue. But I am still not sure
> I've wrapped my head around this feature completely.

It's really simple.  You can install bpf programs with a cgroup path
associated with them.  When that particular type of bpf program needs
to be executed, the bpf program which is attached to the closest
ancestor (including self) is executed.

Thanks.

-- 
tejun


Re: Potential issues (security and otherwise) with the current cgroup-bpf API

2017-01-15 Thread Tejun Heo
Hello,

Sorry about the delay.  Some fire fighthing followed the holidays.

On Tue, Jan 03, 2017 at 11:25:59AM +0100, Michal Hocko wrote:
> > So from what I understand the proposed cgroup is not in fact
> > hierarchical at all.
> > 
> > @TJ, I thought you were enforcing all new cgroups to be properly
> > hierarchical, that would very much include this one.
> 
> I would be interested in that as well. We have made that mistake in
> memcg v1 where hierarchy could be disabled for performance reasons and
> that turned out to be major PITA in the end. Why do we want to repeat
> the same mistake here?

Across the different threads on this subject, there have been multiple
explanations but I'll try to sum it up more clearly.

The big issue here is whether this is a cgroup thing or a bpf thing.
I don't think there's anything inherently wrong with one approach or
the other.  Forget about the proposed cgroup bpf extentions but thinkg
about how iptables does cgroups.  Whether it's the netcls/netprio in
v1 or direct membership matching in v2, it is the network side testing
for cgroup membership one way or the other.  The only part where
cgroup is involved in is answering that test.

This also holds true for the perf controller.  While it is implemented
as a controller, it isn't visible to cgroup users in any way and the
only function it serves is providing the membership test to perf
subsystem.  perf is the one which decides whether and how it is to be
used.  cgroup providing membership test to other subsystems is
completely acceptable and established.

Now coming back to bpf, the current implementation is just that.
Sure, cgroup hosts the rules in its data structures but that isn't
something conceptually relevant.  We might as well implement it as a
prefixed hash table from bpf side.  Having pointers in struct cgroup
is just a more efficient and easier way of achieving the same result.
In fact, IIUC, this whole thing was born out of discussions around
implementing scalable cgroup membership matching from bpf programs.

So, what's proposed is a proper part of bpf.  In terms of
implementation, cgroup helps by hosting the pointers but that doesn't
necessarily affect the conceptual structure of it.  Given that, I
don't think it'd be a good idea to add anything to cgroup interface
for this feature.  Introspection is great to have but this should be
introspectable together with other bpf programs using the same
mechanism.  That's where it belongs.

None of the issues that people have been raising here is actually an
issue if one thinks of it as a part of bpf.  Its security model is
exactly the same as any other bpf programs.  Recursive behavior is
exactly the same as how other external cgroup descendant membership
testing work.  There is no issue here whatsoever.

Now, I'm not claiming that a bpf mechanism which is a proper part of
cgrou isn't attractive.  It is, especially with delegation; however,
that is also where we don't quite know how to proceed.  This doesn't
have much to do with cgroup.  If something is delegatable to non-priv
users and scoped, cgroup's fine with it and if that's not possible it
simply isn't something which is delegatable and putting it on cgroup
doesn't change that.

I'm far from being a bpf expert, so I could be wrong here, but I don't
think there's anything fundamental which prevents bpf from being
delegatable but at the same time bpf is something which is extremely
flexible and nobody really thought about or worked that much on
delegating bpf. If there's enough need for it, I'm sure we'll
eventually get there but from what I hear it isn't something we can
pull off in a restricted timeframe.

There's nothing which makes the currently implemented mechanism
exclusive with a cgroup controller based one.  The hooks are the
expensive part but can be shared, the rest is just about which
programs to execute in what order and how they should be chained.

There are a lot of immediate use cases which can benefit from the
proposed cgroup bpf mechanism and they're all fine with it being a
part of bpf and behaving like any other network mechanism behaves in
terms of configuration and delegation.  I don't see a reason why we
would hold back on merging this.  All the raised issues are coming
from confusing this as a part of cgroup.  It isn't.  It is a part of
bpf.  If we want a bpf cgroup controller, great, but that is a
separate thing.

Thanks.

-- 
tejun


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-09 Thread Tejun Heo
Hello, John.

On Thu, Dec 08, 2016 at 09:39:38PM -0800, John Stultz wrote:
> So just to clarify the discussion for my purposes and make sure I
> understood, per-cgroup CAP rules was not desired, and instead we
> should either utilize an existing cap (are there still objections to
> CAP_SYS_RESOURCE? - this isn't clear to me) or create a new one (ie,
> bring back the older CAP_CGROUP_MIGRATE patch).

Let's create a new one.  It looks to be a bit too different to share
with an existing one.

> Tejun: Do you have a more finished version of your patch that I should
> add my changes on top of?

Oh, just submit the patch on top of the current for-next.  I can queue
mine on top of yours.  They are mostly orthogonal.

Thanks.

-- 
tejun


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-06 Thread Tejun Heo
Hello,

On Tue, Dec 06, 2016 at 10:13:53AM -0800, Andy Lutomirski wrote:
> > Delegation is an explicit operation and reflected in the ownership of
> > the subdirectories and cgroup interface files in them.  The
> > subhierarchy containment is achieved by requiring the user who's
> > trying to migrate a process to have write perm on cgroup.procs on the
> > common ancestor of the source and target in addition to the target.
> 
> OK, I see what you're doing.  That's interesting.

It's something born out of usages of cgroup v1.  People used it that
way (chowning files and directories) and combined with the uid checksn
it yielded something which is useful sometimes, but it always had
issues with hierarchical behaviors, which files to chmod and the weird
combination of uid checks.  cgroup v2 has a clear delegation model but
the uid checks are still left in as not changing was the default.

It's not necessary and I'm thinking about queueing something like the
following in the next cycle.

As for the android CAP discussion, I think it'd be nice to share an
existing CAP but if we can't find a good one to share, let's create a
new one.

Thanks.

diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
index 4cc07ce..34b4b44 100644
--- a/Documentation/cgroup-v2.txt
+++ b/Documentation/cgroup-v2.txt
@@ -328,14 +328,12 @@ a process with a non-root euid to migrate a target 
process into a
 cgroup by writing its PID to the "cgroup.procs" file, the following
 conditions must be met.
 
-- The writer's euid must match either uid or suid of the target process.
-
 - The writer must have write access to the "cgroup.procs" file.
 
 - The writer must have write access to the "cgroup.procs" file of the
   common ancestor of the source and destination cgroups.
 
-The above three constraints ensure that while a delegatee may migrate
+The above two constraints ensure that while a delegatee may migrate
 processes around freely in the delegated sub-hierarchy it can't pull
 in from or push out to outside the sub-hierarchy.
 
@@ -350,10 +348,10 @@ all processes under C0 and C1 belong to U0.
 
 Let's also say U0 wants to write the PID of a process which is
 currently in C10 into "C00/cgroup.procs".  U0 has write access to the
-file and uid match on the process; however, the common ancestor of the
-source cgroup C10 and the destination cgroup C00 is above the points
-of delegation and U0 would not have write access to its "cgroup.procs"
-files and thus the write will be denied with -EACCES.
+file; however, the common ancestor of the source cgroup C10 and the
+destination cgroup C00 is above the points of delegation and U0 would
+not have write access to its "cgroup.procs" files and thus the write
+will be denied with -EACCES.
 
 
 2-6. Guidelines
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 85bc9be..49384ff 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2854,12 +2854,18 @@ static int cgroup_procs_write_permission(struct 
task_struct *task,
 * even if we're attaching all tasks in the thread group, we only
 * need to check permissions on one of them.
 */
-   if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
-   !uid_eq(cred->euid, tcred->uid) &&
-   !uid_eq(cred->euid, tcred->suid))
-   ret = -EACCES;
 
-   if (!ret && cgroup_on_dfl(dst_cgrp)) {
+   /* root is allowed to do anything */
+   if (uid_eq(cred->euid, GLOBAL_ROOT_UID))
+   goto out;
+
+   /*
+* On v2, follow the delegation model.  Inside a delegated subtree,
+* the delegatee can move around the processes however it sees fit.
+*
+* On v1, a process should match one of the target's uids.
+*/
+   if (cgroup_on_dfl(dst_cgrp)) {
struct super_block *sb = of->file->f_path.dentry->d_sb;
struct cgroup *cgrp;
struct inode *inode;
@@ -2877,8 +2883,11 @@ static int cgroup_procs_write_permission(struct 
task_struct *task,
ret = inode_permission(inode, MAY_WRITE);
iput(inode);
}
+   } else if (!uid_eq(cred->euid, tcred->uid) &&
+  !uid_eq(cred->euid, tcred->suid)) {
+   ret = -EACCES;
}
-
+out:
put_cred(tcred);
return ret;
 }


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-06 Thread Tejun Heo
Hello,

On Tue, Dec 06, 2016 at 09:01:17AM -0800, Andy Lutomirski wrote:
> How would one be granted the right to move processes around in one's
> own subtree?

Through expicit delegation - chowning of the directory and
cgroup.procs file.

> Are you imagining that, if you're in /a/b and you want to move a
> process that's currently in /a/b/c to /a/b/d then you're allowed to
> because the target process is in your tree?  If so, I doubt this has
> the security properties you want -- namely, if you can cooperate with
> anyone in /, even if they're unprivileged, you can break it.

Delegation is an explicit operation and reflected in the ownership of
the subdirectories and cgroup interface files in them.  The
subhierarchy containment is achieved by requiring the user who's
trying to migrate a process to have write perm on cgroup.procs on the
common ancestor of the source and target in addition to the target.
So, a user who has a subhierarchy delegated to it can move processes
around inside but not out of or into it.  Also, if there are multiple
delegated disjoint subhierarchies, processes can't be moved across
them by the delegatee either.

Thanks.

-- 
tejun


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-06 Thread Tejun Heo
Hello, Serge.

On Mon, Dec 05, 2016 at 08:00:11PM -0600, Serge E. Hallyn wrote:
> > I really don't know.  The cgroupfs interface is a bit unfortunate in
> > that it doesn't really express the constraints.  To safely migrate a
> > task, ISTM you ought to have some form of privilege over the task
> > *and* some form of privilege over the cgroup.
> 
> Agreed.  The problem is that the privilege required should depend on
> the controller (I guess).  For memory and cpuset, CAP_SYS_NICE seems
> right.  Perhaps CAP_SYS_RESOURCE would be needed for some..  but then,
> as I look through the lists (capabilities(7) and the list of controllers),
> it seems like CAP_SYS_NICE works for everything.  What else would we need?
> Maybe CAP_NET_ADMIN for net_cls and net_prio?  CAP_SYS_RESOURCE|CAP_SYS_ADMIN
> for pids?

Please see my other reply but I don't think it's a good idea to have
these extra checks on the side when there already is hierarchical
delegation mechanism which should be able to handle both resource
control and process management delegation.

Thanks.

-- 
tejun


Re: [RESEND][PATCH v4] cgroup: Use CAP_SYS_RESOURCE to allow a process to migrate other tasks between cgroups

2016-12-06 Thread Tejun Heo
Hello,

On Mon, Dec 05, 2016 at 04:36:51PM -0800, Andy Lutomirski wrote:
> I really don't know.  The cgroupfs interface is a bit unfortunate in
> that it doesn't really express the constraints.  To safely migrate a
> task, ISTM you ought to have some form of privilege over the task
> *and* some form of privilege over the cgroup.  cgroupfs only handles
> the latter.
> 
> CAP_CGROUP_MIGRATE ought to be okay.  Or maybe cgroupfs needs to gain
> a concept of "dangerous" cgroups and further restrict them and
> CAP_SYS_RESOURCE should be fine for non-dangerous cgroups?  I think I
> favor the latter, but it might be nice to hear from Tejun first.

If we can't do CAP_SYS_RESOURCE due to overlaps, let's go with a
separate CAP.  While for android and cgroup v1, it's nice to have a
finer grained CAP for security control, privilege isolation in cgroup
should also primarily done through hierarchical delegation.  It
doesn't make sense to have another system in parallel.

We can't do it properly on v1 because some controllers aren't properly
hierarchical and delegation model isn't well defined.  e.g. nothing
prevents a process from being pulled across different subtrees with
the same delegation, but v2 can do it properly.  All that's necessary
is to make the CAP test OR'd to other perm checks instead of AND'ing
so that the CAP just allows overriding restrictions expressed through
delegation but it's normally possible to move processes around in
one's own delegated subtree.

Thanks.

-- 
tejun


Re: [PATCH trival -resend 2/2] lib: clean up put_cpu_var usage

2016-09-27 Thread Tejun Heo
On Tue, Sep 27, 2016 at 08:42:42AM -0700, Shaohua Li wrote:
> put_cpu_var takes the percpu data, not the data returned from
> get_cpu_var.
> 
> This doesn't change the behavior.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Signed-off-by: Shaohua Li <s...@fb.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH trival -resend 1/2] bpf: clean up put_cpu_var usage

2016-09-27 Thread Tejun Heo
On Tue, Sep 27, 2016 at 08:42:41AM -0700, Shaohua Li wrote:
> put_cpu_var takes the percpu data, not the data returned from
> get_cpu_var.
> 
> This doesn't change the behavior.
> 
> Cc: Tejun Heo <t...@kernel.org>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Signed-off-by: Shaohua Li <s...@fb.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-16 Thread Tejun Heo
Hello,

On Tue, Sep 13, 2016 at 08:14:40PM +0200, Jiri Slaby wrote:
> I assume Dmitry sees the same what I am still seeing, so I reported this
> some time ago:
> https://lkml.org/lkml/2016/3/21/492
> 
> This warning is trigerred there and still occurs with "HEAD":
>   (pwq != wq->dfl_pwq) && (pwq->refcnt > 1)
> and the state dump is in the log empty too:
> destroy_workqueue: name='hci0' pwq=88006b5c8f00
> wq->dfl_pwq=88006b5c9b00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
>   pwq 13:
>  cpus=2-3 node=1 flags=0x4 nice=-20 active=0/1
> in-flight: 2669:wq_barrier_func

Hmmm... I think it could be from rescuer holding reference on the pwq.
Both cases have WQ_MEM_RECLAIM and it could be that rescuer was still
in flight (even without work items pending) when the sanity checks
were done.  The following patch moves the sanity checks after rescuer
destruction.  Dmitry, Jiri, can you please see whether the warning
goes away with this patch?

Thanks.

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 984f6ff..e8046a1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4042,8 +4042,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
}
 
-   if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
-   WARN_ON(pwq->nr_active) ||
+   if (WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(>delayed_works))) {
mutex_unlock(>mutex);
show_workqueue_state();
@@ -4080,6 +4079,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for_each_node(node) {
pwq = rcu_access_pointer(wq->numa_pwq_tbl[node]);
RCU_INIT_POINTER(wq->numa_pwq_tbl[node], NULL);
+   WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt != 1));
put_pwq_unlocked(pwq);
}
 
@@ -4089,6 +4089,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
 */
pwq = wq->dfl_pwq;
wq->dfl_pwq = NULL;
+   WARN_ON(pwq->refcnt != 1);
put_pwq_unlocked(pwq);
}
 }


Re: [PATCH 3/3] mm: memcontrol: consolidate cgroup socket tracking

2016-09-14 Thread Tejun Heo
On Wed, Sep 14, 2016 at 03:48:46PM -0400, Johannes Weiner wrote:
> The cgroup core and the memory controller need to track socket
> ownership for different purposes, but the tracking sites being
> entirely different is kind of ugly.
> 
> Be a better citizen and rename the memory controller callbacks to
> match the cgroup core callbacks, then move them to the same place.
> 
> Signed-off-by: Johannes Weiner <han...@cmpxchg.org>

For 1-3,

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-13 Thread Tejun Heo
Hello,

On Sat, Sep 10, 2016 at 11:33:48AM +0200, Dmitry Vyukov wrote:
> Hit the WARNING with the patch. It showed "Showing busy workqueues and
> worker pools:" after the WARNING, but then no queue info. Was it
> already destroyed and removed from the list?...

Hmm...  It either means that the work item which was in flight when
WARN_ON() ran finished by the time the debug printout got to it or
that it's something unrelated to busy work items.

> [ 198.113838] WARNING: CPU: 2 PID: 26691 at kernel/workqueue.c:4042
> destroy_workqueue+0x17b/0x630

I don't seem to have the same source code that you have.  Which exact
WARN_ON() is this?

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-09-05 Thread Tejun Heo
Hello,

On Sat, Sep 03, 2016 at 12:58:33PM +0200, Dmitry Vyukov wrote:
> > I've seen it only several times in several months, so I don't it will
> > be helpful.
> 
> 
> Bad news: I hit it again.
> On 0f98f121e1670eaa2a2fbb675e07d6ba7f0e146f of linux-next, so I have
> bf389cabb3b8079c23f9762e62b05f291e2d5e99.

Hmmm... we're not getting anywhere.  I've applied the following patch
to wq/for-4.8-fixes so that when this happens the next time we can
actually tell what the hell is going wrong.

Thanks.

-- 8< --
>From 278930ada88c972d20025b0f20def27b1a09dff7 Mon Sep 17 00:00:00 2001
From: Tejun Heo <t...@kernel.org>
Date: Mon, 5 Sep 2016 08:54:06 -0400
Subject: [PATCH] workqueue: dump workqueue state on sanity check failures in
 destroy_workqueue()

destroy_workqueue() performs a number of sanity checks to ensure that
the workqueue is empty before proceeding with destruction.  However,
it's not always easy to tell what's going on just from the warning
message.  Let's dump workqueue state after sanity check failures to
help debugging.

Signed-off-by: Tejun Heo <t...@kernel.org>
Link: 
http://lkml.kernel.org/r/cact4y+zs6vkjho9qhb4treiz3s4+quvvvq9vwvj2mx6petg...@mail.gmail.com
Cc: Dmitry Vyukov <dvyu...@google.com>
---
 kernel/workqueue.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ef071ca..4eaec8b8 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4021,6 +4021,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
for (i = 0; i < WORK_NR_COLORS; i++) {
if (WARN_ON(pwq->nr_in_flight[i])) {
mutex_unlock(>mutex);
+   show_workqueue_state();
return;
}
}
@@ -4029,6 +4030,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(>delayed_works))) {
mutex_unlock(>mutex);
+   show_workqueue_state();
return;
}
}
-- 
2.7.4



Re: [PATCH v2] cfg80211: Remove deprecated create_singlethread_workqueue

2016-08-31 Thread Tejun Heo
On Wed, Aug 31, 2016 at 12:35:07AM +0530, Bhaktipriya Shridhar wrote:
> The workqueue "cfg80211_wq" is involved in cleanup, scan and event related
> works. It queues multiple work items >event_work,
> >dfs_update_channels_wk,
> _to_rdev(request->wiphy)->scan_done_wk,
> _to_rdev(wiphy)->sched_scan_results_wk, which require strict
> execution ordering.
> Hence, an ordered dedicated workqueue has been used.
> 
> Since it's a wireless driver, WQ_MEM_RECLAIM has been set to ensure
> forward progress under memory pressure.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: pegasus: Remove deprecated create_singlethread_workqueue

2016-08-31 Thread Tejun Heo
On Tue, Aug 30, 2016 at 10:02:47PM +0530, Bhaktipriya Shridhar wrote:
> The workqueue "pegasus_workqueue" queues a single work item per pegasus
> instance and hence it doesn't require execution ordering. Hence,
> alloc_workqueue has been used to replace the deprecated
> create_singlethread_workqueue instance.
> 
> The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
> memory pressure since it's a network driver.
> 
> Since there are fixed number of work items, explicit concurrency
> limit is unnecessary here.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] bonding: Remove deprecated create_singlethread_workqueue

2016-08-31 Thread Tejun Heo
On Tue, Aug 30, 2016 at 10:02:01PM +0530, Bhaktipriya Shridhar wrote:
> alloc_ordered_workqueue() with WQ_MEM_RECLAIM set, replaces
> deprecated create_singlethread_workqueue(). This is the identity
> conversion.
> 
> The workqueue "wq" queues multiple work items viz
> >mcast_work, >work, >mii_work, >arp_work,
> >alb_work, >mii_work, >ad_work, >slave_arr_work
> which require strict execution ordering. Hence, an ordered dedicated
> workqueue has been used.
> 
> Since, it is a network driver, WQ_MEM_RECLAIM has been set to
> ensure forward progress under memory pressure.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Tejun Heo
Hello, Sargun.

On Mon, Aug 29, 2016 at 11:49:07AM -0700, Sargun Dhillon wrote:
> It would be a separate hook per LSM hook. Why wouldn't we want a separate bpf 
> hook per lsm hook? I think if one program has to handle them all, the first 
> program would be looking up the hook program in a bpf prog array. If you 
> think 
> it's better to have this logic in the BPF program, that makes sense. 
> 
> I had a version of this patch that allowed you to attach a prog array 
> instead, 
> but I think that it's cleaner attaching a program per lsm hook. In addition, 
> there's a performance impact that comes from these hooks, so I wouldn't want 
> to 
> execute unneccessary code if it's avoidable.

Hmm... it doesn't really matter how the backend part looks like and if
we need to implement per-call hooks to lower runtime overhead, sure.
I was mostly worried about the approach propagating through the
userland visible interface.

> The prog array approach also makes stacking filters difficult. If people want 
> multiple filters per hook, the orchestrator would have to rewrite the 
> existing 
> filters to be cooperative.

I'm not really sure "stacking" in the kernel side is a good idea.
Please see below.

> > I'm not convinced about the approach.  It's an approach which pretty
> > much requires future extensions while being rigid.  Not a good
> > combination.
>
> Do you have an alternative recommendation? Maybe just a set of 5 u64s
> as the context object along with the hook ID?

cgroup fs doesn't seem like the right interface for this but if it
were I'd go for named hook IDs instead of opaque numbers.

> > Unless this is properly delegatable, IOW, it's safe to fully delegate
> > to a lesser security domain for all operations including program
> > loading and assignment (I can't see how that'd be the case), making it
> > an explicit controller doens't work in terms of userland interface.
> > It's fine for bpf / lsm / whatever to attach to cgroups by extending
> > struct cgroup itself or implementing an implicit controller but to be
> > visible as an explicit controller it must be able to follow cgroup
> > interface rules including delegation.  If not, it's best to come
> > through the interface which enforces the required permission checks
> > and then talk to cgroup from there.  This was also an issue with
> > network cgroup bpf programs that Daniel Mack is working on.  Please
> > chat with him.
>
> Program assignment is possible by lesser security domains. Program loading is 
> limited to CAP_SYS_ADMIN in init_user_ns. We could make it accessible to 
> CAP_SYS_ADMIN in any userns, but it the reasoning behind this is that Checmate
> BPF programs can leak kernel pointers. 

That doesn't make much sense to me.  Delegation doesn't mean much if a
delegatee can't load its own program (and I don't see how one can
delegate kernel pointer access to !root).  Also, unless there's
per-program fine control on who can load it, it seems pretty dangerous
to let anyone load any program.

> Could we potentially restrict it to only CAP_MAC_OVERRIDE, while still 
> meeting 
> cgroup delegation requirements?

Wouldn't it make far more sense to pass cgroup fd to bpf syscall so
that "load this program" and "attach this program to the cgroup
identified by this fd" through the same interface and permission
checks?  cgroup participating in bpf operations is all fine but
splitting the userland interface across two domains seems like a bad
idea.

> Filters which are higher up in the heirarchy will still be enforced during 
> delegation. This was an explicit design, as the "Orchestrator in 
> Orchestrator" 
> use case needs to be supported.

Given that program loading is restricted to root, wouldn't it be an a
lot more efficient approach to let userland multiplex multiple
programs?  Walking the tree executing bpf programs each time one of
these operations runs can be pretty expensive.  Imagine a tree like
the following.

A - B - C
  \ D

Let's say program is currently loaded on D.  If someone wants to add a
program on B, userland can load the program on B, combine B's and D's
program and compile them into a single program and load it on D.  The
only thing kernel would need to do in terms of hierarchy is finding
what's the closest program to execute.  In the above example, C would
need to use B's program and that can be determined on program
assignment time rather than on each operation.

Thanks.

-- 
tejun


Re: [net-next RFC v2 4/9] bpf, security: Add Checmate security LSM and BPF program type

2016-08-29 Thread Tejun Heo
Hello,

On Mon, Aug 29, 2016 at 04:47:07AM -0700, Sargun Dhillon wrote:
> This patch adds a minor LSM, Checmate. Checmate is a flexible programmable,
> extensible minor LSM that's coupled with cgroups and BPF. It is designed to
> enforce container-specific policies. It is also a cgroupv2 controller. By
> itself, it doesn't do very much, but in conjunction with a orchestrator
> complex policies can be installed on the cgroup hierarchy.
> 
> These cgroup programs are tied to the kernel ABI version. If one tries
> to load a BPF program compiled against a different kernel version,
> an error will be thrown.

First of all, please talk with people working on network cgroup bpf
and landlock.  I don't think it's a good idea to have N different ways
to implement cgroup-aware bpf mechanism.  There can be multiple
consumers but there gotta be a common mechanism instead of several
independent controllers.

> diff --git a/include/linux/checmate.h b/include/linux/checmate.h
> new file mode 100644
> index 000..4c4db4a
> --- /dev/null
> +++ b/include/linux/checmate.h
> @@ -0,0 +1,108 @@
> +#ifndef _LINUX_CHECMATE_H_
> +#define _LINUX_CHECMATE_H_ 1
> +#include 
> +
> +enum checmate_hook_num {
> + /* CONFIG_SECURITY_NET hooks */
> + CHECMATE_HOOK_UNIX_STREAM_CONNECT,
> + CHECMATE_HOOK_UNIX_MAY_SEND,
> + CHECMATE_HOOK_SOCKET_CREATE,
> + CHECMATE_HOOK_SOCKET_POST_CREATE,
> + CHECMATE_HOOK_SOCKET_BIND,
> + CHECMATE_HOOK_SOCKET_CONNECT,
> + CHECMATE_HOOK_SOCKET_LISTEN,
> + CHECMATE_HOOK_SOCKET_ACCEPT,
> + CHECMATE_HOOK_SOCKET_SENDMSG,
> + CHECMATE_HOOK_SOCKET_RECVMSG,
> + CHECMATE_HOOK_SOCKET_GETSOCKNAME,
> + CHECMATE_HOOK_SOCKET_GETPEERNAME,
> + CHECMATE_HOOK_SOCKET_GETSOCKOPT,
> + CHECMATE_HOOK_SOCKET_SETSOCKOPT,
> + CHECMATE_HOOK_SOCKET_SHUTDOWN,
> + CHECMATE_HOOK_SOCKET_SOCK_RCV_SKB,
> + CHECMATE_HOOK_SK_FREE_SECURITY,
> + __CHECMATE_HOOK_MAX,
> +};

Do we really want a separate hook for each call?  A logical extension
of this would be having a separate hook per syscall which feels kinda
horrible.

> +/* CONFIG_SECURITY_NET contexts */
> +struct checmate_unix_stream_connect_ctx {
> + struct sock *sock;
> + struct sock *other;
> + struct sock *newsk;
> +};
...
> +struct checmate_sk_free_security_ctx {
> + struct sock *sk;
> +};
...
> +struct checmate_ctx {
> + int hook;
> + union {
> +/* CONFIG_SECURITY_NET contexts */
> + struct checmate_unix_stream_connect_ctx unix_stream_connect;
> + struct checmate_unix_may_send_ctx   unix_may_send;
> + struct checmate_socket_create_ctx   socket_create;
> + struct checmate_socket_bind_ctx socket_bind;
> + struct checmate_socket_connect_ctx  socket_connect;
> + struct checmate_socket_listen_ctx   socket_listen;
> + struct checmate_socket_accept_ctx   socket_accept;
> + struct checmate_socket_sendmsg_ctx  socket_sendmsg;
> + struct checmate_socket_recvmsg_ctx  socket_recvmsg;
> + struct checmate_socket_sock_rcv_skb_ctx socket_sock_rcv_skb;
> + struct checmate_sk_free_security_ctxsk_free_security;
> + };
> +};

I'm not convinced about the approach.  It's an approach which pretty
much requires future extensions while being rigid.  Not a good
combination.

> +/*

Please use /** for function comments.

> + * checmate_instance_cleanup_rcu - Cleans up a Checmate program instance
> + * @rp: rcu_head pointer to a Checmate instance
> + */
> +static void checmate_instance_cleanup_rcu(struct rcu_head *rp)
> +{
> + struct checmate_instance *instance;
> +
> + instance = container_of(rp, struct checmate_instance, rcu);
> + bpf_prog_put(instance->prog);
> + kfree(instance);
> +}
> +static struct cftype checmate_files[] = {
> +#ifdef CONFIG_SECURITY_NETWORK
> + CHECMATE_CFTYPE(CHECMATE_HOOK_UNIX_STREAM_CONNECT,
> + "unix_stream_connect"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_UNIX_MAY_SEND,
> + "unix_may_send"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_CREATE, "socket_create"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_BIND, "socket_bind"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_CONNECT, "socket_connect"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_LISTEN, "socket_listen"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_ACCEPT, "socket_accept"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SENDMSG, "socket_sendmsg"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_RECVMSG, "socket_recvmsg"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SHUTDOWN, "socket_shutdown"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SOCKET_SOCK_RCV_SKB,
> + "socket_sock_rcv_skb"),
> + CHECMATE_CFTYPE(CHECMATE_HOOK_SK_FREE_SECURITY, "sk_free_security"),
> +#endif /* CONFIG_SECURITY_NETWORK */
> + {}
> +};

I don't think this is a good interface approach.

> +struct 

Re: [RFC v2 09/10] landlock: Handle cgroups

2016-08-26 Thread Tejun Heo
Hello,

On Fri, Aug 26, 2016 at 07:20:35AM -0700, Andy Lutomirski wrote:
> > This is simply the action of changing the owner of cgroup sysfs files to
> > allow an unprivileged user to handle them (cf. Documentation/cgroup-v2.txt)
> 
> As far as I can tell, Tejun and systemd both actively discourage doing
> this.  Maybe I misunderstand.  But in any event, the admin giving you

Please refer to "2-5. Delegation" of Documentation/cgroup-v2.txt.
Delegation on v1 is broken on both core and specific controller
behaviors and thus discouraged.  On v2, delegation should work just
fine.

I haven't looked in detail but in general I'm not too excited about
layering security mechanism on top of cgroup.  Maybe it makes some
sense when security domain coincides with resource domains but at any
rate please keep me in the loop.

Thanks.

-- 
tejun


Re: [RFC v2 09/10] landlock: Handle cgroups

2016-08-26 Thread Tejun Heo
Hello,

On Thu, Aug 25, 2016 at 04:44:13PM +0200, Mickaël Salaün wrote:
> I tested with cgroup-v2 but indeed, it seems a bit different with
> cgroup-v1 :)
> Does anyone know how to handle both cases?

If you wanna do cgroup membership test, just do cgroup v2 membership
test.  No need to introduce a new controller and possibly struct sock
association field for that.  That's what all new cgroup aware network
operations are using anyway and doesn't conflicts with whether other
controllers are v1 or v2.

For examples of using cgroup v2 membership test, please take a look at
cgroup_mt_v1().

Thanks.

-- 
tejun


Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs

2016-08-25 Thread Tejun Heo
Hello, Sargun.

On Sun, Aug 21, 2016 at 01:14:22PM -0700, Sargun Dhillon wrote:
> So, casually looking at this patch, it looks like you're relying on 
> sock_cgroup_data, which only points to the default hierarchy. If someone uses 
> net_prio or net_classid, cgroup_sk_alloc_disable is called, and this wont 
> work 
> anymore. 

The requirement there comes from network side.  In short, davem
(rightfuly) doesn't want further proliferation of cgroup association
fields in struct sock.  It makes sense for network control too as it's
schizophrenic to have different associations depending on the specific
controller.  Also, the v2 requirement shouldn't really get in the way
as it can be mounted as just another hierarchy along with other v1
hierarchies.

> Any ideas on how to work around that? Does it make sense to add another 
> pointer 
> to sock_cgroup_data, or at least a warning when allocation is disabled?

cgroup already warns when the association gets disabled due to usage
of netcls and netprio.  We probably want to update the warning message
to include bpf too.

Thanks.

-- 
tejun


Re: [PATCH 0/5] Networking cgroup controller

2016-08-25 Thread Tejun Heo
On Thu, Aug 25, 2016 at 12:09:20PM -0400, Tejun Heo wrote:
> ebpf approach does have its shortcomings for sure but mending them
> seems a lot more manageable and future-proof than going with fixed but
> constantly expanding set of operations.  e.g. We can add per-cgroup
> bpf programs which are called only on socket creation or other major
> events, or just let bpf programs which get called on bind(2), and add
^^
please ignore this part. half-assed edit.

-- 
tejun


Re: [PATCH 0/5] Networking cgroup controller

2016-08-25 Thread Tejun Heo
Hello, Mahesh.

On Thu, Aug 25, 2016 at 08:54:19AM -0700, Mahesh Bandewar (महेश बंडेवार) wrote:
> In short most of the associated problems are handled by the
> cgroup-infra / APIs while all that need separate solution in
> alternatives.  Tejun, feels like I'm advocating cgroup approach to you
> ;)

My concern here is that the proposed fixed mechanism isn't gonna be
enough.  Port range matching wouldn't scale, so we'd need some hashmap
style thing which may be too expensive for simple matches so either we
do something adaptive or have different interfaces for the two and so
on.  IOW, I think this approach is likely to replicate what iptables
have been doing with its extensions.  I don't doubt that it is one of
the workable approaches but hardly an attractive one especially at
this point.

ebpf approach does have its shortcomings for sure but mending them
seems a lot more manageable and future-proof than going with fixed but
constantly expanding set of operations.  e.g. We can add per-cgroup
bpf programs which are called only on socket creation or other major
events, or just let bpf programs which get called on bind(2), and add
some per-cgroup state variables which are maintained by cgroup code
which can be used from these bpf programs.

Thanks.

-- 
tejun


Re: [PATCH v2 3/6] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-24 Thread Tejun Heo
Hello,

On Wed, Aug 24, 2016 at 10:24:20PM +0200, Daniel Mack wrote:
>  SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
> size)
>  {
>   union bpf_attr attr = {};
> @@ -888,6 +957,16 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, 
> uattr, unsigned int, siz
>   case BPF_OBJ_GET:
>   err = bpf_obj_get();
>   break;
> +
> +#ifdef CONFIG_CGROUP_BPF
> + case BPF_PROG_ATTACH:
> + err = bpf_prog_attach();
> + break;
> + case BPF_PROG_DETACH:
> + err = bpf_prog_detach();
> + break;
> +#endif

So, this is one thing I haven't realized while pushing for "just embed
it in cgroup".  Breaking it out to a separate controller allows using
its own locking instead of having to piggyback on cgroup_mutex.  That
said, as long as cgroup_mutex is not nested inside some inner mutex,
this shouldn't be a problem.  I still think the embedding is fine and
whether we make it an implicit controller or not doesn't affect
userland API at all, so it's an implementation detail that we can
change later if necessary.

Thanks.

-- 
tejun


Re: [PATCH v2 2/6] cgroup: add support for eBPF programs

2016-08-24 Thread Tejun Heo
Hello, Daniel.

On Wed, Aug 24, 2016 at 10:24:19PM +0200, Daniel Mack wrote:
> +void cgroup_bpf_free(struct cgroup *cgrp)
> +{
> + unsigned int type;
> +
> + rcu_read_lock();
> +
> + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++) {
> + if (!cgrp->bpf.prog[type])
> + continue;
> +
> + bpf_prog_put(cgrp->bpf.prog[type]);
> + static_branch_dec(_bpf_enabled_key);
> + }
> +
> + rcu_read_unlock();

These rcu locking seem suspicious to me.  RCU locking on writer side
is usually bogus.  We sometimes do it to work around locking
assertions in accessors but it's a better idea to make the assertions
better in those cases - e.g. sth like assert_mylock_or_rcu_locked().

> +void cgroup_bpf_inherit(struct cgroup *cgrp, struct cgroup *parent)
> +{
> + unsigned int type;
> +
> + rcu_read_lock();

Ditto.

> + for (type = 0; type < __MAX_BPF_ATTACH_TYPE; type++)
> + rcu_assign_pointer(cgrp->bpf.prog_effective[type],
> + rcu_dereference(parent->bpf.prog_effective[type]));
> +
> + rcu_read_unlock();
> +}
...
> +void __cgroup_bpf_update(struct cgroup *cgrp,
> +  struct cgroup *parent,
> +  struct bpf_prog *prog,
> +  enum bpf_attach_type type)
> +{
> + struct bpf_prog *old_prog, *effective;
> + struct cgroup_subsys_state *pos;
> +
> + rcu_read_lock();

Ditto.

> + old_prog = xchg(cgrp->bpf.prog + type, prog);
> + if (old_prog) {
> + bpf_prog_put(old_prog);
> + static_branch_dec(_bpf_enabled_key);
> + }
> +
> + if (prog)
> + static_branch_inc(_bpf_enabled_key);

Minor but probably better to inc first and then dec so that you can
avoid unnecessary enabled -> disabled -> enabled sequence.

> + effective = (!prog && parent) ?
> + rcu_dereference(parent->bpf.prog_effective[type]) : prog;

If this is what's triggering rcu warnings, there's an accessor to use
in these situations.

> + rcu_read_unlock();
> +
> + css_for_each_descendant_pre(pos, >self) {

On the other hand, this walk actually requires rcu read locking unless
you're holding cgroup_mutex.

Thanks.

-- 
tejun


Re: [PATCH 0/5] Networking cgroup controller

2016-08-24 Thread Tejun Heo
Hello, Anoop.

On Wed, Aug 10, 2016 at 05:53:13PM -0700, Anoop Naravaram wrote:
> This patchset introduces a cgroup controller for the networking subsystem as a
> whole. As of now, this controller will be used for:
> 
> * Limiting the specific ports that a process in a cgroup is allowed to bind
>   to or listen on. For example, you can say that all the processes in a
>   cgroup can only bind to ports 1000-2000, and listen on ports 1000-1100, 
> which
>   guarantees that the remaining ports will be available for other processes.
> 
> * Restricting which DSCP values processes can use with their sockets. For
>   example, you can say that all the processes in a cgroup can only send
>   packets with a DSCP tag between 48 and 63 (corresponding to TOS values of
>   192 to 255).
> 
> * Limiting the total number of udp ports that can be used by a process in a
>   cgroup. For example, you can say that all the processes in one cgroup are
>   allowed to use a total of up to 100 udp ports. Since the total number of udp
>   ports that can be used by all processes is limited, this is useful for
>   rationing out the ports to different process groups.
> 
> In the future, more networking-related properties may be added to this
> controller.

Thanks for working on this; however, I share the sentiment expressed
by others that this looks like too piecemeal an approach.  If there
are no alternatives, we surely should consider this but it at least
*looks* like bpf should be able to cover the same functionalities
without having to revise and extend in-kernel capabilities constantly.

Thanks.

-- 
tejun


Re: [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers

2016-08-17 Thread Tejun Heo
Hello,

On Wed, Aug 17, 2016 at 10:50:40AM -0700, Alexei Starovoitov wrote:
> > +config CGROUP_BPF
> > +   bool "Enable eBPF programs in cgroups"
> > +   depends on BPF_SYSCALL
> > +   help
> > + This options allows cgroups to accommodate eBPF programs that
> > + can be used for network traffic filtering and accounting. See
> > + Documentation/networking/filter.txt for more information.
> > +
> 
> I think this extra config is unnecessary. It makes the code harder to follow.
> Anyone turning on bpf syscall and cgroups should be able to have this feature.
> Extra config is imo overkill.

Agreed, the added code is pretty small, especially in comparison to
both cgroup and bpf, and the only memory overhead would be four
pointers per cgroup, which shouldn't be noticeable at all.

Thanks.

-- 
tejun


Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-17 Thread Tejun Heo
Hello, again.

Just one addition.

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

Please don't copy the actual configuration or program.  Every case
where a cgroup controller took this approach turned out pretty badly.
It gets really confusing.

Thanks.

-- 
tejun


Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-17 Thread Tejun Heo
Hello,

On Wed, Aug 17, 2016 at 04:35:24PM +0200, Daniel Mack wrote:
> > Wouldn't it make sense to update the descendants to point to the
> > programs directly so that there's no need to traverse the tree on each
> > packet?  It's more state to maintain but I think it would make total
> > sense given that it would reduce per-packet cost.
> 
> The idea is to only look at the actual v2 cgroup a task is in, and not
> traverse in any way, to keep the per-packet cost at a minimum. That of
> course means that in a deeper level of the hierarchy, the programs are
> no longer executed and need to be reinstalled after a new cgroup is
> created. To bring this more in line with how cgroups usually work, I
> guess we should add code to copy over the bpf programs from the ancestor
> once a new cgroup is instantiated.

So, yeah, just executing the program from the exact cgroup wouldn't
work because it'd be easy for a delegatee to escape enforced policies
by simply creating subcgroups.

What I'd suggest is keeping two sets of pointers, one set for the
programs explicitly attached to the cgroup and the other for programs
which are to be executed for the cgroup.  The two pointers in the
latter set will point to the closest non-null program in its ancestry.
Note that the pointers may point to programs belonging to different
ancestors.

A - B - C
  \ D - E

Let's say B has both ingress and egress programs and D only has
egress.  E's execution pointers should point to D's egress program and
B's ingress program.

The pointers would need to be updated

1. When a program is installed or removed.  For simplicity's sake, we
   can just walk the whole hierarchy.  If that's too expensive
   (shouldn't be), we can find out the closest ancestor where both
   programs can be determined and then walk from there.

2. When a new cgroup is created, it should inherit its execution table
   from its parent.

Thanks.

-- 
tejun


Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs

2016-08-17 Thread Tejun Heo
On Wed, Aug 17, 2016 at 04:36:02PM +0200, Daniel Mack wrote:
> On 08/17/2016 04:23 PM, Tejun Heo wrote:
> > On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> >> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff 
> >> *skb, unsigned int cap)
> >>if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
> >>return -ENOMEM;
> >>  
> >> +#ifdef CONFIG_CGROUP_BPF
> >> +  err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> >> +  if (err)
> >> +  return err;
> >> +#endif
> > 
> > Hmm... where does egress hook into?
> 
> As stated in the cover letter, that's an open question on my list.

lol sorry about that.  :)

-- 
tejun


Re: [RFC PATCH 4/5] net: filter: run cgroup eBPF programs

2016-08-17 Thread Tejun Heo
On Wed, Aug 17, 2016 at 04:00:47PM +0200, Daniel Mack wrote:
> @@ -78,6 +116,12 @@ int sk_filter_trim_cap(struct sock *sk, struct sk_buff 
> *skb, unsigned int cap)
>   if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
>   return -ENOMEM;
>  
> +#ifdef CONFIG_CGROUP_BPF
> + err = sk_filter_cgroup_bpf(sk, skb, BPF_ATTACH_TYPE_CGROUP_INGRESS);
> + if (err)
> + return err;
> +#endif

Hmm... where does egress hook into?

Thanks.

-- 
tejun


Re: [RFC PATCH 3/5] bpf: add BPF_PROG_ATTACH and BPF_PROG_DETACH commands

2016-08-17 Thread Tejun Heo
Hello, Daniel.

On Wed, Aug 17, 2016 at 04:00:46PM +0200, Daniel Mack wrote:
> The current implementation walks the tree from the passed cgroup up
> to the root. If there is any program of the given type installed in
> any of the ancestors, the installation is rejected. This is because
> programs subject to restrictions should have no way of escaping if
> a higher-level cgroup has installed a program already. This restriction
> can be revisited at some later point in time.

This seems a bit too restrictive to me.  Given that an entity which
can install a bpf program can remove any programs anyway, wouldn't it
make more sense to just execute the program on the closest ancestor to
the cgroup?  That'd allow selectively overriding programs for specific
subhierarchies while applying generic policies to the rest.

> +static int bpf_prog_attach(const union bpf_attr *attr)
> +{
...
> + /* Reject installation of a program if any ancestor has one. */
> + for (pos = cgrp->self.parent; pos; pos = pos->parent) {
> + struct cgroup *parent;
> +
> + css_get(pos);

This refcnting pattern doesn't make sense.  If @pos is already
accessible (which it is in this case), getting an extra ref before
accessing it doesn't do anything.  This would necessary iff the
pointer is to be used outside the scope which is protected by the ref
on @cgrp.

> + parent = container_of(pos, struct cgroup, self);
> +
> + if ((is_ingress  && parent->bpf_ingress) ||
> + (!is_ingress && parent->bpf_egress))
> + err = -EEXIST;
> +
> + css_put(pos);
> + }
> +
> + if (err < 0) {
> + bpf_prog_put(prog);
> + return err;
> + }
> +
> + progp = is_ingress ? >bpf_ingress : >bpf_egress;
> +
> + rcu_read_lock();
> + old_prog = rcu_dereference(*progp);
> + rcu_assign_pointer(*progp, prog);

Wouldn't it make sense to update the descendants to point to the
programs directly so that there's no need to traverse the tree on each
packet?  It's more state to maintain but I think it would make total
sense given that it would reduce per-packet cost.

Otherwise, looks good to me.

Thanks.

-- 
tejun


Re: [RFC PATCH 2/5] cgroup: add bpf_{e,in}gress pointers

2016-08-17 Thread Tejun Heo
Hello,

On Wed, Aug 17, 2016 at 04:00:45PM +0200, Daniel Mack wrote:
> @@ -5461,6 +5462,14 @@ static int cgroup_destroy_locked(struct cgroup *cgrp)
>   for_each_css(css, ssid, cgrp)
>   kill_css(css);
>  
> +#ifdef CONFIG_CGROUP_BPF
> + if (cgrp->bpf_ingress)
> + bpf_prog_put(cgrp->bpf_ingress);
> +
> + if (cgrp->bpf_egress)
> + bpf_prog_put(cgrp->bpf_egress);
> +#endif

This most likely isn't the right place as there still can be sockets
associated with the cgroup and packets flowing.  The cgroup release
path in css_release_work_fn() probably is the right place.

Thanks.

-- 
tejun


Re: [PATCH net-next v4 2/3] bpf: Add bpf_current_task_under_cgroup helper

2016-08-12 Thread Tejun Heo
Hello, Alexei.

On Fri, Aug 12, 2016 at 08:21:21AM -0700, Alexei Starovoitov wrote:
> > lol I should have read the whole thread before replying twice.  Sorry
> > about that.  Yeah, if we can still rename it, let's do "under".  It's
> > more intuitive and gives us the room to implement the real "in" test
> > if ever necessary in the future.
> 
> agree. Thanks for explaining 'in' vs 'under' terminology.
> since we can still rename skb_in_cgroup we should do it.

Sounds good to me.

> and since that was my only nit for this patch.
> Acked-by: Alexei Starovoitov <a...@kernel.org>

FWIW,

Acked-by: Tejun Heo <t...@kernel.org>

> All 3 patches should go via net-next and to avoid conflicts 1/3 can be
> in cgroup tree as well (if you think there will be conflicts).
> We did that in the past with tip and net-next and it worked out well.

Yeah, just route it through net-next.  If other changes ever need it,
I'll include the commit in cgroup tree.

Thanks.

-- 
tejun


Re: [PATCH net-next v4 2/3] bpf: Add bpf_current_task_under_cgroup helper

2016-08-12 Thread Tejun Heo
On Fri, Aug 12, 2016 at 09:21:39AM -0400, Tejun Heo wrote:
> On Thu, Aug 11, 2016 at 09:50:48PM -0700, Sargun Dhillon wrote:
> > I realize that in_cgroup is more consistent, but under_cgroup makes
> > far more sense to me. I think it's more intuitive.
> 
> So, I think in_cgroup should mean that the object is in that
> particular cgroup while under_cgroup in the subhierarchy of that
> cgroup.  Let's rename the other subhierarchy test to under too.  I
> think that'd be a lot less confusing going forward.

Ah, I suppose the bpf part is userland visible?  If so, there isn't
much we can do and probably best to stick with in_cgroup for that
part.  Bummer but no big deal.

Thanks.

-- 
tejun


Re: [PATCH net-next v4 2/3] bpf: Add bpf_current_task_under_cgroup helper

2016-08-12 Thread Tejun Heo
Hello,

On Fri, Aug 12, 2016 at 09:40:39AM +0200, Daniel Borkmann wrote:
> > I actually wish we could rename skb_in_cgroup to skb_under_cgroup. If we 
> > ever
> > introduced a check for absolute membership versus ancestral membership, what
> > would we call that?
> 
> That option is, by the way, still on the table for -net tree, since 4.8 is not
> released yet, so it could still be renamed into BPF_FUNC_skb_under_cgroup.
> 
> Then you could make this one here for -net-next as 
> "BPF_FUNC_current_under_cgroup".
> 
> Tejun, Alexei?

lol I should have read the whole thread before replying twice.  Sorry
about that.  Yeah, if we can still rename it, let's do "under".  It's
more intuitive and gives us the room to implement the real "in" test
if ever necessary in the future.

Thanks.

-- 
tejun


Re: [PATCH net-next v4 2/3] bpf: Add bpf_current_task_under_cgroup helper

2016-08-12 Thread Tejun Heo
On Thu, Aug 11, 2016 at 09:50:48PM -0700, Sargun Dhillon wrote:
> I realize that in_cgroup is more consistent, but under_cgroup makes
> far more sense to me. I think it's more intuitive.

So, I think in_cgroup should mean that the object is in that
particular cgroup while under_cgroup in the subhierarchy of that
cgroup.  Let's rename the other subhierarchy test to under too.  I
think that'd be a lot less confusing going forward.

Thanks.

-- 
tejun


Re: [PATCH net-next v4 1/3] cgroup: Add task_under_cgroup_hierarchy cgroup inline function to headers

2016-08-12 Thread Tejun Heo
On Thu, Aug 11, 2016 at 08:14:45PM -0700, Sargun Dhillon wrote:
> This commit adds an inline function to cgroup.h to check whether a given
> task is under a given cgroup hierarchy. This is to avoid having to put
> ifdefs in .c files to gate access to cgroups. When cgroups are disabled
> this always returns true.
> 
> Signed-off-by: Sargun Dhillon <sar...@sargun.me>
> Cc: Alexei Starovoitov <a...@kernel.org>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Tejun Heo <t...@kernel.org>

Acked-by: Tejun Heo <t...@kernel.org>

Please feel free to route with other patches.  If it'd be better to
route this through the cgroup tree, please let me know.

Thanks.

-- 
tejun


Re: [PATCH net-next v3 1/2] bpf: Add bpf_current_task_in_cgroup helper

2016-08-11 Thread Tejun Heo
Hello, Sargun.

On Thu, Aug 11, 2016 at 11:51:42AM -0700, Sargun Dhillon wrote:
> This adds a bpf helper that's similar to the skb_in_cgroup helper to check
> whether the probe is currently executing in the context of a specific
> subset of the cgroupsv2 hierarchy. It does this based on membership test
> for a cgroup arraymap. It is invalid to call this in an interrupt, and
> it'll return an error. The helper is primarily to be used in debugging
> activities for containers, where you may have multiple programs running in
> a given top-level "container".
> 
> This patch also genericizes some of the arraymap fetching logic between the
> skb_in_cgroup helper and this new helper.

I think it'd be better to put the changes into separate patches.

> @@ -319,4 +319,26 @@ extern const struct bpf_func_proto bpf_get_stackid_proto;
>  void bpf_user_rnd_init_once(void);
>  u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
>  
> +/* Helper to fetch a cgroup pointer based on index.

/**
 *

> + * @map: a cgroup arraymap
> + * @idx: index of the item you want to fetch
> + *
> + * Returns pointer on success,
> + * Error code if item not found, or out-of-bounds access
> + */
> +static inline struct cgroup *fetch_arraymap_ptr(struct bpf_map *map, int idx)
...
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 984f73b..d4e173d 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -497,6 +497,23 @@ static inline bool cgroup_is_descendant(struct cgroup 
> *cgrp,
>   return cgrp->ancestor_ids[ancestor->level] == ancestor->id;
>  }
>  
> +/**
> + * task_in_cgroup_hierarchy - test task's membership of cgroup ancestry
> + * @task: the task to be tested
> + * @ancestor: possible ancestor of @task's cgroup
> + *
> + * Tests whether @task's default cgroup hierarchy is a descendant of 
> @ancestor.
> + * It follows all the same rules as cgroup_is_descendant, and only applies
> + * to the default hierarchy.
> + */
> +static inline bool task_in_cgroup_hierarchy(struct task_struct *task,
> + struct cgroup *ancestor)

Maybe task_is_under_cgroup() is a better name?

> @@ -574,6 +592,11 @@ static inline void cgroup_free(struct task_struct *p) {}
>  static inline int cgroup_init_early(void) { return 0; }
>  static inline int cgroup_init(void) { return 0; }
>  
> +static inline bool task_in_cgroup_hierarchy(struct task_struct *task,
> + struct cgroup *ancestor)
> +{
> + return false;
> +}

If cgroup is not enabled, all tasks are in the root cgroup, so the
test should always return true.

> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -375,6 +375,17 @@ enum bpf_func_id {
>*/
>   BPF_FUNC_probe_write_user,
>  
> + /**
> +  * bpf_current_task_in_cgroup(map, index) - Check cgroup2 membership of 
> current task
> +  * @map: pointer to bpf_map in BPF_MAP_TYPE_CGROUP_ARRAY type
> +  * @index: index of the cgroup in the bpf_map
> +  * Return:
> +  *   == 0 current failed the cgroup2 descendant test
> +  *   == 1 current succeeded the cgroup2 descendant test
> +  *< 0 error
> +  */
> + BPF_FUNC_current_task_in_cgroup,

I think "under" would be better than "in" here.

Thanks.

-- 
tejun


Re: [PATCH] net/mlx5_core/pagealloc: Remove deprecated create_singlethread_workqueue

2016-08-01 Thread Tejun Heo
Hello, Leon.

On Sun, Jul 31, 2016 at 09:35:13AM +0300, Leon Romanovsky wrote:
> > The conversion uses WQ_MEM_RECLAIM, which is standard for all
> > workqueues which can stall packet processing if stalled.  The
> > requirement comes from nfs or block devices over network.
> 
> The title stays "remove deprecated create_singlethread_workqueue" and if
> we put aside the word "deprecated", the code moves from ordered
> workqueue to unordered workqueue and changes max_active count which
> isn't expressed in commit message.

Maybe it was too compressed but the description says

"The workqueue has a single work item. Hence, alloc_workqueue() is
 used instead of alloc_ordered_workqueue() since ordering is
 unnecessary when there's only one work item."

There's only one work item so whether the workqueue is ordered and its
concurrency level don't make any difference.  The reason why people
used to use create_singlethread_workqueue() in these situations was
because per-cpu workqueues used to be very expensive which isn't true
anymore.

> For reclaim paths, I want to be extra caution and want to see
> justification for doing that along with understanding if it is tested or
> not.

WQ_MEM_RECLAIM maintains the behavior of the existing code and the
requirement comes from network interfaces being used for nfs and
transport for block devices and is universial to everything in network
layer.  The only test we can do here is removing it and pushing it
really hard and see whether we can actually trigger deadlock, which it
will under the right circumstances which may or may not be easy to
replicate in test environments.  I don't see many merits in doing this
especially when the behavior doesn't change from the existing code.

> Right now, I'm feeling that I'm participating in soapie where one sends
> patch for every line, waits and sends the same patch for another file.
> It is worth to send one patch set and let us to test it all in once.

Yeah, I guess it can be a bit annoying.  Bhaktipriya, can you please
group patches for meallnox?

Thanks.

-- 
tejun


Re: [RFC] net/mlx5_core/en_main: Remove deprecated create_workqueue

2016-07-29 Thread Tejun Heo
Hello,

On Fri, Jul 29, 2016 at 01:30:05AM +0300, Saeed Mahameed wrote:
> > Are the workitems being used on a memory reclaim path?
> 
> do you mean they need to allocate memory ?

It's a bit convoluted.  A workqueue needs WQ_MEM_RECLAIM flag to be
guaranteed forward progress under memory pressure, so any workqueue
which may be depended upon during memory reclaim should have the flag
set; otherwise, the system can deadlock (try to reclaim memory, hits
the wq which can't make forward progress due to lack of memory).  For
network devices, the requirement comes from block-over-network or nfs
which can be involved in memory reclaim.

Thanks.

-- 
tejun


Re: [PATCH] net/mlx5_core/pagealloc: Remove deprecated create_singlethread_workqueue

2016-07-29 Thread Tejun Heo
Hello,

On Thu, Jul 28, 2016 at 12:37:35PM +0300, Leon Romanovsky wrote:
> Did you test this patch? Did you notice the memory reclaim path nature
> of this work?

The conversion uses WQ_MEM_RECLAIM, which is standard for all
workqueues which can stall packet processing if stalled.  The
requirement comes from nfs or block devices over network.

Thanks.

-- 
tejun


Re: [PATCH] caif-hsi: Remove deprecated create_singlethread_workqueue

2016-07-25 Thread Tejun Heo
On Mon, Jul 25, 2016 at 06:40:57PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_singlethread_workqueue().
> 
> A dedicated workqueue has been used since the workitems are being used
> on a packet tx/rx path. Hence, WQ_MEM_RECLAIM has been set to guarantee
> forward progress under memory pressure.
> 
> An ordered workqueue has been used since workitems >wake_up_work
> and >wake_down_work cannot be run concurrently.
> 
> Calls to flush_workqueue() before destroy_workqueue() have been dropped
> since destroy_workqueue() itself calls drain_workqueue() which flushes
> repeatedly till the workqueue becomes empty.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH net-next v2 2/4] cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY

2016-06-23 Thread Tejun Heo
Hello,

On Thu, Jun 23, 2016 at 11:42:31AM +0200, Daniel Borkmann wrote:
> I presume it's a valid use case to pin a cgroup map, put fds into it and
> remove the pinned file expecting to continue to match on it, right? So
> lifetime is really until last prog using a cgroup map somewhere gets removed
> (even if not accessible from user space anymore, meaning no prog has fd and
> pinned file was removed).

Yeap, from what I can see, the cgroup will stay around (even if it
gets deleted) as long as the bpf rule using it is around and that's
completely fine from cgroup side.

Thanks.

-- 
tejun


Re: [PATCH net-next v2 1/4] cgroup: Add cgroup_get_from_fd

2016-06-23 Thread Tejun Heo
On Wed, Jun 22, 2016 at 02:17:29PM -0700, Martin KaFai Lau wrote:
> Add a helper function to get a cgroup2 from a fd.  It will be
> stored in a bpf array (BPF_MAP_TYPE_CGROUP_ARRAY) which will
> be introduced in the later patch.
> 
> Signed-off-by: Martin KaFai Lau <ka...@fb.com>
> Cc: Alexei Starovoitov <a...@fb.com>
> Cc: Daniel Borkmann <dan...@iogearbox.net>
> Cc: Tejun Heo <t...@kernel.org>

 Acked-by: Tejun Heo <t...@kernel.org>

Please feel free to route this patch with the rest of the series.  If
it's preferable to apply this to the cgroup branch, please let me
know.

Thanks!

-- 
tejun


Re: [PATCH -next 1/4] cgroup: Add cgroup_get_from_fd

2016-06-22 Thread Tejun Heo
Hello, Martin.

On Tue, Jun 21, 2016 at 05:23:19PM -0700, Martin KaFai Lau wrote:
> @@ -6205,6 +6206,31 @@ struct cgroup *cgroup_get_from_path(const char *path)
>  }
>  EXPORT_SYMBOL_GPL(cgroup_get_from_path);

Proper function comment would be nice.

> +struct cgroup *cgroup_get_from_fd(int fd)
> +{
> + struct cgroup_subsys_state *css;
> + struct cgroup *cgrp;
> + struct file *f;
> +
> + f = fget_raw(fd);
> + if (!f)
> + return NULL;

It returns NULL here.

> + css = css_tryget_online_from_dir(f->f_path.dentry, NULL);
> + fput(f);
> + if (IS_ERR(css))
> + return ERR_CAST(css);
> +
> + cgrp = css->cgroup;
> + if (!cgroup_on_dfl(cgrp)) {
> + cgroup_put(cgrp);
> + return ERR_PTR(-EINVAL);

But an ERR_PTR value here.  Is this intentional?  Also, wouldn't it
make more sense to return -EBADF here, given that that's what
css_tryget_online_from_dir() would return if the filesystem type is
wrong?

Thanks!

-- 
tejun


Re: [PATCH net-next] nfnetlink_queue: enable PID info retrieval

2016-06-15 Thread Tejun Heo
Hello,

On Fri, Jun 10, 2016 at 08:40:34AM +0200, Daniel Wagner wrote:
> > [ Cc'ing John, Daniel, et al ]
> > 
> > Btw, while I just looked at scm_detach_fds(), I think commits ...
> > 
> >  * 48a87cc26c13 ("net: netprio: fd passed in SCM_RIGHTS datagram not set
> > correctly")
> >  * d84295067fc7 ("net: net_cls: fd passed in SCM_RIGHTS datagram not set
> > correctly")
> > 
> > ... might not be correct, maybe I'm missing something ...? Lets say
> > process A
> > has a socket fd that it sends via SCM_RIGHTS to process B. Process A was
> > the
> > one that called sk_alloc() originally. Now in scm_detach_fds() we
> > install a new
> > fd for process B pointing to the same sock (file's private_data) and
> > above commits
> > update the cached socket cgroup data for net_cls/net_prio to the new
> > process B.
> > So, if process A for example still sends data over that socket, skbs
> > will then
> > wrongly match on B's cgroup membership instead of A's, no?
> 
> I can't remember the details right now (need to read up again but I wont
> have time till Wednesday).
> 
> From your analysis I would say that is not the desired effect. A should
> match against its own cgroup and not the one of B.

We don't have a good answer for resources which are shared across
different cgroups.  It is often too expensive to track such sharing
accurately and crude approximation (creator-owned, last-used or
whatever) is used widely even outside cgroup.  Different cgroup
controllers tried different approaches but most are settling down for
creator ownership with exceptions for high impact cases.

I don't think there's a solution which satifies all cases here.  Given
that, doing the minimum amount of work (not worrying about SCM_RIGHTS
transfers) is the right thing to do, but we've had this re-labeling
since 2012, so leaving as-is is likely the best option at this point.

Thanks.

-- 
tejun


Re: [PATCH] libertas_tf: Remove create_workqueue

2016-06-11 Thread Tejun Heo
On Wed, Jun 08, 2016 at 01:38:53AM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> A dedicated workqueue has been used since the workitem (viz
> >cmd_work per priv, which maps to lbtf_cmd_work) is involved in
> actual command processing and may be used on a memory reclaim path.
> The workitems require forward progress under memory pressure and hence,
> WQ_MEM_RECLAIM has been set. Since there are only a fixed number of work
> items, explicit concurrency limit is unnecessary here.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: wireless: marvell: libertas: Remove create_workqueue

2016-06-05 Thread Tejun Heo
On Sat, Jun 04, 2016 at 07:29:01PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> In if_sdio.c, the workqueue card->workqueue has workitem
> >packet_worker, which is mapped to if_sdio_host_to_card_worker.
> The workitem is involved in sending packets to firmware.
> Forward progress under memory pressure is a requirement here.
> 
> In if_spi.c, the workqueue card->workqueue has workitem
> >packet_worker, which is mapped to if_spi_host_to_card_worker.
> The workitem is involved in sending command packets from the host.
> Forward progress under memory pressure is a requirement here.
> 
> Dedicated workqueues have been used in both cases since the workitems
> on the workqueues are involved in normal device operation with
> WQ_MEM_RECLAIM set to gurantee forward progress under memory pressure.
> Since there are only a fixed number of work items, explicit concurrency
> limit is unnecessary.
> 
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the calls to flush_workqueue() before
> destroy_workqueue() have been dropped.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: ethernet: cavium: liquidio: response_manager: Remove create_workqueue

2016-06-05 Thread Tejun Heo
On Sat, Jun 04, 2016 at 08:21:40PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> A dedicated workqueue has been used since the workitem viz
> (>wk.work which maps to oct_poll_req_completion) is involved
> in normal device operation. WQ_MEM_RECLAIM has been set to guarantee
> forward progress under memory pressure, which is a requirement here.
> Since there are only a fixed number of work items, explicit concurrency
> limit is unnecessary.
> 
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

As with the previous patch, the subject tag doesn't need to contain
the full hierarchy but other than that,

 Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: ethernet: cavium: liquidio: request_manager: Remove create_workqueue

2016-06-05 Thread Tejun Heo
On Sat, Jun 04, 2016 at 08:54:00PM +0530, Bhaktipriya Shridhar wrote:
> diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c 
> b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> index a2a2465..9313915 100644
> --- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
> +++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
> @@ -144,7 +144,9 @@ int octeon_init_instr_queue(struct octeon_device *oct,
> 
>   oct->fn_list.setup_iq_regs(oct, iq_no);
> 
> - oct->check_db_wq[iq_no].wq = create_workqueue("check_iq_db");
> + oct->check_db_wq[iq_no].wq = alloc_workqueue("check_iq_db",
> +  WQ_MEM_RECLAIM,
> +  0);

Why the new line between WQ_MEM_RECLAIM and 0?

Except for the subj tag and the above nit,

 Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: fjes: fjes_main: Remove create_workqueue

2016-06-02 Thread Tejun Heo
On Thu, Jun 02, 2016 at 03:00:57PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> The workqueue adapter->txrx_wq has workitem
> >raise_intr_rxdata_task per adapter. Extended Socket Network
> Device is shared memory based, so someone's transmission denotes other's
> reception.  raise_intr_rxdata_task raises interruption of receivers from
> the sender in order to notify receivers.
> 
> The workqueue adapter->control_wq has workitem
> >interrupt_watch_task per adapter. interrupt_watch_task is used
> to prevent delay of interrupts.
> 
> Dedicated workqueues have been used in both cases since the workitems
> on the workqueues are involved in normal device operation and require
> forward progress under memory pressure.
> 
> max_active has been set to 0 since there is no need for throttling
> the number of active work items.
> 
> Since network devices  may be used for memory reclaim,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.

Patch looks good but ditto with the description.  I wish it were
clearer.

Thanks.

-- 
tejun


Re: [PATCH] net: ethernet: wiznet: Remove create_workqueue

2016-06-01 Thread Tejun Heo
On Wed, Jun 01, 2016 at 11:29:15PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> A dedicated workqueue has been used since the workitems are involved
> in normal device operation. Workitems >rx_work and >tx_work,
> map to w5100_rx_work and w5100_tx_work respectively and are involved in
> receiving and transmitting packets. Forward progress under
> memory pressure is a requirement here.
> 
> create_workqueue has been replaced with alloc_workqueue with max_active
> as 0 since there is no need for throttling the number of active work
> items.
> 
> Since the driver may be used in memory reclaim path,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
> 
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: ethernet: intel: fm10k: Remove create_workqueue

2016-06-01 Thread Tejun Heo
On Wed, Jun 01, 2016 at 09:10:09PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> A dedicated workqueue has been used since the workitem (viz
> fm10k_service_task, which manages and runs other subtasks) is involved in
> normal device operation and requires forward progress under memory
> pressure.
> 
> create_workqueue has been replaced with alloc_workqueue with max_active
> as 0 since there is no need for throttling the number of active work
> items.
> 
> Since network devices may be used in memory reclaim path,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
> 
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH] net: ethernet: intel: fm10k: Remove create_workqueue

2016-06-01 Thread Tejun Heo
On Wed, Jun 01, 2016 at 09:10:09PM +0530, Bhaktipriya Shridhar wrote:
> alloc_workqueue replaces deprecated create_workqueue().
> 
> A dedicated workqueue has been used since the workitem (viz
> fm10k_service_task, which manages and runs other subtasks) is involved in
> normal device operation and requires forward progress under memory
> pressure.
> 
> create_workqueue has been replaced with alloc_workqueue with max_active
> as 0 since there is no need for throttling the number of active work
> items.
> 
> Since network devices may be used in memory reclaim path,
> WQ_MEM_RECLAIM has been set to guarantee forward progress.
> 
> flush_workqueue is unnecessary since destroy_workqueue() itself calls
> drain_workqueue() which flushes repeatedly till the workqueue
> becomes empty. Hence the call to flush_workqueue() has been dropped.
> 
> Signed-off-by: Bhaktipriya Shridhar <bhaktipriy...@gmail.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH percpu/for-4.7-fixes 1/2] percpu: fix synchronization between chunk->map_extend_work and chunk destruction

2016-05-26 Thread Tejun Heo
Hello,

On Thu, May 26, 2016 at 11:19:06AM +0200, Vlastimil Babka wrote:
> > if (is_atomic) {
> > margin = 3;
> > 
> > if (chunk->map_alloc <
> > -   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW &&
> > -   pcpu_async_enabled)
> > -   schedule_work(>map_extend_work);
> > +   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
> > +   if (list_empty(>map_extend_list)) {

> So why this list_empty condition? Doesn't it deserve a comment then? And

Because doing list_add() twice corrupts the list.  I'm not sure that
deserves a comment.  We can do list_move() instead but that isn't
necessarily better.

> isn't using a list an overkill in that case?

That would require rebalance work to scan all chunks whenever it's
scheduled and if a lot of atomic allocations are taking place, it has
some possibility to become expensive with a lot of chunks.

Thanks.

-- 
tejun


[PATCH percpu/for-4.7-fixes 2/2] percpu: fix synchronization between synchronous map extension and chunk destruction

2016-05-25 Thread Tejun Heo
For non-atomic allocations, pcpu_alloc() can try to extend the area
map synchronously after dropping pcpu_lock; however, the extension
wasn't synchronized against chunk destruction and the chunk might get
freed while extension is in progress.

This patch fixes the bug by putting most of non-atomic allocations
under pcpu_alloc_mutex to synchronize against pcpu_balance_work which
is responsible for async chunk management including destruction.

Signed-off-by: Tejun Heo <t...@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoi...@gmail.com>
Reported-by: Vlastimil Babka <vba...@suse.cz>
Reported-by: Sasha Levin <sasha.le...@oracle.com>
Cc: sta...@vger.kernel.org # v3.18+
Fixes: 1a4d76076cda ("percpu: implement asynchronous chunk population")
---
Hello,

I'll send both patches mainline in a couple days through the percpu
tree.

Thanks.

 mm/percpu.c |   16 
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -162,7 +162,7 @@ static struct pcpu_chunk *pcpu_reserved_
 static int pcpu_reserved_chunk_limit;
 
 static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
+static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
ext */
 
 static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 
@@ -444,6 +444,8 @@ static int pcpu_extend_area_map(struct p
size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
unsigned long flags;
 
+   lockdep_assert_held(_alloc_mutex);
+
new = pcpu_mem_zalloc(new_size);
if (!new)
return -ENOMEM;
@@ -890,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t
return NULL;
}
 
+   if (!is_atomic)
+   mutex_lock(_alloc_mutex);
+
spin_lock_irqsave(_lock, flags);
 
/* serve reserved allocations from the reserved chunk if available */
@@ -962,12 +967,9 @@ restart:
if (is_atomic)
goto fail;
 
-   mutex_lock(_alloc_mutex);
-
if (list_empty(_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
if (!chunk) {
-   mutex_unlock(_alloc_mutex);
err = "failed to allocate new chunk";
goto fail;
}
@@ -978,7 +980,6 @@ restart:
spin_lock_irqsave(_lock, flags);
}
 
-   mutex_unlock(_alloc_mutex);
goto restart;
 
 area_found:
@@ -988,8 +989,6 @@ area_found:
if (!is_atomic) {
int page_start, page_end, rs, re;
 
-   mutex_lock(_alloc_mutex);
-
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);
 
@@ -1000,7 +999,6 @@ area_found:
 
spin_lock_irqsave(_lock, flags);
if (ret) {
-   mutex_unlock(_alloc_mutex);
pcpu_free_area(chunk, off, _pages);
err = "failed to populate";
goto fail_unlock;
@@ -1040,6 +1038,8 @@ fail:
/* see the flag handling in pcpu_blance_workfn() */
pcpu_atomic_alloc_failed = true;
pcpu_schedule_balance_work();
+   } else {
+   mutex_unlock(_alloc_mutex);
}
return NULL;
 }


[PATCH percpu/for-4.7-fixes 1/2] percpu: fix synchronization between chunk->map_extend_work and chunk destruction

2016-05-25 Thread Tejun Heo
Atomic allocations can trigger async map extensions which is serviced
by chunk->map_extend_work.  pcpu_balance_work which is responsible for
destroying idle chunks wasn't synchronizing properly against
chunk->map_extend_work and may end up freeing the chunk while the work
item is still in flight.

This patch fixes the bug by rolling async map extension operations
into pcpu_balance_work.

Signed-off-by: Tejun Heo <t...@kernel.org>
Reported-and-tested-by: Alexei Starovoitov <alexei.starovoi...@gmail.com>
Reported-by: Vlastimil Babka <vba...@suse.cz>
Reported-by: Sasha Levin <sasha.le...@oracle.com>
Cc: sta...@vger.kernel.org # v3.18+
Fixes: 9c824b6a172c ("percpu: make sure chunk->map array has available space")
---
 mm/percpu.c |   57 -
 1 file changed, 36 insertions(+), 21 deletions(-)

--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -112,7 +112,7 @@ struct pcpu_chunk {
int map_used;   /* # of map entries used before 
the sentry */
int map_alloc;  /* # of map entries allocated */
int *map;   /* allocation map */
-   struct work_struct  map_extend_work;/* async ->map[] extension */
+   struct list_headmap_extend_list;/* on pcpu_map_extend_chunks */
 
void*data;  /* chunk data */
int first_free; /* no free below this */
@@ -166,6 +166,9 @@ static DEFINE_MUTEX(pcpu_alloc_mutex);  /
 
 static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 
+/* chunks which need their map areas extended, protected by pcpu_lock */
+static LIST_HEAD(pcpu_map_extend_chunks);
+
 /*
  * The number of empty populated pages, protected by pcpu_lock.  The
  * reserved chunk doesn't contribute to the count.
@@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc
 {
int margin, new_alloc;
 
+   lockdep_assert_held(_lock);
+
if (is_atomic) {
margin = 3;
 
if (chunk->map_alloc <
-   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW &&
-   pcpu_async_enabled)
-   schedule_work(>map_extend_work);
+   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
+   if (list_empty(>map_extend_list)) {
+   list_add_tail(>map_extend_list,
+ _map_extend_chunks);
+   pcpu_schedule_balance_work();
+   }
+   }
} else {
margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
}
@@ -467,20 +476,6 @@ out_unlock:
return 0;
 }
 
-static void pcpu_map_extend_workfn(struct work_struct *work)
-{
-   struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk,
-   map_extend_work);
-   int new_alloc;
-
-   spin_lock_irq(_lock);
-   new_alloc = pcpu_need_to_extend(chunk, false);
-   spin_unlock_irq(_lock);
-
-   if (new_alloc)
-   pcpu_extend_area_map(chunk, new_alloc);
-}
-
 /**
  * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
  * @chunk: chunk the candidate area belongs to
@@ -740,7 +735,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
chunk->map_used = 1;
 
INIT_LIST_HEAD(>list);
-   INIT_WORK(>map_extend_work, pcpu_map_extend_workfn);
+   INIT_LIST_HEAD(>map_extend_list);
chunk->free_size = pcpu_unit_size;
chunk->contig_hint = pcpu_unit_size;
 
@@ -1129,6 +1124,7 @@ static void pcpu_balance_workfn(struct w
if (chunk == list_first_entry(free_head, struct pcpu_chunk, 
list))
continue;
 
+   list_del_init(>map_extend_list);
list_move(>list, _free);
}
 
@@ -1146,6 +1142,25 @@ static void pcpu_balance_workfn(struct w
pcpu_destroy_chunk(chunk);
}
 
+   /* service chunks which requested async area map extension */
+   do {
+   int new_alloc = 0;
+
+   spin_lock_irq(_lock);
+
+   chunk = list_first_entry_or_null(_map_extend_chunks,
+   struct pcpu_chunk, map_extend_list);
+   if (chunk) {
+   list_del_init(>map_extend_list);
+   new_alloc = pcpu_need_to_extend(chunk, false);
+   }
+
+   spin_unlock_irq(_lock);
+
+   if (new_alloc)
+   pcpu_extend_area_map(chunk, new_alloc);
+   } while (chunk);
+
/*
 * Ensure there are certain number of free populated pages for
 * atomic allocs.  Fill up from the most packed so that atomic
@@ -1644,7 +1659,7 @@ int __init p

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Tejun Heo
Hello,

Alexei, can you please verify this patch?  Map extension got rolled
into balance work so that there's no sync issues between the two async
operations.

Thanks.

Index: work/mm/percpu.c
===
--- work.orig/mm/percpu.c
+++ work/mm/percpu.c
@@ -112,7 +112,7 @@ struct pcpu_chunk {
int map_used;   /* # of map entries used before 
the sentry */
int map_alloc;  /* # of map entries allocated */
int *map;   /* allocation map */
-   struct work_struct  map_extend_work;/* async ->map[] extension */
+   struct list_headmap_extend_list;/* on pcpu_map_extend_chunks */
 
void*data;  /* chunk data */
int first_free; /* no free below this */
@@ -162,10 +162,13 @@ static struct pcpu_chunk *pcpu_reserved_
 static int pcpu_reserved_chunk_limit;
 
 static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
+static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
ext */
 
 static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 
+/* chunks which need their map areas extended, protected by pcpu_lock */
+static LIST_HEAD(pcpu_map_extend_chunks);
+
 /*
  * The number of empty populated pages, protected by pcpu_lock.  The
  * reserved chunk doesn't contribute to the count.
@@ -395,13 +398,19 @@ static int pcpu_need_to_extend(struct pc
 {
int margin, new_alloc;
 
+   lockdep_assert_held(_lock);
+
if (is_atomic) {
margin = 3;
 
if (chunk->map_alloc <
-   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW &&
-   pcpu_async_enabled)
-   schedule_work(>map_extend_work);
+   chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) {
+   if (list_empty(>map_extend_list)) {
+   list_add_tail(>map_extend_list,
+ _map_extend_chunks);
+   pcpu_schedule_balance_work();
+   }
+   }
} else {
margin = PCPU_ATOMIC_MAP_MARGIN_HIGH;
}
@@ -435,6 +444,8 @@ static int pcpu_extend_area_map(struct p
size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
unsigned long flags;
 
+   lockdep_assert_held(_alloc_mutex);
+
new = pcpu_mem_zalloc(new_size);
if (!new)
return -ENOMEM;
@@ -467,20 +478,6 @@ out_unlock:
return 0;
 }
 
-static void pcpu_map_extend_workfn(struct work_struct *work)
-{
-   struct pcpu_chunk *chunk = container_of(work, struct pcpu_chunk,
-   map_extend_work);
-   int new_alloc;
-
-   spin_lock_irq(_lock);
-   new_alloc = pcpu_need_to_extend(chunk, false);
-   spin_unlock_irq(_lock);
-
-   if (new_alloc)
-   pcpu_extend_area_map(chunk, new_alloc);
-}
-
 /**
  * pcpu_fit_in_area - try to fit the requested allocation in a candidate area
  * @chunk: chunk the candidate area belongs to
@@ -740,7 +737,7 @@ static struct pcpu_chunk *pcpu_alloc_chu
chunk->map_used = 1;
 
INIT_LIST_HEAD(>list);
-   INIT_WORK(>map_extend_work, pcpu_map_extend_workfn);
+   INIT_LIST_HEAD(>map_extend_list);
chunk->free_size = pcpu_unit_size;
chunk->contig_hint = pcpu_unit_size;
 
@@ -895,6 +892,9 @@ static void __percpu *pcpu_alloc(size_t
return NULL;
}
 
+   if (!is_atomic)
+   mutex_lock(_alloc_mutex);
+
spin_lock_irqsave(_lock, flags);
 
/* serve reserved allocations from the reserved chunk if available */
@@ -967,12 +967,9 @@ restart:
if (is_atomic)
goto fail;
 
-   mutex_lock(_alloc_mutex);
-
if (list_empty(_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
if (!chunk) {
-   mutex_unlock(_alloc_mutex);
err = "failed to allocate new chunk";
goto fail;
}
@@ -983,7 +980,6 @@ restart:
spin_lock_irqsave(_lock, flags);
}
 
-   mutex_unlock(_alloc_mutex);
goto restart;
 
 area_found:
@@ -993,8 +989,6 @@ area_found:
if (!is_atomic) {
int page_start, page_end, rs, re;
 
-   mutex_lock(_alloc_mutex);
-
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);
 
@@ -1005,7 +999,6 @@ area_found:
 
spin_lock_irqsave(_lock, flags);
if (ret) {
-   mutex_unlock(_alloc_mutex);
pcpu_free_area(chunk, off, _pages);
   

Re: bpf: use-after-free in array_map_alloc

2016-05-24 Thread Tejun Heo
Hello,

On Tue, May 24, 2016 at 10:40:54AM +0200, Vlastimil Babka wrote:
> [+CC Marco who reported the CVE, forgot that earlier]
> 
> On 05/23/2016 11:35 PM, Tejun Heo wrote:
> > Hello,
> > 
> > Can you please test whether this patch resolves the issue?  While
> > adding support for atomic allocations, I reduced alloc_mutex covered
> > region too much.
> > 
> > Thanks.
> 
> Ugh, this makes the code even more head-spinning than it was.

Locking-wise, it isn't complicated.  It used to be a single mutex
protecting everything.  Atomic alloc support required putting core
allocation parts under spinlock.  It is messy because the two paths
are mixed in the same function.  If we break out the core part to a
separate function and let the sleepable path call into that, it should
look okay, but that's for another patch.

Also, I think protecting chunk's lifetime w/ alloc_mutex is making it
a bit nasty.  Maybe we should do per-chunk "extending" completion and
let pcpu_alloc_mutex just protect populating chunks.

> > @@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk 
> > *chunk, int new_alloc)
> > size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
> > unsigned long flags;
> > 
> > +   lockdep_assert_held(_alloc_mutex);
> 
> I don't see where the mutex gets locked when called via
> pcpu_map_extend_workfn? (except via the new cancel_work_sync() call below?)

Ah, right.

> Also what protects chunks with scheduled work items from being removed?

cancel_work_sync(), which now obviously should be called outside
alloc_mutex.

> > @@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
> > align, bool reserved,
> > return NULL;
> > }
> > 
> > +   if (!is_atomic)
> > +   mutex_lock(_alloc_mutex);
> 
> BTW I noticed that
>   bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> 
> this is too pessimistic IMHO. Reclaim is possible even without __GFP_FS and
> __GFP_IO. Could you just use gfpflags_allow_blocking(gfp) here?

vmalloc hardcodes GFP_KERNEL, so getting more relaxed doesn't buy us
much.

Thanks.

-- 
tejun


Re: bpf: use-after-free in array_map_alloc

2016-05-23 Thread Tejun Heo
Hello,

Can you please test whether this patch resolves the issue?  While
adding support for atomic allocations, I reduced alloc_mutex covered
region too much.

Thanks.

diff --git a/mm/percpu.c b/mm/percpu.c
index 0c59684..bd2df70 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -162,7 +162,7 @@ static struct pcpu_chunk *pcpu_reserved_chunk;
 static int pcpu_reserved_chunk_limit;
 
 static DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */
-static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop */
+static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map 
extension */
 
 static struct list_head *pcpu_slot __read_mostly; /* chunk list slots */
 
@@ -435,6 +435,8 @@ static int pcpu_extend_area_map(struct pcpu_chunk *chunk, 
int new_alloc)
size_t old_size = 0, new_size = new_alloc * sizeof(new[0]);
unsigned long flags;
 
+   lockdep_assert_held(_alloc_mutex);
+
new = pcpu_mem_zalloc(new_size);
if (!new)
return -ENOMEM;
@@ -895,6 +897,9 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
return NULL;
}
 
+   if (!is_atomic)
+   mutex_lock(_alloc_mutex);
+
spin_lock_irqsave(_lock, flags);
 
/* serve reserved allocations from the reserved chunk if available */
@@ -967,12 +972,11 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
if (is_atomic)
goto fail;
 
-   mutex_lock(_alloc_mutex);
+   lockdep_assert_held(_alloc_mutex);
 
if (list_empty(_slot[pcpu_nr_slots - 1])) {
chunk = pcpu_create_chunk();
if (!chunk) {
-   mutex_unlock(_alloc_mutex);
err = "failed to allocate new chunk";
goto fail;
}
@@ -983,7 +987,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
spin_lock_irqsave(_lock, flags);
}
 
-   mutex_unlock(_alloc_mutex);
goto restart;
 
 area_found:
@@ -993,8 +996,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, 
bool reserved,
if (!is_atomic) {
int page_start, page_end, rs, re;
 
-   mutex_lock(_alloc_mutex);
-
page_start = PFN_DOWN(off);
page_end = PFN_UP(off + size);
 
@@ -1005,7 +1006,6 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
 
spin_lock_irqsave(_lock, flags);
if (ret) {
-   mutex_unlock(_alloc_mutex);
pcpu_free_area(chunk, off, _pages);
err = "failed to populate";
goto fail_unlock;
@@ -1045,6 +1045,8 @@ static void __percpu *pcpu_alloc(size_t size, size_t 
align, bool reserved,
/* see the flag handling in pcpu_blance_workfn() */
pcpu_atomic_alloc_failed = true;
pcpu_schedule_balance_work();
+   } else {
+   mutex_unlock(_alloc_mutex);
}
return NULL;
 }
@@ -1137,6 +1139,8 @@ static void pcpu_balance_workfn(struct work_struct *work)
list_for_each_entry_safe(chunk, next, _free, list) {
int rs, re;
 
+   cancel_work_sync(>map_extend_work);
+
pcpu_for_each_pop_region(chunk, rs, re, 0, pcpu_unit_pages) {
pcpu_depopulate_chunk(chunk, rs, re);
spin_lock_irq(_lock);


Re: [PATCH] qlge: Replace create_singlethread_workqueue with alloc_ordered_workqueue

2016-04-14 Thread Tejun Heo
Hello, Manish.

On Thu, Apr 14, 2016 at 07:25:15AM +, Manish Chopra wrote:
> Just want to confirm that __WQ_LEGACY flag is not necessary here as this is 
> removed
> with this change ? 

Yeah, that should be fine.  That only affects locking dependency
tracking which can fire spuriously due to workqueues created with the
old interface having WQ_MEM_RECLAIM unconditionally.  In this case, we
actually want WQ_MEM_RECLAIM and thus we want the dependency tracking
too.

Thanks.

-- 
tejun


Re: [PATCH] qlge: Replace create_singlethread_workqueue with alloc_ordered_workqueue

2016-04-13 Thread Tejun Heo
On Sat, Apr 09, 2016 at 05:27:45PM +0530, Amitoj Kaur Chawla wrote:
> Replace deprecated create_singlethread_workqueue with
> alloc_ordered_workqueue.
> 
> Work items include getting tx/rx frame sizes, resetting MPI processor,
> setting asic recovery bit so ordering seems necessary as only one work
> item should be in queue/executing at any given time, hence the use of
> alloc_ordered_workqueue.
> 
> WQ_MEM_RECLAIM flag has been set since ethernet devices seem to sit in
> memory reclaim path, so to guarantee forward progress regardless of 
> memory pressure.
> 
> Signed-off-by: Amitoj Kaur Chawla <amitoj1...@gmail.com>
> Acked-by: Tejun Heo <t...@kernel.org>

Ping?

-- 
tejun


Re: [RFD] workqueue: WQ_MEM_RECLAIM usage in network drivers

2016-03-20 Thread Tejun Heo
Hello,

On Fri, Mar 18, 2016 at 05:24:05PM -0400, J. Bruce Fields wrote:
> > But does that actually work?  It's pointless to add WQ_MEM_RECLAIM to
> > workqueues unless all other things are also guaranteed to make forward
> > progress regardless of memory pressure.
> 
> It's supposed to work.
> 
> Also note there was a bunch of work done to allow swap on NFS: see
> a564b8f0 "nfs: enable swap on NFS".

Alright, WQ_MEM_RECLAIM for ethernet devices, then.

> I use NFS mounts over wifi at home.  I may just be odd.  I seem to
> recall some bug reports about suspend vs. NFS--were those also on
> laptops using NFS?
> 
> I wonder if home media centers might do writes over wifi to network
> storage?
> 
> Googling for "nfs wifi" turns up lots of individuals doing this.
> 
> My first impulse is to say that it's probably not perfect but that we
> shouldn't make it worse.

If everything else is working, I'd be happy to throw in WQ_MEM_RECLAIM
but I really don't want to add it if it doesn't actually achieve the
goal.  Can a wireless person chime in here?

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-03-20 Thread Tejun Heo
Hello, Jiri.

On Thu, Mar 17, 2016 at 01:00:13PM +0100, Jiri Slaby wrote:
> >> I have not done that yet, but today, I see:
> >> destroy_workqueue: name='req_hci0' pwq=88002f590300
> >> wq->dfl_pwq=88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
> >>pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
> >>  in-flight: 18568:wq_barrier_func
> > 
> > So, this means that there's flush_work() racing against workqueue
> > destruction, which can't be safe. :(
> 
> But I cannot trigger the WARN_ONs in the attached patch, so I am
> confused how this can happen :(. (While I am still seeing the destroy
> WARNINGs.)

So, no operations should be in progress when destroy_workqueue() is
called.  If somebody was flushing a work item, the flush call must
have returned before destroy_workqueue() was invoked, which doesn't
seem to be the case here.  Can you trigger BUG_ON() or sysrq-t when
the above triggers?  There must be a task which is flushing a work
item there and it shouldn't be difficult to pinpoint what's going on
from it.

Thanks.

-- 
tejun


Re: [PATCH] iwlwifi: dvm: convert create_singlethread_workqueue() to alloc_workqueue()

2016-03-19 Thread Tejun Heo
Hello,

On Thu, Mar 17, 2016 at 01:43:22PM +0100, Johannes Berg wrote:
> On Thu, 2016-03-17 at 20:37 +0800, Eva Rachel Retuya wrote:
> > Use alloc_workqueue() to allocate the workqueue instead of
> > create_singlethread_workqueue() since the latter is deprecated and is
> > scheduled for removal.
> 
> Scheduled where?

They've been deprecated for years now.  I should note that in the
header.

> >  static void iwl_setup_deferred_work(struct iwl_priv *priv)
> >  {
> > -   priv->workqueue = create_singlethread_workqueue(DRV_NAME);
> > +   priv->workqueue = alloc_workqueue(DRV_NAME, WQ_HIGHPRI |
> > WQ_UNBOUND |
> > +     WQ_MEM_RECLAIM, 1);
> 
> Seems like you should use alloc_ordered_workqueue() though? That also
> gets you UNBOUND immediately, and the "1".

Right, this one should have been alloc_ordered_workqueue().

> I'm not really sure HIGHPRI is needed either.

So, no WQ_MEM_RECLAIM either then, I suppose?  What are the latency
requirements here - what happens if a thermal management work gets
delayed?

Thanks.

-- 
tejun


[RFD] workqueue: WQ_MEM_RECLAIM usage in network drivers

2016-03-19 Thread Tejun Heo
Hello,

Years ago, workqueue got reimplemented to use common worker pools
across different workqueues and a new set of more expressive workqueue
creation APIs, alloc_*workqueue() were introduced.  The old
create_*workqueue() became simple wrappers around alloc_*workqueue()
with the most conservative parameters.  The plan has always been to
examine each usage and convert to the new interface with parameters
actually required for the use case.

One important flag to decide upon is WQ_MEM_RECLAIM, which declares
that the workqueue may be depended upon during memory reclaim and thus
must be able to make forward-progress even when further memory can't
be allocated without reclaiming some.  Of the network drivers which
already use alloc_*workqueue() interface, some specify this flag and
I'm wondering what the guidelines should be here.

* Are network devices expected to be able to serve as a part of
  storage stack which is depended upon for memory reclamation?

* If so, are all the pieces in place for that to work for all (or at
  least most) network devices?  If it's only for a subset of NICs, how
  can one tell whether a given driver needs forward progress guarantee
  or not?

* I assume that wireless drivers aren't and can't be used in this
  fashion.  Is that a correction assumption?

Thanks.

-- 
tejun


Re: [RFD] workqueue: WQ_MEM_RECLAIM usage in network drivers

2016-03-19 Thread Tejun Heo
Hello, Jeff.

On Thu, Mar 17, 2016 at 09:32:16PM -0400, Jeff Layton wrote:
> > * Are network devices expected to be able to serve as a part of
> >   storage stack which is depended upon for memory reclamation?
> 
> I think they should be. Cached NFS pages can consume a lot of memory,
> and flushing them generally takes network device access.

But does that actually work?  It's pointless to add WQ_MEM_RECLAIM to
workqueues unless all other things are also guaranteed to make forward
progress regardless of memory pressure.

> > * If so, are all the pieces in place for that to work for all (or at
> >   least most) network devices?  If it's only for a subset of NICs, how
> >   can one tell whether a given driver needs forward progress guarantee
> >   or not?
> > 
> > * I assume that wireless drivers aren't and can't be used in this
> >   fashion.  Is that a correction assumption?
> > 
> 
> People do mount NFS over wireless interfaces. It's not terribly common
> though, in my experience.

Ditto, I'm very skeptical that this actually works in practice and
people expect and depend on it.  I don't follow wireless development
closely but haven't heard anyone talking about reserving memory pools
or people complaining about wireless being the cause of OOM.

So, I really want to avoid spraying WQ_MEM_RECLAIM if it doesn't serve
actual purposes.  It's wasteful, sets bad precedences and confuses
future readers.

Thanks.

-- 
tejun


Re: [Outreachy kernel] [PATCH v2] iwlwifi: dvm: use alloc_ordered_workqueue()

2016-03-19 Thread Tejun Heo
On Fri, Mar 18, 2016 at 12:19:21AM +0800, Eva Rachel Retuya wrote:
> Use alloc_ordered_workqueue() to allocate the workqueue instead of
> create_singlethread_workqueue() since the latter is deprecated and is 
> scheduled
> for removal.
> 
> There are work items doing related operations that shouldn't be swapped when
> queued in a certain order hence preserve the strict execution ordering of a
> single threaded (ST) workqueue by switching to alloc_ordered_workqueue().
> 
> WQ_MEM_RECLAIM flag is not needed since the worker is not supposed to free
> memory.

I think "not depended during memory reclaim" probalby is a better way
to describe it.

> Signed-off-by: Eva Rachel Retuya <eraret...@gmail.com>

But other than that,

 Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-03-11 Thread Tejun Heo
Hello,

Sorry about the delay.

On Thu, Mar 03, 2016 at 10:12:01AM +0100, Jiri Slaby wrote:
> On 03/02/2016, 04:45 PM, Tejun Heo wrote:
> > On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
> >>> 1. didn't help, the problem persists. So I haven't applied the patch from 
> >>> 2.
> >>
> >> FWIW I dumped more info about the wq:
> >> wq->name='hci0' pwq=8800390d7600 wq->dfl_pwq=8800390d5200
> >> pwq->refcnt=2 pwq->nr_active=0 delayed_works: 
> > 
> > Can you please print out the same info for all pwq's during shutdown?
> > It looks like we're leaking pwq refcnt but I can't spot a place where
> > that could happen on an empty pwq.
> 
> I have not done that yet, but today, I see:
> destroy_workqueue: name='req_hci0' pwq=88002f590300
> wq->dfl_pwq=88002f591e00 pwq->refcnt=2 pwq->nr_active=0 delayed_works:
>pwq 12: cpus=0-1 node=0 flags=0x4 nice=-20 active=0/1
>  in-flight: 18568:wq_barrier_func

So, this means that there's flush_work() racing against workqueue
destruction, which can't be safe. :(

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-03-02 Thread Tejun Heo
Hello, Jiri.

On Fri, Feb 19, 2016 at 01:10:00PM +0100, Jiri Slaby wrote:
> > 1. didn't help, the problem persists. So I haven't applied the patch from 2.
> 
> FWIW I dumped more info about the wq:
> wq->name='hci0' pwq=8800390d7600 wq->dfl_pwq=8800390d5200
> pwq->refcnt=2 pwq->nr_active=0 delayed_works: 

Can you please print out the same info for all pwq's during shutdown?
It looks like we're leaking pwq refcnt but I can't spot a place where
that could happen on an empty pwq.

Thanks.

-- 
tejun


Re: [RFC][PATCH] tags: Fix DEFINE_PER_CPU expansions

2016-03-01 Thread Tejun Heo
Hello,

On Tue, Mar 01, 2016 at 07:18:42PM +0100, Peter Zijlstra wrote:
> > Urgh... I really hate the fact that we're putting on arbitrary
> > formatting constraints to compensate for shortcomings in an
> > out-of-line utility.
> 
> Yes it does.
> 
> I'm not too bothered if you don't want to fix this, I just figured I'd
> have a stab at fixing all this, since I regularly run 'make tags' and
> got tired of seeing the warns.

I use cscope so haven't noticed them.

> > Can't it do multiline regex?
> 
> Apparently not:
> 
>   https://sourceforge.net/p/ctags/feature-requests/38/

That's unfortunate.  Ah well, it's only several spots and isn't a big
deal one way or the other.  Occasional >80 lines are fine by me.
Please feel free to add

 Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [RFC][PATCH] tags: Fix DEFINE_PER_CPU expansions

2016-03-01 Thread Tejun Heo
Hello, Peter.

On Tue, Mar 01, 2016 at 11:26:25AM +0100, Peter Zijlstra wrote:
> 
> $ make tags
>   GEN tags
> ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name 
> pattern "\1"
> ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name 
> pattern "\1"
> ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern 
> "\1"
> ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern 
> "\1"
> ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern 
> "\1"
> ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
> ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
> ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
> ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
> 
> Which are all the result of the DEFINE_PER_CPU pattern:
> 
> scripts/tags.sh:200:  '/\ scripts/tags.sh:201:  '/\ *\([[:alnum:]_]*\)/\1/v/'
> 
> The below cures them. All except the workqueue one are within reasonable
> distance of the 80 char limit. TJ do you have any preference on how to
> fix the wq one, or shall we just not care its too long?

Urgh... I really hate the fact that we're putting on arbitrary
formatting constraints to compensate for shortcomings in an
out-of-line utility.  Can't it do multiline regex?

Thanks.

-- 
tejun


Re: net/bluetooth: workqueue destruction WARNING in hci_unregister_dev

2016-02-18 Thread Tejun Heo
Hello,

Can you please do the followings?

1. Remove WQ_MEM_RECLAIM from the affected workqueue and see whether
   the problem is reproducible.  WQ_MEM_RECLAIM on anything bluetooth
   doesn't make sense btw.  Why is it there?

2. If WQ_MEM_RECLAIM makes the issue go away, see whether the attached
   patch works too.

Thanks.

-- 
tejun
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 7ff5dc7..9824d4f 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4012,7 +4012,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
/* drain it before proceeding with destruction */
drain_workqueue(wq);
 
-   /* sanity checks */
+   /* nothing should be in flight */
mutex_lock(>mutex);
for_each_pwq(pwq, wq) {
int i;
@@ -4024,8 +4024,7 @@ void destroy_workqueue(struct workqueue_struct *wq)
}
}
 
-   if (WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1)) ||
-   WARN_ON(pwq->nr_active) ||
+   if (WARN_ON(pwq->nr_active) ||
WARN_ON(!list_empty(>delayed_works))) {
mutex_unlock(>mutex);
return;
@@ -4046,6 +4045,13 @@ void destroy_workqueue(struct workqueue_struct *wq)
if (wq->rescuer)
kthread_stop(wq->rescuer->task);
 
+   /* rescuer is gone, everything should be quiescent now */
+   WARN_ON(!list_empty(>maydays));
+   mutex_lock(>mutex);
+   for_each_pwq(pwq, wq)
+   WARN_ON((pwq != wq->dfl_pwq) && (pwq->refcnt > 1));
+   mutex_unlock(>mutex);
+
if (!(wq->flags & WQ_UNBOUND)) {
/*
 * The base ref is never dropped on per-cpu pwqs.  Directly


Re: [PATCH net-next] core: remove unneded headers for net cgroup controllers.

2016-02-15 Thread Tejun Heo
On Mon, Feb 15, 2016 at 02:39:43AM +0200, Rami Rosen wrote:
> commit 3ed80a6 (cgroup: drop module support) made including 
> module.h redundant in the net cgroup controllers, 
> netclassid_cgroup.c and netprio_cgroup.c. This patch 
> removes them.
> 
> Signed-off-by: Rami Rosen <rami.ro...@intel.com>

Acked-by: Tejun Heo <t...@kernel.org>

Thanks.

-- 
tejun


Re: [PATCH 8/8] netfilter: implement xt_cgroup cgroup2 path match

2016-02-12 Thread Tejun Heo
Hello,

On Fri, Feb 12, 2016 at 01:10:18AM +0100, Alban Crequy wrote:
> Is there any plans to implement a similar cgroup2 path match in a
> cgroup classifier in tc?

That'd be great.  I wanted to but couldn't figure out the heads and
tails after staring at the code for half an hour.

> I wonder if it will be possible to use cgroup to classify packets in
> tc without net_cls.class_id in cgroup2.

You can still match against cgroup v2 path using xt_cgroup and then
mark it and then match that from tc, which isn't ideal but works.

Thanks.

-- 
tejun


  1   2   3   >