date:20180124

Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-24 Thread Wanpeng Li

2018-01-24 23:12 GMT+08:00 Vitaly Kuznetsov :
> I was investigating an issue with seabios >= 1.10 which stopped working
> for nested KVM on Hyper-V. The problem appears to be in
> handle_ept_violation() function: when we do fast mmio we need to skip
> the instruction so we do kvm_skip_emulated_instruction(). This, however,
> depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> However, this is not the case.
>
> Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> some hypervisors follow the spec and don't set it; we end up advancing
> IP with some random value.
>
> I checked with Microsoft and they confirmed they don't fill
> VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
>
> Fix the issue by disabling fast mmio when running nested.
>
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Wanpeng Li 

> ---
>  arch/x86/kvm/vmx.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c829d89e2e63..54afb446f38e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> /*
>  * A nested guest cannot optimize MMIO vmexits, because we have an
>  * nGPA here instead of the required GPA.
> +* Skipping instruction below depends on undefined behavior: Intel's
> +* manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> +* when EPT MISCONFIG occurs and while on real hardware it was 
> observed
> +* to be set, other hypervisors (namely Hyper-V) don't set it, we end
> +* up advancing IP with some random value. Disable fast mmio when
> +* running nested and keep it for real hardware in hope that
> +* VM_EXIT_INSTRUCTION_LEN will always be set correctly.
>  */
> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -   if (!is_guest_mode(vcpu) &&
> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR) && !is_guest_mode(vcpu) &&
> !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> trace_kvm_fast_mmio(gpa);
> return kvm_skip_emulated_instruction(vcpu);
> --
> 2.14.3
>

Re: [PATCH] x86/kvm: disable fast MMIO when running nested

2018-01-24 Thread Wanpeng Li

2018-01-24 23:12 GMT+08:00 Vitaly Kuznetsov :
> I was investigating an issue with seabios >= 1.10 which stopped working
> for nested KVM on Hyper-V. The problem appears to be in
> handle_ept_violation() function: when we do fast mmio we need to skip
> the instruction so we do kvm_skip_emulated_instruction(). This, however,
> depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
> However, this is not the case.
>
> Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
> EPT MISCONFIG occurs. While on real hardware it was observed to be set,
> some hypervisors follow the spec and don't set it; we end up advancing
> IP with some random value.
>
> I checked with Microsoft and they confirmed they don't fill
> VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.
>
> Fix the issue by disabling fast mmio when running nested.
>
> Signed-off-by: Vitaly Kuznetsov 

Reviewed-by: Wanpeng Li 

> ---
>  arch/x86/kvm/vmx.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index c829d89e2e63..54afb446f38e 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6558,9 +6558,16 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
> /*
>  * A nested guest cannot optimize MMIO vmexits, because we have an
>  * nGPA here instead of the required GPA.
> +* Skipping instruction below depends on undefined behavior: Intel's
> +* manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set in VMCS
> +* when EPT MISCONFIG occurs and while on real hardware it was 
> observed
> +* to be set, other hypervisors (namely Hyper-V) don't set it, we end
> +* up advancing IP with some random value. Disable fast mmio when
> +* running nested and keep it for real hardware in hope that
> +* VM_EXIT_INSTRUCTION_LEN will always be set correctly.
>  */
> gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -   if (!is_guest_mode(vcpu) &&
> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR) && !is_guest_mode(vcpu) &&
> !kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> trace_kvm_fast_mmio(gpa);
> return kvm_skip_emulated_instruction(vcpu);
> --
> 2.14.3
>

Re: tip/master falls off NOP cliff with KPTI under KVM

2018-01-24 Thread David Woodhouse

On Wed, 2018-01-24 at 16:53 -0800, Dexuan-Linux Cui wrote:
> On Wed, Jan 10, 2018 at 2:53 PM, Woodhouse, David  wrote:
> > 
> > On Thu, 2018-01-11 at 01:34 +0300, Alexey Dobriyan wrote:
> > > 
> > > 
> > > Bisection points to
> > > 
> > > f3433c1010c6af61c9897f0f0447f81b991feac1 is the first bad commit
> > > commit f3433c1010c6af61c9897f0f0447f81b991feac1
> > > Author: David Woodhouse 
> > > Date:   Tue Jan 9 14:43:11 2018 +
> > > 
> > > x86/retpoline/entry: Convert entry assembler indirect jumps
> > Thanks. We've fixed the underlying problem with the alternatives
> > mechanism, *and* changed the retpoline code not to actually rely on
> > said fix.
> Hi David and all,
> It looks the latest upstream tree did fix the issue.
> Can you please specify the related commit(s) that fixed the issue?
> I need to cherry-pick the fix. I suppose a quick reply from you would save
>  me a lot of time. :-)

Hi Dexuan,

The above commit didn't ever make it into Linus' tree in that form; the
issues were fixed beforehand. So there isn't a subsequent commit that
fixes it. I think it might have just been removing some .align 16 from
nospec-branch.h?

The correct version has been backported to 4.9 and 4.4 releases
already; if you pulled in an early version directly from tip/x86/pti
then I'd recommend you drop it and pull in the real version instead.

smime.p7s
Description: S/MIME cryptographic signature

Re: tip/master falls off NOP cliff with KPTI under KVM

2018-01-24 Thread David Woodhouse

On Wed, 2018-01-24 at 16:53 -0800, Dexuan-Linux Cui wrote:
> On Wed, Jan 10, 2018 at 2:53 PM, Woodhouse, David  wrote:
> > 
> > On Thu, 2018-01-11 at 01:34 +0300, Alexey Dobriyan wrote:
> > > 
> > > 
> > > Bisection points to
> > > 
> > > f3433c1010c6af61c9897f0f0447f81b991feac1 is the first bad commit
> > > commit f3433c1010c6af61c9897f0f0447f81b991feac1
> > > Author: David Woodhouse 
> > > Date:   Tue Jan 9 14:43:11 2018 +
> > > 
> > > x86/retpoline/entry: Convert entry assembler indirect jumps
> > Thanks. We've fixed the underlying problem with the alternatives
> > mechanism, *and* changed the retpoline code not to actually rely on
> > said fix.
> Hi David and all,
> It looks the latest upstream tree did fix the issue.
> Can you please specify the related commit(s) that fixed the issue?
> I need to cherry-pick the fix. I suppose a quick reply from you would save
>  me a lot of time. :-)

Hi Dexuan,

The above commit didn't ever make it into Linus' tree in that form; the
issues were fixed beforehand. So there isn't a subsequent commit that
fixes it. I think it might have just been removing some .align 16 from
nospec-branch.h?

The correct version has been backported to 4.9 and 4.4 releases
already; if you pulled in an early version directly from tip/x86/pti
then I'd recommend you drop it and pull in the real version instead.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH 5/6] Documentation for Pmalloc

2018-01-24 Thread Igor Stoppa



On 24/01/18 21:14, Ralph Campbell wrote:
> 2 Minor typos inline below:

thanks for proof-reading, will fix accordingly.

--
igor

Re: [PATCH 5/6] Documentation for Pmalloc

2018-01-24 Thread Igor Stoppa



On 24/01/18 21:14, Ralph Campbell wrote:
> 2 Minor typos inline below:

thanks for proof-reading, will fix accordingly.

--
igor

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jia-Ju Bai




On 2018/1/25 12:16, Al Viro wrote:

On Thu, Jan 25, 2018 at 11:13:56AM +0800, Jia-Ju Bai wrote:


I have checked the given call chain, and find that nvme_dev_disable in
nvme_timeout calls mutex_lock that can sleep.
Thus, I suppose this call chain is not in atomic context.

... or it is broken.


Besides, how do you find that "function (nvme_timeout()) strongly suggests
that it *is* meant to be called from bloody atomic context"?
I check the comments in nvme_timeout, and do not find related description...

Anything that reads registers for controller state presumably won't be
happy if it can happen in parallel with other threads poking the same
hardware.  Not 100% guaranteed, but it's a fairly strong sign that there's
some kind of exclusion between whatever's submitting requests / handling
interrupts and the caller of that thing.  And such exclusion is likely
to be spin_lock_irqsave()-based.

Again, that does not _prove_ it's called from atomic contexts, but does
suggest such possibility.

Looking through the callers of that method, blk_abort_request() certainly
*is* called from under queue lock.  Different drivers, though.  No idea
if nvme_timeout() blocking case is broken - I'm not familiar with that
code.  Question should go to nvme maintainers...

However, digging through other call chains, there's this:
drivers/md/dm-mpath.c:530:  clone = blk_get_request(q, rq->cmd_flags | 
REQ_NOMERGE, GFP_ATOMIC);
in multipath_clone_and_map(), aka. ->clone_and_map_rq(), called at
drivers/md/dm-rq.c:480: r = ti->type->clone_and_map_rq(ti, rq, >info, 
);
in map_request(), which is called from dm_mq_queue_rq(), aka ->queue_rq(),
which is called from blk_mq_dispatch_rq_list(), called from
blk_mq_do_dispatch_sched(), called from blk_mq_sched_dispatch_requests(),
called under rcu_read_lock().  Not a context where you want GFP_KERNEL
allocations...


By the way, do you mean that I should add "My tool has proved that this
function is never called in atomic context" in the description?

I mean that proof itself should be at least outlined.  Crediting the tool
for finding the proof is fine *IF* it's done along wiht the proof itself.

You want to convince the people applying the patch that it is correct.
Leaving out something trivial to verify is fine - "foo_bar_baz() has
no callers" doesn't require grep output + quoting the area around each
instance to prove that all of them are in the comments, etc.; readers
can bloody well check the accuracy of that claim themselves.  This
kind of analysis, however, is decidedly *NOT* trivial to verify from
scratch.

Moreover, if what you've proven is that for each call chain leading
to that place there's a blocking operation nearby, there is still
a possibility that some of those *are* called while in non-blocking
context.  In that case you've found real bugs, and strictly speaking
your change doesn't break correct code.  However, it does not make
the change itself correct - if you have something like
enter non-blocking section
.
in very unusual cases grab a mutex (or panic, or...)
.
do GFP_ATOMIC allocation
.
leave non-blocking section
changing that to GFP_KERNEL will turn "we deadlock in very hard to
hit case" into "we deadlock easily"...

At the very least, I'd like to see those cutoffs - i.e. the places
that already could block on the callchains.  You might very well
have found actual bugs there.


Okay, thanks for your detailed explanation :)
I admit that my report here is not correct, and I will improve my tool.


Thanks,
Jia-Ju Bai

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Jia-Ju Bai




On 2018/1/25 12:16, Al Viro wrote:

On Thu, Jan 25, 2018 at 11:13:56AM +0800, Jia-Ju Bai wrote:


I have checked the given call chain, and find that nvme_dev_disable in
nvme_timeout calls mutex_lock that can sleep.
Thus, I suppose this call chain is not in atomic context.

... or it is broken.


Besides, how do you find that "function (nvme_timeout()) strongly suggests
that it *is* meant to be called from bloody atomic context"?
I check the comments in nvme_timeout, and do not find related description...

Anything that reads registers for controller state presumably won't be
happy if it can happen in parallel with other threads poking the same
hardware.  Not 100% guaranteed, but it's a fairly strong sign that there's
some kind of exclusion between whatever's submitting requests / handling
interrupts and the caller of that thing.  And such exclusion is likely
to be spin_lock_irqsave()-based.

Again, that does not _prove_ it's called from atomic contexts, but does
suggest such possibility.

Looking through the callers of that method, blk_abort_request() certainly
*is* called from under queue lock.  Different drivers, though.  No idea
if nvme_timeout() blocking case is broken - I'm not familiar with that
code.  Question should go to nvme maintainers...

However, digging through other call chains, there's this:
drivers/md/dm-mpath.c:530:  clone = blk_get_request(q, rq->cmd_flags | 
REQ_NOMERGE, GFP_ATOMIC);
in multipath_clone_and_map(), aka. ->clone_and_map_rq(), called at
drivers/md/dm-rq.c:480: r = ti->type->clone_and_map_rq(ti, rq, >info, 
);
in map_request(), which is called from dm_mq_queue_rq(), aka ->queue_rq(),
which is called from blk_mq_dispatch_rq_list(), called from
blk_mq_do_dispatch_sched(), called from blk_mq_sched_dispatch_requests(),
called under rcu_read_lock().  Not a context where you want GFP_KERNEL
allocations...


By the way, do you mean that I should add "My tool has proved that this
function is never called in atomic context" in the description?

I mean that proof itself should be at least outlined.  Crediting the tool
for finding the proof is fine *IF* it's done along wiht the proof itself.

You want to convince the people applying the patch that it is correct.
Leaving out something trivial to verify is fine - "foo_bar_baz() has
no callers" doesn't require grep output + quoting the area around each
instance to prove that all of them are in the comments, etc.; readers
can bloody well check the accuracy of that claim themselves.  This
kind of analysis, however, is decidedly *NOT* trivial to verify from
scratch.

Moreover, if what you've proven is that for each call chain leading
to that place there's a blocking operation nearby, there is still
a possibility that some of those *are* called while in non-blocking
context.  In that case you've found real bugs, and strictly speaking
your change doesn't break correct code.  However, it does not make
the change itself correct - if you have something like
enter non-blocking section
.
in very unusual cases grab a mutex (or panic, or...)
.
do GFP_ATOMIC allocation
.
leave non-blocking section
changing that to GFP_KERNEL will turn "we deadlock in very hard to
hit case" into "we deadlock easily"...

At the very least, I'd like to see those cutoffs - i.e. the places
that already could block on the callchains.  You might very well
have found actual bugs there.


Okay, thanks for your detailed explanation :)
I admit that my report here is not correct, and I will improve my tool.


Thanks,
Jia-Ju Bai

Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-01-24 Thread Amir Goldstein

On Thu, Jan 25, 2018 at 3:08 AM, Shakeel Butt  wrote:
> On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein  wrote:
>> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara  wrote:
>>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
 On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt  wrote:
 > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara  wrote:
 >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
 >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
 >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
 >>> >>
 >>> >>
 >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
 >>> >>>[Sorry for the late reply]
 >>> >>>
 >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
 >>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
 >>> >>>[...]
 >>> >I do agree it is not fair and not neat to account to producer 
 >>> >rather than
 >>> >misbehaving consumer, but current memcg design looks not support 
 >>> >such use
 >>> >case. And, the other question is do we know who is the listener 
 >>> >if it
 >>> >doesn't read the events?
 >>> 
 >>> So you never know who will read from the notification file 
 >>> descriptor but
 >>> you can simply account that to the process that created the 
 >>> notification
 >>> group and that is IMO the right process to account to.
 >>> >>>
 >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
 >>> >>>those objects then this should be a target of the charge.
 >>> >>>
 >>> I agree that current SLAB memcg accounting does not allow to 
 >>> account to a
 >>> different memcg than the one of the running process. However I 
 >>> *think* it
 >>> should be possible to add such interface. Michal?
 >>> >>>
 >>> >>>We do have memcg_kmem_charge_memcg but that would require some 
 >>> >>>plumbing
 >>> >>>to hook it into the specific allocation path. I suspect it uses 
 >>> >>>kmalloc,
 >>> >>>right?
 >>> >>
 >>> >>Yes.
 >>> >>
 >>> >>I took a look at the implementation and the callsites of
 >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
 >>> >>
 >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
 >>> >>* allocate new slab page, charge to memcg_params.memcg
 >>> >>
 >>> >>I think this is the plumbing you mentioned, right?
 >>> >
 >>> >Maybe I have misunderstood, but you are using slab allocator. So you
 >>> >would need to force it to use a different charging context than 
 >>> >current.
 >>>
 >>> Yes.
 >>>
 >>> >I haven't checked deeply but this doesn't look trivial to me.
 >>>
 >>> I agree. This is also what I explained to Jan and Amir in earlier
 >>> discussion.
 >>
 >> And I also agree. But the fact that it is not trivial does not mean 
 >> that it
 >> should not be done...
 >>
 >
 > I am currently working on directed or remote memcg charging for a
 > different usecase and I think that would be helpful here as well.
 >
 > I have two questions though:
 >
 > 1) Is fsnotify_group the right structure to hold the reference to
 > target mem_cgroup for charging?

 I think it is. The process who set up the group and determined the 
 unlimited
 events queue size and did not consume the events from the queue in a timely
 manner is the process to blame for the OOM situation.
>>>
>>> Agreed here.
>
> Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
> is used. The allocations from dnotify_struct_cache &
> dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
> the memcg of the current process should be charged.

Correct. Note that dnotify_struct_cache is allocated when setting up the
watch. that is always done by the listener, which is the correct process to
charge. handling the event does not allocate an event struct, so there is
no issue here.

>
>>>
 > 2) Remote charging can trigger an OOM in the target memcg. In this
 > usecase, I think, there should be security concerns if the events
 > producer can trigger OOM in the memcg of the monitor. We can either
 > change these allocations to use __GFP_NORETRY or some new gfp flag to
 > not trigger oom-killer. So, is this valid concern or am I
 > over-thinking?
>
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at .
>

Re: [PATCH v2] fs: fsnotify: account fsnotify metadata to kmemcg

2018-01-24 Thread Amir Goldstein

On Thu, Jan 25, 2018 at 3:08 AM, Shakeel Butt  wrote:
> On Wed, Jan 24, 2018 at 3:12 AM, Amir Goldstein  wrote:
>> On Wed, Jan 24, 2018 at 12:34 PM, Jan Kara  wrote:
>>> On Mon 22-01-18 22:31:20, Amir Goldstein wrote:
 On Fri, Jan 19, 2018 at 5:02 PM, Shakeel Butt  wrote:
 > On Wed, Nov 15, 2017 at 1:31 AM, Jan Kara  wrote:
 >> On Wed 15-11-17 01:32:16, Yang Shi wrote:
 >>> On 11/14/17 1:39 AM, Michal Hocko wrote:
 >>> >On Tue 14-11-17 03:10:22, Yang Shi wrote:
 >>> >>
 >>> >>
 >>> >>On 11/9/17 5:54 AM, Michal Hocko wrote:
 >>> >>>[Sorry for the late reply]
 >>> >>>
 >>> >>>On Tue 31-10-17 11:12:38, Jan Kara wrote:
 >>> On Tue 31-10-17 00:39:58, Yang Shi wrote:
 >>> >>>[...]
 >>> >I do agree it is not fair and not neat to account to producer 
 >>> >rather than
 >>> >misbehaving consumer, but current memcg design looks not support 
 >>> >such use
 >>> >case. And, the other question is do we know who is the listener 
 >>> >if it
 >>> >doesn't read the events?
 >>> 
 >>> So you never know who will read from the notification file 
 >>> descriptor but
 >>> you can simply account that to the process that created the 
 >>> notification
 >>> group and that is IMO the right process to account to.
 >>> >>>
 >>> >>>Yes, if the creator is de-facto owner which defines the lifetime of
 >>> >>>those objects then this should be a target of the charge.
 >>> >>>
 >>> I agree that current SLAB memcg accounting does not allow to 
 >>> account to a
 >>> different memcg than the one of the running process. However I 
 >>> *think* it
 >>> should be possible to add such interface. Michal?
 >>> >>>
 >>> >>>We do have memcg_kmem_charge_memcg but that would require some 
 >>> >>>plumbing
 >>> >>>to hook it into the specific allocation path. I suspect it uses 
 >>> >>>kmalloc,
 >>> >>>right?
 >>> >>
 >>> >>Yes.
 >>> >>
 >>> >>I took a look at the implementation and the callsites of
 >>> >>memcg_kmem_charge_memcg(). It looks it is called by:
 >>> >>
 >>> >>* charge kmem to memcg, but it is charged to the allocator's memcg
 >>> >>* allocate new slab page, charge to memcg_params.memcg
 >>> >>
 >>> >>I think this is the plumbing you mentioned, right?
 >>> >
 >>> >Maybe I have misunderstood, but you are using slab allocator. So you
 >>> >would need to force it to use a different charging context than 
 >>> >current.
 >>>
 >>> Yes.
 >>>
 >>> >I haven't checked deeply but this doesn't look trivial to me.
 >>>
 >>> I agree. This is also what I explained to Jan and Amir in earlier
 >>> discussion.
 >>
 >> And I also agree. But the fact that it is not trivial does not mean 
 >> that it
 >> should not be done...
 >>
 >
 > I am currently working on directed or remote memcg charging for a
 > different usecase and I think that would be helpful here as well.
 >
 > I have two questions though:
 >
 > 1) Is fsnotify_group the right structure to hold the reference to
 > target mem_cgroup for charging?

 I think it is. The process who set up the group and determined the 
 unlimited
 events queue size and did not consume the events from the queue in a timely
 manner is the process to blame for the OOM situation.
>>>
>>> Agreed here.
>
> Please note that for fcntl(F_NOTIFY), a global group, dnotify_group,
> is used. The allocations from dnotify_struct_cache &
> dnotify_mark_cache happens in the fcntl(F_NOTIFY), so , I take that
> the memcg of the current process should be charged.

Correct. Note that dnotify_struct_cache is allocated when setting up the
watch. that is always done by the listener, which is the correct process to
charge. handling the event does not allocate an event struct, so there is
no issue here.

>
>>>
 > 2) Remote charging can trigger an OOM in the target memcg. In this
 > usecase, I think, there should be security concerns if the events
 > producer can trigger OOM in the memcg of the monitor. We can either
 > change these allocations to use __GFP_NORETRY or some new gfp flag to
 > not trigger oom-killer. So, is this valid concern or am I
 > over-thinking?
>
> First, let me apologize, I think I might have led the discussion in
> wrong direction by giving one wrong information. The current upstream
> kernel, from the syscall context, does not invoke oom-killer when a
> memcg hits its limit and fails to reclaim memory, instead ENOMEM is
> returned. The memcg oom-killer is only invoked on page faults. However
> in a separate effort I do plan to converge the behavior, long
> discussion at .
>

So I guess it would be better if we limit the discussion on this
thread to which memcg

Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.

Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the

Re: [PATCH] ARM: dts: use a correct at24 fallback for at91-nattis-2-natte-2

2018-01-24 Thread Bartosz Golaszewski

2018-01-24 23:12 GMT+01:00 Peter Rosin :
> Hi Bartosz,
>
> On 2018-01-24 22:34, Bartosz Golaszewski wrote:
>> We now require all at24 users to use the "atmel," fallback in
>> device tree for different manufacturers.
>
> I think my patch [3/4] from about a week ago was just a tiny bit
> better.
> https://lkml.org/lkml/2018/1/16/609
>
> Can we please pick that one directly instead of doing this partial
> step?
>
> Cheers,
> Peter

Hi Peter,

oh yes, sorry - I just grepped the entire code base from next for
invalid compatibles and didn't notice this was already fixed by you.
Let's drop it.

Thanks,
Bartosz

Re: [PATCH] ARM: dts: use a correct at24 fallback for at91-nattis-2-natte-2

2018-01-24 Thread Bartosz Golaszewski

2018-01-24 23:12 GMT+01:00 Peter Rosin :
> Hi Bartosz,
>
> On 2018-01-24 22:34, Bartosz Golaszewski wrote:
>> We now require all at24 users to use the "atmel," fallback in
>> device tree for different manufacturers.
>
> I think my patch [3/4] from about a week ago was just a tiny bit
> better.
> https://lkml.org/lkml/2018/1/16/609
>
> Can we please pick that one directly instead of doing this partial
> step?
>
> Cheers,
> Peter

Hi Peter,

oh yes, sorry - I just grepped the entire code base from next for
invalid compatibles and didn't notice this was already fixed by you.
Let's drop it.

Thanks,
Bartosz

Re: [PATCH v2 4/4] RISC-V: Move to the new generic IRQ handler

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:56PM -0800, Palmer Dabbelt wrote:
> The old mechanism for handling IRQs on RISC-V was pretty ugly: the arch
> code looked at the Kconfig entry for our first-level irqchip driver and
> called into it directly.
> 
> This patch uses the new 0generic IRQ handling infastructure, which
> essentially just deletes a bunch of code.  This does add an additional
> load to the interrupt latency, but there's a lot of tuning left to be
> done there on RISC-V so I think it's OK for now.
> 
> Signed-off-by: Palmer Dabbelt 

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH v2 4/4] RISC-V: Move to the new generic IRQ handler

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:56PM -0800, Palmer Dabbelt wrote:
> The old mechanism for handling IRQs on RISC-V was pretty ugly: the arch
> code looked at the Kconfig entry for our first-level irqchip driver and
> called into it directly.
> 
> This patch uses the new 0generic IRQ handling infastructure, which
> essentially just deletes a bunch of code.  This does add an additional
> load to the interrupt latency, but there's a lot of tuning left to be
> done there on RISC-V so I think it's OK for now.
> 
> Signed-off-by: Palmer Dabbelt 

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH v2 1/4] arm: Make set_handle_irq and handle_arch_irq generic

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:53PM -0800, Palmer Dabbelt wrote:
> It looks like this same irqchip registration mechanism has been copied
> into a handful of ports, including aarch64 and openrisc.  I want to use
> this in the RISC-V port, so I thought it would be good to make this
> generic instead.
> 
> This patch simply moves set_handle_irq and handle_arch_irq from arch/arm
> to kernel/irq/handle.c.

The two important changes here are that:

 a) the handle_arch_irq defintion is moved from assembly to C code
 b) it is now marked __ro_after_init

Those should be prominently mentioned in the changelog, and for a
we probably need an explicit ACK from the ARM folks.

Re: [PATCH v2 1/4] arm: Make set_handle_irq and handle_arch_irq generic

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:53PM -0800, Palmer Dabbelt wrote:
> It looks like this same irqchip registration mechanism has been copied
> into a handful of ports, including aarch64 and openrisc.  I want to use
> this in the RISC-V port, so I thought it would be good to make this
> generic instead.
> 
> This patch simply moves set_handle_irq and handle_arch_irq from arch/arm
> to kernel/irq/handle.c.

The two important changes here are that:

 a) the handle_arch_irq defintion is moved from assembly to C code
 b) it is now marked __ro_after_init

Those should be prominently mentioned in the changelog, and for a
we probably need an explicit ACK from the ARM folks.

Re: [PATCH v2 3/4] openrisc: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:55PM -0800, Palmer Dabbelt wrote:
> It appears that openrisc copied arm64's MULTI_IRQ_HANDLER code (which
> came from arm).  I wanted to make this generic so I could use it in the
> RISC-V port.  This patch converts the openrisc code to use the generic
> version.

Note that openriscv overrides previous handle_arch_irq assignments.
We'll need to know from the openrisc folks if that was intentional.

Otherwise this looks fine to me.

Re: [PATCH v2 3/4] openrisc: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig

On Wed, Jan 24, 2018 at 07:07:55PM -0800, Palmer Dabbelt wrote:
> It appears that openrisc copied arm64's MULTI_IRQ_HANDLER code (which
> came from arm).  I wanted to make this generic so I could use it in the
> RISC-V port.  This patch converts the openrisc code to use the generic
> version.

Note that openriscv overrides previous handle_arch_irq assignments.
We'll need to know from the openrisc folks if that was intentional.

Otherwise this looks fine to me.

Re: [PATCH v2 2/4] arm64: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

Re: [PATCH v2 2/4] arm64: Use the new MULTI_IRQ_HANDLER

2018-01-24 Thread Christoph Hellwig

Looks good,

Reviewed-by: Christoph Hellwig

[PATCH] media: leds: as3645a: Add CONFIG_OF support

2018-01-24 Thread Akash Gajjar

From: Akash Gajjar 

Witth this changes, the driver builds with CONFIG_OF support

Signed-off-by: Akash Gajjar 
---
 drivers/media/i2c/as3645a.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/media/i2c/as3645a.c b/drivers/media/i2c/as3645a.c
index af5db71..24233fa 100644
--- a/drivers/media/i2c/as3645a.c
+++ b/drivers/media/i2c/as3645a.c
@@ -858,6 +858,14 @@ static int as3645a_remove(struct i2c_client *client)
 };
 MODULE_DEVICE_TABLE(i2c, as3645a_id_table);
 
+#if IS_ENABLED(CONFIG_OF)
+static const struct of_device_id as3645a_of_match[] = {
+   { .compatible = "ams,as3645a", },
+   { /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(of, as3645a_of_match);
+#endif
+
 static const struct dev_pm_ops as3645a_pm_ops = {
.suspend = as3645a_suspend,
.resume = as3645a_resume,
@@ -867,6 +875,7 @@ static int as3645a_remove(struct i2c_client *client)
.driver = {
.name = AS3645A_NAME,
.pm   = _pm_ops,
+   .of_match_table = of_match_ptr(as3645a_of_match),
},
.probe  = as3645a_probe,
.remove = as3645a_remove,
-- 
1.9.1

[PATCH] media: leds: as3645a: Add CONFIG_OF support

2018-01-24 Thread Akash Gajjar

From: Akash Gajjar 

Witth this changes, the driver builds with CONFIG_OF support

Signed-off-by: Akash Gajjar 
---
 drivers/media/i2c/as3645a.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/drivers/media/i2c/as3645a.c b/drivers/media/i2c/as3645a.c
index af5db71..24233fa 100644
--- a/drivers/media/i2c/as3645a.c
+++ b/drivers/media/i2c/as3645a.c
@@ -858,6 +858,14 @@ static int as3645a_remove(struct i2c_client *client)
 };
 MODULE_DEVICE_TABLE(i2c, as3645a_id_table);
 
+#if IS_ENABLED(CONFIG_OF)
+static const struct of_device_id as3645a_of_match[] = {
+   { .compatible = "ams,as3645a", },
+   { /* sentinel */ },
+};
+MODULE_DEVICE_TABLE(of, as3645a_of_match);
+#endif
+
 static const struct dev_pm_ops as3645a_pm_ops = {
.suspend = as3645a_suspend,
.resume = as3645a_resume,
@@ -867,6 +875,7 @@ static int as3645a_remove(struct i2c_client *client)
.driver = {
.name = AS3645A_NAME,
.pm   = _pm_ops,
+   .of_match_table = of_match_ptr(as3645a_of_match),
},
.probe  = as3645a_probe,
.remove = as3645a_remove,
-- 
1.9.1

[PATCH net-next] ptr_ring: fix integer overflow

2018-01-24 Thread Jason Wang

We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun 
array bounds")
Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9ca1726..3f99484 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -453,6 +453,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (unlikely(size + 1 == 0))
+   return NULL;
/* Allocate an extra dummy element at end of ring to avoid consumer head
 * or produce head access past the end of the array. Possible when
 * producer/consumer operations and __ptr_ring_peek operations run in
-- 
2.7.4

[PATCH net-next] ptr_ring: fix integer overflow

2018-01-24 Thread Jason Wang

We try to allocate one more entry for lockless peeking. The adding
operation may overflow which causes zero to be passed to kmalloc().
In this case, it returns ZERO_SIZE_PTR without any notice by ptr
ring. Try to do producing or consuming on such ring will lead NULL
dereference. Fix this detect and fail early.

Fixes: bcecb4bbf88a ("net: ptr_ring: otherwise safe empty checks can overrun 
array bounds")
Reported-by: syzbot+87678bcf753b44c39...@syzkaller.appspotmail.com
Cc: John Fastabend 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 9ca1726..3f99484 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -453,6 +453,8 @@ static inline int ptr_ring_consume_batched_bh(struct 
ptr_ring *r,
 
 static inline void **__ptr_ring_init_queue_alloc(unsigned int size, gfp_t gfp)
 {
+   if (unlikely(size + 1 == 0))
+   return NULL;
/* Allocate an extra dummy element at end of ring to avoid consumer head
 * or produce head access past the end of the array. Possible when
 * producer/consumer operations and __ptr_ring_peek operations run in
-- 
2.7.4

Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Boqun Feng

On Wed, Jan 24, 2018 at 10:16:18PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> > From: Heng Zhang 
> > 
> > This RCU implementation (PRCU) is based on a fast consensus protocol
> > published in the following paper:
> > 
> > Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> > Synchronization.
> > Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> > IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> > https://dl.acm.org/citation.cfm?id=3024114.3024143
> > 
> > Signed-off-by: Heng Zhang 
> > Signed-off-by: Lihao Liang 
> 
> A few comments and questions interspersed.
> 
>   Thanx, Paul
> 
> > ---
> >  include/linux/prcu.h |  37 +++
> >  kernel/rcu/Makefile  |   2 +-
> >  kernel/rcu/prcu.c| 125 
> > +++
> >  kernel/sched/core.c  |   2 +
> >  4 files changed, 165 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/prcu.h
> >  create mode 100644 kernel/rcu/prcu.c
> > 
> > diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> > new file mode 100644
> > index ..653b4633
> > --- /dev/null
> > +++ b/include/linux/prcu.h
> > @@ -0,0 +1,37 @@
> > +#ifndef __LINUX_PRCU_H
> > +#define __LINUX_PRCU_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define CONFIG_PRCU
> > +
> > +struct prcu_local_struct {
> > +   unsigned int locked;
> > +   unsigned int online;
> > +   unsigned long long version;
> > +};
> > +
> > +struct prcu_struct {
> > +   atomic64_t global_version;
> > +   atomic_t active_ctr;
> > +   struct mutex mtx;
> > +   wait_queue_head_t wait_q;
> > +};
> > +
> > +#ifdef CONFIG_PRCU
> > +void prcu_read_lock(void);
> > +void prcu_read_unlock(void);
> > +void synchronize_prcu(void);
> > +void prcu_note_context_switch(void);
> > +
> > +#else /* #ifdef CONFIG_PRCU */
> > +
> > +#define prcu_read_lock() do {} while (0)
> > +#define prcu_read_unlock() do {} while (0)
> > +#define synchronize_prcu() do {} while (0)
> > +#define prcu_note_context_switch() do {} while (0)
> 
> If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
> get a build error rather than an error-free but inoperative PRCU?
> 
> Of course, Peter's question about purpose of the patch set applies
> here as well.
> 
> > +
> > +#endif /* #ifdef CONFIG_PRCU */
> > +#endif /* __LINUX_PRCU_H */
> > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > index 23803c7d..8791419c 100644
> > --- a/kernel/rcu/Makefile
> > +++ b/kernel/rcu/Makefile
> > @@ -2,7 +2,7 @@
> >  # and is generally not a function of system call inputs.
> >  KCOV_INSTRUMENT := n
> > 
> > -obj-y += update.o sync.o
> > +obj-y += update.o sync.o prcu.o
> >  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
> >  obj-$(CONFIG_TREE_SRCU) += srcutree.o
> >  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> > diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> > new file mode 100644
> > index ..a00b9420
> > --- /dev/null
> > +++ b/kernel/rcu/prcu.c
> > @@ -0,0 +1,125 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> > +
> > +struct prcu_struct global_prcu = {
> > +   .global_version = ATOMIC64_INIT(0),
> > +   .active_ctr = ATOMIC_INIT(0),
> > +   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> > +   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> > +};
> > +struct prcu_struct *prcu = _prcu;
> > +
> > +static inline void prcu_report(struct prcu_local_struct *local)
> > +{
> > +   unsigned long long global_version;
> > +   unsigned long long local_version;
> > +
> > +   global_version = atomic64_read(>global_version);
> > +   local_version = local->version;
> > +   if (global_version > local_version)
> > +   cmpxchg(>version, local_version, global_version);
> > +}
> > +
> > +void prcu_read_lock(void)
> > +{
> > +   struct prcu_local_struct *local;
> > +
> > +   local = get_cpu_ptr(_local);
> > +   if (!local->online) {
> > +   WRITE_ONCE(local->online, 1);
> > +   smp_mb();
> > +   }
> > +
> > +   local->locked++;
> > +   put_cpu_ptr(_local);
> > +}
> > +EXPORT_SYMBOL(prcu_read_lock);
> > +
> > +void prcu_read_unlock(void)
> > +{
> > +   int locked;
> > +   struct prcu_local_struct *local;
> > +
> > +   barrier();
> > +   local = get_cpu_ptr(_local);
> > +   locked = local->locked;
> > +   if (locked) {
> > +   local->locked--;
> > +   if (locked == 1)
> > +   prcu_report(local);
> 
> Is ordering important here?  It looks to me that the compiler could
> rearrange some of the accesses within prcu_report() with the local->locked
> decrement.  There appears to be some potential for load and store tearing,
> though perhaps you have verified that your compiler avoids

Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Boqun Feng

On Wed, Jan 24, 2018 at 10:16:18PM -0800, Paul E. McKenney wrote:
> On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> > From: Heng Zhang 
> > 
> > This RCU implementation (PRCU) is based on a fast consensus protocol
> > published in the following paper:
> > 
> > Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> > Synchronization.
> > Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> > IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> > https://dl.acm.org/citation.cfm?id=3024114.3024143
> > 
> > Signed-off-by: Heng Zhang 
> > Signed-off-by: Lihao Liang 
> 
> A few comments and questions interspersed.
> 
>   Thanx, Paul
> 
> > ---
> >  include/linux/prcu.h |  37 +++
> >  kernel/rcu/Makefile  |   2 +-
> >  kernel/rcu/prcu.c| 125 
> > +++
> >  kernel/sched/core.c  |   2 +
> >  4 files changed, 165 insertions(+), 1 deletion(-)
> >  create mode 100644 include/linux/prcu.h
> >  create mode 100644 kernel/rcu/prcu.c
> > 
> > diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> > new file mode 100644
> > index ..653b4633
> > --- /dev/null
> > +++ b/include/linux/prcu.h
> > @@ -0,0 +1,37 @@
> > +#ifndef __LINUX_PRCU_H
> > +#define __LINUX_PRCU_H
> > +
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define CONFIG_PRCU
> > +
> > +struct prcu_local_struct {
> > +   unsigned int locked;
> > +   unsigned int online;
> > +   unsigned long long version;
> > +};
> > +
> > +struct prcu_struct {
> > +   atomic64_t global_version;
> > +   atomic_t active_ctr;
> > +   struct mutex mtx;
> > +   wait_queue_head_t wait_q;
> > +};
> > +
> > +#ifdef CONFIG_PRCU
> > +void prcu_read_lock(void);
> > +void prcu_read_unlock(void);
> > +void synchronize_prcu(void);
> > +void prcu_note_context_switch(void);
> > +
> > +#else /* #ifdef CONFIG_PRCU */
> > +
> > +#define prcu_read_lock() do {} while (0)
> > +#define prcu_read_unlock() do {} while (0)
> > +#define synchronize_prcu() do {} while (0)
> > +#define prcu_note_context_switch() do {} while (0)
> 
> If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
> get a build error rather than an error-free but inoperative PRCU?
> 
> Of course, Peter's question about purpose of the patch set applies
> here as well.
> 
> > +
> > +#endif /* #ifdef CONFIG_PRCU */
> > +#endif /* __LINUX_PRCU_H */
> > diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> > index 23803c7d..8791419c 100644
> > --- a/kernel/rcu/Makefile
> > +++ b/kernel/rcu/Makefile
> > @@ -2,7 +2,7 @@
> >  # and is generally not a function of system call inputs.
> >  KCOV_INSTRUMENT := n
> > 
> > -obj-y += update.o sync.o
> > +obj-y += update.o sync.o prcu.o
> >  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
> >  obj-$(CONFIG_TREE_SRCU) += srcutree.o
> >  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> > diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> > new file mode 100644
> > index ..a00b9420
> > --- /dev/null
> > +++ b/kernel/rcu/prcu.c
> > @@ -0,0 +1,125 @@
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#include 
> > +
> > +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> > +
> > +struct prcu_struct global_prcu = {
> > +   .global_version = ATOMIC64_INIT(0),
> > +   .active_ctr = ATOMIC_INIT(0),
> > +   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> > +   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> > +};
> > +struct prcu_struct *prcu = _prcu;
> > +
> > +static inline void prcu_report(struct prcu_local_struct *local)
> > +{
> > +   unsigned long long global_version;
> > +   unsigned long long local_version;
> > +
> > +   global_version = atomic64_read(>global_version);
> > +   local_version = local->version;
> > +   if (global_version > local_version)
> > +   cmpxchg(>version, local_version, global_version);
> > +}
> > +
> > +void prcu_read_lock(void)
> > +{
> > +   struct prcu_local_struct *local;
> > +
> > +   local = get_cpu_ptr(_local);
> > +   if (!local->online) {
> > +   WRITE_ONCE(local->online, 1);
> > +   smp_mb();
> > +   }
> > +
> > +   local->locked++;
> > +   put_cpu_ptr(_local);
> > +}
> > +EXPORT_SYMBOL(prcu_read_lock);
> > +
> > +void prcu_read_unlock(void)
> > +{
> > +   int locked;
> > +   struct prcu_local_struct *local;
> > +
> > +   barrier();
> > +   local = get_cpu_ptr(_local);
> > +   locked = local->locked;
> > +   if (locked) {
> > +   local->locked--;
> > +   if (locked == 1)
> > +   prcu_report(local);
> 
> Is ordering important here?  It looks to me that the compiler could
> rearrange some of the accesses within prcu_report() with the local->locked
> decrement.  There appears to be some potential for load and store tearing,
> though perhaps you have verified that your compiler avoids this on
> the architecture that you are using.
> 
> > +

[PATCH v2 2/2] free_pcppages_bulk: prefetch buddy while not holding lock

2018-01-24 Thread Aaron Lu

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

In the meantime, there is the concern of a prefetch potentially evicting
existing cachelines. This can be true for L1D cache since it is not huge.
Considering the prefetch instruction used is prefetchnta, which will only
store the date in L2 for "Pentium 4 and Intel Xeon processors" according
to Intel's "Instruction Manual Set" document, it is not likely to cause
cache pollution. Other architectures may have this cache pollution problem
though.

There is also some additional instruction overhead, namely calculating
buddy pfn twice. Since the calculation is a XOR on two local variables,
it's expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where multiple
CPUs are contending on zone->lock and the most time consuming part under
zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages
and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
patch1/29608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%

Note: this patch's performance improvement percent is against patch1/2.

Please also note the actual benefit of this patch will be workload/CPU
dependant.

[changelog stole from Dave Hansen and Mel Gorman's comments]
https://lkml.org/lkml/2018/1/24/551
Suggested-by: Ying Huang 
Signed-off-by: Aaron Lu 
---
v2:
update changelog according to Dave Hansen and Mel Gorman's comments.
Add more comments in code to explain why prefetch is done as requested
by Mel Gorman.

 mm/page_alloc.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9e5ded39b16..6566a4b5b124 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
+   unsigned long pfn, buddy_pfn;
+   struct page *buddy;
+
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(>lru);
@@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
continue;
 
list_add_tail(>lru, );
+
+   /*
+* We are going to put the page back to the global
+* pool, prefetch its buddy to speed up later access
+* under zone->lock. It is believed the overhead of
+* calculating buddy_pfn here can be offset by reduced
+* memory latency later.
+*/
+   pfn = page_to_pfn(page);
+   buddy_pfn = __find_buddy_pfn(pfn, 0);
+   buddy = page + (buddy_pfn - pfn);
+   prefetch(buddy);
} while (--count && --batch_free && !list_empty(list));
}
 
-- 
2.14.3

[PATCH v2 2/2] free_pcppages_bulk: prefetch buddy while not holding lock

2018-01-24 Thread Aaron Lu

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

In the meantime, there is the concern of a prefetch potentially evicting
existing cachelines. This can be true for L1D cache since it is not huge.
Considering the prefetch instruction used is prefetchnta, which will only
store the date in L2 for "Pentium 4 and Intel Xeon processors" according
to Intel's "Instruction Manual Set" document, it is not likely to cause
cache pollution. Other architectures may have this cache pollution problem
though.

There is also some additional instruction overhead, namely calculating
buddy pfn twice. Since the calculation is a XOR on two local variables,
it's expected in many cases that cycles spent will be offset by reduced
memory latency later. This is especially true for NUMA machines where multiple
CPUs are contending on zone->lock and the most time consuming part under
zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages
and their buddies.

Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
patch1/29608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%
this patch 10462292 +8.9%  8602889 +2.8% 14802073 +5.4% 17624575 +1.1%

Note: this patch's performance improvement percent is against patch1/2.

Please also note the actual benefit of this patch will be workload/CPU
dependant.

[changelog stole from Dave Hansen and Mel Gorman's comments]
https://lkml.org/lkml/2018/1/24/551
Suggested-by: Ying Huang 
Signed-off-by: Aaron Lu 
---
v2:
update changelog according to Dave Hansen and Mel Gorman's comments.
Add more comments in code to explain why prefetch is done as requested
by Mel Gorman.

 mm/page_alloc.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9e5ded39b16..6566a4b5b124 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1138,6 +1138,9 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
+   unsigned long pfn, buddy_pfn;
+   struct page *buddy;
+
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(>lru);
@@ -1146,6 +1149,18 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
continue;
 
list_add_tail(>lru, );
+
+   /*
+* We are going to put the page back to the global
+* pool, prefetch its buddy to speed up later access
+* under zone->lock. It is believed the overhead of
+* calculating buddy_pfn here can be offset by reduced
+* memory latency later.
+*/
+   pfn = page_to_pfn(page);
+   buddy_pfn = __find_buddy_pfn(pfn, 0);
+   buddy = page + (buddy_pfn - pfn);
+   prefetch(buddy);
} while (--count && --batch_free && !list_empty(list));
}
 
-- 
2.14.3

linux-next: build failure after merge of the rdma tree

2018-01-24 Thread Stephen Rothwell

Hi all,

After merging the rdma tree, today's linux-next build (x86_64
allmodconfig) failed like this:

ERROR: "init_rcu_head" [drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined!

Caused by commit

  a11253142e6d ("IB/srpt: Rework multi-channel support")

I have used the rdma tree from next-20180119 for today.

-- 
Cheers,
Stephen Rothwell

linux-next: build failure after merge of the rdma tree

2018-01-24 Thread Stephen Rothwell

Hi all,

After merging the rdma tree, today's linux-next build (x86_64
allmodconfig) failed like this:

ERROR: "init_rcu_head" [drivers/infiniband/ulp/srpt/ib_srpt.ko] undefined!

Caused by commit

  a11253142e6d ("IB/srpt: Rework multi-channel support")

I have used the rdma tree from next-20180119 for today.

-- 
Cheers,
Stephen Rothwell

[PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free

2018-01-24 Thread Aaron Lu

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman 
Signed-off-by: Aaron Lu 
---
v2: use LIST_HEAD(head) as suggested by Mel Gorman.

 mm/page_alloc.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4093728f292e..c9e5ded39b16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1113,12 +1113,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
-   spin_lock(>lock);
-   isolated_pageblocks = has_isolate_pageblock(zone);
+   struct page *page, *tmp;
+   LIST_HEAD(head);
 
while (count) {
-   struct page *page;
struct list_head *list;
 
/*
@@ -1140,26 +1138,31 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
-   int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(>lru);
 
-   mt = get_pcppage_migratetype(page);
-   /* MIGRATE_ISOLATE page should not go to pcplists */
-   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-   /* Pageblock could have been isolated meanwhile */
-   if (unlikely(isolated_pageblocks))
-   mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;
 
-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
+   list_add_tail(>lru, );
} while (--count && --batch_free && !list_empty(list));
}
+
+   spin_lock(>lock);
+   isolated_pageblocks = has_isolate_pageblock(zone);
+
+   list_for_each_entry_safe(page, tmp, , lru) {
+   int mt = get_pcppage_migratetype(page);
+   /* MIGRATE_ISOLATE page should not go to pcplists */
+   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+   /* Pageblock could have been isolated meanwhile */
+   if (unlikely(isolated_pageblocks))
+   mt = get_pageblock_migratetype(page);
+
+   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+   trace_mm_page_pcpu_drain(page, 0, mt);
+   }
spin_unlock(>lock);
 }
 
-- 
2.14.3

[PATCH v2 1/2] free_pcppages_bulk: do not hold lock when picking pages to free

2018-01-24 Thread Aaron Lu

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel  Broadwell(2S)  Skylake(2S)   Broadwell(4S)  Skylake(4S)
v4.15-rc4   90373328000124   13642741   15728686
this patch  9608786 +6.3%  8368915 +4.6% 14042169 +2.9% 17433559 +10.8%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman 
Signed-off-by: Aaron Lu 
---
v2: use LIST_HEAD(head) as suggested by Mel Gorman.

 mm/page_alloc.c | 33 ++---
 1 file changed, 18 insertions(+), 15 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4093728f292e..c9e5ded39b16 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1113,12 +1113,10 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
-   spin_lock(>lock);
-   isolated_pageblocks = has_isolate_pageblock(zone);
+   struct page *page, *tmp;
+   LIST_HEAD(head);
 
while (count) {
-   struct page *page;
struct list_head *list;
 
/*
@@ -1140,26 +1138,31 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
batch_free = count;
 
do {
-   int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(>lru);
 
-   mt = get_pcppage_migratetype(page);
-   /* MIGRATE_ISOLATE page should not go to pcplists */
-   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
-   /* Pageblock could have been isolated meanwhile */
-   if (unlikely(isolated_pageblocks))
-   mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;
 
-   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
-   trace_mm_page_pcpu_drain(page, 0, mt);
+   list_add_tail(>lru, );
} while (--count && --batch_free && !list_empty(list));
}
+
+   spin_lock(>lock);
+   isolated_pageblocks = has_isolate_pageblock(zone);
+
+   list_for_each_entry_safe(page, tmp, , lru) {
+   int mt = get_pcppage_migratetype(page);
+   /* MIGRATE_ISOLATE page should not go to pcplists */
+   VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+   /* Pageblock could have been isolated meanwhile */
+   if (unlikely(isolated_pageblocks))
+   mt = get_pageblock_migratetype(page);
+
+   __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+   trace_mm_page_pcpu_drain(page, 0, mt);
+   }
spin_unlock(>lock);
 }
 
-- 
2.14.3

Re: [PATCH v4 02/10] asm/nospec, array_ptr: sanitize speculative array de-references

2018-01-24 Thread Cyril Novikov


On 1/18/2018 4:01 PM, Dan Williams wrote:

'array_ptr' is proposed as a generic mechanism to mitigate against
Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks
via speculative execution). The 'array_ptr' implementation is expected
to be safe for current generation cpus across multiple architectures
(ARM, x86).


I'm an outside reviewer, not subscribed to the list, so forgive me if I 
do something not according to protocol. I have the following comments on 
this change:


After discarding the speculation barrier variant, is array_ptr() needed 
at all? You could have a simpler sanitizing macro, say


#define array_sanitize_idx(idx, sz) ((idx) & array_ptr_mask((idx), (sz)))

(adjusted to not evaluate idx twice). And use it as follows:

if (idx < array_size) {
idx = array_sanitize_idx(idx, array_size);
do_something(array[idx]);
}

If I understand the speculation stuff correctly, unlike array_ptr(), 
this "leaks" array[0] rather than nothing (*NULL) when executed 
speculatively. However, it's still much better than leaking an arbitrary 
location in memory. The attacker can likely get array[0] "leaked" by 
passing 0 as idx anyway.



+/*
+ * If idx is negative or if idx > size then bit 63 is set in the mask,
+ * and the value of ~(-1L) is zero. When the mask is zero, bounds check
+ * failed, array_ptr will return NULL.
+ */
+#ifndef array_ptr_mask
+static inline unsigned long array_ptr_mask(unsigned long idx, unsigned long sz)
+{
+   return ~(long)(idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1);
+}
+#endif


Why does this have to resort to the undefined behavior of shifting a 
negative number to the right? You can do without it:


return ((idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1)) - 1;

Of course, you could argue that subtracting 1 from 0 to get all ones is 
also an undefined behavior, but it's still much better than the shift, 
isn't it?



+#define array_ptr(base, idx, sz)   \
+({ \
+   union { typeof(*(base)) *_ptr; unsigned long _bit; } __u;   \
+   typeof(*(base)) *_arr = (base); \
+   unsigned long _i = (idx);   \
+   unsigned long _mask = array_ptr_mask(_i, (sz)); \
+   \
+   __u._ptr = _arr + (_i & _mask); \
+   __u._bit &= _mask;  \
+   __u._ptr;   \
+})


Call me paranoid, but I think this may actually create an exploitable 
bug on 32-bit systems due to casting the index to an unsigned long, if 
the index as it comes from userland is a 64-bit value. You have 
*replaced* the "if (idx < array_size)" check with checking if 
array_ptr() returns NULL. Well, it doesn't return NULL if the low 32 
bits of the index are in-bounds, but the high 32 bits are not zero. 
Apart from the return value pointing to the wrong place, the subsequent 
code may then assume that the 64-bit idx is actually valid and trip on 
it badly.


--
Cyril

Re: [PATCH v4 02/10] asm/nospec, array_ptr: sanitize speculative array de-references

2018-01-24 Thread Cyril Novikov


On 1/18/2018 4:01 PM, Dan Williams wrote:

'array_ptr' is proposed as a generic mechanism to mitigate against
Spectre-variant-1 attacks, i.e. an attack that bypasses boundary checks
via speculative execution). The 'array_ptr' implementation is expected
to be safe for current generation cpus across multiple architectures
(ARM, x86).


I'm an outside reviewer, not subscribed to the list, so forgive me if I 
do something not according to protocol. I have the following comments on 
this change:


After discarding the speculation barrier variant, is array_ptr() needed 
at all? You could have a simpler sanitizing macro, say


#define array_sanitize_idx(idx, sz) ((idx) & array_ptr_mask((idx), (sz)))

(adjusted to not evaluate idx twice). And use it as follows:

if (idx < array_size) {
idx = array_sanitize_idx(idx, array_size);
do_something(array[idx]);
}

If I understand the speculation stuff correctly, unlike array_ptr(), 
this "leaks" array[0] rather than nothing (*NULL) when executed 
speculatively. However, it's still much better than leaking an arbitrary 
location in memory. The attacker can likely get array[0] "leaked" by 
passing 0 as idx anyway.



+/*
+ * If idx is negative or if idx > size then bit 63 is set in the mask,
+ * and the value of ~(-1L) is zero. When the mask is zero, bounds check
+ * failed, array_ptr will return NULL.
+ */
+#ifndef array_ptr_mask
+static inline unsigned long array_ptr_mask(unsigned long idx, unsigned long sz)
+{
+   return ~(long)(idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1);
+}
+#endif


Why does this have to resort to the undefined behavior of shifting a 
negative number to the right? You can do without it:


return ((idx | (sz - 1 - idx)) >> (BITS_PER_LONG - 1)) - 1;

Of course, you could argue that subtracting 1 from 0 to get all ones is 
also an undefined behavior, but it's still much better than the shift, 
isn't it?



+#define array_ptr(base, idx, sz)   \
+({ \
+   union { typeof(*(base)) *_ptr; unsigned long _bit; } __u;   \
+   typeof(*(base)) *_arr = (base); \
+   unsigned long _i = (idx);   \
+   unsigned long _mask = array_ptr_mask(_i, (sz)); \
+   \
+   __u._ptr = _arr + (_i & _mask); \
+   __u._bit &= _mask;  \
+   __u._ptr;   \
+})


Call me paranoid, but I think this may actually create an exploitable 
bug on 32-bit systems due to casting the index to an unsigned long, if 
the index as it comes from userland is a 64-bit value. You have 
*replaced* the "if (idx < array_size)" check with checking if 
array_ptr() returns NULL. Well, it doesn't return NULL if the low 32 
bits of the index are in-bounds, but the high 32 bits are not zero. 
Apart from the return value pointing to the wrong place, the subsequent 
code may then assume that the 64-bit idx is actually valid and trip on 
it badly.


--
Cyril

Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand

On 01/24/18 22:48, Frank Rowand wrote:
> On 01/21/18 06:31, Wolfram Sang wrote:
>> From: Tyrel Datwyler 
>>
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR)
>> dynamic logical partitioning code (ie. POWER speak for hotplugging
>> virtual and physcial resources at runtime such as cpus or IOAs). These
>> tracepoints provide a easy and quick mechanism for validating the
>> reference counting of device_nodes during their lifetime.
>>
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with
>> the configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
>> output for OF_RECONFIG notifiers").
>>
>> The following trace events are provided: of_node_get, of_node_put,
>> of_node_release, and of_reconfig_notify. These trace points require a
> 
> Please add a note that the of_reconfig_notify trace event is not an
> added bit of debug info, but is instead replacing information that
> was previously available via pr_debug() when DEBUG was defined.

I got a little carried away, "when DEBUG was defined" is extra
un-needed detail for the commit message.


> 
> 
>> kernel built with ftrace support to be enabled. In a typical environment
>> where debugfs is mounted at /sys/kernel/debug the entire set of
>> tracepoints can be set with the following:
>>
>>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>>
>> or
>>
>>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
>>
>> The following shows the trace point data from a DLPAR remove of a cpu
>> from a pseries lpar:
>>
>> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>>
>> cpuhp/23-147   [023]    128.324827:
>> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324831:
>> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439000:
>> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439002:
>> of_reconfig_notify: action=DETACH_NODE, 
>> dn->full_name=/cpus/PowerPC,POWER8@10,
>> prop->name=null, old_prop->name=null
>>drmgr-7284  [009]    128.439015:
>> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439016:
>> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>>
>> Signed-off-by: Tyrel Datwyler 
> 
> The following belongs in a list of version 2 changes, below the "---" line:
> 
>> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
>> removed trailing space and fixed pointer declaration in code]
> 
>> Signed-off-by: Wolfram Sang 
>> ---
>>  drivers/of/dynamic.c  | 32 ++--
>>  include/trace/events/of.h | 93 
>> +++
>>  2 files changed, 105 insertions(+), 20 deletions(-)
>>  create mode 100644 include/trace/events/of.h
> 
> mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw
> 
> 
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index ab988d88704da0..b0d6ab5a35b8c6 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>  return container_of(kobj, struct device_node, kobj);
>>  }
>>  
>> +#define CREATE_TRACE_POINTS
>> +#include 
>> +
>>  /**
>>   * of_node_get() - Increment refcount of a node
>>   * @node:   Node to inc refcount, NULL is supported to simplify writing of
>> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>   */
>>  struct device_node *of_node_get(struct device_node *node)
>>  {
>> -if (node)
>> +if (node) {
>>  kobject_get(>kobj);
>> +trace_of_node_get(refcount_read(>kobj.kref.refcount), 
>> node->full_name);
> 
> See the comment from Ron that I mentioned in my previous email.
   
   Rob, darn it.


> Also, the path has been removed from node->full_name.  Does using it here
> still give all of the information

Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand

On 01/24/18 22:48, Frank Rowand wrote:
> On 01/21/18 06:31, Wolfram Sang wrote:
>> From: Tyrel Datwyler 
>>
>> This patch introduces event tracepoints for tracking a device_nodes
>> reference cycle as well as reconfig notifications generated in response
>> to node/property manipulations.
>>
>> With the recent upstreaming of the refcount API several device_node
>> underflows and leaks have come to my attention in the pseries (DLPAR)
>> dynamic logical partitioning code (ie. POWER speak for hotplugging
>> virtual and physcial resources at runtime such as cpus or IOAs). These
>> tracepoints provide a easy and quick mechanism for validating the
>> reference counting of device_nodes during their lifetime.
>>
>> Further, when pseries lpars are migrated to a different machine we
>> perform a live update of our device tree to bring it into alignment with
>> the configuration of the new machine. The of_reconfig_notify trace point
>> provides a mechanism that can be turned for debuging the device tree
>> modifications with out having to build a custom kernel to get at the
>> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
>> output for OF_RECONFIG notifiers").
>>
>> The following trace events are provided: of_node_get, of_node_put,
>> of_node_release, and of_reconfig_notify. These trace points require a
> 
> Please add a note that the of_reconfig_notify trace event is not an
> added bit of debug info, but is instead replacing information that
> was previously available via pr_debug() when DEBUG was defined.

I got a little carried away, "when DEBUG was defined" is extra
un-needed detail for the commit message.


> 
> 
>> kernel built with ftrace support to be enabled. In a typical environment
>> where debugfs is mounted at /sys/kernel/debug the entire set of
>> tracepoints can be set with the following:
>>
>>   echo "of:*" > /sys/kernel/debug/tracing/set_event
>>
>> or
>>
>>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
>>
>> The following shows the trace point data from a DLPAR remove of a cpu
>> from a pseries lpar:
>>
>> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
>>
>> cpuhp/23-147   [023]    128.324827:
>> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324829:
>> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
>> cpuhp/23-147   [023]    128.324831:
>> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439000:
>> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439002:
>> of_reconfig_notify: action=DETACH_NODE, 
>> dn->full_name=/cpus/PowerPC,POWER8@10,
>> prop->name=null, old_prop->name=null
>>drmgr-7284  [009]    128.439015:
>> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>>drmgr-7284  [009]    128.439016:
>> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
>>
>> Signed-off-by: Tyrel Datwyler 
> 
> The following belongs in a list of version 2 changes, below the "---" line:
> 
>> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
>> removed trailing space and fixed pointer declaration in code]
> 
>> Signed-off-by: Wolfram Sang 
>> ---
>>  drivers/of/dynamic.c  | 32 ++--
>>  include/trace/events/of.h | 93 
>> +++
>>  2 files changed, 105 insertions(+), 20 deletions(-)
>>  create mode 100644 include/trace/events/of.h
> 
> mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw
> 
> 
>> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
>> index ab988d88704da0..b0d6ab5a35b8c6 100644
>> --- a/drivers/of/dynamic.c
>> +++ b/drivers/of/dynamic.c
>> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>  return container_of(kobj, struct device_node, kobj);
>>  }
>>  
>> +#define CREATE_TRACE_POINTS
>> +#include 
>> +
>>  /**
>>   * of_node_get() - Increment refcount of a node
>>   * @node:   Node to inc refcount, NULL is supported to simplify writing of
>> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
>> kobject *kobj)
>>   */
>>  struct device_node *of_node_get(struct device_node *node)
>>  {
>> -if (node)
>> +if (node) {
>>  kobject_get(>kobj);
>> +trace_of_node_get(refcount_read(>kobj.kref.refcount), 
>> node->full_name);
> 
> See the comment from Ron that I mentioned in my previous email.
   
   Rob, darn it.


> Also, the path has been removed from node->full_name.  Does using it here
> still give all of the information that is desired?  Same for all others uses
> of full_name in this patch.
> 
> The trace

[RFC PATCH V4 4/5] workqueue: convert ->nice to ->sched_attr

2018-01-24 Thread Wen Yang

The new /sys interface like this:
#cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=0

# echo "policy=0 prio=0 nice=-1" > 
/sys/devices/virtual/workqueue/writeback/sched_attr
# cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=-1

Also, the possibility of specifying more than just a priority
for the wq may be useful for a wide variety of applications.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |   5 --
 kernel/workqueue.c| 130 --
 2 files changed, 79 insertions(+), 56 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9faaade..d9d0f36 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -128,11 +128,6 @@ struct delayed_work {
  */
 struct workqueue_attrs {
/**
-* @nice: nice level
-*/
-   int nice;
-
-   /**
 * @sched_attr: kworker's scheduling parameters
 */
struct sched_attr sched_attr;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e1613d0..8c5aba5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1773,7 +1773,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
 
if (pool->cpu >= 0)
snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
-pool->attrs->nice < 0  ? "H" : "");
+pool->attrs->sched_attr.sched_nice < 0  ? "H" : "");
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
@@ -1782,7 +1782,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
if (IS_ERR(worker->task))
goto fail;
 
-   set_user_nice(worker->task, pool->attrs->nice);
+   set_user_nice(worker->task, pool->attrs->sched_attr.sched_nice);
kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
/* successful, attach the worker to the pool */
@@ -3179,7 +3179,6 @@ static void copy_sched_attr(struct sched_attr *to,
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
-   to->nice = from->nice;
copy_sched_attr(>sched_attr, >sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
@@ -3195,17 +3194,29 @@ static u32 wqattrs_hash(const struct workqueue_attrs 
*attrs)
 {
u32 hash = 0;
 
-   hash = jhash_1word(attrs->nice, hash);
+   hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
hash = jhash(cpumask_bits(attrs->cpumask),
 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
return hash;
 }
 
+static bool sched_attr_equal(const struct sched_attr *a,
+   const struct sched_attr *b)
+{
+   if (a->sched_policy != b->sched_policy)
+   return false;
+   if (a->sched_priority != b->sched_priority)
+   return false;
+   if (a->sched_nice != b->sched_nice)
+   return false;
+   return true;
+}
+
 /* content equality test */
 static bool wqattrs_equal(const struct workqueue_attrs *a,
  const struct workqueue_attrs *b)
 {
-   if (a->nice != b->nice)
+   if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
return false;
if (!cpumask_equal(a->cpumask, b->cpumask))
return false;
@@ -3259,8 +3270,6 @@ static void rcu_free_wq(struct rcu_head *rcu)
 
if (!(wq->flags & WQ_UNBOUND))
free_percpu(wq->cpu_pwqs);
-   else
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
@@ -4353,7 +4362,8 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
-   pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+   pr_cont(" flags=0x%x nice=%d", pool->flags,
+   pool->attrs->sched_attr.sched_nice);
 }
 
 static void pr_cont_work(bool comma, struct work_struct *work)
@@ -5074,7 +5084,64 @@ static ssize_t sched_attr_show(struct device *dev,
return written;
 }
 
-static DEVICE_ATTR_RO(sched_attr);
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct 
*wq);
+
+static int wq_set_unbound_sched_attr(struct workqueue_struct *wq,
+   const struct sched_attr *new)
+{
+   struct workqueue_attrs *attrs;
+   int ret = -ENOMEM;
+
+

[RFC PATCH V4 4/5] workqueue: convert ->nice to ->sched_attr

2018-01-24 Thread Wen Yang

The new /sys interface like this:
#cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=0

# echo "policy=0 prio=0 nice=-1" > 
/sys/devices/virtual/workqueue/writeback/sched_attr
# cat /sys/devices/virtual/workqueue/writeback/sched_attr
policy=0 prio=0 nice=-1

Also, the possibility of specifying more than just a priority
for the wq may be useful for a wide variety of applications.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |   5 --
 kernel/workqueue.c| 130 --
 2 files changed, 79 insertions(+), 56 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 9faaade..d9d0f36 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -128,11 +128,6 @@ struct delayed_work {
  */
 struct workqueue_attrs {
/**
-* @nice: nice level
-*/
-   int nice;
-
-   /**
 * @sched_attr: kworker's scheduling parameters
 */
struct sched_attr sched_attr;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index e1613d0..8c5aba5 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1773,7 +1773,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
 
if (pool->cpu >= 0)
snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
-pool->attrs->nice < 0  ? "H" : "");
+pool->attrs->sched_attr.sched_nice < 0  ? "H" : "");
else
snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);
 
@@ -1782,7 +1782,7 @@ static struct worker *create_worker(struct worker_pool 
*pool)
if (IS_ERR(worker->task))
goto fail;
 
-   set_user_nice(worker->task, pool->attrs->nice);
+   set_user_nice(worker->task, pool->attrs->sched_attr.sched_nice);
kthread_bind_mask(worker->task, pool->attrs->cpumask);
 
/* successful, attach the worker to the pool */
@@ -3179,7 +3179,6 @@ static void copy_sched_attr(struct sched_attr *to,
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
-   to->nice = from->nice;
copy_sched_attr(>sched_attr, >sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
@@ -3195,17 +3194,29 @@ static u32 wqattrs_hash(const struct workqueue_attrs 
*attrs)
 {
u32 hash = 0;
 
-   hash = jhash_1word(attrs->nice, hash);
+   hash = jhash_1word(attrs->sched_attr.sched_nice, hash);
hash = jhash(cpumask_bits(attrs->cpumask),
 BITS_TO_LONGS(nr_cpumask_bits) * sizeof(long), hash);
return hash;
 }
 
+static bool sched_attr_equal(const struct sched_attr *a,
+   const struct sched_attr *b)
+{
+   if (a->sched_policy != b->sched_policy)
+   return false;
+   if (a->sched_priority != b->sched_priority)
+   return false;
+   if (a->sched_nice != b->sched_nice)
+   return false;
+   return true;
+}
+
 /* content equality test */
 static bool wqattrs_equal(const struct workqueue_attrs *a,
  const struct workqueue_attrs *b)
 {
-   if (a->nice != b->nice)
+   if (a->sched_attr.sched_nice != b->sched_attr.sched_nice)
return false;
if (!cpumask_equal(a->cpumask, b->cpumask))
return false;
@@ -3259,8 +3270,6 @@ static void rcu_free_wq(struct rcu_head *rcu)
 
if (!(wq->flags & WQ_UNBOUND))
free_percpu(wq->cpu_pwqs);
-   else
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
@@ -4353,7 +4362,8 @@ static void pr_cont_pool_info(struct worker_pool *pool)
pr_cont(" cpus=%*pbl", nr_cpumask_bits, pool->attrs->cpumask);
if (pool->node != NUMA_NO_NODE)
pr_cont(" node=%d", pool->node);
-   pr_cont(" flags=0x%x nice=%d", pool->flags, pool->attrs->nice);
+   pr_cont(" flags=0x%x nice=%d", pool->flags,
+   pool->attrs->sched_attr.sched_nice);
 }
 
 static void pr_cont_work(bool comma, struct work_struct *work)
@@ -5074,7 +5084,64 @@ static ssize_t sched_attr_show(struct device *dev,
return written;
 }
 
-static DEVICE_ATTR_RO(sched_attr);
+static struct workqueue_attrs *wq_sysfs_prep_attrs(struct workqueue_struct 
*wq);
+
+static int wq_set_unbound_sched_attr(struct workqueue_struct *wq,
+   const struct sched_attr *new)
+{
+   struct workqueue_attrs *attrs;
+   int ret = -ENOMEM;
+
+   apply_wqattrs_lock();
+   attrs = wq_sysfs_prep_attrs(wq);
+   if (!attrs)
+   goto out_unlock;
+

[RFC PATCH V4 5/5] workqueue: introduce a way to set workqueue's scheduler

2018-01-24 Thread Wen Yang

When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
PID: 44145  TASK: 8807bec7b980  CPU: 0   COMMAND: "libvirtd"
#0 [8807f2cbb9d0] __schedule at 816410ed
#1 [8807f2cbba38] schedule at 81641789
#2 [8807f2cbba48] schedule_timeout at 8163f479
#3 [8807f2cbbaf8] wait_for_completion at 81641b56
#4 [8807f2cbbb58] flush_work at 8109efdc
#5 [8807f2cbbbd0] lru_add_drain_all at 81179002
#6 [8807f2cbbc08] migrate_prep at 811c77be
#7 [8807f2cbbc18] do_migrate_pages at 811b8010
#8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c
#9 [8807f2cbbd10] cpuset_attach at 810ff91e
#10 [8807f2cbbd50] cgroup_attach_task at 810f9972
#11 [8807f2cbbe08] attach_task_by_pid at 810fa520
#12 [8807f2cbbe58] cgroup_tasks_write at 810fa593
#13 [8807f2cbbe68] cgroup_file_write at 810f8773
#14 [8807f2cbbef8] vfs_write at 811dfdfd
#15 [8807f2cbbf38] sys_write at 811e089f
#16 [8807f2cbbf80] system_call_fastpath at 8164c809

CPU 43: kworker/43 starved because of the RT threads:
CURRENT: PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
RT PRIO_ARRAY: 883fff3f4950
[ 79] PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
[ 79] PID: 21295  TASK: 88276d481700  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21351  TASK: 8807be822280  COMMAND: "dispatcher"
[ 79] PID: 21129  TASK: 8807bef0f300  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21337  TASK: 88276d482e00  COMMAND: "handler_3"
[ 79] PID: 21352  TASK: 8807be824500  COMMAND: "flow_dumper"
[ 79] PID: 21336  TASK: 88276d480b80  COMMAND: "handler_2"
[ 79] PID: 21342  TASK: 88276d484500  COMMAND: "handler_8"
[ 79] PID: 21341  TASK: 88276d482280  COMMAND: "handler_7"
[ 79] PID: 21338  TASK: 88276d483980  COMMAND: "handler_4"
[ 79] PID: 21339  TASK: 88276d48  COMMAND: "handler_5"
[ 79] PID: 21340  TASK: 88276d486780  COMMAND: "handler_6"
CFS RB_ROOT: 883fff3f4868
[120] PID: 37959  TASK: 88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
PID: 1  TASK: 883fd2d4  CPU: 28  COMMAND: "systemd"
#0 [881fd317bd60] __schedule at 816410ed
#1 [881fd317bdc8] schedule_preempt_disabled at 81642869
#2 [881fd317bdd8] __mutex_lock_slowpath at 81640565
#3 [881fd317be38] mutex_lock at 8163f9cf
#4 [881fd317be50] proc_cgroup_show at 810fd256
#5 [881fd317be98] seq_read at 81203cda
#6 [881fd317bf08] vfs_read at 811dfc6c
#7 [881fd317bf38] sys_read at 811e07bf
#8 [881fd317bf80] system_call_fastpath at 81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
# echo "policy=1 prio=1 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu/sched_attr
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=1 prio=1 nice=0
# cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20
# echo "policy=1 prio=2 nice=0" >

[RFC PATCH V4 5/5] workqueue: introduce a way to set workqueue's scheduler

2018-01-24 Thread Wen Yang

When pinning RT threads to specific cores using CPU affinity, the
kworkers on the same CPU would starve, which may lead to some kind
of priority inversion. In that case, the RT threads would also
suffer high performance impact.

The priority inversion looks like,
CPU 0:  libvirtd acquired cgroup_mutex, and triggered
lru_add_drain_per_cpu, then waiting for all the kworkers to complete:
PID: 44145  TASK: 8807bec7b980  CPU: 0   COMMAND: "libvirtd"
#0 [8807f2cbb9d0] __schedule at 816410ed
#1 [8807f2cbba38] schedule at 81641789
#2 [8807f2cbba48] schedule_timeout at 8163f479
#3 [8807f2cbbaf8] wait_for_completion at 81641b56
#4 [8807f2cbbb58] flush_work at 8109efdc
#5 [8807f2cbbbd0] lru_add_drain_all at 81179002
#6 [8807f2cbbc08] migrate_prep at 811c77be
#7 [8807f2cbbc18] do_migrate_pages at 811b8010
#8 [8807f2cbbcf8] cpuset_migrate_mm at 810fea6c
#9 [8807f2cbbd10] cpuset_attach at 810ff91e
#10 [8807f2cbbd50] cgroup_attach_task at 810f9972
#11 [8807f2cbbe08] attach_task_by_pid at 810fa520
#12 [8807f2cbbe58] cgroup_tasks_write at 810fa593
#13 [8807f2cbbe68] cgroup_file_write at 810f8773
#14 [8807f2cbbef8] vfs_write at 811dfdfd
#15 [8807f2cbbf38] sys_write at 811e089f
#16 [8807f2cbbf80] system_call_fastpath at 8164c809

CPU 43: kworker/43 starved because of the RT threads:
CURRENT: PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
RT PRIO_ARRAY: 883fff3f4950
[ 79] PID: 21294  TASK: 883fd2d45080  COMMAND: "lwip"
[ 79] PID: 21295  TASK: 88276d481700  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21351  TASK: 8807be822280  COMMAND: "dispatcher"
[ 79] PID: 21129  TASK: 8807bef0f300  COMMAND: "ovdk-ovsvswitch"
[ 79] PID: 21337  TASK: 88276d482e00  COMMAND: "handler_3"
[ 79] PID: 21352  TASK: 8807be824500  COMMAND: "flow_dumper"
[ 79] PID: 21336  TASK: 88276d480b80  COMMAND: "handler_2"
[ 79] PID: 21342  TASK: 88276d484500  COMMAND: "handler_8"
[ 79] PID: 21341  TASK: 88276d482280  COMMAND: "handler_7"
[ 79] PID: 21338  TASK: 88276d483980  COMMAND: "handler_4"
[ 79] PID: 21339  TASK: 88276d48  COMMAND: "handler_5"
[ 79] PID: 21340  TASK: 88276d486780  COMMAND: "handler_6"
CFS RB_ROOT: 883fff3f4868
[120] PID: 37959  TASK: 88276e148000  COMMAND: "kworker/43:1"

CPU 28: Systemd(Victim) was blocked by cgroup_mutex:
PID: 1  TASK: 883fd2d4  CPU: 28  COMMAND: "systemd"
#0 [881fd317bd60] __schedule at 816410ed
#1 [881fd317bdc8] schedule_preempt_disabled at 81642869
#2 [881fd317bdd8] __mutex_lock_slowpath at 81640565
#3 [881fd317be38] mutex_lock at 8163f9cf
#4 [881fd317be50] proc_cgroup_show at 810fd256
#5 [881fd317be98] seq_read at 81203cda
#6 [881fd317bf08] vfs_read at 811dfc6c
#7 [881fd317bf38] sys_read at 811e07bf
#8 [881fd317bf80] system_call_fastpath at 81

The simplest way to fix that is to set the scheduler of kworkers to
higher RT priority, just like,
chrt --fifo -p 61 
However, it can not avoid other WORK_CPU_BOUND worker threads running
and starving.

This patch introduces a way to set the scheduler(policy and priority)
of percpu worker_pool, in that way, user could set proper scheduler
policy and priority of the worker_pool as needed, which could apply
to all the WORK_CPU_BOUND workers on the same CPU. On the other hand,
we could using /sys/devices/virtual/workqueue/cpumask for
WORK_CPU_UNBOUND workers to prevent them starving.

Tejun Heo suggested:
"* Add scheduler type to wq_attrs so that unbound workqueues can be
 configured.

* Rename system_wq's wq->name from "events" to "system_percpu", and
 similarly for the similarly named workqueues.

* Enable wq_attrs (only the applicable part should show up in the
 interface) for system_percpu and system_percpu_highpri, and use that
 to change the attributes of the percpu pools."

This patch implements the basic infrastructure and /sys interface,
such as:
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
# echo "policy=1 prio=1 nice=0" > 
/sys/devices/virtual/workqueue/system_percpu/sched_attr
# cat  /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=1 prio=1 nice=0
# cat  /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20
# echo "policy=1 prio=2 nice=0" >

[RFC PATCH V4 3/5] workqueue: rename unbound_attrs to attrs

2018-01-24 Thread Wen Yang

Replace workqueue's unbound_attrs by attrs, so that both unbound
or bound wq can use it.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 993f225..e1613d0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -255,7 +255,6 @@ struct workqueue_struct {
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */
 
-   struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
@@ -3737,7 +3736,7 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx 
*ctx)
/* all pwqs have been created successfully, let's install'em */
mutex_lock(>wq->mutex);
 
-   copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
+   copy_workqueue_attrs(ctx->wq->attrs, ctx->attrs);
 
/* save the previous pwq and install the new one */
for_each_node(node)
@@ -3854,7 +3853,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
lockdep_assert_held(_pool_mutex);
 
if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
-   wq->unbound_attrs->no_numa)
+   wq->attrs->no_numa)
return;
 
/*
@@ -3865,7 +3864,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
target_attrs = wq_update_unbound_numa_attrs_buf;
cpumask = target_attrs->cpumask;
 
-   copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(target_attrs, wq->attrs);
pwq = unbound_pwq_by_node(wq, node);
 
/*
@@ -3985,12 +3984,6 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (!wq)
return NULL;
 
-   if (flags & WQ_UNBOUND) {
-   wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
-   if (!wq->unbound_attrs)
-   goto err_free_wq;
-   }
-
wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!wq->attrs)
goto err_free_wq;
@@ -4069,7 +4062,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
@@ -4941,7 +4933,7 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;
 
-   ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+   ctx = apply_wqattrs_prepare(wq, wq->attrs);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -5119,7 +5111,7 @@ static ssize_t wq_nice_show(struct device *dev, struct 
device_attribute *attr,
int written;
 
mutex_lock(>mutex);
-   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->attrs->nice);
mutex_unlock(>mutex);
 
return written;
@@ -5136,7 +5128,7 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct 
workqueue_struct *wq)
if (!attrs)
return NULL;
 
-   copy_workqueue_attrs(attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(attrs, wq->attrs);
return attrs;
 }
 
@@ -5173,7 +5165,7 @@ static ssize_t wq_cpumask_show(struct device *dev,
 
mutex_lock(>mutex);
written = scnprintf(buf, PAGE_SIZE, "%*pb\n",
-   cpumask_pr_args(wq->unbound_attrs->cpumask));
+   cpumask_pr_args(wq->attrs->cpumask));
mutex_unlock(>mutex);
return written;
 }
@@ -5210,7 +5202,7 @@ static ssize_t wq_numa_show(struct device *dev, struct 
device_attribute *attr,
 
mutex_lock(>mutex);
written = scnprintf(buf, PAGE_SIZE, "%d\n",
-   !wq->unbound_attrs->no_numa);
+   !wq->attrs->no_numa);
mutex_unlock(>mutex);
 
return written;
-- 
1.8.3.1

[RFC PATCH V4 3/5] workqueue: rename unbound_attrs to attrs

2018-01-24 Thread Wen Yang

Replace workqueue's unbound_attrs by attrs, so that both unbound
or bound wq can use it.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 993f225..e1613d0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -255,7 +255,6 @@ struct workqueue_struct {
int nr_drainers;/* WQ: drain in progress */
int saved_max_active; /* WQ: saved pwq max_active */
 
-   struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
@@ -3737,7 +3736,7 @@ static void apply_wqattrs_commit(struct apply_wqattrs_ctx 
*ctx)
/* all pwqs have been created successfully, let's install'em */
mutex_lock(>wq->mutex);
 
-   copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);
+   copy_workqueue_attrs(ctx->wq->attrs, ctx->attrs);
 
/* save the previous pwq and install the new one */
for_each_node(node)
@@ -3854,7 +3853,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
lockdep_assert_held(_pool_mutex);
 
if (!wq_numa_enabled || !(wq->flags & WQ_UNBOUND) ||
-   wq->unbound_attrs->no_numa)
+   wq->attrs->no_numa)
return;
 
/*
@@ -3865,7 +3864,7 @@ static void wq_update_unbound_numa(struct 
workqueue_struct *wq, int cpu,
target_attrs = wq_update_unbound_numa_attrs_buf;
cpumask = target_attrs->cpumask;
 
-   copy_workqueue_attrs(target_attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(target_attrs, wq->attrs);
pwq = unbound_pwq_by_node(wq, node);
 
/*
@@ -3985,12 +3984,6 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
if (!wq)
return NULL;
 
-   if (flags & WQ_UNBOUND) {
-   wq->unbound_attrs = alloc_workqueue_attrs(GFP_KERNEL);
-   if (!wq->unbound_attrs)
-   goto err_free_wq;
-   }
-
wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
if (!wq->attrs)
goto err_free_wq;
@@ -4069,7 +4062,6 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
return wq;
 
 err_free_wq:
-   free_workqueue_attrs(wq->unbound_attrs);
free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
@@ -4941,7 +4933,7 @@ static int workqueue_apply_unbound_cpumask(void)
if (wq->flags & __WQ_ORDERED)
continue;
 
-   ctx = apply_wqattrs_prepare(wq, wq->unbound_attrs);
+   ctx = apply_wqattrs_prepare(wq, wq->attrs);
if (!ctx) {
ret = -ENOMEM;
break;
@@ -5119,7 +5111,7 @@ static ssize_t wq_nice_show(struct device *dev, struct 
device_attribute *attr,
int written;
 
mutex_lock(>mutex);
-   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->unbound_attrs->nice);
+   written = scnprintf(buf, PAGE_SIZE, "%d\n", wq->attrs->nice);
mutex_unlock(>mutex);
 
return written;
@@ -5136,7 +5128,7 @@ static struct workqueue_attrs *wq_sysfs_prep_attrs(struct 
workqueue_struct *wq)
if (!attrs)
return NULL;
 
-   copy_workqueue_attrs(attrs, wq->unbound_attrs);
+   copy_workqueue_attrs(attrs, wq->attrs);
return attrs;
 }
 
@@ -5173,7 +5165,7 @@ static ssize_t wq_cpumask_show(struct device *dev,
 
mutex_lock(>mutex);
written = scnprintf(buf, PAGE_SIZE, "%*pb\n",
-   cpumask_pr_args(wq->unbound_attrs->cpumask));
+   cpumask_pr_args(wq->attrs->cpumask));
mutex_unlock(>mutex);
return written;
 }
@@ -5210,7 +5202,7 @@ static ssize_t wq_numa_show(struct device *dev, struct 
device_attribute *attr,
 
mutex_lock(>mutex);
written = scnprintf(buf, PAGE_SIZE, "%d\n",
-   !wq->unbound_attrs->no_numa);
+   !wq->attrs->no_numa);
mutex_unlock(>mutex);
 
return written;
-- 
1.8.3.1

[RFC PATCH V4 1/5] workqueue: rename system workqueues

2018-01-24 Thread Wen Yang

Rename system_wq's wq->name from "events" to "system_percpu",
and similarly for the similarly named workqueues.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f699122..67b68bb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5601,16 +5601,18 @@ int __init workqueue_init_early(void)
ordered_wq_attrs[i] = attrs;
}
 
-   system_wq = alloc_workqueue("events", 0, 0);
-   system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
-   system_long_wq = alloc_workqueue("events_long", 0, 0);
-   system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
+   system_wq = alloc_workqueue("system_percpu", 0, 0);
+   system_highpri_wq = alloc_workqueue("system_percpu_highpri",
+   WQ_HIGHPRI, 0);
+   system_long_wq = alloc_workqueue("system_percpu_long", 0, 0);
+   system_unbound_wq = alloc_workqueue("system_unbound", WQ_UNBOUND,
WQ_UNBOUND_MAX_ACTIVE);
-   system_freezable_wq = alloc_workqueue("events_freezable",
+   system_freezable_wq = alloc_workqueue("system_percpu_freezable",
  WQ_FREEZABLE, 0);
-   system_power_efficient_wq = alloc_workqueue("events_power_efficient",
+   system_power_efficient_wq = 
alloc_workqueue("system_percpu_power_efficient",
  WQ_POWER_EFFICIENT, 0);
-   system_freezable_power_efficient_wq = 
alloc_workqueue("events_freezable_power_efficient",
+   system_freezable_power_efficient_wq = alloc_workqueue(
+ 
"system_percpu_freezable_power_efficient",
  WQ_FREEZABLE | WQ_POWER_EFFICIENT,
  0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
-- 
1.8.3.1

[RFC PATCH V4 1/5] workqueue: rename system workqueues

2018-01-24 Thread Wen Yang

Rename system_wq's wq->name from "events" to "system_percpu",
and similarly for the similarly named workqueues.

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: linux-kernel@vger.kernel.org
---
 kernel/workqueue.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f699122..67b68bb 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -5601,16 +5601,18 @@ int __init workqueue_init_early(void)
ordered_wq_attrs[i] = attrs;
}
 
-   system_wq = alloc_workqueue("events", 0, 0);
-   system_highpri_wq = alloc_workqueue("events_highpri", WQ_HIGHPRI, 0);
-   system_long_wq = alloc_workqueue("events_long", 0, 0);
-   system_unbound_wq = alloc_workqueue("events_unbound", WQ_UNBOUND,
+   system_wq = alloc_workqueue("system_percpu", 0, 0);
+   system_highpri_wq = alloc_workqueue("system_percpu_highpri",
+   WQ_HIGHPRI, 0);
+   system_long_wq = alloc_workqueue("system_percpu_long", 0, 0);
+   system_unbound_wq = alloc_workqueue("system_unbound", WQ_UNBOUND,
WQ_UNBOUND_MAX_ACTIVE);
-   system_freezable_wq = alloc_workqueue("events_freezable",
+   system_freezable_wq = alloc_workqueue("system_percpu_freezable",
  WQ_FREEZABLE, 0);
-   system_power_efficient_wq = alloc_workqueue("events_power_efficient",
+   system_power_efficient_wq = 
alloc_workqueue("system_percpu_power_efficient",
  WQ_POWER_EFFICIENT, 0);
-   system_freezable_power_efficient_wq = 
alloc_workqueue("events_freezable_power_efficient",
+   system_freezable_power_efficient_wq = alloc_workqueue(
+ 
"system_percpu_freezable_power_efficient",
  WQ_FREEZABLE | WQ_POWER_EFFICIENT,
  0);
BUG_ON(!system_wq || !system_highpri_wq || !system_long_wq ||
-- 
1.8.3.1

[RFC PATCH V4 2/5] workqueue: expose attrs for system workqueues

2018-01-24 Thread Wen Yang

Expose sched_attr for system workqueues, such as:
# cat   /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
cat   /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |  6 ++
 kernel/workqueue.c| 50 +--
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a54ef9..9faaade 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct workqueue_struct;
 
@@ -132,6 +133,11 @@ struct workqueue_attrs {
int nice;
 
/**
+* @sched_attr: kworker's scheduling parameters
+*/
+   struct sched_attr sched_attr;
+
+   /**
 * @cpumask: allowed CPUs
 */
cpumask_var_t cpumask;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 67b68bb..993f225 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -256,6 +256,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */
 
struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
+   struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
@@ -1699,6 +1700,7 @@ static void worker_attach_to_pool(struct worker *worker,
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   sched_setattr(worker->task, >attrs->sched_attr);
 
/*
 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -3166,10 +3168,20 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t 
gfp_mask)
return NULL;
 }
 
+static void copy_sched_attr(struct sched_attr *to,
+   const struct sched_attr *from)
+{
+   to->sched_policy = from->sched_policy;
+   to->sched_priority = from->sched_priority;
+   to->sched_nice = from->sched_nice;
+   to->sched_flags = from->sched_flags;
+}
+
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
to->nice = from->nice;
+   copy_sched_attr(>sched_attr, >sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
 * Unlike hash and equality test, this function doesn't ignore
@@ -3250,7 +3262,7 @@ static void rcu_free_wq(struct rcu_head *rcu)
free_percpu(wq->cpu_pwqs);
else
free_workqueue_attrs(wq->unbound_attrs);
-
+   free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
 }
@@ -3979,6 +3991,10 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
goto err_free_wq;
}
 
+   wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!wq->attrs)
+   goto err_free_wq;
+
va_start(args, lock_name);
vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
@@ -3999,6 +4015,11 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
lockdep_init_map(>lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(>list);
 
+   wq->attrs->sched_attr.sched_policy = SCHED_NORMAL;
+   wq->attrs->sched_attr.sched_priority = 0;
+   wq->attrs->sched_attr.sched_nice = wq->flags & WQ_HIGHPRI ?
+   HIGHPRI_NICE_LEVEL : 0;
+
if (alloc_and_link_pwqs(wq) < 0)
goto err_free_wq;
 
@@ -4049,6 +4070,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
 
 err_free_wq:
free_workqueue_attrs(wq->unbound_attrs);
+   free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
 err_destroy:
@@ -5043,9 +5065,29 @@ static ssize_t max_active_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(max_active);
 
+static ssize_t sched_attr_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   size_t written;
+   struct workqueue_struct *wq = dev_to_wq(dev);
+
+   mutex_lock(>mutex);
+   written = scnprintf(buf, PAGE_SIZE,
+   "policy=%u prio=%u nice=%d\n",
+   wq->attrs->sched_attr.sched_policy,
+   wq->attrs->sched_attr.sched_priority,
+   wq->attrs->sched_attr.sched_nice);
+   mutex_unlock(>mutex);
+
+   return written;
+}
+
+static DEVICE_ATTR_RO(sched_attr);
+

[RFC PATCH V4 2/5] workqueue: expose attrs for system workqueues

2018-01-24 Thread Wen Yang

Expose sched_attr for system workqueues, such as:
# cat   /sys/devices/virtual/workqueue/system_percpu/sched_attr
policy=0 prio=0 nice=0
cat   /sys/devices/virtual/workqueue/system_percpu_highpri/sched_attr
policy=0 prio=0 nice=-20

Signed-off-by: Wen Yang 
Signed-off-by: Jiang Biao 
Signed-off-by: Tan Hu 
Suggested-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Lai Jiangshan 
Cc: kernel test robot 
Cc: linux-kernel@vger.kernel.org
---
 include/linux/workqueue.h |  6 ++
 kernel/workqueue.c| 50 +--
 2 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index 4a54ef9..9faaade 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 struct workqueue_struct;
 
@@ -132,6 +133,11 @@ struct workqueue_attrs {
int nice;
 
/**
+* @sched_attr: kworker's scheduling parameters
+*/
+   struct sched_attr sched_attr;
+
+   /**
 * @cpumask: allowed CPUs
 */
cpumask_var_t cpumask;
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 67b68bb..993f225 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -256,6 +256,7 @@ struct workqueue_struct {
int saved_max_active; /* WQ: saved pwq max_active */
 
struct workqueue_attrs  *unbound_attrs; /* PW: only for unbound wqs */
+   struct workqueue_attrs  *attrs;
struct pool_workqueue   *dfl_pwq;   /* PW: only for unbound wqs */
 
 #ifdef CONFIG_SYSFS
@@ -1699,6 +1700,7 @@ static void worker_attach_to_pool(struct worker *worker,
 * online CPUs.  It'll be re-applied when any of the CPUs come up.
 */
set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+   sched_setattr(worker->task, >attrs->sched_attr);
 
/*
 * The pool->attach_mutex ensures %POOL_DISASSOCIATED remains
@@ -3166,10 +3168,20 @@ struct workqueue_attrs *alloc_workqueue_attrs(gfp_t 
gfp_mask)
return NULL;
 }
 
+static void copy_sched_attr(struct sched_attr *to,
+   const struct sched_attr *from)
+{
+   to->sched_policy = from->sched_policy;
+   to->sched_priority = from->sched_priority;
+   to->sched_nice = from->sched_nice;
+   to->sched_flags = from->sched_flags;
+}
+
 static void copy_workqueue_attrs(struct workqueue_attrs *to,
 const struct workqueue_attrs *from)
 {
to->nice = from->nice;
+   copy_sched_attr(>sched_attr, >sched_attr);
cpumask_copy(to->cpumask, from->cpumask);
/*
 * Unlike hash and equality test, this function doesn't ignore
@@ -3250,7 +3262,7 @@ static void rcu_free_wq(struct rcu_head *rcu)
free_percpu(wq->cpu_pwqs);
else
free_workqueue_attrs(wq->unbound_attrs);
-
+   free_workqueue_attrs(wq->attrs);
kfree(wq->rescuer);
kfree(wq);
 }
@@ -3979,6 +3991,10 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
goto err_free_wq;
}
 
+   wq->attrs = alloc_workqueue_attrs(GFP_KERNEL);
+   if (!wq->attrs)
+   goto err_free_wq;
+
va_start(args, lock_name);
vsnprintf(wq->name, sizeof(wq->name), fmt, args);
va_end(args);
@@ -3999,6 +4015,11 @@ struct workqueue_struct *__alloc_workqueue_key(const 
char *fmt,
lockdep_init_map(>lockdep_map, lock_name, key, 0);
INIT_LIST_HEAD(>list);
 
+   wq->attrs->sched_attr.sched_policy = SCHED_NORMAL;
+   wq->attrs->sched_attr.sched_priority = 0;
+   wq->attrs->sched_attr.sched_nice = wq->flags & WQ_HIGHPRI ?
+   HIGHPRI_NICE_LEVEL : 0;
+
if (alloc_and_link_pwqs(wq) < 0)
goto err_free_wq;
 
@@ -4049,6 +4070,7 @@ struct workqueue_struct *__alloc_workqueue_key(const char 
*fmt,
 
 err_free_wq:
free_workqueue_attrs(wq->unbound_attrs);
+   free_workqueue_attrs(wq->attrs);
kfree(wq);
return NULL;
 err_destroy:
@@ -5043,9 +5065,29 @@ static ssize_t max_active_store(struct device *dev,
 }
 static DEVICE_ATTR_RW(max_active);
 
+static ssize_t sched_attr_show(struct device *dev,
+   struct device_attribute *attr, char *buf)
+{
+   size_t written;
+   struct workqueue_struct *wq = dev_to_wq(dev);
+
+   mutex_lock(>mutex);
+   written = scnprintf(buf, PAGE_SIZE,
+   "policy=%u prio=%u nice=%d\n",
+   wq->attrs->sched_attr.sched_policy,
+   wq->attrs->sched_attr.sched_priority,
+   wq->attrs->sched_attr.sched_nice);
+   mutex_unlock(>mutex);
+
+   return written;
+}
+
+static DEVICE_ATTR_RO(sched_attr);
+
 static struct attribute *wq_sysfs_attrs[] = {
_attr_per_cpu.attr,
_attr_max_active.attr,
+   _attr_sched_attr.attr,

[PATCH v3] f2fs: support inode creation time

2018-01-24 Thread Chao Yu

This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.

Signed-off-by: Chao Yu 
---
v3:
- fix address alignment isue.
 fs/f2fs/f2fs.h  |  7 +++
 fs/f2fs/file.c  |  9 +
 fs/f2fs/inode.c | 16 
 fs/f2fs/namei.c |  3 ++-
 fs/f2fs/sysfs.c |  7 +++
 include/linux/f2fs_fs.h |  4 +++-
 6 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index b7ba496af28f..6300ac5bcbe4 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -124,6 +124,7 @@ struct f2fs_mount_info {
 #define F2FS_FEATURE_INODE_CHKSUM  0x0020
 #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
 #define F2FS_FEATURE_QUOTA_INO 0x0080
+#define F2FS_FEATURE_INODE_CRTIME  0x0100
 
 #define F2FS_HAS_FEATURE(sb, mask) \
((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
@@ -635,6 +636,7 @@ struct f2fs_inode_info {
int i_extra_isize;  /* size of extra space located in 
i_addr */
kprojid_t i_projid; /* id for project quota */
int i_inline_xattr_size;/* inline xattr size */
+   struct timespec i_crtime;   /* inode creation time */
 };
 
 static inline void get_extent_info(struct extent_info *ext,
@@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
super_block *sb)
return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
 }
 
+static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
+{
+   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline int get_blkz_type(struct f2fs_sb_info *sbi,
struct block_device *bdev, block_t blkaddr)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 84306c718e68..672a542e5464 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
*stat,
 {
struct inode *inode = d_inode(path->dentry);
struct f2fs_inode_info *fi = F2FS_I(inode);
+   struct f2fs_inode *ri;
unsigned int flags;
 
+   if (f2fs_has_extra_attr(inode) &&
+   f2fs_sb_has_inode_crtime(inode->i_sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   stat->result_mask |= STATX_BTIME;
+   stat->btime.tv_sec = fi->i_crtime.tv_sec;
+   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
+   }
+
flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
if (flags & FS_APPEND_FL)
stat->attributes |= STATX_ATTR_APPEND;
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 1dc77a40d0ad..99ee72bff628 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
i_projid = F2FS_DEF_PROJID;
fi->i_projid = make_kprojid(_user_ns, i_projid);
 
+   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
+   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
+   }
+
+
f2fs_put_page(node_page, 1);
 
stat_inc_inline_xattr(inode);
@@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
*node_page)
F2FS_I(inode)->i_projid);
ri->i_projid = cpu_to_le32(i_projid);
}
+
+   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
+   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
+   i_crtime)) {
+   ri->i_crtime =
+   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
+   ri->i_crtime_nsec =
+   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
+   }
}
 
__set_inode_rdev(inode, ri);
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index 3ee97ba9d2d7..c4c94c7e9f4f 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -50,7 +50,8 @@ static struct inode *f2fs_new_inode(struct inode *dir, 
umode_t mode)
 
inode->i_ino = ino;
inode->i_blocks = 0;
-   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+   inode->i_mtime = inode->i_atime = inode->i_ctime =
+   F2FS_I(inode)->i_crtime = current_time(inode);
inode->i_generation = sbi->s_next_generation++;
 
err = insert_inode_locked(inode);
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 41887e6ec1b3..d978c7b6ea04 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -113,6 +113,9 @@ static ssize_t features_show(struct f2fs_attr *a,
if

[PATCH v3] f2fs: support inode creation time

2018-01-24 Thread Chao Yu

This patch adds creation time field in inode layout to support showing
kstat.btime in ->statx.

Signed-off-by: Chao Yu 
---
v3:
- fix address alignment isue.
 fs/f2fs/f2fs.h  |  7 +++
 fs/f2fs/file.c  |  9 +
 fs/f2fs/inode.c | 16 
 fs/f2fs/namei.c |  3 ++-
 fs/f2fs/sysfs.c |  7 +++
 include/linux/f2fs_fs.h |  4 +++-
 6 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
index b7ba496af28f..6300ac5bcbe4 100644
--- a/fs/f2fs/f2fs.h
+++ b/fs/f2fs/f2fs.h
@@ -124,6 +124,7 @@ struct f2fs_mount_info {
 #define F2FS_FEATURE_INODE_CHKSUM  0x0020
 #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
 #define F2FS_FEATURE_QUOTA_INO 0x0080
+#define F2FS_FEATURE_INODE_CRTIME  0x0100
 
 #define F2FS_HAS_FEATURE(sb, mask) \
((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
@@ -635,6 +636,7 @@ struct f2fs_inode_info {
int i_extra_isize;  /* size of extra space located in 
i_addr */
kprojid_t i_projid; /* id for project quota */
int i_inline_xattr_size;/* inline xattr size */
+   struct timespec i_crtime;   /* inode creation time */
 };
 
 static inline void get_extent_info(struct extent_info *ext,
@@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
super_block *sb)
return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
 }
 
+static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
+{
+   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
+}
+
 #ifdef CONFIG_BLK_DEV_ZONED
 static inline int get_blkz_type(struct f2fs_sb_info *sbi,
struct block_device *bdev, block_t blkaddr)
diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 84306c718e68..672a542e5464 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
*stat,
 {
struct inode *inode = d_inode(path->dentry);
struct f2fs_inode_info *fi = F2FS_I(inode);
+   struct f2fs_inode *ri;
unsigned int flags;
 
+   if (f2fs_has_extra_attr(inode) &&
+   f2fs_sb_has_inode_crtime(inode->i_sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   stat->result_mask |= STATX_BTIME;
+   stat->btime.tv_sec = fi->i_crtime.tv_sec;
+   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
+   }
+
flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
if (flags & FS_APPEND_FL)
stat->attributes |= STATX_ATTR_APPEND;
diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
index 1dc77a40d0ad..99ee72bff628 100644
--- a/fs/f2fs/inode.c
+++ b/fs/f2fs/inode.c
@@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
i_projid = F2FS_DEF_PROJID;
fi->i_projid = make_kprojid(_user_ns, i_projid);
 
+   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
+   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
+   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
+   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
+   }
+
+
f2fs_put_page(node_page, 1);
 
stat_inc_inline_xattr(inode);
@@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
*node_page)
F2FS_I(inode)->i_projid);
ri->i_projid = cpu_to_le32(i_projid);
}
+
+   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
+   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
+   i_crtime)) {
+   ri->i_crtime =
+   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
+   ri->i_crtime_nsec =
+   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
+   }
}
 
__set_inode_rdev(inode, ri);
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index 3ee97ba9d2d7..c4c94c7e9f4f 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -50,7 +50,8 @@ static struct inode *f2fs_new_inode(struct inode *dir, 
umode_t mode)
 
inode->i_ino = ino;
inode->i_blocks = 0;
-   inode->i_mtime = inode->i_atime = inode->i_ctime = current_time(inode);
+   inode->i_mtime = inode->i_atime = inode->i_ctime =
+   F2FS_I(inode)->i_crtime = current_time(inode);
inode->i_generation = sbi->s_next_generation++;
 
err = insert_inode_locked(inode);
diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c
index 41887e6ec1b3..d978c7b6ea04 100644
--- a/fs/f2fs/sysfs.c
+++ b/fs/f2fs/sysfs.c
@@ -113,6 +113,9 @@ static ssize_t features_show(struct f2fs_attr *a,
if (f2fs_sb_has_quota_ino(sb))

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

Hi Steve,

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.
> 
> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 
> I consider all the remaining checkpatch issues irrelevant for this patch.
> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 

Off the top of your head, can you tell me know early in the boot
process a trace_event can be called and successfully provide the
data to someone trying to debug early boot issues?

Also, way back when version 1 of this patch was being discussed,
a question about stacktrace triggers:

 >>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
 >>> # cat trace | grep -A6 "/pci@8002018"  
 >>
 >> Just to let you know that there is now stacktrace event triggers, where
 >> you don't need to stacktrace all events, you can pick and choose. And
 >> even filter the stack trace on specific fields of the event.  
 >
 > This is great, and I did figure that out this afternoon. One thing I was
 > still trying to determine though was whether its possible to set these
 > triggers at boot? As far as I could tell I'm still limited to
 > "trace_options=stacktrace" as a kernel boot parameter to get the stack
 > for event tracepoints.

 No not yet. But I'll add that to the todo list.

 Thanks,

 -- Steve

Is this still on your todo list, or is it now available?

Thanks,

Frank

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

Hi Steve,

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.
> 
> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 
> I consider all the remaining checkpatch issues irrelevant for this patch.
> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
> 

Off the top of your head, can you tell me know early in the boot
process a trace_event can be called and successfully provide the
data to someone trying to debug early boot issues?

Also, way back when version 1 of this patch was being discussed,
a question about stacktrace triggers:

 >>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
 >>> # cat trace | grep -A6 "/pci@8002018"  
 >>
 >> Just to let you know that there is now stacktrace event triggers, where
 >> you don't need to stacktrace all events, you can pick and choose. And
 >> even filter the stack trace on specific fields of the event.  
 >
 > This is great, and I did figure that out this afternoon. One thing I was
 > still trying to determine though was whether its possible to set these
 > triggers at boot? As far as I could tell I'm still limited to
 > "trace_options=stacktrace" as a kernel boot parameter to get the stack
 > for event tracepoints.

 No not yet. But I'll add that to the todo list.

 Thanks,

 -- Steve

Is this still on your todo list, or is it now available?

Thanks,

Frank

Re: [PATCH v2] f2fs: support inode creation time

2018-01-24 Thread Chao Yu

On 2018/1/24 22:59, Chao Yu wrote:
> Hi Jaegeuk,
> 
> On 2018/1/24 13:38, Jaegeuk Kim wrote:
>> On 01/22, Chao Yu wrote:
>>> From: Chao Yu 
>>>
>>> This patch adds creation time field in inode layout to support showing
>>> kstat.btime in ->statx.
>>
>> Hi Chao,
>>
>> Could you please check this patch again? I reverted this due to kernel panic.
> 
> I can't reproduce it, could you show me your call stack?
> 
> And, could you please have a try with kernel patch only?

Sigh, it's address alignment issue, which will make sizeof(f2fs_node)
exceeding 4096... let me resend the patch.

Thanks,

> 
> Thanks,
> 
>>
>> Thanks,
>>
>>>
>>> Signed-off-by: Chao Yu 
>>> ---
>>> v2:
>>> - add missing sysfs entry.
>>>  fs/f2fs/f2fs.h  |  7 +++
>>>  fs/f2fs/file.c  |  9 +
>>>  fs/f2fs/inode.c | 16 
>>>  fs/f2fs/namei.c |  3 ++-
>>>  fs/f2fs/sysfs.c |  7 +++
>>>  include/linux/f2fs_fs.h |  2 ++
>>>  6 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>> index b7ba496af28f..6300ac5bcbe4 100644
>>> --- a/fs/f2fs/f2fs.h
>>> +++ b/fs/f2fs/f2fs.h
>>> @@ -124,6 +124,7 @@ struct f2fs_mount_info {
>>>  #define F2FS_FEATURE_INODE_CHKSUM  0x0020
>>>  #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
>>>  #define F2FS_FEATURE_QUOTA_INO 0x0080
>>> +#define F2FS_FEATURE_INODE_CRTIME  0x0100
>>>  
>>>  #define F2FS_HAS_FEATURE(sb, mask) \
>>> ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
>>> @@ -635,6 +636,7 @@ struct f2fs_inode_info {
>>> int i_extra_isize;  /* size of extra space located in 
>>> i_addr */
>>> kprojid_t i_projid; /* id for project quota */
>>> int i_inline_xattr_size;/* inline xattr size */
>>> +   struct timespec i_crtime;   /* inode creation time */
>>>  };
>>>  
>>>  static inline void get_extent_info(struct extent_info *ext,
>>> @@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
>>> super_block *sb)
>>> return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
>>>  }
>>>  
>>> +static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
>>> +{
>>> +   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
>>> +}
>>> +
>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>  static inline int get_blkz_type(struct f2fs_sb_info *sbi,
>>> struct block_device *bdev, block_t blkaddr)
>>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>>> index b9b1efe61d29..0873ba12ebc6 100644
>>> --- a/fs/f2fs/file.c
>>> +++ b/fs/f2fs/file.c
>>> @@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
>>> *stat,
>>>  {
>>> struct inode *inode = d_inode(path->dentry);
>>> struct f2fs_inode_info *fi = F2FS_I(inode);
>>> +   struct f2fs_inode *ri;
>>> unsigned int flags;
>>>  
>>> +   if (f2fs_has_extra_attr(inode) &&
>>> +   f2fs_sb_has_inode_crtime(inode->i_sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   stat->result_mask |= STATX_BTIME;
>>> +   stat->btime.tv_sec = fi->i_crtime.tv_sec;
>>> +   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
>>> +   }
>>> +
>>> flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
>>> if (flags & FS_APPEND_FL)
>>> stat->attributes |= STATX_ATTR_APPEND;
>>> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
>>> index 1dc77a40d0ad..99ee72bff628 100644
>>> --- a/fs/f2fs/inode.c
>>> +++ b/fs/f2fs/inode.c
>>> @@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
>>> i_projid = F2FS_DEF_PROJID;
>>> fi->i_projid = make_kprojid(_user_ns, i_projid);
>>>  
>>> +   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
>>> +   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
>>> +   }
>>> +
>>> +
>>> f2fs_put_page(node_page, 1);
>>>  
>>> stat_inc_inline_xattr(inode);
>>> @@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
>>> *node_page)
>>> F2FS_I(inode)->i_projid);
>>> ri->i_projid = cpu_to_le32(i_projid);
>>> }
>>> +
>>> +   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
>>> +   i_crtime)) {
>>> +   ri->i_crtime =
>>> +   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
>>> +   ri->i_crtime_nsec =
>>> +   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
>>> +   }
>>> }
>>>  
>>> __set_inode_rdev(inode, ri);
>>> diff --git a/fs/f2fs/namei.c

Re: [PATCH v2] f2fs: support inode creation time

2018-01-24 Thread Chao Yu

On 2018/1/24 22:59, Chao Yu wrote:
> Hi Jaegeuk,
> 
> On 2018/1/24 13:38, Jaegeuk Kim wrote:
>> On 01/22, Chao Yu wrote:
>>> From: Chao Yu 
>>>
>>> This patch adds creation time field in inode layout to support showing
>>> kstat.btime in ->statx.
>>
>> Hi Chao,
>>
>> Could you please check this patch again? I reverted this due to kernel panic.
> 
> I can't reproduce it, could you show me your call stack?
> 
> And, could you please have a try with kernel patch only?

Sigh, it's address alignment issue, which will make sizeof(f2fs_node)
exceeding 4096... let me resend the patch.

Thanks,

> 
> Thanks,
> 
>>
>> Thanks,
>>
>>>
>>> Signed-off-by: Chao Yu 
>>> ---
>>> v2:
>>> - add missing sysfs entry.
>>>  fs/f2fs/f2fs.h  |  7 +++
>>>  fs/f2fs/file.c  |  9 +
>>>  fs/f2fs/inode.c | 16 
>>>  fs/f2fs/namei.c |  3 ++-
>>>  fs/f2fs/sysfs.c |  7 +++
>>>  include/linux/f2fs_fs.h |  2 ++
>>>  6 files changed, 43 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h
>>> index b7ba496af28f..6300ac5bcbe4 100644
>>> --- a/fs/f2fs/f2fs.h
>>> +++ b/fs/f2fs/f2fs.h
>>> @@ -124,6 +124,7 @@ struct f2fs_mount_info {
>>>  #define F2FS_FEATURE_INODE_CHKSUM  0x0020
>>>  #define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040
>>>  #define F2FS_FEATURE_QUOTA_INO 0x0080
>>> +#define F2FS_FEATURE_INODE_CRTIME  0x0100
>>>  
>>>  #define F2FS_HAS_FEATURE(sb, mask) \
>>> ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0)
>>> @@ -635,6 +636,7 @@ struct f2fs_inode_info {
>>> int i_extra_isize;  /* size of extra space located in 
>>> i_addr */
>>> kprojid_t i_projid; /* id for project quota */
>>> int i_inline_xattr_size;/* inline xattr size */
>>> +   struct timespec i_crtime;   /* inode creation time */
>>>  };
>>>  
>>>  static inline void get_extent_info(struct extent_info *ext,
>>> @@ -3205,6 +3207,11 @@ static inline int f2fs_sb_has_quota_ino(struct 
>>> super_block *sb)
>>> return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_QUOTA_INO);
>>>  }
>>>  
>>> +static inline int f2fs_sb_has_inode_crtime(struct super_block *sb)
>>> +{
>>> +   return F2FS_HAS_FEATURE(sb, F2FS_FEATURE_INODE_CRTIME);
>>> +}
>>> +
>>>  #ifdef CONFIG_BLK_DEV_ZONED
>>>  static inline int get_blkz_type(struct f2fs_sb_info *sbi,
>>> struct block_device *bdev, block_t blkaddr)
>>> diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
>>> index b9b1efe61d29..0873ba12ebc6 100644
>>> --- a/fs/f2fs/file.c
>>> +++ b/fs/f2fs/file.c
>>> @@ -672,8 +672,17 @@ int f2fs_getattr(const struct path *path, struct kstat 
>>> *stat,
>>>  {
>>> struct inode *inode = d_inode(path->dentry);
>>> struct f2fs_inode_info *fi = F2FS_I(inode);
>>> +   struct f2fs_inode *ri;
>>> unsigned int flags;
>>>  
>>> +   if (f2fs_has_extra_attr(inode) &&
>>> +   f2fs_sb_has_inode_crtime(inode->i_sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   stat->result_mask |= STATX_BTIME;
>>> +   stat->btime.tv_sec = fi->i_crtime.tv_sec;
>>> +   stat->btime.tv_nsec = fi->i_crtime.tv_nsec;
>>> +   }
>>> +
>>> flags = fi->i_flags & (FS_FL_USER_VISIBLE | FS_PROJINHERIT_FL);
>>> if (flags & FS_APPEND_FL)
>>> stat->attributes |= STATX_ATTR_APPEND;
>>> diff --git a/fs/f2fs/inode.c b/fs/f2fs/inode.c
>>> index 1dc77a40d0ad..99ee72bff628 100644
>>> --- a/fs/f2fs/inode.c
>>> +++ b/fs/f2fs/inode.c
>>> @@ -278,6 +278,13 @@ static int do_read_inode(struct inode *inode)
>>> i_projid = F2FS_DEF_PROJID;
>>> fi->i_projid = make_kprojid(_user_ns, i_projid);
>>>  
>>> +   if (f2fs_has_extra_attr(inode) && f2fs_sb_has_inode_crtime(sbi->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, fi->i_extra_isize, i_crtime)) {
>>> +   fi->i_crtime.tv_sec = le64_to_cpu(ri->i_crtime);
>>> +   fi->i_crtime.tv_nsec = le32_to_cpu(ri->i_crtime_nsec);
>>> +   }
>>> +
>>> +
>>> f2fs_put_page(node_page, 1);
>>>  
>>> stat_inc_inline_xattr(inode);
>>> @@ -421,6 +428,15 @@ void update_inode(struct inode *inode, struct page 
>>> *node_page)
>>> F2FS_I(inode)->i_projid);
>>> ri->i_projid = cpu_to_le32(i_projid);
>>> }
>>> +
>>> +   if (f2fs_sb_has_inode_crtime(F2FS_I_SB(inode)->sb) &&
>>> +   F2FS_FITS_IN_INODE(ri, F2FS_I(inode)->i_extra_isize,
>>> +   i_crtime)) {
>>> +   ri->i_crtime =
>>> +   cpu_to_le64(F2FS_I(inode)->i_crtime.tv_sec);
>>> +   ri->i_crtime_nsec =
>>> +   cpu_to_le32(F2FS_I(inode)->i_crtime.tv_nsec);
>>> +   }
>>> }
>>>  
>>> __set_inode_rdev(inode, ri);
>>> diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
>>> index

Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels?

2018-01-24 Thread gre...@linuxfoundation.org

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Thu, Jan 25, 2018 at 02:22:57AM +, Zhang, Ning A wrote:
> When Linux runs as a guest OS, CPUID is privileged instruction, sync_core() 
> will be very slow.
> 
> If apply this patch, 200ms will be saved for kernel initial, when Linux runs 
> as a guest OS.

That's not a regression.  Why not just use a newer kernel release if you
want to get the feature of "faster boot"?  :)

thanks,

greg k-h

Re: backport Rewrite sync_core() to use IRET-to-self to stable kernels?

2018-01-24 Thread gre...@linuxfoundation.org

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

A: No.
Q: Should I include quotations after my reply?

http://daringfireball.net/2007/07/on_top

On Thu, Jan 25, 2018 at 02:22:57AM +, Zhang, Ning A wrote:
> When Linux runs as a guest OS, CPUID is privileged instruction, sync_core() 
> will be very slow.
> 
> If apply this patch, 200ms will be saved for kernel initial, when Linux runs 
> as a guest OS.

That's not a regression.  Why not just use a newer kernel release if you
want to get the feature of "faster boot"?  :)

thanks,

greg k-h

Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand

On 01/21/18 06:31, Wolfram Sang wrote:
> From: Tyrel Datwyler 
> 
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
> 
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR)
> dynamic logical partitioning code (ie. POWER speak for hotplugging
> virtual and physcial resources at runtime such as cpus or IOAs). These
> tracepoints provide a easy and quick mechanism for validating the
> reference counting of device_nodes during their lifetime.
> 
> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with
> the configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
> output for OF_RECONFIG notifiers").
> 
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a

Please add a note that the of_reconfig_notify trace event is not an
added bit of debug info, but is instead replacing information that
was previously available via pr_debug() when DEBUG was defined.


> kernel built with ftrace support to be enabled. In a typical environment
> where debugfs is mounted at /sys/kernel/debug the entire set of
> tracepoints can be set with the following:
> 
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
> 
> or
> 
>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
> 
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
> 
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
> 
> cpuhp/23-147   [023]    128.324827:
> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
> of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
> prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
> 
> Signed-off-by: Tyrel Datwyler 

The following belongs in a list of version 2 changes, below the "---" line:

> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
> removed trailing space and fixed pointer declaration in code]

> Signed-off-by: Wolfram Sang 
> ---
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h

mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw


> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704da0..b0d6ab5a35b8c6 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   return container_of(kobj, struct device_node, kobj);
>  }
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:Node to inc refcount, NULL is supported to simplify writing of
> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_get(>kobj);
> + trace_of_node_get(refcount_read(>kobj.kref.refcount), 
> node->full_name);

See the comment from Ron that I mentioned in my previous email.

Also, the path has been removed from node->full_name.  Does using it here
still give all of the information that is desired?  Same for all others uses
of full_name in this patch.

The trace point should have a single argument, node.  Accessing the two
fields can be done in the tracepoint assignment.  Or is there some
reason that can't be done?  Same for the trace_of_node_put() tracepoint.


> + }
>   return node;
>  }
>

Re: [RFC PATCH v2 1/1] of: introduce event tracepoints for dynamic device_node lifecyle

2018-01-24 Thread Frank Rowand

On 01/21/18 06:31, Wolfram Sang wrote:
> From: Tyrel Datwyler 
> 
> This patch introduces event tracepoints for tracking a device_nodes
> reference cycle as well as reconfig notifications generated in response
> to node/property manipulations.
> 
> With the recent upstreaming of the refcount API several device_node
> underflows and leaks have come to my attention in the pseries (DLPAR)
> dynamic logical partitioning code (ie. POWER speak for hotplugging
> virtual and physcial resources at runtime such as cpus or IOAs). These
> tracepoints provide a easy and quick mechanism for validating the
> reference counting of device_nodes during their lifetime.
> 
> Further, when pseries lpars are migrated to a different machine we
> perform a live update of our device tree to bring it into alignment with
> the configuration of the new machine. The of_reconfig_notify trace point
> provides a mechanism that can be turned for debuging the device tree
> modifications with out having to build a custom kernel to get at the
> DEBUG code introduced by commit 00aa37206e1a54 ("of/reconfig: Add debug
> output for OF_RECONFIG notifiers").
> 
> The following trace events are provided: of_node_get, of_node_put,
> of_node_release, and of_reconfig_notify. These trace points require a

Please add a note that the of_reconfig_notify trace event is not an
added bit of debug info, but is instead replacing information that
was previously available via pr_debug() when DEBUG was defined.


> kernel built with ftrace support to be enabled. In a typical environment
> where debugfs is mounted at /sys/kernel/debug the entire set of
> tracepoints can be set with the following:
> 
>   echo "of:*" > /sys/kernel/debug/tracing/set_event
> 
> or
> 
>   echo 1 > /sys/kernel/debug/tracing/events/of/enable
> 
> The following shows the trace point data from a DLPAR remove of a cpu
> from a pseries lpar:
> 
> cat /sys/kernel/debug/tracing/trace | grep "POWER8@10"
> 
> cpuhp/23-147   [023]    128.324827:
> of_node_put: refcount=5, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=4, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324829:
> of_node_put: refcount=3, dn->full_name=/cpus/PowerPC,POWER8@10
> cpuhp/23-147   [023]    128.324831:
> of_node_put: refcount=2, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439000:
> of_node_put: refcount=1, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439002:
> of_reconfig_notify: action=DETACH_NODE, 
> dn->full_name=/cpus/PowerPC,POWER8@10,
> prop->name=null, old_prop->name=null
>drmgr-7284  [009]    128.439015:
> of_node_put: refcount=0, dn->full_name=/cpus/PowerPC,POWER8@10
>drmgr-7284  [009]    128.439016:
> of_node_release: dn->full_name=/cpus/PowerPC,POWER8@10, dn->_flags=4
> 
> Signed-off-by: Tyrel Datwyler 

The following belongs in a list of version 2 changes, below the "---" line:

> [wsa: fixed commit abbrev and one of the sysfs paths in commit desc,
> removed trailing space and fixed pointer declaration in code]

> Signed-off-by: Wolfram Sang 
> ---
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h

mode looks incorrect.  Existing files in include/trace/events/ are -rw-rw


> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index ab988d88704da0..b0d6ab5a35b8c6 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -21,6 +21,9 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   return container_of(kobj, struct device_node, kobj);
>  }
>  
> +#define CREATE_TRACE_POINTS
> +#include 
> +
>  /**
>   * of_node_get() - Increment refcount of a node
>   * @node:Node to inc refcount, NULL is supported to simplify writing of
> @@ -30,8 +33,10 @@ static struct device_node *kobj_to_device_node(struct 
> kobject *kobj)
>   */
>  struct device_node *of_node_get(struct device_node *node)
>  {
> - if (node)
> + if (node) {
>   kobject_get(>kobj);
> + trace_of_node_get(refcount_read(>kobj.kref.refcount), 
> node->full_name);

See the comment from Ron that I mentioned in my previous email.

Also, the path has been removed from node->full_name.  Does using it here
still give all of the information that is desired?  Same for all others uses
of full_name in this patch.

The trace point should have a single argument, node.  Accessing the two
fields can be done in the tracepoint assignment.  Or is there some
reason that can't be done?  Same for the trace_of_node_put() tracepoint.


> + }
>   return node;
>  }
>  EXPORT_SYMBOL(of_node_get);
> @@ -43,8 +48,10 @@ EXPORT_SYMBOL(of_node_get);
>   */
>

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.

It's been nine months since version 1.  If you are going to include the
dynamic updates part of the patch then please test them.


> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 

> I consider all the remaining checkpatch issues irrelevant for this patch.

I am OK with the line length warnings in this patch.

Why can't the macro error be fixed?

A file entry needs to be added to MAINTAINERS.


> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
>

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

On 01/21/18 06:31, Wolfram Sang wrote:
> I got a bug report for a DT node refcounting problem in the I2C subsystem. 
> This
> patch was a huge help in validating the bug report and the proposed solution.
> So, I thought I bring it to attention again. Thanks Tyrel, for the initial
> work!
> 
> Note that I did not test the dynamic updates, only of_node_{get|put} so far. I
> read that Tyrel checked dynamic updates extensively with this patch. And since
> DT overlays are also used within our Renesas dev team, this will help there, 
> as
> well.

It's been nine months since version 1.  If you are going to include the
dynamic updates part of the patch then please test them.


> Tested on a Renesas Salvator-XS board (R-Car H3).
> 
> Changes since RFC v1:
>   * rebased to v4.15-rc8
>   * fixed commit abbrev and one of the sysfs paths in commit desc
>   * removed trailing space and fixed pointer declaration in code
> 

> I consider all the remaining checkpatch issues irrelevant for this patch.

I am OK with the line length warnings in this patch.

Why can't the macro error be fixed?

A file entry needs to be added to MAINTAINERS.


> 
> So what about applying it?
> 
> Kind regards,
> 
>Wolfram
> 
> 
> Tyrel Datwyler (1):
>   of: introduce event tracepoints for dynamic device_node lifecyle
> 
>  drivers/of/dynamic.c  | 32 ++--
>  include/trace/events/of.h | 93 
> +++
>  2 files changed, 105 insertions(+), 20 deletions(-)
>  create mode 100644 include/trace/events/of.h
>

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

On 01/22/18 03:49, Wolfram Sang wrote:
> Hi Frank,
> 
>> Please go back and read the thread for version 1.  Simply resubmitting a
>> forward port is ignoring that whole conversation.
>>
>> There is a lot of good info in that thread.  I certainly learned stuff in it.
> 
> Yes, I did that and learned stuff, too. My summary of the discussion was:
> 
> - you mentioned some drawbacks you saw (like the mixture of trace output
>   and printk output)> - most of them look like addressed to me? (e.g. Steven 
> showed a way to redirect
>   printk to trace
> - you posted your version (which was, however, marked as "not user friendly"
>   even by yourself)

Not exactly a fair quoting.  There were two things I said:

  "Here is a patch that I have used.  It is not as user friendly in terms
  of human readable stack traces (though a very small user space program
  should be able to fix that)."

 So easy to fix using existing userspace programs to convert kernel
 addresses to symbols.

  "FIXME: Currently using pr_err() so I don't need to set loglevel on boot.

  So obviously not a user friendly tool!!!
  The process is:
 - apply patch
 - configure, build, boot kernel
 - analyze data
 - remove patch"

 So not friendly because it uses pr_err() instead of pr_debug().  In
 a reply I said if I submitted my patches I would change it to use
 pr_debug() instead.  So not an issue.

 And not user friendly because it requires patching the kernel.
 Again a NOP if I submitted my patch, because the patch would
 already be in the kernel.

But whatever, let's ignore that - a poor quoting is not a reason to
reject this version of the patch.

> - The discussion stalled over having two approaches

Then you should have stated such when you resubmitted.

> So, I thought reposting would be a good way of finding out if your
> concerns were addressed in the discussion or not. If I overlooked

Then you should have stated that there were concerns raised in the
discussion and asked me if my concerns were addressed.

> something, I am sorry for that. Still, my intention is to continue the
> discussion, not to ignore it. Because as it stands, we don't have such a
> debugging mechanism in place currently, and with people working with DT
> overlays, I'd think it would be nice to have.
> 
> Kind regards,
> 
>Wolfram
> 

Rob suggested:

 >
 > @@ -25,8 +28,10 @@
 >   */
 >  struct device_node *of_node_get(struct device_node *node)
 >  {
 > -   if (node)
 > +   if (node) {
 > kobject_get(>kobj);
 > +   
trace_of_node_get(refcount_read(>kobj.kref.refcount), node->full_name);

 Seems like there should be a kobj wrapper to read the refcount.

As far as I noticed, that was never addressed.  I don't know the answer, but
the question was asked.  And if there is no such function, then there is at
least kref_read(), which would improve the code a little bit.

I'll reply to the patch 0/1 and patch 1/1 emails with review comments.

-Frank

Re: [RFC PATCH v2 0/1] of: easier debugging for node life cycle issues

2018-01-24 Thread Frank Rowand

On 01/22/18 03:49, Wolfram Sang wrote:
> Hi Frank,
> 
>> Please go back and read the thread for version 1.  Simply resubmitting a
>> forward port is ignoring that whole conversation.
>>
>> There is a lot of good info in that thread.  I certainly learned stuff in it.
> 
> Yes, I did that and learned stuff, too. My summary of the discussion was:
> 
> - you mentioned some drawbacks you saw (like the mixture of trace output
>   and printk output)> - most of them look like addressed to me? (e.g. Steven 
> showed a way to redirect
>   printk to trace
> - you posted your version (which was, however, marked as "not user friendly"
>   even by yourself)

Not exactly a fair quoting.  There were two things I said:

  "Here is a patch that I have used.  It is not as user friendly in terms
  of human readable stack traces (though a very small user space program
  should be able to fix that)."

 So easy to fix using existing userspace programs to convert kernel
 addresses to symbols.

  "FIXME: Currently using pr_err() so I don't need to set loglevel on boot.

  So obviously not a user friendly tool!!!
  The process is:
 - apply patch
 - configure, build, boot kernel
 - analyze data
 - remove patch"

 So not friendly because it uses pr_err() instead of pr_debug().  In
 a reply I said if I submitted my patches I would change it to use
 pr_debug() instead.  So not an issue.

 And not user friendly because it requires patching the kernel.
 Again a NOP if I submitted my patch, because the patch would
 already be in the kernel.

But whatever, let's ignore that - a poor quoting is not a reason to
reject this version of the patch.

> - The discussion stalled over having two approaches

Then you should have stated such when you resubmitted.

> So, I thought reposting would be a good way of finding out if your
> concerns were addressed in the discussion or not. If I overlooked

Then you should have stated that there were concerns raised in the
discussion and asked me if my concerns were addressed.

> something, I am sorry for that. Still, my intention is to continue the
> discussion, not to ignore it. Because as it stands, we don't have such a
> debugging mechanism in place currently, and with people working with DT
> overlays, I'd think it would be nice to have.
> 
> Kind regards,
> 
>Wolfram
> 

Rob suggested:

 >
 > @@ -25,8 +28,10 @@
 >   */
 >  struct device_node *of_node_get(struct device_node *node)
 >  {
 > -   if (node)
 > +   if (node) {
 > kobject_get(>kobj);
 > +   
trace_of_node_get(refcount_read(>kobj.kref.refcount), node->full_name);

 Seems like there should be a kobj wrapper to read the refcount.

As far as I noticed, that was never addressed.  I don't know the answer, but
the question was asked.  And if there is no such function, then there is at
least kref_read(), which would improve the code a little bit.

I'll reply to the patch 0/1 and patch 1/1 emails with review comments.

-Frank

Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.

Re: [REGRESSION] (>= v4.12) IO w/dmcrypt causing audio underruns

2018-01-24 Thread vcaputo

On Fri, Jan 19, 2018 at 11:57:32AM +0100, Enric Balletbo Serra wrote:
> Hi Vito,
> 
> 2018-01-17 23:48 GMT+01:00  :
> > On Mon, Dec 18, 2017 at 10:25:33AM +0100, Enric Balletbo Serra wrote:
> >> Hi Vito,
> >>
> >> 2017-12-01 22:33 GMT+01:00  :
> >> > On Wed, Nov 29, 2017 at 10:39:19AM -0800, vcap...@pengaru.com wrote:
> >> >> Hello,
> >> >>
> >> >> Recently I noticed substantial audio dropouts when listening to MP3s in
> >> >> `cmus` while doing big and churny `git checkout` commands in my linux 
> >> >> git
> >> >> tree.
> >> >>
> >> >> It's not something I've done much of over the last couple months so I
> >> >> hadn't noticed until yesterday, but didn't remember this being a 
> >> >> problem in
> >> >> recent history.
> >> >>
> >> >> As there's quite an accumulation of similarly configured and built 
> >> >> kernels
> >> >> in my grub menu, it was trivial to determine approximately when this 
> >> >> began:
> >> >>
> >> >> 4.11.0: no dropouts
> >> >> 4.12.0-rc7: dropouts
> >> >> 4.14.0-rc6: dropouts (seem more substantial as well, didn't investigate)
> >> >>
> >> >> Watching top while this is going on in the various kernel versions, it's
> >> >> apparent that the kworker behavior changed.  Both the priority and 
> >> >> quantity
> >> >> of running kworker threads is elevated in kernels experiencing dropouts.
> >> >>
> >> >> Searching through the commit history for v4.11..v4.12 uncovered:
> >> >>
> >> >> commit a1b89132dc4f61071bdeaab92ea958e0953380a1
> >> >> Author: Tim Murray 
> >> >> Date:   Fri Apr 21 11:11:36 2017 +0200
> >> >>
> >> >> dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues
> >> >>
> >> >> Running dm-crypt with workqueues at the standard priority results 
> >> >> in IO
> >> >> competing for CPU time with standard user apps, which can lead to
> >> >> pipeline bubbles and seriously degraded performance.  Move to using
> >> >> WQ_HIGHPRI workqueues to protect against that.
> >> >>
> >> >> Signed-off-by: Tim Murray 
> >> >> Signed-off-by: Enric Balletbo i Serra 
> >> >> Signed-off-by: Mike Snitzer 
> >> >>
> >> >> ---
> >> >>
> >> >> Reverting a1b8913 from 4.14.0-rc6, my current kernel, eliminates the
> >> >> problem completely.
> >> >>
> >> >> Looking at the diff in that commit, it looks like the commit message 
> >> >> isn't
> >> >> even accurate; not only is the priority of the dmcrypt workqueues being
> >> >> changed - they're also being made "CPU intensive" workqueues as well.
> >> >>
> >> >> This combination appears to result in both elevated scheduling priority 
> >> >> and
> >> >> greater quantity of participant worker threads effectively starving any
> >> >> normal priority user task under periods of heavy IO on dmcrypt volumes.
> >> >>
> >> >> I don't know what the right solution is here.  It seems to me we're 
> >> >> lacking
> >> >> the appropriate mechanism for charging CPU resources consumed on behalf 
> >> >> of
> >> >> user processes in kworker threads to the work-causing process.
> >> >>
> >> >> What effectively happens is my normal `git` user process is able to
> >> >> greatly amplify what share of CPU it takes from the system by 
> >> >> generating IO
> >> >> on what happens to be a high-priority CPU-intensive storage volume.
> >> >>
> >> >> It looks potentially complicated to fix properly, but I suspect at its 
> >> >> core
> >> >> this may be a fairly longstanding shortcoming of the page cache and its
> >> >> asynchronous design.  Something that has been exacerbated substantially 
> >> >> by
> >> >> the introduction of CPU-intensive storage subsystems like dmcrypt.
> >> >>
> >> >> If we imagine the whole stack simplified, where all the IO was being 
> >> >> done
> >> >> synchronously in-band, and the dmcrypt kernel code simply ran in the
> >> >> IO-causing process context, it would be getting charged to the calling
> >> >> process and scheduled accordingly.  The resource accounting and 
> >> >> scheduling
> >> >> problems all emerge with the page cache, buffered IO, and async 
> >> >> background
> >> >> writeback in a pool of unrelated worker threads, etc.  That's how it
> >> >> appears to me anyways...
> >> >>
> >> >> The system used is a X61s Thinkpad 1.8Ghz with 840 EVO SSD, lvm on 
> >> >> dmcrypt.
> >> >> The kernel .config is attached in case it's of interest.
> >> >>
> >> >> Thanks,
> >> >> Vito Caputo
> >> >
> >> >
> >> >
> >> > Ping...
> >> >
> >> > Could somebody please at least ACK receiving this so I'm not left 
> >> > wondering
> >> > if my mails to lkml are somehow winding up flagged as spam, thanks!
> >>
> >> Sorry I did not notice your email before you ping me directly. It's
> >> interesting that issue, though we didn't notice this problem. It's a
> >> bit far since I tested this patch but I'll setup the environment again
> >> and do more tests to understand better what is happening.
> >>
> >
> > Any update on this?
> >
> 
> I did not reproduce the issue for now. Can you try what happens if you
> remove the

linux-next: manual merge of the net-next tree with the vfs tree

2018-01-24 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/tipc/socket.c

between commit:

  ade994f4f6c8 ("net: annotate ->poll() instances")

from the vfs tree and commit:

  60c253069632 ("tipc: fix race between poll() and setsockopt()")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/tipc/socket.c
index 2aa46e8cd8fe,473a096b6fba..
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@@ -715,8 -716,7 +716,7 @@@ static __poll_t tipc_poll(struct file *
  {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct tipc_group *grp = tsk->group;
 -  u32 revents = 0;
 +  __poll_t revents = 0;
  
sock_poll_wait(file, sk_sleep(sk), wait);

linux-next: manual merge of the net-next tree with the vfs tree

2018-01-24 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  net/tipc/socket.c

between commit:

  ade994f4f6c8 ("net: annotate ->poll() instances")

from the vfs tree and commit:

  60c253069632 ("tipc: fix race between poll() and setsockopt()")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc net/tipc/socket.c
index 2aa46e8cd8fe,473a096b6fba..
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@@ -715,8 -716,7 +716,7 @@@ static __poll_t tipc_poll(struct file *
  {
struct sock *sk = sock->sk;
struct tipc_sock *tsk = tipc_sk(sk);
-   struct tipc_group *grp = tsk->group;
 -  u32 revents = 0;
 +  __poll_t revents = 0;
  
sock_poll_wait(file, sk_sleep(sk), wait);

Re: [PATCH RFC 09/16] prcu: Implement prcu_barrier() API

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:34PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's rcu_barrier() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h |  7 ++
>  kernel/rcu/prcu.c| 63 
> 
>  2 files changed, 70 insertions(+)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 4e7d5d65..cce967fd 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define CONFIG_PRCU
> 
> @@ -32,6 +33,7 @@ struct prcu_local_struct {
>   unsigned int online;
>   unsigned long long version;
>   unsigned long long cb_version;
> + struct rcu_head barrier_head;
>   struct prcu_cblist cblist;
>  };
> 
> @@ -39,8 +41,11 @@ struct prcu_struct {
>   atomic64_t global_version;
>   atomic64_t cb_version;
>   atomic_t active_ctr;
> + atomic_t barrier_cpu_count;
>   struct mutex mtx;
> + struct mutex barrier_mtx;
>   wait_queue_head_t wait_q;
> + struct completion barrier_completion;
>  };
> 
>  #ifdef CONFIG_PRCU
> @@ -48,6 +53,7 @@ void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
>  void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_barrier(void);
>  void prcu_init(void);
>  void prcu_note_context_switch(void);
>  int prcu_pending(void);
> @@ -60,6 +66,7 @@ void prcu_check_callbacks(void);
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
>  #define call_prcu() do {} while (0)
> +#define prcu_barrier() do {} while (0)
>  #define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
>  #define prcu_pending() 0
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index 373039c5..2664d091 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -15,6 +15,7 @@ struct prcu_struct global_prcu = {
>   .cb_version = ATOMIC64_INIT(0),
>   .active_ctr = ATOMIC_INIT(0),
>   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .barrier_mtx = __MUTEX_INITIALIZER(global_prcu.barrier_mtx),
>   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
>  };
>  struct prcu_struct *prcu = _prcu;
> @@ -250,6 +251,68 @@ static __latent_entropy void 
> prcu_process_callbacks(struct softirq_action *unuse
>   local_irq_restore(flags);
>  }
> 
> +/*
> + * PRCU callback function for prcu_barrier().
> + * If we are last, wake up the task executing prcu_barrier().
> + */
> +static void prcu_barrier_callback(struct rcu_head *rhp)
> +{
> + if (atomic_dec_and_test(>barrier_cpu_count))
> + complete(>barrier_completion);
> +}
> +
> +/*
> + * Called with preemption disabled, and from cross-cpu IRQ context.
> + */
> +static void prcu_barrier_func(void *info)
> +{
> + struct prcu_local_struct *local = this_cpu_ptr(_local);
> +
> + atomic_inc(>barrier_cpu_count);
> + call_prcu(>barrier_head, prcu_barrier_callback);
> +}
> +
> +/* Waiting for all PRCU callbacks to complete. */
> +void prcu_barrier(void)
> +{
> + int cpu;
> +
> + /* Take mutex to serialize concurrent prcu_barrier() requests. */
> + mutex_lock(>barrier_mtx);
> +
> + /*
> +  * Initialize the count to one rather than to zero in order to
> +  * avoid a too-soon return to zero in case of a short grace period
> +  * (or preemption of this task).
> +  */
> + init_completion(>barrier_completion);
> + atomic_set(>barrier_cpu_count, 1);
> +
> + /*
> +  * Register a new callback on each CPU using IPI to prevent races
> +  * with call_prcu(). When that callback is invoked, we will know
> +  * that all of the corresponding CPU's preceding callbacks have
> +  * been invoked.
> +  */
> + for_each_possible_cpu(cpu)
> + smp_call_function_single(cpu, prcu_barrier_func, NULL, 1);

This code seems to be assuming CONFIG_HOTPLUG_CPU=n.  This might explain
your rcutorture failure.

> + /* Decrement the count as we initialize it to one. */
> + if (atomic_dec_and_test(>barrier_cpu_count))
> + complete(>barrier_completion);
> +
> + /*
> +  * Now that we have an prcu_barrier_callback() callback on each
> +  * CPU, and thus each counted, remove the initial count.
> +  * Wait for all prcu_barrier_callback() callbacks to be invoked.
> +  */
> + wait_for_completion(>barrier_completion);
> +
> + /* Other rcu_barrier() invocations can now safely proceed. */
> + mutex_unlock(>barrier_mtx);
> +}
> +EXPORT_SYMBOL(prcu_barrier);
> +
>  void prcu_init_local_struct(int cpu)
>  {
>   struct prcu_local_struct *local;
> -- 
> 2.14.1.729.g59c0ea183
>

Re: [PATCH RFC 09/16] prcu: Implement prcu_barrier() API

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:34PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's rcu_barrier() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h |  7 ++
>  kernel/rcu/prcu.c| 63 
> 
>  2 files changed, 70 insertions(+)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 4e7d5d65..cce967fd 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -5,6 +5,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
> 
>  #define CONFIG_PRCU
> 
> @@ -32,6 +33,7 @@ struct prcu_local_struct {
>   unsigned int online;
>   unsigned long long version;
>   unsigned long long cb_version;
> + struct rcu_head barrier_head;
>   struct prcu_cblist cblist;
>  };
> 
> @@ -39,8 +41,11 @@ struct prcu_struct {
>   atomic64_t global_version;
>   atomic64_t cb_version;
>   atomic_t active_ctr;
> + atomic_t barrier_cpu_count;
>   struct mutex mtx;
> + struct mutex barrier_mtx;
>   wait_queue_head_t wait_q;
> + struct completion barrier_completion;
>  };
> 
>  #ifdef CONFIG_PRCU
> @@ -48,6 +53,7 @@ void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
>  void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_barrier(void);
>  void prcu_init(void);
>  void prcu_note_context_switch(void);
>  int prcu_pending(void);
> @@ -60,6 +66,7 @@ void prcu_check_callbacks(void);
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
>  #define call_prcu() do {} while (0)
> +#define prcu_barrier() do {} while (0)
>  #define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
>  #define prcu_pending() 0
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index 373039c5..2664d091 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -15,6 +15,7 @@ struct prcu_struct global_prcu = {
>   .cb_version = ATOMIC64_INIT(0),
>   .active_ctr = ATOMIC_INIT(0),
>   .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .barrier_mtx = __MUTEX_INITIALIZER(global_prcu.barrier_mtx),
>   .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
>  };
>  struct prcu_struct *prcu = _prcu;
> @@ -250,6 +251,68 @@ static __latent_entropy void 
> prcu_process_callbacks(struct softirq_action *unuse
>   local_irq_restore(flags);
>  }
> 
> +/*
> + * PRCU callback function for prcu_barrier().
> + * If we are last, wake up the task executing prcu_barrier().
> + */
> +static void prcu_barrier_callback(struct rcu_head *rhp)
> +{
> + if (atomic_dec_and_test(>barrier_cpu_count))
> + complete(>barrier_completion);
> +}
> +
> +/*
> + * Called with preemption disabled, and from cross-cpu IRQ context.
> + */
> +static void prcu_barrier_func(void *info)
> +{
> + struct prcu_local_struct *local = this_cpu_ptr(_local);
> +
> + atomic_inc(>barrier_cpu_count);
> + call_prcu(>barrier_head, prcu_barrier_callback);
> +}
> +
> +/* Waiting for all PRCU callbacks to complete. */
> +void prcu_barrier(void)
> +{
> + int cpu;
> +
> + /* Take mutex to serialize concurrent prcu_barrier() requests. */
> + mutex_lock(>barrier_mtx);
> +
> + /*
> +  * Initialize the count to one rather than to zero in order to
> +  * avoid a too-soon return to zero in case of a short grace period
> +  * (or preemption of this task).
> +  */
> + init_completion(>barrier_completion);
> + atomic_set(>barrier_cpu_count, 1);
> +
> + /*
> +  * Register a new callback on each CPU using IPI to prevent races
> +  * with call_prcu(). When that callback is invoked, we will know
> +  * that all of the corresponding CPU's preceding callbacks have
> +  * been invoked.
> +  */
> + for_each_possible_cpu(cpu)
> + smp_call_function_single(cpu, prcu_barrier_func, NULL, 1);

This code seems to be assuming CONFIG_HOTPLUG_CPU=n.  This might explain
your rcutorture failure.

> + /* Decrement the count as we initialize it to one. */
> + if (atomic_dec_and_test(>barrier_cpu_count))
> + complete(>barrier_completion);
> +
> + /*
> +  * Now that we have an prcu_barrier_callback() callback on each
> +  * CPU, and thus each counted, remove the initial count.
> +  * Wait for all prcu_barrier_callback() callbacks to be invoked.
> +  */
> + wait_for_completion(>barrier_completion);
> +
> + /* Other rcu_barrier() invocations can now safely proceed. */
> + mutex_unlock(>barrier_mtx);
> +}
> +EXPORT_SYMBOL(prcu_barrier);
> +
>  void prcu_init_local_struct(int cpu)
>  {
>   struct prcu_local_struct *local;
> -- 
> 2.14.1.729.g59c0ea183
>

Re: [PATCH RFC 07/16] prcu: Implement call_prcu() API

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:32PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's call_rcu() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h | 25 
>  init/main.c  |  2 ++
>  kernel/rcu/prcu.c| 67 
> +---
>  3 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 653b4633..e5e09c9b 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -2,15 +2,36 @@
>  #define __LINUX_PRCU_H
> 
>  #include 
> +#include 
>  #include 
>  #include 
> 
>  #define CONFIG_PRCU
> 
> +struct prcu_version_head {
> + unsigned long long version;
> + struct prcu_version_head *next;
> +};
> +
> +/* Simple unsegmented callback list for PRCU. */
> +struct prcu_cblist {
> + struct rcu_head *head;
> + struct rcu_head **tail;
> + struct prcu_version_head *version_head;
> + struct prcu_version_head **version_tail;
> + long len;
> +};
> +
> +#define PRCU_CBLIST_INITIALIZER(n) { \
> + .head = NULL, .tail = , \
> + .version_head = NULL, .version_tail = _head, \
> +}
> +
>  struct prcu_local_struct {
>   unsigned int locked;
>   unsigned int online;
>   unsigned long long version;
> + struct prcu_cblist cblist;
>  };
> 
>  struct prcu_struct {
> @@ -24,6 +45,8 @@ struct prcu_struct {
>  void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
> +void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_init(void);
>  void prcu_note_context_switch(void);
> 
>  #else /* #ifdef CONFIG_PRCU */
> @@ -31,6 +54,8 @@ void prcu_note_context_switch(void);
>  #define prcu_read_lock() do {} while (0)
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
> +#define call_prcu() do {} while (0)
> +#define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
> 
>  #endif /* #ifdef CONFIG_PRCU */
> diff --git a/init/main.c b/init/main.c
> index f8665104..4925964e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -574,6 +575,7 @@ asmlinkage __visible void __init start_kernel(void)
>   workqueue_init_early();
> 
>   rcu_init();
> + prcu_init();
> 
>   /* Trace events are available after this */
>   trace_init();
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index a00b9420..f198285c 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -1,11 +1,12 @@
>  #include 
> -#include 
>  #include 
> -#include 
> +#include 
>  #include 
> -
> +#include 
>  #include 
> 
> +#include "rcu.h"
> +
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> 
>  struct prcu_struct global_prcu = {
> @@ -16,6 +17,16 @@ struct prcu_struct global_prcu = {
>  };
>  struct prcu_struct *prcu = _prcu;
> 
> +/* Initialize simple callback list. */
> +static void prcu_cblist_init(struct prcu_cblist *rclp)
> +{
> + rclp->head = NULL;
> + rclp->tail = >head;
> + rclp->version_head = NULL;
> + rclp->version_tail = >version_head;
> + rclp->len = 0;
> +}
> +
>  static inline void prcu_report(struct prcu_local_struct *local)
>  {
>   unsigned long long global_version;
> @@ -123,3 +134,53 @@ void prcu_note_context_switch(void)
>   prcu_report(local);
>   put_cpu_ptr(_local);
>  }
> +
> +void call_prcu(struct rcu_head *head, rcu_callback_t func)
> +{
> + unsigned long flags;
> + struct prcu_local_struct *local;
> + struct prcu_cblist *rclp;
> + struct prcu_version_head *vhp;
> +
> + debug_rcu_head_queue(head);
> +
> + /* Use GFP_ATOMIC with IRQs disabled */
> + vhp = kmalloc(sizeof(struct prcu_version_head), GFP_ATOMIC);
> + if (!vhp)
> + return;

Silently failing to post the callback can cause system hangs.  I suggest
finding some way to avoid allocating on the call_prcu() code path.

Thanx, Paul

> +
> + head->func = func;
> + head->next = NULL;
> + vhp->next = NULL;
> +
> + local_irq_save(flags);
> + local = this_cpu_ptr(_local);
> + vhp->version = local->version;
> + rclp = >cblist;
> + rclp->len++;
> + *rclp->tail = head;
> + rclp->tail = >next;
> + *rclp->version_tail = vhp;
> + rclp->version_tail = >next;
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(call_prcu);
> +
> +void prcu_init_local_struct(int cpu)
> +{
> + struct prcu_local_struct *local;
> +
> + local = per_cpu_ptr(_local, cpu);
> + local->locked = 0;
> + local->online = 0;
> + local->version = 0;
> + prcu_cblist_init(>cblist);
> +}
> +
> +void __init prcu_init(void)
> +{
> +

Re: [PATCH RFC 06/16] rcuperf: Set gp_exp to true for tests to run

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:31PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kernel/rcu/rcuperf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index ea80fa3e..baccc123 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -60,7 +60,7 @@ MODULE_AUTHOR("Paul E. McKenney 
> ");
>  #define VERBOSE_PERFOUT_ERRSTRING(s) \
>   do { if (verbose) pr_alert("%s" PERF_FLAG "!!! %s\n", perf_type, s); } 
> while (0)
> 
> -torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
> +torture_param(bool, gp_exp, true, "Use expedited GP wait primitives");

This is fine as a convenience for internal testing, but the usual way
to make this happen is using the rcuperf.gp_exp kernel boot parameter.
Or was that not working for you?

Thanx, Paul

>  torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
>  torture_param(int, nreaders, -1, "Number of RCU reader threads");
>  torture_param(int, nwriters, -1, "Number of RCU updater threads");
> -- 
> 2.14.1.729.g59c0ea183
>

Re: [PATCH RFC 07/16] prcu: Implement call_prcu() API

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:32PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> This is PRCU's counterpart of RCU's call_rcu() API.
> 
> Reviewed-by: Heng Zhang 
> Signed-off-by: Lihao Liang 
> ---
>  include/linux/prcu.h | 25 
>  init/main.c  |  2 ++
>  kernel/rcu/prcu.c| 67 
> +---
>  3 files changed, 91 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> index 653b4633..e5e09c9b 100644
> --- a/include/linux/prcu.h
> +++ b/include/linux/prcu.h
> @@ -2,15 +2,36 @@
>  #define __LINUX_PRCU_H
> 
>  #include 
> +#include 
>  #include 
>  #include 
> 
>  #define CONFIG_PRCU
> 
> +struct prcu_version_head {
> + unsigned long long version;
> + struct prcu_version_head *next;
> +};
> +
> +/* Simple unsegmented callback list for PRCU. */
> +struct prcu_cblist {
> + struct rcu_head *head;
> + struct rcu_head **tail;
> + struct prcu_version_head *version_head;
> + struct prcu_version_head **version_tail;
> + long len;
> +};
> +
> +#define PRCU_CBLIST_INITIALIZER(n) { \
> + .head = NULL, .tail = , \
> + .version_head = NULL, .version_tail = _head, \
> +}
> +
>  struct prcu_local_struct {
>   unsigned int locked;
>   unsigned int online;
>   unsigned long long version;
> + struct prcu_cblist cblist;
>  };
> 
>  struct prcu_struct {
> @@ -24,6 +45,8 @@ struct prcu_struct {
>  void prcu_read_lock(void);
>  void prcu_read_unlock(void);
>  void synchronize_prcu(void);
> +void call_prcu(struct rcu_head *head, rcu_callback_t func);
> +void prcu_init(void);
>  void prcu_note_context_switch(void);
> 
>  #else /* #ifdef CONFIG_PRCU */
> @@ -31,6 +54,8 @@ void prcu_note_context_switch(void);
>  #define prcu_read_lock() do {} while (0)
>  #define prcu_read_unlock() do {} while (0)
>  #define synchronize_prcu() do {} while (0)
> +#define call_prcu() do {} while (0)
> +#define prcu_init() do {} while (0)
>  #define prcu_note_context_switch() do {} while (0)
> 
>  #endif /* #ifdef CONFIG_PRCU */
> diff --git a/init/main.c b/init/main.c
> index f8665104..4925964e 100644
> --- a/init/main.c
> +++ b/init/main.c
> @@ -38,6 +38,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -574,6 +575,7 @@ asmlinkage __visible void __init start_kernel(void)
>   workqueue_init_early();
> 
>   rcu_init();
> + prcu_init();
> 
>   /* Trace events are available after this */
>   trace_init();
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> index a00b9420..f198285c 100644
> --- a/kernel/rcu/prcu.c
> +++ b/kernel/rcu/prcu.c
> @@ -1,11 +1,12 @@
>  #include 
> -#include 
>  #include 
> -#include 
> +#include 
>  #include 
> -
> +#include 
>  #include 
> 
> +#include "rcu.h"
> +
>  DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> 
>  struct prcu_struct global_prcu = {
> @@ -16,6 +17,16 @@ struct prcu_struct global_prcu = {
>  };
>  struct prcu_struct *prcu = _prcu;
> 
> +/* Initialize simple callback list. */
> +static void prcu_cblist_init(struct prcu_cblist *rclp)
> +{
> + rclp->head = NULL;
> + rclp->tail = >head;
> + rclp->version_head = NULL;
> + rclp->version_tail = >version_head;
> + rclp->len = 0;
> +}
> +
>  static inline void prcu_report(struct prcu_local_struct *local)
>  {
>   unsigned long long global_version;
> @@ -123,3 +134,53 @@ void prcu_note_context_switch(void)
>   prcu_report(local);
>   put_cpu_ptr(_local);
>  }
> +
> +void call_prcu(struct rcu_head *head, rcu_callback_t func)
> +{
> + unsigned long flags;
> + struct prcu_local_struct *local;
> + struct prcu_cblist *rclp;
> + struct prcu_version_head *vhp;
> +
> + debug_rcu_head_queue(head);
> +
> + /* Use GFP_ATOMIC with IRQs disabled */
> + vhp = kmalloc(sizeof(struct prcu_version_head), GFP_ATOMIC);
> + if (!vhp)
> + return;

Silently failing to post the callback can cause system hangs.  I suggest
finding some way to avoid allocating on the call_prcu() code path.

Thanx, Paul

> +
> + head->func = func;
> + head->next = NULL;
> + vhp->next = NULL;
> +
> + local_irq_save(flags);
> + local = this_cpu_ptr(_local);
> + vhp->version = local->version;
> + rclp = >cblist;
> + rclp->len++;
> + *rclp->tail = head;
> + rclp->tail = >next;
> + *rclp->version_tail = vhp;
> + rclp->version_tail = >next;
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL(call_prcu);
> +
> +void prcu_init_local_struct(int cpu)
> +{
> + struct prcu_local_struct *local;
> +
> + local = per_cpu_ptr(_local, cpu);
> + local->locked = 0;
> + local->online = 0;
> + local->version = 0;
> + prcu_cblist_init(>cblist);
> +}
> +
> +void __init prcu_init(void)
> +{
> + int cpu;
> +
> + for_each_possible_cpu(cpu)
> +

Re: [PATCH RFC 06/16] rcuperf: Set gp_exp to true for tests to run

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:31PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kernel/rcu/rcuperf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
> index ea80fa3e..baccc123 100644
> --- a/kernel/rcu/rcuperf.c
> +++ b/kernel/rcu/rcuperf.c
> @@ -60,7 +60,7 @@ MODULE_AUTHOR("Paul E. McKenney 
> ");
>  #define VERBOSE_PERFOUT_ERRSTRING(s) \
>   do { if (verbose) pr_alert("%s" PERF_FLAG "!!! %s\n", perf_type, s); } 
> while (0)
> 
> -torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
> +torture_param(bool, gp_exp, true, "Use expedited GP wait primitives");

This is fine as a convenience for internal testing, but the usual way
to make this happen is using the rcuperf.gp_exp kernel boot parameter.
Or was that not working for you?

Thanx, Paul

>  torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
>  torture_param(int, nreaders, -1, "Number of RCU reader threads");
>  torture_param(int, nwriters, -1, "Number of RCU updater threads");
> -- 
> 2.14.1.729.g59c0ea183
>

Re: [PATCH RFC 15/16] rcutorture: Add scripts to run experiments

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:40PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kvm.sh | 452 
> +
>  run-rcuperf.sh |  26 

The usual approach would be to add what you need to the existing kvm.sh...

Thanx, Paul

>  2 files changed, 478 insertions(+)
>  create mode 100755 kvm.sh
>  create mode 100755 run-rcuperf.sh
> 
> diff --git a/kvm.sh b/kvm.sh
> new file mode 100755
> index ..3b3c1b69
> --- /dev/null
> +++ b/kvm.sh
> @@ -0,0 +1,452 @@
> +#!/bin/bash
> +#
> +# Run a series of 14 tests under KVM.  These are not particularly
> +# well-selected or well-tuned, but are the current set.  Run from the
> +# top level of the source tree.
> +#
> +# Edit the definitions below to set the locations of the various directories,
> +# as well as the test duration.
> +#
> +# Usage: kvm.sh [ options ]
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, you can access it online at
> +# http://www.gnu.org/licenses/gpl-2.0.html.
> +#
> +# Copyright (C) IBM Corporation, 2011
> +#
> +# Authors: Paul E. McKenney 
> +
> +scriptname=$0
> +args="$*"
> +
> +T=/tmp/kvm.sh.$$
> +trap 'rm -rf $T' 0
> +mkdir $T
> +
> +dur=$((30*60))
> +dryrun=""
> +KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM
> +PATH=${KVM}/bin:$PATH; export PATH
> +TORTURE_DEFCONFIG=defconfig
> +TORTURE_BOOT_IMAGE=""
> +TORTURE_INITRD="$KVM/initrd"; export TORTURE_INITRD
> +TORTURE_KMAKE_ARG=""
> +TORTURE_SHUTDOWN_GRACE=180
> +TORTURE_SUITE=rcu
> +resdir=""
> +configs=""
> +cpus=0
> +ds=`date +%Y.%m.%d-%H:%M:%S`
> +jitter="-1"
> +
> +. functions.sh
> +
> +usage () {
> + echo "Usage: $scriptname optional arguments:"
> + echo "   --bootargs kernel-boot-arguments"
> + echo "   --bootimage relative-path-to-kernel-boot-image"
> + echo "   --buildonly"
> + echo "   --configs \"config-file list w/ repeat factor (3*TINY01)\""
> + echo "   --cpus N"
> + echo "   --datestamp string"
> + echo "   --defconfig string"
> + echo "   --dryrun sched|script"
> + echo "   --duration minutes"
> + echo "   --interactive"
> + echo "   --jitter N [ maxsleep (us) [ maxspin (us) ] ]"
> + echo "   --kmake-arg kernel-make-arguments"
> + echo "   --mac nn:nn:nn:nn:nn:nn"
> + echo "   --no-initrd"
> + echo "   --qemu-args qemu-system-..."
> + echo "   --qemu-cmd qemu-system-..."
> + echo "   --results absolute-pathname"
> + echo "   --torture rcu"
> + exit 1
> +}
> +
> +while test $# -gt 0
> +do
> + case "$1" in
> + --bootargs|--bootarg)
> + checkarg --bootargs "(list of kernel boot arguments)" "$#" "$2" 
> '.*' '^--'
> + TORTURE_BOOTARGS="$2"
> + shift
> + ;;
> + --bootimage)
> + checkarg --bootimage "(relative path to kernel boot image)" 
> "$#" "$2" '[a-zA-Z0-9][a-zA-Z0-9_]*' '^--'
> + TORTURE_BOOT_IMAGE="$2"
> + shift
> + ;;
> + --buildonly)
> + TORTURE_BUILDONLY=1
> + ;;
> + --configs|--config)
> + checkarg --configs "(list of config files)" "$#" "$2" '^[^/]*$' 
> '^--'
> + configs="$2"
> + shift
> + ;;
> + --cpus)
> + checkarg --cpus "(number)" "$#" "$2" '^[0-9]*$' '^--'
> + cpus=$2
> + shift
> + ;;
> + --datestamp)
> + checkarg --datestamp "(relative pathname)" "$#" "$2" '^[^/]*$' 
> '^--'
> + ds=$2
> + shift
> + ;;
> + --defconfig)
> + checkarg --defconfig "defconfigtype" "$#" "$2" '^[^/][^/]*$' 
> '^--'
> + TORTURE_DEFCONFIG=$2
> + shift
> + ;;
> + --dryrun)
> + checkarg --dryrun "sched|script" $# "$2" 'sched\|script' '^--'
> + dryrun=$2
> + shift
> + ;;
> + --duration)
> + checkarg --duration "(minutes)" $# "$2" '^[0-9]*$' '^error'
> + dur=$(($2*60))
> + shift
> + ;;
> + --interactive)
> + TORTURE_QEMU_INTERACTIVE=1; export

Re: [PATCH RFC 00/16] A new RCU implementation based on a fast consensus protocol

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:25PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Dear Paul,
> 
> This patch set implements a preemptive version of RCU (PRCU) based on the 
> following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> We have also added preliminary callback-handling support.  Thus, the current 
> version
> provides APIs prcu_read_lock(), prcu_read_unlock(), synchronize_prcu(), 
> call_prcu(),
> and prcu_barrier().
> 
> This is an experimental patch, so it would be good to have some feedback.
> 
> Known shortcoming is that the grace-period version is incremented in 
> synchronize_prcu().
> If call_prcu() or prcu_barrier() is called but there is no 
> synchronized_prcu() invoked,
> callbacks cannot be invoked.  Later version should address this issue, e.g. 
> adding a
> grace-period expedition mechanism.  Others include to use a a hierarchical 
> structure,
> taking into account the NUMA topology, to send IPI in synchronize_prcu().
> 
> We have tested the implementation using rcutorture on both an x86 and ARM64 
> machine.
> PRCU passed 1h and 3h tests on all the newly added config files except PRCU07 
> reported BUG 
> in a 1h run.
> 
> [ 1593.604201] ---[ end trace b3bae911bec86152 ]---
> [ 1594.629450] prcu-torture:torture_onoff task: offlining 14
> [ 1594.73] smpboot: CPU 14 is now offline
> [ 1594.757732] prcu-torture:torture_onoff task: offlined 14
> [ 1597.765149] prcu-torture:torture_onoff task: onlining 11
> [ 1597.766795] smpboot: Booting Node 0 Processor 11 APIC 0xb
> [ 1597.804102] prcu-torture:torture_onoff task: onlined 11
> [ 1599.365098] prcu-torture: rtc: b0277b90 ver: 66358 tfle: 0 rta: 
> 66358 rtaf: 0 
> rtf: 66349 rtmbe: 0 rtbe: 1 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 2233418 
> onoff: 191/191:199/199 34,199:59,5102 10403:0 (HZ=1000) barrier: 188/189:1 
> cbflood: 225
> [ 1599.367946] prcu-torture: !!!
> [ 1599.367966] [ cut here ]

The "rtbe: 1" indicates that your implementation of prcu_barrier()
failed to wait for all preceding call_prcu() callbacks to be invoked.

Does the immediately following "Reader Pipe:" list have any but the
first two numbers non-zero?

> We have also compared PRCU with TREE RCU using rcuperf with gp_exp set to 
> true, that is
> synchronize_rcu_expedited was tested.
> 
> The rcuperf results are as follows (average grace-period duration in ms of 
> ten 10min runs):
> 
> 16*Intel Xeon CPU@2.4GHz, 16GB memory, Ubuntu Linux 3.13.0-47-generic
> 
> CPUs  2   4   8  12  15   16
> PRCU   0.141.074.158.02   10.7915.16 
> TREE  49.30  104.75  277.55  390.82  620.82  1381.54
> 
> 64*Cortex-A72 CPU@2.4GHz, 130GB memory, Ubuntu Linux 4.10.0-21.23-generic
> 
> CPUs   2   48  16  32   48   6364
> PRCU0.23   19.6938.28   63.21   95.41   167.18   252.01   1841.44
> TREE  416.73  901.89  1060.86  743.00  920.66  1325.21  1646.20  23806.27

Well, at the very least, this is a bug report on either expedited RCU
grace-period latency or on rcuperf's measurements, and thank you for that.
I will look into this.  In the meantime, could you please let me know
exactly how you invoked rcuperf?

I have a few comments on some of your patches based on a quick scan
through them.

Thanx, Paul

> Best wishes,
> Lihao.
> 
> 
> Lihao Liang (15):
>   rcutorture: Add PRCU rcu_torture_ops
>   rcutorture: Add PRCU test config files
>   rcuperf: Add PRCU rcu_perf_ops
>   rcuperf: Add PRCU test config files
>   rcuperf: Set gp_exp to true for tests to run
>   prcu: Implement call_prcu() API
>   prcu: Implement PRCU callback processing
>   prcu: Implement prcu_barrier() API
>   rcutorture: Test call_prcu() and prcu_barrier()
>   rcutorture: Add basic ARM64 support to run scripts
>   prcu: Add PRCU Kconfig parameter
>   prcu: Comment source code
>   rcuperf: Add config files with various CONFIG_NR_CPUS
>   rcutorture: Add scripts to run experiments
>   Add GPLv2 license
> 
> Heng Zhang (1):
>   prcu: Add PRCU implementation
> 
>  include/linux/interrupt.h  |   3 +
>  include/linux/prcu.h   | 122 +
>  include/linux/rcupdate.h   |   1 +
>  init/Kconfig   |   7 +
>  init/main.c|   2 +
>  kernel/rcu/Makefile|   1 +
>  kernel/rcu/prcu.c  | 497 
> +
>  kernel/rcu/rcuperf.c   |  33 +-
>  kernel/rcu/rcutorture.c|  40 +-
>  kernel/rcu/tree.c

Re: [PATCH RFC 15/16] rcutorture: Add scripts to run experiments

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:40PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Signed-off-by: Lihao Liang 
> ---
>  kvm.sh | 452 
> +
>  run-rcuperf.sh |  26 

The usual approach would be to add what you need to the existing kvm.sh...

Thanx, Paul

>  2 files changed, 478 insertions(+)
>  create mode 100755 kvm.sh
>  create mode 100755 run-rcuperf.sh
> 
> diff --git a/kvm.sh b/kvm.sh
> new file mode 100755
> index ..3b3c1b69
> --- /dev/null
> +++ b/kvm.sh
> @@ -0,0 +1,452 @@
> +#!/bin/bash
> +#
> +# Run a series of 14 tests under KVM.  These are not particularly
> +# well-selected or well-tuned, but are the current set.  Run from the
> +# top level of the source tree.
> +#
> +# Edit the definitions below to set the locations of the various directories,
> +# as well as the test duration.
> +#
> +# Usage: kvm.sh [ options ]
> +#
> +# This program is free software; you can redistribute it and/or modify
> +# it under the terms of the GNU General Public License as published by
> +# the Free Software Foundation; either version 2 of the License, or
> +# (at your option) any later version.
> +#
> +# This program is distributed in the hope that it will be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, you can access it online at
> +# http://www.gnu.org/licenses/gpl-2.0.html.
> +#
> +# Copyright (C) IBM Corporation, 2011
> +#
> +# Authors: Paul E. McKenney 
> +
> +scriptname=$0
> +args="$*"
> +
> +T=/tmp/kvm.sh.$$
> +trap 'rm -rf $T' 0
> +mkdir $T
> +
> +dur=$((30*60))
> +dryrun=""
> +KVM="`pwd`/tools/testing/selftests/rcutorture"; export KVM
> +PATH=${KVM}/bin:$PATH; export PATH
> +TORTURE_DEFCONFIG=defconfig
> +TORTURE_BOOT_IMAGE=""
> +TORTURE_INITRD="$KVM/initrd"; export TORTURE_INITRD
> +TORTURE_KMAKE_ARG=""
> +TORTURE_SHUTDOWN_GRACE=180
> +TORTURE_SUITE=rcu
> +resdir=""
> +configs=""
> +cpus=0
> +ds=`date +%Y.%m.%d-%H:%M:%S`
> +jitter="-1"
> +
> +. functions.sh
> +
> +usage () {
> + echo "Usage: $scriptname optional arguments:"
> + echo "   --bootargs kernel-boot-arguments"
> + echo "   --bootimage relative-path-to-kernel-boot-image"
> + echo "   --buildonly"
> + echo "   --configs \"config-file list w/ repeat factor (3*TINY01)\""
> + echo "   --cpus N"
> + echo "   --datestamp string"
> + echo "   --defconfig string"
> + echo "   --dryrun sched|script"
> + echo "   --duration minutes"
> + echo "   --interactive"
> + echo "   --jitter N [ maxsleep (us) [ maxspin (us) ] ]"
> + echo "   --kmake-arg kernel-make-arguments"
> + echo "   --mac nn:nn:nn:nn:nn:nn"
> + echo "   --no-initrd"
> + echo "   --qemu-args qemu-system-..."
> + echo "   --qemu-cmd qemu-system-..."
> + echo "   --results absolute-pathname"
> + echo "   --torture rcu"
> + exit 1
> +}
> +
> +while test $# -gt 0
> +do
> + case "$1" in
> + --bootargs|--bootarg)
> + checkarg --bootargs "(list of kernel boot arguments)" "$#" "$2" 
> '.*' '^--'
> + TORTURE_BOOTARGS="$2"
> + shift
> + ;;
> + --bootimage)
> + checkarg --bootimage "(relative path to kernel boot image)" 
> "$#" "$2" '[a-zA-Z0-9][a-zA-Z0-9_]*' '^--'
> + TORTURE_BOOT_IMAGE="$2"
> + shift
> + ;;
> + --buildonly)
> + TORTURE_BUILDONLY=1
> + ;;
> + --configs|--config)
> + checkarg --configs "(list of config files)" "$#" "$2" '^[^/]*$' 
> '^--'
> + configs="$2"
> + shift
> + ;;
> + --cpus)
> + checkarg --cpus "(number)" "$#" "$2" '^[0-9]*$' '^--'
> + cpus=$2
> + shift
> + ;;
> + --datestamp)
> + checkarg --datestamp "(relative pathname)" "$#" "$2" '^[^/]*$' 
> '^--'
> + ds=$2
> + shift
> + ;;
> + --defconfig)
> + checkarg --defconfig "defconfigtype" "$#" "$2" '^[^/][^/]*$' 
> '^--'
> + TORTURE_DEFCONFIG=$2
> + shift
> + ;;
> + --dryrun)
> + checkarg --dryrun "sched|script" $# "$2" 'sched\|script' '^--'
> + dryrun=$2
> + shift
> + ;;
> + --duration)
> + checkarg --duration "(minutes)" $# "$2" '^[0-9]*$' '^error'
> + dur=$(($2*60))
> + shift
> + ;;
> + --interactive)
> + TORTURE_QEMU_INTERACTIVE=1; export TORTURE_QEMU_INTERACTIVE
> + ;;
> + --jitter)
> +

Re: [PATCH RFC 00/16] A new RCU implementation based on a fast consensus protocol

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:25PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Dear Paul,
> 
> This patch set implements a preemptive version of RCU (PRCU) based on the 
> following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> We have also added preliminary callback-handling support.  Thus, the current 
> version
> provides APIs prcu_read_lock(), prcu_read_unlock(), synchronize_prcu(), 
> call_prcu(),
> and prcu_barrier().
> 
> This is an experimental patch, so it would be good to have some feedback.
> 
> Known shortcoming is that the grace-period version is incremented in 
> synchronize_prcu().
> If call_prcu() or prcu_barrier() is called but there is no 
> synchronized_prcu() invoked,
> callbacks cannot be invoked.  Later version should address this issue, e.g. 
> adding a
> grace-period expedition mechanism.  Others include to use a a hierarchical 
> structure,
> taking into account the NUMA topology, to send IPI in synchronize_prcu().
> 
> We have tested the implementation using rcutorture on both an x86 and ARM64 
> machine.
> PRCU passed 1h and 3h tests on all the newly added config files except PRCU07 
> reported BUG 
> in a 1h run.
> 
> [ 1593.604201] ---[ end trace b3bae911bec86152 ]---
> [ 1594.629450] prcu-torture:torture_onoff task: offlining 14
> [ 1594.73] smpboot: CPU 14 is now offline
> [ 1594.757732] prcu-torture:torture_onoff task: offlined 14
> [ 1597.765149] prcu-torture:torture_onoff task: onlining 11
> [ 1597.766795] smpboot: Booting Node 0 Processor 11 APIC 0xb
> [ 1597.804102] prcu-torture:torture_onoff task: onlined 11
> [ 1599.365098] prcu-torture: rtc: b0277b90 ver: 66358 tfle: 0 rta: 
> 66358 rtaf: 0 
> rtf: 66349 rtmbe: 0 rtbe: 1 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 nt: 2233418 
> onoff: 191/191:199/199 34,199:59,5102 10403:0 (HZ=1000) barrier: 188/189:1 
> cbflood: 225
> [ 1599.367946] prcu-torture: !!!
> [ 1599.367966] [ cut here ]

The "rtbe: 1" indicates that your implementation of prcu_barrier()
failed to wait for all preceding call_prcu() callbacks to be invoked.

Does the immediately following "Reader Pipe:" list have any but the
first two numbers non-zero?

> We have also compared PRCU with TREE RCU using rcuperf with gp_exp set to 
> true, that is
> synchronize_rcu_expedited was tested.
> 
> The rcuperf results are as follows (average grace-period duration in ms of 
> ten 10min runs):
> 
> 16*Intel Xeon CPU@2.4GHz, 16GB memory, Ubuntu Linux 3.13.0-47-generic
> 
> CPUs  2   4   8  12  15   16
> PRCU   0.141.074.158.02   10.7915.16 
> TREE  49.30  104.75  277.55  390.82  620.82  1381.54
> 
> 64*Cortex-A72 CPU@2.4GHz, 130GB memory, Ubuntu Linux 4.10.0-21.23-generic
> 
> CPUs   2   48  16  32   48   6364
> PRCU0.23   19.6938.28   63.21   95.41   167.18   252.01   1841.44
> TREE  416.73  901.89  1060.86  743.00  920.66  1325.21  1646.20  23806.27

Well, at the very least, this is a bug report on either expedited RCU
grace-period latency or on rcuperf's measurements, and thank you for that.
I will look into this.  In the meantime, could you please let me know
exactly how you invoked rcuperf?

I have a few comments on some of your patches based on a quick scan
through them.

Thanx, Paul

> Best wishes,
> Lihao.
> 
> 
> Lihao Liang (15):
>   rcutorture: Add PRCU rcu_torture_ops
>   rcutorture: Add PRCU test config files
>   rcuperf: Add PRCU rcu_perf_ops
>   rcuperf: Add PRCU test config files
>   rcuperf: Set gp_exp to true for tests to run
>   prcu: Implement call_prcu() API
>   prcu: Implement PRCU callback processing
>   prcu: Implement prcu_barrier() API
>   rcutorture: Test call_prcu() and prcu_barrier()
>   rcutorture: Add basic ARM64 support to run scripts
>   prcu: Add PRCU Kconfig parameter
>   prcu: Comment source code
>   rcuperf: Add config files with various CONFIG_NR_CPUS
>   rcutorture: Add scripts to run experiments
>   Add GPLv2 license
> 
> Heng Zhang (1):
>   prcu: Add PRCU implementation
> 
>  include/linux/interrupt.h  |   3 +
>  include/linux/prcu.h   | 122 +
>  include/linux/rcupdate.h   |   1 +
>  init/Kconfig   |   7 +
>  init/main.c|   2 +
>  kernel/rcu/Makefile|   1 +
>  kernel/rcu/prcu.c  | 497 
> +
>  kernel/rcu/rcuperf.c   |  33 +-
>  kernel/rcu/rcutorture.c|  40 +-
>  kernel/rcu/tree.c  |   1

Re: [PATCH RFC 03/16] rcutorture: Add PRCU test config files

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:28PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Use the same config files as TREE02, TREE03, TREE06, TREE07, and TREE09.
> 
> Signed-off-by: Lihao Liang 
> ---
>  .../selftests/rcutorture/configs/rcu/CFLIST|  5 
>  .../selftests/rcutorture/configs/rcu/PRCU02| 27 
> ++
>  .../selftests/rcutorture/configs/rcu/PRCU02.boot   |  1 +
>  .../selftests/rcutorture/configs/rcu/PRCU03| 23 ++
>  .../selftests/rcutorture/configs/rcu/PRCU03.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU06| 26 +
>  .../selftests/rcutorture/configs/rcu/PRCU06.boot   |  5 
>  .../selftests/rcutorture/configs/rcu/PRCU07| 25 
>  .../selftests/rcutorture/configs/rcu/PRCU07.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU09| 19 +++
>  .../selftests/rcutorture/configs/rcu/PRCU09.boot   |  1 +
>  11 files changed, 136 insertions(+)
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09.boot
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST 
> b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> index a3a1a05a..7359e194 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> @@ -1,3 +1,8 @@
> +PRCU02
> +PRCU03
> +PRCU06
> +PRCU07
> +PRCU09
>  TREE01
>  TREE02
>  TREE03
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> new file mode 100644
> index ..5f532f05
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> @@ -0,0 +1,27 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=8
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=n
> +CONFIG_NO_HZ_IDLE=y
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_FAST_NO_HZ=n
> +CONFIG_RCU_TRACE=n
> +CONFIG_HOTPLUG_CPU=n
> +CONFIG_SUSPEND=n
> +CONFIG_HIBERNATION=n
> +CONFIG_RCU_FANOUT=3
> +CONFIG_RCU_FANOUT_LEAF=3
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=y
> +CONFIG_PROVE_LOCKING=n
> +CONFIG_RCU_BOOST=n
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> new file mode 100644
> index ..6c5e626f
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> @@ -0,0 +1 @@
> +rcutorture.torture_type=prcu
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> new file mode 100644
> index ..869cadc8
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> @@ -0,0 +1,23 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=16
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=y
> +CONFIG_NO_HZ_IDLE=n
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_TRACE=y
> +CONFIG_HOTPLUG_CPU=y

And from what I can see, PRCU doesn't handle CPU hotplug.  I would not
be surprised to see rcutorture failures when running this scenario.

> +CONFIG_RCU_FANOUT=2
> +CONFIG_RCU_FANOUT_LEAF=2
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=n
> +CONFIG_RCU_BOOST=y
> +CONFIG_RCU_KTHREAD_PRIO=2
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> new file mode 100644
> index ..0be10cba
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> @@ -0,0 +1,2 @@
> +rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> +rcutorture.torture_type=prcu
> diff --git

Re: [PATCH RFC 03/16] rcutorture: Add PRCU test config files

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:28PM +0800, liangli...@huawei.com wrote:
> From: Lihao Liang 
> 
> Use the same config files as TREE02, TREE03, TREE06, TREE07, and TREE09.
> 
> Signed-off-by: Lihao Liang 
> ---
>  .../selftests/rcutorture/configs/rcu/CFLIST|  5 
>  .../selftests/rcutorture/configs/rcu/PRCU02| 27 
> ++
>  .../selftests/rcutorture/configs/rcu/PRCU02.boot   |  1 +
>  .../selftests/rcutorture/configs/rcu/PRCU03| 23 ++
>  .../selftests/rcutorture/configs/rcu/PRCU03.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU06| 26 +
>  .../selftests/rcutorture/configs/rcu/PRCU06.boot   |  5 
>  .../selftests/rcutorture/configs/rcu/PRCU07| 25 
>  .../selftests/rcutorture/configs/rcu/PRCU07.boot   |  2 ++
>  .../selftests/rcutorture/configs/rcu/PRCU09| 19 +++
>  .../selftests/rcutorture/configs/rcu/PRCU09.boot   |  1 +
>  11 files changed, 136 insertions(+)
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU06.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU07.boot
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09
>  create mode 100644 tools/testing/selftests/rcutorture/configs/rcu/PRCU09.boot
> 
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST 
> b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> index a3a1a05a..7359e194 100644
> --- a/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/CFLIST
> @@ -1,3 +1,8 @@
> +PRCU02
> +PRCU03
> +PRCU06
> +PRCU07
> +PRCU09
>  TREE01
>  TREE02
>  TREE03
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> new file mode 100644
> index ..5f532f05
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02
> @@ -0,0 +1,27 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=8
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=n
> +CONFIG_NO_HZ_IDLE=y
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_FAST_NO_HZ=n
> +CONFIG_RCU_TRACE=n
> +CONFIG_HOTPLUG_CPU=n
> +CONFIG_SUSPEND=n
> +CONFIG_HIBERNATION=n
> +CONFIG_RCU_FANOUT=3
> +CONFIG_RCU_FANOUT_LEAF=3
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=y
> +CONFIG_PROVE_LOCKING=n
> +CONFIG_RCU_BOOST=n
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> new file mode 100644
> index ..6c5e626f
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU02.boot
> @@ -0,0 +1 @@
> +rcutorture.torture_type=prcu
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> new file mode 100644
> index ..869cadc8
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03
> @@ -0,0 +1,23 @@
> +CONFIG_SMP=y
> +CONFIG_NR_CPUS=16
> +CONFIG_PREEMPT_NONE=n
> +CONFIG_PREEMPT_VOLUNTARY=n
> +CONFIG_PREEMPT=y
> +CONFIG_PRCU=y
> +#CHECK#CONFIG_PREEMPT_RCU=y
> +CONFIG_HZ_PERIODIC=y
> +CONFIG_NO_HZ_IDLE=n
> +CONFIG_NO_HZ_FULL=n
> +CONFIG_RCU_TRACE=y
> +CONFIG_HOTPLUG_CPU=y

And from what I can see, PRCU doesn't handle CPU hotplug.  I would not
be surprised to see rcutorture failures when running this scenario.

> +CONFIG_RCU_FANOUT=2
> +CONFIG_RCU_FANOUT_LEAF=2
> +CONFIG_RCU_NOCB_CPU=n
> +CONFIG_DEBUG_LOCK_ALLOC=n
> +CONFIG_RCU_BOOST=y
> +CONFIG_RCU_KTHREAD_PRIO=2
> +CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> +CONFIG_RCU_EXPERT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> +CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot 
> b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> new file mode 100644
> index ..0be10cba
> --- /dev/null
> +++ b/tools/testing/selftests/rcutorture/configs/rcu/PRCU03.boot
> @@ -0,0 +1,2 @@
> +rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
> +rcutorture.torture_type=prcu
> diff --git a/tools/testing/selftests/rcutorture/configs/rcu/PRCU06 
>

Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> From: Heng Zhang 
> 
> This RCU implementation (PRCU) is based on a fast consensus protocol
> published in the following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> Signed-off-by: Heng Zhang 
> Signed-off-by: Lihao Liang 

A few comments and questions interspersed.

Thanx, Paul

> ---
>  include/linux/prcu.h |  37 +++
>  kernel/rcu/Makefile  |   2 +-
>  kernel/rcu/prcu.c| 125 
> +++
>  kernel/sched/core.c  |   2 +
>  4 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/prcu.h
>  create mode 100644 kernel/rcu/prcu.c
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> new file mode 100644
> index ..653b4633
> --- /dev/null
> +++ b/include/linux/prcu.h
> @@ -0,0 +1,37 @@
> +#ifndef __LINUX_PRCU_H
> +#define __LINUX_PRCU_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define CONFIG_PRCU
> +
> +struct prcu_local_struct {
> + unsigned int locked;
> + unsigned int online;
> + unsigned long long version;
> +};
> +
> +struct prcu_struct {
> + atomic64_t global_version;
> + atomic_t active_ctr;
> + struct mutex mtx;
> + wait_queue_head_t wait_q;
> +};
> +
> +#ifdef CONFIG_PRCU
> +void prcu_read_lock(void);
> +void prcu_read_unlock(void);
> +void synchronize_prcu(void);
> +void prcu_note_context_switch(void);
> +
> +#else /* #ifdef CONFIG_PRCU */
> +
> +#define prcu_read_lock() do {} while (0)
> +#define prcu_read_unlock() do {} while (0)
> +#define synchronize_prcu() do {} while (0)
> +#define prcu_note_context_switch() do {} while (0)

If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
get a build error rather than an error-free but inoperative PRCU?

Of course, Peter's question about purpose of the patch set applies
here as well.

> +
> +#endif /* #ifdef CONFIG_PRCU */
> +#endif /* __LINUX_PRCU_H */
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 23803c7d..8791419c 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT := n
> 
> -obj-y += update.o sync.o
> +obj-y += update.o sync.o prcu.o
>  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
>  obj-$(CONFIG_TREE_SRCU) += srcutree.o
>  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> new file mode 100644
> index ..a00b9420
> --- /dev/null
> +++ b/kernel/rcu/prcu.c
> @@ -0,0 +1,125 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> +
> +struct prcu_struct global_prcu = {
> + .global_version = ATOMIC64_INIT(0),
> + .active_ctr = ATOMIC_INIT(0),
> + .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> +};
> +struct prcu_struct *prcu = _prcu;
> +
> +static inline void prcu_report(struct prcu_local_struct *local)
> +{
> + unsigned long long global_version;
> + unsigned long long local_version;
> +
> + global_version = atomic64_read(>global_version);
> + local_version = local->version;
> + if (global_version > local_version)
> + cmpxchg(>version, local_version, global_version);
> +}
> +
> +void prcu_read_lock(void)
> +{
> + struct prcu_local_struct *local;
> +
> + local = get_cpu_ptr(_local);
> + if (!local->online) {
> + WRITE_ONCE(local->online, 1);
> + smp_mb();
> + }
> +
> + local->locked++;
> + put_cpu_ptr(_local);
> +}
> +EXPORT_SYMBOL(prcu_read_lock);
> +
> +void prcu_read_unlock(void)
> +{
> + int locked;
> + struct prcu_local_struct *local;
> +
> + barrier();
> + local = get_cpu_ptr(_local);
> + locked = local->locked;
> + if (locked) {
> + local->locked--;
> + if (locked == 1)
> + prcu_report(local);

Is ordering important here?  It looks to me that the compiler could
rearrange some of the accesses within prcu_report() with the local->locked
decrement.  There appears to be some potential for load and store tearing,
though perhaps you have verified that your compiler avoids this on
the architecture that you are using.

> + put_cpu_ptr(_local);
> + } else {

Hmmm...  We get here if the RCU read-side critical section was preempted.
If none of them are preempted, ->active_ctr remains zero.

> + put_cpu_ptr(_local);
> + if

Re: [PATCH RFC 01/16] prcu: Add PRCU implementation

2018-01-24 Thread Paul E. McKenney

On Tue, Jan 23, 2018 at 03:59:26PM +0800, liangli...@huawei.com wrote:
> From: Heng Zhang 
> 
> This RCU implementation (PRCU) is based on a fast consensus protocol
> published in the following paper:
> 
> Fast Consensus Using Bounded Staleness for Scalable Read-mostly 
> Synchronization.
> Haibo Chen, Heng Zhang, Ran Liu, Binyu Zang, and Haibing Guan.
> IEEE Transactions on Parallel and Distributed Systems (TPDS), 2016.
> https://dl.acm.org/citation.cfm?id=3024114.3024143
> 
> Signed-off-by: Heng Zhang 
> Signed-off-by: Lihao Liang 

A few comments and questions interspersed.

Thanx, Paul

> ---
>  include/linux/prcu.h |  37 +++
>  kernel/rcu/Makefile  |   2 +-
>  kernel/rcu/prcu.c| 125 
> +++
>  kernel/sched/core.c  |   2 +
>  4 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/prcu.h
>  create mode 100644 kernel/rcu/prcu.c
> 
> diff --git a/include/linux/prcu.h b/include/linux/prcu.h
> new file mode 100644
> index ..653b4633
> --- /dev/null
> +++ b/include/linux/prcu.h
> @@ -0,0 +1,37 @@
> +#ifndef __LINUX_PRCU_H
> +#define __LINUX_PRCU_H
> +
> +#include 
> +#include 
> +#include 
> +
> +#define CONFIG_PRCU
> +
> +struct prcu_local_struct {
> + unsigned int locked;
> + unsigned int online;
> + unsigned long long version;
> +};
> +
> +struct prcu_struct {
> + atomic64_t global_version;
> + atomic_t active_ctr;
> + struct mutex mtx;
> + wait_queue_head_t wait_q;
> +};
> +
> +#ifdef CONFIG_PRCU
> +void prcu_read_lock(void);
> +void prcu_read_unlock(void);
> +void synchronize_prcu(void);
> +void prcu_note_context_switch(void);
> +
> +#else /* #ifdef CONFIG_PRCU */
> +
> +#define prcu_read_lock() do {} while (0)
> +#define prcu_read_unlock() do {} while (0)
> +#define synchronize_prcu() do {} while (0)
> +#define prcu_note_context_switch() do {} while (0)

If CONFIG_PRCU=n and some code is built that uses PRCU, shouldn't you
get a build error rather than an error-free but inoperative PRCU?

Of course, Peter's question about purpose of the patch set applies
here as well.

> +
> +#endif /* #ifdef CONFIG_PRCU */
> +#endif /* __LINUX_PRCU_H */
> diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
> index 23803c7d..8791419c 100644
> --- a/kernel/rcu/Makefile
> +++ b/kernel/rcu/Makefile
> @@ -2,7 +2,7 @@
>  # and is generally not a function of system call inputs.
>  KCOV_INSTRUMENT := n
> 
> -obj-y += update.o sync.o
> +obj-y += update.o sync.o prcu.o
>  obj-$(CONFIG_CLASSIC_SRCU) += srcu.o
>  obj-$(CONFIG_TREE_SRCU) += srcutree.o
>  obj-$(CONFIG_TINY_SRCU) += srcutiny.o
> diff --git a/kernel/rcu/prcu.c b/kernel/rcu/prcu.c
> new file mode 100644
> index ..a00b9420
> --- /dev/null
> +++ b/kernel/rcu/prcu.c
> @@ -0,0 +1,125 @@
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +DEFINE_PER_CPU_SHARED_ALIGNED(struct prcu_local_struct, prcu_local);
> +
> +struct prcu_struct global_prcu = {
> + .global_version = ATOMIC64_INIT(0),
> + .active_ctr = ATOMIC_INIT(0),
> + .mtx = __MUTEX_INITIALIZER(global_prcu.mtx),
> + .wait_q = __WAIT_QUEUE_HEAD_INITIALIZER(global_prcu.wait_q)
> +};
> +struct prcu_struct *prcu = _prcu;
> +
> +static inline void prcu_report(struct prcu_local_struct *local)
> +{
> + unsigned long long global_version;
> + unsigned long long local_version;
> +
> + global_version = atomic64_read(>global_version);
> + local_version = local->version;
> + if (global_version > local_version)
> + cmpxchg(>version, local_version, global_version);
> +}
> +
> +void prcu_read_lock(void)
> +{
> + struct prcu_local_struct *local;
> +
> + local = get_cpu_ptr(_local);
> + if (!local->online) {
> + WRITE_ONCE(local->online, 1);
> + smp_mb();
> + }
> +
> + local->locked++;
> + put_cpu_ptr(_local);
> +}
> +EXPORT_SYMBOL(prcu_read_lock);
> +
> +void prcu_read_unlock(void)
> +{
> + int locked;
> + struct prcu_local_struct *local;
> +
> + barrier();
> + local = get_cpu_ptr(_local);
> + locked = local->locked;
> + if (locked) {
> + local->locked--;
> + if (locked == 1)
> + prcu_report(local);

Is ordering important here?  It looks to me that the compiler could
rearrange some of the accesses within prcu_report() with the local->locked
decrement.  There appears to be some potential for load and store tearing,
though perhaps you have verified that your compiler avoids this on
the architecture that you are using.

> + put_cpu_ptr(_local);
> + } else {

Hmmm...  We get here if the RCU read-side critical section was preempted.
If none of them are preempted, ->active_ctr remains zero.

> + put_cpu_ptr(_local);
> + if (!atomic_dec_return(>active_ctr))
> + wake_up(>wait_q);
> + }
>

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang

Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:
> On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
>> Hi Tariq
>>
>> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
>> On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
>>> Hi Tariq
>>>
>>> Very sad that the crash was reproduced again after applied the patch.

 Memory barriers vary for different Archs, can you please share more 
 details regarding arch and repro steps?
>>> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
>>> 12/27/2015
>>> The xen is installed. The crash occurred in DOM0.
>>> Regarding to the repro steps, it is a customer's test which does heavy disk 
>>> I/O over NFS storage without any guest.
>>>
>>
>> What is the finial suggestion on this ?
>> If use wmb there, is the performance pulled down ?
> 
> Since 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k=
> 
> we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.
> 
> I doubt the additional wmb() will have serious impact there.
> 
>

Re: [PATCH] net/mlx4_en: ensure rx_desc updating reaches HW before prod db updating

2018-01-24 Thread jianchao.wang

Hi Eric

Thanks for you kindly response and suggestion.
That's really appreciated.

Jianchao

On 01/25/2018 11:55 AM, Eric Dumazet wrote:
> On Thu, 2018-01-25 at 11:27 +0800, jianchao.wang wrote:
>> Hi Tariq
>>
>> On 01/22/2018 10:12 AM, jianchao.wang wrote:
> On 19/01/2018 5:49 PM, Eric Dumazet wrote:
>> On Fri, 2018-01-19 at 23:16 +0800, jianchao.wang wrote:
>>> Hi Tariq
>>>
>>> Very sad that the crash was reproduced again after applied the patch.

 Memory barriers vary for different Archs, can you please share more 
 details regarding arch and repro steps?
>>> The hardware is HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 
>>> 12/27/2015
>>> The xen is installed. The crash occurred in DOM0.
>>> Regarding to the repro steps, it is a customer's test which does heavy disk 
>>> I/O over NFS storage without any guest.
>>>
>>
>> What is the finial suggestion on this ?
>> If use wmb there, is the performance pulled down ?
> 
> Since 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_davem_net-2Dnext.git_commit_-3Fid-3Ddad42c3038a59d27fced28ee4ec1d4a891b28155=DwICaQ=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ=c0oI8duFkyFBILMQYDsqRApHQrOlLY_2uGiz_utcd7s=E4_XKmSI0B63qB0DLQ1EX_fj1bOP78ZdeYADBf33B-k=
> 
> we batch allocations, so mlx4_en_refill_rx_buffers() is not called that often.
> 
> I doubt the additional wmb() will have serious impact there.
> 
>

Re: [PATCH] powerpc: pseries: use irq_of_parse_and_map helper

2018-01-24 Thread Michael Ellerman

Rob Herring  writes:

> On Tue, Jan 23, 2018 at 12:53 AM, Michael Ellerman  
> wrote:
>> Rob Herring  writes:
>>
>>> Instead of calling both of_irq_parse_one and irq_create_of_mapping, call
>>> of_irq_parse_and_map instead which does the same thing. This gets us closer
>>> to making the former 2 functions static.
...
>> Are you trying to remove the low-level routines or is this just a
>> cleanup?
>
> The former, but I'm not sure that will happen. There's a handful of
> others left, but they aren't simply a call to of_irq_parse_one and
> then irq_create_of_mapping.
>
>> The patch below works, it loses the error handling if the interrupts
>> property is corrupt/empty, but that's probably overly paranoid anyway.
>
> Not quite. Previously, it was silent if parsing failed. Only the
> mapping would give an error which would mean the interrupt parent had
> some error.
>
> Actually, we could use of_irq_get here to preserve the error handling.
> It will return error codes from parsing, 0 on mapping failure, or the
> Linux irq number. It adds an irq_find_host call for deferred probe,
> but that should be harmless. I'll respin it.

OK thanks.

cheers

Re: [PATCH] powerpc: pseries: use irq_of_parse_and_map helper

2018-01-24 Thread Michael Ellerman

Rob Herring  writes:

> On Tue, Jan 23, 2018 at 12:53 AM, Michael Ellerman  
> wrote:
>> Rob Herring  writes:
>>
>>> Instead of calling both of_irq_parse_one and irq_create_of_mapping, call
>>> of_irq_parse_and_map instead which does the same thing. This gets us closer
>>> to making the former 2 functions static.
...
>> Are you trying to remove the low-level routines or is this just a
>> cleanup?
>
> The former, but I'm not sure that will happen. There's a handful of
> others left, but they aren't simply a call to of_irq_parse_one and
> then irq_create_of_mapping.
>
>> The patch below works, it loses the error handling if the interrupts
>> property is corrupt/empty, but that's probably overly paranoid anyway.
>
> Not quite. Previously, it was silent if parsing failed. Only the
> mapping would give an error which would mean the interrupt parent had
> some error.
>
> Actually, we could use of_irq_get here to preserve the error handling.
> It will return error codes from parsing, 0 on mapping failure, or the
> Linux irq number. It adds an irq_find_host call for deferred probe,
> but that should be harmless. I'll respin it.

OK thanks.

cheers

linux-next: build failure after merge of the pci tree

2018-01-24 Thread Stephen Rothwell

Hi Bjorn,

After merging the pci tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/kernel/pci-common.c: In function 'pcibios_setup
_device':
arch/powerpc/kernel/pci-common.c:406:15: error: 'virq' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  pci_dev->irq = virq;
  ~^~
arch/powerpc/kernel/pci-common.c:365:15: note: 'virq' was declared here
  unsigned int virq;
   ^~~~

Caused by commit

  c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")

I have applied the following patch for today:

From: Stephen Rothwell 
Date: Thu, 25 Jan 2018 16:44:19 +1100
Subject: [PATCH] powerpc/pci: fix for "Use of_irq_parse_and_map_pci() helper"

Fixes: c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
Signed-off-by: Stephen Rothwell 
---
 arch/powerpc/kernel/pci-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 6be3f2c22a9b..ae2ede4de6be 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -362,7 +362,7 @@ struct pci_controller* pci_find_hose_for_OF_device(struct 
device_node* node)
  */
 static int pci_read_irq_line(struct pci_dev *pci_dev)
 {
-   unsigned int virq;
+   unsigned int virq = 0;
 
pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
 
-- 
2.15.1

-- 
Cheers,
Stephen Rothwell

linux-next: build failure after merge of the pci tree

2018-01-24 Thread Stephen Rothwell

Hi Bjorn,

After merging the pci tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/kernel/pci-common.c: In function 'pcibios_setup
_device':
arch/powerpc/kernel/pci-common.c:406:15: error: 'virq' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]
  pci_dev->irq = virq;
  ~^~
arch/powerpc/kernel/pci-common.c:365:15: note: 'virq' was declared here
  unsigned int virq;
   ^~~~

Caused by commit

  c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")

I have applied the following patch for today:

From: Stephen Rothwell 
Date: Thu, 25 Jan 2018 16:44:19 +1100
Subject: [PATCH] powerpc/pci: fix for "Use of_irq_parse_and_map_pci() helper"

Fixes: c5042ac60fe5 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
Signed-off-by: Stephen Rothwell 
---
 arch/powerpc/kernel/pci-common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 6be3f2c22a9b..ae2ede4de6be 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -362,7 +362,7 @@ struct pci_controller* pci_find_hose_for_OF_device(struct 
device_node* node)
  */
 static int pci_read_irq_line(struct pci_dev *pci_dev)
 {
-   unsigned int virq;
+   unsigned int virq = 0;
 
pr_debug("PCI: Try to map irq for %s...\n", pci_name(pci_dev));
 
-- 
2.15.1

-- 
Cheers,
Stephen Rothwell

[PATCH 8/8] kprobes/s390: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Remove %p because the kprobe will be dumped in
dump_kprobe().

Signed-off-by: Masami Hiramatsu 
---
 arch/s390/kernel/kprobes.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index af3722c28fd9..df30e5b9a572 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -282,7 +282,7 @@ static void kprobe_reenter_check(struct kprobe_ctlblk *kcb, 
struct kprobe *p)
 * is a BUG. The code path resides in the .kprobes.text
 * section and is executed with interrupts disabled.
 */
-   printk(KERN_EMERG "Invalid kprobe detected at %p.\n", p->addr);
+   pr_err("Invalid kprobe detected.\n");
dump_kprobe(p);
BUG();
}

[PATCH 8/8] kprobes/s390: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Remove %p because the kprobe will be dumped in
dump_kprobe().

Signed-off-by: Masami Hiramatsu 
---
 arch/s390/kernel/kprobes.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index af3722c28fd9..df30e5b9a572 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -282,7 +282,7 @@ static void kprobe_reenter_check(struct kprobe_ctlblk *kcb, 
struct kprobe *p)
 * is a BUG. The code path resides in the .kprobes.text
 * section and is executed with interrupts disabled.
 */
-   printk(KERN_EMERG "Invalid kprobe detected at %p.\n", p->addr);
+   pr_err("Invalid kprobe detected.\n");
dump_kprobe(p);
BUG();
}

[PATCH 6/8] kprobes/arm64: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages by removing it because
those are redundant or meaningless.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm64/kernel/probes/kprobes.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index d849d9804011..34f78d07a068 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -275,7 +275,7 @@ static int __kprobes reenter_kprobe(struct kprobe *p,
break;
case KPROBE_HIT_SS:
case KPROBE_REENTER:
-   pr_warn("Unrecoverable kprobe detected at %p.\n", p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
break;
@@ -521,7 +521,7 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
(struct pt_regs *)kcb->jprobe_saved_regs.sp;
pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   pr_err("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
__show_regs(saved_regs);
pr_err("Current registers\n");
__show_regs(regs);

[PATCH 6/8] kprobes/arm64: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages by removing it because
those are redundant or meaningless.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm64/kernel/probes/kprobes.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kernel/probes/kprobes.c 
b/arch/arm64/kernel/probes/kprobes.c
index d849d9804011..34f78d07a068 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -275,7 +275,7 @@ static int __kprobes reenter_kprobe(struct kprobe *p,
break;
case KPROBE_HIT_SS:
case KPROBE_REENTER:
-   pr_warn("Unrecoverable kprobe detected at %p.\n", p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
break;
@@ -521,7 +521,7 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
(struct pt_regs *)kcb->jprobe_saved_regs.sp;
pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   pr_err("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
__show_regs(saved_regs);
pr_err("Current registers\n");
__show_regs(regs);

[PATCH 7/8] kprobes/MN10300: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Replace %p with %px because it is right before BUG().

Signed-off-by: Masami Hiramatsu 
---
 arch/mn10300/kernel/kprobes.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mn10300/kernel/kprobes.c b/arch/mn10300/kernel/kprobes.c
index 0311a7fcea16..e539fac00321 100644
--- a/arch/mn10300/kernel/kprobes.c
+++ b/arch/mn10300/kernel/kprobes.c
@@ -632,9 +632,9 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
 
if (addr == (u8 *) jprobe_return_bp_addr) {
if (jprobe_saved_regs_location != regs) {
-   printk(KERN_ERR"JPROBE:"
-  " Current regs (%p) does not match saved regs"
-  " (%p).\n",
+   pr_err("JPROBE:"
+  " Current regs (%px) does not match saved regs"
+  " (%px).\n",
   regs, jprobe_saved_regs_location);
BUG();
}

[PATCH 7/8] kprobes/MN10300: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Replace %p with %px because it is right before BUG().

Signed-off-by: Masami Hiramatsu 
---
 arch/mn10300/kernel/kprobes.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/mn10300/kernel/kprobes.c b/arch/mn10300/kernel/kprobes.c
index 0311a7fcea16..e539fac00321 100644
--- a/arch/mn10300/kernel/kprobes.c
+++ b/arch/mn10300/kernel/kprobes.c
@@ -632,9 +632,9 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
 
if (addr == (u8 *) jprobe_return_bp_addr) {
if (jprobe_saved_regs_location != regs) {
-   printk(KERN_ERR"JPROBE:"
-  " Current regs (%p) does not match saved regs"
-  " (%p).\n",
+   pr_err("JPROBE:"
+  " Current regs (%px) does not match saved regs"
+  " (%px).\n",
   regs, jprobe_saved_regs_location);
BUG();
}

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Ming Lei

On Wed, Jan 24, 2018 at 08:34:14PM -0700, Jens Axboe wrote:
> On 1/24/18 7:46 PM, Jia-Ju Bai wrote:
> > The function ioc_create_icq here is not called in atomic context.
> > Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
> > 
> > This is found by a static analysis tool named DCNS written by myself.
> 
> But it's running off the IO submission path, so by definition the GFP
> mask cannot include anything that will do IO. GFP_KERNEL will make
> it deadlock prone.
> 
> It could be GFP_NOIO, but that's also overlooking the fact that we can
> have preemption disabled here.

We have REQ_NOWAIT request too, so GFP_NOIO isn't OK too.

-- 
Ming

Re: [PATCH] block: blk-mq-sched: Replace GFP_ATOMIC with GFP_KERNEL in blk_mq_sched_assign_ioc

2018-01-24 Thread Ming Lei

On Wed, Jan 24, 2018 at 08:34:14PM -0700, Jens Axboe wrote:
> On 1/24/18 7:46 PM, Jia-Ju Bai wrote:
> > The function ioc_create_icq here is not called in atomic context.
> > Thus GFP_ATOMIC is not necessary, and it can be replaced with GFP_KERNEL.
> > 
> > This is found by a static analysis tool named DCNS written by myself.
> 
> But it's running off the IO submission path, so by definition the GFP
> mask cannot include anything that will do IO. GFP_KERNEL will make
> it deadlock prone.
> 
> It could be GFP_NOIO, but that's also overlooking the fact that we can
> have preemption disabled here.

We have REQ_NOWAIT request too, so GFP_NOIO isn't OK too.

-- 
Ming

[PATCH 5/8] kprobes/arm: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages by removing it and
using general dumper.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm/probes/kprobes/core.c  |   10 +-
 arch/arm/probes/kprobes/test-core.c |1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c
index 52d1cd14fda4..8f37d505194f 100644
--- a/arch/arm/probes/kprobes/core.c
+++ b/arch/arm/probes/kprobes/core.c
@@ -291,8 +291,8 @@ void __kprobes kprobe_handler(struct pt_regs *regs)
break;
case KPROBE_REENTER:
/* A nested probe was hit in FIQ, it is a BUG */
-   pr_warn("Unrecoverable kprobe detected at 
%p.\n",
-   p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
+   dump_kprobe(p);
/* fall through */
default:
/* impossible cases */
@@ -617,11 +617,11 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
if (orig_sp != stack_addr) {
struct pt_regs *saved_regs =
(struct pt_regs *)kcb->jprobe_saved_regs.ARM_sp;
-   printk("current sp %lx does not match saved sp %lx\n",
+   pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   printk("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk("Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}
diff --git a/arch/arm/probes/kprobes/test-core.c 
b/arch/arm/probes/kprobes/test-core.c
index 9ed0129bed3c..b5c892e24244 100644
--- a/arch/arm/probes/kprobes/test-core.c
+++ b/arch/arm/probes/kprobes/test-core.c
@@ -1460,7 +1460,6 @@ static bool check_test_results(void)
print_registers(_regs);
 
if (mem) {
-   pr_err("current_stack=%p\n", current_stack);
pr_err("expected_memory:\n");
print_memory(expected_memory, mem_size);
pr_err("result_memory:\n");

[PATCH 5/8] kprobes/arm: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages by removing it and
using general dumper.

Signed-off-by: Masami Hiramatsu 
---
 arch/arm/probes/kprobes/core.c  |   10 +-
 arch/arm/probes/kprobes/test-core.c |1 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c
index 52d1cd14fda4..8f37d505194f 100644
--- a/arch/arm/probes/kprobes/core.c
+++ b/arch/arm/probes/kprobes/core.c
@@ -291,8 +291,8 @@ void __kprobes kprobe_handler(struct pt_regs *regs)
break;
case KPROBE_REENTER:
/* A nested probe was hit in FIQ, it is a BUG */
-   pr_warn("Unrecoverable kprobe detected at 
%p.\n",
-   p->addr);
+   pr_warn("Unrecoverable kprobe detected.\n");
+   dump_kprobe(p);
/* fall through */
default:
/* impossible cases */
@@ -617,11 +617,11 @@ int __kprobes longjmp_break_handler(struct kprobe *p, 
struct pt_regs *regs)
if (orig_sp != stack_addr) {
struct pt_regs *saved_regs =
(struct pt_regs *)kcb->jprobe_saved_regs.ARM_sp;
-   printk("current sp %lx does not match saved sp %lx\n",
+   pr_err("current sp %lx does not match saved sp %lx\n",
   orig_sp, stack_addr);
-   printk("Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk("Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}
diff --git a/arch/arm/probes/kprobes/test-core.c 
b/arch/arm/probes/kprobes/test-core.c
index 9ed0129bed3c..b5c892e24244 100644
--- a/arch/arm/probes/kprobes/test-core.c
+++ b/arch/arm/probes/kprobes/test-core.c
@@ -1460,7 +1460,6 @@ static bool check_test_results(void)
print_registers(_regs);
 
if (mem) {
-   pr_err("current_stack=%p\n", current_stack);
pr_err("expected_memory:\n");
print_memory(expected_memory, mem_size);
pr_err("result_memory:\n");

[PATCH 4/8] kprobes/x86: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages in kprobes/x86.
- Some %p uses are not needed. Just remove it (or remove message).
- One %p use is right before the BUG() so replaced with %px.

Signed-off-by: Masami Hiramatsu 
---
 arch/x86/kernel/kprobes/core.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..aea956aedad7 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -391,8 +391,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct 
insn *insn)
  - (u8 *) real;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit 
into s32 (%llx)\n", newdisp);
-   pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",
-   src, real, insn->displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(insn);
@@ -636,8 +634,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs 
*regs,
 * Raise a BUG or we'll continue in an endless reentering loop
 * and eventually a stack overflow.
 */
-   printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
-  p->addr);
+   pr_err("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
default:
@@ -1146,12 +1143,11 @@ int longjmp_break_handler(struct kprobe *p, struct 
pt_regs *regs)
(addr < (u8 *) jprobe_return_end)) {
if (stack_addr(regs) != saved_sp) {
struct pt_regs *saved_regs = >jprobe_saved_regs;
-   printk(KERN_ERR
-  "current sp %p does not match saved sp %p\n",
+   pr_err("current sp %px does not match saved sp %px\n",
   stack_addr(regs), saved_sp);
-   printk(KERN_ERR "Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk(KERN_ERR "Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}

Hello Linux

2018-01-24 Thread norsk5

good morning Linux

https://goo.gl/2YZCS5

[PATCH 4/8] kprobes/x86: Fix %p uses in error messages

2018-01-24 Thread Masami Hiramatsu

Fix %p uses in error messages in kprobes/x86.
- Some %p uses are not needed. Just remove it (or remove message).
- One %p use is right before the BUG() so replaced with %px.

Signed-off-by: Masami Hiramatsu 
---
 arch/x86/kernel/kprobes/core.c |   12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..aea956aedad7 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -391,8 +391,6 @@ int __copy_instruction(u8 *dest, u8 *src, u8 *real, struct 
insn *insn)
  - (u8 *) real;
if ((s64) (s32) newdisp != newdisp) {
pr_err("Kprobes error: new displacement does not fit 
into s32 (%llx)\n", newdisp);
-   pr_err("\tSrc: %p, Dest: %p, old disp: %x\n",
-   src, real, insn->displacement.value);
return 0;
}
disp = (u8 *) dest + insn_offset_displacement(insn);
@@ -636,8 +634,7 @@ static int reenter_kprobe(struct kprobe *p, struct pt_regs 
*regs,
 * Raise a BUG or we'll continue in an endless reentering loop
 * and eventually a stack overflow.
 */
-   printk(KERN_WARNING "Unrecoverable kprobe detected at %p.\n",
-  p->addr);
+   pr_err("Unrecoverable kprobe detected.\n");
dump_kprobe(p);
BUG();
default:
@@ -1146,12 +1143,11 @@ int longjmp_break_handler(struct kprobe *p, struct 
pt_regs *regs)
(addr < (u8 *) jprobe_return_end)) {
if (stack_addr(regs) != saved_sp) {
struct pt_regs *saved_regs = >jprobe_saved_regs;
-   printk(KERN_ERR
-  "current sp %p does not match saved sp %p\n",
+   pr_err("current sp %px does not match saved sp %px\n",
   stack_addr(regs), saved_sp);
-   printk(KERN_ERR "Saved registers for jprobe %p\n", jp);
+   pr_err("Saved registers for jprobe\n");
show_regs(saved_regs);
-   printk(KERN_ERR "Current registers\n");
+   pr_err("Current registers\n");
show_regs(regs);
BUG();
}

Hello Linux

2018-01-24 Thread norsk5

good morning Linux

https://goo.gl/2YZCS5

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1710 matches

Mail list logo