Re: ##freemail## Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-08-21 Thread Wanpeng Li
On Wed, 21 Aug 2019 at 13:41, Naoya Horiguchi  wrote:
>
> On Tue, Aug 20, 2019 at 03:03:55PM +0800, Wanpeng Li wrote:
> > Cc Mel Gorman, Kirill, Dave Hansen,
> > On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi  
> > wrote:
> > >
> > > On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > > > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > > > Cc Paolo,
> > > > > Hi all,
> > > > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  
> > > > > wrote:
> > > > >>
> > > > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > > > >>> Andrew Morton  writes:
> > > > >>>
> > > > >>>> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal 
> > > > >>>>  wrote:
> > > > >>>>
> > > > >>>>>>
> > > > >>>>>> So I don't think that the above test result means that errors 
> > > > >>>>>> are properly
> > > > >>>>>> handled, and the proposed patch should help for arm64.
> > > > >>>>>
> > > > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the 
> > > > >>>>> code
> > > > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > > > >>>>> consistent with expectations by core code.
> > > > >>>>>
> > > > >>>>> I'll look to update the arm64 helpers once this patch gets 
> > > > >>>>> merged. But
> > > > >>>>> it would be helpful if there was a clear expression of semantics 
> > > > >>>>> for
> > > > >>>>> pud_huge() for various cases. Is there any version that can be 
> > > > >>>>> used as
> > > > >>>>> reference?
> > > > >>>>
> > > > >>>> Is that an ack or tested-by?
> > > > >>>>
> > > > >>>> Mike keeps plaintively asking the powerpc developers to take a 
> > > > >>>> look,
> > > > >>>> but they remain steadfastly in hiding.
> > > > >>>
> > > > >>> Cc'ing linuxppc-dev is always a good idea :)
> > > > >>>
> > > > >>
> > > > >> Thanks Michael,
> > > > >>
> > > > >> I was mostly concerned about use cases for soft/hard offline of huge 
> > > > >> pages
> > > > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports 
> > > > >> PGD_SIZE
> > > > >> huge pages, and soft/hard offline support was specifically added for 
> > > > >> this.
> > > > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB 
> > > > >> pages
> > > > >> at PGD level"
> > > > >>
> > > > >> This patch will disable that functionality.  So, at a minimum this 
> > > > >> is a
> > > > >> 'heads up'.  If there are actual use cases that depend on this, then 
> > > > >> more
> > > > >> work/discussions will need to happen.  From the e-mail thread on 
> > > > >> PGD_SIZE
> > > > >> support, I can not tell if there is a real use case or this is just a
> > > > >> 'nice to have'.
> > > > >
> > > > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > > > encounter gup_pud_range() panic several times in product environment.
> > > > > Is there any plan to reenable and fix arch codes?
> > > >
> > > > I too am aware of slightly more interest in 1G huge pages.  Suspect 
> > > > that as
> > > > Intel MMU capacity increases to handle more TLB entries there will be 
> > > > more
> > > > and more interest.
> > > >
> > > > Personally, I am not looking at this issue.  Perhaps Naoya will comment 
> > > > as
> > > > he know most about this code.
> > >
> > > Thanks for forwarding this to me, I'm feeling that memory error handling
> > > on 1GB hugepage is demanded as real use case.
> > >
> > > >
> > > > > In addition, 
> > > > > https:

Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-08-20 Thread Wanpeng Li
Cc Mel Gorman, Kirill, Dave Hansen,
On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi  wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  
> > > wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton  writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal 
> > >>>>  wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are 
> > >>>>>> properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge 
> > >> pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for 
> > >> this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB 
> > >> pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I found it is not easy to split. There is a unique hugetlb page size
that is associated with a mounted hugetlbfs filesystem, file remap to
2MB/4KB will break this. How about hard offline 1GB hugetlb page as
what has already done in soft offline, replace the corrupted 1GB page
by new 1GB page through page migration, the offending/corrupted area
in the original 1GB page doesn't need to be copied into the new page,
the offending/corrupted area in new page can keep full zero just as it
is clear during hugetlb page fault, other sub-pages of the original
1GB page can be freed to buddy system. The sigbus signal is sent to
userspace w/ offending/corrupted virtual address, and signal code,
userspace should take care this.

Regards,
Wanpeng Li


Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-06-11 Thread Wanpeng Li
On Tue, 11 Jun 2019 at 07:51, Naoya Horiguchi  wrote:
>
> On Wed, May 29, 2019 at 04:31:01PM -0700, Mike Kravetz wrote:
> > On 5/28/19 2:49 AM, Wanpeng Li wrote:
> > > Cc Paolo,
> > > Hi all,
> > > On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  
> > > wrote:
> > >>
> > >> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > >>> Andrew Morton  writes:
> > >>>
> > >>>> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal 
> > >>>>  wrote:
> > >>>>
> > >>>>>>
> > >>>>>> So I don't think that the above test result means that errors are 
> > >>>>>> properly
> > >>>>>> handled, and the proposed patch should help for arm64.
> > >>>>>
> > >>>>> Although, the deviation of pud_huge() avoids a kernel crash the code
> > >>>>> would be easier to maintain and reason about if arm64 helpers are
> > >>>>> consistent with expectations by core code.
> > >>>>>
> > >>>>> I'll look to update the arm64 helpers once this patch gets merged. But
> > >>>>> it would be helpful if there was a clear expression of semantics for
> > >>>>> pud_huge() for various cases. Is there any version that can be used as
> > >>>>> reference?
> > >>>>
> > >>>> Is that an ack or tested-by?
> > >>>>
> > >>>> Mike keeps plaintively asking the powerpc developers to take a look,
> > >>>> but they remain steadfastly in hiding.
> > >>>
> > >>> Cc'ing linuxppc-dev is always a good idea :)
> > >>>
> > >>
> > >> Thanks Michael,
> > >>
> > >> I was mostly concerned about use cases for soft/hard offline of huge 
> > >> pages
> > >> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> > >> huge pages, and soft/hard offline support was specifically added for 
> > >> this.
> > >> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB 
> > >> pages
> > >> at PGD level"
> > >>
> > >> This patch will disable that functionality.  So, at a minimum this is a
> > >> 'heads up'.  If there are actual use cases that depend on this, then more
> > >> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> > >> support, I can not tell if there is a real use case or this is just a
> > >> 'nice to have'.
> > >
> > > 1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
> > > encounter gup_pud_range() panic several times in product environment.
> > > Is there any plan to reenable and fix arch codes?
> >
> > I too am aware of slightly more interest in 1G huge pages.  Suspect that as
> > Intel MMU capacity increases to handle more TLB entries there will be more
> > and more interest.
> >
> > Personally, I am not looking at this issue.  Perhaps Naoya will comment as
> > he know most about this code.
>
> Thanks for forwarding this to me, I'm feeling that memory error handling
> on 1GB hugepage is demanded as real use case.
>
> >
> > > In addition, 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
> > > The memory in guest can be 1GB/2MB/4K, though the host-backed memory
> > > are 1GB hugetlbfs pages, after above PUD panic is fixed,
> > > try_to_unmap() which is called in MCA recovery path will mark the PUD
> > > hwpoison entry. The guest will vmexit and retry endlessly when
> > > accessing any memory in the guest which is backed by this 1GB poisoned
> > > hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
> > > hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
> > > which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
> > > into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
> > > will be delivered to VM at page fault next time for the offensive
> > > SPTE. Is this proposal acceptable?
> >
> > I am not sure of the error handling design, but this does sound reasonable.
>
> I agree that that's better.
>
> > That block of code which potentially dissolves a huge page on memory error
> > is hard to understand and I'm not sure if that is even the 'normal'
> > functionality.  Certainly, we would hate to waste/poison an entire 1G page
> > for an error on a small subsection.
>
> Yes, that's not practical, so we need at first establish the code base for
> 2GB hugetlb splitting and then extending it to 1GB next.

I'm working on this, thanks for the inputs.

Regards,
Wanpeng Li


Re: [PATCH v2] mm: hwpoison: disable memory error handling on 1GB hugepage

2019-05-28 Thread Wanpeng Li
Cc Paolo,
Hi all,
On Wed, 14 Feb 2018 at 06:34, Mike Kravetz  wrote:
>
> On 02/12/2018 06:48 PM, Michael Ellerman wrote:
> > Andrew Morton  writes:
> >
> >> On Thu, 08 Feb 2018 12:30:45 + Punit Agrawal  
> >> wrote:
> >>
> >>>>
> >>>> So I don't think that the above test result means that errors are 
> >>>> properly
> >>>> handled, and the proposed patch should help for arm64.
> >>>
> >>> Although, the deviation of pud_huge() avoids a kernel crash the code
> >>> would be easier to maintain and reason about if arm64 helpers are
> >>> consistent with expectations by core code.
> >>>
> >>> I'll look to update the arm64 helpers once this patch gets merged. But
> >>> it would be helpful if there was a clear expression of semantics for
> >>> pud_huge() for various cases. Is there any version that can be used as
> >>> reference?
> >>
> >> Is that an ack or tested-by?
> >>
> >> Mike keeps plaintively asking the powerpc developers to take a look,
> >> but they remain steadfastly in hiding.
> >
> > Cc'ing linuxppc-dev is always a good idea :)
> >
>
> Thanks Michael,
>
> I was mostly concerned about use cases for soft/hard offline of huge pages
> larger than PMD_SIZE on powerpc.  I know that powerpc supports PGD_SIZE
> huge pages, and soft/hard offline support was specifically added for this.
> See, 94310cbcaa3c "mm/madvise: enable (soft|hard) offline of HugeTLB pages
> at PGD level"
>
> This patch will disable that functionality.  So, at a minimum this is a
> 'heads up'.  If there are actual use cases that depend on this, then more
> work/discussions will need to happen.  From the e-mail thread on PGD_SIZE
> support, I can not tell if there is a real use case or this is just a
> 'nice to have'.

1GB hugetlbfs pages are used by DPDK and VMs in cloud deployment, we
encounter gup_pud_range() panic several times in product environment.
Is there any plan to reenable and fix arch codes?

In addition, 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/x86/kvm/mmu.c#n3213
The memory in guest can be 1GB/2MB/4K, though the host-backed memory
are 1GB hugetlbfs pages, after above PUD panic is fixed,
try_to_unmap() which is called in MCA recovery path will mark the PUD
hwpoison entry. The guest will vmexit and retry endlessly when
accessing any memory in the guest which is backed by this 1GB poisoned
hugetlbfs page. We have a plan to split this 1GB hugetblfs page by 2MB
hugetlbfs pages/4KB pages, maybe file remap to a virtual address range
which is 2MB/4KB page granularity, also split the KVM MMU 1GB SPTE
into 2MB/4KB and mark the offensive SPTE w/ a hwpoison flag, a sigbus
will be delivered to VM at page fault next time for the offensive
SPTE. Is this proposal acceptable?

Regards,
Wanpeng Li


Re: [tip:sched/core] sched/core: Add debugging code to catch missing update_rq_clock() calls

2017-02-22 Thread Wanpeng Li
2017-02-02 23:55 GMT+08:00 Peter Zijlstra <pet...@infradead.org>:
> On Tue, Jan 31, 2017 at 10:22:47AM -0700, Ross Zwisler wrote:
>> On Tue, Jan 31, 2017 at 4:48 AM, Mike Galbraith <efa...@gmx.de> wrote:
>> > On Tue, 2017-01-31 at 16:30 +0530, Sachin Sant wrote:
>
>
> Could some of you test this? It seems to cure things in my (very)
> limited testing.
>

Tested-by: Wanpeng Li <wanpeng...@hotmail.com>

> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 96e4ccc..b773821 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -5609,7 +5609,7 @@ static void migrate_tasks(struct rq *dead_rq)
>  {
> struct rq *rq = dead_rq;
> struct task_struct *next, *stop = rq->stop;
> -   struct rq_flags rf, old_rf;
> +   struct rq_flags rf;
> int dest_cpu;
>
> /*
> @@ -5628,7 +5628,9 @@ static void migrate_tasks(struct rq *dead_rq)
>  * class method both need to have an up-to-date
>  * value of rq->clock[_task]
>  */
> +   rq_pin_lock(rq, );
> update_rq_clock(rq);
> +   rq_unpin_lock(rq, );
>
> for (;;) {
> /*
> @@ -5641,7 +5643,7 @@ static void migrate_tasks(struct rq *dead_rq)
> /*
>  * pick_next_task assumes pinned rq->lock.
>  */
> -   rq_pin_lock(rq, );
> +   rq_repin_lock(rq, );
> next = pick_next_task(rq, _task, );
> BUG_ON(!next);
> next->sched_class->put_prev_task(rq, next);
> @@ -5670,13 +5672,6 @@ static void migrate_tasks(struct rq *dead_rq)
> continue;
> }
>
> -   /*
> -* __migrate_task() may return with a different
> -* rq->lock held and a new cookie in 'rf', but we need
> -* to preserve rf::clock_update_flags for 'dead_rq'.
> -*/
> -   old_rf = rf;
> -
> /* Find suitable destination for @next, with force if needed. 
> */
> dest_cpu = select_fallback_rq(dead_rq->cpu, next);
>
> @@ -5685,7 +5680,6 @@ static void migrate_tasks(struct rq *dead_rq)
> raw_spin_unlock(>lock);
> rq = dead_rq;
> raw_spin_lock(>lock);
> -   rf = old_rf;
> }
> raw_spin_unlock(>pi_lock);
> }


Re: [PATCH 6/6] doc/kvm: Add halt polling documentation

2016-10-13 Thread Wanpeng Li
2016-10-14 8:53 GMT+08:00 Suraj Jitindar Singh <sjitindarsi...@gmail.com>:
> There is currently no documentation about the halt polling capabilities
> of the kvm module. Add some documentation describing the mechanism as well
> as the module parameters to all better understanding of how halt polling
> should be used and the effect of tuning the module parameters.

How about replace "halt-polling" by "Adaptive halt-polling"? Btw,
thanks for your docs.

Regards,
Wanpeng Li

>
> Signed-off-by: Suraj Jitindar Singh <sjitindarsi...@gmail.com>
> ---
>  Documentation/virtual/kvm/00-INDEX |   2 +
>  Documentation/virtual/kvm/halt-polling.txt | 127 
> +
>  2 files changed, 129 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/halt-polling.txt
>
> diff --git a/Documentation/virtual/kvm/00-INDEX 
> b/Documentation/virtual/kvm/00-INDEX
> index fee9f2b..69fe1a8 100644
> --- a/Documentation/virtual/kvm/00-INDEX
> +++ b/Documentation/virtual/kvm/00-INDEX
> @@ -6,6 +6,8 @@ cpuid.txt
> - KVM-specific cpuid leaves (x86).
>  devices/
> - KVM_CAP_DEVICE_CTRL userspace API.
> +halt-polling.txt
> +   - notes on halt-polling
>  hypercalls.txt
> - KVM hypercalls.
>  locking.txt
> diff --git a/Documentation/virtual/kvm/halt-polling.txt 
> b/Documentation/virtual/kvm/halt-polling.txt
> new file mode 100644
> index 000..4a84183
> --- /dev/null
> +++ b/Documentation/virtual/kvm/halt-polling.txt
> @@ -0,0 +1,127 @@
> +The KVM halt polling system
> +===
> +
> +The KVM halt polling system provides a feature within KVM whereby the latency
> +of a guest can, under some circumstances, be reduced by polling in the host
> +for some time period after the guest has elected to no longer run by cedeing.
> +That is, when a guest vcpu has ceded, or in the case of powerpc when all of 
> the
> +vcpus of a single vcore have ceded, the host kernel polls for wakeup 
> conditions
> +before giving up the cpu to the scheduler in order to let something else run.
> +
> +Polling provides a latency advantage in cases where the guest can be run 
> again
> +very quickly by at least saving us a trip through the scheduler, normally on
> +the order of a few micro-seconds, although performance benefits are workload
> +dependant. In the event that no wakeup source arrives during the polling
> +interval or some other task on the runqueue is runnable the scheduler is
> +invoked. Thus halt polling is especially useful on workloads with very short
> +wakeup periods where the time spent halt polling is minimised and the time
> +savings of not invoking the scheduler are distinguishable.
> +
> +The generic halt polling code is implemented in:
> +
> +   virt/kvm/kvm_main.c: kvm_vcpu_block()
> +
> +The powerpc kvm-hv specific case is implemented in:
> +
> +   arch/powerpc/kvm/book3s_hv.c: kvmppc_vcore_blocked()
> +
> +Halt Polling Interval
> +=
> +
> +The maximum time for which to poll before invoking the scheduler, referred to
> +as the halt polling interval, is increased and decreased based on the 
> perceived
> +effectiveness of the polling in an attempt to limit pointless polling.
> +This value is stored in either the vcpu struct:
> +
> +   kvm_vcpu->halt_poll_ns
> +
> +or in the case of powerpc kvm-hv, in the vcore struct:
> +
> +   kvmppc_vcore->halt_poll_ns
> +
> +Thus this is a per vcpu (or vcore) value.
> +
> +During polling if a wakeup source is received within the halt polling 
> interval,
> +the interval is left unchanged. In the event that a wakeup source isn't
> +received during the polling interval (and thus schedule is invoked) there are
> +two options, either the polling interval and total block time[0] were less 
> than
> +the global max polling interval (see module params below), or the total block
> +time was greater than the global max polling interval.
> +
> +In the event that both the polling interval and total block time were less 
> than
> +the global max polling interval then the polling interval can be increased in
> +the hope that next time during the longer polling interval the wake up source
> +will be received while the host is polling and the latency benefits will be
> +received. The polling interval is grown in the function grow_halt_poll_ns() 
> and
> +is multiplied by the module parameter halt_poll_ns_grow.
> +
> +In the event that the total block time was greater than the global max 
> polling
> +interval then the host will never poll for long enough (limited by the global
> +max) to wakeup during the polling interval so it may as well be shrunk in 
> ord

Re: [PATCH v2 0/4] implement vcpu preempted check

2016-07-07 Thread Wanpeng Li
2016-07-07 18:12 GMT+08:00 Wanpeng Li <kernel...@gmail.com>:
> 2016-07-07 17:42 GMT+08:00 Peter Zijlstra <pet...@infradead.org>:
>> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>>> > VCPU has been scheduled out since the last time the guest reset the bit.
>>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>>> > accessed at any time, independent of the version field.
>>>
>>> If one vCPU is preempted, and guest check it several times before this
>>> vCPU is scheded in, then the first time we can get "vCPU is
>>> preempted", however, since the field is cleared, the second time we
>>> will get "vCPU is running".
>>>
>>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>>> and kvm_sched_out() to record this field? Btw, if we should keep both
>>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>>> simultaneous?
>>
>> I suspect you want something like so; except this has holes in.
>>
>> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
>> after enabling it, this means that if we get preempted in between, the
>> vcpu is reported as running even though it very much is not.
>
> Paolo also point out this to me offline yesterday: "Please change
> pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
> update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
> notifier means that the vCPU is real preempted on host, however,
> depends on vmexit is different semantic I think.

In addition, I see xen's vcpu_runstate_info::state is updated during
schedule, so I think I can do this similarly through kvm preemption
notifier. IIUC, xen hypervisor has VCPUOP_get_runstate_info hypercall
implemention, so the desired interface can be implemented if they add
hypercall callsite in domU. I can add hypercall to kvm similarly.

Paolo, thoughts?

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/4] implement vcpu preempted check

2016-07-07 Thread Wanpeng Li
2016-07-07 17:42 GMT+08:00 Peter Zijlstra <pet...@infradead.org>:
> On Thu, Jul 07, 2016 at 04:48:05PM +0800, Wanpeng Li wrote:
>> 2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>> > Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
>> > indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
>> > VCPU has been scheduled out since the last time the guest reset the bit.
>> >  The guest can use an xchg to test-and-clear it.  The bit can be
>> > accessed at any time, independent of the version field.
>>
>> If one vCPU is preempted, and guest check it several times before this
>> vCPU is scheded in, then the first time we can get "vCPU is
>> preempted", however, since the field is cleared, the second time we
>> will get "vCPU is running".
>>
>> Do you mean we should call record_steal_time() in both kvm_sched_in()
>> and kvm_sched_out() to record this field? Btw, if we should keep both
>> vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
>> simultaneous?
>
> I suspect you want something like so; except this has holes in.
>
> We clear KVM_ST_PAD_PREEMPT before disabling preemption and we set it
> after enabling it, this means that if we get preempted in between, the
> vcpu is reported as running even though it very much is not.

Paolo also point out this to me offline yesterday: "Please change
pad[12] to "__u32 preempted; __u32 pad[11];" too, and remember to
update Documentation/virtual/kvm/msr.txt!". Btw, do this in preemption
notifier means that the vCPU is real preempted on host, however,
depends on vmexit is different semantic I think.

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/4] implement vcpu preempted check

2016-07-07 Thread Wanpeng Li
2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>  a simplier definition of default vcpu_is_preempted
>>>>>  skip mahcine type check on ppc, and add config. remove dedicated 
>>>>> macro.
>>>>>  add one patch to drop overload of rwsem_spin_on_owner and 
>>>>> mutex_spin_on_owner.
>>>>>  add more comments
>>>>>  thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some 
>>>>> spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we apply this 
>>>>> patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.

If one vCPU is preempted, and guest check it several times before this
vCPU is scheded in, then the first time we can get "vCPU is
preempted", however, since the field is cleared, the second time we
will get "vCPU is running".

Do you mean we should call record_steal_time() in both kvm_sched_in()
and kvm_sched_out() to record this field? Btw, if we should keep both
vcpu->preempted and kvm_steal_time's "vCPU preempted" field present
simultaneous?

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/4] implement vcpu preempted check

2016-07-06 Thread Wanpeng Li
2016-07-06 20:28 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>
>
> On 06/07/2016 14:08, Wanpeng Li wrote:
>> 2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>>>
>>>
>>> On 06/07/2016 08:52, Peter Zijlstra wrote:
>>>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>>>> change fomr v1:
>>>>>  a simplier definition of default vcpu_is_preempted
>>>>>  skip mahcine type check on ppc, and add config. remove dedicated 
>>>>> macro.
>>>>>  add one patch to drop overload of rwsem_spin_on_owner and 
>>>>> mutex_spin_on_owner.
>>>>>  add more comments
>>>>>  thanks boqun and Peter's suggestion.
>>>>>
>>>>> This patch set aims to fix lock holder preemption issues.
>>>>>
>>>>> test-case:
>>>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>>>
>>>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>>>
>>>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some 
>>>>> spin
>>>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>>>> These spin_on_onwer variant also cause rcu stall before we apply this 
>>>>> patch set
>>>>
>>>> Paolo, could you help out with an (x86) KVM interface for this?
>>>
>>> If it's just for spin loops, you can check if the version field in the
>>> steal time structure has changed.
>>
>> Steal time will not be updated until ahead of next vmentry except
>> wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
>> currently, right?
>
> Hmm, you're right.  We can use bit 0 of struct kvm_steal_time's flags to
> indicate that pad[0] is a "VCPU preempted" field; if pad[0] is 1, the
> VCPU has been scheduled out since the last time the guest reset the bit.
>  The guest can use an xchg to test-and-clear it.  The bit can be
> accessed at any time, independent of the version field.

I will try to implement it tomorrow, thanks for your proposal. :)

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 0/4] implement vcpu preempted check

2016-07-06 Thread Wanpeng Li
2016-07-06 18:44 GMT+08:00 Paolo Bonzini <pbonz...@redhat.com>:
>
>
> On 06/07/2016 08:52, Peter Zijlstra wrote:
>> On Tue, Jun 28, 2016 at 10:43:07AM -0400, Pan Xinhui wrote:
>>> change fomr v1:
>>>  a simplier definition of default vcpu_is_preempted
>>>  skip mahcine type check on ppc, and add config. remove dedicated macro.
>>>  add one patch to drop overload of rwsem_spin_on_owner and 
>>> mutex_spin_on_owner.
>>>  add more comments
>>>  thanks boqun and Peter's suggestion.
>>>
>>> This patch set aims to fix lock holder preemption issues.
>>>
>>> test-case:
>>> perf record -a perf bench sched messaging -g 400 -p && perf report
>>>
>>> 18.09%  sched-messaging  [kernel.vmlinux]  [k] osq_lock
>>> 12.28%  sched-messaging  [kernel.vmlinux]  [k] rwsem_spin_on_owner
>>>  5.27%  sched-messaging  [kernel.vmlinux]  [k] mutex_unlock
>>>  3.89%  sched-messaging  [kernel.vmlinux]  [k] wait_consider_task
>>>  3.64%  sched-messaging  [kernel.vmlinux]  [k] _raw_write_lock_irq
>>>  3.41%  sched-messaging  [kernel.vmlinux]  [k] mutex_spin_on_owner.is
>>>  2.49%  sched-messaging  [kernel.vmlinux]  [k] system_call
>>>
>>> We introduce interface bool vcpu_is_preempted(int cpu) and use it in some 
>>> spin
>>> loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
>>> These spin_on_onwer variant also cause rcu stall before we apply this patch 
>>> set
>>>
>>
>> Paolo, could you help out with an (x86) KVM interface for this?
>
> If it's just for spin loops, you can check if the version field in the
> steal time structure has changed.

Steal time will not be updated until ahead of next vmentry except
wrmsr MSR_KVM_STEAL_TIME. So it can't represent it is preempted
currently, right?

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check

2016-07-06 Thread Wanpeng Li
2016-07-06 15:58 GMT+08:00 Peter Zijlstra <pet...@infradead.org>:
> On Wed, Jul 06, 2016 at 02:46:34PM +0800, Wanpeng Li wrote:
>> > SO it's easy for ppc to implement such interface. Note that yield_count is
>> > set by powerVM/KVM.
>> > and only pSeries can run a guest for now. :)
>> >
>> > I also review x86 related code, looks like we need add one hyer-call to get
>> > such vcpu preemption info?
>>
>> There is no such stuff to record lock holder in x86 kvm, maybe we
>> don't need to depend on PLE handler algorithm to guess it if we can
>> know lock holder vCPU directly.
>
> x86/kvm has vcpu->preempted to indicate if a vcpu is currently preempted
> or not. I'm just not sure if that is visible to the guest or how to make
> it so.

Yeah, I miss it. I can be the volunteer to do it if there is any idea,
ping Paolo. :)

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check

2016-07-06 Thread Wanpeng Li
Cc Paolo, kvm ml
2016-07-06 12:58 GMT+08:00 xinhui <xinhui@linux.vnet.ibm.com>:
> Hi, wanpeng
>
> On 2016年07月05日 17:57, Wanpeng Li wrote:
>>
>> Hi Xinhui,
>> 2016-06-28 22:43 GMT+08:00 Pan Xinhui <xinhui@linux.vnet.ibm.com>:
>>>
>>> This is to fix some lock holder preemption issues. Some other locks
>>> implementation do a spin loop before acquiring the lock itself. Currently
>>> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the
>>> cpu
>>> as parameter and return true if the cpu is preempted. Then kernel can
>>> break
>>> the spin loops upon on the retval of vcpu_is_preempted.
>>>
>>> As kernel has used this interface, So lets support it.
>>>
>>> Only pSeries need supoort it. And the fact is powerNV are built into same
>>> kernel image with pSeries. So we need return false if we are runnig as
>>> powerNV. The another fact is that lppaca->yiled_count keeps zero on
>>> powerNV. So we can just skip the machine type.
>>
>>
>> Lock holder vCPU preemption can be detected by hardware pSeries or
>> paravirt method?
>>
> There is one shard struct between kernel and powerVM/KVM. And we read the
> yield_count of this struct to detect if one vcpu is running or not.
> SO it's easy for ppc to implement such interface. Note that yield_count is
> set by powerVM/KVM.
> and only pSeries can run a guest for now. :)
>
> I also review x86 related code, looks like we need add one hyer-call to get
> such vcpu preemption info?

There is no such stuff to record lock holder in x86 kvm, maybe we
don't need to depend on PLE handler algorithm to guess it if we can
know lock holder vCPU directly.

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 2/4] powerpc/spinlock: support vcpu preempted check

2016-07-05 Thread Wanpeng Li
Hi Xinhui,
2016-06-28 22:43 GMT+08:00 Pan Xinhui <xinhui@linux.vnet.ibm.com>:
> This is to fix some lock holder preemption issues. Some other locks
> implementation do a spin loop before acquiring the lock itself. Currently
> kernel has an interface of bool vcpu_is_preempted(int cpu). It take the cpu
> as parameter and return true if the cpu is preempted. Then kernel can break
> the spin loops upon on the retval of vcpu_is_preempted.
>
> As kernel has used this interface, So lets support it.
>
> Only pSeries need supoort it. And the fact is powerNV are built into same
> kernel image with pSeries. So we need return false if we are runnig as
> powerNV. The another fact is that lppaca->yiled_count keeps zero on
> powerNV. So we can just skip the machine type.

Lock holder vCPU preemption can be detected by hardware pSeries or
paravirt method?

Regards,
Wanpeng Li
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-23 Thread Wanpeng Li
Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+   if (!node_present_pages(searchnode))
+   searchnode = numa_mem_id();
+
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-23 Thread Wanpeng Li
On Fri, Jan 24, 2014 at 11:09:07AM +0800, Wanpeng Li wrote:
Hi Christoph,
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after 
 running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

diff --git a/mm/slub.c b/mm/slub.c
index 545a170..a1c6040 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1700,6 +1700,9 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
   void *object;
   int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;

+  if (!node_present_pages(searchnode))
+  searchnode = numa_mem_id();
+
   object = get_partial_node(s, get_node(s, searchnode), c, flags);
   if (object || node != NUMA_NO_NODE)
   return object;


The bug still can't be fixed w/ this patch. 

Regards,
Wanpeng Li 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Wanpeng Li
Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
[...]

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;


The patch fix the bug. However, the kernel crashed very quickly after running 
stress tests for a short while:

[  287.464285] Unable to handle kernel paging request for data at address 
0x0001
[  287.464289] Faulting instruction address: 0xc0445af8
[  287.464294] Oops: Kernel access of bad area, sig: 11 [#1]
[  287.464296] SMP NR_CPUS=2048 NUMA pSeries
[  287.464301] Modules linked in: btrfs raid6_pq xor dm_service_time sg nfsv3 
arc4 md4 rpcsec_gss_krb5 nfsv4 nls_utf8 cifs nfs fscache dns_resolver 
nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6t_REJECT 
ipt_REJECT xt_conntrack ebtable_nat ebtable_broute bridge stp llc 
ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 
nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter 
ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter 
ip_tables ext4 mbcache jbd2 ibmvfc scsi_transport_fc ibmveth nx_crypto 
pseries_rng nfsd auth_rpcgss nfs_acl lockd binfmt_misc sunrpc uinput 
dm_multipath xfs libcrc32c sd_mod crc_t10dif crct10dif_common ibmvscsi 
scsi_transport_srp scsi_tgt dm_mirror dm_region_hash dm_log dm_mod
[  287.464374] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
3.10.0-71.el7.91831.ppc64 #1
[  287.464378] task: c0fde590 ti: c001fffd task.ti: 
c10a4000
[  287.464382] NIP: c0445af8 LR: c0445bcc CTR: c0445b90
[  287.464385] REGS: c001fffd38e0 TRAP: 0300   Not tainted  
(3.10.0-71.el7.91831.ppc64)
[  287.464388] MSR: 80009032 SF,EE,ME,IR,DR,RI  CR: 88002084  XER: 
0001
[  287.464397] SOFTE: 0
[  287.464398] CFAR: c000908c
[  287.464401] DAR: 0001, DSISR: 4000
[  287.464403]
GPR00: d3649a04 c001fffd3b60 c10a94d0 0003
GPR04: c0018d841048 c001fffd3bd0 0012 d364eff0
GPR08: c001fffd3bd0 0001 d364d688 c0445b90
GPR12: d364b960 c7e0 042ac510 0060
GPR16: 0020 fb19 c1122100 
GPR20: c0a94680 c1122180 c0a94680 000a
GPR24: 0100  0001 c001ef90
GPR28: c001d6c066f0 c001aea03520 c001bc9a2640 c0018d841680
[  287.464447] NIP [c0445af8] .__dev_printk+0x28/0xc0
[  287.464450] LR [c0445bcc] .dev_printk+0x3c/0x50
[  287.464453] PACATMSCRATCH [80009032]
[  287.464455] Call Trace:
[  287.464458] [c001fffd3b60] [c001fffd3c00] 0xc001fffd3c00 
(unreliable)
[  287.464467] [c001fffd3bf0] [d3649a04] 
.ibmvfc_scsi_done+0x334/0x3e0 [ibmvfc]
[  287.464474] [c001fffd3cb0] [d36495b8] 
.ibmvfc_handle_crq+0x2e8/0x320 [ibmvfc]
[  287.464488] [c001fffd3d30] [d3649fe4] .ibmvfc_tasklet+0xd4/0x250 
[ibmvfc]
[  287.464494] [c001fffd3de0] [c009b46c] .tasklet_action+0xcc/0x1b0
[  287.464498] [c001fffd3e90] [c009a668] .__do_softirq+0x148/0x360
[  287.464503] [c001fffd3f90] [c00218a8] .call_do_softirq+0x14/0x24
[  287.464507] [c001fffcfdf0] [c00107e0] .do_softirq+0xd0/0x100
[  287.464511] [c001fffcfe80] [c009aba8] .irq_exit+0x1b8/0x1d0
[  287.464514] [c001fffcff10] [c0010410] .__do_irq+0xc0/0x1e0
[  287.464518] [c001fffcff90] [c00218cc] .call_do_irq+0x14/0x24
[  287.464522] [c10a76d0] [c00105bc] .do_IRQ+0x8c/0x100
[  287.464527] --- Exception: 501 at 0x
[  287.464527] LR = .arch_local_irq_restore+0x74/0x90
[  287.464533] [c10a7770] [c0002494] 
hardware_interrupt_common+0x114/0x180 (unreliable)
[  287.464540] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  287.464540] LR = .check_and_cede_processor+0x24/0x40
[  287.464546] 

Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-20 Thread Wanpeng Li
On Mon, Jan 20, 2014 at 04:13:30PM -0600, Christoph Lameter wrote:
On Mon, 20 Jan 2014, Wanpeng Li wrote:

 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 

 The patch fix the bug. However, the kernel crashed very quickly after running
 stress tests for a short while:

This is not a good way of fixing it. How about not asking for memory from
nodes that are memoryless? Use numa_mem_id() which gives you the next node
that has memory instead of numa_node_id() (gives you the current node
regardless if it has memory or not).

Thanks for your pointing out, I will do it and retest it later.

Regards,
Wanpeng Li 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li
Hi Joonsoo,
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
 
[...]
Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended 
memory.


We have a machine:

[0.00] Node 0 Memory:
[0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
0x8000-0xc000
[0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
[0.00] Node 10 Memory: 0xc000-0x18000

[0.041486] Node 0 CPUs: 0-19
[0.041490] Node 4 CPUs:
[0.041492] Node 6 CPUs:
[0.041495] Node 10 CPUs:

The pages of current cpu slab should be allocated from fallback zones/nodes 
of the memoryless node in buddy system, how can not favorable happen? 

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot 
handle
this unbalance node case.

To fix this correctly, how about following patch?


So I think we should fold both of your two patches to one.

Regards,
Wanpeng Li 

Thanks.

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }
object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li
On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
 Hi Joonsoo,
 On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
  
 [...]
 Hello,
 
 I think that we need more efforts to solve unbalanced node problem.
 
 With this patch, even if node of current cpu slab is not favorable to
 unbalanced node, allocation would proceed and we would get the unintended 
 memory.
 
 
 We have a machine:
 
 [0.00] Node 0 Memory:
 [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
 0x8000-0xc000
 [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
 [0.00] Node 10 Memory: 0xc000-0x18000
 
 [0.041486] Node 0 CPUs: 0-19
 [0.041490] Node 4 CPUs:
 [0.041492] Node 6 CPUs:
 [0.041495] Node 10 CPUs:
 
 The pages of current cpu slab should be allocated from fallback zones/nodes 
 of the memoryless node in buddy system, how can not favorable happen? 

Hi, Wanpeng.

IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
allocate the page in fallback zones/node of that node #. So fallback list isn't
related to fallback one of memoryless node #. Am I wrong?


Anton add node_spanned_pages(node) check, so current cpu slab mentioned
above is against memoryless node. If I miss something?

Regards,
Wanpeng Li 

Thanks.

 
 And there is one more problem. Even if we have some partial slabs on
 compatible node, we would allocate new slab, because get_partial() cannot 
 handle
 this unbalance node case.
 
 To fix this correctly, how about following patch?
 
 
 So I think we should fold both of your two patches to one.
 
 Regards,
 Wanpeng Li 
 
 Thanks.
 
 -8
 diff --git a/mm/slub.c b/mm/slub.c
 index c3eb3d3..a1f6dfa 100644
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
 flags, int node,
  {
 void *object;
 int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
 +   struct zonelist *zonelist;
 +   struct zoneref *z;
 +   struct zone *zone;
 +   enum zone_type high_zoneidx = gfp_zone(flags);
 
 +   if (!node_present_pages(searchnode)) {
 +   zonelist = node_zonelist(searchnode, flags);
 +   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
 +   searchnode = zone_to_nid(zone);
 +   if (node_present_pages(searchnode))
 +   break;
 +   }
 +   }
 object = get_partial_node(s, get_node(s, searchnode), c, flags);
 if (object || node != NUMA_NO_NODE)
 return object;
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a
 
 --
 To unsubscribe, send a message with 'unsubscribe linux-mm' in
 the body to majord...@kvack.org.  For more info on Linux MM,
 see: http://www.linux-mm.org/ .
 Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li
On Tue, Jan 07, 2014 at 06:31:56PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 05:21:45PM +0800, Wanpeng Li wrote:
 On Tue, Jan 07, 2014 at 06:10:16PM +0900, Joonsoo Kim wrote:
 On Tue, Jan 07, 2014 at 04:48:40PM +0800, Wanpeng Li wrote:
  Hi Joonsoo,
  On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
  On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
   
  [...]
  Hello,
  
  I think that we need more efforts to solve unbalanced node problem.
  
  With this patch, even if node of current cpu slab is not favorable to
  unbalanced node, allocation would proceed and we would get the 
  unintended memory.
  
  
  We have a machine:
  
  [0.00] Node 0 Memory:
  [0.00] Node 4 Memory: 0x0-0x1000 0x2000-0x6000 
  0x8000-0xc000
  [0.00] Node 6 Memory: 0x1000-0x2000 0x6000-0x8000
  [0.00] Node 10 Memory: 0xc000-0x18000
  
  [0.041486] Node 0 CPUs: 0-19
  [0.041490] Node 4 CPUs:
  [0.041492] Node 6 CPUs:
  [0.041495] Node 10 CPUs:
  
  The pages of current cpu slab should be allocated from fallback 
  zones/nodes 
  of the memoryless node in buddy system, how can not favorable happen? 
 
 Hi, Wanpeng.
 
 IIRC, if we call kmem_cache_alloc_node() with certain node #, we try to
 allocate the page in fallback zones/node of that node #. So fallback list 
 isn't
 related to fallback one of memoryless node #. Am I wrong?
 
 
 Anton add node_spanned_pages(node) check, so current cpu slab mentioned
 above is against memoryless node. If I miss something?

I thought following scenario.

memoryless node # : 1
1's fallback node # : 0

On node 1's cpu,

1. kmem_cache_alloc_node (node 2)
2. allocate the page on node 2 for the slab, now cpu slab is that one.
3. kmem_cache_alloc_node (local node, that is, node 1)
4. It check node_spanned_pages() and find it is memoryless node.
So return node 2's memory.

Is it impossible scenario?


Indeed, it can happen. 

Regards,
Wanpeng Li 

Thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li
On Tue, Jan 07, 2014 at 04:41:36PM +0900, Joonsoo Kim wrote:
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:
 
 We noticed a huge amount of slab memory consumed on a large ppc64 box:
 
 Slab:2094336 kB
 
 Almost 2GB. This box is not balanced and some nodes do not have local
 memory, causing slub to be very inefficient in its slab usage.
 
 Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
 sees it isn't node local, deactivates it and tries to allocate a new
 slab. On empty nodes we will allocate a new remote slab and use the
 first slot, but as explained above when we get called a second time
 we will just deactivate that slab and retry.
 
 As such we end up only using 1 entry in each slab:
 
 slabmem  objects
used   active
 
 kmalloc-16384   1404 MB4.90%
 task_struct  668 MB2.90%
 kmalloc-128  193 MB3.61%
 kmalloc-192  152 MB5.23%
 kmalloc-8192  72 MB   23.40%
 kmalloc-1664 MB7.43%
 kmalloc-512   33 MB   22.41%
 
 The patch below checks that a node is not empty before deactivating a
 slab and trying to allocate it again. With this patch applied we now
 use about 352MB:
 
 Slab: 360192 kB
 
 And our efficiency is much better:
 
 slabmem  objects
used   active
 
 kmalloc-16384 92 MB   74.27%
 task_struct   23 MB   83.46%
 idr_layer_cache   18 MB  100.00%
 pgtable-2^12  17 MB  100.00%
 kmalloc-65536 15 MB  100.00%
 inode_cache   14 MB  100.00%
 kmalloc-256   14 MB   97.81%
 kmalloc-8192  14 MB   85.71%
 
 Signed-off-by: Anton Blanchard an...@samba.org
 ---
 
 Thoughts? It seems like we could hit a similar situation if a machine
 is balanced but we run out of memory on a single node.
 
 Index: b/mm/slub.c
 ===
 --- a/mm/slub.c
 +++ b/mm/slub.c
 @@ -2278,10 +2278,17 @@ redo:
  
  if (unlikely(!node_match(page, node))) {
  stat(s, ALLOC_NODE_MISMATCH);
 -deactivate_slab(s, page, c-freelist);
 -c-page = NULL;
 -c-freelist = NULL;
 -goto new_slab;
 +
 +/*
 + * If the node contains no memory there is no point in trying
 + * to allocate a new node local slab
 + */
 +if (node_spanned_pages(node)) {
 +deactivate_slab(s, page, c-freelist);
 +c-page = NULL;
 +c-freelist = NULL;
 +goto new_slab;
 +}
  }
  
  /*

Hello,

I think that we need more efforts to solve unbalanced node problem.

With this patch, even if node of current cpu slab is not favorable to
unbalanced node, allocation would proceed and we would get the unintended 
memory.

And there is one more problem. Even if we have some partial slabs on
compatible node, we would allocate new slab, because get_partial() cannot 
handle
this unbalance node case.

To fix this correctly, how about following patch?

Thanks.

-8
diff --git a/mm/slub.c b/mm/slub.c
index c3eb3d3..a1f6dfa 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1672,7 +1672,19 @@ static void *get_partial(struct kmem_cache *s, gfp_t 
flags, int node,
 {
void *object;
int searchnode = (node == NUMA_NO_NODE) ? numa_node_id() : node;
+   struct zonelist *zonelist;
+   struct zoneref *z;
+   struct zone *zone;
+   enum zone_type high_zoneidx = gfp_zone(flags);

+   if (!node_present_pages(searchnode)) {
+   zonelist = node_zonelist(searchnode, flags);
+   for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+   searchnode = zone_to_nid(zone);
+   if (node_present_pages(searchnode))
+   break;
+   }
+   }

Why change searchnode instead of depending on fallback zones/nodes in 
get_any_partial() to allocate partial slabs?

Regards,
Wanpeng Li 

object = get_partial_node(s, get_node(s, searchnode), c, flags);
if (object || node != NUMA_NO_NODE)
return object;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-07 Thread Wanpeng Li
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.


Deactive cpu slab cache doesn't always mean free the slab cache to buddy 
system, 
maybe the slab cache will be putback to the remote node's partial list if there 
are objects still in used in this unbalance situation. In this case, the slub 
slow 
path can freeze the partial slab in remote node again. So why the slab cache is 
fragmented as below? 

Regards,
Wanpeng Li 

As such we end up only using 1 entry in each slab:

slabmem  objects
   used   active

kmalloc-16384   1404 MB4.90%
task_struct  668 MB2.90%
kmalloc-128  193 MB3.61%
kmalloc-192  152 MB5.23%
kmalloc-8192  72 MB   23.40%
kmalloc-1664 MB7.43%
kmalloc-512   33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab: 360192 kB

And our efficiency is much better:

slabmem  objects
   used   active

kmalloc-16384 92 MB   74.27%
task_struct   23 MB   83.46%
idr_layer_cache   18 MB  100.00%
pgtable-2^12  17 MB  100.00%
kmalloc-65536 15 MB  100.00%
inode_cache   14 MB  100.00%
kmalloc-256   14 MB   97.81%
kmalloc-8192  14 MB   85.71%

Signed-off-by: Anton Blanchard an...@samba.org
---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:

   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
-  deactivate_slab(s, page, c-freelist);
-  c-page = NULL;
-  c-freelist = NULL;
-  goto new_slab;
+
+  /*
+   * If the node contains no memory there is no point in trying
+   * to allocate a new node local slab
+   */
+  if (node_spanned_pages(node)) {
+  deactivate_slab(s, page, c-freelist);
+  c-page = NULL;
+  c-freelist = NULL;
+  goto new_slab;
+  }
   }

   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] slub: Don't throw away partial remote slabs if there is no local memory

2014-01-06 Thread Wanpeng Li
On Tue, Jan 07, 2014 at 01:21:00PM +1100, Anton Blanchard wrote:

We noticed a huge amount of slab memory consumed on a large ppc64 box:

Slab:2094336 kB

Almost 2GB. This box is not balanced and some nodes do not have local
memory, causing slub to be very inefficient in its slab usage.

Each time we call kmem_cache_alloc_node slub checks the per cpu slab,
sees it isn't node local, deactivates it and tries to allocate a new
slab. On empty nodes we will allocate a new remote slab and use the
first slot, but as explained above when we get called a second time
we will just deactivate that slab and retry.

As such we end up only using 1 entry in each slab:

slabmem  objects
   used   active

kmalloc-16384   1404 MB4.90%
task_struct  668 MB2.90%
kmalloc-128  193 MB3.61%
kmalloc-192  152 MB5.23%
kmalloc-8192  72 MB   23.40%
kmalloc-1664 MB7.43%
kmalloc-512   33 MB   22.41%

The patch below checks that a node is not empty before deactivating a
slab and trying to allocate it again. With this patch applied we now
use about 352MB:

Slab: 360192 kB

And our efficiency is much better:

slabmem  objects
   used   active

kmalloc-16384 92 MB   74.27%
task_struct   23 MB   83.46%
idr_layer_cache   18 MB  100.00%
pgtable-2^12  17 MB  100.00%
kmalloc-65536 15 MB  100.00%
inode_cache   14 MB  100.00%
kmalloc-256   14 MB   97.81%
kmalloc-8192  14 MB   85.71%

Signed-off-by: Anton Blanchard an...@samba.org

Reviewed-by: Wanpeng Li liw...@linux.vnet.ibm.com

---

Thoughts? It seems like we could hit a similar situation if a machine
is balanced but we run out of memory on a single node.

Index: b/mm/slub.c
===
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2278,10 +2278,17 @@ redo:

   if (unlikely(!node_match(page, node))) {
   stat(s, ALLOC_NODE_MISMATCH);
-  deactivate_slab(s, page, c-freelist);
-  c-page = NULL;
-  c-freelist = NULL;
-  goto new_slab;
+
+  /*
+   * If the node contains no memory there is no point in trying
+   * to allocate a new node local slab
+   */
+  if (node_spanned_pages(node)) {

s/node_spanned_pages/node_present_pages 

+  deactivate_slab(s, page, c-freelist);
+  c-page = NULL;
+  c-freelist = NULL;
+  goto new_slab;
+  }
   }

   /*

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majord...@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: a href=mailto:d...@kvack.org; em...@kvack.org /a

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 1/7][TRIVIAL][resend] powerpc: cleanup kernel-doc warning

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Warning(arch/powerpc/kernel/pci_of_scan.c:210): Excess function parameter 
'node' description in 'of_scan_pci_bridge'
Warning(arch/powerpc/kernel/vio.c:636): No description found for parameter 
'desired'
Warning(arch/powerpc/kernel/vio.c:636): Excess function parameter 'new_desired' 
description in 'vio_cmo_set_dev_desired'
Warning(arch/powerpc/kernel/vio.c:1270): No description found for parameter 
'viodrv'
Warning(arch/powerpc/kernel/vio.c:1270): Excess function parameter 'drv' 
description in '__vio_register_driver'
Warning(arch/powerpc/kernel/vio.c:1289): No description found for parameter 
'viodrv'
Warning(arch/powerpc/kernel/vio.c:1289): Excess function parameter 'driver' 
description in 'vio_unregister_driver'


Signed-off-by: Wanpeng Li liwp.li...@gmail.com
---
 arch/powerpc/kernel/pci_of_scan.c |1 -
 arch/powerpc/kernel/vio.c |6 +++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_of_scan.c 
b/arch/powerpc/kernel/pci_of_scan.c
index 89dde17..d7dd42b 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -198,7 +198,6 @@ EXPORT_SYMBOL(of_create_pci_dev);
 
 /**
  * of_scan_pci_bridge - Set up a PCI bridge and scan for child nodes
- * @node: device tree node of bridge
  * @dev: pci_dev structure for the bridge
  *
  * of_scan_bus() calls this routine for each PCI bridge that it finds, and
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index cb87301..06cbc30 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -625,7 +625,7 @@ struct dma_map_ops vio_dma_mapping_ops = {
  * vio_cmo_set_dev_desired - Set desired entitlement for a device
  *
  * @viodev: struct vio_dev for device to alter
- * @new_desired: new desired entitlement level in bytes
+ * @desired: new desired entitlement level in bytes
  *
  * For use by devices to request a change to their entitlement at runtime or
  * through sysfs.  The desired entitlement level is changed and a balancing
@@ -1262,7 +1262,7 @@ static int vio_bus_remove(struct device *dev)
 
 /**
  * vio_register_driver: - Register a new vio driver
- * @drv:   The vio_driver structure to be registered.
+ * @viodrv:The vio_driver structure to be registered.
  */
 int __vio_register_driver(struct vio_driver *viodrv, struct module *owner,
  const char *mod_name)
@@ -1282,7 +1282,7 @@ EXPORT_SYMBOL(__vio_register_driver);
 
 /**
  * vio_unregister_driver - Remove registration of vio driver.
- * @driver:The vio_driver struct to be removed form registration
+ * @viodrv:The vio_driver struct to be removed form registration
  */
 void vio_unregister_driver(struct vio_driver *viodrv)
 {
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 2/7][TRIVIAL][resend] x86/kernel: cleanup some kernel-doc warnings

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Warning(arch/x86/kernel/kgdb.c:465): No description found for parameter 
'e_vector'
Warning(arch/x86/kernel/kgdb.c:465): No description found for parameter 
'remcomInBuffer'
Warning(arch/x86/kernel/kgdb.c:465): No description found for parameter 
'remcomOutBuffer'
Warning(arch/x86/kernel/kgdb.c:465): No description found for parameter 
'linux_regs'
Warning(arch/x86/kernel/kgdb.c:465): Excess function parameter 'vector' 
description in 'kgdb_arch_handle_exception'
Warning(arch/x86/kernel/kgdb.c:465): Excess function parameter 
'remcom_in_buffer' description in 'kgdb_arch_handle_exception'
Warning(arch/x86/kernel/kgdb.c:465): Excess function parameter 
'remcom_out_buffer' description in 'kgdb_arch_handle_exception'
Warning(arch/x86/kernel/kgdb.c:465): Excess function parameter 'regs' 
description in 'kgdb_arch_handle_exception'
Warning(arch/x86/kernel/uprobes.c:416): No description found for parameter 
'auprobe'
Warning(arch/x86/kernel/uprobes.c:416): Excess function parameter 'arch_uprobe' 
description in 'arch_uprobe_analyze_insn'
Warning(arch/x86/lib/csum-wrappers_64.c:125): No description found for 
parameter 'sum'
Warning(arch/x86/lib/csum-wrappers_64.c:125): Excess function parameter 'isum' 
description in 'csum_partial_copy_nocheck'

Signed-off-by: Wanpeng Li liwp.li...@gmail.com

---
 arch/x86/kernel/kgdb.c|8 
 arch/x86/kernel/uprobes.c |2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 8bfb614..3f61904 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -444,12 +444,12 @@ void kgdb_roundup_cpus(unsigned long flags)
 
 /**
  * kgdb_arch_handle_exception - Handle architecture specific GDB packets.
- * @vector: The error vector of the exception that happened.
+ * @e_vector: The error vector of the exception that happened.
  * @signo: The signal number of the exception that happened.
  * @err_code: The error code of the exception that happened.
- * @remcom_in_buffer: The buffer of the packet we have read.
- * @remcom_out_buffer: The buffer of %BUFMAX bytes to write a packet into.
- * @regs: The struct pt_regs of the current process.
+ * @remcomInBuffer: The buffer of the packet we have read.
+ * @remcomOutBuffer: The buffer of %BUFMAX bytes to write a packet into.
+ * @linux_regs: The struct pt_regs of the current process.
  *
  * This function MUST handle the 'c' and 's' command packets,
  * as well packets to set / remove a hardware breakpoint, if used.
diff --git a/arch/x86/kernel/uprobes.c b/arch/x86/kernel/uprobes.c
index dc4e910..f785a06 100644
--- a/arch/x86/kernel/uprobes.c
+++ b/arch/x86/kernel/uprobes.c
@@ -408,7 +408,7 @@ static int validate_insn_bits(struct arch_uprobe *auprobe, 
struct mm_struct *mm,
 /**
  * arch_uprobe_analyze_insn - instruction analysis including validity and 
fixups.
  * @mm: the probed address space.
- * @arch_uprobe: the probepoint information.
+ * @auprobe: the probepoint information.
  * Return 0 on success or a -ve number on error.
  */
 int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe, struct mm_struct *mm)
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 3/7][TRIVIAL][resend] drivers/pci: cleanup kernel-doc warning

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Warning(drivers/pci/setup-bus.c:277): No description found for parameter 
'fail_head'
Warning(drivers/pci/setup-bus.c:277): Excess function parameter 'failed_list' 
description in 'assign_requested_resources_sorted'

Signed-off-by: Wanpeng Li liwp.li...@gmail.com
---
 drivers/pci/setup-bus.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 8fa2d4b..9165d25 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -265,7 +265,7 @@ out:
  * assign_requested_resources_sorted() - satisfy resource requests
  *
  * @head : head of the list tracking requests for resources
- * @failed_list : head of the list tracking requests that could
+ * @fail_head : head of the list tracking requests that could
  * not be allocated
  *
  * Satisfy resource requests of each element in the list. Add
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 4/7][TRIVIAL][resend] mm: cleanup on the comments of zone_reclaim_stat

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Signed-off-by: Wanpeng Li liwp.li...@gmail.com
Acked-by: Minchan Kim minc...@kernel.org

---
 include/linux/mmzone.h |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2427706..d6a5f83 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -188,7 +188,7 @@ static inline int is_unevictable_lru(enum lru_list lru)
 struct zone_reclaim_stat {
/*
 * The pageout code in vmscan.c keeps track of how many of the
-* mem/swap backed and file backed pages are refeferenced.
+* mem/swap backed and file backed pages are referenced.
 * The higher the rotated/scanned ratio, the more valuable
 * that cache is.
 *
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 5/7][TRIVIAL][resend] mm: cleanup kernel-doc warnings

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

fix kernel-doc warnings just like this one:

Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description 
in 'swap_cgroup_record'

Signed-off-by: Wanpeng Li liwp.li...@gmail.com

---
 mm/memblock.c|   12 ++--
 mm/memcontrol.c  |4 ++--
 mm/oom_kill.c|2 +-
 mm/page_cgroup.c |4 ++--
 mm/pagewalk.c|1 -
 mm/percpu-vm.c   |1 -
 6 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/mm/memblock.c b/mm/memblock.c
index 952123e..b84f258 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -540,9 +540,9 @@ int __init_memblock memblock_reserve(phys_addr_t base, 
phys_addr_t size)
  * __next_free_mem_range - next function for for_each_free_mem_range()
  * @idx: pointer to u64 loop variable
  * @nid: nid: node selector, %MAX_NUMNODES for all nodes
- * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
- * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
- * @p_nid: ptr to int for nid of the range, can be %NULL
+ * @out_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Find the first free area from *@idx which matches @nid, fill the out
  * parameters, and update *@idx for the next iteration.  The lower 32bit of
@@ -616,9 +616,9 @@ void __init_memblock __next_free_mem_range(u64 *idx, int 
nid,
  * __next_free_mem_range_rev - next function for 
for_each_free_mem_range_reverse()
  * @idx: pointer to u64 loop variable
  * @nid: nid: node selector, %MAX_NUMNODES for all nodes
- * @p_start: ptr to phys_addr_t for start address of the range, can be %NULL
- * @p_end: ptr to phys_addr_t for end address of the range, can be %NULL
- * @p_nid: ptr to int for nid of the range, can be %NULL
+ * @out_start: ptr to phys_addr_t for start address of the range, can be %NULL
+ * @out_end: ptr to phys_addr_t for end address of the range, can be %NULL
+ * @out_nid: ptr to int for nid of the range, can be %NULL
  *
  * Reverse of __next_free_mem_range().
  */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac35bcc..a9c3d01 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1234,7 +1234,7 @@ int mem_cgroup_inactive_file_is_low(struct lruvec *lruvec)
 
 /**
  * mem_cgroup_margin - calculate chargeable space of a memory cgroup
- * @mem: the memory cgroup
+ * @memcg: the memory cgroup
  *
  * Returns the maximum amount of memory @mem can be charged with, in
  * pages.
@@ -1508,7 +1508,7 @@ static unsigned long mem_cgroup_reclaim(struct mem_cgroup 
*memcg,
 
 /**
  * test_mem_cgroup_node_reclaimable
- * @mem: the target memcg
+ * @memcg: the target memcg
  * @nid: the node ID to be checked.
  * @noswap : specify true here if the user wants flle only information.
  *
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 416637f..c1956f1 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -366,7 +366,7 @@ static struct task_struct *select_bad_process(unsigned int 
*ppoints,
 
 /**
  * dump_tasks - dump current memory state of all system tasks
- * @mem: current's memory controller, if constrained
+ * @memcg: current's memory controller, if constrained
  * @nodemask: nodemask passed to page allocator for mempolicy ooms
  *
  * Dumps the current memory state of all eligible tasks.  Tasks not in the same
diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c
index 1ccbd71..eb750f8 100644
--- a/mm/page_cgroup.c
+++ b/mm/page_cgroup.c
@@ -392,7 +392,7 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t 
ent,
 
 /**
  * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
- * @end: swap entry to be cmpxchged
+ * @ent: swap entry to be cmpxchged
  * @old: old id
  * @new: new id
  *
@@ -422,7 +422,7 @@ unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
 /**
  * swap_cgroup_record - record mem_cgroup for this swp_entry.
  * @ent: swap entry to be recorded into
- * @mem: mem_cgroup to be recorded
+ * @id: mem_cgroup to be recorded
  *
  * Returns old value at success, 0 at failure.
  * (Of course, old value can be 0.)
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index aa9701e..6c118d0 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -162,7 +162,6 @@ static int walk_hugetlb_range(struct vm_area_struct *vma,
 
 /**
  * walk_page_range - walk a memory map's page tables with a callback
- * @mm: memory map to walk
  * @addr: starting address
  * @end: ending address
  * @walk: set of callbacks to invoke for each level of the tree
diff --git a/mm/percpu-vm.c b/mm/percpu-vm.c
index 405d331..3707c71 100644
--- a/mm/percpu-vm.c
+++ b/mm/percpu-vm.c
@@ -360,7 +360,6 @@ err_free:
  * @chunk: chunk to depopulate
  * @off: offset to the area to depopulate
  * @size: size of the area to depopulate in bytes
- * @flush: whether to flush cache and tlb

[PATCH 6/7][TRIVIAL][resend] mm: cleanup page reclaim comment error

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Since there are five lists in LRU cache, the array nr in get_scan_count
should be:

nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
nr[2] = file inactive pages to scan; nr[3] = file active pages to scan

Signed-off-by: Wanpeng Li liwp.li...@gmail.com
Acked-by: KOSAKI Motohiro kosaki.motoh...@jp.fujitsu.com
Acked-by: Minchan Kim minc...@kernel.org
Reviewed-by: Rik van Riel r...@redhat.com

---
 mm/vmscan.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeb3bc9..ed823df 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1567,7 +1567,8 @@ static int vmscan_swappiness(struct scan_control *sc)
  * by looking at the fraction of the pages scanned we did rotate back
  * onto the active list instead of evict.
  *
- * nr[0] = anon pages to scan; nr[1] = file pages to scan
+ * nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan
+ * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
  */
 static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
   unsigned long *nr)
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 7/7] mm/memory.c : cleanup the coding style issue

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Signed-off-by: Wanpeng Li liwp.li...@gmail.com

---
 mm/memory.c |3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 1b7dc66..195d6e1 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2447,7 +2447,8 @@ static inline int pte_unmap_same(struct mm_struct *mm, 
pmd_t *pmd,
return same;
 }
 
-static inline void cow_user_page(struct page *dst, struct page *src, unsigned 
long va, struct vm_area_struct *vma)
+static inline void cow_user_page(struct page *dst, struct page *src,
+   unsigned long va, struct vm_area_struct *vma)
 {
/*
 * If the source page was a PFN mapping, we don't have
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH 0/7][TRIVIAL][resend] trivial patches

2012-06-15 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Since these patches has already send more than one week and 
doesn't get any response, I collect them and send out a patch set.

Wanpeng Li (7)

powerpc: cleanup some kernel doc warning 
x86/kernel: cleanup some kernel doc warning  
drivers/pci: cleanup some kernel doc warning
mm: cleanup on the comments of zone_reclaim_stat
mm: cleanup some kernel doc warning
mm: cleanup page relaim comment error
mm/memory.c: cleanup coding style issue

Signed-off-by: Wanpeng Li liwp.li...@gmail.com
--
 arch/powerpc/kernel/pci_of_scan.c |1 -
 arch/powerpc/kernel/vio.c |6 +++---
 arch/x86/kernel/kgdb.c|8 
 arch/x86/kernel/uprobes.c |2 +-
 drivers/pci/setup-bus.c   |2 +-
 include/linux/mmzone.h|2 +-
 mm/memblock.c |   12 ++--
 mm/memcontrol.c   |4 ++--
 mm/memory.c   |3 ++-
 mm/oom_kill.c |2 +-
 mm/page_cgroup.c  |4 ++--
 mm/pagewalk.c |1 -
 mm/percpu-vm.c|1 -
 mm/vmscan.c   |3 ++-
 14 files changed, 25 insertions(+), 26 deletions(-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


[PATCH] arch/powerpc/: fix kernel-doc warning

2012-06-11 Thread Wanpeng Li
From: Wanpeng Li l...@linux.vnet.ibm.com

Warning(arch/powerpc/kernel/pci_of_scan.c:210): Excess function parameter 
'node' description in 'of_scan_pci_bridge'
Warning(arch/powerpc/kernel/vio.c:636): No description found for parameter 
'desired'
Warning(arch/powerpc/kernel/vio.c:636): Excess function parameter 'new_desired' 
description in 'vio_cmo_set_dev_desired'
Warning(arch/powerpc/kernel/vio.c:1270): No description found for parameter 
'viodrv'
Warning(arch/powerpc/kernel/vio.c:1270): Excess function parameter 'drv' 
description in '__vio_register_driver'
Warning(arch/powerpc/kernel/vio.c:1289): No description found for parameter 
'viodrv'
Warning(arch/powerpc/kernel/vio.c:1289): Excess function parameter 'driver' 
description in 'vio_unregister_driver'


Signed-off-by: Wanpeng Li l...@linux.vnet.ibm.com
---
 arch/powerpc/kernel/pci_of_scan.c |1 -
 arch/powerpc/kernel/vio.c |6 +++---
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/pci_of_scan.c 
b/arch/powerpc/kernel/pci_of_scan.c
index 89dde17..d7dd42b 100644
--- a/arch/powerpc/kernel/pci_of_scan.c
+++ b/arch/powerpc/kernel/pci_of_scan.c
@@ -198,7 +198,6 @@ EXPORT_SYMBOL(of_create_pci_dev);
 
 /**
  * of_scan_pci_bridge - Set up a PCI bridge and scan for child nodes
- * @node: device tree node of bridge
  * @dev: pci_dev structure for the bridge
  *
  * of_scan_bus() calls this routine for each PCI bridge that it finds, and
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index cb87301..06cbc30 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -625,7 +625,7 @@ struct dma_map_ops vio_dma_mapping_ops = {
  * vio_cmo_set_dev_desired - Set desired entitlement for a device
  *
  * @viodev: struct vio_dev for device to alter
- * @new_desired: new desired entitlement level in bytes
+ * @desired: new desired entitlement level in bytes
  *
  * For use by devices to request a change to their entitlement at runtime or
  * through sysfs.  The desired entitlement level is changed and a balancing
@@ -1262,7 +1262,7 @@ static int vio_bus_remove(struct device *dev)
 
 /**
  * vio_register_driver: - Register a new vio driver
- * @drv:   The vio_driver structure to be registered.
+ * @viodrv:The vio_driver structure to be registered.
  */
 int __vio_register_driver(struct vio_driver *viodrv, struct module *owner,
  const char *mod_name)
@@ -1282,7 +1282,7 @@ EXPORT_SYMBOL(__vio_register_driver);
 
 /**
  * vio_unregister_driver - Remove registration of vio driver.
- * @driver:The vio_driver struct to be removed form registration
+ * @viodrv:The vio_driver struct to be removed form registration
  */
 void vio_unregister_driver(struct vio_driver *viodrv)
 {
-- 
1.7.9.5

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev