from:"Christian Borntraeger"

Re: [PATCH 3/5] kvm: Directly account vtime to system on guest switch

2012-10-25 Thread Christian Borntraeger

On 25/10/12 02:51, Frederic Weisbecker wrote:
> Switching to or from guest context is done on ioctl context.
> So by the time we call kvm_guest_enter() or kvm_guest_exit()
> we know we are not running the idle task.
> 
> As a result, we can directly account the cputime using
> vtime_account_system_irqsafe().
> 
> There are two good reasons to do this:
> 
> * We avoid some useless checks on guest switch. It optimizes
> a bit this fast path.
> 
> * In the case of CONFIG_IRQ_TIME_ACCOUNTING, calling vtime_account()
> checks for irq time to account. This is pointless since we know
> we are not in an irq on guest switch. This is wasting cpu cycles
> for no good reason. vtime_account_system() OTOH is a no-op in
> this config option.
> 
> * s390 doesn't disable irqs in its implementation of vtime_account().
> If vtime_account() in kvm races with an irq, the pending time might
> be accounted twice. With vtime_account_system_irqsafe() we are protected.

We disable irqs before we call kvm_guest_enter/exit, see kvm-s390.c:


[...]
if (!kvm_is_ucontrol(vcpu->kvm))
kvm_s390_deliver_pending_interrupts(vcpu);
vcpu->arch.sie_block->icptcode = 0;
local_irq_disable();
kvm_guest_enter();
local_irq_enable();
[...]

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/5] kvm: Directly account vtime to system on guest switch

2012-10-25 Thread Christian Borntraeger

On 25/10/12 09:56, Frederic Weisbecker wrote:
[...]
>>> * s390 doesn't disable irqs in its implementation of vtime_account().
>>> If vtime_account() in kvm races with an irq, the pending time might
>>> be accounted twice. With vtime_account_system_irqsafe() we are protected.
>>
>> We disable irqs before we call kvm_guest_enter/exit, see kvm-s390.c:
>>
>>
>> [...]
>> if (!kvm_is_ucontrol(vcpu->kvm))
>> kvm_s390_deliver_pending_interrupts(vcpu);
>> vcpu->arch.sie_block->icptcode = 0;
>> local_irq_disable();
>> kvm_guest_enter();
>> local_irq_enable();
>> [...]
>>
> 
> Ah ok. Hmm I still need to keep it irqsafe for the other archs though,
> as it is currently with vtime_account(). So perhaps I can remove your
> local_irq_disable there and use vtime_account_system_irqsafe()
> instead?

Yes. We added the local_irq_disable code just for the vtime accounting.




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

"avoid overflows in kernel/time" breakes my ondemand scheduler

2008-02-13 Thread Christian Borntraeger

Hello H. Peter,

the patch "avoid overflows in kernel/time" makes the ondemand cpufreq 
scheduler unusable. It looks like the cpufreq scheduler takes minutes to 
react on load changes.

I looked at the patch but did not found an obvious problem. Reverting this 
patch seems to fix the problem.

Do you have any ideas?

Christian 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/4] PF: Add FAULT_FLAG_RETRY_NOWAIT for guest fault

2013-07-09 Thread Christian Borntraeger

On 09/07/13 17:23, Gleb Natapov wrote:
> On Tue, Jul 09, 2013 at 03:56:44PM +0200, Dominik Dingel wrote:
>> In case of a fault retry exit sie64() with gmap_fault indication for the
>> running thread set. This makes it possible to handle async page faults
>> without the need for mm notifiers.
>>
>> Based on a patch from Martin Schwidefsky.
>>
> For that we will obviously need Christian and Cornelia ACKs. Or it can
> go in via S390 tree.
> 

>> Signed-off-by: Dominik Dingel 
Acked-by: Christian Borntraeger 

Do you want me or Conny to apply these patches add a signoff and resend them?
Otherwise I will review the s390 specific patches and ack them individually.

Christian



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-09 Thread Christian Borntraeger

On 09/07/13 15:56, Dominik Dingel wrote:
> By setting a Kconfig option, the architecture can control when
> guest notifications will be presented by the apf backend.
> So there is the default batch mechanism, working as before, where the vcpu 
> thread
> should pull in this information. On the other hand there is now the direct
> mechanism, this will directly push the information to the guest.
> 
> Still the vcpu thread should call check_completion to cleanup leftovers,
> that leaves most of the common code untouched.
> 
> Signed-off-by: Dominik Dingel 

Acked-by: Christian Borntraeger  
for the "why". We want to use the existing architectured interface.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] PF: Async page fault support on s390

2013-07-10 Thread Christian Borntraeger

On 09/07/13 15:56, Dominik Dingel wrote:
> This patch enables async page faults for s390 kvm guests.
> It provides the userspace API to enable, disable or get the status of this
> feature. Also it includes the diagnose code, called by the guest to enable
> async page faults.
> 
> The async page faults will use an already existing guest interface for this
> purpose, as described in "CP Programming Services (SC24-6084)".
> 
> Signed-off-by: Dominik Dingel 

Re-reading this patch again, found some thing that you should take 
care of (nothing major, just small details). Sorry for not seeing them
earlier.

[...]
> + case 1: /* CANCEL */
> + if (vcpu->run->s.regs.gprs[rx] & 7)
> + return kvm_s390_inject_program_int(vcpu, 
> PGM_ADDRESSING);
> +
> + vcpu->run->s.regs.gprs[ry] = 0;
> +
> + if (vcpu->arch.pfault_token == KVM_S390_PFAULT_TOKEN_INVALID)
> + vcpu->run->s.regs.gprs[ry] = 1;
> +
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> + rc = 0;
> + break;

Dont we need a kvm_clear_async_pf_completion_queue(vcpu) or similar here as 
well?
The cancel function is supposed to purge all outstanding requests (those were 
no 
completion signal was made pending yet)

[...]
> @@ -264,6 +292,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
>  {
>   VCPU_EVENT(vcpu, 3, "%s", "free cpu");
>   trace_kvm_s390_destroy_vcpu(vcpu->vcpu_id);
> + kvm_clear_async_pf_completion_queue(vcpu);
>   if (!kvm_is_ucontrol(vcpu->kvm)) {
>   clear_bit(63 - vcpu->vcpu_id,
> (unsigned long *) &vcpu->kvm->arch.sca->mcn);
> @@ -313,6 +342,9 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  /* Section: vcpu related */
>  int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
>  {
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> + kvm_clear_async_pf_completion_queue(vcpu);
> + kvm_async_pf_wakeup_all(vcpu);
>   if (kvm_is_ucontrol(vcpu->kvm)) {
>   vcpu->arch.gmap = gmap_alloc(current->mm);
>   if (!vcpu->arch.gmap)

We should also reset pfault handling for CPU reset, no?

> @@ -691,10 +723,75 @@ static void kvm_arch_fault_in_sync(struct kvm_vcpu 
> *vcpu)
>   up_read(&mm->mmap_sem);
>  }
> 
> +static void __kvm_inject_pfault_token(struct kvm_vcpu *vcpu, bool 
> start_token,
> +   unsigned long token)
> +{
> + struct kvm_s390_interrupt inti;
> + inti.type = start_token ? KVM_S390_INT_PFAULT_INIT :
> +   KVM_S390_INT_PFAULT_DONE;
> + inti.parm64 = token;
> + if (kvm_s390_inject_vcpu(vcpu, &inti))
> + WARN(1, "pfault interrupt injection failed");
> +}

The PFAULT_DONE is architectured as a floating interrupt (can happen
on other CPUs).

[...]
> --- a/arch/s390/kvm/sigp.c
> +++ b/arch/s390/kvm/sigp.c
> @@ -187,6 +187,8 @@ static int __sigp_set_arch(struct kvm_vcpu *vcpu, u32 
> parameter)
>  {
>   int rc;
> 
> + vcpu->arch.pfault_token = KVM_S390_PFAULT_TOKEN_INVALID;
> +

sigp set architecture affects all cpus, so we must reset pfault via
kvm_for_each_vcpu, I guess.

Otherwise patch looks good. I guess only one more iteration
Christian



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 3/4] PF: Provide additional direct page notification

2013-07-10 Thread Christian Borntraeger

On 10/07/13 12:39, Alexander Graf wrote:
> 
> On 09.07.2013, at 18:01, Christian Borntraeger wrote:
> 
>> On 09/07/13 15:56, Dominik Dingel wrote:
>>> By setting a Kconfig option, the architecture can control when
>>> guest notifications will be presented by the apf backend.
>>> So there is the default batch mechanism, working as before, where the vcpu 
>>> thread
>>> should pull in this information. On the other hand there is now the direct
>>> mechanism, this will directly push the information to the guest.
>>>
>>> Still the vcpu thread should call check_completion to cleanup leftovers,
>>> that leaves most of the common code untouched.
>>>
>>> Signed-off-by: Dominik Dingel 
>>
>> Acked-by: Christian Borntraeger  
>> for the "why". We want to use the existing architectured interface.
> 
> Shouldn't this be a runtime option?

This is an a) or b) depending on the architecture. So making this a kconfig
option is the most sane approach no?

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] PF: Async page fault support on s390

2013-07-11 Thread Christian Borntraeger

On 11/07/13 11:04, Gleb Natapov wrote:
> On Wed, Jul 10, 2013 at 02:59:55PM +0200, Dominik Dingel wrote:
>> This patch enables async page faults for s390 kvm guests.
>> It provides the userspace API to enable, disable or get the status of this
>> feature. Also it includes the diagnose code, called by the guest to enable
>> async page faults.
>>
>> The async page faults will use an already existing guest interface for this
>> purpose, as described in "CP Programming Services (SC24-6084)".
>>
>> Signed-off-by: Dominik Dingel 
> Christian, looks good now?

Looks good, but I just had a  discussion with Dominik about several other cases 
(guest driven reboot, qemu driven reboot, life migration). This patch should 
allow all these cases (independent from this patch we need an ioctl to flush the
list of pending interrupts to do so, but reboot is currently broken in that
regard anyway - patch is currently being looked at)

We are currently discussion if we should get rid of the APF_STATUS and let 
the kernel wait for outstanding page faults before returning from KVM_RUN
or if we go with this patch and let userspace wait for completion. 

Will discuss this with Dominik, Conny and Alex. So lets defer that till next
week, ok?


> 
>> ---
>>  Documentation/s390/kvm.txt   |  24 +
>>  arch/s390/include/asm/kvm_host.h |  22 
>>  arch/s390/include/uapi/asm/kvm.h |  10 
>>  arch/s390/kvm/Kconfig|   2 +
>>  arch/s390/kvm/Makefile   |   2 +-
>>  arch/s390/kvm/diag.c |  63 +++
>>  arch/s390/kvm/interrupt.c|  43 +---
>>  arch/s390/kvm/kvm-s390.c | 107 
>> ++-
>>  arch/s390/kvm/kvm-s390.h |   4 ++
>>  arch/s390/kvm/sigp.c |   6 +++
>>  include/uapi/linux/kvm.h |   2 +
>>  11 files changed, 276 insertions(+), 9 deletions(-)
>>
>> diff --git a/Documentation/s390/kvm.txt b/Documentation/s390/kvm.txt
>> index 85f3280..707b7e9 100644
>> --- a/Documentation/s390/kvm.txt
>> +++ b/Documentation/s390/kvm.txt
>> @@ -70,6 +70,30 @@ floating interrupts are:
>>  KVM_S390_INT_VIRTIO
>>  KVM_S390_INT_SERVICE
>>  
>> +ioctl:  KVM_S390_APF_ENABLE:
>> +args:   none
>> +This ioctl is used to enable the async page fault interface. So in a
>> +host page fault case the host can now submit pfault tokens to the guest.
>> +
>> +ioctl:  KVM_S390_APF_DISABLE:
>> +args:   none
>> +This ioctl is used to disable the async page fault interface. From this 
>> point
>> +on no new pfault tokens will be issued to the guest. Already existing async
>> +page faults are not covered by this and will be normally handled.
>> +
>> +ioctl:  KVM_S390_APF_STATUS:
>> +args:   none
>> +This ioctl allows the userspace to get the current status of the APF 
>> feature.
>> +The main purpose for this, is to ensure that no pfault tokens will be lost
>> +during live migration or similar management operations.
>> +The possible return values are:
>> +KVM_S390_APF_DISABLED_NON_PENDING
>> +KVM_S390_APF_DISABLED_PENDING
>> +KVM_S390_APF_ENABLED_NON_PENDING
>> +KVM_S390_APF_ENABLED_PENDING
>> +Caution: if KVM_S390_APF is enabled the PENDING status could be already 
>> changed
>> +as soon as the ioctl returns to userspace.
>> +
>>  3. ioctl calls to the kvm-vcpu file descriptor
>>  KVM does support the following ioctls on s390 that are common with other
>>  architectures and do behave the same:
>> diff --git a/arch/s390/include/asm/kvm_host.h 
>> b/arch/s390/include/asm/kvm_host.h
>> index cd30c3d..e8012fc 100644
>> --- a/arch/s390/include/asm/kvm_host.h
>> +++ b/arch/s390/include/asm/kvm_host.h
>> @@ -257,6 +257,10 @@ struct kvm_vcpu_arch {
>>  u64 stidp_data;
>>  };
>>  struct gmap *gmap;
>> +#define KVM_S390_PFAULT_TOKEN_INVALID   (-1UL)
>> +unsigned long pfault_token;
>> +unsigned long pfault_select;
>> +unsigned long pfault_compare;
>>  };
>>  
>>  struct kvm_vm_stat {
>> @@ -282,6 +286,24 @@ static inline bool kvm_is_error_hva(unsigned long addr)
>>  return addr == KVM_HVA_ERR_BAD;
>>  }
>>  
>> +#define ASYNC_PF_PER_VCPU   64
>> +struct kvm_vcpu;
>> +struct kvm_async_pf;
>> +struct kvm_arch_async_pf {
>> +unsigned long pfault_token;
>> +};
>> +
>> +bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu);
>> +
>> +void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu,
>> +   struct kvm_async_pf *work);
>> +
>> +void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
>> + struct kvm_async_pf *work);
>> +
>> +void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>> + struct kvm_async_pf *work);
>> +
>>  extern int sie64a(struct kvm_s390_sie_block *, u64 *);
>>  extern char sie_exit;
>>  #endif
>> diff --git a/arch/s390/include/uapi/asm/kvm.h 
>> b/arch/s390/include/uapi/asm/kvm.h
>> index d25da59..b6c83e0 100644
>> --- a/arch/s390/include/uapi

Re: [PATCH] ext4: fix a big-endian bug when an extent is zeroed out

2013-04-22 Thread Christian Borntraeger

On 20/04/13 17:19, Theodore Ts'o wrote:
> On Mon, Apr 08, 2013 at 11:05:11PM -0400, CAI Qian wrote:
>> I can help run xfstests for ext4 dev tree on x64, Power7, Z10 and
>> KVM platforms with back-storage like SAN/multipath, iSCSI and FCoE.
>> I plan to run this weekly and setup a wiki page to update the testing
>> status by every Friday.
> 
> Hi CAI,
> 
> Sorry for not getting back to you sooner; I was at Collaboration
> Summit and LSF/MM last week.
> 
> It would be great if you could help run xfstests on the ext4 dev tree
> on various platforms.  We don't have any coverage on Power7 or
> s390/Z10 at the moment, so that would be especially welcome.  Coverage
> on alternate storage backends can be interesting in finding timing
> problems so they would be valuable as well.

It is probably easier to get access to a Power system (gcc compile farm?).
Having said that, there is a service available to get access to a System z
for open source development:
http://www-03.ibm.com/systems/z/os/linux/support/community.html

When we did the valgrind port we used that a lot to gave access to 
non-IBMers.
Dont know if it is ok to install a private kernel, though. Will double
check.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 4/4] PF: Async page fault support on s390

2013-07-18 Thread Christian Borntraeger

On 18/07/13 15:57, Paolo Bonzini wrote:
> Il 11/07/2013 12:41, Christian Borntraeger ha scritto:
>> On 11/07/13 11:04, Gleb Natapov wrote:
>>> On Wed, Jul 10, 2013 at 02:59:55PM +0200, Dominik Dingel wrote:
>>>> This patch enables async page faults for s390 kvm guests.
>>>> It provides the userspace API to enable, disable or get the status of this
>>>> feature. Also it includes the diagnose code, called by the guest to enable
>>>> async page faults.
>>>>
>>>> The async page faults will use an already existing guest interface for this
>>>> purpose, as described in "CP Programming Services (SC24-6084)".
>>>>
>>>> Signed-off-by: Dominik Dingel 
>>> Christian, looks good now?
>>
>> Looks good, but I just had a  discussion with Dominik about several other 
>> cases 
>> (guest driven reboot, qemu driven reboot, life migration). This patch should 
>> allow all these cases (independent from this patch we need an ioctl to flush 
>> the
>> list of pending interrupts to do so, but reboot is currently broken in that
>> regard anyway - patch is currently being looked at)
>>
>> We are currently discussion if we should get rid of the APF_STATUS and let 
>> the kernel wait for outstanding page faults before returning from KVM_RUN
>> or if we go with this patch and let userspace wait for completion. 
>>
>> Will discuss this with Dominik, Conny and Alex. So lets defer that till next
>> week, ok?
> 
> Let us know if we should wait for a v5. :)

Yes, there will be a v5


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH 0/2] s390/kvm: add kvm support for guest page hinting v2

2013-07-25 Thread Christian Borntraeger

On 25/07/13 10:54, Martin Schwidefsky wrote:
> v1->v2:
>  - found a way to simplify the common code patch
> 
> Linux on s390 as a guest under z/VM has been using the guest page
> hinting interface (alias collaborative memory management) for a long
> time. The full version with volatile states has been deemed to be too
> complicated (see the old discussion about guest page hinting e.g. on
> http://marc.info/?l=linux-mm&m=123816662017742&w=2).
> What is currently implemented for the guest is the unused and stable
> states to mark unallocated pages as freely available to the host.
> This works just fine with z/VM as the host.
> 
> The two patches in this series implement the guest page hinting
> interface for the unused and stable states in the KVM host.
> Most of the code specific to s390 but there is a common memory
> management part as well, see patch #1.
> 
> The code is working stable now, from my point of view this is ready
> for prime-time.
> 
> Konstantin Weitz (2):
>   mm: add support for discard of unused ptes
>   s390/kvm: support collaborative memory management

Can you also add the patch from our tree that reset the usage states
on reboot (diag 308 subcode 3 and 4)?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] mm: add support for discard of unused ptes

2013-07-25 Thread Christian Borntraeger

On 25/07/13 10:54, Martin Schwidefsky wrote:
> From: Konstantin Weitz 
> 
> In a virtualized environment and given an appropriate interface the guest
> can mark pages as unused while they are free (for the s390 implementation
> see git commit 45e576b1c3d00206 "guest page hinting light"). For the host
> the unused state is a property of the pte.
> 
> This patch adds the primitive 'pte_unused' and code to the host swap out
> handler so that pages marked as unused by all mappers are not swapped out
> but discarded instead, thus saving one IO for swap out and potentially
> another one for swap in.
> 
> [ Martin Schwidefsky: patch reordering and simplification ]
> 
> Signed-off-by: Konstantin Weitz 
> Signed-off-by: Martin Schwidefsky 
Reviewed-by: Christian Borntraeger 

> ---
>  include/asm-generic/pgtable.h |   13 +
>  mm/rmap.c |   10 ++
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index 2f47ade..ec540c5 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -193,6 +193,19 @@ static inline int pte_same(pte_t pte_a, pte_t pte_b)
>  }
>  #endif
> 
> +#ifndef __HAVE_ARCH_PTE_UNUSED
> +/*
> + * Some architectures provide facilities to virtualization guests
> + * so that they can flag allocated pages as unused. This allows the
> + * host to transparently reclaim unused pages. This function returns
> + * whether the pte's page is unused.
> + */
> +static inline int pte_unused(pte_t pte)
> +{
> + return 0;
> +}
> +#endif
> +
>  #ifndef __HAVE_ARCH_PMD_SAME
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index cd356df..2291f25 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1234,6 +1234,16 @@ int try_to_unmap_one(struct page *page, struct 
> vm_area_struct *vma,
>   }
>   set_pte_at(mm, address, pte,
>  swp_entry_to_pte(make_hwpoison_entry(page)));
> + } else if (pte_unused(pteval)) {
> + /*
> +  * The guest indicated that the page content is of no
> +  * interest anymore. Simply discard the pte, vmscan
> +  * will take care of the rest.
> +  */
> + if (PageAnon(page))
> + dec_mm_counter(mm, MM_ANONPAGES);
> + else
> + dec_mm_counter(mm, MM_FILEPAGES);
>   } else if (PageAnon(page)) {
>   swp_entry_t entry = { .val = page_private(page) };
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 2/2] s390/kvm: support collaborative memory management

2013-07-25 Thread Christian Borntraeger

On 25/07/13 10:54, Martin Schwidefsky wrote:
> From: Konstantin Weitz 
> 
> This patch enables Collaborative Memory Management (CMM) for kvm
> on s390. CMM allows the guest to inform the host about page usage
> (see arch/s390/mm/cmm.c). The host uses this information to avoid
> swapping in unused pages in the page fault handler. Further, a CPU
> provided list of unused invalid pages is processed to reclaim swap
> space of not yet accessed unused pages.
> 
> [ Martin Schwidefsky: patch reordering and cleanup ]
> 
> Signed-off-by: Konstantin Weitz 
> Signed-off-by: Martin Schwidefsky 

Two things to consider: life migration and reset

When we implement life migration, we need to add some additional magic for
userspace to query/set unused state. But this can be a followup patch, 
whenever this becomes necessary.

As of today it should be enough to add some code to the diag308 handler to
make reset save. For other kinds of reset (e.g. those for kdump) we need
to make this accessible to userspace. Again, this can be added later on
when we implement the other missing pieces for kdump and friends.

So

Reviewed-by: Christian Borntraeger 
Tested-by: Christian Borntraeger 



> ---
>  arch/s390/include/asm/kvm_host.h |5 ++-
>  arch/s390/include/asm/pgtable.h  |   24 
>  arch/s390/kvm/kvm-s390.c |   25 +
>  arch/s390/kvm/kvm-s390.h |2 +
>  arch/s390/kvm/priv.c |   41 
>  arch/s390/mm/pgtable.c   |   77 
> ++
>  6 files changed, 173 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/s390/include/asm/kvm_host.h 
> b/arch/s390/include/asm/kvm_host.h
> index 3238d40..de6450e 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -113,7 +113,9 @@ struct kvm_s390_sie_block {
>   __u64   gbea;   /* 0x0180 */
>   __u8reserved188[24];/* 0x0188 */
>   __u32   fac;/* 0x01a0 */
> - __u8reserved1a4[92];/* 0x01a4 */
> + __u8reserved1a4[20];/* 0x01a4 */
> + __u64   cbrlo;  /* 0x01b8 */
> + __u8reserved1c0[64];/* 0x01c0 */
>  } __attribute__((packed));
> 
>  struct kvm_vcpu_stat {
> @@ -149,6 +151,7 @@ struct kvm_vcpu_stat {
>   u32 instruction_stsi;
>   u32 instruction_stfl;
>   u32 instruction_tprot;
> + u32 instruction_essa;
>   u32 instruction_sigp_sense;
>   u32 instruction_sigp_sense_running;
>   u32 instruction_sigp_external_call;
> diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
> index 75fb726..65d48b8 100644
> --- a/arch/s390/include/asm/pgtable.h
> +++ b/arch/s390/include/asm/pgtable.h
> @@ -227,6 +227,7 @@ extern unsigned long MODULES_END;
>  #define _PAGE_SWR0x008   /* SW pte referenced bit */
>  #define _PAGE_SWW0x010   /* SW pte write bit */
>  #define _PAGE_SPECIAL0x020   /* SW associated with special 
> page */
> +#define _PAGE_UNUSED 0x040   /* SW bit for ptep_clear_flush() */
>  #define __HAVE_ARCH_PTE_SPECIAL
> 
>  /* Set of bits not changed in pte_modify */
> @@ -375,6 +376,12 @@ extern unsigned long MODULES_END;
> 
>  #endif /* CONFIG_64BIT */
> 
> +/* Guest Page State used for virtualization */
> +#define _PGSTE_GPS_ZERO  0x8000UL
> +#define _PGSTE_GPS_USAGE_MASK0x0300UL
> +#define _PGSTE_GPS_USAGE_STABLE 0xUL
> +#define _PGSTE_GPS_USAGE_UNUSED 0x0100UL
> +
>  /*
>   * A user page table pointer has the space-switch-event bit, the
>   * private-space-control bit and the storage-alteration-event-control
> @@ -590,6 +597,12 @@ static inline int pte_file(pte_t pte)
>   return (pte_val(pte) & mask) == _PAGE_TYPE_FILE;
>  }
> 
> +static inline int pte_swap(pte_t pte)
> +{
> + unsigned long mask = _PAGE_RO | _PAGE_INVALID | _PAGE_SWT | _PAGE_SWX;
> + return (pte_val(pte) & mask) == _PAGE_TYPE_SWAP;
> +}
> +
>  static inline int pte_special(pte_t pte)
>  {
>   return (pte_val(pte) & _PAGE_SPECIAL);
> @@ -794,6 +807,7 @@ unsigned long gmap_translate(unsigned long address, 
> struct gmap *);
>  unsigned long __gmap_fault(unsigned long address, struct gmap *);
>  unsigned long gmap_fault(unsigned long address, struct gmap *);
>  void gmap_discard(unsigned long from, unsigned long to, struct gmap *);
> +void __gmap_zap(unsigned long address, struct gmap *);
> 
>  void gmap_register_ipte_notifier(struct gmap_notifier *);
>  void gmap_unregister_ipte_notifier(struct gmap_notifier *);
> @@ -825,6 +839,7 @@ static inline void set_pte_at(struct mm_struct *m

Re: [ 17/53] s390/kvm: Fix store status for ACRS/FPRS

2013-02-28 Thread Christian Borntraeger

On 28/02/13 23:26, Jiri Slaby wrote:
> On 02/27/2013 12:57 AM, Greg Kroah-Hartman wrote:
>> 3.0-stable review patch.  If anyone has any objections, please let me know.
>>
>> --
>>
>> From: Christian Borntraeger 
>>
>> commit 15bc8d8457875f495c59d933b05770ba88d1eacb upstream.
>>
>> On store status we need to copy the current state of registers
>> into a save area. Currently we might save stale versions:
>> The sie state descriptor doesnt have fields for guest ACRS,FPRS,
>> those registers are simply stored in the host registers. The host
>> program must copy these away if needed. We do that in vcpu_put/load.
>>
>> If we now do a store status in KVM code between vcpu_put/load, the
>> saved values are not up-to-date. Lets collect the ACRS/FPRS before
>> saving them.
>>
>> This also fixes some strange problems with hotplug and virtio-ccw,
>> since the low level machine check handler (on hotplug a machine check
>> will happen) will revalidate all registers with the content of the
>> save area.
>>
>> Signed-off-by: Christian Borntraeger 
>> Signed-off-by: Gleb Natapov 
>> Signed-off-by: Greg Kroah-Hartman 
>>
>> ---
>>  arch/s390/kvm/kvm-s390.c |8 
>>  1 file changed, 8 insertions(+)
>>
>> --- a/arch/s390/kvm/kvm-s390.c
>> +++ b/arch/s390/kvm/kvm-s390.c
>> @@ -584,6 +584,14 @@ int kvm_s390_vcpu_store_status(struct kv
>>  } else
>>  prefix = 0;
>>  
>> +/*
>> + * The guest FPRS and ACRS are in the host FPRS/ACRS due to the lazy
>> + * copying in vcpu load/put. Lets update our copies before we save
>> + * it into the save area
>> + */
>> +save_fp_regs(&vcpu->arch.guest_fpregs);
>> +save_access_regs(vcpu->run->s.regs.acrs);
> 
> kvm_run structure does not have kvm_sync_regs in it in 3.0 yet. So this
> fails with:
> arch/s390/kvm/kvm-s390.c: In function 'kvm_s390_vcpu_store_status':
> arch/s390/kvm/kvm-s390.c:593: error: 'struct kvm_run' has no member
> named 's'
> 
> I believe the fix is just to remove save_access_regs, right?

Before the sync reg changes, the ACRS were saved in the vcpu->arch.
So the fix would look like 

save_access_regs(vcpu->arch.guest_acrs);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V5 0/3] kvm: Improving directed yield in PLE handler

2012-07-23 Thread Christian Borntraeger

On 22/07/12 14:43, Avi Kivity wrote:
> On 07/22/2012 03:34 PM, Raghavendra K T wrote:
>>
>> Thanks Marcelo for the review. Avi, Rik, Christian, please let me know
>> if this series looks good now.
>>
> 
> It looks fine to me.  Christian, is this okay for s390?
> 
Tested-by: Christian Borntraeger  # on s390x

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-11 Thread Christian Borntraeger

On 11/07/12 11:06, Avi Kivity wrote:
[...]
>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for 
>> spinlocks, though.
> 
> Perhaps x86 should copy this.

See arch/s390/lib/spinlock.c
The basic idea is using several heuristics:
- loop for a given amount of loops
- check if the lock holder is currently scheduled by the hypervisor
  (smp_vcpu_scheduled, which uses the sigp sense running instruction)
  Dont know if such thing is available for x86. It must be a lot cheaper
  than a guest exit to be useful
- if lock holder is not running and we looped for a while do a directed
  yield to that cpu.

> 
>> So there is no win here, but there are other cases were diag44 is used, e.g. 
>> cpu_relax.
>> I have to double check with others, if these cases are critical, but for 
>> now, it seems 
>> that your dummy implementation  for s390 is just fine. After all it is a 
>> no-op until 
>> we implement something.
> 
> Does the data structure make sense for you?  If so we can move it to
> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
> have to pay anything.

Ignoring the name, yes the data structure itself seems based on the algorithm
and not on arch specific things. That should work. If we move that to common 
code then s390 will use that scheme automatically for the cases were we call 
kvm_vcpu_on_spin(). All others archs as well.

So this would probably improve guests that uses cpu_relax, for example
stop_machine_run. I have no measurements, though.

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-11 Thread Christian Borntraeger

On 11/07/12 13:04, Avi Kivity wrote:
> On 07/11/2012 01:17 PM, Christian Borntraeger wrote:
>> On 11/07/12 11:06, Avi Kivity wrote:
>> [...]
>>>> Almost all s390 kernels use diag9c (directed yield to a given guest cpu) 
>>>> for spinlocks, though.
>>>
>>> Perhaps x86 should copy this.
>>
>> See arch/s390/lib/spinlock.c
>> The basic idea is using several heuristics:
>> - loop for a given amount of loops
>> - check if the lock holder is currently scheduled by the hypervisor
>>   (smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>   Dont know if such thing is available for x86. It must be a lot cheaper
>>   than a guest exit to be useful
> 
> We could make it available via shared memory, updated using preempt
> notifiers.  Of course piling on more pv makes this less attractive.
> 
>> - if lock holder is not running and we looped for a while do a directed
>>   yield to that cpu.
>>
>>>
>>>> So there is no win here, but there are other cases were diag44 is used, 
>>>> e.g. cpu_relax.
>>>> I have to double check with others, if these cases are critical, but for 
>>>> now, it seems 
>>>> that your dummy implementation  for s390 is just fine. After all it is a 
>>>> no-op until 
>>>> we implement something.
>>>
>>> Does the data structure make sense for you?  If so we can move it to
>>> common code (and manage it in kvm_vcpu_on_spin()).  We can guard it with
>>> CONFIG_KVM_HAVE_CPU_RELAX_INTERCEPT or something, so other archs don't
>>> have to pay anything.
>>
>> Ignoring the name,
> 
> What name would you suggest?

maybe vcpu_no_progress instead of pause_loop_exited

> 
>> yes the data structure itself seems based on the algorithm
>> and not on arch specific things. That should work. If we move that to common 
>> code then s390 will use that scheme automatically for the cases were we call 
>> kvm_vcpu_on_spin(). All others archs as well.
> 
> ARM doesn't have an instruction for cpu_relax(), so it can't intercept
> it.  Given ppc's dislike of overcommit, and the way it implements
> cpu_relax() by adjusting hw thread priority, I'm guessing it doesn't
> intercept those either, but I'm copying the ppc people in case I'm
> wrong.  So it's s390 and x86.
> 
>> So this would probably improve guests that uses cpu_relax, for example
>> stop_machine_run. I have no measurements, though.
> 
> smp_call_function() too (though that can be converted to directed yield
> too).  It seems worthwhile.

Indeed. For those places where is is possible I would like to see an
architecture primitive for directed yield. That could be useful for 
other places as well (e.g. maybe lib/spinlock_debug.c, which has no 
yielding at all)

Christian



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-11 Thread Christian Borntraeger

On 11/07/12 13:51, Raghavendra K T wrote:
 Almost all s390 kernels use diag9c (directed yield to a given guest cpu) 
 for spinlocks, though.
>>>
>>> Perhaps x86 should copy this.
>>
>> See arch/s390/lib/spinlock.c
>> The basic idea is using several heuristics:
>> - loop for a given amount of loops
>> - check if the lock holder is currently scheduled by the hypervisor
>>(smp_vcpu_scheduled, which uses the sigp sense running instruction)
>>Dont know if such thing is available for x86. It must be a lot cheaper
>>than a guest exit to be useful
> 
> Unfortunately we do not have information on lock-holder.

That would be an independent patch and requires guest changes.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V3 2/3] kvm: Note down when cpu relax intercepted or pause loop exited

2012-07-12 Thread Christian Borntraeger

On 12/07/12 21:18, Raghavendra K T wrote:
> +#ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
[...]
> + struct {
> + bool cpu_relax_intercepted;
> + bool dy_eligible;
> + } ple;
> +#endif
[...]
>   }
>   vcpu->run = page_address(page);
> + vcpu->ple.cpu_relax_intercepted = false;
> + vcpu->ple.dy_eligible = false;

This struct is only defined if CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT is set, but 
here it
is always accessed. Will break on !x86 && !s390.
> 
>   r = kvm_arch_vcpu_init(vcpu);
>   if (r < 0)
> @@ -1577,6 +1579,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   int pass;
>   int i;
> 
> + me->ple.cpu_relax_intercepted = true;

dito
>   /*
>* We boost the priority of a VCPU that is runnable but not
>* currently running, because it got preempted by something
> @@ -1602,6 +1605,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   }
>   }
>   }
> + me->ple.cpu_relax_intercepted = false;

again.

maybe define static inline access functions in kvm_host.h that are no-ops
if CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT is not set.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC V3 2/3] kvm: Note down when cpu relax intercepted or pause loop exited

2012-07-12 Thread Christian Borntraeger

On 13/07/12 05:35, Raghavendra K T wrote:
> Yes! I forgot about archs in init function.
> How about having
> #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> vcpu->ple.cpu_relax_intercepted = false;
> vcpu->ple.dy_eligible = false;
> #endif
> 
> This would solve all the problem.

No, you need to mask all places

> 
>>>
>>>   r = kvm_arch_vcpu_init(vcpu);
>>>   if (r<  0)
>>> @@ -1577,6 +1579,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>>>   int pass;
>>>   int i;
>>>
>>> +me->ple.cpu_relax_intercepted = true;
>>
>> dito
> 
> currently vcpu_on_spin is used only by x86 and s390. so if some other
> arch in future uses vcpu_on_spin, I believe they also have to enable
> CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> what do you think?

...because  this function is compiled no matter if called or not.
> 
>> maybe define static inline access functions in kvm_host.h that are no-ops
>> if CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT is not set.

As I already said, can you have a look at using access functions?




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-09 Thread Christian Borntraeger

On 09/07/12 08:20, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.
> 
> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:

We could do the same for s390. The appropriate exit would be diag44 (yield to 
hypervisor).

Almost all s390 kernels use diag9c (directed yield to a given guest cpu) for 
spinlocks, though.
So there is no win here, but there are other cases were diag44 is used, e.g. 
cpu_relax.
I have to double check with others, if these cases are critical, but for now, 
it seems 
that your dummy implementation  for s390 is just fine. After all it is a no-op 
until 
we implement something.

Thanks

Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] mm: fix XIP file writes

2007-12-10 Thread Christian Borntraeger

Hi Nick,

> Here we go. See, brd already found a bug ;)
> Can you apply the ext2 XIP patch too? And I'll resend the brd XIP patch.
[...]
> Writing to XIP files at a non-page-aligned offset results in data corruption
> because the writes were always sent to the start of the page.
[...]
> @@ -314,7 +314,7 @@ __xip_file_write(struct file *filp, cons
>   fault_in_pages_readable(buf, bytes);
>   kaddr = kmap_atomic(page, KM_USER0);
>   copied = bytes -
> - __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
> + __copy_from_user_inatomic_nocache(kaddr + offset, buf, 
> bytes);
>   kunmap_atomic(kaddr, KM_USER0);
>   flush_dcache_page(page);

I asked myself why this problem never happened before. So I asked our testers
to reproduce this problem on 2.6.23 and service levels. As the testcase did
not trigger, I looked into the 2.6.23 code. This problem was introduced by
commit 4a9e5ef1f4f15205e477817a5cefc34bd3f65f55 (mm: write iovec cleanup from
Nick Piggin) during 2.6.24-rc:
snip---
-   copied = filemap_copy_from_user(page, offset, buf, bytes);
[...]
+   copied = bytes -
+   __copy_from_user_inatomic_nocache(kaddr, buf, bytes);
---

So yes, its good to have xip on brd. It even tests your changes ;-)
Good news is, that we dont need anything for stable.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] Proposed new directory layout for kvm and virtualization

2007-12-11 Thread Christian Borntraeger

Am Dienstag, 11. Dezember 2007 schrieb Avi Kivity:
> KVM is due to receive support for multiple architectures (ppc, ia64, and 
> s390, in addition to the existing x86), hopefully in time for the 2.6.25 
> merge window.  It is awkward to place the new arch support in 
> drivers/kvm/, so I'd like to propose the following new layout:
> 
>   virt/ top-level directory for hypervisors
>   virt/kvm/ kvm common code
>   virt/lguest/  the other hypervisor
>   arch/*/kvm/   arch dependent kvm code
>   include/linux/kvm.h   arch independent interface
>   include/asm/kvm.h arch dependent interface
>   include/linux/kvm_para.h  arch independent guest interface
>   include/asm/kvm_para.harch dependent guest interface
> 
> The include/ hierarchy is already in place; I'm including it for 
> completeness.

For completeness, where do we want to put the drivers?
drivers/*, drivers/net/* etc. 
or
drivers/kvm/* drivers/xen/* etc.

Christian?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Decreasing stime running confuses top

2007-10-04 Thread Christian Borntraeger

Am Donnerstag, 4. Oktober 2007 schrieb Chuck Ebbert:
> Is CONFIG_VIRT_CPU_ACCOUNTING set?

This is s390 and powerpc only, so the answer is probably no ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH for testing] Re: Decreasing stime running confuses top

2007-10-04 Thread Christian Borntraeger

Am Donnerstag, 4. Oktober 2007 schrieb Chuck Ebbert:
> On 10/04/2007 04:00 PM, Christian Borntraeger wrote:
> > Am Donnerstag, 4. Oktober 2007 schrieb Chuck Ebbert:
> >> Is CONFIG_VIRT_CPU_ACCOUNTING set?
> > 
> > This is s390 and powerpc only, so the answer is probably no ;-)
> > 
> 
> The code in fs/proc/array.c is... interesting.

Short history what I know about this file:
after 2.6.22 the accounting was changes to use sched_clock and to use stime 
and utime only for the split. This is where task_utime and task_stime were 
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise 
numbers in utime and stime. So the code was partially reverted to use stime 
and utime again where appropriate, which are of type  cputime_t. 

> 
> 1. task_stime() converts p->se.sum_exec_runtime to a clock_t

correct.

> 
> 2. it calls task_utime() which does the same thing (can it change
>between the two reads?), does some calculations that yield a 
>clock_t, turns the result into a cputime and returns that
> 
> 3. task_stime() then converts that result back into a clock_t and
>uses it!

These conversions can reduce the accuracy to jiffie resolution, but should not
make the value non-monotonic. I dont think that here is the problem. 

But: it seems that p->se.sum_exec_runtime can change indeed. That would 
explain stime going backward, if sum_exec_runtime was increases in the middle 
of the calculation. 
The logic of task_stime and task_utime was introduced by 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa

Frans can you test this patch if this makes stime and utime monotic again?

It basically reverts the rest of  b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa and 
should restore the 2.6.22 behavior. The process time is used from tasks utime 
and stime instead of the scheduler clock. That means, in general after a long 
period of time, it is less accurate than the current time and behaves like 
2.6.22.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
---
 array.c |   37 -
 1 file changed, 37 deletions(-)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index ee4814d..e42c76e 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *task, char 
*buffer)
 /*
  * Use precise platform statistics if available:
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
 static cputime_t task_utime(struct task_struct *p)
 {
return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_struct *p)
 {
return p->stime;
 }
-#else
-static cputime_t task_utime(struct task_struct *p)
-{
-   clock_t utime = cputime_to_clock_t(p->utime),
-   total = utime + cputime_to_clock_t(p->stime);
-   u64 temp;
-
-   /*
-* Use CFS's precise accounting:
-*/
-   temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
-
-   if (total) {
-   temp *= utime;
-   do_div(temp, total);
-   }
-   utime = (clock_t)temp;
-
-   return clock_t_to_cputime(utime);
-}
-
-static cputime_t task_stime(struct task_struct *p)
-{
-   clock_t stime;
-
-   /*
-* Use CFS's precise accounting. (we subtract utime from
-* the total, to make sure the total observed by userspace
-* grows monotonically - apps rely on that):
-*/
-   stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
-   cputime_to_clock_t(task_utime(p));
-
-   return clock_t_to_cputime(stime);
-}
-#endif

 static int do_task_stat(struct task_struct *task, char *buffer, int whole)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for testing] Re: Decreasing stime running confuses top

2007-10-04 Thread Christian Borntraeger

Am Freitag, 5. Oktober 2007 schrieb Chuck Ebbert:
> On 10/04/2007 05:10 PM, Christian Borntraeger wrote:
> 
> > 
> 
> Alternative patch:
> 
> procfs: Don't read runtime twice when computing task's stime
> 
> Current code reads p->se.sum_exec_runtime twice and goes through
> multiple type conversions to calculate stime. Read it once and
> skip some of the conversions.
> 
> Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]>

Looks better and makes the code nicer. s390 and power should work as well as 
CONFIG_VIRT_CPU_ACCOUNTING is unaffected. 

If Frans successfully tests this patch, feel free to add

Acked-by: Christian Borntraeger <[EMAIL PROTECTED]>

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH for testing] Re: Decreasing stime running confuses top

2007-10-08 Thread Christian Borntraeger

Am Freitag, 5. Oktober 2007 schrieb Frans Pop:
> On Thursday 04 October 2007, you wrote:
> > Frans can you test this patch if this makes stime and utime monotic
> > again?
> >
> > It basically reverts the rest of 
> > b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa and should restore the 2.6.22
> > behavior. The process time is used from tasks utime and stime instead of
> > the scheduler clock. That means, in general after a long period of time,
> > it is less accurate than the current time and behaves like 2.6.22.
> >
> > Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
> 
> Yes, this gives steady increases.
> For kontact it also again shows updates only once every minute. I really 
> wonder where all the other fluctuations for contact come from with the 
> alternative code.

Please correct me, if I am wrong, but here is my guess:

I think that the new code gives actually better numbers for kontact. Kontact 
is using the cpu for very short periods, right? The old code updates utime 
and stime via sampling at each timer tick. If kontact is scheduled based on 
the timer tick(lets say timeout and a low amount of other interrupts) it will 
start shortly after a tick. If kontact now manages to return the cpu before 
the next tick, the old code would not account anything for kontact.
The new code instead, should be correct in terms of overall runtime as it 
accounts the scheduled time in ns.

Why does it still shows numbers going backwards? I guess the sampled values 
for stime and utime change in flight between task_utime and task_stime are 
called. Lets say utime will be increased. Given the same sum_exec_runtime 
that means that the result of task_stime() will get smaller at this point. 

So Chucks patch only deals with sum_exec_runtime changing.

> 
> It seems to me that this patch would be the best option for 2.6.23.

Ingo, do you have any opinion about how to proceed?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-09 Thread Christian Borntraeger

From: Christian Borntraeger <[EMAIL PROTECTED]>

We have seen ramdisk based install systems, where some pages of mapped 
libraries and programs were suddendly zeroed under memory pressure. This 
should not happen, as the ramdisk avoids freeing its pages by keeping them 
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean, 
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list. 
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is 
called. pagevec_strip calls try_to_release_page. If the mapping has no 
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has 
now a special logic for some file systems to make a dirty page clean, if all 
buffers are clean. Thats what happened in our test case.

The solution is to provide a noop-releasepage callback for the ramdisk driver.
This avoids try_to_free_buffers for ramdisk pages. 

To trigger the problem, you have to make buffer_heads_over_limit true, which
means:
- lower max_buffer_heads 
or
- have a system with lots of high memory

Andrew, if there are no objections - please apply. The patch applies against
2.6.23-rc9.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 drivers/block/rd.c |   13 +
 1 files changed, 13 insertions(+)

Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
 }
 
+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the dirty
+ * bit of our ram disk pages, which will be eventually freed, even
+ * if the page is still in use.
+ */
+static int ramdisk_releasepage(struct page *page, gfp_t dummy)
+{
+   return 0;
+}
+
 static const struct address_space_operations ramdisk_aops = {
.readpage   = ramdisk_readpage,
.prepare_write  = ramdisk_prepare_write,
@@ -196,6 +208,7 @@ static const struct address_space_operat
.writepage  = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
+   .releasepage= ramdisk_releasepage,
 };
 
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [s390] networking related oops during boot on Hercules (was: build failure)

2007-11-28 Thread Christian Borntraeger

Am Mittwoch, 28. November 2007 schrieb Frans Pop:
[...]
> During boot I get the following oops in the Hercules emulator.
> 2.6.22 runs fine on Hercules; I've not tried .23 on it.
[...]
>   cut here !
> kernel BUG at net/core/dev.c:852  
>  +
> illegal operation: 0001  #1!
> Modules linked in: ctc fsm tape_34xx cu3088 tape ccwgroup tape_class 
> dm_mirror d
> m_snapshot dm_mod dasd_eckd_mod dasd_mod
> CPU:1Not tainted
> Process hwup (pid: 990, task: 0f034c00, ksp: 0f305be0)
> Krnl PSW : 070c 8019d0da (dev_alloc_name+0x1e/0x58)
>R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:0 PM:0
> Krnl GPRS: 00132700  0f398800 0f398800
> 80132796 0f305d7a 0002
>0f1ada00 0f3c1f08 0f3c1f00 0f398800
>0f398800 8019d0c2 0f305ca0 0f305c30
> Krnl Code: 8019d0cc: f000bf1f2460   srp 3871(1,%r11),1120(%r2),0
>8019d0d2: a7740004   brc 7,8019d0da
>8019d0d6: a7f40001   brc 15,8019d0d8
>   >8019d0da: 5810d04e   l   %r1,78(%r13)
>8019d0de: 41a0f060   la  %r10,96(%r15)
>8019d0e2: 5820c460   l   %r2,1120(%r12)
>8019d0e6: 184a   lr  %r4,%r10
>8019d0e8: 0de1   basr%r14,%r1
> Call Trace:
> ( <>! _ehead+0xfffee000/0x80)
>   <0019d710>! register_netdev+0x34/0x6c
>   <108a561e>! ctc_new_device+0x3ee/0x590  ctc!
>   <10861398>! ccwgroup_online_store+0xb0/0x13c  ccwgroup!
>   <000c962a>! sysfs_write_file+0xca/0x130
>   <000840e6>! vfs_write+0x92/0x128
>  +
>   <000847e8>! sys_write+0x40/0x70
>   <0002098a>! sysc_do_restart+0x12/0x16
>   <77f0abaa>! 0x77f0abaa


This seems to be related to the new network namespace code by Eric
Biederman (CCed). Can you try the following (untested) patch? I also 
CCed Ursula and Peter as they know the ctc code better than me.


CC: Eric W. Biederman <[EMAIL PROTECTED]>
CC: Ursula Braun <[EMAIL PROTECTED]>
CC: Peter Tiedemann <[EMAIL PROTECTED]>
Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 drivers/s390/net/ctcmain.c |2 ++
 1 file changed, 2 insertions(+)

Index: linux-2.6/drivers/s390/net/ctcmain.c
===
--- linux-2.6.orig/drivers/s390/net/ctcmain.c
+++ linux-2.6/drivers/s390/net/ctcmain.c
@@ -56,6 +56,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -2823,6 +2824,7 @@ ctc_init_netdevice(struct net_device * d
dev->type = ARPHRD_SLIP;
dev->tx_queue_len = 100;
dev->flags = IFF_POINTOPOINT | IFF_NOARP;
+   dev->nd_net = &init_net;
return dev;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch] rewrite rd

2007-12-04 Thread Christian Borntraeger

Am Dienstag, 4. Dezember 2007 schrieb Nick Piggin:
[...]
> There is one slight downside -- direct block device access and filesystem
> metadata access goes through an extra copy and gets stored in RAM twice.
> However, this downside is only slight, because the real buffercache of the
> device is now reclaimable (because we're not playing crazy games with it),
> so under memory intensive situations, footprint should effectively be the
> same -- maybe even a slight advantage to the new driver because it can also
> reclaim buffer heads.

This is just an idea, I dont know if it is worth the trouble, but have you
though about implementing direct_access for brd? That would allow 
execute-in-place (xip) on brd eliminating the extra copy.

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [s390] build failure

2007-11-28 Thread Christian Borntraeger

Am Mittwoch, 28. November 2007 schrieb Frans Pop:
> $ git describe
> v2.6.24-rc3-342-g8c27eba
> 
> arch/s390/kernel/built-in.o: In function `cleanup_io_leave_insn':
> diag.c:(.text+0xc29a): undefined reference to `preempt_schedule_irq'
> make[2]: *** [.tmp_vmlinux1] Error 1
> 
> 

We have a patch for that in our repository.
Martin will send that fix with the next bunch of fixes.


Subject: [PATCH] Fix compile error on 31bit without preemption

From: Christian Borntraeger <[EMAIL PROTECTED]>

Commit b8e7a54cd06b0b0174029ef3a7f5a1415a2c28f2 introduced a compile
error if CONFIG_PREEMPT is not set:

arch/s390/kernel/built-in.o: In function `cleanup_io_leave_insn':
/space/kvm/arch/s390/kernel/entry.S:(.text+0xbfce): undefined reference to 
`preempt_schedule_irq'

This patch hides preempt_schedule_irq if CONFIG_PREEMPT is not set.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
Signed-off-by: Martin Schwidefsky <[EMAIL PROTECTED]>
---

 arch/s390/kernel/entry.S |2 ++
 1 file changed, 2 insertions(+)

diff -urpN linux-2.6/arch/s390/kernel/entry.S 
linux-2.6-patched/arch/s390/kernel/entry.S
--- linux-2.6/arch/s390/kernel/entry.S  2007-11-27 16:31:36.0 +0100
+++ linux-2.6-patched/arch/s390/kernel/entry.S  2007-11-27 16:31:45.0 
+0100
@@ -1079,8 +1079,10 @@ cleanup_io_leave_insn:
 .Lexecve_tail: .long   execve_tail
 .Ljump_table:  .long   pgm_check_table
 .Lschedule:.long   schedule
+#ifdef CONFIG_PREEMPT
 .Lpreempt_schedule_irq:
.long   preempt_schedule_irq
+#endif
 .Ltrace:   .long   syscall_trace
 .Lschedtail:   .long   schedule_tail
 .Lsysc_table:  .long   sys_call_table

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Linux 2.6.23.10

2007-12-16 Thread Christian Borntraeger

Am Freitag, 14. Dezember 2007 schrieb Greg Kroah-Hartman:
> Christian Borntraeger (1):
>   Future of Linux 2.6.22.y series

This should be: rd: fix data corruption on memory pressure.
Same for 2.6.22

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/2] Thinkpad Suspend Powersave: Fix ACPI's GFP_KERNEL allocations in contexts that can sleep

2005-03-16 Thread Christian Borntraeger

Andrew Morton wrote:
> "Theodore Y. Ts'o" <[EMAIL PROTECTED]> wrote:
> > This fixes a problem originally reported by Christian Borntraeger where
> >  during the wakeup from a suspend-to-ram, several "sleeping function
> >  called from invalid context" warning messages are issued.  Unlike a
> >  previous patch which attempted to solve this problem, we avoid doing
> > an GFP_ATOMIC kmalloc() except when explicitly necessary.

Len,
you indicated that you are going to address the S3 sleeping function issue 
after the development for 2.6.12 has started with an invasive but correct 
fix. 
(http://marc.theaimsgroup.com/?l=acpi4linux&m=110978961501752&w=2)
Can you give an overview about your progress?

cheers

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [patch 05/12] acpi: sleep-while-atomic during S3 resume from ram

2005-02-24 Thread Christian Borntraeger

[EMAIL PROTECTED] wrote:
> From: Christian Borntraeger <[EMAIL PROTECTED]>
>
> During the wakeup from suspend-to-ram I get several warnings.
>
> Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
> Signed-off-by: Andrew Morton <[EMAIL PROTECTED]>

Andrew,

Len told me that he is going to solve the issue in a different and better 
way.

cheers

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: how to capture kernel panics

2005-02-25 Thread Christian Borntraeger

shabanip wrote:
> is there any way to capture and log kernel panics on disk or ...?

In former times, the Linux kernel tried to sync in the panic function. (If 
the panic did not happen in interrupt context) Unfortunately this had 
severe side effects in cases where the panic was triggered by file system 
block device code or any other part which is necessary for syncing. In most 
cases the call trace never made it onto disk anyway. So currently the 
kernel does not support saving a panic.

Apart from using a serial console, you might have a look at several 
kexec/kdump/lkcd tools where people are working on being able to dump the 
memory of a paniced kernel.

cheers

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Invalid module format in Fedora core2

2005-02-25 Thread Christian Borntraeger

Payasam Manohar wrote:
>   Hai all,
>I tried to insert a sample module into Fedora core 2 , But it is
> giving an error message that " no module in the object"
> The same module was inserted successfully into Redhat linux 9.
>
> Is there any changes from RH 9 to Fedora Core 2 with respect to the
> module writing and insertion.
>
> Any small help  also welcome.

Well, the most important difference is probably the move from Linux kernel 
2.4 to Linux kernel 2.6.

See http://lwn.net/Articles/driver-porting/ for details how to change 
modules from 2.4 to 2.6 (If source code is available).

cheers 

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disable cache disk

2005-03-31 Thread Christian Borntraeger

Yves Crespin wrote:
> Christian Bornträger wrote:
> >On Wednesday 30 March 2005 15:00, Yves Crespin wrote:
> >>1/ is-it possible to *really* be synchronize. I prefer to have a
> >> blocked write() than use cache and get swap!
> >Try to mount with the sync option.
> exactly async and noatime ?

No.  async is exactly the behaviour you dont want. Problem is, the "sync" 
mount option is not available for every file system. At least ext2, ext3, 
and ufs support this option. No idea about other filesystems.


> >>2/ is-it possible to disable cache disk ?
> >
> >your copy tool has to support/use O_DIRECT
>
> is O_DIRECT a POSIX option ?

No it is a Linux extension.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH resend2] rd: fix data corruption on memory pressure

2007-10-29 Thread Christian Borntraeger

Nick, Eric, Andrew,

we now have passed rc1. That means that Erics or Nicks rd rewrite is no longer
an option for 2.6.24. If I followed the last thread correctly all alternative
patches have one of the following issue
- too big for post rc1
- break reiserfs and maybe others
- call into vfs while being unrelated

So this is a resend of my patch, which is in my opinion the simplest fix for
the data corruption problem.
The patch was tested by our test department, thanks to Oliver Paukstadt and
Thorsten Diehl.

---

Subject: [PATCH] rd: fix data corruption on memory pressure
From: Christian Borntraeger <[EMAIL PROTECTED]>

We have seen ramdisk based install systems, where some pages of mapped 
libraries and programs were suddendly zeroed under memory pressure. This 
should not happen, as the ramdisk avoids freeing its pages by keeping them 
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean, 
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list. 
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is 
called. pagevec_strip calls try_to_release_page. If the mapping has no 
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has 
now a special logic for some file systems to make a dirty page clean, if all 
buffers are clean. Thats what happened in our test case.

The simplest solution is to provide a noop-releasepage callback for the
ramdisk driver. This avoids try_to_free_buffers for ramdisk pages. 

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
---
 drivers/block/rd.c |   13 +
 1 file changed, 13 insertions(+)

Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
 }
 
+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the dirty
+ * bit of our ram disk pages, which will be eventually freed, even
+ * if the page is still in use.
+ */
+static int ramdisk_releasepage(struct page *page, gfp_t dummy)
+{
+   return 0;
+}
+
 static const struct address_space_operations ramdisk_aops = {
.readpage   = ramdisk_readpage,
.prepare_write  = ramdisk_prepare_write,
@@ -196,6 +208,7 @@ static const struct address_space_operat
.writepage  = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
+   .releasepage= ramdisk_releasepage,
 };
 
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

2007-10-29 Thread Christian Borntraeger

Am Montag, 29. Oktober 2007 schrieb Ingo Molnar:
> i've got a patch from Peter queued up. (see below) This should fix the 
> main issue.
[...]
> --- linux.orig/fs/proc/array.c
> +++ linux/fs/proc/array.c
> @@ -358,7 +358,8 @@ static cputime_t task_utime(struct task_
>   }
>   utime = (clock_t)temp;
> 
> - return clock_t_to_cputime(utime);
> + p->prev_utime = max(p->prev_utime, clock_t_to_cputime(utime));
> + return p->prev_utime;
>  }
[...]

I dont think it will work. It will make utime monotic, but stime can still 
decrease. For example let sum_exec_runtime increase by a tiny little bit while
utime will get a full additional tick. stime is sum-utime. So stime can still
go backwards. So I think that we need this kind of logic for stime as well, 
no?

Christian



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

2007-10-29 Thread Christian Borntraeger

Am Montag, 29. Oktober 2007 schrieb Balbir Singh:
> We'll also need this additional patch (untested), but in the long run
> I think the approach needs to be
> 
> 1. Update stime and utime at the time of context switching -- keep it
>in sync with p->sum_exec_runtime
> 2. Keep track of system/user context at system call entry points

This is similar to what s390 and ppc64 do when CONFIG_VIRT_CPU_ACCOUNTING is 
set. Whenever you do a patch, please make sure to not break the already 
precise values for utime and stime on s390 and ppc. Feel free to CC me 
whenever you have a patch for this.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 09/16] Remove unnecessary kmalloc casts in the pci subsystem.

2007-07-31 Thread Christian Borntraeger

Am Dienstag, 31. Juli 2007 schrieb [EMAIL PROTECTED]:
> --- a/drivers/pci/rom.c
> +++ b/drivers/pci/rom.c
> @@ -185,7 +185,7 @@ void __iomem *pci_map_rom_copy(struct pc
> IORESOURCE_ROM_BIOS_COPY))
>   return rom;
> 
> - res->start = (unsigned long)kmalloc(*size, GFP_KERNEL);
> + res->start = kmalloc(*size, GFP_KERNEL);

This looks wrong. void * doesnt need a cast to a pointer, but res->start is an 
integer u32 type, and therefore we need a cast, no?



-- 
IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 08/16] Remove unnecessary kmalloc casts in the parisc drivers.

2007-07-31 Thread Christian Borntraeger

Am Dienstag, 31. Juli 2007 schrieb [EMAIL PROTECTED]:
> - a = (unsigned long)kmalloc(sizeof(struct irt_entry) * num_entries + 8, 
GFP_KERNEL);
> + a = kmalloc(sizeof(struct irt_entry) * num_entries + 8, GFP_KERNEL);

NAK.
Same as in some other patches.
"void *" -> "ulong" is not auto casted and will give you a "warning: 
assignment makes integer from pointer without a cast".

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 02/16] Remove unnecessary kmalloc casts from the mips arch.

2007-07-31 Thread Christian Borntraeger

Am Dienstag, 31. Juli 2007 schrieb [EMAIL PROTECTED]:
> ===
> --- a/arch/mips/au1000/common/dbdma.c
> +++ b/arch/mips/au1000/common/dbdma.c
[...]
> - desc_base = (u32)kmalloc(entries * sizeof(au1x_ddma_desc_t),
> + desc_base = kmalloc(entries * sizeof(au1x_ddma_desc_t),
[...]
> - if ((desc_base = (u32)kmalloc(i, GFP_KERNEL|GFP_DMA)) == 0)
> + if ((desc_base = kmalloc(i, GFP_KERNEL|GFP_DMA)) == 0)

See the comments to the other patches. The cast is necessary.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 11/16] Remove unnecessary kmalloc casts in drivers/net.

2007-07-31 Thread Christian Borntraeger

Am Dienstag, 31. Juli 2007 schrieb [EMAIL PROTECTED]:
> --- a/drivers/net/lance.c
> +++ b/drivers/net/lance.c
[...]
> - lp->rx_buffs = (unsigned long)kmalloc(PKT_BUF_SZ*RX_RING_SIZE,
> -   GFP_DMA | GFP_KERNEL);
> + lp->rx_buffs = kmalloc(PKT_BUF_SZ*RX_RING_SIZE, GFP_DMA | GFP_KERNEL);
[...]
> --- a/drivers/net/sgiseeq.c
> +++ b/drivers/net/sgiseeq.c
[...]
> - buffer = (unsigned long) kmalloc(PKT_BUF_SZ, 
> GFP_KERNEL);
> + buffer = kmalloc(PKT_BUF_SZ, GFP_KERNEL);
[...]
> - buffer = (unsigned long) kmalloc(PKT_BUF_SZ, 
> GFP_KERNEL);
> + buffer = kmalloc(PKT_BUF_SZ, GFP_KERNEL);

See the comments to the other patches. The cast is necessary.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Patch 06/16] This removes unnecessary kmalloc casts and corrects a test for kmalloc failure.

2007-07-31 Thread Christian Borntraeger

Am Dienstag, 31. Juli 2007 schrieb [EMAIL PROTECTED]:
> --- a/drivers/media/video/zoran_driver.c
> +++ b/drivers/media/video/zoran_driver.c
[...]
> - mem =
> - (unsigned long) kmalloc(fh->jpg_buffers.
> - buffer_size,
> - GFP_KERNEL);
> + mem = kmalloc(fh->jpg_buffers.buffer_size, GFP_KERNEL);
[...]

NAK. See the comments to the other patches. The cast is necessary.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 0/2][KVM] guest time accounting

2007-08-13 Thread Christian Borntraeger

Am Freitag, 10. August 2007 schrieb Laurent Vivier:
> The aim of these two patches is to measure the CPU time used by a virtual
> machine. All comments are welcome... I'm not sure it's the good way to do 
that.

I did something similar for or s390guest prototype, that Carsten posted in 
May.  I decided to account guest time to the user process instead of adding a 
new field to avoid hazzle with old top. As you can read in the patch comment, 
I personally prefer a new field if we can get one.

My implementation uses a similar mechanism like hard and softirq. So I have an 
sie_enter an sie_exit and a task_is_in_sie function - like irq_enter and 
irq_exit. The main difference is based on the fact, that s390 has precise 
accouting for irq, steal, user and system time, and therefore my patch is 
based on architecture specifc code using CONFIG_VIRT_CPU_ACCOUNT. 

In general my patch has the same idea as your patch, so I am going to review 
your patch and see if it would fit for s390.

For reference this is the (never posted) old patch for our virtualisation 
prototype. It wont work with kvm but it gives you the idea what we had in 
mind on s390.

--- snip old PATCH GPLv2 --

Subject: [PATCH/RFC] Fix system<->user misaccount of interpreted execution

From: Christian Borntraeger <[EMAIL PROTECTED]>

This patches fixes the accouting of guest cpu time. As sie is executed via a
system call, all guest operations were accounted as system time. To fix this
we define a per thread "sie context". Before issuing the sie instruction we
enter this context and leave the context afterwards. sie_enter and sie_exit
call account_system_vtime, which now checks for being in sie_context. We 
define the sie_context to be accounted as user time.

Possible future enhancement: We could add an additional field: "interpretion
time" to cpu stat and process time. Thus we could differentiate between user
time in the host and host user time spent for guests. The main challenge is
the necessary user space change. Therefore, we could export the interpretion
time with a new interface. To be defined.

Signed-off-By: Christian Borntraeger <[EMAIL PROTECTED]>
Signed-off-By: Carsten Otte <[EMAIL PROTECTED]>

---
 arch/s390/Kconfig  |1 +
 arch/s390/host/s390host.c  |   15 +++
 arch/s390/kernel/process.c |1 +
 arch/s390/kernel/vtime.c   |   12 +++-
 include/asm-s390/thread_info.h |2 ++
 5 files changed, 30 insertions(+), 1 deletion(-)

Index: linux-2.6.22/arch/s390/kernel/vtime.c
===
--- linux-2.6.22.orig/arch/s390/kernel/vtime.c
+++ linux-2.6.22/arch/s390/kernel/vtime.c
@@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t
account_system_time(tsk, 0, cputime);
 }
 
+static inline int task_is_in_sie(struct thread_info *thread)
+{
+   return thread->in_sie;
+}
+
 /*
  * Update process times based on virtual cpu times stored by entry.S
  * to the lowcore fields user_timer, system_timer & steal_clock.
@@ -114,7 +119,12 @@ void account_system_vtime(struct task_st
cputime =  S390_lowcore.system_timer >> 12;
S390_lowcore.system_timer -= cputime << 12;
S390_lowcore.steal_clock -= cputime << 12;
-   account_system_time(tsk, 0, cputime);
+
+   if (task_is_in_sie(task_thread_info(tsk)) && !hardirq_count() &&
+   !softirq_count())
+   account_user_time(tsk, cputime);
+   else
+   account_system_time(tsk, 0, cputime);
 }
 
 static inline void set_vtimer(__u64 expires)
Index: linux-2.6.22/arch/s390/host/s390host.c
===
--- linux-2.6.22.orig/arch/s390/host/s390host.c
+++ linux-2.6.22/arch/s390/host/s390host.c
@@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l
 
 static DEFINE_MUTEX(s390host_init_mutex);
 
+static void enter_sie(void)
+{
+   account_system_vtime(current);
+   current_thread_info()->in_sie = 1;
+}
+
+static void exit_sie(void)
+{
+   account_system_vtime(current);
+   current_thread_info()->in_sie = 0;
+}
+
+
 static void s390host_get_data(struct s390host_data *data)
 {
atomic_inc(&data->count);
@@ -297,7 +310,9 @@ again:
schedule();
 
sie_kernel->sie_block.icptcode = 0;
+   enter_sie();
ret = sie64a(sie_kernel);
+   exit_sie();
if (ret)
goto out;
 
Index: linux-2.6.22/include/asm-s390/thread_info.h
===
--- linux-2.6.22.orig/include/asm-s390/thread_info.h
+++ linux-2.6.22/include/asm-s390/thread_info.h
@@ -55,6 +55,7 @@ struct thread_info {
struct restart_blockrestart_block;
struct s390host_data*s390host_data; /* s390hos

Re: [kvm-devel] [PATCH 0/2][KVM] guest time accounting

2007-08-13 Thread Christian Borntraeger

Am Montag, 13. August 2007 schrieb Laurent Vivier:
> > [copying Ingo and Rusty]

@Avi, seems that sourceforge is mangling the cc list?

> > 
> > The patches look good.  A couple of comments:
> > 
> > - perhaps the new fields should be guarded by a #ifdef CONFIG_HYPERVISOR
> > (selected by CONFIG_KVM)?  that way the (minor) additional overhead is
> > only incurred if it can possibly be used.  I imagine that our canine
> > cousin will want to use this as well.
> 
> There is also a CONFIG_VIRTUALIZATION and a CONFIG_VIRT_CPU_ACCOUNTING (from
> s390 and powerpc) Which one to use ?

CONFIG_VIRT_CPU_ACCOUNTING is used for the precise accouting of user,system, 
steal and irq time on these platforms and is not what you want for the on/off 
decision. 
> 
> I'm wondering if we can have a more accurate accounting:
> 
> - For the moment we add all system time since the previous entering to the
> VCPU to the guest time (and I guess there is some real system time in
> it ???) 
> - Perhaps we can sum nanoseconds spent in the VCPU and add it to cpustat
> when these ns are greater than 1 ms ? (I'm trying to make something in this 
way)

If you look at the patch I have posted some minutes ago, I use a method 
similar to irq_enter and irq_exit to separate real system time from guest 
time. 

> > - I think that there is per-task accounting of user time and system
> > time; that should be extended as well.
> it should be easy to do too...

We have to make sure that userspace doesnt break, but yes we should have a 
guest time for processes as well. 

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 0/2][KVM] guest time accounting

2007-08-13 Thread Christian Borntraeger

Am Montag, 13. August 2007 schrieb Avi Kivity:

> Laurent's patch gives the best of both worlds: on old 'top', you get 
> guest time accounted as user time, while on new 'top' it is accounted 
> separately.  This is done by reporting user time as the sum of the real 
> user time and guest time.  A newer 'top' can subtract guest time from 
> user time to get the correct statistic.

Yes that looks promising. If I recall correctly we had some strange top 
behaviours when we introduced the steal time. Old top added the steal time to 
idle. We should check that.

> 
> 
> > My implementation uses a similar mechanism like hard and softirq. So I 
have an 
> > sie_enter an sie_exit and a task_is_in_sie function - like irq_enter and 
> > irq_exit. The main difference is based on the fact, that s390 has precise 
> > accouting for irq, steal, user and system time, and therefore my patch is 
> > based on architecture specifc code using CONFIG_VIRT_CPU_ACCOUNT. 
> >   
> 
> Okay, so the code should be under that config option, and kvm should 
> select it.

No hurry..that was specific to our implementation, not KVM :-)
Besided that, Ingo changed the accouting with his CFS scheduler, and I still 
have to figure out how CONFIG_VIRT_CPU_ACCOUNTING can be properly integrated 
in CFS.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 0/2][KVM] guest time accounting

2007-08-13 Thread Christian Borntraeger

Am Montag, 13. August 2007 schrieb Laurent Vivier:
> >> As guest accounting is hw dependent, I think we should add a hook in the
> >> accounting functions.
> >>   
> > 
> > Isn't PF_VM exactly such a hook?  All the hypervisor needs to do is to
> > set/unset it correctly?
> 
> In fact, no.
> 
> PF_VM is used to know we have entered a virtual CPU (the hypervisor set it,
> the scheduler unset it on accounting)

Why not do something like the following. (This patch does not work as it 
relies on the no-existing var cputime_since_last_update, but it shows the 
idea)

---
 drivers/kvm/kvm.h |   13 +
 drivers/kvm/svm.c |3 ++-
 drivers/kvm/vmx.c |3 ++-
 kernel/sched.c|1 -
 4 files changed, 17 insertions(+), 3 deletions(-)

Index: kvm/drivers/kvm/kvm.h
===
--- kvm.orig/drivers/kvm/kvm.h
+++ kvm/drivers/kvm/kvm.h
@@ -8,6 +8,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -726,6 +727,18 @@ static inline u32 get_rdx_init_val(void)
return 0x600; /* P6 family */
 }
 
+static inline guest_enter(void)
+{
+   account_system_time(current, 0, cputime_since_last_update);
+   current->flags |= PF_VCPU;
+}
+
+static inline guest_exit(void)
+{
+   current->flags &= ~PF_VCPU;
+   account_system_time(current, 0, cputime_since_last_update);
+}
+
 #define ASM_VMX_VMCLEAR_RAX   ".byte 0x66, 0x0f, 0xc7, 0x30"
 #define ASM_VMX_VMLAUNCH  ".byte 0x0f, 0x01, 0xc2"
 #define ASM_VMX_VMRESUME  ".byte 0x0f, 0x01, 0xc3"
Index: kvm/kernel/sched.c
===
--- kvm.orig/kernel/sched.c
+++ kvm/kernel/sched.c
@@ -3251,7 +3251,6 @@ void account_system_time(struct task_str
p->utime = cputime_add(p->utime, cputime);
cpustat->user = cputime64_add(cpustat->user, tmp);
cpustat->guest = cputime64_add(cpustat->guest, tmp);
-   p->flags &= ~PF_VCPU;
return;
}
 
Index: kvm/drivers/kvm/svm.c
===
--- kvm.orig/drivers/kvm/svm.c
+++ kvm/drivers/kvm/svm.c
@@ -1404,7 +1404,7 @@ again:
clgi();
 
vcpu->guest_mode = 1;
-   current->flags |= PF_VCPU;
+   guest_enter();
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
svm_flush_tlb(vcpu);
@@ -1537,6 +1537,7 @@ again:
 #endif
: "cc", "memory" );
 
+   guest_exit();
vcpu->guest_mode = 0;
 
if (vcpu->fpu_active) {
Index: kvm/drivers/kvm/vmx.c
===
--- kvm.orig/drivers/kvm/vmx.c
+++ kvm/drivers/kvm/vmx.c
@@ -2078,7 +2078,7 @@ again:
local_irq_disable();
 
vcpu->guest_mode = 1;
-   current->flags |= PF_VCPU;
+   guest_enter();
if (vcpu->requests)
if (test_and_clear_bit(KVM_TLB_FLUSH, &vcpu->requests))
vmx_flush_tlb(vcpu);
@@ -2199,6 +2199,7 @@ again:
[cr2]"i"(offsetof(struct kvm_vcpu, cr2))
  : "cc", "memory" );
 
+   guest_exit();
vcpu->guest_mode = 0;
local_irq_enable();
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH 0/2][KVM] guest time accounting

2007-08-13 Thread Christian Borntraeger

Am Montag, 13. August 2007 schrieb Avi Kivity:
> Christian Borntraeger wrote:
> > Am Montag, 13. August 2007 schrieb Laurent Vivier:
> >   
> >>> [copying Ingo and Rusty]
> >>>   
> >
> > @Avi, seems that sourceforge is mangling the cc list?
> >
> >   
> 
> It's not configured to do so.  Can you be more specific?

I added Carsten and he was removed from the CC list in my copy I got from SF.
And you cced Ingo and Rusty but I dont have them on cc, either.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [accounting regression since rc1] scheduler updates

2007-08-14 Thread Christian Borntraeger

This is a 2nd try with correct email address, sorry for the duplicates.

Am Sonntag, 12. August 2007 schrieb Ingo Molnar:
> Linus, please pull the latest scheduler git tree from:

Hello Ingo,

this is a followup to the discussion in 
http://lkml.org/lkml/2007/7/19/538

Since 2.6.12, s390 already does precise accouting for system and user time.  
Depending on CONFIG_VIRT_CPU_ACCOUNTING, we  use two 64bit hardware timers on 
s390: the first returns the wall clock time and is stepped even if the 
virtual cpu is not backed by a physical cpu. The second timer is only 
stepped, when the virtual cpu is backed by a physical cpu. The timers have a 
very high accurancy, and the architecture guarantees that bit 51 is increased 
by one/microsecond. We store both timers on each context switch, irq, 
syscall, and machinecheck in entry.S. The calculation are made in 
arch/s390/kernel/vtime.c in accouting_system_vtime and friends with 
microsecond accurracy. This is also used for irq accouting (see the 
definition of irq_enter). It basically boils down to precise numbers in the 
cpu stat and the utime/stime for processes as well as knowledge about time 
stolen by the hypervisor. 

With CFS the accounting was changed, and everything is now based on 
sum_exec_runtime.  There is now an accounting regression on s390 (and maybe 
ppc64), as the default jiffy implemenation does not know anything about 
virtual cpus.

While looking for a solution, I started with a very quick hack and reverted 
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa for the procfs related changes. If I 
revert that commit, it seems that I get the old behaviour -  but of course 
this is just a hack.

I see some options now:

1. Jan could finish his sched_clock implementation for s390 and we would get 
close to the precise numbers. This would also let CFS make better decisions. 
Downside: its not as precise as before as we do some math on the numbers and 
it will burn cycles to compute numbers we already have 
(utime=sum*utime/stime). 
2. set sum_exec_runtime based on the precise utime and stime. Dont know enough 
about CFS if this would show different scheduling behaviour than 1
3. ifdef fs/proc/array.c depending on CONFIG_VIRT_CPU_ACCOUNTING. This will 
save some cycles, and the numbers are precise to a microsecond. Downside: the 
scheduler gets no information about virtual cpus and steal time so its 
probably not completely fair
4. implement sched_clock AND reuse the exisiting utime and stime numbers.
5. other clever solutions I cannot see

Any suggestions?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [accounting regression since rc1] scheduler updates

2007-08-14 Thread Christian Borntraeger

Am Dienstag, 14. August 2007 schrieb Linus Torvalds:
> On Tue, 14 Aug 2007, Christian Borntraeger wrote:
> > Hello Ingo,
> Just a fyi: I think Ingo is on vacation this week without any internet 
> access, I think he'll be back next Monday.
> 
>   Linus

Well deserved.
As Martin and Jan are away as well I have a week to look into that problem 
myself :-)

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH][RFC] Re: accounting regression since rc1

2007-08-16 Thread Christian Borntraeger

Ingo,

this patch fixes the accounting regression for CONFIG_VIRT_CPU_ACCOUNTING. It 
reverts parts of commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa by converting 
fs/proc/array.c back to cputime_t. The new functions task_utime and 
task_stime now return cputime_t instead of clock_t. If 
CONFIG_VIRT_CPU_ACCOUTING is set, task->utime and task->stime are returned 
directly instead of using sum_exec_runtime. 

Patch is tested on s390x with and without VIRT_CPU_ACCOUTING as well as on 
i386. Not tested on ppc64 - Paul?

Feedback is welcome.

Signed-Off-By: Christian Borntraeger <[EMAIL PROTECTED]>

---
 fs/proc/array.c |   41 ++---
 1 file changed, 26 insertions(+), 15 deletions(-)

Index: linux-2.6/fs/proc/array.c
===
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -320,7 +320,18 @@ int proc_pid_status(struct task_struct *
return buffer - orig;
 }
 
-static clock_t task_utime(struct task_struct *p)
+#if defined(CONFIG_VIRT_CPU_ACCOUNTING)
+static cputime_t task_utime(struct task_struct *p)
+{
+   return p->utime;
+}
+
+static cputime_t task_stime(struct task_struct *p)
+{
+   return p->stime;
+}
+#else
+static cputime_t task_utime(struct task_struct *p)
 {
clock_t utime = cputime_to_clock_t(p->utime),
total = utime + cputime_to_clock_t(p->stime);
@@ -337,10 +348,10 @@ static clock_t task_utime(struct task_st
}
utime = (clock_t)temp;
 
-   return utime;
+   return clock_t_to_cputime(utime);
 }
 
-static clock_t task_stime(struct task_struct *p)
+static cputime_t task_stime(struct task_struct *p)
 {
clock_t stime;
 
@@ -349,10 +360,12 @@ static clock_t task_stime(struct task_st
 * the total, to make sure the total observed by userspace
 * grows monotonically - apps rely on that):
 */
-   stime = nsec_to_clock_t(p->se.sum_exec_runtime) - task_utime(p);
+   stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
+   cputime_to_clock_t(task_utime(p));
 
-   return stime;
+   return clock_t_to_cputime(stime);
 }
+#endif
 
 static int do_task_stat(struct task_struct *task, char *buffer, int whole)
 {
@@ -368,8 +381,7 @@ static int do_task_stat(struct task_stru
unsigned long long start_time;
unsigned long cmin_flt = 0, cmaj_flt = 0;
unsigned long  min_flt = 0,  maj_flt = 0;
-   cputime_t cutime, cstime;
-   clock_t utime, stime;
+   cputime_t cutime, cstime, utime, stime;
unsigned long rsslim = 0;
char tcomm[sizeof(task->comm)];
unsigned long flags;
@@ -387,8 +399,7 @@ static int do_task_stat(struct task_stru
 
sigemptyset(&sigign);
sigemptyset(&sigcatch);
-   cutime = cstime = cputime_zero;
-   utime = stime = 0;
+   cutime = cstime = utime = stime = cputime_zero;
 
rcu_read_lock();
if (lock_task_sighand(task, &flags)) {
@@ -414,15 +425,15 @@ static int do_task_stat(struct task_stru
do {
min_flt += t->min_flt;
maj_flt += t->maj_flt;
-   utime += task_utime(t);
-   stime += task_stime(t);
+   utime = cputime_add(utime, task_utime(t));
+   stime = cputime_add(stime, task_stime(t));
t = next_thread(t);
} while (t != task);
 
min_flt += sig->min_flt;
maj_flt += sig->maj_flt;
-   utime += cputime_to_clock_t(sig->utime);
-   stime += cputime_to_clock_t(sig->stime);
+   utime = cputime_add(utime, sig->utime);
+   stime = cputime_add(stime, sig->stime);
}
 
sid = signal_session(sig);
@@ -471,8 +482,8 @@ static int do_task_stat(struct task_stru
cmin_flt,
maj_flt,
cmaj_flt,
-   utime,
-   stime,
+   cputime_to_clock_t(utime),
+   cputime_to_clock_t(stime),
cputime_to_clock_t(cutime),
cputime_to_clock_t(cstime),
priority,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] [PATCH/RFC 3/4]Introduce "account modifiers" mechanism

2007-08-17 Thread Christian Borntraeger

Am Freitag, 17. August 2007 schrieb Laurent Vivier:
> > The normal user/system accounting has the same issue, no?  Whereever we
> > happen to land (kernel or user) gets the whole tick.
> 
> Yes... but perhaps I should rewrite this too ;-)

If you look further, you will see, that this was actually rewritten in 2.6.12 
and thats why we have cputime_t. The infrastructure is currently only used by 
s390 and ppc64. On s390 we use our cpu timer to get the current time on each 
syscall/irq/context switch etc to get precise accounting data for 
system/user/irq/softirq/nice. We also get steal time (this is interesting for 
the guest: how much of my cpu was actually not available because the 
hypervisor scheduled me away). I dont know enough about other architectures  
to say which one could exploit this infrastructure as well.

The current git tree is somewhat broken for CONFIG_VIRT_CPU_ACCOUTING due to 
the accouting changes introduced by CFS - we work on this.

If you are interested in the cputime stuff, you can have a look at 
arch/s390/kernel/vtime.c (account_system_vtime, account_vtime) as well as:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=6be7071fdd321c206b1ee7a3e534782f25572830
for the first introduction and
http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=db59f519d37f80683f3611927f6b5c3bdfc0f9e6
for the s390 exploitation.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Fix guest time accounting going faster than user time accounting

2007-10-18 Thread Christian Borntraeger

Seems I overlooked this type while reviewing Laurents patch.
cputime_add already adds, dont do it twice.

Avi. This should go to Linus before 2.6.24.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 fs/proc/array.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.23/fs/proc/array.c
===
--- linux-2.6.23.orig/fs/proc/array.c
+++ linux-2.6.23/fs/proc/array.c
@@ -446,7 +446,7 @@ static int do_task_stat(struct task_stru
maj_flt += sig->maj_flt;
utime = cputime_add(utime, sig->utime);
stime = cputime_add(stime, sig->stime);
-   gtime += cputime_add(gtime, sig->gtime);
+   gtime = cputime_add(gtime, sig->gtime);
}
 
sid = signal_session(sig);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: 2.6.23-mm1 s390 driver problem

2007-10-18 Thread Christian Borntraeger

Am Donnerstag, 18. Oktober 2007 schrieb Serge E. Hallyn:
> Sigh, well this turned out less informative than I'd liked.
> After bisecting 2.6.23 to 2.6.23-mm1, I found that
> git-s390.patch is the one breaking my s390 boot :(
> (Frown bc it's a conglomeration of patches0
> 
> Symptom is:
>   "Cannot open root device "dasdd2" or unknown-block(94,14)"
> even though dasdd2 appeared to be found earlier in the boot.  I also
> get

Can you post the full console output from IPL to the unsuccessful end?

Thanks

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] severe bug in 2.6.23+ kvm.git

2007-10-19 Thread Christian Borntraeger

Am Freitag, 19. Oktober 2007 schrieb Luca Tettamanti:
> linus-git has at least one bug with SG chaining, but usually it just
> hangs the machine. Patch is here:
> 
> http://lkml.org/lkml/2007/10/17/269

Looks promising.
I pulled this fix by pulling the latest Linus-git into the kvm.git. I also 
enabled some debug options in the kernel hacking section. This resulting 
kernel seems to be stable so far. We will see in the next days if the problem
is really gone.

Thanks to all for your ideas. 

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] severe bug in 2.6.23+ kvm.git

2007-10-19 Thread Christian Borntraeger

Am Freitag, 19. Oktober 2007 schrieb Jan Engelhardt:
> 
> On Oct 19 2007 15:44, Carsten Otte wrote:
> > Carsten Otte wrote:
> >> First thing we do, is figure whether or not 2.6.23.1 as released breaks our
> >> system too. This way, we can either focus on differences between Linus and
> >> Avi, or turn on the big red warning sign saying "regression".
> >
> > Looks like 2.6.23.1 works fine on that box. We'll leave it running over
> > the weekend with "while true; do make; make clean; done".
> 
> Well, do you happen to use sata_mv?

no, we have nvidia, so its sata_nv.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] severe bug in 2.6.23+ kvm.git

2007-10-19 Thread Christian Borntraeger

Am Freitag, 19. Oktober 2007 schrieb Laurent Vivier:
> Carsten Otte a écrit :
> > Laurent Vivier wrote:
> >> How do you know the problem has been introduced by kvm ?
> > I don't. In fact I think it has not been introduced by kvm. All I 
> > stated, is that we experienced the problem when running the kvm.git 
> > kernel after the 2.6.23 update that has not been present in the 
> > kvm.git -rc8 as of last thursday.
> 
> Perhaps 2.6.23.1 corrects this ?
> 
> http://lkml.org/lkml/2007/10/12/302

No, we dont have an marvel chipset. 

kvm:~# lspci
00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a4)
00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev b1)
00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2)
00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2)
00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a4)
00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f3)
00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev f3)
00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2)
00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address 
Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM 
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] 
Miscellaneous Control
01:00.0 VGA compatible controller: ATI Technologies Inc RV370 5B60 [Radeon X300 
(PCIE)]
01:00.1 Display controller: ATI Technologies Inc RV370 [Radeon X300SE]
02:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A (rev 
09)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit 
Ethernet PCI Express (rev 21)
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit 
Ethernet PCI Express (rev 21)

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [kvm-devel] severe bug in 2.6.23+ kvm.git

2007-10-19 Thread Christian Borntraeger

Am Freitag, 19. Oktober 2007 schrieb Laurent Vivier:
> Did you patch kvm.git with patch-2.6.23.1.bz2 or did you download 
> linux-2.6.23.1.tar.bz2 ?
> 
> 2.6.23.1 corrects nothing except sata_mv...

Yes I know. The question we tried to answer was: is the bug in 2.6.23 or was
it introduced after 2.6.23, as kvm.git  already contains lots of post 2. 6.23 
stuff.
Current state is:

kvm.git with tag 2.6.23-rc6 works for days without a problem.
2.6.23.1 vanilla has survived and is currently still under test.
kvm.git tag master killed our filesystem at least three times.since monday.

I will continue to bang on 2.6.23.1 to see if its really fine. After that,
maybe I will try to bisect on kvm.git, but this will take quite a long time,
given that we had to reinstall the system due to this error. 

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Use a private inode for backing storage

2007-10-21 Thread Christian Borntraeger

Am Sonntag, 21. Oktober 2007 schrieb Eric W. Biederman:
> Nick.  Reread the patch.  The only thing your arguments have
> established for me is that this patch is not obviously correct.  Which
> makes it ineligible for a back port.  Frankly I suspect the whole
> issue is to subtle and rare to make any backport make any sense.  My
> apologies Christian.

About being rare, when I force the VM to be more aggressive reclaiming buffer
by using the following patch:
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -3225,7 +3225,7 @@ void __init buffer_init(void)
 * Limit the bh occupancy to 10% of ZONE_NORMAL
 */
nrpages = (nr_free_buffer_pages() * 10) / 100;
-   max_buffer_heads = nrpages * (PAGE_SIZE / sizeof(struct buffer_head));
+   max_buffer_heads = 0;
hotcpu_notifier(buffer_cpu_notify, 0);
 }
 
I can actually cause data corruption within some seconds. So I think the
problem is real enough to be worth fixing.

I still dont fully understand what issues you have with my patch.
- it obviously fixes the problem
- I am not aware of any regression it introduces
- its small

One concern you had, was the fact that buffer heads are out of sync with 
struct pages. Testing your first patch revealed that this is actually needed
by reiserfs - and maybe others.
I can also see, that my patch looks a bit like a bandaid that cobbles the rd
pieces together.
Is there anything else, that makes my patch unmergeable in your opinion?


Christian






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [git pull] lguest: paravirt boot code

2007-10-22 Thread Christian Borntraeger

Am Dienstag, 23. Oktober 2007 schrieb Rusty Russell:
>git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-lguest.git
> This should be fairly clean: 45 lguest patches including the more generic
> drivers (aimed to be used by KVM as well, and hopefully others) and the start
> of separation of arch-specific from general lguest code.

>  drivers/net/virtio_net.c  |  435 +
>  drivers/virtio/Kconfig|8 +
>  drivers/virtio/Makefile   |2 +
>  drivers/virtio/config.c   |   13 +
>  drivers/virtio/virtio.c   |  189 
>  drivers/virtio/virtio_ring.c  |  313 +++
>  drivers/block/virtio_blk.c|  308 +++
>  drivers/char/virtio_console.c |  225 +

Having virtio in Linus tree would make my life easier wiring it up for 
virtualization on s390.

For all virtio patches:

Acked-by: Christian Borntraeger <[EMAIL PROTECTED]>


That reminds me that I promised Rusty to look at one small issue in virtio
net driver. Nothing that should hold of the merge. I will sent the patch via
Rusty when its ready. 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 06/10] [SG] Update net/ to use sg helpers

2007-10-23 Thread Christian Borntraeger

Fix sctp compile

sctp fails to compile with 
net/sctp/sm_make_chunk.c: In function 'sctp_pack_cookie':
net/sctp/sm_make_chunk.c:1516: error: implicit declaration of function 
'sg_init_table'
net/sctp/sm_make_chunk.c:1517: error: implicit declaration of function 
'sg_set_page'

use the proper include file.

SCTP maintainers Vlad Yasevich and Sridhar Samudrala  are CCed.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>
---
 net/sctp/sm_make_chunk.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/net/sctp/sm_make_chunk.c
===
--- linux-2.6.orig/net/sctp/sm_make_chunk.c
+++ linux-2.6/net/sctp/sm_make_chunk.c
@@ -56,7 +56,7 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
 #include 
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

2007-10-14 Thread Christian Borntraeger

Am Samstag, 13. Oktober 2007 schrieb Frans Pop:
> > > Please consider this patch for 2.6.23.2
> > > http://lkml.org/lkml/2007/10/4/389
> > Is it already in Linus's tree?  If so, do you have a git commit id?  If
> > not, please let us (stable@) know when it is, and what the id is, and
> > then we can add it to our tree.
> 
> Not AFAICT.
> CCing Christian (as patch author) and Ingo (as author of the change that 
> caused the regression) so they can push it through the correct channels.
> 

I dont know how to proceed with this issue. The more I think about it, the
more I am convinced that using sum_exec_runtime together with sampled utime
and stime will never guarantee monotonicity for utime and stime in proc. 
Just imagine an process with 9 ticks for utime and 0 ticks for stime. If
we now sample one tick for stime (but having only a small increase in
sum_exec_runtime) the next utime value will only be 90% of the last value. 
So returning to the 2.6.22 model seems to be the safest solution until 
somebody else comes up with an idea that works proper.

Ingo, any opinion?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Christian Borntraeger

Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit unmaintained,
so decided to sent the patch to you :-).
I have CCed Ted, who did work on the code in the 90s. I found no current
email address of Chad Page.

We have seen ramdisk based install systems, where some pages of mapped 
libraries and programs were suddendly zeroed under memory pressure. This 
should not happen, as the ramdisk avoids freeing its pages by keeping them 
dirty all the time.

It turns out that there is a case, where the VM makes a ramdisk page clean, 
without telling the ramdisk driver.
On memory pressure shrink_zone runs and it starts to run shrink_active_list. 
There is a check for buffer_heads_over_limit, and if true, pagevec_strip is 
called. pagevec_strip calls try_to_release_page. If the mapping has no 
releasepage callback, try_to_free_buffers is called. try_to_free_buffers has 
now a special logic for some file systems to make a dirty page clean, if all 
buffers are clean. Thats what happened in our test case.

The solution is to provide a noop-releasepage callback for the ramdisk driver.
This avoids try_to_free_buffers for ramdisk pages. 

To trigger the problem, you have to make buffer_heads_over_limit true, which
means:
- lower max_buffer_heads 
or
- have a system with lots of high memory

Andrew, if there are no objections - please apply. The patch applies against
todays git.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 drivers/block/rd.c |   13 +
 1 files changed, 13 insertions(+)

Index: linux-2.6/drivers/block/rd.c
===
--- linux-2.6.orig/drivers/block/rd.c
+++ linux-2.6/drivers/block/rd.c
@@ -189,6 +189,18 @@ static int ramdisk_set_page_dirty(struct
return 0;
 }
 
+/*
+ * releasepage is called by pagevec_strip/try_to_release_page if
+ * buffers_heads_over_limit is true. Without a releasepage function
+ * try_to_free_buffers is called instead. That can unset the dirty
+ * bit of our ram disk pages, which will be eventually freed, even
+ * if the page is still in use.
+ */
+static int ramdisk_releasepage(struct page *page, gfp_t dummy)
+{
+   return 0;
+}
+
 static const struct address_space_operations ramdisk_aops = {
.readpage   = ramdisk_readpage,
.prepare_write  = ramdisk_prepare_write,
@@ -196,6 +208,7 @@ static const struct address_space_operat
.writepage  = ramdisk_writepage,
.set_page_dirty = ramdisk_set_page_dirty,
.writepages = ramdisk_writepages,
+   .releasepage= ramdisk_releasepage,
 };
 
 static int rd_blkdev_pagecache_IO(int rw, struct bio_vec *vec, sector_t sector,
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH resend] ramdisk: fix zeroed ramdisk pages on memory pressure

2007-10-15 Thread Christian Borntraeger

Am Montag, 15. Oktober 2007 schrieb Nick Piggin:
> On Monday 15 October 2007 18:28, Christian Borntraeger wrote:
> > Andrew, this is a resend of a bugfix patch. Ramdisk seems a bit
> > unmaintained, so decided to sent the patch to you :-).
> > I have CCed Ted, who did work on the code in the 90s. I found no current
> > email address of Chad Page.
> 
> This really needs to be fixed...

I obviously agree ;-)
We have seen this problem happen several times. 

> I can't make up my mind between the approaches to fixing it.
> 
> On one hand, I would actually prefer to really mark the buffers
> dirty (as in: Eric's fix for this problem[*]) than this patch,
> and this seems a bit like a bandaid...

I have never seen these patches, so I cannot comment on them.
> 
> On the other hand, the wound being covered by the bandaid is
> actually the code in the buffer layer that does this latent
> "cleaning" of the page because it sadly doesn't really keep
> track of the pagecache state. But it *still* feels like we
> should be marking the rd page's buffers dirty which should
> avoid this problem anyway.

Yes, that would solve the problem as well. As long as we fix
the problem, I am happy. On the other hand, do you see any
obvious problem with this "bandaid"?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND 2][PATCH 0/4] Virtual Machine Time Accounting

2007-10-15 Thread Christian Borntraeger

Am Montag, 10. September 2007 schrieb Laurent Vivier:
> The aim of these four patches is to introduce Virtual Machine time accounting.

I would move this line
   if (p->flags & PF_VCPU) {
   account_guest_time(p, cputime);
-->    p->flags &= ~PF_VCPU;  <-
   return;
   }
into kvm_guest_exit. Otherwise a guest that is running very long in 
guest context would only get the first tick accounted as guest time, no?

Besides that, this looks good and should work for kvm on s390 as well. 
Thanks Laurent.

Christian


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND 2][PATCH 4/4] Modify KVM to update guest time accounting.

2007-10-15 Thread Christian Borntraeger

forgot to CC everybody besides Laurent:

Am Montag, 15. Oktober 2007 schrieben Sie:
> No, It must not be cleared here because we can't enter in the accounting code
> between kvm_guest_enter(void) and kvm_guest_exit(void).
> 
> This is why the accounting code clears it.
> 
> I put a kvm_guest_exit() only for symmetry.

Why cant we enter the accounting code?
Dont know about x86, but on s390 we can get host interrupts and reschedules
even when we run a guest (if preemption is on).

If the timer tick happens  while the guest is running, we will run the
accounting code on x86 as well, no? 


Christian



-- 
IBM Deutschland Entwicklung GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschäftsführung: Herbert Kircher 
Sitz der Gesellschaft: Böblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND 2][PATCH 4/4] Modify KVM to update guest time accounting.

2007-10-15 Thread Christian Borntraeger

Am Montag, 15. Oktober 2007 schrieb Avi Kivity:
> We can clear it a bit later, after local_irq_enable() in __vcpu_run().  
> However we need a nop instruction first because "sti" keeps interrupts 
> disabled for one more instruction.

Ah, I see. The host interrupt  behaves different and instead of running the
interrupt handler, it exits the vmrun on x86? The interrupt handler will be
called some cycles after the sti?
That is different to s390. We can run the guest code for a long time and the
host instruction pointer stays on the sie instruction. That means, we can 
interrupt sie and continue by simply setting the instrution pointer (PSW) 
back to the sie instruction.

Any idea how to make this proper on all architectures? I will have a look.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RESEND 2][PATCH 4/4] Modify KVM to update guest time accounting.

2007-10-15 Thread Christian Borntraeger

Am Montag, 15. Oktober 2007 schrieb Laurent Vivier:
> > Any idea how to make this proper on all architectures? I will have a look.
> 
> I think the solution is to have an arch dependent kvm_guest_exit(): empty for
> x86, clearing the bit for s390.

I think we can merge your patches, as the userspace interface seems fine. To 
make the accounting work for virtual cpu accounting found on ppc and s390 we
can later add an additional patch that also deals with interruptible guest 
contexts.
So something like this should work:

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 drivers/kvm/kvm.h |8 
 kernel/sched.c|2 ++
 2 files changed, 10 insertions(+)

Index: kvm/drivers/kvm/kvm.h
===
--- kvm.orig/drivers/kvm/kvm.h
+++ kvm/drivers/kvm/kvm.h
@@ -18,6 +18,7 @@
 
 #include 
 #include 
+#include 
 
 #define CR3_PAE_RESERVED_BITS ((X86_CR3_PWT | X86_CR3_PCD) - 1)
 #define CR3_NONPAE_RESERVED_BITS ((PAGE_SIZE-1) & ~(X86_CR3_PWT | X86_CR3_PCD))
@@ -675,11 +676,18 @@ __init void kvm_arch_init(void);
 
 static inline void kvm_guest_enter(void)
 {
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+   account_system_vtime(current);
+#endif
current->flags |= PF_VCPU;
 }
 
 static inline void kvm_guest_exit(void)
 {
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
+   account_system_vtime(current);
+   current->flags &= ~PF_VCPU;
+#endif
 }
 
 static inline int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t gva,
Index: kvm/kernel/sched.c
===
--- kvm.orig/kernel/sched.c
+++ kvm/kernel/sched.c
@@ -3312,7 +3312,9 @@ void account_system_time(struct task_str
 
if (p->flags & PF_VCPU) {
account_guest_time(p, cputime);
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING
p->flags &= ~PF_VCPU;
+#endif
return;
}
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-16 Thread Christian Borntraeger

Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:

> fs/buffer.c |    3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>  drivers/block/rd.c |   13 +
>  1 files changed, 1 insertions(+), 12 deletions(-)

Your patches look sane so far. I have applied both patches, and the problem 
seems gone. I will try to get these patches to our testers.

As long as they dont find new problems:

Acked-by: Christian Borntraeger <[EMAIL PROTECTED]>

Nick, Eric. What is the best patch for the stable series? Both patches from
Eric or mine? I CCed stable, so they know that something is coming.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

2007-10-16 Thread Christian Borntraeger

Chuck, Balbir,

we still have a problem with stime occosionally going backwards. I stated
below that I think this is not fixable with the current utime/stime split
algorithm.

Balbir, you wrote this code, Chuck you tried to fix it. Any ideas how to
fix this properly? The only idea I have requires that we save the old value
of utime and stime and therefore requires additional locking.

Otherwise I suggest to apply my reversion patch from below:

Am Sonntag, 14. Oktober 2007 schrieb Christian Borntraeger:
> Am Samstag, 13. Oktober 2007 schrieb Frans Pop:
> > > > Please consider this patch for 2.6.23.2
> > > > http://lkml.org/lkml/2007/10/4/389
> > > Is it already in Linus's tree?  If so, do you have a git commit id?  If
> > > not, please let us (stable@) know when it is, and what the id is, and
> > > then we can add it to our tree.
> > 
> > Not AFAICT.
> > CCing Christian (as patch author) and Ingo (as author of the change that 
> > caused the regression) so they can push it through the correct channels.
> > 
> 
> I dont know how to proceed with this issue. The more I think about it, the
> more I am convinced that using sum_exec_runtime together with sampled utime
> and stime will never guarantee monotonicity for utime and stime in proc. 
> Just imagine an process with 9 ticks for utime and 0 ticks for stime. If
> we now sample one tick for stime (but having only a small increase in
> sum_exec_runtime) the next utime value will only be 90% of the last value. 
> So returning to the 2.6.22 model seems to be the safest solution until 
> somebody else comes up with an idea that works proper.
> 
> Ingo, any opinion?

- patch---

Fix stime going backwards

after 2.6.22 the accounting was changes to use sched_clock and to use stime 
and utime only for the split. This is where task_utime and task_stime were 
introduced. See the code in 2.6.23-rc1.
Unfortunately this broke the accouting on s390 which already has precise 
numbers in utime and stime. So the code was partially reverted to use stime 
and utime again where appropriate, which are of type  cputime_t. 

It now seems, that the rest of the logic is still broken.

This patch basically reverts the rest of 
b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa
and  should restore the 2.6.22 behavior. The process time is used from tasks
utime  and stime instead of the scheduler clock. That means, in general after
a long period of time, it is less accurate than the sum_exec_runtime based 
code, but it doesnt break top.

Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 fs/proc/array.c |   37 -
 1 file changed, 37 deletions(-)

Index: linux-2.6/fs/proc/array.c
===
--- linux-2.6.orig/fs/proc/array.c
+++ linux-2.6/fs/proc/array.c
@@ -323,7 +323,6 @@ int proc_pid_status(struct task_struct *
 /*
  * Use precise platform statistics if available:
  */
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING
 static cputime_t task_utime(struct task_struct *p)
 {
return p->utime;
@@ -333,42 +332,6 @@ static cputime_t task_stime(struct task_
 {
return p->stime;
 }
-#else
-static cputime_t task_utime(struct task_struct *p)
-{
-   clock_t utime = cputime_to_clock_t(p->utime),
-   total = utime + cputime_to_clock_t(p->stime);
-   u64 temp;
-
-   /*
-* Use CFS's precise accounting:
-*/
-   temp = (u64)nsec_to_clock_t(p->se.sum_exec_runtime);
-
-   if (total) {
-   temp *= utime;
-   do_div(temp, total);
-   }
-   utime = (clock_t)temp;
-
-   return clock_t_to_cputime(utime);
-}
-
-static cputime_t task_stime(struct task_struct *p)
-{
-   clock_t stime;
-
-   /*
-* Use CFS's precise accounting. (we subtract utime from
-* the total, to make sure the total observed by userspace
-* grows monotonically - apps rely on that):
-*/
-   stime = nsec_to_clock_t(p->se.sum_exec_runtime) -
-   cputime_to_clock_t(task_utime(p));
-
-   return clock_t_to_cputime(stime);
-}
-#endif
 
 static int do_task_stat(struct task_struct *task, char *buffer, int whole)
 {
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-16 Thread Christian Borntraeger

Am Dienstag, 16. Oktober 2007 schrieb Nick Piggin:
> While I said it was a good fix when I saw the patch earlier, I think
> it's not closing the entire hole, and as such, Christian's patch is
> probably the way to go for stable.

That would be my preferred method. Merge Erics and my fixup for 2.6.24-rc.
The only open questions is, what was the reiserfs problem? Is it still
causes by Erics patches?

> For mainline, *if* we want to keep the old rd.c around at all, I
> don't see any harm in this patch so long as Christian's is merged
> as well. Sharing common code is always good.

While the merge window is still open, to me the new ramdisk code seems
more like a 2.6.25-rc thing. We actually should test the behaviour in 
low memory scenarios. What do you think?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [stable] 2.6.23 regression: top displaying 9999% CPU usage

2007-10-16 Thread Christian Borntraeger

Am Dienstag, 16. Oktober 2007 schrieb Balbir Singh:
> I am trying to think out loud as to what the root cause of the problem
> might be. In one of the discussion threads, I saw utime going backwards,
> which seemed very odd, I suspect that those are rounding errors.
> 
> I don't understand your explanation below
> 
> Initially utime = 9, stime = 0, sum_exec_runtime = S1
> 
> Later
> 
> utime = 9, stime = 1, sum_exec_runtime = S2
> 
> We can be sure that S >= (utime + stime)

I think here is the problem. How can we be sure? We cant. utime and stime
are sampled, so they can be largely off in any direction,if the program
sleeps often and manages to synchronize itself to the timer tick. Lets say
a program only does a simple system call and then sleeps. So sum_exec_runtime
is increased by lets say 1000 cycles on a 1Ghz box which means 1000ns. If now 
the timer tick happens exactly at this moment, stime is increased by 1 tick
= 100ns. 

Maybe there is some magic in the code which I did not see, but obviously
the problem exists and looking at Frans data (stime+utime) are not decreasing,
but stime isnt and utime is. If you look at Frans data you see:
Oct 16 11:54:48 8 10
Oct 16 11:54:49 6 12  <-- utime
Oct 16 11:54:50 6 12
Oct 16 11:54:51 6 12
Oct 16 11:54:52 8 10  <-- stime
Oct 16 11:54:53 8 10
Oct 16 11:54:54 8 10
Oct 16 11:54:55 8 12
Oct 16 11:54:56 8 12

(stime+utime) is constant. That means that S2-S1 is obviously smaller than
one tick (See the calculation in task_stime). I am quite sure it is caused
by changes in the sampled values p->utime and p->stime.

> 
> If S2 = S1 + delta, then as per our calculation
> 
> Initially
> 
> utime_proc = (utime * (S1))/(utime + stime)
>= nsec_to_clock_t(9 * S1 / 9)
> 
> later
> 
> utime_proc = nsec_to_clock_t(9 * S2/10)
> 
> Given that S >= (utime + stime), we should be fine.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] clear PF_VCPU in kvm_guest_exit()

2007-10-17 Thread Christian Borntraeger

Am Mittwoch, 17. Oktober 2007 schrieb Laurent Vivier:
> clear PF_VCPU in kvm_guest_exit() and move kvm_guest_exit() after 
> local_irq_enable().

Looks good. In the next days I will sent a patch for precise guest time 
accounting with CONFIG_VIRT_CPU_ACCOUNTING on top of this patch. 

Acked-By: Christian Borntraeger <[EMAIL PROTECTED]>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Christian Borntraeger

Eric,

Am Dienstag, 16. Oktober 2007 schrieb Christian Borntraeger:
> Am Dienstag, 16. Oktober 2007 schrieb Eric W. Biederman:
> 
> > fs/buffer.c |    3 +++
> > 1 files changed, 3 insertions(+), 0 deletions(-)
> >  drivers/block/rd.c |   13 +
> >  1 files changed, 1 insertions(+), 12 deletions(-)
> 
> Your patches look sane so far. I have applied both patches, and the problem 
> seems gone. I will try to get these patches to our testers.
> 
> As long as they dont find new problems:

Our testers did only a short test, and then they were stopped by problems with
reiserfs. At the moment I cannot say for sure if your patch caused this, but 
we got the following BUG

ReiserFS: ram0: warning: Created .reiserfs_priv on ram0 - reserved for xattr 
storage.
[ cut here ]
kernel BUG at 
/home/autobuild/BUILD/linux-2.6.23-20071017/fs/reiserfs/journal.c:1117!
illegal operation: 0001 [#1]
Modules linked in: reiserfs dm_multipath sunrpc dm_mod qeth ccwgroup vmur
CPU:3Not tainted
Process reiserfs/3 (pid: 2592, task: 77dac418, ksp: 7513ee88)
Krnl PSW : 070c3000 fb344380 (flush_commit_list+0x808/0x95c [reiserfs])
   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:0 CC:3 PM:0
Krnl GPRS: 0002 7411b5c8 002b 
   7b04d000 0001  76d1de00
   7513eec0 0003 0012 77f77680
   7411b608 fb343b7e fb34404a 7513ee50
Krnl Code: fb344374: a7210002   tmll%r2,2
   fb344378: a7840004   brc 8,fb344380
   fb34437c: a7f40001   brc 15,fb34437e
  >fb344380: 5810d8c2   l   %r1,2242(%r13)
   fb344384: 5820b03c   l   %r2,60(%r11)
   fb344388: 0de1   basr%r14,%r1
   fb34438a: 5810d90e   l   %r1,2318(%r13)
   fb34438e: 5820b03c   l   %r2,60(%r11)


Looking at the code, this really seems related to dirty buffers, so your patch
is the main suspect at the moment. 

if (!barrier) {
/* If there was a write error in the journal - we can't commit
 * this transaction - it will be invalid and, if successful,
 * will just end up propagating the write error out to
 * the file system. */
if (likely(!retval && !reiserfs_is_journal_aborted (journal))) {
if (buffer_dirty(jl->j_commit_bh))
1117>BUG();
mark_buffer_dirty(jl->j_commit_bh) ;
sync_dirty_buffer(jl->j_commit_bh) ;
}
}

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-17 Thread Christian Borntraeger

Am Mittwoch, 17. Oktober 2007 schrieb Eric W. Biederman:
> Did you have both of my changes applied?
> To init_page_buffer() and to the ramdisk_set_dirty_page?

Yes, I removed my patch and applied both patches from you. 

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] rd: Mark ramdisk buffers heads dirty

2007-10-18 Thread Christian Borntraeger

Am Donnerstag, 18. Oktober 2007 schrieb Eric W. Biederman:
> Grr. Inconsistent rules on a core piece of infrastructure.
> It looks like that if there is any trivial/minimal fix it
> is based on your patch suppressing try_to_free_buffers.  Ugh.
> 
> Eric

Ok. What do you think about having my patch for 2.6.23 stable, for 2.6.24
and doing a nicer fix (rd rewrite for example for post 2.6.24)?

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Use virtual cpu accounting if available for guest times.

2007-10-18 Thread Christian Borntraeger

Avi,

ppc and s390 offer the possibility to track process times precisely
by looking at cpu timer on every context switch, irq, softirq etc.
We can use that infrastructure as well for guest time accounting.
We need to account the used time before we change the state. 
This patch adds a call to account_system_vtime to kvm_guest_enter
and kvm_guest exit. If CONFIG_VIRT_CPU_ACCOUNTING is not set,
account_system_vtime is defined in hardirq.h as an empty function,
which means this patch does not change the behaviour on other
platforms.

I compile tested this patch on x86 and function tested the patch on
s390.

Avi, please apply.


Signed-off-by: Christian Borntraeger <[EMAIL PROTECTED]>

---
 drivers/kvm/kvm.h |3 +++
 1 file changed, 3 insertions(+)

Index: kvm/drivers/kvm/kvm.h
===
--- kvm.orig/drivers/kvm/kvm.h
+++ kvm/drivers/kvm/kvm.h
@@ -7,6 +7,7 @@
  */
 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -669,11 +670,13 @@ __init void kvm_arch_init(void);
 
 static inline void kvm_guest_enter(void)
 {
+   account_system_vtime(current);
current->flags |= PF_VCPU;
 }
 
 static inline void kvm_guest_exit(void)
 {
+   account_system_vtime(current);
current->flags &= ~PF_VCPU;
 }
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Future of Linux 2.6.22.y series

2007-11-06 Thread Christian Borntraeger

Am Dienstag, 6. November 2007 schrieb Jan Kara:
>   In that case you might be interested in picking up the attached fix :)
> I don't know if you already didn't get it...

---
[PATCH] rd: fix data corruption on memory pressure
---

The patch is currently in mm but not yet upstream. As soon as it is merged,
this patch is a good candidate for stable.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Bug: 2.6.15 Process accounting fails to account for small time slice loads (still)

2007-01-30 Thread Christian Borntraeger

On Tuesday 30 January 2007 17:24, you wrote:
> The bug described back in 2004 in the following message still seems to 
exist:
> 
> http://linux.derkeiler.com/Mailing-Lists/Kernel/2004-05/4313.html
> 
> Essentially, it appears that if a process sleeps before it uses up a
> complete jiffy then no charge is made to its process accounting table.
> 
> This problem still exists in the same manner in kernel:

This is a known limitation of tick based accounting used on most 
architectures. 
It was fixed on s390 and using a cpu timer on every context switch. (IIRC ppc 
is now using this infrastructure as well)

see for a http://www.kernel-traffic.org/kernel-traffic/kt20041117_284.html#8
for the first patch in the series.

I dont know if there are plans of using the infrastructure on x86 compatible 
platforms. I also dont know it that is actually easy to implement with x86 
hardware. 

-- 
Mit freundlichen Grüßen / Best Regards

Christian Borntraeger
Linux Software Engineer zSeries Linux & Virtualization

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A set of "standard" virtual devices?

2007-04-03 Thread Christian Borntraeger

On Monday 02 April 2007 23:12, Andi Kleen wrote:
> 
> > How would that work in the case where virtualized guests don't have a
> > visible PCI bus, and the virtual environment doesn't pretend to emulate
> > a PCI bus?
> 
> If they emulated one with the appropiate device 
> then distribution driver auto probing would just work transparently for
> them. 

Still, that would only make sense for virtualized platforms that usually have 
a PCI bus. Thinking about seeing a PCI device on ,lets say, s390 is strange.

Christian
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v3 02/11] mm: Hardened usercopy

2016-07-19 Thread Christian Borntraeger

On 07/19/2016 09:31 PM, Kees Cook wrote:
> On Tue, Jul 19, 2016 at 2:21 AM, Christian Borntraeger
>  wrote:
>> On 07/15/2016 11:44 PM, Kees Cook wrote:
>>> +config HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>> + bool
>>> + help
>>> +   An architecture should select this if it has a secondary linear
>>> +   mapping of the kernel text. This is used to verify that kernel
>>> +   text exposures are not visible under CONFIG_HARDENED_USERCOPY.
>>
>> I have trouble parsing this. (What does secondary linear mapping mean?)
> 
> I likely need help clarifying this language...
> 
>> So let me give an example below
>>
>>> +
>> [...]
>>> +/* Is this address range in the kernel text area? */
>>> +static inline const char *check_kernel_text_object(const void *ptr,
>>> +unsigned long n)
>>> +{
>>> + unsigned long textlow = (unsigned long)_stext;
>>> + unsigned long texthigh = (unsigned long)_etext;
>>> +
>>> + if (overlaps(ptr, n, textlow, texthigh))
>>> + return "";
>>> +
>>> +#ifdef HAVE_ARCH_LINEAR_KERNEL_MAPPING
>>> + /* Check against linear mapping as well. */
>>> + if (overlaps(ptr, n, (unsigned long)__va(__pa(textlow)),
>>> +  (unsigned long)__va(__pa(texthigh
>>> + return "";
>>> +#endif
>>> +
>>> + return NULL;
>>> +}
>>
>> s390 has an address space for user (primary address space from 0..4TB/8PB) 
>> and a separate
>> address space (home space from 0..4TB/8PB) for the kernel. In this home 
>> space the kernel
>> mapping is virtual containing the physical memory as well as vmalloc memory 
>> (creating aliases
>> into the physical one). The kernel text is mapped from _stext to _etext in 
>> this mapping.
>> So I assume this would qualify for HAVE_ARCH_LINEAR_KERNEL_MAPPING ?
> 
> If I understand your example, yes. In the home space you have two
> addresses that reference the kernel image?

No, there is only one address that points to the kernel.
As we have no kernel ASLR yet, and the kernel mapping is 
a 1:1 mapping from 0 to memory end and the kernel is only
from _stext to _etext. The vmalloc area contains modules
and vmalloc but not a 2nd kernel mapping.

But thanks for your example, now I understood. If we have only
one address 
>>> + if (overlaps(ptr, n, textlow, texthigh))
>>> + return "";

This is just enough.

So what about for the CONFIG text:

   An architecture should select this if the kernel mapping has a secondary
   linear mapping of the kernel text - in other words more than one virtual
   kernel address that points to the kernel image. This is used to verify 
   that kernel text exposures are not visible under 
CONFIG_HARDENED_USERCOPY.


> I wonder if I can avoid the CONFIG entirely if I just did a
> __va(__pa(_stext)) != _stext test... would that break anyone?

Can this be resolved on all platforms at compile time?

Re: [PATCH v3 02/11] mm: Hardened usercopy

2016-07-19 Thread Christian Borntraeger

On 07/19/2016 10:34 PM, Kees Cook wrote:
[...]
>>
>> So what about for the CONFIG text:
>>
>>An architecture should select this if the kernel mapping has a 
>> secondary
>>linear mapping of the kernel text - in other words more than one 
>> virtual
>>kernel address that points to the kernel image. This is used to verify
>>that kernel text exposures are not visible under 
>> CONFIG_HARDENED_USERCOPY.
> 
> Sounds good, I've adjusted it for now.
> 
>>> I wonder if I can avoid the CONFIG entirely if I just did a
>>> __va(__pa(_stext)) != _stext test... would that break anyone?
>>
>> Can this be resolved on all platforms at compile time?
> 
> Well, I think it still needs a runtime check (compile-time may not be
> able to tell about kaslr, or who knows what else). I would really like
> to avoid the CONFIG if possible, though. Would this do the right thing
> on s390? This appears to work where I'm able to test it (32/64 x86,
> 32/64 arm):
> 
> unsigned long textlow = (unsigned long)_stext;
> unsigned long texthigh = (unsigned long)_etext;
> unsigned long textlow_linear = (unsigned long)__va(__pa(textlow);
> unsigned long texthigh_linear = (unsigned long)__va(__pa(texthigh);
> 
as we have

#define PAGE_OFFSET 0x0UL
#define __pa(x) (unsigned long)(x)
#define __va(x) (void *)(unsigned long)(x)

both should be identical on s390 as of today, so it should work fine and only
do the check once

> if (overlaps(ptr, n, textlow, texthigh))
> return "";
> 
> /* Check against possible secondary linear mapping as well. */
> if (textlow != textlow_linear &&
> overlaps(ptr, n, textlow_linear, texthigh_linear))
> return "";
> 
> return NULL;
> 
> 
> -Kees
> 


PS: Not sure how useful and flexible this offers is but you can get some 
temporary
free access to an s390 on https://developer.ibm.com/linuxone/

linux-next: new scheduler messages span: 0-15 (max cpu_capacity = 589) when starting KVM guests

2016-09-19 Thread Christian Borntraeger

Dietmar, Ingo, Tejun,

since commit cd92bfd3b8cb0ec2ee825e55a3aee704cd55aea9
   sched/core: Store maximum per-CPU capacity in root domain

I get tons of messages from the scheduler like
[..]
span: 0-15 (max cpu_capacity = 589)
span: 0-15 (max cpu_capacity = 589)
span: 0-15 (max cpu_capacity = 589)
span: 0-15 (max cpu_capacity = 589)
[..]

whenever I start kvm guests with libvirt.

The reason seems to be that libvirt via systemd/machined tries to move all
guest vcpus into its cpuset and for whatever reasons, the way it is done
will always call rebuild_sched_domains from the cgroup code.

While the message alone is somewhat of a nuisance, I think rebuilding
the scheduling domains for moving kvm vcpus is really expensive.

Tejun, do you have an idea whats going on here? Is libvirt using
the cgroup interface wrong (e.g. also d a memory migrate or whatever)

Christian

Re: linux-next: new scheduler messages span: 0-15 (max cpu_capacity = 589) when starting KVM guests

2016-09-19 Thread Christian Borntraeger

On 09/19/2016 03:40 PM, Peter Zijlstra wrote:
> On Mon, Sep 19, 2016 at 03:19:11PM +0200, Christian Borntraeger wrote:
>> Dietmar, Ingo, Tejun,
>>
>> since commit cd92bfd3b8cb0ec2ee825e55a3aee704cd55aea9
>>sched/core: Store maximum per-CPU capacity in root domain
>>
>> I get tons of messages from the scheduler like
>> [..]
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> [..]
>>
> 
> Oh, oops ;-)
> 
> Something like the below ought to cure I think.

That would certainly make the message go away. (e.g. also
good for cpu hotplug)

I am still asking myself why cgroup cpuset really needs to rebuild
the scheduling domains if a vcpu thread is moved. 
> 
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5f7b3cdf0be..fdc9e311fd29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6990,7 +6990,7 @@ static int build_sched_domains(const struct cpumask 
> *cpu_map,
>   }
>   rcu_read_unlock();
> 
> - if (rq) {
> + if (rq && sched_debug_enabled) {
>   pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
>   cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
>   }
>

Re: linux-next: new scheduler messages span: 0-15 (max cpu_capacity = 589) when starting KVM guests

2016-09-20 Thread Christian Borntraeger

On 09/19/2016 03:40 PM, Peter Zijlstra wrote:
> On Mon, Sep 19, 2016 at 03:19:11PM +0200, Christian Borntraeger wrote:
>> Dietmar, Ingo, Tejun,
>>
>> since commit cd92bfd3b8cb0ec2ee825e55a3aee704cd55aea9
>>sched/core: Store maximum per-CPU capacity in root domain
>>
>> I get tons of messages from the scheduler like
>> [..]
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> span: 0-15 (max cpu_capacity = 589)
>> [..]
>>
> 
> Oh, oops ;-)
> 
> Something like the below ought to cure I think.

Still trying to get some opinion from Tejun, why moving vcpus in
its cpuset causes schedule domain rebuilds, but 
Acked-by: Christian Borntraeger 

for such a patch.

> 
> ---
>  kernel/sched/core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index f5f7b3cdf0be..fdc9e311fd29 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6990,7 +6990,7 @@ static int build_sched_domains(const struct cpumask 
> *cpu_map,
>   }
>   rcu_read_unlock();
> 
> - if (rq) {
> + if (rq && sched_debug_enabled) {
>   pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
>   cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
>   }
>

4.7-rc1: lockdep: inconsistent lock state kcompactd/aio_migratepage/mem_cgroup_migrate....

2016-06-20 Thread Christian Borntraeger

Has anyone seen this before?


[  335.384657] =
[  335.384659] [ INFO: inconsistent lock state ]
[  335.384663] 4.7.0-rc1+ #52 Tainted: GW  
[  335.384666] -
[  335.384669] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[  335.384672] kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
[  335.384674]  (&(&ctx->completion_lock)->rlock){+.?.-.}, at: 
[<0038fd96>] aio_migratepage+0x156/0x1e8
[  335.384692] {IN-SOFTIRQ-W} state was registered at:
[  335.384696]   [<001a8366>] __lock_acquire+0x5b6/0x1930
[  335.384704]   [<001a9b9e>] lock_acquire+0xee/0x270
[  335.384708]   [<00951fee>] _raw_spin_lock_irqsave+0x66/0xb0
[  335.384717]   [<00390108>] aio_complete+0x98/0x328
[  335.384721]   [<0037c7d4>] dio_complete+0xe4/0x1e0
[  335.384728]   [<00650e64>] blk_update_request+0xd4/0x450
[  335.384736]   [<0072a1a8>] scsi_end_request+0x48/0x1c8
[  335.384743]   [<0072d7e2>] scsi_io_completion+0x272/0x698
[  335.384747]   [<0065adb2>] blk_done_softirq+0xca/0xe8
[  335.384753]   [<00953f80>] __do_softirq+0xc8/0x518
[  335.384757]   [<001495de>] irq_exit+0xee/0x110
[  335.384764]   [<0010ceba>] do_IRQ+0x6a/0x88
[  335.384769]   [<0095342e>] io_int_handler+0x11a/0x25c
[  335.384774]   [<0094fb5c>] __mutex_unlock_slowpath+0x144/0x1d8
[  335.384778]   [<0094fb58>] __mutex_unlock_slowpath+0x140/0x1d8
[  335.384783]   [<003c6114>] kernfs_iop_permission+0x64/0x80
[  335.384791]   [<0033ba86>] __inode_permission+0x9e/0xf0
[  335.384799]   [<0033ea96>] link_path_walk+0x6e/0x510
[  335.384825]   [<0033f09c>] path_lookupat+0xc4/0x1a8
[  335.384828]   [<0034195c>] filename_lookup+0x9c/0x160
[  335.384831]   [<00341b44>] user_path_at_empty+0x5c/0x70
[  335.384834]   [<00335250>] SyS_readlinkat+0x68/0x140
[  335.384838]   [<00952f8e>] system_call+0xd6/0x270
[  335.384842] irq event stamp: 971410
[  335.384844] hardirqs last  enabled at (971409): [<0030f982>] 
migrate_page_move_mapping+0x3ea/0x588
[  335.384850] hardirqs last disabled at (971410): [<00951fc4>] 
_raw_spin_lock_irqsave+0x3c/0xb0
[  335.384854] softirqs last  enabled at (970526): [<00954318>] 
__do_softirq+0x460/0x518
[  335.384858] softirqs last disabled at (970519): [<001495de>] 
irq_exit+0xee/0x110
[  335.384862] 
   other info that might help us debug this:
[  335.384864]  Possible unsafe locking scenario:

[  335.384867]CPU0
[  335.384870]
[  335.384871]   lock(&(&ctx->completion_lock)->rlock);
[  335.384875]   
[  335.384877] lock(&(&ctx->completion_lock)->rlock);
[  335.384882] 
*** DEADLOCK ***

[  335.384885] 3 locks held by kcompactd0/151:
[  335.384886]  #0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at: 
[<0038fc82>] aio_migratepage+0x42/0x1e8
[  335.384895]  #1:  (&ctx->ring_lock){+.+.+.}, at: [<0038fc9a>] 
aio_migratepage+0x5a/0x1e8
[  335.384902]  #2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at: 
[<0038fd96>] aio_migratepage+0x156/0x1e8
[  335.384910] 
   stack backtrace:
[  335.384913] CPU: 20 PID: 151 Comm: kcompactd0 Tainted: GW   
4.7.0-rc1+ #52
[  335.384915]0001c6cbb730 0001c6cbb7c0 0002 
 
  0001c6cbb860 0001c6cbb7d8 0001c6cbb7d8 
00114496 
   00b517ec 00b680b6 
000b 
  0001c6cbb820 0001c6cbb7c0  
 
  04000184ad18 00114496 0001c6cbb7c0 
0001c6cbb820 
[  335.384945] Call Trace:
[  335.384950] ([<001143d2>] show_trace+0xea/0xf0)
[  335.384953] ([<0011444a>] show_stack+0x72/0xf0)
[  335.384959] ([<00684522>] dump_stack+0x9a/0xd8)
[  335.384963] ([<0028679c>] print_usage_bug.part.27+0x2d4/0x2e8)
[  335.384966] ([<001a71ce>] mark_lock+0x17e/0x758)
[  335.384969] ([<001a784a>] mark_held_locks+0xa2/0xd0)
[  335.384972] ([<001a79b8>] trace_hardirqs_on_caller+0x140/0x1c0)
[  335.384977] ([<00326026>] mem_cgroup_migrate+0x266/0x370)
[  335.384980] ([<0038fdaa>] aio_migratepage+0x16a/0x1e8)
[  335.384982] ([<00310568>] move_to_new_page+0xb0/0x260)
[  335.384986] ([<003111b4>] migrate_pages+0x8f4/0x9f0)
[  335.384990] ([<002c507c>] compact_zone+0x4dc/0xdc8)
[  335.384992] ([<002c5e22>] kcompactd_do_work+0x1aa/0x358)
[  335.384994] ([<002c608a>] kcompactd+0xba/0x2c8)
[  335.384999] ([<0016b09a>] kthread+0x10a/0x110)
[  335.385001] ([<0095315a>] kernel_thread_starter+0x6/0xc)
[  335.385003] ([<00953154>] kernel_thread_starter+0x0/0xc)
[  335.385004] INFO: lockdep is turned off.

[PATCH 1/1] mm/page_ref: introduce page_ref_inc_return

2016-06-20 Thread Christian Borntraeger

From: David Hildenbrand 

Let's introduce that helper.

Signed-off-by: David Hildenbrand 
Signed-off-by: Christian Borntraeger 
---
 include/linux/page_ref.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/page_ref.h b/include/linux/page_ref.h
index 8b5e0a9..610e132 100644
--- a/include/linux/page_ref.h
+++ b/include/linux/page_ref.h
@@ -124,6 +124,15 @@ static inline int page_ref_sub_and_test(struct page *page, 
int nr)
return ret;
 }
 
+static inline int page_ref_inc_return(struct page *page)
+{
+   int ret = atomic_inc_return(&page->_refcount);
+
+   if (page_ref_tracepoint_active(__tracepoint_page_ref_mod_and_return))
+   __page_ref_mod_and_return(page, 1, ret);
+   return ret;
+}
+
 static inline int page_ref_dec_and_test(struct page *page)
 {
int ret = atomic_dec_and_test(&page->_refcount);
-- 
2.5.5

[PATCH 0/1] introduce page_ref_inc_return

2016-06-20 Thread Christian Borntraeger

commit 0139aa7b7fa1 ("mm: rename _count, field of the struct page,
to _refcount") changed all accesses to page->_count to use wrappers.
There is already a page_ref_dec_return and we need for kvm/s390
code the function "page_ref_inc_return" as well.

FWIW, the code is under
https://git.kernel.org/cgit/linux/kernel/git/kvms390/linux.git/log/?h=next


Can I get an ack to carry this patch via the KVM/s390 tree (will
be merged into Paolos kvm tree soon)?

David Hildenbrand (1):
  mm/page_ref: introduce page_ref_inc_return

 include/linux/page_ref.h | 9 +
 1 file changed, 9 insertions(+)

-- 
2.5.5

Re: [PATCH] memcg: mem_cgroup_migrate() may be called with irq disabled

2016-06-20 Thread Christian Borntraeger

On 06/20/2016 08:41 PM, Tejun Heo wrote:
> Hello,
> 
> Christian, I *think* this should fix it.  Can you please verify?

I cannot reliably reproduce the bug :-/,but at least I have not seen
it with this patch and the patch makes sense and matches the traces.

> 
> Thanks!
> -- 8< --
> mem_cgroup_migrate() uses local_irq_disable/enable() but can be called
> with irq disabled from migrate_page_copy().  This ends up enabling irq
> while holding a irq context lock triggering the following lockdep
> warning.  Fix it by using irq_save/restore instead.
> 
>   =
>   [ INFO: inconsistent lock state ]
>   4.7.0-rc1+ #52 Tainted: GW  
>   -
>   inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>   kcompactd0/151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>(&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] 
> aio_migratepage+0x156/0x1e8
>   {IN-SOFTIRQ-W} state was registered at:
> [<001a8366>] __lock_acquire+0x5b6/0x1930
> [<001a9b9e>] lock_acquire+0xee/0x270
> [<00951fee>] _raw_spin_lock_irqsave+0x66/0xb0
> [<00390108>] aio_complete+0x98/0x328
> [<0037c7d4>] dio_complete+0xe4/0x1e0
> [<00650e64>] blk_update_request+0xd4/0x450
> [<0072a1a8>] scsi_end_request+0x48/0x1c8
> [<0072d7e2>] scsi_io_completion+0x272/0x698
> [<0065adb2>] blk_done_softirq+0xca/0xe8
> [<00953f80>] __do_softirq+0xc8/0x518
> [<001495de>] irq_exit+0xee/0x110
> [<0010ceba>] do_IRQ+0x6a/0x88
> [<0095342e>] io_int_handler+0x11a/0x25c
> [<0094fb5c>] __mutex_unlock_slowpath+0x144/0x1d8
> [<0094fb58>] __mutex_unlock_slowpath+0x140/0x1d8
> [<003c6114>] kernfs_iop_permission+0x64/0x80
> [<0033ba86>] __inode_permission+0x9e/0xf0
> [<0033ea96>] link_path_walk+0x6e/0x510
> [<0033f09c>] path_lookupat+0xc4/0x1a8
> [<0034195c>] filename_lookup+0x9c/0x160
> [<00341b44>] user_path_at_empty+0x5c/0x70
> [<00335250>] SyS_readlinkat+0x68/0x140
> [<00952f8e>] system_call+0xd6/0x270
>   irq event stamp: 971410
>   hardirqs last  enabled at (971409): [<0030f982>] 
> migrate_page_move_mapping+0x3ea/0x588
>   hardirqs last disabled at (971410): [<00951fc4>] 
> _raw_spin_lock_irqsave+0x3c/0xb0
>   softirqs last  enabled at (970526): [<00954318>] 
> __do_softirq+0x460/0x518
>   softirqs last disabled at (970519): [<001495de>] irq_exit+0xee/0x110
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>CPU0
>
> lock(&(&ctx->completion_lock)->rlock);
> 
>   lock(&(&ctx->completion_lock)->rlock);
> 
> *** DEADLOCK ***
> 
>   3 locks held by kcompactd0/151:
>#0:  (&(&mapping->private_lock)->rlock){+.+.-.}, at: [<0038fc82>] 
> aio_migratepage+0x42/0x1e8
>#1:  (&ctx->ring_lock){+.+.+.}, at: [<0038fc9a>] 
> aio_migratepage+0x5a/0x1e8
>#2:  (&(&ctx->completion_lock)->rlock){+.?.-.}, at: [<0038fd96>] 
> aio_migratepage+0x156/0x1e8
> 
>   stack backtrace:
>   CPU: 20 PID: 151 Comm: kcompactd0 Tainted: GW   4.7.0-rc1+ #52
>0001c6cbb730 0001c6cbb7c0 0002  
>0001c6cbb860 0001c6cbb7d8 0001c6cbb7d8 00114496 
> 00b517ec 00b680b6 000b 
>0001c6cbb820 0001c6cbb7c0   
>04000184ad18 00114496 0001c6cbb7c0 0001c6cbb820 
>   Call Trace:
>   ([<001143d2>] show_trace+0xea/0xf0)
>   ([<0011444a>] show_stack+0x72/0xf0)
>   ([<00684522>] dump_stack+0x9a/0xd8)
>   ([<0028679c>] print_usage_bug.part.27+0x2d4/0x2e8)
>   ([<001a71ce>] mark_lock+0x17e/0x758)
>   ([<001a784a>] mark_held_locks+0xa2/0xd0)
>   ([<001a79b8>] trace_hardirqs_on_caller+0x140/0x1c0)
>   ([<00326026>] mem_cgroup_migrate+0x266/0x370)
>   ([<0038fdaa>] aio_migratepage+0x16a/0x1e8)
>   ([<00310568>] move_to_new_page+0xb0/0x260)
>   ([<003111b4>] migrate_pages+0x8f4/0x9f0)
>   ([<002c507c>] compact_zone+0x4dc/0xdc8)
>   ([&l

Re: [PATCH v2 0/2] x86/entry: speed up context-tracking system calls by 150 clock cycles

2016-06-21 Thread Christian Borntraeger

On 06/20/2016 04:58 PM, Paolo Bonzini wrote:
> The first patches are the two optimizations I posted on May 30th
> for the system call entry/exit code.  The only change is in the
> function names, which use the user_{enter,exit}_irqoff favored
> by Andy and Ingo.  The first patch matches what commit d0e536d8939
> ("context_tracking: avoid irq_save/irq_restore on guest entry and exit",
> 2015-10-28) did for guest entry and exit.  The second simply adds
> an inline annotation; the compiler doesn't figure it out because the
> function is not static.
> 
> The second two patches move guest_{enter,exit} to the same naming
> convention, removing the KVM wrappers kvm_guest_{enter,exit} and
> __kvm_guest_{enter,exit} in the process.  I would like these two to
> go through the KVM tree because I have other optimizations for 4.8
> on top of these patches.
> 
> Thanks,
> 
> Paolo
> 
> Paolo Bonzini (4):
>   x86/entry: Avoid interrupt flag save and restore
>   x86/entry: Inline enter_from_user_mode
>   context_tracking: move rcu_virt_note_context_switch out of kvm_host.h
>   KVM: remove kvm_guest_enter/exit wrappers
> 
>  arch/arm/kvm/arm.c   |  8 +++---
>  arch/mips/kvm/mips.c |  4 +--
>  arch/powerpc/kvm/book3s_hv.c |  4 +--
>  arch/powerpc/kvm/book3s_pr.c |  4 +--
>  arch/powerpc/kvm/booke.c |  4 +--
>  arch/powerpc/kvm/powerpc.c   |  2 +-
>  arch/s390/kvm/kvm-s390.c |  4 +--
>  arch/x86/entry/common.c  |  6 ++---
>  arch/x86/kvm/x86.c   |  4 +--
>  include/linux/context_tracking.h | 53 
> +---
>  include/linux/kvm_host.h | 39 -
>  11 files changed, 69 insertions(+), 63 deletions(-)
> 
Series looks sane and does work on s390.
It has a minor conflict with my vsie pull request (so either add vsie.c
to this patch set or fixup my pull request in the merge commit to replace
kvm_guest_exit/enter with the new functions.

Christian

group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

2016-09-26 Thread Christian Borntraeger

Folks,

I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
sched/fair: Rewrite runnable load and utilization average tracking)
This has not been fixed by subsequent patches,e.g. the ones that try to
fix this for interactive workload.

The problem is only visible for sleep/wakeup heavy workload which must
be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
as libvirt will put KVM guests into cgroup instances).

For example a simple sysbench oltp with mysql inside a KVM guests with
16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
inside a guest, having multiple instances). This is the numbers of
events per second.
Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
use group scheduling for KVM guests) makes the behaviour much better:


instances   group   nogroup
1   34063002
2   50784940
3   60176760
4   64718216 (+27%)
5   67169196
6   69769783
7   712710170
8   739910385 (+40%)

before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
average tracking") there was basically no difference between group
or non-group scheduling. These numbers are with 4.7, older kernels after
9d89c257d show a similar difference.

The bad thing is that there is a lot of idle cpu power in the host
when this happens so the scheduler seems to not realize that this
workload could use more cpus in the host.

I tried some experiments , but I have not found a hack that "fixes" the
degredation, which would give me an indication which part  of the code
is broken. So are there any ideas? Is the estimated group load
calculation just not fast enough for sleep/wakeup workload?

Christian

Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

2016-09-26 Thread Christian Borntraeger

On 09/26/2016 12:56 PM, Peter Zijlstra wrote:
> On Mon, Sep 26, 2016 at 12:42:22PM +0200, Christian Borntraeger wrote:
>> Folks,
>>
>> I have seen big scalability degredations sind 4.3 (bisected 9d89c257d
>> sched/fair: Rewrite runnable load and utilization average tracking)
>> This has not been fixed by subsequent patches,e.g. the ones that try to
>> fix this for interactive workload.
>>
>> The problem is only visible for sleep/wakeup heavy workload which must
>> be part of the scheduler group (e.g. a sysbench OLTP inside a KVM guest
>> as libvirt will put KVM guests into cgroup instances).
>>
>> For example a simple sysbench oltp with mysql inside a KVM guests with
>> 16 CPUs backed by 8 host cpus (16 host threads) scales less (scale up
>> inside a guest, having multiple instances). This is the numbers of
>> events per second.
>> Unmounting /sys/fs/cgroup/cpu,cpuacct (thus forcing libvirt to not
>> use group scheduling for KVM guests) makes the behaviour much better:
>>
>>
>> instancesgroup   nogroup
>> 134063002
>> 250784940
>> 360176760
>> 464718216 (+27%)
>> 567169196
>> 669769783
>> 7712710170
>> 8739910385 (+40%)
>>
>> before 9d89c257d ("sched/fair: Rewrite runnable load and utilization
>> average tracking") there was basically no difference between group
>> or non-group scheduling. These numbers are with 4.7, older kernels after
>> 9d89c257d show a similar difference.
>>
>> The bad thing is that there is a lot of idle cpu power in the host
>> when this happens so the scheduler seems to not realize that this
>> workload could use more cpus in the host.
>>
>> I tried some experiments , but I have not found a hack that "fixes" the
>> degredation, which would give me an indication which part  of the code
>> is broken. So are there any ideas? Is the estimated group load
>> calculation just not fast enough for sleep/wakeup workload?
> 
> One of the differences in the old and new thing is being addressed by
> these patches:
> 
>   
> https://lkml.kernel.org/r/1473666472-13749-1-git-send-email-vincent.guit...@linaro.org
> 
> Could you see if those patches make a difference? If not, we'll have to
> go poke elsewhere ofcourse ;-)

Those patches do not apply cleanly on v4.7, linux/master or next/master.
Is there a good branch to test these patches?

Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

2016-09-26 Thread Christian Borntraeger

On 09/26/2016 01:53 PM, Peter Zijlstra wrote:
> On Mon, Sep 26, 2016 at 01:42:05PM +0200, Christian Borntraeger wrote:
>> On 09/26/2016 12:56 PM, Peter Zijlstra wrote:
> 
>>> One of the differences in the old and new thing is being addressed by
>>> these patches:
>>>
>>>   
>>> https://lkml.kernel.org/r/1473666472-13749-1-git-send-email-vincent.guit...@linaro.org
>>>
>>> Could you see if those patches make a difference? If not, we'll have to
>>> go poke elsewhere ofcourse ;-)
>>
>> Those patches do not apply cleanly on v4.7, linux/master or next/master.
>> Is there a good branch to test these patches?
> 
> They seemed to apply for me on tip/sched/core, I pushed out a branch for
> you that has them on.
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git 
> sched/propagate
> 
> I didn't boot the result though; but they applied without issue.

They applied ok on next from 9/13. Things go even worse.
With this host configuration:

CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED ADDRESS
0   000  00:0:0:0 yesyes0
1   000  01:1:1:1 yesyes1
2   000  12:2:2:2 yesyes2
3   000  13:3:3:3 yesyes3
4   001  24:4:4:4 yesyes4
5   001  25:5:5:5 yesyes5
6   001  36:6:6:6 yesyes6
7   001  37:7:7:7 yesyes7
8   001  48:8:8:8 yesyes8
9   001  49:9:9:9 yesyes9
10  001  510:10:10:10 yesyes10
11  001  511:11:11:11 yesyes11
12  001  612:12:12:12 yesyes12
13  001  613:13:13:13 yesyes13
14  001  714:14:14:14 yesyes14
15  001  715:15:15:15 yesyes15

the guest was running either on 0-3 or on 4-15, but never
used the full system. With group scheduling disabled everything was good
again. So looks like that this bug has also some dependency on on the
host topology.

Christian

Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

2016-09-26 Thread Christian Borntraeger

On 09/26/2016 02:10 PM, Peter Zijlstra wrote:
> On Mon, Sep 26, 2016 at 02:01:43PM +0200, Christian Borntraeger wrote:
>> They applied ok on next from 9/13. Things go even worse.
>> With this host configuration:
>>
>> CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED ADDRESS
>> 0   000  00:0:0:0 yesyes0
>> 1   000  01:1:1:1 yesyes1
>> 2   000  12:2:2:2 yesyes2
>> 3   000  13:3:3:3 yesyes3
>> 4   001  24:4:4:4 yesyes4
>> 5   001  25:5:5:5 yesyes5
>> 6   001  36:6:6:6 yesyes6
>> 7   001  37:7:7:7 yesyes7
>> 8   001  48:8:8:8 yesyes8
>> 9   001  49:9:9:9 yesyes9
>> 10  001  510:10:10:10 yesyes10
>> 11  001  511:11:11:11 yesyes11
>> 12  001  612:12:12:12 yesyes12
>> 13  001  613:13:13:13 yesyes13
>> 14  001  714:14:14:14 yesyes14
>> 15  001  715:15:15:15 yesyes15
>>
>> the guest was running either on 0-3 or on 4-15, but never
>> used the full system. With group scheduling disabled everything was good
>> again. So looks like that this bug has also some dependency on on the
>> host topology.
> 
> OK, so CPU affinities that unevenly straddle topology boundaries like
> that are hard (and is generally not recommended), but its not
> immediately obvious why it would be so much worse with cgroups enabled.

Well thats what I get from LPAR...
With CPUs 0-3 disabled things are better, but there is still 10%
difference between group/nogroup.
Will test Vincents v4 soon.

In any case, would a 5 second sequence of /proc/sched_debug for the
good/bad case with all 16 host CPUs  (or the reduced 12 cpu set) be useful?

Christian

Re: group scheduler regression since 4.3 (bisect 9d89c257d sched/fair: Rewrite runnable load and utilization average tracking)

2016-09-26 Thread Christian Borntraeger

On 09/26/2016 02:10 PM, Peter Zijlstra wrote:
> On Mon, Sep 26, 2016 at 02:01:43PM +0200, Christian Borntraeger wrote:
>> They applied ok on next from 9/13. Things go even worse.
>> With this host configuration:
>>
>> CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED ADDRESS
>> 0   000  00:0:0:0 yesyes0
>> 1   000  01:1:1:1 yesyes1
>> 2   000  12:2:2:2 yesyes2
>> 3   000  13:3:3:3 yesyes3
>> 4   001  24:4:4:4 yesyes4
>> 5   001  25:5:5:5 yesyes5
>> 6   001  36:6:6:6 yesyes6
>> 7   001  37:7:7:7 yesyes7
>> 8   001  48:8:8:8 yesyes8
>> 9   001  49:9:9:9 yesyes9
>> 10  001  510:10:10:10 yesyes10
>> 11  001  511:11:11:11 yesyes11
>> 12  001  612:12:12:12 yesyes12
>> 13  001  613:13:13:13 yesyes13
>> 14  001  714:14:14:14 yesyes14
>> 15  001  715:15:15:15 yesyes15
>>
>> the guest was running either on 0-3 or on 4-15, but never
>> used the full system. With group scheduling disabled everything was good
>> again. So looks like that this bug has also some dependency on on the
>> host topology.
> 
> OK, so CPU affinities that unevenly straddle topology boundaries like
> that are hard (and is generally not recommended), but its not

Ok so I created with cpu hotplug a symmetrical CPU topology:
CPU NODE BOOK SOCKET CORE L1d:L1i:L2d:L2i ONLINE CONFIGURED ADDRESS
0   000  00:0:0:0 yesyes0
1   000  01:1:1:1 yesyes1
2   000  12:2:2:2 yesyes2
3   000  13:3:3:3 yesyes3
4   000  24:4:4:4 yesyes4
5   000  25:5:5:5 yesyes5
6   000  36:6:6:6 yesyes6
7   000  37:7:7:7 yesyes7
8   000  -::: no yes8
9   000  -::: no yes9
10  000  -::: no yes10
11  000  -::: no yes11
12  001  48:8:8:8 yesyes12
13  001  49:9:9:9 yesyes13
14  001  510:10:10:10 yesyes14
15  001  511:11:11:11 yesyes15
16  001  612:12:12:12 yesyes16
17  001  613:13:13:13 yesyes17
18  001  714:14:14:14 yesyes18
19  001  715:15:15:15 yesyes19

Same effect: Only half of the CPUs are used, but the number of
guest CPUs == number of host cpus. Turns out that this is totally
unrelated to this patch set, so it must be something else.

lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Christian Borntraeger

Chris,

with 4.8-rc3 I get the following on an s390 box:


[ 1094.009172] =
[ 1094.009174] [ INFO: possible recursive locking detected ]
[ 1094.009177] 4.8.0-rc3 #126 Tainted: GW  
[ 1094.009179] -
[ 1094.009180] vim/12891 is trying to acquire lock:
[ 1094.009182]  (&ei->log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009256] 
   but task is already holding lock:
[ 1094.009258]  (&ei->log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009276] 
   other info that might help us debug this:
[ 1094.009278]  Possible unsafe locking scenario:

[ 1094.009280]CPU0
[ 1094.009281]
[ 1094.009282]   lock(&ei->log_mutex);
[ 1094.009284]   lock(&ei->log_mutex);
[ 1094.009286] 
*** DEADLOCK ***

[ 1094.009288]  May be due to missing lock nesting notation

[ 1094.009290] 3 locks held by vim/12891:
[ 1094.009291]  #0:  (&sb->s_type->i_mutex_key#15){+.+.+.}, at: 
[<03ff817afbd6>] btrfs_sync_file+0x1de/0x5e8 [btrfs]
[ 1094.009311]  #1:  (sb_internal#2){.+.+..}, at: [<0035e0ba>] 
__sb_start_write+0x122/0x138
[ 1094.009320]  #2:  (&ei->log_mutex){+.+...}, at: [<03ff817e83c6>] 
btrfs_log_inode+0x126/0x1010 [btrfs]
[ 1094.009370] 
   stack backtrace:
[ 1094.009375] CPU: 14 PID: 12891 Comm: vim Tainted: GW   4.8.0-rc3 
#126
[ 1094.009377] Hardware name: IBM  2964 NC9  704
  (LPAR)
[ 1094.009380]00f061367608 00f061367698 0002 
 
  00f061367738 00f0613676b0 00f0613676b0 
001133ec 
    00f7000a 
00f7000a 
  00f0613676f8 00f061367698  
 
  040001d821c8 001133ec 00f061367698 
00f0613676e8 
[ 1094.009396] Call Trace:
[ 1094.009401] ([<00113334>] show_trace+0xec/0xf0)
[ 1094.009403] ([<0011339a>] show_stack+0x62/0xe8)
[ 1094.009406] ([<0055211c>] dump_stack+0x9c/0xe0)
[ 1094.009411] ([<001d9930>] validate_chain.isra.22+0xc00/0xd70)
[ 1094.009413] ([<001dad9c>] __lock_acquire+0x39c/0x7d8)
[ 1094.009414] ([<001db8d0>] lock_acquire+0x108/0x320)
[ 1094.009420] ([<008845c6>] mutex_lock_nested+0x86/0x3f8)
[ 1094.009440] ([<03ff817e83c6>] btrfs_log_inode+0x126/0x1010 [btrfs])
[ 1094.009457] ([<03ff817e8fb2>] btrfs_log_inode+0xd12/0x1010 [btrfs])
[ 1094.009474] ([<03ff817e95b4>] btrfs_log_inode_parent+0x244/0x980 [btrfs])
[ 1094.009490] ([<03ff817eafea>] btrfs_log_dentry_safe+0x7a/0xa0 [btrfs])
[ 1094.009506] ([<03ff817afe1a>] btrfs_sync_file+0x422/0x5e8 [btrfs])
[ 1094.009512] ([<0039e64e>] do_fsync+0x5e/0x90)
[ 1094.009514] ([<0039e9e2>] SyS_fsync+0x32/0x40)
[ 1094.009517] ([<0088a336>] system_call+0xd6/0x270)
[ 1094.009518] INFO: lockdep is turned off.

Re: lockdep warning in btrfs in 4.8-rc3

2016-09-08 Thread Christian Borntraeger

On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
> Chris,
> 
> with 4.8-rc3 I get the following on an s390 box:

Sorry for the noise, just saw the fix in your pull request.

Re: [PATCH 2/4] virtio_blk: Less function calls in init_vq() after error detection

2016-09-13 Thread Christian Borntraeger

On 09/13/2016 02:13 PM, SF Markus Elfring wrote:
> From: Markus Elfring 
> Date: Tue, 13 Sep 2016 13:20:44 +0200
> 
> The kfree() function was called in up to three cases
> by the init_vq() function during error handling even if
> the passed variable contained a null pointer.
> 
> * Split a condition check for memory allocation failures.
> 
> * Adjust jump targets according to the Linux coding style convention.
> 
> Signed-off-by: Markus Elfring 
> ---
>  drivers/block/virtio_blk.c | 22 +-
>  1 file changed, 17 insertions(+), 5 deletions(-)

Can't you see from this diffstat that the patch actually seems to makes
the code more complex?
In addition, please have a look at commit 
347a529398e8e723338cca5d8a8ae2d9e7e93448
virtio_blk: Fix a slient kernel panic

which did the opposite of your patch. And in fact it fixed a bug. Quite 
obviously
multiple labels are harder to read and harder to get right. For error handling 
with
just kfree one label is just the right thing to.

> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index 6553eb7..d28dbcf 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -395,11 +395,21 @@ static int init_vq(struct virtio_blk *vblk)
>   return -ENOMEM;
> 
>   names = kmalloc_array(num_vqs, sizeof(*names), GFP_KERNEL);
> + if (!names) {
> + err = -ENOMEM;
> + goto free_vblk_vqs;
> + }
> +
>   callbacks = kmalloc_array(num_vqs, sizeof(*callbacks), GFP_KERNEL);
> + if (!callbacks) {
> + err = -ENOMEM;
> + goto free_names;
> + }
> +
>   vqs = kmalloc_array(num_vqs, sizeof(*vqs), GFP_KERNEL);
> - if (!names || !callbacks || !vqs) {
> + if (!vqs) {
>   err = -ENOMEM;
> - goto out;
> + goto free_callbacks;
>   }
> 
>   for (i = 0; i < num_vqs; i++) {
> @@ -411,19 +421,21 @@ static int init_vq(struct virtio_blk *vblk)
>   /* Discover virtqueues and write information to configuration.  */
>   err = vdev->config->find_vqs(vdev, num_vqs, vqs, callbacks, names);
>   if (err)
> - goto out;
> + goto free_vqs;
> 
>   for (i = 0; i < num_vqs; i++) {
>   spin_lock_init(&vblk->vqs[i].lock);
>   vblk->vqs[i].vq = vqs[i];
>   }
>   vblk->num_vqs = num_vqs;
> -
> -out:
> + free_vqs:
>   kfree(vqs);
> + free_callbacks:
>   kfree(callbacks);
> + free_names:
>   kfree(names);
>   if (err)
> + free_vblk_vqs:
>   kfree(vblk->vqs);
>   return err;
>  }
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1093 matches

Mail list logo