Re: [PATCH v2 0/2] send tlb_remove_table_smp_sync IPI only to necessary CPUs

2023-06-22 Thread Marcelo Tosatti
On Wed, Jun 21, 2023 at 09:43:37AM +0200, Peter Zijlstra wrote:
> On Tue, Jun 20, 2023 at 05:46:16PM +0300, Yair Podemsky wrote:
> > Currently the tlb_remove_table_smp_sync IPI is sent to all CPUs
> > indiscriminately, this causes unnecessary work and delays notable in
> > real-time use-cases and isolated cpus.
> > By limiting the IPI to only be sent to cpus referencing the effected
> > mm.
> > a config to differentiate architectures that support mm_cpumask from
> > those that don't will allow safe usage of this feature.
> > 
> > changes from -v1:
> > - Previous version included a patch to only send the IPI to CPU's with
> > context_tracking in the kernel space, this was removed due to race 
> > condition concerns.
> > - for archs that do not maintain mm_cpumask the mask used should be
> >  cpu_online_mask (Peter Zijlstra).
> >  
> 
> Would it not be much better to fix the root cause? As per the last time,
> there's patches that cure the thp abuse of this.

The other case where the IPI can happen is:

CPU-0   CPU-1

tlb_remove_table
tlb_remove_table_sync_one
IPI
local_irq_disable
gup_fast
local_irq_enable


So its not only the THP case.



Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-19 Thread Marcelo Tosatti
On Wed, Apr 19, 2023 at 01:30:57PM +0200, David Hildenbrand wrote:
> On 06.04.23 20:27, Peter Zijlstra wrote:
> > On Thu, Apr 06, 2023 at 05:51:52PM +0200, David Hildenbrand wrote:
> > > On 06.04.23 17:02, Peter Zijlstra wrote:
> > 
> > > > DavidH, what do you thikn about reviving Jann's patches here:
> > > > 
> > > > https://bugs.chromium.org/p/project-zero/issues/detail?id=2365#c1
> > > > 
> > > > Those are far more invasive, but afaict they seem to do the right thing.
> > > > 
> > > 
> > > I recall seeing those while discussed on secur...@kernel.org. What we
> > > currently have was (IMHO for good reasons) deemed better to fix the issue,
> > > especially when caring about backports and getting it right.
> > 
> > Yes, and I think that was the right call. However, we can now revisit
> > without having the pressure of a known defect and backport
> > considerations.
> > 
> > > The alternative that was discussed in that context IIRC was to simply
> > > allocate a fresh page table, place the fresh page table into the list
> > > instead, and simply free the old page table (then using common machinery).
> > > 
> > > TBH, I'd wish (and recently raised) that we could just stop wasting memory
> > > on page tables for THPs that are maybe never going to get PTE-mapped ... 
> > > and
> > > eventually just allocate on demand (with some caching?) and handle the
> > > places where we're OOM and cannot PTE-map a THP in some descend way.
> > > 
> > > ... instead of trying to figure out how to deal with these page tables we
> > > cannot free but have to special-case simply because of GUP-fast.
> > 
> > Not keeping them around sounds good to me, but I'm not *that* familiar
> > with the THP code, most of that happened after I stopped tracking mm. So
> > I'm not sure how feasible is it.
> > 
> > But it does look entirely feasible to rework this page-table freeing
> > along the lines Jann did.
> 
> It's most probably more feasible, although the easiest would be to just
> allocate a fresh page table to deposit and free the old one using the mmu
> gatherer.
> 
> This way we can avoid the khugepaged of tlb_remove_table_smp_sync(), but not
> the tlb_remove_table_one() usage. I suspect khugepaged isn't really relevant
> in RT kernels (IIRC, most of RT setups disable THP completely).

People will disable khugepaged because it causes IPIs (and the fact one
has to disable khugepaged is a configuration overhead, and a source of
headache for configuring the realtime system, since one can forget of
doing that, etc).

But people do want to run non-RT applications along with RT applications
(in case you have a single box on a priviledged location, for example).

> 
> tlb_remove_table_one() only triggers if __get_free_page(GFP_NOWAIT |
> __GFP_NOWARN); fails. IIUC, that can happen easily under memory pressure
> because it doesn't wait for direct reclaim.
> 
> I don't know much about RT workloads (so I'd appreciate some feedback), but
> I guess we can run int memory pressure as well due to some !rt housekeeping
> task on the system?

Yes, exactly (memory for -RT app will be mlocked).



Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-19 Thread Marcelo Tosatti
On Thu, Apr 06, 2023 at 03:32:06PM +0200, Peter Zijlstra wrote:
> On Thu, Apr 06, 2023 at 09:49:22AM -0300, Marcelo Tosatti wrote:
> 
> > > > 2) Depends on the application and the definition of "occasional".
> > > > 
> > > > For certain types of applications (for example PLC software or
> > > > RAN processing), upon occurrence of an event, it is necessary to
> > > > complete a certain task in a maximum amount of time (deadline).
> > > 
> > > If the application is properly NOHZ_FULL and never does a kernel entry,
> > > it will never get that IPI. If it is a pile of shit and does kernel
> > > entries while it pretends to be NOHZ_FULL it gets to keep the pieces and
> > > no amount of crying will get me to care.
> > 
> > I suppose its common practice to use certain system calls in latency
> > sensitive applications, for example nanosleep. Some examples:
> > 
> > 1) cyclictest   (nanosleep)
> 
> cyclictest is not a NOHZ_FULL application, if you tihnk it is, you're
> deluded.

On the field (what end-users do on production):

cyclictest runs on NOHZ_FULL cores.
PLC type programs run on NOHZ_FULL cores.

So accordingly to physical reality i observe, i am not deluded.

> > 2) PLC programs (nanosleep)
> 
> What's a PLC? Programmable Logic Circuit?

Programmable logic controller.

> > A system call does not necessarily have to take locks, does it ?
> 
> This all is unrelated to locks

OK.

> > Or even if application does system calls, but runs under a VM,
> > then you are requiring it to never VM-exit.
> 
> That seems to be a goal for performance anyway.

Not sure what you mean.

> > This reduces the flexibility of developing such applications.
> 
> Yeah, that's the cards you're dealt, deal with it.

This is not what happens on the field. 



Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-06 Thread Marcelo Tosatti
On Wed, Apr 05, 2023 at 09:54:57PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:43:14PM -0300, Marcelo Tosatti wrote:
> 
> > Two points:
> > 
> > 1) For a virtualized system, the overhead is not only of executing the
> > IPI but:
> > 
> > VM-exit
> > run VM-exit code in host
> > handle IPI
> > run VM-entry code in host
> > VM-entry
> 
> I thought we could do IPIs without VMexit these days? 

Yes, IPIs to vCPU (guest context). In this case we can consider
an IPI to the host pCPU (which requires VM-exit from guest context).

> Also virt... /me walks away.
> 
> > 2) Depends on the application and the definition of "occasional".
> > 
> > For certain types of applications (for example PLC software or
> > RAN processing), upon occurrence of an event, it is necessary to
> > complete a certain task in a maximum amount of time (deadline).
> 
> If the application is properly NOHZ_FULL and never does a kernel entry,
> it will never get that IPI. If it is a pile of shit and does kernel
> entries while it pretends to be NOHZ_FULL it gets to keep the pieces and
> no amount of crying will get me to care.

I suppose its common practice to use certain system calls in latency
sensitive applications, for example nanosleep. Some examples:

1) cyclictest   (nanosleep)
2) PLC programs (nanosleep)

A system call does not necessarily have to take locks, does it ?

Or even if application does system calls, but runs under a VM,
then you are requiring it to never VM-exit.

This reduces the flexibility of developing such applications.





Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-06 Thread Marcelo Tosatti
On Wed, Apr 05, 2023 at 09:52:26PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 05, 2023 at 04:45:32PM -0300, Marcelo Tosatti wrote:
> > On Wed, Apr 05, 2023 at 01:10:07PM +0200, Frederic Weisbecker wrote:
> > > On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> > > > On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > > > > + int state = atomic_read(>state);
> > > > > + /* will return true only for cpus in kernel space */
> > > > > + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > > > > +}
> > > > 
> > > > Also note that this doesn't stricly prevent userspace from being 
> > > > interrupted.
> > > > You may well observe the CPU in kernel but it may receive the IPI later 
> > > > after
> > > > switching to userspace.
> > > > 
> > > > We could arrange for avoiding that with marking ct->state with a 
> > > > pending work bit
> > > > to flush upon user entry/exit but that's a bit more overhead so I first 
> > > > need to
> > > > know about your expectations here, ie: can you tolerate such an 
> > > > occasional
> > > > interruption or not?
> > > 
> > > Bah, actually what can we do to prevent from that racy IPI? Not much I 
> > > fear...
> > 
> > Use a different mechanism other than an IPI to ensure in progress
> > __get_free_pages_fast() has finished execution.
> > 
> > Isnt this codepath slow path enough that it can use
> > synchronize_rcu_expedited?
> 
> To actually hit this path you're doing something really dodgy.

Apparently khugepaged is using the same infrastructure:

$ grep tlb_remove_table khugepaged.c 
tlb_remove_table_sync_one();
tlb_remove_table_sync_one();

So just enabling khugepaged will hit that path.



Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Marcelo Tosatti
On Wed, Apr 05, 2023 at 01:10:07PM +0200, Frederic Weisbecker wrote:
> On Wed, Apr 05, 2023 at 12:44:04PM +0200, Frederic Weisbecker wrote:
> > On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > > + int state = atomic_read(>state);
> > > + /* will return true only for cpus in kernel space */
> > > + return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > > +}
> > 
> > Also note that this doesn't stricly prevent userspace from being 
> > interrupted.
> > You may well observe the CPU in kernel but it may receive the IPI later 
> > after
> > switching to userspace.
> > 
> > We could arrange for avoiding that with marking ct->state with a pending 
> > work bit
> > to flush upon user entry/exit but that's a bit more overhead so I first 
> > need to
> > know about your expectations here, ie: can you tolerate such an occasional
> > interruption or not?
> 
> Bah, actually what can we do to prevent from that racy IPI? Not much I fear...

Use a different mechanism other than an IPI to ensure in progress
__get_free_pages_fast() has finished execution.

Isnt this codepath slow path enough that it can use
synchronize_rcu_expedited?



Re: [PATCH 3/3] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to CPUs in kernel mode

2023-04-05 Thread Marcelo Tosatti
On Wed, Apr 05, 2023 at 12:43:58PM +0200, Frederic Weisbecker wrote:
> On Tue, Apr 04, 2023 at 04:42:24PM +0300, Yair Podemsky wrote:
> > @@ -191,6 +192,20 @@ static void tlb_remove_table_smp_sync(void *arg)
> > /* Simply deliver the interrupt */
> >  }
> >  
> > +
> > +#ifdef CONFIG_CONTEXT_TRACKING
> > +static bool cpu_in_kernel(int cpu, void *info)
> > +{
> > +   struct context_tracking *ct = per_cpu_ptr(_tracking, cpu);
> 
> Like Peter said, an smp_mb() is required here before the read (unless there is
> already one between the page table modification and that ct->state read?).
> 
> So that you have this pairing:
> 
> 
>WRITE page_table  WRITE ct->state
>  smp_mb()  smp_mb() // implied by 
> atomic_fetch_or
>READ ct->stateREAD page_table
> 
> > +   int state = atomic_read(>state);
> > +   /* will return true only for cpus in kernel space */
> > +   return state & CT_STATE_MASK == CONTEXT_KERNEL;
> > +}
> 
> Also note that this doesn't stricly prevent userspace from being interrupted.
> You may well observe the CPU in kernel but it may receive the IPI later after
> switching to userspace.
> 
> We could arrange for avoiding that with marking ct->state with a pending work 
> bit
> to flush upon user entry/exit but that's a bit more overhead so I first need 
> to
> know about your expectations here, ie: can you tolerate such an occasional
> interruption or not?

Two points:

1) For a virtualized system, the overhead is not only of executing the
IPI but:

VM-exit
run VM-exit code in host
handle IPI
run VM-entry code in host
VM-entry

2) Depends on the application and the definition of "occasional".

For certain types of applications (for example PLC software or
RAN processing), upon occurrence of an event, it is necessary to
complete a certain task in a maximum amount of time (deadline).

One way to express this requirement is with a pair of numbers,
deadline time and execution time, where:

* deadline time: length of time between event and deadline.
* execution time: length of time it takes for processing of event
  to occur on a particular hardware platform
  (uninterrupted).





Re: [RFC PATCH 0/5] Generic IPI sending tracepoint

2022-10-07 Thread Marcelo Tosatti
Hi Valentin,

On Fri, Oct 07, 2022 at 04:41:40PM +0100, Valentin Schneider wrote:
> Background
> ==
> 
> Detecting IPI *reception* is relatively easy, e.g. using
> trace_irq_handler_{entry,exit} or even just function-trace
> flush_smp_call_function_queue() for SMP calls.  
> 
> Figuring out their *origin*, is trickier as there is no generic tracepoint 
> tied
> to e.g. smp_call_function():
> 
> o AFAIA x86 has no tracepoint tied to sending IPIs, only receiving them
>   (cf. trace_call_function{_single}_entry()).
> o arm/arm64 do have trace_ipi_raise(), which gives us the target cpus but 
> also a
>   mostly useless string (smp_calls will all be "Function call interrupts").
> o Other architectures don't seem to have any IPI-sending related tracepoint.  
> 
> I believe one reason those tracepoints used by arm/arm64 ended up as they were
> is because these archs used to handle IPIs differently from regular interrupts
> (the IRQ driver would directly invoke an IPI-handling routine), which meant 
> they 
> never showed up in trace_irq_handler_{entry, exit}. The trace_ipi_{entry,exit}
> tracepoints gave a way to trace IPI reception but those have become redundant 
> as
> of: 
> 
>   56afcd3dbd19 ("ARM: Allow IPIs to be handled as normal interrupts")
>   d3afc7f12987 ("arm64: Allow IPIs to be handled as normal interrupts")
> 
> which gave IPIs a "proper" handler function used through
> generic_handle_domain_irq(), which makes them show up via
> trace_irq_handler_{entry, exit}.
> 
> Changing stuff up
> =
> 
> Per the above, it would make sense to reshuffle trace_ipi_raise() and move it
> into generic code. This also came up during Daniel's talk on Osnoise at the 
> CPU
> isolation MC of LPC 2022 [1]. 
> 
> Now, to be useful, such a tracepoint needs to export:
> o targeted CPU(s)
> o calling context
> 
> The only way to get the calling context with trace_ipi_raise() is to trigger a
> stack dump, e.g. $(trace-cmd -e ipi* -T echo 42).
> 
> As for the targeted CPUs, the existing tracepoint does export them, albeit in
> cpumask form, which is quite inconvenient from a tooling perspective. For
> instance, as far as I'm aware, it's not possible to do event filtering on a
> cpumask via trace-cmd.

https://man7.org/linux/man-pages/man1/trace-cmd-set.1.html

   -f filter
   Specify a filter for the previous event. This must come after
   a -e. This will filter what events get recorded based on the
   content of the event. Filtering is passed to the kernel
   directly so what filtering is allowed may depend on what
   version of the kernel you have. Basically, it will let you
   use C notation to check if an event should be processed or
   not.

   ==, >=, <=, >, <, &, |, && and ||

   The above are usually safe to use to compare fields.

This looks overkill to me (consider large number of bits set in mask).

+#define trace_ipi_send_cpumask(callsite, mask) do {\
+   if (static_key_false(&__tracepoint_ipi_send_cpu.key)) { \
+   int cpu;\
+   for_each_cpu(cpu, mask) \
+   trace_ipi_send_cpu(callsite, cpu);  \
+   }   \
+} while (0)


> 
> Because of the above points, this is introducing a new tracepoint.
> 
> Patches
> ===
> 
> This results in having trace events for:
> 
> o smp_call_function*()
> o smp_send_reschedule()
> o irq_work_queue*()
> 
> This is incomplete, just looking at arm64 there's more IPI types that aren't 
> covered:
> 
>   IPI_CPU_STOP,
>   IPI_CPU_CRASH_STOP,
>   IPI_TIMER,
>   IPI_WAKEUP,
> 
> ... But it feels like a good starting point.

Can't you have a single tracepoint (or variant with cpumask) that would
cover such cases as well?

Maybe (as parameters for tracepoint):

* type (reschedule, smp_call_function, timer, wakeup, ...).

* function address: valid for smp_call_function, irq_work_queue
  types.

> Another thing worth mentioning is that depending on the callsite, the _RET_IP_
> fed to the tracepoint is not always useful - generic_exec_single() doesn't 
> tell
> you much about the actual callback being sent via IPI, so there might be value
> in exploding the single tracepoint into at least one variant for smp_calls.

Not sure i grasp what you mean by "exploding the single tracepoint...",
but yes knowing the function or irq work function is very useful.

> 
> Links
> =
> 
> [1]: https://youtu.be/5gT57y4OzBM?t=14234
> 
> Valentin Schneider (5):
>   trace: Add trace_ipi_send_{cpu, cpumask}
>   sched, smp: Trace send_call_function_single_ipi()
>   smp: Add a multi-CPU variant to send_call_function_single_ipi()
>   irq_work: Trace calls to arch_irq_work_raise()
>   treewide: Rename and trace arch-definitions of smp_send_reschedule()
> 
>  arch/alpha/kernel/smp.c  |  2 +-

[PATCH] PPC: 8xx: update MAINTAINERS entry

2014-05-29 Thread Marcelo Tosatti

Not involved in 8xx activities for years, update MAINTAINERS
to reflect it.

Signed-off-by: Marcelo Tosatti mtosa...@redhat.com

diff --git a/MAINTAINERS b/MAINTAINERS
index c47d268..ed7b606 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5349,7 +5349,6 @@ F:arch/powerpc/*/*/*virtex*
 
 LINUX FOR POWERPC EMBEDDED PPC8XX
 M: Vitaly Bordug v...@kernel.crashing.org
-M: Marcelo Tosatti marc...@kvack.org
 W: http://www.penguinppc.org/
 L: linuxppc-dev@lists.ozlabs.org
 S: Maintained
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 19/38] KVM: PPC: e500mc: add load inst fixup

2012-03-08 Thread Marcelo Tosatti
On Wed, Feb 29, 2012 at 01:09:47AM +0100, Alexander Graf wrote:
 There's always a chance we're unable to read a guest instruction. The guest
 could have its TLB mapped execute-, but not readable, something odd happens
 and our TLB gets flushed. So it's a good idea to be prepared for that case
 and have a fallback that allows us to fix things up in that case.
 
 Add fixup code that keeps guest code from potentially crashing our host 
 kernel.
 
 Signed-off-by: Alexander Graf ag...@suse.de
 
 ---
 
 v1 - v2:
 
   - fix whitespace

Still broken.

   - use explicit preempt counts
 ---
  arch/powerpc/kvm/bookehv_interrupts.S |   30 +-
  1 files changed, 29 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/kvm/bookehv_interrupts.S 
 b/arch/powerpc/kvm/bookehv_interrupts.S
 index 63023ae..f7dc3f6 100644
 --- a/arch/powerpc/kvm/bookehv_interrupts.S
 +++ b/arch/powerpc/kvm/bookehv_interrupts.S
 @@ -28,6 +28,7 @@
  #include asm/asm-compat.h
  #include asm/asm-offsets.h
  #include asm/bitsperlong.h
 +#include asm/thread_info.h
  
  #include ../kernel/head_booke.h /* for THREAD_NORMSAVE() */
  
 @@ -171,9 +172,36 @@
   PPC_STL r30, VCPU_GPR(r30)(r4)
   PPC_STL r31, VCPU_GPR(r31)(r4)
   mtspr   SPRN_EPLC, r8
 +
 + /* disable preemption, so we are sure we hit the fixup handler */
 +#ifdef CONFIG_PPC64
 + clrrdi  r8,r1,THREAD_SHIFT
 +#else
 + rlwinm  r8,r1,0,0,31-THREAD_SHIFT   /* current thread_info */
 +#endif
 + li  r7, 1
 +stw  r7, TI_PREEMPT(r8)
 +
   isync
 - lwepx   r9, 0, r5
 +
 + /*
 +  * In case the read goes wrong, we catch it and write an invalid value
 +  * in LAST_INST instead.
 +  */
 +1:   lwepx   r9, 0, r5
 +2:
 +.section .fixup, ax
 +3:   li  r9, KVM_INST_FETCH_FAILED
 + b   2b
 +.previous
 +.section __ex_table,a
 + PPC_LONG_ALIGN
 + PPC_LONG 1b,3b
 +.previous
 +
   mtspr   SPRN_EPLC, r3
 + li  r7, 0
 +stw  r7, TI_PREEMPT(r8)
   stw r9, VCPU_LAST_INST(r4)
   .endif
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 11/11] KVM: PPC: Eliminate global spinlock in kvmppc_h_enter

2011-11-23 Thread Marcelo Tosatti
On Thu, Nov 17, 2011 at 10:55:49AM +1100, Paul Mackerras wrote:
 From dfd5bcfac841f8a36593edf60d9fb15e0d633287 Mon Sep 17 00:00:00 2001
 From: Paul Mackerras pau...@samba.org
 Date: Mon, 14 Nov 2011 13:30:38 +1100
 Subject: 
 
 Currently, kvmppc_h_enter takes a spinlock that is global to the guest,
 kvm-mmu_lock, in order to check for pending PTE invalidations safely.
 On some workloads, kvmppc_h_enter is called heavily and the use of a
 global spinlock could compromise scalability.  We already use a per-
 guest page spinlock in the form of the bit spinlock on the rmap chain,
 and this gives us synchronization with the PTE invalidation side, which
 also takes the bit spinlock on the rmap chain for each page being
 invalidated.  Thus it is sufficient to check for pending invalidations
 while the rmap chain bit spinlock is held.  However, now we require
 barriers in mmu_notifier_retry() and in the places where
 mmu_notifier_count and mmu_notifier_seq are updated, since we can now
 call mmu_notifier_retry() concurrently with updates to those fields.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
 Cc'd to k...@vger.kernel.org for review of the generic kvm changes.

Looks good to me.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC PATCH 0/11] KVM: PPC: Update Book3S HV memory handling

2011-11-23 Thread Marcelo Tosatti
On Sat, Nov 19, 2011 at 08:54:24AM +1100, Paul Mackerras wrote:
 On Fri, Nov 18, 2011 at 02:57:11PM +0100, Alexander Graf wrote:
  
  This touches areas that I'm sure non-PPC people would want to see as
  well. Could you please CC kvm@vger too next time?
  
  Avi, Marcelo, mind to review some of the bits in this patch set? :)
 
 I did cc the last patch (the one that adds barriers in the MMU
 notifier sequence/count logic) to kvm@vger.  Do you mean I should cc
 the whole series?  The only other thing touching generic code is the
 addition of the KVM_MEMSLOT_IO flag in the first patch.

I don't see such flag anywhere in the patches in this thread?

 I'm hoping the extra barriers will be OK since they are no-ops on
 x86.  In fact I now think that the smp_wmbs I added to
 kvm_mmu_notifier_invalidate_page and kvm_mmu_notifier_change_pte
 aren't in fact necessary, since it's only necessary to ensure that the
 sequence number increase is visible before the point where
 kvm_unmap_hva or kvm_set_spte_hva unlock the bitlock on the first rmap
 chain they look at, which will be ensured anyway by the barrier before
 the unlock.
 
 Paul.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into subarch-specific code

2011-05-17 Thread Marcelo Tosatti
On Wed, May 11, 2011 at 08:43:31PM +1000, Paul Mackerras wrote:
 From 964ee93b2d728e4fb16ae66eaceb6e912bf114ad Mon Sep 17 00:00:00 2001
 From: Paul Mackerras pau...@samba.org
 Date: Tue, 10 May 2011 22:23:18 +1000
 Subject: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into
  subarch-specific code
 
 Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
 calls in powerpc.c, this moves them down into the subarch-specific
 book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
 call in book3s_pr.c, and will be needed for when we do SMT4 guest
 support in the book3s hypervisor mode code.
 
 Signed-off-by: Paul Mackerras pau...@samba.org
 ---
  arch/powerpc/include/asm/kvm_ppc.h   |1 +
  arch/powerpc/kvm/book3s_interrupts.S |2 +-
  arch/powerpc/kvm/book3s_pr.c |   12 ++--
  arch/powerpc/kvm/booke.c |   13 +
  arch/powerpc/kvm/powerpc.c   |6 +-
  5 files changed, 22 insertions(+), 12 deletions(-)
 
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
 b/arch/powerpc/include/asm/kvm_ppc.h
 index f3c218a..3210911 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -42,6 +42,7 @@ enum emulation_result {
   EMULATE_AGAIN,/* something went wrong. go again */
  };
  
 +extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
  extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
  extern char kvmppc_handlers_start[];
  extern unsigned long kvmppc_handler_len;
 diff --git a/arch/powerpc/kvm/book3s_interrupts.S 
 b/arch/powerpc/kvm/book3s_interrupts.S
 index 2f0bc92..8c5e0e1 100644
 --- a/arch/powerpc/kvm/book3s_interrupts.S
 +++ b/arch/powerpc/kvm/book3s_interrupts.S
 @@ -85,7 +85,7 @@
   *  r3: kvm_run pointer
   *  r4: vcpu pointer
   */
 -_GLOBAL(__kvmppc_vcpu_entry)
 +_GLOBAL(__kvmppc_vcpu_run)
  
  kvm_start_entry:
   /* Write correct stack frame */
 diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
 index 08cedf0..f769915 100644
 --- a/arch/powerpc/kvm/book3s_pr.c
 +++ b/arch/powerpc/kvm/book3s_pr.c
 @@ -891,8 +891,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
   vfree(vcpu_book3s);
  }
  
 -extern int __kvmppc_vcpu_entry(struct kvm_run *kvm_run, struct kvm_vcpu 
 *vcpu);
 -int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
 +int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
  {
   int ret;
   double fpr[32][TS_FPRWIDTH];
 @@ -944,14 +943,15 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct 
 kvm_vcpu *vcpu)
   /* Remember the MSR with disabled extensions */
   ext_msr = current-thread.regs-msr;
  
 - /* XXX we get called with irq disabled - change that! */
 - local_irq_enable();
 -
   /* Preload FPU if it's enabled */
   if (vcpu-arch.shared-msr  MSR_FP)
   kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
  
 - ret = __kvmppc_vcpu_entry(kvm_run, vcpu);
 + kvm_guest_enter();


kvm_guest_enter should run with interrupts disabled.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into subarch-specific code

2011-05-17 Thread Marcelo Tosatti
On Tue, May 17, 2011 at 03:05:12PM -0300, Marcelo Tosatti wrote:
   
  -   ret = __kvmppc_vcpu_entry(kvm_run, vcpu);
  +   kvm_guest_enter();
 
 
 kvm_guest_enter should run with interrupts disabled.

Its fine, please ignore message.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] KVM: PPC: Fix SPRG get/set for Book3S and BookE

2011-01-06 Thread Marcelo Tosatti
On Wed, Dec 29, 2010 at 01:51:25PM -0600, Peter Tyser wrote:
 Previously SPRGs 4-7 were improperly read and written in
 kvm_arch_vcpu_ioctl_get_regs() and kvm_arch_vcpu_ioctl_set_regs();
 
 Signed-off-by: Peter Tyser pty...@xes-inc.com
 ---
 I noticed this while grepping for somthing unrelated and assume its
 a typo.  Feel free to add to the patch description; I don't use KVM
 so don't know what the high-level consequences of this change are.
 
  arch/powerpc/kvm/book3s.c |   14 --
  arch/powerpc/kvm/booke.c  |   14 --
  2 files changed, 16 insertions(+), 12 deletions(-)

Applied, thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] powerpc: kvm: fix information leak to userland

2010-11-01 Thread Marcelo Tosatti
On Sat, Oct 30, 2010 at 01:04:24PM +0400, Vasiliy Kulikov wrote:
 Structure kvm_ppc_pvinfo is copied to userland with flags and
 pad fields unitialized.  It leads to leaking of contents of
 kernel stack memory.
 
 Signed-off-by: Vasiliy Kulikov sego...@gmail.com
 ---
  I cannot compile this driver, so it is not tested at all.
 
  arch/powerpc/kvm/powerpc.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

Applied, thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH 0/2] Faster MMU lookups for Book3s v3

2010-07-01 Thread Marcelo Tosatti
On Wed, Jun 30, 2010 at 03:18:44PM +0200, Alexander Graf wrote:
 Book3s suffered from my really bad shadow MMU implementation so far. So
 I finally got around to implement a combined hash and list mechanism that
 allows for much faster lookup of mapped pages.
 
 To show that it really is faster, I tried to run simple process spawning
 code inside the guest with and without these patches:
 
 [without]
 
 debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello  /dev/null; 
 done
 
 real0m20.235s
 user0m10.418s
 sys 0m9.766s
 
 [with]
 
 debian-powerpc:~# time for i in {1..1000}; do /bin/echo hello  /dev/null; 
 done
 
 real0m14.659s
 user0m8.967s
 sys 0m5.688s
 
 So as you can see, performance improved significantly.
 
 v2 - v3:
 
   - use hlist
   - use global slab cache
 
 Alexander Graf (2):
   KVM: PPC: Add generic hpte management functions
   KVM: PPC: Make use of hash based Shadow MMU
 
  arch/powerpc/include/asm/kvm_book3s.h |9 +
  arch/powerpc/include/asm/kvm_host.h   |   17 ++-
  arch/powerpc/kvm/Makefile |2 +
  arch/powerpc/kvm/book3s.c |   14 ++-
  arch/powerpc/kvm/book3s_32_mmu_host.c |  104 ++---
  arch/powerpc/kvm/book3s_64_mmu_host.c |   98 +---
  arch/powerpc/kvm/book3s_mmu_hpte.c|  277 
 +
  7 files changed, 331 insertions(+), 190 deletions(-)
  create mode 100644 arch/powerpc/kvm/book3s_mmu_hpte.c

Applied, thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [PATCH] KVM: PPC: elide struct thread_struct instances from stack

2010-06-02 Thread Marcelo Tosatti
On Mon, May 31, 2010 at 09:59:13PM +0200, Andreas Schwab wrote:
 Instead of instantiating a whole thread_struct on the stack use only the
 required parts of it.
 
 Signed-off-by: Andreas Schwab sch...@linux-m68k.org
 Tested-by: Alexander Graf ag...@suse.de
 ---
  arch/powerpc/include/asm/kvm_fpu.h   |   27 +
  arch/powerpc/kernel/ppc_ksyms.c  |4 -
  arch/powerpc/kvm/book3s.c|   49 +---
  arch/powerpc/kvm/book3s_paired_singles.c |   94 
 --
  arch/powerpc/kvm/fpu.S   |   18 ++
  5 files changed, 97 insertions(+), 95 deletions(-)

Applied, thanks.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps

2010-05-11 Thread Marcelo Tosatti
On Tue, May 11, 2010 at 02:53:54PM +0900, Takuya Yoshikawa wrote:
 (2010/05/11 12:43), Marcelo Tosatti wrote:
 On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote:
 +How to Get
 +
 +Before calling this, you have to set the slot member of kvm_user_dirty_log
 +to indicate the target memory slot.
 +
 +struct kvm_user_dirty_log {
 +   __u32 slot;
 +   __u32 flags;
 +   __u64 dirty_bitmap;
 +   __u64 dirty_bitmap_old;
 +};
 +
 +The addresses will be set in the paired members: dirty_bitmap and _old.
 
 Why not pass the bitmap address to the kernel, instead of having the
 kernel allocate it. Because apparently you plan to do that in a new
 generation anyway?
 
 Yes, we want to make qemu allocate and free bitmaps in the future.
 But currently, those are strictly tied with memory slot registration and
 changing it in one patch set is really difficult.
 
 Anyway, we need kernel side allocation mechanism to keep the current
 GET_DIRTY_LOG api. I don't mind not exporting kernel allocated bitmaps
 in this patch set and later introducing a bitmap registration mechanism
 in another patch set.
 
 As this RFC is suggesting, kernel side double buffering infrastructure is
 designed as general as possible and adding a new API like SWITCH can be done
 naturally.
 
 
 Also, why does the kernel need to know about different bitmaps?
 
 Because we need to support current GET API. We don't have any way to get
 a new bitmap in the case of GET and we don't want to do_mmap() every time
 for GET.
 
 
 One alternative would be:
 
 KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active
 bitmap was clean, it returns 0, no switch performed. If the active
 bitmap was dirty, the kernel switches to the new bitmap and returns 1.
 
 And the responsability of cleaning the new bitmap could also be left
 for userspace.
 
 
 That is a beautiful approach but we can do that only when we give up using
 GET api.
 
 
 I follow you and Avi's advice about that kind of maintenance policy!
 What do you think?

If you introduce a switch ioctl that frees the bitmap vmalloc'ed by the
current set_memory_region (if its not freed already), after pointing the
memslot to the user supplied one, it should be fine?

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev


Re: [RFC][PATCH RFC 10/12] KVM: move dirty bitmaps to user space

2010-05-10 Thread Marcelo Tosatti
On Tue, May 04, 2010 at 10:07:02PM +0900, Takuya Yoshikawa wrote:
 We move dirty bitmaps to user space.
 
  - Allocation and destruction: we use do_mmap() and do_munmap().
The new bitmap space is twice longer than the original one and we
use the additional space for double buffering: this makes it
possible to update the active bitmap while letting the user space
read the other one safely. For x86, we can also remove the vmalloc()
in kvm_vm_ioctl_get_dirty_log().
 
  - Bitmap manipulations: we replace all functions which access dirty
bitmaps with *_user() functions.
 
  - For ia64: moving the dirty bitmaps of memory slots does not affect
ia64 much because it's using a different place to store dirty logs
rather than the dirty bitmaps of memory slots: all we have to change
are sync and get of dirty log, so we don't need set_bit_user like
functions for ia64.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
 CC: Avi Kivity a...@redhat.com
 CC: Alexander Graf ag...@suse.de
 ---
  arch/ia64/kvm/kvm-ia64.c  |   15 +-
  arch/powerpc/kvm/book3s.c |5 +++-
  arch/x86/kvm/x86.c|   25 --
  include/linux/kvm_host.h  |3 +-
  virt/kvm/kvm_main.c   |   62 +---
  5 files changed, 82 insertions(+), 28 deletions(-)
 
 diff --git a/arch/ia64/kvm/kvm-ia64.c b/arch/ia64/kvm/kvm-ia64.c
 index 17fd65c..03503e6 100644
 --- a/arch/ia64/kvm/kvm-ia64.c
 +++ b/arch/ia64/kvm/kvm-ia64.c
 @@ -1823,11 +1823,19 @@ static int kvm_ia64_sync_dirty_log(struct kvm *kvm,
   n = kvm_dirty_bitmap_bytes(memslot);
   base = memslot-base_gfn / BITS_PER_LONG;
  
 + r = -EFAULT;
 + if (!access_ok(VERIFY_WRITE, memslot-dirty_bitmap, n))
 + goto out;
 +
   for (i = 0; i  n/sizeof(long); ++i) {
   if (dirty_bitmap[base + i])
   memslot-is_dirty = true;
  
 - memslot-dirty_bitmap[i] = dirty_bitmap[base + i];
 + if (__put_user(dirty_bitmap[base + i],
 +memslot-dirty_bitmap[i])) {
 + r = -EFAULT;
 + goto out;
 + }
   dirty_bitmap[base + i] = 0;
   }
   r = 0;
 @@ -1858,7 +1866,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   if (memslot-is_dirty) {
   kvm_flush_remote_tlbs(kvm);
   n = kvm_dirty_bitmap_bytes(memslot);
 - memset(memslot-dirty_bitmap, 0, n);
 + if (clear_user(memslot-dirty_bitmap, n)) {
 + r = -EFAULT;
 + goto out;
 + }
   memslot-is_dirty = false;
   }
   r = 0;
 diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
 index 4b074f1..2a31d2f 100644
 --- a/arch/powerpc/kvm/book3s.c
 +++ b/arch/powerpc/kvm/book3s.c
 @@ -1210,7 +1210,10 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   kvmppc_mmu_pte_pflush(vcpu, ga, ga_end);
  
   n = kvm_dirty_bitmap_bytes(memslot);
 - memset(memslot-dirty_bitmap, 0, n);
 + if (clear_user(memslot-dirty_bitmap, n)) {
 + r = -EFAULT;
 + goto out;
 + }
   memslot-is_dirty = false;
   }
  
 diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
 index 023c7f8..32a3d94 100644
 --- a/arch/x86/kvm/x86.c
 +++ b/arch/x86/kvm/x86.c
 @@ -2760,40 +2760,37 @@ int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm,
   /* If nothing is dirty, don't bother messing with page tables. */
   if (memslot-is_dirty) {
   struct kvm_memslots *slots, *old_slots;
 - unsigned long *dirty_bitmap;
 + unsigned long __user *dirty_bitmap;
 + unsigned long __user *dirty_bitmap_old;
  
   spin_lock(kvm-mmu_lock);
   kvm_mmu_slot_remove_write_access(kvm, log-slot);
   spin_unlock(kvm-mmu_lock);
  
 - r = -ENOMEM;
 - dirty_bitmap = vmalloc(n);
 - if (!dirty_bitmap)
 + dirty_bitmap = memslot-dirty_bitmap;
 + dirty_bitmap_old = memslot-dirty_bitmap_old;
 + r = -EFAULT;
 + if (clear_user(dirty_bitmap_old, n))
   goto out;
 - memset(dirty_bitmap, 0, n);
  
   r = -ENOMEM;
   slots = kzalloc(sizeof(struct kvm_memslots), GFP_KERNEL);
 - if (!slots) {
 - vfree(dirty_bitmap);
 + if (!slots)
   goto out;
 - }
 +
   memcpy(slots, kvm-memslots, sizeof(struct kvm_memslots));
 - slots-memslots[log-slot].dirty_bitmap = dirty_bitmap;
 + slots-memslots[log-slot].dirty_bitmap = dirty_bitmap_old;
 + slots-memslots[log-slot].dirty_bitmap_old = dirty_bitmap;
   

Re: [RFC][PATCH 11/12] KVM: introduce new API for getting/switching dirty bitmaps

2010-05-10 Thread Marcelo Tosatti
On Tue, May 04, 2010 at 10:08:21PM +0900, Takuya Yoshikawa wrote:
 Now that dirty bitmaps are accessible from user space, we export the
 addresses of these to achieve zero-copy dirty log check:
 
   KVM_GET_USER_DIRTY_LOG_ADDR
 
 We also need an API for triggering dirty bitmap switch to take the full
 advantage of double buffered bitmaps.
 
   KVM_SWITCH_DIRTY_LOG
 
 See the documentation in this patch for precise explanations.
 
 About performance improvement: the most important feature of switch API
 is the lightness. In our test, this appeared in the form of improved
 responses for GUI manipulations.
 
 Signed-off-by: Takuya Yoshikawa yoshikawa.tak...@oss.ntt.co.jp
 Signed-off-by: Fernando Luis Vazquez Cao ferna...@oss.ntt.co.jp
 CC: Avi Kivity a...@redhat.com
 CC: Alexander Graf ag...@suse.de
 ---
  Documentation/kvm/api.txt |   87 
 +
  arch/ia64/kvm/kvm-ia64.c  |   27 +-
  arch/powerpc/kvm/book3s.c |   18 +++--
  arch/x86/kvm/x86.c|   44 ---
  include/linux/kvm.h   |   11 ++
  include/linux/kvm_host.h  |4 ++-
  virt/kvm/kvm_main.c   |   63 +
  7 files changed, 220 insertions(+), 34 deletions(-)
 
 diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
 index a237518..c106c83 100644
 --- a/Documentation/kvm/api.txt
 +++ b/Documentation/kvm/api.txt
 @@ -892,6 +892,93 @@ arguments.
  This ioctl is only useful after KVM_CREATE_IRQCHIP.  Without an in-kernel
  irqchip, the multiprocessing state must be maintained by userspace.
  
 +4.39 KVM_GET_USER_DIRTY_LOG_ADDR
 +
 +Capability: KVM_CAP_USER_DIRTY_LOG (=1 see below)
 +Architectures: all
 +Type: vm ioctl
 +Parameters: struct kvm_user_dirty_log (in/out)
 +Returns: 0 on success, -1 on error
 +
 +This ioctl makes it possible to use KVM_SWITCH_DIRTY_LOG (see 4.40) instead
 +of the old dirty log manipulation by KVM_GET_DIRTY_LOG.
 +
 +A bit about KVM_CAP_USER_DIRTY_LOG
 +
 +Before this ioctl was introduced, dirty bitmaps for dirty page logging were
 +allocated in the kernel's memory space.  But we have now moved them to user
 +space to get more flexiblity and performance.  To achive this move without
 +breaking the compatibility, we will split KVM_CAP_USER_DIRTY_LOG capability
 +into a few generations which can be identified by its check extension
 +return values.
 +
 +This KVM_GET_USER_DIRTY_LOG_ADDR belongs to the first generation with the
 +KVM_SWITCH_DIRTY_LOG (4.40) and must be supported by all generations.
 +
 +What you get
 +
 +By using this, you can get paired bitmap addresses which are accessible from
 +user space.  See the explanation in 4.40 for the roles of these two bitmaps.
 +
 +How to Get
 +
 +Before calling this, you have to set the slot member of kvm_user_dirty_log
 +to indicate the target memory slot.
 +
 +struct kvm_user_dirty_log {
 + __u32 slot;
 + __u32 flags;
 + __u64 dirty_bitmap;
 + __u64 dirty_bitmap_old;
 +};
 +
 +The addresses will be set in the paired members: dirty_bitmap and _old.

Why not pass the bitmap address to the kernel, instead of having the
kernel allocate it. Because apparently you plan to do that in a new
generation anyway?

Also, why does the kernel need to know about different bitmaps?

One alternative would be:

KVM_SWITCH_DIRTY_LOG passing the address of a bitmap. If the active
bitmap was clean, it returns 0, no switch performed. If the active
bitmap was dirty, the kernel switches to the new bitmap and returns 1.

And the responsability of cleaning the new bitmap could also be left
for userspace.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev