What's kvmclock's custom sched_clock for?
AFAICT KVM reliably passes a monotonic TSC through to guests, even if the host suspends. That's all that sched_clock needs, I think. So why does kvmclock have a custom sched_clock? On a related note, KVM doesn't pass the "invariant TSC" feature through to guests on my machine even though "invtsc" is set in QEMU and the kernel host code appears to support it. What gives? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] x86/vdso/pvclock: Protect STABLE check with the seqcount
If the clock becomes unstable while we're reading it, we need to bail. We can do this by simply moving the check into the seqcount loop. Reported-by: Marcelo Tosatti Signed-off-by: Andy Lutomirski --- Marcelo, how's this? arch/x86/entry/vdso/vclock_gettime.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 8602f06c759f..1a50e09c945b 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -126,23 +126,23 @@ static notrace cycle_t vread_pvclock(int *mode) * * On Xen, we don't appear to have that guarantee, but Xen still * supplies a valid seqlock using the version field. - +* * We only do pvclock vdso timing at all if * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to * mean that all vCPUs have matching pvti and that the TSC is * synced, so we can just look at vCPU 0's pvti. */ - if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { - *mode = VCLOCK_NONE; - return 0; - } - do { version = pvti->version; smp_rmb(); + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { + *mode = VCLOCK_NONE; + return 0; + } + tsc = rdtsc_ordered(); pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; pvti_tsc_shift = pvti->tsc_shift; -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Mon, Jan 4, 2016 at 12:26 PM, Marcelo Tosatti wrote: > On Sun, Dec 20, 2015 at 03:05:41AM -0800, Andy Lutomirski wrote: >> From: Andy Lutomirski >> >> The pvclock vdso code was too abstracted to understand easily and >> excessively paranoid. Simplify it for a huge speedup. >> >> This opens the door for additional simplifications, as the vdso no >> longer accesses the pvti for any vcpu other than vcpu 0. >> >> Before, vclock_gettime using kvm-clock took about 45ns on my machine. >> With this change, it takes 29ns, which is almost as fast as the pure TSC >> implementation. >> >> Reviewed-by: Paolo Bonzini >> Signed-off-by: Andy Lutomirski >> --- >> arch/x86/entry/vdso/vclock_gettime.c | 81 >> >> 1 file changed, 46 insertions(+), 35 deletions(-) >> >> diff --git a/arch/x86/entry/vdso/vclock_gettime.c >> b/arch/x86/entry/vdso/vclock_gettime.c >> index ca94fa649251..c325ba1bdddf 100644 >> --- a/arch/x86/entry/vdso/vclock_gettime.c >> +++ b/arch/x86/entry/vdso/vclock_gettime.c >> @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info >> *get_pvti(int cpu) >> >> static notrace cycle_t vread_pvclock(int *mode) >> { >> - const struct pvclock_vsyscall_time_info *pvti; >> + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; >> cycle_t ret; >> - u64 last; >> - u32 version; >> - u8 flags; >> - unsigned cpu, cpu1; >> - >> + u64 tsc, pvti_tsc; >> + u64 last, delta, pvti_system_time; >> + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; >> >> /* >> - * Note: hypervisor must guarantee that: >> - * 1. cpu ID number maps 1:1 to per-CPU pvclock time info. >> - * 2. that per-CPU pvclock time info is updated if the >> - *underlying CPU changes. >> - * 3. that version is increased whenever underlying CPU >> - *changes. >> + * Note: The kernel and hypervisor must guarantee that cpu ID >> + * number maps 1:1 to per-CPU pvclock time info. >> + * >> + * Because the hypervisor is entirely unaware of guest userspace >> + * preemption, it cannot guarantee that per-CPU pvclock time >> + * info is updated if the underlying CPU changes or that that >> + * version is increased whenever underlying CPU changes. >>* >> + * On KVM, we are guaranteed that pvti updates for any vCPU are >> + * atomic as seen by *all* vCPUs. This is an even stronger >> + * guarantee than we get with a normal seqlock. >> + * >> + * On Xen, we don't appear to have that guarantee, but Xen still >> + * supplies a valid seqlock using the version field. >> + >> + * We only do pvclock vdso timing at all if >> + * PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to >> + * mean that all vCPUs have matching pvti and that the TSC is >> + * synced, so we can just look at vCPU 0's pvti. >>*/ >> - do { >> - cpu = __getcpu() & VGETCPU_CPU_MASK; >> - /* TODO: We can put vcpu id into higher bits of pvti.version. >> - * This will save a couple of cycles by getting rid of >> - * __getcpu() calls (Gleb). >> - */ >> - >> - pvti = get_pvti(cpu); >> - >> - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags); >> - >> - /* >> - * Test we're still on the cpu as well as the version. >> - * We could have been migrated just after the first >> - * vgetcpu but before fetching the version, so we >> - * wouldn't notice a version change. >> - */ >> - cpu1 = __getcpu() & VGETCPU_CPU_MASK; >> - } while (unlikely(cpu != cpu1 || >> - (pvti->pvti.version & 1) || >> - pvti->pvti.version != version)); >> - >> - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) >> + >> + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { >> *mode = VCLOCK_NONE; >> + return 0; >> + } >> + >> + do { >> + version = pvti->version; >> + >> + /* This is also a read barrier, so we'll read version first. */ >> + tsc = rdtsc_ordered(); >> + >>
Re: [PATCH RFC 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va
On Tue, Dec 29, 2015 at 4:50 AM, Joao Martins wrote: > On 12/28/2015 11:45 PM, Andy Lutomirski wrote: >> On Mon, Dec 28, 2015 at 1:52 PM, Joao Martins >> wrote: >>> Right now there is only a pvclock_pvti_cpu0_va() which is defined on >>> kvmclock since: >>> >>> commit dac16fba6fc5 >>> ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap") >>> >>> The only user of this interface so far is kvm. This commit adds a setter >>> function for the pvti page and moves pvclock_pvti_cpu0_va to pvclock, which >>> is a more generic place to have it; and would allow other PV clocksources >>> to use it, such as Xen. >>> >> >>> + >>> +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti) >>> +{ >>> + pvti_cpu0_va = pvti; >>> +} >> >> IMO this either wants to be __init or wants a >> WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)). The latter hasn't landed in >> -tip yet, but I think it'll land next week unless the merge window >> opens early. > OK, I will add those two once it lands in -tip. > > I had a silly mistake in this patch as I bindly ommited the parameter name to > keep checkpatch happy, but didn't compile check when built without PARAVIRT. > Apologies for that and will fix that also on the next version. > >> >> It may pay to actually separate out the kvm-clock clocksource and >> rename it rather than partially duplicating it, assuming the result >> wouldn't be messy. >> > Not sure if I follow but I moved out pvclock_pvti_cpu0_va from kvm-clock or do > you mean to separate out kvm-clock in it's enterity, or something else within > kvm-clock is that is common to both (such as kvm_setup_vsyscall_timeinfo) ? I meant literally using the same clocksource. I don't know whether the Xen and KVM variants are similar enough for that to make sense. > >> Can you CC me on the rest of the series for new versions? >> > Sure! Thanks for the prompt reply. > >> BTW, since this seems to require hypervisor changes to be useful, it >> might make sense to rethink the interface a bit. Are you actually >> planning to support per-cpu pvti for this in any useful way? If not, >> I think that this would work a whole lot better and be considerably >> less code if you had a single global pvti that lived in >> hypervisor-allocated memory instead of an array that lives in guest >> memory. I'd be happy to discuss next week in more detail (currently >> on vacation). > Initially I had this series using per-cpu pvti's based on Linux 4.4 but since > that was removed in favor of vdso using solely cpu0 pvti, then I ended up just > registering the cpu 0 page. I don't intend to add per-cpu pvti's since it > would > only be used for this case: (unless the reviewers think it should be done) > meaning I would register pvti's for the other CPUs without having them used. > Having a global pvti as you suggest it would get a lot simpler for the guest, > but I guess this would only work assuming PVCLOCK_TSC_STABLE_BIT is there? > Looking forward to discuss it next week. Sounds good. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/3] x86/pvclock: add setter for pvclock_pvti_cpu0_va
On Mon, Dec 28, 2015 at 1:52 PM, Joao Martins wrote: > Right now there is only a pvclock_pvti_cpu0_va() which is defined on > kvmclock since: > > commit dac16fba6fc5 > ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap") > > The only user of this interface so far is kvm. This commit adds a setter > function for the pvti page and moves pvclock_pvti_cpu0_va to pvclock, which > is a more generic place to have it; and would allow other PV clocksources > to use it, such as Xen. > > + > +void pvclock_set_pvti_cpu0_va(struct pvclock_vsyscall_time_info *pvti) > +{ > + pvti_cpu0_va = pvti; > +} IMO this either wants to be __init or wants a WARN_ON(vclock_was_used(VCLOCK_PVCLOCK)). The latter hasn't landed in -tip yet, but I think it'll land next week unless the merge window opens early. It may pay to actually separate out the kvm-clock clocksource and rename it rather than partially duplicating it, assuming the result wouldn't be messy. Can you CC me on the rest of the series for new versions? BTW, since this seems to require hypervisor changes to be useful, it might make sense to rethink the interface a bit. Are you actually planning to support per-cpu pvti for this in any useful way? If not, I think that this would work a whole lot better and be considerably less code if you had a single global pvti that lived in hypervisor-allocated memory instead of an array that lives in guest memory. I'd be happy to discuss next week in more detail (currently on vacation). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 23, 2015 at 11:27 AM, Marcelo Tosatti wrote: > On Mon, Dec 21, 2015 at 02:49:25PM -0800, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti wrote: >> > (busy spin is equally problematic as IPI for realtime guests). >> >> I disagree. It's never been safe to call clock_gettime from an RT >> task and expect a guarantee of real-time performance. We could fix >> that, but it's not even safe on non-KVM. > > The problem is how long the IPI (or busy spinning in case of version > above) interrupts the vcpu. The busy spin should be a few hundred cycles in the very worst case (a couple of remote cache misses timed such that the guest is spinning the whole time). The IPI is always thousands of cycles no matter what the guest is doing. > >> Sending an IPI *always* stalls the task. Taking a lock (which is >> effectively what this is doing) only stalls the tasks that contend for >> the lock, which, most of the time, means that nothing stalls. >> >> Also, if the host disables preemption or otherwise boosts its priority >> while version is odd, then the actual stall will be very short, in >> contrast to an IPI-induced stall, which will be much, much longer. >> >> --Andy > > 1) The updates are rare. > 2) There are no user complaints about the IPI mechanism. If KVM ever starts directly propagating corrected time (CLOCK_MONOTONIC, for example), then the updates won't be rare. Maybe I'll try to instrument this. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 1:49 PM, Marcelo Tosatti wrote: > On Fri, Dec 18, 2015 at 12:25:11PM -0800, Andy Lutomirski wrote: >> [cc: John Stultz -- maybe you have ideas on how this should best >> integrate with the core code] >> >> On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti >> wrote: >> > Can you write an actual proposal (with details) that accomodates the >> > issue described at "Assuming a stable TSC across physical CPUS, and a >> > stable TSC" ? >> > >> > Yes it would be nicer, the IPIs (to stop the vcpus) are problematic for >> > realtime guests. >> >> This shouldn't require many details, and I don't think there's an ABI >> change. The rules are: >> >> When the overall system timebase changes (e.g. when the selected >> clocksource changes or when update_pvclock_gtod is called), the KVM >> host would: >> >> optionally: preempt_disable(); /* for performance */ >> >> for all vms { >> >> for all registered pvti structures { >> pvti->version++; /* should be odd now */ >> } > > pvti is userspace data, so you have to pin it before? Yes. Fortunately, most systems probably only have one page of pvti structures, I think (unless there are a ton of vcpus), so the performance impact should be negligible. > >> /* Note: right now, any vcpu that tries to access pvti will start >> infinite looping. We should add cpu_relax() to the guests. */ >> >> for all registered pvti structures { >> update everything except pvti->version; >> } >> >> for all registered pvti structures { >> pvti->version++; /* should be even now */ >> } >> >> cond_resched(); >> } >> >> Is this enough detail? This should work with all existing guests, >> too, unless there's a buggy guest out there that actually fails to >> double-check version. > > What is the advantage of this over the brute force method, given > that guests will busy spin? > > (busy spin is equally problematic as IPI for realtime guests). I disagree. It's never been safe to call clock_gettime from an RT task and expect a guarantee of real-time performance. We could fix that, but it's not even safe on non-KVM. Sending an IPI *always* stalls the task. Taking a lock (which is effectively what this is doing) only stalls the tasks that contend for the lock, which, most of the time, means that nothing stalls. Also, if the host disables preemption or otherwise boosts its priority while version is odd, then the actual stall will be very short, in contrast to an IPI-induced stall, which will be much, much longer. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 4/4] x86/vdso: Enable vdso pvclock access on all vdso variants
Now that pvclock doesn't require access to the fixmap, all vdso variants can use it. The kernel side isn't wired up for 32-bit kernels yet, but this covers 32-bit and x32 userspace on 64-bit kernels. Acked-by: Paolo Bonzini Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 91 1 file changed, 40 insertions(+), 51 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 59a98c25bde7..8602f06c759f 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -17,8 +17,10 @@ #include #include #include +#include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -43,10 +45,6 @@ extern u8 pvclock_page #ifndef BUILD_VDSO32 -#include -#include -#include - notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) return ret; } -#ifdef CONFIG_PARAVIRT_CLOCK +#else + +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) + : "memory", "edx"); + return ret; +} + +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) + : "memory", "edx"); + return ret; +} + +#endif + +#ifdef CONFIG_PARAVIRT_CLOCK static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { return (const struct pvclock_vsyscall_time_info *)&pvclock_page; @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode) do { version = pvti->version; - /* This is also a read barrier, so we'll read version first. */ - tsc = rdtsc_ordered(); + smp_rmb(); + tsc = rdtsc_ordered(); pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; pvti_tsc_shift = pvti->tsc_shift; pvti_system_time = pvti->system_time; @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode) pvclock_scale_delta(delta, pvti_tsc_to_system_mul, pvti_tsc_shift); - /* refer to tsc.c read_tsc() comment for rationale */ + /* refer to vread_tsc() comment for rationale */ last = gtod->cycle_last; if (likely(ret >= last)) @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode) } #endif -#else - -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) - : "memory", "edx"); - return ret; -} - -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) - : "memory", "edx"); - return ret; -} - -#ifdef CONFIG_PARAVIRT_CLOCK - -static notrace cycle_t vread_pvclock(int *mode) -{ - *mode = VCLOCK_NONE; - return 0; -} -#endif - -#endif - notrace static cycle_t vread_tsc(void) { cycle_t ret = (cycle_t)rdtsc_ordered(); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/4] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
From: Andy Lutomirski The pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 45ns on my machine. With this change, it takes 29ns, which is almost as fast as the pure TSC implementation. Reviewed-by: Paolo Bonzini Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 81 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index ca94fa649251..c325ba1bdddf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vsyscall_time_info *pvti; + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; cycle_t ret; - u64 last; - u32 version; - u8 flags; - unsigned cpu, cpu1; - + u64 tsc, pvti_tsc; + u64 last, delta, pvti_system_time; + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; /* -* Note: hypervisor must guarantee that: -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. -* 2. that per-CPU pvclock time info is updated if the -*underlying CPU changes. -* 3. that version is increased whenever underlying CPU -*changes. +* Note: The kernel and hypervisor must guarantee that cpu ID +* number maps 1:1 to per-CPU pvclock time info. +* +* Because the hypervisor is entirely unaware of guest userspace +* preemption, it cannot guarantee that per-CPU pvclock time +* info is updated if the underlying CPU changes or that that +* version is increased whenever underlying CPU changes. * +* On KVM, we are guaranteed that pvti updates for any vCPU are +* atomic as seen by *all* vCPUs. This is an even stronger +* guarantee than we get with a normal seqlock. +* +* On Xen, we don't appear to have that guarantee, but Xen still +* supplies a valid seqlock using the version field. + +* We only do pvclock vdso timing at all if +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to +* mean that all vCPUs have matching pvti and that the TSC is +* synced, so we can just look at vCPU 0's pvti. */ - do { - cpu = __getcpu() & VGETCPU_CPU_MASK; - /* TODO: We can put vcpu id into higher bits of pvti.version. -* This will save a couple of cycles by getting rid of -* __getcpu() calls (Gleb). -*/ - - pvti = get_pvti(cpu); - - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags); - - /* -* Test we're still on the cpu as well as the version. -* We could have been migrated just after the first -* vgetcpu but before fetching the version, so we -* wouldn't notice a version change. -*/ - cpu1 = __getcpu() & VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti->pvti.version & 1) || - pvti->pvti.version != version)); - - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) + + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { *mode = VCLOCK_NONE; + return 0; + } + + do { + version = pvti->version; + + /* This is also a read barrier, so we'll read version first. */ + tsc = rdtsc_ordered(); + + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; + pvti_tsc_shift = pvti->tsc_shift; + pvti_system_time = pvti->system_time; + pvti_tsc = pvti->tsc_timestamp; + + /* Make sure that the version double-check is last. */ + smp_rmb(); + } while (unlikely((version & 1) || version != pvti->version)); + + delta = tsc - pvti_tsc; + ret = pvti_system_time + + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, + pvti_tsc_shift); /* refer to tsc.c read_tsc() comment for rationale */ last = gtod->cycle_last; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/4] x86/vdso: Remove pvclock fixmap machinery
Acked-by: Paolo Bonzini Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 1 - arch/x86/entry/vdso/vma.c| 1 + arch/x86/include/asm/fixmap.h| 5 - arch/x86/include/asm/pvclock.h | 5 - arch/x86/kernel/kvmclock.c | 6 -- arch/x86/kernel/pvclock.c| 24 6 files changed, 1 insertion(+), 41 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 5dd363d54348..59a98c25bde7 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -45,7 +45,6 @@ extern u8 pvclock_page #include #include -#include #include notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index aa828191c654..b8f69e264ac4 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index f80d70009ff8..6d7d0e52ed5a 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -19,7 +19,6 @@ #include #include #include -#include #ifdef CONFIG_X86_32 #include #include @@ -72,10 +71,6 @@ enum fixed_addresses { #ifdef CONFIG_X86_VSYSCALL_EMULATION VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT, #endif -#ifdef CONFIG_PARAVIRT_CLOCK - PVCLOCK_FIXMAP_BEGIN, - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, -#endif #endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 571dad355bbc..fdcc04020636 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info { } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1) - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size); -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ec1b06dc82d2..72cef58693c7 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) { #ifdef CONFIG_X86_64 int cpu; - int ret; u8 flags; struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) return 1; } - if ((ret = pvclock_init_vsyscall(hv_clock, size))) { - put_cpu(); - return ret; - } - put_cpu(); kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 2f355d229a58..99bfc025111d 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } - -#ifdef CONFIG_X86_64 -/* - * Initialize the generic pvclock vsyscall state. This will allocate - * a/some page(s) for the per-vcpu pvclock information, set up a - * fixmap mapping for the page(s) - */ - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size) -{ - int idx; - - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); - - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, -__pa(i) + (idx*PAGE_SIZE), -PAGE_KERNEL_VVAR); - } - - return 0; -} -#endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/4] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
Acked-by: Paolo Bonzini Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 20 arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 13 + arch/x86/include/asm/pvclock.h| 9 + arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 5 + 7 files changed, 41 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index c325ba1bdddf..5dd363d54348 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void) } #endif +#ifdef CONFIG_PARAVIRT_CLOCK +extern u8 pvclock_page + __attribute__((visibility("hidden"))); +#endif + #ifndef BUILD_VDSO32 #include @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) #ifdef CONFIG_PARAVIRT_CLOCK -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { - const struct pvclock_vsyscall_time_info *pvti_base; - int idx = cpu / (PAGE_SIZE/PVTI_SIZE); - int offset = cpu % (PAGE_SIZE/PVTI_SIZE); - - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END); - - pvti_base = (struct pvclock_vsyscall_time_info *) - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); - - return &pvti_base[offset]; + return (const struct pvclock_vsyscall_time_info *)&pvclock_page; } static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; + const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti; cycle_t ret; u64 tsc, pvti_tsc; u64 last, delta, pvti_system_time; diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index de2c921025f5..4158acc17df0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -25,7 +25,7 @@ SECTIONS * segment. */ - vvar_start = . - 2 * PAGE_SIZE; + vvar_start = . - 3 * PAGE_SIZE; vvar_page = vvar_start; /* Place all vvars at the offsets in asm/vvar.h. */ @@ -36,6 +36,7 @@ SECTIONS #undef EMIT_VVAR hpet_page = vvar_start + PAGE_SIZE; + pvclock_page = vvar_start + 2 * PAGE_SIZE; . = SIZEOF_HEADERS; diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 785d9922b106..491020b2826d 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -73,6 +73,7 @@ enum { sym_vvar_start, sym_vvar_page, sym_hpet_page, + sym_pvclock_page, sym_VDSO_FAKE_SECTION_TABLE_START, sym_VDSO_FAKE_SECTION_TABLE_END, }; @@ -80,6 +81,7 @@ enum { const int special_pages[] = { sym_vvar_page, sym_hpet_page, + sym_pvclock_page, }; struct vdso_sym { @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = { [sym_vvar_start] = {"vvar_start", true}, [sym_vvar_page] = {"vvar_page", true}, [sym_hpet_page] = {"hpet_page", true}, + [sym_pvclock_page] = {"pvclock_page", true}, [sym_VDSO_FAKE_SECTION_TABLE_START] = { "VDSO_FAKE_SECTION_TABLE_START", false }, diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 64df47148160..aa828191c654 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) .name = "[vvar]", .pages = no_pages, }; + struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) } #endif + pvti = pvclock_pvti_cpu0_va(); + if (pvti && image->sym_pvclock_page) { + ret = remap_pfn_range(vma, + text_start + image->sym_pvclock_page, + __pa(pvti) >> PAGE_SHIFT, + PAGE_SIZE, + PAGE_READONLY); + + if (ret) + goto up_fail; + } + up_fail: if (ret) current->mm->context.vdso = NULL; diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7a6bed5c08bc..571dad355bbc 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -4,6 +4,15 @@ #include #include +#ifdef CONFIG_KVM_GUEST
[PATCH v2 0/4] x86: KVM vdso and clock improvements
x86: KVM vdso and clock improvements NB: patch 1 doesn't really belong here, but it makes this a lot easier for me to test. Patch 1, if it's okay at all, should go though the kvm tree. The rest should probably go through tip:x86/vdso once they're reviewed. I'll do a followup to enable vdso pvclock on 32-bit guests. I'm not currently set up to test it. (The KVM people could also do it very easily on top of these patches.) Changes from v1: - Dropped patch 1 - Added Paolo's review and acks - Fixed a build issue on some configs Andy Lutomirski (4): x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap x86/vdso: Remove pvclock fixmap machinery x86/vdso: Enable vdso pvclock access on all vdso variants arch/x86/entry/vdso/vclock_gettime.c | 151 -- arch/x86/entry/vdso/vdso-layout.lds.S | 3 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 14 arch/x86/include/asm/fixmap.h | 5 -- arch/x86/include/asm/pvclock.h| 14 ++-- arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 11 ++- arch/x86/kernel/pvclock.c | 24 -- 9 files changed, 107 insertions(+), 119 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
[cc: John Stultz -- maybe you have ideas on how this should best integrate with the core code] On Fri, Dec 18, 2015 at 11:45 AM, Marcelo Tosatti wrote: > On Fri, Dec 18, 2015 at 11:27:13AM -0800, Andy Lutomirski wrote: >> On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: >> > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: >> >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti >> >> wrote: >> >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti >> >> >> wrote: >> >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> >> >> wrote: >> >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >> >> >>> > RAW TSC NTP corrected TSC >> >> >> >> >>> > t0 10 10 >> >> >> >> >>> > t1 20 19.99 >> >> >> >> >>> > t2 30 29.98 >> >> >> >> >>> > t3 40 39.97 >> >> >> >> >>> > t4 50 49.96 >> >> > >> >> > (1) >> >> > >> >> >> >> >>> > >> >> >> >> >>> > ... >> >> >> >> >>> > >> >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >> >> >>> > you can see what will happen. >> >> >> >> >>> >> >> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. >> >> >> >> >> After >> >> >> >> >> resume, the TSC certainly increases at the same rate as before, >> >> >> >> >> but the >> >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased >> >> >> >> >> slower >> >> >> >> >> than the guest kvmclock. >> >> >> >> > >> >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> >> >> > >> >> >> >> > If it's the host's, then wouldn't systemtime be reset after >> >> >> >> > resume to >> >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> >> >> > backwards. >> >> >> >> > >> >> >> >> > If it's the guest's, then the guest's NTP correction is applied >> >> >> >> > on top >> >> >> >> > of kvmclock, and this shouldn't matter. >> >> >> >> > >> >> >> >> > I still feel like I'm missing something very basic here. >> >> >> >> > >> >> >> >> >> >> >> >> OK, I think I get it. >> >> >> >> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the >> >> >> >> host's >> >> >> >> correction to the guest. If it did, indeed, propagate the >> >> >> >> correction >> >> >> >> then, after resume, the host's new system_time would match the >> >> >> >> guest's >> >> >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> >> >> think there would be a problem. >> >> >> >> That being said, I can't find the code in the masterclock stuff that >> >> >> >> would actually do this. >> >> >> > >> >> >> > Guest clock is
Re: kvmclock doesn't work, help?
On Fri, Dec 18, 2015 at 3:47 AM, Marcelo Tosatti wrote: > On Thu, Dec 17, 2015 at 05:12:59PM -0800, Andy Lutomirski wrote: >> On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti >> wrote: >> > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti >> >> wrote: >> >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> >> wrote: >> >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> >> > wrote: >> >> >> >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >> >>> > RAW TSC NTP corrected TSC >> >> >> >>> > t0 10 10 >> >> >> >>> > t1 20 19.99 >> >> >> >>> > t2 30 29.98 >> >> >> >>> > t3 40 39.97 >> >> >> >>> > t4 50 49.96 >> > >> > (1) >> > >> >> >> >>> > >> >> >> >>> > ... >> >> >> >>> > >> >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >> >>> > you can see what will happen. >> >> >> >>> >> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> >> >> resume, the TSC certainly increases at the same rate as before, but >> >> >> >> the >> >> >> >> raw TSC restarted counting from 0 and systemtime has increased >> >> >> >> slower >> >> >> >> than the guest kvmclock. >> >> >> > >> >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> >> > >> >> >> > If it's the host's, then wouldn't systemtime be reset after resume to >> >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> >> > backwards. >> >> >> > >> >> >> > If it's the guest's, then the guest's NTP correction is applied on >> >> >> > top >> >> >> > of kvmclock, and this shouldn't matter. >> >> >> > >> >> >> > I still feel like I'm missing something very basic here. >> >> >> > >> >> >> >> >> >> OK, I think I get it. >> >> >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> >> >> correction to the guest. If it did, indeed, propagate the correction >> >> >> then, after resume, the host's new system_time would match the guest's >> >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> >> think there would be a problem. >> >> >> That being said, I can't find the code in the masterclock stuff that >> >> >> would actually do this. >> >> > >> >> > Guest clock is maintained by guest timekeeping code, which does: >> >> > >> >> > timer_interrupt() >> >> > offset = read clocksource since last timer interrupt >> >> > accumulate_to_systemclock(offset) >> >> > >> >> > The frequency correction of NTP in the host can be applied to >> >> > kvmclock, which will be visible to the guest >> >> > at "read clocksource since last timer interrupt" >> >> > (kvmclock_clocksource_read function). >> >> >> >> pvclock_clocksource_read? That seems to do the same thing as all the >> >> other clocksource access functions. >> >> >> >> > >> >> > This does not mean that the NTP correction in the host is propagated >> >> > to the guests system clock directly. >> >> > >> >> > (For example, the guest can run N
Re: kvmclock doesn't work, help?
On Thu, Dec 17, 2015 at 11:08 AM, Marcelo Tosatti wrote: > On Thu, Dec 17, 2015 at 08:33:17AM -0800, Andy Lutomirski wrote: >> On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: >> > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski >> >> wrote: >> >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> >> > wrote: >> >> >> >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >> >>> > RAW TSC NTP corrected TSC >> >> >>> > t0 10 10 >> >> >>> > t1 20 19.99 >> >> >>> > t2 30 29.98 >> >> >>> > t3 40 39.97 >> >> >>> > t4 50 49.96 > > (1) > >> >> >>> > >> >> >>> > ... >> >> >>> > >> >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >> >>> > you can see what will happen. >> >> >>> >> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> >> resume, the TSC certainly increases at the same rate as before, but the >> >> >> raw TSC restarted counting from 0 and systemtime has increased slower >> >> >> than the guest kvmclock. >> >> > >> >> > Wait, are we talking about the host's NTP or the guest's NTP? >> >> > >> >> > If it's the host's, then wouldn't systemtime be reset after resume to >> >> > the NTP corrected value? If so, the guest wouldn't see time go >> >> > backwards. >> >> > >> >> > If it's the guest's, then the guest's NTP correction is applied on top >> >> > of kvmclock, and this shouldn't matter. >> >> > >> >> > I still feel like I'm missing something very basic here. >> >> > >> >> >> >> OK, I think I get it. >> >> >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> >> correction to the guest. If it did, indeed, propagate the correction >> >> then, after resume, the host's new system_time would match the guest's >> >> idea of it (after accounting for the guest's long nap), and I don't >> >> think there would be a problem. >> >> That being said, I can't find the code in the masterclock stuff that >> >> would actually do this. >> > >> > Guest clock is maintained by guest timekeeping code, which does: >> > >> > timer_interrupt() >> > offset = read clocksource since last timer interrupt >> > accumulate_to_systemclock(offset) >> > >> > The frequency correction of NTP in the host can be applied to >> > kvmclock, which will be visible to the guest >> > at "read clocksource since last timer interrupt" >> > (kvmclock_clocksource_read function). >> >> pvclock_clocksource_read? That seems to do the same thing as all the >> other clocksource access functions. >> >> > >> > This does not mean that the NTP correction in the host is propagated >> > to the guests system clock directly. >> > >> > (For example, the guest can run NTP which is free to do further >> > adjustments at "accumulate_to_systemclock(offset)" time). >> >> Of course. But I expected that, in the absence of NTP on the guest, >> that the guest would track the host's *corrected* time. >> >> > >> >> If, on the other hand, the host's NTP correction is not supposed to >> >> propagate to the guest, >> > >> > This is optional. There is a module option to control this, in fact. >> > >> > Its nice to have, because then you can execute a guest without NTP >> > (say without network connection), and have a kvmclock (kvmclock is a >> > clocksource, not a guest system clock) which is NTP corrected. >> >> Can you point to how this works? I found kvm_guest_time_update, whch >> is called under circumstances that I haven't un
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 1:57 PM, Marcelo Tosatti wrote: > On Wed, Dec 16, 2015 at 10:17:16AM -0800, Andy Lutomirski wrote: >> On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: >> > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini >> > wrote: >> >> >> >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >> >>> > RAW TSC NTP corrected TSC >> >>> > t0 10 10 >> >>> > t1 20 19.99 >> >>> > t2 30 29.98 >> >>> > t3 40 39.97 >> >>> > t4 50 49.96 >> >>> > >> >>> > ... >> >>> > >> >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> >>> > you can see what will happen. >> >>> >> >>> Sure, but why would you ever switch from one to the other? >> >> >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> >> resume, the TSC certainly increases at the same rate as before, but the >> >> raw TSC restarted counting from 0 and systemtime has increased slower >> >> than the guest kvmclock. >> > >> > Wait, are we talking about the host's NTP or the guest's NTP? >> > >> > If it's the host's, then wouldn't systemtime be reset after resume to >> > the NTP corrected value? If so, the guest wouldn't see time go >> > backwards. >> > >> > If it's the guest's, then the guest's NTP correction is applied on top >> > of kvmclock, and this shouldn't matter. >> > >> > I still feel like I'm missing something very basic here. >> > >> >> OK, I think I get it. >> >> Marcelo, I thought that kvmclock was supposed to propagate the host's >> correction to the guest. If it did, indeed, propagate the correction >> then, after resume, the host's new system_time would match the guest's >> idea of it (after accounting for the guest's long nap), and I don't >> think there would be a problem. >> That being said, I can't find the code in the masterclock stuff that >> would actually do this. > > Guest clock is maintained by guest timekeeping code, which does: > > timer_interrupt() > offset = read clocksource since last timer interrupt > accumulate_to_systemclock(offset) > > The frequency correction of NTP in the host can be applied to > kvmclock, which will be visible to the guest > at "read clocksource since last timer interrupt" > (kvmclock_clocksource_read function). pvclock_clocksource_read? That seems to do the same thing as all the other clocksource access functions. > > This does not mean that the NTP correction in the host is propagated > to the guests system clock directly. > > (For example, the guest can run NTP which is free to do further > adjustments at "accumulate_to_systemclock(offset)" time). Of course. But I expected that, in the absence of NTP on the guest, that the guest would track the host's *corrected* time. > >> If, on the other hand, the host's NTP correction is not supposed to >> propagate to the guest, > > This is optional. There is a module option to control this, in fact. > > Its nice to have, because then you can execute a guest without NTP > (say without network connection), and have a kvmclock (kvmclock is a > clocksource, not a guest system clock) which is NTP corrected. Can you point to how this works? I found kvm_guest_time_update, whch is called under circumstances that I haven't untangled. I can't really tell what it's trying to do. In any case, this still seems much more convoluted than it has to be. In the case in which the host has a stable TSC (tsc is selected in the core timekeeping code, VCLOCK_TSC is set, etc), which is basically all the time on the last few generations of CPUs, then the core timekeeping code is already exposing a linear function that's supposed to be used for monotonic, cpu-local access to a corrected nanosecond counter. It's even in pretty much exactly the right form to pass through to the guest via pvclock in the gtod data. Why doesn't KVM pass it through verbatim, updated in real time? Is there some legacy reason that KVM must apply its own corrections and has to jump through hoops to pause vcpus when updating those vcpu's copies of the pvclock data? > >> then shouldn't KVM just update system_time on >> resume to whatever the guest would think it had (which I think would >> be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly >> shifted by some per-guest constant offset). >> >> --Andy > > Sure, you could add a correction to compensate and make sure > the guest clock does not see time backwards. > Could you help do that? You understand the code far better than I do. As it stands, it simply doesn't work on any system that suspends and resumes (unless maybe the system has the upcoming Intel ART feature, and I have no clue when that'll show up). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 16, 2015 at 9:48 AM, Andy Lutomirski wrote: > On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: >> >> >> On 14/12/2015 23:31, Andy Lutomirski wrote: >>> > RAW TSC NTP corrected TSC >>> > t0 10 10 >>> > t1 20 19.99 >>> > t2 30 29.98 >>> > t3 40 39.97 >>> > t4 50 49.96 >>> > >>> > ... >>> > >>> > if you suddenly switch from RAW TSC to NTP corrected TSC, >>> > you can see what will happen. >>> >>> Sure, but why would you ever switch from one to the other? >> >> The guest uses the raw TSC and systemtime = 0 until suspend. After >> resume, the TSC certainly increases at the same rate as before, but the >> raw TSC restarted counting from 0 and systemtime has increased slower >> than the guest kvmclock. > > Wait, are we talking about the host's NTP or the guest's NTP? > > If it's the host's, then wouldn't systemtime be reset after resume to > the NTP corrected value? If so, the guest wouldn't see time go > backwards. > > If it's the guest's, then the guest's NTP correction is applied on top > of kvmclock, and this shouldn't matter. > > I still feel like I'm missing something very basic here. > OK, I think I get it. Marcelo, I thought that kvmclock was supposed to propagate the host's correction to the guest. If it did, indeed, propagate the correction then, after resume, the host's new system_time would match the guest's idea of it (after accounting for the guest's long nap), and I don't think there would be a problem. That being said, I can't find the code in the masterclock stuff that would actually do this. If, on the other hand, the host's NTP correction is not supposed to propagate to the guest, then shouldn't KVM just update system_time on resume to whatever the guest would think it had (which I think would be equivalent to the host's CLOCK_MONOTONIC_RAW value, possibly shifted by some per-guest constant offset). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Tue, Dec 15, 2015 at 12:42 AM, Paolo Bonzini wrote: > > > On 14/12/2015 23:31, Andy Lutomirski wrote: >> > RAW TSC NTP corrected TSC >> > t0 10 10 >> > t1 20 19.99 >> > t2 30 29.98 >> > t3 40 39.97 >> > t4 50 49.96 >> > >> > ... >> > >> > if you suddenly switch from RAW TSC to NTP corrected TSC, >> > you can see what will happen. >> >> Sure, but why would you ever switch from one to the other? > > The guest uses the raw TSC and systemtime = 0 until suspend. After > resume, the TSC certainly increases at the same rate as before, but the > raw TSC restarted counting from 0 and systemtime has increased slower > than the guest kvmclock. Wait, are we talking about the host's NTP or the guest's NTP? If it's the host's, then wouldn't systemtime be reset after resume to the NTP corrected value? If so, the guest wouldn't see time go backwards. If it's the guest's, then the guest's NTP correction is applied on top of kvmclock, and this shouldn't matter. I still feel like I'm missing something very basic here. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Mon, Dec 14, 2015 at 2:00 PM, Marcelo Tosatti wrote: > On Mon, Dec 14, 2015 at 02:44:15PM +0100, Paolo Bonzini wrote: >> >> >> On 11/12/2015 22:57, Andy Lutomirski wrote: >> > I'm still not seeing the issue. >> > >> > The formula is: >> > >> > (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> >> > pvti->tsc_shift) + pvti->system_time >> > >> > Obviously, if you reset pvti->tsc_timestamp to the current tsc value >> > after suspend/resume, you would also need to update system_time. >> > >> > I don't see what this has to do with suspend/resume or with whether >> > the effective scale factor is greater than or less than one. The only >> > suspend/resume interaction I can see is that, if the host allows the >> > guest-observed TSC value to jump (which is arguably a bug, what that's >> > not important here), it needs to update pvti before resuming the >> > guest. >> >> Which is not an issue, since freezing obviously gets all CPUs out of >> guest mode. >> >> Marcelo, can you provide an example with made-up values for tsc and pvti? > > I meant "systemtime" at ^. > > guest visible clock = systemtime (updated at time 0, guest initialization) + > scaled tsc reads=LARGE VALUE. > ^^ > guest reads clock to memory at location A = scaled tsc read. > -> suspend resume event > guest visible clock = systemtime (updated at time AFTER SUSPEND) + scaled tsc > reads=0. > guest reads clock to memory at location B. > > So before the suspend/resume event, the clock is the RAW TSC values > (scaled by kvmclock, but the frequency of the RAW TSC). > > After suspend/resume event, the clock is updated from the host > via get_kernel_ns(), which reads the corrected NTP frequency TSC. > > So you switch the timebase, from a clock running at a given frequency, > to a clock running at another frequency (effective frequency). > > Example: > > RAW TSC NTP corrected TSC > t0 10 10 > t1 20 19.99 > t2 30 29.98 > t3 40 39.97 > t4 50 49.96 > > ... > > if you suddenly switch from RAW TSC to NTP corrected TSC, > you can see what will happen. Sure, but why would you ever switch from one to the other? I'm still not seeing the scenario under which this discontinuity is visible to anything other than the kvmclock code itself. The only things that need to be monotonic are the output from vread_pvclock and the in-kernel equivalent, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Fri, Dec 11, 2015 at 3:48 PM, Marcelo Tosatti wrote: > On Fri, Dec 11, 2015 at 01:57:23PM -0800, Andy Lutomirski wrote: >> On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti wrote: >> > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: >> >> I'm trying to clean up kvmclock and I can't get it to work at all. My >> >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. >> >> >> >> If I boot an SMP (2 vcpus) guest, tracing says: >> >> >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 0 >> >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: >> >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: >> >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> >> >> >> If I boot a UP guest, tracing says: >> >> >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: >> >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> >> I suspect, but I haven't verified, that this is fallout from: >> >> >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f >> >> Author: Marcelo Tosatti >> >> Date: Wed May 14 12:43:24 2014 -0300 >> >> >> >> KVM: x86: disable master clock if TSC is reset during suspend >> >> >> >> Updating system_time from the kernel clock once master clock >> >> has been enabled can result in time backwards event, in case >> >> kernel clock frequency is lower than TSC frequency. >> >> >> >> Disable master clock in case it is necessary to update it >> >> from the resume path. >> >> >> >> Signed-off-by: Marcelo Tosatti >> >> Signed-off-by: Paolo Bonzini >> >> >> >> >> >> Can we please stop making kvmclock more complex? It's a beast right >> >> now, and not in a good way. It's far too tangled with the vclock >> >> machinery on both the host and guest sides, the pvclock stuff is not >> >> well thought out (even in principle in an ABI sense), and it's never >> >> been clear to my what problem exactly the kvmclock stuff is supposed >> >> to solve. >> >> >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> >> start over. A correctly functioning KVM guest using TSC (i.e. >> >> ignoring kvmclock entirely) >> >> seems to work rather more reliably and >> >> considerably faster than a kvmclock guest. >> >> >> >> --Andy >> >> >> >> -- >> >> Andy Lutomirski >> >> AMA Capital Management, LLC >> > >> > Andy, >> > >> > I am all for solving practical problems rather than pleasing aesthetic >> > pleasure. >> > >> >> Updating system_time from the kernel clock once master clock >> >> has been enabled can result in time backwards event, in case >> >> kernel clock frequency is lower than TSC frequency. >> >> >> >> Disable master clock in case it is necessary to update it >> >> f
Re: kvmclock doesn't work, help?
On Thu, Dec 10, 2015 at 1:32 PM, Marcelo Tosatti wrote: > On Wed, Dec 09, 2015 at 01:10:59PM -0800, Andy Lutomirski wrote: >> I'm trying to clean up kvmclock and I can't get it to work at all. My >> host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. >> >> If I boot an SMP (2 vcpus) guest, tracing says: >> >> qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 0 >> qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc >> qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: >> vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: >> vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc >> qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> >> >> If I boot a UP guest, tracing says: >> >> qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: >> masterclock 0 hostclock tsc offsetmatched 1 >> >> I suspect, but I haven't verified, that this is fallout from: >> >> commit 16a9602158861687c78b6de6dc6a79e6e8a9136f >> Author: Marcelo Tosatti >> Date: Wed May 14 12:43:24 2014 -0300 >> >> KVM: x86: disable master clock if TSC is reset during suspend >> >> Updating system_time from the kernel clock once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. >> >> Disable master clock in case it is necessary to update it >> from the resume path. >> >> Signed-off-by: Marcelo Tosatti >> Signed-off-by: Paolo Bonzini >> >> >> Can we please stop making kvmclock more complex? It's a beast right >> now, and not in a good way. It's far too tangled with the vclock >> machinery on both the host and guest sides, the pvclock stuff is not >> well thought out (even in principle in an ABI sense), and it's never >> been clear to my what problem exactly the kvmclock stuff is supposed >> to solve. >> >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> start over. A correctly functioning KVM guest using TSC (i.e. >> ignoring kvmclock entirely) >> seems to work rather more reliably and >> considerably faster than a kvmclock guest. >> >> --Andy >> >> -- >> Andy Lutomirski >> AMA Capital Management, LLC > > Andy, > > I am all for solving practical problems rather than pleasing aesthetic > pleasure. > >> Updating system_time from the kernel clock once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. >> >> Disable master clock in case it is necessary to update it >> from the resume path. > >> once master clock >> has been enabled can result in time backwards event, in case >> kernel clock frequency is lower than TSC frequency. > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc reads. > > If the effective frequency of the kernel clock is lower (for example > due to NTP correcting the TSC frequency of the system), and you resume > and update the system, the following happens: > > guest visible clock = tsc_timestamp (updated at time 0) + scaled tsc > reads=LARGE VALUE. > suspend/resume event. > guest visible clock = tsc_timestamp (updated at time N) + scaled tsc reads=0. > I'm still not seeing the issue. The formula is: (((rdtsc - pvti->tsc_timestamp) * pvti->tsc_to_system_mul) >> pvti->tsc_shift) + pvti->system_time Obviously
Re: [PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
On Fri, Dec 11, 2015 at 12:42 AM, Paolo Bonzini wrote: > > > On 11/12/2015 08:52, Ingo Molnar wrote: >> >> * Paolo Bonzini wrote: >>> >>> Reviewed-by: Paolo Bonzini >> >> Thanks. I've added your Reviewed-by to the 1/5 patch as well - to be able to >> put >> the whole series into the tip:x86/entry tree. Let me know if you'd like it >> to be >> done differently. > > The 1/5 patch is entirely in KVM and is not necessary for the rest of > the series to work. I would like it to be separate, because Marcelo has > not yet chimed in to say why it was necessary. > > Can you just apply patches 2-5? Yes, please. I don't grok the clock update mechanism in the KVM host well enough to be sure that patch 1 is actually correct. All I know is that it works better on my laptop with the patch than without the patch and that it seems at least conceptually correct. In any event, patch 1 is a host patch and 2-5 are guest patches, and they only interact to the extent that it's hard for me to test 2-5 on the guest without patch 1 on the host because without patch 1 my laptop's host kernel tends to disable stable kvmclock, thus disabling the entire mechanism in the guest. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86/platform/uv: Include clocksource.h for clocksource_touch_watchdog()
On Fri, Dec 11, 2015 at 12:06 AM, Ingo Molnar wrote: > > * Andy Lutomirski wrote: > >> diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h >> index f80d70009ff8..6d7d0e52ed5a 100644 >> --- a/arch/x86/include/asm/fixmap.h >> +++ b/arch/x86/include/asm/fixmap.h >> @@ -19,7 +19,6 @@ >> #include >> #include >> #include >> -#include >> #ifdef CONFIG_X86_32 >> #include >> #include > > So this change triggered a build failure on 64-bit allmodconfig - fixed via > the > patch below. Your change unearthed a latent bug, a missing header inclusion. > > Thanks, > > Ingo > > > > From d51953b0873358d13b189996e6976dfa12a9b59d Mon Sep 17 00:00:00 2001 > From: Ingo Molnar > Date: Fri, 11 Dec 2015 09:01:30 +0100 > Subject: [PATCH] x86/platform/uv: Include clocksource.h for > clocksource_touch_watchdog() > > This build failure triggers on 64-bit allmodconfig: > > arch/x86/platform/uv/uv_nmi.c:493:2: error: implicit declaration of > function ‘clocksource_touch_watchdog’ [-Werror=implicit-function-declaration] > > which is caused by recent changes exposing a missing clocksource.h include > in uv_nmi.c: > > cc1e24fdb064 x86/vdso: Remove pvclock fixmap machinery > > this file got clocksource.h indirectly via fixmap.h - that stealth route > of header inclusion is now gone. > > Cc: Borislav Petkov > Cc: H. Peter Anvin > Cc: Linus Torvalds > Cc: Thomas Gleixner > Signed-off-by: Ingo Molnar LGTM. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] x86/kvm: On KVM re-enable (e.g. after suspend), update clocks
This gets rid of the "did TSC go backwards" logic and just updates all clocks. It should work better (no more disabling of fast timing) and more reliably (all of the clocks are actually updated). Signed-off-by: Andy Lutomirski --- arch/x86/kvm/x86.c | 75 +++--- 1 file changed, 3 insertions(+), 72 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index eed32283d22c..c88f91f4b1a3 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -123,8 +123,6 @@ module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR); unsigned int __read_mostly lapic_timer_advance_ns = 0; module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR); -static bool __read_mostly backwards_tsc_observed = false; - #define KVM_NR_SHARED_MSRS 16 struct kvm_shared_msrs_global { @@ -1671,7 +1669,6 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm) &ka->master_cycle_now); ka->use_master_clock = host_tsc_clocksource && vcpus_matched - && !backwards_tsc_observed && !ka->boot_vcpu_runs_old_kvmclock; if (ka->use_master_clock) @@ -7369,88 +7366,22 @@ int kvm_arch_hardware_enable(void) struct kvm_vcpu *vcpu; int i; int ret; - u64 local_tsc; - u64 max_tsc = 0; - bool stable, backwards_tsc = false; kvm_shared_msr_cpu_online(); ret = kvm_x86_ops->hardware_enable(); if (ret != 0) return ret; - local_tsc = rdtsc(); - stable = !check_tsc_unstable(); list_for_each_entry(kvm, &vm_list, vm_list) { kvm_for_each_vcpu(i, vcpu, kvm) { - if (!stable && vcpu->cpu == smp_processor_id()) + if (vcpu->cpu == smp_processor_id()) { kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); - if (stable && vcpu->arch.last_host_tsc > local_tsc) { - backwards_tsc = true; - if (vcpu->arch.last_host_tsc > max_tsc) - max_tsc = vcpu->arch.last_host_tsc; + kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, +vcpu); } } } - /* -* Sometimes, even reliable TSCs go backwards. This happens on -* platforms that reset TSC during suspend or hibernate actions, but -* maintain synchronization. We must compensate. Fortunately, we can -* detect that condition here, which happens early in CPU bringup, -* before any KVM threads can be running. Unfortunately, we can't -* bring the TSCs fully up to date with real time, as we aren't yet far -* enough into CPU bringup that we know how much real time has actually -* elapsed; our helper function, get_kernel_ns() will be using boot -* variables that haven't been updated yet. -* -* So we simply find the maximum observed TSC above, then record the -* adjustment to TSC in each VCPU. When the VCPU later gets loaded, -* the adjustment will be applied. Note that we accumulate -* adjustments, in case multiple suspend cycles happen before some VCPU -* gets a chance to run again. In the event that no KVM threads get a -* chance to run, we will miss the entire elapsed period, as we'll have -* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may -* loose cycle time. This isn't too big a deal, since the loss will be -* uniform across all VCPUs (not to mention the scenario is extremely -* unlikely). It is possible that a second hibernate recovery happens -* much faster than a first, causing the observed TSC here to be -* smaller; this would require additional padding adjustment, which is -* why we set last_host_tsc to the local tsc observed here. -* -* N.B. - this code below runs only on platforms with reliable TSC, -* as that is the only way backwards_tsc is set above. Also note -* that this runs for ALL vcpus, which is not a bug; all VCPUs should -* have the same delta_cyc adjustment applied if backwards_tsc -* is detected. Note further, this adjustment is only done once, -* as we reset last_host_tsc on all VCPUs to stop this from being -* called multiple times (one for each physical CPU bringup). -* -* Platforms with unreliable TSCs don't have to deal with this, they -* will be compensated by the logic in vcpu_load, which sets the TSC to -* catchup mode. This will catchup all VCPUs to
[PATCH 3/5] x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap
Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 20 arch/x86/entry/vdso/vdso-layout.lds.S | 3 ++- arch/x86/entry/vdso/vdso2c.c | 3 +++ arch/x86/entry/vdso/vma.c | 13 + arch/x86/include/asm/pvclock.h| 9 + arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 5 + 7 files changed, 41 insertions(+), 13 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index c325ba1bdddf..5dd363d54348 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -36,6 +36,11 @@ static notrace cycle_t vread_hpet(void) } #endif +#ifdef CONFIG_PARAVIRT_CLOCK +extern u8 pvclock_page + __attribute__((visibility("hidden"))); +#endif + #ifndef BUILD_VDSO32 #include @@ -62,23 +67,14 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) #ifdef CONFIG_PARAVIRT_CLOCK -static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) +static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { - const struct pvclock_vsyscall_time_info *pvti_base; - int idx = cpu / (PAGE_SIZE/PVTI_SIZE); - int offset = cpu % (PAGE_SIZE/PVTI_SIZE); - - BUG_ON(PVCLOCK_FIXMAP_BEGIN + idx > PVCLOCK_FIXMAP_END); - - pvti_base = (struct pvclock_vsyscall_time_info *) - __fix_to_virt(PVCLOCK_FIXMAP_BEGIN+idx); - - return &pvti_base[offset]; + return (const struct pvclock_vsyscall_time_info *)&pvclock_page; } static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; + const struct pvclock_vcpu_time_info *pvti = &get_pvti0()->pvti; cycle_t ret; u64 tsc, pvti_tsc; u64 last, delta, pvti_system_time; diff --git a/arch/x86/entry/vdso/vdso-layout.lds.S b/arch/x86/entry/vdso/vdso-layout.lds.S index de2c921025f5..4158acc17df0 100644 --- a/arch/x86/entry/vdso/vdso-layout.lds.S +++ b/arch/x86/entry/vdso/vdso-layout.lds.S @@ -25,7 +25,7 @@ SECTIONS * segment. */ - vvar_start = . - 2 * PAGE_SIZE; + vvar_start = . - 3 * PAGE_SIZE; vvar_page = vvar_start; /* Place all vvars at the offsets in asm/vvar.h. */ @@ -36,6 +36,7 @@ SECTIONS #undef EMIT_VVAR hpet_page = vvar_start + PAGE_SIZE; + pvclock_page = vvar_start + 2 * PAGE_SIZE; . = SIZEOF_HEADERS; diff --git a/arch/x86/entry/vdso/vdso2c.c b/arch/x86/entry/vdso/vdso2c.c index 785d9922b106..491020b2826d 100644 --- a/arch/x86/entry/vdso/vdso2c.c +++ b/arch/x86/entry/vdso/vdso2c.c @@ -73,6 +73,7 @@ enum { sym_vvar_start, sym_vvar_page, sym_hpet_page, + sym_pvclock_page, sym_VDSO_FAKE_SECTION_TABLE_START, sym_VDSO_FAKE_SECTION_TABLE_END, }; @@ -80,6 +81,7 @@ enum { const int special_pages[] = { sym_vvar_page, sym_hpet_page, + sym_pvclock_page, }; struct vdso_sym { @@ -91,6 +93,7 @@ struct vdso_sym required_syms[] = { [sym_vvar_start] = {"vvar_start", true}, [sym_vvar_page] = {"vvar_page", true}, [sym_hpet_page] = {"hpet_page", true}, + [sym_pvclock_page] = {"pvclock_page", true}, [sym_VDSO_FAKE_SECTION_TABLE_START] = { "VDSO_FAKE_SECTION_TABLE_START", false }, diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index 64df47148160..aa828191c654 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -100,6 +100,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) .name = "[vvar]", .pages = no_pages, }; + struct pvclock_vsyscall_time_info *pvti; if (calculate_addr) { addr = vdso_addr(current->mm->start_stack, @@ -169,6 +170,18 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) } #endif + pvti = pvclock_pvti_cpu0_va(); + if (pvti && image->sym_pvclock_page) { + ret = remap_pfn_range(vma, + text_start + image->sym_pvclock_page, + __pa(pvti) >> PAGE_SHIFT, + PAGE_SIZE, + PAGE_READONLY); + + if (ret) + goto up_fail; + } + up_fail: if (ret) current->mm->context.vdso = NULL; diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 7a6bed5c08bc..3864398c7cb2 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -4,6 +4,15 @@ #include #include +#ifdef CONFIG_PARAVIRT_CLOCK
[PATCH 0/5] x86: KVM vdso and clock improvements
NB: patch 1 doesn't really belong here, but it makes this a lot easier for me to test. Patch 1, if it's okay at all, should go though the kvm tree. The rest should probably go through tip:x86/vdso once they're reviewed. I'll do a followup to enable vdso pvclock on 32-bit guests. I'm not currently set up to test it. (The KVM people could also do it very easily on top of these patches.) Andy Lutomirski (5): x86/kvm: On KVM re-enable (e.g. after suspend), update clocks x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap x86/vdso: Remove pvclock fixmap machinery x86/vdso: Enable vdso pvclock access on all vdso variants arch/x86/entry/vdso/vclock_gettime.c | 151 -- arch/x86/entry/vdso/vdso-layout.lds.S | 3 +- arch/x86/entry/vdso/vdso2c.c | 3 + arch/x86/entry/vdso/vma.c | 14 arch/x86/include/asm/fixmap.h | 5 -- arch/x86/include/asm/pvclock.h| 14 ++-- arch/x86/include/asm/vdso.h | 1 + arch/x86/kernel/kvmclock.c| 11 ++- arch/x86/kernel/pvclock.c | 24 -- arch/x86/kvm/x86.c| 75 + 10 files changed, 110 insertions(+), 191 deletions(-) -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] x86, vdso, pvclock: Simplify and speed up the vdso pvclock reader
From: Andy Lutomirski The pvclock vdso code was too abstracted to understand easily and excessively paranoid. Simplify it for a huge speedup. This opens the door for additional simplifications, as the vdso no longer accesses the pvti for any vcpu other than vcpu 0. Before, vclock_gettime using kvm-clock took about 45ns on my machine. With this change, it takes 29ns, which is almost as fast as the pure TSC implementation. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 81 1 file changed, 46 insertions(+), 35 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index ca94fa649251..c325ba1bdddf 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -78,47 +78,58 @@ static notrace const struct pvclock_vsyscall_time_info *get_pvti(int cpu) static notrace cycle_t vread_pvclock(int *mode) { - const struct pvclock_vsyscall_time_info *pvti; + const struct pvclock_vcpu_time_info *pvti = &get_pvti(0)->pvti; cycle_t ret; - u64 last; - u32 version; - u8 flags; - unsigned cpu, cpu1; - + u64 tsc, pvti_tsc; + u64 last, delta, pvti_system_time; + u32 version, pvti_tsc_to_system_mul, pvti_tsc_shift; /* -* Note: hypervisor must guarantee that: -* 1. cpu ID number maps 1:1 to per-CPU pvclock time info. -* 2. that per-CPU pvclock time info is updated if the -*underlying CPU changes. -* 3. that version is increased whenever underlying CPU -*changes. +* Note: The kernel and hypervisor must guarantee that cpu ID +* number maps 1:1 to per-CPU pvclock time info. +* +* Because the hypervisor is entirely unaware of guest userspace +* preemption, it cannot guarantee that per-CPU pvclock time +* info is updated if the underlying CPU changes or that that +* version is increased whenever underlying CPU changes. * +* On KVM, we are guaranteed that pvti updates for any vCPU are +* atomic as seen by *all* vCPUs. This is an even stronger +* guarantee than we get with a normal seqlock. +* +* On Xen, we don't appear to have that guarantee, but Xen still +* supplies a valid seqlock using the version field. + +* We only do pvclock vdso timing at all if +* PVCLOCK_TSC_STABLE_BIT is set, and we interpret that bit to +* mean that all vCPUs have matching pvti and that the TSC is +* synced, so we can just look at vCPU 0's pvti. */ - do { - cpu = __getcpu() & VGETCPU_CPU_MASK; - /* TODO: We can put vcpu id into higher bits of pvti.version. -* This will save a couple of cycles by getting rid of -* __getcpu() calls (Gleb). -*/ - - pvti = get_pvti(cpu); - - version = __pvclock_read_cycles(&pvti->pvti, &ret, &flags); - - /* -* Test we're still on the cpu as well as the version. -* We could have been migrated just after the first -* vgetcpu but before fetching the version, so we -* wouldn't notice a version change. -*/ - cpu1 = __getcpu() & VGETCPU_CPU_MASK; - } while (unlikely(cpu != cpu1 || - (pvti->pvti.version & 1) || - pvti->pvti.version != version)); - - if (unlikely(!(flags & PVCLOCK_TSC_STABLE_BIT))) + + if (unlikely(!(pvti->flags & PVCLOCK_TSC_STABLE_BIT))) { *mode = VCLOCK_NONE; + return 0; + } + + do { + version = pvti->version; + + /* This is also a read barrier, so we'll read version first. */ + tsc = rdtsc_ordered(); + + pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; + pvti_tsc_shift = pvti->tsc_shift; + pvti_system_time = pvti->system_time; + pvti_tsc = pvti->tsc_timestamp; + + /* Make sure that the version double-check is last. */ + smp_rmb(); + } while (unlikely((version & 1) || version != pvti->version)); + + delta = tsc - pvti_tsc; + ret = pvti_system_time + + pvclock_scale_delta(delta, pvti_tsc_to_system_mul, + pvti_tsc_shift); /* refer to tsc.c read_tsc() comment for rationale */ last = gtod->cycle_last; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] x86/vdso: Remove pvclock fixmap machinery
Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 1 - arch/x86/entry/vdso/vma.c| 1 + arch/x86/include/asm/fixmap.h| 5 - arch/x86/include/asm/pvclock.h | 5 - arch/x86/kernel/kvmclock.c | 6 -- arch/x86/kernel/pvclock.c| 24 6 files changed, 1 insertion(+), 41 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 5dd363d54348..59a98c25bde7 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -45,7 +45,6 @@ extern u8 pvclock_page #include #include -#include #include notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) diff --git a/arch/x86/entry/vdso/vma.c b/arch/x86/entry/vdso/vma.c index aa828191c654..b8f69e264ac4 100644 --- a/arch/x86/entry/vdso/vma.c +++ b/arch/x86/entry/vdso/vma.c @@ -12,6 +12,7 @@ #include #include #include +#include #include #include #include diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h index f80d70009ff8..6d7d0e52ed5a 100644 --- a/arch/x86/include/asm/fixmap.h +++ b/arch/x86/include/asm/fixmap.h @@ -19,7 +19,6 @@ #include #include #include -#include #ifdef CONFIG_X86_32 #include #include @@ -72,10 +71,6 @@ enum fixed_addresses { #ifdef CONFIG_X86_VSYSCALL_EMULATION VSYSCALL_PAGE = (FIXADDR_TOP - VSYSCALL_ADDR) >> PAGE_SHIFT, #endif -#ifdef CONFIG_PARAVIRT_CLOCK - PVCLOCK_FIXMAP_BEGIN, - PVCLOCK_FIXMAP_END = PVCLOCK_FIXMAP_BEGIN+PVCLOCK_VSYSCALL_NR_PAGES-1, -#endif #endif FIX_DBGP_BASE, FIX_EARLYCON_MEM_BASE, diff --git a/arch/x86/include/asm/pvclock.h b/arch/x86/include/asm/pvclock.h index 3864398c7cb2..66df22b2e0c9 100644 --- a/arch/x86/include/asm/pvclock.h +++ b/arch/x86/include/asm/pvclock.h @@ -100,10 +100,5 @@ struct pvclock_vsyscall_time_info { } __attribute__((__aligned__(SMP_CACHE_BYTES))); #define PVTI_SIZE sizeof(struct pvclock_vsyscall_time_info) -#define PVCLOCK_VSYSCALL_NR_PAGES (((NR_CPUS-1)/(PAGE_SIZE/PVTI_SIZE))+1) - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size); -struct pvclock_vcpu_time_info *pvclock_get_vsyscall_time_info(int cpu); #endif /* _ASM_X86_PVCLOCK_H */ diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c index ec1b06dc82d2..72cef58693c7 100644 --- a/arch/x86/kernel/kvmclock.c +++ b/arch/x86/kernel/kvmclock.c @@ -310,7 +310,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) { #ifdef CONFIG_X86_64 int cpu; - int ret; u8 flags; struct pvclock_vcpu_time_info *vcpu_time; unsigned int size; @@ -330,11 +329,6 @@ int __init kvm_setup_vsyscall_timeinfo(void) return 1; } - if ((ret = pvclock_init_vsyscall(hv_clock, size))) { - put_cpu(); - return ret; - } - put_cpu(); kvm_clock.archdata.vclock_mode = VCLOCK_PVCLOCK; diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 2f355d229a58..99bfc025111d 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -140,27 +140,3 @@ void pvclock_read_wallclock(struct pvclock_wall_clock *wall_clock, set_normalized_timespec(ts, now.tv_sec, now.tv_nsec); } - -#ifdef CONFIG_X86_64 -/* - * Initialize the generic pvclock vsyscall state. This will allocate - * a/some page(s) for the per-vcpu pvclock information, set up a - * fixmap mapping for the page(s) - */ - -int __init pvclock_init_vsyscall(struct pvclock_vsyscall_time_info *i, -int size) -{ - int idx; - - WARN_ON (size != PVCLOCK_VSYSCALL_NR_PAGES*PAGE_SIZE); - - for (idx = 0; idx <= (PVCLOCK_FIXMAP_END-PVCLOCK_FIXMAP_BEGIN); idx++) { - __set_fixmap(PVCLOCK_FIXMAP_BEGIN + idx, -__pa(i) + (idx*PAGE_SIZE), -PAGE_KERNEL_VVAR); - } - - return 0; -} -#endif -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] x86/vdso: Enable vdso pvclock access on all vdso variants
Now that pvclock doesn't require access to the fixmap, all vdso variants can use it. The kernel side isn't wired up for 32-bit kernels yet, but this covers 32-bit and x32 userspace on 64-bit kernels. Signed-off-by: Andy Lutomirski --- arch/x86/entry/vdso/vclock_gettime.c | 91 1 file changed, 40 insertions(+), 51 deletions(-) diff --git a/arch/x86/entry/vdso/vclock_gettime.c b/arch/x86/entry/vdso/vclock_gettime.c index 59a98c25bde7..8602f06c759f 100644 --- a/arch/x86/entry/vdso/vclock_gettime.c +++ b/arch/x86/entry/vdso/vclock_gettime.c @@ -17,8 +17,10 @@ #include #include #include +#include #include #include +#include #define gtod (&VVAR(vsyscall_gtod_data)) @@ -43,10 +45,6 @@ extern u8 pvclock_page #ifndef BUILD_VDSO32 -#include -#include -#include - notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) { long ret; @@ -64,8 +62,42 @@ notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) return ret; } -#ifdef CONFIG_PARAVIRT_CLOCK +#else + +notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) + : "memory", "edx"); + return ret; +} + +notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) +{ + long ret; + + asm( + "mov %%ebx, %%edx \n" + "mov %2, %%ebx \n" + "call __kernel_vsyscall \n" + "mov %%edx, %%ebx \n" + : "=a" (ret) + : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) + : "memory", "edx"); + return ret; +} + +#endif + +#ifdef CONFIG_PARAVIRT_CLOCK static notrace const struct pvclock_vsyscall_time_info *get_pvti0(void) { return (const struct pvclock_vsyscall_time_info *)&pvclock_page; @@ -109,9 +141,9 @@ static notrace cycle_t vread_pvclock(int *mode) do { version = pvti->version; - /* This is also a read barrier, so we'll read version first. */ - tsc = rdtsc_ordered(); + smp_rmb(); + tsc = rdtsc_ordered(); pvti_tsc_to_system_mul = pvti->tsc_to_system_mul; pvti_tsc_shift = pvti->tsc_shift; pvti_system_time = pvti->system_time; @@ -126,7 +158,7 @@ static notrace cycle_t vread_pvclock(int *mode) pvclock_scale_delta(delta, pvti_tsc_to_system_mul, pvti_tsc_shift); - /* refer to tsc.c read_tsc() comment for rationale */ + /* refer to vread_tsc() comment for rationale */ last = gtod->cycle_last; if (likely(ret >= last)) @@ -136,49 +168,6 @@ static notrace cycle_t vread_pvclock(int *mode) } #endif -#else - -notrace static long vdso_fallback_gettime(long clock, struct timespec *ts) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_clock_gettime), "g" (clock), "c" (ts) - : "memory", "edx"); - return ret; -} - -notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz) -{ - long ret; - - asm( - "mov %%ebx, %%edx \n" - "mov %2, %%ebx \n" - "call __kernel_vsyscall \n" - "mov %%edx, %%ebx \n" - : "=a" (ret) - : "0" (__NR_gettimeofday), "g" (tv), "c" (tz) - : "memory", "edx"); - return ret; -} - -#ifdef CONFIG_PARAVIRT_CLOCK - -static notrace cycle_t vread_pvclock(int *mode) -{ - *mode = VCLOCK_NONE; - return 0; -} -#endif - -#endif - notrace static cycle_t vread_tsc(void) { cycle_t ret = (cycle_t)rdtsc_ordered(); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:27 PM, Andy Lutomirski wrote: > On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: >> >> >> On 09/12/2015 22:49, Andy Lutomirski wrote: >>> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: >>>> >>>> >>>> On 09/12/2015 22:10, Andy Lutomirski wrote: >>>>> Can we please stop making kvmclock more complex? It's a beast right >>>>> now, and not in a good way. It's far too tangled with the vclock >>>>> machinery on both the host and guest sides, the pvclock stuff is not >>>>> well thought out (even in principle in an ABI sense), and it's never >>>>> been clear to my what problem exactly the kvmclock stuff is supposed >>>>> to solve. >>>> >>>> It's supposed to solve the problem that: >>>> >>>> - not all hosts have a working TSC >>> >>> Fine, but we don't need any vdso integration for that. >> >> Well, you still want a fast time source. That was a given. :) > > If the host can't figure out how to give *itself* a fast time source, > I'd be surprised if KVM can manage to give the guest a fast, reliable > time source. > >> >>>> - even if they all do, virtual machines can be migrated (or >>>> saved/restored) to a host with a different TSC frequency >>>> >>>> - any MMIO- or PIO-based mechanism to access the current time is orders >>>> of magnitude slower than the TSC and less precise too. >>> >>> Yup. But TSC by itself gets that benefit, too. >> >> Yes, the problem is if you want to solve all three of them. The first >> two are solved by the ACPI PM timer with a decent resolution (70 >> ns---much faster anyway than an I/O port access). The third is solved >> by TSC. To solve all three, you need kvmclock. > > Still confused. Is kvmclock really used in cases where even the host > can't pull of working TSC? > >> >>>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and >>>>> start over. A correctly functioning KVM guest using TSC (i.e. >>>>> ignoring kvmclock entirely) seems to work rather more reliably and >>>>> considerably faster than a kvmclock guest. >>>> >>>> If all your hosts have a working TSC and you don't do migration or >>>> save/restore, that's a valid configuration. It's not a good default, >>>> however. >>> >>> Er? >>> >>> kvmclock is still really quite slow and buggy. >> >> Unless it takes 3-4000 clock cycles for a gettimeofday, which it >> shouldn't even with vdso disabled, it's definitely not slower than PIO. >> >>> And the patch I identified is definitely a problem here: >>> >>> [ 136.131241] KVM: disabling fast timing permanently due to inability >>> to recover from suspend >>> >>> I got that on the host with this whitespace-damaged patch: >>> >>> if (backwards_tsc) { >>> u64 delta_cyc = max_tsc - local_tsc; >>> + if (!backwards_tsc_observed) >>> + pr_warn("KVM: disabling fast timing >>> permanently due to inability to recover from suspend\n"); >>> >>> when I suspended and resumed. >>> >>> Can anyone explain what problem >>> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >>> brief inspection, it just seems to be incorrect. Shouldn't KVM's >>> normal TSC logic handle that case right? After all, all vcpus should >>> be paused when we resume from suspend. At worst, we should just need >>> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >>> shouldn't we do that regardless of which way the TSC jumped on >>> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >>> likely to change except on the very small handful of CPUs (if any) >>> that keep the TSC running in S3 and hibernate. >> >> I don't recall the details of that patch, so Marcelo will have to answer >> this, or Alex too since he chimed in the original thread. At least it >> should be made conditional on the existence of a VM at suspend time (and >> the master clock stuff should be made per VM, as I suggested at >> https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). >> >> It would indeed be great if the master clock could be dropped. But I'm >>
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 2:12 PM, Paolo Bonzini wrote: > > > On 09/12/2015 22:49, Andy Lutomirski wrote: >> On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: >>> >>> >>> On 09/12/2015 22:10, Andy Lutomirski wrote: >>>> Can we please stop making kvmclock more complex? It's a beast right >>>> now, and not in a good way. It's far too tangled with the vclock >>>> machinery on both the host and guest sides, the pvclock stuff is not >>>> well thought out (even in principle in an ABI sense), and it's never >>>> been clear to my what problem exactly the kvmclock stuff is supposed >>>> to solve. >>> >>> It's supposed to solve the problem that: >>> >>> - not all hosts have a working TSC >> >> Fine, but we don't need any vdso integration for that. > > Well, you still want a fast time source. That was a given. :) If the host can't figure out how to give *itself* a fast time source, I'd be surprised if KVM can manage to give the guest a fast, reliable time source. > >>> - even if they all do, virtual machines can be migrated (or >>> saved/restored) to a host with a different TSC frequency >>> >>> - any MMIO- or PIO-based mechanism to access the current time is orders >>> of magnitude slower than the TSC and less precise too. >> >> Yup. But TSC by itself gets that benefit, too. > > Yes, the problem is if you want to solve all three of them. The first > two are solved by the ACPI PM timer with a decent resolution (70 > ns---much faster anyway than an I/O port access). The third is solved > by TSC. To solve all three, you need kvmclock. Still confused. Is kvmclock really used in cases where even the host can't pull of working TSC? > >>>> I'm somewhat tempted to suggest that we delete kvmclock entirely and >>>> start over. A correctly functioning KVM guest using TSC (i.e. >>>> ignoring kvmclock entirely) seems to work rather more reliably and >>>> considerably faster than a kvmclock guest. >>> >>> If all your hosts have a working TSC and you don't do migration or >>> save/restore, that's a valid configuration. It's not a good default, >>> however. >> >> Er? >> >> kvmclock is still really quite slow and buggy. > > Unless it takes 3-4000 clock cycles for a gettimeofday, which it > shouldn't even with vdso disabled, it's definitely not slower than PIO. > >> And the patch I identified is definitely a problem here: >> >> [ 136.131241] KVM: disabling fast timing permanently due to inability >> to recover from suspend >> >> I got that on the host with this whitespace-damaged patch: >> >> if (backwards_tsc) { >> u64 delta_cyc = max_tsc - local_tsc; >> + if (!backwards_tsc_observed) >> + pr_warn("KVM: disabling fast timing >> permanently due to inability to recover from suspend\n"); >> >> when I suspended and resumed. >> >> Can anyone explain what problem >> 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On >> brief inspection, it just seems to be incorrect. Shouldn't KVM's >> normal TSC logic handle that case right? After all, all vcpus should >> be paused when we resume from suspend. At worst, we should just need >> kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, >> shouldn't we do that regardless of which way the TSC jumped on >> suspend/resume? After all, the jTSC-to-wall-clock offset is quite >> likely to change except on the very small handful of CPUs (if any) >> that keep the TSC running in S3 and hibernate. > > I don't recall the details of that patch, so Marcelo will have to answer > this, or Alex too since he chimed in the original thread. At least it > should be made conditional on the existence of a VM at suspend time (and > the master clock stuff should be made per VM, as I suggested at > https://www.mail-archive.com/kvm@vger.kernel.org/msg102316.html). > > It would indeed be great if the master clock could be dropped. But I'm > definitely missing some of the subtle details. :( Me, too. Anyway, see the attached untested patch. Marcelo? --Andy From e4a5e834d3fb6fc2499966e1af42cb5bd59f4410 Mon Sep 17 00:00:00 2001 Message-Id: From: Andy Lutomirski Date: Wed, 9 Dec 2015 14:21:05 -0800 Subject: [PATCH] x86/kvm: On KVM re-enable (e.g. after suspect), update clocks This gets rid of the "did TSC go backwards" logic and just updates all clock
Re: kvmclock doesn't work, help?
On Wed, Dec 9, 2015 at 1:16 PM, Paolo Bonzini wrote: > > > On 09/12/2015 22:10, Andy Lutomirski wrote: >> Can we please stop making kvmclock more complex? It's a beast right >> now, and not in a good way. It's far too tangled with the vclock >> machinery on both the host and guest sides, the pvclock stuff is not >> well thought out (even in principle in an ABI sense), and it's never >> been clear to my what problem exactly the kvmclock stuff is supposed >> to solve. > > It's supposed to solve the problem that: > > - not all hosts have a working TSC Fine, but we don't need any vdso integration for that. > > - even if they all do, virtual machines can be migrated (or > saved/restored) to a host with a different TSC frequency OK, I buy that. So we want to export a linear function that the guest applies to the TSC so the guest can apply it. I suppose we also want ntp frequency corrections on the host to propagate to the guest. > > - any MMIO- or PIO-based mechanism to access the current time is orders > of magnitude slower than the TSC and less precise too. Yup. But TSC by itself gets that benefit, too. > >> I'm somewhat tempted to suggest that we delete kvmclock entirely and >> start over. A correctly functioning KVM guest using TSC (i.e. >> ignoring kvmclock entirely) seems to work rather more reliably and >> considerably faster than a kvmclock guest. > > If all your hosts have a working TSC and you don't do migration or > save/restore, that's a valid configuration. It's not a good default, > however. Er? kvmclock is still really quite slow and buggy. And the patch I identified is definitely a problem here: [ 136.131241] KVM: disabling fast timing permanently due to inability to recover from suspend I got that on the host with this whitespace-damaged patch: if (backwards_tsc) { u64 delta_cyc = max_tsc - local_tsc; + if (!backwards_tsc_observed) + pr_warn("KVM: disabling fast timing permanently due to inability to recover from suspend\n"); when I suspended and resumed. Can anyone explain what problem 16a9602158861687c78b6de6dc6a79e6e8a9136f is supposed to solve? On brief inspection, it just seems to be incorrect. Shouldn't KVM's normal TSC logic handle that case right? After all, all vcpus should be paused when we resume from suspend. At worst, we should just need kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu) on all vcpus. (Actually, shouldn't we do that regardless of which way the TSC jumped on suspend/resume? After all, the jTSC-to-wall-clock offset is quite likely to change except on the very small handful of CPUs (if any) that keep the TSC running in S3 and hibernate. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kvmclock doesn't work, help?
I'm trying to clean up kvmclock and I can't get it to work at all. My host is 4.4.0-rc3-ish on a Skylake laptop that has a working TSC. If I boot an SMP (2 vcpus) guest, tracing says: qemu-system-x86-2517 [001] 102242.610654: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 0 qemu-system-x86-2521 [000] 102242.613742: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 0 nr_online 1 hostclock tsc qemu-system-x86-2522 [000] 102242.622959: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2521 [000] 102242.645123: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2522 [000] 102242.647291: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2521 [000] 102242.653369: kvm_track_tsc: vcpu_id 0 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2522 [000] 102242.653429: kvm_track_tsc: vcpu_id 1 masterclock 0 offsetmatched 1 nr_online 2 hostclock tsc qemu-system-x86-2517 [001] 102242.653447: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2521 [000] 102242.653657: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2522 [002] 102242.664448: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 If I boot a UP guest, tracing says: qemu-system-x86-2567 [001] 102370.447484: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 qemu-system-x86-2571 [002] 102370.447688: kvm_update_master_clock: masterclock 0 hostclock tsc offsetmatched 1 I suspect, but I haven't verified, that this is fallout from: commit 16a9602158861687c78b6de6dc6a79e6e8a9136f Author: Marcelo Tosatti Date: Wed May 14 12:43:24 2014 -0300 KVM: x86: disable master clock if TSC is reset during suspend Updating system_time from the kernel clock once master clock has been enabled can result in time backwards event, in case kernel clock frequency is lower than TSC frequency. Disable master clock in case it is necessary to update it from the resume path. Signed-off-by: Marcelo Tosatti Signed-off-by: Paolo Bonzini Can we please stop making kvmclock more complex? It's a beast right now, and not in a good way. It's far too tangled with the vclock machinery on both the host and guest sides, the pvclock stuff is not well thought out (even in principle in an ABI sense), and it's never been clear to my what problem exactly the kvmclock stuff is supposed to solve. I'm somewhat tempted to suggest that we delete kvmclock entirely and start over. A correctly functioning KVM guest using TSC (i.e. ignoring kvmclock entirely) seems to work rather more reliably and considerably faster than a kvmclock guest. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Nov 19, 2015 5:45 AM, "Michael S. Tsirkin" wrote: > > On Tue, Oct 27, 2015 at 11:38:57PM -0700, Andy Lutomirski wrote: > > This switches virtio to use the DMA API unconditionally. I'm sure > > it breaks things, but it seems to work on x86 using virtio-pci, with > > and without Xen, and using both the modern 1.0 variant and the > > legacy variant. > > So thinking hard about it, I don't see any real drawbacks to making this > conditional on a new feature bit, that Xen can then set.. Can you elaborate? If I run QEMU, hosting Xen, hosting Linux, and the virtio device is provided by QEMU, then how does Xen set the bit? Similarly, how would Xen set the bit for a real physical device? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm/vmx: EPTP switching test
On Mon, Nov 16, 2015 at 10:18 AM, =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= wrote: > 2015-11-16 19:59+0200, Michael S. Tsirkin: >> On Mon, Nov 16, 2015 at 06:51:06PM +0100, >> =?UTF-8?q?Radim=20Kr=C4=8Dm=C3=A1=C5=99?= wrote: >>> 2015-11-15 18:00+0200, Michael S. Tsirkin: >>> (And I think that eptp switching is expected to be used in conjuction >>> with #VE, so it'd then make sense to implement a nop for it as well.) >> >> No idea how would I even test it, so I'm not interested in #VE at this >> point. If you are - go ahead and post a patch for that on top though, >> why not. > > I thought that it's going to be simpler to provide functionality (that > utilizes eptp switching) to the guest through #VE, which probably isn't > true as I think more about it. (Not interested in implementing it :]) I'm willing to review a #VE stub. I'm not likely to implement it. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Wed, Nov 11, 2015 at 2:05 AM, Michael S. Tsirkin wrote: > On Tue, Nov 10, 2015 at 10:54:21AM -0800, Andy Lutomirski wrote: >> On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" wrote: >> > >> > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote: >> > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote: >> > > > I have no problem with that. For example, can we teach >> > > > the DMA API on intel x86 to use PT for virtio by default? >> > > > That would allow merging Andy's patches with >> > > > full compatibility with old guests and hosts. >> > > >> > > Well, the only incompatibility comes from an experimental qemu feature, >> > > more explicitly from a bug in that features implementation. So why >> > > should we work around that in the kernel? I think it is not too hard to >> > > fix qemu to generate a correct DMAR table which excludes the virtio >> > > devices from iommu translation. >> > > >> > > >> > > Joerg >> > >> > It's not that easy - you'd have to dedicate some buses >> > for iommu bypass, and teach management tools to only put >> > virtio there - but it's possible. >> > >> > This will absolutely address guests that don't need to set up IOMMU for >> > virtio devices, and virtio that bypasses the IOMMU. >> > >> > But the problem is that we do want to *allow* guests >> > to set up IOMMU for virtio devices. >> > In that case, these are two other usecases: >> > >> > A- monolitic virtio within QEMU: >> > iommu only needed for VFIO -> >> > guest should always use iommu=pt >> > iommu=on works but is just useless overhead. >> > >> > B- modular out of process virtio outside QEMU: >> > iommu needed for VFIO or kernel driver -> >> > guest should use iommu=pt or iommu=on >> > depending on security/performance requirements >> > >> > Note that there could easily be a mix of these in the same system. >> > >> > So for these cases we do need QEMU to specify to guest that IOMMU covers >> > the virtio devices. Also, once one does this, the default on linux is >> > iommu=on and not pt, which works but ATM is very slow. >> > >> > This poses three problems: >> > >> > 1. How do we address the different needs of A and B? >> >One way would be for virtio to pass the information to guest >> >using some virtio specific way, and have drivers >> >specify what kind of DMA access they want. >> > >> > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure >> > most guests >> >use the more sensible iommu=pt. >> > >> > 3. Once we do allow IOMMU, how can we keep existing guests work in this >> > configuration? >> >Creating different hypervisor configurations depending on guest is very >> > nasty. >> >Again, one way would be some virtio specific interface. >> > >> > I'd rather we figured the answers to this before merging Andy's patches >> > because I'm concerned that instead of 1 broken configuration >> > (virtio always bypasses IOMMU) we'll get two bad configurations >> > (in the second one, virtio uses the slow default with no >> > gain in security). >> > >> > Suggestions wellcome. >> >> I think there's still no downside of using my patches, even on x86. >> >> Old kernels on new QEMU work unless IOMMU is enabled on the host. I >> think that's the best we can possibly do. >> New kernels work at full speed on old QEMU. > > Only if IOMMU is disabled, right? > >> New kernels with new QEMU and iommu enabled work slower. Even newer >> kernels with default passthrough work at full speed, and there's no >> obvious downside to the existence of kernels with just my patches. >> >> --Andy >> > > I tried to explain the possible downside. Let me try again. Imagine > that guest kernel notifies hypervisor that it wants IOMMU to actually > work. This will make old kernel on new QEMU work even with IOMMU > enabled on host - better than "the best we can do" that you described > above. Specifically, QEMU will assume that if it didn't get > notification, it's an old kernel so it should ignore the IOMMU. Can you flesh out this trick? On x86 IIUC the IOMMU more-or-le
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Nov 10, 2015 4:44 PM, "Benjamin Herrenschmidt" wrote: > > On Tue, 2015-11-10 at 15:44 -0800, Andy Lutomirski wrote: > > > > > What about partition <-> partition virtio such as what we could do on > > > PAPR systems. That would have the weak barrier bit. > > > > > > > Is it partition <-> partition, bypassing IOMMU? > > No. > > > I think I'd settle for just something that doesn't regress > > non-experimental setups that actually work today and that allow new > > setups (x86 with fixed QEMU and maybe something more complicated on > > powerpc and/or sparc) to work in all cases. > > > > We could certainly just make powerpc and sparc continue bypassing the > > IOMMU until someone comes up with a way to fix it. I'll send out some > > patches that do that, and maybe that'll help this make progress. > > But we haven't found a solution that works. All we have come up with is > a quirk that will force bypass on virtio always and will not allow us > to operate non-bypassing devices on either of those architectures in > the future. > > I'm not too happy about this. Me neither. At least it wouldn't be a regression, but it's still crappy. I think that arm is fine, at least. I was unable to find an arm QEMU config that has any problems with my patches. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Tue, Nov 10, 2015 at 2:27 PM, Benjamin Herrenschmidt wrote: > On Tue, 2015-11-10 at 10:54 -0800, Andy Lutomirski wrote: >> >> Does that work on powerpc on existing kernels? >> >> Anyway, here's another crazy idea: make the quirk assume that the >> IOMMU is bypasses if and only if the weak barriers bit is set on >> systems that are missing the new DT binding. > > "New DT bindings" doesn't mean much ... how do we change DT bindings on > existing machines with a FW in flash ? > > What about partition <-> partition virtio such as what we could do on > PAPR systems. That would have the weak barrier bit. > Is it partition <-> partition, bypassing IOMMU? I think I'd settle for just something that doesn't regress non-experimental setups that actually work today and that allow new setups (x86 with fixed QEMU and maybe something more complicated on powerpc and/or sparc) to work in all cases. We could certainly just make powerpc and sparc continue bypassing the IOMMU until someone comes up with a way to fix it. I'll send out some patches that do that, and maybe that'll help this make progress. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Nov 10, 2015 2:38 AM, "Benjamin Herrenschmidt" wrote: > > On Mon, 2015-11-09 at 21:35 -0800, Andy Lutomirski wrote: > > > > We could do it the other way around: on powerpc, if a PCI device is in > > that range and doesn't have the "bypass" property at all, then it's > > assumed to bypass the IOMMU. This means that everything that > > currently works continues working. If someone builds a physical > > virtio device or uses another system in PCIe target mode speaking > > virtio, then it won't work until they upgrade their firmware to set > > bypass=0. Meanwhile everyone using hypothetical new QEMU also gets > > bypass=0 and no ambiguity. > > > > vfio will presumably notice the bypass and correctly refuse to map any > > current virtio devices. > > > > Would that work? > > That would be extremely strange from a platform perspective. Any device > in that vendor/device range would bypass the iommu unless some new > property "actually-works-like-a-real-pci-device" happens to exist in > the device-tree, which we would then need to define somewhere and > handle accross at least 3 different platforms who get their device-tree > from widly different places. > > Also if tomorrow I create a PCI device that implements virtio-net and > put it in a machine running IBM proprietary firmware (or Apple's or > Sun's), it won't have that property... > > This is not hypothetical. People are using virtio to do point-to-point > communication between machines via PCIe today. Does that work on powerpc on existing kernels? Anyway, here's another crazy idea: make the quirk assume that the IOMMU is bypasses if and only if the weak barriers bit is set on systems that are missing the new DT binding. --Andy > > Cheers, > Ben. > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Nov 10, 2015 7:02 AM, "Michael S. Tsirkin" wrote: > > On Sun, Nov 08, 2015 at 12:49:46PM +0100, Joerg Roedel wrote: > > On Sun, Nov 08, 2015 at 12:37:47PM +0200, Michael S. Tsirkin wrote: > > > I have no problem with that. For example, can we teach > > > the DMA API on intel x86 to use PT for virtio by default? > > > That would allow merging Andy's patches with > > > full compatibility with old guests and hosts. > > > > Well, the only incompatibility comes from an experimental qemu feature, > > more explicitly from a bug in that features implementation. So why > > should we work around that in the kernel? I think it is not too hard to > > fix qemu to generate a correct DMAR table which excludes the virtio > > devices from iommu translation. > > > > > > Joerg > > It's not that easy - you'd have to dedicate some buses > for iommu bypass, and teach management tools to only put > virtio there - but it's possible. > > This will absolutely address guests that don't need to set up IOMMU for > virtio devices, and virtio that bypasses the IOMMU. > > But the problem is that we do want to *allow* guests > to set up IOMMU for virtio devices. > In that case, these are two other usecases: > > A- monolitic virtio within QEMU: > iommu only needed for VFIO -> > guest should always use iommu=pt > iommu=on works but is just useless overhead. > > B- modular out of process virtio outside QEMU: > iommu needed for VFIO or kernel driver -> > guest should use iommu=pt or iommu=on > depending on security/performance requirements > > Note that there could easily be a mix of these in the same system. > > So for these cases we do need QEMU to specify to guest that IOMMU covers > the virtio devices. Also, once one does this, the default on linux is > iommu=on and not pt, which works but ATM is very slow. > > This poses three problems: > > 1. How do we address the different needs of A and B? >One way would be for virtio to pass the information to guest >using some virtio specific way, and have drivers >specify what kind of DMA access they want. > > 2. (Kind of a subset of 1) once we do allow IOMMU, how do we make sure most > guests >use the more sensible iommu=pt. > > 3. Once we do allow IOMMU, how can we keep existing guests work in this > configuration? >Creating different hypervisor configurations depending on guest is very > nasty. >Again, one way would be some virtio specific interface. > > I'd rather we figured the answers to this before merging Andy's patches > because I'm concerned that instead of 1 broken configuration > (virtio always bypasses IOMMU) we'll get two bad configurations > (in the second one, virtio uses the slow default with no > gain in security). > > Suggestions wellcome. I think there's still no downside of using my patches, even on x86. Old kernels on new QEMU work unless IOMMU is enabled on the host. I think that's the best we can possibly do. New kernels work at full speed on old QEMU. New kernels with new QEMU and iommu enabled work slower. Even newer kernels with default passthrough work at full speed, and there's no obvious downside to the existence of kernels with just my patches. --Andy > > -- > MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 9:28 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: >> >> /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. >> */ >> static const struct pci_device_id virtio_pci_id_table[] = { >> { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, >> { 0 } >> }; >> >> Can we match on that range? > > We can, but the problem remains, how do we differenciate an existing > device that does bypass vs. a newer one that needs the IOMMU and thus > doesn't have the new "bypass" property in the device-tree. > We could do it the other way around: on powerpc, if a PCI device is in that range and doesn't have the "bypass" property at all, then it's assumed to bypass the IOMMU. This means that everything that currently works continues working. If someone builds a physical virtio device or uses another system in PCIe target mode speaking virtio, then it won't work until they upgrade their firmware to set bypass=0. Meanwhile everyone using hypothetical new QEMU also gets bypass=0 and no ambiguity. vfio will presumably notice the bypass and correctly refuse to map any current virtio devices. Would that work? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 9:26 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 18:18 -0800, Andy Lutomirski wrote: >> >> Which leaves the special case of Xen, where even preexisting devices >> don't bypass the IOMMU. Can we keep this specific to powerpc and >> sparc? On x86, this problem is basically nonexistent, since the IOMMU >> is properly self-describing. >> >> IOW, I think that on x86 we should assume that all virtio devices >> honor the IOMMU. > > You don't like performances ? :-) This should have basically no effect. Every non-experimental x86 virtio setup in existence either doesn't work at all (Xen) or has DMA ops that are no-ops. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 6:04 PM, Benjamin Herrenschmidt wrote: > On Mon, 2015-11-09 at 16:46 -0800, Andy Lutomirski wrote: >> The problem here is that in some of the problematic cases the virtio >> driver may not even be loaded. If someone runs an L1 guest with an >> IOMMU-bypassing virtio device and assigns it to L2 using vfio, then >> *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) >> >> > >> > The only way out of this while keeping the "platform" stuff would be to >> > also bump some kind of version in the virtio config (or PCI header). I >> > have no other way to differenciate between "this is an old qemu that >> > doesn't do the 'bypass property' yet" from "this is a virtio device >> > that doesn't bypass". >> > >> > Any better idea ? >> >> I'd suggest that, in the absence of the new DT binding, we assume that >> any PCI device with the virtio vendor ID is passthrough on powerpc. I >> can do this in the virtio driver, but if it's in the platform code >> then vfio gets it right too (i.e. fails to load). > > The problem is there isn't *a* virtio vendor ID. It's the RedHat vendor > ID which will be used by more than just virtio, so we need to > specifically list the devices. Really? /* Qumranet donated their vendor ID for devices 0x1000 thru 0x10FF. */ static const struct pci_device_id virtio_pci_id_table[] = { { PCI_DEVICE(0x1af4, PCI_ANY_ID) }, { 0 } }; Can we match on that range? > > Additionally, that still means that once we have a virtio device that > actually uses the iommu, powerpc will not work since the "workaround" > above will kick in. I don't know how to solve that problem, though, especially since the vendor of such a device (especially if it's real hardware) might not set any new bit. > > The "in absence of the new DT binding" doesn't make that much sense. > > Those platforms use device-trees defined since the dawn of ages by > actual open firmware implementations, they either have no iommu > representation in there (Macs, the platform code hooks it all up) or > have various properties related to the iommu but no concept of "bypass" > in there. > > We can *add* a new property under some circumstances that indicates a > bypass on a per-device basis, however that doesn't completely solve it: > > - As I said above, what does the absence of that property mean ? An > old qemu that does bypass on all virtio or a new qemu trying to tell > you that the virtio device actually does use the iommu (or some other > environment that isn't qemu) ? > > - On things like macs, the device-tree is generated by openbios, it > would have to have some added logic to try to figure that out, which > means it needs to know *via different means* that some or all virtio > devices bypass the iommu. > > I thus go back to my original statement, it's a LOT easier to handle if > the device itself is self describing, indicating whether it is set to > bypass a host iommu or not. For L1->L2, well, that wouldn't be the > first time qemu/VFIO plays tricks with the passed through device > configuration space... Which leaves the special case of Xen, where even preexisting devices don't bypass the IOMMU. Can we keep this specific to powerpc and sparc? On x86, this problem is basically nonexistent, since the IOMMU is properly self-describing. IOW, I think that on x86 we should assume that all virtio devices honor the IOMMU. > > Note that the above can be solved via some kind of compromise: The > device self describes the ability to honor the iommu, along with the > property (or ACPI table entry) that indicates whether or not it does. > > IE. We could use the revision or ProgIf field of the config space for > example. Or something in virtio config. If it's an "old" device, we > know it always bypass. If it's a new device, we know it only bypasses > if the corresponding property is in. I still would have to sort out the > openbios case for mac among others but it's at least a workable > direction. > > BTW. Don't you have a similar problem on x86 that today qemu claims > that everything honors the iommu in ACPI ? Only on a single experimental configuration, and that can apparently just be fixed going forward without any real problems being caused. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Mon, Nov 9, 2015 at 2:58 PM, Benjamin Herrenschmidt wrote: > So ... > > I've finally tried to sort that out for powerpc and I can't find a way > to make that work that isn't a complete pile of stinking shit. > > I'm very tempted to go back to my original idea: virtio itself should > indicate it's "bypassing ability" via the virtio config space or some > other bit (like the ProgIf of the PCI header etc...). > > I don't see how I can make it work otherwise. > > The problem with the statement "it's a platform matter" is that: > > - It's not entirely correct. It's actually a feature of a specific > virtio implementation (qemu's) that it bypasses all the platform IOMMU > facilities. > > - The platforms (on powerpc there's at least 3 or 4 that have qemu > emulation and support some form of PCI) don't have an existing way to > convey the information that a device bypasses the IOMMU (if any). > > - Even if we add such a mechanism (new property in the device-tree), > we end up with a big turd: Because we need to be compatible with older > qemus, we essentially need a quirk that will make all virtio devices > assume that property is present. That will of course break whenever we > try to use another implementation of virtio on powerpc which doesn't > bypass the iommu. > > We don't have a way to differenciate a virtio device that does the > bypass from one that doesn't and the backward compatibility requirement > forces us to treat all existing virtio devices as doing bypass. The problem here is that in some of the problematic cases the virtio driver may not even be loaded. If someone runs an L1 guest with an IOMMU-bypassing virtio device and assigns it to L2 using vfio, then *boom* L1 crashes. (Same if, say, DPDK gets used, I think.) > > The only way out of this while keeping the "platform" stuff would be to > also bump some kind of version in the virtio config (or PCI header). I > have no other way to differenciate between "this is an old qemu that > doesn't do the 'bypass property' yet" from "this is a virtio device > that doesn't bypass". > > Any better idea ? I'd suggest that, in the absence of the new DT binding, we assume that any PCI device with the virtio vendor ID is passthrough on powerpc. I can do this in the virtio driver, but if it's in the platform code then vfio gets it right too (i.e. fails to load). --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL v4 0/3] dma and virtio prep patches
On Thu, Nov 5, 2015 at 12:08 PM, Christian Borntraeger wrote: > Andy, > > to make it obvious which version is the latest, here is a branch > > The following changes since commit 6a13feb9c82803e2b815eca72fa7a9f5561d7861: > > Linux 4.3 (2015-11-01 16:05:25 -0800) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/borntraeger/linux.git dma > > for you to fetch changes up to fc7f9754db6ce0c12281da4055281f731d36bdee: > > s390/dma: Allow per device dma ops (2015-11-05 21:02:40 +0100) Pulled, thanks. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] s390/dma: Allow per device dma ops
On 11/05/2015 01:33 AM, Christian Borntraeger wrote: Am 03.11.2015 um 13:26 schrieb Cornelia Huck: On Tue, 3 Nov 2015 12:54:39 +0100 Christian Borntraeger wrote: As virtio-ccw now has dma ops, we can no longer default to the PCI ones. Make use of dev_archdata to keep the dma_ops per device. The pci devices now use that to override the default, and the default is changed to use the noop ops for everything that is not PCI. To compile without PCI support we also have to enable the DMA api with virtio. Not only with virtio, but generally, right? Yes, will update the patch description. Signed-off-by: Christian Borntraeger Reviewed-by: Joerg Roedel Acked-by: Sebastian Ott --- arch/s390/Kconfig | 3 ++- arch/s390/include/asm/device.h | 6 +- arch/s390/include/asm/dma-mapping.h | 6 -- arch/s390/pci/pci.c | 1 + arch/s390/pci/pci_dma.c | 4 ++-- 5 files changed, 14 insertions(+), 6 deletions(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index 1d57000..04f0e02 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -113,6 +113,7 @@ config S390 select GENERIC_FIND_FIRST_BIT select GENERIC_SMP_IDLE_THREAD select GENERIC_TIME_VSYSCALL + select HAS_DMA select HAVE_ALIGNED_STRUCT_PAGE if SLUB select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_EARLY_PFN_TO_NID @@ -124,6 +125,7 @@ config S390 select HAVE_CMPXCHG_DOUBLE select HAVE_CMPXCHG_LOCAL select HAVE_DEBUG_KMEMLEAK + select HAVE_DMA_ATTRS select HAVE_DYNAMIC_FTRACE select HAVE_DYNAMIC_FTRACE_WITH_REGS select HAVE_FTRACE_MCOUNT_RECORD @@ -580,7 +582,6 @@ config QDIO menuconfig PCI bool "PCI support" - select HAVE_DMA_ATTRS select PCI_MSI help Enable PCI support. Hm. Further down in this file, there's config HAS_DMA def_bool PCI select HAVE_DMA_API_DEBUG Should we maybe select HAVE_DMA_API_DEBUG above, drop the HAS_DMA config option and rely on not defining NO_DMA instead? Hmm, yes. That would simplify things a lot. Right now we include lib/Kconfig (which defines HAS_DMA) and define it ourselfes in arch/s390/Kconfig. WHoever comes first wins. Adding a select statement would make this even more complicated. Andy, I will simply send you a respin of this patch. I'm slightly concerned that I'm going to screw this all up and apply the wrong version. Could you resend the whole series with git format-patch -vN for some appropriate N (or similar)? Thanks, Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Wed, Nov 4, 2015 at 9:52 AM, Cornelia Huck wrote: > On Wed, 4 Nov 2015 15:29:23 +0100 > Cornelia Huck wrote: > >> I'm currently suspecting some endianness issues, probably with the ecw >> accesses, which is why I'd be interested in your trace information (as >> I currently don't have a LE test setup at hand.) > > I think I've got it. We have sense_data as a byte array, which > implicitly makes it BE already. When we copy to the ecws while building > the irb, the data ends up in 32 bit values. The conversion from host > endianness to BE now treats them as LE on your system... > > Could you please give the following qemu patch a try? Tested-by: Andy Lutomirski Now my test script panics for the right reason (init isn't actually an s390 binary). Thanks! I'll update my branch with the latest s390 DMA stuff, and now I can give it at least basic regression testing. Of course, I have to find a root filesystem, too. :) --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Tue, Nov 3, 2015 at 9:59 AM, Cornelia Huck wrote: > It's just failing very early in the setup phase. As it works for me > with a kvm setup, I'm suspecting some error in qemu's emulation code, > which is unfortunately not my turf. > That should be easy to rule out. Can you try with -machine accel=tcg? I can't test with kvm for obvious reasons. > Some more poke-around-in-the-dark ideas: > > - Do you get more debug out put when you switch back to s390-ccw-virtio > (virtio-1), i.e. does cmd 83 work and is it followed by further > commands? I'll play with this stuff later today. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Tue, Nov 3, 2015 at 12:14 AM, Christian Borntraeger wrote: > Am 02.11.2015 um 21:23 schrieb Andy Lutomirski: >> On Mon, Nov 2, 2015 at 3:16 AM, Cornelia Huck >> wrote: >>> On Fri, 30 Oct 2015 13:33:07 -0700 >>> Andy Lutomirski wrote: >>> >>>> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck >>>> wrote: >>>>> On Thu, 29 Oct 2015 15:50:38 -0700 >>>>> Andy Lutomirski wrote: >>>>> >>>>>> Progress! After getting that sort-of-working, I figured out what was >>>>>> wrong with my earlier command, and I got that working, too. Now I >>>>>> get: >>>>>> >>>>>> qemu-system-s390x -fsdev >>>>>> local,id=virtfs1,path=/,security_model=none,readonly -device >>>>>> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio >>>>>> -nodefaults -device sclpconsole,chardev=console -parallel none -net >>>>>> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on >>>>>> -serial chardev:console -mon chardev=console -vga none -display none >>>>>> -kernel arch/s390/boot/bzImage -append >>>>>> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init >>>>>> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8" >>>>>> TERM=xterm-256color rootfstype=9p >>>>>> rootflags=ro,version=9p2000.L,trans=virtio,access=any >>>>>> raid=noautodetect debug' >>>>> >>>>> The commandline looks sane AFAICS. >>>>> >>>>> (...) >>>>> >>>>>> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80 >>>>>> virtio_ccw 0.0.: Failed to set online: -5 >>>>>> >>>>>> ^^^ bad news! >>>>> >>>>> I'd like to see where in the onlining process this fails. Could you set >>>>> up qemu tracing for css_* and virtio_ccw_* (instructions in >>>>> qemu/docs/tracing.txt)? >>>> >>>> I have a file called events that contains: >>>> >>>> css_* >>>> virtio_ccw_* >>>> >>>> pointing -trace events= at it results in a trace- file that's 549 >>>> bytes long and contains nothing. Are wildcards not as well-supported >>>> as the docs suggest? >>> >>> Just tried it, seemed to work for me as expected. And as your messages >>> indicate, at least some of the css tracepoints are guaranteed to be >>> hit. Odd. >>> >>> Can you try the following sophisticated printf debug patch? >>> >>> diff --git a/hw/s390x/css.c b/hw/s390x/css.c >>> index c033612..6a87bd6 100644 >>> --- a/hw/s390x/css.c >>> +++ b/hw/s390x/css.c >>> @@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr >>> ccw_addr) >>> sch->ccw_no_data_cnt++; >>> } >>> >>> +fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code); >>> + >>> /* Look at the command. */ >>> switch (ccw.cmd_code) { >>> case CCW_CMD_NOOP: >>> @@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr >>> ccw_addr) >>> } >>> break; >>> } >>> +fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret); >>> sch->last_cmd = ccw; >>> sch->last_cmd_valid = true; >>> if (ret == 0) { >>> >>> >>>>> Which qemu version is this, btw.? >>>>> >>>> >>>> git from yesterday. >>> >>> Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead. >>> >> >> No change. >> >> With s390-ccw-virtio-2.4, I get: >> >> Initializing cgroup subsys cpuset >> Initializing cgroup subsys cpu >> Initializing cgroup subsys cpuacct >> Linux version 4.3.0-rc7-8-gff230d6ec6b2 >> (l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red >> Hat Cross 5.1.1-3) (GCC) ) #344 SMP Fri Oct 30 13:16:13 PDT 2015 >> setup: Linux is running under KVM in 64-bit mode >> setup: Max memory size: 128MB >> Zone ranges: >> DMA [mem 0x-0x7fff] >> Normal empty >> Movable zone start for each node >> Early memory node ranges >> node 0: [mem 0x-0x07ff] >&
Re: [PATCH/RFC 0/4] dma ops and virtio
On Mon, Nov 2, 2015 at 3:16 AM, Cornelia Huck wrote: > On Fri, 30 Oct 2015 13:33:07 -0700 > Andy Lutomirski wrote: > >> On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck >> wrote: >> > On Thu, 29 Oct 2015 15:50:38 -0700 >> > Andy Lutomirski wrote: >> > >> >> Progress! After getting that sort-of-working, I figured out what was >> >> wrong with my earlier command, and I got that working, too. Now I >> >> get: >> >> >> >> qemu-system-s390x -fsdev >> >> local,id=virtfs1,path=/,security_model=none,readonly -device >> >> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio >> >> -nodefaults -device sclpconsole,chardev=console -parallel none -net >> >> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on >> >> -serial chardev:console -mon chardev=console -vga none -display none >> >> -kernel arch/s390/boot/bzImage -append >> >> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init >> >> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8" >> >> TERM=xterm-256color rootfstype=9p >> >> rootflags=ro,version=9p2000.L,trans=virtio,access=any >> >> raid=noautodetect debug' >> > >> > The commandline looks sane AFAICS. >> > >> > (...) >> > >> >> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80 >> >> virtio_ccw 0.0.: Failed to set online: -5 >> >> >> >> ^^^ bad news! >> > >> > I'd like to see where in the onlining process this fails. Could you set >> > up qemu tracing for css_* and virtio_ccw_* (instructions in >> > qemu/docs/tracing.txt)? >> >> I have a file called events that contains: >> >> css_* >> virtio_ccw_* >> >> pointing -trace events= at it results in a trace- file that's 549 >> bytes long and contains nothing. Are wildcards not as well-supported >> as the docs suggest? > > Just tried it, seemed to work for me as expected. And as your messages > indicate, at least some of the css tracepoints are guaranteed to be > hit. Odd. > > Can you try the following sophisticated printf debug patch? > > diff --git a/hw/s390x/css.c b/hw/s390x/css.c > index c033612..6a87bd6 100644 > --- a/hw/s390x/css.c > +++ b/hw/s390x/css.c > @@ -308,6 +308,8 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr > ccw_addr) > sch->ccw_no_data_cnt++; > } > > +fprintf(stderr, "CH DBG: %s: cmd_code=%x\n", __func__, ccw.cmd_code); > + > /* Look at the command. */ > switch (ccw.cmd_code) { > case CCW_CMD_NOOP: > @@ -375,6 +377,7 @@ static int css_interpret_ccw(SubchDev *sch, hwaddr > ccw_addr) > } > break; > } > +fprintf(stderr, "CH DBG: %s: ret=%d\n", __func__, ret); > sch->last_cmd = ccw; > sch->last_cmd_valid = true; > if (ret == 0) { > > >> > Which qemu version is this, btw.? >> > >> >> git from yesterday. > > Hm. Might be worth trying the s390-ccw-virtio-2.4 machine instead. > No change. With s390-ccw-virtio-2.4, I get: Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Initializing cgroup subsys cpuacct Linux version 4.3.0-rc7-8-gff230d6ec6b2 (l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red Hat Cross 5.1.1-3) (GCC) ) #344 SMP Fri Oct 30 13:16:13 PDT 2015 setup: Linux is running under KVM in 64-bit mode setup: Max memory size: 128MB Zone ranges: DMA [mem 0x-0x7fff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x-0x07ff] Initmem setup node 0 [mem 0x-0x07ff] On node 0 totalpages: 32768 DMA zone: 512 pages used for memmap DMA zone: 0 pages reserved DMA zone: 32768 pages, LIFO batch:7 PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736 pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096 pcpu-alloc: [0] 0 [0] 1 Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32256 Kernel command line: init=/home/luto/devel/virtme/virtme/guest/virtme-init psmouse.proto=exps "virtme_stty_con=rows 45 cols 150 iutf8" TERM=xterm-256color rootfstype=9p rootflags=version=9p2000.L,trans=virtio,access=any raid=noautodetect ro debug PID hash table entries: 512 (order: 0, 4096 bytes) Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Memory: 92520K/131072K available (8255K kernel code, 802K rwdata, 3860K rodata, 2384K in
Re: [PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
On Fri, Oct 30, 2015 at 6:55 AM, Christian Borntraeger wrote: > Am 30.10.2015 um 02:09 schrieb Andy Lutomirski: >> From: "Michael S. Tsirkin" >> >> Once virtio starts using the DMA API, we won't be able to safely DMA >> from the stack. virtio-net does a couple of config DMA requests >> from small stack buffers -- switch to using dynamically-allocated >> memory. >> >> This should have no effect on any performance-critical code paths. >> >> [I wrote the subject and commit message. mst wrote the code. --luto] >> >> Signed-off-by: Andy Lutomirski >> signed-off-by: Michael S. Tsirkin > > I still get an error when using multiqueue: > > # ethtool -L eth0 combined 4 > [ 33.534686] virtio_ccw 0.0.000d: DMA-API: device driver maps memory from > stack [addr=629e7c06] Fixed in my branch, I think. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Fri, Oct 30, 2015 at 1:25 AM, Cornelia Huck wrote: > On Thu, 29 Oct 2015 15:50:38 -0700 > Andy Lutomirski wrote: > >> Progress! After getting that sort-of-working, I figured out what was >> wrong with my earlier command, and I got that working, too. Now I >> get: >> >> qemu-system-s390x -fsdev >> local,id=virtfs1,path=/,security_model=none,readonly -device >> virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio >> -nodefaults -device sclpconsole,chardev=console -parallel none -net >> none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on >> -serial chardev:console -mon chardev=console -vga none -display none >> -kernel arch/s390/boot/bzImage -append >> 'init=/home/luto/devel/virtme/virtme/guest/virtme-init >> psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8" >> TERM=xterm-256color rootfstype=9p >> rootflags=ro,version=9p2000.L,trans=virtio,access=any >> raid=noautodetect debug' > > The commandline looks sane AFAICS. > > (...) > >> vrfy: device 0.0.: rc=0 pgroup=0 mpath=0 vpm=80 >> virtio_ccw 0.0.: Failed to set online: -5 >> >> ^^^ bad news! > > I'd like to see where in the onlining process this fails. Could you set > up qemu tracing for css_* and virtio_ccw_* (instructions in > qemu/docs/tracing.txt)? I have a file called events that contains: css_* virtio_ccw_* pointing -trace events= at it results in a trace- file that's 549 bytes long and contains nothing. Are wildcards not as well-supported as the docs suggest? > > Which qemu version is this, btw.? > git from yesterday. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/6] virtio_ring: Support DMA APIs
On Fri, Oct 30, 2015 at 5:05 AM, Christian Borntraeger wrote: > Am 30.10.2015 um 13:01 schrieb Cornelia Huck: >> On Thu, 29 Oct 2015 18:09:47 -0700 >> Andy Lutomirski wrote: >> >>> virtio_ring currently sends the device (usually a hypervisor) >>> physical addresses of its I/O buffers. This is okay when DMA >>> addresses and physical addresses are the same thing, but this isn't >>> always the case. For example, this never works on Xen guests, and >>> it is likely to fail if a physical "virtio" device ever ends up >>> behind an IOMMU or swiotlb. >>> >>> The immediate use case for me is to enable virtio on Xen guests. >>> For that to work, we need vring to support DMA address translation >>> as well as a corresponding change to virtio_pci or to another >>> driver. >>> >>> With this patch, if enabled, virtfs survives kmemleak and >>> CONFIG_DMA_API_DEBUG. >>> >>> Signed-off-by: Andy Lutomirski >>> --- >>> drivers/virtio/Kconfig | 2 +- >>> drivers/virtio/virtio_ring.c | 190 >>> +++ >>> tools/virtio/linux/dma-mapping.h | 17 >>> 3 files changed, 172 insertions(+), 37 deletions(-) >>> create mode 100644 tools/virtio/linux/dma-mapping.h >> >>> static void detach_buf(struct vring_virtqueue *vq, unsigned int head) >>> { >>> -unsigned int i; >>> +unsigned int i, j; >>> +u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); >>> >>> /* Clear data ptr. */ >>> -vq->data[head] = NULL; >>> +vq->desc_state[head].data = NULL; >>> >>> -/* Put back on free list: find end */ >>> +/* Put back on free list: unmap first-level descriptors and find end */ >>> i = head; >>> >>> -/* Free the indirect table */ >>> -if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, >>> VRING_DESC_F_INDIRECT)) >>> -kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, >>> vq->vring.desc[i].addr))); >>> - >>> -while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, >>> VRING_DESC_F_NEXT)) { >>> +while (vq->vring.desc[i].flags & nextflag) { >>> +vring_unmap_one(vq, &vq->vring.desc[i]); >>> i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next); >>> vq->vq.num_free++; >>> } >>> >>> +vring_unmap_one(vq, &vq->vring.desc[i]); >>> vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head); >>> vq->free_head = head; >>> + >>> /* Plus final descriptor */ >>> vq->vq.num_free++; >>> + >>> +/* Free the indirect table, if any, now that it's unmapped. */ >>> +if (vq->desc_state[head].indir_desc) { >>> +struct vring_desc *indir_desc = >>> vq->desc_state[head].indir_desc; >>> +u32 len = vq->vring.desc[head].len; >> >> This one needs to be virtio32_to_cpu(...) as well. > > Yes, just did the exact same change > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index f269e1c..f2249df 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -556,7 +556,7 @@ static void detach_buf(struct vring_virtqueue *vq, > unsigned int head) > /* Free the indirect table, if any, now that it's unmapped. */ > if (vq->desc_state[head].indir_desc) { > struct vring_desc *indir_desc = > vq->desc_state[head].indir_desc; > - u32 len = vq->vring.desc[head].len; > + u32 len = virtio32_to_cpu(vq->vq.vdev, > vq->vring.desc[head].len); > > BUG_ON(!(vq->vring.desc[head].flags & > cpu_to_virtio16(vq->vq.vdev, > VRING_DESC_F_INDIRECT))); > > > now it boots. Thanks! I applied this to my tree. I won't send a new version quite yet, though, to reduce inbox load. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 0/6] virtio core DMA API conversion
On Thu, Oct 29, 2015 at 6:09 PM, Andy Lutomirski wrote: > This switches virtio to use the DMA API unconditionally. I'm sure > it breaks things, but it seems to work on x86 using virtio-pci, with > and without Xen, and using both the modern 1.0 variant and the > legacy variant. ... > Andy Lutomirski (5): > virtio_ring: Support DMA APIs > virtio_pci: Use the DMA API > virtio: Add improved queue allocation API > virtio_mmio: Use the DMA API > virtio_pci: Use the DMA API Ugh. The two virtio_pci patches should be squashed together. I'll do that for v5, but I'm not going to send it until there's more feedback. FWIW, I'm collecting this stuff here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=virtio_dma That branch includes this series (with the squash) and the s390 patches. I'll keep it up to date, since it seems silly to declare it stable enough to stop rebasing yet. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 0/6] virtio core DMA API conversion
This switches virtio to use the DMA API unconditionally. I'm sure it breaks things, but it seems to work on x86 using virtio-pci, with and without Xen, and using both the modern 1.0 variant and the legacy variant. This appears to work on native and Xen x86_64 using both modern and legacy virtio-pci. It also appears to work on arm and arm64. It definitely won't work as-is on s390x, and I haven't been able to test Christian's patches because I can't get virtio-ccw to work in QEMU at all. I don't know what I'm doing wrong. It doesn't work on ppc64. Ben, consider yourself pinged to send me a patch :) It doesn't work on sparc64. I didn't realize at Kernel Summit that sparc64 has the same problem as ppc64. DaveM, for background, we're trying to fix virtio to use the DMA API. That will require that every platform that uses virtio supplies valid DMA operations on devices that use virtio_ring. Unfortunately, QEMU historically ignores the IOMMU on virtio devices. On x86, this isn't really a problem. x86 has a nice way for the platform to describe which devices are behind an IOMMU, and QEMU will be adjusted accordingly. The only thing that will break is a recently-added experimental mode. Ben's plan for powerpc is to add a quirk for existing virtio-pci devices and to eventually update the devicetree stuff to allow QEMU to tell the guest which devices use the IOMMU. AFAICT sparc has a similar problem to powerpc. DaveM, can you come up with a straightforward way to get sparc's DMA API to work correctly for virtio-pci devices? NB: Sadly, the platforms I've successfully tested on don't include any big-endian platforms, so there could still be lurking endian problems. Changes from v3: - More big-endian fixes. - Added better virtio-ring APIs that handle allocation and use them in virtio-mmio and virtio-pci. - Switch to Michael's virtio-net patch. Changes from v2: - Fix vring_mapping_error incorrect argument Changes from v1: - Fix an endian conversion error causing a BUG to hit. - Fix a DMA ordering issue (swiotlb=force works now). - Minor cleanups. Andy Lutomirski (5): virtio_ring: Support DMA APIs virtio_pci: Use the DMA API virtio: Add improved queue allocation API virtio_mmio: Use the DMA API virtio_pci: Use the DMA API Michael S. Tsirkin (1): virtio-net: Stop doing DMA from the stack drivers/net/virtio_net.c | 34 ++-- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_mmio.c | 67 ++- drivers/virtio/virtio_pci_common.h | 6 - drivers/virtio/virtio_pci_legacy.c | 42 ++--- drivers/virtio/virtio_pci_modern.c | 61 ++- drivers/virtio/virtio_ring.c | 348 ++--- include/linux/virtio.h | 23 ++- include/linux/virtio_ring.h| 35 tools/virtio/linux/dma-mapping.h | 17 ++ 10 files changed, 426 insertions(+), 209 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 2/6] virtio_ring: Support DMA APIs
virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical "virtio" device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need vring to support DMA address translation as well as a corresponding change to virtio_pci or to another driver. With this patch, if enabled, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. Signed-off-by: Andy Lutomirski --- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_ring.c | 190 +++ tools/virtio/linux/dma-mapping.h | 17 3 files changed, 172 insertions(+), 37 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index cab9f3f63a38..77590320d44c 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -60,7 +60,7 @@ config VIRTIO_INPUT config VIRTIO_MMIO tristate "Platform bus driver for memory mapped virtio devices" - depends on HAS_IOMEM + depends on HAS_IOMEM && HAS_DMA select VIRTIO ---help--- This drivers provides support for memory mapped virtio diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 096b857e7b75..a872eb89587f 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,7 +55,14 @@ #define END_USE(vq) #endif -struct vring_virtqueue { +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + +struct vring_virtqueue +{ struct virtqueue vq; /* Actual memory layout for this queue */ @@ -92,12 +100,71 @@ struct vring_virtqueue { ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* + * The DMA ops on various arches are rather gnarly right now, and + * making all of the arch DMA ops work on the vring device itself + * is a mess. For now, we use the parent device for DMA ops. + */ +struct device *vring_dma_dev(const struct vring_virtqueue *vq) +{ + return vq->vq.vdev->dev.parent; +} + +/* Map one sg entry. */ +static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, + struct scatterlist *sg, + enum dma_data_direction direction) +{ + /* +* We can't use dma_map_sg, because we don't use scatterlists in +* the way it expects (we don't guarantee that the scatterlist +* will exist for the lifetime of the mapping). +*/ + return dma_map_page(vring_dma_dev(vq), + sg_page(sg), sg->offset, sg->length, + direction); +} + +static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dma_map_single(vring_dma_dev(vq), + cpu_addr, size, direction); +} + +static void vring_unmap_one(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + return dma_mapping_error(vring_dma_dev(vq), addr); +} + static struct vring_desc *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg, gfp_t gfp) { @@ -131
[PATCH v4 4/6] virtio: Add improved queue allocation API
This leaves vring_new_virtqueue alone for compatbility, but it adds two new improved APIs: vring_create_virtqueue: Creates a virtqueue backed by automatically allocated coherent memory. (Some day it this could be extended to support non-coherent memory, too, if there ends up being a platform on which it's worthwhile.) __vring_new_virtqueue: Creates a virtqueue with a manually-specified layout. This should allow mic_virtio to work much more cleanly. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_ring.c | 164 +++ include/linux/virtio.h | 23 +- include/linux/virtio_ring.h | 35 + 3 files changed, 190 insertions(+), 32 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index a872eb89587f..f269e1cba00c 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -91,6 +91,11 @@ struct vring_virtqueue /* How to notify other side. FIXME: commonalize hcalls! */ bool (*notify)(struct virtqueue *vq); + /* DMA, allocation, and size information */ + bool we_own_ring; + size_t queue_size_in_bytes; + dma_addr_t queue_dma_addr; + #ifdef DEBUG /* They're supposed to lock for us. */ unsigned int in_use; @@ -821,36 +826,31 @@ irqreturn_t vring_interrupt(int irq, void *_vq) } EXPORT_SYMBOL_GPL(vring_interrupt); -struct virtqueue *vring_new_virtqueue(unsigned int index, - unsigned int num, - unsigned int vring_align, - struct virtio_device *vdev, - bool weak_barriers, - void *pages, - bool (*notify)(struct virtqueue *), - void (*callback)(struct virtqueue *), - const char *name) +struct virtqueue *__vring_new_virtqueue(unsigned int index, + struct vring vring, + struct virtio_device *vdev, + bool weak_barriers, + bool (*notify)(struct virtqueue *), + void (*callback)(struct virtqueue *), + const char *name) { - struct vring_virtqueue *vq; unsigned int i; + struct vring_virtqueue *vq; - /* We assume num is a power of 2. */ - if (num & (num - 1)) { - dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num); - return NULL; - } - - vq = kmalloc(sizeof(*vq) + num * sizeof(struct vring_desc_state), + vq = kmalloc(sizeof(*vq) + vring.num * sizeof(struct vring_desc_state), GFP_KERNEL); if (!vq) return NULL; - vring_init(&vq->vring, num, pages, vring_align); + vq->vring = vring; vq->vq.callback = callback; vq->vq.vdev = vdev; vq->vq.name = name; - vq->vq.num_free = num; + vq->vq.num_free = vring.num; vq->vq.index = index; + vq->we_own_ring = false; + vq->queue_dma_addr = 0; + vq->queue_size_in_bytes = 0; vq->notify = notify; vq->weak_barriers = weak_barriers; vq->broken = false; @@ -871,18 +871,105 @@ struct virtqueue *vring_new_virtqueue(unsigned int index, /* Put everything in free lists. */ vq->free_head = 0; - for (i = 0; i < num-1; i++) + for (i = 0; i < vring.num-1; i++) vq->vring.desc[i].next = cpu_to_virtio16(vdev, i + 1); - memset(vq->desc_state, 0, num * sizeof(struct vring_desc_state)); + memset(vq->desc_state, 0, vring.num * sizeof(struct vring_desc_state)); return &vq->vq; } +EXPORT_SYMBOL_GPL(__vring_new_virtqueue); + +struct virtqueue *vring_create_virtqueue( + unsigned int index, + unsigned int num, + unsigned int vring_align, + struct virtio_device *vdev, + bool weak_barriers, + bool may_reduce_num, + bool (*notify)(struct virtqueue *), + void (*callback)(struct virtqueue *), + const char *name) +{ + struct virtqueue *vq; + void *queue; + dma_addr_t dma_addr; + size_t queue_size_in_bytes; + struct vring vring; + + /* We assume num is a power of 2. */ + if (num & (num - 1)) { + dev_warn(&vdev->dev, "Bad virtqueue length %u\n", num); + return NULL; + } + + /* TODO: allocate each queue chunk individually */ + for (; num && vring_size(num, vring_align) > PAGE_SIZE; num /= 2) { + queue = dma_zalloc_coherent( +
[PATCH v4 3/6] virtio_pci: Use the DMA API
This fixes virtio-pci on platforms and busses that have IOMMUs. This will break the experimental QEMU Q35 IOMMU support until QEMU is fixed. In exchange, it fixes physical virtio hardware as well as virtio-pci running under Xen. We should clean up the virtqueue API to do its own allocation and teach virtqueue_get_avail and virtqueue_get_used to return DMA addresses directly. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_pci_common.h | 3 ++- drivers/virtio/virtio_pci_legacy.c | 19 +++ drivers/virtio/virtio_pci_modern.c | 34 -- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index b976d968e793..cd6196b513ad 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -38,8 +38,9 @@ struct virtio_pci_vq_info { /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ + /* the ring queue */ void *queue; + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index 48bc9797e530..b5293e5f2af4 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->msix_vector = msix_vec; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, + &info->queue_dma_addr, + GFP_KERNEL); if (info->queue == NULL) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -169,7 +171,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info) iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_ops = { @@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) return -ENODEV; } + rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (rc) + rc = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (rc) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy"); if (rc) return rc; diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 8e5cf194cc0b..fbe0bd1c4881 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num) return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); } -static void *alloc_virtqueue_pages(int *num) +static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, + int *num, dma_addr_t *dma_addr) { void *pages; /* TODO: allocate each queue chunk individually */ for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = alloc_pages_exact(vring_pci_size(*num), - GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); + pages = dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL|__GFP_NOWARN); if (pages) return pages; } @@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num) return NULL; /* Try to get a single page. You are my only ho
[PATCH v4 1/6] virtio-net: Stop doing DMA from the stack
From: "Michael S. Tsirkin" Once virtio starts using the DMA API, we won't be able to safely DMA from the stack. virtio-net does a couple of config DMA requests from small stack buffers -- switch to using dynamically-allocated memory. This should have no effect on any performance-critical code paths. [I wrote the subject and commit message. mst wrote the code. --luto] Signed-off-by: Andy Lutomirski signed-off-by: Michael S. Tsirkin --- drivers/net/virtio_net.c | 34 +++--- 1 file changed, 19 insertions(+), 15 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d8838dedb7a4..f94ab786088f 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -140,6 +140,12 @@ struct virtnet_info { /* CPU hot plug notifier */ struct notifier_block nb; + + /* Control VQ buffers: protected by the rtnl lock */ + struct virtio_net_ctrl_hdr ctrl_hdr; + virtio_net_ctrl_ack ctrl_status; + u8 ctrl_promisc; + u8 ctrl_allmulti; }; struct padded_vnet_hdr { @@ -976,31 +982,30 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; - struct virtio_net_ctrl_hdr ctrl; - virtio_net_ctrl_ack status = ~0; unsigned out_num = 0, tmp; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); - ctrl.class = class; - ctrl.cmd = cmd; + vi->ctrl_status = ~0; + vi->ctrl_hdr.class = class; + vi->ctrl_hdr.cmd = cmd; /* Add header */ - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); + sg_init_one(&hdr, &vi->ctrl_hdr, sizeof(vi->ctrl_hdr)); sgs[out_num++] = &hdr; if (out) sgs[out_num++] = out; /* Add return status. */ - sg_init_one(&stat, &status, sizeof(status)); + sg_init_one(&stat, &vi->ctrl_status, sizeof(vi->ctrl_status)); sgs[out_num] = &stat; BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); if (unlikely(!virtqueue_kick(vi->cvq))) - return status == VIRTIO_NET_OK; + return vi->ctrl_status == VIRTIO_NET_OK; /* Spin for a response, the kick causes an ioport write, trapping * into the hypervisor, so the request should be handled immediately. @@ -1009,7 +1014,7 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, !virtqueue_is_broken(vi->cvq)) cpu_relax(); - return status == VIRTIO_NET_OK; + return vi->ctrl_status == VIRTIO_NET_OK; } static int virtnet_set_mac_address(struct net_device *dev, void *p) @@ -1151,7 +1156,6 @@ static void virtnet_set_rx_mode(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); struct scatterlist sg[2]; - u8 promisc, allmulti; struct virtio_net_ctrl_mac *mac_data; struct netdev_hw_addr *ha; int uc_count; @@ -1163,22 +1167,22 @@ static void virtnet_set_rx_mode(struct net_device *dev) if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) return; - promisc = ((dev->flags & IFF_PROMISC) != 0); - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); + vi->ctrl_promisc = ((dev->flags & IFF_PROMISC) != 0); + vi->ctrl_allmulti = ((dev->flags & IFF_ALLMULTI) != 0); - sg_init_one(sg, &promisc, sizeof(promisc)); + sg_init_one(sg, &vi->ctrl_promisc, sizeof(vi->ctrl_promisc)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", -promisc ? "en" : "dis"); +vi->ctrl_promisc ? "en" : "dis"); - sg_init_one(sg, &allmulti, sizeof(allmulti)); + sg_init_one(sg, &vi->ctrl_allmulti, sizeof(vi->ctrl_allmulti)); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", -allmulti ? "en" : "dis"); +vi->ctrl_allmulti ? "en" : "dis"); uc_count = netdev_uc_count(dev); mc_count = netdev_mc_count(dev); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4 5/6] virtio_mmio: Use the DMA API
This switches to vring_create_virtqueue, simplifying the driver and adding DMA API support. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_mmio.c | 67 ++-- 1 file changed, 15 insertions(+), 52 deletions(-) diff --git a/drivers/virtio/virtio_mmio.c b/drivers/virtio/virtio_mmio.c index f499d9da7237..2b9fab52a3cb 100644 --- a/drivers/virtio/virtio_mmio.c +++ b/drivers/virtio/virtio_mmio.c @@ -99,12 +99,6 @@ struct virtio_mmio_vq_info { /* the actual virtqueue */ struct virtqueue *vq; - /* the number of entries in the queue */ - unsigned int num; - - /* the virtual address of the ring queue */ - void *queue; - /* the list node for the virtqueues list */ struct list_head node; }; @@ -322,15 +316,13 @@ static void vm_del_vq(struct virtqueue *vq) { struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vq->vdev); struct virtio_mmio_vq_info *info = vq->priv; - unsigned long flags, size; + unsigned long flags; unsigned int index = vq->index; spin_lock_irqsave(&vm_dev->lock, flags); list_del(&info->node); spin_unlock_irqrestore(&vm_dev->lock, flags); - vring_del_virtqueue(vq); - /* Select and deactivate the queue */ writel(index, vm_dev->base + VIRTIO_MMIO_QUEUE_SEL); if (vm_dev->version == 1) { @@ -340,8 +332,8 @@ static void vm_del_vq(struct virtqueue *vq) WARN_ON(readl(vm_dev->base + VIRTIO_MMIO_QUEUE_READY)); } - size = PAGE_ALIGN(vring_size(info->num, VIRTIO_MMIO_VRING_ALIGN)); - free_pages_exact(info->queue, size); + vring_del_virtqueue(vq); + kfree(info); } @@ -356,8 +348,6 @@ static void vm_del_vqs(struct virtio_device *vdev) free_irq(platform_get_irq(vm_dev->pdev, 0), vm_dev); } - - static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, void (*callback)(struct virtqueue *vq), const char *name) @@ -365,7 +355,8 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, struct virtio_mmio_device *vm_dev = to_virtio_mmio_device(vdev); struct virtio_mmio_vq_info *info; struct virtqueue *vq; - unsigned long flags, size; + unsigned long flags; + unsigned int num; int err; if (!name) @@ -388,66 +379,40 @@ static struct virtqueue *vm_setup_vq(struct virtio_device *vdev, unsigned index, goto error_kmalloc; } - /* Allocate pages for the queue - start with a queue as big as -* possible (limited by maximum size allowed by device), drop down -* to a minimal size, just big enough to fit descriptor table -* and two rings (which makes it "alignment_size * 2") -*/ - info->num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX); - - /* If the device reports a 0 entry queue, we won't be able to -* use it to perform I/O, and vring_new_virtqueue() can't create -* empty queues anyway, so don't bother to set up the device. -*/ - if (info->num == 0) { + num = readl(vm_dev->base + VIRTIO_MMIO_QUEUE_NUM_MAX); + if (num == 0) { err = -ENOENT; - goto error_alloc_pages; - } - - while (1) { - size = PAGE_ALIGN(vring_size(info->num, - VIRTIO_MMIO_VRING_ALIGN)); - /* Did the last iter shrink the queue below minimum size? */ - if (size < VIRTIO_MMIO_VRING_ALIGN * 2) { - err = -ENOMEM; - goto error_alloc_pages; - } - - info->queue = alloc_pages_exact(size, GFP_KERNEL | __GFP_ZERO); - if (info->queue) - break; - - info->num /= 2; + goto error_new_virtqueue; } /* Create the vring */ - vq = vring_new_virtqueue(index, info->num, VIRTIO_MMIO_VRING_ALIGN, vdev, -true, info->queue, vm_notify, callback, name); + vq = vring_create_virtqueue(index, num, VIRTIO_MMIO_VRING_ALIGN, vdev, +true, true, vm_notify, callback, name); if (!vq) { err = -ENOMEM; goto error_new_virtqueue; } /* Activate the queue */ - writel(info->num, vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); + writel(virtqueue_get_vring_size(vq), vm_dev->base + VIRTIO_MMIO_QUEUE_NUM); if (vm_dev->version == 1) { writel(PAGE_SIZE, vm_dev->base + VIRTIO_MMIO_QUEUE_ALIGN); - writel(virt_to_phys(info->queue) >> PAGE_SHIFT, + write
[PATCH v4 6/6] virtio_pci: Use the DMA API
This switches to vring_create_virtqueue, simplifying the driver and adding DMA API support. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_pci_common.h | 7 - drivers/virtio/virtio_pci_legacy.c | 39 +++- drivers/virtio/virtio_pci_modern.c | 61 ++ 3 files changed, 19 insertions(+), 88 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index cd6196b513ad..1a3c689d1b9e 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -35,13 +35,6 @@ struct virtio_pci_vq_info { /* the actual virtqueue */ struct virtqueue *vq; - /* the number of entries in the queue */ - int num; - - /* the ring queue */ - void *queue; - dma_addr_t queue_dma_addr; /* bus address */ - /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index b5293e5f2af4..8c4e61783441 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -119,7 +119,6 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, u16 msix_vec) { struct virtqueue *vq; - unsigned long size; u16 num; int err; @@ -131,29 +130,19 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, if (!num || ioread32(vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN)) return ERR_PTR(-ENOENT); - info->num = num; info->msix_vector = msix_vec; - size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, - &info->queue_dma_addr, - GFP_KERNEL); - if (info->queue == NULL) + /* create the vring */ + vq = vring_create_virtqueue(index, num, + VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev, + true, false, vp_notify, callback, name); + if (!vq) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(virtqueue_get_desc_addr(vq) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - /* create the vring */ - vq = vring_new_virtqueue(index, info->num, -VIRTIO_PCI_VRING_ALIGN, &vp_dev->vdev, -true, info->queue, vp_notify, callback, name); - if (!vq) { - err = -ENOMEM; - goto out_activate_queue; - } - vq->priv = (void __force *)vp_dev->ioaddr + VIRTIO_PCI_QUEUE_NOTIFY; if (msix_vec != VIRTIO_MSI_NO_VECTOR) { @@ -161,18 +150,15 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, msix_vec = ioread16(vp_dev->ioaddr + VIRTIO_MSI_QUEUE_VECTOR); if (msix_vec == VIRTIO_MSI_NO_VECTOR) { err = -EBUSY; - goto out_assign; + goto out_deactivate; } } return vq; -out_assign: - vring_del_virtqueue(vq); -out_activate_queue: +out_deactivate: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - dma_free_coherent(&vp_dev->pci_dev->dev, size, - info->queue, info->queue_dma_addr); + vring_del_virtqueue(vq); return ERR_PTR(err); } @@ -180,7 +166,6 @@ static void del_vq(struct virtio_pci_vq_info *info) { struct virtqueue *vq = info->vq; struct virtio_pci_device *vp_dev = to_vp_device(vq->vdev); - unsigned long size; iowrite16(vq->index, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_SEL); @@ -191,14 +176,10 @@ static void del_vq(struct virtio_pci_vq_info *info) ioread8(vp_dev->ioaddr + VIRTIO_PCI_ISR); } - vring_del_virtqueue(vq); - /* Select and deactivate the queue */ iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - dma_free_coherent(&vp_dev->pci_dev->dev, size, - info->queue, info->queue_dma_addr); + vring_del_virtqueue(vq); } static const struct virtio_config_ops virtio_pci_config_ops = { diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index fbe0bd1c4881..50b0cd5a501e 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -287,35 +287,6 @@ static u16 vp_config_vector(struct virtio_pci_device *vp_dev, u16 vec
Re: [PATCH v2 1/3] virtio_net: Stop doing DMA from the stack
On Wed, Oct 28, 2015 at 12:07 AM, Michael S. Tsirkin wrote: > How about this instead? Less code, more robust. > > Warning: untested. If you do like this approach, Tested-by would be > appreciated. I like it. Tested-by: Andy Lutomirski --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Wed, Oct 28, 2015 at 5:04 PM, Christian Borntraeger wrote: > Am 29.10.2015 um 07:22 schrieb Andy Lutomirski: >> On Tue, Oct 27, 2015 at 3:48 PM, Christian Borntraeger >> wrote: >>> This is an RFC to check if I am on the right track. There >>> are some attempts to unify the dma ops (Christoph) as well >>> as some attempts to make virtio use the dma API (Andy). >>> >>> At todays discussion at the kernel summit, we concluded that >>> we want to use the same code on all platforms, whereever >>> possible, so having a dummy dma_op might be the easiest >>> solution to keep virtio-ccw as similar as possible to >>> virtio-pci. Andy Lutomirski will rework his patchset to >>> unconditionally use the dma ops. We will also need a >>> compatibility quirk for powerpc to bypass the iommu mappings >>> on older QEMU version (which translates to all versions as >>> of today) and per device, e.g. via device tree. Ben >>> Herrenschmidt will look into that. >> >> The combination of these patches plus my series don't link for me >> unless I enable PCI. Would s390 need to select HAS_DMA from VIRTIO or >> something similar? > > Well, actually this is a potential improvement for series. I could just > make the noop dma ops default for _all_ devices unless it has a per > device dma_ops (e.g. s390 pci) and the unconditionally set HAS_DMA. >> >> Also, is it possible to use QEMU to emulate an s390x? Even just: >> >> qemu-system-s390x -M s390-ccw-virtio > > Yes, we have no interactive bios and if no boot device is there is bios > will load a disabled wait, which will exit qemu. > > Make sure to compile your kernel for z900 (processor type) as qemu does not > handle all things of newer processors. > You can then do > qemu-system-s390x -nographic -m 256 -kernel vmlinux -initrd > Progress! After getting that sort-of-working, I figured out what was wrong with my earlier command, and I got that working, too. Now I get: qemu-system-s390x -fsdev local,id=virtfs1,path=/,security_model=none,readonly -device virtio-9p-ccw,fsdev=virtfs1,mount_tag=/dev/root -M s390-ccw-virtio -nodefaults -device sclpconsole,chardev=console -parallel none -net none -echr 1 -serial none -chardev stdio,id=console,signal=off,mux=on -serial chardev:console -mon chardev=console -vga none -display none -kernel arch/s390/boot/bzImage -append 'init=/home/luto/devel/virtme/virtme/guest/virtme-init psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8" TERM=xterm-256color rootfstype=9p rootflags=ro,version=9p2000.L,trans=virtio,access=any raid=noautodetect debug' Initializing cgroup subsys cpuset Initializing cgroup subsys cpu Initializing cgroup subsys cpuacct Linux version 4.3.0-rc7-00403-ga2b5cd810259-dirty (l...@amaluto.corp.amacapital.net) (gcc version 5.1.1 20150618 (Red Hat Cross 5.1.1-3) (GCC) ) #328 SMP Thu Oct 29 15:46:05 PDT 2015 setup: Linux is running under KVM in 64-bit mode setup: Max memory size: 128MB Zone ranges: DMA [mem 0x-0x7fff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x-0x07ff] Initmem setup node 0 [mem 0x-0x07ff] On node 0 totalpages: 32768 DMA zone: 512 pages used for memmap DMA zone: 0 pages reserved DMA zone: 32768 pages, LIFO batch:7 PERCPU: Embedded 466 pages/cpu @07605000 s1868032 r8192 d32512 u1908736 pcpu-alloc: s1868032 r8192 d32512 u1908736 alloc=466*4096 pcpu-alloc: [0] 0 [0] 1 Built 1 zonelists in Zone order, mobility grouping on. Total pages: 32256 Kernel command line: init=/home/luto/devel/virtme/virtme/guest/virtme-init psmouse.proto=exps "virtme_stty_con=rows 24 cols 150 iutf8" TERM=xterm-256color rootfstype=9p rootflags=ro,version=9p2000.L,trans=virtio,access=any raid=noautodetect debug PID hash table entries: 512 (order: 0, 4096 bytes) Dentry cache hash table entries: 16384 (order: 5, 131072 bytes) Inode-cache hash table entries: 8192 (order: 4, 65536 bytes) Memory: 92552K/131072K available (8229K kernel code, 798K rwdata, 3856K rodata, 2384K init, 14382K bss, 38520K reserved, 0K cma-reserved) Write protected kernel read-only data: 0x10 - 0xccdfff SLUB: HWalign=256, Order=0-3, MinObjects=0, CPUs=2, Nodes=1 Running RCU self tests Hierarchical RCU implementation. RCU debugfs-based tracing is enabled. RCU lockdep checking is enabled. Build-time adjustment of leaf fanout to 64. RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=2. RCU: Adjusting geometry for rcu_fanout_leaf=64, nr_cpu_ids=2 NR_IRQS:3 clocksource: tod: mask: 0x max_cycles: 0x3b0a9be803b0a9, max_idle_ns: 1805497147909793 ns console [ttyS1] enabled Lock dependency validator: Copyright (c) 2006 Red Hat
Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS
On Oct 28, 2015 6:11 PM, "Benjamin Herrenschmidt" wrote: > > On Thu, 2015-10-29 at 09:42 +0900, David Woodhouse wrote: > > On Thu, 2015-10-29 at 09:32 +0900, Benjamin Herrenschmidt wrote: > > > > > On Power, I generally have 2 IOMMU windows for a device, one at the > > > bottom is remapped, and is generally used for 32-bit devices and the > > > one at the top us setup as a bypass > > > > So in the normal case of decent 64-bit devices (and not in a VM), > > they'll *already* be using the bypass region and have full access to > > all of memory, all of the time? And you have no protection against > > driver and firmware bugs causing stray DMA? > > Correct, we chose to do that for performance reasons. Could this be mitigated using pools? I don't know if the net code would play along easily. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Wed, Oct 28, 2015 at 9:12 AM, Michael S. Tsirkin wrote: > On Wed, Oct 28, 2015 at 11:32:34PM +0900, David Woodhouse wrote: >> > I don't have a problem with extending DMA API to address >> > more usecases. >> >> No, this isn't an extension. This is fixing a bug, on certain platforms >> where the DMA API has currently done the wrong thing. >> >> We have historically worked around that bug by introducing *another* >> bug, which is not to *use* the DMA API in the virtio driver. >> >> Sure, we can co-ordinate those two bug-fixes. But let's not talk about >> them as anything other than bug-fixes. > > It was pretty practical not to use it. All virtio devices at the time > without exception bypassed the IOMMU, so it was a question of omitting a > couple of function calls in virtio versus hacking on DMA implementation > on multiple platforms. We have more policy options now, so I agree it's > time to revisit this. > > But for me, the most important thing is that we do coordinate. > >> > > Drivers use DMA API. No more talky. >> > >> > Well for virtio they don't ATM. And 1:1 mapping makes perfect sense >> > for the wast majority of users, so I can't switch them over >> > until the DMA API actually addresses all existing usecases. >> >> That's still not your business; it's the platform's. And there are >> hardware implementations of the virtio protocols on real PCI cards. And >> we have the option of doing IOMMU translation for the virtio devices >> even in a virtual machine. Just don't get involved. >> >> -- >> dwmw2 >> >> > > I'm involved anyway, it's possible not to put all the code in the virtio > subsystem in guest though. But I suspect we'll need to find a way for > non-linux drivers within guest to work correctly too, and they might > have trouble poking at things at the system level. So possibly virtio > subsystem will have to tell platform "this device wants to bypass IOMMU" > and then DMA API does the right thing. > After some discussion at KS, no one came up with an example where it's necessary, and the patches to convert virtqueue to use the DMA API are much nicer when they convert it unconditionally. The two interesting cases we thought of were PPC and x86's emulated Q35 IOMMU. PPC will look in to architecting a devicetree-based way to indicate passthrough status and will add quirks for the existing virtio devices. Everyone seems to agree that x86's emulated Q35 thing is just buggy right now and should be taught to use the existing ACPI mechanism for enumerating passthrough devices. I'll send a new version of the series soon. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH/RFC 0/4] dma ops and virtio
On Tue, Oct 27, 2015 at 3:48 PM, Christian Borntraeger wrote: > This is an RFC to check if I am on the right track. There > are some attempts to unify the dma ops (Christoph) as well > as some attempts to make virtio use the dma API (Andy). > > At todays discussion at the kernel summit, we concluded that > we want to use the same code on all platforms, whereever > possible, so having a dummy dma_op might be the easiest > solution to keep virtio-ccw as similar as possible to > virtio-pci. Andy Lutomirski will rework his patchset to > unconditionally use the dma ops. We will also need a > compatibility quirk for powerpc to bypass the iommu mappings > on older QEMU version (which translates to all versions as > of today) and per device, e.g. via device tree. Ben > Herrenschmidt will look into that. The combination of these patches plus my series don't link for me unless I enable PCI. Would s390 need to select HAS_DMA from VIRTIO or something similar? Also, is it possible to use QEMU to emulate an s390x? Even just: qemu-system-s390x -M s390-ccw-virtio exits immediately. (I build from QEMU git today.) I don't know what options to pass to get -M s390 to do anything useful, and AFAICT it's considered deprecated. Help? --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] virtio DMA API core stuff
On Tue, Oct 27, 2015 at 11:53 PM, David Woodhouse wrote: > On Tue, 2015-10-27 at 23:38 -0700, Andy Lutomirski wrote: >> >> Changes from v2: >> - Fix really embarrassing bug. This version actually works. > > So embarrassing you didn't want to tell us what it was? ... Shhh, it's a secret! I somehow managed to test-boot a different kernel than I thought I was booting. > > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -292,7 +292,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, > vq, desc, total_sg * sizeof(struct vring_desc), > DMA_TO_DEVICE); > > - if (vring_mapping_error(vq, vq->vring.desc[head].addr)) > + if (vring_mapping_error(vq, addr)) > goto unmap_release; > > vq->vring.desc[head].flags = cpu_to_virtio16(_vq->vdev, > VRING_DESC_F_INDIRECT); > > That wasn't going to be the reason for Christian's failure, was it? > Not obviously, but it's possible. Now that I'm staring at it, I have some more big-endian issues, so there'll be a v4. I'll also play with Michael's thing. Expect a long delay, though -- my flight's about to leave. The readme notwithstanding, virtme (https://github.com/amluto/virtme) actually has s390x support, so I can try to debug when I get home. I'm not about to try doing this on a laptop :) --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/3] virtio_ring: Support DMA APIs
virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical "virtio" device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need vring to support DMA address translation as well as a corresponding change to virtio_pci or to another driver. With this patch, if enabled, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. Signed-off-by: Andy Lutomirski --- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_ring.c | 187 +++ tools/virtio/linux/dma-mapping.h | 17 3 files changed, 169 insertions(+), 37 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index cab9f3f63a38..77590320d44c 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -60,7 +60,7 @@ config VIRTIO_INPUT config VIRTIO_MMIO tristate "Platform bus driver for memory mapped virtio devices" - depends on HAS_IOMEM + depends on HAS_IOMEM && HAS_DMA select VIRTIO ---help--- This drivers provides support for memory mapped virtio diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 096b857e7b75..6962ea37ade0 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,7 +55,14 @@ #define END_USE(vq) #endif -struct vring_virtqueue { +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + +struct vring_virtqueue +{ struct virtqueue vq; /* Actual memory layout for this queue */ @@ -92,12 +100,71 @@ struct vring_virtqueue { ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* + * The DMA ops on various arches are rather gnarly right now, and + * making all of the arch DMA ops work on the vring device itself + * is a mess. For now, we use the parent device for DMA ops. + */ +struct device *vring_dma_dev(const struct vring_virtqueue *vq) +{ + return vq->vq.vdev->dev.parent; +} + +/* Map one sg entry. */ +static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, + struct scatterlist *sg, + enum dma_data_direction direction) +{ + /* +* We can't use dma_map_sg, because we don't use scatterlists in +* the way it expects (we don't guarantee that the scatterlist +* will exist for the lifetime of the mapping). +*/ + return dma_map_page(vring_dma_dev(vq), + sg_page(sg), sg->offset, sg->length, + direction); +} + +static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dma_map_single(vring_dma_dev(vq), + cpu_addr, size, direction); +} + +static void vring_unmap_one(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + return dma_mapping_error(vring_dma_dev(vq), addr); +} + static struct vring_desc *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg, gfp_t gfp) { @@ -131
[PATCH v3 3/3] virtio_pci: Use the DMA API
This fixes virtio-pci on platforms and busses that have IOMMUs. This will break the experimental QEMU Q35 IOMMU support until QEMU is fixed. In exchange, it fixes physical virtio hardware as well as virtio-pci running under Xen. We should clean up the virtqueue API to do its own allocation and teach virtqueue_get_avail and virtqueue_get_used to return DMA addresses directly. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_pci_common.h | 3 ++- drivers/virtio/virtio_pci_legacy.c | 19 +++ drivers/virtio/virtio_pci_modern.c | 34 -- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index b976d968e793..cd6196b513ad 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -38,8 +38,9 @@ struct virtio_pci_vq_info { /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ + /* the ring queue */ void *queue; + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index 48bc9797e530..b5293e5f2af4 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->msix_vector = msix_vec; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, + &info->queue_dma_addr, + GFP_KERNEL); if (info->queue == NULL) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -169,7 +171,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info) iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_ops = { @@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) return -ENODEV; } + rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (rc) + rc = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (rc) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy"); if (rc) return rc; diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 8e5cf194cc0b..fbe0bd1c4881 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num) return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); } -static void *alloc_virtqueue_pages(int *num) +static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, + int *num, dma_addr_t *dma_addr) { void *pages; /* TODO: allocate each queue chunk individually */ for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = alloc_pages_exact(vring_pci_size(*num), - GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); + pages = dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL|__GFP_NOWARN); if (pages) return pages; } @@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num) return NULL; /* Try to get a single page. You are my only ho
[PATCH v3 1/3] virtio_net: Stop doing DMA from the stack
From: Andy Lutomirski Once virtio starts using the DMA API, we won't be able to safely DMA from the stack. virtio-net does a couple of config DMA requests from small stack buffers -- switch to using dynamically-allocated memory. This should have no effect on any performance-critical code paths. Cc: "Michael S. Tsirkin" Cc: virtualizat...@lists.linux-foundation.org Reviewed-by: Joerg Roedel Signed-off-by: Andy Lutomirski --- Hi Michael and DaveM- This is a prerequisite for the virtio DMA fixing project. It works as a standalone patch, though. Would it make sense to apply it to an appropriate networking tree now? (This is unchanged from v2.) drivers/net/virtio_net.c | 53 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d8838dedb7a4..4f10f8a58811 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; - struct virtio_net_ctrl_hdr ctrl; - virtio_net_ctrl_ack status = ~0; + + struct { + struct virtio_net_ctrl_hdr ctrl; + virtio_net_ctrl_ack status; + } *buf; + unsigned out_num = 0, tmp; + bool ret; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); - ctrl.class = class; - ctrl.cmd = cmd; + buf = kmalloc(sizeof(*buf), GFP_ATOMIC); + if (!buf) + return false; + buf->status = ~0; + + buf->ctrl.class = class; + buf->ctrl.cmd = cmd; /* Add header */ - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); + sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl)); sgs[out_num++] = &hdr; if (out) sgs[out_num++] = out; /* Add return status. */ - sg_init_one(&stat, &status, sizeof(status)); + sg_init_one(&stat, &buf->status, sizeof(buf->status)); sgs[out_num] = &stat; BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); - if (unlikely(!virtqueue_kick(vi->cvq))) - return status == VIRTIO_NET_OK; + if (unlikely(!virtqueue_kick(vi->cvq))) { + ret = (buf->status == VIRTIO_NET_OK); + goto out; + } /* Spin for a response, the kick causes an ioport write, trapping * into the hypervisor, so the request should be handled immediately. @@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, !virtqueue_is_broken(vi->cvq)) cpu_relax(); - return status == VIRTIO_NET_OK; + ret = (buf->status == VIRTIO_NET_OK); + +out: + kfree(buf); + return ret; } static int virtnet_set_mac_address(struct net_device *dev, void *p) @@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); struct scatterlist sg[2]; - u8 promisc, allmulti; + u8 *cmdbyte; struct virtio_net_ctrl_mac *mac_data; struct netdev_hw_addr *ha; int uc_count; @@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev) if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) return; - promisc = ((dev->flags & IFF_PROMISC) != 0); - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); + cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC); + if (!cmdbyte) + return; - sg_init_one(sg, &promisc, sizeof(promisc)); + sg_init_one(sg, cmdbyte, sizeof(*cmdbyte)); + *cmdbyte = ((dev->flags & IFF_PROMISC) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", -promisc ? "en" : "dis"); - - sg_init_one(sg, &allmulti, sizeof(allmulti)); +*cmdbyte ? "en" : "dis"); + *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", -allmulti ? "en" : "dis"); +*cmdbyte ? "en" : "dis"); + + kfree(cmdbyte); uc_count = netdev_uc_count(dev); mc_count = netdev_mc_count(dev); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 0/3] virtio DMA API core stuff
This switches virtio to use the DMA API unconditionally. I'm sure it breaks things, but it seems to work on x86 using virtio-pci, with and without Xen, and using both the modern 1.0 variant and the legacy variant. Changes from v2: - Fix really embarrassing bug. This version actually works. Changes from v1: - Fix an endian conversion error causing a BUG to hit. - Fix a DMA ordering issue (swiotlb=force works now). - Minor cleanups. Andy Lutomirski (3): virtio_net: Stop doing DMA from the stack virtio_ring: Support DMA APIs virtio_pci: Use the DMA API drivers/net/virtio_net.c | 53 +++ drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_pci_common.h | 3 +- drivers/virtio/virtio_pci_legacy.c | 19 +++- drivers/virtio/virtio_pci_modern.c | 34 +-- drivers/virtio/virtio_ring.c | 187 ++--- tools/virtio/linux/dma-mapping.h | 17 7 files changed, 246 insertions(+), 69 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] virtio_pci: Use the DMA API
From: Andy Lutomirski This fixes virtio-pci on platforms and busses that have IOMMUs. This will break the experimental QEMU Q35 IOMMU support until QEMU is fixed. In exchange, it fixes physical virtio hardware as well as virtio-pci running under Xen. We should clean up the virtqueue API to do its own allocation and teach virtqueue_get_avail and virtqueue_get_used to return DMA addresses directly. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_pci_common.h | 3 ++- drivers/virtio/virtio_pci_legacy.c | 19 +++ drivers/virtio/virtio_pci_modern.c | 34 -- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index b976d968e793..cd6196b513ad 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -38,8 +38,9 @@ struct virtio_pci_vq_info { /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ + /* the ring queue */ void *queue; + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index 48bc9797e530..b5293e5f2af4 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->msix_vector = msix_vec; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, + &info->queue_dma_addr, + GFP_KERNEL); if (info->queue == NULL) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -169,7 +171,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info) iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_ops = { @@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) return -ENODEV; } + rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (rc) + rc = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (rc) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy"); if (rc) return rc; diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 8e5cf194cc0b..fbe0bd1c4881 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num) return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); } -static void *alloc_virtqueue_pages(int *num) +static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, + int *num, dma_addr_t *dma_addr) { void *pages; /* TODO: allocate each queue chunk individually */ for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = alloc_pages_exact(vring_pci_size(*num), - GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); + pages = dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL|__GFP_NOWARN); if (pages) return pages; } @@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num) return NULL; /* Try t
[PATCH v2 2/3] virtio_ring: Support DMA APIs
virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical "virtio" device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need vring to support DMA address translation as well as a corresponding change to virtio_pci or to another driver. With this patch, if enabled, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. Signed-off-by: Andy Lutomirski --- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_ring.c | 187 +++ tools/virtio/linux/dma-mapping.h | 17 3 files changed, 169 insertions(+), 37 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index cab9f3f63a38..77590320d44c 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -60,7 +60,7 @@ config VIRTIO_INPUT config VIRTIO_MMIO tristate "Platform bus driver for memory mapped virtio devices" - depends on HAS_IOMEM + depends on HAS_IOMEM && HAS_DMA select VIRTIO ---help--- This drivers provides support for memory mapped virtio diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 096b857e7b75..e71bf0ca59a2 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,7 +55,14 @@ #define END_USE(vq) #endif -struct vring_virtqueue { +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + +struct vring_virtqueue +{ struct virtqueue vq; /* Actual memory layout for this queue */ @@ -92,12 +100,71 @@ struct vring_virtqueue { ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* + * The DMA ops on various arches are rather gnarly right now, and + * making all of the arch DMA ops work on the vring device itself + * is a mess. For now, we use the parent device for DMA ops. + */ +struct device *vring_dma_dev(const struct vring_virtqueue *vq) +{ + return vq->vq.vdev->dev.parent; +} + +/* Map one sg entry. */ +static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, + struct scatterlist *sg, + enum dma_data_direction direction) +{ + /* +* We can't use dma_map_sg, because we don't use scatterlists in +* the way it expects (we don't guarantee that the scatterlist +* will exist for the lifetime of the mapping). +*/ + return dma_map_page(vring_dma_dev(vq), + sg_page(sg), sg->offset, sg->length, + direction); +} + +static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dma_map_single(vring_dma_dev(vq), + cpu_addr, size, direction); +} + +static void vring_unmap_one(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + return dma_mapping_error(vring_dma_dev(vq), addr); +} + static struct vring_desc *alloc_indirect(struct virtqueue *_vq, unsigned int total_sg, gfp_t gfp) { @@ -131
[PATCH v2 1/3] virtio_net: Stop doing DMA from the stack
From: Andy Lutomirski Once virtio starts using the DMA API, we won't be able to safely DMA from the stack. virtio-net does a couple of config DMA requests from small stack buffers -- switch to using dynamically-allocated memory. This should have no effect on any performance-critical code paths. Cc: net...@vger.kernel.org Cc: "Michael S. Tsirkin" Cc: virtualizat...@lists.linux-foundation.org Reviewed-by: Joerg Roedel Signed-off-by: Andy Lutomirski --- Hi Michael and DaveM- This is a prerequisite for the virtio DMA fixing project. It works as a standalone patch, though. Would it make sense to apply it to an appropriate networking tree now? drivers/net/virtio_net.c | 53 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d8838dedb7a4..4f10f8a58811 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; - struct virtio_net_ctrl_hdr ctrl; - virtio_net_ctrl_ack status = ~0; + + struct { + struct virtio_net_ctrl_hdr ctrl; + virtio_net_ctrl_ack status; + } *buf; + unsigned out_num = 0, tmp; + bool ret; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); - ctrl.class = class; - ctrl.cmd = cmd; + buf = kmalloc(sizeof(*buf), GFP_ATOMIC); + if (!buf) + return false; + buf->status = ~0; + + buf->ctrl.class = class; + buf->ctrl.cmd = cmd; /* Add header */ - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); + sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl)); sgs[out_num++] = &hdr; if (out) sgs[out_num++] = out; /* Add return status. */ - sg_init_one(&stat, &status, sizeof(status)); + sg_init_one(&stat, &buf->status, sizeof(buf->status)); sgs[out_num] = &stat; BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); - if (unlikely(!virtqueue_kick(vi->cvq))) - return status == VIRTIO_NET_OK; + if (unlikely(!virtqueue_kick(vi->cvq))) { + ret = (buf->status == VIRTIO_NET_OK); + goto out; + } /* Spin for a response, the kick causes an ioport write, trapping * into the hypervisor, so the request should be handled immediately. @@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, !virtqueue_is_broken(vi->cvq)) cpu_relax(); - return status == VIRTIO_NET_OK; + ret = (buf->status == VIRTIO_NET_OK); + +out: + kfree(buf); + return ret; } static int virtnet_set_mac_address(struct net_device *dev, void *p) @@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); struct scatterlist sg[2]; - u8 promisc, allmulti; + u8 *cmdbyte; struct virtio_net_ctrl_mac *mac_data; struct netdev_hw_addr *ha; int uc_count; @@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev) if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) return; - promisc = ((dev->flags & IFF_PROMISC) != 0); - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); + cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC); + if (!cmdbyte) + return; - sg_init_one(sg, &promisc, sizeof(promisc)); + sg_init_one(sg, cmdbyte, sizeof(*cmdbyte)); + *cmdbyte = ((dev->flags & IFF_PROMISC) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", -promisc ? "en" : "dis"); - - sg_init_one(sg, &allmulti, sizeof(allmulti)); +*cmdbyte ? "en" : "dis"); + *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", -allmulti ? "en" : "dis"); +*cmdbyte ? "en" : "dis"); + + kfree(cmdbyte); uc_count = netdev_uc_count(dev); mc_count = netdev_mc_count(dev); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/3] virtio DMA API core stuff
This switches virtio to use the DMA API unconditionally. I'm sure it breaks things, but it seems to work on x86 using virtio-pci, with and without Xen, and using both the modern 1.0 variant and the legacy variant. Changes from v1: - Fix an endian conversion error causing a BUG to hit. - Fix a DMA ordering issue (swiotlb=force works now). - Minor cleanups. Andy Lutomirski (3): virtio_net: Stop doing DMA from the stack virtio_ring: Support DMA APIs virtio_pci: Use the DMA API drivers/net/virtio_net.c | 53 +++ drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_pci_common.h | 3 +- drivers/virtio/virtio_pci_legacy.c | 19 +++- drivers/virtio/virtio_pci_modern.c | 34 +-- drivers/virtio/virtio_ring.c | 187 ++--- tools/virtio/linux/dma-mapping.h | 17 7 files changed, 246 insertions(+), 69 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] x86: context_tracking: avoid irq_save/irq_restore on kernel entry and exit
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini wrote: > x86 always calls user_enter and user_exit with interrupt disabled. > Therefore, it is not necessary to check for exceptions, nor to > save/restore the IRQ state, when context tracking functions are > called by guest_enter and guest_exit. > > Use the previously introduced __context_tracking_entry and > __context_tracking_exit. x86 isn't ready for this yet. We could do a quick-and-dirty fix with explicit IRQs-on-and-off much protected by the static key, or we could just wait until I finish the syscall cleanup. I favor the latter, but you're all welcome to do the former and I'll review it. BTW, Frederic, I have a static key patch coming that I think you'll like. I'll send it tomorrow once I'm in front of a real computer. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] context_tracking: avoid irq_save/irq_restore on guest entry and exit
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini wrote: > guest_enter and guest_exit must be called with interrupts disabled, > since they take the vtime_seqlock with write_seq{lock,unlock}. > Therefore, it is not necessary to check for exceptions, nor to > save/restore the IRQ state, when context tracking functions are > called by guest_enter and guest_exit. > > Split the body of context_tracking_entry and context_tracking_exit > out to __-prefixed functions, and use them from KVM. > > Rik van Riel has measured this to speed up a tight vmentry/vmexit > loop by about 2%. Looks generally sensible. I'm not familiar enough with the code to call it reviewed-by while sitting on the airport shuttle. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] context_tracking: remove duplicate enabled check
On Tue, Oct 27, 2015 at 6:39 PM, Paolo Bonzini wrote: > All calls to context_tracking_enter and context_tracking_exit > are already checking context_tracking_is_enabled, except the > context_tracking_user_enter and context_tracking_user_exit > functions left in for the benefit of assembly calls. > > Pull the check up to those functions, by making them simple > wrappers around the user_enter and user_exit inline functions. This makes my brain hurt. Assuming that this survives a boot with CONFIG_CONTEXT_TRACKING_FORCE=y and CONFIG_PROVE_LOCKING=y (with the implied CONFIG_PROVE_RCU), then: Acked-by: Andy Lutomirski -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio_ring: Support DMA APIs
On Tue, Oct 27, 2015 at 7:21 PM, Joerg Roedel wrote: > On Tue, Oct 27, 2015 at 07:13:56PM -0700, Andy Lutomirski wrote: >> On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel wrote: >> > Hi Andy, >> > >> > On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote: >> >> From: Andy Lutomirski >> >> >> >> virtio_ring currently sends the device (usually a hypervisor) >> >> physical addresses of its I/O buffers. This is okay when DMA >> >> addresses and physical addresses are the same thing, but this isn't >> >> always the case. For example, this never works on Xen guests, and >> >> it is likely to fail if a physical "virtio" device ever ends up >> >> behind an IOMMU or swiotlb. >> > >> > The overall code looks good, but I havn't seen and dma_sync* calls. >> > When swiotlb=force is in use this would break. >> > >> >> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, >> >> vring_map_single( >> >> + vq, >> >> + desc, total_sg * sizeof(struct vring_desc), >> >> + DMA_TO_DEVICE)); >> > >> >> Are you talking about a dma_sync call on the descriptor ring itself? >> Isn't dma_alloc_coherent supposed to make that unnecessary? I should >> move the allocation into the virtqueue code. >> >> The docs suggest that I might need to "flush the processor's write >> buffers before telling devices to read that memory". I'm not sure how >> to do that. > > The write buffers should be flushed by the dma-api functions if > necessary. For dma_alloc_coherent allocations you don't need to call > dma_sync*, but for the map_single/map_page/map_sg ones, as these might > be bounce-buffered. I think that all the necessary barriers are already there. I had a nasty bug that swiotlb=force exposed, though, which I've fixed. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio_ring: Support DMA APIs
On Tue, Oct 27, 2015 at 7:27 PM, Christian Borntraeger wrote: > Am 28.10.2015 um 10:17 schrieb Andy Lutomirski: > @@ -423,27 +522,42 @@ EXPORT_SYMBOL_GPL(virtqueue_kick); >> >> static void detach_buf(struct vring_virtqueue *vq, unsigned int head) >> { >> - unsigned int i; >> + unsigned int i, j; >> + u16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); >> >> /* Clear data ptr. */ >> - vq->data[head] = NULL; >> + vq->desc_state[head].data = NULL; >> >> - /* Put back on free list: find end */ >> + /* Put back on free list: unmap first-level descriptors and find end */ >> i = head; >> >> - /* Free the indirect table */ >> - if (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, >> VRING_DESC_F_INDIRECT)) >> - kfree(phys_to_virt(virtio64_to_cpu(vq->vq.vdev, >> vq->vring.desc[i].addr))); >> - >> - while (vq->vring.desc[i].flags & cpu_to_virtio16(vq->vq.vdev, >> VRING_DESC_F_NEXT)) { >> + while (vq->vring.desc[i].flags & nextflag) { >> + vring_unmap_one(vq, &vq->vring.desc[i]); >> i = virtio16_to_cpu(vq->vq.vdev, vq->vring.desc[i].next); >> vq->vq.num_free++; >> } >> >> + vring_unmap_one(vq, &vq->vring.desc[i]); >> vq->vring.desc[i].next = cpu_to_virtio16(vq->vq.vdev, vq->free_head); >> vq->free_head = head; >> + >> /* Plus final descriptor */ >> vq->vq.num_free++; >> + >> + /* Free the indirect table, if any, now that it's unmapped. */ >> + if (vq->desc_state[head].indir_desc) { >> + struct vring_desc *indir_desc = >> vq->desc_state[head].indir_desc; >> + u32 len = vq->vring.desc[head].len; >> + >> + BUG_ON(!(vq->vring.desc[head].flags & VRING_DESC_F_INDIRECT)); >> + BUG_ON(len == 0 || len % sizeof(struct vring_desc)); >> + >> + for (j = 0; j < len / sizeof(struct vring_desc); j++) >> + vring_unmap_one(vq, &indir_desc[j]); >> + >> + kfree(vq->desc_state[head].indir_desc); >> + vq->desc_state[head].indir_desc = NULL; >> + } >> } > > something seems to be broken with indirect descriptors with that change You're big endian, right? I had an endian handling bug, and I'll fix it in v2. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] virtio_net: Stop doing DMA from the stack
On Tue, Oct 27, 2015 at 7:07 PM, Joerg Roedel wrote: > On Tue, Oct 27, 2015 at 06:17:08PM -0700, Andy Lutomirski wrote: >> From: Andy Lutomirski >> >> Once virtio starts using the DMA API, we won't be able to safely DMA >> from the stack. virtio-net does a couple of config DMA requests >> from small stack buffers -- switch to using dynamically-allocated >> memory. >> >> This should have no effect on any performance-critical code paths. >> >> Signed-off-by: Andy Lutomirski >> --- >> drivers/net/virtio_net.c | 53 >> >> 1 file changed, 36 insertions(+), 17 deletions(-) > > Reviewed-by: Joerg Roedel Should I just send this to netdev now for 4.4? We could get this one out of the way, maybe. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] virtio_ring: Support DMA APIs
On Tue, Oct 27, 2015 at 7:06 PM, Joerg Roedel wrote: > Hi Andy, > > On Tue, Oct 27, 2015 at 06:17:09PM -0700, Andy Lutomirski wrote: >> From: Andy Lutomirski >> >> virtio_ring currently sends the device (usually a hypervisor) >> physical addresses of its I/O buffers. This is okay when DMA >> addresses and physical addresses are the same thing, but this isn't >> always the case. For example, this never works on Xen guests, and >> it is likely to fail if a physical "virtio" device ever ends up >> behind an IOMMU or swiotlb. > > The overall code looks good, but I havn't seen and dma_sync* calls. > When swiotlb=force is in use this would break. > >> + vq->vring.desc[head].addr = cpu_to_virtio64(_vq->vdev, >> vring_map_single( >> + vq, >> + desc, total_sg * sizeof(struct vring_desc), >> + DMA_TO_DEVICE)); > Are you talking about a dma_sync call on the descriptor ring itself? Isn't dma_alloc_coherent supposed to make that unnecessary? I should move the allocation into the virtqueue code. The docs suggest that I might need to "flush the processor's write buffers before telling devices to read that memory". I'm not sure how to do that. > Nit: I think readability could be improved here by using a temporary > variable for the return value of vring_map_single(). > I'll do something like that for v2. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] virtio_net: Stop doing DMA from the stack
From: Andy Lutomirski Once virtio starts using the DMA API, we won't be able to safely DMA from the stack. virtio-net does a couple of config DMA requests from small stack buffers -- switch to using dynamically-allocated memory. This should have no effect on any performance-critical code paths. Signed-off-by: Andy Lutomirski --- drivers/net/virtio_net.c | 53 1 file changed, 36 insertions(+), 17 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index d8838dedb7a4..4f10f8a58811 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -976,31 +976,43 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, struct scatterlist *out) { struct scatterlist *sgs[4], hdr, stat; - struct virtio_net_ctrl_hdr ctrl; - virtio_net_ctrl_ack status = ~0; + + struct { + struct virtio_net_ctrl_hdr ctrl; + virtio_net_ctrl_ack status; + } *buf; + unsigned out_num = 0, tmp; + bool ret; /* Caller should know better */ BUG_ON(!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_VQ)); - ctrl.class = class; - ctrl.cmd = cmd; + buf = kmalloc(sizeof(*buf), GFP_ATOMIC); + if (!buf) + return false; + buf->status = ~0; + + buf->ctrl.class = class; + buf->ctrl.cmd = cmd; /* Add header */ - sg_init_one(&hdr, &ctrl, sizeof(ctrl)); + sg_init_one(&hdr, &buf->ctrl, sizeof(buf->ctrl)); sgs[out_num++] = &hdr; if (out) sgs[out_num++] = out; /* Add return status. */ - sg_init_one(&stat, &status, sizeof(status)); + sg_init_one(&stat, &buf->status, sizeof(buf->status)); sgs[out_num] = &stat; BUG_ON(out_num + 1 > ARRAY_SIZE(sgs)); virtqueue_add_sgs(vi->cvq, sgs, out_num, 1, vi, GFP_ATOMIC); - if (unlikely(!virtqueue_kick(vi->cvq))) - return status == VIRTIO_NET_OK; + if (unlikely(!virtqueue_kick(vi->cvq))) { + ret = (buf->status == VIRTIO_NET_OK); + goto out; + } /* Spin for a response, the kick causes an ioport write, trapping * into the hypervisor, so the request should be handled immediately. @@ -1009,7 +1021,11 @@ static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd, !virtqueue_is_broken(vi->cvq)) cpu_relax(); - return status == VIRTIO_NET_OK; + ret = (buf->status == VIRTIO_NET_OK); + +out: + kfree(buf); + return ret; } static int virtnet_set_mac_address(struct net_device *dev, void *p) @@ -1151,7 +1167,7 @@ static void virtnet_set_rx_mode(struct net_device *dev) { struct virtnet_info *vi = netdev_priv(dev); struct scatterlist sg[2]; - u8 promisc, allmulti; + u8 *cmdbyte; struct virtio_net_ctrl_mac *mac_data; struct netdev_hw_addr *ha; int uc_count; @@ -1163,22 +1179,25 @@ static void virtnet_set_rx_mode(struct net_device *dev) if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX)) return; - promisc = ((dev->flags & IFF_PROMISC) != 0); - allmulti = ((dev->flags & IFF_ALLMULTI) != 0); + cmdbyte = kmalloc(sizeof(*cmdbyte), GFP_ATOMIC); + if (!cmdbyte) + return; - sg_init_one(sg, &promisc, sizeof(promisc)); + sg_init_one(sg, cmdbyte, sizeof(*cmdbyte)); + *cmdbyte = ((dev->flags & IFF_PROMISC) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_PROMISC, sg)) dev_warn(&dev->dev, "Failed to %sable promisc mode.\n", -promisc ? "en" : "dis"); - - sg_init_one(sg, &allmulti, sizeof(allmulti)); +*cmdbyte ? "en" : "dis"); + *cmdbyte = ((dev->flags & IFF_ALLMULTI) != 0); if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_RX, VIRTIO_NET_CTRL_RX_ALLMULTI, sg)) dev_warn(&dev->dev, "Failed to %sable allmulti mode.\n", -allmulti ? "en" : "dis"); +*cmdbyte ? "en" : "dis"); + + kfree(cmdbyte); uc_count = netdev_uc_count(dev); mc_count = netdev_mc_count(dev); -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] virtio DMA API core stuff
This switches virtio to use the DMA API unconditionally. I'm sure it breaks things, but it seems to work on x86 using virtio-pci, with and without Xen, and using both the modern 1.0 variant and the legacy variant. Andy Lutomirski (3): virtio_net: Stop doing DMA from the stack virtio_ring: Support DMA APIs virtio_pci: Use the DMA API drivers/net/virtio_net.c | 53 +++ drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_pci_common.h | 3 +- drivers/virtio/virtio_pci_legacy.c | 19 +++- drivers/virtio/virtio_pci_modern.c | 34 --- drivers/virtio/virtio_ring.c | 179 ++--- tools/virtio/linux/dma-mapping.h | 17 7 files changed, 241 insertions(+), 66 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] virtio_pci: Use the DMA API
From: Andy Lutomirski This fixes virtio-pci on platforms and busses that have IOMMUs. This will break the experimental QEMU Q35 IOMMU support until QEMU is fixed. In exchange, it fixes physical virtio hardware as well as virtio-pci running under Xen. We should clean up the virtqueue API to do its own allocation and teach virtqueue_get_avail and virtqueue_get_used to return DMA addresses directly. Signed-off-by: Andy Lutomirski --- drivers/virtio/virtio_pci_common.h | 3 ++- drivers/virtio/virtio_pci_legacy.c | 19 +++ drivers/virtio/virtio_pci_modern.c | 34 -- 3 files changed, 41 insertions(+), 15 deletions(-) diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h index b976d968e793..cd6196b513ad 100644 --- a/drivers/virtio/virtio_pci_common.h +++ b/drivers/virtio/virtio_pci_common.h @@ -38,8 +38,9 @@ struct virtio_pci_vq_info { /* the number of entries in the queue */ int num; - /* the virtual address of the ring queue */ + /* the ring queue */ void *queue; + dma_addr_t queue_dma_addr; /* bus address */ /* the list node for the virtqueues list */ struct list_head node; diff --git a/drivers/virtio/virtio_pci_legacy.c b/drivers/virtio/virtio_pci_legacy.c index 48bc9797e530..b5293e5f2af4 100644 --- a/drivers/virtio/virtio_pci_legacy.c +++ b/drivers/virtio/virtio_pci_legacy.c @@ -135,12 +135,14 @@ static struct virtqueue *setup_vq(struct virtio_pci_device *vp_dev, info->msix_vector = msix_vec; size = PAGE_ALIGN(vring_size(num, VIRTIO_PCI_VRING_ALIGN)); - info->queue = alloc_pages_exact(size, GFP_KERNEL|__GFP_ZERO); + info->queue = dma_zalloc_coherent(&vp_dev->pci_dev->dev, size, + &info->queue_dma_addr, + GFP_KERNEL); if (info->queue == NULL) return ERR_PTR(-ENOMEM); /* activate the queue */ - iowrite32(virt_to_phys(info->queue) >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, + iowrite32(info->queue_dma_addr >> VIRTIO_PCI_QUEUE_ADDR_SHIFT, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); /* create the vring */ @@ -169,7 +171,8 @@ out_assign: vring_del_virtqueue(vq); out_activate_queue: iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); return ERR_PTR(err); } @@ -194,7 +197,8 @@ static void del_vq(struct virtio_pci_vq_info *info) iowrite32(0, vp_dev->ioaddr + VIRTIO_PCI_QUEUE_PFN); size = PAGE_ALIGN(vring_size(info->num, VIRTIO_PCI_VRING_ALIGN)); - free_pages_exact(info->queue, size); + dma_free_coherent(&vp_dev->pci_dev->dev, size, + info->queue, info->queue_dma_addr); } static const struct virtio_config_ops virtio_pci_config_ops = { @@ -227,6 +231,13 @@ int virtio_pci_legacy_probe(struct virtio_pci_device *vp_dev) return -ENODEV; } + rc = dma_set_mask_and_coherent(&pci_dev->dev, DMA_BIT_MASK(64)); + if (rc) + rc = dma_set_mask_and_coherent(&pci_dev->dev, + DMA_BIT_MASK(32)); + if (rc) + dev_warn(&pci_dev->dev, "Failed to enable 64-bit or 32-bit DMA. Trying to continue, but this might not work.\n"); + rc = pci_request_region(pci_dev, 0, "virtio-pci-legacy"); if (rc) return rc; diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c index 8e5cf194cc0b..fbe0bd1c4881 100644 --- a/drivers/virtio/virtio_pci_modern.c +++ b/drivers/virtio/virtio_pci_modern.c @@ -293,14 +293,16 @@ static size_t vring_pci_size(u16 num) return PAGE_ALIGN(vring_size(num, SMP_CACHE_BYTES)); } -static void *alloc_virtqueue_pages(int *num) +static void *alloc_virtqueue_pages(struct virtio_pci_device *vp_dev, + int *num, dma_addr_t *dma_addr) { void *pages; /* TODO: allocate each queue chunk individually */ for (; *num && vring_pci_size(*num) > PAGE_SIZE; *num /= 2) { - pages = alloc_pages_exact(vring_pci_size(*num), - GFP_KERNEL|__GFP_ZERO|__GFP_NOWARN); + pages = dma_zalloc_coherent( + &vp_dev->pci_dev->dev, vring_pci_size(*num), + dma_addr, GFP_KERNEL|__GFP_NOWARN); if (pages) return pages; } @@ -309,7 +311,9 @@ static void *alloc_virtqueue_pages(int *num) return NULL; /* Try t
[PATCH 2/3] virtio_ring: Support DMA APIs
From: Andy Lutomirski virtio_ring currently sends the device (usually a hypervisor) physical addresses of its I/O buffers. This is okay when DMA addresses and physical addresses are the same thing, but this isn't always the case. For example, this never works on Xen guests, and it is likely to fail if a physical "virtio" device ever ends up behind an IOMMU or swiotlb. The immediate use case for me is to enable virtio on Xen guests. For that to work, we need vring to support DMA address translation as well as a corresponding change to virtio_pci or to another driver. With this patch, if enabled, virtfs survives kmemleak and CONFIG_DMA_API_DEBUG. Signed-off-by: Andy Lutomirski --- drivers/virtio/Kconfig | 2 +- drivers/virtio/virtio_ring.c | 179 +++ tools/virtio/linux/dma-mapping.h | 17 3 files changed, 164 insertions(+), 34 deletions(-) create mode 100644 tools/virtio/linux/dma-mapping.h diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig index cab9f3f63a38..77590320d44c 100644 --- a/drivers/virtio/Kconfig +++ b/drivers/virtio/Kconfig @@ -60,7 +60,7 @@ config VIRTIO_INPUT config VIRTIO_MMIO tristate "Platform bus driver for memory mapped virtio devices" - depends on HAS_IOMEM + depends on HAS_IOMEM && HAS_DMA select VIRTIO ---help--- This drivers provides support for memory mapped virtio diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 096b857e7b75..911fbaa7a260 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -24,6 +24,7 @@ #include #include #include +#include #ifdef DEBUG /* For development, we want to crash whenever the ring is screwed. */ @@ -54,7 +55,14 @@ #define END_USE(vq) #endif -struct vring_virtqueue { +struct vring_desc_state +{ + void *data; /* Data for callback. */ + struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ +}; + +struct vring_virtqueue +{ struct virtqueue vq; /* Actual memory layout for this queue */ @@ -92,12 +100,71 @@ struct vring_virtqueue { ktime_t last_add_time; #endif - /* Tokens for callbacks. */ - void *data[]; + /* Per-descriptor state. */ + struct vring_desc_state desc_state[]; }; #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq) +/* + * The DMA ops on various arches are rather gnarly right now, and + * making all of the arch DMA ops work on the vring device itself + * is a mess. For now, we use the parent device for DMA ops. + */ +struct device *vring_dma_dev(const struct vring_virtqueue *vq) +{ + return vq->vq.vdev->dev.parent; +} + +/* Map one sg entry. */ +static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, + struct scatterlist *sg, + enum dma_data_direction direction) +{ + /* +* We can't use dma_map_sg, because we don't use scatterlists in +* the way it expects (we don't guarantee that the scatterlist +* will exist for the lifetime of the mapping). +*/ + return dma_map_page(vring_dma_dev(vq), + sg_page(sg), sg->offset, sg->length, + direction); +} + +static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, + void *cpu_addr, size_t size, + enum dma_data_direction direction) +{ + return dma_map_single(vring_dma_dev(vq), + cpu_addr, size, direction); +} + +static void vring_unmap_one(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +(flags & VRING_DESC_F_WRITE) ? +DMA_FROM_DEVICE : DMA_TO_DEVICE); + } else { + dma_unmap_page(vring_dma_dev(vq), + virtio64_to_cpu(vq->vq.vdev, desc->addr), + virtio32_to_cpu(vq->vq.vdev, desc->len), + (flags & VRING_DESC_F_WRITE) ? + DMA_FROM_DEVICE : DMA_TO_DEVICE); + } +} + +static int vring_mapping_error(const struct vring_virtqueue *vq, + dma_addr_t addr) +{ + return dma_mapping_error(vring_dma_dev(vq), addr); +} + static struct vring_desc *alloc_indirect(struct virtqueue *_vq, unsigned
Re: [PATCH 26/26] x86, pkeys: Documentation
On Thu, Oct 1, 2015 at 3:33 PM, Dave Hansen wrote: > On 10/01/2015 01:39 PM, Kees Cook wrote: >> On Thu, Oct 1, 2015 at 4:17 AM, Ingo Molnar wrote: >>> So could we try to add an (opt-in) kernel option that enables this >>> transparently >>> and automatically for all PROT_EXEC && !PROT_WRITE mappings, without any >>> user-space changes and syscalls necessary? >> >> I would like this very much. :) > > Here it is in a quite fugly form (well, it's not opt-in). Init crashes > if I boot with this, though. Somebody really ought to rework things so that a crash in init prints out a normal indication of the unhandled signal and optionally leaves everything else running. Also... EPT seems to have separate R, W, and X flags. I wonder if it would make sense to add a KVM paravirt feature that maps the entire guest physical space an extra time at a monstrous offset with R cleared in the EPT and passes through a #PF or other notification (KVM-specific thing? #VE?) on a read fault. This wouldn't even need a whole duplicate paging hierarchy -- it would just duplicate the EPT PML4 entries, so it would add exactly zero runtime memory usage. The guest would use it by treating the high bit of the physical address as a "may read" bit. This reminds me -- we should probably wire up X86_TRAP_VE with a stub that OOPSes until someone figures out some more useful thing to do. We're probably not doing anyone any favors by unconditionally promoting them to double-faults. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Wed, Sep 30, 2015 at 7:01 AM, Ingo Molnar wrote: > > * Peter Zijlstra wrote: > >> On Mon, Sep 21, 2015 at 09:36:15AM -0700, Linus Torvalds wrote: >> > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar wrote: >> > > >> > > Linus, what's your preference? >> > >> > So quite frankly, is there any reason we don't just implement >> > native_read_msr() as just >> > >> >unsigned long long native_read_msr(unsigned int msr) >> >{ >> > int err; >> > unsigned long long val; >> > >> > val = native_read_msr_safe(msr, &err); >> > WARN_ON_ONCE(err); >> > return val; >> >} >> > >> > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be >> > done with it. I don't see the downside. >> > >> > How many msr reads are so critical that the function call >> > overhead would matter? Get rid of the inline version of the _safe() >> > thing too, and put that thing there too. >> >> There are a few in the perf code, and esp. on cores without a stack engine >> the >> call overhead is noticeable. Also note that the perf MSRs are generally >> optimized MSRs and less slow (we cannot say fast, they're still MSRs) than >> regular MSRs. > > These could still be open coded in an inlined fashion, like the scheduler > usage. > We could have a raw_rdmsr for those. OTOH, I'm still not 100% convinced that this warn-but-don't-die behavior is worth the effort. This isn't a frequent source of bugs to my knowledge, and we don't try to recover from incorrect cr writes, out-of-bounds MMIO, etc, so do we really gain much by rigging a recovery mechanism for rdmsr and wrmsr failures for code that doesn't use the _safe variants? --Andy > Thanks, > > Ingo -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Tue, Sep 29, 2015 at 11:18 AM, H. Peter Anvin wrote: > SGDT would be easy to use, and it is logical that it is faster since it reads > an internal register. SIDT does too but unlike the GDT has a secondary limit > (it can never be larger than 4096 bytes) and so all limits in the range > 4095-65535 are exactly equivalent. > Using the IDT limit would have been a great ideal if Intel hadn't decided to clobber it on every VM exit. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Tue, Sep 29, 2015 at 10:50 AM, Linus Torvalds wrote: > On Tue, Sep 29, 2015 at 1:35 PM, Andy Lutomirski wrote: >> >> Does anyone know what happens if you stick a non-accessed segment in >> the GDT, map the GDT RO, and access it? > > You should get a #PF, as you guess, but go ahead and test it if you > want to make sure. > Then I think that, if we do this, the patch series should first make it percpu and fixmapped but RW and then flip it RO as a separate patch in case we need to revert the actual RO bit. I don't want to break Wine or The Witcher 2 because of this, and we might need various fixups. I really hope that no one uses get_thread_area to check whether TLS has been accessed. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] x86: Use entire page for the per-cpu GDT only if paravirt-enabled
On Sep 29, 2015 2:01 AM, "Ingo Molnar" wrote: > > > * Denys Vlasenko wrote: > > > On 09/28/2015 09:58 AM, Ingo Molnar wrote: > > > > > > * Denys Vlasenko wrote: > > > > > >> On 09/26/2015 09:50 PM, H. Peter Anvin wrote: > > >>> NAK. We really should map the GDT read-only on all 64 bit systems, > > >>> since we can't hide the address from SLDT. Same with the IDT. > > >> > > >> Sorry, I don't understand your point. > > > > > > So the problem is that right now the SGDT instruction (which is > > > unprivileged) > > > leaks the real address of the kernel image: > > > > > > fomalhaut:~> ./sgdt > > > SGDT: 88303fd89000 / 007f > > > > > > that '88303fd89000' is a kernel address. > > > > Thank you. > > I do know that SGDT and friends are unprivileged on x86 > > and thus they allow userspace (and guest kernels in paravirt) > > learn things they don't need to know. > > > > I don't see how making GDT page-aligned and page-sized > > changes anything in this regard. SGDT will still work, > > and still leak GDT address. > > Well, as I try to explain it in the other part of my mail, doing so enables > us to > remap the GDT to a less security sensitive virtual address that does not leak > the > kernel's randomized address: > > > > Your observation in the changelog and your patch: > > > > > It is page-sized because of paravirt. [...] > > > > > > ... conflicts with the intention to mark (remap) the primary GDT address > > > read-only > > > on native kernels as well. > > > > > > So what we should do instead is to use the page alignment properly and > > > remap the > > > GDT to a read-only location, and load that one. > > > > If we'd have a small GDT (i.e. what my patch does), we still can remap the > > entire page which contains small GDT, and simply don't care that some other > > data > > is also visible through that RO page. > > That's generally considered fragile: suppose an attacker has a limited > information > leak that can read absolute addresses with system privilege but he doesn't > know > the kernel's randomized base offset. With a 'partial page' mapping there > could be > function pointers near the GDT, part of the page the GDT happens to be on, > that > leak this information. > > (Same goes for crypto keys or other critical information (like canary > information, > salts, etc.) accidentally ending up nearby.) > > Arguably it's a bit tenuous, but when playing remapping games it's generally > considered good to be page aligned and page sized, with zero padding. > > > > This would have a couple of advantages: > > > > > > - This would give kernel address randomization more teeth on x86. > > > > > > - An additional advantage would be that rootkits overwriting the GDT > > > would have > > >a bit more work to do. > > > > > > - A third advantage would be that for NUMA systems we could 'mirror' the > > > GDT into > > >node-local memory and load those. This makes GDT load cache-misses a > > > bit less > > >expensive. > > > > GDT is per-cpu. Isn't per-cpu memory already NUMA-local? > > Indeed it is: > > fomalhaut:~> for ((cpu=1; cpu<9; cpu++)); do taskset $cpu ./sgdt ; done > SGDT: 88103fa09000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa49000 / 007f > SGDT: 88103fa29000 / 007f > SGDT: 88103fa69000 / 007f > > I confused it with the IDT, which is still global. > > This also means that the GDT in itself does not leak kernel addresses at the > moment, except it leaks the layout of the percpu area. > > So my suggestion would be to: > > - make the GDT unconditionally page aligned and sized, then remap it to a >read-only address unconditionally as well, like we do it for the IDT. Does anyone know what happens if you stick a non-accessed segment in the GDT, map the GDT RO, and access it? The docs are extremely vague on the interplay between segmentation and paging on the segmentation structures themselves. My guess is that it causes #PF. This might break set_thread_area users unless we change set_thread_area to force the accessed bit on. There's a possible worse failure mode: if someone pokes an un-accessed segment into SS or CS using sigreturn, then it's within the realm of possibility that IRET would generate #PF (hey Intel and AMD, please document this!). I don't think that would be rootable, but at the very least we'd want to make sure it doesn't OOPS by either making it impossible or adding an explicit test to sigreturn.c. hpa pointed out in another thread that the GDT *must* be writable on 32-bit kernels because we use a task gate for NMI and jumping through a task gate writes to the GDT. On another note, SGDT is considerably faster than LSL, at least on Sandy Bridge. The vdso might be able to take advantage of that for getcpu. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.k
Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?
On Tue, Sep 22, 2015 at 11:23 AM, Konrad Rzeszutek Wilk wrote: > On Sun, Sep 20, 2015 at 09:49:04PM -0700, Andy Lutomirski wrote: >> On Fri, Sep 18, 2015 at 12:04 PM, Borislav Petkov wrote: >> > On Fri, Sep 18, 2015 at 08:20:46AM -0700, Andy Lutomirski wrote: >> >> In any event, Borislav, you must have typed rdmsr_safe for a reason :) >> > >> > Wasn't me: >> > >> > 6c62aa4a3c12 ("x86: make amd.c have 64bit support code") >> > >> > I think the error handling of rdmsrl_safe() was needed to do the pfn >> > games which are done in the if-clause. >> >> I just tried it. rdmsrl_safe and friends definitely work fine in that >> code. I think that Linux's Xen startup code is buggy and fails to set >> up early exception handling. >> >> Try this (horribly whitespace damaged): >> >> static void __init early_identify_cpu(struct cpuinfo_x86 *c) >> { >> + u64 tmp; >> #ifdef CONFIG_X86_64 >> c->x86_clflush_size = 64; >> c->x86_phys_bits = 36; >> @@ -752,6 +753,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 >> *c) >> c->cpu_index = 0; >> filter_cpuid_features(c, false); >> >> + pr_err("trying to crash\n"); >> + rdmsrl_safe(0x12345678, &tmp); >> + >> >> It works fine. I bet it crashes on a Xen guest, though. I assume >> that Xen just works in most cases by luck. > > (d31) mapping kernel into physical memory > (d31) about to get started... > (XEN) traps.c:3151: GPF (): 82d0801a31ed -> 82d08023c77b > (XEN) traps.c:459:d31v0 Unhandled general protection fault fault/trap [#13] > on VCPU 0 [ec=] > (XEN) domain_crash_sync called from entry.S: fault at 82d080238213 > create_bounce_frame+0x12b/0x13a > (XEN) Domain 31 (vcpu#0) crashed on cpu#35: > (XEN) [ Xen-4.5.0 x86_64 debug=n Not tainted ] > (XEN) CPU:35 > (XEN) RIP:e033:[] > (XEN) RFLAGS: 0246 EM: 1 CONTEXT: pv guest > (XEN) rax: rbx: 81c03e64 rcx: 12345678 > (XEN) rdx: 81c03de8 rsi: 81c03dec rdi: 12345278 > (XEN) rbp: 81c03e48 rsp: 81c03dd0 r8: 7420676e69797274 > (XEN) r9: 6873617263206f74 r10: r11: > (XEN) r12: 12345678 r13: 81c03f00 r14: > (XEN) r15: cr0: 8005003b cr4: 001526f0 > (XEN) cr3: 0014e8c97000 cr2: > (XEN) ds: es: fs: gs: ss: e02b cs: e033 > (XEN) Guest stack trace from rsp=81c03dd0: > (XEN)12345678 81041b64 > (XEN)0001e030 00010046 81c03e18 e02b > (XEN)81041b5d 81c03e48 00811809 > (XEN)01a0 0100 82009000 81c03e68 > (XEN)81d211ea 81c03ed8 > (XEN)81d1be59 81c03ed8 811892ab 0010 > (XEN)81c03ee8 81c03ea8 697a696c61697469 81f15442 > (XEN) 81db3900 > (XEN) 81c03f28 81d10f0a > (XEN) > (XEN) 81c03f38 > (XEN)81d10603 81c03ff8 81d15f5c > (XEN) > (XEN) > (XEN) > (XEN) > (XEN) ffd83a031f898b75 22400800 0001 > (XEN) 00010102464c457f > (XEN)0001003e0003 0940 0040 12a0 > (XEN)00380040 001100120044 00050001 > [root@ovs107 ~]# > > (gdb) x/20i 0x81041b64 >0x81041b64: rdmsr > Exactly. I think that Xen is missing code to wire up early_fixup_exception. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 9:36 AM, Linus Torvalds wrote: > On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar wrote: >> >> Linus, what's your preference? > > So quite frankly, is there any reason we don't just implement > native_read_msr() as just > >unsigned long long native_read_msr(unsigned int msr) >{ > int err; > unsigned long long val; > > val = native_read_msr_safe(msr, &err); > WARN_ON_ONCE(err); > return val; >} > > Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be > done with it. I don't see the downside. In the interest of sanity, I want to drop the "native_", too, since there appear to be few or no good use cases for native_read_msr as such. I'm tempted to add new functions read_msr and write_msr that forward to rdmsrl_safe and wrmsrl_safe. It looks like the msr helpers are every bit as bad as the TSC helpers used to be :( --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Ven wrote: > On 9/21/2015 9:36 AM, Linus Torvalds wrote: >> >> On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar wrote: >>> >>> >>> Linus, what's your preference? >> >> >> So quite frankly, is there any reason we don't just implement >> native_read_msr() as just >> >> unsigned long long native_read_msr(unsigned int msr) >> { >>int err; >>unsigned long long val; >> >>val = native_read_msr_safe(msr, &err); >>WARN_ON_ONCE(err); >>return val; >> } >> >> Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be >> done with it. I don't see the downside. >> >> How many msr reads are so critical that the function call >> overhead would matter? > > > if anything qualifies it'd be switch_to() and friends. And maybe the KVM user return notifier. Unfortunately, switch_to might gain another two MSR accesses at some point if we decide to fix the bugs in there. Sigh. > > note that I'm not entirely happy about the notion of "safe" MSRs. > They're safe as in "won't fault". > Reading random MSRs isn't a generic safe operation though, but the name sort > of gives people > the impression that it is. Even with _safe variants, you still need to KNOW > the MSR exists (by means > of CPUID or similar) unfortunately. > I tend to agree. Anyway, the fully out-of-line approach isn't obviously a bad idea, and it simplifies the whole mess (we can drop most of the paravirt patches, too). I'll give it a try and see what happens. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?
On Fri, Sep 18, 2015 at 12:04 PM, Borislav Petkov wrote: > On Fri, Sep 18, 2015 at 08:20:46AM -0700, Andy Lutomirski wrote: >> In any event, Borislav, you must have typed rdmsr_safe for a reason :) > > Wasn't me: > > 6c62aa4a3c12 ("x86: make amd.c have 64bit support code") > > I think the error handling of rdmsrl_safe() was needed to do the pfn > games which are done in the if-clause. I just tried it. rdmsrl_safe and friends definitely work fine in that code. I think that Linux's Xen startup code is buggy and fails to set up early exception handling. Try this (horribly whitespace damaged): static void __init early_identify_cpu(struct cpuinfo_x86 *c) { + u64 tmp; #ifdef CONFIG_X86_64 c->x86_clflush_size = 64; c->x86_phys_bits = 36; @@ -752,6 +753,9 @@ static void __init early_identify_cpu(struct cpuinfo_x86 *c) c->cpu_index = 0; filter_cpuid_features(c, false); + pr_err("trying to crash\n"); + rdmsrl_safe(0x12345678, &tmp); + It works fine. I bet it crashes on a Xen guest, though. I assume that Xen just works in most cases by luck. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Sep 20, 2015 5:15 PM, "Linus Torvalds" wrote: > > On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski wrote: > > This demotes an OOPS and likely panic due to a failed non-"safe" MSR > > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case). > > We still write a pr_info entry unconditionally for debugging. > > No, this is wrong. > > If you really want to do something like this, then just make all MSR > reads safe. So the only difference between "safe" and "unsafe" is that > the unsafe version just doesn't check the return value, and silently > just returns zero for reads (or writes nothing). > > To quote Obi-Wan: "Use the exception table, Luke". > > Because decoding instructions is just too ugly. We'll do it for CPU > errata where we might have to do it for user space code too (ie the > AMD prefetch mess), but for code that _we_ control? Hell no. > > So NAK on this. My personal preference is to just not do this at all. A couple people disagree. If we make the unsafe variants not oops, then I think we want to have the nice loud warning, since these issues are bugs if they happen. We could certainly use the exception table for this, but it'll result in bigger core, since each MSR access will need an exception table entry and an associated fixup to call some helper that warns and sets the result to zero. I'd be happy to implement that, but only if it'll be applied. Otherwise I'd rather just drop this patch and keep the rest of the series. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
This demotes an OOPS and likely panic due to a failed non-"safe" MSR access to a WARN_ON_ONCE and a return of zero (in the RDMSR case). We still write a pr_info entry unconditionally for debugging. To be clear, this type of failure should *not* happen. This patch exists to minimize the chance of nasty undebuggable failures due on systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug. Signed-off-by: Andy Lutomirski --- arch/x86/kernel/traps.c | 55 + 1 file changed, 55 insertions(+) diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c index 346eec73f7db..f82987643e32 100644 --- a/arch/x86/kernel/traps.c +++ b/arch/x86/kernel/traps.c @@ -437,6 +437,58 @@ exit_trap: do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL); } +static bool paper_over_kernel_gpf(struct pt_regs *regs) +{ + /* +* Try to decode the opcode that failed. So far, we only care +* about boring two-byte unprefixed opcodes, so we don't need +* the full instruction decoder machinery. +*/ + u16 opcode; + + if (probe_kernel_read(&opcode, (const void *)regs->ip, sizeof(opcode))) + return false; + + if (opcode == 0x320f) { + /* RDMSR */ + pr_info("bad kernel RDMSR from non-existent MSR 0x%x", + (unsigned int)regs->cx); + if (!panic_on_oops) { + WARN_ON_ONCE(true); + + /* +* Pretend that RDMSR worked and returned zero. We +* chose zero because zero seems less likely to +* cause further malfunctions than any other value. +*/ + regs->ax = 0; + regs->dx = 0; + regs->ip += 2; + return true; + } else { + /* Don't fix it up. */ + return false; + } + } else if (opcode == 0x300f) { + /* WRMSR */ + pr_info("bad kernel WRMSR writing 0x%08x%08x to MSR 0x%x", + (unsigned int)regs->dx, (unsigned int)regs->ax, + (unsigned int)regs->cx); + if (!panic_on_oops) { + WARN_ON_ONCE(true); + + /* Pretend it worked and carry on. */ + regs->ip += 2; + return true; + } else { + /* Don't fix it up. */ + return false; + } + } + + return false; +} + dotraplinkage void do_general_protection(struct pt_regs *regs, long error_code) { @@ -456,6 +508,9 @@ do_general_protection(struct pt_regs *regs, long error_code) if (fixup_exception(regs)) return; + if (paper_over_kernel_gpf(regs)) + return; + tsk->thread.error_code = error_code; tsk->thread.trap_nr = X86_TRAP_GP; if (notify_die(DIE_GPF, "general protection fault", regs, error_code, -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 0/2] x86/msr: MSR access failure changes
This applies on top of my earlier paravirt MSR series. Changes from v1: - Return zero instead of poison on bad RDMSRs. Andy Lutomirski (2): x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops x86/msr: Set the return value to zero when native_rdmsr_safe fails arch/x86/include/asm/msr.h | 5 - arch/x86/kernel/traps.c| 55 ++ 2 files changed, 59 insertions(+), 1 deletion(-) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] x86/msr: Set the return value to zero when native_rdmsr_safe fails
This will cause unchecked native_rdmsr_safe failures to return deterministic results. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/msr.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h index 77d8b284e4a7..9eda52205d42 100644 --- a/arch/x86/include/asm/msr.h +++ b/arch/x86/include/asm/msr.h @@ -73,7 +73,10 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr, asm volatile("2: rdmsr ; xor %[err],%[err]\n" "1:\n\t" ".section .fixup,\"ax\"\n\t" -"3: mov %[fault],%[err] ; jmp 1b\n\t" +"3: mov %[fault],%[err]\n\t" +"xorl %%eax, %%eax\n\t" +"xorl %%edx, %%edx\n\t" +"jmp 1b\n\t" ".previous\n\t" _ASM_EXTABLE(2b, 3b) : [err] "=r" (*err), EAX_EDX_RET(val, low, high) -- 2.4.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rdmsr_safe in Linux PV (under Xen) gets an #GP:Re: [Fedora-xen] Running fedora xen on top of KVM?
On Fri, Sep 18, 2015 at 6:54 AM, Borislav Petkov wrote: > On Thu, Sep 17, 2015 at 01:23:31PM -0700, Andy Lutomirski wrote: >> Cc: Borislav. Is TSEG guaranteed to exist? Can we defer that until > > Why not? It is the tseg base address. > > I think this is kvm injecting a #GP as it is not ignoring this MSR. > Presumably modprobing kvm with "ignore_msrs=1" or so should hide the > issue IIUC... But the issue should be fixed, no? Does anyone know why this affects Xen on KVM but not native Linux on KVM? Or is there something weird about the particular KVM configuration? In any event, Borislav, you must have typed rdmsr_safe for a reason :) Either rdmsr_safe in that context is broken on Xen and not native (which is entirely possible -- I didn't check carefully) or the rdmsr_safe isn't "safe" on any type of kernel. Persumably either the kernel or KVM (or both) should be fixed. Given that we can handle fixups in the decompressor, surely it wouldn't be so hard to make early GPF fixups work in the main kernel. --Andy -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html