Re: [GIT PULL 0/5] perf/core improvements and fixes

2015-10-22 Thread Ingo Molnar

* Arnaldo Carvalho de Melo  wrote:

> Hi Ingo,
> 
>   Please consider pulling,
> 
> - Arnaldo
> 
> The following changes since commit 43e41adc9e8c36545888d78fed2ef8d102a938dc:
> 
>   perf record: Add ability to sample call branches (2015-10-20 10:30:55 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
> tags/perf-core-for-mingo
> 
> for you to fetch changes up to e3d006ce8180a0c025ce66bdc89bbc125f85be57:
> 
>   perf annotate: Add debug message for out of bounds sample (2015-10-21 
> 18:12:37 -0300)
> 
> 
> perf/core improvements and fixes:
> 
> User visible:
> 
> - Print branch filter state with verbose mode (Andi Kleen)
> 
> - Fix core dump caused by per-socket/core system-wide stat (Kan Liang)
> 
> - Update libtraceevent KVM plugin (Paolo Bonzini)
> 
> Developer stuff:
> 
> - Add fixdep to 'tools/build' .gitignore (Yunlong Song)
> 
> Signed-off-by: Arnaldo Carvalho de Melo 
> 
> 
> Andi Kleen (1):
>   perf evsel: Print branch filter state with -vv
> 
> Arnaldo Carvalho de Melo (1):
>   perf annotate: Add debug message for out of bounds sample
> 
> Kan Liang (1):
>   perf cpu_map: Fix core dump caused by per-socket/core system-wide stat
> 
> Paolo Bonzini (1):
>   tools lib traceevent: update KVM plugin
> 
> Yunlong Song (1):
>   perf build: Add fixdep to .gitignore
> 
>  tools/build/.gitignore|  1 +
>  tools/lib/traceevent/plugin_kvm.c | 25 +
>  tools/perf/util/annotate.c|  5 -
>  tools/perf/util/cpumap.c  |  2 +-
>  tools/perf/util/evsel.c   |  1 +
>  5 files changed, 24 insertions(+), 10 deletions(-)
>  create mode 100644 tools/build/.gitignore

Pulled, thanks a lot Arnaldo!

Ingo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] kvmtool: tiny init fox x86_64

2015-10-22 Thread Pekka Enberg

Hi Oleg,

On 10/19/2015 01:59 PM, Oleg Nesterov wrote:

Yesterday I discovered kvmtool and it looks really cool! Thanks!
Looks like, it does exactly what I need.

But the static /virt/init doesn't look good. This series lessens
the size of lkvm binary from 1045952 to 196200 bytes on x86_64,
but this is minor. I want to write my /virt/init in shell or perl
and this change can help.

I don't really know who should be cc'ed, I've picked some names
from git-log guest/init.c.


Will maintains the standalone kvmtool.git tree.

The series looks good to me:

Acked-by: Pekka Enberg 

- Pekka
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-22 Thread Wanpeng Li

On 10/21/15 2:46 PM, Paolo Bonzini wrote:


On 21/10/2015 00:57, Wanpeng Li wrote:

kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks.  The
hooks are registered only between vcpu_load and vcpu_put, therefore they
know that the mutex is taken.  The sequence will go like this:

  vcpu_load
  kvm_sched_out
  kvm_sched_in
  kvm_sched_out
  kvm_sched_in
  ...
  vcpu_put

If this should be:

vcpu_load
kvm_sched_in
kvm_sched_out
kvm_sched_in
kvm_sched_out
...
vcpu_put

No, because vcpu_load is called while the thread is running.  Therefore,
the first preempt notifier call will be a sched_out notification, which
calls kvm_arch_vcpu_put.  Extending the picture above:

   vcpu_load-> kvm_arch_vcpu_load
   kvm_sched_out-> kvm_arch_vcpu_put
   kvm_sched_in -> kvm_arch_vcpu_load
   kvm_sched_out-> kvm_arch_vcpu_put
   kvm_sched_in -> kvm_arch_vcpu_load
   ...
   kvm_sched_out-> kvm_arch_vcpu_put
   kvm_sched_in -> kvm_arch_vcpu_load
   vcpu_put -> kvm_arch_vcpu_put


Got it, thanks. :-)

Regards,
Wanpeng Li
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] target-i386: save/restore vcpu's TSC rate during migration

2015-10-22 Thread Eduardo Habkost
On Tue, Oct 20, 2015 at 03:22:51PM +0800, Haozhong Zhang wrote:
> This patchset enables QEMU to save/restore vcpu's TSC rate during the
> migration. When cooperating with KVM which supports TSC scaling, guest
> programs can observe a consistent guest TSC rate even though they are
> migrated among machines with different host TSC rates.
> 
> A pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' are added to
> control the migration of vcpu's TSC rate.

The requirements and goals aren't clear to me. I see two possible use
cases, here:

1) Best effort to keep TSC frequency constant if possible (but not
   aborting migration if not possible). This would be an interesting
   default, but a bit unpredictable.
2) Strictly ensuring TSC frequency stays constant on migration (and
   aborting migration if not possible). This would be an useful feature,
   but can't be enabled by default unless both hosts have the same TSC
   frequency or support TSC scaling.

Which one(s) you are trying to implement?

In other words, what is the right behavior when KVM_SET_TSC_KHZ fails or
KVM_CAP_TSC_CONTROL is not available? We can't answer that question if
the requirements and goals are not clear.

Once we know what exactly is the goal, we could enable the new mode with
a single option, instead of raw options to control migration stream
loading/saving.


>  * By default, the migration of vcpu's TSC rate is enabled only on
>pc-*-2.5 and newer machine types. If the cpu option 'save-tsc-freq'
>is present, the vcpu's TSC rate will be migrated from older machine
>types as well.
>  * Another cpu option 'load-tsc-freq' controls whether the migrated
>vcpu's TSC rate is used. By default, QEMU will not use the migrated
>TSC rate if this option is not present. Otherwise, QEMU will use
>the migrated TSC rate and override the TSC rate given by the cpu
>option 'tsc-freq'.
> 
> Changes in v2:
>  * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to
>control the migration of vcpu's TSC rate.
>  * Move all logic of setting TSC rate to target-i386.
>  * Remove the duplicated TSC setup in kvm_arch_init_vcpu().
> 
> Haozhong Zhang (3):
>   target-i386: add a subsection for migrating vcpu's TSC rate
>   target-i386: calculate vcpu's TSC rate to be migrated
>   target-i386: load the migrated vcpu's TSC rate
> 
>  include/hw/i386/pc.h  |  5 +
>  target-i386/cpu.c |  2 ++
>  target-i386/cpu.h |  3 +++
>  target-i386/kvm.c | 61 
> +++
>  target-i386/machine.c | 19 
>  5 files changed, 81 insertions(+), 9 deletions(-)
> 
> -- 
> 2.4.8
> 

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] target-i386: load the migrated vcpu's TSC rate

2015-10-22 Thread Eduardo Habkost
On Tue, Oct 20, 2015 at 03:22:54PM +0800, Haozhong Zhang wrote:
> Set vcpu's TSC rate to the migrated value (if any). If KVM supports TSC
> scaling, guest programs will observe TSC increasing in the migrated rate
> other than the host TSC rate.
> 
> The loading is controlled by a new cpu option 'load-tsc-freq'. If it is
> present, then the loading will be enabled and the migrated vcpu's TSC
> rate will override the value specified by the cpu option
> 'tsc-freq'. Otherwise, the loading will be disabled.

Why do we need an option? Why can't we enable loading unconditionally?

> 
> The setting of vcpu's TSC rate in this patch duplicates the code in
> kvm_arch_init_vcpu(), so we remove the latter one.
> 
> Signed-off-by: Haozhong Zhang 
> ---
>  target-i386/cpu.c |  1 +
>  target-i386/cpu.h |  1 +
>  target-i386/kvm.c | 28 +++-
>  3 files changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> index b6bb457..763ba4b 100644
> --- a/target-i386/cpu.c
> +++ b/target-i386/cpu.c
> @@ -3144,6 +3144,7 @@ static Property x86_cpu_properties[] = {
>  DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
>  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
>  DEFINE_PROP_BOOL("save-tsc-freq", X86CPU, env.save_tsc_khz, true),
> +DEFINE_PROP_BOOL("load-tsc-freq", X86CPU, env.load_tsc_khz, false),
>  DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
>  DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
>  DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> index ba1a289..353f5fb 100644
> --- a/target-i386/cpu.h
> +++ b/target-i386/cpu.h
> @@ -968,6 +968,7 @@ typedef struct CPUX86State {
>  int64_t tsc_khz;
>  int64_t tsc_khz_incoming;
>  bool save_tsc_khz;
> +bool load_tsc_khz;
>  void *kvm_xsave_buf;
>  
>  uint64_t mcg_cap;
> diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> index 698524a..34616f5 100644
> --- a/target-i386/kvm.c
> +++ b/target-i386/kvm.c
> @@ -743,15 +743,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
>  return r;
>  }
>  
> -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> -if (r && env->tsc_khz) {
> -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> -if (r < 0) {
> -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> -return r;
> -}
> -}
> -
>  if (kvm_has_xsave()) {
>  env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
>  }
> @@ -2223,6 +2214,25 @@ static int kvm_setup_tsc_khz(X86CPU *cpu, int level)
>  return 0;
>  
>  /*
> + * If the cpu option 'load-tsc-freq' is present, the vcpu's TSC rate in 
> the
> + * migrated state will be used and the overrides the user-specified 
> vcpu's
> + * TSC rate (if any).
> + */
> +if (runstate_check(RUN_STATE_INMIGRATE) &&
> +env->load_tsc_khz && env->tsc_khz_incoming) {
> +env->tsc_khz = env->tsc_khz_incoming;
> +}

Please don't make the results of the function depend on global QEMU
runstate, as it makes it harder to reason about it, and easy to
introduce subtle bugs if we change initialization order. Can't we just
ensure tsc_khz gets set to the right value before the function is
called, inside the code that loads migration data?

> +
> +r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> +if (r && env->tsc_khz) {
> +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> +if (r < 0) {
> +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> +return r;
> +}
> +}

So, the final result here does not depend on the configuration, but also
on host capabilities. That means nobody can possibly know if the
tsc-freq option really works, until they enable it, run the VM, and
check the results from inside the VM. Not a good idea.

(This doesn't apply just to the new code, the existing code is already
broken this way.)

-- 
Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 01:27:28AM -0400, Jason Wang wrote:
> This path introduces a helper which can give a hint for whether or not
> there's a work queued in the work list.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vhost/vhost.c | 6 ++
>  drivers/vhost/vhost.h | 1 +
>  2 files changed, 7 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index eec2f11..d42d11e 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
> vhost_work *work)
>  }
>  EXPORT_SYMBOL_GPL(vhost_work_queue);
>  
> +bool vhost_has_work(struct vhost_dev *dev)
> +{
> + return !list_empty(>work_list);
> +}
> +EXPORT_SYMBOL_GPL(vhost_has_work);
> +
>  void vhost_poll_queue(struct vhost_poll *poll)
>  {
>   vhost_work_queue(poll->dev, >work);


This doesn't take a lock so it's unreliable.
I think it's ok in this case since it's just
an optimization - but pls document this.

> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 4772862..ea0327d 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -37,6 +37,7 @@ struct vhost_poll {
>  
>  void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
>  void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
> +bool vhost_has_work(struct vhost_dev *dev);
>  
>  void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
>unsigned long mask, struct vhost_dev *dev);
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 11:42:02 Eric Auger wrote:
> Currently reset lookup is done on probe. This introduces a
> race with new registration mechanism in the case where the
> vfio-platform driver is bound to the device before its module
> is loaded: on the load, the probe happens which triggers the
> reset module load which itself attempts to get the symbol for
> the registration function (vfio_platform_register_reset). The
> symbol is not yet available hence the lookup fails. In case we
> do the lookup in the first open we are sure the vfio-platform
> module is loaded and vfio_platform_register_reset is available.
> 
> Signed-off-by: Eric Auger 

I don't understand the explanation. I would expect the request_module()
call to block until the module is actually loaded. Is this not
what happens?

> mutex_unlock(_lock);
> @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
> if (ret)
> goto err_irq;
>  
> +   vfio_platform_get_reset(vdev);
> +
> if (vdev->reset)
> vdev->reset(vdev);
> 

This needs some error handling to ensure that the open() fails
if there is no reset handler.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-22 Thread Eric Auger
On 10/22/2015 12:13 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 11:41:59 Eric Auger wrote:
>> This patch adds the reset function registration/unregistration.
>>
>> Signed-off-by: Eric Auger 
> 
> Looks good, except one thing:
>> @@ -70,6 +69,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
>> vfio_platform_device *vdev)
>>  return -ENOMEM;
>>  }
>>  
>> +pr_info("VFIO reset of %s\n", vdev->name);
>> +
>>  /* disable IRQ */
>>  writel(0, reg.ioaddr + XGMAC_DMA_INTR_ENA);
>>  
> 
> This probably slipped in from debugging, please remove it.
Well actually this is not an oversight but some unappropriate tracing
attempt I am afraid. I wanted to add a trace useful for the end-user to
make sure the VFIO reset function was called. Do you forbid that or do
recommend to use another tracing mechanism/level? In the past I tried
dynamic tracing but with module loading mechanism I found it not that handy.

Eric
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, 
> unsigned long address,
>* through poll/read().
>*/
>   __add_wait_queue(>fault_wqh, );
> - for (;;) {
> - set_current_state(TASK_KILLABLE);
> - if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> - fatal_signal_pending(current))
> - break;
> - spin_unlock(>fault_wqh.lock);
> + set_current_state(TASK_KILLABLE);
> + spin_unlock(>fault_wqh.lock);
>  
> + if (likely(!ACCESS_ONCE(ctx->released) &&
> +!fatal_signal_pending(current))) {
>   wake_up_poll(>fd_wqh, POLLIN);
>   schedule();
> + ret |= VM_FAULT_MAJOR;
> + }

So what happens here if schedule() spontaneously wakes for no reason?

I'm not sure enough of userfaultfd semantics to say if that would be
bad, but the code looks suspiciously like it relies on schedule() not to
do that; which is wrong.

> + __set_current_state(TASK_RUNNING);
> + /* see finish_wait() comment for why list_empty_careful() */
> + if (!list_empty_careful(_list)) {
>   spin_lock(>fault_wqh.lock);
> + list_del_init(_list);
> + spin_unlock(>fault_wqh.lock);
>   }
> - __remove_wait_queue(>fault_wqh, );
> - __set_current_state(TASK_RUNNING);
> - spin_unlock(>fault_wqh.lock);
>  
>   /*
>* ctx may go away after this if the userfault pseudo fd is
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] vfio: platform: use list of registered reset function

2015-10-22 Thread Eric Auger
On 10/22/2015 12:19 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 11:42:01 Eric Auger wrote:
>> Remove the static lookup table and use the dynamic list of registered
>> reset functions instead. Also load the reset module through its alias.
>> The reset struct module pointer is stored in vfio_platform_device.
>>
>> We also remove the useless struct device pointer parameter in
>> vfio_platform_get_reset.
>>
>> This patch fixes the issue related to the usage of __symbol_get, which
>> besides from being moot, prevented compilation with CONFIG_MODULES
>> disabled.
>>
>> Also usage of MODULE_ALIAS makes possible to add a new reset module
>> without needing to update the framework. This was suggested by Arnd.
>>
>> Signed-off-by: Eric Auger 
>> Reported-by: Arnd Bergmann 
> 
> Looks good, just small style issues:
> 
>>
>> -}
>> +mutex_lock(_lock);
>> +list_for_each_entry(iter, _list, link) {
>> +if (!strcmp(iter->compat, compat) &&
>> +try_module_get(iter->owner)) {
>> +*module = iter->owner;
>> +mutex_unlock(_lock);
>> +return iter->reset;
>>  }
>>  }
>> +
>> +mutex_unlock(_lock);
>> +return NULL;
> 
> Better use 'break' to have a single mutex_unlock and return  statement.
OK
> 
>> +
>>  
>>  static void vfio_platform_put_reset(struct vfio_platform_device *vdev)
>>  {
>> -if (vdev->reset)
>> -symbol_put_addr(vdev->reset);
>> +if (vdev->reset) {
>> +module_put(vdev->reset_module);
>> +vdev->reset_module = NULL;
>> +vdev->reset = NULL;
>> +}
>>  }
> 
> This gets called from the remove callback. No need to clear those
> struct members immediately before the kfree().
I should have added those assignments in next patch actually. This
latter moves the vfio_platform_put_reset call in the release function.

Best Regards

Eric
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Eric Auger
Hi Arnd,
On 10/22/2015 12:29 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 11:42:02 Eric Auger wrote:
>> Currently reset lookup is done on probe. This introduces a
>> race with new registration mechanism in the case where the
>> vfio-platform driver is bound to the device before its module
>> is loaded: on the load, the probe happens which triggers the
>> reset module load which itself attempts to get the symbol for
>> the registration function (vfio_platform_register_reset). The
>> symbol is not yet available hence the lookup fails. In case we
>> do the lookup in the first open we are sure the vfio-platform
>> module is loaded and vfio_platform_register_reset is available.
>>
>> Signed-off-by: Eric Auger 
> 
> I don't understand the explanation. I would expect the request_module()
> call to block until the module is actually loaded. Is this not
> what happens?

Again many thanks for this new review.

My understanding is the following
1) vfio-platform is attached to the device through the override mechanism
2) vfio-platform get loaded through modprobe:
2-1) the platform driver init function eventually calls the
vfio-platform probe function.
2-2) request_module of vfio-platform-calxedaxgmac gets called.
3) The init of  vfio-platform-calxedaxgmac looks for
vfio_platform_register_reset. Unfortunately at that stage the init of
vfio-platform is not completed so the symbol is not available
3-1) in case symbol_get is used in vfio_platform_calxedaxgmac init, as
of today, this latter simply returns -EINVAL. Reset registration failed
but no stall.
3-2) in case symbol_get is *not* used, I think the module loader
attempts to load vfio-platform, which is already under load and this
causes a stall.

Let me know if you think this understanding is not correct.

Best Regards

Eric
> 
>> mutex_unlock(_lock);
>> @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
>> if (ret)
>> goto err_irq;
>>  
>> +   vfio_platform_get_reset(vdev);
>> +
>> if (vdev->reset)
>> vdev->reset(vdev);
>>
> 
> This needs some error handling to ensure that the open() fails
> if there is no reset handler.
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 13:40:16 Eric Auger wrote:
> On 10/22/2015 12:29 PM, Arnd Bergmann wrote:
> > On Thursday 22 October 2015 11:42:02 Eric Auger wrote:
> >> Currently reset lookup is done on probe. This introduces a
> >> race with new registration mechanism in the case where the
> >> vfio-platform driver is bound to the device before its module
> >> is loaded: on the load, the probe happens which triggers the
> >> reset module load which itself attempts to get the symbol for
> >> the registration function (vfio_platform_register_reset). The
> >> symbol is not yet available hence the lookup fails. In case we
> >> do the lookup in the first open we are sure the vfio-platform
> >> module is loaded and vfio_platform_register_reset is available.
> >>
> >> Signed-off-by: Eric Auger 
> > 
> > I don't understand the explanation. I would expect the request_module()
> > call to block until the module is actually loaded. Is this not
> > what happens?
> 
> Again many thanks for this new review.
> 
> My understanding is the following
> 1) vfio-platform is attached to the device through the override mechanism
> 2) vfio-platform get loaded through modprobe:
> 2-1) the platform driver init function eventually calls the
> vfio-platform probe function.
> 2-2) request_module of vfio-platform-calxedaxgmac gets called.
> 3) The init of  vfio-platform-calxedaxgmac looks for
> vfio_platform_register_reset. Unfortunately at that stage the init of
> vfio-platform is not completed so the symbol is not available
> 3-1) in case symbol_get is used in vfio_platform_calxedaxgmac init, as
> of today, this latter simply returns -EINVAL. Reset registration failed
> but no stall.
> 3-2) in case symbol_get is *not* used, I think the module loader
> attempts to load vfio-platform, which is already under load and this
> causes a stall.
> 
> Let me know if you think this understanding is not correct.

I was expecting the vfio_platform_register_reset() function to be
available by the time of step 2-2.

What I think is going on here is that we are still inside of the
module_init() function of the vfio-platform driver at the time that
we call the request_module(), and the symbols are not made visible
to other modules until the module_init function returns with success.
This is a side-effect of the probe function for the platform driver
getting called from deep inside of the platform_driver_register()
function for all devices that are already present.

I think it already works for the AMBA case, which uses separate modules,
but we need to restructure the platform_device case slightly to do
the same:

diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
index 9ce8afe28450..a00a44814255 100644
--- a/drivers/vfio/platform/Makefile
+++ b/drivers/vfio/platform/Makefile
@@ -1,10 +1,12 @@
 
-vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
+vfio-platform-base-y := vfio_platform_common.o vfio_platform_irq.o
 
-obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
+obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
+obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform-base.o
 obj-$(CONFIG_VFIO_PLATFORM) += reset/
 
 vfio-amba-y := vfio_amba.o
 
 obj-$(CONFIG_VFIO_AMBA) += vfio-amba.o
+obj-$(CONFIG_VFIO_AMBA) += vfio-platform-base.o
 obj-$(CONFIG_VFIO_AMBA) += reset/


Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 13:54:42 Eric Auger wrote:
> On 10/22/2015 12:13 PM, Arnd Bergmann wrote:
> > On Thursday 22 October 2015 11:41:59 Eric Auger wrote:
> >> This patch adds the reset function registration/unregistration.
> >>
> >> Signed-off-by: Eric Auger 
> > 
> > Looks good, except one thing:
> >> @@ -70,6 +69,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
> >> vfio_platform_device *vdev)
> >>  return -ENOMEM;
> >>  }
> >>  
> >> +pr_info("VFIO reset of %s\n", vdev->name);
> >> +
> >>  /* disable IRQ */
> >>  writel(0, reg.ioaddr + XGMAC_DMA_INTR_ENA);
> >>  
> > 
> > This probably slipped in from debugging, please remove it.
> Well actually this is not an oversight but some unappropriate tracing
> attempt I am afraid. I wanted to add a trace useful for the end-user to
> make sure the VFIO reset function was called. Do you forbid that or do
> recommend to use another tracing mechanism/level? In the past I tried
> dynamic tracing but with module loading mechanism I found it not that handy.

If you think it's useful to have in the long run, it should be a separate
patch with a description what it's good for, rather than a side-effect of
an unrelated patch. It just looked to me like it's something you do
while debugging the reset code, rather than while using it.

Implementation-wise, you should use dev_info() instead of pr_info() where
possible, and it probably makes sense to put this into the vfio_platform
driver before calling the reset function, for consistency between the
drivers.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()

2015-10-22 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 05:44:29PM +0300, Pavel Fedin wrote:
> Now we see that vgic_set_lr() and vgic_sync_lr_elrsr() are always used
> together. Merge them into one function, saving from second vgic_ops
> dereferencing every time.
> 
> Additionally, remove unnecessary vgic_set_lr() in vgic_unqueue_irqs(),
> because the following vgic_retire_lr() will reset lr.state to zero
> anyway.
> 
> Signed-off-by: Pavel Fedin 
> ---
>  include/kvm/arm_vgic.h |  1 -
>  virt/kvm/arm/vgic-v2.c |  5 -
>  virt/kvm/arm/vgic-v3.c |  5 -
>  virt/kvm/arm/vgic.c| 30 --
>  4 files changed, 4 insertions(+), 37 deletions(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index d908028..ab5d242 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -112,7 +112,6 @@ struct vgic_vmcr {
>  struct vgic_ops {
>   struct vgic_lr  (*get_lr)(const struct kvm_vcpu *, int);
>   void(*set_lr)(struct kvm_vcpu *, int, struct vgic_lr);
> - void(*sync_lr_elrsr)(struct kvm_vcpu *, int, struct vgic_lr);
>   u64 (*get_elrsr)(const struct kvm_vcpu *vcpu);
>   u64 (*get_eisr)(const struct kvm_vcpu *vcpu);
>   void(*clear_eisr)(struct kvm_vcpu *vcpu);
> diff --git a/virt/kvm/arm/vgic-v2.c b/virt/kvm/arm/vgic-v2.c
> index 8d7b04d..f9d8da5 100644
> --- a/virt/kvm/arm/vgic-v2.c
> +++ b/virt/kvm/arm/vgic-v2.c
> @@ -79,11 +79,7 @@ static void vgic_v2_set_lr(struct kvm_vcpu *vcpu, int lr,
>   lr_val |= (lr_desc.source << GICH_LR_PHYSID_CPUID_SHIFT);
>  
>   vcpu->arch.vgic_cpu.vgic_v2.vgic_lr[lr] = lr_val;
> -}
>  
> -static void vgic_v2_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
> -   struct vgic_lr lr_desc)
> -{
>   if (!(lr_desc.state & LR_STATE_MASK))
>   vcpu->arch.vgic_cpu.vgic_v2.vgic_elrsr |= (1ULL << lr);
>   else
> @@ -166,7 +162,6 @@ static void vgic_v2_enable(struct kvm_vcpu *vcpu)
>  static const struct vgic_ops vgic_v2_ops = {
>   .get_lr = vgic_v2_get_lr,
>   .set_lr = vgic_v2_set_lr,
> - .sync_lr_elrsr  = vgic_v2_sync_lr_elrsr,
>   .get_elrsr  = vgic_v2_get_elrsr,
>   .get_eisr   = vgic_v2_get_eisr,
>   .clear_eisr = vgic_v2_clear_eisr,
> diff --git a/virt/kvm/arm/vgic-v3.c b/virt/kvm/arm/vgic-v3.c
> index 7dd5d62..75f6d91 100644
> --- a/virt/kvm/arm/vgic-v3.c
> +++ b/virt/kvm/arm/vgic-v3.c
> @@ -112,11 +112,7 @@ static void vgic_v3_set_lr(struct kvm_vcpu *vcpu, int lr,
>   }
>  
>   vcpu->arch.vgic_cpu.vgic_v3.vgic_lr[LR_INDEX(lr)] = lr_val;
> -}
>  
> -static void vgic_v3_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
> -   struct vgic_lr lr_desc)
> -{
>   if (!(lr_desc.state & LR_STATE_MASK))
>   vcpu->arch.vgic_cpu.vgic_v3.vgic_elrsr |= (1U << lr);
>   else
> @@ -211,7 +207,6 @@ static void vgic_v3_enable(struct kvm_vcpu *vcpu)
>  static const struct vgic_ops vgic_v3_ops = {
>   .get_lr = vgic_v3_get_lr,
>   .set_lr = vgic_v3_set_lr,
> - .sync_lr_elrsr  = vgic_v3_sync_lr_elrsr,
>   .get_elrsr  = vgic_v3_get_elrsr,
>   .get_eisr   = vgic_v3_get_eisr,
>   .clear_eisr = vgic_v3_clear_eisr,
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 2f4d25a..7e164eb 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -709,10 +709,8 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
>* interrupt then move the active state to the
>* distributor tracking bit.
>*/
> - if (lr.state & LR_STATE_ACTIVE) {
> + if (lr.state & LR_STATE_ACTIVE)
>   vgic_irq_set_active(vcpu, lr.irq);
> - lr.state &= ~LR_STATE_ACTIVE;
> - }
>  
>   /*
>* Reestablish the pending state on the distributor and the
> @@ -720,17 +718,12 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
>* is fine, then we are only setting a few bits that were
>* already set.
>*/
> - if (lr.state & LR_STATE_PENDING) {
> + if (lr.state & LR_STATE_PENDING)
>   vgic_dist_irq_set_pending(vcpu, lr.irq);
> - lr.state &= ~LR_STATE_PENDING;
> - }
> -
> - vgic_set_lr(vcpu, i, lr);
>  
>   /*
>* Mark the LR as free for other use.
>*/
> - BUG_ON(lr.state & LR_STATE_MASK);
>   vgic_retire_lr(i, vcpu);
>   vgic_irq_clear_queued(vcpu, lr.irq);
>  
> @@ -1039,12 +1032,6 @@ static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr,
>   vgic_ops->set_lr(vcpu, lr, vlr);
>  }
>  
> -static void vgic_sync_lr_elrsr(struct kvm_vcpu *vcpu, int lr,
> -

Re: [PATCH v2 2/2] KVM/arm: enable enhanced armv7 fp/simd lazy switch

2015-10-22 Thread Christoffer Dall
On Tue, Oct 20, 2015 at 06:10:59PM -0700, Mario Smarduch wrote:
> 
> 
> On 10/20/2015 12:24 AM, Christoffer Dall wrote:
> > On Mon, Oct 19, 2015 at 04:25:04PM -0700, Mario Smarduch wrote:
> >>
> >>
> >> On 10/19/2015 3:14 AM, Christoffer Dall wrote:
> >>> On Sat, Sep 26, 2015 at 04:43:29PM -0700, Mario Smarduch wrote:
>  This patch enhances current lazy vfp/simd hardware switch. In addition to
>  current lazy switch, it tracks vfp/simd hardware state with a vcpu 
>  lazy flag. 
> 
>  vcpu lazy flag is set on guest access and trap to vfp/simd hardware 
>  switch 
>  handler. On vm-enter if lazy flag is set skip trap enable and saving 
>  host fpexc. On vm-exit if flag is set skip hardware context switch
>  and return to host with guest context.
> 
>  On vcpu_put check if vcpu lazy flag is set, and execute a hardware 
>  context 
>  switch to restore host.
> 
>  Signed-off-by: Mario Smarduch 
>  ---
>   arch/arm/include/asm/kvm_asm.h |  1 +
>   arch/arm/kvm/arm.c | 17 
>   arch/arm/kvm/interrupts.S  | 60 
>  +++---
>   arch/arm/kvm/interrupts_head.S | 12 ++---
>   4 files changed, 71 insertions(+), 19 deletions(-)
> 
>  diff --git a/arch/arm/include/asm/kvm_asm.h 
>  b/arch/arm/include/asm/kvm_asm.h
>  index 194c91b..4b45d79 100644
>  --- a/arch/arm/include/asm/kvm_asm.h
>  +++ b/arch/arm/include/asm/kvm_asm.h
>  @@ -97,6 +97,7 @@ extern char __kvm_hyp_code_end[];
>   extern void __kvm_flush_vm_context(void);
>   extern void __kvm_tlb_flush_vmid_ipa(struct kvm *kvm, phys_addr_t ipa);
>   extern void __kvm_tlb_flush_vmid(struct kvm *kvm);
>  +extern void __kvm_restore_host_vfp_state(struct kvm_vcpu *vcpu);
>   
>   extern int __kvm_vcpu_run(struct kvm_vcpu *vcpu);
>   #endif
>  diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>  index ce404a5..79f49c7 100644
>  --- a/arch/arm/kvm/arm.c
>  +++ b/arch/arm/kvm/arm.c
>  @@ -105,6 +105,20 @@ void kvm_arch_check_processor_compat(void *rtn)
>   *(int *)rtn = 0;
>   }
>   
>  +/**
>  + * kvm_switch_fp_regs() - switch guest/host VFP/SIMD registers
>  + * @vcpu:   pointer to vcpu structure.
>  + *
> >>>
> >>> nit: stray blank line
> >> ok
> >>>
>  + */
>  +static void kvm_switch_fp_regs(struct kvm_vcpu *vcpu)
>  +{
>  +#ifdef CONFIG_ARM
>  +if (vcpu->arch.vfp_lazy == 1) {
>  +kvm_call_hyp(__kvm_restore_host_vfp_state, vcpu);
> >>>
> >>> why do you have to do this in HYP mode ?
> >>  Calling it directly works fine. I moved the function outside hyp start/end
> >> range in interrupts.S. Not thinking outside the box, just thought let them 
> >> all
> >> be hyp calls.
> >>
> >>>
>  +vcpu->arch.vfp_lazy = 0;
>  +}
>  +#endif
> >>>
> >>> we've tried to put stuff like this in header files to avoid the ifdefs
> >>> so far.  Could that be done here as well?
> >>
> >> That was a to do, but didn't get around to it.
> >>>
>  +}
>   
>   /**
>    * kvm_arch_init_vm - initializes a VM data structure
>  @@ -295,6 +309,9 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int 
>  cpu)
>   
>   void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>   {
>  +/* Check if Guest accessed VFP registers */
> >>>
> >>> misleading comment: this function does more than checking
> >> Yep sure does, will change.
> >>>
>  +kvm_switch_fp_regs(vcpu);
>  +
>   /*
>    * The arch-generic KVM code expects the cpu field of a vcpu to 
>  be -1
>    * if the vcpu is no longer assigned to a cpu.  This is used 
>  for the
>  diff --git a/arch/arm/kvm/interrupts.S b/arch/arm/kvm/interrupts.S
>  index 900ef6d..6d98232 100644
>  --- a/arch/arm/kvm/interrupts.S
>  +++ b/arch/arm/kvm/interrupts.S
>  @@ -96,6 +96,29 @@ ENTRY(__kvm_flush_vm_context)
>   bx  lr
>   ENDPROC(__kvm_flush_vm_context)
>   
>  +/**
>  + * void __kvm_restore_host_vfp_state(struct vcpu *vcpu) - Executes a 
>  lazy
>  + * fp/simd switch, saves the guest, restores host.
>  + *
> >>>
> >>> nit: stray blank line
> >> ok.
> >>>
>  + */
>  +ENTRY(__kvm_restore_host_vfp_state)
>  +#ifdef CONFIG_VFPv3
>  +push{r3-r7}
>  +
>  +add r7, r0, #VCPU_VFP_GUEST
>  +store_vfp_state r7
>  +
>  +add r7, r0, #VCPU_VFP_HOST
>  +ldr r7, [r7]
>  +restore_vfp_state r7
>  +
>  +ldr r3, [vcpu, #VCPU_VFP_FPEXC]
> >>>
> >>> either use r0 or vcpu throughout this function please
> >> Yeah that's bad - in the same function to
> >>>
>  +VFPFMXR FPEXC, r3
>  +
>  +pop {r3-r7}

Re: [PATCH 1/2] KVM: arm/arm64: Optimize away redundant LR tracking

2015-10-22 Thread Christoffer Dall
On Fri, Oct 02, 2015 at 05:44:28PM +0300, Pavel Fedin wrote:
> Currently we use vgic_irq_lr_map in order to track which LRs hold which
> IRQs, and lr_used bitmap in order to track which LRs are used or free.
> 
> vgic_irq_lr_map is actually used only for piggy-back optimization, and
> can be easily replaced by iteration over lr_used. This is good because in
> future, when LPI support is introduced, number of IRQs will grow up to at
> least 16384, while numbers from 1024 to 8192 are never going to be used.
> This would be a huge memory waste.
> 
> In its turn, lr_used is also completely redundant since
> ae705930fca6322600690df9dc1c7d0516145a93 ("arm/arm64: KVM: Keep elrsr/aisr
> in sync with software model"), because together with lr_used we also update
> elrsr. This allows to easily replace lr_used with elrsr, inverting all
> conditions (because in elrsr '1' means 'free').
> 
> Signed-off-by: Pavel Fedin 
> ---
>  include/kvm/arm_vgic.h |  6 
>  virt/kvm/arm/vgic.c| 74 
> +++---
>  2 files changed, 28 insertions(+), 52 deletions(-)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 4e14dac..d908028 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -296,9 +296,6 @@ struct vgic_v3_cpu_if {
>  };
>  
>  struct vgic_cpu {
> - /* per IRQ to LR mapping */
> - u8  *vgic_irq_lr_map;
> -
>   /* Pending/active/both interrupts on this VCPU */
>   DECLARE_BITMAP( pending_percpu, VGIC_NR_PRIVATE_IRQS);
>   DECLARE_BITMAP( active_percpu, VGIC_NR_PRIVATE_IRQS);
> @@ -309,9 +306,6 @@ struct vgic_cpu {
>   unsigned long   *active_shared;
>   unsigned long   *pend_act_shared;
>  
> - /* Bitmap of used/free list registers */
> - DECLARE_BITMAP( lr_used, VGIC_V2_MAX_LRS);
> -
>   /* Number of list registers on this CPU */
>   int nr_lr;
>  
> diff --git a/virt/kvm/arm/vgic.c b/virt/kvm/arm/vgic.c
> index 6bd1c9b..2f4d25a 100644
> --- a/virt/kvm/arm/vgic.c
> +++ b/virt/kvm/arm/vgic.c
> @@ -102,9 +102,10 @@
>  #include "vgic.h"
>  
>  static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu);
> -static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu);
> +static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu);
>  static struct vgic_lr vgic_get_lr(const struct kvm_vcpu *vcpu, int lr);
>  static void vgic_set_lr(struct kvm_vcpu *vcpu, int lr, struct vgic_lr 
> lr_desc);
> +static u64 vgic_get_elrsr(struct kvm_vcpu *vcpu);
>  static struct irq_phys_map *vgic_irq_map_search(struct kvm_vcpu *vcpu,
>   int virt_irq);
>  
> @@ -683,9 +684,11 @@ bool vgic_handle_cfg_reg(u32 *reg, struct kvm_exit_mmio 
> *mmio,
>  void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
>  {
>   struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
> + u64 elrsr = vgic_get_elrsr(vcpu);
> + unsigned long *elrsr_ptr = u64_to_bitmask();
>   int i;
>  
> - for_each_set_bit(i, vgic_cpu->lr_used, vgic_cpu->nr_lr) {
> + for_each_clear_bit(i, elrsr_ptr, vgic_cpu->nr_lr) {
>   struct vgic_lr lr = vgic_get_lr(vcpu, i);
>  
>   /*
> @@ -728,7 +731,7 @@ void vgic_unqueue_irqs(struct kvm_vcpu *vcpu)
>* Mark the LR as free for other use.
>*/
>   BUG_ON(lr.state & LR_STATE_MASK);
> - vgic_retire_lr(i, lr.irq, vcpu);
> + vgic_retire_lr(i, vcpu);
>   vgic_irq_clear_queued(vcpu, lr.irq);
>  
>   /* Finally update the VGIC state. */
> @@ -1087,15 +1090,12 @@ static inline void vgic_enable(struct kvm_vcpu *vcpu)
>   vgic_ops->enable(vcpu);
>  }
>  
> -static void vgic_retire_lr(int lr_nr, int irq, struct kvm_vcpu *vcpu)
> +static void vgic_retire_lr(int lr_nr, struct kvm_vcpu *vcpu)
>  {
> - struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
>   struct vgic_lr vlr = vgic_get_lr(vcpu, lr_nr);
>  
>   vlr.state = 0;
>   vgic_set_lr(vcpu, lr_nr, vlr);
> - clear_bit(lr_nr, vgic_cpu->lr_used);
> - vgic_cpu->vgic_irq_lr_map[irq] = LR_EMPTY;
>   vgic_sync_lr_elrsr(vcpu, lr_nr, vlr);
>  }
>  
> @@ -1110,14 +1110,15 @@ static void vgic_retire_lr(int lr_nr, int irq, struct 
> kvm_vcpu *vcpu)
>   */
>  static void vgic_retire_disabled_irqs(struct kvm_vcpu *vcpu)
>  {
> - struct vgic_cpu *vgic_cpu = >arch.vgic_cpu;
> + u64 elrsr = vgic_get_elrsr(vcpu);
> + unsigned long *elrsr_ptr = u64_to_bitmask();
>   int lr;
>  
> - for_each_set_bit(lr, vgic_cpu->lr_used, vgic->nr_lr) {
> + for_each_clear_bit(lr, elrsr_ptr, vgic->nr_lr) {
>   struct vgic_lr vlr = vgic_get_lr(vcpu, lr);
>  
>   if (!vgic_irq_is_enabled(vcpu, vlr.irq)) {
> - vgic_retire_lr(lr, vlr.irq, vcpu);
> + vgic_retire_lr(lr, vcpu);
>   if (vgic_irq_is_queued(vcpu, vlr.irq))
>   

[PATCH v2 5/6] vfio: platform: use list of registered reset function

2015-10-22 Thread Eric Auger
Remove the static lookup table and use the dynamic list of registered
reset functions instead. Also load the reset module through its alias.
The reset struct module pointer is stored in vfio_platform_device.

We also remove the useless struct device pointer parameter in
vfio_platform_get_reset.

This patch fixes the issue related to the usage of __symbol_get, which
besides from being moot, prevented compilation with CONFIG_MODULES
disabled.

Also usage of MODULE_ALIAS makes possible to add a new reset module
without needing to update the framework. This was suggested by Arnd.

Signed-off-by: Eric Auger 
Reported-by: Arnd Bergmann 

---

v1 -> v2:
- use reset_lock in vfio_platform_lookup_reset
- remove vfio_platform_reset_combo declaration
- remove struct device *dev parameter in vfio_platform_get_reset
- set reset_module and reset to NULL in put function
---
 drivers/vfio/platform/vfio_platform_common.c  | 57 ---
 drivers/vfio/platform/vfio_platform_private.h |  7 +---
 2 files changed, 34 insertions(+), 30 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index e7d615f..95b8294 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -34,37 +34,46 @@ static LIST_HEAD(reset_list);
 static DEFINE_MUTEX(reset_lock);
 static DEFINE_MUTEX(driver_lock);
 
-static const struct vfio_platform_reset_combo reset_lookup_table[] = {
-   {
-   .compat = "calxeda,hb-xgmac",
-   .reset_function_name = "vfio_platform_calxedaxgmac_reset",
-   .module_name = "vfio-platform-calxedaxgmac",
-   },
-};
-
-static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
-   struct device *dev)
+static vfio_platform_reset_fn_t vfio_platform_lookup_reset(const char *compat,
+   struct module **module)
 {
-   int (*reset)(struct vfio_platform_device *);
-   int i;
+   struct vfio_platform_reset_node *iter;
 
-   for (i = 0 ; i < ARRAY_SIZE(reset_lookup_table); i++) {
-   if (!strcmp(reset_lookup_table[i].compat, vdev->compat)) {
-   request_module(reset_lookup_table[i].module_name);
-   reset = __symbol_get(
-   reset_lookup_table[i].reset_function_name);
-   if (reset) {
-   vdev->reset = reset;
-   return;
-   }
+   mutex_lock(_lock);
+   list_for_each_entry(iter, _list, link) {
+   if (!strcmp(iter->compat, compat) &&
+   try_module_get(iter->owner)) {
+   *module = iter->owner;
+   mutex_unlock(_lock);
+   return iter->reset;
}
}
+
+   mutex_unlock(_lock);
+   return NULL;
+}
+
+static void vfio_platform_get_reset(struct vfio_platform_device *vdev)
+{
+   char modname[256];
+
+   vdev->reset = vfio_platform_lookup_reset(vdev->compat,
+   >reset_module);
+   if (!vdev->reset) {
+   snprintf(modname, 256, "vfio-reset:%s", vdev->compat);
+   request_module(modname);
+   vdev->reset = vfio_platform_lookup_reset(vdev->compat,
+>reset_module);
+   }
 }
 
 static void vfio_platform_put_reset(struct vfio_platform_device *vdev)
 {
-   if (vdev->reset)
-   symbol_put_addr(vdev->reset);
+   if (vdev->reset) {
+   module_put(vdev->reset_module);
+   vdev->reset_module = NULL;
+   vdev->reset = NULL;
+   }
 }
 
 static int vfio_platform_regions_init(struct vfio_platform_device *vdev)
@@ -561,7 +570,7 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return ret;
}
 
-   vfio_platform_get_reset(vdev, dev);
+   vfio_platform_get_reset(vdev);
 
mutex_init(>igate);
 
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 620ee40..e968454 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -57,6 +57,7 @@ struct vfio_platform_device {
int refcnt;
struct mutexigate;
const char  *compat;
+   struct module   *reset_module;
 
/*
 * These fields should be filled by the bus specific binder
@@ -71,12 +72,6 @@ struct vfio_platform_device {
int (*reset)(struct vfio_platform_device *vdev);
 };
 
-struct vfio_platform_reset_combo {
-   const char *compat;
-   const char *reset_function_name;
-   const char 

[PATCH v2 4/6] vfio: platform: add compat in vfio_platform_device

2015-10-22 Thread Eric Auger
Let's retrieve the compatibility string on probe and store it
in the vfio_platform_device struct

Signed-off-by: Eric Auger 
---
 drivers/vfio/platform/vfio_platform_common.c  | 15 ---
 drivers/vfio/platform/vfio_platform_private.h |  1 +
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 52a4c7b..e7d615f 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -45,16 +45,11 @@ static const struct vfio_platform_reset_combo 
reset_lookup_table[] = {
 static void vfio_platform_get_reset(struct vfio_platform_device *vdev,
struct device *dev)
 {
-   const char *compat;
int (*reset)(struct vfio_platform_device *);
-   int ret, i;
-
-   ret = device_property_read_string(dev, "compatible", );
-   if (ret)
-   return;
+   int i;
 
for (i = 0 ; i < ARRAY_SIZE(reset_lookup_table); i++) {
-   if (!strcmp(reset_lookup_table[i].compat, compat)) {
+   if (!strcmp(reset_lookup_table[i].compat, vdev->compat)) {
request_module(reset_lookup_table[i].module_name);
reset = __symbol_get(
reset_lookup_table[i].reset_function_name);
@@ -545,6 +540,12 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
struct iommu_group *group;
int ret;
 
+   ret = device_property_read_string(dev, "compatible", >compat);
+   if (ret) {
+   pr_err("VFIO: cannot retrieve compat for %s\n", vdev->name);
+   return -EINVAL;
+   }
+
if (!vdev)
return -EINVAL;
 
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 1fb1410..620ee40 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -56,6 +56,7 @@ struct vfio_platform_device {
u32 num_irqs;
int refcnt;
struct mutexigate;
+   const char  *compat;
 
/*
 * These fields should be filled by the bus specific binder
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/6] vfio: platform: add capability to register a reset function

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 11:41:57 Eric Auger wrote:
> In preparation for subsequent changes in reset function lookup,
> lets introduce a dynamic list of reset combos (compat string,
> reset module, reset function). The list can be populated/voided with
> two new functions, vfio_platform_register/unregister_reset. Those are
> not yet used in this patch.
> 
> Signed-off-by: Eric Auger 

Looks correct to me now, just a little style comments.

>  drivers/vfio/platform/vfio_platform_common.c  | 77 
> +++
>  drivers/vfio/platform/vfio_platform_private.h |  7 +++
>  2 files changed, 84 insertions(+)
> 
> diff --git a/drivers/vfio/platform/vfio_platform_common.c 
> b/drivers/vfio/platform/vfio_platform_common.c
> index e43efb5..52a4c7b 100644
> --- a/drivers/vfio/platform/vfio_platform_common.c
> +++ b/drivers/vfio/platform/vfio_platform_common.c
> @@ -23,6 +23,15 @@
>  
>  #include "vfio_platform_private.h"
>  
> +struct vfio_platform_reset_node {
> + struct list_head link;
> + char *compat;
> + struct module *owner;
> + vfio_platform_reset_fn_t reset;
> +};
> +
> +static LIST_HEAD(reset_list);
> +static DEFINE_MUTEX(reset_lock);
>  static DEFINE_MUTEX(driver_lock);

I wonder if having a single mutex here would be enough. If you don't expect
drivers to register/unregister a lot, you could just use driver_lock here
as well.

>  static const struct vfio_platform_reset_combo reset_lookup_table[] = {
> @@ -573,3 +582,71 @@ struct vfio_platform_device 
> *vfio_platform_remove_common(struct device *dev)
>   return vdev;
>  }
>  EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
> +
> +int vfio_platform_register_reset(struct module *reset_owner,
> +  const char *compat,
> +  vfio_platform_reset_fn_t reset)
> +{
> + struct vfio_platform_reset_node *node, *iter;
> + bool found = false;
> +
> + mutex_lock(_lock);
> + list_for_each_entry(iter, _list, link) {
> + if (!strcmp(iter->compat, compat)) {
> + found = true;
> + break;
> + }
> + }
> + if (found) {
> + mutex_unlock(_lock);
> + return -EINVAL;
> + }

This seems to be an unnecesssary safeguard. I would not bother
with the search, or otherwise do a WARN_ON here.

> + node = kmalloc(sizeof(*node), GFP_KERNEL);
> + if (!node) {
> + mutex_unlock(_lock);
> + return -ENOMEM;
> + }
> +
> + node->compat = kstrdup(compat, GFP_KERNEL);
> + if (!node->compat) {
> + kfree(node);
> + mutex_unlock(_lock);
> + return -ENOMEM;
> + }

If you hold a lock, it's better to use a goto for handling errors
and put the unlock and kfree there. I think it can be avoided
entirely here though:

It should be safe to define the interface as keeping a reference
on the string and not do a kstrdup here. I would expect users to
pass a string literal.

We could even go as far as defining the entire interface as a shim,
like

#define vfio_platform_register_reset(__compat, __reset) \
static struct vfio_platform_reset_node __reset ## _node = { \
.owner = THIS_MODULE,   \
.compat = __compat, \
.reset = __reset,   \
};  \
__vfio_platform_register_reset(&__reset ## node);

to make the function really simple.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] vfio: platform: reset: add vfio_platform_reset_private.h

2015-10-22 Thread Eric Auger
This header is to be included in all vfio reset modules. It
defines the module_vfio_reset_handler macro whose role is
- to define a module alias
- implement module init/exit function which respectively registers
  and unregisters the reset function.

Signed-off-by: Eric Auger 

---

v2: creation
- this defines the module_vfio_reset_handler macro as suggested by Arnd

Although Arnd suggested me to remove the vfio_platform_register_reset
symbol_get (since the module manager can handle the case where the
vfio-platform driver is not loaded), I prefered to keep it while
introducing the macro. The rationale is, when using symbol_get/put
we are able to release the hold from the reset module on vfio-platform
as soon as the registration is complete and I think this makes sense.
---
 drivers/vfio/platform/reset/Makefile   |  2 +-
 .../platform/reset/vfio_platform_reset_private.h   | 66 ++
 2 files changed, 67 insertions(+), 1 deletion(-)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_reset_private.h

diff --git a/drivers/vfio/platform/reset/Makefile 
b/drivers/vfio/platform/reset/Makefile
index 2a486af..154a7d5 100644
--- a/drivers/vfio/platform/reset/Makefile
+++ b/drivers/vfio/platform/reset/Makefile
@@ -1,5 +1,5 @@
 vfio-platform-calxedaxgmac-y := vfio_platform_calxedaxgmac.o
 
-ccflags-y += -Idrivers/vfio/platform
+ccflags-y += -Idrivers/vfio/platform -Idrivers/vfio/platform/reset
 
 obj-$(CONFIG_VFIO_PLATFORM_CALXEDAXGMAC_RESET) += vfio-platform-calxedaxgmac.o
diff --git a/drivers/vfio/platform/reset/vfio_platform_reset_private.h 
b/drivers/vfio/platform/reset/vfio_platform_reset_private.h
new file mode 100644
index 000..f212a61
--- /dev/null
+++ b/drivers/vfio/platform/reset/vfio_platform_reset_private.h
@@ -0,0 +1,66 @@
+/*
+ * Interface used by VFIO platform reset modules to register/unregister
+ * their reset function
+ *
+ * Copyright (c) 2015 Linaro Ltd.
+ *  www.linaro.org
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ */
+
+#ifndef VFIO_PLATFORM_RESET_PRIVATE_H
+#define VFIO_PLATFORM_RESET_PRIVATE_H
+
+#include 
+#include "vfio_platform_private.h"
+
+static int reset_module_register(struct module *module,
+   const char *compat,
+   vfio_platform_reset_fn_t reset)
+{
+   int (*register_reset)(struct module *, const char*,
+   vfio_platform_reset_fn_t);
+   int ret;
+
+   register_reset = symbol_get(vfio_platform_register_reset);
+   if (!register_reset)
+   return -EINVAL;
+   ret = register_reset(module, compat, reset);
+   symbol_put(vfio_platform_register_reset);
+   return ret;
+}
+
+static void reset_module_unregister(const char *compat)
+{
+   int (*unregister_reset)(const char *);
+
+   unregister_reset = symbol_get(vfio_platform_unregister_reset);
+   if (!unregister_reset)
+   return;
+
+   unregister_reset(compat);
+
+   symbol_put(vfio_platform_unregister_reset);
+}
+
+#define module_vfio_reset_handler(compat, reset)   \
+MODULE_ALIAS("vfio-reset:" compat);\
+static int __init reset ## _module_init(void)  \
+{  \
+   return reset_module_register(THIS_MODULE, compat, );  \
+}; \
+static void __exit reset ## _module_exit(void) \
+{   \
+   reset_module_unregister(compat);\
+}; \
+module_init(reset ## _module_init);\
+module_exit(reset ## _module_exit)
+
+#endif /* VFIO_PLATFORM_RESET_PRIVATE_H */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/6] vfio: platform: reset: add vfio_platform_reset_private.h

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 11:41:58 Eric Auger wrote:
> v2: creation
> - this defines the module_vfio_reset_handler macro as suggested by Arnd
> 
> Although Arnd suggested me to remove the vfio_platform_register_reset
> symbol_get (since the module manager can handle the case where the
> vfio-platform driver is not loaded), I prefered to keep it while
> introducing the macro. The rationale is, when using symbol_get/put
> we are able to release the hold from the reset module on vfio-platform
> as soon as the registration is complete and I think this makes sense.

This makes it highly unusual. I can't think of a good reason to allow
unloading the vfio platform module while a reset module is registered,
that just causes a memory leak (or possibly a crash) when you do
unload the core module while it's used by a reset driver, it
prevents the reset module from getting loaded without loading the
vfio module first (because the autoloading doesn't work), and it
means we don't catch build errors in invalid configurations where you
have only the reset driver but not the vfio driver enabled.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:
> This patch tries to poll for new added tx buffer for a while at the
> end of tx processing. The maximum time spent on polling were limited
> through a module parameter. To avoid block rx, the loop will end it
> there's new other works queued on vhost so in fact socket receive
> queue is also be polled.
> 
> busyloop_timeout = 50 gives us following improvement on TCP_RR test:
> 
> size/session/+thu%/+normalize%
> 1/ 1/   +5%/  -20%
> 1/50/  +17%/   +3%

Is there a measureable increase in cpu utilization
with busyloop_timeout = 0?

> Signed-off-by: Jason Wang 

We might be able to shave off the minor regression
by careful use of likely/unlikely, or maybe
deferring 

> ---
>  drivers/vhost/net.c | 19 +++
>  1 file changed, 19 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 9eda69e..bbb522a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -31,7 +31,9 @@
>  #include "vhost.h"
>  
>  static int experimental_zcopytx = 1;
> +static int busyloop_timeout = 50;
>  module_param(experimental_zcopytx, int, 0444);
> +module_param(busyloop_timeout, int, 0444);

Pls add a description, including the units and the special
value 0.

>  MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
>  " 1 -Enable; 0 - Disable");
>  
> @@ -287,12 +289,23 @@ static void vhost_zerocopy_callback(struct ubuf_info 
> *ubuf, bool success)
>   rcu_read_unlock_bh();
>  }
>  
> +static bool tx_can_busy_poll(struct vhost_dev *dev,
> +  unsigned long endtime)
> +{
> + unsigned long now = local_clock() >> 10;

local_clock might go backwards if we jump between CPUs.
One way to fix would be to record the CPU id and break
out of loop if that changes.

Also - defer this until we actually know we need it?

> +
> + return busyloop_timeout && !need_resched() &&
> +!time_after(now, endtime) && !vhost_has_work(dev) &&
> +single_task_running();

signal pending as well?

> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
>  {
>   struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
>   struct vhost_virtqueue *vq = >vq;
> + unsigned long endtime;
>   unsigned out, in;
>   int head;
>   struct msghdr msg = {
> @@ -331,6 +344,8 @@ static void handle_tx(struct vhost_net *net)
> % UIO_MAXIOV == nvq->done_idx))
>   break;
>  
> + endtime  = (local_clock() >> 10) + busyloop_timeout;
> +again:
>   head = vhost_get_vq_desc(vq, vq->iov,
>ARRAY_SIZE(vq->iov),
>, ,
> @@ -340,6 +355,10 @@ static void handle_tx(struct vhost_net *net)
>   break;
>   /* Nothing new?  Wait for eventfd to tell us they refilled. */
>   if (head == vq->num) {
> + if (tx_can_busy_poll(vq->dev, endtime)) {
> + cpu_relax();
> + goto again;
> + }
>   if (unlikely(vhost_enable_notify(>dev, vq))) {
>   vhost_disable_notify(>dev, vq);
>   continue;
> -- 
> 1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/6] vfio: platform: add capability to register a reset function

2015-10-22 Thread Eric Auger
In preparation for subsequent changes in reset function lookup,
lets introduce a dynamic list of reset combos (compat string,
reset module, reset function). The list can be populated/voided with
two new functions, vfio_platform_register/unregister_reset. Those are
not yet used in this patch.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- reset_list becomes static
- vfio_platform_register/unregister_reset take a const char * as compat
- fix node leak
- add reset_lock to protect the reset list manipulation
- move vfio_platform_reset_node declaration in vfio_platform_common.c
---
 drivers/vfio/platform/vfio_platform_common.c  | 77 +++
 drivers/vfio/platform/vfio_platform_private.h |  7 +++
 2 files changed, 84 insertions(+)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index e43efb5..52a4c7b 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -23,6 +23,15 @@
 
 #include "vfio_platform_private.h"
 
+struct vfio_platform_reset_node {
+   struct list_head link;
+   char *compat;
+   struct module *owner;
+   vfio_platform_reset_fn_t reset;
+};
+
+static LIST_HEAD(reset_list);
+static DEFINE_MUTEX(reset_lock);
 static DEFINE_MUTEX(driver_lock);
 
 static const struct vfio_platform_reset_combo reset_lookup_table[] = {
@@ -573,3 +582,71 @@ struct vfio_platform_device 
*vfio_platform_remove_common(struct device *dev)
return vdev;
 }
 EXPORT_SYMBOL_GPL(vfio_platform_remove_common);
+
+int vfio_platform_register_reset(struct module *reset_owner,
+const char *compat,
+vfio_platform_reset_fn_t reset)
+{
+   struct vfio_platform_reset_node *node, *iter;
+   bool found = false;
+
+   mutex_lock(_lock);
+   list_for_each_entry(iter, _list, link) {
+   if (!strcmp(iter->compat, compat)) {
+   found = true;
+   break;
+   }
+   }
+   if (found) {
+   mutex_unlock(_lock);
+   return -EINVAL;
+   }
+
+   node = kmalloc(sizeof(*node), GFP_KERNEL);
+   if (!node) {
+   mutex_unlock(_lock);
+   return -ENOMEM;
+   }
+
+   node->compat = kstrdup(compat, GFP_KERNEL);
+   if (!node->compat) {
+   kfree(node);
+   mutex_unlock(_lock);
+   return -ENOMEM;
+   }
+
+   node->owner = reset_owner;
+   node->reset = reset;
+
+   list_add(>link, _list);
+   mutex_unlock(_lock);
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_register_reset);
+
+int vfio_platform_unregister_reset(const char *compat)
+{
+   struct vfio_platform_reset_node *iter;
+   bool found = false;
+
+   mutex_lock(_lock);
+   list_for_each_entry(iter, _list, link) {
+   if (!strcmp(iter->compat, compat)) {
+   found = true;
+   break;
+   }
+   }
+   if (!found) {
+   mutex_unlock(_lock);
+   return -EINVAL;
+   }
+
+   list_del(>link);
+   kfree(iter->compat);
+   kfree(iter);
+   mutex_unlock(_lock);
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(vfio_platform_unregister_reset);
+
diff --git a/drivers/vfio/platform/vfio_platform_private.h 
b/drivers/vfio/platform/vfio_platform_private.h
index 1c9b3d5..1fb1410 100644
--- a/drivers/vfio/platform/vfio_platform_private.h
+++ b/drivers/vfio/platform/vfio_platform_private.h
@@ -89,4 +89,11 @@ extern int vfio_platform_set_irqs_ioctl(struct 
vfio_platform_device *vdev,
unsigned start, unsigned count,
void *data);
 
+typedef int (*vfio_platform_reset_fn_t)(struct vfio_platform_device *vdev);
+
+extern int vfio_platform_register_reset(struct module *owner,
+   const char *compat,
+   vfio_platform_reset_fn_t reset);
+extern int vfio_platform_unregister_reset(const char *compat);
+
 #endif /* VFIO_PLATFORM_PRIVATE_H */
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] VFIO platform reset module rework

2015-10-22 Thread Eric Auger
This series fixes the current implementation by getting rid of the
usage of __symbol_get which caused a compilation issue with
CONFIG_MODULES disabled. On top of this, the usage of MODULE_ALIAS makes
possible to add a new reset module without being obliged to update the
framework. The new implementation relies on the reset module registering
its reset function to the vfio-platform driver.

The series is available at

https://git.linaro.org/people/eric.auger/linux.git/shortlog/refs/heads/v4.3-rc6-rework-v2

Best Regards

Eric

v1 -> v2:
* in vfio_platform_common.c:
  - move reset lookup at load time and put reset at release: this is to
prevent a race between the 2 load module loads
  - reset_list becomes static
  - vfio_platform_register/unregister_reset take a const char * as compat
  - fix node link
  - remove old combo struct and cleanup proto of vfio_platform_get_reset
  - add mutex to protect the reset list
* in calxeda xgmac reset module
  - introduce vfio_platform_reset_private.h
  - use module_vfio_reset_handler macro
  - do not export vfio_platform_calxedaxgmac_reset symbol anymore
  - add a pr_info to show the device is reset by vfio reset module


Eric Auger (6):
  vfio: platform: add capability to register a reset function
  vfio: platform: reset: add vfio_platform_reset_private.h
  vfio: platform: reset: calxedaxgmac: add reset function registration
  vfio: platform: add compat in vfio_platform_device
  vfio: platform: use list of registered reset function
  vfio: platform: move get/put reset at open/release

 drivers/vfio/platform/reset/Makefile   |   2 +-
 .../platform/reset/vfio_platform_calxedaxgmac.c|   9 +-
 .../platform/reset/vfio_platform_reset_private.h   |  66 +
 drivers/vfio/platform/vfio_platform_common.c   | 151 -
 drivers/vfio/platform/vfio_platform_private.h  |  15 +-
 5 files changed, 201 insertions(+), 42 deletions(-)
 create mode 100644 drivers/vfio/platform/reset/vfio_platform_reset_private.h

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 11:41:59 Eric Auger wrote:
> This patch adds the reset function registration/unregistration.
> 
> Signed-off-by: Eric Auger 

Looks good, except one thing:
> @@ -70,6 +69,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
> vfio_platform_device *vdev)
>   return -ENOMEM;
>   }
>  
> + pr_info("VFIO reset of %s\n", vdev->name);
> +
>   /* disable IRQ */
>   writel(0, reg.ioaddr + XGMAC_DMA_INTR_ENA);
>  

This probably slipped in from debugging, please remove it.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] vfio: platform: use list of registered reset function

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 11:42:01 Eric Auger wrote:
> Remove the static lookup table and use the dynamic list of registered
> reset functions instead. Also load the reset module through its alias.
> The reset struct module pointer is stored in vfio_platform_device.
> 
> We also remove the useless struct device pointer parameter in
> vfio_platform_get_reset.
> 
> This patch fixes the issue related to the usage of __symbol_get, which
> besides from being moot, prevented compilation with CONFIG_MODULES
> disabled.
> 
> Also usage of MODULE_ALIAS makes possible to add a new reset module
> without needing to update the framework. This was suggested by Arnd.
> 
> Signed-off-by: Eric Auger 
> Reported-by: Arnd Bergmann 

Looks good, just small style issues:

> 
> - }
> + mutex_lock(_lock);
> + list_for_each_entry(iter, _list, link) {
> + if (!strcmp(iter->compat, compat) &&
> + try_module_get(iter->owner)) {
> + *module = iter->owner;
> + mutex_unlock(_lock);
> + return iter->reset;
>   }
>   }
> +
> + mutex_unlock(_lock);
> + return NULL;

Better use 'break' to have a single mutex_unlock and return  statement.

> +
>  
>  static void vfio_platform_put_reset(struct vfio_platform_device *vdev)
>  {
> - if (vdev->reset)
> - symbol_put_addr(vdev->reset);
> + if (vdev->reset) {
> + module_put(vdev->reset_module);
> + vdev->reset_module = NULL;
> + vdev->reset = NULL;
> + }
>  }

This gets called from the remove callback. No need to clear those
struct members immediately before the kfree().

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-22 Thread Eric Auger
This patch adds the reset function registration/unregistration.

Signed-off-by: Eric Auger 

---

v1 -> v2:
- uses the module_vfio_reset_handler macro
- add pr_info on vfio reset
- do not export vfio_platform_calxedaxgmac_reset symbol anymore
---
 drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c 
b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
index 619dc7d..8b2 100644
--- a/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
+++ b/drivers/vfio/platform/reset/vfio_platform_calxedaxgmac.c
@@ -25,13 +25,12 @@
 #include 
 
 #include "vfio_platform_private.h"
+#include "vfio_platform_reset_private.h"
 
 #define DRIVER_VERSION  "0.1"
 #define DRIVER_AUTHOR   "Eric Auger "
 #define DRIVER_DESC "Reset support for Calxeda xgmac vfio platform device"
 
-#define CALXEDAXGMAC_COMPAT "calxeda,hb-xgmac"
-
 /* XGMAC Register definitions */
 #define XGMAC_CONTROL   0x  /* MAC Configuration */
 
@@ -70,6 +69,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
vfio_platform_device *vdev)
return -ENOMEM;
}
 
+   pr_info("VFIO reset of %s\n", vdev->name);
+
/* disable IRQ */
writel(0, reg.ioaddr + XGMAC_DMA_INTR_ENA);
 
@@ -78,7 +79,9 @@ int vfio_platform_calxedaxgmac_reset(struct 
vfio_platform_device *vdev)
 
return 0;
 }
-EXPORT_SYMBOL_GPL(vfio_platform_calxedaxgmac_reset);
+
+module_vfio_reset_handler("calxeda,hb-xgmac",
+ vfio_platform_calxedaxgmac_reset);
 
 MODULE_VERSION(DRIVER_VERSION);
 MODULE_LICENSE("GPL v2");
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Eric Auger
Currently reset lookup is done on probe. This introduces a
race with new registration mechanism in the case where the
vfio-platform driver is bound to the device before its module
is loaded: on the load, the probe happens which triggers the
reset module load which itself attempts to get the symbol for
the registration function (vfio_platform_register_reset). The
symbol is not yet available hence the lookup fails. In case we
do the lookup in the first open we are sure the vfio-platform
module is loaded and vfio_platform_register_reset is available.

Signed-off-by: Eric Auger 
---
 drivers/vfio/platform/vfio_platform_common.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vfio/platform/vfio_platform_common.c 
b/drivers/vfio/platform/vfio_platform_common.c
index 95b8294..5ebae8c 100644
--- a/drivers/vfio/platform/vfio_platform_common.c
+++ b/drivers/vfio/platform/vfio_platform_common.c
@@ -155,6 +155,7 @@ static void vfio_platform_release(void *device_data)
vdev->reset(vdev);
vfio_platform_regions_cleanup(vdev);
vfio_platform_irq_cleanup(vdev);
+   vfio_platform_put_reset(vdev);
}
 
mutex_unlock(_lock);
@@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
if (ret)
goto err_irq;
 
+   vfio_platform_get_reset(vdev);
+
if (vdev->reset)
vdev->reset(vdev);
}
@@ -570,8 +573,6 @@ int vfio_platform_probe_common(struct vfio_platform_device 
*vdev,
return ret;
}
 
-   vfio_platform_get_reset(vdev);
-
mutex_init(>igate);
 
return 0;
@@ -585,7 +586,6 @@ struct vfio_platform_device 
*vfio_platform_remove_common(struct device *dev)
vdev = vfio_del_group_dev(dev);
 
if (vdev) {
-   vfio_platform_put_reset(vdev);
iommu_group_put(dev->iommu_group);
}
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-22 Thread Paolo Bonzini


On 21/10/2015 19:21, Yacine HEBBAL wrote:
> If I correctly understood you last paragraph, it is better to use vm_ioctl
> to do generic processing that doesn't rely on a given VCPU and hence I won't
> need to use "CPU_FOREACH, run_on_cpu and current_cpu".

Right.  On the other hand, you definitely want a vcpu_ioctl if you need
to call vcpu_load.

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Arnd Bergmann
On Thursday 22 October 2015 15:26:55 Eric Auger wrote:
> >> @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
> >> if (ret)
> >> goto err_irq;
> >>  
> >> +   vfio_platform_get_reset(vdev);
> >> +
> >> if (vdev->reset)
> >> vdev->reset(vdev);
> >>
> > 
> > This needs some error handling to ensure that the open() fails
> > if there is no reset handler.
> 
> Is that really what we want? The code was meant to allow the use case
> where the VFIO platform driver would be used without such reset module.
> 
> I think the imperious need for a reset module depends on the device and
> more importantly depends on the IOMMU mapping. With QEMU VFIO
> integration this is needed because the whole VM memory is IOMMU mapped
> but in a simpler user-space driver context, we might live without.
> 
> Any thought?

I would think we need a reset driver for any device that can start DMA,
otherwise things can go wrong as soon as you attach it to a different domain
while there is ongoing DMA.

Maybe we could just allow devices to be attached without a reset handler,
but then disallow DMA on them?

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 03:38:24PM +0200, Peter Zijlstra wrote:
> On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:
> 
> > If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> > would be a bug in the scheduler in my view. Luckily there doesn't seem
> > to be such a bug, or at least we never experienced it.
> 
> Well, there will be a wakeup, just not the one you were hoping for.
> 
> We have code that does:
> 
>   @cond = true;
>   get_task_struct(p);
>   queue(p)
> 
>   /* random wait somewhere */
>   for (;;) {
>   prepare_to_wait();
>   if (@cond)
> break;
> 
>   ...
> 
>   handle_userfault()
> ...
> schedule();
>   ...
> 
>   dequeue(p)
>   wake_up_process(p) ---> wakeup without userfault wakeup
> 
> 
> These races are (extremely) rare, but they do exist. Therefore one must
> never assume schedule() will not spuriously wake because of these
> things.
> 
> Also, see:
> 
> lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojy2...@mail.gmail.com

With one more spinlock taken in the fast path we could recheck if the
waitqueue is still queued and this is a false positive extremely rare
spurious wakeup, and in such case set the state back to TASK_KILLABLE
and schedule.

However in the long term such a spinlock should be removed because
it's faster to stick with the current lockless list_empty_careful and
not to recheck the auto-remove waitqueue, but then we must be able to
re-enter handle_userfault() even if FAULT_FLAG_TRIED was set
(currently we can't return VM_FAULT_RETRY if FAULT_FLAG_TRIED is set
and that's the problem). This change is planned for a long time as we
need it to arm the vma-less write protection while the app is running,
so I'm not sure if it's worth going for the short term fix if this is
extremely rare.

The risk of memory corruption is still zero no matter what happens
here, in the extremely rare case the app will get a SIGBUS or a
syscall will return -EFAULT. The kernel also cannot crash. So it's not
very severe concern if it happens extremely rarely (we never
reproduced it and stress testing run for months). Of course in the
longer term this would have been fixed regardless as said in previous
email.

I think going for the longer term fix that was already planned, is
better than doing a short term fix and the real question is how I
should proceed to change the arch code and gup to cope with
handle_userfault() being re-entered.

The simplest thing is to drop FAULT_FLAG_TRIED as a whole. Or I could
add a new VM_FAULT_USERFAULT flag specific to handle_userfault that
would be returned even if FAULT_FLAG_TRIED is set, so that only
userfaults will be allowed to be repeated indefinitely (and then
VM_FAULT_USERFAULT shouldn't trigger a transition to FAULT_FLAG_TRIED,
unlike VM_FAULT_RETRY does).

This is all about being allowed to drop the mmap_sem.

If we'd check the waitqueue with the spinlock (to be sure the wakeup
isn't happening from under us while we check if we got an userfault
wakeup or if this is a spurious schedule), we could also limit the
VM_FAULT_RETRY to 2 max events if I add a FAULT_FLAG_TRIED2 and I
still use VM_FAULT_RETRY (instead of VM_FAULT_USERFAULT).

Being able to return VM_FAULT_RETRY indefinitely is only needed if we
don't handle the extremely wakeup race condition in handle_userfault
by taking the spinlock once more time in the fast path (i.e. after the
schedule).

I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I've no real preference on how to tweak the page fault code to be able
to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
change possible, so if you've suggestions now it's good time.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Eric Auger
On 10/22/2015 04:10 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 15:26:55 Eric Auger wrote:
 @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
 if (ret)
 goto err_irq;
  
 +   vfio_platform_get_reset(vdev);
 +
 if (vdev->reset)
 vdev->reset(vdev);

>>>
>>> This needs some error handling to ensure that the open() fails
>>> if there is no reset handler.
>>
>> Is that really what we want? The code was meant to allow the use case
>> where the VFIO platform driver would be used without such reset module.
>>
>> I think the imperious need for a reset module depends on the device and
>> more importantly depends on the IOMMU mapping. With QEMU VFIO
>> integration this is needed because the whole VM memory is IOMMU mapped
>> but in a simpler user-space driver context, we might live without.
>>
>> Any thought?
> 
> I would think we need a reset driver for any device that can start DMA,
> otherwise things can go wrong as soon as you attach it to a different domain
> while there is ongoing DMA.
> 
> Maybe we could just allow devices to be attached without a reset handler,
> but then disallow DMA on them?

Well I am tempted to think that most assigned devices will perform DMA
accesses so to me this somehow comes to the same result, ie disallowing
functional passthrough for devices not properly/fully integrated.

Alex/Baptiste, any opinion on this?

Thanks

Eric


> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 12/12] IXGBEVF: Track dma dirty pages

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:44AM +0800, Lan Tianyu wrote:
> Migration relies on tracking dirty page to migrate memory.
> Hardware can't automatically mark a page as dirty after DMA
> memory access. VF descriptor rings and data buffers are modified
> by hardware when receive and transmit data. To track such dirty memory
> manually, do dummy writes(read a byte and write it back) during receive
> and transmit data.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++---
>  1 file changed, 11 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index d22160f..ce7bd7a 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>   break;
>  
> + /* write back status to mark page dirty */

Which page? the descriptor ring?  What does marking it dirty accomplish
though, given that we might migrate right before this happens?

It might be a good idea to just specify addresses of rings
to hypervisor, and have it send the ring pages after VM
and the VF are stopped.


> + eop_desc->wb.status = eop_desc->wb.status;
> +
Compiler is likely to optimize this out.
You also probably need a wmb here ...

>   /* clear next_to_watch to prevent false hangs */
>   tx_buffer->next_to_watch = NULL;
>   tx_buffer->desc_num = 0;
> @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct 
> ixgbevf_ring *rx_ring,
>  {
>   struct ixgbevf_rx_buffer *rx_buffer;
>   struct page *page;
> + u8 *page_addr;
>  
>   rx_buffer = _ring->rx_buffer_info[rx_ring->next_to_clean];
>   page = rx_buffer->page;
>   prefetchw(page);
>  
> - if (likely(!skb)) {
> - void *page_addr = page_address(page) +
> -   rx_buffer->page_offset;
> + /* Mark page dirty */

Looks like there's a race condition here: VM could
migrate at this point. RX ring will indicate
packet has been received, but page data would be stale.


One solution I see is explicitly testing for this
condition and discarding the packet.
For example, hypervisor could increment some counter
in RAM during migration.

Then:

x = read counter

get packet from rx ring
mark page dirty

y = read counter

if (x != y)
discard packet


> + page_addr = page_address(page) + rx_buffer->page_offset;
> + *page_addr = *page_addr;

Compiler is likely to optimize this out.
You also probably need a wmb here ...


>  
> + if (likely(!skb)) {
>   /* prefetch first cache line of first page */
>   prefetch(page_addr);

prefetch makes no sense if you read it right here.

>  #if L1_CACHE_BYTES < 128
> @@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
>   break;
>  
> + /* Write back status to mark page dirty */
> + rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
> +

same question as for tx.

>   /* This memory barrier is needed to keep us from reading
>* any other fields out of the rx_desc until we know the
>* RXD_STAT_DD bit is set
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:42AM +0800, Lan Tianyu wrote:
> Ring shifting during restoring VF function maybe race with original
> ring operation(transmit/receive package). This patch is to add tx/rx
> lock to protect ring related data.
> 
> Signed-off-by: Lan Tianyu 

That's adding a bunch of locking on data path - what's the
performance impact?

Can't you do something faster? E.g. migration things
are slow path - can't you use something like RCU
to flush outstanding work?


> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 
> ---
>  2 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 6eab402e..3a748c8 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -448,6 +448,8 @@ struct ixgbevf_adapter {
>  
>   spinlock_t mbx_lock;
>   unsigned long last_reset;
> + spinlock_t mg_rx_lock;
> + spinlock_t mg_tx_lock;
>  };
>  
>  enum ixbgevf_state_t {
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 15ec361..04b6ce7 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
> *ring)
>  
>  int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  {
> + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>   struct ixgbevf_tx_buffer *tx_buffer = NULL;
>   static union ixgbevf_desc *tx_desc = NULL;
> + unsigned long flags;
>  
>   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
>   if (!tx_buffer)
> @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 
> head)
>   if (!tx_desc)
>   return -ENOMEM;
>  
> + spin_lock_irqsave(>mg_tx_lock, flags);
>   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
>   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
> - head));
>   memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
> head);
> @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 
> head)
>   else
>   r->next_to_use += (r->count - head);
>  
> + spin_unlock_irqrestore(>mg_tx_lock, flags);
> +
>   vfree(tx_buffer);
>   vfree(tx_desc);
>   return 0;
> @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 
> head)
>  
>  int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
>  {
> + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
>   struct ixgbevf_rx_buffer *rx_buffer = NULL;
>   static union ixgbevf_desc *rx_desc = NULL;
> + unsigned long flags;
>  
>   rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
>   if (!rx_buffer)
> @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 
> head)
>   if (!rx_desc)
>   return -ENOMEM;
>  
> + spin_lock_irqsave(>mg_rx_lock, flags);
>   memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
>   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
> - head));
>   memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
> head);
> @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 
> head)
>   r->next_to_use -= head;
>   else
>   r->next_to_use += (r->count - head);
> + spin_unlock_irqrestore(>mg_rx_lock, flags);
>  
>   vfree(rx_buffer);
>   vfree(rx_desc);
> @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   if (test_bit(__IXGBEVF_DOWN, >state))
>   return true;
>  
> + spin_lock(>mg_tx_lock);
> + i = tx_ring->next_to_clean;
>   tx_buffer = _ring->tx_buffer_info[i];
>   tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>   i -= tx_ring->count;
> @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   q_vector->tx.total_bytes += total_bytes;
>   q_vector->tx.total_packets += total_packets;
>  
> + spin_unlock(>mg_tx_lock);
> +
>   if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
>   struct ixgbe_hw *hw = >hw;
>   union ixgbe_adv_tx_desc *eop_desc;
> @@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct 
> ixgbevf_q_vector *q_vector,
>   struct ixgbevf_ring *rx_ring,
>   int budget)
>  {
> + struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
>   unsigned int total_rx_bytes = 0, total_rx_packets = 0;
>   u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
>  

Re: [Qemu-devel] [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-22 Thread Michael S. Tsirkin
On Wed, Oct 21, 2015 at 01:52:48PM -0700, Alexander Duyck wrote:
> Also have you even considered the MSI-X configuration on the VF?  I haven't
> seen anything anywhere that would have migrated the VF's MSI-X configuration
> from BAR 3 on one system to the new system.

Hypervisors do this for virtual devices so they can do this
for physical devices too.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] target-i386: save/restore vcpu's TSC rate during migration

2015-10-22 Thread Haozhong Zhang
On Thu, Oct 22, 2015 at 04:45:21PM -0200, Eduardo Habkost wrote:
> On Tue, Oct 20, 2015 at 03:22:51PM +0800, Haozhong Zhang wrote:
> > This patchset enables QEMU to save/restore vcpu's TSC rate during the
> > migration. When cooperating with KVM which supports TSC scaling, guest
> > programs can observe a consistent guest TSC rate even though they are
> > migrated among machines with different host TSC rates.
> > 
> > A pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' are added to
> > control the migration of vcpu's TSC rate.
> 
> The requirements and goals aren't clear to me. I see two possible use
> cases, here:
> 
> 1) Best effort to keep TSC frequency constant if possible (but not
>aborting migration if not possible). This would be an interesting
>default, but a bit unpredictable.
> 2) Strictly ensuring TSC frequency stays constant on migration (and
>aborting migration if not possible). This would be an useful feature,
>but can't be enabled by default unless both hosts have the same TSC
>frequency or support TSC scaling.
> 
> Which one(s) you are trying to implement?
>

The former. I agree that it's unpredictable if setting vcpu's TSC
frequency to the migrated value is enabled by default (but not in this
patchset). The cpu option 'load-tsc-freq' is introduced to allow users
to enable this behavior if they do know the underlying KVM and CPU
support TSC scaling. In this way, I think the behavior is predictable
as users do know what they are doing.

> In other words, what is the right behavior when KVM_SET_TSC_KHZ fails or
> KVM_CAP_TSC_CONTROL is not available? We can't answer that question if
> the requirements and goals are not clear.
>

If KVM_CAP_TSC_CONTROL is unavailable, QEMU and KVM will use the host
TSC frequency as vcpu's TSC frequency.

If KVM_CAP_TSC_CONTROL is available and KVM_SET_TSC_KHZ fails, the
setting of TSC frequency will fail and abort either the VM creation
(this is the case for cpu option 'tsc-freq') or the migration.

> Once we know what exactly is the goal, we could enable the new mode with
> a single option, instead of raw options to control migration stream
> loading/saving.
>

Saving vcpu's TSC frequency does not depend on TSC scaling but the
loading does. And that is why I introduce two cpu options to control
them separately.

Haozhong

> 
> >  * By default, the migration of vcpu's TSC rate is enabled only on
> >pc-*-2.5 and newer machine types. If the cpu option 'save-tsc-freq'
> >is present, the vcpu's TSC rate will be migrated from older machine
> >types as well.
> >  * Another cpu option 'load-tsc-freq' controls whether the migrated
> >vcpu's TSC rate is used. By default, QEMU will not use the migrated
> >TSC rate if this option is not present. Otherwise, QEMU will use
> >the migrated TSC rate and override the TSC rate given by the cpu
> >option 'tsc-freq'.
> > 
> > Changes in v2:
> >  * Add a pair of cpu options 'save-tsc-freq' and 'load-tsc-freq' to
> >control the migration of vcpu's TSC rate.
> >  * Move all logic of setting TSC rate to target-i386.
> >  * Remove the duplicated TSC setup in kvm_arch_init_vcpu().
> > 
> > Haozhong Zhang (3):
> >   target-i386: add a subsection for migrating vcpu's TSC rate
> >   target-i386: calculate vcpu's TSC rate to be migrated
> >   target-i386: load the migrated vcpu's TSC rate
> > 
> >  include/hw/i386/pc.h  |  5 +
> >  target-i386/cpu.c |  2 ++
> >  target-i386/cpu.h |  3 +++
> >  target-i386/kvm.c | 61 
> > +++
> >  target-i386/machine.c | 19 
> >  5 files changed, 81 insertions(+), 9 deletions(-)
> > 
> > -- 
> > 2.4.8
> > 
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] target-i386: load the migrated vcpu's TSC rate

2015-10-22 Thread Haozhong Zhang
On Thu, Oct 22, 2015 at 04:11:37PM -0200, Eduardo Habkost wrote:
> On Tue, Oct 20, 2015 at 03:22:54PM +0800, Haozhong Zhang wrote:
> > Set vcpu's TSC rate to the migrated value (if any). If KVM supports TSC
> > scaling, guest programs will observe TSC increasing in the migrated rate
> > other than the host TSC rate.
> > 
> > The loading is controlled by a new cpu option 'load-tsc-freq'. If it is
> > present, then the loading will be enabled and the migrated vcpu's TSC
> > rate will override the value specified by the cpu option
> > 'tsc-freq'. Otherwise, the loading will be disabled.
> 
> Why do we need an option? Why can't we enable loading unconditionally?
>

If TSC scaling is not supported by KVM and CPU, unconditionally
enabling this loading will not take effect which would be different
from users' expectation. 'load-tsc-freq' is introduced to allow users
to enable the loading of migrated TSC frequency if they do know the
underlying KVM and CPU have TSC scaling support.

> > 
> > The setting of vcpu's TSC rate in this patch duplicates the code in
> > kvm_arch_init_vcpu(), so we remove the latter one.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  target-i386/cpu.c |  1 +
> >  target-i386/cpu.h |  1 +
> >  target-i386/kvm.c | 28 +++-
> >  3 files changed, 21 insertions(+), 9 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index b6bb457..763ba4b 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -3144,6 +3144,7 @@ static Property x86_cpu_properties[] = {
> >  DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> >  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
> >  DEFINE_PROP_BOOL("save-tsc-freq", X86CPU, env.save_tsc_khz, true),
> > +DEFINE_PROP_BOOL("load-tsc-freq", X86CPU, env.load_tsc_khz, false),
> >  DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
> >  DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
> >  DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index ba1a289..353f5fb 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -968,6 +968,7 @@ typedef struct CPUX86State {
> >  int64_t tsc_khz;
> >  int64_t tsc_khz_incoming;
> >  bool save_tsc_khz;
> > +bool load_tsc_khz;
> >  void *kvm_xsave_buf;
> >  
> >  uint64_t mcg_cap;
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 698524a..34616f5 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -743,15 +743,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >  return r;
> >  }
> >  
> > -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> > -if (r && env->tsc_khz) {
> > -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > -if (r < 0) {
> > -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> > -return r;
> > -}
> > -}
> > -
> >  if (kvm_has_xsave()) {
> >  env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
> >  }
> > @@ -2223,6 +2214,25 @@ static int kvm_setup_tsc_khz(X86CPU *cpu, int level)
> >  return 0;
> >  
> >  /*
> > + * If the cpu option 'load-tsc-freq' is present, the vcpu's TSC rate 
> > in the
> > + * migrated state will be used and the overrides the user-specified 
> > vcpu's
> > + * TSC rate (if any).
> > + */
> > +if (runstate_check(RUN_STATE_INMIGRATE) &&
> > +env->load_tsc_khz && env->tsc_khz_incoming) {
> > +env->tsc_khz = env->tsc_khz_incoming;
> > +}
> 
> Please don't make the results of the function depend on global QEMU
> runstate, as it makes it harder to reason about it, and easy to
> introduce subtle bugs if we change initialization order. Can't we just
> ensure tsc_khz gets set to the right value before the function is
> called, inside the code that loads migration data?
> 
> > +
> > +r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> > +if (r && env->tsc_khz) {
> > +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > +if (r < 0) {
> > +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> > +return r;
> > +}
> > +}
> 
> So, the final result here does not depend on the configuration, but also
> on host capabilities. That means nobody can possibly know if the
> tsc-freq option really works, until they enable it, run the VM, and
> check the results from inside the VM. Not a good idea.
> 
> (This doesn't apply just to the new code, the existing code is already
> broken this way.)
> 
> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] target-i386: load the migrated vcpu's TSC rate

2015-10-22 Thread Haozhong Zhang
On Thu, Oct 22, 2015 at 04:11:37PM -0200, Eduardo Habkost wrote:
> On Tue, Oct 20, 2015 at 03:22:54PM +0800, Haozhong Zhang wrote:
> > Set vcpu's TSC rate to the migrated value (if any). If KVM supports TSC
> > scaling, guest programs will observe TSC increasing in the migrated rate
> > other than the host TSC rate.
> > 
> > The loading is controlled by a new cpu option 'load-tsc-freq'. If it is
> > present, then the loading will be enabled and the migrated vcpu's TSC
> > rate will override the value specified by the cpu option
> > 'tsc-freq'. Otherwise, the loading will be disabled.
> 
> Why do we need an option? Why can't we enable loading unconditionally?
>
> > 
> > The setting of vcpu's TSC rate in this patch duplicates the code in
> > kvm_arch_init_vcpu(), so we remove the latter one.
> > 
> > Signed-off-by: Haozhong Zhang 
> > ---
> >  target-i386/cpu.c |  1 +
> >  target-i386/cpu.h |  1 +
> >  target-i386/kvm.c | 28 +++-
> >  3 files changed, 21 insertions(+), 9 deletions(-)
> > 
> > diff --git a/target-i386/cpu.c b/target-i386/cpu.c
> > index b6bb457..763ba4b 100644
> > --- a/target-i386/cpu.c
> > +++ b/target-i386/cpu.c
> > @@ -3144,6 +3144,7 @@ static Property x86_cpu_properties[] = {
> >  DEFINE_PROP_BOOL("enforce", X86CPU, enforce_cpuid, false),
> >  DEFINE_PROP_BOOL("kvm", X86CPU, expose_kvm, true),
> >  DEFINE_PROP_BOOL("save-tsc-freq", X86CPU, env.save_tsc_khz, true),
> > +DEFINE_PROP_BOOL("load-tsc-freq", X86CPU, env.load_tsc_khz, false),
> >  DEFINE_PROP_UINT32("level", X86CPU, env.cpuid_level, 0),
> >  DEFINE_PROP_UINT32("xlevel", X86CPU, env.cpuid_xlevel, 0),
> >  DEFINE_PROP_UINT32("xlevel2", X86CPU, env.cpuid_xlevel2, 0),
> > diff --git a/target-i386/cpu.h b/target-i386/cpu.h
> > index ba1a289..353f5fb 100644
> > --- a/target-i386/cpu.h
> > +++ b/target-i386/cpu.h
> > @@ -968,6 +968,7 @@ typedef struct CPUX86State {
> >  int64_t tsc_khz;
> >  int64_t tsc_khz_incoming;
> >  bool save_tsc_khz;
> > +bool load_tsc_khz;
> >  void *kvm_xsave_buf;
> >  
> >  uint64_t mcg_cap;
> > diff --git a/target-i386/kvm.c b/target-i386/kvm.c
> > index 698524a..34616f5 100644
> > --- a/target-i386/kvm.c
> > +++ b/target-i386/kvm.c
> > @@ -743,15 +743,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
> >  return r;
> >  }
> >  
> > -r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> > -if (r && env->tsc_khz) {
> > -r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > -if (r < 0) {
> > -fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> > -return r;
> > -}
> > -}
> > -
> >  if (kvm_has_xsave()) {
> >  env->kvm_xsave_buf = qemu_memalign(4096, sizeof(struct kvm_xsave));
> >  }
> > @@ -2223,6 +2214,25 @@ static int kvm_setup_tsc_khz(X86CPU *cpu, int level)
> >  return 0;
> >  
> >  /*
> > + * If the cpu option 'load-tsc-freq' is present, the vcpu's TSC rate 
> > in the
> > + * migrated state will be used and the overrides the user-specified 
> > vcpu's
> > + * TSC rate (if any).
> > + */
> > +if (runstate_check(RUN_STATE_INMIGRATE) &&
> > +env->load_tsc_khz && env->tsc_khz_incoming) {
> > +env->tsc_khz = env->tsc_khz_incoming;
> > +}
> 
> Please don't make the results of the function depend on global QEMU
> runstate, as it makes it harder to reason about it, and easy to
> introduce subtle bugs if we change initialization order. Can't we just
> ensure tsc_khz gets set to the right value before the function is
> called, inside the code that loads migration data?
>

I can make kvm_setup_tsc_khz() called in
do_kvm_cpu_synchronize_post_init() and also make empty stubs for other
targets.

> > +
> > +r = kvm_check_extension(cs->kvm_state, KVM_CAP_TSC_CONTROL);
> > +if (r && env->tsc_khz) {
> > +r = kvm_vcpu_ioctl(cs, KVM_SET_TSC_KHZ, env->tsc_khz);
> > +if (r < 0) {
> > +fprintf(stderr, "KVM_SET_TSC_KHZ failed\n");
> > +return r;
> > +}
> > +}
> 
> So, the final result here does not depend on the configuration, but also
> on host capabilities. That means nobody can possibly know if the
> tsc-freq option really works, until they enable it, run the VM, and
> check the results from inside the VM. Not a good idea.
> 
> (This doesn't apply just to the new code, the existing code is already
> broken this way.)
>

Really should abort QEMU here for both tsc-freq and migration.

> -- 
> Eduardo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Eric Auger
On 10/22/2015 02:05 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 13:40:16 Eric Auger wrote:
>> On 10/22/2015 12:29 PM, Arnd Bergmann wrote:
>>> On Thursday 22 October 2015 11:42:02 Eric Auger wrote:
 Currently reset lookup is done on probe. This introduces a
 race with new registration mechanism in the case where the
 vfio-platform driver is bound to the device before its module
 is loaded: on the load, the probe happens which triggers the
 reset module load which itself attempts to get the symbol for
 the registration function (vfio_platform_register_reset). The
 symbol is not yet available hence the lookup fails. In case we
 do the lookup in the first open we are sure the vfio-platform
 module is loaded and vfio_platform_register_reset is available.

 Signed-off-by: Eric Auger 
>>>
>>> I don't understand the explanation. I would expect the request_module()
>>> call to block until the module is actually loaded. Is this not
>>> what happens?
>>
>> Again many thanks for this new review.
>>
>> My understanding is the following
>> 1) vfio-platform is attached to the device through the override mechanism
>> 2) vfio-platform get loaded through modprobe:
>> 2-1) the platform driver init function eventually calls the
>> vfio-platform probe function.
>> 2-2) request_module of vfio-platform-calxedaxgmac gets called.
>> 3) The init of  vfio-platform-calxedaxgmac looks for
>> vfio_platform_register_reset. Unfortunately at that stage the init of
>> vfio-platform is not completed so the symbol is not available
>> 3-1) in case symbol_get is used in vfio_platform_calxedaxgmac init, as
>> of today, this latter simply returns -EINVAL. Reset registration failed
>> but no stall.
>> 3-2) in case symbol_get is *not* used, I think the module loader
>> attempts to load vfio-platform, which is already under load and this
>> causes a stall.
>>
>> Let me know if you think this understanding is not correct.
> 
> I was expecting the vfio_platform_register_reset() function to be
> available by the time of step 2-2.
> 
> What I think is going on here is that we are still inside of the
> module_init() function of the vfio-platform driver at the time that
> we call the request_module(), and the symbols are not made visible
> to other modules until the module_init function returns with success.
> This is a side-effect of the probe function for the platform driver
> getting called from deep inside of the platform_driver_register()
> function for all devices that are already present.
yes we share the same understanding, this is what I meant.
> 
> I think it already works for the AMBA case, which uses separate modules,
> but we need to restructure the platform_device case slightly to do
> the same:

OK I will test the new module structure below and eventually remove the
symbol_get (I got the link issue). Thanks for the hint!

Eric
> 
> diff --git a/drivers/vfio/platform/Makefile b/drivers/vfio/platform/Makefile
> index 9ce8afe28450..a00a44814255 100644
> --- a/drivers/vfio/platform/Makefile
> +++ b/drivers/vfio/platform/Makefile
> @@ -1,10 +1,12 @@
>  
> -vfio-platform-y := vfio_platform.o vfio_platform_common.o vfio_platform_irq.o
> +vfio-platform-base-y := vfio_platform_common.o vfio_platform_irq.o
>  
> -obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform.o
> +obj-$(CONFIG_VFIO_PLATFORM) += vfio_platform.o
> +obj-$(CONFIG_VFIO_PLATFORM) += vfio-platform-base.o
>  obj-$(CONFIG_VFIO_PLATFORM) += reset/
>  
>  vfio-amba-y := vfio_amba.o
>  
>  obj-$(CONFIG_VFIO_AMBA) += vfio-amba.o
> +obj-$(CONFIG_VFIO_AMBA) += vfio-platform-base.o
>  obj-$(CONFIG_VFIO_AMBA) += reset/
> 
> 
>   Arnd
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 11/12] IXGBEVF: Migrate VF statistic data

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:43AM +0800, Lan Tianyu wrote:
> VF statistic regs are read-only and can't be migrated via writing back
> directly.
> 
> Currently, statistic data returned to user space by the driver is not equal
> to value of statistic regs. VF driver records value of statistic regs as base 
> data
> when net interface is up or open, calculate increased count of regs during
> last period of online service and added it to saved_reset data. When user
> space collects statistic data, VF driver returns result of
> "current - base + saved_reset". "Current" is reg value at that point.
> 
> Restoring net function after migration just likes net interface is up or open.
> Call existed function to update base and saved_reset data to keep statistic
> data continual during migration.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 04b6ce7..d22160f 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
>   return 0;
>  
>   del_timer_sync(>service_timer);
> + ixgbevf_update_stats(adapter);
>   pr_info("migration start\n");
>   migration_status = MIGRATION_IN_PROGRESS; 
>  

So far, it seems that the only two things done when
starting migration are very small.

It doesn't seem worth it to let guests defer migration for things like
this.  Surely, cancelling a timer can be done later after VM is
migrated?



> @@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
>   return 1;
>  
>   ixgbevf_restore_state(adapter);
> + ixgbevf_save_reset_stats(adapter);
> + ixgbevf_init_last_counter_stats(adapter);
>   migration_status = MIGRATION_COMPLETED;
>   pr_info("migration end\n");
>   return 0;
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:32AM +0800, Lan Tianyu wrote:
> This patchset is to propose a new solution to add live migration support for 
> 82599
> SRIOV network card.
> 
> Im our solution, we prefer to put all device specific operation into VF and
> PF driver and make code in the Qemu more general.

Adding code to VF driver makes sense.  However, adding code to PF driver
is problematic: PF and VF run within different environments, you can't
assume PF and VF drivers are the same version.

I guess that would be acceptable if these messages make
it into the official intel spec, along with
hardware registers.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-22 Thread Michael S. Tsirkin
On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote:
> On 10/21/2015 09:37 AM, Lan Tianyu wrote:
> >In order to restore VF function after migration, add self emulation layer
> >to record regs' values during accessing regs.
> >
> >Signed-off-by: Lan Tianyu 
> >---
> >  drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
> >  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
> >  .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 
> > ++
> >  drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
> >  4 files changed, 33 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
> >b/drivers/net/ethernet/intel/ixgbevf/Makefile
> >index 4ce4c97..841c884 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
> >+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
> >@@ -31,7 +31,8 @@
> >  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
> >-ixgbevf-objs := vf.o \
> >+ixgbevf-objs := self-emulation.o \
> >+vf.o \
> >  mbx.o \
> >  ethtool.o \
> >  ixgbevf_main.o
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> >b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >index a16d267..4446916 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> >@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
> > if (IXGBE_REMOVED(reg_addr))
> > return IXGBE_FAILED_READ_REG;
> >-value = readl(reg_addr + reg);
> >+value = ixgbe_self_emul_readl(reg_addr, reg);
> > if (unlikely(value == IXGBE_FAILED_READ_REG))
> > ixgbevf_check_remove(hw, reg);
> > return value;
> >diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
> >b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >new file mode 100644
> >index 000..d74b2da
> >--- /dev/null
> >+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
> >@@ -0,0 +1,26 @@
> >+#include 
> >+#include 
> >+#include 
> >+#include 
> >+#include 
> >+
> >+#include "vf.h"
> >+#include "ixgbevf.h"
> >+
> >+static u32 hw_regs[0x4000];
> >+
> >+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
> >+{
> >+u32 tmp;
> >+
> >+tmp = readl(base + addr);
> >+hw_regs[(unsigned long)addr] = tmp;
> >+
> >+return tmp;
> >+}
> >+
> >+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
> >+{
> >+hw_regs[(unsigned long)addr] = val;
> >+writel(val, (volatile void __iomem *)(base + addr));
> >+}
> 
> So I see what you are doing, however I don't think this adds much value.
> Many of the key registers for the device are not simple Read/Write
> registers.  Most of them are things like write 1 to clear or some other sort
> of value where writing doesn't set the bit but has some other side effect.
> Just take a look through the Datasheet at registers such as the VFCTRL,
> VFMAILBOX, or most of the interrupt registers.  The fact is simply storing
> the values off doesn't give you any real idea of what the state of things
> are.

It doesn't, but I guess the point is to isolate the migration-related logic
in the recovery code.

An alternative would be to have some smart logic all over the place to
only store what's required - that would be much more intrusive.


> >diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
> >b/drivers/net/ethernet/intel/ixgbevf/vf.h
> >index d40f036..6a3f4eb 100644
> >--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
> >+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
> >@@ -39,6 +39,9 @@
> >  struct ixgbe_hw;
> >+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
> >+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  
> >addr);
> >+
> >  /* iterator type for walking multicast address lists */
> >  typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
> >   u32 *vmdq);
> >@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, 
> >u32 reg, u32 value)
> > if (IXGBE_REMOVED(reg_addr))
> > return;
> >-writel(value, reg_addr + reg);
> >+ixgbe_self_emul_writel(value, reg_addr, reg);
> >  }
> >  #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 15:32 +0300, Michael S. Tsirkin wrote:
> On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> > The trouble here is that the VF needs to be unplugged prior to the start
> > of migration because we can't do effective dirty page tracking while the
> > device is connected and doing DMA.
> 
> That's exactly what patch 12/12 is trying to accomplish.
> 
> I do see some problems with it, but I also suggested some solutions.

I was replying to:

> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:

And then later note:

"Here it's done via an enlightened guest driver."

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 07:01:01AM -0600, Alex Williamson wrote:
> On Thu, 2015-10-22 at 15:32 +0300, Michael S. Tsirkin wrote:
> > On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> > > The trouble here is that the VF needs to be unplugged prior to the start
> > > of migration because we can't do effective dirty page tracking while the
> > > device is connected and doing DMA.
> > 
> > That's exactly what patch 12/12 is trying to accomplish.
> > 
> > I do see some problems with it, but I also suggested some solutions.
> 
> I was replying to:
> 
> > So... what would you expect service down wise for the following
> > solution which is zero touch and I think should work for any VF
> > driver:
> 
> And then later note:
> 
> "Here it's done via an enlightened guest driver."

Oh, I misunderstood your intent. Sorry about that.

So we are actually in agreement between us then. That's nice.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2015 at 03:20:15PM +0200, Andrea Arcangeli wrote:

> If schedule spontaneously wakes up a task in TASK_KILLABLE state that
> would be a bug in the scheduler in my view. Luckily there doesn't seem
> to be such a bug, or at least we never experienced it.

Well, there will be a wakeup, just not the one you were hoping for.

We have code that does:

@cond = true;
get_task_struct(p);
queue(p)

/* random wait somewhere */
for (;;) {
prepare_to_wait();
if (@cond)
  break;

...

handle_userfault()
  ...
  schedule();
...

dequeue(p)
wake_up_process(p) ---> wakeup without userfault wakeup


These races are (extremely) rare, but they do exist. Therefore one must
never assume schedule() will not spuriously wake because of these
things.

Also, see:

lkml.kernel.org/r/CA+55aFwHkOo+YGWKYROmce1-H_uG3KfEUmCkJUerTj=ojy2...@mail.gmail.com

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/6] vfio: platform: reset: calxedaxgmac: add reset function registration

2015-10-22 Thread Eric Auger
On 10/22/2015 02:09 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 13:54:42 Eric Auger wrote:
>> On 10/22/2015 12:13 PM, Arnd Bergmann wrote:
>>> On Thursday 22 October 2015 11:41:59 Eric Auger wrote:
 This patch adds the reset function registration/unregistration.

 Signed-off-by: Eric Auger 
>>>
>>> Looks good, except one thing:
 @@ -70,6 +69,8 @@ int vfio_platform_calxedaxgmac_reset(struct 
 vfio_platform_device *vdev)
  return -ENOMEM;
  }
  
 +pr_info("VFIO reset of %s\n", vdev->name);
 +
  /* disable IRQ */
  writel(0, reg.ioaddr + XGMAC_DMA_INTR_ENA);
  
>>>
>>> This probably slipped in from debugging, please remove it.
>> Well actually this is not an oversight but some unappropriate tracing
>> attempt I am afraid. I wanted to add a trace useful for the end-user to
>> make sure the VFIO reset function was called. Do you forbid that or do
>> recommend to use another tracing mechanism/level? In the past I tried
>> dynamic tracing but with module loading mechanism I found it not that handy.
> 
> If you think it's useful to have in the long run, it should be a separate
> patch with a description what it's good for, rather than a side-effect of
> an unrelated patch. It just looked to me like it's something you do
> while debugging the reset code, rather than while using it.
> 
> Implementation-wise, you should use dev_info() instead of pr_info() where
> possible, and it probably makes sense to put this into the vfio_platform
> driver before calling the reset function, for consistency between the
> drivers.
OK thanks

Best Regards

Eric
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Michael S. Tsirkin
On Wed, Oct 21, 2015 at 01:20:27PM -0600, Alex Williamson wrote:
> The trouble here is that the VF needs to be unplugged prior to the start
> of migration because we can't do effective dirty page tracking while the
> device is connected and doing DMA.

That's exactly what patch 12/12 is trying to accomplish.

I do see some problems with it, but I also suggested some solutions.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:41AM +0800, Lan Tianyu wrote:
> To let VF driver in the guest to know migration status, Qemu will
> fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
> get ack from VF driver.

I guess this works for current devices but not using
0xF0/0xF1 registers is not architectural, is it?

So it could conflict with future devices.

Maybe it's better to just have a dedicated para-virtualized
device (PCI,ACPI,etc) for this migration-related activity.
This driver would then register with this it.


> When migration starts, Qemu will set reg "0xF0" to 1, notify
> VF driver via triggering mail box msg and wait for VF driver to tell
> it's ready for migration(set reg "0xF1" to 1).

This waiting for driver is problematic: high load is one of the reasons
people migrate VMs out.  It would be much better if we could support
migration while VM is completely stopped.


> After migration, Qemu
> will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
> driver begins to restore tx/rx function after detecting sttatus change.
> 
> When VF receives mail box irq, it will check reg "0xF0" in the service
> task function to get migration status and performs related operations
> according its value.
> 
> Steps of restarting receive and transmit function
> 1) Restore VF status in the PF driver via sending mail event to PF driver
> 2) Write back reg values recorded by self emulation layer
> 3) Restart rx/tx ring
> 4) Recovery interrupt
> 
> Transmit/Receive descriptor head regs are read-only and can't
> be restored via writing back recording reg value directly and they
> are set to 0 during VF reset. To reuse original tx/rx rings, shift
> desc ring in order to move the desc pointed by original head reg to
> first entry of the ring and then enable tx/rx rings. VF restarts to
> receive and transmit from original head desc.
> 
> Signed-off-by: Lan Tianyu 
> ---
>  drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 
> -
>  .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
>  4 files changed, 232 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
> b/drivers/net/ethernet/intel/ixgbevf/defines.h
> index 770e21a..113efd2 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/defines.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
> @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
>   __le32 mss_l4len_idx;
>  };
>  
> +union ixgbevf_desc {
> + union ixgbe_adv_tx_desc rx_desc;
> + union ixgbe_adv_rx_desc tx_desc;
> + struct ixgbe_adv_tx_context_desc tx_context_desc;
> +};
> +
>  /* Adv Transmit Descriptor Config Masks */
>  #define IXGBE_ADVTXD_DTYP_MASK   0x00F0 /* DTYP mask */
>  #define IXGBE_ADVTXD_DTYP_CTXT   0x0020 /* Advanced Context Desc */
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index c823616..6eab402e 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -109,7 +109,7 @@ struct ixgbevf_ring {
>   struct ixgbevf_ring *next;
>   struct net_device *netdev;
>   struct device *dev;
> - void *desc; /* descriptor ring memory */
> + union ixgbevf_desc *desc;   /* descriptor ring memory */
>   dma_addr_t dma; /* phys. address of descriptor ring */
>   unsigned int size;  /* length in bytes */
>   u16 count;  /* amount of descriptors */
> @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
> *q_vector);
>  
>  void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
>  void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
> +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
> +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
> +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
> +
>  
>  #ifdef DEBUG
>  char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 056841c..15ec361 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function 
> Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(DRV_VERSION);
>  
> +
> +#define MIGRATION_COMPLETED   0x00
> +#define MIGRATION_IN_PROGRESS 0x01
> +
>  #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
>  static int debug = -1;
>  module_param(debug, int, 0);
> @@ -221,6 +225,78 @@ static u64 

Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 12:37:40AM +0800, Lan Tianyu wrote:
> When transmit a package, the end transmit desc of package
> indicates whether package is sent already. Current code records
> the end desc's pointer in the next_to_watch of struct tx buffer.
> This code will be broken if shifting desc ring after migration.
> The pointer will be invalid. This patch is to replace recording
> pointer with recording the desc number of the package and find
> the end decs via the first desc and desc number.
> 
> Signed-off-by: Lan Tianyu 

Do you really need to play the shifting games?
Can't you just reset everything and re-initialize the rings?
It's slower but way less intrusive.
Also removes the need to track writes into rings.

> ---
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
>  2 files changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> index 775d089..c823616 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
> @@ -54,6 +54,7 @@
>   */
>  struct ixgbevf_tx_buffer {
>   union ixgbe_adv_tx_desc *next_to_watch;
> + u16 desc_num;
>   unsigned long time_stamp;
>   struct sk_buff *skb;
>   unsigned int bytecount;
> diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
> b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> index 4446916..056841c 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
> @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
> ixgbevf_ring *tx_ring,
>  DMA_TO_DEVICE);
>   }
>   tx_buffer->next_to_watch = NULL;
> + tx_buffer->desc_num = 0;
>   tx_buffer->skb = NULL;
>   dma_unmap_len_set(tx_buffer, len, 0);
>   /* tx_buffer must be completely set up in the transmit path */
> @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   union ixgbe_adv_tx_desc *tx_desc;
>   unsigned int total_bytes = 0, total_packets = 0;
>   unsigned int budget = tx_ring->count / 2;
> - unsigned int i = tx_ring->next_to_clean;
> + int i, watch_index;
>  
>   if (test_bit(__IXGBEVF_DOWN, >state))
>   return true;
> @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>   i -= tx_ring->count;
>  
>   do {
> - union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
> + union ixgbe_adv_tx_desc *eop_desc;
> +
> + if (!tx_buffer->desc_num)
> + break;
> +
> + if (i + tx_buffer->desc_num >= 0)
> + watch_index = i + tx_buffer->desc_num;
> + else
> + watch_index = i + tx_ring->count + tx_buffer->desc_num;
>  
> - /* if next_to_watch is not set then there is no work pending */
> + eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
>   if (!eop_desc)
>   break;
>  
> @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
> *q_vector,
>  
>   /* clear next_to_watch to prevent false hangs */
>   tx_buffer->next_to_watch = NULL;
> + tx_buffer->desc_num = 0;
>  
>   /* update the statistics for this packet */
>   total_bytes += tx_buffer->bytecount;
> @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>   u32 tx_flags = first->tx_flags;
>   __le32 cmd_type;
>   u16 i = tx_ring->next_to_use;
> + u16 start;
>  
>   tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
>  
> @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
>  
>   /* set next_to_watch value indicating a packet is present */
>   first->next_to_watch = tx_desc;
> + start = first - tx_ring->tx_buffer_info;
> + first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
> start;
>  
>   i++;
>   if (i == tx_ring->count)
> -- 
> 1.8.4.rc0.1.g8f6a3e5.dirty
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 02:10:56PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2015 at 07:31:11PM +0200, Andrea Arcangeli wrote:
> > @@ -255,21 +259,23 @@ int handle_userfault(struct vm_area_struct *vma, 
> > unsigned long address,
> >  * through poll/read().
> >  */
> > __add_wait_queue(>fault_wqh, );
> > -   for (;;) {
> > -   set_current_state(TASK_KILLABLE);
> > -   if (!uwq.pending || ACCESS_ONCE(ctx->released) ||
> > -   fatal_signal_pending(current))
> > -   break;
> > -   spin_unlock(>fault_wqh.lock);
> > +   set_current_state(TASK_KILLABLE);
> > +   spin_unlock(>fault_wqh.lock);
> >  
> > +   if (likely(!ACCESS_ONCE(ctx->released) &&
> > +  !fatal_signal_pending(current))) {
> > wake_up_poll(>fd_wqh, POLLIN);
> > schedule();
> > +   ret |= VM_FAULT_MAJOR;
> > +   }
> 
> So what happens here if schedule() spontaneously wakes for no reason?
> 
> I'm not sure enough of userfaultfd semantics to say if that would be
> bad, but the code looks suspiciously like it relies on schedule() not to
> do that; which is wrong.

That would repeat the fault and trigger the DEBUG_VM printk above,
complaining that FAULT_FLAG_ALLOW_RETRY is not set. It is only a
problem for kernel faults (copy_user/get_user_pages*). Userland won't
error out in such a way because userland would return 0 and not
VM_FAULT_RETRY. So it's only required when the task schedule in
TASK_KILLABLE state.

If schedule spontaneously wakes up a task in TASK_KILLABLE state that
would be a bug in the scheduler in my view. Luckily there doesn't seem
to be such a bug, or at least we never experienced it.

Overall this dependency on the scheduler will be lifted soon, as it
must be lifted in order to track the write protect faults, so longer
term this is not a concern and this is not a design issue, but this
remains an implementation detail that avoided to change the arch code
and gup.

If you send a reproducer to show how the current scheduler can wake up
the task in TASK_KILLABLE despite not receiving a wakeup, that would
help too as we never experienced that.

The reason this dependency will be lifted soon is that the userfaultfd
write protect tracking may be armed at any time while the app is
running so we may already be in the middle of a page fault that
returned VM_FAULT_RETRY by the time we arm the write protect
tracking. So longer term the arch page fault and
__get_user_pages_locked must allow handle_userfault() to return
VM_FAULT_RETRY even if FAULT_FLAG_TRIED is set. Then we don't care if
the task is waken.

Waking a task in TASK_KILLABLE state it will still waste CPU so the
scheduler still shouldn't do that. All load balancing works better if
the task isn't running anyway so I can't imagine a good reason for
wanting to run a task in TASK_KILLABLE state before it gets the
wakeup.

Trying to predict that a wakeup is always happening in less time than
it takes to schedule the task out of the CPU, sounds a very CPU
intensive thing to measure and it's probably better to leave those
heuristics in the caller like by spinning on a lock for a while before
blocking.

Thanks,
Andrea
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Eric Auger
On 10/22/2015 12:29 PM, Arnd Bergmann wrote:
> On Thursday 22 October 2015 11:42:02 Eric Auger wrote:
>> Currently reset lookup is done on probe. This introduces a
>> race with new registration mechanism in the case where the
>> vfio-platform driver is bound to the device before its module
>> is loaded: on the load, the probe happens which triggers the
>> reset module load which itself attempts to get the symbol for
>> the registration function (vfio_platform_register_reset). The
>> symbol is not yet available hence the lookup fails. In case we
>> do the lookup in the first open we are sure the vfio-platform
>> module is loaded and vfio_platform_register_reset is available.
>>
>> Signed-off-by: Eric Auger 
> 
> I don't understand the explanation. I would expect the request_module()
> call to block until the module is actually loaded. Is this not
> what happens?
> 
>> mutex_unlock(_lock);
>> @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
>> if (ret)
>> goto err_irq;
>>  
>> +   vfio_platform_get_reset(vdev);
>> +
>> if (vdev->reset)
>> vdev->reset(vdev);
>>
> 
> This needs some error handling to ensure that the open() fails
> if there is no reset handler.

Is that really what we want? The code was meant to allow the use case
where the VFIO platform driver would be used without such reset module.

I think the imperious need for a reset module depends on the device and
more importantly depends on the IOMMU mapping. With QEMU VFIO
integration this is needed because the whole VM memory is IOMMU mapped
but in a simpler user-space driver context, we might live without.

Any thought?

Eric
> 
>   Arnd
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-22 Thread Rick Jones

On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote:

On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:

This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

busyloop_timeout = 50 gives us following improvement on TCP_RR test:

size/session/+thu%/+normalize%
 1/ 1/   +5%/  -20%
 1/50/  +17%/   +3%


Is there a measureable increase in cpu utilization
with busyloop_timeout = 0?


And since a netperf TCP_RR test is involved, be careful about what 
netperf reports for CPU util if that increase isn't in the context of 
the guest OS.


For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS, 
aggregate _RR and even aggregate _RR/packets per second for many VMs on 
the same system would be in order.


happy benchmarking,

rick jones

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 6/6] vfio: platform: move get/put reset at open/release

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 16:23 +0200, Eric Auger wrote:
> On 10/22/2015 04:10 PM, Arnd Bergmann wrote:
> > On Thursday 22 October 2015 15:26:55 Eric Auger wrote:
>  @@ -181,6 +182,8 @@ static int vfio_platform_open(void *device_data)
>  if (ret)
>  goto err_irq;
>   
>  +   vfio_platform_get_reset(vdev);
>  +
>  if (vdev->reset)
>  vdev->reset(vdev);
> 
> >>>
> >>> This needs some error handling to ensure that the open() fails
> >>> if there is no reset handler.
> >>
> >> Is that really what we want? The code was meant to allow the use case
> >> where the VFIO platform driver would be used without such reset module.
> >>
> >> I think the imperious need for a reset module depends on the device and
> >> more importantly depends on the IOMMU mapping. With QEMU VFIO
> >> integration this is needed because the whole VM memory is IOMMU mapped
> >> but in a simpler user-space driver context, we might live without.
> >>
> >> Any thought?
> > 
> > I would think we need a reset driver for any device that can start DMA,
> > otherwise things can go wrong as soon as you attach it to a different domain
> > while there is ongoing DMA.
> > 
> > Maybe we could just allow devices to be attached without a reset handler,
> > but then disallow DMA on them?
> 
> Well I am tempted to think that most assigned devices will perform DMA
> accesses so to me this somehow comes to the same result, ie disallowing
> functional passthrough for devices not properly/fully integrated.
> 
> Alex/Baptiste, any opinion on this?

We have an IOMMU and the user doesn't get access to the device until the
IOMMU domain is established.  So, ideally yes, we should have a way to
reset the device, but I don't see it as a requirement.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-22 Thread Alexander Duyck

On 10/22/2015 05:50 AM, Michael S. Tsirkin wrote:

On Wed, Oct 21, 2015 at 01:58:19PM -0700, Alexander Duyck wrote:

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
  drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
  4 files changed, 33 insertions(+), 3 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+   vf.o \
  mbx.o \
  ethtool.o \
  ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
if (IXGBE_REMOVED(reg_addr))
return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}

So I see what you are doing, however I don't think this adds much value.
Many of the key registers for the device are not simple Read/Write
registers.  Most of them are things like write 1 to clear or some other sort
of value where writing doesn't set the bit but has some other side effect.
Just take a look through the Datasheet at registers such as the VFCTRL,
VFMAILBOX, or most of the interrupt registers.  The fact is simply storing
the values off doesn't give you any real idea of what the state of things
are.

It doesn't, but I guess the point is to isolate the migration-related logic
in the recovery code.

An alternative would be to have some smart logic all over the place to
only store what's required - that would be much more intrusive.


After reviewing all of the patches yesterday I would say that almost all 
the values being stored aren't needed.  They can be restored from the 
settings of the driver itself anyway.  Copying the values out don't make 
much sense here since there are already enough caches for almost all of 
this data.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Peter Zijlstra
On Thu, Oct 22, 2015 at 04:18:31PM +0200, Andrea Arcangeli wrote:

> The risk of memory corruption is still zero no matter what happens
> here, in the extremely rare case the app will get a SIGBUS or a

That might still upset people, SIGBUS isn't something an app can really
recover from.

> I'm not exactly sure why we allow VM_FAULT_RETRY only once currently
> so I'm tempted to drop FAULT_FLAG_TRIED entirely.

I think to ensure we make forward progress.

> I've no real preference on how to tweak the page fault code to be able
> to return VM_FAULT_RETRY indefinitely and I would aim for the smallest
> change possible, so if you've suggestions now it's good time.

Indefinitely is such a long time, we should try and finish
computation before the computer dies etc. :-)

Yes, yes.. I know, extremely unlikely etc. Still guarantees are good.


In any case, I'm not really too bothered how you fix it, just figured
I'd let you know.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 14/23] userfaultfd: wake pending userfaults

2015-10-22 Thread Andrea Arcangeli
On Thu, Oct 22, 2015 at 05:15:09PM +0200, Peter Zijlstra wrote:
> Indefinitely is such a long time, we should try and finish
> computation before the computer dies etc. :-)

Indefinitely as read_seqcount_retry, eventually it makes progress.

Even returning 0 from the page fault can trigger it again
indefinitely, so VM_FAULT_RETRY isn't fundamentally different from
returning 0 and retrying the page fault again later. So it's not clear
why VM_FAULT_RETRY can only try once more.

FAULT_FLAG_TRIED as a message to the VM so it starts to do heavy
locking and block more aggressively is actually useful as such, but it
shouldn't be a replacement of FAULT_FLAG_ALLOW_RETRY. What I meant
with removing FAULT_FLAG_TRIED is really about converting it to an
hint, but not controlling if the page fault can keep retrying
in-kernel.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 08/16] KVM: arm64: handle ITS related GICv3 redistributor registers

2015-10-22 Thread Pavel Fedin
 Hello!

 During my work on live migration i found a big bug in your implementation.

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
> Of Andre Przywara
> Sent: Wednesday, October 07, 2015 5:55 PM
> To: marc.zyng...@arm.com; christoffer.d...@linaro.org
> Cc: eric.au...@linaro.org; p.fe...@samsung.com; kvm...@lists.cs.columbia.edu; 
> linux-arm-
> ker...@lists.infradead.org; kvm@vger.kernel.org
> Subject: [PATCH v3 08/16] KVM: arm64: handle ITS related GICv3 redistributor 
> registers
> 
> In the GICv3 redistributor there are the PENDBASER and PROPBASER
> registers which we did not emulate so far, as they only make sense
> when having an ITS. In preparation for that emulate those MMIO
> accesses by storing the 64-bit data written into it into a variable
> which we later read in the ITS emulation.
> 
> Signed-off-by: Andre Przywara 
> ---
> Changelog v2..v3:
> - rename vgic_handle_base_register to vgic_reg64_access()
> 
>  include/kvm/arm_vgic.h  |  8 
>  virt/kvm/arm/vgic-v3-emul.c | 44 
>  virt/kvm/arm/vgic.c | 31 +++
>  virt/kvm/arm/vgic.h |  2 ++
>  4 files changed, 85 insertions(+)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 067ad09..06c33bc 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -272,6 +272,14 @@ struct vgic_dist {
>   /* Virtual irq to hwirq mapping */
>   spinlock_t  irq_phys_map_lock;
>   struct list_headirq_phys_map_list;
> +
> + /* Address of LPI configuration table shared by all redistributors */
> + u64 propbaser;
> +
> + /* Addresses of LPI pending tables per redistributor */
> + u64 *pendbaser;
> +
> + boollpis_enabled;
>  };
> 
>  struct vgic_v2_cpu_if {
> diff --git a/virt/kvm/arm/vgic-v3-emul.c b/virt/kvm/arm/vgic-v3-emul.c
> index a8cf669..6939f7c 100644
> --- a/virt/kvm/arm/vgic-v3-emul.c
> +++ b/virt/kvm/arm/vgic-v3-emul.c
> @@ -651,6 +651,38 @@ static bool handle_mmio_cfg_reg_redist(struct kvm_vcpu 
> *vcpu,
>   return vgic_handle_cfg_reg(reg, mmio, offset);
>  }
> 
> +/* We don't trigger any actions here, just store the register value */
> +static bool handle_mmio_propbaser_redist(struct kvm_vcpu *vcpu,
> +  struct kvm_exit_mmio *mmio,
> +  phys_addr_t offset)
> +{
> + struct vgic_dist *dist = >kvm->arch.vgic;
> + int mode = ACCESS_READ_VALUE;
> +
> + /* Storing a value with LPIs already enabled is undefined */
> + mode |= dist->lpis_enabled ? ACCESS_WRITE_IGNORED : ACCESS_WRITE_VALUE;
> + vgic_reg64_access(mmio, offset, >propbaser, mode);
> +
> + return false;
> +}
> +
> +/* We don't trigger any actions here, just store the register value */
> +static bool handle_mmio_pendbaser_redist(struct kvm_vcpu *vcpu,
> +  struct kvm_exit_mmio *mmio,
> +  phys_addr_t offset)
> +{
> + struct kvm_vcpu *rdvcpu = mmio->private;
> + struct vgic_dist *dist = >kvm->arch.vgic;
> + int mode = ACCESS_READ_VALUE;
> +
> + /* Storing a value with LPIs already enabled is undefined */
> + mode |= dist->lpis_enabled ? ACCESS_WRITE_IGNORED : ACCESS_WRITE_VALUE;

 Here you store lpis_enabled globally, and this is plain wrong.

 Linux kernel separately programs PENDBASER and enables LPIs on every CPU. 
Therefore, after CPU #0 is initialized (this happens much
earlier than everything else), dist->lpis_enabled is set to true, and 
subsequent PROPBASER writes, even for different
redistributors, will be ignored. As a result, you'll get dist->pendbaser[n] == 
NULL forever, where n > 0. And your
its_sync_lpi_pending_table() actually reads some garbage from physical address 
0 of the guest.

 Attempts to write data to that region silently corrupts random qemu data 
during migration, that's how i discovered it.

> + vgic_reg64_access(mmio, offset,
> +   >pendbaser[rdvcpu->vcpu_id], mode);
> +
> + return false;
> +}
> +
>  #define SGI_base(x) ((x) + SZ_64K)
> 
>  static const struct vgic_io_range vgic_redist_ranges[] = {
> @@ -679,6 +711,18 @@ static const struct vgic_io_range vgic_redist_ranges[] = 
> {
>   .handle_mmio= handle_mmio_raz_wi,
>   },
>   {
> + .base   = GICR_PENDBASER,
> + .len= 0x08,
> + .bits_per_irq   = 0,
> + .handle_mmio= handle_mmio_pendbaser_redist,
> + },
> + {
> + .base   = GICR_PROPBASER,
> + .len= 0x08,
> + .bits_per_irq   = 0,
> + .handle_mmio= handle_mmio_propbaser_redist,
> + },
> + {
>   .base   = GICR_IDREGS,
>  

Re: [PATCH v3 9/9] kvm/x86: Hyper-V kvm exit

2015-10-22 Thread Paolo Bonzini


On 22/10/2015 18:10, Andrey Smetanin wrote:
> A new vcpu exit is introduced to notify the userspace of the
> changes in Hyper-V SynIC configuration triggered by guest writing to the
> corresponding MSRs.
> 
> Changes v3:
> * added KVM_EXIT_HYPERV types and structs notes into docs

Thanks.  The changes look good.  I look forward to the unit tests so I
can merge it.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 08/16] KVM: arm64: handle ITS related GICv3 redistributor registers

2015-10-22 Thread Pavel Fedin
 Hello!

 One more idea...

>  Here you store lpis_enabled globally, and this is plain wrong.

 By the way, may be we should move this flag, together with pendbaser array, 
into struct vgic_cpu? Then we would not have to
allocate them manually.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Or Gerlitz
On Wed, Oct 21, 2015 at 10:20 PM, Alex Williamson
 wrote:

> This is why the typical VF agnostic approach here is to using bonding
> and fail over to a emulated device during migration, so performance
> suffers, but downtime is something acceptable.

bonding in the VM isn't a zero touch solution, right? is it really acceptable?
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-22 Thread Michael S. Tsirkin
On Thu, Oct 22, 2015 at 08:46:33AM -0700, Rick Jones wrote:
> On 10/22/2015 02:33 AM, Michael S. Tsirkin wrote:
> >On Thu, Oct 22, 2015 at 01:27:29AM -0400, Jason Wang wrote:
> >>This patch tries to poll for new added tx buffer for a while at the
> >>end of tx processing. The maximum time spent on polling were limited
> >>through a module parameter. To avoid block rx, the loop will end it
> >>there's new other works queued on vhost so in fact socket receive
> >>queue is also be polled.
> >>
> >>busyloop_timeout = 50 gives us following improvement on TCP_RR test:
> >>
> >>size/session/+thu%/+normalize%
> >> 1/ 1/   +5%/  -20%
> >> 1/50/  +17%/   +3%
> >
> >Is there a measureable increase in cpu utilization
> >with busyloop_timeout = 0?
> 
> And since a netperf TCP_RR test is involved, be careful about what netperf
> reports for CPU util if that increase isn't in the context of the guest OS.
> 
> For completeness, looking at the effect on TCP_STREAM and TCP_MAERTS,
> aggregate _RR and even aggregate _RR/packets per second for many VMs on the
> same system would be in order.
> 
> happy benchmarking,
> 
> rick jones

Absolutely, merging a new kernel API just for a specific
benchmark doesn't make sense.
I'm guessing this is just an early RFC, a fuller submission
will probably include more numbers.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] kvmtool/run: do not overwrite /virt/init

2015-10-22 Thread Oleg Nesterov
To me kvm_setup_guest_init() behaviour looks "obviously wrong" and
unfriendly because it always overwrites /virt/init.

kvm_setup_guest_init() is also called when we are going to use this
tree as a rootfs, and without another patch ("kvmtool/run: append
cfg.kernel_cmdline at the end of real_cmdline") the user can't use
"lkvm run -p init=my_init_path". This simply means that you can not
use a customized init unless you patch kvmtool.

Change extract_file() to do nothing if the file already exists. This
should not affect do_setup() which calls kvm_setup_guest_init() only
if make_dir(guestfs_name) creates the new/empty dir.

Signed-off-by: Oleg Nesterov 
---
 builtin-setup.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/builtin-setup.c b/builtin-setup.c
index 1e5b1e4..8be8d62 100644
--- a/builtin-setup.c
+++ b/builtin-setup.c
@@ -130,10 +130,14 @@ static int extract_file(const char *guestfs_name, const 
char *filename,
 
snprintf(path, PATH_MAX, "%s%s/%s", kvm__get_dir(),
guestfs_name, filename);
-   remove(path);
-   fd = open(path, O_CREAT | O_WRONLY, 0755);
-   if (fd < 0)
+
+   fd = open(path, O_EXCL | O_CREAT | O_WRONLY, 0755);
+   if (fd < 0) {
+   if (errno == EEXIST)
+   return 0;
die("Fail to setup %s", path);
+   }
+
ret = xwrite(fd, data, (size_t)_size);
if (ret < 0)
die("Fail to setup %s", path);
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] kvmtool: don't overwrite /virt/init

2015-10-22 Thread Oleg Nesterov
Hello,

On top of "[PATCH 0/3] kvmtool: tiny init fox x86_64" I sent.

I simply can't understand why kvmtool always overwrites /virt/init.
This doesn't look right, especially because you can't pass "init="
kernel argument without another patch "[PATCH] kvmtool/run: append
cfg.kernel_cmdline at the end of real_cmdline" I sent. And so far
it is not clear to me if you are going to accept that patch or not.

1/3 and 2/3 are off-topic cleanups.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] kvmtool/run: don't abuse "root=" parameter, don't pass "rw" to v9fs_mount()

2015-10-22 Thread Oleg Nesterov
1. kvm_cmd_run_init() appends "root=/dev/root" to real_cmdline if
   cfg.using_rootfs == T. This doesn't hurt but makes no sense and
   looks confusing.

   We do not need to initialiaze the kernel's saved_root_name[] and
   "/dev/root" means nothing to name_to_dev_t().

   We only need to pass this mount-tag to 9p but the kernel always
   uses dev_name="/dev/root" in mount_root() path, so we can safely
   remove this option from the command line.

2. "rw" in rootflags looks confusing too, it is silently ignored by
   v9fs_parse_options() and has no effect.

   We need to clear MS_RDONLY from root_mountflags, this is what the
   "standalone" kernel parameter correctly does.

Signed-off-by: Oleg Nesterov 
---
 builtin-run.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin-run.c b/builtin-run.c
index ab1fc2a..6e4491c 100644
--- a/builtin-run.c
+++ b/builtin-run.c
@@ -590,7 +590,7 @@ static struct kvm *kvm_cmd_run_init(int argc, const char 
**argv)
}
 
if (kvm->cfg.using_rootfs) {
-   strcat(real_cmdline, " root=/dev/root rw 
rootflags=rw,trans=virtio,version=9p2000.L rootfstype=9p");
+   strcat(real_cmdline, " rw 
rootflags=trans=virtio,version=9p2000.L rootfstype=9p");
if (kvm->cfg.custom_rootfs) {
kvm_run_set_sandbox(kvm);
 
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] kvmtool: add lkvm-static to gitignore

2015-10-22 Thread Oleg Nesterov
add lkvm-static to gitignore

Signed-off-by: Oleg Nesterov 
---
 .gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/.gitignore b/.gitignore
index 697a63f..a16a97f 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,4 +1,5 @@
 /lkvm
+/lkvm-static
 /vm
 *.o
 *.d
-- 
2.4.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] kvmtool: tiny init fox x86_64

2015-10-22 Thread Oleg Nesterov
Hi Pekka,

On 10/22, Pekka Enberg wrote:
>
> The series looks good to me:
>
> Acked-by: Pekka Enberg 

Thanks! I'll try to send some cleanups later.

But let me send another simple series first.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-22 Thread Alex Williamson
On Thu, 2015-10-22 at 18:58 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 10:20 PM, Alex Williamson
>  wrote:
> 
> > This is why the typical VF agnostic approach here is to using bonding
> > and fail over to a emulated device during migration, so performance
> > suffers, but downtime is something acceptable.
> 
> bonding in the VM isn't a zero touch solution, right? is it really acceptable?

The bonding solution requires configuring the bond in the guest and
doing the hot unplug/re-plug around migration.  It's zero touch in that
it works on current code with any PF/VF, but it's certainly not zero
configuration in the guest.  Is what acceptable?  The configuration?
The performance?  The downtime?  I don't think we can hope to improve on
the downtime of an emulated device, but obviously the configuration and
performance are not always acceptable or we wouldn't be seeing so many
people working on migration of assigned devices.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 8/9] kvm/x86: Hyper-V synthetic interrupt controller

2015-10-22 Thread Andrey Smetanin
SynIC (synthetic interrupt controller) is a lapic extension,
which is controlled via MSRs and maintains for each vCPU
 - 16 synthetic interrupt "lines" (SINT's); each can be configured to
   trigger a specific interrupt vector optionally with auto-EOI
   semantics
 - a message page in the guest memory with 16 256-byte per-SINT message
   slots
 - an event flag page in the guest memory with 16 2048-bit per-SINT
   event flag areas

The host triggers a SINT whenever it delivers a new message to the
corresponding slot or flips an event flag bit in the corresponding area.
The guest informs the host that it can try delivering a message by
explicitly asserting EOI in lapic or writing to End-Of-Message (EOM)
MSR.

The userspace (qemu) triggers interrupts and receives EOM notifications
via irqfd with resampler; for that, a GSI is allocated for each
configured SINT, and irq_routing api is extended to support GSI-SINT
mapping.

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Vitaly Kuznetsov 
CC: "K. Y. Srinivasan" 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: Roman Kagan 

Changes v3:
* added KVM_CAP_HYPERV_SYNIC and KVM_IRQ_ROUTING_HV_SINT notes into
docs

Changes v2:
* do not use posted interrupts for Hyper-V SynIC AutoEOI vectors
* add Hyper-V SynIC vectors into EOI exit bitmap
* Hyper-V SyniIC SINT msr write logic simplified
---
 Documentation/virtual/kvm/api.txt |  14 ++
 arch/x86/include/asm/kvm_host.h   |  14 ++
 arch/x86/kvm/hyperv.c | 297 ++
 arch/x86/kvm/hyperv.h |  21 +++
 arch/x86/kvm/irq_comm.c   |  34 +
 arch/x86/kvm/lapic.c  |  18 ++-
 arch/x86/kvm/lapic.h  |   5 +
 arch/x86/kvm/x86.c|  12 +-
 include/linux/kvm_host.h  |   6 +
 include/uapi/linux/kvm.h  |   8 +
 10 files changed, 421 insertions(+), 8 deletions(-)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 092ee9f..8710418 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -1451,6 +1451,7 @@ struct kvm_irq_routing_entry {
struct kvm_irq_routing_irqchip irqchip;
struct kvm_irq_routing_msi msi;
struct kvm_irq_routing_s390_adapter adapter;
+   struct kvm_irq_routing_hv_sint hv_sint;
__u32 pad[8];
} u;
 };
@@ -1459,6 +1460,7 @@ struct kvm_irq_routing_entry {
 #define KVM_IRQ_ROUTING_IRQCHIP 1
 #define KVM_IRQ_ROUTING_MSI 2
 #define KVM_IRQ_ROUTING_S390_ADAPTER 3
+#define KVM_IRQ_ROUTING_HV_SINT 4
 
 No flags are specified so far, the corresponding field must be set to zero.
 
@@ -1482,6 +1484,10 @@ struct kvm_irq_routing_s390_adapter {
__u32 adapter_id;
 };
 
+struct kvm_irq_routing_hv_sint {
+   __u32 vcpu;
+   __u32 sint;
+};
 
 4.53 KVM_ASSIGN_SET_MSIX_NR (deprecated)
 
@@ -3685,3 +3691,11 @@ available, means that that the kernel has an 
implementation of the
 H_RANDOM hypercall backed by a hardware random-number generator.
 If present, the kernel H_RANDOM handler can be enabled for guest use
 with the KVM_CAP_PPC_ENABLE_HCALL capability.
+
+8.2 KVM_CAP_HYPERV_SYNIC
+
+Architectures: x86
+This capability, if KVM_CHECK_EXTENSION indicates that it is
+available, means that that the kernel has an implementation of the
+Hyper-V Synthetic interrupt controller(SynIC). SynIC is used to
+support Windows Hyper-V based guest paravirt drivers(VMBus).
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3c6327d..8434f88 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -374,10 +375,23 @@ struct kvm_mtrr {
struct list_head head;
 };
 
+/* Hyper-V synthetic interrupt controller (SynIC)*/
+struct kvm_vcpu_hv_synic {
+   u64 version;
+   u64 control;
+   u64 msg_page;
+   u64 evt_page;
+   atomic64_t sint[HV_SYNIC_SINT_COUNT];
+   atomic_t sint_to_gsi[HV_SYNIC_SINT_COUNT];
+   DECLARE_BITMAP(auto_eoi_bitmap, 256);
+   DECLARE_BITMAP(vec_bitmap, 256);
+};
+
 /* Hyper-V per vcpu emulation context */
 struct kvm_vcpu_hv {
u64 hv_vapic;
s64 runtime_offset;
+   struct kvm_vcpu_hv_synic synic;
 };
 
 struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 62cf8c9..8ff71f3 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -23,13 +23,296 @@
 
 #include "x86.h"
 #include "lapic.h"
+#include "ioapic.h"
 #include "hyperv.h"
 
 #include 
+#include 
 #include 
 
 #include "trace.h"
 
+static inline u64 synic_read_sint(struct kvm_vcpu_hv_synic *synic, int sint)
+{
+   return atomic64_read(>sint[sint]);
+}
+

[PATCH v3 9/9] kvm/x86: Hyper-V kvm exit

2015-10-22 Thread Andrey Smetanin
A new vcpu exit is introduced to notify the userspace of the
changes in Hyper-V SynIC configuration triggered by guest writing to the
corresponding MSRs.

Changes v3:
* added KVM_EXIT_HYPERV types and structs notes into docs

Signed-off-by: Andrey Smetanin 
Reviewed-by: Roman Kagan 
Signed-off-by: Denis V. Lunev 
CC: Vitaly Kuznetsov 
CC: "K. Y. Srinivasan" 
CC: Gleb Natapov 
CC: Paolo Bonzini 
CC: Roman Kagan 

---
 Documentation/virtual/kvm/api.txt | 22 ++
 arch/x86/include/asm/kvm_host.h   |  1 +
 arch/x86/kvm/hyperv.c | 17 +
 arch/x86/kvm/x86.c|  6 ++
 include/linux/kvm_host.h  |  1 +
 include/uapi/linux/kvm.h  | 17 +
 6 files changed, 64 insertions(+)

diff --git a/Documentation/virtual/kvm/api.txt 
b/Documentation/virtual/kvm/api.txt
index 8710418..a6858eb 100644
--- a/Documentation/virtual/kvm/api.txt
+++ b/Documentation/virtual/kvm/api.txt
@@ -3337,6 +3337,28 @@ the userspace IOAPIC should process the EOI and 
retrigger the interrupt if
 it is still asserted.  Vector is the LAPIC interrupt vector for which the
 EOI was received.
 
+   struct kvm_hyperv_exit {
+#define KVM_EXIT_HYPERV_SYNIC  1
+   __u32 type;
+   union {
+   struct {
+   __u32 msr;
+   __u64 control;
+   __u64 evt_page;
+   __u64 msg_page;
+   } synic;
+   } u;
+   };
+   /* KVM_EXIT_HYPERV */
+struct kvm_hyperv_exit hyperv;
+Indicates that the VCPU exits into userspace to process some tasks
+related to Hyper-V emulation.
+Valid values for 'type' are:
+   KVM_EXIT_HYPERV_SYNIC -- synchronously notify user-space about
+Hyper-V SynIC state change. Notification is used to remap SynIC
+event/message pages and to enable/disable SynIC messages/events processing
+in userspace.
+
/* Fix the size of the union. */
char padding[256];
};
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8434f88..54c90d3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -392,6 +392,7 @@ struct kvm_vcpu_hv {
u64 hv_vapic;
s64 runtime_offset;
struct kvm_vcpu_hv_synic synic;
+   struct kvm_hyperv_exit exit;
 };
 
 struct kvm_vcpu_arch {
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 8ff71f3..9443920 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -129,6 +129,20 @@ static void kvm_hv_notify_acked_sint(struct kvm_vcpu 
*vcpu, u32 sint)
srcu_read_unlock(>irq_srcu, idx);
 }
 
+static void synic_exit(struct kvm_vcpu_hv_synic *synic, u32 msr)
+{
+   struct kvm_vcpu *vcpu = synic_to_vcpu(synic);
+   struct kvm_vcpu_hv *hv_vcpu = >arch.hyperv;
+
+   hv_vcpu->exit.type = KVM_EXIT_HYPERV_SYNIC;
+   hv_vcpu->exit.u.synic.msr = msr;
+   hv_vcpu->exit.u.synic.control = synic->control;
+   hv_vcpu->exit.u.synic.evt_page = synic->evt_page;
+   hv_vcpu->exit.u.synic.msg_page = synic->msg_page;
+
+   kvm_make_request(KVM_REQ_HV_EXIT, vcpu);
+}
+
 static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
 u32 msr, u64 data, bool host)
 {
@@ -141,6 +155,7 @@ static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
switch (msr) {
case HV_X64_MSR_SCONTROL:
synic->control = data;
+   synic_exit(synic, msr);
break;
case HV_X64_MSR_SVERSION:
if (!host) {
@@ -157,6 +172,7 @@ static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
break;
}
synic->evt_page = data;
+   synic_exit(synic, msr);
break;
case HV_X64_MSR_SIMP:
if (data & HV_SYNIC_SIMP_ENABLE)
@@ -166,6 +182,7 @@ static int synic_set_msr(struct kvm_vcpu_hv_synic *synic,
break;
}
synic->msg_page = data;
+   synic_exit(synic, msr);
break;
case HV_X64_MSR_EOM: {
int i;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b853b2df..0704ee3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6342,6 +6342,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
r = 0;
goto out;
}
+   if (kvm_check_request(KVM_REQ_HV_EXIT, vcpu)) {
+   vcpu->run->exit_reason = KVM_EXIT_HYPERV;
+

[GIT PULL] KVM changes for 4.3-rc7

2015-10-22 Thread Paolo Bonzini
Linus,

The following changes since commit b10d92a54dac25a6152f1aa1ffc95c12908035ce:

  KVM: x86: fix RSM into 64-bit protected mode (2015-10-14 16:39:52 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/for-linus

for you to fetch changes up to ad355e383d826e3506c3caaa0fe991fd112de47b:

  Merge tag 'kvm-arm-for-v4.3-rc7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master 
(2015-10-21 17:46:56 +0200)



Bug fixes for ARM, mostly 4.3 regressions related to virtual
interrupt controller changes.


Arnd Bergmann (1):
  KVM: arm: use GIC support unconditionally

Christoffer Dall (3):
  arm/arm64: KVM: Fix arch timer behavior for disabled interrupts
  arm/arm64: KVM: Clear map->active on pend/active clear
  arm/arm64: KVM: Fix disabled distributor operation

Paolo Bonzini (1):
  Merge tag 'kvm-arm-for-v4.3-rc7' of 
git://git.kernel.org/.../kvmarm/kvmarm into kvm-master

Pavel Fedin (2):
  KVM: arm/arm64: Do not inject spurious interrupts
  KVM: arm/arm64: Fix memory leak if timer initialization fails

 arch/arm/kvm/Kconfig  |  1 +
 arch/arm/kvm/arm.c|  2 +-
 virt/kvm/arm/arch_timer.c | 19 ++
 virt/kvm/arm/vgic.c   | 95 +++
 4 files changed, 76 insertions(+), 41 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html