Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function

2015-10-21 Thread Xiao Guangrong



On 10/21/2015 06:49 PM, Stefan Hajnoczi wrote:

On Wed, Oct 21, 2015 at 12:26:35AM +0800, Xiao Guangrong wrote:



On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote:

On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote:

+exit:
+/* Write our output result to dsm memory. */
+((dsm_out *)dsm_ram_addr)->len = out->len;


Missing byteswap?

I thought you were going to remove this field because it wasn't needed
by the guest.



The @len is the size of _DSM result buffer, for example, for the function of
DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for
DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code
how much size of memory we need to return to the _DSM caller.

In _DSM code, it's handled like this:

"RLEN" is @len, “OBUF” is the left memory in DSM page.

 /* get @len*/
 aml_append(method, aml_store(aml_name("RLEN"), aml_local(6)));
 /* @len << 3 to get bits. */
 aml_append(method, aml_store(aml_shiftleft(aml_local(6),
aml_int(3)), aml_local(6)));

 /* get @len << 3 bits from OBUF, and return it to the caller. */
 aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0),
 aml_local(6) , "OBUF"));

Since @len is our internally used, it's not return to guest, so i did not do
byteswap here.


I am not familiar with the ACPI details, but I think this emits bytecode
that will be run by the guest's ACPI interpreter?

You still need to define the endianness of fields since QEMU and the
guest could have different endianness.

In other words, will the following work if a big-endian ppc host is
running a little-endian x86 guest?

   ((dsm_out *)dsm_ram_addr)->len = out->len;



Er... If we do byteswap in QEMU then it is also needed in ASL code, however,
ASL lacks this kind of instruction.  I guess ACPI interpreter is smart enough
to change value to Littel-Endian for all 2 bytes / 4 bytes / 8 bytes accesses

I will do the change in next version, thanks for you pointing it out, Stefan!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally

2015-10-21 Thread Arnd Bergmann
On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote:
> Should this be "select" or "depends on"? Not a blocker, can always be fixed 
> in 4.4.

We have lots of 'select ARM_GIC' in the tree for platforms that use one, using
'depends on' will limit KVM support to being available only if at least one
of them is being used.

The only platform I can think of that uses ARMv7ve without actually having
a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform
like that? If so, 'depends on' might be better, otherwise let's stay with
'select'.

Note that ARM_GIC is not a user-visible option, you can only turn it on
by picking one or more platforms that have a GIC.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally

2015-10-21 Thread Christoffer Dall
On Wed, Oct 21, 2015 at 03:45:20PM +0200, Arnd Bergmann wrote:
> On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote:
> > Should this be "select" or "depends on"? Not a blocker, can always be fixed 
> > in 4.4.
> 
> We have lots of 'select ARM_GIC' in the tree for platforms that use one, using
> 'depends on' will limit KVM support to being available only if at least one
> of them is being used.
> 
> The only platform I can think of that uses ARMv7ve without actually having
> a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform
> like that? If so, 'depends on' might be better, otherwise let's stay with
> 'select'.

Yes you can, just without the VGIC and the timer - you have to emulate
that in userspace.  Samsung also has a broken platform where they
integrated things incorrectly, so you cannot use the VGIC, but that
platform support is out of tree, so I can't see if it uses the GIC in
general or not.

I'm a bit confused why using 'depends on' in this case helps anythign?

(I know, I suck at dealing with the config system)

Thanks,
-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally

2015-10-21 Thread Pavel Fedin
 Hello!

> The only platform I can think of that uses ARMv7ve without actually having
> a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform
> like that?

 We can, with two limitations:
1. GIC has to be emulated in software. I have recently fixed support for this. 
The only problem here would be that KVM currently
refuses to initialize if there's no vGIC, but it is easy to fix, i posted 
patches for this too.
2. We cannot emulate CP15 timer, because accessing virtual timer registers 
cannot be trapped to HYP. However, it is possible to trap
physical timer access, but a small KVM API extension is needed for this.

 Currently it is possible to run qemu vexpress model in this mode, because it 
has another, memory-mapped timer. It is only necessary
to either remove CP15 timer from guest device tree, or disable support in guest 
.config.

Kind regards,
Pavel Fedin
Expert Engineer
Samsung Electronics Research center Russia

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 2/2] vhost_net: basic polling support

2015-10-21 Thread Jason Wang
This patch tries to poll for new added tx buffer for a while at the
end of tx processing. The maximum time spent on polling were limited
through a module parameter. To avoid block rx, the loop will end it
there's new other works queued on vhost so in fact socket receive
queue is also be polled.

busyloop_timeout = 50 gives us following improvement on TCP_RR test:

size/session/+thu%/+normalize%
1/ 1/   +5%/  -20%
1/50/  +17%/   +3%

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9eda69e..bbb522a 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -31,7 +31,9 @@
 #include "vhost.h"
 
 static int experimental_zcopytx = 1;
+static int busyloop_timeout = 50;
 module_param(experimental_zcopytx, int, 0444);
+module_param(busyloop_timeout, int, 0444);
 MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
   " 1 -Enable; 0 - Disable");
 
@@ -287,12 +289,23 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
rcu_read_unlock_bh();
 }
 
+static bool tx_can_busy_poll(struct vhost_dev *dev,
+unsigned long endtime)
+{
+   unsigned long now = local_clock() >> 10;
+
+   return busyloop_timeout && !need_resched() &&
+  !time_after(now, endtime) && !vhost_has_work(dev) &&
+  single_task_running();
+}
+
 /* Expects to be always run from workqueue - which acts as
  * read-size critical section for our kind of RCU. */
 static void handle_tx(struct vhost_net *net)
 {
struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX];
struct vhost_virtqueue *vq = >vq;
+   unsigned long endtime;
unsigned out, in;
int head;
struct msghdr msg = {
@@ -331,6 +344,8 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
+   endtime  = (local_clock() >> 10) + busyloop_timeout;
+again:
head = vhost_get_vq_desc(vq, vq->iov,
 ARRAY_SIZE(vq->iov),
 , ,
@@ -340,6 +355,10 @@ static void handle_tx(struct vhost_net *net)
break;
/* Nothing new?  Wait for eventfd to tell us they refilled. */
if (head == vq->num) {
+   if (tx_can_busy_poll(vq->dev, endtime)) {
+   cpu_relax();
+   goto again;
+   }
if (unlikely(vhost_enable_notify(>dev, vq))) {
vhost_disable_notify(>dev, vq);
continue;
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()

2015-10-21 Thread Jason Wang
This path introduces a helper which can give a hint for whether or not
there's a work queued in the work list.

Signed-off-by: Jason Wang 
---
 drivers/vhost/vhost.c | 6 ++
 drivers/vhost/vhost.h | 1 +
 2 files changed, 7 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index eec2f11..d42d11e 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+bool vhost_has_work(struct vhost_dev *dev)
+{
+   return !list_empty(>work_list);
+}
+EXPORT_SYMBOL_GPL(vhost_has_work);
+
 void vhost_poll_queue(struct vhost_poll *poll)
 {
vhost_work_queue(poll->dev, >work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4772862..ea0327d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -37,6 +37,7 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
+bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 unsigned long mask, struct vhost_dev *dev);
-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 11/12] IXGBEVF: Migrate VF statistic data

2015-10-21 Thread Lan Tianyu
VF statistic regs are read-only and can't be migrated via writing back
directly.

Currently, statistic data returned to user space by the driver is not equal
to value of statistic regs. VF driver records value of statistic regs as base 
data
when net interface is up or open, calculate increased count of regs during
last period of online service and added it to saved_reset data. When user
space collects statistic data, VF driver returns result of
"current - base + saved_reset". "Current" is reg value at that point.

Restoring net function after migration just likes net interface is up or open.
Call existed function to update base and saved_reset data to keep statistic
data continual during migration.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 04b6ce7..d22160f 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 0;
 
del_timer_sync(>service_timer);
+   ixgbevf_update_stats(adapter);
pr_info("migration start\n");
migration_status = MIGRATION_IN_PROGRESS; 
 
@@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter)
return 1;
 
ixgbevf_restore_state(adapter);
+   ixgbevf_save_reset_stats(adapter);
+   ixgbevf_init_last_counter_stats(adapter);
migration_status = MIGRATION_COMPLETED;
pr_info("migration end\n");
return 0;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-21 Thread Lan Tianyu
Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu 
---
 drivers/pci/iov.c   | 1 +
 include/linux/pci.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;
virtfn->multifunction = 0;
+   virtfn->virtfn_index = id;
 
for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
res = >resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */
unsigned intirq_managed:1;
pci_dev_flags_t dev_flags;
+   unsigned intvirtfn_index;
atomic_tenable_cnt; /* pci_enable_device has been called */
 
u32 saved_config_space[16]; /* config space saved at 
suspend time */
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 0/6] A handful of fixes for KVM/ARM for v4.3-rc7

2015-10-21 Thread Paolo Bonzini


On 20/10/2015 18:19, Christoffer Dall wrote:
> Hi Paolo,
> 
> The following changes since commit 920552b213e3dc832a874b4e7ba29ecddbab31bc:
> 
>   KVM: disable halt_poll_ns as default for s390x (2015-09-25 10:31:30 +0200)
> 
> are available in the git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git 
> tags/kvm-arm-for-v4.3-rc7
> 
> for you to fetch changes up to 0d997491f814c87310a6ad7be30a9049c7150489:
> 
>   arm/arm64: KVM: Fix disabled distributor operation (2015-10-20 18:09:13 
> +0200)
> 
> Sorry for sending these relatively late, but we had a situation where we
> found one breakage in the timer implementation changes merged for 4.3,
> then fixing that issue revealed another bug, and then that happened
> again, and now we have something that looks stable.
> 
> Description of the fixes is in the tag and quoted below.
> 
> Thanks,
> -Christoffer
> 
> 
> A late round of KVM/ARM fixes for v4.3-rc7, fixing:
>  - A bug where level-triggered interrupts lowered from userspace
>are still routed to the guest
>  - A memory leak an a failed initialization path
>  - A build error under certain configurations
>  - Several timer bugs introduced with moving the timer to the active
>state handling instead of the masking trick.
> 
> 
> Arnd Bergmann (1):
>   KVM: arm: use GIC support unconditionally
> 
> Christoffer Dall (3):
>   arm/arm64: KVM: Fix arch timer behavior for disabled interrupts
>   arm/arm64: KVM: Clear map->active on pend/active clear
>   arm/arm64: KVM: Fix disabled distributor operation
> 
> Pavel Fedin (2):
>   KVM: arm/arm64: Do not inject spurious interrupts
>   KVM: arm/arm64: Fix memory leak if timer initialization fails
> 
>  arch/arm/kvm/Kconfig  |  1 +
>  arch/arm/kvm/arm.c|  2 +-
>  virt/kvm/arm/arch_timer.c | 19 ++
>  virt/kvm/arm/vgic.c   | 95 
> +++
>  4 files changed, 76 insertions(+), 41 deletions(-)
> 

Pulled, thanks.  I'll send the fixes to Linus tomorrow.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Lan Tianyu
This patchset is to propose a new solution to add live migration support for 
82599
SRIOV network card.

Im our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


VF status migration
=
VF status can be divided into 4 parts
1) PCI configure regs
2) MSIX configure
3) VF status in the PF driver
4) VF MMIO regs 

The first three status are all handled by Qemu. 
The PCI configure space regs and MSIX configure are originally
stored in Qemu. To save and restore "VF status in the PF driver"
by Qemu during migration, adds new sysfs node "state_in_pf" under
VF sysfs directory.

For VF MMIO regs, we introduce self emulation layer in the VF
driver to record MMIO reg values during reading or writing MMIO
and put these data in the guest memory. It will be migrated with
guest memory to new machine.


VF function restoration

Restoring VF function operation are done in the VF and PF driver.
 
In order to let VF driver to know migration status, Qemu fakes VF
PCI configure regs to indicate migration status and add new sysfs
node "notify_vf" to trigger VF mailbox irq in order to notify VF 
about migration status change.

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.


Tracking DMA accessed memory
=
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) when receive
and transmit data.


Service down time test
=
So far, we tested migration between two laptops with 82599 nic which
are connected to a gigabit switch. Ping VF in the 0.001s interval
during migration on the host of source side. It service down
time is about 180ms.

[983769928.053604] 64 bytes from 10.239.48.100: icmp_seq=4131 ttl=64 time=2.79 
ms
[983769928.056422] 64 bytes from 10.239.48.100: icmp_seq=4132 ttl=64 time=2.79 
ms
[983769928.059241] 64 bytes from 10.239.48.100: icmp_seq=4133 ttl=64 time=2.79 
ms
[983769928.062071] 64 bytes from 10.239.48.100: icmp_seq=4134 ttl=64 time=2.80 
ms
[983769928.064890] 64 bytes from 10.239.48.100: icmp_seq=4135 ttl=64 time=2.79 
ms
[983769928.067716] 64 bytes from 10.239.48.100: icmp_seq=4136 ttl=64 time=2.79 
ms
[983769928.070538] 64 bytes from 10.239.48.100: icmp_seq=4137 ttl=64 time=2.79 
ms
[983769928.073360] 64 bytes from 10.239.48.100: icmp_seq=4138 ttl=64 time=2.79 
ms
[983769928.083444] no answer yet for icmp_seq=4139
[983769928.093524] no answer yet for icmp_seq=4140
[983769928.103602] no answer yet for icmp_seq=4141
[983769928.113684] no answer yet for icmp_seq=4142
[983769928.123763] no answer yet for icmp_seq=4143
[983769928.133854] no answer yet for icmp_seq=4144
[983769928.143931] no answer yet for icmp_seq=4145
[983769928.154008] no answer yet for icmp_seq=4146
[983769928.164084] no answer yet for icmp_seq=4147
[983769928.174160] no answer yet for icmp_seq=4148
[983769928.184236] no answer yet for icmp_seq=4149
[983769928.194313] no answer yet for icmp_seq=4150
[983769928.204390] no answer yet for icmp_seq=4151
[983769928.214468] no answer yet for icmp_seq=4152
[983769928.224556] no answer yet for icmp_seq=4153
[983769928.234632] no answer yet for icmp_seq=4154
[983769928.244709] no answer yet for icmp_seq=4155
[983769928.254783] no answer yet for icmp_seq=4156
[983769928.256094] 64 bytes from 10.239.48.100: icmp_seq=4139 ttl=64 time=182 ms
[983769928.256107] 64 bytes from 10.239.48.100: icmp_seq=4140 ttl=64 time=172 ms
[983769928.256114] no answer yet for icmp_seq=4157
[983769928.256236] 64 bytes from 10.239.48.100: icmp_seq=4141 ttl=64 time=162 ms
[983769928.256245] 64 bytes from 10.239.48.100: icmp_seq=4142 ttl=64 time=152 ms
[983769928.256272] 64 bytes from 10.239.48.100: icmp_seq=4143 ttl=64 time=142 ms
[983769928.256310] 64 bytes from 10.239.48.100: icmp_seq=4144 ttl=64 time=132 ms
[983769928.256325] 64 bytes from 10.239.48.100: icmp_seq=4145 ttl=64 time=122 ms
[983769928.256332] 64 bytes from 10.239.48.100: icmp_seq=4146 ttl=64 time=112 ms
[983769928.256440] 64 bytes from 10.239.48.100: icmp_seq=4147 ttl=64 time=102 ms
[983769928.256455] 64 bytes from 10.239.48.100: icmp_seq=4148 ttl=64 time=92.3 
ms
[983769928.256494] 64 bytes from 10.239.48.100: icmp_seq=4149 ttl=64 time=82.3 
ms
[983769928.256503] 64 

[RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-21 Thread Lan Tianyu
When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
  */
 struct ixgbevf_tx_buffer {
union ixgbe_adv_tx_desc *next_to_watch;
+   u16 desc_num;
unsigned long time_stamp;
struct sk_buff *skb;
unsigned int bytecount;
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
ixgbevf_ring *tx_ring,
   DMA_TO_DEVICE);
}
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);
/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
union ixgbe_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = tx_ring->count / 2;
-   unsigned int i = tx_ring->next_to_clean;
+   int i, watch_index;
 
if (test_bit(__IXGBEVF_DOWN, >state))
return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
i -= tx_ring->count;
 
do {
-   union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+   union ixgbe_adv_tx_desc *eop_desc;
+
+   if (!tx_buffer->desc_num)
+   break;
+
+   if (i + tx_buffer->desc_num >= 0)
+   watch_index = i + tx_buffer->desc_num;
+   else
+   watch_index = i + tx_ring->count + tx_buffer->desc_num;
 
-   /* if next_to_watch is not set then there is no work pending */
+   eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
if (!eop_desc)
break;
 
@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
 
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
 
/* update the statistics for this packet */
total_bytes += tx_buffer->bytecount;
@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
u32 tx_flags = first->tx_flags;
__le32 cmd_type;
u16 i = tx_ring->next_to_use;
+   u16 start;
 
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
 
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
 
/* set next_to_watch value indicating a packet is present */
first->next_to_watch = tx_desc;
+   start = first - tx_ring->tx_buffer_info;
+   first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
start;
 
i++;
if (i == tx_ring->count)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-21 Thread Lan Tianyu
To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 -
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
 4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
__le32 mss_l4len_idx;
 };
 
+union ixgbevf_desc {
+   union ixgbe_adv_tx_desc rx_desc;
+   union ixgbe_adv_rx_desc tx_desc;
+   struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
 /* Adv Transmit Descriptor Config Masks */
 #define IXGBE_ADVTXD_DTYP_MASK 0x00F0 /* DTYP mask */
 #define IXGBE_ADVTXD_DTYP_CTXT 0x0020 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
struct ixgbevf_ring *next;
struct net_device *netdev;
struct device *dev;
-   void *desc; /* descriptor ring memory */
+   union ixgbevf_desc *desc;   /* descriptor ring memory */
dma_addr_t dma; /* phys. address of descriptor ring */
unsigned int size;  /* length in bytes */
u16 count;  /* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
*q_vector);
 
 void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
 void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+
 
 #ifdef DEBUG
 char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function 
Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
 #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
 static int debug = -1;
 module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
return ring->stats.packets;
 }
 
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+   struct ixgbevf_tx_buffer *tx_buffer = NULL;
+   static union ixgbevf_desc *tx_desc = NULL;
+
+   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+   if (!tx_buffer)
+   return -ENOMEM;
+
+   tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+   if (!tx_desc)
+   return -ENOMEM;
+
+   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
+   memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
+
+   

[RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.

2015-10-21 Thread Lan Tianyu
This patch is to add ixgbe_ping_vf() to notify a specified VF. When
migration status is changed, it's necessary to notify VF the change.
VF driver will check the migrate status when it gets mailbox msg.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h |  1 +
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 89671eb..e247d67 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter)
IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0);
 }
 
-void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn)
 {
struct ixgbe_hw *hw = >hw;
u32 ping;
+
+   ping = IXGBE_PF_CONTROL_MSG;
+   if (adapter->vfinfo[vfn].clear_to_send)
+   ping |= IXGBE_VT_MSGTYPE_CTS;
+   ixgbe_write_mbx(hw, , 1, vfn);
+}
+
+void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter)
+{
int i;
 
-   for (i = 0 ; i < adapter->num_vfs; i++) {
-   ping = IXGBE_PF_CONTROL_MSG;
-   if (adapter->vfinfo[i].clear_to_send)
-   ping |= IXGBE_VT_MSGTYPE_CTS;
-   ixgbe_write_mbx(hw, , 1, i);
-   }
+   for (i = 0 ; i < adapter->num_vfs; i++)
+   ixgbe_ping_vf(adapter, i);
 }
 
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
index 2c197e6..143e2fd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h
@@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter);
 int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask);
 void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter);
 void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter);
+void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn);
 int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac);
 int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan,
   u8 qos);
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 07/12] IXGBEVF: Add new mail box event for migration

2015-10-21 Thread Lan Tianyu
VF status in the PF driver needs to be restored after migration and reset
VF hardware. This patch is to add a new event for VF driver to notify PF
driver to restore status.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/mbx.h |  3 +++
 drivers/net/ethernet/intel/ixgbevf/vf.c  | 10 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h  |  1 +
 3 files changed, 14 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h 
b/drivers/net/ethernet/intel/ixgbevf/mbx.h
index 82f44e0..22761d8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/mbx.h
+++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h
@@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev {
 #define IXGBE_VF_GET_RETA  0x0a/* VF request for RETA */
 #define IXGBE_VF_GET_RSS_KEY   0x0b/* get RSS hash key */
 
+/* mail box event for live migration  */
+#define IXGBE_VF_NOTIFY_RESUME  0x0c /* VF notify PF migration to restore 
status */
+
 /* length of permanent address message returned from PF */
 #define IXGBE_VF_PERMADDR_MSG_LEN  4
 /* word in permanent address message with the current multicast type */
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c 
b/drivers/net/ethernet/intel/ixgbevf/vf.c
index d1339b0..1e4e5e6 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.c
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.c
@@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int 
*num_tcs,
return err;
 }
 
+static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw)
+{
+   struct ixgbe_mbx_info *mbx = >mbx;
+   u32 msgbuf[1];
+
+   msgbuf[0] = IXGBE_VF_NOTIFY_RESUME;
+   mbx->ops.write_posted(hw, msgbuf, 1);
+}
+
 static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.init_hw= ixgbevf_init_hw_vf,
.reset_hw   = ixgbevf_reset_hw_vf,
@@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = {
.update_mc_addr_list= ixgbevf_update_mc_addr_list_vf,
.set_uc_addr= ixgbevf_set_uc_addr_vf,
.set_vfta   = ixgbevf_set_vfta_vf,
+   .notify_resume  = ixgbevf_notify_resume_vf,
 };
 
 const struct ixgbevf_info ixgbevf_82599_vf_info = {
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index 6a3f4eb..a25fe81 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -70,6 +70,7 @@ struct ixgbe_mac_operations {
s32 (*disable_mc)(struct ixgbe_hw *);
s32 (*clear_vfta)(struct ixgbe_hw *);
s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool);
+   void (*notify_resume)(struct ixgbe_hw *); 
 };
 
 enum ixgbe_mac_type {
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] Qemu: Add pci-assign.h to share functions and struct definition with new file

2015-10-21 Thread Lan Tianyu
Signed-off-by: Lan Tianyu 
---
 hw/i386/kvm/pci-assign.c | 111 ++-
 hw/i386/kvm/pci-assign.h | 109 ++
 2 files changed, 112 insertions(+), 108 deletions(-)
 create mode 100644 hw/i386/kvm/pci-assign.h

diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c
index 74d22f4..616532d 100644
--- a/hw/i386/kvm/pci-assign.c
+++ b/hw/i386/kvm/pci-assign.c
@@ -37,112 +37,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
 #include "kvm_i386.h"
-
-#define MSIX_PAGE_SIZE 0x1000
-
-/* From linux/ioport.h */
-#define IORESOURCE_IO   0x0100  /* Resource type */
-#define IORESOURCE_MEM  0x0200
-#define IORESOURCE_IRQ  0x0400
-#define IORESOURCE_DMA  0x0800
-#define IORESOURCE_PREFETCH 0x2000  /* No side effects */
-#define IORESOURCE_MEM_64   0x0010
-
-//#define DEVICE_ASSIGNMENT_DEBUG
-
-#ifdef DEVICE_ASSIGNMENT_DEBUG
-#define DEBUG(fmt, ...)   \
-do {  \
-fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__);  \
-} while (0)
-#else
-#define DEBUG(fmt, ...)
-#endif
-
-typedef struct PCIRegion {
-int type;   /* Memory or port I/O */
-int valid;
-uint64_t base_addr;
-uint64_t size;/* size of the region */
-int resource_fd;
-} PCIRegion;
-
-typedef struct PCIDevRegions {
-uint8_t bus, dev, func; /* Bus inside domain, device and function */
-int irq;/* IRQ number */
-uint16_t region_number; /* number of active regions */
-
-/* Port I/O or MMIO Regions */
-PCIRegion regions[PCI_NUM_REGIONS - 1];
-int config_fd;
-} PCIDevRegions;
-
-typedef struct AssignedDevRegion {
-MemoryRegion container;
-MemoryRegion real_iomem;
-union {
-uint8_t *r_virtbase; /* mmapped access address for memory regions */
-uint32_t r_baseport; /* the base guest port for I/O regions */
-} u;
-pcibus_t e_size;/* emulated size of region in bytes */
-pcibus_t r_size;/* real size of region in bytes */
-PCIRegion *region;
-} AssignedDevRegion;
-
-#define ASSIGNED_DEVICE_PREFER_MSI_BIT  0
-#define ASSIGNED_DEVICE_SHARE_INTX_BIT  1
-
-#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 << ASSIGNED_DEVICE_PREFER_MSI_BIT)
-#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1 << ASSIGNED_DEVICE_SHARE_INTX_BIT)
-
-typedef struct MSIXTableEntry {
-uint32_t addr_lo;
-uint32_t addr_hi;
-uint32_t data;
-uint32_t ctrl;
-} MSIXTableEntry;
-
-typedef enum AssignedIRQType {
-ASSIGNED_IRQ_NONE = 0,
-ASSIGNED_IRQ_INTX_HOST_INTX,
-ASSIGNED_IRQ_INTX_HOST_MSI,
-ASSIGNED_IRQ_MSI,
-ASSIGNED_IRQ_MSIX
-} AssignedIRQType;
-
-typedef struct AssignedDevice {
-PCIDevice dev;
-PCIHostDeviceAddress host;
-uint32_t dev_id;
-uint32_t features;
-int intpin;
-AssignedDevRegion v_addrs[PCI_NUM_REGIONS - 1];
-PCIDevRegions real_device;
-PCIINTxRoute intx_route;
-AssignedIRQType assigned_irq_type;
-struct {
-#define ASSIGNED_DEVICE_CAP_MSI (1 << 0)
-#define ASSIGNED_DEVICE_CAP_MSIX (1 << 1)
-uint32_t available;
-#define ASSIGNED_DEVICE_MSI_ENABLED (1 << 0)
-#define ASSIGNED_DEVICE_MSIX_ENABLED (1 << 1)
-#define ASSIGNED_DEVICE_MSIX_MASKED (1 << 2)
-uint32_t state;
-} cap;
-uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE];
-uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE];
-int msi_virq_nr;
-int *msi_virq;
-MSIXTableEntry *msix_table;
-hwaddr msix_table_addr;
-uint16_t msix_max;
-MemoryRegion mmio;
-char *configfd_name;
-int32_t bootindex;
-} AssignedDevice;
-
-#define TYPE_PCI_ASSIGN "kvm-pci-assign"
-#define PCI_ASSIGN(obj) OBJECT_CHECK(AssignedDevice, (obj), TYPE_PCI_ASSIGN)
+#include "pci-assign.h"
 
 static void assigned_dev_update_irq_routing(PCIDevice *dev);
 
@@ -1044,7 +939,7 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry)
  * sure the physical MSI-X state tracks the guest's view, which is important
  * for some VF/PF and PF/fw communication channels.
  */
-static bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
+bool assigned_dev_msix_skipped(MSIXTableEntry *entry)
 {
 return !entry->data;
 }
@@ -1114,7 +1009,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice 
*pci_dev)
 return r;
 }
 
-static void assigned_dev_update_msix(PCIDevice *pci_dev)
+void assigned_dev_update_msix(PCIDevice *pci_dev)
 {
 AssignedDevice *assigned_dev = PCI_ASSIGN(pci_dev);
 uint16_t ctrl_word = pci_get_word(pci_dev->config + pci_dev->msix_cap +
diff --git a/hw/i386/kvm/pci-assign.h b/hw/i386/kvm/pci-assign.h
new file mode 100644
index 000..91d00ea
--- /dev/null
+++ b/hw/i386/kvm/pci-assign.h
@@ -0,0 +1,109 @@
+#define MSIX_PAGE_SIZE 0x1000
+
+/* From linux/ioport.h */
+#define IORESOURCE_IO   0x0100  /* Resource type */
+#define IORESOURCE_MEM  

[RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Lan Tianyu
This patchset is Qemu part for live migration support for SRIOV NIC.
kernel part patch information is in the following link.
http://marc.info/?l=kvm=144544635330193=2


Lan Tianyu (3):
  Qemu: Add pci-assign.h to share functions and struct definition with
new file
  Qemu: Add post_load_state() to run after restoring CPU state
  Qemu: Introduce pci-sriov device type to support VF live migration

 hw/i386/kvm/Makefile.objs   |   2 +-
 hw/i386/kvm/pci-assign.c| 113 +--
 hw/i386/kvm/pci-assign.h| 109 +++
 hw/i386/kvm/sriov.c | 213 
 include/migration/vmstate.h |   2 +
 migration/savevm.c  |  15 
 6 files changed, 344 insertions(+), 110 deletions(-)
 create mode 100644 hw/i386/kvm/pci-assign.h
 create mode 100644 hw/i386/kvm/sriov.c

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-21 Thread Lan Tianyu
In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
 .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
 drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
 4 files changed, 33 insertions(+), 3 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
 
 obj-$(CONFIG_IXGBEVF) += ixgbevf.o
 
-ixgbevf-objs := vf.o \
+ixgbevf-objs := self-emulation.o \
+   vf.o \
 mbx.o \
 ethtool.o \
 ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
 
if (IXGBE_REMOVED(reg_addr))
return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}
diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
 
 struct ixgbe_hw;
 
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
 /* iterator type for walking multicast address lists */
 typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 
reg, u32 value)
 
if (IXGBE_REMOVED(reg_addr))
return;
-   writel(value, reg_addr + reg);
+   ixgbe_self_emul_writel(value, reg_addr, reg);
 }
 
 #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/3] Qemu: Add post_load_state() to run after restoring CPU state

2015-10-21 Thread Lan Tianyu
After migration, Qemu needs to trigger mailbox irq to notify VF driver
in the guest about status change. The irq delivery restarts to work after
restoring CPU state. This patch is to add new callback to run after
restoring CPU state and provide a way to trigger mailbox irq later.

Signed-off-by: Lan Tianyu 
---
 include/migration/vmstate.h |  2 ++
 migration/savevm.c  | 15 +++
 2 files changed, 17 insertions(+)

diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 0695d7c..dc681a6 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -56,6 +56,8 @@ typedef struct SaveVMHandlers {
 int (*save_live_setup)(QEMUFile *f, void *opaque);
 uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t 
max_size);
 
+/* This runs after restoring CPU related state */
+void (*post_load_state)(void *opaque);
 LoadStateHandler *load_state;
 } SaveVMHandlers;
 
diff --git a/migration/savevm.c b/migration/savevm.c
index 9e0e286..48b6223 100644
--- a/migration/savevm.c
+++ b/migration/savevm.c
@@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp)
 return false;
 }
 
+void qemu_savevm_post_load(void)
+{
+SaveStateEntry *se;
+
+QTAILQ_FOREACH(se, _state.handlers, entry) {
+if (!se->ops || !se->ops->post_load_state) {
+continue;
+}
+
+se->ops->post_load_state(se->opaque);
+}
+}
+
+
 void qemu_savevm_state_header(QEMUFile *f)
 {
 trace_savevm_state_header();
@@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f)
 }
 
 cpu_synchronize_all_post_init();
+qemu_savevm_post_load();
 
 ret = 0;
 
-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-21 Thread Lan Tianyu
Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ---
 2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {
 
spinlock_t mbx_lock;
unsigned long last_reset;
+   spinlock_t mg_rx_lock;
+   spinlock_t mg_tx_lock;
 };
 
 enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
 
 int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_tx_buffer *tx_buffer = NULL;
static union ixgbevf_desc *tx_desc = NULL;
+   unsigned long flags;
 
tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!tx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(>mg_tx_lock, flags);
memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
else
r->next_to_use += (r->count - head);
 
+   spin_unlock_irqrestore(>mg_tx_lock, flags);
+
vfree(tx_buffer);
vfree(tx_desc);
return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
 
 int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_rx_buffer *rx_buffer = NULL;
static union ixgbevf_desc *rx_desc = NULL;
+   unsigned long flags;
 
rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!rx_desc)
return -ENOMEM;
 
+   spin_lock_irqsave(>mg_rx_lock, flags);
memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
r->next_to_use -= head;
else
r->next_to_use += (r->count - head);
+   spin_unlock_irqrestore(>mg_rx_lock, flags);
 
vfree(rx_buffer);
vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (test_bit(__IXGBEVF_DOWN, >state))
return true;
 
+   spin_lock(>mg_tx_lock);
+   i = tx_ring->next_to_clean;
tx_buffer = _ring->tx_buffer_info[i];
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;
 
+   spin_unlock(>mg_tx_lock);
+
if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
struct ixgbe_hw *hw = >hw;
union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
struct ixgbevf_ring *rx_ring,
int budget)
 {
+   struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev);
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
u16 cleaned_count = ixgbevf_desc_unused(rx_ring);
struct sk_buff *skb = rx_ring->skb;
 
+   spin_lock(>mg_rx_lock);
while (likely(total_rx_packets < budget)) {
union ixgbe_adv_rx_desc *rx_desc;
 
@@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->rx.total_packets += total_rx_packets;
q_vector->rx.total_bytes += total_rx_bytes;
 
+   

[RFC Patch 12/12] IXGBEVF: Track dma dirty pages

2015-10-21 Thread Lan Tianyu
Migration relies on tracking dirty page to migrate memory.
Hardware can't automatically mark a page as dirty after DMA
memory access. VF descriptor rings and data buffers are modified
by hardware when receive and transmit data. To track such dirty memory
manually, do dummy writes(read a byte and write it back) during receive
and transmit data.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index d22160f..ce7bd7a 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
break;
 
+   /* write back status to mark page dirty */
+   eop_desc->wb.status = eop_desc->wb.status;
+
/* clear next_to_watch to prevent false hangs */
tx_buffer->next_to_watch = NULL;
tx_buffer->desc_num = 0;
@@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct 
ixgbevf_ring *rx_ring,
 {
struct ixgbevf_rx_buffer *rx_buffer;
struct page *page;
+   u8 *page_addr;
 
rx_buffer = _ring->rx_buffer_info[rx_ring->next_to_clean];
page = rx_buffer->page;
prefetchw(page);
 
-   if (likely(!skb)) {
-   void *page_addr = page_address(page) +
- rx_buffer->page_offset;
+   /* Mark page dirty */
+   page_addr = page_address(page) + rx_buffer->page_offset;
+   *page_addr = *page_addr;
 
+   if (likely(!skb)) {
/* prefetch first cache line of first page */
prefetch(page_addr);
 #if L1_CACHE_BYTES < 128
@@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector 
*q_vector,
if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD))
break;
 
+   /* Write back status to mark page dirty */
+   rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error;
+
/* This memory barrier is needed to keep us from reading
 * any other fields out of the rx_desc until we know the
 * RXD_STAT_DD bit is set
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++
 3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
bool pf_set_mac;
u16 pf_vlan; /* When set, guest VLAN config not allowed. */
u16 pf_qos;
+   u32 vf_lpe;
u16 tx_rate;
u16 vlan_count;
u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {
 
 /* mailbox API, version 1.1 VF requests */
 #define IXGBE_VF_GET_QUEUES0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */
 
 /* GET_QUEUES return data indices within the mailbox */
 #define IXGBE_VF_TX_QUEUES 1   /* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter 
*adapter, u32 vf,
}
 }
 
+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg, reg_offset, vf_shift;
+   int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+   vf_shift = vf % 32;
+   reg_offset = vf / 32;
+
+   /* enable transmit and receive for vf */
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   ixgbe_vf_reset_event(adapter, vf);
+
+   hw->mac.ops.set_rar(hw, rar_entry,
+   adapter->vfinfo[vf].vf_mac_addresses,
+   vf, IXGBE_RAH_AV);
+
+
+   if (adapter->vfinfo[vf].vf_lpe)
+   ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf);
+}
+
 static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
 {
struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
break;
case IXGBE_VF_SET_LPE:
retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+   adapter->vfinfo[vf].vf_lpe = *msgbuf;
break;
case IXGBE_VF_SET_MACVLAN:
retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
case IXGBE_VF_GET_RSS_KEY:
retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
break;
+   case IXGBE_VF_NOTIFY_RESUME:
+   ixgbe_restore_setting(adapter, vf);
+   break;
default:
e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
retval = IXGBE_ERR_MBX;
-- 
1.8.4.rc0.1.g8f6a3e5.dirty

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-21 Thread Lan Tianyu
This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 -
 1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
return -ENOMEM;
 }
 
+#define IXGBE_PCI_VFCOMMAND   0x4
+#define IXGBE_PCI_VFMSIXMC0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter 
*)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+   u16 command;
+   u16 msix_message_control;
+   struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+   u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * 
vfn;
+   return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+   u32 reg, reg_offset, vf_shift;
+
+   /* Clear VF mac and disable VF */
+   ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, 
vfn);
+
+   /* Record PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
>command);
+   pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
>msix_message_control);
+   }
+   else
+   printk(KERN_WARNING "Unable to find VF device.\n");
+
+   /* Record states hold by PF */
+   memcpy(>vf_data, >vfinfo[vfn], sizeof(struct 
vf_data_storage));
+
+   vf_shift = vfn % 32;
+   reg_offset = vfn / 32;
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   return sizeof(struct state_in_pf);
+}
+
+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+
+   /* Check struct size */
+   if (count != sizeof(struct state_in_pf)) {
+   printk(KERN_ERR "State in PF size does not fit.\n");
+   goto out;
+   }
+
+   /* Restore PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
state->command);
+   pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
state->msix_message_control);
+   }
+
+   /* Restore states hold by PF */
+   memcpy(>vfinfo[vfn], >vf_data, sizeof(struct 
vf_data_storage));
+
+  out:
+   return count;
+}
+
+static struct device_attribute ixgbe_per_state_in_pf_attribute =
+   __ATTR(state_in_pf, S_IRUGO | S_IWUSR,
+   ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
+
+void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
+{
+   struct pci_dev *pdev = adapter->pdev;
+   struct pci_dev *vfdev;
+   unsigned short vf_id;
+   int pos, ret;
+
+   pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
+   if (!pos)
+   return;
+
+   /* get the device ID for the VF */
+   pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id);
+
+   vfdev = pci_get_device(pdev->vendor, vf_id, NULL);
+
+   while (vfdev) {
+   if (vfdev->is_virtfn) {
+   ret = device_create_file(>dev,
+   _per_state_in_pf_attribute);
+   if (ret)
+   

Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally

2015-10-21 Thread Arnd Bergmann
On Wednesday 21 October 2015 15:58:44 Christoffer Dall wrote:
> On Wed, Oct 21, 2015 at 03:45:20PM +0200, Arnd Bergmann wrote:
> > On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote:
> > > Should this be "select" or "depends on"? Not a blocker, can always be 
> > > fixed in 4.4.
> > 
> > We have lots of 'select ARM_GIC' in the tree for platforms that use one, 
> > using
> > 'depends on' will limit KVM support to being available only if at least one
> > of them is being used.
> > 
> > The only platform I can think of that uses ARMv7ve without actually having
> > a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform
> > like that? If so, 'depends on' might be better, otherwise let's stay with
> > 'select'.
> 
> Yes you can, just without the VGIC and the timer - you have to emulate
> that in userspace.  Samsung also has a broken platform where they
> integrated things incorrectly, so you cannot use the VGIC, but that
> platform support is out of tree, so I can't see if it uses the GIC in
> general or not.

Ok, my patch should be fine then.
 
> I'm a bit confused why using 'depends on' in this case helps anythign?
> 
> (I know, I suck at dealing with the config system)

Generally speaking, 'select' causes more problems than 'depends on',
in particular when you get conflicting requirements (A selects B,
B depends on C, but A can be enabled without C).

However, symbols that only have 'select' and no 'depends on', and also
are not user-visible, are not problematic. This is the case here.

Arnd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sanitizing kvmtool

2015-10-21 Thread Sasha Levin
On 10/19/2015 11:15 AM, Dmitry Vyukov wrote:
> On Mon, Oct 19, 2015 at 5:08 PM, Sasha Levin  wrote:
>> > On 10/19/2015 10:47 AM, Dmitry Vyukov wrote:
 >>> Right, the memory areas that are accessed both by the hypervisor and 
 >>> the guest
> >>> > should be treated as untrusted input, but the hypervisor is 
> >>> > supposed to validate
> >>> > the input carefully before using it - so I'm not sure how data 
> >>> > races would
> >>> > introduce anything new that we didn't catch during validation.
>>> >>
>>> >> One possibility would be: if result of a racy read is passed to guest,
>>> >> that can leak arbitrary host data into guest. Does not sound good.
>>> >> Also, without usage of proper atomic operations, it is basically
>>> >> impossible to verify untrusted data, as it can be changing under your
>>> >> feet. And storing data into a local variable does not prevent the data
>>> >> from changing.
>> >
>> > What's missing here is that the guest doesn't directly read/write the 
>> > memory:
>> > every time it accesses a memory that is shared with the host it will 
>> > trigger
>> > an exit, which will stop the vcpu thread that made the access and kernel 
>> > side
>> > kvm will pass the hypervisor the value the guest wrote (or the memory 
>> > address
>> > it attempted to read). The value/address can't change under us in that 
>> > scenario.
> But still: if result of a racy read is passed to guest, that can leak
> arbitrary host data into guest.

I see what you're saying. I need to think about it a bit, maybe we do need 
locking
for each of the virtio devices we emulate.


On an unrelated note, a few of the reports are pointing to ioport__unregister():

==
WARNING: ThreadSanitizer: data race (pid=109228)
  Write of size 8 at 0x7d1cdf40 by main thread:
#0 free tsan/rtl/tsan_interceptors.cc:570 (lkvm+0x00443376)
#1 ioport__unregister ioport.c:138:2 (lkvm+0x004a9ff9)
#2 pci__exit pci.c:247:2 (lkvm+0x004ac857)
#3 init_list__exit util/init.c:59:8 (lkvm+0x004bca6e)
#4 kvm_cmd_run_exit builtin-run.c:645:2 (lkvm+0x004a68a7)
#5 kvm_cmd_run builtin-run.c:661 (lkvm+0x004a68a7)
#6 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
#7 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
#8 main main.c:18 (lkvm+0x004ac0b4)

  Previous read of size 8 at 0x7d1cdf40 by thread T55:
#0 rb_int_search_single util/rbtree-interval.c:14:17 (lkvm+0x004bf968)
#1 ioport_search ioport.c:41:9 (lkvm+0x004aa05f)
#2 kvm__emulate_io ioport.c:186 (lkvm+0x004aa05f)
#3 kvm_cpu__emulate_io x86/include/kvm/kvm-cpu-arch.h:41:9 
(lkvm+0x004aa718)
#4 kvm_cpu__start kvm-cpu.c:126 (lkvm+0x004aa718)
#5 kvm_cpu_thread builtin-run.c:174:6 (lkvm+0x004a6e3e)

  Thread T55 'kvm-vcpu-2' (tid=109285, finished) created by main thread at:
#0 pthread_create tsan/rtl/tsan_interceptors.cc:848 (lkvm+0x004478a3)
#1 kvm_cmd_run_work builtin-run.c:633:7 (lkvm+0x004a683f)
#2 kvm_cmd_run builtin-run.c:660 (lkvm+0x004a683f)
#3 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c)
#4 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4)
#5 main main.c:18 (lkvm+0x004ac0b4)

SUMMARY: ThreadSanitizer: data race ioport.c:138:2 in ioport__unregister
==

I think this is because we don't perform locking using pthread, but rather pause
the vm entirely - so the cpu threads it's pointing to aren't actually running 
when
we unregister ioports. Is there a way to annotate that for tsan?


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-21 Thread Yacine HEBBAL
Paolo Bonzini  redhat.com> writes:

> 
> 
> On 21/10/2015 12:17, Hebbal Yacine wrote:
> > Thanks for the explanation, it's very clear.
> > I tired that but I didn't succeed to send the ioctl from "run_on_cpu"
> > function, I didn't find how to set the right CPUStat
> > I've tried "current_cpu"
> 
> Current_cpu is always NULL outside the VCPU thread.
> 
> > 
> > kvm_main.c:
> > 
> > // yacine.begin
> > 
> > static void do_vmi_start_kvm_ioctl(void *type) {
> > printf("do_vmi_start_kvm_ioctl\n");
> > kvm_vm_ioctl(kvm_state, type);


//yacine.begin
int hmp_vmi_op_result = 0;

static void do_vmi_kvm_ioctl(void *type_ioctl) {
int* type = (int*) type_ioctl;
hmp_vmi_op_result = kvm_vcpu_ioctl(current_cpu, *type);
//hmp_vmi_start_result = kvm_vm_ioctl(kvm_state, *type);
}

int vmi_kvm_ioctl(int type) {
CPUState* cpu;

CPU_FOREACH(cpu) {
run_on_cpu(cpu, do_vmi_kvm_ioctl, );
}
return hmp_vmi_op_result;
}
//yacine.end

Yes, it works perfectly this way even when running multiple VCPUs, thank you
a lot :)
In fact, I was using an old version of qemu (1.5.x), and it doesn't have
CPU_FOREACH, i searched a little for to replace it, but without any luck. So
I upgraded my working version and now everything is cool 

> Are you sure you want a VM ioctl and not a VCPU ioctl?  Or perhaps a VM
> ioctl to do generic processing, and a VCPU ioctl that is then sent to
> all VCPUs?
>If you use a VCPU ioctl, you can use CPU_FOREACH or a for loop to
> iterate over all VCPUs.

In fact, I get the same result when using vm_ioctl or vcpu_ioctl.
If I correctly understood you last paragraph, it is better to use vm_ioctl
to do generic processing that doesn't rely on a given VCPU and hence I won't
need to use "CPU_FOREACH, run_on_cpu and current_cpu".

Thanks again :)

> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Add "virtfn_index" member in the struct pci_device to record VF sequence
of PF. This will be used in the VF sysfs node handle.

Signed-off-by: Lan Tianyu 
---
  drivers/pci/iov.c   | 1 +
  include/linux/pci.h | 1 +
  2 files changed, 2 insertions(+)

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ee0ebff..065b6bb 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int 
reset)
virtfn->physfn = pci_dev_get(dev);
virtfn->is_virtfn = 1;
virtfn->multifunction = 0;
+   virtfn->virtfn_index = id;
  
  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {

res = >resource[i + PCI_IOV_RESOURCES];
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 353db8d..85c5531 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -356,6 +356,7 @@ struct pci_dev {
unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */
unsigned intirq_managed:1;
pci_dev_flags_t dev_flags;
+   unsigned intvirtfn_index;
atomic_tenable_cnt; /* pci_enable_device has been called */
  
  	u32		saved_config_space[16]; /* config space saved at suspend time */




Can't you just calculate the VF index based on the VF BDF number 
combined with the information in the PF BDF number and VF 
offset/stride?  Seems kind of pointless to add a variable that is only 
used by one driver and is in a slowpath when you can just calculate it 
pretty quickly.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alex Williamson
On Thu, 2015-10-22 at 00:52 +0800, Lan Tianyu wrote:
> This patchset is Qemu part for live migration support for SRIOV NIC.
> kernel part patch information is in the following link.
> http://marc.info/?l=kvm=144544635330193=2
> 
> 
> Lan Tianyu (3):
>   Qemu: Add pci-assign.h to share functions and struct definition with
> new file
>   Qemu: Add post_load_state() to run after restoring CPU state
>   Qemu: Introduce pci-sriov device type to support VF live migration
> 
>  hw/i386/kvm/Makefile.objs   |   2 +-
>  hw/i386/kvm/pci-assign.c| 113 +--
>  hw/i386/kvm/pci-assign.h| 109 +++
>  hw/i386/kvm/sriov.c | 213 
> 
>  include/migration/vmstate.h |   2 +
>  migration/savevm.c  |  15 
>  6 files changed, 344 insertions(+), 110 deletions(-)
>  create mode 100644 hw/i386/kvm/pci-assign.h
>  create mode 100644 hw/i386/kvm/sriov.c
> 

Hi Lan,

Seems like there are a couple immediate problems with this approach.
The first is that you're modifying legacy KVM device assignment, which
is deprecated upstream and not even enabled by some distros.  VFIO is
the supported mechanism for doing PCI device assignment now and any
features like this need to be added there first.  It's not only more
secure than legacy KVM device assignment, but it also doesn't limit this
to an x86-only solution.  Surely you want to support 82599 VF migration
on other platforms as well.

Using sysfs to interact with the PF is also problematic since that means
that libvirt needs to grant qemu access to these files, adding one more
layer to the stack.  If we were to use VFIO, we could potentially enable
this through a save-state region on the device file descriptor and if
necessary, virtual interrupt channels for the device as well.  This of
course implies that the kernel internal channels are made as general as
possible in order to support any PF driver.

That said, there are some nice features here.  Using unused PCI config
bytes to communicate with the guest driver and enable guest-based page
dirtying is a nice hack.  However, if we want to add this capability to
other devices, we're not always going to be able to use fixed addresses
0xf0 and 0xf1.  I would suggest that we probably want to create a
virtual capability in the config space of the VF, perhaps a Vendor
Specific capability.  Obviously some devices won't have room for a full
capability in the standard config space, so we may need to optionally
expose it in extended config space.  Those device would be limited to
only supporting migration in PCI-e configurations in the guest.  Also,
plenty of devices make use of undefined PCI config space, so we may not
be able to simply add a capability to a region we think is unused, maybe
it needs to happen through reserved space in another capability or
perhaps defining a virtual BAR that unenlightened guest drivers would
ignore.  The point is that we somehow need to standardize that so that
rather than implicitly know that it's at 0xf0/0xf1 on 82599 VFs.

Also, I haven't looked at the kernel-side patches yet, but the saved
state received from and loaded into the PF driver needs to be versioned
and maybe we need some way to know whether versions are compatible.
Migration version information is difficult enough for QEMU, it's a
completely foreign concept in the kernel.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Or Gerlitz
On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu  wrote:
> This patchset is to propose a new solution to add live migration support
> for 82599 SRIOV network card.

> In our solution, we prefer to put all device specific operation into VF and
> PF driver and make code in the Qemu more general.

[...]

> Service down time test
> So far, we tested migration between two laptops with 82599 nic which
> are connected to a gigabit switch. Ping VF in the 0.001s interval
> during migration on the host of source side. It service down
> time is about 180ms.

So... what would you expect service down wise for the following
solution which is zero touch and I think should work for any VF
driver:

on host A: unplug the VM and conduct live migration to host B ala the
no-SRIOV case.

on host B:

when the VM "gets back to live", probe a VF there with the same assigned mac

next, udev on the VM will call the VF driver to create netdev instance

DHCP client would run to get the same IP address

+ under config directive (or from Qemu) send Gratuitous ARP to notify
the switch/es on the new location for that mac.

Or.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alex Williamson
On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:
> On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu  wrote:
> > This patchset is to propose a new solution to add live migration support
> > for 82599 SRIOV network card.
> 
> > In our solution, we prefer to put all device specific operation into VF and
> > PF driver and make code in the Qemu more general.
> 
> [...]
> 
> > Service down time test
> > So far, we tested migration between two laptops with 82599 nic which
> > are connected to a gigabit switch. Ping VF in the 0.001s interval
> > during migration on the host of source side. It service down
> > time is about 180ms.
> 
> So... what would you expect service down wise for the following
> solution which is zero touch and I think should work for any VF
> driver:
> 
> on host A: unplug the VM and conduct live migration to host B ala the
> no-SRIOV case.

The trouble here is that the VF needs to be unplugged prior to the start
of migration because we can't do effective dirty page tracking while the
device is connected and doing DMA.  So the downtime, assuming we're
counting only VF connectivity, is dependent on memory size, rate of
dirtying, and network bandwidth; seconds for small guests, minutes or
more (maybe much, much more) for large guests.

This is why the typical VF agnostic approach here is to using bonding
and fail over to a emulated device during migration, so performance
suffers, but downtime is something acceptable.

If we want the ability to defer the VF unplug until just before the
final stages of the migration, we need the VF to participate in dirty
page tracking.  Here it's done via an enlightened guest driver.  Alex
Graf presented a solution using a device specific enlightenment in QEMU.
Otherwise we'd need hardware support from the IOMMU.  Thanks,

Alex

> on host B:
> 
> when the VM "gets back to live", probe a VF there with the same assigned mac
> 
> next, udev on the VM will call the VF driver to create netdev instance
> 
> DHCP client would run to get the same IP address
> 
> + under config directive (or from Qemu) send Gratuitous ARP to notify
> the switch/es on the new location for that mac.
> 
> Or.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to add sysfs interface state_in_pf under sysfs directory
of VF PCI device for Qemu to get and put VF status in the PF driver during
migration.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 -
  1 file changed, 155 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index ab2a2e2..89671eb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter 
*adapter)
return -ENOMEM;
  }
  
+#define IXGBE_PCI_VFCOMMAND   0x4

+#define IXGBE_PCI_VFMSIXMC0x72
+#define IXGBE_SRIOV_VF_OFFSET 0x180
+#define IXGBE_SRIOV_VF_STRIDE 0x2
+
+#define to_adapter(dev) ((struct ixgbe_adapter 
*)(pci_get_drvdata(to_pci_dev(dev)->physfn)))
+
+struct state_in_pf {
+   u16 command;
+   u16 msix_message_control;
+   struct vf_data_storage vf_data;
+};
+
+static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn)
+{
+   u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * 
vfn;
+   return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff);
+}
+
+static ssize_t ixgbe_show_state_in_pf(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+   u32 reg, reg_offset, vf_shift;
+
+   /* Clear VF mac and disable VF */
+   ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, 
vfn);
+
+   /* Record PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
>command);
+   pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
>msix_message_control);
+   }
+   else
+   printk(KERN_WARNING "Unable to find VF device.\n");
+


Formatting for the if/else is incorrect.  The else condition should be 
in brackets as well.



+   /* Record states hold by PF */
+   memcpy(>vf_data, >vfinfo[vfn], sizeof(struct 
vf_data_storage));
+
+   vf_shift = vfn % 32;
+   reg_offset = vfn / 32;
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg &= ~(1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);
+
+   return sizeof(struct state_in_pf);
+}
+


This is a read.  Why does it need to switch off the VF?  Also why turn 
of the anti-spoof, it doesn't make much sense.



+static ssize_t ixgbe_store_state_in_pf(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct pci_dev *pdev = adapter->pdev, *vdev;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   struct state_in_pf *state = (struct state_in_pf *)buf;
+   int vfn = vf_pdev->virtfn_index;
+
+   /* Check struct size */
+   if (count != sizeof(struct state_in_pf)) {
+   printk(KERN_ERR "State in PF size does not fit.\n");
+   goto out;
+   }
+
+   /* Restore PCI configurations */
+   vdev = ixgbe_get_virtfn_dev(pdev, vfn);
+   if (vdev) {
+   pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, 
state->command);
+   pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, 
state->msix_message_control);
+   }
+
+   /* Restore states hold by PF */
+   memcpy(>vfinfo[vfn], >vf_data, sizeof(struct 
vf_data_storage));
+
+  out:
+   return count;
+}


Just doing a memcpy to move the vfinfo over adds no value.  The fact is 
there are a number of filters that have to be configured in hardware 
after, and it isn't as simple as just migrating the values stored.  As I 
mentioned in the case of the 82598 there is also jumbo frames to take 
into account.  If the first PF didn't have it enabled, but the second 
one does that implies the state of the VF needs to change to account for 
that.


I really think you would be better off only migrating the data related 
to what can be configured using the ip link command and leaving other 
values such as clear_to_send at the reset value of 0. Then 

Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to add new sysfs interface of "notify_vf" under sysfs
directory of VF PCI device for Qemu to notify VF when migration status
is changed.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++
  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  4 
  2 files changed, 34 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index e247d67..5cc7817 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev,
return count;
  }
  
+static ssize_t ixgbe_store_notify_vf(struct device *dev,

+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct ixgbe_adapter *adapter = to_adapter(dev);
+   struct ixgbe_hw *hw = >hw;
+   struct pci_dev *vf_pdev = to_pci_dev(dev);
+   int vfn = vf_pdev->virtfn_index;
+   u32 ivar;
+
+   /* Enable VF mailbox irq first */
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4);
+   IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4);
+
+   ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn));
+   ivar &= ~0xFF;
+   ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL;
+   IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar);
+
+   ixgbe_ping_vf(adapter, vfn);
+   return count;
+}
+


NAK, this won't fly.  You can't just go in from the PF and enable 
interrupts on the VF hoping they are configured well enough to handle an 
interrupt you decide to trigger from them.


Also have you even considered the MSI-X configuration on the VF?  I 
haven't seen anything anywhere that would have migrated the VF's MSI-X 
configuration from BAR 3 on one system to the new system.



  static struct device_attribute ixgbe_per_state_in_pf_attribute =
__ATTR(state_in_pf, S_IRUGO | S_IWUSR,
ixgbe_show_state_in_pf, ixgbe_store_state_in_pf);
  
+static struct device_attribute ixgbe_per_notify_vf_attribute =

+   __ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf);
+
  void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
  {
struct pci_dev *pdev = adapter->pdev;
@@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter)
if (vfdev->is_virtfn) {
ret = device_create_file(>dev,
_per_state_in_pf_attribute);
+   ret |= device_create_file(>dev,
+   _per_notify_vf_attribute);
if (ret)
pr_warn("Unable to add VF attribute for dev 
%s,\n",
dev_name(>dev));
@@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter)
while (vfdev) {
if (vfdev->is_virtfn) {
device_remove_file(>dev, 
_per_state_in_pf_attribute);
+   device_remove_file(>dev, 
_per_notify_vf_attribute);
}
  
  		vfdev = pci_get_device(pdev->vendor, vf_id, vfdev);


More driver specific sysfs.  This needs to be moved out of the driver if 
this is to be considered anything more than a proof of concept.



diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dd6ba59..c6ddb66 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -2302,6 +2302,10 @@ enum {
  #define IXGBE_PVFTDT(P)   (0x06018 + (0x40 * (P)))
  #define IXGBE_PVFTDWBAL(P)(0x06038 + (0x40 * (P)))
  #define IXGBE_PVFTDWBAH(P)(0x0603C + (0x40 * (P)))
+#define IXGBE_PVTEIMS(P)   (0x00D00 + (4 * (P)))
+#define IXGBE_PVTIVAR_MISC(P)  (0x04E00 + (4 * (P)))
+#define IXGBE_PVTEIAC(P)   (0x00F00 + (4 * P))
+#define IXGBE_PVTEIAM(P)   (0x04D00 + (4 * P))
  
  #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \

(IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index)))


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 06/12] IXGBEVF: Add self emulation layer

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

In order to restore VF function after migration, add self emulation layer
to record regs' values during accessing regs.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/Makefile|  3 ++-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  |  2 +-
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++
  drivers/net/ethernet/intel/ixgbevf/vf.h|  5 -
  4 files changed, 33 insertions(+), 3 deletions(-)
  create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c

diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile 
b/drivers/net/ethernet/intel/ixgbevf/Makefile
index 4ce4c97..841c884 100644
--- a/drivers/net/ethernet/intel/ixgbevf/Makefile
+++ b/drivers/net/ethernet/intel/ixgbevf/Makefile
@@ -31,7 +31,8 @@
  
  obj-$(CONFIG_IXGBEVF) += ixgbevf.o
  
-ixgbevf-objs := vf.o \

+ixgbevf-objs := self-emulation.o \
+   vf.o \
  mbx.o \
  ethtool.o \
  ixgbevf_main.o
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index a16d267..4446916 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg)
  
  	if (IXGBE_REMOVED(reg_addr))

return IXGBE_FAILED_READ_REG;
-   value = readl(reg_addr + reg);
+   value = ixgbe_self_emul_readl(reg_addr, reg);
if (unlikely(value == IXGBE_FAILED_READ_REG))
ixgbevf_check_remove(hw, reg);
return value;
diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c 
b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
new file mode 100644
index 000..d74b2da
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c
@@ -0,0 +1,26 @@
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "vf.h"
+#include "ixgbevf.h"
+
+static u32 hw_regs[0x4000];
+
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr)
+{
+   u32 tmp;
+
+   tmp = readl(base + addr);
+   hw_regs[(unsigned long)addr] = tmp;
+
+   return tmp;
+}
+
+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr)
+{
+   hw_regs[(unsigned long)addr] = val;
+   writel(val, (volatile void __iomem *)(base + addr));
+}


So I see what you are doing, however I don't think this adds much 
value.  Many of the key registers for the device are not simple 
Read/Write registers.  Most of them are things like write 1 to clear or 
some other sort of value where writing doesn't set the bit but has some 
other side effect.  Just take a look through the Datasheet at registers 
such as the VFCTRL, VFMAILBOX, or most of the interrupt registers.  The 
fact is simply storing the values off doesn't give you any real idea of 
what the state of things are.



diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h 
b/drivers/net/ethernet/intel/ixgbevf/vf.h
index d40f036..6a3f4eb 100644
--- a/drivers/net/ethernet/intel/ixgbevf/vf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/vf.h
@@ -39,6 +39,9 @@
  
  struct ixgbe_hw;
  
+u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr);

+void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32  addr);
+
  /* iterator type for walking multicast address lists */
  typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr,
  u32 *vmdq);
@@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 
reg, u32 value)
  
  	if (IXGBE_REMOVED(reg_addr))

return;
-   writel(value, reg_addr + reg);
+   ixgbe_self_emul_writel(value, reg_addr, reg);
  }
  
  #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v)


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

This patch is to restore VF status in the PF driver when get event
from VF.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe.h   |  1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h   |  1 +
  drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++
  3 files changed, 42 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 636f9e3..9d5669a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -148,6 +148,7 @@ struct vf_data_storage {
bool pf_set_mac;
u16 pf_vlan; /* When set, guest VLAN config not allowed. */
u16 pf_qos;
+   u32 vf_lpe;
u16 tx_rate;
u16 vlan_count;
u8 spoofchk_enabled;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
index b1e4703..8fdb38d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h
@@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev {

  /* mailbox API, version 1.1 VF requests */
  #define IXGBE_VF_GET_QUEUES   0x09 /* get queue configuration */
+#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */

  /* GET_QUEUES return data indices within the mailbox */
  #define IXGBE_VF_TX_QUEUES1   /* number of Tx queues supported */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index 1d17b58..ab2a2e2 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter 
*adapter, u32 vf,
}
  }

+/**
+ *  Restore the settings by mailbox, after migration
+ **/
+void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf)
+{
+   struct ixgbe_hw *hw = >hw;
+   u32 reg, reg_offset, vf_shift;
+   int rar_entry = hw->mac.num_rar_entries - (vf + 1);
+
+   vf_shift = vf % 32;
+   reg_offset = vf / 32;
+
+   /* enable transmit and receive for vf */
+   reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg);
+
+   reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg);
+


This is just blanket enabling Rx and Tx.  I don't see how this can be 
valid.  It seems like it would result in memory corruption for the guest 
if you are enabling Rx on a device that is not ready.  A perfect example 
is if the guest is not configured to handle jumbo frames and the PF has 
jumbo frames enabled.



+   reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset));
+   reg |= (1 << vf_shift);
+   IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg);


This assumes that the anti-spoof is enabled.  That may not be the case.


+   ixgbe_vf_reset_event(adapter, vf);
+
+   hw->mac.ops.set_rar(hw, rar_entry,
+   adapter->vfinfo[vf].vf_mac_addresses,
+   vf, IXGBE_RAH_AV);
+
+
+   if (adapter->vfinfo[vf].vf_lpe)
+   ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf);
+}
+


The function ixgbe_set_vf_lpe also enabled the receive, you should take 
a look at it.  For 82598 you cannot just arbitrarily enable the Rx as 
there is a risk of corrupting guest memory or causing a kernel panic.



  static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf)
  {
struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ];
@@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
break;
case IXGBE_VF_SET_LPE:
retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf);
+   adapter->vfinfo[vf].vf_lpe = *msgbuf;
break;


Why not just leave this for the VF to notify us of via a reset.  It 
seems like if the VF is migrated it should start with the cts bits of 
the mailbox cleared as though the PF driver as been reloaded.



case IXGBE_VF_SET_MACVLAN:
retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf);
@@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter 
*adapter, u32 vf)
case IXGBE_VF_GET_RSS_KEY:
retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf);
break;
+   case IXGBE_VF_NOTIFY_RESUME:
+   ixgbe_restore_setting(adapter, vf);
+   break;
default:
e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]);
retval = IXGBE_ERR_MBX;



I really don't think the VF should be sending us a message telling us to 
restore settings.  Why not just use the existing messages?


The VF as it is now can survive a suspend/resume 

[GIT PULL 0/5] perf/core improvements and fixes

2015-10-21 Thread Arnaldo Carvalho de Melo
Hi Ingo,

Please consider pulling,

- Arnaldo

The following changes since commit 43e41adc9e8c36545888d78fed2ef8d102a938dc:

  perf record: Add ability to sample call branches (2015-10-20 10:30:55 +0200)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git 
tags/perf-core-for-mingo

for you to fetch changes up to e3d006ce8180a0c025ce66bdc89bbc125f85be57:

  perf annotate: Add debug message for out of bounds sample (2015-10-21 
18:12:37 -0300)


perf/core improvements and fixes:

User visible:

- Print branch filter state with verbose mode (Andi Kleen)

- Fix core dump caused by per-socket/core system-wide stat (Kan Liang)

- Update libtraceevent KVM plugin (Paolo Bonzini)

Developer stuff:

- Add fixdep to 'tools/build' .gitignore (Yunlong Song)

Signed-off-by: Arnaldo Carvalho de Melo 


Andi Kleen (1):
  perf evsel: Print branch filter state with -vv

Arnaldo Carvalho de Melo (1):
  perf annotate: Add debug message for out of bounds sample

Kan Liang (1):
  perf cpu_map: Fix core dump caused by per-socket/core system-wide stat

Paolo Bonzini (1):
  tools lib traceevent: update KVM plugin

Yunlong Song (1):
  perf build: Add fixdep to .gitignore

 tools/build/.gitignore|  1 +
 tools/lib/traceevent/plugin_kvm.c | 25 +
 tools/perf/util/annotate.c|  5 -
 tools/perf/util/cpumap.c  |  2 +-
 tools/perf/util/evsel.c   |  1 +
 5 files changed, 24 insertions(+), 10 deletions(-)
 create mode 100644 tools/build/.gitignore
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/5] tools lib traceevent: update KVM plugin

2015-10-21 Thread Arnaldo Carvalho de Melo
From: Paolo Bonzini 

The format of the role word has changed through the years and the plugin
was never updated; some VMX exit reasons were missing too.

Signed-off-by: Paolo Bonzini 
Acked-by: Steven Rostedt 
Cc: David Ahern 
Cc: Namhyung Kim 
Cc: kvm@vger.kernel.org
Link: 
http://lkml.kernel.org/r/1443695293-31127-1-git-send-email-pbonz...@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo 
---
 tools/lib/traceevent/plugin_kvm.c | 25 +
 1 file changed, 17 insertions(+), 8 deletions(-)

diff --git a/tools/lib/traceevent/plugin_kvm.c 
b/tools/lib/traceevent/plugin_kvm.c
index 88fe83dff7cd..18536f756577 100644
--- a/tools/lib/traceevent/plugin_kvm.c
+++ b/tools/lib/traceevent/plugin_kvm.c
@@ -124,7 +124,10 @@ static const char *disassemble(unsigned char *insn, int 
len, uint64_t rip,
_ER(WBINVD,  54)\
_ER(XSETBV,  55)\
_ER(APIC_WRITE,  56)\
-   _ER(INVPCID, 58)
+   _ER(INVPCID, 58)\
+   _ER(PML_FULL,62)\
+   _ER(XSAVES,  63)\
+   _ER(XRSTORS, 64)
 
 #define SVM_EXIT_REASONS \
_ER(EXIT_READ_CR0,  0x000)  \
@@ -352,15 +355,18 @@ static int kvm_nested_vmexit_handler(struct trace_seq *s, 
struct pevent_record *
 union kvm_mmu_page_role {
unsigned word;
struct {
-   unsigned glevels:4;
unsigned level:4;
+   unsigned cr4_pae:1;
unsigned quadrant:2;
-   unsigned pad_for_nice_hex_output:6;
unsigned direct:1;
unsigned access:3;
unsigned invalid:1;
-   unsigned cr4_pge:1;
unsigned nxe:1;
+   unsigned cr0_wp:1;
+   unsigned smep_and_not_wp:1;
+   unsigned smap_and_not_wp:1;
+   unsigned pad_for_nice_hex_output:8;
+   unsigned smm:8;
};
 };
 
@@ -385,15 +391,18 @@ static int kvm_mmu_print_role(struct trace_seq *s, struct 
pevent_record *record,
if (pevent_is_file_bigendian(event->pevent) ==
pevent_is_host_bigendian(event->pevent)) {
 
-   trace_seq_printf(s, "%u/%u q%u%s %s%s %spge %snxe",
+   trace_seq_printf(s, "%u q%u%s %s%s %spae %snxe %swp%s%s%s",
 role.level,
-role.glevels,
 role.quadrant,
 role.direct ? " direct" : "",
 access_str[role.access],
 role.invalid ? " invalid" : "",
-role.cr4_pge ? "" : "!",
-role.nxe ? "" : "!");
+role.cr4_pae ? "" : "!",
+role.nxe ? "" : "!",
+role.cr0_wp ? "" : "!",
+role.smep_and_not_wp ? " smep" : "",
+role.smap_and_not_wp ? " smap" : "",
+role.smm ? " smm" : "");
} else
trace_seq_printf(s, "WORD: %08x", role.word);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

When transmit a package, the end transmit desc of package
indicates whether package is sent already. Current code records
the end desc's pointer in the next_to_watch of struct tx buffer.
This code will be broken if shifting desc ring after migration.
The pointer will be invalid. This patch is to replace recording
pointer with recording the desc number of the package and find
the end decs via the first desc and desc number.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  1 +
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 ---
  2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 775d089..c823616 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -54,6 +54,7 @@
   */
  struct ixgbevf_tx_buffer {
union ixgbe_adv_tx_desc *next_to_watch;
+   u16 desc_num;
unsigned long time_stamp;
struct sk_buff *skb;
unsigned int bytecount;


So if you can't use next_to_watch why is it left in here?  Also you 
might want to take a look at moving desc_num to a different spot in the 
buffer as you are leaving a 6 byte hole in the descriptor.



diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 4446916..056841c 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct 
ixgbevf_ring *tx_ring,
   DMA_TO_DEVICE);
}
tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
tx_buffer->skb = NULL;
dma_unmap_len_set(tx_buffer, len, 0);


This opens up a race condition.  If you have a descriptor ready to be 
cleaned at offset 0 what is to prevent you from just running through the 
ring?  You likely need to find a descriptor number that cannot be valid 
to use here.



/* tx_buffer must be completely set up in the transmit path */
@@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
union ixgbe_adv_tx_desc *tx_desc;
unsigned int total_bytes = 0, total_packets = 0;
unsigned int budget = tx_ring->count / 2;
-   unsigned int i = tx_ring->next_to_clean;
+   int i, watch_index;
  


Where is i being initialized?  It was here but you removed it.  Are you 
using i without initializing it?



if (test_bit(__IXGBEVF_DOWN, >state))
return true;
@@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
i -= tx_ring->count;
  
  	do {

-   union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch;
+   union ixgbe_adv_tx_desc *eop_desc;
+
+   if (!tx_buffer->desc_num)
+   break;
+
+   if (i + tx_buffer->desc_num >= 0)
+   watch_index = i + tx_buffer->desc_num;
+   else
+   watch_index = i + tx_ring->count + tx_buffer->desc_num;
  
-		/* if next_to_watch is not set then there is no work pending */

+   eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index);
if (!eop_desc)
break;
  


So I don't see how this isn't triggering Tx hangs.  I suspect for the 
simple ping case desc_num will often be 0.  The fact is there are many 
cases where first and tx_buffer_info are the same descriptor.



@@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
  
  		/* clear next_to_watch to prevent false hangs */

tx_buffer->next_to_watch = NULL;
+   tx_buffer->desc_num = 0;
  
  		/* update the statistics for this packet */

total_bytes += tx_buffer->bytecount;


You cannot use 0 because 0 is a valid number.  You are using it as a 
look-ahead currently and there are cases where i is the eop_desc index.



@@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
u32 tx_flags = first->tx_flags;
__le32 cmd_type;
u16 i = tx_ring->next_to_use;
+   u16 start;
  
  	tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
  
@@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring,
  
  	/* set next_to_watch value indicating a packet is present */

first->next_to_watch = tx_desc;
+   start = first - tx_ring->tx_buffer_info;
+   first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - 
start;
  
  	i++;

if (i == tx_ring->count)


start and i could be the same value.  If you look at ixgbevf_tx_map you 
should find that if the packet is contained in a single buffer then the 
first and last descriptor in your 

Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

To let VF driver in the guest to know migration status, Qemu will
fake PCI configure reg 0xF0 and 0xF1 to show migrate status and
get ack from VF driver.

When migration starts, Qemu will set reg "0xF0" to 1, notify
VF driver via triggering mail box msg and wait for VF driver to tell
it's ready for migration(set reg "0xF1" to 1). After migration, Qemu
will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF
driver begins to restore tx/rx function after detecting sttatus change.

When VF receives mail box irq, it will check reg "0xF0" in the service
task function to get migration status and performs related operations
according its value.

Steps of restarting receive and transmit function
1) Restore VF status in the PF driver via sending mail event to PF driver
2) Write back reg values recorded by self emulation layer
3) Restart rx/tx ring
4) Recovery interrupt

Transmit/Receive descriptor head regs are read-only and can't
be restored via writing back recording reg value directly and they
are set to 0 during VF reset. To reuse original tx/rx rings, shift
desc ring in order to move the desc pointed by original head reg to
first entry of the ring and then enable tx/rx rings. VF restarts to
receive and transmit from original head desc.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/defines.h   |   6 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h   |   7 +-
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c  | 115 -
  .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++
  4 files changed, 232 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h 
b/drivers/net/ethernet/intel/ixgbevf/defines.h
index 770e21a..113efd2 100644
--- a/drivers/net/ethernet/intel/ixgbevf/defines.h
+++ b/drivers/net/ethernet/intel/ixgbevf/defines.h
@@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc {
__le32 mss_l4len_idx;
  };

+union ixgbevf_desc {
+   union ixgbe_adv_tx_desc rx_desc;
+   union ixgbe_adv_rx_desc tx_desc;
+   struct ixgbe_adv_tx_context_desc tx_context_desc;
+};
+
  /* Adv Transmit Descriptor Config Masks */
  #define IXGBE_ADVTXD_DTYP_MASK0x00F0 /* DTYP mask */
  #define IXGBE_ADVTXD_DTYP_CTXT0x0020 /* Advanced Context Desc */
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index c823616..6eab402e 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -109,7 +109,7 @@ struct ixgbevf_ring {
struct ixgbevf_ring *next;
struct net_device *netdev;
struct device *dev;
-   void *desc; /* descriptor ring memory */
+   union ixgbevf_desc *desc;   /* descriptor ring memory */
dma_addr_t dma; /* phys. address of descriptor ring */
unsigned int size;  /* length in bytes */
u16 count;  /* amount of descriptors */
@@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector 
*q_vector);

  void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter);
  void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter);
+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head);
+int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head);
+void ixgbevf_restore_state(struct ixgbevf_adapter *adapter);
+inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter);
+

  #ifdef DEBUG
  char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 056841c..15ec361 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network 
Driver");
  MODULE_LICENSE("GPL");
  MODULE_VERSION(DRV_VERSION);

+
+#define MIGRATION_COMPLETED   0x00
+#define MIGRATION_IN_PROGRESS 0x01
+
  #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK)
  static int debug = -1;
  module_param(debug, int, 0);
@@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)
return ring->stats.packets;
  }

+int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
+{
+   struct ixgbevf_tx_buffer *tx_buffer = NULL;
+   static union ixgbevf_desc *tx_desc = NULL;
+
+   tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
+   if (!tx_buffer)
+   return -ENOMEM;
+
+   tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count);
+   if (!tx_desc)
+   return -ENOMEM;
+
+   memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
+   memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
+   

Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation

2015-10-21 Thread Alexander Duyck

On 10/21/2015 09:37 AM, Lan Tianyu wrote:

Ring shifting during restoring VF function maybe race with original
ring operation(transmit/receive package). This patch is to add tx/rx
lock to protect ring related data.

Signed-off-by: Lan Tianyu 
---
  drivers/net/ethernet/intel/ixgbevf/ixgbevf.h  |  2 ++
  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 ---
  2 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
index 6eab402e..3a748c8 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h
@@ -448,6 +448,8 @@ struct ixgbevf_adapter {

spinlock_t mbx_lock;
unsigned long last_reset;
+   spinlock_t mg_rx_lock;
+   spinlock_t mg_tx_lock;
  };



Really, a shared lock for all of the Rx or Tx rings?  This is going to 
kill any chance at performance.  Especially since just recently the VFs 
got support for RSS.


To top it off it also means we cannot clean Tx while adding new buffers 
which will kill Tx performance.


The other concern I have is what is supposed to prevent the hardware 
from accessing the rings while you are reading?  I suspect nothing so I 
don't see how this helps anything.


I would honestly say you are better off just giving up on all of the 
data stored in the descriptor rings rather than trying to restore them. 
 Yes you are going to lose a few packets but you don't have the risk 
for races that this code introduces.



  enum ixbgevf_state_t {
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 15ec361..04b6ce7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring 
*ring)

  int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
  {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_tx_buffer *tx_buffer = NULL;
static union ixgbevf_desc *tx_desc = NULL;
+   unsigned long flags;

tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count));
if (!tx_buffer)
@@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!tx_desc)
return -ENOMEM;

+   spin_lock_irqsave(>mg_tx_lock, flags);
memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count);
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)
else
r->next_to_use += (r->count - head);

+   spin_unlock_irqrestore(>mg_tx_lock, flags);
+
vfree(tx_buffer);
vfree(tx_desc);
return 0;
@@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head)

  int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
  {
+   struct ixgbevf_adapter *adapter = netdev_priv(r->netdev);
struct ixgbevf_rx_buffer *rx_buffer = NULL;
static union ixgbevf_desc *rx_desc = NULL;
+   unsigned long flags;

rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count));
if (!rx_buffer)
@@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
if (!rx_desc)
return -ENOMEM;

+   spin_lock_irqsave(>mg_rx_lock, flags);
memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count));
memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count 
- head));
memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * 
head);
@@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head)
r->next_to_use -= head;
else
r->next_to_use += (r->count - head);
+   spin_unlock_irqrestore(>mg_rx_lock, flags);

vfree(rx_buffer);
vfree(rx_desc);
@@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
if (test_bit(__IXGBEVF_DOWN, >state))
return true;

+   spin_lock(>mg_tx_lock);
+   i = tx_ring->next_to_clean;
tx_buffer = _ring->tx_buffer_info[i];
tx_desc = IXGBEVF_TX_DESC(tx_ring, i);
i -= tx_ring->count;
@@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector 
*q_vector,
q_vector->tx.total_bytes += total_bytes;
q_vector->tx.total_packets += total_packets;

+   spin_unlock(>mg_tx_lock);
+
if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) {
struct ixgbe_hw *hw = >hw;
union ixgbe_adv_tx_desc *eop_desc;
@@ -999,10 +1012,12 @@ static int 

Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC

2015-10-21 Thread Alexander Duyck

On 10/21/2015 12:20 PM, Alex Williamson wrote:

On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote:

On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyu  wrote:

This patchset is to propose a new solution to add live migration support
for 82599 SRIOV network card.



In our solution, we prefer to put all device specific operation into VF and
PF driver and make code in the Qemu more general.


[...]


Service down time test
So far, we tested migration between two laptops with 82599 nic which
are connected to a gigabit switch. Ping VF in the 0.001s interval
during migration on the host of source side. It service down
time is about 180ms.


So... what would you expect service down wise for the following
solution which is zero touch and I think should work for any VF
driver:

on host A: unplug the VM and conduct live migration to host B ala the
no-SRIOV case.


The trouble here is that the VF needs to be unplugged prior to the start
of migration because we can't do effective dirty page tracking while the
device is connected and doing DMA.  So the downtime, assuming we're
counting only VF connectivity, is dependent on memory size, rate of
dirtying, and network bandwidth; seconds for small guests, minutes or
more (maybe much, much more) for large guests.


The question of dirty page tracking though should be pretty simple.  We 
start the Tx packets out as dirty so we don't need to add anything 
there.  It seems like the Rx data and Tx/Rx descriptor rings are the issue.



This is why the typical VF agnostic approach here is to using bonding
and fail over to a emulated device during migration, so performance
suffers, but downtime is something acceptable.

If we want the ability to defer the VF unplug until just before the
final stages of the migration, we need the VF to participate in dirty
page tracking.  Here it's done via an enlightened guest driver.  Alex
Graf presented a solution using a device specific enlightenment in QEMU.
Otherwise we'd need hardware support from the IOMMU.


My only real complaint with this patch series is that it seems like 
there was to much focus on instrumenting the driver instead of providing 
the code necessary to enable a driver ecosystem that enables migration.


I don't know if what we need is a full hardware IOMMU.  It seems like a 
good way to take care of the need to flag dirty pages for DMA capable 
devices would be to add functionality to the dma_map_ops calls 
sync_{sg|single}for_cpu and unmap_{page|sg} so that they would take care 
of mapping the pages as dirty for us when needed.  We could probably 
make do with just a few tweaks to existing API in order to make this work.


As far as the descriptor rings I would argue they are invalid as soon as 
we migrate.  The problem is there is no way to guarantee ordering as we 
cannot pre-emptively mark an Rx data buffer as being a dirty page when 
we haven't even looked at the Rx descriptor for the given buffer yet. 
Tx has similar issues as we cannot guarantee the Tx will disable itself 
after a complete frame.  As such I would say the moment we migrate we 
should just give up on the frames that are still in the descriptor 
rings, drop them, and then start over with fresh rings.


- Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Clarification on KVM + vhost-net

2015-10-21 Thread Roland Dreier
Hi,

I would like to invoke QEMU and KVM so that the guest sees a virtio
NIC, and that NIC goes through a SR-IOV VF of a host NIC as directly
and efficiently as possible.  But I don't actually want to pass the VF
through to the guest.  I've found a bunch of discussion and confusing
examples on the web, but I'm not able to figure out what the right
thing to do with modern QEMU is.

I don't think I want to create a macvtap interface attached to the VF,
because I just want to use one MAC address for the VF itself (and
allow the NIC anti-spoofing hardware to work etc).  Am I supposed to
create a raw socket bound to the interface I want to use in a helper,
and then pass that to qemu?  How exactly do I pass that in — do I
still use "-net tap"?  Do I have to create my own vhostfd in my helper
too?

Thanks!
  Roland
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally

2015-10-21 Thread Christoffer Dall
On Tue, Oct 20, 2015 at 03:51:05PM -0400, Paolo Bonzini wrote:
> Should this be "select" or "depends on"? Not a blocker, can always be fixed 
> in 4.4.
> 
Hmm, I don't know actually.  I trusted Arnd to make the right call and
given Marc's ack as well, I didn't pay too much attention to that
particular detail.

Arnd, any comments?

Thanks,
-Christoffer

> 
> 
> -Original Message-
> From: Christoffer Dall [christoffer.d...@linaro.org]
> Received: martedì, 20 ott 2015, 18:18
> To: Paolo Bonzini [pbonz...@redhat.com]; kvm...@lists.cs.columbia.edu, 
> kvm@vger.kernel.org, linux-arm-ker...@lists.infradead.org
> CC: Marc Zyngier [marc.zyng...@arm.com]; Arnd Bergmann [a...@arndb.de]; 
> Christoffer Dall [christoffer.d...@linaro.org]
> Subject: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
> 
> From: Arnd Bergmann 
> 
> The vgic code on ARM is built for all configurations that enable KVM,
> but the parent_data field that it references is only present when
> CONFIG_IRQ_DOMAIN_HIERARCHY is set:
> 
> virt/kvm/arm/vgic.c: In function 'kvm_vgic_map_phys_irq':
> virt/kvm/arm/vgic.c:1781:13: error: 'struct irq_data' has no member named 
> 'parent_data'
> 
> This flag is implied by the GIC driver, and indeed the VGIC code only
> makes sense if a GIC is present. This changes the CONFIG_KVM symbol
> to always select GIC, which avoids the issue.
> 
> Fixes: 662d9715840 ("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}")
> Signed-off-by: Arnd Bergmann 
> Acked-by: Marc Zyngier 
> Signed-off-by: Christoffer Dall 
> ---
>  arch/arm/kvm/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig
> index 210ecca..356970f 100644
> --- a/arch/arm/kvm/Kconfig
> +++ b/arch/arm/kvm/Kconfig
> @@ -21,6 +21,7 @@ config KVM
>   depends on MMU && OF
>   select PREEMPT_NOTIFIERS
>   select ANON_INODES
> + select ARM_GIC
>   select HAVE_KVM_CPU_RELAX_INTERCEPT
>   select HAVE_KVM_ARCH_TLB_FLUSH_ALL
>   select KVM_MMIO
> -- 
> 2.1.2.330.g565301e.dirty
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 07/20] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function

2015-10-21 Thread Shannon Zhao


On 2015/10/16 14:08, Wei Huang wrote:
>> +/**
>> > + * kvm_pmu_get_counter_value - get PMU counter value
>> > + * @vcpu: The vcpu pointer
>> > + * @select_idx: The counter index
>> > + */
>> > +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 
>> > select_idx)
>> > +{
>> > +  u64 enabled, running;
>> > +  struct kvm_pmu *pmu = >arch.pmu;
>> > +  struct kvm_pmc *pmc = >pmc[select_idx];
>> > +  u64 counter;
>> > +
>> > +  if (!vcpu_mode_is_32bit(vcpu))
>> > +  counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx);
> The select_idx is from PMSELR_EL0. According to PMUv3 spec, PMSELR_EL0
> is the register that "selects the current event counter PMEVCNTR or
> the cycle counter, CCNT". The code here always reads the counter value
> from PMEVCNTR. It doesn't read the value from cycle counter when
> select_idx=0b1. We might waste some perf counter resources here.
> 
No, it does read the value from the cycle counter. When
select_idx=0b1, PMEVCNTR0_EL0 + select_idx = PMCCNTR_EL0( See patch
03/20).

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-21 Thread Paolo Bonzini


On 21/10/2015 00:57, Wanpeng Li wrote:
>> kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks.  The
>> hooks are registered only between vcpu_load and vcpu_put, therefore they
>> know that the mutex is taken.  The sequence will go like this:
>>
>>  vcpu_load
>>  kvm_sched_out
>>  kvm_sched_in
>>  kvm_sched_out
>>  kvm_sched_in
>>  ...
>>  vcpu_put
> 
> If this should be:
> 
> vcpu_load
> kvm_sched_in
> kvm_sched_out
> kvm_sched_in
> kvm_sched_out
> ...
> vcpu_put

No, because vcpu_load is called while the thread is running.  Therefore,
the first preempt notifier call will be a sched_out notification, which
calls kvm_arch_vcpu_put.  Extending the picture above:

  vcpu_load-> kvm_arch_vcpu_load
  kvm_sched_out-> kvm_arch_vcpu_put
  kvm_sched_in -> kvm_arch_vcpu_load
  kvm_sched_out-> kvm_arch_vcpu_put
  kvm_sched_in -> kvm_arch_vcpu_load
  ...
  kvm_sched_out-> kvm_arch_vcpu_put
  kvm_sched_in -> kvm_arch_vcpu_load
  vcpu_put -> kvm_arch_vcpu_put

Thanks,

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 10/20] KVM: ARM64: Add reset and access handlers for PMCCNTR register

2015-10-21 Thread Shannon Zhao


On 2015/10/16 23:06, Wei Huang wrote:
> 
> 
> On 09/24/2015 05:31 PM, Shannon Zhao wrote:
>> Since the reset value of PMCCNTR is UNKNOWN, use reset_unknown for its
>> reset handler. Add a new case to emulate reading to PMCCNTR register.
>>
>> Signed-off-by: Shannon Zhao 
>> ---
>>  arch/arm64/kvm/sys_regs.c | 17 +++--
>>  1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> index e7f6058..c38c2de 100644
>> --- a/arch/arm64/kvm/sys_regs.c
>> +++ b/arch/arm64/kvm/sys_regs.c
>> @@ -518,6 +518,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu,
>>  }
>>  } else {
>>  switch (r->reg) {
>> +case PMCCNTR_EL0: {
>> +val = kvm_pmu_get_counter_value(vcpu,
>> +ARMV8_MAX_COUNTERS - 1);
>> +*vcpu_reg(vcpu, p->Rt) = val;
>> +break;
>> +}
>>  case PMXEVCNTR_EL0: {
>>  val = kvm_pmu_get_counter_value(vcpu,
>>  vcpu_sys_reg(vcpu, PMSELR_EL0));
>> @@ -748,7 +754,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
>>access_pmu_regs, reset_pmceid, PMCEID1_EL0 },
>>  /* PMCCNTR_EL0 */
>>  { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000),
>> -  trap_raz_wi },
>> +  access_pmu_regs, reset_unknown, PMCCNTR_EL0 },
>>  /* PMXEVTYPER_EL0 */
>>  { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001),
>>access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 },
>> @@ -997,6 +1003,12 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu,
>>  }
>>  } else {
>>  switch (r->reg) {
>> +case c9_PMCCNTR: {
>> +val = kvm_pmu_get_counter_value(vcpu,
>> +ARMV8_MAX_COUNTERS - 1);
> 
> PMCCNTR is for cycle counter. There is a filter register, PMCCFILTR_EL0,
> associated with it. When kvm_pmu_set_counter_event_type() is called, I
> didn't see this filter config been used in perf_event_attr when
> perf_event is created.

According to the spec, to PMXEVTYPER_EL0 it says "When PMSELR_EL0.SEL
selects the cycle counter, this accesses PMCCFILTR_EL0." So within
kvm_pmu_set_counter_event_type, I configure the perf_event_attr based on
the bits of PMXEVTYPER_EL0 and only handle bit P for EL0 and bit U for
EL1 since KVM guest doesn't see EL2 and EL3.

See patch 07/20 :
+   attr.exclude_user = data & ARMV8_EXCLUDE_EL0 ? 1 : 0;
+   attr.exclude_kernel = data & ARMV8_EXCLUDE_EL1 ? 1 : 0;


-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 15/20] KVM: ARM64: Add reset and access handlers for PMSWINC register

2015-10-21 Thread Shannon Zhao


On 2015/10/16 23:25, Wei Huang wrote:
>>  /**
>> > + * kvm_pmu_software_increment - do software increment
>> > + * @vcpu: The vcpu pointer
>> > + * @val: the value guest writes to PMSWINC register
>> > + */
>> > +void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val)
>> > +{
>> > +  int i;
>> > +  u32 type, enable;
>> > +
>> > +  for (i = 0; i < 32; i++) {
>> > +  if ((val >> i) & 0x1) {
>> > +  if (!vcpu_mode_is_32bit(vcpu)) {
>> > +  type = vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + i)
>> > + & ARMV8_EVTYPE_EVENT;
>> > +  enable = vcpu_sys_reg(vcpu, PMCNTENSET_EL0);
>> > +  if ((type == 0) && ((enable >> i) & 0x1))
>> > +  vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i)++;
> Most parts make sense here. I just wonder about the case of counter
> overflow here. Should we trigger an interrupt and set Overflow Flag
> status register when SW increment overflows here? I didn't find anything
> in ARM document.
> 
I didn't find either. But since SW increment uses the PMEVCNTR_EL0 to
count, it should be same with other events to trigger an interrupt and
set Overflow Flag status register.

I will add this in next version patch. Thanks.

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-21 Thread Yacine HEBBAL
Paolo Bonzini  redhat.com> writes:

> 
> 
> On 21/10/2015 00:57, Wanpeng Li wrote:
> >> kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks.  The
> >> hooks are registered only between vcpu_load and vcpu_put, therefore they
> >> know that the mutex is taken.  The sequence will go like this:
> >>
> >>  vcpu_load
> >>  kvm_sched_out
> >>  kvm_sched_in
> >>  kvm_sched_out
> >>  kvm_sched_in
> >>  ...
> >>  vcpu_put
> > 
> > If this should be:
> > 
> > vcpu_load
> > kvm_sched_in
> > kvm_sched_out
> > kvm_sched_in
> > kvm_sched_out
> > ...
> > vcpu_put
> 
> No, because vcpu_load is called while the thread is running.  Therefore,
> the first preempt notifier call will be a sched_out notification, which
> calls kvm_arch_vcpu_put.  Extending the picture above:
> 
>   vcpu_load-> kvm_arch_vcpu_load
>   kvm_sched_out-> kvm_arch_vcpu_put
>   kvm_sched_in -> kvm_arch_vcpu_load
>   kvm_sched_out-> kvm_arch_vcpu_put
>   kvm_sched_in -> kvm_arch_vcpu_load
>   ...
>   kvm_sched_out-> kvm_arch_vcpu_put
>   kvm_sched_in -> kvm_arch_vcpu_load
>   vcpu_put -> kvm_arch_vcpu_put
> 
> Thanks,
> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo  vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

Thanks for the explanation, it's very clear.
I tired that but I didn't succeed to send the ioctl from "run_on_cpu"
function, I didn't find how to set the right CPUStat
I've tried "current_cpu"

kvm_main.c:

// yacine.begin

static void do_vmi_start_kvm_ioctl(void *type) {
printf("do_vmi_start_kvm_ioctl\n");
kvm_vm_ioctl(kvm_state, type);
}

int vmi_start_kvm_ioctl(int type) { <- called from hmp.c
printf("vmi_start_kvm_ioctl\n");
run_on_cpu(current_cpu, do_vmi_start_kvm_ioctl, (void *) );
return 0;
}
// yacine.end

This gives me a segmentation fault
Then I tired to replace current_cpu with ENV_GET_CPU(mon_get_cpu()), it
didn't work, I get nothing, no error but doesn't work
I tried also to pass mon->mon_cpu  through int vmi_start_kvm_ioctl(int type)
by adding a first parameter as CPUStat, i get compiler error "dereference
pointer to incomplete type"
I'm beginner to qemu and kvm code, can you please orient me to fix this
problem ? Thanks in advance

Yacine




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Difference between vcpu_load and kvm_sched_in ?

2015-10-21 Thread Paolo Bonzini


On 21/10/2015 12:17, Hebbal Yacine wrote:
> Thanks for the explanation, it's very clear.
> I tired that but I didn't succeed to send the ioctl from "run_on_cpu"
> function, I didn't find how to set the right CPUStat
> I've tried "current_cpu"

Current_cpu is always NULL outside the VCPU thread.

> 
> kvm_main.c:
> 
> // yacine.begin
> 
> static void do_vmi_start_kvm_ioctl(void *type) {
> printf("do_vmi_start_kvm_ioctl\n");
> kvm_vm_ioctl(kvm_state, type);

Are you sure you want a VM ioctl and not a VCPU ioctl?  Or perhaps a VM
ioctl to do generic processing, and a VCPU ioctl that is then sent to
all VCPUs?

If you use a VCPU ioctl, you can use CPU_FOREACH or a for loop to
iterate over all VCPUs.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function

2015-10-21 Thread Stefan Hajnoczi
On Wed, Oct 21, 2015 at 12:26:35AM +0800, Xiao Guangrong wrote:
> 
> 
> On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote:
> >On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote:
> >>+exit:
> >>+/* Write our output result to dsm memory. */
> >>+((dsm_out *)dsm_ram_addr)->len = out->len;
> >
> >Missing byteswap?
> >
> >I thought you were going to remove this field because it wasn't needed
> >by the guest.
> >
> 
> The @len is the size of _DSM result buffer, for example, for the function of
> DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for
> DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code
> how much size of memory we need to return to the _DSM caller.
> 
> In _DSM code, it's handled like this:
> 
> "RLEN" is @len, “OBUF” is the left memory in DSM page.
> 
> /* get @len*/
> aml_append(method, aml_store(aml_name("RLEN"), aml_local(6)));
> /* @len << 3 to get bits. */
> aml_append(method, aml_store(aml_shiftleft(aml_local(6),
>aml_int(3)), aml_local(6)));
> 
> /* get @len << 3 bits from OBUF, and return it to the caller. */
> aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0),
> aml_local(6) , "OBUF"));
> 
> Since @len is our internally used, it's not return to guest, so i did not do
> byteswap here.

I am not familiar with the ACPI details, but I think this emits bytecode
that will be run by the guest's ACPI interpreter?

You still need to define the endianness of fields since QEMU and the
guest could have different endianness.

In other words, will the following work if a big-endian ppc host is
running a little-endian x86 guest?

  ((dsm_out *)dsm_ram_addr)->len = out->len;

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 13/16] KVM: arm64: sync LPI configuration and pending tables

2015-10-21 Thread Pavel Fedin
 Hello!

> -Original Message-
> From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf 
> Of Andre Przywara
> Sent: Wednesday, October 07, 2015 5:55 PM
> To: marc.zyng...@arm.com; christoffer.d...@linaro.org
> Cc: eric.au...@linaro.org; p.fe...@samsung.com; kvm...@lists.cs.columbia.edu; 
> linux-arm-
> ker...@lists.infradead.org; kvm@vger.kernel.org
> Subject: [PATCH v3 13/16] KVM: arm64: sync LPI configuration and pending 
> tables
> 
> The LPI configuration and pending tables of the GICv3 LPIs are held
> in tables in (guest) memory. To achieve reasonable performance, we
> cache this data in our own data structures, so we need to sync those
> two views from time to time. This behaviour is well described in the
> GICv3 spec and is also exercised by hardware, so the sync points are
> well known.
> 
> Provide functions that read the guest memory and store the
> information from the configuration and pending tables in the kernel.
> 
> Signed-off-by: Andre Przywara 
> ---
> Changelog v2..v3:
> - rework functions to avoid propbaser/pendbaser accesses inside lock
> 
>  include/kvm/arm_vgic.h  |   2 +
>  virt/kvm/arm/its-emul.c | 133 
> 
>  virt/kvm/arm/its-emul.h |   3 ++
>  3 files changed, 138 insertions(+)
> 
> diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h
> index 035911f..4ea023c 100644
> --- a/include/kvm/arm_vgic.h
> +++ b/include/kvm/arm_vgic.h
> @@ -179,6 +179,8 @@ struct vgic_its {
>   int cwriter;
>   struct list_headdevice_list;
>   struct list_headcollection_list;
> + /* memory used for buffering guest's memory */
> + void*buffer_page;
>  };
> 
>  struct vgic_dist {
> diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c
> index 8349970..7a8c5db 100644
> --- a/virt/kvm/arm/its-emul.c
> +++ b/virt/kvm/arm/its-emul.c
> @@ -59,6 +59,7 @@ struct its_itte {
>   struct its_collection *collection;
>   u32 lpi;
>   u32 event_id;
> + u8 priority;
>   bool enabled;
>   unsigned long *pending;
>  };
> @@ -80,8 +81,124 @@ static struct its_itte *find_itte_by_lpi(struct kvm *kvm, 
> int lpi)
>   return NULL;
>  }
> 
> +#define LPI_PROP_ENABLE_BIT(p)   ((p) & LPI_PROP_ENABLED)
> +#define LPI_PROP_PRIORITY(p) ((p) & 0xfc)
> +
> +/* stores the priority and enable bit for a given LPI */
> +static void update_lpi_config(struct kvm *kvm, struct its_itte *itte, u8 
> prop)
> +{
> + itte->priority = LPI_PROP_PRIORITY(prop);
> + itte->enabled  = LPI_PROP_ENABLE_BIT(prop);
> +}
> +
> +#define GIC_LPI_OFFSET 8192
> +
> +/* We scan the table in chunks the size of the smallest page size */
> +#define CHUNK_SIZE 4096U
> +
>  #define BASER_BASE_ADDRESS(x) ((x) & 0xf000ULL)
> 
> +static int nr_idbits_propbase(u64 propbaser)
> +{
> + int nr_idbits = (1U << (propbaser & 0x1f)) + 1;
> +
> + return max(nr_idbits, INTERRUPT_ID_BITS_ITS);
> +}
> +
> +/*
> + * Scan the whole LPI configuration table and put the LPI configuration
> + * data in our own data structures. This relies on the LPI being
> + * mapped before.
> + */
> +static bool its_update_lpis_configuration(struct kvm *kvm, u64 prop_base_reg)
> +{
> + struct vgic_dist *dist = >arch.vgic;
> + u8 *prop = dist->its.buffer_page;
> + u32 tsize;
> + gpa_t propbase;
> + int lpi = GIC_LPI_OFFSET;
> + struct its_itte *itte;
> + struct its_device *device;
> + int ret;
> +
> + propbase = BASER_BASE_ADDRESS(prop_base_reg);
> + tsize = nr_idbits_propbase(prop_base_reg);
> +
> + while (tsize > 0) {
> + int chunksize = min(tsize, CHUNK_SIZE);
> +
> + ret = kvm_read_guest(kvm, propbase, prop, chunksize);
> + if (ret)
> + return false;

 I think it would be more convenient to return 'ret' here, and 0 on success. I 
see that currently nobody consumes the error code,
but with live migration this may change. And the same in 
its_sync_lpi_pending_table().

> +
> + spin_lock(>its.lock);
> + /*
> +  * Updating the status for all allocated LPIs. We catch
> +  * those LPIs that get disabled. We really don't care
> +  * about unmapped LPIs, as they need to be updated
> +  * later manually anyway once they get mapped.
> +  */
> + for_each_lpi(device, itte, kvm) {
> + if (itte->lpi < lpi || itte->lpi >= lpi + chunksize)
> + continue;
> +
> + update_lpi_config(kvm, itte, prop[itte->lpi - lpi]);
> + }
> + spin_unlock(>its.lock);
> + tsize -= chunksize;
> + lpi += chunksize;
> + propbase += chunksize;
> + }
> +
> + return true;
> +}
> +
> +/*
> + * Scan the whole LPI pending table and sync the pending bit in there
> + * with 

[PATCH] KVM: x86: fix eflags state following processor init/reset

2015-10-21 Thread Wanpeng Li
Reference SDM 3.4.3:

Following initialization of the processor (either by asserting the 
RESET pin or the INIT pin), the state of the EFLAGS register is 
0002H.

However, the eflags fixed bit is not set and other bits are also not 
cleared during the init/reset in kvm.

This patch fix it by set eflags register to 0002H following 
initialization of the processor.

Signed-off-by: Wanpeng Li 
---
 arch/x86/kvm/vmx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index b680c2e..326f6ea 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4935,6 +4935,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool 
init_event)
vmx_set_efer(vcpu, 0);
vmx_fpu_activate(vcpu);
update_exception_bitmap(vcpu);
+   vmx_set_rflags(vcpu, X86_EFLAGS_FIXED);
 
vpid_sync_context(vmx->vpid);
 }
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 04/20] KVM: ARM64: Add reset and access handlers for PMCR_EL0 register

2015-10-21 Thread Shannon Zhao


On 2015/10/16 13:35, Wei Huang wrote:
> 
> On 09/24/2015 05:31 PM, Shannon Zhao wrote:
>> > Add reset handler which gets host value of PMCR_EL0 and make writable
>> > bits architecturally UNKNOWN. Add a common access handler for PMU
>> > registers which emulates writing and reading register and add emulation
>> > for PMCR.
>> > 
>> > Signed-off-by: Shannon Zhao 
>> > ---
>> >  arch/arm64/kvm/sys_regs.c | 81 
>> > +--
>> >  1 file changed, 79 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>> > index b41607d..60c0842 100644
>> > --- a/arch/arm64/kvm/sys_regs.c
>> > +++ b/arch/arm64/kvm/sys_regs.c
>> > @@ -33,6 +33,7 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> >  
>> >  #include 
>> >  
>> > @@ -446,6 +447,53 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const 
>> > struct sys_reg_desc *r)
>> >vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr;
>> >  }
>> >  
>> > +static void vcpu_sysreg_write(struct kvm_vcpu *vcpu,
>> > +const struct sys_reg_desc *r, u64 val)
>> > +{
>> > +  if (!vcpu_mode_is_32bit(vcpu))
>> > +  vcpu_sys_reg(vcpu, r->reg) = val;
>> > +  else
>> > +  vcpu_cp15(vcpu, r->reg) = lower_32_bits(val);
>> > +}
>> > +
>> > +static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc 
>> > *r)
>> > +{
>> > +  u64 pmcr, val;
>> > +
>> > +  asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr));
>> > +  /* Writable bits of PMCR_EL0 (ARMV8_PMCR_MASK) is reset to UNKNOWN*/
>> > +  val = (pmcr & ~ARMV8_PMCR_MASK) | (ARMV8_PMCR_MASK & 0xdecafbad);
> Two comments:
> (1) In Patch 1, ARMV8_PMCR_MASK is defined as 0x3f. According to ARMv8
> spec, PMCR_EL0.LC (bit 6) is also writable. Should ARMV8_PMCR_MASK be 0x7f?
According to the spec, it should be 0x7f.

> (2) According to spec the PMCR_EL0.E bit reset to 0, not UNKNOWN.
> 
Yeah, will fix this.

Thanks,
-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/20] KVM: ARM64: Add guest PMU support

2015-10-21 Thread Shannon Zhao


On 2015/10/17 1:01, Christopher Covington wrote:
> On 10/16/2015 12:55 AM, Wei Huang wrote:
>> > 
>> > 
>> > On 09/24/2015 05:31 PM, Shannon Zhao wrote:
>>> >> This patchset adds guest PMU support for KVM on ARM64. It takes
>>> >> trap-and-emulate approach. When guest wants to monitor one event, it
>>> >> will be trapped by KVM and KVM will call perf_event API to create a perf
>>> >> event and call relevant perf_event APIs to get the count value of event.
>>> >>
>>> >> Use perf to test this patchset in guest. When using "perf list", it
>>> >> shows the list of the hardware events and hardware cache events perf
>>> >> supports. Then use "perf stat -e EVENT" to monitor some event. For
>>> >> example, use "perf stat -e cycles" to count cpu cycles and
>>> >> "perf stat -e cache-misses" to count cache misses.
>>> >>
>>> >> Below are the outputs of "perf stat -r 5 sleep 5" when running in host
>>> >> and guest.
>>> >>
>>> >> Host:
>>> >>  Performance counter stats for 'sleep 5' (5 runs):
>>> >>
>>> >>   0.551428  task-clock (msec) #0.000 CPUs 
>>> >> utilized( +-  0.91% )
>>> >>  1  context-switches  #0.002 M/sec
>>> >>  0  cpu-migrations#0.000 K/sec
>>> >> 48  page-faults   #0.088 M/sec   
>>> >>  ( +-  1.05% )
>>> >>1150265  cycles#2.086 GHz 
>>> >>  ( +-  0.92% )
>>> >>  stalled-cycles-frontend
>>> >>  stalled-cycles-backend
>>> >> 526398  instructions  #0.46  insns per 
>>> >> cycle  ( +-  0.89% )
>>> >>  branches
>>> >>   9485  branch-misses #   17.201 M/sec   
>>> >>  ( +-  2.35% )
>>> >>
>>> >>5.000831616 seconds time elapsed  
>>> >> ( +-  0.00% )
>>> >>
>>> >> Guest:
>>> >>  Performance counter stats for 'sleep 5' (5 runs):
>>> >>
>>> >>   0.730868  task-clock (msec) #0.000 CPUs 
>>> >> utilized( +-  1.13% )
>>> >>  1  context-switches  #0.001 M/sec
>>> >>  0  cpu-migrations#0.000 K/sec
>>> >> 48  page-faults   #0.065 M/sec   
>>> >>  ( +-  0.42% )
>>> >>1642982  cycles#2.248 GHz 
>>> >>  ( +-  1.04% )
>>> >>  stalled-cycles-frontend
>>> >>  stalled-cycles-backend
>>> >> 637964  instructions  #0.39  insns per 
>>> >> cycle  ( +-  0.65% )
>>> >>  branches
>>> >>  10377  branch-misses #   14.198 M/sec   
>>> >>  ( +-  1.09% )
>>> >>
>>> >>5.001289068 seconds time elapsed  
>>> >> ( +-  0.00% )
>>> >>
>> > 
>> > Thanks for V3. One suggestion is to run more perf stress tests, such as
>> > "perf test". So we know the corner cases are covered as much as possible.
> I'd also recommend Vince Weaver's perf_event_tests. It tests things like
> signal-on-counter-overflow that I've never seen anywhere else (other than some
> of my own code).
> 
> https://github.com/deater/perf_event_tests

Ok. Thanks for your suggestion.

-- 
Shannon

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html