Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On 10/21/2015 06:49 PM, Stefan Hajnoczi wrote: On Wed, Oct 21, 2015 at 12:26:35AM +0800, Xiao Guangrong wrote: On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote: On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: +exit: +/* Write our output result to dsm memory. */ +((dsm_out *)dsm_ram_addr)->len = out->len; Missing byteswap? I thought you were going to remove this field because it wasn't needed by the guest. The @len is the size of _DSM result buffer, for example, for the function of DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code how much size of memory we need to return to the _DSM caller. In _DSM code, it's handled like this: "RLEN" is @len, “OBUF” is the left memory in DSM page. /* get @len*/ aml_append(method, aml_store(aml_name("RLEN"), aml_local(6))); /* @len << 3 to get bits. */ aml_append(method, aml_store(aml_shiftleft(aml_local(6), aml_int(3)), aml_local(6))); /* get @len << 3 bits from OBUF, and return it to the caller. */ aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0), aml_local(6) , "OBUF")); Since @len is our internally used, it's not return to guest, so i did not do byteswap here. I am not familiar with the ACPI details, but I think this emits bytecode that will be run by the guest's ACPI interpreter? You still need to define the endianness of fields since QEMU and the guest could have different endianness. In other words, will the following work if a big-endian ppc host is running a little-endian x86 guest? ((dsm_out *)dsm_ram_addr)->len = out->len; Er... If we do byteswap in QEMU then it is also needed in ASL code, however, ASL lacks this kind of instruction. I guess ACPI interpreter is smart enough to change value to Littel-Endian for all 2 bytes / 4 bytes / 8 bytes accesses I will do the change in next version, thanks for you pointing it out, Stefan! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote: > Should this be "select" or "depends on"? Not a blocker, can always be fixed > in 4.4. We have lots of 'select ARM_GIC' in the tree for platforms that use one, using 'depends on' will limit KVM support to being available only if at least one of them is being used. The only platform I can think of that uses ARMv7ve without actually having a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform like that? If so, 'depends on' might be better, otherwise let's stay with 'select'. Note that ARM_GIC is not a user-visible option, you can only turn it on by picking one or more platforms that have a GIC. Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
On Wed, Oct 21, 2015 at 03:45:20PM +0200, Arnd Bergmann wrote: > On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote: > > Should this be "select" or "depends on"? Not a blocker, can always be fixed > > in 4.4. > > We have lots of 'select ARM_GIC' in the tree for platforms that use one, using > 'depends on' will limit KVM support to being available only if at least one > of them is being used. > > The only platform I can think of that uses ARMv7ve without actually having > a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform > like that? If so, 'depends on' might be better, otherwise let's stay with > 'select'. Yes you can, just without the VGIC and the timer - you have to emulate that in userspace. Samsung also has a broken platform where they integrated things incorrectly, so you cannot use the VGIC, but that platform support is out of tree, so I can't see if it uses the GIC in general or not. I'm a bit confused why using 'depends on' in this case helps anythign? (I know, I suck at dealing with the config system) Thanks, -Christoffer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
Hello! > The only platform I can think of that uses ARMv7ve without actually having > a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform > like that? We can, with two limitations: 1. GIC has to be emulated in software. I have recently fixed support for this. The only problem here would be that KVM currently refuses to initialize if there's no vGIC, but it is easy to fix, i posted patches for this too. 2. We cannot emulate CP15 timer, because accessing virtual timer registers cannot be trapped to HYP. However, it is possible to trap physical timer access, but a small KVM API extension is needed for this. Currently it is possible to run qemu vexpress model in this mode, because it has another, memory-mapped timer. It is only necessary to either remove CP15 timer from guest device tree, or disable support in guest .config. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next RFC 2/2] vhost_net: basic polling support
This patch tries to poll for new added tx buffer for a while at the end of tx processing. The maximum time spent on polling were limited through a module parameter. To avoid block rx, the loop will end it there's new other works queued on vhost so in fact socket receive queue is also be polled. busyloop_timeout = 50 gives us following improvement on TCP_RR test: size/session/+thu%/+normalize% 1/ 1/ +5%/ -20% 1/50/ +17%/ +3% Signed-off-by: Jason Wang--- drivers/vhost/net.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 9eda69e..bbb522a 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -31,7 +31,9 @@ #include "vhost.h" static int experimental_zcopytx = 1; +static int busyloop_timeout = 50; module_param(experimental_zcopytx, int, 0444); +module_param(busyloop_timeout, int, 0444); MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;" " 1 -Enable; 0 - Disable"); @@ -287,12 +289,23 @@ static void vhost_zerocopy_callback(struct ubuf_info *ubuf, bool success) rcu_read_unlock_bh(); } +static bool tx_can_busy_poll(struct vhost_dev *dev, +unsigned long endtime) +{ + unsigned long now = local_clock() >> 10; + + return busyloop_timeout && !need_resched() && + !time_after(now, endtime) && !vhost_has_work(dev) && + single_task_running(); +} + /* Expects to be always run from workqueue - which acts as * read-size critical section for our kind of RCU. */ static void handle_tx(struct vhost_net *net) { struct vhost_net_virtqueue *nvq = >vqs[VHOST_NET_VQ_TX]; struct vhost_virtqueue *vq = >vq; + unsigned long endtime; unsigned out, in; int head; struct msghdr msg = { @@ -331,6 +344,8 @@ static void handle_tx(struct vhost_net *net) % UIO_MAXIOV == nvq->done_idx)) break; + endtime = (local_clock() >> 10) + busyloop_timeout; +again: head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov), , , @@ -340,6 +355,10 @@ static void handle_tx(struct vhost_net *net) break; /* Nothing new? Wait for eventfd to tell us they refilled. */ if (head == vq->num) { + if (tx_can_busy_poll(vq->dev, endtime)) { + cpu_relax(); + goto again; + } if (unlikely(vhost_enable_notify(>dev, vq))) { vhost_disable_notify(>dev, vq); continue; -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net-next RFC 1/2] vhost: introduce vhost_has_work()
This path introduces a helper which can give a hint for whether or not there's a work queued in the work list. Signed-off-by: Jason Wang--- drivers/vhost/vhost.c | 6 ++ drivers/vhost/vhost.h | 1 + 2 files changed, 7 insertions(+) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index eec2f11..d42d11e 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -245,6 +245,12 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) } EXPORT_SYMBOL_GPL(vhost_work_queue); +bool vhost_has_work(struct vhost_dev *dev) +{ + return !list_empty(>work_list); +} +EXPORT_SYMBOL_GPL(vhost_has_work); + void vhost_poll_queue(struct vhost_poll *poll) { vhost_work_queue(poll->dev, >work); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 4772862..ea0327d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -37,6 +37,7 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); +bool vhost_has_work(struct vhost_dev *dev); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, unsigned long mask, struct vhost_dev *dev); -- 1.8.3.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 11/12] IXGBEVF: Migrate VF statistic data
VF statistic regs are read-only and can't be migrated via writing back directly. Currently, statistic data returned to user space by the driver is not equal to value of statistic regs. VF driver records value of statistic regs as base data when net interface is up or open, calculate increased count of regs during last period of online service and added it to saved_reset data. When user space collects statistic data, VF driver returns result of "current - base + saved_reset". "Current" is reg value at that point. Restoring net function after migration just likes net interface is up or open. Call existed function to update base and saved_reset data to keep statistic data continual during migration. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 04b6ce7..d22160f 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -3005,6 +3005,7 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter) return 0; del_timer_sync(>service_timer); + ixgbevf_update_stats(adapter); pr_info("migration start\n"); migration_status = MIGRATION_IN_PROGRESS; @@ -3017,6 +3018,8 @@ int ixgbevf_live_mg(struct ixgbevf_adapter *adapter) return 1; ixgbevf_restore_state(adapter); + ixgbevf_save_reset_stats(adapter); + ixgbevf_init_last_counter_stats(adapter); migration_status = MIGRATION_COMPLETED; pr_info("migration end\n"); return 0; -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
Add "virtfn_index" member in the struct pci_device to record VF sequence of PF. This will be used in the VF sysfs node handle. Signed-off-by: Lan Tianyu--- drivers/pci/iov.c | 1 + include/linux/pci.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index ee0ebff..065b6bb 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn->physfn = pci_dev_get(dev); virtfn->is_virtfn = 1; virtfn->multifunction = 0; + virtfn->virtfn_index = id; for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { res = >resource[i + PCI_IOV_RESOURCES]; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8d..85c5531 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -356,6 +356,7 @@ struct pci_dev { unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */ unsigned intirq_managed:1; pci_dev_flags_t dev_flags; + unsigned intvirtfn_index; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL 0/6] A handful of fixes for KVM/ARM for v4.3-rc7
On 20/10/2015 18:19, Christoffer Dall wrote: > Hi Paolo, > > The following changes since commit 920552b213e3dc832a874b4e7ba29ecddbab31bc: > > KVM: disable halt_poll_ns as default for s390x (2015-09-25 10:31:30 +0200) > > are available in the git repository at: > > git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm.git > tags/kvm-arm-for-v4.3-rc7 > > for you to fetch changes up to 0d997491f814c87310a6ad7be30a9049c7150489: > > arm/arm64: KVM: Fix disabled distributor operation (2015-10-20 18:09:13 > +0200) > > Sorry for sending these relatively late, but we had a situation where we > found one breakage in the timer implementation changes merged for 4.3, > then fixing that issue revealed another bug, and then that happened > again, and now we have something that looks stable. > > Description of the fixes is in the tag and quoted below. > > Thanks, > -Christoffer > > > A late round of KVM/ARM fixes for v4.3-rc7, fixing: > - A bug where level-triggered interrupts lowered from userspace >are still routed to the guest > - A memory leak an a failed initialization path > - A build error under certain configurations > - Several timer bugs introduced with moving the timer to the active >state handling instead of the masking trick. > > > Arnd Bergmann (1): > KVM: arm: use GIC support unconditionally > > Christoffer Dall (3): > arm/arm64: KVM: Fix arch timer behavior for disabled interrupts > arm/arm64: KVM: Clear map->active on pend/active clear > arm/arm64: KVM: Fix disabled distributor operation > > Pavel Fedin (2): > KVM: arm/arm64: Do not inject spurious interrupts > KVM: arm/arm64: Fix memory leak if timer initialization fails > > arch/arm/kvm/Kconfig | 1 + > arch/arm/kvm/arm.c| 2 +- > virt/kvm/arm/arch_timer.c | 19 ++ > virt/kvm/arm/vgic.c | 95 > +++ > 4 files changed, 76 insertions(+), 41 deletions(-) > Pulled, thanks. I'll send the fixes to Linus tomorrow. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
This patchset is to propose a new solution to add live migration support for 82599 SRIOV network card. Im our solution, we prefer to put all device specific operation into VF and PF driver and make code in the Qemu more general. VF status migration = VF status can be divided into 4 parts 1) PCI configure regs 2) MSIX configure 3) VF status in the PF driver 4) VF MMIO regs The first three status are all handled by Qemu. The PCI configure space regs and MSIX configure are originally stored in Qemu. To save and restore "VF status in the PF driver" by Qemu during migration, adds new sysfs node "state_in_pf" under VF sysfs directory. For VF MMIO regs, we introduce self emulation layer in the VF driver to record MMIO reg values during reading or writing MMIO and put these data in the guest memory. It will be migrated with guest memory to new machine. VF function restoration Restoring VF function operation are done in the VF and PF driver. In order to let VF driver to know migration status, Qemu fakes VF PCI configure regs to indicate migration status and add new sysfs node "notify_vf" to trigger VF mailbox irq in order to notify VF about migration status change. Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Tracking DMA accessed memory = Migration relies on tracking dirty page to migrate memory. Hardware can't automatically mark a page as dirty after DMA memory access. VF descriptor rings and data buffers are modified by hardware when receive and transmit data. To track such dirty memory manually, do dummy writes(read a byte and write it back) when receive and transmit data. Service down time test = So far, we tested migration between two laptops with 82599 nic which are connected to a gigabit switch. Ping VF in the 0.001s interval during migration on the host of source side. It service down time is about 180ms. [983769928.053604] 64 bytes from 10.239.48.100: icmp_seq=4131 ttl=64 time=2.79 ms [983769928.056422] 64 bytes from 10.239.48.100: icmp_seq=4132 ttl=64 time=2.79 ms [983769928.059241] 64 bytes from 10.239.48.100: icmp_seq=4133 ttl=64 time=2.79 ms [983769928.062071] 64 bytes from 10.239.48.100: icmp_seq=4134 ttl=64 time=2.80 ms [983769928.064890] 64 bytes from 10.239.48.100: icmp_seq=4135 ttl=64 time=2.79 ms [983769928.067716] 64 bytes from 10.239.48.100: icmp_seq=4136 ttl=64 time=2.79 ms [983769928.070538] 64 bytes from 10.239.48.100: icmp_seq=4137 ttl=64 time=2.79 ms [983769928.073360] 64 bytes from 10.239.48.100: icmp_seq=4138 ttl=64 time=2.79 ms [983769928.083444] no answer yet for icmp_seq=4139 [983769928.093524] no answer yet for icmp_seq=4140 [983769928.103602] no answer yet for icmp_seq=4141 [983769928.113684] no answer yet for icmp_seq=4142 [983769928.123763] no answer yet for icmp_seq=4143 [983769928.133854] no answer yet for icmp_seq=4144 [983769928.143931] no answer yet for icmp_seq=4145 [983769928.154008] no answer yet for icmp_seq=4146 [983769928.164084] no answer yet for icmp_seq=4147 [983769928.174160] no answer yet for icmp_seq=4148 [983769928.184236] no answer yet for icmp_seq=4149 [983769928.194313] no answer yet for icmp_seq=4150 [983769928.204390] no answer yet for icmp_seq=4151 [983769928.214468] no answer yet for icmp_seq=4152 [983769928.224556] no answer yet for icmp_seq=4153 [983769928.234632] no answer yet for icmp_seq=4154 [983769928.244709] no answer yet for icmp_seq=4155 [983769928.254783] no answer yet for icmp_seq=4156 [983769928.256094] 64 bytes from 10.239.48.100: icmp_seq=4139 ttl=64 time=182 ms [983769928.256107] 64 bytes from 10.239.48.100: icmp_seq=4140 ttl=64 time=172 ms [983769928.256114] no answer yet for icmp_seq=4157 [983769928.256236] 64 bytes from 10.239.48.100: icmp_seq=4141 ttl=64 time=162 ms [983769928.256245] 64 bytes from 10.239.48.100: icmp_seq=4142 ttl=64 time=152 ms [983769928.256272] 64 bytes from 10.239.48.100: icmp_seq=4143 ttl=64 time=142 ms [983769928.256310] 64 bytes from 10.239.48.100: icmp_seq=4144 ttl=64 time=132 ms [983769928.256325] 64 bytes from 10.239.48.100: icmp_seq=4145 ttl=64 time=122 ms [983769928.256332] 64 bytes from 10.239.48.100: icmp_seq=4146 ttl=64 time=112 ms [983769928.256440] 64 bytes from 10.239.48.100: icmp_seq=4147 ttl=64 time=102 ms [983769928.256455] 64 bytes from 10.239.48.100: icmp_seq=4148 ttl=64 time=92.3 ms [983769928.256494] 64 bytes from 10.239.48.100: icmp_seq=4149 ttl=64 time=82.3 ms [983769928.256503] 64
[RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
When transmit a package, the end transmit desc of package indicates whether package is sent already. Current code records the end desc's pointer in the next_to_watch of struct tx buffer. This code will be broken if shifting desc ring after migration. The pointer will be invalid. This patch is to replace recording pointer with recording the desc number of the package and find the end decs via the first desc and desc number. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 1 + drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 --- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..c823616 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -54,6 +54,7 @@ */ struct ixgbevf_tx_buffer { union ixgbe_adv_tx_desc *next_to_watch; + u16 desc_num; unsigned long time_stamp; struct sk_buff *skb; unsigned int bytecount; diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 4446916..056841c 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring, DMA_TO_DEVICE); } tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; tx_buffer->skb = NULL; dma_unmap_len_set(tx_buffer, len, 0); /* tx_buffer must be completely set up in the transmit path */ @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, union ixgbe_adv_tx_desc *tx_desc; unsigned int total_bytes = 0, total_packets = 0; unsigned int budget = tx_ring->count / 2; - unsigned int i = tx_ring->next_to_clean; + int i, watch_index; if (test_bit(__IXGBEVF_DOWN, >state)) return true; @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, i -= tx_ring->count; do { - union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch; + union ixgbe_adv_tx_desc *eop_desc; + + if (!tx_buffer->desc_num) + break; + + if (i + tx_buffer->desc_num >= 0) + watch_index = i + tx_buffer->desc_num; + else + watch_index = i + tx_ring->count + tx_buffer->desc_num; - /* if next_to_watch is not set then there is no work pending */ + eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index); if (!eop_desc) break; @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; /* update the statistics for this packet */ total_bytes += tx_buffer->bytecount; @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, u32 tx_flags = first->tx_flags; __le32 cmd_type; u16 i = tx_ring->next_to_use; + u16 start; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, /* set next_to_watch value indicating a packet is present */ first->next_to_watch = tx_desc; + start = first - tx_ring->tx_buffer_info; + first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start; i++; if (i == tx_ring->count) -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
To let VF driver in the guest to know migration status, Qemu will fake PCI configure reg 0xF0 and 0xF1 to show migrate status and get ack from VF driver. When migration starts, Qemu will set reg "0xF0" to 1, notify VF driver via triggering mail box msg and wait for VF driver to tell it's ready for migration(set reg "0xF1" to 1). After migration, Qemu will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF driver begins to restore tx/rx function after detecting sttatus change. When VF receives mail box irq, it will check reg "0xF0" in the service task function to get migration status and performs related operations according its value. Steps of restarting receive and transmit function 1) Restore VF status in the PF driver via sending mail event to PF driver 2) Write back reg values recorded by self emulation layer 3) Restart rx/tx ring 4) Recovery interrupt Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/defines.h | 6 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 7 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 115 - .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++ 4 files changed, 232 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h index 770e21a..113efd2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/defines.h +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc { __le32 mss_l4len_idx; }; +union ixgbevf_desc { + union ixgbe_adv_tx_desc rx_desc; + union ixgbe_adv_rx_desc tx_desc; + struct ixgbe_adv_tx_context_desc tx_context_desc; +}; + /* Adv Transmit Descriptor Config Masks */ #define IXGBE_ADVTXD_DTYP_MASK 0x00F0 /* DTYP mask */ #define IXGBE_ADVTXD_DTYP_CTXT 0x0020 /* Advanced Context Desc */ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index c823616..6eab402e 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -109,7 +109,7 @@ struct ixgbevf_ring { struct ixgbevf_ring *next; struct net_device *netdev; struct device *dev; - void *desc; /* descriptor ring memory */ + union ixgbevf_desc *desc; /* descriptor ring memory */ dma_addr_t dma; /* phys. address of descriptor ring */ unsigned int size; /* length in bytes */ u16 count; /* amount of descriptors */ @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector); void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter); void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter); +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head); +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head); +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter); +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter); + #ifdef DEBUG char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw); diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 056841c..15ec361 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(DRV_VERSION); + +#define MIGRATION_COMPLETED 0x00 +#define MIGRATION_IN_PROGRESS 0x01 + #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK) static int debug = -1; module_param(debug, int, 0); @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) return ring->stats.packets; } +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) +{ + struct ixgbevf_tx_buffer *tx_buffer = NULL; + static union ixgbevf_desc *tx_desc = NULL; + + tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); + if (!tx_buffer) + return -ENOMEM; + + tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count); + if (!tx_desc) + return -ENOMEM; + + memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); + memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); + memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head); + +
[RFC Patch 04/12] IXGBE: Add ixgbe_ping_vf() to notify a specified VF via mailbox msg.
This patch is to add ixgbe_ping_vf() to notify a specified VF. When migration status is changed, it's necessary to notify VF the change. VF driver will check the migrate status when it gets mailbox msg. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 19 --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h | 1 + 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 89671eb..e247d67 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -1318,18 +1318,23 @@ void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter) IXGBE_WRITE_REG(hw, IXGBE_VFRE(1), 0); } -void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) +void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn) { struct ixgbe_hw *hw = >hw; u32 ping; + + ping = IXGBE_PF_CONTROL_MSG; + if (adapter->vfinfo[vfn].clear_to_send) + ping |= IXGBE_VT_MSGTYPE_CTS; + ixgbe_write_mbx(hw, , 1, vfn); +} + +void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter) +{ int i; - for (i = 0 ; i < adapter->num_vfs; i++) { - ping = IXGBE_PF_CONTROL_MSG; - if (adapter->vfinfo[i].clear_to_send) - ping |= IXGBE_VT_MSGTYPE_CTS; - ixgbe_write_mbx(hw, , 1, i); - } + for (i = 0 ; i < adapter->num_vfs; i++) + ixgbe_ping_vf(adapter, i); } int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int vf, u8 *mac) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h index 2c197e6..143e2fd 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.h @@ -41,6 +41,7 @@ void ixgbe_msg_task(struct ixgbe_adapter *adapter); int ixgbe_vf_configuration(struct pci_dev *pdev, unsigned int event_mask); void ixgbe_disable_tx_rx(struct ixgbe_adapter *adapter); void ixgbe_ping_all_vfs(struct ixgbe_adapter *adapter); +void ixgbe_ping_vf(struct ixgbe_adapter *adapter, int vfn); int ixgbe_ndo_set_vf_mac(struct net_device *netdev, int queue, u8 *mac); int ixgbe_ndo_set_vf_vlan(struct net_device *netdev, int queue, u16 vlan, u8 qos); -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 07/12] IXGBEVF: Add new mail box event for migration
VF status in the PF driver needs to be restored after migration and reset VF hardware. This patch is to add a new event for VF driver to notify PF driver to restore status. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/mbx.h | 3 +++ drivers/net/ethernet/intel/ixgbevf/vf.c | 10 ++ drivers/net/ethernet/intel/ixgbevf/vf.h | 1 + 3 files changed, 14 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.h b/drivers/net/ethernet/intel/ixgbevf/mbx.h index 82f44e0..22761d8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/mbx.h +++ b/drivers/net/ethernet/intel/ixgbevf/mbx.h @@ -112,6 +112,9 @@ enum ixgbe_pfvf_api_rev { #define IXGBE_VF_GET_RETA 0x0a/* VF request for RETA */ #define IXGBE_VF_GET_RSS_KEY 0x0b/* get RSS hash key */ +/* mail box event for live migration */ +#define IXGBE_VF_NOTIFY_RESUME 0x0c /* VF notify PF migration to restore status */ + /* length of permanent address message returned from PF */ #define IXGBE_VF_PERMADDR_MSG_LEN 4 /* word in permanent address message with the current multicast type */ diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.c b/drivers/net/ethernet/intel/ixgbevf/vf.c index d1339b0..1e4e5e6 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.c +++ b/drivers/net/ethernet/intel/ixgbevf/vf.c @@ -717,6 +717,15 @@ int ixgbevf_get_queues(struct ixgbe_hw *hw, unsigned int *num_tcs, return err; } +static void ixgbevf_notify_resume_vf(struct ixgbe_hw *hw) +{ + struct ixgbe_mbx_info *mbx = >mbx; + u32 msgbuf[1]; + + msgbuf[0] = IXGBE_VF_NOTIFY_RESUME; + mbx->ops.write_posted(hw, msgbuf, 1); +} + static const struct ixgbe_mac_operations ixgbevf_mac_ops = { .init_hw= ixgbevf_init_hw_vf, .reset_hw = ixgbevf_reset_hw_vf, @@ -729,6 +738,7 @@ static const struct ixgbe_mac_operations ixgbevf_mac_ops = { .update_mc_addr_list= ixgbevf_update_mc_addr_list_vf, .set_uc_addr= ixgbevf_set_uc_addr_vf, .set_vfta = ixgbevf_set_vfta_vf, + .notify_resume = ixgbevf_notify_resume_vf, }; const struct ixgbevf_info ixgbevf_82599_vf_info = { diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index 6a3f4eb..a25fe81 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -70,6 +70,7 @@ struct ixgbe_mac_operations { s32 (*disable_mc)(struct ixgbe_hw *); s32 (*clear_vfta)(struct ixgbe_hw *); s32 (*set_vfta)(struct ixgbe_hw *, u32, u32, bool); + void (*notify_resume)(struct ixgbe_hw *); }; enum ixgbe_mac_type { -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/3] Qemu: Add pci-assign.h to share functions and struct definition with new file
Signed-off-by: Lan Tianyu--- hw/i386/kvm/pci-assign.c | 111 ++- hw/i386/kvm/pci-assign.h | 109 ++ 2 files changed, 112 insertions(+), 108 deletions(-) create mode 100644 hw/i386/kvm/pci-assign.h diff --git a/hw/i386/kvm/pci-assign.c b/hw/i386/kvm/pci-assign.c index 74d22f4..616532d 100644 --- a/hw/i386/kvm/pci-assign.c +++ b/hw/i386/kvm/pci-assign.c @@ -37,112 +37,7 @@ #include "hw/pci/pci.h" #include "hw/pci/msi.h" #include "kvm_i386.h" - -#define MSIX_PAGE_SIZE 0x1000 - -/* From linux/ioport.h */ -#define IORESOURCE_IO 0x0100 /* Resource type */ -#define IORESOURCE_MEM 0x0200 -#define IORESOURCE_IRQ 0x0400 -#define IORESOURCE_DMA 0x0800 -#define IORESOURCE_PREFETCH 0x2000 /* No side effects */ -#define IORESOURCE_MEM_64 0x0010 - -//#define DEVICE_ASSIGNMENT_DEBUG - -#ifdef DEVICE_ASSIGNMENT_DEBUG -#define DEBUG(fmt, ...) \ -do { \ -fprintf(stderr, "%s: " fmt, __func__ , __VA_ARGS__); \ -} while (0) -#else -#define DEBUG(fmt, ...) -#endif - -typedef struct PCIRegion { -int type; /* Memory or port I/O */ -int valid; -uint64_t base_addr; -uint64_t size;/* size of the region */ -int resource_fd; -} PCIRegion; - -typedef struct PCIDevRegions { -uint8_t bus, dev, func; /* Bus inside domain, device and function */ -int irq;/* IRQ number */ -uint16_t region_number; /* number of active regions */ - -/* Port I/O or MMIO Regions */ -PCIRegion regions[PCI_NUM_REGIONS - 1]; -int config_fd; -} PCIDevRegions; - -typedef struct AssignedDevRegion { -MemoryRegion container; -MemoryRegion real_iomem; -union { -uint8_t *r_virtbase; /* mmapped access address for memory regions */ -uint32_t r_baseport; /* the base guest port for I/O regions */ -} u; -pcibus_t e_size;/* emulated size of region in bytes */ -pcibus_t r_size;/* real size of region in bytes */ -PCIRegion *region; -} AssignedDevRegion; - -#define ASSIGNED_DEVICE_PREFER_MSI_BIT 0 -#define ASSIGNED_DEVICE_SHARE_INTX_BIT 1 - -#define ASSIGNED_DEVICE_PREFER_MSI_MASK (1 << ASSIGNED_DEVICE_PREFER_MSI_BIT) -#define ASSIGNED_DEVICE_SHARE_INTX_MASK (1 << ASSIGNED_DEVICE_SHARE_INTX_BIT) - -typedef struct MSIXTableEntry { -uint32_t addr_lo; -uint32_t addr_hi; -uint32_t data; -uint32_t ctrl; -} MSIXTableEntry; - -typedef enum AssignedIRQType { -ASSIGNED_IRQ_NONE = 0, -ASSIGNED_IRQ_INTX_HOST_INTX, -ASSIGNED_IRQ_INTX_HOST_MSI, -ASSIGNED_IRQ_MSI, -ASSIGNED_IRQ_MSIX -} AssignedIRQType; - -typedef struct AssignedDevice { -PCIDevice dev; -PCIHostDeviceAddress host; -uint32_t dev_id; -uint32_t features; -int intpin; -AssignedDevRegion v_addrs[PCI_NUM_REGIONS - 1]; -PCIDevRegions real_device; -PCIINTxRoute intx_route; -AssignedIRQType assigned_irq_type; -struct { -#define ASSIGNED_DEVICE_CAP_MSI (1 << 0) -#define ASSIGNED_DEVICE_CAP_MSIX (1 << 1) -uint32_t available; -#define ASSIGNED_DEVICE_MSI_ENABLED (1 << 0) -#define ASSIGNED_DEVICE_MSIX_ENABLED (1 << 1) -#define ASSIGNED_DEVICE_MSIX_MASKED (1 << 2) -uint32_t state; -} cap; -uint8_t emulate_config_read[PCI_CONFIG_SPACE_SIZE]; -uint8_t emulate_config_write[PCI_CONFIG_SPACE_SIZE]; -int msi_virq_nr; -int *msi_virq; -MSIXTableEntry *msix_table; -hwaddr msix_table_addr; -uint16_t msix_max; -MemoryRegion mmio; -char *configfd_name; -int32_t bootindex; -} AssignedDevice; - -#define TYPE_PCI_ASSIGN "kvm-pci-assign" -#define PCI_ASSIGN(obj) OBJECT_CHECK(AssignedDevice, (obj), TYPE_PCI_ASSIGN) +#include "pci-assign.h" static void assigned_dev_update_irq_routing(PCIDevice *dev); @@ -1044,7 +939,7 @@ static bool assigned_dev_msix_masked(MSIXTableEntry *entry) * sure the physical MSI-X state tracks the guest's view, which is important * for some VF/PF and PF/fw communication channels. */ -static bool assigned_dev_msix_skipped(MSIXTableEntry *entry) +bool assigned_dev_msix_skipped(MSIXTableEntry *entry) { return !entry->data; } @@ -1114,7 +1009,7 @@ static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) return r; } -static void assigned_dev_update_msix(PCIDevice *pci_dev) +void assigned_dev_update_msix(PCIDevice *pci_dev) { AssignedDevice *assigned_dev = PCI_ASSIGN(pci_dev); uint16_t ctrl_word = pci_get_word(pci_dev->config + pci_dev->msix_cap + diff --git a/hw/i386/kvm/pci-assign.h b/hw/i386/kvm/pci-assign.h new file mode 100644 index 000..91d00ea --- /dev/null +++ b/hw/i386/kvm/pci-assign.h @@ -0,0 +1,109 @@ +#define MSIX_PAGE_SIZE 0x1000 + +/* From linux/ioport.h */ +#define IORESOURCE_IO 0x0100 /* Resource type */ +#define IORESOURCE_MEM
[RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC
This patchset is Qemu part for live migration support for SRIOV NIC. kernel part patch information is in the following link. http://marc.info/?l=kvm=144544635330193=2 Lan Tianyu (3): Qemu: Add pci-assign.h to share functions and struct definition with new file Qemu: Add post_load_state() to run after restoring CPU state Qemu: Introduce pci-sriov device type to support VF live migration hw/i386/kvm/Makefile.objs | 2 +- hw/i386/kvm/pci-assign.c| 113 +-- hw/i386/kvm/pci-assign.h| 109 +++ hw/i386/kvm/sriov.c | 213 include/migration/vmstate.h | 2 + migration/savevm.c | 15 6 files changed, 344 insertions(+), 110 deletions(-) create mode 100644 hw/i386/kvm/pci-assign.h create mode 100644 hw/i386/kvm/sriov.c -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 06/12] IXGBEVF: Add self emulation layer
In order to restore VF function after migration, add self emulation layer to record regs' values during accessing regs. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/Makefile| 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++ drivers/net/ethernet/intel/ixgbevf/vf.h| 5 - 4 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile index 4ce4c97..841c884 100644 --- a/drivers/net/ethernet/intel/ixgbevf/Makefile +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile @@ -31,7 +31,8 @@ obj-$(CONFIG_IXGBEVF) += ixgbevf.o -ixgbevf-objs := vf.o \ +ixgbevf-objs := self-emulation.o \ + vf.o \ mbx.o \ ethtool.o \ ixgbevf_main.o diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..4446916 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg) if (IXGBE_REMOVED(reg_addr)) return IXGBE_FAILED_READ_REG; - value = readl(reg_addr + reg); + value = ixgbe_self_emul_readl(reg_addr, reg); if (unlikely(value == IXGBE_FAILED_READ_REG)) ixgbevf_check_remove(hw, reg); return value; diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c new file mode 100644 index 000..d74b2da --- /dev/null +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include + +#include "vf.h" +#include "ixgbevf.h" + +static u32 hw_regs[0x4000]; + +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr) +{ + u32 tmp; + + tmp = readl(base + addr); + hw_regs[(unsigned long)addr] = tmp; + + return tmp; +} + +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr) +{ + hw_regs[(unsigned long)addr] = val; + writel(val, (volatile void __iomem *)(base + addr)); +} diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index d40f036..6a3f4eb 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -39,6 +39,9 @@ struct ixgbe_hw; +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr); +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr); + /* iterator type for walking multicast address lists */ typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr, u32 *vmdq); @@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value) if (IXGBE_REMOVED(reg_addr)) return; - writel(value, reg_addr + reg); + ixgbe_self_emul_writel(value, reg_addr, reg); } #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v) -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/3] Qemu: Add post_load_state() to run after restoring CPU state
After migration, Qemu needs to trigger mailbox irq to notify VF driver in the guest about status change. The irq delivery restarts to work after restoring CPU state. This patch is to add new callback to run after restoring CPU state and provide a way to trigger mailbox irq later. Signed-off-by: Lan Tianyu--- include/migration/vmstate.h | 2 ++ migration/savevm.c | 15 +++ 2 files changed, 17 insertions(+) diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h index 0695d7c..dc681a6 100644 --- a/include/migration/vmstate.h +++ b/include/migration/vmstate.h @@ -56,6 +56,8 @@ typedef struct SaveVMHandlers { int (*save_live_setup)(QEMUFile *f, void *opaque); uint64_t (*save_live_pending)(QEMUFile *f, void *opaque, uint64_t max_size); +/* This runs after restoring CPU related state */ +void (*post_load_state)(void *opaque); LoadStateHandler *load_state; } SaveVMHandlers; diff --git a/migration/savevm.c b/migration/savevm.c index 9e0e286..48b6223 100644 --- a/migration/savevm.c +++ b/migration/savevm.c @@ -702,6 +702,20 @@ bool qemu_savevm_state_blocked(Error **errp) return false; } +void qemu_savevm_post_load(void) +{ +SaveStateEntry *se; + +QTAILQ_FOREACH(se, _state.handlers, entry) { +if (!se->ops || !se->ops->post_load_state) { +continue; +} + +se->ops->post_load_state(se->opaque); +} +} + + void qemu_savevm_state_header(QEMUFile *f) { trace_savevm_state_header(); @@ -1140,6 +1154,7 @@ int qemu_loadvm_state(QEMUFile *f) } cpu_synchronize_all_post_init(); +qemu_savevm_post_load(); ret = 0; -- 1.9.3 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
Ring shifting during restoring VF function maybe race with original ring operation(transmit/receive package). This patch is to add tx/rx lock to protect ring related data. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 2 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 --- 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 6eab402e..3a748c8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -448,6 +448,8 @@ struct ixgbevf_adapter { spinlock_t mbx_lock; unsigned long last_reset; + spinlock_t mg_rx_lock; + spinlock_t mg_tx_lock; }; enum ixbgevf_state_t { diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 15ec361..04b6ce7 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_tx_buffer *tx_buffer = NULL; static union ixgbevf_desc *tx_desc = NULL; + unsigned long flags; tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); if (!tx_buffer) @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!tx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_tx_lock, flags); memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head); @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_tx_lock, flags); + vfree(tx_buffer); vfree(tx_desc); return 0; @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_rx_buffer *rx_buffer = NULL; static union ixgbevf_desc *rx_desc = NULL; + unsigned long flags; rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count)); if (!rx_buffer) @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!rx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_rx_lock, flags); memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count)); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head); @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) r->next_to_use -= head; else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_rx_lock, flags); vfree(rx_buffer); vfree(rx_desc); @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (test_bit(__IXGBEVF_DOWN, >state)) return true; + spin_lock(>mg_tx_lock); + i = tx_ring->next_to_clean; tx_buffer = _ring->tx_buffer_info[i]; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); i -= tx_ring->count; @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, q_vector->tx.total_bytes += total_bytes; q_vector->tx.total_packets += total_packets; + spin_unlock(>mg_tx_lock); + if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) { struct ixgbe_hw *hw = >hw; union ixgbe_adv_tx_desc *eop_desc; @@ -999,10 +1012,12 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, struct ixgbevf_ring *rx_ring, int budget) { + struct ixgbevf_adapter *adapter = netdev_priv(rx_ring->netdev); unsigned int total_rx_bytes = 0, total_rx_packets = 0; u16 cleaned_count = ixgbevf_desc_unused(rx_ring); struct sk_buff *skb = rx_ring->skb; + spin_lock(>mg_rx_lock); while (likely(total_rx_packets < budget)) { union ixgbe_adv_rx_desc *rx_desc; @@ -1078,6 +1093,7 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, q_vector->rx.total_packets += total_rx_packets; q_vector->rx.total_bytes += total_rx_bytes; +
[RFC Patch 12/12] IXGBEVF: Track dma dirty pages
Migration relies on tracking dirty page to migrate memory. Hardware can't automatically mark a page as dirty after DMA memory access. VF descriptor rings and data buffers are modified by hardware when receive and transmit data. To track such dirty memory manually, do dummy writes(read a byte and write it back) during receive and transmit data. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index d22160f..ce7bd7a 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -414,6 +414,9 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (!(eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD))) break; + /* write back status to mark page dirty */ + eop_desc->wb.status = eop_desc->wb.status; + /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; tx_buffer->desc_num = 0; @@ -946,15 +949,17 @@ static struct sk_buff *ixgbevf_fetch_rx_buffer(struct ixgbevf_ring *rx_ring, { struct ixgbevf_rx_buffer *rx_buffer; struct page *page; + u8 *page_addr; rx_buffer = _ring->rx_buffer_info[rx_ring->next_to_clean]; page = rx_buffer->page; prefetchw(page); - if (likely(!skb)) { - void *page_addr = page_address(page) + - rx_buffer->page_offset; + /* Mark page dirty */ + page_addr = page_address(page) + rx_buffer->page_offset; + *page_addr = *page_addr; + if (likely(!skb)) { /* prefetch first cache line of first page */ prefetch(page_addr); #if L1_CACHE_BYTES < 128 @@ -1032,6 +1037,9 @@ static int ixgbevf_clean_rx_irq(struct ixgbevf_q_vector *q_vector, if (!ixgbevf_test_staterr(rx_desc, IXGBE_RXD_STAT_DD)) break; + /* Write back status to mark page dirty */ + rx_desc->wb.upper.status_error = rx_desc->wb.upper.status_error; + /* This memory barrier is needed to keep us from reading * any other fields out of the rx_desc until we know the * RXD_STAT_DD bit is set -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
This patch is to restore VF status in the PF driver when get event from VF. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++ 3 files changed, 42 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 636f9e3..9d5669a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -148,6 +148,7 @@ struct vf_data_storage { bool pf_set_mac; u16 pf_vlan; /* When set, guest VLAN config not allowed. */ u16 pf_qos; + u32 vf_lpe; u16 tx_rate; u16 vlan_count; u8 spoofchk_enabled; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h index b1e4703..8fdb38d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h @@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev { /* mailbox API, version 1.1 VF requests */ #define IXGBE_VF_GET_QUEUES0x09 /* get queue configuration */ +#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */ /* GET_QUEUES return data indices within the mailbox */ #define IXGBE_VF_TX_QUEUES 1 /* number of Tx queues supported */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 1d17b58..ab2a2e2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf, } } +/** + * Restore the settings by mailbox, after migration + **/ +void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf) +{ + struct ixgbe_hw *hw = >hw; + u32 reg, reg_offset, vf_shift; + int rar_entry = hw->mac.num_rar_entries - (vf + 1); + + vf_shift = vf % 32; + reg_offset = vf / 32; + + /* enable transmit and receive for vf */ + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + ixgbe_vf_reset_event(adapter, vf); + + hw->mac.ops.set_rar(hw, rar_entry, + adapter->vfinfo[vf].vf_mac_addresses, + vf, IXGBE_RAH_AV); + + + if (adapter->vfinfo[vf].vf_lpe) + ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf); +} + static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ]; @@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) break; case IXGBE_VF_SET_LPE: retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf); + adapter->vfinfo[vf].vf_lpe = *msgbuf; break; case IXGBE_VF_SET_MACVLAN: retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf); @@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) case IXGBE_VF_GET_RSS_KEY: retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf); break; + case IXGBE_VF_NOTIFY_RESUME: + ixgbe_restore_setting(adapter, vf); + break; default: e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]); retval = IXGBE_ERR_MBX; -- 1.8.4.rc0.1.g8f6a3e5.dirty -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
This patch is to add sysfs interface state_in_pf under sysfs directory of VF PCI device for Qemu to get and put VF status in the PF driver during migration. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 - 1 file changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index ab2a2e2..89671eb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter) return -ENOMEM; } +#define IXGBE_PCI_VFCOMMAND 0x4 +#define IXGBE_PCI_VFMSIXMC0x72 +#define IXGBE_SRIOV_VF_OFFSET 0x180 +#define IXGBE_SRIOV_VF_STRIDE 0x2 + +#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn))) + +struct state_in_pf { + u16 command; + u16 msix_message_control; + struct vf_data_storage vf_data; +}; + +static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn) +{ + u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn; + return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff); +} + +static ssize_t ixgbe_show_state_in_pf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct ixgbe_hw *hw = >hw; + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + u32 reg, reg_offset, vf_shift; + + /* Clear VF mac and disable VF */ + ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn); + + /* Record PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, >command); + pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, >msix_message_control); + } + else + printk(KERN_WARNING "Unable to find VF device.\n"); + + /* Record states hold by PF */ + memcpy(>vf_data, >vfinfo[vfn], sizeof(struct vf_data_storage)); + + vf_shift = vfn % 32; + reg_offset = vfn / 32; + + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + return sizeof(struct state_in_pf); +} + +static ssize_t ixgbe_store_state_in_pf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + + /* Check struct size */ + if (count != sizeof(struct state_in_pf)) { + printk(KERN_ERR "State in PF size does not fit.\n"); + goto out; + } + + /* Restore PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command); + pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control); + } + + /* Restore states hold by PF */ + memcpy(>vfinfo[vfn], >vf_data, sizeof(struct vf_data_storage)); + + out: + return count; +} + +static struct device_attribute ixgbe_per_state_in_pf_attribute = + __ATTR(state_in_pf, S_IRUGO | S_IWUSR, + ixgbe_show_state_in_pf, ixgbe_store_state_in_pf); + +void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) +{ + struct pci_dev *pdev = adapter->pdev; + struct pci_dev *vfdev; + unsigned short vf_id; + int pos, ret; + + pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV); + if (!pos) + return; + + /* get the device ID for the VF */ + pci_read_config_word(pdev, pos + PCI_SRIOV_VF_DID, _id); + + vfdev = pci_get_device(pdev->vendor, vf_id, NULL); + + while (vfdev) { + if (vfdev->is_virtfn) { + ret = device_create_file(>dev, + _per_state_in_pf_attribute); + if (ret) +
Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
On Wednesday 21 October 2015 15:58:44 Christoffer Dall wrote: > On Wed, Oct 21, 2015 at 03:45:20PM +0200, Arnd Bergmann wrote: > > On Tuesday 20 October 2015 15:51:05 Paolo Bonzini wrote: > > > Should this be "select" or "depends on"? Not a blocker, can always be > > > fixed in 4.4. > > > > We have lots of 'select ARM_GIC' in the tree for platforms that use one, > > using > > 'depends on' will limit KVM support to being available only if at least one > > of them is being used. > > > > The only platform I can think of that uses ARMv7ve without actually having > > a GIC is BCM2836 (Raspberry Pi 2). Can we actually run KVM on a platform > > like that? If so, 'depends on' might be better, otherwise let's stay with > > 'select'. > > Yes you can, just without the VGIC and the timer - you have to emulate > that in userspace. Samsung also has a broken platform where they > integrated things incorrectly, so you cannot use the VGIC, but that > platform support is out of tree, so I can't see if it uses the GIC in > general or not. Ok, my patch should be fine then. > I'm a bit confused why using 'depends on' in this case helps anythign? > > (I know, I suck at dealing with the config system) Generally speaking, 'select' causes more problems than 'depends on', in particular when you get conflicting requirements (A selects B, B depends on C, but A can be enabled without C). However, symbols that only have 'select' and no 'depends on', and also are not user-visible, are not problematic. This is the case here. Arnd -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sanitizing kvmtool
On 10/19/2015 11:15 AM, Dmitry Vyukov wrote: > On Mon, Oct 19, 2015 at 5:08 PM, Sasha Levinwrote: >> > On 10/19/2015 10:47 AM, Dmitry Vyukov wrote: >>> Right, the memory areas that are accessed both by the hypervisor and >>> the guest > >>> > should be treated as untrusted input, but the hypervisor is > >>> > supposed to validate > >>> > the input carefully before using it - so I'm not sure how data > >>> > races would > >>> > introduce anything new that we didn't catch during validation. >>> >> >>> >> One possibility would be: if result of a racy read is passed to guest, >>> >> that can leak arbitrary host data into guest. Does not sound good. >>> >> Also, without usage of proper atomic operations, it is basically >>> >> impossible to verify untrusted data, as it can be changing under your >>> >> feet. And storing data into a local variable does not prevent the data >>> >> from changing. >> > >> > What's missing here is that the guest doesn't directly read/write the >> > memory: >> > every time it accesses a memory that is shared with the host it will >> > trigger >> > an exit, which will stop the vcpu thread that made the access and kernel >> > side >> > kvm will pass the hypervisor the value the guest wrote (or the memory >> > address >> > it attempted to read). The value/address can't change under us in that >> > scenario. > But still: if result of a racy read is passed to guest, that can leak > arbitrary host data into guest. I see what you're saying. I need to think about it a bit, maybe we do need locking for each of the virtio devices we emulate. On an unrelated note, a few of the reports are pointing to ioport__unregister(): == WARNING: ThreadSanitizer: data race (pid=109228) Write of size 8 at 0x7d1cdf40 by main thread: #0 free tsan/rtl/tsan_interceptors.cc:570 (lkvm+0x00443376) #1 ioport__unregister ioport.c:138:2 (lkvm+0x004a9ff9) #2 pci__exit pci.c:247:2 (lkvm+0x004ac857) #3 init_list__exit util/init.c:59:8 (lkvm+0x004bca6e) #4 kvm_cmd_run_exit builtin-run.c:645:2 (lkvm+0x004a68a7) #5 kvm_cmd_run builtin-run.c:661 (lkvm+0x004a68a7) #6 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c) #7 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4) #8 main main.c:18 (lkvm+0x004ac0b4) Previous read of size 8 at 0x7d1cdf40 by thread T55: #0 rb_int_search_single util/rbtree-interval.c:14:17 (lkvm+0x004bf968) #1 ioport_search ioport.c:41:9 (lkvm+0x004aa05f) #2 kvm__emulate_io ioport.c:186 (lkvm+0x004aa05f) #3 kvm_cpu__emulate_io x86/include/kvm/kvm-cpu-arch.h:41:9 (lkvm+0x004aa718) #4 kvm_cpu__start kvm-cpu.c:126 (lkvm+0x004aa718) #5 kvm_cpu_thread builtin-run.c:174:6 (lkvm+0x004a6e3e) Thread T55 'kvm-vcpu-2' (tid=109285, finished) created by main thread at: #0 pthread_create tsan/rtl/tsan_interceptors.cc:848 (lkvm+0x004478a3) #1 kvm_cmd_run_work builtin-run.c:633:7 (lkvm+0x004a683f) #2 kvm_cmd_run builtin-run.c:660 (lkvm+0x004a683f) #3 handle_command kvm-cmd.c:84:8 (lkvm+0x004bc40c) #4 handle_kvm_command main.c:11:9 (lkvm+0x004ac0b4) #5 main main.c:18 (lkvm+0x004ac0b4) SUMMARY: ThreadSanitizer: data race ioport.c:138:2 in ioport__unregister == I think this is because we don't perform locking using pthread, but rather pause the vm entirely - so the cpu threads it's pointing to aren't actually running when we unregister ioports. Is there a way to annotate that for tsan? Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
Paolo Bonzini redhat.com> writes: > > > On 21/10/2015 12:17, Hebbal Yacine wrote: > > Thanks for the explanation, it's very clear. > > I tired that but I didn't succeed to send the ioctl from "run_on_cpu" > > function, I didn't find how to set the right CPUStat > > I've tried "current_cpu" > > Current_cpu is always NULL outside the VCPU thread. > > > > > kvm_main.c: > > > > // yacine.begin > > > > static void do_vmi_start_kvm_ioctl(void *type) { > > printf("do_vmi_start_kvm_ioctl\n"); > > kvm_vm_ioctl(kvm_state, type); //yacine.begin int hmp_vmi_op_result = 0; static void do_vmi_kvm_ioctl(void *type_ioctl) { int* type = (int*) type_ioctl; hmp_vmi_op_result = kvm_vcpu_ioctl(current_cpu, *type); //hmp_vmi_start_result = kvm_vm_ioctl(kvm_state, *type); } int vmi_kvm_ioctl(int type) { CPUState* cpu; CPU_FOREACH(cpu) { run_on_cpu(cpu, do_vmi_kvm_ioctl, ); } return hmp_vmi_op_result; } //yacine.end Yes, it works perfectly this way even when running multiple VCPUs, thank you a lot :) In fact, I was using an old version of qemu (1.5.x), and it doesn't have CPU_FOREACH, i searched a little for to replace it, but without any luck. So I upgraded my working version and now everything is cool > Are you sure you want a VM ioctl and not a VCPU ioctl? Or perhaps a VM > ioctl to do generic processing, and a VCPU ioctl that is then sent to > all VCPUs? >If you use a VCPU ioctl, you can use CPU_FOREACH or a for loop to > iterate over all VCPUs. In fact, I get the same result when using vm_ioctl or vcpu_ioctl. If I correctly understood you last paragraph, it is better to use vm_ioctl to do generic processing that doesn't rely on a given VCPU and hence I won't need to use "CPU_FOREACH, run_on_cpu and current_cpu". Thanks again :) > > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 01/12] PCI: Add virtfn_index for struct pci_device
On 10/21/2015 09:37 AM, Lan Tianyu wrote: Add "virtfn_index" member in the struct pci_device to record VF sequence of PF. This will be used in the VF sysfs node handle. Signed-off-by: Lan Tianyu--- drivers/pci/iov.c | 1 + include/linux/pci.h | 1 + 2 files changed, 2 insertions(+) diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c index ee0ebff..065b6bb 100644 --- a/drivers/pci/iov.c +++ b/drivers/pci/iov.c @@ -136,6 +136,7 @@ static int virtfn_add(struct pci_dev *dev, int id, int reset) virtfn->physfn = pci_dev_get(dev); virtfn->is_virtfn = 1; virtfn->multifunction = 0; + virtfn->virtfn_index = id; for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) { res = >resource[i + PCI_IOV_RESOURCES]; diff --git a/include/linux/pci.h b/include/linux/pci.h index 353db8d..85c5531 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -356,6 +356,7 @@ struct pci_dev { unsigned intio_window_1k:1; /* Intel P2P bridge 1K I/O windows */ unsigned intirq_managed:1; pci_dev_flags_t dev_flags; + unsigned intvirtfn_index; atomic_tenable_cnt; /* pci_enable_device has been called */ u32 saved_config_space[16]; /* config space saved at suspend time */ Can't you just calculate the VF index based on the VF BDF number combined with the information in the PF BDF number and VF offset/stride? Seems kind of pointless to add a variable that is only used by one driver and is in a slowpath when you can just calculate it pretty quickly. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] Qemu/IXGBE: Add live migration support for SRIOV NIC
On Thu, 2015-10-22 at 00:52 +0800, Lan Tianyu wrote: > This patchset is Qemu part for live migration support for SRIOV NIC. > kernel part patch information is in the following link. > http://marc.info/?l=kvm=144544635330193=2 > > > Lan Tianyu (3): > Qemu: Add pci-assign.h to share functions and struct definition with > new file > Qemu: Add post_load_state() to run after restoring CPU state > Qemu: Introduce pci-sriov device type to support VF live migration > > hw/i386/kvm/Makefile.objs | 2 +- > hw/i386/kvm/pci-assign.c| 113 +-- > hw/i386/kvm/pci-assign.h| 109 +++ > hw/i386/kvm/sriov.c | 213 > > include/migration/vmstate.h | 2 + > migration/savevm.c | 15 > 6 files changed, 344 insertions(+), 110 deletions(-) > create mode 100644 hw/i386/kvm/pci-assign.h > create mode 100644 hw/i386/kvm/sriov.c > Hi Lan, Seems like there are a couple immediate problems with this approach. The first is that you're modifying legacy KVM device assignment, which is deprecated upstream and not even enabled by some distros. VFIO is the supported mechanism for doing PCI device assignment now and any features like this need to be added there first. It's not only more secure than legacy KVM device assignment, but it also doesn't limit this to an x86-only solution. Surely you want to support 82599 VF migration on other platforms as well. Using sysfs to interact with the PF is also problematic since that means that libvirt needs to grant qemu access to these files, adding one more layer to the stack. If we were to use VFIO, we could potentially enable this through a save-state region on the device file descriptor and if necessary, virtual interrupt channels for the device as well. This of course implies that the kernel internal channels are made as general as possible in order to support any PF driver. That said, there are some nice features here. Using unused PCI config bytes to communicate with the guest driver and enable guest-based page dirtying is a nice hack. However, if we want to add this capability to other devices, we're not always going to be able to use fixed addresses 0xf0 and 0xf1. I would suggest that we probably want to create a virtual capability in the config space of the VF, perhaps a Vendor Specific capability. Obviously some devices won't have room for a full capability in the standard config space, so we may need to optionally expose it in extended config space. Those device would be limited to only supporting migration in PCI-e configurations in the guest. Also, plenty of devices make use of undefined PCI config space, so we may not be able to simply add a capability to a region we think is unused, maybe it needs to happen through reserved space in another capability or perhaps defining a virtual BAR that unenlightened guest drivers would ignore. The point is that we somehow need to standardize that so that rather than implicitly know that it's at 0xf0/0xf1 on 82599 VFs. Also, I haven't looked at the kernel-side patches yet, but the saved state received from and loaded into the PF driver needs to be versioned and maybe we need some way to know whether versions are compatible. Migration version information is difficult enough for QEMU, it's a completely foreign concept in the kernel. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyuwrote: > This patchset is to propose a new solution to add live migration support > for 82599 SRIOV network card. > In our solution, we prefer to put all device specific operation into VF and > PF driver and make code in the Qemu more general. [...] > Service down time test > So far, we tested migration between two laptops with 82599 nic which > are connected to a gigabit switch. Ping VF in the 0.001s interval > during migration on the host of source side. It service down > time is about 180ms. So... what would you expect service down wise for the following solution which is zero touch and I think should work for any VF driver: on host A: unplug the VM and conduct live migration to host B ala the no-SRIOV case. on host B: when the VM "gets back to live", probe a VF there with the same assigned mac next, udev on the VM will call the VF driver to create netdev instance DHCP client would run to get the same IP address + under config directive (or from Qemu) send Gratuitous ARP to notify the switch/es on the new location for that mac. Or. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote: > On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyuwrote: > > This patchset is to propose a new solution to add live migration support > > for 82599 SRIOV network card. > > > In our solution, we prefer to put all device specific operation into VF and > > PF driver and make code in the Qemu more general. > > [...] > > > Service down time test > > So far, we tested migration between two laptops with 82599 nic which > > are connected to a gigabit switch. Ping VF in the 0.001s interval > > during migration on the host of source side. It service down > > time is about 180ms. > > So... what would you expect service down wise for the following > solution which is zero touch and I think should work for any VF > driver: > > on host A: unplug the VM and conduct live migration to host B ala the > no-SRIOV case. The trouble here is that the VF needs to be unplugged prior to the start of migration because we can't do effective dirty page tracking while the device is connected and doing DMA. So the downtime, assuming we're counting only VF connectivity, is dependent on memory size, rate of dirtying, and network bandwidth; seconds for small guests, minutes or more (maybe much, much more) for large guests. This is why the typical VF agnostic approach here is to using bonding and fail over to a emulated device during migration, so performance suffers, but downtime is something acceptable. If we want the ability to defer the VF unplug until just before the final stages of the migration, we need the VF to participate in dirty page tracking. Here it's done via an enlightened guest driver. Alex Graf presented a solution using a device specific enlightenment in QEMU. Otherwise we'd need hardware support from the IOMMU. Thanks, Alex > on host B: > > when the VM "gets back to live", probe a VF there with the same assigned mac > > next, udev on the VM will call the VF driver to create netdev instance > > DHCP client would run to get the same IP address > > + under config directive (or from Qemu) send Gratuitous ARP to notify > the switch/es on the new location for that mac. > > Or. > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 03/12] IXGBE: Add sysfs interface for Qemu to migrate VF status in the PF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to add sysfs interface state_in_pf under sysfs directory of VF PCI device for Qemu to get and put VF status in the PF driver during migration. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 156 - 1 file changed, 155 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index ab2a2e2..89671eb 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -124,6 +124,157 @@ static int __ixgbe_enable_sriov(struct ixgbe_adapter *adapter) return -ENOMEM; } +#define IXGBE_PCI_VFCOMMAND 0x4 +#define IXGBE_PCI_VFMSIXMC0x72 +#define IXGBE_SRIOV_VF_OFFSET 0x180 +#define IXGBE_SRIOV_VF_STRIDE 0x2 + +#define to_adapter(dev) ((struct ixgbe_adapter *)(pci_get_drvdata(to_pci_dev(dev)->physfn))) + +struct state_in_pf { + u16 command; + u16 msix_message_control; + struct vf_data_storage vf_data; +}; + +static struct pci_dev *ixgbe_get_virtfn_dev(struct pci_dev *pdev, int vfn) +{ + u16 rid = pdev->devfn + IXGBE_SRIOV_VF_OFFSET + IXGBE_SRIOV_VF_STRIDE * vfn; + return pci_get_bus_and_slot(pdev->bus->number + (rid >> 8), rid & 0xff); +} + +static ssize_t ixgbe_show_state_in_pf(struct device *dev, + struct device_attribute *attr, char *buf) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct ixgbe_hw *hw = >hw; + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + u32 reg, reg_offset, vf_shift; + + /* Clear VF mac and disable VF */ + ixgbe_del_mac_filter(adapter, adapter->vfinfo[vfn].vf_mac_addresses, vfn); + + /* Record PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_read_config_word(vdev, IXGBE_PCI_VFCOMMAND, >command); + pci_read_config_word(vdev, IXGBE_PCI_VFMSIXMC, >msix_message_control); + } + else + printk(KERN_WARNING "Unable to find VF device.\n"); + Formatting for the if/else is incorrect. The else condition should be in brackets as well. + /* Record states hold by PF */ + memcpy(>vf_data, >vfinfo[vfn], sizeof(struct vf_data_storage)); + + vf_shift = vfn % 32; + reg_offset = vfn / 32; + + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg &= ~(1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); + + return sizeof(struct state_in_pf); +} + This is a read. Why does it need to switch off the VF? Also why turn of the anti-spoof, it doesn't make much sense. +static ssize_t ixgbe_store_state_in_pf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct pci_dev *pdev = adapter->pdev, *vdev; + struct pci_dev *vf_pdev = to_pci_dev(dev); + struct state_in_pf *state = (struct state_in_pf *)buf; + int vfn = vf_pdev->virtfn_index; + + /* Check struct size */ + if (count != sizeof(struct state_in_pf)) { + printk(KERN_ERR "State in PF size does not fit.\n"); + goto out; + } + + /* Restore PCI configurations */ + vdev = ixgbe_get_virtfn_dev(pdev, vfn); + if (vdev) { + pci_write_config_word(vdev, IXGBE_PCI_VFCOMMAND, state->command); + pci_write_config_word(vdev, IXGBE_PCI_VFMSIXMC, state->msix_message_control); + } + + /* Restore states hold by PF */ + memcpy(>vfinfo[vfn], >vf_data, sizeof(struct vf_data_storage)); + + out: + return count; +} Just doing a memcpy to move the vfinfo over adds no value. The fact is there are a number of filters that have to be configured in hardware after, and it isn't as simple as just migrating the values stored. As I mentioned in the case of the 82598 there is also jumbo frames to take into account. If the first PF didn't have it enabled, but the second one does that implies the state of the VF needs to change to account for that. I really think you would be better off only migrating the data related to what can be configured using the ip link command and leaving other values such as clear_to_send at the reset value of 0. Then
Re: [RFC Patch 05/12] IXGBE: Add new sysfs interface of "notify_vf"
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to add new sysfs interface of "notify_vf" under sysfs directory of VF PCI device for Qemu to notify VF when migration status is changed. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 30 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_type.h | 4 2 files changed, 34 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index e247d67..5cc7817 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -217,10 +217,37 @@ static ssize_t ixgbe_store_state_in_pf(struct device *dev, return count; } +static ssize_t ixgbe_store_notify_vf(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t count) +{ + struct ixgbe_adapter *adapter = to_adapter(dev); + struct ixgbe_hw *hw = >hw; + struct pci_dev *vf_pdev = to_pci_dev(dev); + int vfn = vf_pdev->virtfn_index; + u32 ivar; + + /* Enable VF mailbox irq first */ + IXGBE_WRITE_REG(hw, IXGBE_PVTEIMS(vfn), 0x4); + IXGBE_WRITE_REG(hw, IXGBE_PVTEIAM(vfn), 0x4); + IXGBE_WRITE_REG(hw, IXGBE_PVTEIAC(vfn), 0x4); + + ivar = IXGBE_READ_REG(hw, IXGBE_PVTIVAR_MISC(vfn)); + ivar &= ~0xFF; + ivar |= 0x2 | IXGBE_IVAR_ALLOC_VAL; + IXGBE_WRITE_REG(hw, IXGBE_PVTIVAR_MISC(vfn), ivar); + + ixgbe_ping_vf(adapter, vfn); + return count; +} + NAK, this won't fly. You can't just go in from the PF and enable interrupts on the VF hoping they are configured well enough to handle an interrupt you decide to trigger from them. Also have you even considered the MSI-X configuration on the VF? I haven't seen anything anywhere that would have migrated the VF's MSI-X configuration from BAR 3 on one system to the new system. static struct device_attribute ixgbe_per_state_in_pf_attribute = __ATTR(state_in_pf, S_IRUGO | S_IWUSR, ixgbe_show_state_in_pf, ixgbe_store_state_in_pf); +static struct device_attribute ixgbe_per_notify_vf_attribute = + __ATTR(notify_vf, S_IWUSR, NULL, ixgbe_store_notify_vf); + void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) { struct pci_dev *pdev = adapter->pdev; @@ -241,6 +268,8 @@ void ixgbe_add_vf_attrib(struct ixgbe_adapter *adapter) if (vfdev->is_virtfn) { ret = device_create_file(>dev, _per_state_in_pf_attribute); + ret |= device_create_file(>dev, + _per_notify_vf_attribute); if (ret) pr_warn("Unable to add VF attribute for dev %s,\n", dev_name(>dev)); @@ -269,6 +298,7 @@ void ixgbe_remove_vf_attrib(struct ixgbe_adapter *adapter) while (vfdev) { if (vfdev->is_virtfn) { device_remove_file(>dev, _per_state_in_pf_attribute); + device_remove_file(>dev, _per_notify_vf_attribute); } vfdev = pci_get_device(pdev->vendor, vf_id, vfdev); More driver specific sysfs. This needs to be moved out of the driver if this is to be considered anything more than a proof of concept. diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h index dd6ba59..c6ddb66 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h @@ -2302,6 +2302,10 @@ enum { #define IXGBE_PVFTDT(P) (0x06018 + (0x40 * (P))) #define IXGBE_PVFTDWBAL(P)(0x06038 + (0x40 * (P))) #define IXGBE_PVFTDWBAH(P)(0x0603C + (0x40 * (P))) +#define IXGBE_PVTEIMS(P) (0x00D00 + (4 * (P))) +#define IXGBE_PVTIVAR_MISC(P) (0x04E00 + (4 * (P))) +#define IXGBE_PVTEIAC(P) (0x00F00 + (4 * P)) +#define IXGBE_PVTEIAM(P) (0x04D00 + (4 * P)) #define IXGBE_PVFTDWBALn(q_per_pool, vf_number, vf_q_index) \ (IXGBE_PVFTDWBAL((q_per_pool)*(vf_number) + (vf_q_index))) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 06/12] IXGBEVF: Add self emulation layer
On 10/21/2015 09:37 AM, Lan Tianyu wrote: In order to restore VF function after migration, add self emulation layer to record regs' values during accessing regs. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/Makefile| 3 ++- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +- .../net/ethernet/intel/ixgbevf/self-emulation.c| 26 ++ drivers/net/ethernet/intel/ixgbevf/vf.h| 5 - 4 files changed, 33 insertions(+), 3 deletions(-) create mode 100644 drivers/net/ethernet/intel/ixgbevf/self-emulation.c diff --git a/drivers/net/ethernet/intel/ixgbevf/Makefile b/drivers/net/ethernet/intel/ixgbevf/Makefile index 4ce4c97..841c884 100644 --- a/drivers/net/ethernet/intel/ixgbevf/Makefile +++ b/drivers/net/ethernet/intel/ixgbevf/Makefile @@ -31,7 +31,8 @@ obj-$(CONFIG_IXGBEVF) += ixgbevf.o -ixgbevf-objs := vf.o \ +ixgbevf-objs := self-emulation.o \ + vf.o \ mbx.o \ ethtool.o \ ixgbevf_main.o diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index a16d267..4446916 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -156,7 +156,7 @@ u32 ixgbevf_read_reg(struct ixgbe_hw *hw, u32 reg) if (IXGBE_REMOVED(reg_addr)) return IXGBE_FAILED_READ_REG; - value = readl(reg_addr + reg); + value = ixgbe_self_emul_readl(reg_addr, reg); if (unlikely(value == IXGBE_FAILED_READ_REG)) ixgbevf_check_remove(hw, reg); return value; diff --git a/drivers/net/ethernet/intel/ixgbevf/self-emulation.c b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c new file mode 100644 index 000..d74b2da --- /dev/null +++ b/drivers/net/ethernet/intel/ixgbevf/self-emulation.c @@ -0,0 +1,26 @@ +#include +#include +#include +#include +#include + +#include "vf.h" +#include "ixgbevf.h" + +static u32 hw_regs[0x4000]; + +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr) +{ + u32 tmp; + + tmp = readl(base + addr); + hw_regs[(unsigned long)addr] = tmp; + + return tmp; +} + +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr) +{ + hw_regs[(unsigned long)addr] = val; + writel(val, (volatile void __iomem *)(base + addr)); +} So I see what you are doing, however I don't think this adds much value. Many of the key registers for the device are not simple Read/Write registers. Most of them are things like write 1 to clear or some other sort of value where writing doesn't set the bit but has some other side effect. Just take a look through the Datasheet at registers such as the VFCTRL, VFMAILBOX, or most of the interrupt registers. The fact is simply storing the values off doesn't give you any real idea of what the state of things are. diff --git a/drivers/net/ethernet/intel/ixgbevf/vf.h b/drivers/net/ethernet/intel/ixgbevf/vf.h index d40f036..6a3f4eb 100644 --- a/drivers/net/ethernet/intel/ixgbevf/vf.h +++ b/drivers/net/ethernet/intel/ixgbevf/vf.h @@ -39,6 +39,9 @@ struct ixgbe_hw; +u32 ixgbe_self_emul_readl(volatile void __iomem *base, u32 addr); +void ixgbe_self_emul_writel(u32 val, volatile void __iomem *base, u32 addr); + /* iterator type for walking multicast address lists */ typedef u8* (*ixgbe_mc_addr_itr) (struct ixgbe_hw *hw, u8 **mc_addr_ptr, u32 *vmdq); @@ -182,7 +185,7 @@ static inline void ixgbe_write_reg(struct ixgbe_hw *hw, u32 reg, u32 value) if (IXGBE_REMOVED(reg_addr)) return; - writel(value, reg_addr + reg); + ixgbe_self_emul_writel(value, reg_addr, reg); } #define IXGBE_WRITE_REG(h, r, v) ixgbe_write_reg(h, r, v) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 02/12] IXGBE: Add new mail box event to restore VF status in the PF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: This patch is to restore VF status in the PF driver when get event from VF. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbe/ixgbe.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h | 1 + drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 40 ++ 3 files changed, 42 insertions(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h index 636f9e3..9d5669a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h @@ -148,6 +148,7 @@ struct vf_data_storage { bool pf_set_mac; u16 pf_vlan; /* When set, guest VLAN config not allowed. */ u16 pf_qos; + u32 vf_lpe; u16 tx_rate; u16 vlan_count; u8 spoofchk_enabled; diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h index b1e4703..8fdb38d 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_mbx.h @@ -91,6 +91,7 @@ enum ixgbe_pfvf_api_rev { /* mailbox API, version 1.1 VF requests */ #define IXGBE_VF_GET_QUEUES 0x09 /* get queue configuration */ +#define IXGBE_VF_NOTIFY_RESUME0x0c /* VF notify PF migration finishing */ /* GET_QUEUES return data indices within the mailbox */ #define IXGBE_VF_TX_QUEUES1 /* number of Tx queues supported */ diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index 1d17b58..ab2a2e2 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -648,6 +648,42 @@ static inline void ixgbe_write_qde(struct ixgbe_adapter *adapter, u32 vf, } } +/** + * Restore the settings by mailbox, after migration + **/ +void ixgbe_restore_setting(struct ixgbe_adapter *adapter, u32 vf) +{ + struct ixgbe_hw *hw = >hw; + u32 reg, reg_offset, vf_shift; + int rar_entry = hw->mac.num_rar_entries - (vf + 1); + + vf_shift = vf % 32; + reg_offset = vf / 32; + + /* enable transmit and receive for vf */ + reg = IXGBE_READ_REG(hw, IXGBE_VFTE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFTE(reg_offset), reg); + + reg = IXGBE_READ_REG(hw, IXGBE_VFRE(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VFRE(reg_offset), reg); + This is just blanket enabling Rx and Tx. I don't see how this can be valid. It seems like it would result in memory corruption for the guest if you are enabling Rx on a device that is not ready. A perfect example is if the guest is not configured to handle jumbo frames and the PF has jumbo frames enabled. + reg = IXGBE_READ_REG(hw, IXGBE_VMECM(reg_offset)); + reg |= (1 << vf_shift); + IXGBE_WRITE_REG(hw, IXGBE_VMECM(reg_offset), reg); This assumes that the anti-spoof is enabled. That may not be the case. + ixgbe_vf_reset_event(adapter, vf); + + hw->mac.ops.set_rar(hw, rar_entry, + adapter->vfinfo[vf].vf_mac_addresses, + vf, IXGBE_RAH_AV); + + + if (adapter->vfinfo[vf].vf_lpe) + ixgbe_set_vf_lpe(adapter, >vfinfo[vf].vf_lpe, vf); +} + The function ixgbe_set_vf_lpe also enabled the receive, you should take a look at it. For 82598 you cannot just arbitrarily enable the Rx as there is a risk of corrupting guest memory or causing a kernel panic. static int ixgbe_vf_reset_msg(struct ixgbe_adapter *adapter, u32 vf) { struct ixgbe_ring_feature *vmdq = >ring_feature[RING_F_VMDQ]; @@ -1047,6 +1083,7 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) break; case IXGBE_VF_SET_LPE: retval = ixgbe_set_vf_lpe(adapter, msgbuf, vf); + adapter->vfinfo[vf].vf_lpe = *msgbuf; break; Why not just leave this for the VF to notify us of via a reset. It seems like if the VF is migrated it should start with the cts bits of the mailbox cleared as though the PF driver as been reloaded. case IXGBE_VF_SET_MACVLAN: retval = ixgbe_set_vf_macvlan_msg(adapter, msgbuf, vf); @@ -1063,6 +1100,9 @@ static int ixgbe_rcv_msg_from_vf(struct ixgbe_adapter *adapter, u32 vf) case IXGBE_VF_GET_RSS_KEY: retval = ixgbe_get_vf_rss_key(adapter, msgbuf, vf); break; + case IXGBE_VF_NOTIFY_RESUME: + ixgbe_restore_setting(adapter, vf); + break; default: e_err(drv, "Unhandled Msg %8.8x\n", msgbuf[0]); retval = IXGBE_ERR_MBX; I really don't think the VF should be sending us a message telling us to restore settings. Why not just use the existing messages? The VF as it is now can survive a suspend/resume
[GIT PULL 0/5] perf/core improvements and fixes
Hi Ingo, Please consider pulling, - Arnaldo The following changes since commit 43e41adc9e8c36545888d78fed2ef8d102a938dc: perf record: Add ability to sample call branches (2015-10-20 10:30:55 +0200) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git tags/perf-core-for-mingo for you to fetch changes up to e3d006ce8180a0c025ce66bdc89bbc125f85be57: perf annotate: Add debug message for out of bounds sample (2015-10-21 18:12:37 -0300) perf/core improvements and fixes: User visible: - Print branch filter state with verbose mode (Andi Kleen) - Fix core dump caused by per-socket/core system-wide stat (Kan Liang) - Update libtraceevent KVM plugin (Paolo Bonzini) Developer stuff: - Add fixdep to 'tools/build' .gitignore (Yunlong Song) Signed-off-by: Arnaldo Carvalho de MeloAndi Kleen (1): perf evsel: Print branch filter state with -vv Arnaldo Carvalho de Melo (1): perf annotate: Add debug message for out of bounds sample Kan Liang (1): perf cpu_map: Fix core dump caused by per-socket/core system-wide stat Paolo Bonzini (1): tools lib traceevent: update KVM plugin Yunlong Song (1): perf build: Add fixdep to .gitignore tools/build/.gitignore| 1 + tools/lib/traceevent/plugin_kvm.c | 25 + tools/perf/util/annotate.c| 5 - tools/perf/util/cpumap.c | 2 +- tools/perf/util/evsel.c | 1 + 5 files changed, 24 insertions(+), 10 deletions(-) create mode 100644 tools/build/.gitignore -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/5] tools lib traceevent: update KVM plugin
From: Paolo BonziniThe format of the role word has changed through the years and the plugin was never updated; some VMX exit reasons were missing too. Signed-off-by: Paolo Bonzini Acked-by: Steven Rostedt Cc: David Ahern Cc: Namhyung Kim Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/1443695293-31127-1-git-send-email-pbonz...@redhat.com Signed-off-by: Arnaldo Carvalho de Melo --- tools/lib/traceevent/plugin_kvm.c | 25 + 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/tools/lib/traceevent/plugin_kvm.c b/tools/lib/traceevent/plugin_kvm.c index 88fe83dff7cd..18536f756577 100644 --- a/tools/lib/traceevent/plugin_kvm.c +++ b/tools/lib/traceevent/plugin_kvm.c @@ -124,7 +124,10 @@ static const char *disassemble(unsigned char *insn, int len, uint64_t rip, _ER(WBINVD, 54)\ _ER(XSETBV, 55)\ _ER(APIC_WRITE, 56)\ - _ER(INVPCID, 58) + _ER(INVPCID, 58)\ + _ER(PML_FULL,62)\ + _ER(XSAVES, 63)\ + _ER(XRSTORS, 64) #define SVM_EXIT_REASONS \ _ER(EXIT_READ_CR0, 0x000) \ @@ -352,15 +355,18 @@ static int kvm_nested_vmexit_handler(struct trace_seq *s, struct pevent_record * union kvm_mmu_page_role { unsigned word; struct { - unsigned glevels:4; unsigned level:4; + unsigned cr4_pae:1; unsigned quadrant:2; - unsigned pad_for_nice_hex_output:6; unsigned direct:1; unsigned access:3; unsigned invalid:1; - unsigned cr4_pge:1; unsigned nxe:1; + unsigned cr0_wp:1; + unsigned smep_and_not_wp:1; + unsigned smap_and_not_wp:1; + unsigned pad_for_nice_hex_output:8; + unsigned smm:8; }; }; @@ -385,15 +391,18 @@ static int kvm_mmu_print_role(struct trace_seq *s, struct pevent_record *record, if (pevent_is_file_bigendian(event->pevent) == pevent_is_host_bigendian(event->pevent)) { - trace_seq_printf(s, "%u/%u q%u%s %s%s %spge %snxe", + trace_seq_printf(s, "%u q%u%s %s%s %spae %snxe %swp%s%s%s", role.level, -role.glevels, role.quadrant, role.direct ? " direct" : "", access_str[role.access], role.invalid ? " invalid" : "", -role.cr4_pge ? "" : "!", -role.nxe ? "" : "!"); +role.cr4_pae ? "" : "!", +role.nxe ? "" : "!", +role.cr0_wp ? "" : "!", +role.smep_and_not_wp ? " smep" : "", +role.smap_and_not_wp ? " smap" : "", +role.smm ? " smm" : ""); } else trace_seq_printf(s, "WORD: %08x", role.word); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC Patch 08/12] IXGBEVF: Rework code of finding the end transmit desc of package
On 10/21/2015 09:37 AM, Lan Tianyu wrote: When transmit a package, the end transmit desc of package indicates whether package is sent already. Current code records the end desc's pointer in the next_to_watch of struct tx buffer. This code will be broken if shifting desc ring after migration. The pointer will be invalid. This patch is to replace recording pointer with recording the desc number of the package and find the end decs via the first desc and desc number. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 1 + drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 19 --- 2 files changed, 17 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 775d089..c823616 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -54,6 +54,7 @@ */ struct ixgbevf_tx_buffer { union ixgbe_adv_tx_desc *next_to_watch; + u16 desc_num; unsigned long time_stamp; struct sk_buff *skb; unsigned int bytecount; So if you can't use next_to_watch why is it left in here? Also you might want to take a look at moving desc_num to a different spot in the buffer as you are leaving a 6 byte hole in the descriptor. diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 4446916..056841c 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -210,6 +210,7 @@ static void ixgbevf_unmap_and_free_tx_resource(struct ixgbevf_ring *tx_ring, DMA_TO_DEVICE); } tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; tx_buffer->skb = NULL; dma_unmap_len_set(tx_buffer, len, 0); This opens up a race condition. If you have a descriptor ready to be cleaned at offset 0 what is to prevent you from just running through the ring? You likely need to find a descriptor number that cannot be valid to use here. /* tx_buffer must be completely set up in the transmit path */ @@ -295,7 +296,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, union ixgbe_adv_tx_desc *tx_desc; unsigned int total_bytes = 0, total_packets = 0; unsigned int budget = tx_ring->count / 2; - unsigned int i = tx_ring->next_to_clean; + int i, watch_index; Where is i being initialized? It was here but you removed it. Are you using i without initializing it? if (test_bit(__IXGBEVF_DOWN, >state)) return true; @@ -305,9 +306,17 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, i -= tx_ring->count; do { - union ixgbe_adv_tx_desc *eop_desc = tx_buffer->next_to_watch; + union ixgbe_adv_tx_desc *eop_desc; + + if (!tx_buffer->desc_num) + break; + + if (i + tx_buffer->desc_num >= 0) + watch_index = i + tx_buffer->desc_num; + else + watch_index = i + tx_ring->count + tx_buffer->desc_num; - /* if next_to_watch is not set then there is no work pending */ + eop_desc = IXGBEVF_TX_DESC(tx_ring, watch_index); if (!eop_desc) break; So I don't see how this isn't triggering Tx hangs. I suspect for the simple ping case desc_num will often be 0. The fact is there are many cases where first and tx_buffer_info are the same descriptor. @@ -320,6 +329,7 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, /* clear next_to_watch to prevent false hangs */ tx_buffer->next_to_watch = NULL; + tx_buffer->desc_num = 0; /* update the statistics for this packet */ total_bytes += tx_buffer->bytecount; You cannot use 0 because 0 is a valid number. You are using it as a look-ahead currently and there are cases where i is the eop_desc index. @@ -3457,6 +3467,7 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, u32 tx_flags = first->tx_flags; __le32 cmd_type; u16 i = tx_ring->next_to_use; + u16 start; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); @@ -3540,6 +3551,8 @@ static void ixgbevf_tx_map(struct ixgbevf_ring *tx_ring, /* set next_to_watch value indicating a packet is present */ first->next_to_watch = tx_desc; + start = first - tx_ring->tx_buffer_info; + first->desc_num = (i - start >= 0) ? i - start: i + tx_ring->count - start; i++; if (i == tx_ring->count) start and i could be the same value. If you look at ixgbevf_tx_map you should find that if the packet is contained in a single buffer then the first and last descriptor in your
Re: [RFC Patch 09/12] IXGBEVF: Add live migration support for VF driver
On 10/21/2015 09:37 AM, Lan Tianyu wrote: To let VF driver in the guest to know migration status, Qemu will fake PCI configure reg 0xF0 and 0xF1 to show migrate status and get ack from VF driver. When migration starts, Qemu will set reg "0xF0" to 1, notify VF driver via triggering mail box msg and wait for VF driver to tell it's ready for migration(set reg "0xF1" to 1). After migration, Qemu will set reg "0xF0" to 0 and notify VF driver by mail box irq. VF driver begins to restore tx/rx function after detecting sttatus change. When VF receives mail box irq, it will check reg "0xF0" in the service task function to get migration status and performs related operations according its value. Steps of restarting receive and transmit function 1) Restore VF status in the PF driver via sending mail event to PF driver 2) Write back reg values recorded by self emulation layer 3) Restart rx/tx ring 4) Recovery interrupt Transmit/Receive descriptor head regs are read-only and can't be restored via writing back recording reg value directly and they are set to 0 during VF reset. To reuse original tx/rx rings, shift desc ring in order to move the desc pointed by original head reg to first entry of the ring and then enable tx/rx rings. VF restarts to receive and transmit from original head desc. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/defines.h | 6 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 7 +- drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 115 - .../net/ethernet/intel/ixgbevf/self-emulation.c| 107 +++ 4 files changed, 232 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/defines.h b/drivers/net/ethernet/intel/ixgbevf/defines.h index 770e21a..113efd2 100644 --- a/drivers/net/ethernet/intel/ixgbevf/defines.h +++ b/drivers/net/ethernet/intel/ixgbevf/defines.h @@ -239,6 +239,12 @@ struct ixgbe_adv_tx_context_desc { __le32 mss_l4len_idx; }; +union ixgbevf_desc { + union ixgbe_adv_tx_desc rx_desc; + union ixgbe_adv_rx_desc tx_desc; + struct ixgbe_adv_tx_context_desc tx_context_desc; +}; + /* Adv Transmit Descriptor Config Masks */ #define IXGBE_ADVTXD_DTYP_MASK0x00F0 /* DTYP mask */ #define IXGBE_ADVTXD_DTYP_CTXT0x0020 /* Advanced Context Desc */ diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index c823616..6eab402e 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -109,7 +109,7 @@ struct ixgbevf_ring { struct ixgbevf_ring *next; struct net_device *netdev; struct device *dev; - void *desc; /* descriptor ring memory */ + union ixgbevf_desc *desc; /* descriptor ring memory */ dma_addr_t dma; /* phys. address of descriptor ring */ unsigned int size; /* length in bytes */ u16 count; /* amount of descriptors */ @@ -493,6 +493,11 @@ extern void ixgbevf_write_eitr(struct ixgbevf_q_vector *q_vector); void ixgbe_napi_add_all(struct ixgbevf_adapter *adapter); void ixgbe_napi_del_all(struct ixgbevf_adapter *adapter); +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head); +int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head); +void ixgbevf_restore_state(struct ixgbevf_adapter *adapter); +inline void ixgbevf_irq_enable(struct ixgbevf_adapter *adapter); + #ifdef DEBUG char *ixgbevf_get_hw_dev_name(struct ixgbe_hw *hw); diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 056841c..15ec361 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -91,6 +91,10 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit Virtual Function Network Driver"); MODULE_LICENSE("GPL"); MODULE_VERSION(DRV_VERSION); + +#define MIGRATION_COMPLETED 0x00 +#define MIGRATION_IN_PROGRESS 0x01 + #define DEFAULT_MSG_ENABLE (NETIF_MSG_DRV|NETIF_MSG_PROBE|NETIF_MSG_LINK) static int debug = -1; module_param(debug, int, 0); @@ -221,6 +225,78 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) return ring->stats.packets; } +int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) +{ + struct ixgbevf_tx_buffer *tx_buffer = NULL; + static union ixgbevf_desc *tx_desc = NULL; + + tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); + if (!tx_buffer) + return -ENOMEM; + + tx_desc = vmalloc(sizeof(union ixgbevf_desc) * r->count); + if (!tx_desc) + return -ENOMEM; + + memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); + memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); +
Re: [RFC Patch 10/12] IXGBEVF: Add lock to protect tx/rx ring operation
On 10/21/2015 09:37 AM, Lan Tianyu wrote: Ring shifting during restoring VF function maybe race with original ring operation(transmit/receive package). This patch is to add tx/rx lock to protect ring related data. Signed-off-by: Lan Tianyu--- drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 2 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 28 --- 2 files changed, 27 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h index 6eab402e..3a748c8 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf.h @@ -448,6 +448,8 @@ struct ixgbevf_adapter { spinlock_t mbx_lock; unsigned long last_reset; + spinlock_t mg_rx_lock; + spinlock_t mg_tx_lock; }; Really, a shared lock for all of the Rx or Tx rings? This is going to kill any chance at performance. Especially since just recently the VFs got support for RSS. To top it off it also means we cannot clean Tx while adding new buffers which will kill Tx performance. The other concern I have is what is supposed to prevent the hardware from accessing the rings while you are reading? I suspect nothing so I don't see how this helps anything. I would honestly say you are better off just giving up on all of the data stored in the descriptor rings rather than trying to restore them. Yes you are going to lose a few packets but you don't have the risk for races that this code introduces. enum ixbgevf_state_t { diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c index 15ec361..04b6ce7 100644 --- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c +++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c @@ -227,8 +227,10 @@ static u64 ixgbevf_get_tx_completed(struct ixgbevf_ring *ring) int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_tx_buffer *tx_buffer = NULL; static union ixgbevf_desc *tx_desc = NULL; + unsigned long flags; tx_buffer = vmalloc(sizeof(struct ixgbevf_tx_buffer) * (r->count)); if (!tx_buffer) @@ -238,6 +240,7 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!tx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_tx_lock, flags); memcpy(tx_desc, r->desc, sizeof(union ixgbevf_desc) * r->count); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], tx_desc, sizeof(union ixgbevf_desc) * head); @@ -256,6 +259,8 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_tx_lock, flags); + vfree(tx_buffer); vfree(tx_desc); return 0; @@ -263,8 +268,10 @@ int ixgbevf_tx_ring_shift(struct ixgbevf_ring *r, u32 head) int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) { + struct ixgbevf_adapter *adapter = netdev_priv(r->netdev); struct ixgbevf_rx_buffer *rx_buffer = NULL; static union ixgbevf_desc *rx_desc = NULL; + unsigned long flags; rx_buffer = vmalloc(sizeof(struct ixgbevf_rx_buffer) * (r->count)); if (!rx_buffer) @@ -274,6 +281,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) if (!rx_desc) return -ENOMEM; + spin_lock_irqsave(>mg_rx_lock, flags); memcpy(rx_desc, r->desc, sizeof(union ixgbevf_desc) * (r->count)); memcpy(r->desc, _desc[head], sizeof(union ixgbevf_desc) * (r->count - head)); memcpy(>desc[r->count - head], rx_desc, sizeof(union ixgbevf_desc) * head); @@ -291,6 +299,7 @@ int ixgbevf_rx_ring_shift(struct ixgbevf_ring *r, u32 head) r->next_to_use -= head; else r->next_to_use += (r->count - head); + spin_unlock_irqrestore(>mg_rx_lock, flags); vfree(rx_buffer); vfree(rx_desc); @@ -377,6 +386,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, if (test_bit(__IXGBEVF_DOWN, >state)) return true; + spin_lock(>mg_tx_lock); + i = tx_ring->next_to_clean; tx_buffer = _ring->tx_buffer_info[i]; tx_desc = IXGBEVF_TX_DESC(tx_ring, i); i -= tx_ring->count; @@ -471,6 +482,8 @@ static bool ixgbevf_clean_tx_irq(struct ixgbevf_q_vector *q_vector, q_vector->tx.total_bytes += total_bytes; q_vector->tx.total_packets += total_packets; + spin_unlock(>mg_tx_lock); + if (check_for_tx_hang(tx_ring) && ixgbevf_check_tx_hang(tx_ring)) { struct ixgbe_hw *hw = >hw; union ixgbe_adv_tx_desc *eop_desc; @@ -999,10 +1012,12 @@ static int
Re: [RFC Patch 00/12] IXGBE: Add live migration support for SRIOV NIC
On 10/21/2015 12:20 PM, Alex Williamson wrote: On Wed, 2015-10-21 at 21:45 +0300, Or Gerlitz wrote: On Wed, Oct 21, 2015 at 7:37 PM, Lan Tianyuwrote: This patchset is to propose a new solution to add live migration support for 82599 SRIOV network card. In our solution, we prefer to put all device specific operation into VF and PF driver and make code in the Qemu more general. [...] Service down time test So far, we tested migration between two laptops with 82599 nic which are connected to a gigabit switch. Ping VF in the 0.001s interval during migration on the host of source side. It service down time is about 180ms. So... what would you expect service down wise for the following solution which is zero touch and I think should work for any VF driver: on host A: unplug the VM and conduct live migration to host B ala the no-SRIOV case. The trouble here is that the VF needs to be unplugged prior to the start of migration because we can't do effective dirty page tracking while the device is connected and doing DMA. So the downtime, assuming we're counting only VF connectivity, is dependent on memory size, rate of dirtying, and network bandwidth; seconds for small guests, minutes or more (maybe much, much more) for large guests. The question of dirty page tracking though should be pretty simple. We start the Tx packets out as dirty so we don't need to add anything there. It seems like the Rx data and Tx/Rx descriptor rings are the issue. This is why the typical VF agnostic approach here is to using bonding and fail over to a emulated device during migration, so performance suffers, but downtime is something acceptable. If we want the ability to defer the VF unplug until just before the final stages of the migration, we need the VF to participate in dirty page tracking. Here it's done via an enlightened guest driver. Alex Graf presented a solution using a device specific enlightenment in QEMU. Otherwise we'd need hardware support from the IOMMU. My only real complaint with this patch series is that it seems like there was to much focus on instrumenting the driver instead of providing the code necessary to enable a driver ecosystem that enables migration. I don't know if what we need is a full hardware IOMMU. It seems like a good way to take care of the need to flag dirty pages for DMA capable devices would be to add functionality to the dma_map_ops calls sync_{sg|single}for_cpu and unmap_{page|sg} so that they would take care of mapping the pages as dirty for us when needed. We could probably make do with just a few tweaks to existing API in order to make this work. As far as the descriptor rings I would argue they are invalid as soon as we migrate. The problem is there is no way to guarantee ordering as we cannot pre-emptively mark an Rx data buffer as being a dirty page when we haven't even looked at the Rx descriptor for the given buffer yet. Tx has similar issues as we cannot guarantee the Tx will disable itself after a complete frame. As such I would say the moment we migrate we should just give up on the frames that are still in the descriptor rings, drop them, and then start over with fresh rings. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Clarification on KVM + vhost-net
Hi, I would like to invoke QEMU and KVM so that the guest sees a virtio NIC, and that NIC goes through a SR-IOV VF of a host NIC as directly and efficiently as possible. But I don't actually want to pass the VF through to the guest. I've found a bunch of discussion and confusing examples on the web, but I'm not able to figure out what the right thing to do with modern QEMU is. I don't think I want to create a macvtap interface attached to the VF, because I just want to use one MAC address for the VF itself (and allow the NIC anti-spoofing hardware to work etc). Am I supposed to create a raw socket bound to the interface I want to use in a helper, and then pass that to qemu? How exactly do I pass that in — do I still use "-net tap"? Do I have to create my own vhostfd in my helper too? Thanks! Roland -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally
On Tue, Oct 20, 2015 at 03:51:05PM -0400, Paolo Bonzini wrote: > Should this be "select" or "depends on"? Not a blocker, can always be fixed > in 4.4. > Hmm, I don't know actually. I trusted Arnd to make the right call and given Marc's ack as well, I didn't pay too much attention to that particular detail. Arnd, any comments? Thanks, -Christoffer > > > -Original Message- > From: Christoffer Dall [christoffer.d...@linaro.org] > Received: martedì, 20 ott 2015, 18:18 > To: Paolo Bonzini [pbonz...@redhat.com]; kvm...@lists.cs.columbia.edu, > kvm@vger.kernel.org, linux-arm-ker...@lists.infradead.org > CC: Marc Zyngier [marc.zyng...@arm.com]; Arnd Bergmann [a...@arndb.de]; > Christoffer Dall [christoffer.d...@linaro.org] > Subject: [GIT PULL 3/6] KVM: arm: use GIC support unconditionally > > From: Arnd Bergmann> > The vgic code on ARM is built for all configurations that enable KVM, > but the parent_data field that it references is only present when > CONFIG_IRQ_DOMAIN_HIERARCHY is set: > > virt/kvm/arm/vgic.c: In function 'kvm_vgic_map_phys_irq': > virt/kvm/arm/vgic.c:1781:13: error: 'struct irq_data' has no member named > 'parent_data' > > This flag is implied by the GIC driver, and indeed the VGIC code only > makes sense if a GIC is present. This changes the CONFIG_KVM symbol > to always select GIC, which avoids the issue. > > Fixes: 662d9715840 ("arm/arm64: KVM: Kill CONFIG_KVM_ARM_{VGIC,TIMER}") > Signed-off-by: Arnd Bergmann > Acked-by: Marc Zyngier > Signed-off-by: Christoffer Dall > --- > arch/arm/kvm/Kconfig | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/arm/kvm/Kconfig b/arch/arm/kvm/Kconfig > index 210ecca..356970f 100644 > --- a/arch/arm/kvm/Kconfig > +++ b/arch/arm/kvm/Kconfig > @@ -21,6 +21,7 @@ config KVM > depends on MMU && OF > select PREEMPT_NOTIFIERS > select ANON_INODES > + select ARM_GIC > select HAVE_KVM_CPU_RELAX_INTERCEPT > select HAVE_KVM_ARCH_TLB_FLUSH_ALL > select KVM_MMIO > -- > 2.1.2.330.g565301e.dirty > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 07/20] KVM: ARM64: PMU: Add perf event map and introduce perf event creating function
On 2015/10/16 14:08, Wei Huang wrote: >> +/** >> > + * kvm_pmu_get_counter_value - get PMU counter value >> > + * @vcpu: The vcpu pointer >> > + * @select_idx: The counter index >> > + */ >> > +unsigned long kvm_pmu_get_counter_value(struct kvm_vcpu *vcpu, u32 >> > select_idx) >> > +{ >> > + u64 enabled, running; >> > + struct kvm_pmu *pmu = >arch.pmu; >> > + struct kvm_pmc *pmc = >pmc[select_idx]; >> > + u64 counter; >> > + >> > + if (!vcpu_mode_is_32bit(vcpu)) >> > + counter = vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + select_idx); > The select_idx is from PMSELR_EL0. According to PMUv3 spec, PMSELR_EL0 > is the register that "selects the current event counter PMEVCNTR or > the cycle counter, CCNT". The code here always reads the counter value > from PMEVCNTR. It doesn't read the value from cycle counter when > select_idx=0b1. We might waste some perf counter resources here. > No, it does read the value from the cycle counter. When select_idx=0b1, PMEVCNTR0_EL0 + select_idx = PMCCNTR_EL0( See patch 03/20). -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
On 21/10/2015 00:57, Wanpeng Li wrote: >> kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks. The >> hooks are registered only between vcpu_load and vcpu_put, therefore they >> know that the mutex is taken. The sequence will go like this: >> >> vcpu_load >> kvm_sched_out >> kvm_sched_in >> kvm_sched_out >> kvm_sched_in >> ... >> vcpu_put > > If this should be: > > vcpu_load > kvm_sched_in > kvm_sched_out > kvm_sched_in > kvm_sched_out > ... > vcpu_put No, because vcpu_load is called while the thread is running. Therefore, the first preempt notifier call will be a sched_out notification, which calls kvm_arch_vcpu_put. Extending the picture above: vcpu_load-> kvm_arch_vcpu_load kvm_sched_out-> kvm_arch_vcpu_put kvm_sched_in -> kvm_arch_vcpu_load kvm_sched_out-> kvm_arch_vcpu_put kvm_sched_in -> kvm_arch_vcpu_load ... kvm_sched_out-> kvm_arch_vcpu_put kvm_sched_in -> kvm_arch_vcpu_load vcpu_put -> kvm_arch_vcpu_put Thanks, Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 10/20] KVM: ARM64: Add reset and access handlers for PMCCNTR register
On 2015/10/16 23:06, Wei Huang wrote: > > > On 09/24/2015 05:31 PM, Shannon Zhao wrote: >> Since the reset value of PMCCNTR is UNKNOWN, use reset_unknown for its >> reset handler. Add a new case to emulate reading to PMCCNTR register. >> >> Signed-off-by: Shannon Zhao>> --- >> arch/arm64/kvm/sys_regs.c | 17 +++-- >> 1 file changed, 15 insertions(+), 2 deletions(-) >> >> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c >> index e7f6058..c38c2de 100644 >> --- a/arch/arm64/kvm/sys_regs.c >> +++ b/arch/arm64/kvm/sys_regs.c >> @@ -518,6 +518,12 @@ static bool access_pmu_regs(struct kvm_vcpu *vcpu, >> } >> } else { >> switch (r->reg) { >> +case PMCCNTR_EL0: { >> +val = kvm_pmu_get_counter_value(vcpu, >> +ARMV8_MAX_COUNTERS - 1); >> +*vcpu_reg(vcpu, p->Rt) = val; >> +break; >> +} >> case PMXEVCNTR_EL0: { >> val = kvm_pmu_get_counter_value(vcpu, >> vcpu_sys_reg(vcpu, PMSELR_EL0)); >> @@ -748,7 +754,7 @@ static const struct sys_reg_desc sys_reg_descs[] = { >>access_pmu_regs, reset_pmceid, PMCEID1_EL0 }, >> /* PMCCNTR_EL0 */ >> { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b000), >> - trap_raz_wi }, >> + access_pmu_regs, reset_unknown, PMCCNTR_EL0 }, >> /* PMXEVTYPER_EL0 */ >> { Op0(0b11), Op1(0b011), CRn(0b1001), CRm(0b1101), Op2(0b001), >>access_pmu_regs, reset_unknown, PMXEVTYPER_EL0 }, >> @@ -997,6 +1003,12 @@ static bool access_pmu_cp15_regs(struct kvm_vcpu *vcpu, >> } >> } else { >> switch (r->reg) { >> +case c9_PMCCNTR: { >> +val = kvm_pmu_get_counter_value(vcpu, >> +ARMV8_MAX_COUNTERS - 1); > > PMCCNTR is for cycle counter. There is a filter register, PMCCFILTR_EL0, > associated with it. When kvm_pmu_set_counter_event_type() is called, I > didn't see this filter config been used in perf_event_attr when > perf_event is created. According to the spec, to PMXEVTYPER_EL0 it says "When PMSELR_EL0.SEL selects the cycle counter, this accesses PMCCFILTR_EL0." So within kvm_pmu_set_counter_event_type, I configure the perf_event_attr based on the bits of PMXEVTYPER_EL0 and only handle bit P for EL0 and bit U for EL1 since KVM guest doesn't see EL2 and EL3. See patch 07/20 : + attr.exclude_user = data & ARMV8_EXCLUDE_EL0 ? 1 : 0; + attr.exclude_kernel = data & ARMV8_EXCLUDE_EL1 ? 1 : 0; -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 15/20] KVM: ARM64: Add reset and access handlers for PMSWINC register
On 2015/10/16 23:25, Wei Huang wrote: >> /** >> > + * kvm_pmu_software_increment - do software increment >> > + * @vcpu: The vcpu pointer >> > + * @val: the value guest writes to PMSWINC register >> > + */ >> > +void kvm_pmu_software_increment(struct kvm_vcpu *vcpu, u32 val) >> > +{ >> > + int i; >> > + u32 type, enable; >> > + >> > + for (i = 0; i < 32; i++) { >> > + if ((val >> i) & 0x1) { >> > + if (!vcpu_mode_is_32bit(vcpu)) { >> > + type = vcpu_sys_reg(vcpu, PMEVTYPER0_EL0 + i) >> > + & ARMV8_EVTYPE_EVENT; >> > + enable = vcpu_sys_reg(vcpu, PMCNTENSET_EL0); >> > + if ((type == 0) && ((enable >> i) & 0x1)) >> > + vcpu_sys_reg(vcpu, PMEVCNTR0_EL0 + i)++; > Most parts make sense here. I just wonder about the case of counter > overflow here. Should we trigger an interrupt and set Overflow Flag > status register when SW increment overflows here? I didn't find anything > in ARM document. > I didn't find either. But since SW increment uses the PMEVCNTR_EL0 to count, it should be same with other events to trigger an interrupt and set Overflow Flag status register. I will add this in next version patch. Thanks. -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
Paolo Bonzini redhat.com> writes: > > > On 21/10/2015 00:57, Wanpeng Li wrote: > >> kvm_sched_out and kvm_sched_in are part of KVM's preemption hooks. The > >> hooks are registered only between vcpu_load and vcpu_put, therefore they > >> know that the mutex is taken. The sequence will go like this: > >> > >> vcpu_load > >> kvm_sched_out > >> kvm_sched_in > >> kvm_sched_out > >> kvm_sched_in > >> ... > >> vcpu_put > > > > If this should be: > > > > vcpu_load > > kvm_sched_in > > kvm_sched_out > > kvm_sched_in > > kvm_sched_out > > ... > > vcpu_put > > No, because vcpu_load is called while the thread is running. Therefore, > the first preempt notifier call will be a sched_out notification, which > calls kvm_arch_vcpu_put. Extending the picture above: > > vcpu_load-> kvm_arch_vcpu_load > kvm_sched_out-> kvm_arch_vcpu_put > kvm_sched_in -> kvm_arch_vcpu_load > kvm_sched_out-> kvm_arch_vcpu_put > kvm_sched_in -> kvm_arch_vcpu_load > ... > kvm_sched_out-> kvm_arch_vcpu_put > kvm_sched_in -> kvm_arch_vcpu_load > vcpu_put -> kvm_arch_vcpu_put > > Thanks, > > Paolo > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majordomo vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Thanks for the explanation, it's very clear. I tired that but I didn't succeed to send the ioctl from "run_on_cpu" function, I didn't find how to set the right CPUStat I've tried "current_cpu" kvm_main.c: // yacine.begin static void do_vmi_start_kvm_ioctl(void *type) { printf("do_vmi_start_kvm_ioctl\n"); kvm_vm_ioctl(kvm_state, type); } int vmi_start_kvm_ioctl(int type) { <- called from hmp.c printf("vmi_start_kvm_ioctl\n"); run_on_cpu(current_cpu, do_vmi_start_kvm_ioctl, (void *) ); return 0; } // yacine.end This gives me a segmentation fault Then I tired to replace current_cpu with ENV_GET_CPU(mon_get_cpu()), it didn't work, I get nothing, no error but doesn't work I tried also to pass mon->mon_cpu through int vmi_start_kvm_ioctl(int type) by adding a first parameter as CPUStat, i get compiler error "dereference pointer to incomplete type" I'm beginner to qemu and kvm code, can you please orient me to fix this problem ? Thanks in advance Yacine -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Difference between vcpu_load and kvm_sched_in ?
On 21/10/2015 12:17, Hebbal Yacine wrote: > Thanks for the explanation, it's very clear. > I tired that but I didn't succeed to send the ioctl from "run_on_cpu" > function, I didn't find how to set the right CPUStat > I've tried "current_cpu" Current_cpu is always NULL outside the VCPU thread. > > kvm_main.c: > > // yacine.begin > > static void do_vmi_start_kvm_ioctl(void *type) { > printf("do_vmi_start_kvm_ioctl\n"); > kvm_vm_ioctl(kvm_state, type); Are you sure you want a VM ioctl and not a VCPU ioctl? Or perhaps a VM ioctl to do generic processing, and a VCPU ioctl that is then sent to all VCPUs? If you use a VCPU ioctl, you can use CPU_FOREACH or a for loop to iterate over all VCPUs. Paolo -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 28/33] nvdimm acpi: support DSM_FUN_IMPLEMENTED function
On Wed, Oct 21, 2015 at 12:26:35AM +0800, Xiao Guangrong wrote: > > > On 10/20/2015 11:51 PM, Stefan Hajnoczi wrote: > >On Mon, Oct 19, 2015 at 08:54:14AM +0800, Xiao Guangrong wrote: > >>+exit: > >>+/* Write our output result to dsm memory. */ > >>+((dsm_out *)dsm_ram_addr)->len = out->len; > > > >Missing byteswap? > > > >I thought you were going to remove this field because it wasn't needed > >by the guest. > > > > The @len is the size of _DSM result buffer, for example, for the function of > DSM_FUN_IMPLEMENTED the result buffer is 8 bytes, and for > DSM_DEV_FUN_NAMESPACE_LABEL_SIZE the buffer size is 4 bytes. It tells ASL code > how much size of memory we need to return to the _DSM caller. > > In _DSM code, it's handled like this: > > "RLEN" is @len, “OBUF” is the left memory in DSM page. > > /* get @len*/ > aml_append(method, aml_store(aml_name("RLEN"), aml_local(6))); > /* @len << 3 to get bits. */ > aml_append(method, aml_store(aml_shiftleft(aml_local(6), >aml_int(3)), aml_local(6))); > > /* get @len << 3 bits from OBUF, and return it to the caller. */ > aml_append(method, aml_create_field(aml_name("ODAT"), aml_int(0), > aml_local(6) , "OBUF")); > > Since @len is our internally used, it's not return to guest, so i did not do > byteswap here. I am not familiar with the ACPI details, but I think this emits bytecode that will be run by the guest's ACPI interpreter? You still need to define the endianness of fields since QEMU and the guest could have different endianness. In other words, will the following work if a big-endian ppc host is running a little-endian x86 guest? ((dsm_out *)dsm_ram_addr)->len = out->len; Stefan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH v3 13/16] KVM: arm64: sync LPI configuration and pending tables
Hello! > -Original Message- > From: kvm-ow...@vger.kernel.org [mailto:kvm-ow...@vger.kernel.org] On Behalf > Of Andre Przywara > Sent: Wednesday, October 07, 2015 5:55 PM > To: marc.zyng...@arm.com; christoffer.d...@linaro.org > Cc: eric.au...@linaro.org; p.fe...@samsung.com; kvm...@lists.cs.columbia.edu; > linux-arm- > ker...@lists.infradead.org; kvm@vger.kernel.org > Subject: [PATCH v3 13/16] KVM: arm64: sync LPI configuration and pending > tables > > The LPI configuration and pending tables of the GICv3 LPIs are held > in tables in (guest) memory. To achieve reasonable performance, we > cache this data in our own data structures, so we need to sync those > two views from time to time. This behaviour is well described in the > GICv3 spec and is also exercised by hardware, so the sync points are > well known. > > Provide functions that read the guest memory and store the > information from the configuration and pending tables in the kernel. > > Signed-off-by: Andre Przywara> --- > Changelog v2..v3: > - rework functions to avoid propbaser/pendbaser accesses inside lock > > include/kvm/arm_vgic.h | 2 + > virt/kvm/arm/its-emul.c | 133 > > virt/kvm/arm/its-emul.h | 3 ++ > 3 files changed, 138 insertions(+) > > diff --git a/include/kvm/arm_vgic.h b/include/kvm/arm_vgic.h > index 035911f..4ea023c 100644 > --- a/include/kvm/arm_vgic.h > +++ b/include/kvm/arm_vgic.h > @@ -179,6 +179,8 @@ struct vgic_its { > int cwriter; > struct list_headdevice_list; > struct list_headcollection_list; > + /* memory used for buffering guest's memory */ > + void*buffer_page; > }; > > struct vgic_dist { > diff --git a/virt/kvm/arm/its-emul.c b/virt/kvm/arm/its-emul.c > index 8349970..7a8c5db 100644 > --- a/virt/kvm/arm/its-emul.c > +++ b/virt/kvm/arm/its-emul.c > @@ -59,6 +59,7 @@ struct its_itte { > struct its_collection *collection; > u32 lpi; > u32 event_id; > + u8 priority; > bool enabled; > unsigned long *pending; > }; > @@ -80,8 +81,124 @@ static struct its_itte *find_itte_by_lpi(struct kvm *kvm, > int lpi) > return NULL; > } > > +#define LPI_PROP_ENABLE_BIT(p) ((p) & LPI_PROP_ENABLED) > +#define LPI_PROP_PRIORITY(p) ((p) & 0xfc) > + > +/* stores the priority and enable bit for a given LPI */ > +static void update_lpi_config(struct kvm *kvm, struct its_itte *itte, u8 > prop) > +{ > + itte->priority = LPI_PROP_PRIORITY(prop); > + itte->enabled = LPI_PROP_ENABLE_BIT(prop); > +} > + > +#define GIC_LPI_OFFSET 8192 > + > +/* We scan the table in chunks the size of the smallest page size */ > +#define CHUNK_SIZE 4096U > + > #define BASER_BASE_ADDRESS(x) ((x) & 0xf000ULL) > > +static int nr_idbits_propbase(u64 propbaser) > +{ > + int nr_idbits = (1U << (propbaser & 0x1f)) + 1; > + > + return max(nr_idbits, INTERRUPT_ID_BITS_ITS); > +} > + > +/* > + * Scan the whole LPI configuration table and put the LPI configuration > + * data in our own data structures. This relies on the LPI being > + * mapped before. > + */ > +static bool its_update_lpis_configuration(struct kvm *kvm, u64 prop_base_reg) > +{ > + struct vgic_dist *dist = >arch.vgic; > + u8 *prop = dist->its.buffer_page; > + u32 tsize; > + gpa_t propbase; > + int lpi = GIC_LPI_OFFSET; > + struct its_itte *itte; > + struct its_device *device; > + int ret; > + > + propbase = BASER_BASE_ADDRESS(prop_base_reg); > + tsize = nr_idbits_propbase(prop_base_reg); > + > + while (tsize > 0) { > + int chunksize = min(tsize, CHUNK_SIZE); > + > + ret = kvm_read_guest(kvm, propbase, prop, chunksize); > + if (ret) > + return false; I think it would be more convenient to return 'ret' here, and 0 on success. I see that currently nobody consumes the error code, but with live migration this may change. And the same in its_sync_lpi_pending_table(). > + > + spin_lock(>its.lock); > + /* > + * Updating the status for all allocated LPIs. We catch > + * those LPIs that get disabled. We really don't care > + * about unmapped LPIs, as they need to be updated > + * later manually anyway once they get mapped. > + */ > + for_each_lpi(device, itte, kvm) { > + if (itte->lpi < lpi || itte->lpi >= lpi + chunksize) > + continue; > + > + update_lpi_config(kvm, itte, prop[itte->lpi - lpi]); > + } > + spin_unlock(>its.lock); > + tsize -= chunksize; > + lpi += chunksize; > + propbase += chunksize; > + } > + > + return true; > +} > + > +/* > + * Scan the whole LPI pending table and sync the pending bit in there > + * with
[PATCH] KVM: x86: fix eflags state following processor init/reset
Reference SDM 3.4.3: Following initialization of the processor (either by asserting the RESET pin or the INIT pin), the state of the EFLAGS register is 0002H. However, the eflags fixed bit is not set and other bits are also not cleared during the init/reset in kvm. This patch fix it by set eflags register to 0002H following initialization of the processor. Signed-off-by: Wanpeng Li--- arch/x86/kvm/vmx.c | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b680c2e..326f6ea 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4935,6 +4935,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) vmx_set_efer(vcpu, 0); vmx_fpu_activate(vcpu); update_exception_bitmap(vcpu); + vmx_set_rflags(vcpu, X86_EFLAGS_FIXED); vpid_sync_context(vmx->vpid); } -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 04/20] KVM: ARM64: Add reset and access handlers for PMCR_EL0 register
On 2015/10/16 13:35, Wei Huang wrote: > > On 09/24/2015 05:31 PM, Shannon Zhao wrote: >> > Add reset handler which gets host value of PMCR_EL0 and make writable >> > bits architecturally UNKNOWN. Add a common access handler for PMU >> > registers which emulates writing and reading register and add emulation >> > for PMCR. >> > >> > Signed-off-by: Shannon Zhao>> > --- >> > arch/arm64/kvm/sys_regs.c | 81 >> > +-- >> > 1 file changed, 79 insertions(+), 2 deletions(-) >> > >> > diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c >> > index b41607d..60c0842 100644 >> > --- a/arch/arm64/kvm/sys_regs.c >> > +++ b/arch/arm64/kvm/sys_regs.c >> > @@ -33,6 +33,7 @@ >> > #include >> > #include >> > #include >> > +#include >> > >> > #include >> > >> > @@ -446,6 +447,53 @@ static void reset_mpidr(struct kvm_vcpu *vcpu, const >> > struct sys_reg_desc *r) >> >vcpu_sys_reg(vcpu, MPIDR_EL1) = (1ULL << 31) | mpidr; >> > } >> > >> > +static void vcpu_sysreg_write(struct kvm_vcpu *vcpu, >> > +const struct sys_reg_desc *r, u64 val) >> > +{ >> > + if (!vcpu_mode_is_32bit(vcpu)) >> > + vcpu_sys_reg(vcpu, r->reg) = val; >> > + else >> > + vcpu_cp15(vcpu, r->reg) = lower_32_bits(val); >> > +} >> > + >> > +static void reset_pmcr(struct kvm_vcpu *vcpu, const struct sys_reg_desc >> > *r) >> > +{ >> > + u64 pmcr, val; >> > + >> > + asm volatile("mrs %0, pmcr_el0\n" : "=r" (pmcr)); >> > + /* Writable bits of PMCR_EL0 (ARMV8_PMCR_MASK) is reset to UNKNOWN*/ >> > + val = (pmcr & ~ARMV8_PMCR_MASK) | (ARMV8_PMCR_MASK & 0xdecafbad); > Two comments: > (1) In Patch 1, ARMV8_PMCR_MASK is defined as 0x3f. According to ARMv8 > spec, PMCR_EL0.LC (bit 6) is also writable. Should ARMV8_PMCR_MASK be 0x7f? According to the spec, it should be 0x7f. > (2) According to spec the PMCR_EL0.E bit reset to 0, not UNKNOWN. > Yeah, will fix this. Thanks, -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 00/20] KVM: ARM64: Add guest PMU support
On 2015/10/17 1:01, Christopher Covington wrote: > On 10/16/2015 12:55 AM, Wei Huang wrote: >> > >> > >> > On 09/24/2015 05:31 PM, Shannon Zhao wrote: >>> >> This patchset adds guest PMU support for KVM on ARM64. It takes >>> >> trap-and-emulate approach. When guest wants to monitor one event, it >>> >> will be trapped by KVM and KVM will call perf_event API to create a perf >>> >> event and call relevant perf_event APIs to get the count value of event. >>> >> >>> >> Use perf to test this patchset in guest. When using "perf list", it >>> >> shows the list of the hardware events and hardware cache events perf >>> >> supports. Then use "perf stat -e EVENT" to monitor some event. For >>> >> example, use "perf stat -e cycles" to count cpu cycles and >>> >> "perf stat -e cache-misses" to count cache misses. >>> >> >>> >> Below are the outputs of "perf stat -r 5 sleep 5" when running in host >>> >> and guest. >>> >> >>> >> Host: >>> >> Performance counter stats for 'sleep 5' (5 runs): >>> >> >>> >> 0.551428 task-clock (msec) #0.000 CPUs >>> >> utilized( +- 0.91% ) >>> >> 1 context-switches #0.002 M/sec >>> >> 0 cpu-migrations#0.000 K/sec >>> >> 48 page-faults #0.088 M/sec >>> >> ( +- 1.05% ) >>> >>1150265 cycles#2.086 GHz >>> >> ( +- 0.92% ) >>> >> stalled-cycles-frontend >>> >> stalled-cycles-backend >>> >> 526398 instructions #0.46 insns per >>> >> cycle ( +- 0.89% ) >>> >> branches >>> >> 9485 branch-misses # 17.201 M/sec >>> >> ( +- 2.35% ) >>> >> >>> >>5.000831616 seconds time elapsed >>> >> ( +- 0.00% ) >>> >> >>> >> Guest: >>> >> Performance counter stats for 'sleep 5' (5 runs): >>> >> >>> >> 0.730868 task-clock (msec) #0.000 CPUs >>> >> utilized( +- 1.13% ) >>> >> 1 context-switches #0.001 M/sec >>> >> 0 cpu-migrations#0.000 K/sec >>> >> 48 page-faults #0.065 M/sec >>> >> ( +- 0.42% ) >>> >>1642982 cycles#2.248 GHz >>> >> ( +- 1.04% ) >>> >> stalled-cycles-frontend >>> >> stalled-cycles-backend >>> >> 637964 instructions #0.39 insns per >>> >> cycle ( +- 0.65% ) >>> >> branches >>> >> 10377 branch-misses # 14.198 M/sec >>> >> ( +- 1.09% ) >>> >> >>> >>5.001289068 seconds time elapsed >>> >> ( +- 0.00% ) >>> >> >> > >> > Thanks for V3. One suggestion is to run more perf stress tests, such as >> > "perf test". So we know the corner cases are covered as much as possible. > I'd also recommend Vince Weaver's perf_event_tests. It tests things like > signal-on-counter-overflow that I've never seen anywhere else (other than some > of my own code). > > https://github.com/deater/perf_event_tests Ok. Thanks for your suggestion. -- Shannon -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html