Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access

2014-10-11 Thread Nadav Amit
Radim, Paolo, Sorry for the late responses (due to holidays)…

On Oct 11, 2014, at 12:39 PM, Paolo Bonzini  wrote:

> Il 10/10/2014 17:54, Radim Krčmář ha scritto:
 
 One exception is the case of conforming code segment. The SDM says: "Use a
 code-segment override prefix (CS) to read a readable...  [it is] valid 
 because
 the DPL of the code segment selected by the CS register is the same as the
 CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, 
 and
 CS would still be accessible.  The emulator should avoid privilage level 
 checks
 for data reads using CS.
>> Ah, after stripping faulty presumptions, I'm not sure this change is
>> enough ... shouldn't we also skip the check on conforming code segments?
>> 
>> Method 2 is always valid because the privilege level of a conforming
>> code segment is effectively the same as the CPL, regardless of its DPL.
> 
> Radim is right; we need to skip the check on conforming code segments 
> and, once we do that, checking addr.seg is not necessary anymore.  That 
> is because, for a CS override on a nonconforming code segment, at the 
> time we fetch the instruction we know that cpl == desc.dpl.  The less 
> restrictive data segment check (cpl <= desc.dpl) thus always passes.
Yes. I was wrong, assuming the code-segment checks are just a derivative of the 
data segment checks.


> 
> Let's put together this check and the readability check, too, since
> we are adding another "if (fetch)".
> 
> Can you guys think of a way to simplify the following untested patch?
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 03954f7900f5..9f3e33551db9 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -638,9 +638,6 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
>   if ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8))
>   || !(desc.type & 2)) && write)
>   goto bad;
> - /* unreadable code segment */
> - if (!fetch && (desc.type & 8) && !(desc.type & 2))
> - goto bad;
>   lim = desc_limit_scaled(&desc);
>   if ((ctxt->mode == X86EMUL_MODE_REAL) && !fetch &&
>   (ctxt->d & NoBigReal)) {
> @@ -660,17 +657,40 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
>   goto bad;
>   }
>   cpl = ctxt->ops->cpl(ctxt);
> - if (!(desc.type & 8)) {
> - /* data segment */
> + if (fetch && (desc.type & 8)) {
> + if (!(desc.type & 4)) {
> + /* nonconforming code segment */
> + if (cpl != desc.dpl)
> + goto bad;
> + break;
> + } else {
> + /* conforming code segment */
> + if (cpl < desc.dpl)
> + goto bad;
> + break;
> + }
> + }
> +
> + if (likely(!(desc.type & 8) || (desc.type & 6) == 2)) {
> + /*
> +  * Data segment or readable, nonconforming code
> +  * segment.  The SDM mentions that access through
> +  * a code-segment override prefix is always valid.
> +  * This really only matters for conforming code
> +  * segments (checked below, and always valid anyway):
> +  * for nonconforming ones, cpl == desc.dpl was checked
> +  * when fetching the instruction, meaning the following
> +  * test will always pass too.
> +  */
>   if (cpl > desc.dpl)
>   goto bad;
> - } else if ((desc.type & 8) && !(desc.type & 4)) {
> - /* nonconforming code segment */
> - if (cpl != desc.dpl)
> - goto bad;
> - } else if ((desc.type & 8) && (desc.type & 4)) {
> - /* conforming code segment */
> - if (cpl < desc.dpl)
> + } else {
> + /*
> +  * These are the (rare) cases that do not behave
> +  * like data segments: nonreadable code segments (bad)
> +  * and readable, conforming code segments (good).
> +  */
> + if (!(desc.type & 2))
>   goto bad;
>   }
>   break;

Looks good. I’ll give it a try but it is hard to give a definitive answer, 
since the emulator is still bug-ridden.
Please note I submitted another patch at this area ("Wrong error code on limit 
violation during emulat

Please reply

2014-10-11 Thread Jose Calvache
Dear Sir/Madam, Here is a pdf attachment of my proposal to you. Please
read and reply I would be grateful. Jose Calvache

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-11 Thread Eric Dumazet
On Sat, 2014-10-11 at 15:16 +0800, Jason Wang wrote:
> We free transmitted packets in ndo_start_xmit() in the past to get better
> performance in the past. One side effect is that skb_orphan() needs to be
> called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
> fact. For TCP protocol, this means several optimization could not work well
> such as TCP small queue and auto corking. This can lead extra low
> throughput of small packets stream.
> 
> Thanks to the urgent descriptor support. This patch tries to solve this
> issue by enable the tx interrupt selectively for stream packets. This means
> we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
> tx interrupt for those packets. After we get tx interrupt, a tx napi was
> scheduled to free those packets.
> 
> With this method, sk_wmem_alloc of TCP socket were more accurate than in
> the past which let TCP can batch more through TSQ and auto corking.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/net/virtio_net.c | 164 
> ---
>  1 file changed, 128 insertions(+), 36 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 5810841..b450fc4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -72,6 +72,8 @@ struct send_queue {
>  
>   /* Name of the send queue: output.$index */
>   char name[40];
> +
> + struct napi_struct napi;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue 
> *rq, gfp_t gfp_mask)
>   return p;
>  }
>  
> +static int free_old_xmit_skbs(struct send_queue *sq, int budget)
> +{
> + struct sk_buff *skb;
> + unsigned int len;
> + struct virtnet_info *vi = sq->vq->vdev->priv;
> + struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
> + int sent = 0;
> +
> + while (sent < budget &&
> +(skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> + pr_debug("Sent skb %p\n", skb);
> +
> + u64_stats_update_begin(&stats->tx_syncp);
> + stats->tx_bytes += skb->len;
> + stats->tx_packets++;
> + u64_stats_update_end(&stats->tx_syncp);
> +
> + dev_kfree_skb_any(skb);
> + sent++;
> + }
> +

You could accumulate skb->len in a totlen var, and perform a single

u64_stats_update_begin(&stats->tx_syncp);
stats->tx_bytes += totlen;
stats->tx_packets += sent;
u64_stats_update_end(&stats->tx_syncp);

after the loop.


> + return sent;
> +}
> +

...

> +
> +static bool virtnet_skb_needs_intr(struct sk_buff *skb)
> +{
> + union {
> + unsigned char *network;
> + struct iphdr *ipv4;
> + struct ipv6hdr *ipv6;
> + } hdr;
> + struct tcphdr *th = tcp_hdr(skb);
> + u16 payload_len;
> +
> + hdr.network = skb_network_header(skb);
> +
> + /* Only IPv4/IPv6 with TCP is supported */

Oh well, yet another packet flow dissector :)

If most packets were caught by your implementation, you could use it
for fast patj and fallback to skb_flow_dissect() for encapsulated
traffic.

struct flow_keys keys;   

if (!skb_flow_dissect(skb, &keys)) 
return false;

if (keys.ip_proto != IPPROTO_TCP)
return false;

then check __skb_get_poff() how to get th, and check if there is some
payload...


> + if ((skb->protocol == htons(ETH_P_IP)) &&
> + hdr.ipv4->protocol == IPPROTO_TCP) {
> + payload_len = ntohs(hdr.ipv4->tot_len) - hdr.ipv4->ihl * 4 -
> +   th->doff * 4;
> + } else if ((skb->protocol == htons(ETH_P_IPV6) ||
> +hdr.ipv6->nexthdr == IPPROTO_TCP)) {
> + payload_len = ntohs(hdr.ipv6->payload_len) - th->doff * 4;
> + } else {
> + return false;
> + }
> +
> + /* We don't want to dealy packet with PUSH bit and pure ACK packet */
> + if (!th->psh && payload_len)
> + return true;
> +
> + return false;
>  }




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 2/5] KVM: x86: Emulator performs code segment checks on read access

2014-10-11 Thread Paolo Bonzini
Il 10/10/2014 17:54, Radim Krčmář ha scritto:
>> > 
>> > One exception is the case of conforming code segment. The SDM says: "Use a
>> > code-segment override prefix (CS) to read a readable...  [it is] valid 
>> > because
>> > the DPL of the code segment selected by the CS register is the same as the
>> > CPL." This is misleading since CS.DPL may be lower (numerically) than CPL, 
>> > and
>> > CS would still be accessible.  The emulator should avoid privilage level 
>> > checks
>> > for data reads using CS.
> Ah, after stripping faulty presumptions, I'm not sure this change is
> enough ... shouldn't we also skip the check on conforming code segments?
> 
>  Method 2 is always valid because the privilege level of a conforming
>  code segment is effectively the same as the CPL, regardless of its DPL.

Radim is right; we need to skip the check on conforming code segments 
and, once we do that, checking addr.seg is not necessary anymore.  That 
is because, for a CS override on a nonconforming code segment, at the 
time we fetch the instruction we know that cpl == desc.dpl.  The less 
restrictive data segment check (cpl <= desc.dpl) thus always passes.

Let's put together this check and the readability check, too, since
we are adding another "if (fetch)".

Can you guys think of a way to simplify the following untested patch?

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 03954f7900f5..9f3e33551db9 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -638,9 +638,6 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
if ctxt->mode != X86EMUL_MODE_REAL) && (desc.type & 8))
|| !(desc.type & 2)) && write)
goto bad;
-   /* unreadable code segment */
-   if (!fetch && (desc.type & 8) && !(desc.type & 2))
-   goto bad;
lim = desc_limit_scaled(&desc);
if ((ctxt->mode == X86EMUL_MODE_REAL) && !fetch &&
(ctxt->d & NoBigReal)) {
@@ -660,17 +657,40 @@ static int __linearize(struct x86_emulate_ctxt *ctxt,
goto bad;
}
cpl = ctxt->ops->cpl(ctxt);
-   if (!(desc.type & 8)) {
-   /* data segment */
+   if (fetch && (desc.type & 8)) {
+   if (!(desc.type & 4)) {
+   /* nonconforming code segment */
+   if (cpl != desc.dpl)
+   goto bad;
+   break;
+   } else {
+   /* conforming code segment */
+   if (cpl < desc.dpl)
+   goto bad;
+   break;
+   }
+   }
+
+   if (likely(!(desc.type & 8) || (desc.type & 6) == 2)) {
+   /*
+* Data segment or readable, nonconforming code
+* segment.  The SDM mentions that access through
+* a code-segment override prefix is always valid.
+* This really only matters for conforming code
+* segments (checked below, and always valid anyway):
+* for nonconforming ones, cpl == desc.dpl was checked
+* when fetching the instruction, meaning the following
+* test will always pass too.
+*/
if (cpl > desc.dpl)
goto bad;
-   } else if ((desc.type & 8) && !(desc.type & 4)) {
-   /* nonconforming code segment */
-   if (cpl != desc.dpl)
-   goto bad;
-   } else if ((desc.type & 8) && (desc.type & 4)) {
-   /* conforming code segment */
-   if (cpl < desc.dpl)
+   } else {
+   /*
+* These are the (rare) cases that do not behave
+* like data segments: nonreadable code segments (bad)
+* and readable, conforming code segments (good).
+*/
+   if (!(desc.type & 2))
goto bad;
}
break;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Help] qemu abort when configure ivshmem size >=32M with kvm-3.6

2014-10-11 Thread zhanghailiang

Hi all,

We can't start VM when we configure ivshmem with its size >= 32M.
Its reports: kvm_set_phys_mem: error registering slot: File exists
Aborted (core dumped)

Qemu command:
#qemu-system-x86_64 -name suse10sp1 -enable-kvm -m 4096 -smp 4,sockets=1, \
cores=4,threads=1 -drive file=/home/sles10_sp1_64_2U,if=none,id=drive-ide0-0-0, 
\
format=raw,cache=none,aio=native -device ide-hd,bus=ide.0,unit=0, \
drive=drive-ide0-0-0,id=ide0-0-0,bootindex=1 -vnc :10 \
-device ivshmem,id=ivshmem0,shm=vm-ivshmem,size=32m,role=master

We have made several tests, and the results are:
1.
Guest OS: suse10sp1
KVM version: 3.6
QEMU version:2.1.0
ivshmem size: 4M/8M/16M
result: OK
2.
Guest OS: sles10sp1
KVM version: 3.6
QEMU version: 2.1.0
ivshmem size: 32M/64M
result: fail
3.
Guest OS: sles10sp1
KVM version: *3.15*
QEMU version: 2.1.0
ivshmem size: 4M/8M/16M/32M/64M
result: OK

After a hard work, we found in kvm source, the commit b940413f5 can fix this 
problem.

KVM: Fix user memslot overlap check

Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries.  With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots.  Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.

Cc: sta...@vger.kernel.org
Signed-off-by: Alex Williamson 
Signed-off-by: Marcelo Tosatti 

Since our project using kvm-3.6, we want to backport this patch to kvm-3.6.
But if there are any potential problems to backport this patch?

Actually we also tested other Guest OS, such as suse11sp2, suse11sp3, they are 
all OK with configuring
ivshmem >= 32M, so is it a bug of guest OS?

Any help will be appreciated.;)

Thanks,
zhanghailiang

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 2/3] arm/arm64: KVM: Ensure memslots are within KVM_PHYS_SIZE

2014-10-11 Thread Marc Zyngier

On 2014-10-10 11:14, Christoffer Dall wrote:
When creating or moving a memslot, make sure the IPA space is within 
the

addressable range of the guest.  Otherwise, user space can create too
large a memslot and KVM would try to access potentially unallocated 
page

table entries when inserting entries in the Stage-2 page tables.

Signed-off-by: Christoffer Dall 


Acked-by: Marc Zyngier 

M.
--
Fast, cheap, reliable. Pick two.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 1/3] arm64: KVM: Implement 48 VA support for KVM EL2 and Stage-2

2014-10-11 Thread Marc Zyngier

On 2014-10-10 11:14, Christoffer Dall wrote:

This patch adds the necessary support for all host kernel PGSIZE and
VA_SPACE configuration options for both EL2 and the Stage-2 page 
tables.


However, for 40bit and 42bit PARange systems, the architecture 
mandates

that VTCR_EL2.SL0 is maximum 1, resulting in fewer levels of stage-2
pagge tables than levels of host kernel page tables.  At the same 
time,
systems with a PARange > 42bit, we limit the IPA range by always 
setting

VTCR_EL2.T0SZ to 24.

To solve the situation with different levels of page tables for 
Stage-2

translation than the host kernel page tables, we allocate a dummy PGD
with pointers to our actual inital level Stage-2 page table, in order
for us to reuse the kernel pgtable manipulation primitives.  
Reproducing
all these in KVM does not look pretty and unnecessarily complicates 
the

32-bit side.

Systems with a PARange < 40bits are not yet supported.

 [ I have reworked this patch from its original form submitted by
   Jungseok to take the architecture constraints into consideration.
   There were too many changes from the original patch for me to
   preserve the authorship.  Thanks to Catalin Marinas for his help 
in

   figuring out a good solution to this challenge.  I have also fixed
   various bugs and missing error code handling from the original
   patch. - Christoffer ]

Cc: Marc Zyngier 
Cc: Catalin Marinas 
Signed-off-by: Jungseok Lee 
Signed-off-by: Christoffer Dall 


Acked-by: Marc Zyngier 

M.
--
Fast, cheap, reliable. Pick two.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bug: host crash when vf is passthrough to vm.

2014-10-11 Thread ChenLiang

Thanks for your reply.
I will try new kernel. And do you have any suggestion?
I have google it, sadly, nothing mention this bug.

Best regards
ChenLiang

On 2014/10/11 15:37, Alex Williamson wrote:

> On Sat, 2014-10-11 at 11:02 +0800, ChenLiang wrote:
>> Hi all:
>>
>> kernel: 3.0.93-0.8-default
>> qemu: 1.5
> 
> This is a very old kernel and rather old QEMU as well.  This bug is
> hopefully fixed in a newer kernel.  Thanks,
> 
> Alex
> 
>> crash log:
>>
>> [134397.708857] BUG: unable to handle kernel NULL pointer dereference at 
>> 0012
>> [134397.717334] IP: [] iommu_disable_dev_iotlb+0x15/0x30
>> [134397.724268] PGD 0
>> [134397.726686] Oops:  [#1] SMP
>> [134397.730335] kbox: Begin to handle event info
>> [134397.734992] kbox: kbox: Enter into handle die  dump while current 
>> state:Dump State Init
>> [134397.751054]
>> [134398.043275] kbox: End  handling event info
>> [134398.047757] CPU 1
>> [134398.049678] Modules linked in: mlx4_en(FX) mlx4_core(FX) compat(FX) 
>> openvswitch crc32c libcrc32c gre nm_dev(FN) ip6table_filter ip6_tables 
>> iptable_filter ip_tables ebtable_nat ebtables x_tables uvpdump i
>> [134398.125560] Supported: No, Unsupported modules are loaded
>> [134398.131338]
>> [134398.133220] Pid: 183980, comm: qemu-kvm Tainted: GF  NX 
>> 3.0.93-0.8-default #1 HUAWEI TECHNOLOGIES CO.,LTD. CH80TXSUA/CH80TXSUA
>> [134398.146013] RIP: 0010:[]  [] iommu_disable_dev_iotlb+0x15/0x30
>> [134398.155649] RSP: 0018:8817f0e13be0  EFLAGS: 00010202
>> [134398.161342] RAX: 0002 RBX: 880bf34e6600 RCX: 
>> 0007
>> [134398.169158] RDX: 880bf34e6650 RSI: 0292 RDI: 
>> 880bf2011000
>> [134398.176970] RBP:  R08: dead00200200 R09: 
>> dead00100100
>> [134398.184793] R10: 8817f5d4c380 R11: 8128e350 R12: 
>> 880bf34e6640
>> [134398.192549] R13: 880bf5d3de80 R14: 880653e4a000 R15: 
>> 880bf34d7858
>> [134398.200360] FS:  7f6d17fff980() GS:880c3ee2() 
>> knlGS:
>> [134398.209109] CS:  0010 DS:  ES:  CR0: 8005003b
>> [134398.215223] CR2: 0012 CR3: 00096ed95000 CR4: 
>> 001427e0
>> [134398.223000] DR0:  DR1:  DR2: 
>> 
>> [134398.230812] DR3:  DR6: 0ff0 DR7: 
>> 0400
>> [134398.238630] Process qemu-kvm (pid: 183980, threadinfo 8817f0e12000, 
>> task 8817804ac300)
>> [134398.247906] Stack:
>> [134398.250260]  8128f756 ffed 880bf34d7840 
>> 0292
>> [134398.258383]  880bf34d7640 880653e4a090 ffed 
>> 880653e4a000
>> [134398.266509]  880653e4a000 880bee6c4878 812938c6 
>> 880bebe17d60
>> [134398.274659] Call Trace:
>> [134398.277503]  [] domain_remove_one_dev_info+0x156/0x280
>> [134398.284582]  [] intel_iommu_attach_device+0x156/0x170
>> [134398.291583]  [] kvm_assign_device+0x73/0x150 [kvm]
>> [134398.298360]  [] kvm_vm_ioctl_assign_device+0x247/0x3c0 [kvm]
>> [134398.306303]  [] kvm_vm_ioctl_assigned_device+0x2fc/0x6a0 [kvm]
>> [134398.314401]  [] kvm_vm_ioctl+0x101/0x300 [kvm]
>> [134398.320789]  [] do_vfs_ioctl+0x8b/0x3b0
>> [134398.326566]  [] sys_ioctl+0xa1/0xb0
>> [134398.331996]  [] system_call_fastpath+0x16/0x1b
>> [134398.338376]  [<7f6d15b77e57>] 0x7f6d15b77e56
>> [134398.343373] Code: c6 05 97 dc de 00 01 eb a7 66 66 66 66 2e 0f 1f 84 00 
>> 00 00 00 00 48 8b 7f 28 48 85 ff 74 12 48 8b 87 30 09 00 00 48 85 c0 74 06 
>> 40 10 01 75 05 f3 c3 0f 1f 00 e9 ab 61 00 00 66 66
>> [134398.364424] RIP  [] iommu_disable_dev_iotlb+0x15/0x30
>> [134398.371441]  RSP
>> [134398.375314] CR2: 0012
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 
> .
> 



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[GIT PULL] VFIO updates for 3.18-rc1

2014-10-11 Thread Alex Williamson
Hi Linus,

The following changes since commit fe82dcec644244676d55a1384c958d5f67979adb:

  Linux 3.17-rc7 (2014-09-28 14:29:07 -0700)

are available in the git repository at:

  git://github.com/awilliam/linux-vfio.git tags/vfio-v3.18-rc1

for you to fetch changes up to 93899a679fd6b2534b5c297d9316bae039ebcbe1:

  vfio-pci: Fix remove path locking (2014-09-29 17:18:39 -0600)


VFIO updates for v3.18-rc1
 - Nested IOMMU extension to type1 (Will Deacon)
 - Restore MSIx message before enabling (Gavin Shan)
 - Fix remove path locking (Alex Williamson)


Alex Williamson (1):
  vfio-pci: Fix remove path locking

Gavin Shan (3):
  PCI: Export MSI message relevant functions
  vfio/pci: Restore MSIx message prior to enabling
  drivers/vfio: Export vfio_spapr_iommu_eeh_ioctl() with GPL

Will Deacon (2):
  iommu: introduce domain attribute for nesting IOMMUs
  vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type

 drivers/pci/msi.c |   2 +
 drivers/vfio/pci/vfio_pci.c   | 136 --
 drivers/vfio/pci/vfio_pci_intrs.c |  15 +
 drivers/vfio/vfio_iommu_type1.c   |  30 +++--
 drivers/vfio/vfio_spapr_eeh.c |   2 +-
 include/linux/iommu.h |   1 +
 include/uapi/linux/vfio.h |   3 +
 7 files changed, 104 insertions(+), 85 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [Bug?] qemu abort when trying to passthrough BCM5719 Gigabit Ethernet

2014-10-11 Thread Alex Williamson
On Sat, 2014-10-11 at 13:58 +0800, zhanghailiang wrote:
> Hi all,
> 
> When i try to passthrough BCM5719 Gigabit Ethernet to guest using the qemu 
> master branch, it aborted,
> and show kvm_set_phys_mem:error registering slot:Bad Address.
> 
> qemu command:
> #./qemu/qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -smp 4 -m 4096 
> -vnc :99 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x3  -drive 
> file=/home/suse11_sp3_64,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,aio=native
>  -device scsi-hd,bus=scsi0.0,channel=0,scsi-
> id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -device 
> pci-assign,host=01:00.1,id=mydevice -net none
> 
> info about guest and host:
> host OS: 3.16.5
> *guest OS: Novell SuSE Linux Enterprise Server 11 SP3*
> #cat /proc/cpuinfo
> processor   : 31
> vendor_id   : GenuineIntel
> cpu family  : 6
> model   : 62
> model name  : Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
> stepping: 4
> microcode   : 0x416
> cpu MHz : 1926.875
> cache size  : 20480 KB
> physical id : 1
> siblings: 16
> core id : 7
> cpu cores   : 8
> apicid  : 47
> initial apicid  : 47
> fpu : yes
> fpu_exception   : yes
> cpuid level : 13
> wp  : yes
> flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
> pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
> nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx
> smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln
> pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
> bogomips: 4005.35
> clflush size: 64
> cache_alignment : 64
> address sizes   : 46 bits physical, 48 bits virtual
> power management:
> 
> gdb info:
> (gdb) bt
> #0  0x71ce9989 in raise () from /usr/lib64/libc.so.6
> #1  0x71ceb098 in abort () from /usr/lib64/libc.so.6
> #2  0x556275cf in kvm_set_phys_mem (section=0x7fffedcea790, add=true) 
> at /home/qemu/qemu/kvm-all.c:711
> #3  0x5562980f in address_space_update_topology_pass 
> (as=as@entry=0x55d01ca0 , adding=adding@entry=true,
>  new_view=, new_view=, 
> old_view=0x7fffe8022a90, old_view=0x7fffe8022a90) at 
> /home/qemu/qemu/memory.c:752
> #4  0x5562b910 in address_space_update_topology (as=0x55d01ca0 
> ) at /home/qemu/qemu/memory.c:767
> #5  memory_region_transaction_commit () at /home/qemu/qemu/memory.c:808
> #6  0x557a75b4 in pci_update_mappings (d=0x562ba9f0) at 
> hw/pci/pci.c:1113
> #7  0x557a7932 in pci_default_write_config (d=d@entry=0x562ba9f0, 
> addr=addr@entry=20, val_in=val_in@entry=4294967295, l=l@entry=4)
>  at hw/pci/pci.c:1165
> #8  0x5566c17e in assigned_dev_pci_write_config 
> (pci_dev=0x562ba9f0, address=20, val=4294967295, len=4)
>  at /home/qemu/qemu/hw/i386/kvm/pci-assign.c:1196
> #9  0x55628fea in access_with_adjusted_size (addr=addr@entry=0, 
> value=value@entry=0x7fffedceaae0, size=size@entry=4,
>  access_size_min=, access_size_max=, 
> access=0x55629160 , mr=0x56231f00)
>  at /home/qemu/qemu/memory.c:480
> #10 0x5562dbf7 in memory_region_dispatch_write (size=4, 
> data=18446744073709551615, addr=0, mr=0x56231f00) at 
> /home/qemu/qemu/memory.c:1122
> #11 io_mem_write (mr=mr@entry=0x56231f00, addr=0, val=, 
> size=4) at /home/qemu/qemu/memory.c:1958
> #12 0x555f8963 in address_space_rw (as=0x55d01d80 
> , addr=addr@entry=3324, buf=0x77fec000 
> "\377\377\377\377",
>  len=len@entry=4, is_write=is_write@entry=true) at 
> /home/qemu/qemu/exec.c:2145
> #13 0x55628491 in kvm_handle_io (count=1, size=4, 
> direction=, data=, port=3324) at 
> /home/qemu/qemu/kvm-all.c:1614
> #14 kvm_cpu_exec (cpu=cpu@entry=0x5620e610) at 
> /home/qemu/qemu/kvm-all.c:1771
> #15 0x55617182 in qemu_kvm_cpu_thread_fn (arg=0x5620e610) at 
> /home/qemu/qemu/cpus.c:953
> #16 0x76ba2df3 in start_thread () from /usr/lib64/libpthread.so.0
> #17 0x71daa3dd in clone () from /usr/lib64/libc.so.6
> 
> messages log:
> Oct 10 07:43:18 localhost kernel: kvm: zapping shadow pages for mmio 
> generation wraparound
> Oct 10 07:43:27 localhost kernel: kvm [13251]: vcpu0 disabled perfctr wrmsr: 
> 0xc1 data 0xabcd
> Oct 10 07:43:28 localhost kernel: intel_iommu_map: iommu width (48) is not 
> sufficient for the mapped address (fe001000)
> Oct 10 07:43:28 localhost kernel: kvm_iommu_map_address:iommu failed to map 
> pfn=94880
> 
> After bisected the commits, i found the commit 453c43a4a241a7 leads to this 
> problem.
> 
> commit 57271d63c4d93352406704d540453c43a4a241a7
> Author: Paolo Bonzini 
> Date:   Thu Nov 7 17:14:37 2013 +0100
> 
>  exec: make address spaces 64-bit wi

Re: bug: host crash when vf is passthrough to vm.

2014-10-11 Thread Alex Williamson
On Sat, 2014-10-11 at 11:02 +0800, ChenLiang wrote:
> Hi all:
> 
> kernel: 3.0.93-0.8-default
> qemu: 1.5

This is a very old kernel and rather old QEMU as well.  This bug is
hopefully fixed in a newer kernel.  Thanks,

Alex

> crash log:
> 
> [134397.708857] BUG: unable to handle kernel NULL pointer dereference at 
> 0012
> [134397.717334] IP: [] iommu_disable_dev_iotlb+0x15/0x30
> [134397.724268] PGD 0
> [134397.726686] Oops:  [#1] SMP
> [134397.730335] kbox: Begin to handle event info
> [134397.734992] kbox: kbox: Enter into handle die  dump while current 
> state:Dump State Init
> [134397.751054]
> [134398.043275] kbox: End  handling event info
> [134398.047757] CPU 1
> [134398.049678] Modules linked in: mlx4_en(FX) mlx4_core(FX) compat(FX) 
> openvswitch crc32c libcrc32c gre nm_dev(FN) ip6table_filter ip6_tables 
> iptable_filter ip_tables ebtable_nat ebtables x_tables uvpdump i
> [134398.125560] Supported: No, Unsupported modules are loaded
> [134398.131338]
> [134398.133220] Pid: 183980, comm: qemu-kvm Tainted: GF  NX 
> 3.0.93-0.8-default #1 HUAWEI TECHNOLOGIES CO.,LTD. CH80TXSUA/CH80TXSUA
> [134398.146013] RIP: 0010:[]  [] iommu_disable_dev_iotlb+0x15/0x30
> [134398.155649] RSP: 0018:8817f0e13be0  EFLAGS: 00010202
> [134398.161342] RAX: 0002 RBX: 880bf34e6600 RCX: 
> 0007
> [134398.169158] RDX: 880bf34e6650 RSI: 0292 RDI: 
> 880bf2011000
> [134398.176970] RBP:  R08: dead00200200 R09: 
> dead00100100
> [134398.184793] R10: 8817f5d4c380 R11: 8128e350 R12: 
> 880bf34e6640
> [134398.192549] R13: 880bf5d3de80 R14: 880653e4a000 R15: 
> 880bf34d7858
> [134398.200360] FS:  7f6d17fff980() GS:880c3ee2() 
> knlGS:
> [134398.209109] CS:  0010 DS:  ES:  CR0: 8005003b
> [134398.215223] CR2: 0012 CR3: 00096ed95000 CR4: 
> 001427e0
> [134398.223000] DR0:  DR1:  DR2: 
> 
> [134398.230812] DR3:  DR6: 0ff0 DR7: 
> 0400
> [134398.238630] Process qemu-kvm (pid: 183980, threadinfo 8817f0e12000, 
> task 8817804ac300)
> [134398.247906] Stack:
> [134398.250260]  8128f756 ffed 880bf34d7840 
> 0292
> [134398.258383]  880bf34d7640 880653e4a090 ffed 
> 880653e4a000
> [134398.266509]  880653e4a000 880bee6c4878 812938c6 
> 880bebe17d60
> [134398.274659] Call Trace:
> [134398.277503]  [] domain_remove_one_dev_info+0x156/0x280
> [134398.284582]  [] intel_iommu_attach_device+0x156/0x170
> [134398.291583]  [] kvm_assign_device+0x73/0x150 [kvm]
> [134398.298360]  [] kvm_vm_ioctl_assign_device+0x247/0x3c0 [kvm]
> [134398.306303]  [] kvm_vm_ioctl_assigned_device+0x2fc/0x6a0 [kvm]
> [134398.314401]  [] kvm_vm_ioctl+0x101/0x300 [kvm]
> [134398.320789]  [] do_vfs_ioctl+0x8b/0x3b0
> [134398.326566]  [] sys_ioctl+0xa1/0xb0
> [134398.331996]  [] system_call_fastpath+0x16/0x1b
> [134398.338376]  [<7f6d15b77e57>] 0x7f6d15b77e56
> [134398.343373] Code: c6 05 97 dc de 00 01 eb a7 66 66 66 66 2e 0f 1f 84 00 
> 00 00 00 00 48 8b 7f 28 48 85 ff 74 12 48 8b 87 30 09 00 00 48 85 c0 74 06 40 
> 10 01 75 05 f3 c3 0f 1f 00 e9 ab 61 00 00 66 66
> [134398.364424] RIP  [] iommu_disable_dev_iotlb+0x15/0x30
> [134398.371441]  RSP
> [134398.375314] CR2: 0012
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net-next RFC 1/3] virtio: support for urgent descriptors

2014-10-11 Thread Jason Wang
Below should be useful for some experiments Jason is doing.
I thought I'd send it out for early review/feedback.

event idx feature allows us to defer interrupts until
a specific # of descriptors were used.
Sometimes it might be useful to get an interrupt after
a specific descriptor, regardless.
This adds a descriptor flag for this, and an API
to create an urgent output descriptor.
This is still an RFC:
we'll need a feature bit for drivers to detect this,
but we've run out of feature bits for virtio 0.X.
For experimentation purposes, drivers can assume
this is set, or add a driver-specific feature bit.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/virtio/virtio_ring.c | 75 +---
 include/linux/virtio.h   | 14 
 include/uapi/linux/virtio_ring.h |  5 ++-
 3 files changed, 89 insertions(+), 5 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 4d08f45a..a5188c6 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -115,6 +115,7 @@ static inline struct scatterlist *sg_next_arr(struct 
scatterlist *sg,
 
 /* Set up an indirect table of descriptors and add it to the queue. */
 static inline int vring_add_indirect(struct vring_virtqueue *vq,
+bool urgent,
 struct scatterlist *sgs[],
 struct scatterlist *(*next)
   (struct scatterlist *, unsigned int *),
@@ -173,6 +174,8 @@ static inline int vring_add_indirect(struct vring_virtqueue 
*vq,
/* Use a single buffer which doesn't continue */
head = vq->free_head;
vq->vring.desc[head].flags = VRING_DESC_F_INDIRECT;
+   if (urgent)
+   vq->vring.desc[head].flags |= VRING_DESC_F_URGENT;
vq->vring.desc[head].addr = virt_to_phys(desc);
/* kmemleak gives a false positive, as it's hidden by virt_to_phys */
kmemleak_ignore(desc);
@@ -185,6 +188,7 @@ static inline int vring_add_indirect(struct vring_virtqueue 
*vq,
 }
 
 static inline int virtqueue_add(struct virtqueue *_vq,
+   bool urgent,
struct scatterlist *sgs[],
struct scatterlist *(*next)
  (struct scatterlist *, unsigned int *),
@@ -227,7 +231,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
/* If the host supports indirect descriptor tables, and we have multiple
 * buffers, then go indirect. FIXME: tune this threshold */
if (vq->indirect && total_sg > 1 && vq->vq.num_free) {
-   head = vring_add_indirect(vq, sgs, next, total_sg, total_out,
+   head = vring_add_indirect(vq, urgent, sgs, next, total_sg, 
total_out,
  total_in,
  out_sgs, in_sgs, gfp);
if (likely(head >= 0))
@@ -256,6 +260,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
for (n = 0; n < out_sgs; n++) {
for (sg = sgs[n]; sg; sg = next(sg, &total_out)) {
vq->vring.desc[i].flags = VRING_DESC_F_NEXT;
+   if (urgent) {
+   vq->vring.desc[head].flags |= 
VRING_DESC_F_URGENT;
+   urgent = false;
+   }
vq->vring.desc[i].addr = sg_phys(sg);
vq->vring.desc[i].len = sg->length;
prev = i;
@@ -265,6 +273,10 @@ static inline int virtqueue_add(struct virtqueue *_vq,
for (; n < (out_sgs + in_sgs); n++) {
for (sg = sgs[n]; sg; sg = next(sg, &total_in)) {
vq->vring.desc[i].flags = 
VRING_DESC_F_NEXT|VRING_DESC_F_WRITE;
+   if (urgent) {
+   vq->vring.desc[head].flags |= 
VRING_DESC_F_URGENT;
+   urgent = false;
+   }
vq->vring.desc[i].addr = sg_phys(sg);
vq->vring.desc[i].len = sg->length;
prev = i;
@@ -305,6 +317,8 @@ add_head:
 
 /**
  * virtqueue_add_sgs - expose buffers to other end
+ * @urgent: in case virtqueue_enable_cb_delayed was called, cause an interrupt
+ *  after this descriptor was completed
  * @vq: the struct virtqueue we're talking about.
  * @sgs: array of terminated scatterlists.
  * @out_num: the number of scatterlists readable by other side
@@ -337,7 +351,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
for (sg = sgs[i]; sg; sg = sg_next(sg))
total_in++;
}
-   return virtqueue_add(_vq, sgs, sg_next_chained,
+   return virtqueue_add(_vq, false, sgs, sg_next_chained,
 total_out, total_in, out_sgs, in_

[PATCH net-next RFC 2/3] vhost: support urgent descriptors

2014-10-11 Thread Jason Wang
This patches let vhost-net support urgent descriptors. For zerocopy case,
two new types of length was introduced to make it work.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c   | 43 +++
 drivers/vhost/scsi.c  | 23 +++
 drivers/vhost/test.c  |  5 +++--
 drivers/vhost/vhost.c | 44 +---
 drivers/vhost/vhost.h | 19 +--
 5 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8dae2f7..37b0bb5 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -48,9 +48,13 @@ MODULE_PARM_DESC(experimental_zcopytx, "Enable Zero Copy TX;"
  * status internally; used for zerocopy tx only.
  */
 /* Lower device DMA failed */
-#define VHOST_DMA_FAILED_LEN   3
+#define VHOST_DMA_FAILED_LEN   5
+/* Lower device DMA doen, urgent bit set */
+#define VHOST_DMA_DONE_LEN_URGENT  4
 /* Lower device DMA done */
-#define VHOST_DMA_DONE_LEN 2
+#define VHOST_DMA_DONE_LEN 3
+/* Lower device DMA in progress, urgent bit set */
+#define VHOST_DMA_URGENT   2
 /* Lower device DMA in progress */
 #define VHOST_DMA_IN_PROGRESS  1
 /* Buffer unused */
@@ -284,11 +288,13 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
container_of(vq, struct vhost_net_virtqueue, vq);
int i, add;
int j = 0;
+   bool urgent = false;
 
for (i = nvq->done_idx; i != nvq->upend_idx; i = (i + 1) % UIO_MAXIOV) {
if (vq->heads[i].len == VHOST_DMA_FAILED_LEN)
vhost_net_tx_err(net);
if (VHOST_DMA_IS_DONE(vq->heads[i].len)) {
+   urgent = urgent || vq->heads[i].len == 
VHOST_DMA_DONE_LEN_URGENT;
vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
++j;
} else
@@ -296,7 +302,7 @@ static void vhost_zerocopy_signal_used(struct vhost_net 
*net,
}
while (j) {
add = min(UIO_MAXIOV - nvq->done_idx, j);
-   vhost_add_used_and_signal_n(vq->dev, vq,
+   vhost_add_used_and_signal_n(vq->dev, vq, urgent,
&vq->heads[nvq->done_idx], add);
nvq->done_idx = (nvq->done_idx + add) % UIO_MAXIOV;
j -= add;
@@ -311,9 +317,14 @@ static void vhost_zerocopy_callback(struct ubuf_info 
*ubuf, bool success)
 
rcu_read_lock_bh();
 
-   /* set len to mark this desc buffers done DMA */
-   vq->heads[ubuf->desc].len = success ?
-   VHOST_DMA_DONE_LEN : VHOST_DMA_FAILED_LEN;
+   if (success) {
+   if (vq->heads[ubuf->desc].len == VHOST_DMA_IN_PROGRESS)
+   vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN;
+   else
+   vq->heads[ubuf->desc].len = VHOST_DMA_DONE_LEN_URGENT;
+   } else {
+   vq->heads[ubuf->desc].len = VHOST_DMA_FAILED_LEN;
+   }
cnt = vhost_net_ubuf_put(ubufs);
 
/*
@@ -363,6 +374,7 @@ static void handle_tx(struct vhost_net *net)
zcopy = nvq->ubufs;
 
for (;;) {
+   bool urgent;
/* Release DMAs done buffers first */
if (zcopy)
vhost_zerocopy_signal_used(net, vq);
@@ -374,7 +386,7 @@ static void handle_tx(struct vhost_net *net)
  % UIO_MAXIOV == nvq->done_idx))
break;
 
-   head = vhost_get_vq_desc(vq, vq->iov,
+   head = vhost_get_vq_desc(vq, &urgent, vq->iov,
 ARRAY_SIZE(vq->iov),
 &out, &in,
 NULL, NULL);
@@ -417,7 +429,8 @@ static void handle_tx(struct vhost_net *net)
ubuf = nvq->ubuf_info + nvq->upend_idx;
 
vq->heads[nvq->upend_idx].id = head;
-   vq->heads[nvq->upend_idx].len = VHOST_DMA_IN_PROGRESS;
+   vq->heads[nvq->upend_idx].len = urgent ?
+   VHOST_DMA_URGENT : VHOST_DMA_IN_PROGRESS;
ubuf->callback = vhost_zerocopy_callback;
ubuf->ctx = nvq->ubufs;
ubuf->desc = nvq->upend_idx;
@@ -445,7 +458,7 @@ static void handle_tx(struct vhost_net *net)
pr_debug("Truncated TX packet: "
 " len %d != %zd\n", err, len);
if (!zcopy_used)
-   vhost_add_used_and_signal(&net->dev, vq, head, 0);
+   vhost_add_used_and_signal(&net->dev, vq, urgent, head, 
0);
else
vhost_zerocopy_signal_used(net, vq);
total_len += len;
@@ -488,6 +501,7 @@ static int peek_head_len(struct sock *sk)
  * returns number of buf

[PATCH net-next RFC 3/3] virtio-net: conditionally enable tx interrupt

2014-10-11 Thread Jason Wang
We free transmitted packets in ndo_start_xmit() in the past to get better
performance in the past. One side effect is that skb_orphan() needs to be
called in ndo_start_xmit() which makes sk_wmem_alloc not accurate in
fact. For TCP protocol, this means several optimization could not work well
such as TCP small queue and auto corking. This can lead extra low
throughput of small packets stream.

Thanks to the urgent descriptor support. This patch tries to solve this
issue by enable the tx interrupt selectively for stream packets. This means
we don't need to orphan TCP stream packets in ndo_start_xmit() but enable
tx interrupt for those packets. After we get tx interrupt, a tx napi was
scheduled to free those packets.

With this method, sk_wmem_alloc of TCP socket were more accurate than in
the past which let TCP can batch more through TSQ and auto corking.

Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 164 ---
 1 file changed, 128 insertions(+), 36 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 5810841..b450fc4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -72,6 +72,8 @@ struct send_queue {
 
/* Name of the send queue: output.$index */
char name[40];
+
+   struct napi_struct napi;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -217,15 +219,40 @@ static struct page *get_a_page(struct receive_queue *rq, 
gfp_t gfp_mask)
return p;
 }
 
+static int free_old_xmit_skbs(struct send_queue *sq, int budget)
+{
+   struct sk_buff *skb;
+   unsigned int len;
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
+   int sent = 0;
+
+   while (sent < budget &&
+  (skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
+   pr_debug("Sent skb %p\n", skb);
+
+   u64_stats_update_begin(&stats->tx_syncp);
+   stats->tx_bytes += skb->len;
+   stats->tx_packets++;
+   u64_stats_update_end(&stats->tx_syncp);
+
+   dev_kfree_skb_any(skb);
+   sent++;
+   }
+
+   return sent;
+}
+
 static void skb_xmit_done(struct virtqueue *vq)
 {
struct virtnet_info *vi = vq->vdev->priv;
+   struct send_queue *sq = &vi->sq[vq2txq(vq)];
 
-   /* Suppress further interrupts. */
-   virtqueue_disable_cb(vq);
-
-   /* We were probably waiting for more output buffers. */
-   netif_wake_subqueue(vi->dev, vq2txq(vq));
+   if (napi_schedule_prep(&sq->napi)) {
+   virtqueue_disable_cb(vq);
+   virtqueue_disable_cb_urgent(vq);
+   __napi_schedule(&sq->napi);
+   }
 }
 
 static unsigned int mergeable_ctx_to_buf_truesize(unsigned long mrg_ctx)
@@ -772,7 +799,38 @@ again:
return received;
 }
 
+static int virtnet_poll_tx(struct napi_struct *napi, int budget)
+{
+   struct send_queue *sq =
+   container_of(napi, struct send_queue, napi);
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct netdev_queue *txq = netdev_get_tx_queue(vi->dev, vq2txq(sq->vq));
+   unsigned int r, sent = 0;
+
+again:
+   __netif_tx_lock(txq, smp_processor_id());
+   sent += free_old_xmit_skbs(sq, budget - sent);
+
+   if (sent < budget) {
+   r = virtqueue_enable_cb_prepare_urgent(sq->vq);
+   napi_complete(napi);
+   __netif_tx_unlock(txq);
+   if (unlikely(virtqueue_poll(sq->vq, r)) &&
+   napi_schedule_prep(napi)) {
+   virtqueue_disable_cb_urgent(sq->vq);
+   __napi_schedule(napi);
+   goto again;
+   }
+   } else {
+   __netif_tx_unlock(txq);
+   }
+
+   netif_wake_subqueue(vi->dev, vq2txq(sq->vq));
+   return sent;
+}
+
 #ifdef CONFIG_NET_RX_BUSY_POLL
+
 /* must be called with local_bh_disable()d */
 static int virtnet_busy_poll(struct napi_struct *napi)
 {
@@ -820,31 +878,13 @@ static int virtnet_open(struct net_device *dev)
if (!try_fill_recv(&vi->rq[i], GFP_KERNEL))
schedule_delayed_work(&vi->refill, 0);
virtnet_napi_enable(&vi->rq[i]);
+   napi_enable(&vi->sq[i].napi);
}
 
return 0;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq)
-{
-   struct sk_buff *skb;
-   unsigned int len;
-   struct virtnet_info *vi = sq->vq->vdev->priv;
-   struct virtnet_stats *stats = this_cpu_ptr(vi->stats);
-
-   while ((skb = virtqueue_get_buf(sq->vq, &len)) != NULL) {
-   pr_debug("Sent skb %p\n", skb);
-
-   u64_stats_update_begin(&stats->tx_syncp);
-   stats->tx_bytes += skb->len;
-   stats->tx_packets++;
-   u64_stats_update_end(&stats->tx_syncp);
-
-   dev_kfree_skb_a

[PATCH net-next RFC 0/3] virtio-net: Conditionally enable tx interrupt

2014-10-11 Thread Jason Wang
Hello all:

We free old transmitted packets in ndo_start_xmit() currently, so any
packet must be orphaned also there. This was used to reduce the overhead of
tx interrupt to achieve better performance. But this may not work for some
protocols such as TCP stream. TCP depends on the value of sk_wmem_alloc to
implement various optimization for small packets stream such as TCP small
queue and auto corking. But orphaning packets early in ndo_start_xmit()
disable such things more or less since sk_wmem_alloc was not accurate. This
lead extra low throughput for TCP stream of small writes.

This series tries to solve this issue by enable tx interrupts for all TCP
packets other than the ones with push bit or pure ACK. This is done through
the support of urgent descriptor which can force an interrupt for a
specified packet. If tx interrupt was enabled for a packet, there's no need
to orphan it in ndo_start_xmit(), we can free it tx napi which is scheduled
by tx interrupt. Then sk_wmem_alloc was more accurate than before and TCP
can batch more for small write. More larger skb was produced by TCP in this
case to improve both throughput and cpu utilization.

Test shows great improvements on small write tcp streams. For most of the
other cases, the throughput and cpu utilization are the same in the
past. Only few cases, more cpu utilization was noticed which needs more
investigation.

Review and comments are welcomed.

Thanks

Test result:

- Two Intel Corporation Xeon 5600s (8 cores) with back to back connected
  82599ES:
- netperf test between guest and remote host
- 1 queue 2 vcpus with zercopy enabled vhost_net
- both host and guest are net-next.git with the patches.
- Value with '[]' means obvious difference (the significance is greater
  than 95%).
- he significance of the differences between the two averages is calculated
  using unpaired T-test that takes into account the SD of the averages.

Guest RX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/+3.7872%/+3.2307%/+0.5390%/
64/2/-0.2325%/+2.9552%/-3.0962%/
64/4/[-2.0296%]/+2.2955%/[-4.2280%]/
64/8/+0.0944%/[+2.2654%]/-2.4662%/
256/1/+1.1947%/-2.5462%/+3.8386%/
256/2/-1.6477%/+3.4421%/-4.9301%/
256/4/[-5.9526%]/[+6.8861%]/[-11.9951%]/
256/8/-3.6470%/-1.5887%/-2.0916%/
1024/1/-4.2225%/-1.3238%/-2.9376%/
1024/2/+0.3568%/+1.8439%/-1.4601%/
1024/4/-0.7065%/-0.0099%/-2.3483%/
1024/8/-1.8620%/-2.4774%/+0.6310%/
4096/1/+0.0115%/-0.3693%/+0.3823%/
4096/2/-0.0209%/+0.8730%/-0.8862%/
4096/4/+0.0729%/-7.0303%/+7.6403%/
4096/8/-2.3720%/+0.0507%/-2.4214%/
16384/1/+0.0222%/-1.8672%/+1.9254%/
16384/2/+0.0986%/+3.2968%/-3.0961%/
16384/4/-1.2059%/+7.4291%/-8.0379%/
16384/8/-1.4893%/+0.3403%/-1.8234%/
65535/1/-0.0445%/-1.4060%/+1.3808%/
65535/2/-0.0311%/+0.9610%/-0.9827%/
65535/4/-0.7015%/+0.3660%/-1.0637%/
65535/8/-3.1585%/+11.1302%/[-12.8576%]/

Guest TX
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
64/1/[+75.2622%]/[-14.3928%]/[+104.7283%]/
64/2/[+68.9596%]/[-12.6655%]/[+93.4625%]/
64/4/[+68.0126%]/[-12.7982%]/[+92.6710%]/
64/8/[+67.9870%]/[-12.6297%]/[+92.2703%]/
256/1/[+160.4177%]/[-26.9643%]/[+256.5624%]/
256/2/[+48.4357%]/[-24.3380%]/[+96.1825%]/
256/4/[+48.3663%]/[-24.1127%]/[+95.5087%]/
256/8/[+47.9722%]/[-24.2516%]/[+95.3469%]/
1024/1/[+54.4474%]/[-52.9223%]/[+228.0694%]/
1024/2/+0.0742%/[-12.7444%]/[+14.6908%]/
1024/4/[+0.5524%]/-0.0327%/+0.5853%/
1024/8/[-1.2783%]/[+6.2902%]/[-7.1206%]/
4096/1/+0.0778%/-13.1121%/+15.1804%/
4096/2/+0.0189%/[-11.3176%]/[+12.7832%]/
4096/4/+0.0218%/-1.0389%/+1.0718%/
4096/8/-1.3774%/[+12.7396%]/[-12.5218%]/
16384/1/+0.0136%/-2.5043%/+2.5826%/
16384/2/+0.0509%/[-15.3846%]/[+18.2420%]/
16384/4/-0.0163%/[-4.8808%]/[+5.1141%]/
16384/8/[-1.7249%]/[+13.9174%]/[-13.7313%]/
65535/1/+0.0686%/-5.4942%/+5.8862%/
65535/2/+0.0043%/[-7.5816%]/[+8.2082%]/
65535/4/+0.0080%/[-7.2993%]/[+7.8827%]/
65535/8/[-1.3669%]/[+16.6536%]/[-15.4479%]/

Guest TCP_RR
size/sessions/throughput-+%/cpu-+%/per cpu throughput -+%/
256/1/-0.2914%/+12.6457%/-11.4848%/
256/25/-0.5968%/-5.0531%/+4.6935%/
256/50/+0.0262%/+0.2079%/-0.1813%/
4096/1/+2.6965%/[+16.1248%]/[-11.5636%]/
4096/25/-0.5002%/+0.5449%/-1.0395%/
4096/50/[-2.0987%]/-0.0330%/[-2.0664%]/

Tests on mlx4 was ongoing, will post the result in next week.

Jason Wang (3):
  virtio: support for urgent descriptors
  vhost: support urgent descriptors
  virtio-net: conditionally enable tx interrupt

 drivers/net/virtio_net.c | 164 ++-
 drivers/vhost/net.c  |  43 +++---
 drivers/vhost/scsi.c |  23 --
 drivers/vhost/test.c |   5 +-
 drivers/vhost/vhost.c|  44 +++
 drivers/vhost/vhost.h|  19 +++--
 drivers/virtio/virtio_ring.c |  75 +-
 include/linux/virtio.h   |  14 
 include/uapi/linux/virtio_ring.h |   5 +-
 9 files changed, 308 insertions(+), 84 deletions(-)

-- 
1.8.3.1

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the

[PATCH] qcow2: fix double-free of Qcow2DiscardRegion in qcow2_process_discards

2014-10-11 Thread Zhang Haoyu
In qcow2_update_snapshot_refcount -> qcow2_process_discards() -> bdrv_discard()
may free the Qcow2DiscardRegion which is referenced by "next" pointer in
qcow2_process_discards() now, in next iteration, d = next, so g_free(d)
will double-free this Qcow2DiscardRegion.

qcow2_snapshot_delete
|- qcow2_update_snapshot_refcount
|-- qcow2_process_discards
|--- bdrv_discard
| aio_poll
|- aio_dispatch
|-- bdrv_co_io_em_complete
|--- qemu_coroutine_enter(co->coroutine, NULL); <=== coroutine entry is 
bdrv_co_do_rw
|--- g_free(d) <== free first Qcow2DiscardRegion is okay
|--- d = next;  <== this set is done in QTAILQ_FOREACH_SAFE() macro.
|--- g_free(d);  <== double-free will happen if during previous iteration, 
bdrv_discard had free this object.

bdrv_co_do_rw
|- bdrv_co_do_writev
|-- bdrv_co_do_pwritev
|--- bdrv_aligned_pwritev
| qcow2_co_writev
|- qcow2_alloc_cluster_link_l2
|-- qcow2_free_any_clusters
|--- qcow2_free_clusters
| update_refcount
|- qcow2_process_discards
|-- g_free(d)  <== In next iteration, this Qcow2DiscardRegion will be 
double-free.

Signed-off-by: Zhang Haoyu 
Signed-off-by: Fu Xuewei 
---
 block/qcow2-refcount.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 2bcaaf9..3b759a3 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -462,9 +462,9 @@ fail_block:
 void qcow2_process_discards(BlockDriverState *bs, int ret)
 {
 BDRVQcowState *s = bs->opaque;
-Qcow2DiscardRegion *d, *next;
+Qcow2DiscardRegion *d;
 
-QTAILQ_FOREACH_SAFE(d, &s->discards, next, next) {
+while ((d = QTAILQ_FIRST(&s->discards)) != NULL) {
 QTAILQ_REMOVE(&s->discards, d, next);
 
 /* Discard is optional, ignore the return value */
-- 
1.7.12.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html