Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 07:51 PM, Avi Kivity wrote:

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far. 


(A Fedora 27)


Re: aio poll, io_pgetevents and a new in-kernel poll API V3

2018-01-18 Thread Avi Kivity

On 01/18/2018 05:46 PM, Jeff Moyer wrote:

FYI, this kernel has issues.  It will boot up, but I don't have
networking, and even rebooting doesn't succeed.  I'm looking into it.


FWIW, I'm running an older version of this patchset on my desktop with 
no problems so far.



-Jeff

Christoph Hellwig  writes:


Hi all,

this series adds support for the IOCB_CMD_POLL operation to poll for the
readyness of file descriptors using the aio subsystem.  The API is based
on patches that existed in RHAS2.1 and RHEL3, which means it already is
supported by libaio.  To implement the poll support efficiently new
methods to poll are introduced in struct file_operations:  get_poll_head
and poll_mask.  The first one returns a wait_queue_head to wait on
(lifetime is bound by the file), and the second does a non-blocking
check for the POLL* events.  This allows aio poll to work without
any additional context switches, unlike epoll.

To make the interface fully useful a new io_pgetevents system call is
added, which atomically saves and restores the signal mask over the
io_pgetevents system call.  It it the logical equivalent to pselect and
ppoll for io_pgetevents.

The corresponding libaio changes for io_pgetevents support and
documentation, as well as a test case will be posted in a separate
series.

The changes were sponsored by Scylladb, and improve performance
of the seastar framework up to 10%, while also removing the need
for a privileged SCHED_FIFO epoll listener thread.

The patches are on top of Als __poll_t annoations, so I've also
prepared a git branch on top of those here:

 git://git.infradead.org/users/hch/vfs.git aio-poll.3

Gitweb:

 http://git.infradead.org/users/hch/vfs.git/shortlog/refs/heads/aio-poll.3

Libaio changes:

 https://pagure.io/libaio.git io-poll

Seastar changes (not updated for the new io_pgetevens ABI yet):

 https://github.com/avikivity/seastar/commits/aio

Changes since V2:
  - removed a double initialization
  - new vfs_get_poll_head helper
  - document that ->get_poll_head can return NULL
  - call ->poll_mask before sleeping
  - various ACKs
  - add conversion of random to ->poll_mask
  - add conversion of af_alg to ->poll_mask
  - lacking ->poll_mask support now returns -EINVAL for IOCB_CMD_POLL
  - reshuffled the series so that prep patches and everything not
requiring the new in-kernel poll API is in the beginning

Changes since V1:
  - handle the NULL ->poll case in vfs_poll
  - dropped the file argument to the ->poll_mask socket operation
  - replace the ->pre_poll socket operation with ->get_poll_head as
in the file operations

--
To unsubscribe, send a message with 'unsubscribe linux-aio' in
the body to majord...@kvack.org.  For more info on Linux AIO,
see: http://www.kvack.org/aio/
Don't email: mailto:"a...@kvack.org;>a...@kvack.org





Re: [kvm-devel] [virtio-net][PATCH] Don't arm tx hrtimer with a constant 500us each transmit

2007-12-17 Thread Avi Kivity

Rusty Russell wrote:

On Wednesday 12 December 2007 23:54:00 Dor Laor wrote:
  

commit 763769621d271d92204ed27552d75448587c1ac0
Author: Dor Laor [EMAIL PROTECTED]
Date:   Wed Dec 12 14:52:00 2007 +0200

[virtio-net][PATCH] Don't arm tx hrtimer with a constant 50us each
transmit

The current start_xmit sets 500us hrtimer to kick the host.
The problem is that if another xmit happens before the timer was
fired then
the first xmit will have to wait additional 500us.
This patch does not re-arm the timer if there is existing one.
This will shorten the latency for tx.



Hi Dor!

Yes, I pondered this when I wrote the code.  On the one hand, it's a 
low-probability pathological corner case, on the other, your patch reduces 
the number of timer reprograms in the normal case.
  


One thing that came up in our discussions is to let the host do the 
timer processing instead of the guest.  When tx exit mitigation is 
enabled, the guest bumps the queue pointer, but carefully refrains from 
kicking the host.  The host polls the tx pointer using a timer, kicking 
itself periodically; if polling yields no packets it disables tx exit 
mitigation.  This saves the guest the bother of programming the timer, 
which presumably requires an exit if the timer is the closest one to 
expiration.


[btw, this can be implemented in virtqueue rather than virtio-net, no?]

--
Any sufficiently difficult bug is indistinguishable from a feature.

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-15 Thread Avi Kivity
Rusty Russell wrote:
 On Thu, 2007-04-12 at 06:32 +0300, Avi Kivity wrote:
   
 I hadn't considered an always-blocking (or unbuffered) networking API. 
 It's very counter to current APIs, but does make sense with things like
 syslets.  Without syslets, I don't think it's very useful as you need
 some artificial threads to keep things humming along.

 (How would userspace specify it? O_DIRECT when opening the tap?)
 

 TBH, I hadn't thought that far.  Tap already has those IFF_NO_PI etc
 flags, but it might make sense to just be the default.  From userspace's
 POV it's not a semantic change.

 OK, just tested: I can get 230,000 packets (28 byte UDP) through the tun
 device in a second (130,000 actually out the 100-base-T NIC, 100,000
 dropped).  If the tun driver's write() blocks until the skb is
 destroyed, it's 4,000 packets.

 So your intuition was right: skb_free latency on xmit (at least for this
 e1000) is far too large for anything but an async solution.

 Will ponder further.
   

I think aio_write (but done copyless-lessly) is the way to go.  Not only
is the infrastructure there, but the API already allows for multiple
packet submission and for batching completions.  Fitting into that
framework ought to be easier than starting yet another one.

It still misses scatter/gather and integration with fd-based
notification, but there are patches around for that.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-11 Thread Avi Kivity

Rusty Russell wrote:

On Wed, 2007-04-11 at 07:26 +0300, Avi Kivity wrote:
  

Nope.  Being async is critical for copyless networking:

- in the transmit path, so need to stop the sender (guest) from touching
the memory until it's on the wire.  This means 100% of packets sent will
be blocked.



Hi Avi,

You keep saying stuff like this, and I keep ignoring it.  OK, I'll
bite:

Why would we try to prevent the sender from altering the packets?

  


To avoid data corruption.

The guest wants to send a packet.  It calls write(), which causes an skb 
to be allocated, data to be copied into it, the entire networking stack 
gets into gear, and the guest-side driver instructs the device to send 
the packet.


With async operations, the saga continues like this: the host-side 
driver allocates an skb, get_page()s and attaches the data to the new 
skb, this skb crosses the bridge, trickles into the real ethernet 
device, gets queued there, sent, interrupts fire, triggering async 
completion.  On this completion, we send a virtual interrupt to the 
guest, which tells it to destroy the skb and reclaim the pages attached 
to it.


Without async operations, we don't have a hook to notify the guest when 
to reclaim the skb.  If we do it too soon, the skb can be reclaimed and 
the memory reused before the real device gets to see it, so we end up 
sending data that we did not intend.  The only way to avoid it is to 
copy the data somewhere safe, but that is exactly what we don't want to do.



- multiple packets per operation (for interrupt mitigation) (like
lio_listio)



The benefits for interrupt mitigation are less clear to me in a virtual
environment (scheduling tends to make it happen anyway); I'd want to
benchmark it.

  


Yes, the guest will probably submit multiple packets in one hypercall.  
It would be nice for the userspace driver to be able to submit them to 
the host kernel in one syscall.



Some kind of batching to reduce syscall overhead, perhaps, but TSO would
go a fair way towards that anyway (probably not enough).

  


For some workloads, sure.



- scatter/gather packets (iovecs)



Yes, and this is already present in the tap device.  Anthony suggested a
slightly nasty hack for multiple sg packets in one writev()/readv, which
could also give us batching.

  


No need for hacks if we get list aio support one day.


- configurable wakeup (by packet count/timeout) for queue management



I'm not convinced that this is a showstopper, though.
  


It probably isn't.  It's free with aio though.

  

- hacks (tso)



I'd usually go for a batch interface over TSO, but if the card we're
sending to actually does TSO then TSO will probably win.
  


Sure, if tso helps a regular host then it should help one that happens 
to be running a virtual machine.


  

Most of these can be provided by a combination of the pending aio work,
the pending aio/fd integration, and the not-so-pending tap aio work.  As
the first two are available as patches and the third is limited to the
tap device, it is not unreasonable to try it out.  Maybe it will turn
out not to be as difficult as I predicted just a few lines above.



Indeed, I don't think we're asking for a revolution a-la VJ-style
channels.  But I'm still itching to get back to that, and this might yet
provide an excuse 8)
  


I'll be happy if this can be made to work.  It will make the paravirt 
guest-side driver work in kvm-less setups, which are useful for testing, 
and of course reduction in kernel code is beneficial.  It will be slower 
that in-kernel, but if we get the batching right, perhaps not 
significantly slower.  I'm mostly concerned that this depends on code 
that has eluded merging for such a long time.



--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-10 Thread Avi Kivity
Evgeniy Polyakov wrote:
 On Mon, Apr 09, 2007 at 04:38:18PM +0300, Avi Kivity ([EMAIL PROTECTED]) 
 wrote:
   
 But I don't get this we can enhance the kernel but not userspace vibe
 8(
  
   
 I've been waiting for network aio since ~2003.  If it arrives in the 
 next few days, I'm all for it; much more than kvm can use it 
 profitably.  But I'm not going to write that interface myself.
 

 Hmm, you missed at least two implementations of network aio in the 
 previous year, and now with syslets we can have third one.
   

I meant, network aio in the mainline kernel.  I am aware of the various
out-of-tree implementations.

 But it looks from this discussion, that it will not prevent from
 changing in-kernel driver - place a hook into skb allocation path and
 allocate data from opposing memory - get pages from another side and put
 them into fragments, then copy headers into skb-data.
   

I don't understand this (opposing memory, another side?).  Can you
elaborate?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-10 Thread Avi Kivity

Evgeniy Polyakov wrote:

But it looks from this discussion, that it will not prevent from
changing in-kernel driver - place a hook into skb allocation path and
allocate data from opposing memory - get pages from another side and put
them into fragments, then copy headers into skb-data.
  
  

I don't understand this (opposing memory, another side?).  Can you
elaborate?



You want to implement zero-copy network device between host and guest, if
I understood this thread correctly?
So, for sending part, device allocates pages from receiver's memory (or
from shared memory), receiver gets an 'interrupt' and got pages from own
memory, which are attached to new skb and transferred up to the network
stack.
It can be extended to use shared ring of pages.
  


This is what Xen does.  It is actually less performant than copying, IIRC.

The problem with flipping pages around is that physical addresses are 
cached both in the kvm mmu and in the on-chip tlbs, necessitating 
expensive page table walks and tlb invalidation IPIs.


Note that for sending from the guest an external host can be done 
copylessly, and for the receive side using a dma engine (like I/OAT) can 
reduce the cost of the copy.


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-10 Thread Avi Kivity

Evgeniy Polyakov wrote:

This is what Xen does.  It is actually less performant than copying, IIRC.

The problem with flipping pages around is that physical addresses are 
cached both in the kvm mmu and in the on-chip tlbs, necessitating 
expensive page table walks and tlb invalidation IPIs.



Hmm, I'm not familiar with Xen driver, but similar technique was used
with zero-copy network sniffer some time ago, substituting userspace
pages with pages containing skb data was about 25-50% faster than
copying 1500 bytes in general, and in order of 10 times faster in some
cases.

Check a link please in case we are talking about different ideas:
http://marc.info/?l=linux-netdevm=112262743505711w=2

  


I don't really understand what you're testing there.  in particular, how 
can the copying time change so dramatically depending on whether you've 
just rebooted or not?




--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-10 Thread Avi Kivity

Evgeniy Polyakov wrote:

On Tue, Apr 10, 2007 at 03:17:45PM +0300, Avi Kivity ([EMAIL PROTECTED]) wrote:
  

Check a link please in case we are talking about different ideas:
http://marc.info/?l=linux-netdevm=112262743505711w=2

 
  
I don't really understand what you're testing there.  in particular, how 
can the copying time change so dramatically depending on whether you've 
just rebooted or not?

 
I tested page remapping time - i.e. time to replace a page in two

different mappings - the same should be performed in host and guest
kernels if such design is going to be used for communication.

I can only explain after-reboot slow copy with empty caches - arbitrary
kernel pages were copied into buffer (not the same data as in posted
code).
  


Doing this in kvm would be significantly more complex, as we'd need to 
use full reverse mapping to locate all guest mappings (we already 
reverse map writable pages for other reasons), so the 25-50% difference 
might be nullified or even turn into overhead.


Here are the Xen numbers for reference.  Xen probably has more overhead 
than kvm for such things, though, as it needs to do hypercalls from dom0 
which is in-kernel for kvm.


http://lists.xensource.com/archives/html/xen-devel/2007-03/msg01218.html

--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-10 Thread Avi Kivity
Rusty Russell wrote:
 On Mon, 2007-04-09 at 16:38 +0300, Avi Kivity wrote:
   
 Moreover, some things just don't lend themselves to a userspace 
 abstraction.  If we want to expose tso (tcp segmentation offload), we 
 can easily do so with a kernel driver since the kernel interfaces are 
 all tso aware.  Tacking on tso awareness to tun/tap is doable, but at 
 the very least wierd.
 

 It is kinda weird, yes, but it certainly makes sense.  All the arguments
 for tso apply in triplicate to userspace packet sends...

   

Well, write() with a large buffer is a sort of tso device.  The problem
is tso breaks through several layers (like I'm advocating in the other
thread :), pushing tcp functionality into ethernet.  Well, we've seen worse.


 We're dealing with the tun/tap device here, not a socket.
   
 Hmm.  tun actually has aio_write implemented, but it seems synchronous.  
 So does the read path.

 If these are made truly asynchronous, and the write path is made in 
 addition copyless, then we might have something workable.  I still 
 cringe at having a pagetable walk in order to deliver a 1500-byte packet.
 

 Right, now we're talking!

 However, it's not clear to me why creating an skb which references a kvm
 guest's memory doesn't need a pagetable walk, but a packet in (other)
 userspace memory does?
   

Currently guest pages are stashed in a kernel array, as well as being
mmap()ed into user space.

That's not a very strong argument though, as I'd like to be map
userspace memory into the guest, or map address_spaces to the guest, or
something, so accessing guest physical memory will become more expensive
in time.

 My conviction which started this discussion is that if we can offer an
 efficient interface for kvm, we should be able to offer an efficient
 interface for any (other) userspace.
   

Fully agreed.  It's mostly a question of who and when.  Designing and
implementing this interface is going to be difficult, require deep
knowledge of Linux networking, and consume a lot of time.

 As to async, I'm not *so* worried about that for the moment, although it
 would probably be nicer to fail than to block.  Otherwise we could
 simply set an skb destructor to wake us up.
   

Nope.  Being async is critical for copyless networking:

- in the transmit path, so need to stop the sender (guest) from touching
the memory until it's on the wire.  This means 100% of packets sent will
be blocked.
- in the receive path, you could separate receive notification from the
single copy that must be done (like poll() + read()), but to make use of
dma engines you need to provide the end address beforehand.

 I think the first step is to see how much worse a decent userspace net
 driver is compared with the current in-kernel one.
   

A userspace net interface needs to provide the following:

- true async operations
- multiple packets per operation (for interrupt mitigation) (like
lio_listio)
- scatter/gather packets (iovecs)
- configurable wakeup (by packet count/timeout) for queue management
- hacks (tso)

Most of these can be provided by a combination of the pending aio work,
the pending aio/fd integration, and the not-so-pending tap aio work.  As
the first two are available as patches and the third is limited to the
tap device, it is not unreasonable to try it out.  Maybe it will turn
out not to be as difficult as I predicted just a few lines above.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-09 Thread Avi Kivity

Rusty Russell wrote:

On Sun, 2007-04-08 at 08:36 +0300, Avi Kivity wrote:
  

Rusty Russell wrote:


Hi Avi,

I don't think you've thought about this very hard.  The receive copy is
completely independent with whether the packet is going to the guest via
a kernel driver or via userspace, so not relevant.
  
  
A packet received in the kernel cannot be made available to userspace in 
a safe manner without a copy, as it will not be aligned with page 
boundaries, so userspace cannot examine the packet until after one copy 
has occured.



Hi Avi!

I'm a little puzzled by your response.  Hmm...

lguest's userspace network frontend does exactly as many copies as
Ingo's in-host-kernel code.  One from the Guest, one to the Guest.

  


kvm pvnet is suboptimal now.  The number of copies could be reduced by 
two (to zero), by constructing an skb that points to guest memory.  
Right now, this can only be done in-kernel.


With current userspace networking interfaces, one cannot build a network 
device that has less than one copy on transmit, because sendmsg() *must* 
copy the data (as there is no completion notification).  sendfilev(), 
even if it existed, cannot be used: it is copyless, but lacks completion 
notification.  It is useful only on unchanging data like read-only files.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-09 Thread Avi Kivity

Rusty Russell wrote:

On Mon, 2007-04-09 at 10:10 +0300, Avi Kivity wrote:
  

Rusty Russell wrote:


I'm a little puzzled by your response.  Hmm...

lguest's userspace network frontend does exactly as many copies as
Ingo's in-host-kernel code.  One from the Guest, one to the Guest.
  
kvm pvnet is suboptimal now.  The number of copies could be reduced by 
two (to zero), by constructing an skb that points to guest memory.  
Right now, this can only be done in-kernel.



Sorry, you lost me here.  You mean both input and output copies can be
eliminated?  Or are you talking about another two copies somewhere?
  


On the transmit path, current kvm pvnet has two copies:

1.  on the guest side, the driver copies the skb data into the shared ring
2. on the host side, the device copies the data from the ring into a 
newly allocated skb


Both of these copies can be eliminated with a host-side kernel.  With 
current userspace interfaces, only one copy can be eliminated.


Similar logic applies to receive, except that one copy must remain.


But I don't get this we can enhance the kernel but not userspace vibe
8(
  


I've been waiting for network aio since ~2003.  If it arrives in the 
next few days, I'm all for it; much more than kvm can use it 
profitably.  But I'm not going to write that interface myself.


Moreover, some things just don't lend themselves to a userspace 
abstraction.  If we want to expose tso (tcp segmentation offload), we 
can easily do so with a kernel driver since the kernel interfaces are 
all tso aware.  Tacking on tso awareness to tun/tap is doable, but at 
the very least wierd.


  
With current userspace networking interfaces, one cannot build a network 
device that has less than one copy on transmit, because sendmsg() *must* 
copy the data (as there is no completion notification).



Why are you talking about sendmsg()?  Perhaps this is where we're
getting tangled up.

We're dealing with the tun/tap device here, not a socket.

  


Hmm.  tun actually has aio_write implemented, but it seems synchronous.  
So does the read path.


If these are made truly asynchronous, and the write path is made in 
addition copyless, then we might have something workable.  I still 
cringe at having a pagetable walk in order to deliver a 1500-byte packet.



 sendfilev(), 
even if it existed, cannot be used: it is copyless, but lacks completion 
notification.  It is useful only on unchanging data like read-only files.



Again, sendfile is a *much* harder problem than sending a single packet
once, which is the question here.
  


sendfile() is a *different* problem.  It doesn't need completion because 
the data is assumed not to change under it.


Consider that the guest may be issuing a megabyte-sized sendfile() which 
is broken into 17 tso frames.  We need to preserve the large structures 
as much as possible or we end up repeating the simple single packet 
once path 700 times.


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-07 Thread Avi Kivity

Rusty Russell wrote:

On Thu, 2007-04-05 at 10:17 +0300, Avi Kivity wrote:
  

Rusty Russell wrote:


You didn't quote Anthony's point about it's more about there not being
good enough userspace interfaces to do network IO.

It's easier to write a kernel-space network driver, but it's not
obviously the right thing to do until we can show that an efficient
packet-level userspace interface isn't possible.  I don't think that's
been done, and it would be interesting to try.
  
  
In the case of networking, the copyful interfaces on receive are driven 
by the hardware not knowing how to split the header from the data.  On 
transmit I agree, it could be made copyless from userspace (somthing 
like sendfilev, only not file oriented).



Hi Avi,

I don't think you've thought about this very hard.  The receive copy is
completely independent with whether the packet is going to the guest via
a kernel driver or via userspace, so not relevant.
  


A packet received in the kernel cannot be made available to userspace in 
a safe manner without a copy, as it will not be aligned with page 
boundaries, so userspace cannot examine the packet until after one copy 
has occured.  After userspace has determined what to do with the packet, 
another copy must take place to get it there.


There's a counterexample, mmapped sockets, but that works only when all 
packets arriving on a card are exposed to the same process.  This is 
useful for tcpdump or for what you outline below but is hardly generic.



And if all packets from the card are going to the guest, you can
deliver directly.  Userspace or kernel, no difference.
  


That is not the common case.  Nor is it true when there is a mismatch 
between the card's capabilties and guest expectations and constraints.  
For example, guest memory is not physically contiguous so a NIC that 
won't do scatter/gather will require bouncing (or an iommu, but that's 
not here yet).



And we have a sendfilev not file oriented: it's called writev 8)
  


writev() cannot be made copyless for networking.  One needs an async 
interface so the kernel can complete the write after the NIC acks the 
dma transfer, or a kernel driver.



An in-kernel driver can avoid system call overhead and page references.
But a better tap device helps more than just KVM.
  


I'll believe it when I see it.

--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-05 Thread Avi Kivity

Ingo Molnar wrote:
so right now the only option for a clean codebase is the KVM in-kernel 
code.


I strongly disagree with this.  Bad code in userspace is not an excuse 
for shoving stuff into the kernel, where maintaining it is much more 
expensive, and the cause of a mistake can be system crashes and data 
loss, affecting unrelated processes.  If we move something into the 
kernel, we'd better have a really good reason for it.


Qemu code _is_ crufty.  We can do one of three things:
1. live with it
2. fork it and clean it up
3. clean it up incrementally and merge it upstream

Currently we're doing (1).  You're suggesting a variant of (2), fork 
plus move into the kernel.  The right thing to do IMO is (3), but I 
don't see anybody volunteering.  Qemu picked up additional committers 
recently and I believe they would be receptive to cleanups.


[In the *pic/pit case, we have other reasons to push things into the 
kernel.  But this code is crap, let's rewrite it in the kernel is not 
a justification I'll accept.  I'd be much happier if we could quantify 
these other reasons.]



--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-05 Thread Avi Kivity

Ingo Molnar wrote:

* Avi Kivity [EMAIL PROTECTED] wrote:

  
so right now the only option for a clean codebase is the KVM 
in-kernel code.
  

I strongly disagree with this.



are you disagreeing with my statement that the KVM kernel-side code is 
the only clean codebase here? To me this is a clear fact :)
  


No, I agree with that.  I just disagree with choosing to put the *pic 
code (or other code) into the kernel on *that* basis.  The selection 
should be on design/performance issues alone, *not* the state of 
existing code.


I only pointed out that the only clean codebase at the moment is the KVM 
in-kernel code - i did not make the argument (at all) that every new 
piece of KVM code should be done in the kernel. That would be stupid - 
do you think i'd advocate for example moving command line argument 
parsing into the kernel?
  


No.  But the difference in cruftiness between kvm and qemu code should 
not enter into the discussion of where to do things.


and as i said in the mail: the kernel _is_ the best place to do this 
particular stuff.
  


I agree with this, maybe for different reasons.


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [kvm-devel] QEMU PIC indirection patch for in-kernel APIC work

2007-04-05 Thread Avi Kivity

Rusty Russell wrote:

You didn't quote Anthony's point about it's more about there not being
good enough userspace interfaces to do network IO.

It's easier to write a kernel-space network driver, but it's not
obviously the right thing to do until we can show that an efficient
packet-level userspace interface isn't possible.  I don't think that's
been done, and it would be interesting to try.
  


In the case of networking, the copyful interfaces on receive are driven 
by the hardware not knowing how to split the header from the data.  On 
transmit I agree, it could be made copyless from userspace (somthing 
like sendfilev, only not file oriented).


--
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [3/4] kevent: AIO, aio_sendfile() implementation.

2006-07-26 Thread Avi Kivity

David Miller wrote:


From: Christoph Hellwig [EMAIL PROTECTED]
Date: Wed, 26 Jul 2006 11:04:31 +0100

 And to be honest, I don't think adding all this code is acceptable
 if it can't replace the existing aio code while keeping the
 interface.  So while you interface looks pretty sane the
 implementation needs a lot of work still :)

Networking and disk AIO have significantly different needs.

Surely, there needs to be a unified polling interface to support single 
threaded designs.


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: possible recursive locking in ATM layer

2006-07-05 Thread Avi Kivity

Arjan van de Ven wrote:


From: Arjan van de Ven [EMAIL PROTECTED]

 Linux version 2.6.17-git22 ([EMAIL PROTECTED]) (gcc version 4.0.3 
(Ubuntu 4.0.3-1ubuntu5)) #20 PREEMPT Tue Jul 4 10:35:04 CEST 2006



 [ 2381.598609] =
 [ 2381.619314] [ INFO: possible recursive locking detected ]
 [ 2381.635497] -
 [ 2381.651706] atmarpd/2696 is trying to acquire lock:
 [ 2381.666354]  (skb_queue_lock_key){-+..}, at: [c028c540] 
skb_migrate+0x24/0x6c

 [ 2381.688848]


ok this is a real potential deadlock in a way, it takes two locks of 2
skbuffs without doing any kind of lock ordering; I think the following
patch should fix it. Just sort the lock taking order by address of the
skb.. it's not pretty but it's the best this can do in a minimally
invasive way.



Isn't it a deadlock only if skb_migrate(a, b) and skb_migrate(b, a) can 
be called concurrently?


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Van Jacobson's net channels and real-time

2006-04-23 Thread Avi Kivity

Ingo Oeser wrote:

Hi Jörn,

On Saturday, 22. April 2006 13:48, Jörn Engel wrote:
  

Unless I completely misunderstand something, one of the main points of
the netchannels if to have *zero* fields written to by both producer
and consumer. 



Hmm, for me the main point was to keep the complete processing
of a single packet within one CPU/Core where this is a non-issue.
  
But the interrupt for a packet can be received by cpu 0 whereas the rest 
of processing proceeds on cpu 1; so it still helps to keep the producer 
index and consumer index on separate cachelines.


--
error compiling committee.c: too many arguments to function

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 0/3] ioat: DMA engine support

2005-11-24 Thread Avi Kivity

Andi Kleen wrote:

Don't forget that there are benefits of not polluting the cache with the 
traffic for the incoming skbs.
   



Is that a general benefit outside benchmarks? I would expect
most real programs to actually do something with the data
- and that usually involves needing it in cache.

 

As an example, an NFS server reads some data pages using iSCSI and sends 
them using NFS/TCP (or vice versa).


In the I/O AT case it might make sense to do a few prefetch()es of the 
userland data on the return-to-userspace code path.  
   



Some prefetches for user space might be a good idea yes

 

As long as they can be turned off. Not all usespace applications want to 
touch the data immediately.




-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [PATCH 0/3] ioat: DMA engine support

2005-11-24 Thread Avi Kivity

Andi Kleen wrote:

 

As an example, an NFS server reads some data pages using iSCSI and sends 
them using NFS/TCP (or vice versa).
   



For TX this can be done zero copy using a sendfile like setup.
 


Yes, or with aio send for anonymous memory.


For RX it may help - but my point was that most applications
are not structured in this simple way.

 

Agreed. But those that do care, care very much. The data mover 
applications, simply because they don't touch the data, expect very high 
bandwidth.


As long as they can be turned off. Not all usespace applications want to 
touch the data immediately.
   



Perhaps.  And lots of others might. Of course the simple
network benchmarks don't so the number on them look good.

 


There are very real non-benchmark applications that want this.


Just pointing out that it's not clear it will always be a big help.

 


Agree it should default to in-cache.

-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html