Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-03-13 Thread Paolo Bonzini


On 13/03/2015 15:46, Jason Baron wrote:
> > The throttling algorithm computes a duration for the next IO, which is used 
> > to
> > arm a timer in order to delay the request a bit. As timers are always 
> > rounded
> > *UP* to the effective granularity, the timeout being 1ms in epoll_pwait is 
> > just
> > too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can 
> > avoid
> > the rounding-up.
> 
> right, but we could use the timerfd here to get the desired precision.

Fam, didn't you see slowdowns with few file descriptors
epoll_ctl+epoll_wait+timerfd compared to ppoll?

Do they disappear or improve with epoll_ctl_batch and epoll_pwait1?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-03-13 Thread Jason Baron

On 03/13/2015 07:31 AM, Fam Zheng wrote:
> On Thu, 03/12 11:02, Jason Baron wrote:
>> On 03/09/2015 09:49 PM, Fam Zheng wrote:
>>
>> Hi,
>>
>> So it sounds like you are comparing original qemu code (which was using
>> ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
>> comparing the existing epoll (with say the timerfd in your epoll set), so
>> we can see the improvement relative to epoll.
> I did compare them, but they are too close to see differences. The 
> improvements
> in epoll_pwait1 doesn't really help the hot path of guest IO, but it does
> affect the program timer precision, that are used in various device emulations
> in QEMU.
>
> Although it's kind of subtle and difficult to summarize here, I can give an
> example in the IO throttling implementation in QEMU, to show the significance:
>
> The throttling algorithm computes a duration for the next IO, which is used to
> arm a timer in order to delay the request a bit. As timers are always rounded
> *UP* to the effective granularity, the timeout being 1ms in epoll_pwait is 
> just
> too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can avoid
> the rounding-up.

right, but we could use the timerfd here to get the desired precision.

> I think this idea could be pertty generally desired by other applications, 
> too.
>
> Regarding the epoll_ctl_batch improvement, again, it is not going to disrupt
> the numbers in the small workload I managed to test.
>
> Of course, if you have a specific application senario in mind, I will try it. 
> :)

I want to understand what new functionality these syscalls offer over
what we have now. I mean we could show a micro-benchmark where
these matter, but is that enough to justify these new syscalls given that
I think we could implement library wrappers around what we have now
to do what you are proposing here.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-03-13 Thread Fam Zheng
On Thu, 03/12 11:02, Jason Baron wrote:
> On 03/09/2015 09:49 PM, Fam Zheng wrote:
> >
> > Benchmark for epoll_pwait1
> > ==
> >
> > By running fio tests inside VM with both original and modified QEMU, we can
> > compare their difference in performance.
> >
> > With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
> > latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.
> >
> > With a slightly larger VM instance [t2] - attached a virtio-serial device so
> > that there are 80~90 fds in the main loop - the original QEMU has a latency
> > overhead around 49 us. By adding more such devices [t3], we can see the 
> > latency
> > go even higher - 83 us with ~200 FDs.
> >
> > Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
> > repectively 36us, 37us, 47us for t1, t2 and t3.
> >
> >
> 
> Hi,
> 
> So it sounds like you are comparing original qemu code (which was using
> ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
> comparing the existing epoll (with say the timerfd in your epoll set), so
> we can see the improvement relative to epoll.

I did compare them, but they are too close to see differences. The improvements
in epoll_pwait1 doesn't really help the hot path of guest IO, but it does
affect the program timer precision, that are used in various device emulations
in QEMU.

Although it's kind of subtle and difficult to summarize here, I can give an
example in the IO throttling implementation in QEMU, to show the significance:

The throttling algorithm computes a duration for the next IO, which is used to
arm a timer in order to delay the request a bit. As timers are always rounded
*UP* to the effective granularity, the timeout being 1ms in epoll_pwait is just
too coarse and will lead to severe inaccuracy. With epoll_pwait1, we can avoid
the rounding-up.

I think this idea could be pertty generally desired by other applications, too.

Regarding the epoll_ctl_batch improvement, again, it is not going to disrupt
the numbers in the small workload I managed to test.

Of course, if you have a specific application senario in mind, I will try it. :)

Thanks,
Fam
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-03-12 Thread Jason Baron
On 03/09/2015 09:49 PM, Fam Zheng wrote:
>
> Benchmark for epoll_pwait1
> ==
>
> By running fio tests inside VM with both original and modified QEMU, we can
> compare their difference in performance.
>
> With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
> latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.
>
> With a slightly larger VM instance [t2] - attached a virtio-serial device so
> that there are 80~90 fds in the main loop - the original QEMU has a latency
> overhead around 49 us. By adding more such devices [t3], we can see the 
> latency
> go even higher - 83 us with ~200 FDs.
>
> Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
> repectively 36us, 37us, 47us for t1, t2 and t3.
>
>

Hi,

So it sounds like you are comparing original qemu code (which was using
ppoll) vs. using epoll with these new syscalls. Curious if you have numbers
comparing the existing epoll (with say the timerfd in your epoll set), so
we can see the improvement relative to epoll.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v4 0/9] epoll: Introduce new syscalls, epoll_ctl_batch and epoll_pwait1

2015-03-09 Thread Fam Zheng
Changes from v3:

  - Add "size" field in epoll_wait_params. [Jon, Ingo, Seymour]
  - Input validation for ncmds in epoll_ctl_batch. [Dan]
  - Return -EFAULT if copy_to_user failed in epoll_ctl_batch. [Omar, Michael]
  - Change "timeout" in epoll_wait_params to pointer, to get the same
convention of 'no wait', 'wait indefinitely' and 'wait for specified time'
with epoll_pwait. [Seymour]
  - Add compat implementation of epoll_pwait1.

Justification
=

QEMU, among many select/poll based applications, considers epoll as an
alternative when its event loop needs to handle a big number of FDs. However,
there are currently two concerns with epoll which prevents the switching:

The major one is the timeout precision. For example in QEMU, the main loop
takes care of calling callbacks at a specific timeout - the QEMU timer API. The
timeout value in ppoll depends on the next firing timer. epoll_pwait's
millisecond timeout is so coarse that rounding up the timeout will hurt
performance badly.

The minor one is the number of system call to update fd set. While epoll can
handle a large number of fds quickly, it still requires one epoll_ctl per fd
update, compared to the one-shot call to select/poll with an fd array. This may
as well make epoll inferior to ppoll in the cases where a small, but frequently
changing set of fds are polled by the event loop.

This series introduces two new epoll sys calls to address them respectively.
The idea of epoll_ctl_batch is suggested by Andy Lutomirski in [1], who also
suggested clockid as a parameter in epoll_pwait1.

[1]: http://lists.openwall.net/linux-kernel/2015/01/08/542

Benchmark for epoll_pwait1
==

By running fio tests inside VM with both original and modified QEMU, we can
compare their difference in performance.

With a small VM setup [t1], the original QEMU (ppoll based) has an 4k read
latency overhead around 37 us. In this setup, the main loop polls 10~20 fds.

With a slightly larger VM instance [t2] - attached a virtio-serial device so
that there are 80~90 fds in the main loop - the original QEMU has a latency
overhead around 49 us. By adding more such devices [t3], we can see the latency
go even higher - 83 us with ~200 FDs.

Now modify QEMU to use epoll_pwait1 and test again, the latency numbers are
repectively 36us, 37us, 47us for t1, t2 and t3.

Previous Changelogs
===

Changes from v2 (https://lkml.org/lkml/2015/2/4/105)


  - Rename epoll_ctl_cmd.error_hint to "result". [Michael]

  - Add background introduction in cover letter. [Michael]

  - Expand the last struct of epoll_pwait1, add clockid and timespec.
  
  - Update man page in cover letter accordingly:

* "error_hint" -> "result".
* The result field's caveat in "RETURN VALUE" secion of epoll_ctl_batch.

  Please review!

Changes from v1 (https://lkml.org/lkml/2015/1/20/189)
-

  - As discussed in previous thread [1], split the call to epoll_ctl_batch and
epoll_pwait. [Michael]

  - Fix memory leaks. [Omar]

  - Add a short comment about the ignored copy_to_user failure. [Omar]

  - Cover letter rewritten.

Documentation of the new system calls
=

1) epoll_ctl_batch
--

NAME
   epoll_ctl_batch - batch control interface for an epoll descriptor

SYNOPSIS

   #include 

   int epoll_ctl_batch(int epfd, int flags,
   int ncmds, struct epoll_ctl_cmd *cmds);

DESCRIPTION

   This system call is an extension of epoll_ctl(). The primary difference
   is that this system call allows you to batch multiple operations with
   the one system call. This provides a more efficient interface for
   updating events on this epoll file descriptor epfd.

   The flags argument is reserved and must be 0.

   The argument ncmds is the number of cmds entries being passed in.
   This number must be greater than 0.

   Each operation is specified as an element in the cmds array, defined as:

   struct epoll_ctl_cmd {

  /* Reserved flags for future extension, must be 0. */
  int flags;

  /* The same as epoll_ctl() op parameter. */
  int op;

  /* The same as epoll_ctl() fd parameter. */
  int fd;

  /* The same as the "events" field in struct epoll_event. */
  uint32_t events;

  /* The same as the "data" field in struct epoll_event. */
  uint64_t data;

  /* Output field, will be set to the return code after this
   * command is executed by kernel */
  int result;
   };

   This system call is not atomic when updating the epoll descriptor.  All
   entries in cmds are executed in the provided order. If any cmds entry