Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags

2018-12-12 Thread Andrew Murray
On Wed, Dec 12, 2018 at 09:07:42AM +0100, Christoffer Dall wrote:
> On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote:
> > On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote:
> > > [ Reviving old thread. ]
> > > 
> > > Andrew Murray  writes:
> > > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote:
> > > >> Andrew Murray  writes:
> > > >> 
> > > >> > Update design.txt to reflect the presence of the exclude_host
> > > >> > and exclude_guest perf flags.
> > > >> >
> > > >> > Signed-off-by: Andrew Murray 
> > > >> > ---
> > > >> >  tools/perf/design.txt | 4 
> > > >> >  1 file changed, 4 insertions(+)
> > > >> >
> > > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> > > >> > index a28dca2..7de7d83 100644
> > > >> > --- a/tools/perf/design.txt
> > > >> > +++ b/tools/perf/design.txt
> > > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 
> > > >> > 'exclude_hv' bits provide a
> > > >> >  way to request that counting of events be restricted to times when 
> > > >> > the
> > > >> >  CPU is in user, kernel and/or hypervisor mode.
> > > >> >  
> > > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a 
> > > >> > way
> > > >> > +to request counting of events restricted to guest and host contexts 
> > > >> > when
> > > >> > +using virtualisation.
> > > >> 
> > > >> How does exclude_host differ from exclude_hv ?
> > > >
> > > > I believe exclude_host / exclude_guest are intented to distinguish
> > > > between host and guest in the hosted hypervisor context (KVM).
> > > 
> > > OK yeah, from the perf-list man page:
> > > 
> > >u - user-space counting
> > >k - kernel counting
> > >h - hypervisor counting
> > >I - non idle counting
> > >G - guest counting (in KVM guests)
> > >H - host counting (not in KVM guests)
> > > 
> > > > Whereas exclude_hv allows to distinguish between guest and
> > > > hypervisor in the bare-metal type hypervisors.
> > > 
> > > Except that's exactly not how we use them on powerpc :)
> > > 
> > > We use exclude_hv to exclude "the hypervisor", regardless of whether
> > > it's KVM or PowerVM (which is a bare-metal hypervisor).
> > > 
> > > We don't use exclude_host / exclude_guest at all, which I guess is a
> > > bug, except I didn't know they existed until this thread.
> > > 
> > > eg, in a KVM guest:
> > > 
> > >   $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done"
> > >   $ perf report -D | grep -Fc "dso: [hypervisor]"
> > >   16
> > > 
> > > 
> > > > In the case of arm64 - if VHE extensions are present then the host
> > > > kernel will run at a higher privilege to the guest kernel, in which
> > > > case there is no distinction between hypervisor and host so we ignore
> > > > exclude_hv. But where VHE extensions are not present then the host
> > > > kernel runs at the same privilege level as the guest and we use a
> > > > higher privilege level to switch between them - in this case we can
> > > > use exclude_hv to discount that hypervisor role of switching between
> > > > guests.
> > > 
> > > I couldn't find any arm64 perf code using exclude_host/guest at all?
> > 
> > Correct - but this is in flight as I am currently adding support for this
> > see [1].
> > 
> > > 
> > > And I don't see any x86 code using exclude_hv.
> > 
> > I can't find any either.
> > 
> > > 
> > > But maybe that's OK, I just worry this is confusing for users.
> > 
> > There is some extra context regarding this where exclude_guest/exclude_host
> > was added, see [2] and where exclude_hv was added, see [3]
> > 
> > Generally it seems that exclude_guest/exclude_host relies upon switching
> > counters off/on on guest/host switch code (which works well in the nested
> > virt case). Whereas exclude_hv tends to rely solely on hardware capability
> > based on privilege level (which works well in the bare metal case where
> > the guest doesn't run at same privilege as the host).
> > 
> > I think from the user perspective exclude_hv allows you to see your overhead
> > if you are a guest (i.e. work done by bare metal hypervisor associated with
> > you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to
> > see events above you (i.e. the kernel hypervisor) if you are the guest...
> > 
> > At least that's how I read this, I've copied in others that may provide
> > more authoritative feedback.
> > 
> > [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html
> > [2] https://www.spinics.net/lists/kvm/msg53996.html
> > [3] https://lore.kernel.org/patchwork/patch/143918/
> > 
> 
> I'll try to answer this in a different way, based on previous
> discussions with Joerg et al. who introduced these flags.  Assume no
> support for nested virtualization as a first approximation:
> 
>   If you are running as a guest:
> - exclude_hv: stop counting events when the hypervisor runs
> - exclude_host: has no effect
>  

Re: [PATCH v2 6/8] socket: Add SO_TIMESTAMP[NS]_NEW

2018-12-12 Thread Willem de Bruijn
> This did not address yet the previous comments on consistency and
> unnecessary code churn.
>
> The existing logic to differentiate SO_TIMESTAMP from SO_TIMESTAMPNS
> in both tcp_recv_timestamp and __sock_recv_timestamp is
>
>   if (sock_flag(sk, SOCK_RCVTSTAMP)) {
>   if (sock_flag(sk, SOCK_RCVTSTAMPNS))
>   /* timespec case */
>   else
>   /* timeval case */
>   }
>
> A new level of nesting needs to be added to differentiate .._OLD from .._NEW.
>
> Even if massively changing the original functions, please do so
> consistently, either
>
>   if (sock_flag(sk, SOCK_RCVTSTAMP)) {
>   if (sock_flag(sk, SOCK_TSTAMP_NEW) {
>   /* new code */
>   } else {
>   if (sock_flag(sk, SOCK_RCVTSTAMPNS))
>   /* timespec case */
>   else
>   /* timeval case */
>  }
>   }

This first example is wrong. I meant

   if (sock_flag(sk, SOCK_RCVTSTAMP)) {
   if (sock_flag(sk, SOCK_RCVTSTAMPNS)) {
   if (sock_flag(sk, SOCK_TSTAMP_NEW)
/* new code */
  else
/* timespec case */
   } else {
   if (sock_flag(sk, SOCK_TSTAMP_NEW)
/* new code */
  else
/* timeval case */
}
   }


Re: [PATCH v2 7/8] socket: Add SO_TIMESTAMPING_NEW

2018-12-12 Thread Willem de Bruijn
On Tue, Dec 11, 2018 at 3:30 PM Deepa Dinamani  wrote:
>
> Add SO_TIMESTAMPING_NEW variant of socket timestamp options.
> This is the y2038 safe versions of the SO_TIMESTAMPING_OLD
> for all architectures.
>
> Signed-off-by: Deepa Dinamani 
> Cc: ch...@zankel.net
> Cc: fenghua...@intel.com
> Cc: r...@twiddle.net
> Cc: t...@linutronix.de
> Cc: ubr...@linux.ibm.com
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-a...@vger.kernel.org
> Cc: linux-i...@vger.kernel.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-s...@vger.kernel.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: sparcli...@vger.kernel.org
> ---
>  arch/alpha/include/uapi/asm/socket.h  |  5 +-
>  arch/mips/include/uapi/asm/socket.h   |  5 +-
>  arch/parisc/include/uapi/asm/socket.h |  5 +-
>  arch/sparc/include/uapi/asm/socket.h  |  8 +--
>  include/linux/socket.h|  7 +++
>  include/uapi/asm-generic/socket.h |  5 +-
>  include/uapi/linux/errqueue.h |  4 ++
>  net/core/scm.c| 27 ++
>  net/core/sock.c   | 73 +++
>  net/ipv4/tcp.c| 29 ++-
>  net/smc/af_smc.c  |  3 +-
>  net/socket.c  | 15 --
>  12 files changed, 122 insertions(+), 64 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/socket.h 
> b/arch/alpha/include/uapi/asm/socket.h
> index 352e3dc0b3d9..8b9f6f7f8087 100644
> --- a/arch/alpha/include/uapi/asm/socket.h
> +++ b/arch/alpha/include/uapi/asm/socket.h
> @@ -116,19 +116,20 @@
>
>  #define SO_TIMESTAMP_NEW 62
>  #define SO_TIMESTAMPNS_NEW   63
> +#define SO_TIMESTAMPING_NEW  64
>
>  #if !defined(__KERNEL__)
>
>  #if __BITS_PER_LONG == 64
>  #define SO_TIMESTAMP   SO_TIMESTAMP_OLD
>  #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
> +#define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD
>  #else
>  #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
> SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
>  #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
> SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
> +#define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? 
> SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW)
>  #endif

If the previous patch moves this block to a platform-independent
header, that allows this patch to shrink considerably, too.

> +static int setsockopt_timestamping(struct sock *sk, int type, int val)
> +{
> +   if (type == SO_TIMESTAMPING_NEW)
> +   sock_set_flag(sk, SOCK_TSTAMP_NEW);
> +
> +   if (val & ~SOF_TIMESTAMPING_MASK)
> +   return -EINVAL;
> +
> +   if (val & SOF_TIMESTAMPING_OPT_ID &&
> +   !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> +   if (sk->sk_protocol == IPPROTO_TCP &&
> +   sk->sk_type == SOCK_STREAM) {
> +   if ((1 << sk->sk_state) &
> +   (TCPF_CLOSE | TCPF_LISTEN))
> +   return -EINVAL;
> +   sk->sk_tskey = tcp_sk(sk)->snd_una;
> +   } else {
> +   sk->sk_tskey = 0;
> +   }
> +   }
> +
> +   if (val & SOF_TIMESTAMPING_OPT_STATS &&
> +   !(val & SOF_TIMESTAMPING_OPT_TSONLY))
> +   return -EINVAL;
> +
> +   sk->sk_tsflags = val;
> +   if (val & SOF_TIMESTAMPING_RX_SOFTWARE) {
> +   sock_enable_timestamp(sk,
> + SOCK_TIMESTAMPING_RX_SOFTWARE);
> +   } else {
> +   if (type == SO_TIMESTAMPING_NEW)
> +   sock_reset_flag(sk, SOCK_TSTAMP_NEW);
> +   sock_disable_timestamp(sk,
> +  (1UL << 
> SOCK_TIMESTAMPING_RX_SOFTWARE));
> +   }
> +
> +   return 0;
> +}
> +
>  /*
>   * This is meant for all protocols to use and covers goings on
>   * at the socket level. Everything here is generic.
> @@ -851,39 +890,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
> optname,
> break;
>
> case SO_TIMESTAMPING_OLD:
> -   if (val & ~SOF_TIMESTAMPING_MASK) {
> -   ret = -EINVAL;
> -   break;
> -   }
> -
> -   if (val & SOF_TIMESTAMPING_OPT_ID &&
> -   !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) {
> -   if (sk->sk_protocol == IPPROTO_TCP &&
> -   sk->sk_type == SOCK_STREAM) {
> -   if ((1 << sk->sk_state) &
> -   (TCPF_CLOSE | TCPF_LISTEN)) {
> -   ret = -EINVAL;
> -   break;
> -   }
> -   sk->sk_tskey = tcp_sk(sk)->snd_una;
> -   } else {
> -   sk->sk_tskey = 0;
> -   }
> -

Re: [PATCH v2 6/8] socket: Add SO_TIMESTAMP[NS]_NEW

2018-12-12 Thread Willem de Bruijn
On Tue, Dec 11, 2018 at 3:30 PM Deepa Dinamani  wrote:
>
> Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of
> socket timestamp options.
> These are the y2038 safe versions of the SO_TIMESTAMP_OLD
> and SO_TIMESTAMPNS_OLD for all architectures.
>
> Note that the format of scm_timestamping.ts[0] is not changed
> in this patch.
>
> Signed-off-by: Deepa Dinamani 
> Cc: j...@parisc-linux.org
> Cc: r...@linux-mips.org
> Cc: r...@twiddle.net
> Cc: linux-alpha@vger.kernel.org
> Cc: linux-m...@linux-mips.org
> Cc: linux-par...@vger.kernel.org
> Cc: linux-r...@vger.kernel.org
> Cc: net...@vger.kernel.org
> Cc: sparcli...@vger.kernel.org
> ---
>  arch/alpha/include/uapi/asm/socket.h  | 15 ++-
>  arch/mips/include/uapi/asm/socket.h   | 14 ++-
>  arch/parisc/include/uapi/asm/socket.h | 14 ++-
>  arch/sparc/include/uapi/asm/socket.h  | 14 ++-
>  include/linux/skbuff.h| 18 +
>  include/net/sock.h|  1 +
>  include/uapi/asm-generic/socket.h | 15 ++-
>  net/core/sock.c   | 51 ++--
>  net/ipv4/tcp.c| 57 +++
>  net/rds/af_rds.c  |  8 +++-
>  net/rds/recv.c| 16 +++-
>  net/socket.c  | 47 --
>  12 files changed, 216 insertions(+), 54 deletions(-)
>
> diff --git a/arch/alpha/include/uapi/asm/socket.h 
> b/arch/alpha/include/uapi/asm/socket.h
> index 00e45c80e574..352e3dc0b3d9 100644
> --- a/arch/alpha/include/uapi/asm/socket.h
> +++ b/arch/alpha/include/uapi/asm/socket.h
> @@ -3,6 +3,7 @@
>  #define _UAPI_ASM_SOCKET_H
>
>  #include 
> +#include 
>
>  /* For setsockopt(2) */
>  /*
> @@ -110,12 +111,22 @@
>
>  #define SO_TIMESTAMP_OLD 29
>  #define SO_TIMESTAMPNS_OLD   35
> +
>  #define SO_TIMESTAMPING_OLD  37
>
> +#define SO_TIMESTAMP_NEW 62
> +#define SO_TIMESTAMPNS_NEW   63
> +
>  #if !defined(__KERNEL__)
>
> -#define SO_TIMESTAMP   SO_TIMESTAMP_OLD
> -#define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
> +#if __BITS_PER_LONG == 64
> +#define SO_TIMESTAMP   SO_TIMESTAMP_OLD
> +#define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD
> +#else
> +#define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? 
> SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW)
> +#define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? 
> SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW)
> +#endif
> +

This is not platform specific. Perhaps it can be deduplicated. The
interface expects callers to include , not
 directly. So perhaps it can go there?

> -/* Similar to __sock_recv_timestamp, but does not require an skb */
> -static void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk,
> -  struct scm_timestamping *tss)
> +static void tcp_recv_sw_timestamp(struct msghdr *msg, const struct sock *sk, 
> struct timespec64 *ts)
>  {
> -   struct __kernel_old_timeval tv;
> -   bool has_timestamping = false;
> +   int new_tstamp = sock_flag(sk, SOCK_TSTAMP_NEW);
> +
> +   if (ts->tv_sec || ts->tv_nsec) {
> +   if (sock_flag(sk, SOCK_RCVTSTAMPNS)) {
> +   if (new_tstamp) {
> +   struct __kernel_timespec kts = {ts->tv_sec, 
> ts->tv_nsec};
> +
> +   put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMPNS_NEW,
> +sizeof(kts), );
> +   } else {
> +   struct timespec ts_old = 
> timespec64_to_timespec(*ts);
>
> -   if (tss->ts[0].tv_sec || tss->ts[0].tv_nsec) {
> -   if (sock_flag(sk, SOCK_RCVTSTAMP)) {
> -   if (sock_flag(sk, SOCK_RCVTSTAMPNS)) {
> put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMPNS_OLD,
> -sizeof(tss->ts[0]), >ts[0]);
> +sizeof(ts), _old);
> +   }
> +   } else if (sock_flag(sk, SOCK_RCVTSTAMP)) {
> +   if (new_tstamp) {
> +   struct __kernel_sock_timeval stv;
> +
> +   stv.tv_sec = ts->tv_sec;
> +   stv.tv_usec = ts->tv_nsec / 1000;
> +
> +   put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_NEW,
> +sizeof(stv), );
> } else {
> -   tv.tv_sec = tss->ts[0].tv_sec;
> -   tv.tv_usec = tss->ts[0].tv_nsec / 1000;
> +   struct __kernel_old_timeval tv;
>
> +   tv.tv_sec = ts->tv_sec;
> +   tv.tv_usec = ts->tv_nsec / 1000;
> put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_OLD,
>  sizeof(tv), );
>   

Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags

2018-12-12 Thread Christoffer Dall
On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote:
> On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote:
> > [ Reviving old thread. ]
> > 
> > Andrew Murray  writes:
> > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote:
> > >> Andrew Murray  writes:
> > >> 
> > >> > Update design.txt to reflect the presence of the exclude_host
> > >> > and exclude_guest perf flags.
> > >> >
> > >> > Signed-off-by: Andrew Murray 
> > >> > ---
> > >> >  tools/perf/design.txt | 4 
> > >> >  1 file changed, 4 insertions(+)
> > >> >
> > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt
> > >> > index a28dca2..7de7d83 100644
> > >> > --- a/tools/perf/design.txt
> > >> > +++ b/tools/perf/design.txt
> > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and 
> > >> > 'exclude_hv' bits provide a
> > >> >  way to request that counting of events be restricted to times when the
> > >> >  CPU is in user, kernel and/or hypervisor mode.
> > >> >  
> > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way
> > >> > +to request counting of events restricted to guest and host contexts 
> > >> > when
> > >> > +using virtualisation.
> > >> 
> > >> How does exclude_host differ from exclude_hv ?
> > >
> > > I believe exclude_host / exclude_guest are intented to distinguish
> > > between host and guest in the hosted hypervisor context (KVM).
> > 
> > OK yeah, from the perf-list man page:
> > 
> >u - user-space counting
> >k - kernel counting
> >h - hypervisor counting
> >I - non idle counting
> >G - guest counting (in KVM guests)
> >H - host counting (not in KVM guests)
> > 
> > > Whereas exclude_hv allows to distinguish between guest and
> > > hypervisor in the bare-metal type hypervisors.
> > 
> > Except that's exactly not how we use them on powerpc :)
> > 
> > We use exclude_hv to exclude "the hypervisor", regardless of whether
> > it's KVM or PowerVM (which is a bare-metal hypervisor).
> > 
> > We don't use exclude_host / exclude_guest at all, which I guess is a
> > bug, except I didn't know they existed until this thread.
> > 
> > eg, in a KVM guest:
> > 
> >   $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done"
> >   $ perf report -D | grep -Fc "dso: [hypervisor]"
> >   16
> > 
> > 
> > > In the case of arm64 - if VHE extensions are present then the host
> > > kernel will run at a higher privilege to the guest kernel, in which
> > > case there is no distinction between hypervisor and host so we ignore
> > > exclude_hv. But where VHE extensions are not present then the host
> > > kernel runs at the same privilege level as the guest and we use a
> > > higher privilege level to switch between them - in this case we can
> > > use exclude_hv to discount that hypervisor role of switching between
> > > guests.
> > 
> > I couldn't find any arm64 perf code using exclude_host/guest at all?
> 
> Correct - but this is in flight as I am currently adding support for this
> see [1].
> 
> > 
> > And I don't see any x86 code using exclude_hv.
> 
> I can't find any either.
> 
> > 
> > But maybe that's OK, I just worry this is confusing for users.
> 
> There is some extra context regarding this where exclude_guest/exclude_host
> was added, see [2] and where exclude_hv was added, see [3]
> 
> Generally it seems that exclude_guest/exclude_host relies upon switching
> counters off/on on guest/host switch code (which works well in the nested
> virt case). Whereas exclude_hv tends to rely solely on hardware capability
> based on privilege level (which works well in the bare metal case where
> the guest doesn't run at same privilege as the host).
> 
> I think from the user perspective exclude_hv allows you to see your overhead
> if you are a guest (i.e. work done by bare metal hypervisor associated with
> you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to
> see events above you (i.e. the kernel hypervisor) if you are the guest...
> 
> At least that's how I read this, I've copied in others that may provide
> more authoritative feedback.
> 
> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html
> [2] https://www.spinics.net/lists/kvm/msg53996.html
> [3] https://lore.kernel.org/patchwork/patch/143918/
> 

I'll try to answer this in a different way, based on previous
discussions with Joerg et al. who introduced these flags.  Assume no
support for nested virtualization as a first approximation:

  If you are running as a guest:
- exclude_hv: stop counting events when the hypervisor runs
- exclude_host: has no effect
- exclude_guest: has no effect
  
  If you are running as a host/hypervisor:
   - exclude_hv: has no effect
   - exclude_host: only count events when the guest is running
   - exclude_guest: only count events when the host is running

With nested virtualization, you get the natural union of the above.

**This has