Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags
On Wed, Dec 12, 2018 at 09:07:42AM +0100, Christoffer Dall wrote: > On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote: > > On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote: > > > [ Reviving old thread. ] > > > > > > Andrew Murray writes: > > > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote: > > > >> Andrew Murray writes: > > > >> > > > >> > Update design.txt to reflect the presence of the exclude_host > > > >> > and exclude_guest perf flags. > > > >> > > > > >> > Signed-off-by: Andrew Murray > > > >> > --- > > > >> > tools/perf/design.txt | 4 > > > >> > 1 file changed, 4 insertions(+) > > > >> > > > > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt > > > >> > index a28dca2..7de7d83 100644 > > > >> > --- a/tools/perf/design.txt > > > >> > +++ b/tools/perf/design.txt > > > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and > > > >> > 'exclude_hv' bits provide a > > > >> > way to request that counting of events be restricted to times when > > > >> > the > > > >> > CPU is in user, kernel and/or hypervisor mode. > > > >> > > > > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a > > > >> > way > > > >> > +to request counting of events restricted to guest and host contexts > > > >> > when > > > >> > +using virtualisation. > > > >> > > > >> How does exclude_host differ from exclude_hv ? > > > > > > > > I believe exclude_host / exclude_guest are intented to distinguish > > > > between host and guest in the hosted hypervisor context (KVM). > > > > > > OK yeah, from the perf-list man page: > > > > > >u - user-space counting > > >k - kernel counting > > >h - hypervisor counting > > >I - non idle counting > > >G - guest counting (in KVM guests) > > >H - host counting (not in KVM guests) > > > > > > > Whereas exclude_hv allows to distinguish between guest and > > > > hypervisor in the bare-metal type hypervisors. > > > > > > Except that's exactly not how we use them on powerpc :) > > > > > > We use exclude_hv to exclude "the hypervisor", regardless of whether > > > it's KVM or PowerVM (which is a bare-metal hypervisor). > > > > > > We don't use exclude_host / exclude_guest at all, which I guess is a > > > bug, except I didn't know they existed until this thread. > > > > > > eg, in a KVM guest: > > > > > > $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done" > > > $ perf report -D | grep -Fc "dso: [hypervisor]" > > > 16 > > > > > > > > > > In the case of arm64 - if VHE extensions are present then the host > > > > kernel will run at a higher privilege to the guest kernel, in which > > > > case there is no distinction between hypervisor and host so we ignore > > > > exclude_hv. But where VHE extensions are not present then the host > > > > kernel runs at the same privilege level as the guest and we use a > > > > higher privilege level to switch between them - in this case we can > > > > use exclude_hv to discount that hypervisor role of switching between > > > > guests. > > > > > > I couldn't find any arm64 perf code using exclude_host/guest at all? > > > > Correct - but this is in flight as I am currently adding support for this > > see [1]. > > > > > > > > And I don't see any x86 code using exclude_hv. > > > > I can't find any either. > > > > > > > > But maybe that's OK, I just worry this is confusing for users. > > > > There is some extra context regarding this where exclude_guest/exclude_host > > was added, see [2] and where exclude_hv was added, see [3] > > > > Generally it seems that exclude_guest/exclude_host relies upon switching > > counters off/on on guest/host switch code (which works well in the nested > > virt case). Whereas exclude_hv tends to rely solely on hardware capability > > based on privilege level (which works well in the bare metal case where > > the guest doesn't run at same privilege as the host). > > > > I think from the user perspective exclude_hv allows you to see your overhead > > if you are a guest (i.e. work done by bare metal hypervisor associated with > > you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to > > see events above you (i.e. the kernel hypervisor) if you are the guest... > > > > At least that's how I read this, I've copied in others that may provide > > more authoritative feedback. > > > > [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html > > [2] https://www.spinics.net/lists/kvm/msg53996.html > > [3] https://lore.kernel.org/patchwork/patch/143918/ > > > > I'll try to answer this in a different way, based on previous > discussions with Joerg et al. who introduced these flags. Assume no > support for nested virtualization as a first approximation: > > If you are running as a guest: > - exclude_hv: stop counting events when the hypervisor runs > - exclude_host: has no effect >
Re: [PATCH v2 6/8] socket: Add SO_TIMESTAMP[NS]_NEW
> This did not address yet the previous comments on consistency and > unnecessary code churn. > > The existing logic to differentiate SO_TIMESTAMP from SO_TIMESTAMPNS > in both tcp_recv_timestamp and __sock_recv_timestamp is > > if (sock_flag(sk, SOCK_RCVTSTAMP)) { > if (sock_flag(sk, SOCK_RCVTSTAMPNS)) > /* timespec case */ > else > /* timeval case */ > } > > A new level of nesting needs to be added to differentiate .._OLD from .._NEW. > > Even if massively changing the original functions, please do so > consistently, either > > if (sock_flag(sk, SOCK_RCVTSTAMP)) { > if (sock_flag(sk, SOCK_TSTAMP_NEW) { > /* new code */ > } else { > if (sock_flag(sk, SOCK_RCVTSTAMPNS)) > /* timespec case */ > else > /* timeval case */ > } > } This first example is wrong. I meant if (sock_flag(sk, SOCK_RCVTSTAMP)) { if (sock_flag(sk, SOCK_RCVTSTAMPNS)) { if (sock_flag(sk, SOCK_TSTAMP_NEW) /* new code */ else /* timespec case */ } else { if (sock_flag(sk, SOCK_TSTAMP_NEW) /* new code */ else /* timeval case */ } }
Re: [PATCH v2 7/8] socket: Add SO_TIMESTAMPING_NEW
On Tue, Dec 11, 2018 at 3:30 PM Deepa Dinamani wrote: > > Add SO_TIMESTAMPING_NEW variant of socket timestamp options. > This is the y2038 safe versions of the SO_TIMESTAMPING_OLD > for all architectures. > > Signed-off-by: Deepa Dinamani > Cc: ch...@zankel.net > Cc: fenghua...@intel.com > Cc: r...@twiddle.net > Cc: t...@linutronix.de > Cc: ubr...@linux.ibm.com > Cc: linux-alpha@vger.kernel.org > Cc: linux-a...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@linux-mips.org > Cc: linux-s...@vger.kernel.org > Cc: linux-xte...@linux-xtensa.org > Cc: sparcli...@vger.kernel.org > --- > arch/alpha/include/uapi/asm/socket.h | 5 +- > arch/mips/include/uapi/asm/socket.h | 5 +- > arch/parisc/include/uapi/asm/socket.h | 5 +- > arch/sparc/include/uapi/asm/socket.h | 8 +-- > include/linux/socket.h| 7 +++ > include/uapi/asm-generic/socket.h | 5 +- > include/uapi/linux/errqueue.h | 4 ++ > net/core/scm.c| 27 ++ > net/core/sock.c | 73 +++ > net/ipv4/tcp.c| 29 ++- > net/smc/af_smc.c | 3 +- > net/socket.c | 15 -- > 12 files changed, 122 insertions(+), 64 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/socket.h > b/arch/alpha/include/uapi/asm/socket.h > index 352e3dc0b3d9..8b9f6f7f8087 100644 > --- a/arch/alpha/include/uapi/asm/socket.h > +++ b/arch/alpha/include/uapi/asm/socket.h > @@ -116,19 +116,20 @@ > > #define SO_TIMESTAMP_NEW 62 > #define SO_TIMESTAMPNS_NEW 63 > +#define SO_TIMESTAMPING_NEW 64 > > #if !defined(__KERNEL__) > > #if __BITS_PER_LONG == 64 > #define SO_TIMESTAMP SO_TIMESTAMP_OLD > #define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD > +#define SO_TIMESTAMPINGSO_TIMESTAMPING_OLD > #else > #define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? > SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW) > #define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? > SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW) > +#define SO_TIMESTAMPING (sizeof(time_t) == sizeof(__kernel_long_t) ? > SO_TIMESTAMPING_OLD : SO_TIMESTAMPING_NEW) > #endif If the previous patch moves this block to a platform-independent header, that allows this patch to shrink considerably, too. > +static int setsockopt_timestamping(struct sock *sk, int type, int val) > +{ > + if (type == SO_TIMESTAMPING_NEW) > + sock_set_flag(sk, SOCK_TSTAMP_NEW); > + > + if (val & ~SOF_TIMESTAMPING_MASK) > + return -EINVAL; > + > + if (val & SOF_TIMESTAMPING_OPT_ID && > + !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) { > + if (sk->sk_protocol == IPPROTO_TCP && > + sk->sk_type == SOCK_STREAM) { > + if ((1 << sk->sk_state) & > + (TCPF_CLOSE | TCPF_LISTEN)) > + return -EINVAL; > + sk->sk_tskey = tcp_sk(sk)->snd_una; > + } else { > + sk->sk_tskey = 0; > + } > + } > + > + if (val & SOF_TIMESTAMPING_OPT_STATS && > + !(val & SOF_TIMESTAMPING_OPT_TSONLY)) > + return -EINVAL; > + > + sk->sk_tsflags = val; > + if (val & SOF_TIMESTAMPING_RX_SOFTWARE) { > + sock_enable_timestamp(sk, > + SOCK_TIMESTAMPING_RX_SOFTWARE); > + } else { > + if (type == SO_TIMESTAMPING_NEW) > + sock_reset_flag(sk, SOCK_TSTAMP_NEW); > + sock_disable_timestamp(sk, > + (1UL << > SOCK_TIMESTAMPING_RX_SOFTWARE)); > + } > + > + return 0; > +} > + > /* > * This is meant for all protocols to use and covers goings on > * at the socket level. Everything here is generic. > @@ -851,39 +890,7 @@ int sock_setsockopt(struct socket *sock, int level, int > optname, > break; > > case SO_TIMESTAMPING_OLD: > - if (val & ~SOF_TIMESTAMPING_MASK) { > - ret = -EINVAL; > - break; > - } > - > - if (val & SOF_TIMESTAMPING_OPT_ID && > - !(sk->sk_tsflags & SOF_TIMESTAMPING_OPT_ID)) { > - if (sk->sk_protocol == IPPROTO_TCP && > - sk->sk_type == SOCK_STREAM) { > - if ((1 << sk->sk_state) & > - (TCPF_CLOSE | TCPF_LISTEN)) { > - ret = -EINVAL; > - break; > - } > - sk->sk_tskey = tcp_sk(sk)->snd_una; > - } else { > - sk->sk_tskey = 0; > - } > -
Re: [PATCH v2 6/8] socket: Add SO_TIMESTAMP[NS]_NEW
On Tue, Dec 11, 2018 at 3:30 PM Deepa Dinamani wrote: > > Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of > socket timestamp options. > These are the y2038 safe versions of the SO_TIMESTAMP_OLD > and SO_TIMESTAMPNS_OLD for all architectures. > > Note that the format of scm_timestamping.ts[0] is not changed > in this patch. > > Signed-off-by: Deepa Dinamani > Cc: j...@parisc-linux.org > Cc: r...@linux-mips.org > Cc: r...@twiddle.net > Cc: linux-alpha@vger.kernel.org > Cc: linux-m...@linux-mips.org > Cc: linux-par...@vger.kernel.org > Cc: linux-r...@vger.kernel.org > Cc: net...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > --- > arch/alpha/include/uapi/asm/socket.h | 15 ++- > arch/mips/include/uapi/asm/socket.h | 14 ++- > arch/parisc/include/uapi/asm/socket.h | 14 ++- > arch/sparc/include/uapi/asm/socket.h | 14 ++- > include/linux/skbuff.h| 18 + > include/net/sock.h| 1 + > include/uapi/asm-generic/socket.h | 15 ++- > net/core/sock.c | 51 ++-- > net/ipv4/tcp.c| 57 +++ > net/rds/af_rds.c | 8 +++- > net/rds/recv.c| 16 +++- > net/socket.c | 47 -- > 12 files changed, 216 insertions(+), 54 deletions(-) > > diff --git a/arch/alpha/include/uapi/asm/socket.h > b/arch/alpha/include/uapi/asm/socket.h > index 00e45c80e574..352e3dc0b3d9 100644 > --- a/arch/alpha/include/uapi/asm/socket.h > +++ b/arch/alpha/include/uapi/asm/socket.h > @@ -3,6 +3,7 @@ > #define _UAPI_ASM_SOCKET_H > > #include > +#include > > /* For setsockopt(2) */ > /* > @@ -110,12 +111,22 @@ > > #define SO_TIMESTAMP_OLD 29 > #define SO_TIMESTAMPNS_OLD 35 > + > #define SO_TIMESTAMPING_OLD 37 > > +#define SO_TIMESTAMP_NEW 62 > +#define SO_TIMESTAMPNS_NEW 63 > + > #if !defined(__KERNEL__) > > -#define SO_TIMESTAMP SO_TIMESTAMP_OLD > -#define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD > +#if __BITS_PER_LONG == 64 > +#define SO_TIMESTAMP SO_TIMESTAMP_OLD > +#define SO_TIMESTAMPNS SO_TIMESTAMPNS_OLD > +#else > +#define SO_TIMESTAMP (sizeof(time_t) == sizeof(__kernel_long_t) ? > SO_TIMESTAMP_OLD : SO_TIMESTAMP_NEW) > +#define SO_TIMESTAMPNS (sizeof(time_t) == sizeof(__kernel_long_t) ? > SO_TIMESTAMPNS_OLD : SO_TIMESTAMPNS_NEW) > +#endif > + This is not platform specific. Perhaps it can be deduplicated. The interface expects callers to include , not directly. So perhaps it can go there? > -/* Similar to __sock_recv_timestamp, but does not require an skb */ > -static void tcp_recv_timestamp(struct msghdr *msg, const struct sock *sk, > - struct scm_timestamping *tss) > +static void tcp_recv_sw_timestamp(struct msghdr *msg, const struct sock *sk, > struct timespec64 *ts) > { > - struct __kernel_old_timeval tv; > - bool has_timestamping = false; > + int new_tstamp = sock_flag(sk, SOCK_TSTAMP_NEW); > + > + if (ts->tv_sec || ts->tv_nsec) { > + if (sock_flag(sk, SOCK_RCVTSTAMPNS)) { > + if (new_tstamp) { > + struct __kernel_timespec kts = {ts->tv_sec, > ts->tv_nsec}; > + > + put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMPNS_NEW, > +sizeof(kts), ); > + } else { > + struct timespec ts_old = > timespec64_to_timespec(*ts); > > - if (tss->ts[0].tv_sec || tss->ts[0].tv_nsec) { > - if (sock_flag(sk, SOCK_RCVTSTAMP)) { > - if (sock_flag(sk, SOCK_RCVTSTAMPNS)) { > put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMPNS_OLD, > -sizeof(tss->ts[0]), >ts[0]); > +sizeof(ts), _old); > + } > + } else if (sock_flag(sk, SOCK_RCVTSTAMP)) { > + if (new_tstamp) { > + struct __kernel_sock_timeval stv; > + > + stv.tv_sec = ts->tv_sec; > + stv.tv_usec = ts->tv_nsec / 1000; > + > + put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_NEW, > +sizeof(stv), ); > } else { > - tv.tv_sec = tss->ts[0].tv_sec; > - tv.tv_usec = tss->ts[0].tv_nsec / 1000; > + struct __kernel_old_timeval tv; > > + tv.tv_sec = ts->tv_sec; > + tv.tv_usec = ts->tv_nsec / 1000; > put_cmsg(msg, SOL_SOCKET, SO_TIMESTAMP_OLD, > sizeof(tv), ); >
Re: [PATCH 10/10] perf/doc: update design.txt for exclude_{host|guest} flags
On Tue, Dec 11, 2018 at 01:59:03PM +, Andrew Murray wrote: > On Tue, Dec 11, 2018 at 10:06:53PM +1100, Michael Ellerman wrote: > > [ Reviving old thread. ] > > > > Andrew Murray writes: > > > On Tue, Nov 20, 2018 at 10:31:36PM +1100, Michael Ellerman wrote: > > >> Andrew Murray writes: > > >> > > >> > Update design.txt to reflect the presence of the exclude_host > > >> > and exclude_guest perf flags. > > >> > > > >> > Signed-off-by: Andrew Murray > > >> > --- > > >> > tools/perf/design.txt | 4 > > >> > 1 file changed, 4 insertions(+) > > >> > > > >> > diff --git a/tools/perf/design.txt b/tools/perf/design.txt > > >> > index a28dca2..7de7d83 100644 > > >> > --- a/tools/perf/design.txt > > >> > +++ b/tools/perf/design.txt > > >> > @@ -222,6 +222,10 @@ The 'exclude_user', 'exclude_kernel' and > > >> > 'exclude_hv' bits provide a > > >> > way to request that counting of events be restricted to times when the > > >> > CPU is in user, kernel and/or hypervisor mode. > > >> > > > >> > +Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way > > >> > +to request counting of events restricted to guest and host contexts > > >> > when > > >> > +using virtualisation. > > >> > > >> How does exclude_host differ from exclude_hv ? > > > > > > I believe exclude_host / exclude_guest are intented to distinguish > > > between host and guest in the hosted hypervisor context (KVM). > > > > OK yeah, from the perf-list man page: > > > >u - user-space counting > >k - kernel counting > >h - hypervisor counting > >I - non idle counting > >G - guest counting (in KVM guests) > >H - host counting (not in KVM guests) > > > > > Whereas exclude_hv allows to distinguish between guest and > > > hypervisor in the bare-metal type hypervisors. > > > > Except that's exactly not how we use them on powerpc :) > > > > We use exclude_hv to exclude "the hypervisor", regardless of whether > > it's KVM or PowerVM (which is a bare-metal hypervisor). > > > > We don't use exclude_host / exclude_guest at all, which I guess is a > > bug, except I didn't know they existed until this thread. > > > > eg, in a KVM guest: > > > > $ perf record -e cycles:G /bin/bash -c "for i in {0..10}; do :;done" > > $ perf report -D | grep -Fc "dso: [hypervisor]" > > 16 > > > > > > > In the case of arm64 - if VHE extensions are present then the host > > > kernel will run at a higher privilege to the guest kernel, in which > > > case there is no distinction between hypervisor and host so we ignore > > > exclude_hv. But where VHE extensions are not present then the host > > > kernel runs at the same privilege level as the guest and we use a > > > higher privilege level to switch between them - in this case we can > > > use exclude_hv to discount that hypervisor role of switching between > > > guests. > > > > I couldn't find any arm64 perf code using exclude_host/guest at all? > > Correct - but this is in flight as I am currently adding support for this > see [1]. > > > > > And I don't see any x86 code using exclude_hv. > > I can't find any either. > > > > > But maybe that's OK, I just worry this is confusing for users. > > There is some extra context regarding this where exclude_guest/exclude_host > was added, see [2] and where exclude_hv was added, see [3] > > Generally it seems that exclude_guest/exclude_host relies upon switching > counters off/on on guest/host switch code (which works well in the nested > virt case). Whereas exclude_hv tends to rely solely on hardware capability > based on privilege level (which works well in the bare metal case where > the guest doesn't run at same privilege as the host). > > I think from the user perspective exclude_hv allows you to see your overhead > if you are a guest (i.e. work done by bare metal hypervisor associated with > you as the guest). Whereas exclude_guest/exclude_host doesn't allow you to > see events above you (i.e. the kernel hypervisor) if you are the guest... > > At least that's how I read this, I've copied in others that may provide > more authoritative feedback. > > [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-December/033698.html > [2] https://www.spinics.net/lists/kvm/msg53996.html > [3] https://lore.kernel.org/patchwork/patch/143918/ > I'll try to answer this in a different way, based on previous discussions with Joerg et al. who introduced these flags. Assume no support for nested virtualization as a first approximation: If you are running as a guest: - exclude_hv: stop counting events when the hypervisor runs - exclude_host: has no effect - exclude_guest: has no effect If you are running as a host/hypervisor: - exclude_hv: has no effect - exclude_host: only count events when the guest is running - exclude_guest: only count events when the host is running With nested virtualization, you get the natural union of the above. **This has