from:"Thomas Gleixner"

Re: [PATCH v2 01/17] asm: simd context helper API

2018-09-06 Thread Thomas Gleixner

On Sat, 1 Sep 2018, Jason A. Donenfeld wrote:
> On Sat, Sep 1, 2018 at 2:32 PM Andy Lutomirski  wrote:
> > I tend to think the right approach is to merge Jason's code and then
> > make it better later.  Even with a totally perfect lazy FPU restore
> > implementation on x86, we'll probably still need some way of dealing
> > with SIMD contexts.  I think we're highly unlikely to ever a allow
> > SIMD usage in all NMI contexts, for example, and there will always be
> > cases where we specifically don't want to use all available SIMD
> > capabilities even if we can.  For example, generating random numbers
> > does crypto, but we probably don't want to do *SIMD* crypto, since
> > that will force a save and restore and will probably fire up the
> > AVX512 unit, and that's not worth it unless we're already using it for
> > some other reason.
> >
> > Also, as Rik has discovered, lazy FPU restore is conceptually
> > straightforward but isn't entirely trivial :)
> 
> Sounds good. I'll move ahead on this basis.

Fine with me.

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-28 Thread Thomas Gleixner

On Mon, 28 May 2018, Christoph Hellwig wrote:
> On Mon, May 28, 2018 at 08:23:35AM +0200, Thomas Gleixner wrote:
> > > > They remove the commandline switch before having the replacement in 
> > > > place
> > > > unless I'm misreading something.
> > > 
> > > The command line switch to force 32-bit dma is removed without
> > > replacement.   The PCI quirk for force the 32-bit dma for VIA bridges
> > > is kept in place, and switch to a different mechanism in this patch.
> > 
> > Fair enough
> 
> Thanks.  Do you want me to repost them for the x86 tree, or should
> just pull the changes with the commit logs fixed to the dma-mapping
> tree?

Just take them through your tree. There is nothing conflicting in tip.

Thanks,

tglx

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-28 Thread Thomas Gleixner

On Mon, 28 May 2018, Christoph Hellwig wrote:

> On Mon, May 28, 2018 at 08:18:46AM +0200, Thomas Gleixner wrote:
> > > Which other two?  The boot optional removal patches?  They just remove
> > > the visible interface, but keep the implementation which is converted
> > > to the better mechanism here, so I think the order makes sense.
> > > But I might be missing something..
> > 
> > They remove the commandline switch before having the replacement in place
> > unless I'm misreading something.
> 
> The command line switch to force 32-bit dma is removed without
> replacement.   The PCI quirk for force the 32-bit dma for VIA bridges
> is kept in place, and switch to a different mechanism in this patch.

Fair enough

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-28 Thread Thomas Gleixner

On Mon, 28 May 2018, Christoph Hellwig wrote:

> On Mon, May 28, 2018 at 08:10:40AM +0200, Thomas Gleixner wrote:
> > n Fri, 25 May 2018, Christoph Hellwig wrote:
> > 
> > x86/pci-dma: ...
> > 
> > Please
> > 
> > > Instead of globally disabling > 32bit DMA using the arch_dma_supported
> > > hook walk the PCI bus under the actually affected bridge and mark every
> > > device with the dma_32bit_limit flag.  This also gets rid of the
> > > arch_dma_supported hook entirely.
> > 
> > Shouldn't this go before the other two? 
> 
> Which other two?  The boot optional removal patches?  They just remove
> the visible interface, but keep the implementation which is converted
> to the better mechanism here, so I think the order makes sense.
> But I might be missing something..

They remove the commandline switch before having the replacement in place
unless I'm misreading something.

Thanks,

tglx

Re: [PATCH 7/7] x86: switch the VIA 32-bit DMA quirk to use the struct device flag

2018-05-28 Thread Thomas Gleixner

n Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> Instead of globally disabling > 32bit DMA using the arch_dma_supported
> hook walk the PCI bus under the actually affected bridge and mark every
> device with the dma_32bit_limit flag.  This also gets rid of the
> arch_dma_supported hook entirely.

Shouldn't this go before the other two? 
 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH 6/7] x86: remove the explicit nodac and allowdac option

2018-05-28 Thread Thomas Gleixner

On Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> This is something drivers should decide (modulo chipset quirks like
> for VIA), which as far as I can tell is how things have been handled
> for the last 15 years.
> 
> Note that we keep the usedac option for now, as it is used in the wild
> to override the too generic VIA quirk.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH 5/7] x86: remove the experimental forcesac boot option

2018-05-28 Thread Thomas Gleixner

On Fri, 25 May 2018, Christoph Hellwig wrote:

x86/pci-dma: ...

Please

> Limiting the dma mask to avoid PCI (pre-PCIe) DAC cycles while paying
> the huge overhead of an IOMMU is rather pointless, and this seriously
> gets in the way of dma mapping work.
> 
> Signed-off-by: Christoph Hellwig <h...@lst.de>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH 4/7] x86: remove a stray reference to pci-nommu.c

2018-05-26 Thread Thomas Gleixner

On Fri, 25 May 2018, Christoph Hellwig wrote:

Subject should be: Documentation/x86: Remove .

please

> This is just the minimal workaround.  The file file is mostly either stale

file file?

> and/or duplicative of Documentation/admin-guide/kernel-parameters.txt,
> but that is much more work than I'm willing to do right now.

Yeah, this thing is on the todo list ...

> Signed-off-by: Christoph Hellwig <h...@lst.de>

Other than the above nits:

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

> ---
>  Documentation/x86/x86_64/boot-options.txt | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/x86/x86_64/boot-options.txt 
> b/Documentation/x86/x86_64/boot-options.txt
> index b297c48389b9..153b3a57fba2 100644
> --- a/Documentation/x86/x86_64/boot-options.txt
> +++ b/Documentation/x86/x86_64/boot-options.txt
> @@ -187,9 +187,9 @@ PCI
>  
>  IOMMU (input/output memory management unit)
>  
> - Currently four x86-64 PCI-DMA mapping implementations exist:
> + Multiple x86-64 PCI-DMA mapping implementations exist, for example:
>  
> -   1. : use no hardware/software IOMMU at all
> +   1. : use no hardware/software IOMMU at all
>(e.g. because you have < 3 GB memory).
>Kernel boot message: "PCI-DMA: Disabling IOMMU"
>  
> -- 
> 2.17.0
> 
>

Re: [net 1/1] net/mlx5: Fix build break when CONFIG_SMP=n

2018-05-15 Thread Thomas Gleixner

On Mon, 14 May 2018, Randy Dunlap wrote:

> On 05/14/2018 03:38 PM, Saeed Mahameed wrote:
> > Avoid using the kernel's irq_descriptor and return IRQ vector affinity
> > directly from the driver.
> > 
> > This fixes the following build break when CONFIG_SMP=n
> > 
> > include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
> > include/linux/mlx5/driver.h:1299:13: error:
> > ‘struct irq_desc’ has no member named ‘affinity_hint’
> > 
> > Fixes: 6082d9c9c94a ("net/mlx5: Fix mlx5_get_vector_affinity function")
> > Signed-off-by: Saeed Mahameed <sae...@mellanox.com>
> > CC: Randy Dunlap <rdun...@infradead.org>
> > CC: Guenter Roeck <li...@roeck-us.net>
> > CC: Thomas Gleixner <t...@linutronix.de>
> > Tested-by: Israel Rukshin <isra...@mellanox.com>
> 
> Reported-by: kbuild test robot <l...@intel.com>
> Reported-by: Randy Dunlap <rdun...@infradead.org>
> Tested-by: Randy Dunlap <rdun...@infradead.org>

Acked-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto

2018-05-13 Thread Thomas Gleixner

On Sun, 13 May 2018, Alexei Starovoitov wrote:
> On Sat, May 12, 2018 at 10:30:02PM +0200, Thomas Gleixner wrote:
> > But yes, the situation is slightly different here because tools which
> > create trace event magic _HAVE_ to pull in kernel headers. At the same time
> > these tools depend on a compiler which failed to implement asm_goto for
> > fricking 8 years.
> 
> As a maintainer of a piece of llvm codebase I have to say that
> this bullying tactic has the opposite effect.

I'm not bullying at all. Its a fact that the discussion about asm goto is
dragging out 8 years now. We've stayed away from mandating it for quite
some time, but at some point it just doesn't make sense anymore.

> The inline asm is processed by gcc and llvm very differently.  gcc is
> leaking internal backend implementation details into inline asm
> syntax. It makes little sense for llvm to do the same, since compiler
> codegen is completely different. gcc doesn't have integrated assembler
> whereas llvm not only can parse asm, but can potentially optimize it as
> well.  Instead of demanding asm-goto that matches gcc one to one it's
> better to work with the community to define the syntax that works for
> both kernel and llvm.

Come on, we surely are open for discussions, but what I've seen so far is
just 'oh we can't do this because' instead of a sane proposal how it can be
done w/o rewriting the whole ASM GOTO stuff in the kernel or even
duplicating it.

> > + * Workaround for the sake of BPF compilation which utilizes kernel
> > + * headers, but clang does not support ASM GOTO and fails the build.
> > + */
> > +#ifndef __BPF__
> > +#warning "Compiler lacks ASM_GOTO support. Add -D __BPF__ to your compiler 
> > arguments"
> > +#endif
> 
> Agree.
> The warning makes sense to me, but it has to be different macro name.
> How about -D__BPF_TRACING__ or -D__BPF_KPROBES__ or something similar ?

Fair enough.

> Such name will also make it clear that only tracing bpf programs
> need this. Networking programs shouldn't be including kernel headers.
> There was never a need, but since the tracing progs are often used
> as an example people copy paste makefiles too.
> We tried to document it as much as possible, but people still use
> 'clang -target native -I/kernel/includes bpf_prog.c -emit-llvm | llc 
> -march=bpf'
> in their builds.
> (sometimes as a workaround for setups where clang is older version,
> but llc/llvm is new)
> Now they will see this warning and it will force them to think whether
> they actually need the kernel headers.

Makes sense.

> > +
> > +#define static_cpu_has(bit)boot_cpu_has(bit)
> > +
> > +#else
> > +
> >  /*
> >   * Static testing of CPU features.  Used the same as boot_cpu_has().
> >   * These will statically patch the target code for additional
> > @@ -195,6 +209,7 @@ static __always_inline __pure bool _stat
> > boot_cpu_has(bit) : \
> > _static_cpu_has(bit)\
> >  )
> > +#endif
> >  
> >  #define cpu_has_bug(c, bit)cpu_has(c, (bit))
> >  #define set_cpu_bug(c, bit)set_cpu_cap(c, (bit))
> > --- a/samples/bpf/Makefile
> > +++ b/samples/bpf/Makefile
> > @@ -255,7 +255,7 @@ verify_target_bpf: verify_cmds
> >  $(obj)/%.o: $(src)/%.c
> > $(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
> > -I$(srctree)/tools/testing/selftests/bpf/ \
> > -   -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
> > +   -D__KERNEL__ -D__BPF__ -Wno-unused-value -Wno-pointer-sign \
> 
> Yep. In samples/bpf and libbcc we can selectively add -D__BPF_TRACING__
> I think sysdig and other folks can live with that as well.
> Agree?

Sure. Care to send an updated patch?

Thanks,

tglx

Re: [PATCH bpf v3] x86/cpufeature: bpf hack for clang not supporting asm goto

2018-05-12 Thread Thomas Gleixner

On Sat, 12 May 2018, Alexei Starovoitov wrote:
> On Thu, May 10, 2018 at 10:58 AM, Alexei Starovoitov
>  wrote:
> > I see no option, but to fix the kernel.
> > Regardless whether it's called user space breakage or kernel breakage.

There is a big difference. If you are abusing a kernel internal header in a
user space tool, then there is absolutely ZERO excuse for requesting that
the header in question has to be modified.

But yes, the situation is slightly different here because tools which
create trace event magic _HAVE_ to pull in kernel headers. At the same time
these tools depend on a compiler which failed to implement asm_goto for
fricking 8 years.

So while Boris is right, that nothing has to fiddle with a kernel only
header, I grumpily agree with you that we need a workaround in the kernel
for this particular issue.

> could you please ack the patch or better yet take it into tip tree
> and send to Linus asap ?

Nope. The patch is a horrible hack.

Why the heck do we need that extra fugly define? That has exactly zero
value simply because we already have a define which denotes availablity of
ASM GOTO: CC_HAVE_ASM_GOTO.

In case of samples/bpf/ and libbcc the compile does not go through the
arch/x86 Makefile which stops the build anyway when ASM_GOTO is
missing. Those builds merily pull in the headers and have their own build
magic, which is broken btw: Changing a kernel header which gets pulled into
the build does not rebuild anything in samples/bpf. Qualitee..

So we can just use CC_HAVE_ASM_GOTO and be done with it.

But we also want the tools which needs this to be aware of this. Peter
requested -D __BPF__ several times which got ignored. It's not too much of
a request to add that.

Find a patch which deos exactly this for samples/bpf, but also allows other
tools to build with a warning emitted so they get fixed.

Thanks,

tglx

8<
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -140,6 +140,20 @@ extern void clear_cpu_cap(struct cpuinfo

 #define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)

+#ifndef CC_HAVE_ASM_GOTO
+
+/*
+ * Workaround for the sake of BPF compilation which utilizes kernel
+ * headers, but clang does not support ASM GOTO and fails the build.
+ */
+#ifndef __BPF__
+#warning "Compiler lacks ASM_GOTO support. Add -D __BPF__ to your compiler 
arguments"
+#endif
+
+#define static_cpu_has(bit)boot_cpu_has(bit)
+
+#else
+
 /*
  * Static testing of CPU features.  Used the same as boot_cpu_has().
  * These will statically patch the target code for additional
@@ -195,6 +209,7 @@ static __always_inline __pure bool _stat
boot_cpu_has(bit) : \
_static_cpu_has(bit)\
 )
+#endif

 #define cpu_has_bug(c, bit)cpu_has(c, (bit))
 #define set_cpu_bug(c, bit)set_cpu_cap(c, (bit))
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -255,7 +255,7 @@ verify_target_bpf: verify_cmds
 $(obj)/%.o: $(src)/%.c
$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-I$(srctree)/tools/testing/selftests/bpf/ \
-   -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
+   -D__KERNEL__ -D__BPF__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
-Wno-gnu-variable-sized-type-not-at-end \
-Wno-address-of-packed-member -Wno-tautological-compare \

Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function

2018-05-06 Thread Thomas Gleixner

On Sun, 6 May 2018, Thomas Gleixner wrote:
> On Sat, 5 May 2018, Guenter Roeck wrote:
> > > -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> > > - mask = irq_data_get_effective_affinity_mask(>irq_data);
> > > -#else
> > > - mask = desc->irq_common_data.affinity;
> > > -#endif
> > > - return mask;
> > > + return desc->affinity_hint;
> 
> NAK.
> 
> Nothing in regular device drivers is supposed to ever fiddle with struct
> irq_desc. The existing code is already a violation of that rule and needs
> to be fixed, but not in that way.
> 
> The logic here is completely screwed. affinity_hint is set by the driver,
> so the driver already knows what it is. If the driver does not set it, then
> the thing is NULL.

And this completely insane fiddling with irq_desc is in MLX4 as
well. Dammit, why can't people respect subsytem boundaries and just fiddle
in everything just because they can? If there is something missing at the
core level then please talk to the maintainers instead of hacking utter
crap into your driver.

Yours grumpy

  tglx

Re: [PATCH] net/mlx5: Fix mlx5_get_vector_affinity function

2018-05-06 Thread Thomas Gleixner

On Sat, 5 May 2018, Guenter Roeck wrote:

> On Thu, Apr 12, 2018 at 09:49:11AM +, Israel Rukshin wrote:
> > Adding the vector offset when calling to mlx5_vector2eqn() is wrong.
> > This is because mlx5_vector2eqn() checks if EQ index is equal to vector 
> > number
> > and the fact that the internal completion vectors that mlx5 allocates
> > don't get an EQ index.
> > 
> > The second problem here is that using effective_affinity_mask gives the same
> > CPU for different vectors.
> > This leads to unmapped queues when calling it from blk_mq_rdma_map_queues().
> > This doesn't happen when using affinity_hint mask.
> > 
> Except that affinity_hint is only defined if SMP is enabled. Without:
> 
> include/linux/mlx5/driver.h: In function ‘mlx5_get_vector_affinity_hint’:
> include/linux/mlx5/driver.h:1299:13: error:
> ‘struct irq_desc’ has no member named ‘affinity_hint’
> 
> Note that this is the only use of affinity_hint outside kernel/irq.
> Don't other drivers have similar problems ?

Aside of that.

> >  static inline const struct cpumask *
> > -mlx5_get_vector_affinity(struct mlx5_core_dev *dev, int vector)
> > +mlx5_get_vector_affinity_hint(struct mlx5_core_dev *dev, int vector)
> >  {
> > -   const struct cpumask *mask;
> > struct irq_desc *desc;
> > unsigned int irq;
> > int eqn;
> > int err;
> >  
> > -   err = mlx5_vector2eqn(dev, MLX5_EQ_VEC_COMP_BASE + vector, , );
> > +   err = mlx5_vector2eqn(dev, vector, , );
> > if (err)
> > return NULL;
> >  
> > desc = irq_to_desc(irq);
> > -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
> > -   mask = irq_data_get_effective_affinity_mask(>irq_data);
> > -#else
> > -   mask = desc->irq_common_data.affinity;
> > -#endif
> > -   return mask;
> > +   return desc->affinity_hint;

NAK.

Nothing in regular device drivers is supposed to ever fiddle with struct
irq_desc. The existing code is already a violation of that rule and needs
to be fixed, but not in that way.

The logic here is completely screwed. affinity_hint is set by the driver,
so the driver already knows what it is. If the driver does not set it, then
the thing is NULL.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-24 Thread Thomas Gleixner

On Mon, 23 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> >> +struct tbs_sched_data {
> >> +  bool sorting;
> >> +  int clockid;
> >> +  int queue;
> >> +  s32 delta; /* in ns */
> >> +  ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> >> +  struct rb_root head;
> > 
> > Hmm. You are reimplementing timerqueue open coded. Have you checked whether
> > you could reuse the timerqueue implementation?
> > 
> > That requires to add a timerqueue node to struct skbuff
> > 
> > @@ -671,7 +671,8 @@ struct sk_buff {
> > unsigned long   dev_scratch;
> > };
> > };
> > -   struct rb_node  rbnode; /* used in netem & tcp stack */
> > +   struct rb_node  rbnode; /* used in netem & tcp stack */
> > +   struct timerqueue_node  tqnode;
> > };
> > struct sock *sk;
> > 
> > Then you can use timerqueue_head in your scheduler data and all the open
> > coded rbtree handling goes away.
> 
> 
> I just noticed that doing the above increases the size of struct sk_buff by 8
> bytes - struct timerqueue_node is 32bytes long while struct rb_node is only
> 24bytes long.
> 
> Given the feedback we got here before against touching struct sk_buff at all 
> for
> non-generic use cases, I will keep the implementation of sch_tbs.c as is, thus
> keeping the open-coded version for now, ok?

The size of sk_buff is 216 and the size of sk_buff_fclones is 440
bytes. The sk_buff and sk_buff_fclones kmem_caches use objects sized 256
and 512 bytes because the kmem_caches are created with SLAB_HWCACHE_ALIGN.

So adding 8 bytes to spare duplicated code will not change the kmem_cache
object size and I really doubt that anyone will notice.

Thanks,

tglx

Re: [PATCH] bpf, x86_32: add eBPF JIT compiler for ia32 (x86_32)

2018-04-19 Thread Thomas Gleixner

On Wed, 18 Apr 2018, Wang YanQing wrote:
> @@ -0,0 +1,147 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* bpf_jit.S : BPF JIT helper functions

Please do not add these file names to the top level comment. They provide
no value and just become stale when the file gets moved/renamed.

> + *
> + * Copyright (C) 2018 Wang YanQing (udkni...@gmail.com)
> + * Copyright (C) 2011 Eric Dumazet (eric.duma...@gmail.com)
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; version 2
> + * of the License.

You have already the License Identifier. So you don't need the boiler plate
text.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-19 Thread Thomas Gleixner

On Wed, 11 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 04/11/2018 01:16 PM, Thomas Gleixner wrote:
> >> So there is a "clockid" that can be used for the full hw offload modes. On 
> >> this
> >> case, the txtimes are in reference to the NIC's PTP clock, and, as 
> >> discussed, we
> >> can't just use a clockid that was computed from the fd pointing to 
> >> /dev/ptpX .
> > 
> > And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
> > yet another clock, right?
> 
> Just breaking this down a bit, yes, TAI is the network time base, and the NICs
> PTP clock use that because PTP is (commonly) based on TAI. After the PHCs have
> been synchronized over the network (e.g. with ptp4l), my understanding is that
> if applications want to use the clockid_t CLOCK_TAI as a network clock 
> reference
> it's required that something (i.e. phc2sys) is synchronizing the PHCs and the
> system clock, and also that something calls adjtime to apply the TAI vs UTC
> offset to CLOCK_TAI.
> 
> If we are fine with those 'dependencies', then I agree there is no need for
> another clock.
> 
> I was thinking about the full offload use-cases, thus when no scheduling is
> happening inside the qdiscs. Applications could just read the time from the 
> PHC
> clocks directly without having to rely on any of the above. On this case,
> userspace would use DYNAMIC_CLOCK just to flag that this is the case, but I 
> must
> admit it's not clear to me how common of a use-case that is, or even if it 
> makes
> sense.

I don't think it makes a lot of sense because the only use case for that is
a full user space scheduler which routes _ALL_ traffic. I don't think
that's something which we want to proliferate.

So I'd rather start off with the CLOCK_TAI assumption and if the need
really arises we can discuss that separately. So you can take a clockid
into account when designing the ABI, but have it CLOCK_TAI only for the
start.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-11 Thread Thomas Gleixner

On Tue, 10 Apr 2018, Jesus Sanchez-Palencia wrote:
> >> This will be provided by tbs if the socket which is transmitting packets is
> >> configured for deadline mode.
> > 
> > You don't want the socket to decide that. The qdisc into which a socket
> > feeds defines the mode and the qdisc rejects requests with the wrong mode.
> > 
> > Making a qdisc doing both and let the user decide what he wants it to be is
> > not really going to fly. Especially if you have different users which want
> > a different mode. It's clearly distinct functionality.
> 
> 
> Ok, so just to make sure I got this right, are you suggesting that both the
> 'tbs' qdisc *and* the socket (i.e. through SO_TXTIME) should have a config
> parameter for specifying the txtime mode? This way if there is a mismatch,
> packets from that socket are rejected by the qdisc.

Correct. The same is true if you try to set SO_TXTIME for something which
is just routing regular traffic.

> (...)
> > 
> >> Another question for this mode (but perhaps that applies to both modes) 
> >> is, what
> >> if the qdisc misses the deadline for *any* reason? I'm assuming it should 
> >> drop
> >> the packet during dequeue.
> > 
> > There the question is how user space is notified about that issue. The
> > application which queued the packet on time does rightfully assume that
> > it's going to be on the wire on time.
> > 
> > This is a violation of the overall scheduling plan, so you need to have
> > a sane design to handle that.
> 
> In addition to the qdisc stats, we could look into using the socket's error
> queue to notify the application about that.

Makes sense.
 
> >> Putting it all together, we end up with:
> >>
> >> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
> >> like:
> >> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload 
> >> sorting
> > 
> > Why CLOCK_REALTIME? The only interesting time in a TSN network is
> > CLOCK_TAI, really.
> 
> REALTIME was just an example here to show that the qdisc has to be configured
> with a clockid parameter. Are you suggesting that instead both of the new 
> qdiscs
> (i.e. tbs and taprio) should always be using CLOCK_TAI implicitly?

I think so. It's _the_ network time on which everything is based on.

> >> 2) a new cmsg-interface for setting a per-packet timestamp that will be 
> >> used
> >> either as a txtime or as deadline by tbs (and further the NIC driver for 
> >> the
> >> offlaod case): SCM_TXTIME.
> >>
> >> 3) a new socket option: SO_TXTIME. It will be used to enable the feature 
> >> for a
> >> socket, and will have as parameters a clockid and a txtime mode (deadline 
> >> or
> >> explicit), that defines the semantics of the timestamp set on packets using
> >> SCM_TXTIME.
> >>
> >> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .
> > 
> > Can you remind me why we would need that?
> 
> So there is a "clockid" that can be used for the full hw offload modes. On 
> this
> case, the txtimes are in reference to the NIC's PTP clock, and, as discussed, 
> we
> can't just use a clockid that was computed from the fd pointing to /dev/ptpX .

And the NICs PTP clock is CLOCK_TAI, so there should be no reason to have
yet another clock, right?

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-10 Thread Thomas Gleixner

Jesus,

On Mon, 9 Apr 2018, Jesus Sanchez-Palencia wrote:
> On 03/28/2018 12:48 AM, Thomas Gleixner wrote:
> > Yes, you have the root qdisc, which is in charge of the overall scheduling
> > plan, how complex or not it is defined does not matter. It exposes traffic
> > classes which have properties defined by the configuration.
> 
> Perfect. Let's see if we can agree on an overall plan, then. Hopefully I'm not
> missing anything.
> 
> For the above we'll develop a new qdisc, designed along the 'taprio' ideas, 
> thus
> a Qbv style scheduler, to be used as root qdisc. It can run the schedule 
> inside
> the kernel or just offload it to the NIC if supported. Similarly to the other
> multiqueue qdiscs, it will expose the HW Tx queues.
> 
> What is new here from the ideas we shared last year is that this new root 
> qdisc
> will be responsible for calling the attached qdiscs' dequeue functions during
> their timeslices, making it the only entity capable of enqueueing packets into
> the NIC.

Correct. Aside of that it's the entity which is in charge of the overall
scheduling.

> This is the "global scheduler", but we still need the txtime aware
> qdisc. For that, we'll modify tbs to accommodate the feedback from this
> thread. More below.

> > The qdiscs which are attached to those traffic classes can be anything
> > including:
> >
> >  - Simple feed through (Applications are time contraints aware and set the
> >exact schedule). qdisc has admission control.
> 
> This will be provided by the tbs qdisc. It will still provide a txtime sorted
> list and hw offload, but now there will be a per-socket option that tells the
> qdisc if the per-packet timestamp is the txtime (i.e. explicit mode, as you've
> called it) or a deadline. The drop_if_late flag will be removed.
> 
> When in explicit mode, packets from that socket are dequeued from the qdisc
> during its time slice if their [(txtime - delta) < now].
> 
> >
> >  - Deadline aware qdisc to handle e.g. A/V streams. Applications are aware
> >of time constraints and provide the packet deadline. qdisc has admission
> >control. This can be a simple first comes, first served scheduler or
> >something like EDF which allows optimized utilization. The qdisc sets
> >the TX time depending on the deadline and feeds into the root.
> 
> This will be provided by tbs if the socket which is transmitting packets is
> configured for deadline mode.

You don't want the socket to decide that. The qdisc into which a socket
feeds defines the mode and the qdisc rejects requests with the wrong mode.

Making a qdisc doing both and let the user decide what he wants it to be is
not really going to fly. Especially if you have different users which want
a different mode. It's clearly distinct functionality.

Please stop trying to develop swiss army knifes with integrated coffee
machines.

> For the deadline -> txtime conversion, what I have in mind is: when dequeue is
> called tbs will just change the skbuff's timestamp from the deadline to 'now'
> (i.e. as soon as possible) and dequeue the packet. Would that be enough or
> should we use the delta parameter of the qdisc on this case add make [txtime =
> now + delta]? The only benefit of doing so would be to provide a configurable
> 'fudge' factor.

Well, that really depends on how your deadline scheduler works.

> Another question for this mode (but perhaps that applies to both modes) is, 
> what
> if the qdisc misses the deadline for *any* reason? I'm assuming it should drop
> the packet during dequeue.

There the question is how user space is notified about that issue. The
application which queued the packet on time does rightfully assume that
it's going to be on the wire on time.

This is a violation of the overall scheduling plan, so you need to have
a sane design to handle that.

> Putting it all together, we end up with:
> 
> 1) a new txtime aware qdisc, tbs, to be used per queue. Its cli will look 
> like:
> $ tc qdisc add (...) tbs clockid CLOCK_REALTIME delta 15 offload sorting

Why CLOCK_REALTIME? The only interesting time in a TSN network is
CLOCK_TAI, really.

> 2) a new cmsg-interface for setting a per-packet timestamp that will be used
> either as a txtime or as deadline by tbs (and further the NIC driver for the
> offlaod case): SCM_TXTIME.
> 
> 3) a new socket option: SO_TXTIME. It will be used to enable the feature for a
> socket, and will have as parameters a clockid and a txtime mode (deadline or
> explicit), that defines the semantics of the timestamp set on packets using
> SCM_TXTIME.
> 
> 4) a new #define DYNAMIC_CLOCKID 15 added to include/uapi/linux/time.h .

Can you remind me why we would need that?

> 5) a new schedule-awa

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-28 Thread Thomas Gleixner

Jesus,

On Tue, 27 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/25/2018 04:46 AM, Thomas Gleixner wrote:
> >   This is missing right now and you want to get that right from the very
> >   beginning. Duct taping it on the interface later on is a bad idea.
> 
> Agreed that this is needed. On the SO_TXTIME + tbs proposal, I believe it's 
> been
> covered by the (per-packet) SCM_DROP_IF_LATE. Do you think we need a different
> mechanism for expressing that?

Uuurgh. No. DROP_IF_LATE is just crap to be honest.

There are two modes:

  1) Send at the given TX time (Explicit mode)

  2) Send before given TX time (Deadline mode)

There is no need to specify 'drop if late' simply because if the message is
handed in past the given TX time, it's too late by definition. What you are
trying to implement is a hybrid of TSN and general purpose (not time aware)
networking in one go. And you do that because your overall design is not
looking at the big picture. You designed from a given use case assumption
and tried to fit other things into it with duct tape.

> >   So you really want a way for the application to query the timing
> >   constraints and perhaps other properties of the channel it connects
> >   to. And you want that now before the first application starts to use the
> >   new ABI. If the application developer does not use it, you still have to
> >   fix the application, but you have to fix it because the developer was a
> >   lazy bastard and not because the design was bad. That's a major
> >   difference.
> 
> Ok, this is something that we have considered in the past, but then the 
> feedback
> here drove us onto a different direction. The overall input we got here was 
> that
> applications would have to be adjusted or that userspace would have to handle
> the coordination between applications somehow (e.g.: a daemon could be 
> developed
> separately to accommodate the fully dynamic use-cases, etc).

The only thing which will happen is that you get applications which require
to control the full interface themself because they are so important and
the only ones which get it right. Good luck with fixing them up.

That extra daemon if it ever surfaces will be just a PITA. Think about
20khz control loops. Do you really want queueing, locking, several context
switches and priority configuration nightmares in such a scenario?
Definitely not! You want a fast channel directly to the root qdisc which
takes care of getting it out at the right point, which might be immediate
handover if the adapter supports hw scheduling.

> This is a new requirement for the entire discussion.
> 
> If I'm not missing anything, however, underutilization of the time slots is 
> only
> a problem:
> 
> 1) for the fully dynamic use-cases and;
> 2) because now you are designing applications in terms of time slices, right?

No. It's a general problem. I'm not designing applications in terms of time
slices. Time slices are a fundamental property of TSN. Whether you use them
for explicit scheduling or bandwidth reservation or make them flat does not
matter.

The application does not necessarily need to know about the time
constraints at all. But if it wants to use timed scheduling then it better
does know about them.

> We have not thought of making any of the proposed qdiscs capable of 
> (optionally)
> adjusting the "time slices", but mainly because this is not a problem we had
> here before. Our assumption was that per-port Tx schedules would only be used
> for static systems. In other words, no, we didn't think that re-balancing the
> slots was a requirement, not even for 'taprio'.

Sigh. Utilization is not something entirely new in the network space. I'm
not saying that this needs to be implemented right away, but designing it
in a way which forces underutilization is just wrong.

> > Coming back to the overall scheme. If you start upfront with a time slice
> > manager which is designed to:
> >
> >   - Handle multiple channels
> >
> >   - Expose the time constraints, properties per channel
> >
> > then you can fit all kind of use cases, whether designed by committee or
> > not. You can configure that thing per node or network wide. It does not
> > make a difference. The only difference are the resulting constraints.
> 
>
> Ok, and I believe the above was covered by what we had proposed before, unless
> what you meant by time constraints is beyond the configured port schedule.
>
> Are you suggesting that we'll need to have a kernel entity that is not only
> aware of the current traffic classes 'schedule', but also of the resources 
> that
> are still available for new streams to be accommodated into the classes? 
> Putting
> it differently, is the TAS you envision just an entity that ru

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-25 Thread Thomas Gleixner

On Fri, 23 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:52 PM, Thomas Gleixner wrote:
> > So what's the plan for this? Having TAS as a separate entity or TAS feeding
> > into the proposed 'basic' time transmission thing?
> 
> The second one, I guess.

That's just wrong. It won't work. See below.

> Elaborating, the plan is at some point having TAS as a separate entity,
> but which can use tbs for one of its classes (and cbs for another, and
> strict priority for everything else, etc).
>
> Basically, the design would something along the lines of 'taprio'. A root 
> qdisc
> that is both time and priority aware, and capable of running a schedule for 
> the
> port. That schedule can run inside the kernel with hrtimers, or just be
> offloaded into the controller if Qbv is supported on HW.
> 
> Because it would expose the inner traffic classes in a mq / mqprio / prio 
> style,
> then it would allow for other per-queue qdiscs to be attached to it. On a 
> system
> using the i210, for instance, we could then have tbs installed on traffic 
> class
> 0 just dialing hw offload. The Qbv schedule would be running in SW on the TAS
> entity (i.e. 'taprio') which would be setting the packets' txtime before
> dequeueing packets on a fast path -> tbs -> NIC.
> 
> Similarly, other qdisc, like cbs, could be installed if all that traffic class
> requires is traffic shaping once its 'gate' is allowed to execute the selected
> tx algorithm attached to it.
> 
> > I've not yet seen a convincing argument why this low level stuff with all
> > of its weird flavours is superiour over something which reflects the basic
> > operating principle of TSN.
> 
> 
> As you know, not all TSN systems are designed the same. Take AVB systems, for
> example. These not always are running on networks that are aware of any time
> schedule, or at least not quite like what is described by Qbv.
> 
> On those systems there is usually a certain number of streams with different
> priorities that care mostly about having their bandwidth reserved along the
> network. The applications running on such systems are usually based on AVTP,
> thus they already have to calculate and set the "avtp presentation time"
> per-packet themselves. A Qbv scheduler would probably provide very little
> benefits to this domain, IMHO. For "talkers" of these AVB systems, shaping
> traffic using txtime (i.e. tbs) can provide a low-jitter alternative to cbs, 
> for
> instance.

You're looking at it from particular use cases and try to accomodate for
them in the simplest possible way. I don't think that cuts it.

Let's take a step back and look at it from a more general POV without
trying to make it fit to any of the standards first. I'm deliberately NOT
using any of the standard defined terms.

At the (local) network level you have always an explicit plan. This plan
might range from no plan at all to an very elaborate plan which is strict
about when each node is allowed to TX a particular class of packets.

So lets assume we have the following picture:

  [NIC]
|
 [ Time slice manager ]

Now in the simplest case, the time slice manager has no constraints and
exposes a single input which allows the application to say: "Send my packet
at time X". There is no restriction on 'time X' except if there is a time
collision with an already queued packet or the requested TX time has
already passed. That's close to what you implemented.

  Is the TX timestamp which you defined in the user space ABI a fixed
  scheduling point or is it a deadline?

  That's an important distinction and for this all to work accross various
  use cases you need a way to express that in the ABI. It might be an
  implicit property of the socket/channel to which the application connects
  to but still you want to express it from the application side to do
  proper sanity checking.

  Just think about stuff like audio/video streaming. The point of
  transmission does not have to be fixed if you have some intelligent
  controller at the receiving end which can buffer stuff. The only relevant
  information is the deadline, i.e. the latest point in time where the
  packet needs to go out on the wire in order to keep the stream steady at
  the consumer side. Having the notion of a deadline and that's the only
  thing the provider knows about allows you proper utilization by using an
  approriate scheduling algorithm like EDF.

  Contrary to that you want very explicit TX points for applications like
  automation control. For this kind of use case there is no wiggle room, it
  has to go out at a fixed time because that's the way control systems
  work.

  This is missing right now and you want to get that right from the very
  beginning. Duct taping it on the interfa

Re: rcu: Add might_sleep() check to synchronize_rcu()

2018-03-23 Thread Thomas Gleixner

On Fri, 23 Mar 2018, Steven Rostedt wrote:
> On Fri, 23 Mar 2018 22:33:29 +0100 (CET)
> [  150.741223]  [] synchronize_rcu+0x27/0x90
> [  150.746908]  [] __l2tp_session_unhash+0x3d5/0x550
> 
> Looks like __l2tp_session_unhash() is the real culprit here.

Yes. I reported that to netdev already.

> [  150.753281]  [] ? __l2tp_session_unhash+0x1bf/0x550
> [  150.759828]  [] ? __local_bh_enable_ip+0x6a/0xd0
> [  150.766123]  [] ? l2tp_udp_encap_recv+0xd90/0xd90
> [  150.772497]  [] l2tp_tunnel_closeall+0x1e7/0x3a0
> [  150.778782]  [] l2tp_tunnel_destruct+0x30e/0x5a0
> [  150.785067]  [] ? l2tp_tunnel_destruct+0x1aa/0x5a0
> [  150.791537]  [] ? l2tp_tunnel_del_work+0x460/0x460
> [  150.797997]  [] __sk_destruct+0x53/0x570
> [  150.803588]  [] rcu_process_callbacks+0x898/0x1300
> [  150.810048]  [] ? rcu_process_callbacks+0x977/0x1300
> [  150.816684]  [] ? __sk_dst_check+0x240/0x240
> [  150.822625]  [] __do_softirq+0x206/0x951
> [  150.828223]  [] irq_exit+0x165/0x190
> [  150.833557]  [] smp_apic_timer_interrupt+0x7b/0xa0
> [  150.840018]  [] apic_timer_interrupt+0xa0/0xb0
> [  150.846132]   [  150.848166]  [] ? 
> native_safe_halt+0x6/0x10
> [  150.854036]  [] ? trace_hardirqs_on+0xd/0x10
> [  150.859973]  [] default_idle+0x55/0x360
> [  150.865478]  [] arch_cpu_idle+0xa/0x10
> 
> I think you want this instead, as __l2tp_session_unhash is what looks
> like might be hiding the call to synchronize_rcu(). It's not called in
> all instances, and I don't think your patch would have triggered the
> issues before hand. You want this:
>
> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index 194a7483bb93..857b494bee29 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -1677,6 +1677,8 @@ void __l2tp_session_unhash(struct l2tp_session *session)
>  {
>   struct l2tp_tunnel *tunnel = session->tunnel;
>  
> + might_sleep();
> +
>   /* Remove the session from core hashes */
>   if (tunnel) {
>   /* Remove from the per-tunnel hash */

That too :)

Re: syzbot rcu/debugobjects warning

2018-03-23 Thread Thomas Gleixner

On Fri, 23 Mar 2018, Joel Fernandes wrote:
> On Fri, Mar 23, 2018 at 2:11 AM, Thomas Gleixner <t...@linutronix.de> wrote:
> > On Thu, 22 Mar 2018, Joel Fernandes wrote:
> Sorry. Here is the raw crash log: https://pastebin.com/raw/puvh0cXE
> (The kernel logs are toward the end with the above).

And that is interesting:

[  150.629667]   [  150.631700]  [] dump_stack+0xc1/0x128
[  150.637051]  [] ? __debug_object_init+0x526/0xc40
[  150.643431]  [] panic+0x1bc/0x3a8
[  150.648416]  [] ? 
percpu_up_read_preempt_enable.constprop.53+0xd7/0xd7
[  150.656611]  [] ? load_image_and_restore+0xf9/0xf9
[  150.663070]  [] ? vprintk_default+0x1d/0x30
[  150.668925]  [] ? __warn+0x1a9/0x1e0
[  150.674170]  [] ? __debug_object_init+0x526/0xc40
[  150.680543]  [] __warn+0x1c4/0x1e0
[  150.685614]  [] warn_slowpath_null+0x2c/0x40
[  150.691972]  [] __debug_object_init+0x526/0xc40
[  150.698174]  [] ? debug_object_fixup+0x30/0x30
[  150.704283]  [] debug_object_init_on_stack+0x19/0x20
[  150.710917]  [] __wait_rcu_gp+0x93/0x1b0
[  150.716508]  [] synchronize_rcu.part.65+0x101/0x110
[  150.723054]  [] ? rcu_pm_notify+0xc0/0xc0
[  150.728735]  [] ? __call_rcu.constprop.72+0x910/0x910
[  150.735459]  [] ? __lock_is_held+0xa1/0xf0
[  150.741223]  [] synchronize_rcu+0x27/0x90

So this calls synchronize_rcu from a rcu callback. That's a nono. This is
on the back of an interrupt in softirq context and __wait_rcu_gp() can
sleep, which is obviously a bad idea in softirq context

Cc'ed netdev 

And that also explains the debug object splat because this is not running
on the task stack. It's running on the softirq stack 

[  150.746908]  [] __l2tp_session_unhash+0x3d5/0x550
[  150.753281]  [] ? __l2tp_session_unhash+0x1bf/0x550
[  150.759828]  [] ? __local_bh_enable_ip+0x6a/0xd0
[  150.766123]  [] ? l2tp_udp_encap_recv+0xd90/0xd90
[  150.772497]  [] l2tp_tunnel_closeall+0x1e7/0x3a0
[  150.778782]  [] l2tp_tunnel_destruct+0x30e/0x5a0
[  150.785067]  [] ? l2tp_tunnel_destruct+0x1aa/0x5a0
[  150.791537]  [] ? l2tp_tunnel_del_work+0x460/0x460
[  150.797997]  [] __sk_destruct+0x53/0x570
[  150.803588]  [] rcu_process_callbacks+0x898/0x1300
[  150.810048]  [] ? rcu_process_callbacks+0x977/0x1300
[  150.816684]  [] ? __sk_dst_check+0x240/0x240
[  150.822625]  [] __do_softirq+0x206/0x951
[  150.828223]  [] irq_exit+0x165/0x190
[  150.833557]  [] smp_apic_timer_interrupt+0x7b/0xa0
[  150.840018]  [] apic_timer_interrupt+0xa0/0xb0
[  150.846132]   [  150.848166]  [] ? 
native_safe_halt+0x6/0x10
[  150.854036]  [] ? trace_hardirqs_on+0xd/0x10
[  150.859973]  [] default_idle+0x55/0x360
[  150.865478]  [] arch_cpu_idle+0xa/0x10
[  150.870896]  [] default_idle_call+0x36/0x60
[  150.876751]  [] cpu_startup_entry+0x2b0/0x380
[  150.882787]  [] ? cpu_in_idle+0x20/0x20
[  150.888291]  [] ? clockevents_register_device+0x123/0x200
[  150.895358]  [] start_secondary+0x303/0x3e0
[  150.901209]  [] ? set_cpu_sibling_map+0x11f0/0x11f0

Thanks,

tglx

Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-23 Thread Thomas Gleixner

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 07:22 AM, Thomas Gleixner wrote:
> > Bah, and probably there you need CLOCK_TAI because that's what PTP is based
> > on, so clock_system needs to accomodate that as well. Dammit, there goes
> > the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
> > bits plus the adapter bit.
> > 
> > Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
> > don't see us adding new fixed clocks, so we really can reserve #15 for
> > selecting the adapter clock if sparing that extra bit is truly required.
> 
> 
> So what about just using the previous single 'clockid' argument, but then just
> adding to uapi time.h something like:
> 
> #define DYNAMIC_CLOCKID 15
> 
> And using it for that, instead. This way applications that will use the raw hw
> offload mode must use this value for their per-socket clockid, and the qdisc's
> clockid would be implicitly initialized to the same value.

That's what I suggested above.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-23 Thread Thomas Gleixner

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/22/2018 03:11 PM, Thomas Gleixner wrote:
> So, are you just opposing to the case where sorting off + offload off is used?
> (i.e. the scheduled FIFO case)

FIFO does not make any sense if your packets have a fixed transmission
time. I yet have to see a reasonable explanation why FIFO in the context of
time ordered would be a good thing.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-22 Thread Thomas Gleixner

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> Our plan was to work directly with the Qbv-like scheduling (per-port) just 
> after
> the cbs qdisc (Qav), but the feedback here and offline was that there were use
> cases for a more simplistic launchtime approach (per-queue) as well. We've
> decided to invest on it first (and postpone the 'taprio' qdisc until there was
> NIC available with HW support for it, basically).

I missed that discussion due to other urgent stuff on my plate. Just
skimmed through it. More below.

> You are right, and we agree, that using tbs for a per-port schedule of any 
> sort
> will require a SW scheduler to be developed on top of it, but we've never said
> the contrary either. Our vision has always been that these are separate
> mechanisms with different use-cases, so we do see the value for the kernel to
> provide both.
> 
> In other words, tbs is not the final solution for Qbv, and we agree that a 
> 'TAS'
> qdisc is still necessary. And due to the wide range of applications and hw 
> being
> used for those out there, we need both specially given that one does not block
> the other.

So what's the plan for this? Having TAS as a separate entity or TAS feeding
into the proposed 'basic' time transmission thing?

The general objection I have with the current approach is that it creates
the playground for all flavours of misdesigned user space implementations
and just replaces the home brewn and ugly user mode network adapter
drivers.

But that's not helping the cause at all. There is enough crappy stuff out
there already and I rather see a proper designed slice management which can
be utilized and improved by all involved parties.

All variants which utilize the basic time driven packet transmission are
based on periodic explicit plan scheduling with (local) network wide time
slice assignment.

It does not matter whether you feed VLAN traffic into a time slice, where
the VLAN itself does not even have to know about it, or if you have aware
applications feeding packets to a designated timeslot. The basic principle
of this is always the same.

So coming back to last years discussion. It totally went into the wrong
direction because it turned from an approach (the patches) which came from
the big picture to an single use case and application centric view. That's
just wrong and I regret that I didn't have the time to pay attention back
then.

You always need to look at the big picture first and design from there, not
the other way round. There will always be the argument:

But my application is special and needs X

It's easy to fall for that. From a long experience I know that none of
these claims ever held. These arguments are made because the people making
them have either never looked at the big picture or are simply refusing to
do so because it would cause them work.

If you start from the use case and application centric view and ignore the
big picture then you end up in a gazillion of extra magic features over
time which could have been completely avoided if you had put your foot down
and made everyone to agree on a proper and versatile design in the first
place.

The more low level access you hand out in the beginning the less commonly
used, improved and maintained infrastrucure you will get in the end. That
has happened before in other areas and it will happen here as well. You
create a user space ABI which you cant get rid off and before you come out
with the proper interface after that a large number of involved parties
have gone off and implemented on top of the low level ABI and they will
never look back.

In the (not so) long run this will create a lot more issues than it
solves. A simple example is that you cannot run two applications which
easily could share the network in parallel without major surgery because
both require to be the management authority.

I've not yet seen a convincing argument why this low level stuff with all
of its weird flavours is superiour over something which reflects the basic
operating principle of TSN.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-22 Thread Thomas Gleixner

On Thu, 22 Mar 2018, Jesus Sanchez-Palencia wrote:
> On 03/21/2018 06:46 AM, Thomas Gleixner wrote:
> > If you look at the use cases of TDM in various fields then FIFO mode is
> > pretty much useless. In industrial/automotive fieldbus applications the
> > various time slices are filled by different threads or even processes.
> > 
> > Sure, the rbtree queue/dequeue has overhead compared to a simple linked
> > list, but you pay for that with more indirections and lots of mostly
> > duplicated code. And in the worst case one of these code pathes is going to
> > be rarely used and prone to bitrot.
> 
> 
> Our initial version (on RFC v2) was performing the sorting for all modes. 
> After
> all the feedback we got we decided to make it optional and provide FIFO modes 
> as
> well. For the SW fallback we need the scheduled FIFO, and for "pure" hw 
> offload
> we need the "raw" FIFO.

I don't see how FIFO ever works without the issue that a newly qeueud
packet which has an earlier time stamp than the head of the FIFO list will
lose. Why would you even want to have that mode? Just because some weird
existing application misdesign thinks its required? That doesn't make it a
good idea.

With pure hardware offload the packets are immediately handed off to the
network card and that one is responsible for sending it on time. So there
is no FIFO at all. It's actually a bypass mode.

> This was a way to accommodate all the use cases without imposing too much of a
> burden onto anyone, regardless of their application's segment (i.e. 
> industrial,
> pro a/v, automotive, etc).

I'm not buying that argument at all. That's all handwaving.

The whole approach is a burden on every application segment because it
pushes the whole schedule and time slice management out to user space,
which also requires that you route general traffic down to that user space
scheduling entity and then queue it back into the proper time slice. And
FIFO makes that even worse.

> Having the sorting always enabled requires that a valid static clockid is 
> passed
> to the qdisc. For the hw offload mode, that means that the PHC and one of the
> system clocks must be synchronized since hrtimers do not support dynamic 
> clocks.
> Not all systems do that or want to, and given that we do not want to perform
> crosstimestamping between the packets' clock reference and the qdisc's one, 
> the
> only solution for these systems would be using the raw hw offload mode.

There are two variants of hardware offload:

1) Full hardware offload

   That bypasses the queue completely. You just stick the thing into the
   scatter gather buffers. Except when there is no room anymore, then you
   have to queue, but it does not make any difference if you queue in FIFO
   or in time order. The packets go out in time order anyway.

2) Single packet hardware offload

   What you do here is to schedule a hrtimer a bit earlier than the first
   packet tx time and when it fires stick the packet into the hardware and
   rearm the timer for the next one.

   The whole point of TSN with hardware support is that you have:

   - Global network time

   and

   - Frequency adjustment of the system time base

PTP is TAI based and the kernel exposes clock TAI directly through
hrtimers. You don't need dynamic clocks for that.

You can even use clock MONOTONIC as it basically is just

   TAI - offset

If the network card uses anything else than TAI or a time stamp with a
strict correlation to TAI for actual TX scheduling then the whole thing is
broken to begin with.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-21 Thread Thomas Gleixner

On Wed, 21 Mar 2018, Thomas Gleixner wrote:
> If you look at the use cases of TDM in various fields then FIFO mode is
> pretty much useless. In industrial/automotive fieldbus applications the
> various time slices are filled by different threads or even processes.

That brings me to a related question. The TDM cases I'm familiar with which
aim to use this utilize multiple periodic time slices, aka 802.1Qbv
time-aware scheduling.

Simple example:

[1a][1b][1c][1d][1a][1b][1c][1d][.
[2a][2b][2c][2d]
[3a][3b]
[4a][4b]
--> t   

where 1-4 is the slice level and a-d are network nodes.

In most cases the slice levels on a node are handled by different
applications or threads. Some of the protocols utilize dedicated time slice
levels - lets assume '4' in the above example - to run general network
traffic which might even be allowed to have collisions, i.e. [4a-d] would
become [4] and any node can send; the involved componets like switches are
supposed to handle that.

I'm not seing how TBS is going to assist with any of that. It requires
everything to be handled at the application level. Not really useful
especially not for general traffic which does not know about the scheduling
bands at all.

If you look at an industrial control node. It basically does:

queue_first_packet(tx, slice1);
while (!stop) {
if (wait_for_packet(rx) == ERROR)
goto errorhandling;
tx = do_computation(rx);
queue_next_tx(tx, slice1);
}

that's a pretty common pattern for these kind of applications. For audio
sources queue_next() might be triggered by the input sampler which needs to
be synchronized to the network slices anyway in order to work properly.

TBS per current implementation is nice as a proof of concept, but it solves
just a small portion of the complete problem space. I have the suspicion
that this was 'designed' to replace the user space hack in the AVNU stack
with something close to it. Not really a good plan to be honest.

I think what we really want is a strict periodic scheduler which supports
multiple slices as shown above because thats what all relevant TDM use
cases need: A/V, industrial fieldbusses .

  |-|
  | |
  |   TAS   |<- Config
  |1   2   3   4|
  |-|
   |   |   |   |
   |   |   |   |
   |   |   |   |
   |   |   |   |
  [DirectSocket]   [Qdisc FIFO]   [Qdisc Prio] [Qdisc FIFO]
   |   |   |
   |   |   |
[Socket][Socket] [General traffic]

The interesting thing here is that it does not require any time stamp
information brought in from the application. That's especially good for
general network traffic which is routed through a dedicated time slot. If
we don't have that then we need a user space scheduler which does exactly
the same thing and we have to route the general traffic out to user space
and back into the kernel, which is obviously a pointless exercise.

There are all kind of TDM schemes out there which are not directly driven
by applications, but rather route categorized traffic like VLANs through
dedicated time slices. That works pretty well with the above scheme because
in that case the applications might be completely oblivious about the tx
time schedule.

Surely there are protocols which do not utilize every time slice they could
use, so we need a way to tell the number of empty slices between two
consecutive packets. There are also different policies vs. the unused time
slices, like sending dummy frames or just nothing which wants to be
addressed, but I don't think that changes the general approach.

There might be some special cases for setup or node hotplug, but the
protocols I'm familiar with handle these in dedicated time slices or
through general traffic so it should just fit in.

I'm surely missing some details, but from my knowledge about the protocols
which want to utilize this, the general direction should be fine.

Feel free to tell me that I'm missing the point completely though :)

Thoughts?

Thanks,

tglx

Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-21 Thread Thomas Gleixner

On Wed, 21 Mar 2018, Richard Cochran wrote:

> On Wed, Mar 21, 2018 at 03:22:11PM +0100, Thomas Gleixner wrote:
> > Which clockid will be handed in from the application? The network adapter
> > time has no fixed clockid. The only way you can get to it is via a fd based
> > posix clock and that does not work at all because the qdisc setup might
> > have a different FD than the application which queues packets.
> 
> Duh.  That explains it.  Please ignore my "why not?" Q in the other thread...

:)

So in that case you are either bound to rely on the application to use the
proper dynamic clock or if we need a sanity check, then you need a cookie
of some form which can be retrieved from the posix clock file descriptor
and handed in as 'clockid' together with clock_adapter = true.

That's doable, but that needs a bit more trickery. A simple unique ID per
dynamic posix-clock would be trivial to add, but that would not give you
any form of verification whether this ID actually belongs to the network
adapter or not.

So either you ignore the clockid and rely on the application not being
stupid when it says "clock_adpater = true" or you need some extra
complexity to build an association of a "clockid" to a network adapter.

There is a connection already, via

 adapter->ptp_clock->devid

which is MKDEV(major, index) which is accessible at least at the network
driver level, but probably not from networking core. So you'd need to drill
a few more holes by adding yet another callback to net_device_ops.

I'm not sure if its worth the trouble. If the application hands in bogus
timestamps, packets go out at the wrong time or are dropped. That's true
whether it uses the proper clock or not. So nothing the kernel should
really worry about.

For clock_system - REAL/MONO/TAI(sigh) - you surely need a sanity check,
but that is independent of the underlying network adapater even in the
qdisc assisted HW offload case.

Thanks,

tglx

Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-21 Thread Thomas Gleixner

On Wed, 21 Mar 2018, Richard Cochran wrote:

@Intel: I removed intel-wired-lan@ as I have absolutely zero interest in
the moderation spam from that list. Can you please either get rid
of this moderation nonsense or stop CC'ing that list when posting
to lkml/netdev?

> On Wed, Mar 21, 2018 at 01:58:51PM +0100, Thomas Gleixner wrote:
> > Errm. No. There is no way to support fd based clocks or one of the CPU
> > time/process time based clocks for this.
> 
> Why not?
>  
> If the we have HW offloading, then the transmit time had better be
> expressed in terms of the MAC's internal clock.  Otherwise we would
> need to translate between a kernel clock and the MAC clock, but that
> is expensive (eg over PCIe) and silly (because in a typical use case
> the MAC will already be synchronized to the network time).

Sure, but you CANNOT use a clockid for that because there is NONE.

The mac clock is exposed via a dynamic posix clock and can only be
referenced via a file descriptor.

The qdisc setup does fd = open(...) and hands that in as clockid. Later the
application does fd = open(...) and uses that as clockid for tagging the
messages.

What the heck guarantees that both the setup and the application will get
the same fd number?

Exactly nothing. So any attempt to use the filedescriptor number as clockid
is broken by definition.

Thanks,

tglx

Re: [RFC v3 net-next 14/18] net/sched: Add HW offloading capability to TBS

2018-03-21 Thread Thomas Gleixner

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload
> 
> In this example, the Qdisc will use HW offload for the control of the
> transmission time through the network adapter. It's assumed the timestamp
> in skbuffs are in reference to the interface's PHC and setting any other
> valid clockid would be treated as an error. Because there is no
> scheduling being performed in the qdisc, setting a delta != 0 would also
> be considered an error.

Which clockid will be handed in from the application? The network adapter
time has no fixed clockid. The only way you can get to it is via a fd based
posix clock and that does not work at all because the qdisc setup might
have a different FD than the application which queues packets.

I think this should look like this:

clock_adapter:  1 = clock of the network adapter
0 = system clock selected by clock_system

clock_system:   0 = CLOCK_REALTIME
1 = CLOCK_MONOTONIC

or something like that.

> Example 2:
> 
> $ tc qdisc replace dev enp2s0 parent root handle 100 mqprio num_tc 3 \
>map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
> 
> $ tc qdisc add dev enp2s0 parent 100:1 tbs offload delta 10 \
>  clockid CLOCK_REALTIME sorting
> 
> Here, the Qdisc will use HW offload for the txtime control again,
> but now sorting will be enabled, and thus there will be scheduling being
> performed by the qdisc. That is done based on the clockid CLOCK_REALTIME
> reference and packets leave the Qdisc "delta" (10) nanoseconds before
> their transmission time. Because this will be using HW offload and
> since dynamic clocks are not supported by the hrtimer, the system clock
> and the PHC clock must be synchronized for this mode to behave as expected.

So what you do here is queueing the packets in the qdisk and then schedule
them at some point ahead of actual transmission time for delivery to the
hardware. That delivery uses the same txtime as used for qdisc scheduling
to tell the hardware when the packet should go on the wire. That's needed
when the network adapter does not support queueing of multiple packets.

Bah, and probably there you need CLOCK_TAI because that's what PTP is based
on, so clock_system needs to accomodate that as well. Dammit, there goes
the simple 2 bits implementation. CLOCK_TAI is 11, so we'd need 4 clock
bits plus the adapter bit.

Though we could spare a bit. The fixed CLOCK_* space goes from 0 to 15. I
don't see us adding new fixed clocks, so we really can reserve #15 for
selecting the adapter clock if sparing that extra bit is truly required.

Thanks,

tglx

Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-03-21 Thread Thomas Gleixner

On Tue, 6 Mar 2018, Jesus Sanchez-Palencia wrote:
> +struct tbs_sched_data {
> + bool sorting;
> + int clockid;
> + int queue;
> + s32 delta; /* in ns */
> + ktime_t last; /* The txtime of the last skb sent to the netdevice. */
> + struct rb_root head;

Hmm. You are reimplementing timerqueue open coded. Have you checked whether
you could reuse the timerqueue implementation?

That requires to add a timerqueue node to struct skbuff

@@ -671,7 +671,8 @@ struct sk_buff {
unsigned long   dev_scratch;
};
};
-   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct rb_node  rbnode; /* used in netem & tcp stack */
+   struct timerqueue_node  tqnode;
};
struct sock *sk;

Then you can use timerqueue_head in your scheduler data and all the open
coded rbtree handling goes away.

> +static bool is_packet_valid(struct Qdisc *sch, struct sk_buff *nskb)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + ktime_t txtime = nskb->tstamp;
> + struct sock *sk = nskb->sk;
> + ktime_t now;
> +
> + if (sk && !sock_flag(sk, SOCK_TXTIME))
> + return false;
> +
> + /* We don't perform crosstimestamping.
> +  * Drop if packet's clockid differs from qdisc's.
> +  */
> + if (nskb->txtime_clockid != q->clockid)
> + return false;
> +
> + now = get_time_by_clockid(q->clockid);

If you store the time getter function pointer in tbs_sched_data then you
avoid the lookup and just can do

   now = q->get_time();

That applies to lots of other places.

> + if (ktime_before(txtime, now) || ktime_before(txtime, q->last))
> + return false;
> +
> + return true;
> +}
> +
> +static struct sk_buff *tbs_peek(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + return q->peek(sch);
> +}
> +
> +static struct sk_buff *tbs_peek_timesortedlist(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct rb_node *p;
> +
> + p = rb_first(>head);

timerqueue gives you direct access to the first expiring entry w/o walking
the rbtree. So that would become:

p = timerqueue_getnext(>tqhead);
return p ? rb_to_skb(p) : NULL;

> + if (!p)
> + return NULL;
> +
> + return rb_to_skb(p);
> +}

> +static int tbs_enqueue_timesortedlist(struct sk_buff *nskb, struct Qdisc 
> *sch,
> +   struct sk_buff **to_free)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct rb_node **p = >head.rb_node, *parent = NULL;
> + ktime_t txtime = nskb->tstamp;
> +
> + if (!is_packet_valid(sch, nskb))
> + return qdisc_drop(nskb, sch, to_free);
> +
> + while (*p) {
> + struct sk_buff *skb;
> +
> + parent = *p;
> + skb = rb_to_skb(parent);
> + if (ktime_after(txtime, skb->tstamp))
> + p = >rb_right;
> + else
> + p = >rb_left;
> + }
> + rb_link_node(>rbnode, parent, p);
> + rb_insert_color(>rbnode, >head);

That'd become:

   nskb->tknode.expires = txtime;
   timerqueue_add(>tqhead, >tknode);

> + qdisc_qstats_backlog_inc(sch, nskb);
> + sch->q.qlen++;
> +
> + /* Now we may need to re-arm the qdisc watchdog for the next packet. */
> + reset_watchdog(sch);
> +
> + return NET_XMIT_SUCCESS;
> +}
> +
> +static void timesortedlist_erase(struct Qdisc *sch, struct sk_buff *skb,
> +  bool drop)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + rb_erase(>rbnode, >head);
> +
> + qdisc_qstats_backlog_dec(sch, skb);
> +
> + if (drop) {
> + struct sk_buff *to_free = NULL;
> +
> + qdisc_drop(skb, sch, _free);
> + kfree_skb_list(to_free);
> + qdisc_qstats_overlimit(sch);
> + } else {
> + qdisc_bstats_update(sch, skb);
> +
> + q->last = skb->tstamp;
> + }
> +
> + sch->q.qlen--;
> +
> + /* The rbnode field in the skb re-uses these fields, now that
> +  * we are done with the rbnode, reset them.
> +  */
> + skb->next = NULL;
> + skb->prev = NULL;
> + skb->dev = qdisc_dev(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> +
> + return q->dequeue(sch);
> +}
> +
> +static struct sk_buff *tbs_dequeue_scheduledfifo(struct Qdisc *sch)
> +{
> + struct tbs_sched_data *q = qdisc_priv(sch);
> + struct sk_buff *skb = tbs_peek(sch);
> + ktime_t now, next;
> +
> + if (!skb)
> + return NULL;
> +
> + now = get_time_by_clockid(q->clockid);
> +
> + /* Drop if packet has expired while in queue and the drop_if_late
> +  * flag is set.
> +  */
> + if

Re: [RFC v3 net-next 08/18] net: SO_TXTIME: Add clockid and drop_if_late params

2018-03-21 Thread Thomas Gleixner

On Tue, 6 Mar 2018, Richard Cochran wrote:

> On Tue, Mar 06, 2018 at 06:53:29PM -0800, Eric Dumazet wrote:
> > This is adding 32+1 bits to sk_buff, and possibly holes in this very
> > very hot (and already too fat) structure.
> > 
> > Do we really need 32 bits for a clockid_t ?
> 
> Probably we can live with fewer bits.
> 
> For clock IDs with a positive sign, the max possible clock value is 16.
> 
> For clock IDs with a negative sign, IIRC, three bits are for the type
> code (we have also posix timers packed like this) and the are for the
> file descriptor.  So maybe we could use 16 bits, allowing 12 bits or
> so for encoding the FD.
> 
> The downside would be that this forces the application to make sure
> and open the dynamic posix clock early enough before the FD count gets
> too high.

Errm. No. There is no way to support fd based clocks or one of the CPU
time/process time based clocks for this.

CLOCK_REALTIME and CLOCK_MONOTONIC are probably the only interesting
ones. BOOTTIME is hopefully soon irrelevant as we make MONOTONIC and
BOOTTIME the same unless this causes unexpectedly a major issues. I don't
think that CLOCK_TAI makes sense in that context, but I might be wrong.

The rest of the CLOCK_* space cannot be used at all.

So you need at max 2 bits for this, but I think 1 is good enough.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner <t...@linutronix.de> wrote:
> 
> > > So I do think we could do more in this area to improve driver 
> > > performance, if the 
> > > code is correct and if there's actual benchmarks that are showing real 
> > > benefits.
> > 
> > If it's about hotpath performance I'm all for it, but the use case here is
> > a debug facility...
> > 
> > And if we go down that road then we want a AVX based memcpy()
> > implementation which is runtime conditional on the feature bit(s) and
> > length dependent. Just slapping a readqq() at it and use it in a loop does
> > not make any sense.
> 
> Yeah, so generic memcpy() replacement is only feasible I think if the most 
> optimistic implementation is actually correct:
> 
>  - if no preempt disable()/enable() is required
> 
>  - if direct access to the AVX[2] registers does not disturb legacy FPU state 
> in 
>any fashion
> 
>  - if direct access to the AVX[2] registers cannot raise weird exceptions or 
> have
>weird behavior if the FPU control word is modified to non-standard values 
> by 
>untrusted user-space
> 
> If we have to touch the FPU tag or control words then it's probably only good 
> for 
> a specialized API.

I did not mean to have a general memcpy replacement. Rather something like
magic_memcpy() which falls back to memcpy when AVX is not usable or the
length does not justify the AVX stuff at all.

Thanks,

tglx

Re: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-20 Thread Thomas Gleixner

On Tue, 20 Mar 2018, Ingo Molnar wrote:
> * Thomas Gleixner <t...@linutronix.de> wrote:
> 
> > > Useful also for code that needs AVX-like registers to do things like CRCs.
> > 
> > x86/crypto/ has a lot of AVX optimized code.
> 
> Yeah, that's true, but the crypto code is processing fundamentally bigger 
> blocks 
> of data, which amortizes the cost of using kernel_fpu_begin()/_end().

Correct.

> So assuming the target driver will only load on modern FPUs I *think* it 
> should 
> actually be possible to do something like (pseudocode):
> 
>   vmovdqa %ymm0, 40(%rsp)
>   vmovdqa %ymm1, 80(%rsp)
> 
>   ...
>   # use ymm0 and ymm1
>   ...
> 
>   vmovdqa 80(%rsp), %ymm1
>   vmovdqa 40(%rsp), %ymm0
> 
> ... without using the heavy XSAVE/XRSTOR instructions.
> 
> Note that preemption probably still needs to be disabled and possibly there 
> are 
> other details as well, but there should be no 'heavy' FPU operations.

Emphasis on should :)

> I think this should still preserve all user-space FPU state and shouldn't 
> muck up 
> any 'weird' user-space FPU state (such as pending exceptions, legacy x87 
> running 
> code, NaN registers or weird FPU control word settings) we might have 
> interrupted 
> either.
> 
> But I could be wrong, it should be checked whether this sequence is safe. 
> Worst-case we might have to save/restore the FPU control and tag words - but 
> those 
> operations should still be much faster than a full XSAVE/XRSTOR pair.

Fair enough.

> So I do think we could do more in this area to improve driver performance, if 
> the 
> code is correct and if there's actual benchmarks that are showing real 
> benefits.

If it's about hotpath performance I'm all for it, but the use case here is
a debug facility...

And if we go down that road then we want a AVX based memcpy()
implementation which is runtime conditional on the feature bit(s) and
length dependent. Just slapping a readqq() at it and use it in a loop does
not make any sense.

Thanks,

tglx

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Thomas Gleixner
> > Sent: 19 March 2018 15:05
> > 
> > On Mon, 19 Mar 2018, David Laight wrote:
> > > From: Rahul Lakkireddy
> > > In principle it ought to be possible to get access to one or two
> > > (eg) AVX registers by saving them to stack and telling the fpu
> > > save code where you've put them.
> > 
> > No. We have functions for this and we are not adding new ad hoc magic.
> 
> I was thinking that a real API might do this...

We have a real API and that's good enough for the stuff we have using AVX
in the kernel.

> Useful also for code that needs AVX-like registers to do things like CRCs.

x86/crypto/ has a lot of AVX optimized code.

> > > OTOH, for x86, if the code always runs in process context (eg from a
> > > system call) then, since the ABI defines them all as caller-saved
> > > the AVX(2) registers, it is only necessary to ensure that the current
> > > FPU registers belong to the current process once.
> > > The registers can be set to zero by an 'invalidate' instruction on
> > > system call entry (hope this is done) and after use.
> > 
> > Why would a system call touch the FPU registers? The kernel normally does
> > not use FPU instructions and the code which explicitely does has to take
> > care of save/restore. It would be performance madness to fiddle with the
> > FPU stuff unconditionally if nothing uses it.
> 
> If system call entry reset the AVX registers then any FP save/restore
> would be faster because the AVX registers wouldn't need to be saved
> (and the cpu won't save them).
> I believe the instruction to reset the AVX registers is fast.
> The AVX registers only ever need saving if the process enters the
> kernel through an interrupt.

Wrong. The x8664 ABI clearly states:

   Linux Kernel code is not allowed to change the x87 and SSE units. If
   those are changed by kernel code, they have to be restored properly
   before sleeping or leav- ing the kernel.

That means the syscall interface relies on FPU state being not changed by
the kernel. So if you want to clear AVX on syscall entry you need to save
it first and then restore before returning. That would be a huge
performance hit.

Thanks,

tglx

RE: [RFC PATCH 0/3] kernel: add support for 256-bit IO access

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, David Laight wrote:
> From: Rahul Lakkireddy
> In principle it ought to be possible to get access to one or two
> (eg) AVX registers by saving them to stack and telling the fpu
> save code where you've put them.

No. We have functions for this and we are not adding new ad hoc magic.

> OTOH, for x86, if the code always runs in process context (eg from a
> system call) then, since the ABI defines them all as caller-saved
> the AVX(2) registers, it is only necessary to ensure that the current
> FPU registers belong to the current process once.
> The registers can be set to zero by an 'invalidate' instruction on
> system call entry (hope this is done) and after use.

Why would a system call touch the FPU registers? The kernel normally does
not use FPU instructions and the code which explicitely does has to take
care of save/restore. It would be performance madness to fiddle with the
FPU stuff unconditionally if nothing uses it.

Thanks,

tglx

Re: [RFC PATCH 2/3] x86/io: implement 256-bit IO read and write

2018-03-19 Thread Thomas Gleixner

On Mon, 19 Mar 2018, Rahul Lakkireddy wrote:

> Use VMOVDQU AVX CPU instruction when available to do 256-bit
> IO read and write.

That's not what the patch does. See below.

> Signed-off-by: Rahul Lakkireddy 
> Signed-off-by: Ganesh Goudar 

That Signed-off-by chain is wrong

> +#ifdef CONFIG_AS_AVX
> +#include 
> +
> +static inline u256 __readqq(const volatile void __iomem *addr)
> +{
> + u256 ret;
> +
> + kernel_fpu_begin();
> + asm volatile("vmovdqu %0, %%ymm0" :
> +  : "m" (*(volatile u256 __force *)addr));
> + asm volatile("vmovdqu %%ymm0, %0" : "=m" (ret));
> + kernel_fpu_end();
> + return ret;

You _cannot_ assume that the instruction is available just because
CONFIG_AS_AVX is set. The availability is determined by the runtime
evaluated CPU feature flags, i.e. X86_FEATURE_AVX.

Aside of that I very much doubt that this is faster than 4 consecutive
64bit reads/writes as you have the full overhead of
kernel_fpu_begin()/end() for each access.

You did not provide any numbers for this so its even harder to
determine.

As far as I can tell the code where you are using this is a debug
facility. What's the point? Debug is hardly a performance critical problem.

Thanks,

tglx

Re: [PATCH v2 1/3] vsprintf: Remove accidental VLA usage

2018-03-08 Thread Thomas Gleixner

On Thu, 8 Mar 2018, Rasmus Villemoes wrote:
> On 2018-03-08 04:30, Kees Cook wrote:
> >  /**
> > + * SIMPLE_MAX - return maximum of two values without any type checking
> > + * @x: first value
> > + * @y: second value
> > + *
> > + * This should only be used in stack array sizes, since the type-checking
> > + * from max() confuses the compiler into thinking a VLA is being used.
> > + */
> > +#define SIMPLE_MAX(x, y)   ((size_t)(x) > (size_t)(y) ? (size_t)(x) \
> > +  : (size_t)(y))
> 
> This will be abused at some point, leading to the usual double
> evaluation etc. etc. problems. The name is also too long (and in general
> we should avoid adjectives like "simple", "safe", people reading the
> code won't know what is simple or safe about it). I think this should work
> 
> #define MAX(x, y) (__builtin_choose_expr((x) > (y), x, y))
> 
> That forces (x)>(y) to be a compile-time constant, so x and y must also
> be; hence there can be no side effects. The MIN version of this could
> replace the custom __const_min in fs/file.c, and probably other places
> as well.
> 
> I tested that this at least works in the vsprintf case, -Wvla no longer
> complains. fs/file.c also compiles with the MIN version of this.
> 
> I suppose MIN and MAX will collide with other uses in the tree. Hmm.

Make it CONST_MAX() or something like that which makes it entirely clear.

Thanks,

tglx

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner

On Sun, 7 Jan 2018, Linus Torvalds wrote:
> We need to fix the security problem, but we need to do it *without*
> these braindead arguments that performance is somehow secondary.

I surely agree, but we have gone the way of PTI without the ability of
exempting individual processes exactly for one reason:

  Lack of time

It can be done on top of the PTI implementation and it won't take ages.

For spectre_v1/2 we face the same problem simply because we got informed so
much ahead of time and we were all twiddling thumbs, enjoying our christmas
vacation and having a good time.

The exploits are out in the wild and they are observed already, so we
really have to make a decision whether we want to fix that in the most
obvious ways even if it hurts performance right now and then take a break
from all that hell and sit down and sort the performance mess or whether we
want to discuss the right way to fix it for the next 3 month and leave all
doors open until the last bit of performance is squeezed out.

I totally can understand the anger of those who learned all of this a few
days ago and are now grasping straws to avoid the obviously coming
performance hit, but its not our fault that we have to make a decision
which cannot make everyone happy tomorrow.

If the general sentiment is that we have to fix the performance issue
first, then please let me know and I take 3 weeks of vacation right now.

Thanks,

tglx

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner

On Sun, 7 Jan 2018, James Bottomley wrote:

> On Sat, 2018-01-06 at 20:36 -0500, David Miller wrote:
> > From: Willy Tarreau 
> > Date: Sat, 6 Jan 2018 21:42:29 +0100
> > 
> > > On Sat, Jan 06, 2018 at 06:38:59PM +, Alan Cox wrote:
> > >> Normally people who propose security fixes don't have to argue
> > about the
> > >> fact they added 30 clocks to avoid your box being 0wned.
> > > 
> > > In fact it depends, because if a fix makes the system unusable for
> > its
> > > initial purpose, this fix will simply not be deployed at all, which
> > is
> > > the worst that can happen.
> > 
> > +1
> > 
> > I completely agree with Willy and Alexei.
> > 
> > And the scale isn't even accurate, we're talking about at least
> > hundreds upon hundreds of clocks, not 30, if we add an operation
> > whose side effect is to wait for all pending loads to complete.  So
> > yeah this is going to be heavily scrutinized.
> 
> Plus this is the standard kernel code review MO: we've never blindly
> accepted code just because *security* (otherwise we'd have grsec in by
> now).  We use the pushback to get better and more performant code.
>  What often happens is it turns out that the "either security or
> performance" position was a false dichotomy and there is a way of
> fixing stuff that's acceptable (although not usually perfect) for
> everyone.  I'm not saying this always happens, but it is reasonable to
> let the iterative pushback see if we can get to better code in this
> case rather than trying to cut it of with the "because *security*"
> argument.

In principle I agree, though there are a few things to consider:

1) We have not the time to discuss that for the next 6 month

2) Alexei's analyis is purely based on the public information of the google
   zero folks. If it would be complete and the only attack vector all fine.

   If not and I doubt it is, we're going to regret this decision faster
   than we made it and this is not the kind of play field where we can
   afford that.

The whole 'we know better and performance is key' attitude is what led to
this disaster in the first place. We should finaly start to learn.

Can we please stop that and live with the extra cycles for a few month up
to the point where we get more informed answers to all these questions?

Thanks,

tglx

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-07 Thread Thomas Gleixner

On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> which clearly states that bpf_tail_call() was used in the attack.
> Yet none of the intel nor arm patches address speculation in
> this bpf helper!
> It means that:
> - gpz didn't share neither exploit nor the detailed description
>   of the POC with cpu vendors until now
> - coverity rules used to find all these places in the kernel
>   failed to find bpf_tail_call
> - cpu vendors were speculating what variant 1 can actually do

You forgot to mention that there might be other attacks than the public POC
which are not covered by a simple AND 

Thanks,

tglx

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner

On Sat, 6 Jan 2018, Alexei Starovoitov wrote:
> So how about we do array_access() macro similar to above by default
> with extra CONFIG_ to convert it to lfence ?
> Why default to AND approach instead of lfence ?
> Because the kernel should still be usable. If security
> sacrifices performance so much such security will be turned off.
> Ex: kpti suppose to add 5-30%. If it means 10% on production workload
> and the datacenter capacity cannot grow 10% overnight, kpti will be off.

That's the decision and responsibility of the person who disables it.

Thanks,

tglx

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-06 Thread Thomas Gleixner

On Sat, 6 Jan 2018, Alexei Starovoitov wrote:

> On Sat, Jan 06, 2018 at 10:54:27AM -0800, Dan Williams wrote:
> > On Sat, Jan 6, 2018 at 10:39 AM, Alexei Starovoitov
> >  wrote:
> > [..]
> > >> retpoline is variant-2, this patch series is about variant-1.
> > >
> > > that's exactly the point. Don't slow down the kernel with lfences
> > > to solve variant 1. retpoline for 2 is ok from long term kernel
> > > viability perspective.
> > >
> > 
> > Setting aside that we still need to measure the impact of these
> > changes the end result will still be nospec_array_ptr() sprinkled in
> > various locations. So can we save the debate about what's inside that
> > macro on various architectures and at least proceed with annotating
> > the problematic locations? Perhaps we can go a step further and have a
> > config option to switch between the clever array_access() approach
> > from Linus that might be fine depending on the compiler, and the
> > cpu-vendor-recommended not to speculate implementation of
> > nospec_array_ptr().
> 
> recommended by panicing vendors who had no better ideas?
> Ohh, speculation is exploitable, let's stop speculation.
> Instead of fighting it we can safely steer it where it doesn't leak
> kernel data. AND approach is doing exactly that.

For one particular architecture and that's not a solution for generic
code.

Aside of that I fundamentally disagree with your purely performance
optimized argumentation. We need to make sure that we have a solution which
kills the problem safely and then take it from there. Correctness first,
optimization later is the rule for this. Better safe than sorry.

Thanks,

tglx

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-14 Thread Thomas Gleixner

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > Can you explain what do you mean by "subsystem"? I thought that the
> > > subsystem would be the irq subsystem (which means you are the one to
> > > provide
> > > the needed input :) ) and the driver would pass in something
> > > like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> > > requirements that you listed and NULL to tell the core to leave it alone
> > > and do what it sees fit (or pass msi_irq_ops with flag that means that).
> > > 
> > > ops structure is a very common way for drivers to communicate with a
> > > subsystem core.
> > 
> > So if you look at the above pseudo code then the subsys_*_move_callbacks
> > are probably subsystem specific, i.e. block or networking.
> > 
> > Those subsystem callbacks might either handle it at the subsystem level
> > directly or call into the particular driver.
> 
> Personally I do not think that integrating this to networking/block
> stacks in that level is going to work. drivers don't communicate any
> information on what they do with msi(x) vectors (and I'm not sure they
> should).

The callback does not need to transport any information about the interrupt
itself. The driver associates the interrupt to a queue and that queue _is_
known at the subsystem level. So a queue cookie can be handed in along with
the callback information.

I think that the subsystem, e.g. block should handle parts of this, at
least the generic driver independent management of the queue, i.e.

- Marking the queue as closed so no new requests can be queued
  anymore. Having this at the driver level seems to be a layering
  violation.

- Making sure that the outstanding requests are completed. That needs
  driver support for sure, but OTOH the subsystem should know about
  outstanding requests.

- Marking the queue as open again. Again that's mostly subsystem level.

Thanks,

tglx

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Thomas Gleixner

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > > >  #1 Before the core tries to move the interrupt so it can veto
> > > > the
> > > >   move if it cannot allocate new resources or whatever is 
> > > > required
> > > >   to operate after the move.
> > > 
> > > What would the core do if a driver veto a move?
> > 
> > Return the error code from write_affinity() as it does with any other error
> > which fails to set the affinity.
> 
> OK, so this would mean that the driver queue no longer has a vector
> correct? so is the semantics that it needs to cleanup its resources or
> should it expect another callout for that?

The driver queue still has the old vector, i.e.

echo XXX > /proc/irq/YYY/affinity

 write_irq_affinity(newaffinity)

newvec = reserve_new_vector();

ret = subsys_pre_move_callback(, newaffinity);

if (ret) {
drop_new_vector(newvec);
return ret;
}

shutdown(oldvec);
install(newvec);

susbsys_post_move_callback()

startup(newvec);

subsys_pre_move_callback()

ret = do_whatever();
if (ret)
return ret;

/*
 * Make sure nothing is queued anymore and outstanding
 * requests are completed. Same as for managed CPU hotplug.
 */
stop_and_drain_queue();
return 0;

subsys_post_move_callback()

install_new_data();

/* Reenable queue. Same as for managed CPU hotplug */
reenable_queue();

free_old_data();
return;

Does that clarify the mechanism?

> > > This looks like it can work to me, but I'm probably not familiar enough
> > > to see the full picture here.
> > 
> > On the interrupt core side this is workable, I just need the input from the
> > driver^Wsubsystem side if this can be implemented sanely.
> 
> Can you explain what do you mean by "subsystem"? I thought that the
> subsystem would be the irq subsystem (which means you are the one to provide
> the needed input :) ) and the driver would pass in something
> like msi_irq_ops to pci_alloc_irq_vectors() if it supports the driver
> requirements that you listed and NULL to tell the core to leave it alone
> and do what it sees fit (or pass msi_irq_ops with flag that means that).
> 
> ops structure is a very common way for drivers to communicate with a
> subsystem core.

So if you look at the above pseudo code then the subsys_*_move_callbacks
are probably subsystem specific, i.e. block or networking.

Those subsystem callbacks might either handle it at the subsystem level
directly or call into the particular driver.

That's certainly out of the scope what the generic interrupt code can do :)

Thanks,

tglx

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-13 Thread Thomas Gleixner

On Mon, 13 Nov 2017, Sagi Grimberg wrote:
> > 3) Affinity override in managed mode
> > 
> >   Doable, but there are a couple of things to think about:
> 
> I think that it will be good to shoot for (3). Given that there are
> driver requirements I'd say that driver will expose up front if it can
> handle it, and if not we fallback to (1).
> 
> >* How is this enabled?
> > 
> >  - Opt-in by driver
> > 
> >  - Extra sysfs/procfs knob
> > 
> >  We definitely should not enable it per default because that would
> >  surprise users/drivers which work with the current managed devices and
> >  rely on the affinity files to be non writeable in managed mode.
> 
> Do you know if any exist? Would it make sense to have a survey to
> understand if anyone relies on it?
> 
> From what I've seen so far, drivers that were converted simply worked
> with the non-managed facility and didn't have any special code for it.
> Perhaps Christoph can comment as he convert most of them.
> 
> But if there aren't any drivers that absolutely rely on it, maybe its
> not a bad idea to allow it by default?

Sure, I was just cautious and I have to admit that I have no insight into
the driver side details.

> >* When and how is the driver informed about the change?
> > 
> >   When:
> > 
> > #1 Before the core tries to move the interrupt so it can veto the
> >   move if it cannot allocate new resources or whatever is required
> >   to operate after the move.
> 
> What would the core do if a driver veto a move?

Return the error code from write_affinity() as it does with any other error
which fails to set the affinity.

> I'm wandering in what conditions a driver will be unable to allocate
> resources for move to cpu X but able to allocate for move to cpu Y.

Node affine memory allocation is the only thing which comes to my mind, or
some decision not to have a gazillion of queues on a single CPU. 

> This looks like it can work to me, but I'm probably not familiar enough
> to see the full picture here.

On the interrupt core side this is workable, I just need the input from the
driver^Wsubsystem side if this can be implemented sanely.

Thanks,

tglx

Re: [run_timer_softirq] BUG: unable to handle kernel paging request at 0000000000010007

2017-11-10 Thread Thomas Gleixner

On Fri, 10 Nov 2017, Linus Torvalds wrote:

> On Wed, Nov 8, 2017 at 9:19 PM, Fengguang Wu <fengguang...@intel.com> wrote:
> >
> > Yes it's accessing the list. Here is the faddr2line output.
> 
> Ok, so it's a corrupted timer list. Which is not a big surprise.
> 
> It's
> 
> next->pprev = pprev;
> 
> in __hlist_del(), and the trapping instruction decodes as
> 
> mov%rdx,0x8(%rax)
> 
> with %rax having the value dead0200,
> 
> Which is just LIST_POISON2.
> 
> So we've deleted that entry twice - LIST_POISON2 is what hlist_del()
> sets pprev to after already deleting it once.
> 
> Although in this case it might not be hlist_del(), because
> detach_timer() also sets entry->next to LIST_POISON2.
> 
> Which is pretty bogus, we are supposed to use LIST_POISON1 for the
> "next" pointer. Oh well. Nobody cares, except for the list entry
> debugging code, which isn't run on the hlist cases.
> 
> Adding Thomas Gleixner to the cc. It should not be possible to delete
> the same timer twice.

Right, it shouldn't.

Fengguang, can you please enable:

CONFIG_DEBUG_OBJECTS
CONFIG_DEBUG_OBJECTS_TIMERS

and try to reproduce? Debugobject should catch that hopefully.

Thanks,

tglx

Re: [RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-10 Thread Thomas Gleixner

On Fri, 10 Nov 2017, Saeed Mahameed wrote:

> Well, I can speak for mlx5 case or most of the network drivers, where
> all of the queues associated with an interrupt, move with it, so i
> don't think our current driver have this issue. I don't believe there
> are network driver with fixed Per cpu resources, but it worth double
> checking.

At least the intel i40e[vf] drivers use the affinity notifier. Same for
infiniband drivers and scsi. So there seems to be requirements.

> Regarding the below solutions, any one that will gurantee the initial
> managed spreading and still allow the user to modify affinity via
> /proc/irq/xyz/smp_afinity will be acceptable, since many tools and user
> rely on this sysfs entry e.g. (irqbalance)

Well, any one is not very specific.

This needs to be discussed and agreed on in detail, otherwise we end up
with the same situation 3 month after the changes hit mainline and drivers
got converted over.

Thanks,

tglx

[RFD] Managed interrupt affinities [ Was: mlx5 broken affinity ]

2017-11-09 Thread Thomas Gleixner

Find below a summary of the technical details, implications and options

What can be done for 4.14?

  We basically have two options: Revert at the driver level or ship as
  is.

  Even if we come up with a quick and dirty hack then it will be too late
  for proper testing before sunday.


What can be done with some time to work on?

The managed mechanism consists of 3 pieces:

 1) Vector spreading

 2) Managed vector allocation, which becomes a guaranteed reservation in
4.15 due of the big rework of the vector management code.

Non managed interrupts get a best effort reservation to handle theCPU
unplug vector pressure problem in a sane way.

 3) CPU hotplug management

If the last CPU in the affinity set goes offline, then the interrupt is
shutdown and restarted when the first CPU in the affinity set comes
online again. The driver code needs to ensure that the queue associated
to that interrupt is drained before shutdown and nothing is queued
there after this point.

So we have options:

1) Initial vector spreading 

 Let the driver use the initial vector spreading. That does only the
 initial affinity setup, but otherwise the interrupts are handled like any
 other non managed interrupt, i.e. best effort reservation, affinity
 settings enabled and CPU unplug breaks affinity and moves them to some
 random other online CPU.

 The simplest solution of all.

2) Allowing a driver supplied mask

 Certainly simple to do, but as you said it's not really a solution. I'm
 not sure whether we want to go there as this is going to be replaced fast
 enough and then create another breakage/frustration level.


3) Affinity override in managed mode

 Doable, but there are a couple of things to think about:

  * How is this enabled?

- Opt-in by driver
 
- Extra sysfs/procfs knob

We definitely should not enable it per default because that would
surprise users/drivers which work with the current managed devices and
rely on the affinity files to be non writeable in managed mode.

  * Is it allowed to set the affinity to offline, but present CPUs?

 In principle yes, because the core management code can do that as well
 at setup time.

  * The affinity setting must fail when it cannot do a guaranteed
reservation on the new target CPU(s).

 This is not much of a question. That's a matter of fact because
 otherwise the association cannot be guaranteed and things fall apart
 all over the place.

  * When and how is the driver informed about the change?

 When:

   #1 Before the core tries to move the interrupt so it can veto the
  move if it cannot allocate new resources or whatever is required
  to operate after the move.
  
   #2 After the core made the move effective because:

  - The interrupt might be moved from an offline set to an online
set and needs to be started up, so the related queue must be
enabled as well.

  - The interrupt might be moved from an online set to an offline
set, so the queue needs to be drained and disabled.

  - Resources which have been allocated in the first step must be
made effective and old resources freed.

 How:

   The existing affinity notification mechanism does not work for this
   and it's a horrible piece of crap which should go away sooner than
   later.

   So we need some sensible way to provide callback. Emphasis on
   callbacks as one multiplexing callback is not a good idea.

  * How can the change made effective?

When the preliminaries (vector reservation on the new set and
evtl. resource allocation in the subsystem have been done, then the
actual move can be made.

But, there is a caveat. x86 is not good in reassociating interrupts on
the fly except when it sits behind an interrupt remapping unit, but we
cannot rely on that.

So the change flow which works for everything would be:

if (reserve_vectors() < 0)
   return FAIL;

if (subsys_prep_callback() < 0) {
   release_vectors();
   return FAIL;
}

shutdown(irq);

if (!online(newset))
   return SUCCESS;

startup(irq);

subsys_post_callback();
return SUCCESS;

subsys_prep_callback() must basically work the same way as the CPU
offline mechanism and drain the queue and prevent queueing before the
irq is restarted. If the move results in keeping it shutdown because
the new set is offline, then the irq will be restarted via the CPU
hotplug code and the subsystem will be informed about that via the
hotplug mechanism as well.

subsys_post_callback() is more or less the same as the hotplug callback
and restarts the queue. The only difference to the hotplug code as of
today is that it might need to make previously allocated resources
effective and free the old ones.

I named that subsys_*_callback()

Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:07 AM, Thomas Gleixner wrote:
> > I say it one last time: It can be done and I'm willing to help.
> 
> It didn't sound like it earlier, but that's good news.

Well, I'm equally frustrated by this whole thing, but I certainly never
said that I'm not willing to change anything.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 10:03 AM, Thomas Gleixner wrote:
> > On Thu, 9 Nov 2017, Jens Axboe wrote:
> >> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> >> If that's the attitude at your end, then I do suggest we just revert the
> >> driver changes. Clearly this isn't going to be productive going forward.
> >>
> >> The better solution was to make the managed setup more flexible, but
> >> it doesn't sound like that is going to be viable at all.
> > 
> > That's not true. I indicated several times, that we can do that, but not
> > just by breaking the managed facility.
> > 
> > What I'm arguing against is that the blame is focused on those who
> > implemented the managed facility with the existing semantics.
> > 
> > I'm still waiting for a proper description of what needs to be changed in
> > order to make these drivers work again. All I have seen so far is to break
> > managed interrupts completely and that's not going to happen.
> 
> There's no blame as far as I'm concerned, just frustration that we broke
> this and folks apparently not acknowledging that it's a concern.
> 
> What used to work was that you could move IRQs around as you wanted to.
> That was very useful for custom setups, for tuning, or for isolation
> purposes. None of that works now, which is unfortunate.

Well, its unfortunate and I can understand your frustration, but I really
have a hard time to understand that these concerns have not been brought up
when the whole thing was discussed and in the early stages of
implementation way before it got merged.

It was not my decision to make it the way it is and I merily try to prevent
hasty and damaging hackery right now.

I'll answer to the technical details in a separate mail.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner

On Thu, 9 Nov 2017, Jens Axboe wrote:

> On 11/09/2017 09:01 AM, Sagi Grimberg wrote:
> >> Now you try to blame the people who implemented the managed affinity stuff
> >> for the wreckage, which was created by people who changed drivers to use
> >> it. Nice try.
> > 
> > I'm not trying to blame anyone, really. I was just trying to understand
> > how to move forward with making users happy and still enjoy subsystem
> > services instead of doing lots of similar things inside mlx5 driver.
> 
> Exactly. The key here is how we make it work for both cases. But there
> has to be a willingness to make the infrastructure work for that.

I say it one last time: It can be done and I'm willing to help.

Please tell me what exactly you expect and I can have a look what needs to
be done w/o creating an utter mess.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner

On Thu, 9 Nov 2017, Jens Axboe wrote:
> On 11/09/2017 07:19 AM, Thomas Gleixner wrote:
> If that's the attitude at your end, then I do suggest we just revert the
> driver changes. Clearly this isn't going to be productive going forward.
> 
> The better solution was to make the managed setup more flexible, but
> it doesn't sound like that is going to be viable at all.

That's not true. I indicated several times, that we can do that, but not
just by breaking the managed facility.

What I'm arguing against is that the blame is focused on those who
implemented the managed facility with the existing semantics.

I'm still waiting for a proper description of what needs to be changed in
order to make these drivers work again. All I have seen so far is to break
managed interrupts completely and that's not going to happen.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-09 Thread Thomas Gleixner

On Thu, 9 Nov 2017, Sagi Grimberg wrote:
> Thomas,
> 
> > > Because the user sometimes knows better based on statically assigned
> > > loads, or the user wants consistency across kernels. It's great that the
> > > system is better at allocating this now, but we also need to allow for a
> > > user to change it. Like anything on Linux, a user wanting to blow off
> > > his/her own foot, should be allowed to do so.
> > 
> > That's fine, but that's not what the managed affinity facility provides. If
> > you want to leverage the spread mechanism, but avoid the managed part, then
> > this is a different story and we need to figure out how to provide that
> > without breaking the managed side of it.
> > 
> > As I said it's possible, but I vehemently disagree, that this is a bug in
> > the core code, as it was claimed several times in this thread.
> > 
> > The real issue is that the driver was converted to something which was
> > expected to behave differently. That's hardly a bug in the core code, at
> > most it's a documentation problem.
> 
> I disagree here, this is not a device specific discussion. The question
> of exposing IRQ affinity assignment knob to user-space holds for every
> device (NICs, HBAs and alike). The same issue could have been raised on
> any other device using managed affinitization (and we have quite a few
> of those now). I can't see any reason why its OK for device X to use it
> while its not OK for device Y to use it.
>
> If the resolution is "YES we must expose a knob to user-space" then the
> managed facility is unusable in its current form for *any* device. If the
> answer is "NO, user-space can't handle all the stuff the kernel can" then
> we should document it. This is really device independent.

The early discussion of the managed facility came to the conclusion that it
will manage this stuff completely to allow fixed association of 'queue /
interrupt / corresponding memory' to a single CPU or a set of CPUs. That
removes a lot of 'affinity' handling magic from the driver and utilizes the
hardware in a sensible way. That was not my decision, really. It surely
made sense to me and I helped Christoph to implement it.

The real question is whether you want to have the fixed 'queue / interrupt/
corresponding memory association' and get rid of all the 'affinity' dance
in your driver or not.

If you answer that question with 'yes' then the consequence is that there
is no knob.

If you answer that question with 'no' then you should not use
the managed facility in the first place and if you need parts of that
functionality then this needs to be added to the core code _before_ a
driver gets converted and not afterwards.

It's not my problem if people decide, to use this and then trip over the
limitations after the changes hit the tree. This could have been figured
out before even a single patch was posted.

Now you try to blame the people who implemented the managed affinity stuff
for the wreckage, which was created by people who changed drivers to use
it. Nice try.

Thanks,

tglx

Re: [PATCH x86/urgent] bpf: emulate push insns for uprobe on x86

2017-11-08 Thread Thomas Gleixner

On Wed, 8 Nov 2017, Yonghong Song wrote:
> On 11/8/17 4:06 PM, David Miller wrote:
> > From: Yonghong Song 
> > Date: Wed, 8 Nov 2017 13:37:12 -0800
> > 
> > > Uprobe is a tracing mechanism for userspace programs.
> > > Typical uprobe will incur overhead of two traps.
> > > First trap is caused by replaced trap insn, and
> > > the second trap is to execute the original displaced
> > > insn in user space.
> >   ...
> > 
> > I don't understand how this is bpf related, and if it is you don't
> > explain it well in the commit message.
> 
> Right. This is not related to bpf. Will remove the "bpf" from the subject line
> in the next revision.

The proper subject is something like:

[PATCH] uprobes/x86: ...

which you can figure out by looking at the subsystem prefixes via

git log arch/x86/kernel/uprobes.c

Note, that it says [PATCH} and nothing else. That patch is a nice
performance improvement, but certainly not x86/urgent material. x86/urgent
is for bug and regression fixes.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-08 Thread Thomas Gleixner

On Wed, 8 Nov 2017, Jes Sorensen wrote:
> On 11/07/2017 10:07 AM, Thomas Gleixner wrote:
> > On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> >> I do agree that the user would lose better cpu online/offline behavior,
> >> but it seems that users want to still have some control over the IRQ
> >> affinity assignments even if they lose this functionality.
> > 
> > Depending on the machine and the number of queues this might even result in
> > completely losing the ability to suspend/hibernate because the number of
> > available vectors on CPU0 is not sufficient to accomodate all queue
> > interrupts.
> 
> Depending on the system, suspend/resume is really the lesser interesting
> issue to the user. Pretty much any system with a 10/25GBps mlx5 NIC in
> it will be in some sort of rack and is unlikely to ever want to
> suspend/resume.

The discussions with Intel about that tell a different story and cpu
online/offline for power management purposes is - while debatable - widely
used.

That's where the whole idea for managed affinities originated from along
with avoiding the affinity hint and affinity notifier machinery which
creates more problems than it solves.

> >> Would it be possible to keep the managed facility until a user overrides
> >> an affinity assignment? This way if the user didn't touch it, we keep
> >> all the perks, and in case the user overrides it, we log the implication
> >> so the user is aware?
> > 
> > A lot of things are possible, the question is whether it makes sense. The
> > whole point is to have resources (queues, interrupts etc.) per CPU and have
> > them strictly associated.
> > 
> > Why would you give the user a knob to destroy what you carefully optimized?
> > 
> > Just because we can and just because users love those knobs or is there any
> > real technical reason?
> 
> Because the user sometimes knows better based on statically assigned
> loads, or the user wants consistency across kernels. It's great that the
> system is better at allocating this now, but we also need to allow for a
> user to change it. Like anything on Linux, a user wanting to blow off
> his/her own foot, should be allowed to do so.

That's fine, but that's not what the managed affinity facility provides. If
you want to leverage the spread mechanism, but avoid the managed part, then
this is a different story and we need to figure out how to provide that
without breaking the managed side of it.

As I said it's possible, but I vehemently disagree, that this is a bug in
the core code, as it was claimed several times in this thread.

The real issue is that the driver was converted to something which was
expected to behave differently. That's hardly a bug in the core code, at
most it's a documentation problem.

Thanks,

tglx

Re: Is there a race between __mod_timer() and del_timer()?

2017-11-08 Thread Thomas Gleixner

On Wed, 8 Nov 2017, David Howells wrote:

> Is there a race between the optimisation for networking code in __mod_timer()
> and del_timer() - or, at least, a race that matters?
> 
> Consider:
> 
>   CPU A   CPU B
>   === ===
>   [timer X is active]
>   ==>__mod_timer(X)
>   if (timer_pending(timer))
>   [Take the true path]
>   -- IRQ --   ==>del_timer(X)
>   <==
>   if (timer->expires == expires)
>   [Take the true path]
>   <==return 1
>   [timer X is not active]
> 
> There's no locking to prevent this, but __mod_timer() returns without
> restarting the timer.  I'm not sure this is a problem exactly, however, since
> del_timer() *was* issued, and could've deleted the timer after __mod_timer()
> returned.

Correct, if two CPUs fiddle with the same timer concurrently then there is
no guaranteed outcome.

> A couple of possible alleviations:
> 
>  (1) Recheck timer_pending() before returning from __mod_timer().

That's just adding more instructions into that code path for a dubious
value.

>  (2) Set timer->expires to jiffies in del_timer() - but since there's nothing
>  preventing the optimisation in __mod_timer() from occurring concurrently
>  with del_timer(), this probably won't help.

Right.

> I think it might just be best to put a note in the comments in __mod_timer().

Agreed.

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-07 Thread Thomas Gleixner

On Sun, 5 Nov 2017, Sagi Grimberg wrote:
> > > > This wasn't to start a debate about which allocation method is the
> > > > perfect solution. I am perfectly happy with the new default, the part
> > > > that is broken is to take away the user's option to reassign the
> > > > affinity. That is a bug and it needs to be fixed!
> > > 
> > > Well,
> > > 
> > > I would really want to wait for Thomas/Christoph to reply, but this
> > > simple change fixed it for me:
> > > --
> > > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > > index 573dc52b0806..eccd06be5e44 100644
> > > --- a/kernel/irq/manage.c
> > > +++ b/kernel/irq/manage.c
> > > @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
> > >   {
> > >  struct irq_desc *desc = irq_to_desc(irq);
> > > 
> > > -   return __irq_can_set_affinity(desc) &&
> > > -   !irqd_affinity_is_managed(>irq_data);
> > > +   return __irq_can_set_affinity(desc);
> > 
> > Which defeats the whole purpose of the managed facility, which is _not_ to
> > break the affinities on cpu offline and bring the interrupt back on the CPU
> > when it comes online again.
> > 
> > What I can do is to have a separate flag, which only uses the initial
> > distribution mechanism, but I really want to have Christophs opinion on
> > that.
> 
> I do agree that the user would lose better cpu online/offline behavior,
> but it seems that users want to still have some control over the IRQ
> affinity assignments even if they lose this functionality.

Depending on the machine and the number of queues this might even result in
completely losing the ability to suspend/hibernate because the number of
available vectors on CPU0 is not sufficient to accomodate all queue
interrupts.

> Would it be possible to keep the managed facility until a user overrides
> an affinity assignment? This way if the user didn't touch it, we keep
> all the perks, and in case the user overrides it, we log the implication
> so the user is aware?

A lot of things are possible, the question is whether it makes sense. The
whole point is to have resources (queues, interrupts etc.) per CPU and have
them strictly associated.

Why would you give the user a knob to destroy what you carefully optimized?

Just because we can and just because users love those knobs or is there any
real technical reason?

Thanks,

tglx

Re: mlx5 broken affinity

2017-11-02 Thread Thomas Gleixner

On Thu, 2 Nov 2017, Sagi Grimberg wrote:

> 
> > This wasn't to start a debate about which allocation method is the
> > perfect solution. I am perfectly happy with the new default, the part
> > that is broken is to take away the user's option to reassign the
> > affinity. That is a bug and it needs to be fixed!
> 
> Well,
> 
> I would really want to wait for Thomas/Christoph to reply, but this
> simple change fixed it for me:
> --
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index 573dc52b0806..eccd06be5e44 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -146,8 +146,7 @@ bool irq_can_set_affinity_usr(unsigned int irq)
>  {
> struct irq_desc *desc = irq_to_desc(irq);
> 
> -   return __irq_can_set_affinity(desc) &&
> -   !irqd_affinity_is_managed(>irq_data);
> +   return __irq_can_set_affinity(desc);

Which defeats the whole purpose of the managed facility, which is _not_ to
break the affinities on cpu offline and bring the interrupt back on the CPU
when it comes online again.

What I can do is to have a separate flag, which only uses the initial
distribution mechanism, but I really want to have Christophs opinion on
that.

Thanks,

tglx

Re: KASAN: use-after-free Write in detach_if_pending

2017-10-29 Thread Thomas Gleixner

On Fri, 27 Oct 2017, syzbot wrote:

Cc'ed network folks.

> syzkaller hit the following crash on e7989f973ae1b90ec7c0b671c81f7f553affccbe
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/master
> compiler: gcc (GCC) 7.1.1 20170620
> .config is attached
> Raw console output is attached.
> C reproducer is attached
> syzkaller reproducer is attached. See https://goo.gl/kgGztJ
> for information about syzkaller reproducers
> 
> 
> BUG: KASAN: use-after-free in __write_once_size include/linux/compiler.h:305
> [inline]
> BUG: KASAN: use-after-free in __hlist_del include/linux/list.h:648 [inline]
> BUG: KASAN: use-after-free in detach_timer kernel/time/timer.c:791 [inline]
> BUG: KASAN: use-after-free in detach_if_pending+0x557/0x610
> kernel/time/timer.c:808
> Write of size 8 at addr 8801d3bab780 by task syzkaller900516/2986

That's just the point where this gets detected.

> CPU: 1 PID: 2986 Comm: syzkaller900516 Not tainted 4.13.0+ #82

> __hlist_del include/linux/list.h:648 [inline]
> detach_timer kernel/time/timer.c:791 [inline]
> detach_if_pending+0x557/0x610 kernel/time/timer.c:808
> try_to_del_timer_sync+0xa2/0x120 kernel/time/timer.c:1182
> del_timer_sync+0x18a/0x240 kernel/time/timer.c:1247
> tun_flow_uninit drivers/net/tun.c:1104 [inline]
> tun_free_netdev+0x105/0x1b0 drivers/net/tun.c:1776

   This shouldn't be called I think

> netdev_run_todo+0x870/0xca0 net/core/dev.c:7864
> rtnl_unlock+0xe/0x10 net/core/rtnetlink.c:106
> tun_detach drivers/net/tun.c:588 [inline]
> tun_chr_close+0x49/0x60 drivers/net/tun.c:2609
> __fput+0x333/0x7f0 fs/file_table.c:210
> fput+0x15/0x20 fs/file_table.c:246
> task_work_run+0x199/0x270 kernel/task_work.c:112
> exit_task_work include/linux/task_work.h:21 [inline]
> do_exit+0xa52/0x1b40 kernel/exit.c:865

Here is the allocation path

> alloc_netdev_mqs+0x16e/0xed0 net/core/dev.c:8018
> tun_set_iff drivers/net/tun.c:2022 [inline]
> __tun_chr_ioctl+0x12be/0x3d20 drivers/net/tun.c:2276
> tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> vfs_ioctl fs/ioctl.c:45 [inline]
> do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> SYSC_ioctl fs/ioctl.c:700 [inline]
> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> entry_SYSCALL_64_fastpath+0x1f/0xbe


And this is free.

> netdev_freemem net/core/dev.c:7970 [inline]
> free_netdev+0x2cf/0x360 net/core/dev.c:8132
> tun_set_iff drivers/net/tun.c:2105 [inline]

err_free_flow:
tun_flow_uninit(tun);   <

> __tun_chr_ioctl+0x2cf6/0x3d20 drivers/net/tun.c:2276
> tun_chr_ioctl+0x2a/0x40 drivers/net/tun.c:2521
> vfs_ioctl fs/ioctl.c:45 [inline]
> do_vfs_ioctl+0x1b1/0x1530 fs/ioctl.c:685
> SYSC_ioctl fs/ioctl.c:700 [inline]
> SyS_ioctl+0x8f/0xc0 fs/ioctl.c:691
> entry_SYSCALL_64_fastpath+0x1f/0xbe

So it's the TUNSETIFF ioctl which first allocates and then frees in the
errorpath of tun_set_iff.

But for some reason this sticks and the exit of that task does it again,
which triggers KASAN in the innocent timer code.

Thanks,

tglx

Re: [PATCH 32/58] isdn/gigaset: Convert timers to use timer_setup()

2017-10-19 Thread Thomas Gleixner

On Thu, 19 Oct 2017, Paul Bolle wrote:

> On Thu, 2017-10-19 at 23:31 +0200, Thomas Gleixner wrote:
> > bas_gigaset_exit()
> > {
> > for (i = 0; i < driver->minors; i++) {
> > if (gigaset_shutdown(driver->cs + i) < 0)
> > 
> > gigaset_shutdown(cs)
> > {
> > mutex_lock(>mutex); < Explodes here
> > 
> > So driver->cs + i is invalid. No idea how that might be related to that
> > timer conversion patch, but 
> 
> Thanks for peeking into this!
> 
> Please note that driver->minors is one of the more embarrassing warts of the
> gigaset code. It's basically hardcoded to 1 for all three drivers (including
> bas_gigaset). So driver->cs itself is invalid here.
> 
> And since the patch uses
> struct cardstate *cs = urb->context;
> 
> in a few places my guess is that it's really the patch that triggers this.

Well, that does not explain why

  drivers->cs + i

would be corrupted. That would require that this cs -> urb link points at
driver magically and then wreckages that driver data structure. Might be
the case, but if so then there are dragons burried somehwere

Thanks,

tglx

Re: [PATCH 32/58] isdn/gigaset: Convert timers to use timer_setup()

2017-10-19 Thread Thomas Gleixner

On Thu, 19 Oct 2017, Paul Bolle wrote:

> On Thu, 2017-10-19 at 23:03 +0200, Paul Bolle wrote:
> > On Mon, 2017-10-16 at 17:29 -0700, Kees Cook wrote:
> > > In preparation for unconditionally passing the struct timer_list pointer 
> > > to
> > > all timer callbacks, switch to using the new timer_setup() and 
> > > from_timer()
> > > to pass the timer pointer explicitly.
> > 
> > Acked-by: Paul Bolle 
> 
> I have to take this back, sorry!
> 
> > For the record: this patch made me nervous but survived the rigorous 
> > testing I
> > threw at it. (Ie, dialing up using bas_gigaset and downloading almost 20 MB 
> > in
> > just over an hour. Whoot! That's more than good enough to ack this patch.)
> > 
> > There was some cleanup I had in mind to make this patch more 
> > straightforward.
> > But that can wait until someone finds a way to hit an issue with this patch.
> > We'll see.
> 
> That someone turns out to be me, doing "modprobe -r bas_gigaset":
> 
> <1>[30143.538135] BUG: unable to handle kernel NULL pointer dereference at 
> 01e9
> <1>[30143.538154] IP: mutex_lock+0x19/0x30
> <0>[30143.538300]  gigaset_shutdown+0x28/0x130 [gigaset]
> <0>[30143.538307]  ? find_module_all+0x62/0x80
> <0>[30143.538314]  bas_gigaset_exit+0x31/0x1077 [bas_gigaset]

bas_gigaset_exit()
{
for (i = 0; i < driver->minors; i++) {
if (gigaset_shutdown(driver->cs + i) < 0)

gigaset_shutdown(cs)
{
mutex_lock(>mutex); < Explodes here

So driver->cs + i is invalid. No idea how that might be related to that
timer conversion patch, but 

Thanks,

tglx

Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket

2017-09-25 Thread Thomas Gleixner

On Wed, 20 Sep 2017, Vallish Vaidyeshwara wrote:
> On Sat, Sep 16, 2017 at 11:47:56AM +0200, Thomas Gleixner wrote:
> > > So if we need to replace all 'legacy' timers to high resolution timer,
> > > because some application was _relying_ on jiffies being kind of precise,
> > > maybe it is better to revert the change done on legacy timers.
> > 
> > Which would be a major step back in terms of timer performance and system
> > disturbance caused by massive recascading operations.
> > 
> > > Or continue the migration and make them use high res internally.
> > > 
> > > select() and poll() are the standard way to have precise timeouts,
> > > it is silly we have to maintain a timeout handling in the datagram fast
> > > path.
> > 
> > A few years ago we switched select/poll over to use hrtimers because the
> > wheel timers were too inaccurate for some operations, so it feels
> > consequent to switch the timeout in the datagram rcv path over as well. I
> > agree that the whole timeout magic there feels silly, but unfortunately
> > it's a documented property of sockets.
> > 
> 
> Thanks for your comments. This patch has been NACK'ed by David Miller. Is
> there any other approach to solve this problem with out application code
> being recompiled?

We have only three options here:

   1) Do a massive revert of the timer wheel changes and lose all the
  benefits of that rework.

   2) Make that timer list -> hrtimer change in the datagram code

   3) Ignore it

#1 Would be pretty ironic as networking would take the biggest penalty of
   the revert.

#2 Is IMO the proper solution as it cures a user space visible regression,
   though the patch itself could be made way simpler

#3 Shrug

Dave, Eric?

Thanks,

tglx

Re: [PATCH v2 0/2] enable hires timer to timeout datagram socket

2017-09-16 Thread Thomas Gleixner

On Fri, 8 Sep 2017, Eric Dumazet wrote:
> On Fri, 2017-09-08 at 11:55 -0700, Eduardo Valentin wrote:
> > Hello,
> > 
> > On Fri, Sep 08, 2017 at 10:26:45AM -0700, David Miller wrote:
> > > From: David Woodhouse 
> > > Date: Fri, 08 Sep 2017 18:23:22 +0100
> > > 
> > > > I don't know that anyone's ever tried saying "show me the chapter
> > and
> > > > verse of the documentation"
> > > 
> > > Do you know why I brought this up?  Because the person I am replying
> > > to told me that the syscall documentation should have suggested this
> > > or that.
> > > 
> > > That's why.
> > 
> > :-) My intention was for sure not to upset anybody.
> > 
> > Just to reiterate, the point of patch is simple, there was a change in
> > behavior in the system call from one kernel version to the other. As I
> > mentioned, I agree that the userspace could use other means to achieve
> > the same, but still the system call behavior has changed.
> > 
> > > 
> > > So let's concentrate on the other aspects of my reply, ok?
> > 
> > I agree. I would prefer to understand here what is the technical
> > reason not to accept these patches other than "use other system
> > calls".
> 
> So if we need to replace all 'legacy' timers to high resolution timer,
> because some application was _relying_ on jiffies being kind of precise,
> maybe it is better to revert the change done on legacy timers.

Which would be a major step back in terms of timer performance and system
disturbance caused by massive recascading operations.

> Or continue the migration and make them use high res internally.
> 
> select() and poll() are the standard way to have precise timeouts,
> it is silly we have to maintain a timeout handling in the datagram fast
> path.

A few years ago we switched select/poll over to use hrtimers because the
wheel timers were too inaccurate for some operations, so it feels
consequent to switch the timeout in the datagram rcv path over as well. I
agree that the whole timeout magic there feels silly, but unfortunately
it's a documented property of sockets.

Thanks,

tglx

Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage

2017-09-05 Thread Thomas Gleixner

On Tue, 5 Sep 2017, Sebastian Andrzej Siewior wrote:

> On 2017-09-05 09:12:40 [+0200], Thomas Gleixner wrote:
> > Sorry, no. That bisect is completely bogus. The commit in question merily
> > replaces the unsupported clockid with a valid one.
> 
> The bisect is correct. It just has problems to express itself properly. So
> the table says:
> 
> | WARNING:at_kernel/time/hrtimer.c:#hrtimer_init  | 7   | ||   |  
>   
> | BUG:kernel_reboot-without-warning_in_test_stage | 0   | 12  | 6  | 6 |  
>   
> 
> which means _before_ your commit it counted a warning in hrtimer_init()
> (an unsupported clock id was used). With your commit, the warning was
> gone and I *think* the userland then printed
> "BUG:kernel_reboot-without-warning_in_test_stage" because it had no
> warning.
> It seems that the bot learned to live with that warning which was around
> for more than three years. Now that you removed it, it seems to be a
> mistake to do so because nobody complained about it so far.

Thanks for the translation. I'll never learn to decode these reports.

   tglx

Re: 915f3e3f76 ("mac80211_hwsim: Replace bogus hrtimer clockid"): BUG: kernel reboot-without-warning in test stage

2017-09-05 Thread Thomas Gleixner

On Tue, 5 Sep 2017, kernel test robot wrote:

> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> 
> commit 915f3e3f76c05b2da93c4cc278eebc2d9219d9f4
> Author: Thomas Gleixner <t...@linutronix.de>
> AuthorDate: Sat Feb 25 11:27:37 2017 +0100
> Commit: Linus Torvalds <torva...@linux-foundation.org>
> CommitDate: Sat Feb 25 09:48:16 2017 -0800
> 
> mac80211_hwsim: Replace bogus hrtimer clockid
> 
> mac80211_hwsim initializes a hrtimer with clockid CLOCK_MONOTONIC_RAW.
> That's not supported.
> 
> Use CLOCK_MONOTNIC instead.

Sorry, no. That bisect is completely bogus. The commit in question merily
replaces the unsupported clockid with a valid one.

Thanks,

tglx

Re: [PATCH 00/51] rtc: stop using rtc deprecated functions

2017-06-20 Thread Thomas Gleixner

On Tue, 20 Jun 2017, Alexandre Belloni wrote:
> On 20/06/2017 at 22:15:36 +0100, Russell King - ARM Linux wrote:
> > On Tue, Jun 20, 2017 at 05:07:46PM +0200, Benjamin Gaignard wrote:
> > > 2017-06-20 15:48 GMT+02:00 Alexandre Belloni
> > > :
> > > >> Yes, that's argument against changing rtc _drivers_ for hardware that
> > > >> can not do better than 32bit. For generic code (such as 44/51 sysfs,
> > > >> 51/51 suspend test), the change still makes sense.
> > > 
> > > What I had in mind when writing those patches was to remove the 
> > > limitations
> > > coming from those functions usage, even more since they been marked has
> > > deprecated.
> > 
> > I'd say that they should not be marked as deprecated.  They're entirely
> > appropriate for use with hardware that only supports a 32-bit
> > representation of time.
> > 
> > It's entirely reasonable to fix the ones that use other representations
> > that exceed that, but for those which do not, we need to keep using the
> > 32-bit versions.  Doing so actually gives us _more_ flexibility in the
> > future.
> > 
> > Consider that at the moment, we define the 32-bit RTC representation to
> > start at a well known epoch.  We _could_ decide that when it wraps to
> > 0x8000 seconds, we'll define the lower 0x4000 seconds to mean
> > dates in the future - and keep rolling that forward each time we cross
> > another 0x4000 seconds.  Unless someone invents a real time machine,
> > we shouldn't need to set a modern RTC back to 1970.
> > 
> 
> I agree with that but not the android guys. They seem to mandate an RTC
> that can store time from 01/01/1970. I don't know much more than that
> because they never cared to explain why that was actually necessary
> (apart from a laconic "this will result in a bad user experience")
> 
> I think tglx had a plan for offsetting the time at some point so 32-bit
> platform can pass 2038 properly.

Yes, but there are still quite some issues to solve there:

 1) How do you tell the system that it should apply the offset in the
first place, i.e at boot time before NTP or any other mechanism can
correct it?

 2) Deal with creative vendors who have their own idea about the 'start
of the epoch'

 3) Add the information of wraparound time to the rtc device which
needs to be filled in for each device. That way the rtc_***
accessor functions can deal with them whether they wrap in 2038 or
2100 or whatever.

#3 is the simplest problem of them :)   

> My opinion is that as long as userspace is not ready to handle those
> dates, it doesn't really matter because it is quite unlikely that
> anything will be able to continue running anyway.

That's a different story. Making the kernel y2038 ready in general is a
good thing. Whether userspace will be ready by then or not is completely
irrelevant.

Thanks,

tglx

Re: run_timer_softirq gpf. [smc]

2017-03-21 Thread Thomas Gleixner

On Tue, 21 Mar 2017, Dave Jones wrote:
> On Tue, Mar 21, 2017 at 08:25:39PM +0100, Thomas Gleixner wrote:
>  
>  > > I just hit this while fuzzing..
>  > > 
>  > > general protection fault:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
>  > > CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.11.0-rc2-think+ #1 
>  > > task: 88017f0ed440 task.stack: c9094000
>  > > RIP: 0010:run_timer_softirq+0x15f/0x700
>  > > RSP: 0018:880507c03ec8 EFLAGS: 00010086
>  > > RAX: dead0200 RBX: 880507dd0d00 RCX: 0002
>  > > RDX: 880507c03ed0 RSI:  RDI: 8204b3a0
>  > > RBP: 880507c03f48 R08: 880507dd12d0 R09: 880507c03ed8
>  > > R10: 880507dd0db0 R11:  R12: 8215cc38
>  > > R13: 880507c03ed0 R14: 82005188 R15: 8804b55491a8
>  > > FS:  () GS:880507c0() 
> knlGS:
>  > > CS:  0010 DS:  ES:  CR0: 80050033
>  > > CR2: 0004 CR3: 05011000 CR4: 001406e0
>  > > Call Trace:
>  > >  
>  > >  ? clockevents_program_event+0x47/0x120
>  > >  __do_softirq+0xbf/0x5b1
>  > >  irq_exit+0xb5/0xc0
>  > >  smp_apic_timer_interrupt+0x3d/0x50
>  > >  apic_timer_interrupt+0x97/0xa0
>  > > RIP: 0010:cpuidle_enter_state+0x12e/0x400
>  > > RSP: 0018:c9097e40 EFLAGS: 0202
>  > > [CONT START]  ORIG_RAX: ff10
>  > > RAX: 88017f0ed440 RBX: e8a03cc8 RCX: 0001
>  > > RDX: 20c49ba5e353f7cf RSI: 0001 RDI: 88017f0ed440
>  > > RBP: c9097e80 R08:  R09: 0008
>  > > R10:  R11:  R12: 0005
>  > > R13: 820b9338 R14: 0005 R15: 820b9320
>  > >  
>  > >  cpuidle_enter+0x17/0x20
>  > >  call_cpuidle+0x23/0x40
>  > >  do_idle+0xfb/0x200
>  > >  cpu_startup_entry+0x71/0x80
>  > >  start_secondary+0x16a/0x210
>  > >  start_cpu+0x14/0x14
>  > > Code: 8b 05 ce 1b ef 7e 83 f8 03 0f 87 4e 01 00 00 89 c0 49 0f a3 04 24 
> 0f 82 0a 01 00 00 49 8b 07 49 8b 57 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 
> 41 f6 47 2a 20 49 c7 47 08 00 00 00 00 48 89 df 48 
>  > 
>  > The timer which expires has timer->entry.next == POISON2 !
>  > 
>  > it's a classic list corruption.  The
>  > bad news is that there is no trace of the culprit because that happens when
>  > some other timer expires after some random amount of time.
>  > 
>  > If that is reproducible, then please enable debugobjects. That should
>  > pinpoint the culprit.
> 
> It's net/smc.  This recently had a similar bug with workqueues. 
> (https://marc.info/?l=linux-kernel=148821582909541) fixed by
> 637fdbae60d6cb9f6e963c1079d7e0445c86ff7d

Fixed? It's not fixed by that commit. The workqueue code merily got a new
WARN_ON_ONCE(). But the underlying problem is still unfixed in net/smc

> so it's probably unsurprising that there are similar issues.

That one is related to workqueues:

> WARNING: CPU: 0 PID: 2430 at lib/debugobjects.c:289 
> debug_print_object+0x87/0xb0
> ODEBUG: free active (active state 0) object type: timer_list hint: 
> delayed_work_timer_fn+0x0/0x20

delayed_work_timer_fn() is what queues the work once the timer expires.

> CPU: 0 PID: 2430 Comm: trinity-c4 Not tainted 4.11.0-rc3-think+ #3 
> Call Trace:
>  dump_stack+0x68/0x93
>  __warn+0xcb/0xf0
>  warn_slowpath_fmt+0x5f/0x80
>  ? debug_check_no_obj_freed+0xd9/0x260
>  debug_print_object+0x87/0xb0
>  ? work_on_cpu+0xd0/0xd0
>  debug_check_no_obj_freed+0x219/0x260
>  ? __sk_destruct+0x10d/0x1c0
>  kmem_cache_free+0x9f/0x370
>  __sk_destruct+0x10d/0x1c0
>  sk_destruct+0x20/0x30
>  __sk_free+0x43/0xa0
>  sk_free+0x18/0x20

smc_release does at the end of the function:

if (smc->use_fallback) {
schedule_delayed_work(>sock_put_work, TCP_TIMEWAIT_LEN);
} else if (sk->sk_state == SMC_CLOSED) {
smc_conn_free(>conn);
schedule_delayed_work(>sock_put_work,
  SMC_CLOSE_SOCK_PUT_DELAY);
}
sk->sk_prot->unhash(sk);
release_sock(sk);

sock_put(sk);

sock_put(sk)
{
if (atomic_dec_and_test(>sk_refcnt))
sk_free(sk);
}

That means either smc_release() queued delayed work or it was already
queued.

But in neither case it holds an extra refcount on sk. Otherwise sock_put()
would not end up in sk_free().

Thanks,

tglx

Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

2017-03-09 Thread Thomas Gleixner

On Thu, 9 Mar 2017, Daniel Borkmann wrote:

> On 03/09/2017 02:10 PM, Thomas Gleixner wrote:
> > On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> > > With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> > > code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> > > was set each time we switched attrs and a cpa_flush_range() was
> > > performed (with the correct number of pages and cache set to 0). That
> > > would be a __flush_tlb_all() eventually.
> > > 
> > > Hmm, it indeed might seem likely that this could be an emulation bug.
> > 
> > Which variant of __flush_tlb_all() is used when the test fails?
> > 
> > Check for the following flags in /proc/cpuinfo: pge invpcid
> 
> I added the following and booted with both variants:
> 
> printk("X86_FEATURE_PGE:%u\n", static_cpu_has(X86_FEATURE_PGE));
> printk("X86_FEATURE_INVPCID:%u\n", static_cpu_has(X86_FEATURE_INVPCID));
> 
> "-cpu host" gives:
> 
> [8.326117] X86_FEATURE_PGE:1
> [8.326381] X86_FEATURE_INVPCID:1
> 
> "-cpu kvm64" gives:
> 
> [8.517069] X86_FEATURE_PGE:1
> [8.517393] X86_FEATURE_INVPCID:0

That's the one which fails. So it's using the CR4 based flushing. Just ran
a test on a physical system with PGE=1 and INVPCID=0. Works fine.

Emulation problem?

Thanks,

tglx

Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

2017-03-09 Thread Thomas Gleixner

On Wed, 8 Mar 2017, Linus Torvalds wrote:

> Adding x86 people too, since this seems to be something off about
> ARCH_HAS_SET_MEMORY for x86-32.
> 
> The code seems to be shared between x86-32 and 64, I'm not seeing why
> set_memory_r[ow]() should fail on one but not the other.

Indeed.

> Considering that it seems to be flaky even on 32-bit, maybe it's
> timing-related, or possibly related to TLB sizes or whatever (ie more
> likely hidden by a larger TLB on more modern hardware?)

The only difference I can see is the way how __tlb_flush_all() is
happening. We have 3 variants:

   invpcid_flush_all()  - depends on X86_FEATURE_INVPCID and X86_FEATURE_PGE
   cr4 based flush  - depends on X86_FEATURE_PGE
   cr3 based flush

No idea which variant is used in that failure case.

> Anyway, just looking at change_page_attr_set_clr(), I notice that the
> page alias checking treats NX specially:
> 
> /* No alias checking for _NX bit modifications */
> checkalias = (pgprot_val(mask_set) | pgprot_val(mask_clr)) != 
> _PAGE_NX;
> 
> which seems insane. Why would NX be different from other protection
> bits (like _PAGE_RW)?

The reason is that the alias mapping should never be executable at all.

Thanks,

tglx

Re: [net/bpf] 3051bf36c2 BUG: unable to handle kernel paging request at 0000a7cf

2017-03-09 Thread Thomas Gleixner

On Thu, 9 Mar 2017, Daniel Borkmann wrote:
> With regard to CPA_FLUSHTLB that Linus mentioned, when I investigated
> code paths in change_page_attr_set_clr(), I did see that CPA_FLUSHTLB
> was set each time we switched attrs and a cpa_flush_range() was
> performed (with the correct number of pages and cache set to 0). That
> would be a __flush_tlb_all() eventually.
> 
> Hmm, it indeed might seem likely that this could be an emulation bug.

Which variant of __flush_tlb_all() is used when the test fails?

Check for the following flags in /proc/cpuinfo: pge invpcid

Thanks,

tglx

net/3com/3c515: Fix timer handling, prevent leaks and crashes

2016-12-11 Thread Thomas Gleixner

The timer handling in this driver is broken in several ways:

- corkscrew_open() initializes and arms a timer before requesting the
  device interrupt. If the request fails the timer stays armed.

  A second call to corkscrew_open will unconditionally reinitialize the
  quued timer and arm it again. Also a immediate device removal will leave
  the timer queued because close() is not called (open() failed) and
  therefore nothing issues del_timer().

  The reinitialization corrupts the link chain in the timer wheel hash
  bucket and causes a NULL pointer dereference when the timer wheel tries
  to operate on that hash bucket. Immediate device removal lets the link
  chain poke into freed and possibly reused memory.

  Solution: Arm the timer after the successful irq request.

- corkscrew_close() uses del_timer()

  On close the timer is disarmed with del_timer() which lets the following
  code race against a concurrent timer expiry function.

  Solution: Use del_timer_sync() instead

- corkscrew_close() calls del_timer() unconditionally

  del_timer() is invoked even if the timer was never initialized. This
  works by chance because the struct containing the timer is zeroed at
  allocation time.

  Solution: Move the setup of the timer into corkscrew_setup().

Reported-by: Matthew Whitehead <tedheads...@gmail.com>
Signed-off-by: Thomas Gleixner <t...@linutronix.de>
Cc: Andy Lutomirski <l...@kernel.org>
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/3com/3c515.c |   15 ---
 1 file changed, 8 insertions(+), 7 deletions(-)

--- a/drivers/net/ethernet/3com/3c515.c
+++ b/drivers/net/ethernet/3com/3c515.c
@@ -628,6 +628,8 @@ static int corkscrew_setup(struct net_de
 
spin_lock_init(>lock);
 
+   setup_timer(>timer, corkscrew_timer, (unsigned long) dev);
+
/* Read the station address from the EEPROM. */
EL3WINDOW(0);
for (i = 0; i < 0x18; i++) {
@@ -708,6 +710,7 @@ static int corkscrew_open(struct net_dev
 {
int ioaddr = dev->base_addr;
struct corkscrew_private *vp = netdev_priv(dev);
+   bool armtimer = false;
__u32 config;
int i;
 
@@ -732,12 +735,7 @@ static int corkscrew_open(struct net_dev
if (corkscrew_debug > 1)
pr_debug("%s: Initial media type %s.\n",
   dev->name, media_tbl[dev->if_port].name);
-
-   init_timer(>timer);
-   vp->timer.expires = jiffies + media_tbl[dev->if_port].wait;
-   vp->timer.data = (unsigned long) dev;
-   vp->timer.function = corkscrew_timer;   /* timer handler */
-   add_timer(>timer);
+   armtimer = true;
} else
dev->if_port = vp->default_media;
 
@@ -777,6 +775,9 @@ static int corkscrew_open(struct net_dev
return -EAGAIN;
}
 
+   if (armtimer)
+   mod_timer(>timer, jiffies + media_tbl[dev->if_port].wait);
+
if (corkscrew_debug > 1) {
EL3WINDOW(4);
pr_debug("%s: corkscrew_open() irq %d media status %4.4x.\n",
@@ -1427,7 +1428,7 @@ static int corkscrew_close(struct net_de
dev->name, rx_nocopy, rx_copy, queued_packet);
}
 
-   del_timer(>timer);
+   del_timer_sync(>timer);
 
/* Turn off statistics ASAP.  We update lp->stats below. */
outw(StatsDisable, ioaddr + EL3_CMD);

Re: [PATCH v2 11/13] clocksource: export the clocks_calc_mult_shift to use by timestamp code

2016-11-29 Thread Thomas Gleixner

On Mon, 28 Nov 2016, Grygorii Strashko wrote:

> From: Murali Karicheri <m-kariche...@ti.com>
> 
> The CPSW CPTS driver is capable of doing timestamping on tx/rx packets and
> requires to know mult and shift factors for timestamp conversion from raw
> value to nanoseconds (ptp clock). Now these mult and shift factors are
> calculated manually and provided through DT, which makes very hard to
> support of a lot number of platforms, especially if CPTS refclk is not the
> same for some kind of boards and depends on efuse settings (Keystone 2
> platforms). Hence, export clocks_calc_mult_shift() to allow drivers like
> CPSW CPTS (and other ptp drivesr) to benefit from automaitc calculation of
> mult and shift factors.
> 
> Cc: John Stultz <john.stu...@linaro.org>
> Cc: Thomas Gleixner <t...@linutronix.de>
> Signed-off-by: Murali Karicheri <m-kariche...@ti.com>
> Signed-off-by: Grygorii Strashko <grygorii.stras...@ti.com>

Acked-by: Thomas Gleixner <t...@linutronix.de>

> ---
>  kernel/time/clocksource.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 7e4fad7..150242c 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -89,6 +89,7 @@ clocks_calc_mult_shift(u32 *mult, u32 *shift, u32 from, u32 
> to, u32 maxsec)
>   *mult = tmp;
>   *shift = sft;
>  }
> +EXPORT_SYMBOL_GPL(clocks_calc_mult_shift);
>  
>  /*[Clocksource internal variables]-
>   * curr_clocksource:
> -- 
> 2.10.1
> 
>

Re: [PATCH] net/iucv: use explicit clean up labels in iucv_init()

2016-11-28 Thread Thomas Gleixner

On Mon, 28 Nov 2016, David Miller wrote:
> From: Sebastian Andrzej Siewior 
> Date: Thu, 24 Nov 2016 17:10:13 +0100
> 
> > Ursula suggested to use explicit labels for clean up in the error path
> > instead of one `out_free' label which handles multiple exits.
> > Since the previous patch got already applied, here is a follow up patch.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior 
> 
> "Previous patch" doesn't tell readers anything specific enough to identify
> the change you are referring to.  This will be even more true years down
> the line if someone tries to read this commit and figure out what you
> are referring to.
> 
> We have a standard mechanism to refer to commits, via SHA1_ID and commit
> header line text, please use it.

I amended the commit message.

Thanks,

tglx

Re: [BUG] X86: Removing inline decl on arch/x86/include/asm/desc.h.

2016-11-16 Thread Thomas Gleixner

On Tue, 15 Nov 2016, Corcodel Marian wrote:
>  Inline declarations suppress warning message from compiler but
>  most of these functions was declared static and is not used local on file.

Huch? This is a header file and the functions are marked inline on purpose.

Can you please explain what you are trying to achieve and why you think
that this is a good idea?

Thanks,

tglx

Re: [PATCH v3 0/4] make POSIX timers optional with some Kconfig help

2016-11-08 Thread Thomas Gleixner

On Mon, 7 Nov 2016, Nicolas Pitre wrote:

> Many embedded systems don't need the full POSIX timer support.
> Configuring them out provides a nice kernel image size reduction.
> 
> When POSIX timers are configured out, the PTP clock subsystem should be
> left out as well. However a bunch of ethernet drivers currently *select*
> the later in their Kconfig entries. Therefore some more work was needed
> to break that hard dependency from those drivers without preventing their
> usage altogether.
> 
> Therefore this series also includes kconfig changes to implement a new
> keyword to express some reverse dependencies like "select" does, named
> "imply", and still allowing for the target config symbol to be disabled
> if the user or a direct dependency says so.
> 
> At this point I'd like to gather ACKs especially from people in the "To"
> field. Ideally this would need to go upstream as a single series to avoid
> cross subsystem dependency issues.  So far it was suggested that this should 
> go
> via the kbuild tree.

For the whole series:

Acked-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH 3/4] ptp_clock: allow for it to be optional

2016-10-20 Thread Thomas Gleixner

On Thu, 20 Oct 2016, Nicolas Pitre wrote:
> On Thu, 20 Oct 2016, Thomas Gleixner wrote:
> 
> > On Wed, 19 Oct 2016, Nicolas Pitre wrote:
> > > The pch_gbe driver is a bit special as it relies on extra code in
> > > drivers/ptp/ptp_pch.c. Therefore we let the make process descend into
> > > drivers/ptp/ even if PTP_1588_CLOCK is unselected.
> > 
> > The above paragraph looks like a leftover of the previous patch set.
> 
> Not really.  Without the change to drivers/Makefile, drivers/ptp/ is not 
> visited when CONFIG_PTP_1588_CLOCK=n. If you then have CONFIG_PCH_GBE=y 
> you end up with:
> 
> drivers/built-in.o: In function `pch_gbe_ioctl':
> pch_gbe_main.c:(.text+0x28c914): undefined reference to `pch_ch_control_write'
> 
> Hence the above paragraph.

Indeed. I misread that.

Thanks,

tglx

Re: [PATCH 3/4] ptp_clock: allow for it to be optional

2016-10-20 Thread Thomas Gleixner

On Wed, 19 Oct 2016, Nicolas Pitre wrote:
> The pch_gbe driver is a bit special as it relies on extra code in
> drivers/ptp/ptp_pch.c. Therefore we let the make process descend into
> drivers/ptp/ even if PTP_1588_CLOCK is unselected.

The above paragraph looks like a leftover of the previous patch set.
 
Thanks,

tglx

Re: [PATCH 0/4] make POSIX timers optional with some Kconfig help

2016-10-20 Thread Thomas Gleixner

On Wed, 19 Oct 2016, Nicolas Pitre wrote:
> Therefore this series also includes kconfig changes to implement a new
> keyword to express some reverse dependencies like "select" does, named
> "imply", and still allowing for the target config symbol to be disabled
> if the user or a direct dependency says so.

That's really nice work! Thanks for doing that. It makes the whole thing
more palatable.

Thanks,

tglx

Re: [PATCH v2 0/2] make POSIX timers optional

2016-09-21 Thread Thomas Gleixner

On Tue, 20 Sep 2016, Nicolas Pitre wrote:
> There are way more drivers than subsystems and if you had to go around 
> unselecting all NIC drivers for CONFIG_ETHERNET to be turned off, and 
> with CONFIG_ETHERNET=n you'd finally be able to turn networking off, 
> then this would be a nightmare.

It's also a nightmare to try to enable something which depends on a chain
of config dependencies. You search A and see that it depends on B. Now you
try to enable B and find out it depends on C. The deepest chain I found so
far was 5 levels deep. That's horrible.

> IMHO it is much nicer for the poor user configuring the kernel to have a 
> single configuration prompt for PTP support, and then have whatever 
> driver that can provide a PTP clock just do it (or omit it) based on 
> that single prompt.  Prompting for PTP support for each individual 
> ethernet driver is silly.

No, it's not. PTP support in a NIC can require a substantial amount of code
and adds overhead in the hotpath. So if you have two NICs on your system
and only use PTP on one of them then configuring out PTP for the other is
desired.

And talking about nightmares. With your patch I can now end up with the
following configuration:

CONFIG_NETWORK_DRIVER=y
CONFIG_PTP_1588_CLOCK=m

which makes the network driver to use the stub functions due to
IS_REACHABLE(). That's going to be an utter nightmare for the poor user to
figure out why PTP suddenly does not work anymore. That's simply crap.

Thanks,

tglx

Re: [PATCH v2 0/2] make POSIX timers optional

2016-09-20 Thread Thomas Gleixner

On Tue, 20 Sep 2016, Nicolas Pitre wrote:
> On Tue, 20 Sep 2016, Richard Cochran wrote:
> 
> > On Tue, Sep 20, 2016 at 10:25:56PM +0200, Richard Cochran wrote:
> > > After this series, if I don't pay enough attention to dmesg, then I
> > > have lost functionality that I had in step #1.  That sucks, and it has
> > > nothing to do with the tinification option at all.  It will bite even
> > > if I have no knowledge of it.  That isn't acceptable to me.
> > 
> > Can't you leave all the "select PTP_1588_CLOCK" alone and simply add
> > 
> > #ifdef CONFIG_POSIX_TIMERS
> > // global declarations
> > #else
> > // static inlines
> > #endif

Eew. No! That's an even more blantant layering violation.

> > to ptp_clock_kernel.h, and then sandwich ptp_clock.c in
> > #ifdef CONFIG_POSIX_TIMERS ... #endif ?
> 
> Sure I could... but I'm sure I'll be flamed by others for making things 
> even more obscure and hackish than they are right now.

I think the whole approach is wrong because it makes the PTP split at the
wrong level.

Currently we have:

  DRIVER_X
  tristate "Driver X"
  select PTP

In order to make POSIX_CLOCK configurable we should have

  PTP
  tristate "PTP"
  select POSIX_CLOCK

Now if you want to distangle PTP from a driver then you split it at the
driver level and not at the PTP level:

  DRIVER_X
  tristate "Driver X"

  DRIVER_X_PTP
  bool "Enable PTP support"
  default y if !MAKE_IT_TINY
  depends on DRIVER_X
  select PTP

We have already drivers following that scheme. That way you make the PTP
support in the driver conditional on DRIVER_X_PTP and have no hassle with
modules and dependencies.

Your tiny config can simply disable all the PTP extra bits and then you can
disable PTP and finally POSIX_TIMERS.

Thanks,

tglx

Re: [e1000_netpoll] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110

2016-07-28 Thread Thomas Gleixner

On Thu, 28 Jul 2016, Sabrina Dubroca wrote:
> 2016-07-28, 07:43:55 +0200, Eric Dumazet wrote:
> > I would prefer having a definitive advice from Thomas Gleixner and/or
> > others if disable_irq() is forbidden from IRQ path.

Yes it is. Before we added threaded interrupt handlers it was not an issue,
but with (possibly) threaded interrupts it's an absolute no-no.

> > As I said, about all netpoll() methods in net drivers use disable_irq()
> > so a lot of patches would be needed.
> > 
> > disable_irq() should then test this condition earlier, so that we can
> > detect potential bug, even if the IRQ is not (yet) threaded.
> 
> The idea when this first came up was to skip the sleeping part of
> disable_irq():
> 
> http://marc.info/?l=linux-netdev=142314159626052
> 
> This fell off my todolist and I didn't send the conversion patches,
> which would basically look like this:
>
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
> b/drivers/net/ethernet/intel/e1000e/netdev.c
> index 41f32c0b341e..b022691e680b 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -6713,20 +6713,20 @@ static irqreturn_t e1000_intr_msix(int 
> __always_unused irq, void *data)
>  
>   vector = 0;
>   msix_irq = adapter->msix_entries[vector].vector;
> - disable_irq(msix_irq);
> - e1000_intr_msix_rx(msix_irq, netdev);
> + if (disable_hardirq(msix_irq))
> + e1000_intr_msix_rx(msix_irq, netdev);
>   enable_irq(msix_irq);

That'll work nicely even when one of the affected interrupts is threaded.

Thanks,

tglx

Re: [e1000_netpoll] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110

2016-07-28 Thread Thomas Gleixner

On Tue, 26 Jul 2016, nick wrote:
> diff --git a/drivers/net/ethernet/intel/e1000/e1000_main.c 
> b/drivers/net/ethernet/intel/e1000/e1000_main.c
> index f42129d..e1830af 100644
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3797,7 +3797,7 @@ static irqreturn_t e1000_intr(int irq, void *data)
>   hw->get_link_status = 1;
>   /* guard against interrupt when we're going down */
>   if (!test_bit(__E1000_DOWN, >flags))
> - schedule_delayed_work(>watchdog_task, 1);
> + mod_work(>watchdog_task, jiffies + 1);

And that's not even funny anymore. Are you using a random generator to create
these patches?

Re: [e1000_netpoll] BUG: sleeping function called from invalid context at kernel/irq/manage.c:110

2016-07-26 Thread Thomas Gleixner

On Tue, 26 Jul 2016, Fengguang Wu wrote:
> --- a/drivers/net/ethernet/intel/e1000/e1000_main.c
> +++ b/drivers/net/ethernet/intel/e1000/e1000_main.c
> @@ -3797,7 +3797,7 @@ static irqreturn_t e1000_intr(int irq, void *data)
>   hw->get_link_status = 1;
>   /* guard against interrupt when we're going down */
>   if (!test_bit(__E1000_DOWN, >flags))
> - schedule_delayed_work(>watchdog_task, 1);
> + mod_timer(>watchdog_timer, jiffies + 1);

ROTFL

Re: [RFC PATCH 0/2] net: threadable napi poll loop

2016-05-10 Thread Thomas Gleixner

On Tue, 10 May 2016, Paolo Abeni wrote:

Nice patch set and very promising results! 

> At this point we are not really sure if we should go with this simpler
> approach by putting NAPI itself into kthreads or leverage the threadirqs
> function by putting the whole interrupt into a thread and signaling NAPI
> that it does not reschedule itself in a softirq but to simply run at
> this particular context of the interrupt handler.
> 
> While the threaded irq way seems to better integrate into the kernel and
> also other devices could move their interrupts into the threads easily
> on a common policy, we don't know how to really express the necessary
> knobs with the current device driver model (module parameters, sysfs
> attributes, etc.). This is where we would like to hear some opinions.
> NAPI would e.g. have to query the kernel if the particular IRQ/MSI if it
> should be scheduled in a softirq or in a thread, so we don't have to
> rewrite all device drivers. This might even be needed on a per rx-queue
> granularity.

Utilizing threaded irqs should be halfways simple even without touching the
device driver at all.

We can do the switch to threading in two ways:

1) Let the driver request the interrupt(s) as it does now and then have a
   /proc/irq/NNN/threaded file which converts it to a threaded interrupt on
   the fly. That should be fairly trivial.

2) Let the driver request the interrupt(s) as it does now and retrieve the
   interrupt number which belongs to the device/queue from the network core
   and let the irq core switch it over to threaded.

So the interrupt flow of the device would be:

interrupt
IRQ_WAKE_THREAD

irq thread()
{ 
local_bh_disable();
action->thread_fn(action->irq, action->dev_id); <-- driver handler
irq_finalize_oneshot(desc, action);
local_bh_enable();
}

The driver irq handler calls napi_schedule(). So if your napi_struct is
flagged POLL_IRQ_THREAD then you can call your polling machinery from there
instead of raising the softirq.

You surely need some way to figure out whether the interrupt is threaded when
you set up the device in order to flag your napi struct, but that should be
not too hard to achieve.

Thanks,

tglx

Re: [PATCH] net: mvneta: Add missing hotplug notifier transition

2016-03-11 Thread Thomas Gleixner

Gregory,

On Fri, 11 Mar 2016, Gregory CLEMENT wrote:
>  On ven., mars 11 2016, Anna-Maria Gleixner  wrote:
> 
> > The mvneta_percpu_notifier() hotplug callback lacks handling of the
> > CPU_DOWN_FAILED case. That means, if CPU_DOWN_PREPARE failes, the
> > driver is not well configured on the CPU.
> >
> > Add handling for CPU_DOWN_FAILED[_FROZEN] hotplug notifier transition
> > to setup the driver.
> 
> I agree that we need to handle this case, however reusing CPU_ONLINE
> case for it seems too much.

Why is that too much? If it does the job, then it's definitely preferrable
over an extra case for a non hotpath error handling operation.

Thanks,

tglx

Re: [PATCH 6/8] x86: tsc: Always Running Timer (ART) correlated clocksource

2016-03-03 Thread Thomas Gleixner

On Wed, 2 Mar 2016, John Stultz wrote:

Subject: x86: tsc: Always Running ...

Please make that:

Subject: x86/tsc: Always Running ...

>  
> +#else
> +
> +#define detect_art()

Inline stub if at all. Why sits detect_art() under CONFIG_CPU_FREQ while the
rest of the art code is not

Thanks

tglx

Re: Softirq priority inversion from "softirq: reduce latencies"

2016-02-29 Thread Thomas Gleixner

On Mon, 29 Feb 2016, Peter Hurley wrote:
> On 02/29/2016 10:24 AM, Eric Dumazet wrote:
> >> Just to be clear
> >>
> >>if (time_before(jiffies, end) && !need_resched() &&
> >>--max_restart)
> >>goto restart;
> >>
> >> aborts softirq *even if 0ns have elapsed*, if NET_RX has woken a process.
> > 
> > Sure, now remove the 1st and 2nd condition.
> 
> Well just removing the 2nd condition has everything working fine,
> because that fixes the priority inversion.

No. It does not fix anything. It hides the shortcomings of the driver.

> However, when system resources are _not_ contended, it makes no
> sense to be forced to revert to ksoftirqd resolution, which is strictly
> intended as fallback.

No. You claim it is simply because your driver does not handle that situation
properly.

> Or flipping your argument on its head, why not just _always_ execute
> softirq in ksoftirqd?

Which is what that change effectivley does. And that makes a lot of sense,
because you get the softirq load under scheduler control and do not let the
softirq run as a context stealing entity which is completely uncontrollable by
the scheduler.

Running the softirq on return from interrupt can cause real priority
inversions.

Thanks,

tglx

Re: [PATCH v8 6/8] x86: tsc: Always Running Timer (ART) correlated clocksource

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:

> On modern Intel systems TSC is derived from the new Always Running Timer
> (ART). ART can be captured simultaneous to the capture of
> audio and network device clocks, allowing a correlation between timebases
> to be constructed. Upon capture, the driver converts the captured ART
> value to the appropriate system clock using the correlated clocksource
> mechanism.
> 
> On systems that support ART a new CPUID leaf (0x15) returns parameters
> “m” and “n” such that:
> 
> TSC_value = (ART_value * m) / n + k [n >= 2]
> 
> [k is an offset that can adjusted by a privileged agent. The
> IA32_TSC_ADJUST MSR is an example of an interface to adjust k.
> See 17.14.4 of the Intel SDM for more details]
> 
> Signed-off-by: Christopher S. Hall <christopher.s.h...@intel.com>
> [jstultz: Tweaked to fix build issue, also reworked math for
> 64bit division on 32bit systems]
> Signed-off-by: John Stultz <john.stu...@linaro.org>

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH v8 5/8] time: Add history to cross timestamp interface supporting slower devices

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:
> +{
> + struct timekeeper *tk = _core.timekeeper;
> + bool interp_forward;
> + u64 corr_raw, corr_real;
> + int ret;

Once more:

struct timekeeper *tk = _core.timekeeper;
u64 corr_raw, corr_real;
bool interp_forward;
int ret;

Is way simpler to parse fast.

>  {
>   struct timekeeper *tk = _core.timekeeper;
> @@ -929,6 +1046,12 @@ int get_device_system_crosststamp(int (*get_time_fn)
>   ktime_t base_real;
>   s64 nsec_raw;
>   s64 nsec_real;
> + cycles_t cycles;
> + cycle_t now;
> + cycle_t interval_start;
> + unsigned int clock_was_set_seq;
> + u8 cs_was_changed_seq;
> + bool do_interp;

Single lines for same types .....

Other than that:

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH v8 4/8] time: Add driver cross timestamp interface for higher precision time synchronization

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:
> +int get_device_system_crosststamp(int (*get_time_fn)
> +   (ktime_t *device_time,
> +struct system_counterval_t *sys_counterval,
> +void *ctx),
> +   void *ctx,
> +   struct system_device_crosststamp *xtstamp)
> +{
> + struct timekeeper *tk = _core.timekeeper;
> + unsigned long seq;
> + struct system_counterval_t system_counterval;
> + ktime_t base_raw;
> + ktime_t base_real;
> + s64 nsec_raw;
> + s64 nsec_real;

Single lines for same and desceding length ordered, which makes it simpler to
parse, please.

Other than that: Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH v8 2/8] time: Add timekeeping snapshot code capturing system time and counter

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:

> In the current timekeeping code there isn't any interface to
> atomically capture the current relationship between the system counter
> and system time. ktime_get_snapshot() returns this triple (counter,
> monotonic raw, realtime) in the system_time_snapshot struct.

> +/**
> + * ktime_get_snapshot - snapshots the realtime/monotonic raw clocks with 
> counter
> + * @systime_snapshot:pointer to struct receiving the system time 
> snapshot
> + */
> +void ktime_get_snapshot(struct system_time_snapshot *systime_snapshot)
> +{
> + struct timekeeper *tk = _core.timekeeper;
> + unsigned long seq;
> + ktime_t base_raw;
> + ktime_t base_real;
> + s64 nsec_raw;
> + s64 nsec_real;

I'd prefer to have the variables of the same type in a single line. Other than
that:

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH v8 3/8] time: Remove duplicated code in ktime_get_raw_and_real()

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:

> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -888,6 +888,8 @@ void ktime_get_snapshot(struct system_time_snapshot 
> *systime_snapshot)
>   s64 nsec_real;
>   cycle_t now;
>  
> + WARN_ON(timekeeping_suspended);

WARN_ON_ONCE please

Other than that: Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: [PATCH v8 1/8] time: Add cycles to nanoseconds translation

2016-02-24 Thread Thomas Gleixner

On Mon, 22 Feb 2016, Christopher S. Hall wrote:

> The timekeeping code does not currently provide a way to translate
> externally provided clocksource cycles to system time. The cycle count
> is always provided by the result clocksource read() method internal to
> the timekeeping code. The added function timekeeping_cycles_to_ns()
> calculated a nanosecond value from a cycle count that can be added to
> tk_read_base.base value yielding the current system time. This allows
> clocksource cycle values external to the timekeeping code to provide a
> cycle count that can be transformed to system time.

Reviewed-by: Thomas Gleixner <t...@linutronix.de>

Re: Asterisk deadlocks since Kernel 4.1

2015-11-17 Thread Thomas Gleixner

On Tue, 17 Nov 2015, Stefan Priebe wrote:
> I've now also two gdb backtraces from two crashes:
> http://pastebin.com/raw.php?i=yih5jNt8
> 
> http://pastebin.com/raw.php?i=kGEcvH4T

They don't tell me anything as I have no idea of the inner workings of
asterisk. You might be better of to talk to the asterisk folks to help
you track down what that thing is waiting for, so we can actually look
at a well defined area.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bpf: convert hashtab lock to raw lock

2015-11-02 Thread Thomas Gleixner

On Sun, 1 Nov 2015, Alexei Starovoitov wrote:
> On Sat, Oct 31, 2015 at 09:47:36AM -0400, Steven Rostedt wrote:
> > On Fri, 30 Oct 2015 17:03:58 -0700
> > Alexei Starovoitov  wrote:
> > 
> > > On Fri, Oct 30, 2015 at 03:16:26PM -0700, Yang Shi wrote:
> > > > When running bpf samples on rt kernel, it reports the below warning:
> > > > 
> > > > BUG: sleeping function called from invalid context at 
> > > > kernel/locking/rtmutex.c:917
> > > > in_atomic(): 1, irqs_disabled(): 128, pid: 477, name: ping
> > > > Preemption disabled at:[] kprobe_perf_func+0x30/0x228 
> > > >  
> > > ...
> > > > diff --git a/kernel/bpf/hashtab.c b/kernel/bpf/hashtab.c
> > > > index 83c209d..972b76b 100644
> > > > --- a/kernel/bpf/hashtab.c
> > > > +++ b/kernel/bpf/hashtab.c
> > > > @@ -17,7 +17,7 @@
> > > >  struct bpf_htab {
> > > > struct bpf_map map;
> > > > struct hlist_head *buckets;
> > > > -   spinlock_t lock;
> > > > +   raw_spinlock_t lock;  
> > > 
> > > How do we address such things in general?
> > > I bet there are tons of places around the kernel that
> > > call spin_lock from atomic.
> > > I'd hate to lose the benefits of lockdep of non-raw spin_lock
> > > just to make rt happy.
> > 
> > You wont lose any benefits of lockdep. Lockdep still checks
> > raw_spin_lock(). The only difference between raw_spin_lock and
> > spin_lock is that in -rt spin_lock turns into an rt_mutex() and
> > raw_spin_lock stays a spin lock.
> 
> I see. The patch makes sense then.
> Would be good to document this peculiarity of spin_lock.

I'm working on a document.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/4] Produce system time from correlated clocksource

2015-10-21 Thread Thomas Gleixner

On Tue, 20 Oct 2015, John Stultz wrote:

> On Tue, Oct 20, 2015 at 12:11 PM, Thomas Gleixner <t...@linutronix.de> wrote:
> > On Tue, 20 Oct 2015, Richard Cochran wrote:
> >> On Tue, Oct 20, 2015 at 01:51:13PM +0200, Richard Cochran wrote:
> >> > You can, in fact, achieve "proper" correlation by sampling.  As John
> >> > said, the question is whether the method in the patch set "measurably
> >> > improves the error" over using another, simpler method.
> >>
> >> Here is a short example to put some numbers on the expected error.
> >> Let the driver sample at an interval of 1 ms.  If the system time's
> >> frequency hasn't changed between two samples, A and B, then the driver
> >> may interpolate without introducing any error.
> >
> > Darn, we don't want to have that kind of sampling in every driver
> > which has this kind of problem even if it looks like the simpler
> > choice for this particular use case. This is going to be something
> > which next generation chips will have on more than just the audio
> > interface and we realy want to have a generic solution for this.
> 
> I sort of agree with Richard that the timekeeper history approach
> doesn't seem like a generic solution here.
 
I'm not pushing that approach. I just want a generic facility of some
sort to solve that.

> And again, you seem to be speaking with a bigger picture in mind that
> at least I don't yet share (apologies for being thick headed here).
> Being able to have various hardware sharing a time base is quite
> useful, and methods for correlating timestamps together are useful.
> But I don't yet really understand why its important that we can
> translate a hardware timestamp from some time in the past to the
> correct system time in the past without error.

If your device can only provide timestamps from the past, then having
access to the history is important if you want to have precise
correlation.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/4] Produce system time from correlated clocksource

2015-10-20 Thread Thomas Gleixner

On Tue, 20 Oct 2015, Richard Cochran wrote:

> On Mon, Oct 19, 2015 at 05:36:56PM -0700, John Stultz wrote:
> > If we're only tracking 4ms of history, how does this solution
> > measurably improve the error over using the timestamps to generate
> > MONOTONIC_RAW clock deltas (which doesn't require keeping any history)
> > and using getnstime_raw_and_real to take an anchor point to calculate
> > the delta from?  Why is adding complexity necessary?
> 
> This idea is variant of what I suggested in another reply in this
> thread.  To my understanding, there is no need at all to keep a
> history arbitrarily 4 ms long.  Instead, the DSP driver (or whoever
> else may need such a thing) can simply sample the system time at the
> rate needed for that particular application.

That's complete nonsense. The whole point is to have a proper
correlation from ART/audio timestamps to system time. Sampling system
time does not help in any way,

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/4] Produce system time from correlated clocksource

2015-10-20 Thread Thomas Gleixner

On Tue, 20 Oct 2015, Richard Cochran wrote:
> On Tue, Oct 20, 2015 at 01:51:13PM +0200, Richard Cochran wrote:
> > You can, in fact, achieve "proper" correlation by sampling.  As John
> > said, the question is whether the method in the patch set "measurably
> > improves the error" over using another, simpler method.
> 
> Here is a short example to put some numbers on the expected error.
> Let the driver sample at an interval of 1 ms.  If the system time's
> frequency hasn't changed between two samples, A and B, then the driver
> may interpolate without introducing any error.

Darn, we don't want to have that kind of sampling in every driver
which has this kind of problem even if it looks like the simpler
choice for this particular use case. This is going to be something
which next generation chips will have on more than just the audio
interface and we realy want to have a generic solution for this.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 1/4] Produce system time from correlated clocksource

2015-10-15 Thread Thomas Gleixner

On Wed, 14 Oct 2015, Christopher Hall wrote:
> On Tue, 13 Oct 2015 12:42:52 -0700, Thomas Gleixner <t...@linutronix.de>
> wrote:
> > On Mon, 12 Oct 2015, Christopher S. Hall wrote:
> > > audio.
> > 
> > This wants to be a seperate patch, really.
> 
> OK. This makes sense, I'll do this the next time.
> 
> > > +/* This needs to be 3 or greater for backtracking to be useful */
> > 
> > Why?
> 
> The current index points to a copy and the next may be being changed by
> update_wall_time(). Leaving n-2 entries available with useful history in them.
> I'll add more descriptive comments here.
> 
> > 
> > > +#define SHADOW_HISTORY_DEPTH 7
> > 
> > And that number is 7 because?
> 
> Due to power of 2 it will be 8 instead. As above the useful history is 8-2*1
> ms (1 ms is the minimum jiffy length).  Array size 4 would not be enough
> history for the DSP which requires 4 ms of history, in the worst case.

And how exactly becomes 7 magically 8?
 
> > 
> > What's the point of this? Why are you not making the few lines which
> > you can actually reuse a helper function and leave the PTP code alone?
> 
> The audio driver is structured in such a way that it's simpler to provide a
> value rather than a callback.  I changed this to allow the audio developers to
> provide an ART value as input.  If a callback is provided, the resulting
> counter value is guaranteed to be later than cycle_last and there is no need
> to do extra checking (the goto skips that check).  Is this an answer to your
> question?

Make it a seperate function which can hand in the information and
leave the PTP specific sample/conversion function alone.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 176 matches

Mail list logo