from:"Benjamin Herrenschmidt"

Re: [PATCH net-next v4 2/2] of: net: fix of_get_mac_addr_nvmem() for non-platform devices

2021-04-15 Thread Benjamin Herrenschmidt

On Mon, 2021-04-12 at 19:47 +0200, Michael Walle wrote:
> 
>  /**
>   * of_get_phy_mode - Get phy mode for given device_node
> @@ -59,15 +60,39 @@ static int of_get_mac_addr(struct device_node *np, const 
> char *name, u8 *addr)
>  static int of_get_mac_addr_nvmem(struct device_node *np, u8 *addr)
>  {
> struct platform_device *pdev = of_find_device_by_node(np);
> +   struct nvmem_cell *cell;
> +   const void *mac;
> +   size_t len;
> int ret;
>  
> -   if (!pdev)
> -   return -ENODEV;
> +   /* Try lookup by device first, there might be a nvmem_cell_lookup
> +* associated with a given device.
> +*/
> +   if (pdev) {
> +   ret = nvmem_get_mac_address(>dev, addr);
> +   put_device(>dev);
> +   return ret;
> +   }
> +

This smells like the wrong band aid :)

Any struct device can contain an OF node pointer these days.

This seems all backwards. I think we are dealing with bad evolution.

We need to do a lookup for the device because we get passed an of_node.
We should just get passed a device here... or rather stop calling
of_get_mac_addr() from all those drivers and instead call
eth_platform_get_mac_address() which in turns calls of_get_mac_addr().

Then the nvmem stuff gets put in eth_platform_get_mac_address().

of_get_mac_addr() becomes a low-level thingy that most drivers don't
care about.

Cheers,
Ben.

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-26 Thread Benjamin Herrenschmidt

On Fri, 2020-10-23 at 13:08 +, Dylan Hung wrote:
> The issue was found on our test chip (ast2600 version A0) which is
> just for testing and won't be mass-produced.  This HW bug has been
> fixed on ast2600 A1 and later versions.
> 
> To verify the HW fix, I run overnight iperf and kvm tests on
> ast2600A1 without this patch, and get stable result without hanging.
> So I think we can discard this patch.

This is great news. Thanks !

Cheers,
Ben.

Re: [PATCH] net: ftgmac100: Ensure tx descriptor updates are visible

2020-10-22 Thread Benjamin Herrenschmidt

On Wed, 2020-10-21 at 14:40 +0200, Arnd Bergmann wrote:
> On Wed, Oct 21, 2020 at 12:39 PM Joel Stanley  wrote:
> 
> > 
> > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
> > b/drivers/net/ethernet/faraday/ftgmac100.c
> > index 331d4bdd4a67..15cdfeb135b0 100644
> > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > @@ -653,6 +653,11 @@ static bool ftgmac100_tx_complete_packet(struct 
> > ftgmac100 *priv)
> > ftgmac100_free_tx_packet(priv, pointer, skb, txdes, ctl_stat);
> > txdes->txdes0 = cpu_to_le32(ctl_stat & priv->txdes0_edotr_mask);
> > 
> > +   /* Ensure the descriptor config is visible before setting the tx
> > +* pointer.
> > +*/
> > +   smp_wmb();
> > +
> > priv->tx_clean_pointer = ftgmac100_next_tx_pointer(priv, pointer);
> > 
> > return true;
> > @@ -806,6 +811,11 @@ static netdev_tx_t ftgmac100_hard_start_xmit(struct 
> > sk_buff *skb,
> > dma_wmb();
> > first->txdes0 = cpu_to_le32(f_ctl_stat);
> > 
> > +   /* Ensure the descriptor config is visible before setting the tx
> > +* pointer.
> > +*/
> > +   smp_wmb();
> > +
> 
> Shouldn't these be paired with smp_rmb() on the reader side?

(Not near the code right now) I *think* the reader already has them
where it matters but I might have overlooked something when I quickly
checked the other day.

Cheers,
Ben.

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-22 Thread Benjamin Herrenschmidt

On Wed, 2020-10-21 at 14:11 +0200, Arnd Bergmann wrote:
> (replying to my own mail from a different address to deal with the
> regular one being blacklisted somewhere, sorry for any duplicates)
> 
> On Wed, Oct 21, 2020 at 9:16 AM Arnd Bergmann  wrote:
> > 
> > On Wed, Oct 21, 2020 at 12:10 AM Benjamin Herrenschmidt
> >  wrote:
> > > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:
> > > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung  
> > > > wrote:
> > > > > > +1 @first is system memory from dma_alloc_coherent(), right?
> > > > > > 
> > > > > > You shouldn't have to do this. Is coherent DMA memory broken on your
> > > > > > platform?
> > > > > 
> > > > > It is about the arbitration on the DRAM controller.  There are two 
> > > > > queues in the dram controller, one is for the CPU access and the 
> > > > > other is for the HW engines.
> > > > > When CPU issues a store command, the dram controller just 
> > > > > acknowledges cpu's request and pushes the request into the queue.  
> > > > > Then CPU triggers the HW MAC engine, the HW engine starts to fetch 
> > > > > the DMA memory.
> > > > > But since the cpu's request may still stay in the queue, the HW 
> > > > > engine may fetch the wrong data.
> > > 
> > > Actually, I take back what I said earlier, the above seems to imply
> > > this is more generic.
> > > 
> > > Dylan, please confirm, does this affect *all* DMA capable devices ? If
> > > yes, then it's a really really bad design bug in your chips
> > > unfortunately and the proper fix is indeed to make dma_wmb() do a dummy
> > > read of some sort (what address though ? would any dummy non-cachable
> > > page do ?) to force the data out as *all* drivers will potentially be
> > > affected.
> > > 
> > > I was under the impression that it was a specific timing issue in the
> > > vhub and ethernet parts, but if it's more generic then it needs to be
> > > fixed globally.
> > 
> > We have CONFIG_ARM_HEAVY_MB for SoCs with similar problems,
> > it turns mb() and wmb() into a platform specific function call, though it
> > doesn't do that for dma_wmb() and smp_wmb(), which should not
> > be affected if the problem is only missing serialization between DMA
> > and CPU writes.

Right. I got mixed up by David mention of dma_wmb, it's not the issue
here, it's indeed the ordering of writes to "coherent" memory vs
iowrite.

> > > > If either of the two is the case, then the READ_ONCE() would just
> > > > introduce a long delay before the iowrite32() that makes it more likely
> > > > that the data is there, but the inconsistent state would still be 
> > > > observable
> > > > by the device if it is still working on previous frames.
> > > 
> > > I think it just get stuck until we try another packet, ie, it doesn't
> > > see the new descriptor valid bit. But Dylan can elaborate.
> > 
> > Ok, that would point to an insufficient barrier in iowrite32() as well,
> > not in dma_wmb().

Correct.

> > At the moment, the only chips that need the heavy barrier are
> > omap4 and mstar_v7, and early l2 cache controllers (not the one
> > on Cortex-A7) have another synchronization callback that IIRC
> > is used for streaming mappings.

 .../...

> > Obviously, adding one of these for ast2600 would slow down every
> > mb() and writel() a lot, but if it is a chip-wide problem rather than
> > one isolated to the network device, it would be the correct solution,
> > provided that a correct code sequence can be found.

I'm surprised that problem doesn't already exist on the ast2400 and
2500 and I thus worry about the performance impact of such a workaround
applied generally to every MMIO writes

But we did kill mmiowb so ... ;-)

Cheers,
Ben.

Re: [PATCH] net: ftgmac100: Ensure tx descriptor updates are visible

2020-10-22 Thread Benjamin Herrenschmidt

On Wed, 2020-10-21 at 08:18 +, David Laight wrote:
> From: Benjamin Herrenschmidt
> > Sent: 21 October 2020 01:00
> > 
> > On Wed, 2020-10-21 at 08:36 +1030, Joel Stanley wrote:
> > > We must ensure the tx descriptor updates are visible before updating
> > > the tx pointer.
> > > 
> > > This resolves the tx hangs observed on the 2600 when running iperf:
> > 
> > To clarify the comment here. This doesn't ensure they are visible to
> > the hardware but to other CPUs. This is the ordering vs start_xmit and
> > tx_complete.
> 
> You need two barriers.
> 1) after making the data buffers available before transferring
> the descriptor ownership to the device.
> 2) after transferring the ownership before 'kicking' the mac engine.
> 
> The first is needed because the mac engine can poll the descriptors
> at any time (eg on completing the previous transmit).
> This stops it transmitting garbage.
> 
> The second makes sure it finds the descriptor you've just set.
> This stops delays before sending the packet.
> (But it will get sent later.)

The above is unrelated to this patch. This isn't about fixing any
device <-> CPU ordering or interaction but purely about ensuring proper
ordering between start_xmit and tx packet cleanup. IE. We are looking
at two different issues with this driver.

> For (2) dma_wmb() is the documented barrier.
> 
> I'm not sure which barrier you need for (1).
> smp_wmb() would be right if the reader were another cpu,
> but it is (at most) a compile barrier on UP kernels.
> So you need something stronger than smp_wmb().

There should already be sufficient barriers for that in the driver
(except for the HW bug mentioned earlier).

> On a TSO system (which yours probably is) a compile barrier
> is probably sufficient, but if memory writes can get re-ordered
> it needs to be a stronger barrier - but not necessarily as strong
> as dma_wmb().
> 
>   David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)

Re: [PATCH] net: ftgmac100: Ensure tx descriptor updates are visible

2020-10-20 Thread Benjamin Herrenschmidt

On Wed, 2020-10-21 at 08:36 +1030, Joel Stanley wrote:
> We must ensure the tx descriptor updates are visible before updating
> the tx pointer.
> 
> This resolves the tx hangs observed on the 2600 when running iperf:

To clarify the comment here. This doesn't ensure they are visible to
the hardware but to other CPUs. This is the ordering vs start_xmit and
tx_complete.

Cheers,
Ben.

> root@ast2600:~# iperf3 -c 192.168.86.146 -R
> Connecting to host 192.168.86.146, port 5201
> Reverse mode, remote host 192.168.86.146 is sending
> [  5] local 192.168.86.173 port 43886 connected to 192.168.86.146
> port 5201
> [ ID] Interval   Transfer Bitrate
> [  5]   0.00-1.00   sec  90.7 MBytes   760 Mbits/sec
> [  5]   1.00-2.00   sec  91.7 MBytes   769 Mbits/sec
> [  5]   2.00-3.00   sec  91.7 MBytes   770 Mbits/sec
> [  5]   3.00-4.00   sec  91.7 MBytes   769 Mbits/sec
> [  5]   4.00-5.00   sec  91.8 MBytes   771 Mbits/sec
> [  5]   5.00-6.00   sec  91.8 MBytes   771 Mbits/sec
> [  5]   6.00-7.00   sec  91.9 MBytes   771 Mbits/sec
> [  5]   7.00-8.00   sec  91.4 MBytes   767 Mbits/sec
> [  5]   8.00-9.00   sec  91.3 MBytes   766 Mbits/sec
> [  5]   9.00-10.00  sec  91.9 MBytes   771 Mbits/sec
> [  5]  10.00-11.00  sec  91.8 MBytes   770 Mbits/sec
> [  5]  11.00-12.00  sec  91.8 MBytes   770 Mbits/sec
> [  5]  12.00-13.00  sec  90.6 MBytes   761 Mbits/sec
> [  5]  13.00-14.00  sec  45.2 KBytes   370 Kbits/sec
> [  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec
> [  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec
> [  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec
> [  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec
> [   67.031671] [ cut here ]
> [   67.036870] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:442
> dev_watchdog+0x2dc/0x300
> [   67.046123] NETDEV WATCHDOG: eth2 (ftgmac100): transmit queue 0
> timed out
> 
> Fixes: 52c0cae87465 ("ftgmac100: Remove tx descriptor accessors")
> Signed-off-by: Joel Stanley 
> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 331d4bdd4a67..15cdfeb135b0 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -653,6 +653,11 @@ static bool ftgmac100_tx_complete_packet(struct
> ftgmac100 *priv)
>   ftgmac100_free_tx_packet(priv, pointer, skb, txdes, ctl_stat);
>   txdes->txdes0 = cpu_to_le32(ctl_stat & priv-
> >txdes0_edotr_mask);
>  
> + /* Ensure the descriptor config is visible before setting the
> tx
> +  * pointer.
> +  */
> + smp_wmb();
> +
>   priv->tx_clean_pointer = ftgmac100_next_tx_pointer(priv,
> pointer);
>  
>   return true;
> @@ -806,6 +811,11 @@ static netdev_tx_t
> ftgmac100_hard_start_xmit(struct sk_buff *skb,
>   dma_wmb();
>   first->txdes0 = cpu_to_le32(f_ctl_stat);
>  
> + /* Ensure the descriptor config is visible before setting the
> tx
> +  * pointer.
> +  */
> + smp_wmb();
> +
>   /* Update next TX pointer */
>   priv->tx_pointer = pointer;
>

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-20 Thread Benjamin Herrenschmidt

On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:
> On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung  wrote:
> > > +1 @first is system memory from dma_alloc_coherent(), right?
> > > 
> > > You shouldn't have to do this. Is coherent DMA memory broken on your
> > > platform?
> > 
> > It is about the arbitration on the DRAM controller.  There are two queues 
> > in the dram controller, one is for the CPU access and the other is for the 
> > HW engines.
> > When CPU issues a store command, the dram controller just acknowledges 
> > cpu's request and pushes the request into the queue.  Then CPU triggers the 
> > HW MAC engine, the HW engine starts to fetch the DMA memory.
> > But since the cpu's request may still stay in the queue, the HW engine may 
> > fetch the wrong data.

Actually, I take back what I said earlier, the above seems to imply
this is more generic.

Dylan, please confirm, does this affect *all* DMA capable devices ? If
yes, then it's a really really bad design bug in your chips
unfortunately and the proper fix is indeed to make dma_wmb() do a dummy
read of some sort (what address though ? would any dummy non-cachable
page do ?) to force the data out as *all* drivers will potentially be
affected.

I was under the impression that it was a specific timing issue in the
vhub and ethernet parts, but if it's more generic then it needs to be
fixed globally.

> There is still something missing in the explanation: The iowrite32()
> only tells the
> device that it should check the queue, but not where the data is. I would 
> expect
> the device to either see the correct data that was marked valid by the
> 'dma_wmb();first->txdes0 = cpu_to_le32(f_ctl_stat);' operation, or it would 
> see
> the old f_ctl_stat value telling it that the data is not yet valid and
> not look at
> the rest of the descriptor. In the second case you would see the data
> not getting sent out until the next start_xmit(), but the device should not
> fetch wrong data.
> 
> There are two possible scenarios in which your patch would still help:
> 
> a) the dma_wmb() does not serialize the stores as seen by DMA the
> way it is supposed to, so the device can observe the new value of txdec0
> before it observes the correct data.
> 
> b) The txdes0 field sometimes contains stale data that marks the
> descriptor as valid before the correct data is written. This field
> should have been set in ftgmac100_tx_complete_packet() earlier
> 
> If either of the two is the case, then the READ_ONCE() would just
> introduce a long delay before the iowrite32() that makes it more likely
> that the data is there, but the inconsistent state would still be observable
> by the device if it is still working on previous frames.

I think it just get stuck until we try another packet, ie, it doesn't
see the new descriptor valid bit. But Dylan can elaborate.

Cheers,
Ben.

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-20 Thread Benjamin Herrenschmidt

On Tue, 2020-10-20 at 13:15 +, David Laight wrote:
> That rather depends where the data is 'stuck'.
> 
> An old sparc cpu would flush the cpu store buffer before the read.
> But a modern x86 cpu will satisfy the read from the store buffer
> for cached data.
> 
> If the write is 'posted' on a PCI(e) bus then the read can't overtake it.
> But that is a memory access so shouldn't be to a PCI(e) address.
> 
> Shouldn't dma_wb() actually force your 'cpu to dram' queue be flushed?
> In which case you need one after writing the ring descriptor and
> before the poke of the mac engine.
> 
> The barrier before the descriptor write only needs to guarantee
> ordering of the writes - it can probably be a lighter barrier?
> 
> It might be that your dma_wmb() needs to do a write+read of
> an uncached DRAM location in order to empty the cpu to dram queue.

This is a specific bug with how a specific IP block is hooked up in
those SOCs, I wouldn't bloat the global dma_wmb for that. The read back
in the driver with appropriate comment should be enough.

Cheers,
Ben.

Re: [PATCH 1/4] ftgmac100: Fix race issue on TX descriptor[0]

2020-10-20 Thread Benjamin Herrenschmidt

On Tue, 2020-10-20 at 04:13 +, Joel Stanley wrote:
> On Mon, 19 Oct 2020 at 23:20, Benjamin Herrenschmidt
>  wrote:
> > 
> > On Mon, 2020-10-19 at 16:57 +0800, Dylan Hung wrote:
> > > These rules must be followed when accessing the TX descriptor:
> > > 
> > > 1. A TX descriptor is "cleanable" only when its value is non-zero
> > > and the owner bit is set to "software"
> > 
> > Can you elaborate ? What is the point of that change ? The owner
> > bit
> > should be sufficient, why do we need to check other fields ?
> 
> I would like Dylan to clarify too. The datasheet has a footnote below
> the descriptor layout:
> 
>  - TXDES#0: Bits 27 ~ 14 are valid only when FTS = 1
>  - TXDES#1: Bits 31 ~ 0 are valid only when FTS = 1
> 
> So the ownership bit (31) is not valid unless FTS is set. However,
> this isn't what his patch does. It adds checks for EDOTR.

No I think it adds a check for everything except EDOTR which just marks
the end of ring and needs to be ignored in the comparison.

That said, we do need a better explanation.

One potential bug I did find by looking at my code however is:

static bool ftgmac100_tx_complete_packet(struct ftgmac100 *priv)
{
struct net_device *netdev = priv->netdev;
struct ftgmac100_txdes *txdes;
struct sk_buff *skb;
unsigned int pointer;
u32 ctl_stat;

pointer = priv->tx_clean_pointer;
txdes = >txdes[pointer];

ctl_stat = le32_to_cpu(txdes->txdes0);
if (ctl_stat & FTGMAC100_TXDES0_TXDMA_OWN)
return false;

skb = priv->tx_skbs[pointer];
netdev->stats.tx_packets++;
netdev->stats.tx_bytes += skb->len;
ftgmac100_free_tx_packet(priv, pointer, skb, txdes, ctl_stat);
txdes->txdes0 = cpu_to_le32(ctl_stat & priv->txdes0_edotr_mask);

   There should probably be an smp_wmb() here to ensure that all the above
stores are visible before the tx clean pointer is updated.

priv->tx_clean_pointer = ftgmac100_next_tx_pointer(priv, pointer);

return true;
}

Similarly we probablu should have one before setting tx_pointer in start_xmit().

As for the read side of this, I'm not 100% sure, I'll have to think more about
it, it *think* the existing barriers are sufficient at first sight.

Cheers,
Ben.

> > 
> > > 2. A TX descriptor is "writable" only when its value is zero
> > > regardless the edotr mask.
> > 
> > Again, why is that ? Can you elaborate ? What race are you trying
> > to
> > address here ?
> > 
> > Cheers,
> > Ben.
> > 
> > > Fixes: 52c0cae87465 ("ftgmac100: Remove tx descriptor accessors")
> > > Signed-off-by: Dylan Hung 
> > > Signed-off-by: Joel Stanley 
> > > ---
> > >  drivers/net/ethernet/faraday/ftgmac100.c | 10 ++
> > >  1 file changed, 10 insertions(+)
> > > 
> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > > b/drivers/net/ethernet/faraday/ftgmac100.c
> > > index 00024dd41147..7cacbe4aecb7 100644
> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > > @@ -647,6 +647,9 @@ static bool
> > > ftgmac100_tx_complete_packet(struct
> > > ftgmac100 *priv)
> > >   if (ctl_stat & FTGMAC100_TXDES0_TXDMA_OWN)
> > >   return false;
> > > 
> > > + if ((ctl_stat & ~(priv->txdes0_edotr_mask)) == 0)
> > > + return false;
> > > +
> > >   skb = priv->tx_skbs[pointer];
> > >   netdev->stats.tx_packets++;
> > >   netdev->stats.tx_bytes += skb->len;
> > > @@ -756,6 +759,9 @@ static netdev_tx_t
> > > ftgmac100_hard_start_xmit(struct sk_buff *skb,
> > >   pointer = priv->tx_pointer;
> > >   txdes = first = >txdes[pointer];
> > > 
> > > + if (le32_to_cpu(txdes->txdes0) & ~priv->txdes0_edotr_mask)
> > > + goto drop;
> > > +
> > >   /* Setup it up with the packet head. Don't write the head
> > > to
> > > the
> > >* ring just yet
> > >*/
> > > @@ -787,6 +793,10 @@ static netdev_tx_t
> > > ftgmac100_hard_start_xmit(struct sk_buff *skb,
> > >   /* Setup descriptor */
> > >   priv->tx_skbs[pointer] = skb;
> > >   txdes = >txdes[pointer];
> > > +
> > > + if (le32_to_cpu(txdes->txdes0) & ~priv-
> > > > txdes0_edotr_mask)
> > > 
> > > + goto dma_err;
> > > +
> > >   ctl_stat = ftgmac100_base_tx_ctlstat(priv,
> > > pointer);
> > >   ctl_stat |= FTGMAC100_TXDES0_TXDMA_OWN;
> > >   ctl_stat |= FTGMAC100_TXDES0_TXBUF_SIZE(len);

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-20 Thread Benjamin Herrenschmidt

On Mon, 2020-10-19 at 19:57 -0700, Jakub Kicinski wrote:
> > I suspect the problem is that the HW (and yes this would be a HW bug)
> > doesn't order the CPU -> memory and the CPU -> MMIO path.
> > 
> > What I think happens is that the store to txde0 is potentially still in
> > a buffer somewhere on its way to memory, gets bypassed by the store to
> > MMIO, causing the MAC to try to read the descriptor, and getting the
> > "old" data from memory.
> 
> I see, but in general this sort of a problem should be resolved by
> adding an appropriate memory barrier. And in fact such barrier should
> (these days) be implied by a writel (I'm not 100% clear on why this
> driver uses iowrite, and if it matters).

No, a barrier won't solve this I think.

This is a coherency problem at the fabric/interconnect level. I has to
do with the way they implemented the DMA path from memory to the
ethernet controller using a different "port" of the memory controller
than the one used by the CPU, separately from the MMIO path, with no
proper ordering between those busses. Old school design  and
broken.

By doing a read back, they probably force the previous write to memory
to get past the point where it will be visible to a subsequent DMA read
by the ethernet controller.

> > It's ... fishy, but that isn't the first time an Aspeed chip has that
> > type of bug (there's a similar one in the USB device controler iirc).

Cheers,
Ben.

Re: [PATCH v1 1/2] net: ftgmac100: move phy connect out from ftgmac100_setup_mdio

2020-10-19 Thread Benjamin Herrenschmidt

On Thu, 2020-10-15 at 15:49 +0300, Ivan Mikhaylov wrote:
> Split MDIO registration and PHY connect into ftgmac100_setup_mdio and
> ftgmac100_mii_probe.

Please keep me CCod on ftgmac100 patches.

> Signed-off-by: Ivan Mikhaylov 

Acked-by: Benjamin Herrenschmidt 

> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 92 
>  1 file changed, 47 insertions(+), 45 deletions(-)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 87236206366f..6997e121824b 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1044,11 +1044,47 @@ static void ftgmac100_adjust_link(struct net_device 
> *netdev)
>   schedule_work(>reset_task);
>  }
>  
> -static int ftgmac100_mii_probe(struct ftgmac100 *priv, phy_interface_t intf)
> +static int ftgmac100_mii_probe(struct net_device *netdev)
>  {
> - struct net_device *netdev = priv->netdev;
> + struct ftgmac100 *priv = netdev_priv(netdev);
> + struct platform_device *pdev = to_platform_device(priv->dev);
> + struct device_node *np = pdev->dev.of_node;
> + phy_interface_t phy_intf = PHY_INTERFACE_MODE_RGMII;
>   struct phy_device *phydev;
>  
> + /* Get PHY mode from device-tree */
> + if (np) {
> + /* Default to RGMII. It's a gigabit part after all */
> + phy_intf = of_get_phy_mode(np, _intf);
> + if (phy_intf < 0)
> + phy_intf = PHY_INTERFACE_MODE_RGMII;
> +
> + /* Aspeed only supports these. I don't know about other IP
> +  * block vendors so I'm going to just let them through for
> +  * now. Note that this is only a warning if for some obscure
> +  * reason the DT really means to lie about it or it's a newer
> +  * part we don't know about.
> +  *
> +  * On the Aspeed SoC there are additionally straps and SCU
> +  * control bits that could tell us what the interface is
> +  * (or allow us to configure it while the IP block is held
> +  * in reset). For now I chose to keep this driver away from
> +  * those SoC specific bits and assume the device-tree is
> +  * right and the SCU has been configured properly by pinmux
> +  * or the firmware.
> +  */
> + if (priv->is_aspeed &&
> + phy_intf != PHY_INTERFACE_MODE_RMII &&
> + phy_intf != PHY_INTERFACE_MODE_RGMII &&
> + phy_intf != PHY_INTERFACE_MODE_RGMII_ID &&
> + phy_intf != PHY_INTERFACE_MODE_RGMII_RXID &&
> + phy_intf != PHY_INTERFACE_MODE_RGMII_TXID) {
> + netdev_warn(netdev,
> + "Unsupported PHY mode %s !\n",
> + phy_modes(phy_intf));
> + }
> + }
> +
>   phydev = phy_find_first(priv->mii_bus);
>   if (!phydev) {
>   netdev_info(netdev, "%s: no PHY found\n", netdev->name);
> @@ -1056,7 +1092,7 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv, 
> phy_interface_t intf)
>   }
>  
>   phydev = phy_connect(netdev, phydev_name(phydev),
> -  _adjust_link, intf);
> +  _adjust_link, phy_intf);
>  
>   if (IS_ERR(phydev)) {
>   netdev_err(netdev, "%s: Could not attach to PHY\n", 
> netdev->name);
> @@ -1601,7 +1637,6 @@ static int ftgmac100_setup_mdio(struct net_device 
> *netdev)
>  {
>   struct ftgmac100 *priv = netdev_priv(netdev);
>   struct platform_device *pdev = to_platform_device(priv->dev);
> - phy_interface_t phy_intf = PHY_INTERFACE_MODE_RGMII;
>   struct device_node *np = pdev->dev.of_node;
>   int i, err = 0;
>   u32 reg;
> @@ -1623,39 +1658,6 @@ static int ftgmac100_setup_mdio(struct net_device 
> *netdev)
>   iowrite32(reg, priv->base + FTGMAC100_OFFSET_REVR);
>   }
>  
> - /* Get PHY mode from device-tree */
> - if (np) {
> - /* Default to RGMII. It's a gigabit part after all */
> - err = of_get_phy_mode(np, _intf);
> - if (err)
> - phy_intf = PHY_INTERFACE_MODE_RGMII;
> -
> - /* Aspeed only supports these. I don't know about other IP
> -  * block vendors so I'm going to just let them through for
> -  * now. Note that this is only a warning if for some obscure
> -  * reason the DT real

Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue

2020-10-19 Thread Benjamin Herrenschmidt

On Mon, 2020-10-19 at 12:00 -0700, Jakub Kicinski wrote:
> On Mon, 19 Oct 2020 08:57:03 + Joel Stanley wrote:
> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > > b/drivers/net/ethernet/faraday/ftgmac100.c
> > > index 00024dd41147..9a99a87f29f3 100644
> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > > @@ -804,7 +804,8 @@ static netdev_tx_t
> > > ftgmac100_hard_start_xmit(struct sk_buff *skb,
> > >  * before setting the OWN bit on the first descriptor.
> > >  */
> > > dma_wmb();
> > > -   first->txdes0 = cpu_to_le32(f_ctl_stat);
> > > +   WRITE_ONCE(first->txdes0, cpu_to_le32(f_ctl_stat));
> > > +   READ_ONCE(first->txdes0);  
> > 
> > I understand what you're trying to do here, but I'm not sure that
> > this
> > is the correct way to go about it.
> > 
> > It does cause the compiler to produce a store and then a load.
> 
> +1 @first is system memory from dma_alloc_coherent(), right?
> 
> You shouldn't have to do this. Is coherent DMA memory broken 
> on your platform?

I suspect the problem is that the HW (and yes this would be a HW bug)
doesn't order the CPU -> memory and the CPU -> MMIO path.

What I think happens is that the store to txde0 is potentially still in
a buffer somewhere on its way to memory, gets bypassed by the store to
MMIO, causing the MAC to try to read the descriptor, and getting the
"old" data from memory.

It's ... fishy, but that isn't the first time an Aspeed chip has that
type of bug (there's a similar one in the USB device controler iirc).

Cheers,
Ben.

Re: [PATCH 1/4] ftgmac100: Fix race issue on TX descriptor[0]

2020-10-19 Thread Benjamin Herrenschmidt

On Mon, 2020-10-19 at 16:57 +0800, Dylan Hung wrote:
> These rules must be followed when accessing the TX descriptor:
> 
> 1. A TX descriptor is "cleanable" only when its value is non-zero
> and the owner bit is set to "software"

Can you elaborate ? What is the point of that change ? The owner bit
should be sufficient, why do we need to check other fields ?

> 2. A TX descriptor is "writable" only when its value is zero
> regardless the edotr mask.

Again, why is that ? Can you elaborate ? What race are you trying to
address here ?

Cheers,
Ben.

> Fixes: 52c0cae87465 ("ftgmac100: Remove tx descriptor accessors")
> Signed-off-by: Dylan Hung 
> Signed-off-by: Joel Stanley 
> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 00024dd41147..7cacbe4aecb7 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -647,6 +647,9 @@ static bool ftgmac100_tx_complete_packet(struct
> ftgmac100 *priv)
>   if (ctl_stat & FTGMAC100_TXDES0_TXDMA_OWN)
>   return false;
>  
> + if ((ctl_stat & ~(priv->txdes0_edotr_mask)) == 0)
> + return false;
> +
>   skb = priv->tx_skbs[pointer];
>   netdev->stats.tx_packets++;
>   netdev->stats.tx_bytes += skb->len;
> @@ -756,6 +759,9 @@ static netdev_tx_t
> ftgmac100_hard_start_xmit(struct sk_buff *skb,
>   pointer = priv->tx_pointer;
>   txdes = first = >txdes[pointer];
>  
> + if (le32_to_cpu(txdes->txdes0) & ~priv->txdes0_edotr_mask)
> + goto drop;
> +
>   /* Setup it up with the packet head. Don't write the head to
> the
>* ring just yet
>*/
> @@ -787,6 +793,10 @@ static netdev_tx_t
> ftgmac100_hard_start_xmit(struct sk_buff *skb,
>   /* Setup descriptor */
>   priv->tx_skbs[pointer] = skb;
>   txdes = >txdes[pointer];
> +
> + if (le32_to_cpu(txdes->txdes0) & ~priv-
> >txdes0_edotr_mask)
> + goto dma_err;
> +
>   ctl_stat = ftgmac100_base_tx_ctlstat(priv, pointer);
>   ctl_stat |= FTGMAC100_TXDES0_TXDMA_OWN;
>   ctl_stat |= FTGMAC100_TXDES0_TXBUF_SIZE(len);

Re: [PATCH 4/4] ftgmac100: Restart MAC HW once

2020-10-19 Thread Benjamin Herrenschmidt

On Mon, 2020-10-19 at 16:57 +0800, Dylan Hung wrote:
> The interrupt handler may set the flag to reset the mac in the
> future,
> but that flag is not cleared once the reset has occured.
> 
> Fixes: 10cbd6407609 ("ftgmac100: Rework NAPI & interrupts handling")
> Signed-off-by: Dylan Hung 
> Signed-off-by: Joel Stanley 

Acked-by: Benjamin Herrenschmidt 

> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 0c67fc3e27df..57736b049de3 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1326,6 +1326,7 @@ static int ftgmac100_poll(struct napi_struct
> *napi, int budget)
>*/
>   if (unlikely(priv->need_mac_restart)) {
>   ftgmac100_start_hw(priv);
> + priv->need_mac_restart = false;
>  
>   /* Re-enable "bad" interrupts */
>   ftgmac100_write(FTGMAC100_INT_BAD, priv->base +
> FTGMAC100_OFFSET_IER);

Re: [PATCH 3/4] ftgmac100: Add a dummy read to ensure running sequence

2020-10-19 Thread Benjamin Herrenschmidt

On Mon, 2020-10-19 at 16:57 +0800, Dylan Hung wrote:
> On the AST2600 care must be taken to ensure writes appear correctly when
> modifying the interrupt reglated registers.
> 
> Add a function to perform a read after all writes to the IER and ISR 
> registers.

You need to elaborate here. MMIO writes shouldn't get out of order,
though they can get posted. I thus object to that "blanket"
ftgmac100_write(). Instead, specifically add reads to flush posted
writes where they are necessary and document it with a comment. Unless
there's a deeper problem in the HW, in which case you need a better
explanation.

Cheers,
Ben.

> Fixes: 39bfab8844a0 ("net: ftgmac100: Add support for DT phy-handle property")
> Signed-off-by: Dylan Hung 
> Signed-off-by: Joel Stanley 
> ---
>  drivers/net/ethernet/faraday/ftgmac100.c | 38 
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c 
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 810bda80f138..0c67fc3e27df 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -111,6 +111,14 @@ struct ftgmac100 {
>   bool is_aspeed;
>  };
>  
> +/* Helper to ensure writes are observed with the correct ordering. Use only
> + * for IER and ISR accesses. */
> +static void ftgmac100_write(u32 val, void __iomem *addr)
> +{
> + iowrite32(val, addr);
> + ioread32(addr);
> +}
> +
>  static int ftgmac100_reset_mac(struct ftgmac100 *priv, u32 maccr)
>  {
>   struct net_device *netdev = priv->netdev;
> @@ -1048,7 +1056,7 @@ static void ftgmac100_adjust_link(struct net_device 
> *netdev)
>   return;
>  
>   /* Disable all interrupts */
> - iowrite32(0, priv->base + FTGMAC100_OFFSET_IER);
> + ftgmac100_write(0, priv->base + FTGMAC100_OFFSET_IER);
>  
>   /* Reset the adapter asynchronously */
>   schedule_work(>reset_task);
> @@ -1246,7 +1254,7 @@ static irqreturn_t ftgmac100_interrupt(int irq, void 
> *dev_id)
>  
>   /* Fetch and clear interrupt bits, process abnormal ones */
>   status = ioread32(priv->base + FTGMAC100_OFFSET_ISR);
> - iowrite32(status, priv->base + FTGMAC100_OFFSET_ISR);
> + ftgmac100_write(status, priv->base + FTGMAC100_OFFSET_ISR);
>   if (unlikely(status & FTGMAC100_INT_BAD)) {
>  
>   /* RX buffer unavailable */
> @@ -1266,7 +1274,7 @@ static irqreturn_t ftgmac100_interrupt(int irq, void 
> *dev_id)
>   if (net_ratelimit())
>   netdev_warn(netdev,
>  "AHB bus error ! Resetting chip.\n");
> - iowrite32(0, priv->base + FTGMAC100_OFFSET_IER);
> + ftgmac100_write(0, priv->base + FTGMAC100_OFFSET_IER);
>   schedule_work(>reset_task);
>   return IRQ_HANDLED;
>   }
> @@ -1281,7 +1289,7 @@ static irqreturn_t ftgmac100_interrupt(int irq, void 
> *dev_id)
>   }
>  
>   /* Only enable "bad" interrupts while NAPI is on */
> - iowrite32(new_mask, priv->base + FTGMAC100_OFFSET_IER);
> + ftgmac100_write(new_mask, priv->base + FTGMAC100_OFFSET_IER);
>  
>   /* Schedule NAPI bh */
>   napi_schedule_irqoff(>napi);
> @@ -1320,8 +1328,7 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
> budget)
>   ftgmac100_start_hw(priv);
>  
>   /* Re-enable "bad" interrupts */
> - iowrite32(FTGMAC100_INT_BAD,
> -   priv->base + FTGMAC100_OFFSET_IER);
> + ftgmac100_write(FTGMAC100_INT_BAD, priv->base + 
> FTGMAC100_OFFSET_IER);
>   }
>  
>   /* As long as we are waiting for transmit packets to be
> @@ -1336,13 +1343,7 @@ static int ftgmac100_poll(struct napi_struct *napi, 
> int budget)
>* they were masked. So we clear them first, then we need
>* to re-check if there's something to process
>*/
> - iowrite32(FTGMAC100_INT_RXTX,
> -   priv->base + FTGMAC100_OFFSET_ISR);
> -
> - /* Push the above (and provides a barrier vs. subsequent
> -  * reads of the descriptor).
> -  */
> - ioread32(priv->base + FTGMAC100_OFFSET_ISR);
> + ftgmac100_write(FTGMAC100_INT_RXTX, priv->base + 
> FTGMAC100_OFFSET_ISR);
>  
>   /* Check RX and TX descriptors for more work to do */
>   if (ftgmac100_check_rx(priv) ||
> @@ -1353,8 +1354,7 @@ static int ftgmac100_poll(struct napi_struct *napi, int 
> budget)
>   napi_complete(napi);
>  
>   /* enable all interrupts */
> - iowrite32(FTGMAC100_INT_ALL,
> -   priv->base + FTGMAC100_OFFSET_IER);
> + ftgmac100_write(FTGMAC100_INT_ALL, priv->base + 
> FTGMAC100_OFFSET_IER);
>   }
>  
>   return work_done;
> @@ -1382,7 +1382,7 @@ static int

Re: [PATCH v1 0/2] video: fbdev: radeonfb: PCI PM framework upgrade and fix-ups.

2020-09-07 Thread Benjamin Herrenschmidt

On Mon, 2020-09-07 at 09:55 +0200, Daniel Vetter wrote:
> On Thu, Aug 06, 2020 at 12:52:54PM +0530, Vaibhav Gupta wrote:
> > Linux Kernel Mentee: Remove Legacy Power Management. 
> > 
> > The original goal of the patch series is to upgrade the power
> > management
> > framework of radeonfb fbdev driver. This has been done by upgrading
> > .suspend()
> > and .resume() callbacks.
> > 
> > The upgrade makes sure that the involvement of PCI Core does not
> > change the
> > order of operations executed in a driver. Thus, does not change its
> > behavior.
> > 
> > During this process, it was found that "#if defined(CONFIG_PM)" at
> > line 1434 is
> > redundant. This was introduced in the commit
> > 42ddb453a0cd ("radeon: Conditionally compile PM code").
> 
> I do wonder whether it wouldn't be better to just outright delete
> these,
> we have the drm radeon driver for pretty much all the same hardware
> ...

The only thing is, afaik, the DRM drivers never got the D2/D3 code that
I wrote for radeonfb to get old powerbooks to suspend/resume.

Cheers,
Ben.

> -Daniel
> 
> > 
> > 
> > 
> > Before 42ddb453a0cd:
> > $ git show 65122f7e80b5:drivers/video/aty/radeon_pm.c | grep -n
> > "#ifdef\|#if\|#else\|#endif\|#elif\|#ifndef"
> > 
> > Based on output in terminal:
> > 
> > 547:#ifdef CONFIG_PM
> >|-- 959:#ifdef CONFIG_PPC_PMAC
> >|-- 972:#endif
> >|-- 1291:#ifdef CONFIG_PPC_OF
> >|-- 1301:#endif /* CONFIG_PPC_OF */
> >|-- 1943:#ifdef CONFIG_PPC_OF
> >|-- 2206:#if 0 /* Not ready yet */
> >|-- 2508:#endif /* 0 */
> >|-- 2510:#endif /* CONFIG_PPC_OF */
> >|-- 2648:#ifdef CONFIG_PPC_PMAC
> >|-- 2654:#endif /* CONFIG_PPC_PMAC */
> >|-- 2768:#ifdef CONFIG_PPC_PMAC
> >|-- 2774:#endif /* CONFIG_PPC_PMAC */
> >|-- 2791:#ifdef CONFIG_PPC_OF__disabled
> >|-- 2801:#endif /* CONFIG_PPC_OF */
> > 2803:#endif /* CONFIG_PM */
> > 
> > 
> > 
> > After 42ddb453a0cd:
> > $ git show 42ddb453a0cd:drivers/video/aty/radeon_pm.c | grep -n
> > "#ifdef\|#if\|#else\|#endif\|#elif\|#ifndef"
> > 
> > Based on output in terminal:
> > 
> > 547:#ifdef CONFIG_PM
> >|-- 959:#ifdef CONFIG_PPC_PMAC
> >|-- 972:#endif
> >|-- 1291:#ifdef CONFIG_PPC_OF
> >|-- 1301:#endif /* CONFIG_PPC_OF */
> >|-- 1430:#if defined(CONFIG_PM)
> >|-- 1431:#if defined(CONFIG_X86) ||
> > defined(CONFIG_PPC_PMAC)
> >|-- 1944:#endif
> >|-- 1946:#ifdef CONFIG_PPC_OF
> >|-- 1947:#ifdef CONFIG_PPC_PMAC
> >|-- 2208:#endif
> >|-- 2209:#endif
> >|-- 2211:#if 0 /* Not ready yet */
> >|-- 2513:#endif /* 0 */
> >|-- 2515:#endif /* CONFIG_PPC_OF */
> >|-- 2653:#ifdef CONFIG_PPC_PMAC
> >|-- 2659:#endif /* CONFIG_PPC_PMAC */
> >|-- 2773:#ifdef CONFIG_PPC_PMAC
> >|-- 2779:#endif /* CONFIG_PPC_PMAC */
> >|-- 2796:#ifdef CONFIG_PPC_OF__disabled
> >|-- 2806:#endif /* CONFIG_PPC_OF */
> > 2808:#endif /* CONFIG_PM */
> > 
> > 
> > 
> > This also affected the CONFIG_PPC_OF container (line 1943 at commit
> > 65122f7e80b5)
> > 
> > The patch-series fixes it along with PM upgrade.
> > 
> > All patches are compile-tested only.
> > 
> > Test tools:
> > - Compiler: gcc (GCC) 10.1.0
> > - allmodconfig build: make -j$(nproc) W=1 all
> > 
> > Vaibhav Gupta (2):
> >   video: fbdev: aty: radeon_pm: remove redundant CONFIG_PM
> > container
> >   fbdev: radeonfb:use generic power management
> > 
> >  drivers/video/fbdev/aty/radeon_base.c | 10 ---
> >  drivers/video/fbdev/aty/radeon_pm.c   | 38 ---
> > 
> >  drivers/video/fbdev/aty/radeonfb.h|  3 +--
> >  3 files changed, 35 insertions(+), 16 deletions(-)
> > 
> > -- 
> > 2.27.0
> > 
> > ___
> > dri-devel mailing list
> > dri-de...@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> 
>

Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs

2020-09-01 Thread Benjamin Herrenschmidt

On Tue, 2020-09-01 at 13:37 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 31, 2020 at 03:18:27PM +, Clint Sbisa wrote:
> > Using write-combine is crucial for performance of PCI devices where
> > significant amounts of transactions go over PCI BARs.
> > 
> > arm64 supports write-combine PCI mappings, so the appropriate
> > define
> > has been added which will expose write-combine mappings under sysfs
> > for prefetchable PCI resources.
> > 
> > Signed-off-by: Clint Sbisa 
> 
> Fine with me, I assume Will or Catalin will apply this.

Haha ! Client had sent it to them originally and I told him to resend
it to linux-pci, yourself and Lorenzo :-)

So the confusion is on me.

Will, Catalin, it's all yours. You should have the original patch in
your mbox already, otherwise:

https://patchwork.kernel.org/patch/11729875/

Cheers,
Ben.

Re: [PATCH 1/2] fsi/sbefifo: Clean up correct FIFO when receiving reset request from SBE

2020-07-26 Thread Benjamin Herrenschmidt

On Fri, 2020-07-24 at 16:45 +0930, Joel Stanley wrote:
> From: Joachim Fenkes 
> 
> When the SBE requests a reset via the down FIFO, that is also the
> FIFO we should go and reset ;)

Is it ?

I no longer work for IBM and dont have access to any of the
documentation here but I had vague memories that we would get a reset
request in the down fifo in order to reset the up one since we control
the up one and the host controls the down one, no ?

Cheers,
Ben.

> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Signed-off-by: Joachim Fenkes 
> Signed-off-by: Joel Stanley 
> ---
>  drivers/fsi/fsi-sbefifo.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
> index f54df9ebc8b3..655b45c1f6ba 100644
> --- a/drivers/fsi/fsi-sbefifo.c
> +++ b/drivers/fsi/fsi-sbefifo.c
> @@ -400,7 +400,7 @@ static int sbefifo_cleanup_hw(struct sbefifo
> *sbefifo)
>   /* The FIFO already contains a reset request from the SBE ? */
>   if (down_status & SBEFIFO_STS_RESET_REQ) {
>   dev_info(dev, "Cleanup: FIFO reset request set,
> resetting\n");
> - rc = sbefifo_regw(sbefifo, SBEFIFO_UP,
> SBEFIFO_PERFORM_RESET);
> + rc = sbefifo_regw(sbefifo, SBEFIFO_DOWN,
> SBEFIFO_PERFORM_RESET);
>   if (rc) {
>   sbefifo->broken = true;
>   dev_err(dev, "Cleanup: Reset reg write failed,
> rc=%d\n", rc);

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-23 Thread Benjamin Herrenschmidt

On Thu, 2020-07-23 at 01:21 -0400, Alex Ghiti wrote:
> > works fine with huge pages, what is your problem there ? You rely on
> > punching small-page size holes in there ?
> > 
> 
> ARCH_HAS_STRICT_KERNEL_RWX prevents the use of a hugepage for the kernel 
> mapping in the direct mapping as it sets different permissions to 
> different part of the kernel (data, text..etc).

Ah ok, that can be solved in a couple of ways...

One is to use the linker script to ensure those sections are linked
HUGE_PAGE_SIZE appart and moved appropriately by early boot code. One
is to selectively degrade just those huge pages.

I'm not familiar with the RiscV MMU (I should probably go have a look)
but if it's a classic radix tree with huge pages at PUD/PMD level, then
you could just degrade the one(s) that cross those boundaries.

Cheers,
Ben.

Re: [PATCH] usb: gadget: net2280: fix memory leak on probe error handling paths

2020-07-22 Thread Benjamin Herrenschmidt

On Wed, 2020-07-22 at 22:56 +0300, Evgeny Novikov wrote:
> Hi Alan,
> 
> I have neither an appropriate hardware nor an experience to deal with
> issues that you mentioned. Our framework does not allow to detect
> them as well at the moment. At last, it seems that rather many
> drivers can suffer from these issues. So, it would be much better if
> somebody else will suggest necessary fixes and test them carefully.
> 
> BTW, you have already discussed the race within net2280_remove() with
> my colleague about 3 years ago. But you did not achieve a consensus
> at that time and no fixes were made after all.
> 
> Anyway, one can consider both issues independently on the one fixed
> by the patch.

FYI. It looks like I'm likely to resume my work on that driver in the
next few weeks in which case I could probably look into these Alan.

Cheers,
Ben.


> -- 
> Evgeny Novikov
> Linux Verification Center, ISP RAS
> http://linuxtesting.org
> 
> 22.07.2020, 17:17, "Alan Stern" :
> > On Tue, Jul 21, 2020 at 11:15:58PM +0300, Evgeny Novikov wrote:
> > >  Driver does not release memory for device on error handling
> > > paths in
> > >  net2280_probe() when gadget_release() is not registered yet.
> > > 
> > >  The patch fixes the bug like in other similar drivers.
> > > 
> > >  Found by Linux Driver Verification project (linuxtesting.org).
> > > 
> > >  Signed-off-by: Evgeny Novikov 
> > >  ---
> > >   drivers/usb/gadget/udc/net2280.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > >  diff --git a/drivers/usb/gadget/udc/net2280.c
> > > b/drivers/usb/gadget/udc/net2280.c
> > >  index 5eff85eeaa5a..d5fe071b2db2 100644
> > >  --- a/drivers/usb/gadget/udc/net2280.c
> > >  +++ b/drivers/usb/gadget/udc/net2280.c
> > >  @@ -3781,8 +3781,10 @@ static int net2280_probe(struct pci_dev
> > > *pdev, const struct pci_device_id *id)
> > >   return 0;
> > > 
> > >   done:
> > >  - if (dev)
> > >  + if (dev) {
> > >   net2280_remove(pdev);
> > >  + kfree(dev);
> > >  + }
> > >   return retval;
> > >   }
> > 
> > This patch seems to be the tip of an iceberg. Following through its
> > implications led to a couple of discoveries.
> > 
> > usb_del_gadget_udc() calls device_unregister(>dev). Once
> > this
> > call returns, gadget has to be regarded as a stale pointer. But the
> > very next line of code does:
> > 
> > memset(>dev, 0x00, sizeof(gadget->dev));
> > 
> > for no apparent reason. I'm amazed this hasn't caused problems
> > already.
> > Is there any justification for keeping this memset? It's hard to
> > imagine that it does any good.
> > 
> > Similarly, net2280_remove() calls usb_del_gadget_udc(>gadget)
> > at
> > its start, and so dev must be a stale pointer for the entire
> > remainder
> > of the routine. But it gets used repeatedly. Surely we ought to
> > have
> > a device_get() and device_put() in there.
> > 
> > Alan Stern

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 16:48 -0700, Palmer Dabbelt wrote:
> > Why ? Branch distance limits ? You can't use trampolines ?
> 
> Nothing fundamental, it's just that we don't have a large code model in the C
> compiler.  As a result all the global symbols are resolved as 32-bit
> PC-relative accesses.  We could fix this with a fast large code model, but 
> then
> the kernel would need to relax global symbol references in modules and we 
> don't
> even do that for the simple code models we have now.  FWIW, some of the
> proposed large code models are essentially just split-PLT/GOT and therefor
> don't require relaxation, but at that point we're essentially PIC until we
> have more that 2GiB of kernel text -- and even then, we keep all the
> performance issues.

My memory might be out of date but I *think* we do it on powerpc
without going to a large code model, but just having the in-kernel
linker insert trampolines.

Cheers,
Ben.

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 12:05 -0700, Palmer Dabbelt wrote:
> 
> * We waste vmalloc space on 32-bit systems, where there isn't a lot of it.
> * On 64-bit systems the VA space around the kernel is precious because it's 
> the
>   only place we can place text (modules, BPF, whatever). 

Why ? Branch distance limits ? You can't use trampolines ?

>  If we start putting
>   the kernel in the vmalloc space then we either have to pre-allocate a bunch
>   of space around it (essentially making it a fixed mapping anyway) or it
>   becomes likely that we won't be able to find space for modules as they're
>   loaded into running systems.

I dislike the kernel being in the vmalloc space (see my other email)
but I don't understand the specific issue with modules.

> * Relying on a relocatable kernel for sv48 support introduces a fairly large
>   performance hit.

Out of curiosity why would relocatable kernels introduce a significant
hit ? Where about do you see the overhead coming from ?

> Roughly, my proposal would be to:
> 
> * Leave the 32-bit memory map alone.  On 32-bit systems we can load modules
>   anywhere and we only have one VA width, so we're not really solving any
>   problems with these changes.
> * Staticly allocate a 2GiB portion of the VA space for all our text, as its 
> own
>   region.  We'd link/relocate the kernel here instead of around PAGE_OFFSET,
>   which would decouple the kernel from the physical memory layout of the 
> system.
>   This would have the side effect of sorting out a bunch of bootloader 
> headaches
>   that we currently have.
> * Sort out how to maintain a linear map as the canonical hole moves around
>   between the VA widths without adding a bunch of overhead to the virt2phys 
> and
>   friends.  This is probably going to be the trickiest part, but I think if we
>   just change the page table code to essentially lie about VAs when an sv39
>   system runs an sv48+sv39 kernel we could make it work -- there'd be some
>   logical complexity involved, but it would remain fast.
> 
> This doesn't solve the problem of virtually relocatable kernels, but it does
> let us decouple that from the sv48 stuff.  It also lets us stop relying on a
> fixed physical address the kernel is loaded into, which is another thing I
> don't like.
> 
> I know this may be a more complicated approach, but there aren't any sv48
> systems around right now so I just don't see the rush to support them,
> particularly when there's a cost to what already exists (for those who haven't
> been watching, so far all the sv48 patch sets have imposed a significant
> performance penalty on all systems).

Re: [PATCH v5 1/4] riscv: Move kernel mapping to vmalloc zone

2020-07-21 Thread Benjamin Herrenschmidt

On Tue, 2020-07-21 at 14:36 -0400, Alex Ghiti wrote:
> > > I guess I don't understand why this is necessary at all.  
> > > Specifically: why
> > > can't we just relocate the kernel within the linear map?  That would 
> > > let the
> > > bootloader put the kernel wherever it wants, modulo the physical 
> > > memory size we
> > > support.  We'd need to handle the regions that are coupled to the 
> > > kernel's
> > > execution address, but we could just put them in an explicit memory 
> > > region
> > > which is what we should probably be doing anyway.
> > 
> > Virtual relocation in the linear mapping requires to move the kernel 
> > physically too. Zong implemented this physical move in its KASLR RFC 
> > patchset, which is cumbersome since finding an available physical spot 
> > is harder than just selecting a virtual range in the vmalloc range.
> > 
> > In addition, having the kernel mapping in the linear mapping prevents 
> > the use of hugepage for the linear mapping resulting in performance loss 
> > (at least for the GB that encompasses the kernel).
> > 
> > Why do you find this "ugly" ? The vmalloc region is just a bunch of 
> > available virtual addresses to whatever purpose we want, and as noted by 
> > Zong, arm64 uses the same scheme.

I don't get it :-)

At least on powerpc we move the kernel in the linear mapping and it
works fine with huge pages, what is your problem there ? You rely on
punching small-page size holes in there ?

At least in the old days, there were a number of assumptions that
the kernel text/data/bss resides in the linear mapping.

If you change that you need to ensure that it's still physically
contiguous and you'll have to tweak __va and __pa, which might induce
extra overhead.

Cheers,
Ben.

Re: [PATCH] powerpc/64: Fix an out of date comment about MMIO ordering

2020-07-16 Thread Benjamin Herrenschmidt

On Thu, 2020-07-16 at 12:38 -0700, Palmer Dabbelt wrote:
> From: Palmer Dabbelt 
> 
> This primitive has been renamed, but because it was spelled incorrectly in the
> first place it must have escaped the fixup patch.  As far as I can tell this
> logic is still correct: smp_mb__after_spinlock() uses the default smp_mb()
> implementation, which is "sync" rather than "hwsync" but those are the same
> (though I'm not that familiar with PowerPC).

Typo ? That must be me ... :)

Looks fine. Yes, sync and hwsync are the same (by opposition to lwsync
which is lighter weight and doesn't order cache inhibited).

Cheers,
Ben.

> Signed-off-by: Palmer Dabbelt 
> ---
>  arch/powerpc/kernel/entry_64.S | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
> index b3c9f15089b6..7b38b4daca93 100644
> --- a/arch/powerpc/kernel/entry_64.S
> +++ b/arch/powerpc/kernel/entry_64.S
> @@ -357,7 +357,7 @@ _GLOBAL(_switch)
>* kernel/sched/core.c).
>*
>* Uncacheable stores in the case of involuntary preemption must
> -  * be taken care of. The smp_mb__before_spin_lock() in __schedule()
> +  * be taken care of. The smp_mb__after_spinlock() in __schedule()
>* is implemented as hwsync on powerpc, which orders MMIO too. So
>* long as there is an hwsync in the context switch path, it will
>* be executed on the source CPU after the task has performed

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread Benjamin Herrenschmidt

On Wed, 2020-07-15 at 17:12 -0500, Bjorn Helgaas wrote:
> > I've 'played' with PCIe error handling - without much success.
> > What might be useful is for a driver that has just read ~0u to
> > be able to ask 'has there been an error signalled for this device?'.
> 
> In many cases a driver will know that ~0 is not a valid value for the
> register it's reading.  But if ~0 *could* be valid, an interface like
> you suggest could be useful.  I don't think we have anything like that
> today, but maybe we could.  It would certainly be nice if the PCI core
> noticed, logged, and cleared errors.  We have some of that for AER,
> but that's an optional feature, and support for the error bits in the
> garden-variety PCI_STATUS register is pretty haphazard.  As you note
> below, this sort of SERR/PERR reporting is frequently hard-wired in
> ways that takes it out of our purview.

We do have pci_channel_state (via pci_channel_offline()) which covers
the cases where the underlying error handling (such as EEH or unplug)
results in the device being offlined though this tend to be
asynchronous so it might take a few ~0's before you get it.

It's typically used to break potentially infinite loops in some
drivers.

There is no interface to check whether *an* error happened though for
the most cases it will be captured in the status register, which is
harvested (and cleared ?) by some EDAC drivers iirc... 

All this lacks coordination, I agree.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-15 Thread Benjamin Herrenschmidt

On Wed, 2020-07-15 at 08:47 +0200, Arnd Bergmann wrote:
> drivers/misc/cardreader/rts5261.c:  retval =
> rtsx_pci_write_config_dword(pcr, PCR_SETTING_REG2, lval);
> 
> That last one is interesting because I think this is a case in which
> we
> actually want to check for errors, as the driver seems to use it
> to ensure that accessing extended config space at offset 0x814
> works before relying on the value. Unfortunately the implementation
> seems buggy as it a) keeps using the possibly uninitialized value
> after
> printing a warning and b) returns the PCIBIOS_* value in place of a
> negative errno and then ignores it in the caller.

In cases like this, usually checking against ~0 is sufficient

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 18:46 -0500, Bjorn Helgaas wrote:
> Yes.  I have no problem with that.  There are a few cases where it's
> important to check for errors, e.g., we read a status register and do
> something based on a bit being set.  A failure will return all bits
> set, and we may do the wrong thing.  But most of the errors we care
> about will be on MMIO reads, not config reads, so we can probably
> ignore most config read errors.

And in both cases, we don't have the plumbing to provide accurate
and reliable error returns for all platforms anyways (esp. not for
MMIO).

I think it makes sense to stick to the good old "if all 1's, then go
out of line" including for config space.

 ../..

> Yep, except for things like device removal or other PCI errors.

A whole bunch of these are reported asynchronously, esp for writes (and
yes, including config writes, they are supposed to be non-posted but
more often than not, the path  from the CPU to the PCI bridge remains
posted for writes including config ones).

> So maybe a good place to start is by removing some of the useless
> error checking for pci_read_config_*() and pci_write_config_*().
> That's a decent-sized but not impractical project that could be done
> per subsystem or something:
> 
>   git grep -E "(if|return|=).*\ 
> finds about 400 matches.
> 
> Some of those callers probably really *do* want to check for errors,
> and I guess we'd have to identify them and do them separately as you
> mentioned.

I'd be curious about these considering how unreliable our error return
is accross the board.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 23:02 +0200, Kjetil Oftedal wrote:
> > 
> > > For b), it might be nice to also change other aspects of the
> > > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > > instead of a pci_bus pointer, or having the callback in the
> > > pci_host_bridge structure.
> > 
> > I like this idea a lot, too.  I think the fact that
> > pci_bus_read_config_word() requires a pci_bus * complicates things in
> > a few places.
> > 
> > I think it's completely separate, as you say, and we should defer it
> > for now because even part a) is a lot of work.  I added it to my list
> > of possible future projects.
> > 
> 
> What about strange PCI devices such as Non-Transparent bridges?
> They will require their own PCI Config space accessors that is not
> connected to a host bridge if one wants to do some sort of
> punch-through enumeration.
> I guess the kernel doesn't care much about them?

Well, today they would require a pci_bus anyway.. . so if you want to do
that sort of funny trick you may as well create a "virtual" host bridge.

Cheers,
Ben.

Re: [RFC PATCH 00/35] Move all PCIBIOS* definitions into arch/x86

2020-07-14 Thread Benjamin Herrenschmidt

On Tue, 2020-07-14 at 13:45 -0500, Bjorn Helgaas wrote:
> 
> > fail for valid arguments on a valid pci_device* ?
> 
> I really like this idea.
> 
> pci_write_config_*() has one return value, and only 100ish of 2500
> callers check for errors.  It's sometimes possible for config
> accessors to detect PCI errors and return failure, e.g., device was
> removed or didn't respond, but most of them don't, and detecting
> these
> errors is not really that valuable.
> 
> pci_read_config_*() is much more interesting because it returns two
> things, the function return value and the value read from the PCI
> device, and it's complicated to check both. 

  .../...

I agree. It's a mess at the moment.

We have separate mechanism to convey PCI errors (among other things the
channel state) which should apply to config space when detection is
possible.

I think returning all 1's is the right thing to do here and avoids odd
duplicate error detection logic which I bet you is never properly
tested.

> > For b), it might be nice to also change other aspects of the
> > interface, e.g. passing a pci_host_bridge pointer plus bus number
> > instead of a pci_bus pointer, or having the callback in the
> > pci_host_bridge structure.
> 
> I like this idea a lot, too.  I think the fact that
> pci_bus_read_config_word() requires a pci_bus * complicates things in
> a few places.
> 
> I think it's completely separate, as you say, and we should defer it
> for now because even part a) is a lot of work.  I added it to my list
> of possible future projects.

Agreed on both points.

Cheers,
Ben.

Re: [PATCH v1 6/6] console: Fix trivia typo 'change' -> 'chance'

2020-06-18 Thread Benjamin Herrenschmidt

On Thu, 2020-06-18 at 19:47 +0300, Andy Shevchenko wrote:
> I bet the word 'chance' has to be used in 'had a chance to be called',
> but, alas, I'm not native speaker...

Yup, typo :)

> Signed-off-by: Andy Shevchenko 
> 

Acked-by: Benjamin Herrenschmidt 

> ---
>  kernel/printk/printk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index aaea3ad182e1..6623e975675a 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2705,7 +2705,7 @@ static int try_enable_new_console(struct console 
> *newcon, bool user_specified)
>   /*
>* Some consoles, such as pstore and netconsole, can be enabled even
>* without matching. Accept the pre-enabled consoles only when match()
> -  * and setup() had a change to be called.
> +  * and setup() had a chance to be called.
>*/
>   if (newcon->flags & CON_ENABLED && c->user_specified == user_specified)
>   return 0;

Re: [PATCH v1 5/6] console: Propagate error code from console ->setup()

2020-06-18 Thread Benjamin Herrenschmidt

On Thu, 2020-06-18 at 19:47 +0300, Andy Shevchenko wrote:
> Since console ->setup() hook returns meaningful error codes,
> propagate it to the caller of try_enable_new_console().
> 
> Signed-off-by: Andy Shevchenko 
> 

Acked-by: Benjamin Herrenschmidt 

> ---A
>  kernel/printk/printk.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/printk/printk.c b/kernel/printk/printk.c
> index 8c14835be46c..aaea3ad182e1 100644
> --- a/kernel/printk/printk.c
> +++ b/kernel/printk/printk.c
> @@ -2668,7 +2668,7 @@ early_param("keep_bootcon", keep_bootcon_setup);
>  static int try_enable_new_console(struct console *newcon, bool 
> user_specified)
>  {
>   struct console_cmdline *c;
> - int i;
> + int i, err;
>  
>   for (i = 0, c = console_cmdline;
>i < MAX_CMDLINECONSOLES && c->name[0];
> @@ -2691,8 +2691,8 @@ static int try_enable_new_console(struct console 
> *newcon, bool user_specified)
>   return 0;
>  
>   if (newcon->setup &&
> - newcon->setup(newcon, c->options) != 0)
> - return -EIO;
> + (err = newcon->setup(newcon, c->options)) != 0)
> + return err;
>   }
>   newcon->flags |= CON_ENABLED;
>   if (i == preferred_console) {

Re: [PATCH] irqchip/gic-v3-its: Don't try to move a disabled irq

2020-06-08 Thread Benjamin Herrenschmidt

On Mon, 2020-06-08 at 15:48 +0200, Thomas Gleixner wrote:
> "Herrenschmidt, Benjamin"  writes:
> > On Wed, 2020-06-03 at 16:16 +0100, Marc Zyngier wrote:
> > > > My original patch should certain check activated and not disabled.
> > > > With that do you still have reservations Marc?
> > > 
> > > I'd still prefer it if we could do something in core code, rather
> > > than spreading these checks in the individual drivers. If we can't,
> > > fair enough. But it feels like the core set_affinity function could
> > > just do the same thing in a single place (although the started vs
> > > activated is yet another piece of the puzzle I didn't consider,
> > > and the ITS doesn't need the "can_reserve" thing).
> > 
> > For the sake of fixing the problem in a timely and backportable way I
> > would suggest first merging the fix, *then* fixing the core core.
> 
> The "fix" is just wrong
> 
> > if (cpu != its_dev->event_map.col_map[id]) {
> > target_col = _dev->its->collections[cpu];
> > -   its_send_movi(its_dev, target_col, id);
> > +
> > +   /* If the IRQ is disabled a discard was sent so don't move */
> > +   if (!irqd_irq_disabled(d))
> 
> That check needs to be !irqd_is_activated() because enable_irq() does
> not touch anything affinity related.

Right. Note: other  drivers  (like arch/powerpc/sysdev/xive/common.c
use irqd_is_started() ... this gets confusing :)

> > +   its_send_movi(its_dev, target_col, id);
> > +
> > its_dev->event_map.col_map[id] = cpu;
> > irq_data_update_effective_affinity(d, cpumask_of(cpu));
> 
> And then these associtations are disconnected from reality in any case.

Not sure what you mean here, that said...

> Something like the completely untested patch below should work.

Ok. One possible issue though is before, the driver always had the
opportunity to "vet" the affinity mask for whatever platform
constraints may be there and change it before applying it. This is no
longer the case on a deactivated interrupt with your patch as far as I
can tell. I don't know if that is a problem and if drivers that do that
have what it takes to "fixup" the affinity at startup time, the ones I
wrote don't need that feature, but...

Cheers,
Ben.

> Thanks,
> 
> tglx
> 
> ---
>  arch/x86/kernel/apic/vector.c |   21 +++--
>  kernel/irq/manage.c   |   37 +++--
>  2 files changed, 38 insertions(+), 20 deletions(-)
> 
> --- a/arch/x86/kernel/apic/vector.c
> +++ b/arch/x86/kernel/apic/vector.c
> @@ -446,12 +446,10 @@ static int x86_vector_activate(struct ir
>   trace_vector_activate(irqd->irq, apicd->is_managed,
> apicd->can_reserve, reserve);
>  
> - /* Nothing to do for fixed assigned vectors */
> - if (!apicd->can_reserve && !apicd->is_managed)
> - return 0;
> -
>   raw_spin_lock_irqsave(_lock, flags);
> - if (reserve || irqd_is_managed_and_shutdown(irqd))
> + if (!apicd->can_reserve && !apicd->is_managed)
> + assign_irq_vector_any_locked(irqd);
> + else if (reserve || irqd_is_managed_and_shutdown(irqd))
>   vector_assign_managed_shutdown(irqd);
>   else if (apicd->is_managed)
>   ret = activate_managed(irqd);
> @@ -775,21 +773,8 @@ void lapic_offline(void)
>  static int apic_set_affinity(struct irq_data *irqd,
>const struct cpumask *dest, bool force)
>  {
> - struct apic_chip_data *apicd = apic_chip_data(irqd);
>   int err;
>  
> - /*
> -  * Core code can call here for inactive interrupts. For inactive
> -  * interrupts which use managed or reservation mode there is no
> -  * point in going through the vector assignment right now as the
> -  * activation will assign a vector which fits the destination
> -  * cpumask. Let the core code store the destination mask and be
> -  * done with it.
> -  */
> - if (!irqd_is_activated(irqd) &&
> - (apicd->is_managed || apicd->can_reserve))
> - return IRQ_SET_MASK_OK;
> -
>   raw_spin_lock(_lock);
>   cpumask_and(vector_searchmask, dest, cpu_online_mask);
>   if (irqd_affinity_is_managed(irqd))
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -195,9 +195,9 @@ void irq_set_thread_affinity(struct irq_
>   set_bit(IRQTF_AFFINITY, >thread_flags);
>  }
>  
> +#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
>  static void irq_validate_effective_affinity(struct irq_data *data)
>  {
> -#ifdef CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK
>   const struct cpumask *m = irq_data_get_effective_affinity_mask(data);
>   struct irq_chip *chip = irq_data_get_irq_chip(data);
>  
> @@ -205,9 +205,19 @@ static void irq_validate_effective_affin
>   return;
>   pr_warn_once("irq_chip %s did not update eff. affinity mask of irq 
> %u\n",
>chip->name, data->irq);
> -#endif
>

Re: [GIT PULL] x86/mm changes for v5.8

2020-06-02 Thread Benjamin Herrenschmidt

On Tue, 2020-06-02 at 09:33 +0200, Ingo Molnar wrote:
> Or rather, we should ask a higher level question as well, maybe we 
> should not do this feature at all?

Well, it does solve a real issue in some circumstances and there was a
reasonable discussion about this on the list that lead to it being
merged with Kees and Thomas (and others) agreeing :)

But yes, it is pointless with SMT and yes maybe we should make it
explicitly do nothing on SMT, but let's not throw the baby out with the
bath water shall we ?

> Typically cloud computing systems such as AWS will have SMT enabled, 
> because cloud computing pricing is essentially per vCPU, and they want 
> to sell the hyperthreads as vCPUs.

Not necessarily and not in every circumstances. Yes, VMs will typically
have SMT enabled. This isn't targeted at them though. One example that
was given during the discussions was containers pertaining to different
users. Now maybe we could discuss making the flush on changes of
cgroups or other similar things rather than individual process, but so
far that hasn't come up during the disucssion on the mailing list.

Another example would be a process that  handles more critical data
such as payment information, than the rest of the system and wants to
protect itself (or the admin wants that process protected) against
possible data leaks to less trusted processes.

>  So the safest solution, disabling 
> SMT on affected systems, is not actually done, because it's an 
> economic non-starter. (I'd like to note the security double standard 
> there: the most secure option, to disable SMT, is not actually used ...)

This has nothing to do about SMT, though yes maybe we should make the
patch do nothing on SMT but this isn't what this feature is about.

> BTW., I wonder how Amazon is solving the single-vCPU customer workload 
> problem on AWS: if the vast majority of AWS computing capacity is 
> running on a single vCPU, because it's the cheapest tier and because 
> it's more than enough capacity to run a website. Even core-scheduling 
> doesn't solve this fundamental SMT security problem: separate customer 
> workloads *cannot* share the same core - but this means that the 
> single-vCPU workloads will only be able to utilize 50% of all 
> available vCPUs if they are properly isolated.
> 
> Or if the majority of AWS EC2 etc. customer systems are using 2,4 or 
> more vCPUs, then both this feature and 'core-scheduling' is 
> effectively pointless from a security POV, because the cloud computing 
> systems are de-facto partitioned into cores already, with each core 
> accounted as 2 vCPUs.

AWS has more than just VMs for rent :-) There are a whole pile of
higher level "services" that our users can use and not all of them
necessarily run on VMs charged per vCPU.

> The hour-up-rounded way AWS (and many other cloud providers) account 
> system runtime costs suggests that they are doing relatively static 
> partitioning of customer workloads already, i.e. customer workloads 
> are mapped to actual physical hardware in an exclusive fashion, with 
> no overcommitting of physical resources and no sharing of cores 
> between customers.
> 
> If I look at the pricing and capabilities table of AWS:
> 
>   https://aws.amazon.com/ec2/pricing/on-demand/
> 
> Only the 't2' and 't3' On-Demand instances have 'Variable' pricing, 
> which is only 9% of the offered 228 configurations.
> 
> I.e. I strongly suspect that neither L1D flushing nor core-scheduling 
> is actually required on affected vulnerable CPUs to keep customer 
> workloads isolated from each other, on the majority of cloud computing 
> systems, because they are already isolated via semi-static 
> partitioning, using pricing that reflects static partitioning.

This isn't about that. These patches aren't trying to solve problems
happening inside of a customer VM running SMT not are they about
protecting VMs against other VMs on the same system.

Cheers,
Ben.

Re: [PATCH v3 07/18] nitro_enclaves: Init misc device providing the ioctl interface

2020-05-31 Thread Benjamin Herrenschmidt

On Thu, 2020-05-28 at 15:12 +0200, Greg KH wrote:
> So at runtime, after all is booted and up and going, you just ripped
> cores out from under someone's feet?  :)
> 
> And the code really handles writing to that value while the module is
> already loaded and up and running?  At a quick glance, it didn't seem
> like it would handle that very well as it only is checked at ne_init()
> time.
> 
> Or am I missing something?
> 
> Anyway, yes, if you can dynamically do this at runtime, that's great,
> but it feels ackward to me to rely on one configuration thing as a
> module parameter, and everything else through the ioctl interface.
> Unification would seem to be a good thing, right?

I personally still prefer a sysfs file :) I really don't like module
parameters as a way to do such things.

Cheers,
Ben.

Re: [PATCH v3 01/18] nitro_enclaves: Add ioctl interface definition

2020-05-31 Thread Benjamin Herrenschmidt

On Wed, 2020-05-27 at 09:49 +0100, Stefan Hajnoczi wrote:
> 
> What about feature bits or a API version number field? If you add
> features to the NE driver, how will userspace detect them?
> 
> Even if you intend to always compile userspace against the exact kernel
> headers that the program will run on, it can still be useful to have an
> API version for informational purposes and to easily prevent user
> errors (running a new userspace binary on an old kernel where the API is
> different).
> 
> Finally, reserved struct fields may come in handy in the future. That
> way userspace and the kernel don't need to explicitly handle multiple
> struct sizes.

Beware, Greg might disagree :)

That said, yes, at least a way to query the API version would be
useful.

Cheers,
Ben.

Re: [PATCH v3 07/18] nitro_enclaves: Init misc device providing the ioctl interface

2020-05-31 Thread Benjamin Herrenschmidt

On Wed, 2020-05-27 at 00:24 +0200, Greg KH wrote:
> > Would you want random users to get the ability to hot unplug CPUs from your
> > system? At unlimited quantity? I don't :).
> 
> A random user, no, but one with admin rights, why not?  They can do that
> already today on your system, this isn't new.

So I agree with you that a module parameter sucks. I disagree on the
ioctl :)

This is the kind of setup task that will probably end up being done by
some shell script at boot time based on some config file. Being able to
echo something in a sysfs file which will parse the standard-format CPU
list using the existing kernel functions to turn that into a cpu_mask
makes a lot more sense than having a binary interface via an ioctl
which will require an extra userspace program for use by the admin for
that one single task.

So sysfs is my preference here.

Another approach would be configfs, which would provide a more flexible
interface to potentially create multiple "CPU sets" that could be given
to different users or classes of users etc... but that might be pushing
it, I don't know.

Cheers,
Ben.

Re: [PATCH v3 02/18] nitro_enclaves: Define the PCI device interface

2020-05-31 Thread Benjamin Herrenschmidt

On Wed, 2020-05-27 at 00:21 +0200, Greg KH wrote:
> > There are a couple of data structures with more than one member and multiple
> > field sizes. And for the ones that are not, gathered as feedback from
> > previous rounds of review that should consider adding a "flags" field in
> > there for further extensibility.
> 
> Please do not do that in ioctls.  Just create new calls instead of
> trying to "extend" existing ones.  It's always much easier.
> 
> > I can modify to have "__packed" instead of the attribute callout.
> 
> Make sure you even need that, as I don't think you do for structures
> like the above one, right?

Hrm, my impression (granted I only just started to look at this code)
is that these are protocol messages with the PCI devices, not strictly
just ioctl arguments (though they do get conveyed via such ioctls).

Andra-Irina, did I get that right ? :-)

That said, I still think that by carefully ordering the fields and
using explicit padding, we can avoid the need of the packed attributed.

Cheers,
Ben.

Re: [PATCH v3 04/18] nitro_enclaves: Init PCI device driver

2020-05-31 Thread Benjamin Herrenschmidt

On Tue, 2020-05-26 at 21:35 +0300, Paraschiv, Andra-Irina wrote:
> This was needed to have an identifier for the overall NE logic - PCI 
> dev, ioctl and misc dev.
> 
> The ioctl and misc dev logic has pr_* logs, but I can update them to 
> dev_* with misc dev, then remove this prefix.

Or #define pr_fmt, but dev_ is better.

Cheers,
Ben.

Re: [PATCH v3 02/18] nitro_enclaves: Define the PCI device interface

2020-05-31 Thread Benjamin Herrenschmidt

On Tue, 2020-05-26 at 20:01 +0300, Paraschiv, Andra-Irina wrote:
> 
> On 26/05/2020 09:44, Greg KH wrote:
> > On Tue, May 26, 2020 at 01:13:18AM +0300, Andra Paraschiv wrote:
> > > +struct enclave_get_slot_req {
> > > + /* Context ID (CID) for the enclave vsock device. */
> > > + u64 enclave_cid;
> > > +} __attribute__ ((__packed__));
> > 
> > Can you really "pack" a single member structure?
> > 
> > Anyway, we have better ways to specify this instead of the "raw"
> > __attribute__ option.  But first see if you really need any of
> > these, at
> > first glance, I do not think you do at all, and they can all be
> > removed.
> 
> There are a couple of data structures with more than one member and 
> multiple field sizes. And for the ones that are not, gathered as 
> feedback from previous rounds of review that should consider adding
> a 
> "flags" field in there for further extensibility.
> 
> I can modify to have "__packed" instead of the attribute callout.

I tend to prefer designing the protocol so that all the fields are
naturally aligned, which should avoid the need for the attribute. Is
it possible in this case ?

Cheers,
Ben.

Re: [PATCH v3 07/18] nitro_enclaves: Init misc device providing the ioctl interface

2020-05-31 Thread Benjamin Herrenschmidt

On Tue, 2020-05-26 at 14:44 +0200, Alexander Graf wrote:
> So I really don't think an ioctl would be a great user experience. Same 
> for a sysfs file - although that's probably slightly better than the ioctl.

What would be wrong with a sysfs file ?

Another way to approach that makes sense from a kernel perspective is
to have the user first offline the CPUs, then "donate" them to the
driver via a sysfs file.

> Other options I can think of:
> 
>* sysctl (for modules?)

Why would that be any good ? If anything sysctl's are even more awkward
in my book :)

>* module parameter (as implemented here)
>* proc file (deprecated FWIW)

Yeah no.

> The key is the tenant split: Admin sets the pool up, user consumes. This 
> setup should happen (early) on boot, so that system services can spawn 
> enclaves.

Right and you can have some init script or udev rule that sets that up
from a sys admin produced config file at boot upon detection of the
enclave PCI device for example.

> > module parameters are a major pain, you know this :)
> 
> I think in this case it's the least painful option ;). But I'm really 
> happy to hear about an actually good alternative to it. Right now, I 
> just can't think of any.

Cheers,
Ben.

Re: [PATCH v3 07/18] nitro_enclaves: Init misc device providing the ioctl interface

2020-05-31 Thread Benjamin Herrenschmidt

On Tue, 2020-05-26 at 08:51 +0200, Greg KH wrote:
> 
> And get them to sign off on it too, showing they agree with the design
> decisions here :)

Isn't it generally frowned upon to publish a patch with internal sign-
off's on it already ? Or do you mean for us to publicly sign off once
we have reviewed ?

Cheers,
Ben.

Re: [PATCH] net: bmac: Fix stack corruption panic in bmac_probe()

2020-05-19 Thread Benjamin Herrenschmidt

On Tue, 2020-05-19 at 08:59 +0800, Jeremy Kerr wrote:
> Hi Stan,
> 
> > The new kernel compiled and booted with no errors, with these
> > STACKPROTECTOR options in .config (the last two revealed the bug):
> > 
> > CONFIG_HAVE_STACKPROTECTOR=y
> > CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
> > CONFIG_STACKPROTECTOR=y
> > CONFIG_STACKPROTECTOR_STRONG=y
> 
> Brilliant, thanks for testing. I'll send a standalone patch to
> netdev.

Nice catch :-) Lots of people used these machines for years without
noticing :-)  Granted most people used newer generation stuff with
a gmac instead.

Cheers,
Ben.

Re: [RFC PATCH v2 0/3] Prefer working VT console over SPCR and device-tree chosen stdout-path

2020-05-13 Thread Benjamin Herrenschmidt

On Wed, 2020-05-13 at 16:37 +0200, Petr Mladek wrote:
> The only common rules are:
> 
>+ The last console on the command line should always be the
>  preferred one when defined.
> 
>+ Consoles defined by the device (SPCR, device tree) are used
>  when there is no commandline.

With the exception that on x86, SPCR is only used for early_con, we
don't do add_preferred_console() at all for it.

I sort-of understand why... the track record on BIOS quality out there
being what it is, I could see this causing a number of systems start
sending the console to a non-existent or non-wired serial port instead
of the tty/gpu because the BIOS leave SPCR set/enabled for no reason.

It may or may not be the case in practice but I don't see how we can
figure that out without either a large campain of data collection from
tons of systems (which will miss plenty) or just taking the chance &
breaking people and see who screams :-)

Cheers,
Ben.

Re: [PATCH v0 linux master] i2c/busses: Avoid i2c interrupt status clear race condition.

2020-04-30 Thread Benjamin Herrenschmidt

On Wed, 2020-04-29 at 11:03 +0200, Wolfram Sang wrote:
> > And is there maybe a Fixes: tag for it?
> > [Ryan Chen] Yes it is a fix patch.
> 
> I meant this (from submitting-patches.rst):

It fixes the original implementation of the driver basically. It's just
a classic posted-write fix. The write to clear the pending interrupt is
asynchronous, so you can get spurrious ones if you return from the
handler before it has percolated to the HW.

I assume it's just more visible on the 2600 because of the cores are
significantly faster but the IO bus is still as dumb.

Ryan: You could always add a Fixed-by: tag that specifies the commit
that added the initial driver...

Cheers,
Ben.

Re: [PATCH v0 linux master] i2c/busses: Avoid i2c interrupt status clear race condition.

2020-04-30 Thread Benjamin Herrenschmidt

On Wed, 2020-04-29 at 11:37 +0800, ryan_chen wrote:
> In AST2600 there have a slow peripheral bus between CPU
>  and i2c controller.
> Therefore GIC i2c interrupt status clear have delay timing,
> when CPU issue write clear i2c controller interrupt status.
> To avoid this issue, the driver need have read after write
>  clear at i2c ISR.
> 
> Signed-off-by: ryan_chen 

Acked-by: Benjamin Herrenschmidt 
--


> ---
>  drivers/i2c/busses/i2c-aspeed.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/i2c/busses/i2c-aspeed.c
> b/drivers/i2c/busses/i2c-aspeed.c
> index 07c1993274c5..f51702d86a90 100644
> --- a/drivers/i2c/busses/i2c-aspeed.c
> +++ b/drivers/i2c/busses/i2c-aspeed.c
> @@ -603,6 +603,7 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq,
> void *dev_id)
>   /* Ack all interrupts except for Rx done */
>   writel(irq_received & ~ASPEED_I2CD_INTR_RX_DONE,
>  bus->base + ASPEED_I2C_INTR_STS_REG);
> + readl(bus->base + ASPEED_I2C_INTR_STS_REG);
>   irq_remaining = irq_received;
>  
>  #if IS_ENABLED(CONFIG_I2C_SLAVE)
> @@ -645,9 +646,11 @@ static irqreturn_t aspeed_i2c_bus_irq(int irq,
> void *dev_id)
>   irq_received, irq_handled);
>  
>   /* Ack Rx done */
> - if (irq_received & ASPEED_I2CD_INTR_RX_DONE)
> + if (irq_received & ASPEED_I2CD_INTR_RX_DONE) {
>   writel(ASPEED_I2CD_INTR_RX_DONE,
>  bus->base + ASPEED_I2C_INTR_STS_REG);
> + readl(bus->base + ASPEED_I2C_INTR_STS_REG);
> + }
>   spin_unlock(>lock);
>   return irq_remaining ? IRQ_NONE : IRQ_HANDLED;
>  }

Re: [PATCH v5 3/5] drivers/soc/litex: add LiteX SoC Controller driver

2020-04-28 Thread Benjamin Herrenschmidt

On Mon, 2020-04-27 at 11:13 +0200, Mateusz Holenko wrote:
> As Gabriel Somlo  suggested to me, I could still use
> readl/writel/ioread/iowrite() standard functions providing memory
> barriers *and* have values in CPU native endianness by using the
> following constructs:
> 
> `le32_to_cpu(readl(addr))`
> 
> and
> 
> `writel(cpu_to_le32(value), addr)`
> 
> as le32_to_cpu/cpu_to_le32():
> - does nothing on LE CPUs and
> - reorders bytes on BE CPUs which in turn reverts swapping made by
> readl() resulting in returning the original value.

It's a bit sad... I don't understand why you need this. The HW has a
fied endian has you have mentioned earlier (and that is a good design).

The fact that you are trying to shove things into a "smaller pipe" than
the actual register shouldn't affect at what address the MSB and LSB
reside. And readl/writel (or ioread32/iowrite32) will always be LE as
well, so will match the HW layout. Thus I don't see why you need to
play swapping games here.

This however would be avoided completely if the HW was a tiny bit
smarter and would do the multi-beat access for you which shouldn't be
terribly hard to implement.

That said, it would be even clearer if you just open coded the 2 or 3
useful cases: 32/8, 32/16 and 32/32. The loop with calculated shifts
(and no masks) makes the code hard to understand.

Cheers,
Ben.

Re: [PATCH v5 3/5] drivers/soc/litex: add LiteX SoC Controller driver

2020-04-28 Thread Benjamin Herrenschmidt

On Sat, 2020-04-25 at 13:42 +0200, Mateusz Holenko wrote:
> From: Pawel Czarnecki 
> 
> This commit adds driver for the FPGA-based LiteX SoC
> Controller from LiteX SoC builder.

Sorry for jumping in late, Joel only just pointed me to this :)

> + * The purpose of `litex_set_reg`/`litex_get_reg` is to implement
> + * the logic of writing to/reading from the LiteX CSR in a single
> + * place that can be then reused by all LiteX drivers.
> + */
> +void litex_set_reg(void __iomem *reg, unsigned long reg_size,
> + unsigned long val)
> +{
> + unsigned long shifted_data, shift, i;
> + unsigned long flags;
> +
> + spin_lock_irqsave(_lock, flags);
> +
> + for (i = 0; i < reg_size; ++i) {
> + shift = ((reg_size - i - 1) * LITEX_SUBREG_SIZE_BIT);
> + shifted_data = val >> shift;
> +
> + __raw_writel(shifted_data, reg + (LITEX_REG_SIZE * i));
> + }
> +
> + spin_unlock_irqrestore(_lock, flags);
> +}
> +
> +unsigned long litex_get_reg(void __iomem *reg, unsigned long reg_size)
> +{
> + unsigned long shifted_data, shift, i;
> + unsigned long result = 0;
> + unsigned long flags;
> +
> + spin_lock_irqsave(_lock, flags);
> +
> + for (i = 0; i < reg_size; ++i) {
> + shifted_data = __raw_readl(reg + (LITEX_REG_SIZE * i));
> +
> + shift = ((reg_size - i - 1) * LITEX_SUBREG_SIZE_BIT);
> + result |= (shifted_data << shift);
> + }
> +
> + spin_unlock_irqrestore(_lock, flags);
> +
> + return result;
> +}

I really don't like the fact that the register sizes & sub sizes are
#defined. As your comment explains, this makes it harder to support
other configurations. This geometry should come from the device-tree
instead.

Also this while thing is rather gross (and the lock will not help
performance). Why can't CSRs be normally memory mapped always instead ?

Even when transporting them on a HW bus that's smaller, the HW bus
conversion should be able to do the break-down into a multi-breat
transfer rather than doing that in SW.

Or at least have a fast-path if the register size is no larger than the
sub size, so you can use a normal ioread32/iowrite32.

Also I wonder ... last I played with LiteX, it would re-generate the
register layout (including the bit layout inside registers potentially)
rather enthousiastically, making it pretty hard to have a fixed
register layout for use by a kernel driver. Was this addressed ?

Cheers,
Ben.

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-19 Thread Benjamin Herrenschmidt

On Sat, 2019-10-19 at 01:31 +, Vijay Khemka wrote:
> Thanks Ben,
> I will try to add some trace and test whatever possible and test it.
> As we
> don't have tcpdump into our image and I have limited understanding of
> networking stack so if you get some time to verify ipv6, it will be
> really
> helpful. 
> 

You only need tcpdump (or wireshark) on the *other end* of the link,
could even be your laptop, to look at what the generated frames look
like and compare with your traces.

Cheers,
Ben.

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-18 Thread Benjamin Herrenschmidt

On Fri, 2019-10-18 at 22:50 +, Vijay Khemka wrote:
> I don't have much understanding of IP Stack but I went through code details 
> and 
> you are right and found that it should fallback to SW calculation for IPV6 
> but it doesn't
> happen because ftgmac100_hard_start_xmit checks for CHECKSUM_PARTIAL before
> setting HW checksum and calling ftgmac100_prep_tx_csum function. And in my 
> understanding, this value is set CHECKSUM_PARTIAL in IP stack. I looked up IP 
> stack for
> IPV6, file net/ipv6/ip6_output.c, function __ip6_append_data: here it sets 
> CHECKSUM_PARTIAL only for UDP packets not for TCP packets. Please look at line
>  number 1880. This could be an issue we are seeing here as why
> ftgmac100_prep_tx_csum is not getting triggered for IPV6 with TCP. Please 
> correct
> me if my understanding is wrong.
> 

Not entirely sure. tcp_v6_send_response() in tcp_ipv6.c does set
CHECKSUM_PARTIAL as well. I don't really know how things are being
handled in that part of the network stack though.

>From a driver perspective, if the value of ip_summed is not
CHECKSUM_PARTIAL it means we should not have to calculate any checksum.
At least that's my understanding here.

You may need to add some traces to the driver to see what you get in
there, what protocol indication etc... and analyze the corresponding
packets with something like tcpdump or wireshark on the other end.

Cheers,
Ben.

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-17 Thread Benjamin Herrenschmidt

On Fri, 2019-10-18 at 00:06 +, Vijay Khemka wrote:
> 
> > This is not a matter of unsupported csum, it is broken hw csum. 
> > That's why we disable hw checksum. My guess is once we disable
> > Hw checksum, it will use sw checksum. So I am just disabling hw 
> > Checksum.
> 
> I don't understand what you are saying. You reported a problem with
> IPV6 checksums generation. The HW doesn't support it. What's "not a
> matter of unsupported csum" ?
> 
> Your patch uses a *deprecated* bit to tell the network stack to only do
> HW checksum generation on IPV4.
> 
> This bit is deprecated for a reason, again, see skbuff.h. The right
> approach, *which the driver already does*, is to tell the stack that we
> support HW checksuming using NETIF_F_HW_CSUM, and then, in the transmit
> handler, to call skb_checksum_help() to have the SW calculate the
> checksum if it's not a supported type.
> 
> My understanding was when we enable NETIF_F_HW_CSUM means network 
> stack enables HW checksum and doesn't calculate SW checksum. But as per
> this supported types HW checksum are used only for IPV4 and not for IPV6 even
> though driver enabled NETIF_F_HW_CSUM. For IPV6 it is always a SW generated
> checksum, please correct me here.

Have you actually read the comments in skbuff.h that I pointed you to ?

And the rest of my email for that matter ?

> This is exactly what ftgmac100_prep_tx_csum() does. It only enables HW
> checksum generation on supported types and uses skb_checksum_help()
> otherwise, supported types being protocol ETH_P_IP and IP protocol
> being raw IP, TCP and UDP.
> 
> 
> So this *should* have fallen back to SW for IPV6. So either something
> in my code there is making an incorrect assumption, or something is
> broken in skb_checksum_help() for IPV6 (which I somewhat doubt) or
> something else I can't think of, but setting a *deprecated* flag is
> definitely not the right answer, neither is completely disabling HW
> checksumming.
> 
> So can you investigate what's going on a bit more closely please ? I
> can try myself, though I have very little experience with IPV6 and
> probably won't have time before next week.
> 
> Cheers,
> Ben.
> 
> > The driver should have handled unsupported csum via SW fallback
> > already in ftgmac100_prep_tx_csum()
> > 
> > Can you check why this didn't work for you ?
> > 
> > Cheers,
> > Ben.
> > 
> > > Signed-off-by: Vijay Khemka 
> > > ---
> > > Changes since v1:
> > >  Enabled IPV4 hw checksum generation as it works for IPV4.
> > > 
> > >  drivers/net/ethernet/faraday/ftgmac100.c | 13 -
> > >  1 file changed, 12 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > > b/drivers/net/ethernet/faraday/ftgmac100.c
> > > index 030fed65393e..0255a28d2958 100644
> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > > @@ -1842,8 +1842,19 @@ static int ftgmac100_probe(struct
> > > platform_device *pdev)
> > >   /* AST2400  doesn't have working HW checksum generation */
> > >   if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
> > >   netdev->hw_features &= ~NETIF_F_HW_CSUM;
> > > +
> > > + /* AST2500 doesn't have working HW checksum generation for IPV6
> > > +  * but it works for IPV4, so disabling hw checksum and enabling
> > > +  * it for only IPV4.
> > > +  */
> > > + if (np && (of_device_is_compatible(np, "aspeed,ast2500-mac")))
> > > {
> > > + netdev->hw_features &= ~NETIF_F_HW_CSUM;
> > > + netdev->hw_features |= NETIF_F_IP_CSUM;
> > > + }
> > > +
> > >   if (np && of_get_property(np, "no-hw-checksum", NULL))
> > > - netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> > > NETIF_F_RXCSUM);
> > > + netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> > > NETIF_F_RXCSUM
> > > +  | NETIF_F_IP_CSUM);
> > >   netdev->features |= netdev->hw_features;
> > >  
> > >   /* register network device */
> > 
> > 
> > 
> 
> 
>

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-17 Thread Benjamin Herrenschmidt

On Fri, 2019-10-18 at 10:14 +1100, Benjamin Herrenschmidt wrote:
> 
> I don't understand what you are saying. You reported a problem with
> IPV6 checksums generation. The HW doesn't support it. What's "not a
> matter of unsupported csum" ?
> 
> Your patch uses a *deprecated* bit to tell the network stack to only do
> HW checksum generation on IPV4.
> 
> This bit is deprecated for a reason, again, see skbuff.h. The right
> approach, *which the driver already does*, is to tell the stack that we
> support HW checksuming using NETIF_F_HW_CSUM, and then, in the transmit
> handler, to call skb_checksum_help() to have the SW calculate the
> checksum if it's not a supported type.
> 
> This is exactly what ftgmac100_prep_tx_csum() does. It only enables HW
> checksum generation on supported types and uses skb_checksum_help()
> otherwise, supported types being protocol ETH_P_IP and IP protocol
> being raw IP, TCP and UDP.
> 
> So this *should* have fallen back to SW for IPV6. So either something
> in my code there is making an incorrect assumption, or something is
> broken in skb_checksum_help() for IPV6 (which I somewhat doubt) or
> something else I can't think of, but setting a *deprecated* flag is
> definitely not the right answer, neither is completely disabling HW
> checksumming.
> 
> So can you investigate what's going on a bit more closely please ? I
> can try myself, though I have very little experience with IPV6 and
> probably won't have time before next week.

I did get that piece of information from Aspeed: The HW checksum
generation is supported if:

 - The length of UDP header is always 20 bytes.
 - The length of TCP and IP header have 4 * N bytes (N is 5 to 15).

Now these afaik are also the protocol limits, so it *should* work.

Or am I missing something or some funky encaspulation/header format
that can be used under some circumstances ?

Cheers,
Ben.

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-17 Thread Benjamin Herrenschmidt

On Thu, 2019-10-17 at 22:01 +, Vijay Khemka wrote:
> 
> On 10/16/19, 6:29 PM, "Benjamin Herrenschmidt"  
> wrote:
> 
> On Fri, 2019-10-11 at 14:30 -0700, Vijay Khemka wrote:
> > HW checksum generation is not working for AST2500, specially with
> > IPV6
> > over NCSI. All TCP packets with IPv6 get dropped. By disabling this
> > it works perfectly fine with IPV6. As it works for IPV4 so enabled
> > hw checksum back for IPV4.
> > 
> > Verified with IPV6 enabled and can do ssh.
> 
> So while this probably works, I don't think this is the right
> approach, at least according to the comments in skbuff.h
> 
> This is not a matter of unsupported csum, it is broken hw csum. 
> That's why we disable hw checksum. My guess is once we disable
> Hw checksum, it will use sw checksum. So I am just disabling hw 
> Checksum.

I don't understand what you are saying. You reported a problem with
IPV6 checksums generation. The HW doesn't support it. What's "not a
matter of unsupported csum" ?

Your patch uses a *deprecated* bit to tell the network stack to only do
HW checksum generation on IPV4.

This bit is deprecated for a reason, again, see skbuff.h. The right
approach, *which the driver already does*, is to tell the stack that we
support HW checksuming using NETIF_F_HW_CSUM, and then, in the transmit
handler, to call skb_checksum_help() to have the SW calculate the
checksum if it's not a supported type.

This is exactly what ftgmac100_prep_tx_csum() does. It only enables HW
checksum generation on supported types and uses skb_checksum_help()
otherwise, supported types being protocol ETH_P_IP and IP protocol
being raw IP, TCP and UDP.

So this *should* have fallen back to SW for IPV6. So either something
in my code there is making an incorrect assumption, or something is
broken in skb_checksum_help() for IPV6 (which I somewhat doubt) or
something else I can't think of, but setting a *deprecated* flag is
definitely not the right answer, neither is completely disabling HW
checksumming.

So can you investigate what's going on a bit more closely please ? I
can try myself, though I have very little experience with IPV6 and
probably won't have time before next week.

Cheers,
Ben.

> The driver should have handled unsupported csum via SW fallback
> already in ftgmac100_prep_tx_csum()
> 
> Can you check why this didn't work for you ?
> 
> Cheers,
> Ben.
> 
> > Signed-off-by: Vijay Khemka 
> > ---
> > Changes since v1:
> >  Enabled IPV4 hw checksum generation as it works for IPV4.
> > 
> >  drivers/net/ethernet/faraday/ftgmac100.c | 13 -
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > b/drivers/net/ethernet/faraday/ftgmac100.c
> > index 030fed65393e..0255a28d2958 100644
> > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > @@ -1842,8 +1842,19 @@ static int ftgmac100_probe(struct
> > platform_device *pdev)
> > /* AST2400  doesn't have working HW checksum generation */
> > if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
> > netdev->hw_features &= ~NETIF_F_HW_CSUM;
> > +
> > +   /* AST2500 doesn't have working HW checksum generation for IPV6
> > +* but it works for IPV4, so disabling hw checksum and enabling
> > +* it for only IPV4.
> > +*/
> > +   if (np && (of_device_is_compatible(np, "aspeed,ast2500-mac")))
> > {
> > +   netdev->hw_features &= ~NETIF_F_HW_CSUM;
> > +   netdev->hw_features |= NETIF_F_IP_CSUM;
> > +   }
> > +
> > if (np && of_get_property(np, "no-hw-checksum", NULL))
> > -   netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> > NETIF_F_RXCSUM);
> > +   netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> > NETIF_F_RXCSUM
> > +| NETIF_F_IP_CSUM);
> > netdev->features |= netdev->hw_features;
> >  
> > /* register network device */
> 
> 
>

Re: [PATCH v2] ftgmac100: Disable HW checksum generation on AST2500

2019-10-16 Thread Benjamin Herrenschmidt

On Fri, 2019-10-11 at 14:30 -0700, Vijay Khemka wrote:
> HW checksum generation is not working for AST2500, specially with
> IPV6
> over NCSI. All TCP packets with IPv6 get dropped. By disabling this
> it works perfectly fine with IPV6. As it works for IPV4 so enabled
> hw checksum back for IPV4.
> 
> Verified with IPV6 enabled and can do ssh.

So while this probably works, I don't think this is the right
approach, at least according to the comments in skbuff.h

The driver should have handled unsupported csum via SW fallback
already in ftgmac100_prep_tx_csum()

Can you check why this didn't work for you ?

Cheers,
Ben.

> Signed-off-by: Vijay Khemka 
> ---
> Changes since v1:
>  Enabled IPV4 hw checksum generation as it works for IPV4.
> 
>  drivers/net/ethernet/faraday/ftgmac100.c | 13 -
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> b/drivers/net/ethernet/faraday/ftgmac100.c
> index 030fed65393e..0255a28d2958 100644
> --- a/drivers/net/ethernet/faraday/ftgmac100.c
> +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> @@ -1842,8 +1842,19 @@ static int ftgmac100_probe(struct
> platform_device *pdev)
>   /* AST2400  doesn't have working HW checksum generation */
>   if (np && (of_device_is_compatible(np, "aspeed,ast2400-mac")))
>   netdev->hw_features &= ~NETIF_F_HW_CSUM;
> +
> + /* AST2500 doesn't have working HW checksum generation for IPV6
> +  * but it works for IPV4, so disabling hw checksum and enabling
> +  * it for only IPV4.
> +  */
> + if (np && (of_device_is_compatible(np, "aspeed,ast2500-mac")))
> {
> + netdev->hw_features &= ~NETIF_F_HW_CSUM;
> + netdev->hw_features |= NETIF_F_IP_CSUM;
> + }
> +
>   if (np && of_get_property(np, "no-hw-checksum", NULL))
> - netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> NETIF_F_RXCSUM);
> + netdev->hw_features &= ~(NETIF_F_HW_CSUM |
> NETIF_F_RXCSUM
> +  | NETIF_F_IP_CSUM);
>   netdev->features |= netdev->hw_features;
>  
>   /* register network device */

Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500

2019-10-10 Thread Benjamin Herrenschmidt

On Thu, 2019-10-10 at 19:15 +, Vijay Khemka wrote:
> Any news on this ? AST2400 has no HW checksum logic in HW, AST2500
> should work for IPV4 fine, we should only selectively disable it for
> IPV6.
> 
> Ben, I have already sent v2 for this with requested change which only disable 
> for IPV6 in AST2500. I can send it again.

I didn't see it, did you CC me ? I maintain that driver...

Cheers,
Ben.

Re: [PATCH 1/3] dt-bindings: net: ftgmac100: Document AST2600 compatible

2019-10-09 Thread Benjamin Herrenschmidt

On Wed, 2019-10-09 at 15:25 +1030, Andrew Jeffery wrote:
> 
> On Wed, 9 Oct 2019, at 15:19, Andrew Jeffery wrote:
> > 
> > 
> > On Wed, 9 Oct 2019, at 15:08, Benjamin Herrenschmidt wrote:
> > > On Tue, 2019-10-08 at 22:21 +1030, Andrew Jeffery wrote:
> > > > The AST2600 contains an FTGMAC100-compatible MAC, although it
> > > > no-
> > > > longer
> > > > contains an MDIO controller.
> > > 
> > > How do you talk to the PHY then ?
> > 
> > There are still MDIO controllers, they're just not in the MAC IP on
> > the 2600.
> 
> Sorry, on reflection that description is a little ambiguous in its
> use of 'it'. I'll
> fix that in v2 as well. Does this read better?
> 
> "The AST2600 contains an FTGMAC100-compatible MAC, although the MAC
> no-longer contains an MDIO controller."

That's fine. Or to be pendantic, say the MDIO controller has been moved
of the MAC unit into its own separate block or something along those
lines so people like me don't get anxious :)

Cheers,
Ben.

Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500

2019-10-09 Thread Benjamin Herrenschmidt

On Wed, 2019-10-09 at 14:18 -0400, Oskar Senft wrote:
> Does HW in the AST2500 actually perform the HW checksum calculation,
> or would that be the responsibility of the NIC that it's talking to
> via NC-SI?

I wouldn't rely on the NC-SI NIC for UDP/TCP checksums. We should be
providing it with well formed traffic.

Cheers,
Ben.

> Oskar.
> 
> On Wed, Oct 9, 2019 at 12:38 AM Benjamin Herrenschmidt <
> b...@kernel.crashing.org> wrote:
> > On Wed, 2019-09-11 at 14:48 +, Joel Stanley wrote:
> > > Hi Ben,
> > > 
> > > On Tue, 10 Sep 2019 at 22:05, Florian Fainelli <
> > f.faine...@gmail.com>
> > > wrote:
> > > > 
> > > > On 9/10/19 2:37 PM, Vijay Khemka wrote:
> > > > > HW checksum generation is not working for AST2500, specially
> > with
> > > > > IPV6
> > > > > over NCSI. All TCP packets with IPv6 get dropped. By
> > disabling
> > > > > this
> > > > > it works perfectly fine with IPV6.
> > > > > 
> > > > > Verified with IPV6 enabled and can do ssh.
> > > > 
> > > > How about IPv4, do these packets have problem? If not, can you
> > > > continue
> > > > advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
> > > > 
> > > > > 
> > > > > Signed-off-by: Vijay Khemka 
> > > > > ---
> > > > >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
> > > > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > > > > b/drivers/net/ethernet/faraday/ftgmac100.c
> > > > > index 030fed65393e..591c9725002b 100644
> > > > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > > > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > > > > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct
> > > > > platform_device *pdev)
> > > > >   if (priv->use_ncsi)
> > > > >   netdev->hw_features |=
> > NETIF_F_HW_VLAN_CTAG_FILTER;
> > > > > 
> > > > > - /* AST2400  doesn't have working HW checksum generation
> > */
> > > > > - if (np && (of_device_is_compatible(np, "aspeed,ast2400-
> > > > > mac")))
> > > > > + /* AST2400  and AST2500 doesn't have working HW
> > checksum
> > > > > generation */
> > > > > + if (np && (of_device_is_compatible(np, "aspeed,ast2400-
> > > > > mac") ||
> > > > > +of_device_is_compatible(np, "aspeed,ast2500-
> > > > > mac")))
> > > 
> > > Do you recall under what circumstances we need to disable
> > hardware
> > > checksumming?
> > 
> > Any news on this ? AST2400 has no HW checksum logic in HW, AST2500
> > should work for IPV4 fine, we should only selectively disable it
> > for
> > IPV6.
> > 
> > Can you do an updated patch ?
> > 
> > Cheers,
> > Ben.
> >

Re: [PATCH 1/3] dt-bindings: net: ftgmac100: Document AST2600 compatible

2019-10-08 Thread Benjamin Herrenschmidt

On Tue, 2019-10-08 at 22:21 +1030, Andrew Jeffery wrote:
> The AST2600 contains an FTGMAC100-compatible MAC, although it no-
> longer
> contains an MDIO controller.

How do you talk to the PHY then ?

Cheers,
Ben.

> Signed-off-by: Andrew Jeffery 
> ---
>  Documentation/devicetree/bindings/net/ftgmac100.txt | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/ftgmac100.txt
> b/Documentation/devicetree/bindings/net/ftgmac100.txt
> index 72e7aaf7242e..04cc0191b7dd 100644
> --- a/Documentation/devicetree/bindings/net/ftgmac100.txt
> +++ b/Documentation/devicetree/bindings/net/ftgmac100.txt
> @@ -9,6 +9,7 @@ Required properties:
>  
>   - "aspeed,ast2400-mac"
>   - "aspeed,ast2500-mac"
> + - "aspeed,ast2600-mac"
>  
>  - reg: Address and length of the register set for the device
>  - interrupts: Should contain ethernet controller interrupt

Re: [PATCH] ftgmac100: Disable HW checksum generation on AST2500

2019-10-08 Thread Benjamin Herrenschmidt

On Wed, 2019-09-11 at 14:48 +, Joel Stanley wrote:
> Hi Ben,
> 
> On Tue, 10 Sep 2019 at 22:05, Florian Fainelli 
> wrote:
> > 
> > On 9/10/19 2:37 PM, Vijay Khemka wrote:
> > > HW checksum generation is not working for AST2500, specially with
> > > IPV6
> > > over NCSI. All TCP packets with IPv6 get dropped. By disabling
> > > this
> > > it works perfectly fine with IPV6.
> > > 
> > > Verified with IPV6 enabled and can do ssh.
> > 
> > How about IPv4, do these packets have problem? If not, can you
> > continue
> > advertising NETIF_F_IP_CSUM but take out NETIF_F_IPV6_CSUM?
> > 
> > > 
> > > Signed-off-by: Vijay Khemka 
> > > ---
> > >  drivers/net/ethernet/faraday/ftgmac100.c | 5 +++--
> > >  1 file changed, 3 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/ethernet/faraday/ftgmac100.c
> > > b/drivers/net/ethernet/faraday/ftgmac100.c
> > > index 030fed65393e..591c9725002b 100644
> > > --- a/drivers/net/ethernet/faraday/ftgmac100.c
> > > +++ b/drivers/net/ethernet/faraday/ftgmac100.c
> > > @@ -1839,8 +1839,9 @@ static int ftgmac100_probe(struct
> > > platform_device *pdev)
> > >   if (priv->use_ncsi)
> > >   netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_FILTER;
> > > 
> > > - /* AST2400  doesn't have working HW checksum generation */
> > > - if (np && (of_device_is_compatible(np, "aspeed,ast2400-
> > > mac")))
> > > + /* AST2400  and AST2500 doesn't have working HW checksum
> > > generation */
> > > + if (np && (of_device_is_compatible(np, "aspeed,ast2400-
> > > mac") ||
> > > +of_device_is_compatible(np, "aspeed,ast2500-
> > > mac")))
> 
> Do you recall under what circumstances we need to disable hardware
> checksumming?

Any news on this ? AST2400 has no HW checksum logic in HW, AST2500
should work for IPV4 fine, we should only selectively disable it for
IPV6.

Can you do an updated patch ?

Cheers,
Ben.

Re: [PATCH] powerpc/time: use feature fixup in __USE_RTC() instead of cpu feature.

2019-08-26 Thread Benjamin Herrenschmidt

On Mon, 2019-08-26 at 21:41 +1000, Michael Ellerman wrote:
> Christophe Leroy  writes:
> > sched_clock(), used by printk(), calls __USE_RTC() to know
> > whether to use realtime clock or timebase.
> > 
> > __USE_RTC() uses cpu_has_feature() which is initialised by
> > machine_init(). Before machine_init(), __USE_RTC() returns true,
> > leading to a program check exception on CPUs not having realtime
> > clock.
> > 
> > In order to be able to use printk() earlier, use feature fixup.
> > Feature fixups are applies in early_init(), enabling the use of
> > printk() earlier.
> > 
> > Signed-off-by: Christophe Leroy 
> > ---
> >  arch/powerpc/include/asm/time.h | 9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> The other option would be just to make this a compile time decision, eg.
> add CONFIG_PPC_601 and use that to gate whether we use RTC.
> 
> Given how many 601 users there are, maybe 1?, I think that would be a
> simpler option and avoids complicating the code / binary for everyone
> else.

Didn't we ditch 601 support years ago anyway ? We had workaround we
threw out I think...

Cheers,
Ben.

> cheers
> 
> > diff --git a/arch/powerpc/include/asm/time.h 
> > b/arch/powerpc/include/asm/time.h
> > index 54f4ec1f9fab..3455cb54c333 100644
> > --- a/arch/powerpc/include/asm/time.h
> > +++ b/arch/powerpc/include/asm/time.h
> > @@ -42,7 +42,14 @@ struct div_result {
> >  /* Accessor functions for the timebase (RTC on 601) registers. */
> >  /* If one day CONFIG_POWER is added just define __USE_RTC as 1 */
> >  #ifdef CONFIG_PPC_BOOK3S_32
> > -#define __USE_RTC()(cpu_has_feature(CPU_FTR_USE_RTC))
> > +static inline bool __USE_RTC(void)
> > +{
> > +   asm_volatile_goto(ASM_FTR_IFCLR("nop;", "b %1;", %0) ::
> > + "i" (CPU_FTR_USE_RTC) :: l_use_rtc);
> > +   return false;
> > +l_use_rtc:
> > +   return true;
> > +}
> >  #else
> >  #define __USE_RTC()0
> >  #endif
> > -- 
> > 2.13.3

Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-22 Thread Benjamin Herrenschmidt

On Thu, 2019-08-22 at 19:52 -0700, Sagi Grimberg wrote:
> > > I'll fix it. Note that I'm going to take it out of the tree soon
> > > because it will have conflicts with Jens for-5.4/block, so we
> > > will send it to Jens after the initial merge window, after he
> > > rebases off of Linus.
> > 
> > Conflicts too hard to fixup at merge time ? Otherwise I could just
> > rebase on top of Jens and put in a topic branch...
> 
> The quirk enumeration conflicts with 5.3-rc. Not a big deal, just
> thought it'd be easier to handle that way.
> 
> Rebasing on top of Jens won't help because his for-5.4/block which
> does not have the rc code yet.

Ok sure, do as you prefer. Let me know if you want me to do anything.

Cheers,
Ben.

Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-22 Thread Benjamin Herrenschmidt

On Thu, 2019-08-22 at 11:02 -0700, Sagi Grimberg wrote:
> > > 
> I'll fix it. Note that I'm going to take it out of the tree soon
> because it will have conflicts with Jens for-5.4/block, so we
> will send it to Jens after the initial merge window, after he
> rebases off of Linus.

Conflicts too hard to fixup at merge time ? Otherwise I could just
rebase on top of Jens and put in a topic branch...

Whatever works for you.

Cheers,
Ben.

Re: [PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-21 Thread Benjamin Herrenschmidt

On Thu, 2019-08-22 at 02:28 +0200, Christoph Hellwig wrote:
> On Wed, Aug 07, 2019 at 05:51:20PM +1000, Benjamin Herrenschmidt
> wrote:
> > +#define NVME_NVM_ADMSQES   6
> >  #define NVME_NVM_IOSQES6
> >  #define NVME_NVM_IOCQES4
> 
> The NVM in the two defines here stands for the NVM command set,
> so this should just be named NVME_ADM_SQES or so.  But except for
> this
> the patch looks good:
> 
> Reviewed-by: Christoph Hellwig 
> 
> So maybe Sagi can just fix this up in the tree.

Ah ok I missed the meaning. Thanks. Sagi, can you fix that up or do you
need me to resubmit ?

Cheers,
Ben.

Re: [PATCH] powerpc/vdso32: inline __get_datapage()

2019-08-20 Thread Benjamin Herrenschmidt

On Fri, 2019-08-16 at 14:48 +, Christophe Leroy wrote:
> __get_datapage() is only a few instructions to retrieve the
> address of the page where the kernel stores data to the VDSO.
> 
> By inlining this function into its users, a bl/blr pair and
> a mflr/mtlr pair is avoided, plus a few reg moves.
> 
> The improvement is noticeable (about 55 nsec/call on an 8xx)

Interesting... would be worth doing the same on vdso64 no ?

> vdsotest before the patch:
> gettimeofday:vdso: 731 nsec/call
> clock-gettime-realtime-coarse:vdso: 668 nsec/call
> clock-gettime-monotonic-coarse:vdso: 745 nsec/call
> 
> vdsotest after the patch:
> gettimeofday:vdso: 677 nsec/call
> clock-gettime-realtime-coarse:vdso: 613 nsec/call
> clock-gettime-monotonic-coarse:vdso: 690 nsec/call
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/kernel/vdso32/cacheflush.S   | 10 +-
>  arch/powerpc/kernel/vdso32/datapage.S | 29 -
> 
>  arch/powerpc/kernel/vdso32/datapage.h | 12 
>  arch/powerpc/kernel/vdso32/gettimeofday.S | 11 +--
>  4 files changed, 26 insertions(+), 36 deletions(-)
>  create mode 100644 arch/powerpc/kernel/vdso32/datapage.h
> 
> diff --git a/arch/powerpc/kernel/vdso32/cacheflush.S
> b/arch/powerpc/kernel/vdso32/cacheflush.S
> index 7f882e7b9f43..e9453837e4ee 100644
> --- a/arch/powerpc/kernel/vdso32/cacheflush.S
> +++ b/arch/powerpc/kernel/vdso32/cacheflush.S
> @@ -10,6 +10,8 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>  
>  /*
> @@ -24,14 +26,12 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - mr  r11,r3
> - bl  __get_datapage@local
> + get_datapager10, r0
>   mtlrr12
> - mr  r10,r3
>  
>   lwz r7,CFG_DCACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5/* ensure we get enough */
>   lwz r9,CFG_DCACHE_LOGBLOCKSZ(r10)
> @@ -48,7 +48,7 @@ V_FUNCTION_BEGIN(__kernel_sync_dicache)
>  
>   lwz r7,CFG_ICACHE_BLOCKSZ(r10)
>   addir5,r7,-1
> - andcr6,r11,r5   /* round low to line bdy */
> + andcr6,r3,r5/* round low to line bdy */
>   subfr8,r6,r4/* compute length */
>   add r8,r8,r5
>   lwz r9,CFG_ICACHE_LOGBLOCKSZ(r10)
> diff --git a/arch/powerpc/kernel/vdso32/datapage.S
> b/arch/powerpc/kernel/vdso32/datapage.S
> index 6984125b9fc0..d480d2d4a3fe 100644
> --- a/arch/powerpc/kernel/vdso32/datapage.S
> +++ b/arch/powerpc/kernel/vdso32/datapage.S
> @@ -11,34 +11,13 @@
>  #include 
>  #include 
>  
> +#include "datapage.h"
> +
>   .text
>   .global __kernel_datapage_offset;
>  __kernel_datapage_offset:
>   .long   0
>  
> -V_FUNCTION_BEGIN(__get_datapage)
> -  .cfi_startproc
> - /* We don't want that exposed or overridable as we want other
> objects
> -  * to be able to bl directly to here
> -  */
> - .protected __get_datapage
> - .hidden __get_datapage
> -
> - mflrr0
> -  .cfi_register lr,r0
> -
> - bcl 20,31,data_page_branch
> -data_page_branch:
> - mflrr3
> - mtlrr0
> - addir3, r3, __kernel_datapage_offset-data_page_branch
> - lwz r0,0(r3)
> -  .cfi_restore lr
> - add r3,r0,r3
> - blr
> -  .cfi_endproc
> -V_FUNCTION_END(__get_datapage)
> -
>  /*
>   * void *__kernel_get_syscall_map(unsigned int *syscall_count) ;
>   *
> @@ -53,7 +32,7 @@ V_FUNCTION_BEGIN(__kernel_get_syscall_map)
>   mflrr12
>.cfi_register lr,r12
>   mr  r4,r3
> - bl  __get_datapage@local
> + get_datapager3, r0
>   mtlrr12
>   addir3,r3,CFG_SYSCALL_MAP32
>   cmpli   cr0,r4,0
> @@ -74,7 +53,7 @@ V_FUNCTION_BEGIN(__kernel_get_tbfreq)
>.cfi_startproc
>   mflrr12
>.cfi_register lr,r12
> - bl  __get_datapage@local
> + get_datapager3, r0
>   lwz r4,(CFG_TB_TICKS_PER_SEC + 4)(r3)
>   lwz r3,CFG_TB_TICKS_PER_SEC(r3)
>   mtlrr12
> diff --git a/arch/powerpc/kernel/vdso32/datapage.h
> b/arch/powerpc/kernel/vdso32/datapage.h
> new file mode 100644
> index ..ad96256be090
> --- /dev/null
> +++ b/arch/powerpc/kernel/vdso32/datapage.h
> @@ -0,0 +1,12 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +.macro get_datapage ptr, tmp
> + bcl 20,31,888f
> +888:
> + mflr\ptr
> + addi\ptr, \ptr, __kernel_datapage_offset - 888b
> + lwz \tmp, 0(\ptr)
> + add \ptr, \tmp, \ptr
> +.endm
> +
> +
> diff --git a/arch/powerpc/kernel/vdso32/gettimeofday.S
> b/arch/powerpc/kernel/vdso32/gettimeofday.S
> index e10098cde89c..91a58f01dcd5 100644
> ---

Re: [PATCH] fsi: scom: Don't abort operations for minor errors

2019-08-20 Thread Benjamin Herrenschmidt

On Thu, 2019-08-15 at 14:08 -0500, Eddie James wrote:
> The scom driver currently fails out of operations if certain system
> errors are flagged in the status register; system checkstop, special
> attention, or recoverable error. These errors won't impact the ability
> of the scom engine to perform operations, so the driver should continue
> under these conditions.
> Also, don't do a PIB reset for these conditions, since it won't help.
> 
> Signed-off-by: Eddie James 

Acked-by: Benjamin Herrenschmidt 

> ---
>  drivers/fsi/fsi-scom.c | 8 +---
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/drivers/fsi/fsi-scom.c b/drivers/fsi/fsi-scom.c
> index 343153d..004dc03 100644
> --- a/drivers/fsi/fsi-scom.c
> +++ b/drivers/fsi/fsi-scom.c
> @@ -38,8 +38,7 @@
>  #define SCOM_STATUS_PIB_RESP_MASK0x7000
>  #define SCOM_STATUS_PIB_RESP_SHIFT   12
>  
> -#define SCOM_STATUS_ANY_ERR  (SCOM_STATUS_ERR_SUMMARY | \
> -  SCOM_STATUS_PROTECTION | \
> +#define SCOM_STATUS_ANY_ERR  (SCOM_STATUS_PROTECTION | \
>SCOM_STATUS_PARITY | \
>SCOM_STATUS_PIB_ABORT | \
>SCOM_STATUS_PIB_RESP_MASK)
> @@ -251,11 +250,6 @@ static int handle_fsi2pib_status(struct scom_device 
> *scom, uint32_t status)
>   /* Return -EBUSY on PIB abort to force a retry */
>   if (status & SCOM_STATUS_PIB_ABORT)
>   return -EBUSY;
> - if (status & SCOM_STATUS_ERR_SUMMARY) {
> - fsi_device_write(scom->fsi_dev, SCOM_FSI2PIB_RESET_REG, ,
> -  sizeof(uint32_t));
> - return -EIO;
> - }
>   return 0;
>  }
>

Re: [PATCH v2 05/12] powerpc/mm: rework io-workaround invocation.

2019-08-20 Thread Benjamin Herrenschmidt

On Wed, 2019-08-21 at 00:28 +0200, Christoph Hellwig wrote:
> On Tue, Aug 20, 2019 at 02:07:13PM +, Christophe Leroy wrote:
> > ppc_md.ioremap() is only used for I/O workaround on CELL platform,
> > so indirect function call can be avoided.
> > 
> > This patch reworks the io-workaround and ioremap() functions to
> > use the global 'io_workaround_inited' flag for the activation
> > of io-workaround.
> > 
> > When CONFIG_PPC_IO_WORKAROUNDS or CONFIG_PPC_INDIRECT_MMIO are not
> > selected, the I/O workaround ioremap() voids and the global flag is
> > not used.
> 
> Note that CONFIG_PPC_IO_WORKAROUNDS is only selected by a specific cell
> config,  and CONFIG_PPC_INDIRECT_MMIO is always selected by cell, so
> I think we can make CONFIG_PPC_IO_WORKAROUNDS depend on
> CONFIG_PPC_INDIRECT_MMIO

Or we can deprecate that old platform... not sure anybody uses it
anymore (if anybody ever did).

Cheers,
ben.

[PATCH v4 1/4] nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros

2019-08-07 Thread Benjamin Herrenschmidt

This will make it easier to handle variable queue entry sizes
later. No functional change.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Minwoo Im 
---
 drivers/nvme/host/pci.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 362a1a9ced36..b5b296984aa1 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,8 +28,8 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
-#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
+#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
@@ -1344,16 +1344,16 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
-   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
return;
 
if (test_and_clear_bit(NVMEQ_SQ_CMB, >flags)) {
pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev),
-   nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+   nvmeq->sq_cmds, SQ_SIZE(nvmeq));
} else {
-   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
}
 }
@@ -1433,12 +1433,12 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 }
 
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-   int qid, int depth)
+   int qid)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq));
if (nvmeq->sq_cmds) {
nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
nvmeq->sq_cmds);
@@ -1447,11 +1447,11 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, 
struct nvme_queue *nvmeq,
return 0;
}
 
-   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(depth));
+   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(nvmeq));
}
}
 
-   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(nvmeq),
>sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
@@ -1465,12 +1465,13 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
-   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(depth),
+   nvmeq->q_depth = depth;
+   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq),
 >cq_dma_addr, GFP_KERNEL);
if (!nvmeq->cqes)
goto free_nvmeq;
 
-   if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth))
+   if (nvme_alloc_sq_cmds(dev, nvmeq, qid))
goto free_cqdma;
 
nvmeq->dev = dev;
@@ -1479,15 +1480,14 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   nvmeq->q_depth = depth;
nvmeq->qid = qid;
dev->ctrl.queue_count++;
 
return 0;
 
  free_cqdma:
-   dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes,
-   nvmeq->cq_dma_addr);
+   dma_free_coherent(dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes,
+ nvmeq->cq_dma_addr);
  free_nvmeq:
return -ENOMEM;
 }
@@ -1515,7 +1515,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 
qid)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq));
nvme_dbbuf_init(dev, nvmeq, qid);
dev->online_queues++;
wmb(); /* ensure the first interrupt sees the initialization */
-- 
2.17.1

[PATCH v4 0/4] nvme-pci: Support for Apple 201+ (T2 chip)

2019-08-07 Thread Benjamin Herrenschmidt

This series combines the original series and an updated version of the
shared tags patch, and is rebased on nvme-5.4.

This adds support for the controller found in recent Apple machines
which is basically a SW emulated NVME controller in the T2 chip.

The original reverse engineering work was done by
Paul Pawlowski .

[PATCH v4 4/4] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-08-07 Thread Benjamin Herrenschmidt

Another issue with the Apple T2 based 2018 controllers seem to be
that they blow up (and shut the machine down) if there's a tag
collision between the IO queue and the Admin queue.

My suspicion is that they use our tags for their internal tracking
and don't mix them with the queue id. They also seem to not like
when tags go beyond the IO queue depth, ie 128 tags.

This adds a quirk that marks tags 0..31 of the IO queue reserved

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Ming Lei 
Acked-by: Keith Busch 
---
 drivers/nvme/host/nvme.h |  5 +
 drivers/nvme/host/pci.c  | 31 ++-
 2 files changed, 35 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 0925f7fc13ff..3e64f7187e70 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,11 @@ enum nvme_quirks {
 * Use non-standard 128 bytes SQEs.
 */
NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
+
+   /*
+* Prevent tag overlap between queues
+*/
+   NVME_QUIRK_SHARED_TAGS  = (1 << 12),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index c683263cdf60..de8c170d5abc 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2106,6 +2106,14 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
unsigned long size;
 
nr_io_queues = max_io_queues();
+
+   /*
+* If tags are shared with admin queue (Apple bug), then
+* make sure we only use one IO queue.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+   nr_io_queues = 1;
+
result = nvme_set_queue_count(>ctrl, _io_queues);
if (result < 0)
return result;
@@ -2276,6 +2284,14 @@ static int nvme_dev_add(struct nvme_dev *dev)
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;
 
+   /*
+* Some Apple controllers requires tags to be unique
+* across admin and IO queue, so reserve the first 32
+* tags of the IO queue.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+   dev->tagset.reserved_tags = NVME_AQ_DEPTH;
+
ret = blk_mq_alloc_tag_set(>tagset);
if (ret) {
dev_warn(dev->ctrl.device,
@@ -2356,6 +2372,18 @@ static int nvme_pci_enable(struct nvme_dev *dev)
 "set queue depth=%u\n", dev->q_depth);
}
 
+   /*
+* Controllers with the shared tags quirk need the IO queue to be
+* big enough so that we get 32 tags for the admin queue
+*/
+   if ((dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) &&
+   (dev->q_depth < (NVME_AQ_DEPTH + 2))) {
+   dev->q_depth = NVME_AQ_DEPTH + 2;
+   dev_warn(dev->ctrl.device, "IO queue depth clamped to %d\n",
+dev->q_depth);
+   }
+
+
nvme_map_cmb(dev);
 
pci_enable_pcie_error_reporting(pdev);
@@ -3057,7 +3085,8 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
.driver_data = NVME_QUIRK_SINGLE_VECTOR |
-   NVME_QUIRK_128_BYTES_SQES },
+   NVME_QUIRK_128_BYTES_SQES |
+   NVME_QUIRK_SHARED_TAGS },
{ 0, }
 };
 MODULE_DEVICE_TABLE(pci, nvme_id_table);
-- 
2.17.1

[PATCH v4 2/4] nvme-pci: Add support for variable IO SQ element size

2019-08-07 Thread Benjamin Herrenschmidt

The size of a submission queue element should always be 6 (64 bytes)
by spec.

However some controllers such as Apple's are not properly implementing
the standard and require a different size.

This provides the ground work for the subsequent quirks for these
controllers.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Minwoo Im 
---
 drivers/nvme/host/pci.c | 11 ---
 include/linux/nvme.h|  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b5b296984aa1..78a660e229d9 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,7 +28,7 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define SQ_SIZE(q) ((q)->q_depth << (q)->sqes)
 #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
@@ -100,6 +100,7 @@ struct nvme_dev {
unsigned io_queues[HCTX_MAX_TYPES];
unsigned int num_vecs;
int q_depth;
+   int io_sqes;
u32 db_stride;
void __iomem *bar;
unsigned long bar_mapped_size;
@@ -162,7 +163,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl 
*ctrl)
 struct nvme_queue {
struct nvme_dev *dev;
spinlock_t sq_lock;
-   struct nvme_command *sq_cmds;
+   void *sq_cmds;
 /* only used for poll queues: */
spinlock_t cq_poll_lock cacheline_aligned_in_smp;
volatile struct nvme_completion *cqes;
@@ -178,6 +179,7 @@ struct nvme_queue {
u16 last_cq_head;
u16 qid;
u8 cq_phase;
+   u8 sqes;
unsigned long flags;
 #define NVMEQ_ENABLED  0
 #define NVMEQ_SQ_CMB   1
@@ -488,7 +490,8 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
struct nvme_command *cmd,
bool write_sq)
 {
spin_lock(>sq_lock);
-   memcpy(>sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
+   memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+  cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
nvme_write_sq_db(nvmeq, write_sq);
@@ -1465,6 +1468,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
+   nvmeq->sqes = qid ? dev->io_sqes : NVME_NVM_ADMSQES;
nvmeq->q_depth = depth;
nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq),
 >cq_dma_addr, GFP_KERNEL);
@@ -2317,6 +2321,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
dev->dbs = dev->bar + 4096;
+   dev->io_sqes = NVME_NVM_IOSQES;
 
/*
 * Temporary fix for the Apple controller found in the MacBook8,1 and
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 01aa6a6c241d..d5a4bc21f36b 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -140,6 +140,7 @@ enum {
  * Submission and Completion Queue Entry Sizes for the NVM command set.
  * (In bytes and specified as a power of two (2^n)).
  */
+#define NVME_NVM_ADMSQES   6
 #define NVME_NVM_IOSQES6
 #define NVME_NVM_IOCQES4
 
-- 
2.17.1

[PATCH v4 3/4] nvme-pci: Add support for Apple 2018+ models

2019-08-07 Thread Benjamin Herrenschmidt

Based on reverse engineering and original patch by

Paul Pawlowski 

This adds support for Apple weird implementation of NVME in their
2018 or later machines. It accounts for the twice-as-big SQ entries
for the IO queues, and the fact that only interrupt vector 0 appears
to function properly.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Minwoo Im 
---
 drivers/nvme/host/nvme.h | 10 ++
 drivers/nvme/host/pci.c  | 21 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 8dc010ca30e5..0925f7fc13ff 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,16 @@ enum nvme_quirks {
 * Broken Write Zeroes.
 */
NVME_QUIRK_DISABLE_WRITE_ZEROES = (1 << 9),
+
+   /*
+* Use only one interrupt vector for all queues
+*/
+   NVME_QUIRK_SINGLE_VECTOR= (1 << 10),
+
+   /*
+* Use non-standard 128 bytes SQEs.
+*/
+   NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 78a660e229d9..c683263cdf60 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2081,6 +2081,13 @@ static int nvme_setup_irqs(struct nvme_dev *dev, 
unsigned int nr_io_queues)
dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
dev->io_queues[HCTX_TYPE_READ] = 0;
 
+   /*
+* Some Apple controllers require all queues to use the
+* first vector.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR)
+   irq_queues = 1;
+
return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
  PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, );
 }
@@ -2321,7 +2328,16 @@ static int nvme_pci_enable(struct nvme_dev *dev)
dev->ctrl.sqsize = dev->q_depth - 1; /* 0's based queue depth */
dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
dev->dbs = dev->bar + 4096;
-   dev->io_sqes = NVME_NVM_IOSQES;
+
+   /*
+* Some Apple controllers require a non-standard SQE size.
+* Interestingly they also seem to ignore the CC:IOSQES register
+* so we don't bother updating it here.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_128_BYTES_SQES)
+   dev->io_sqes = 7;
+   else
+   dev->io_sqes = NVME_NVM_IOSQES;
 
/*
 * Temporary fix for the Apple controller found in the MacBook8,1 and
@@ -3039,6 +3055,9 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
+   { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
+   .driver_data = NVME_QUIRK_SINGLE_VECTOR |
+   NVME_QUIRK_128_BYTES_SQES },
{ 0, }
 };
 MODULE_DEVICE_TABLE(pci, nvme_id_table);
-- 
2.17.1

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-08-05 Thread Benjamin Herrenschmidt

On Mon, 2019-08-05 at 13:07 -0700, Sagi Grimberg wrote:
> > > > > > Ping ? I had another look today and I don't feel like mucking around
> > > > > > with all the AQ size logic, AEN magic tag etc... just for that sake 
> > > > > > of
> > > > > > that Apple gunk. I'm happy to have it give up IO tags, it doesn't 
> > > > > > seem
> > > > > > to make much of a difference in practice anyway.
> > > > > > 
> > > > > > But if you feel strongly about it, then I'll implement the "proper" 
> > > > > > way
> > > > > > sometimes this week, adding a way to shrink the AQ down to something
> > > > > > like 3 (one admin request, one async event (AEN), and the empty 
> > > > > > slot)
> > > > > > by making a bunch of the constants involved variables instead.
> > > > > 
> > > > > I don't feel too strongly about it. I think your patch is fine, so
> > > > > 
> > > > > Acked-by: Keith Busch 
> > > > 
> > > > Should we pick this up for 5.3-rc?
> > > 
> > > No, it's not a regression fix. Queue it up for 5.4 instead.
> > 
> > OK, will queue it up for nvme-5.4
> 
> Doesn't apply..
> 
> Ben, can you please respin a patch that applies on nvme-5.4?
> 
> http://git.infradead.org/nvme.git/shortlog/refs/heads/nvme-5.4

Sure, will do in the next couple of days !

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-08-05 Thread Benjamin Herrenschmidt

On Mon, 2019-08-05 at 11:27 -0700, Sagi Grimberg wrote:
> > > Ping ? I had another look today and I don't feel like mucking
> > > around
> > > with all the AQ size logic, AEN magic tag etc... just for that
> > > sake of
> > > that Apple gunk. I'm happy to have it give up IO tags, it doesn't
> > > seem
> > > to make much of a difference in practice anyway.
> > > 
> > > But if you feel strongly about it, then I'll implement the
> > > "proper" way
> > > sometimes this week, adding a way to shrink the AQ down to
> > > something
> > > like 3 (one admin request, one async event (AEN), and the empty
> > > slot)
> > > by making a bunch of the constants involved variables instead.
> > 
> > I don't feel too strongly about it. I think your patch is fine, so
> > 
> > Acked-by: Keith Busch 
> 
> Should we pick this up for 5.3-rc?

Well, if you are talking about just this patch, it's not needed unless
you also merge the rest of the Apple T2 support patches, which some
people I'm sure would like but I think Christoph is a bit cold feet
about (I don't care either way).

If you are talking about the whole series, give me a day or two to
respin with the added check that Keith requested (I believe his ack is
for a respin with the check so setting the IO queue too small doesn't
kill the driver).

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-08-05 Thread Benjamin Herrenschmidt

On Mon, 2019-08-05 at 07:49 -0600, Keith Busch wrote:
> > Ping ? I had another look today and I don't feel like mucking around
> > with all the AQ size logic, AEN magic tag etc... just for that sake of
> > that Apple gunk. I'm happy to have it give up IO tags, it doesn't seem
> > to make much of a difference in practice anyway.
> > 
> > But if you feel strongly about it, then I'll implement the "proper" way
> > sometimes this week, adding a way to shrink the AQ down to something
> > like 3 (one admin request, one async event (AEN), and the empty slot)
> > by making a bunch of the constants involved variables instead.
> 
> I don't feel too strongly about it. I think your patch is fine, so
> 
> Acked-by: Keith Busch 

Thanks, I'll fold the test and respin this week.

Cheers,
Ben.

Re: WARNING in kernfs_new_node

2019-08-05 Thread Benjamin Herrenschmidt

On Mon, 2019-08-05 at 05:38 -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:

A fix for that was submitted recently by Muchun Song  and should be queued up by Greg.

Cheers,
Ben.

> HEAD commit:1e78030e Merge tag 'mmc-v5.3-rc1' of
> git://git.kernel.org/..
> git tree:   upstream
> console output: 
> https://syzkaller.appspot.com/x/log.txt?x=12aab5cc60
> kernel config:  
> https://syzkaller.appspot.com/x/.config?x=e397351d2615e10
> dashboard link: 
> https://syzkaller.appspot.com/bug?extid=499aea72daa2cea73cb7
> compiler:   clang version 9.0.0 (/home/glider/llvm/clang  
> 80fee25776c2fb61e74c1ecb1a523375c2500b69)
> syz repro:  
> https://syzkaller.appspot.com/x/repro.syz?x=111d80fc60
> C reproducer:   
> https://syzkaller.appspot.com/x/repro.c?x=14d9060c60
> 
> The bug was bisected to:
> 
> commit 726e41097920a73e4c7c33385dcc0debb1281e18
> Author: Benjamin Herrenschmidt 
> Date:   Tue Jul 10 00:29:10 2018 +
> 
>  drivers: core: Remove glue dirs from sysfs earlier
> 
> bisection log:  
> https://syzkaller.appspot.com/x/bisect.txt?x=138b41ec60
> final crash:
> https://syzkaller.appspot.com/x/report.txt?x=104b41ec60
> console output: 
> https://syzkaller.appspot.com/x/log.txt?x=178b41ec60
> 
> IMPORTANT: if you fix the bug, please add the following tag to the
> commit:
> Reported-by: syzbot+499aea72daa2cea73...@syzkaller.appspotmail.com
> Fixes: 726e41097920 ("drivers: core: Remove glue dirs from sysfs
> earlier")
> 
> debugfs: Directory 'hci3' with parent 'bluetooth' already present!
> [ cut here ]
> WARNING: CPU: 0 PID: 9903 at fs/kernfs/dir.c:493 kernfs_get  
> fs/kernfs/dir.c:493 [inline]
> WARNING: CPU: 0 PID: 9903 at fs/kernfs/dir.c:493  
> kernfs_new_node+0x155/0x180 fs/kernfs/dir.c:700
> Kernel panic - not syncing: panic_on_warn set ...
> CPU: 0 PID: 9903 Comm: syz-executor126 Not tainted 5.3.0-rc2+ #59
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS  
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x1d8/0x2f8 lib/dump_stack.c:113
>   panic+0x29b/0x7d9 kernel/panic.c:219
>   __warn+0x22f/0x230 kernel/panic.c:576
>   report_bug+0x190/0x290 lib/bug.c:186
>   fixup_bug arch/x86/kernel/traps.c:179 [inline]
>   do_error_trap+0xd7/0x440 arch/x86/kernel/traps.c:272
>   do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:291
>   invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1026
> RIP: 0010:kernfs_get fs/kernfs/dir.c:493 [inline]
> RIP: 0010:kernfs_new_node+0x155/0x180 fs/kernfs/dir.c:700
> Code: d2 ff 4c 89 23 4c 89 f0 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f
> 5d c3  
> e8 69 47 98 ff 48 c7 c7 6d b9 7a 88 31 c0 e8 9e a6 80 ff <0f> 0b eb
> 8e 44  
> 89 e1 80 e1 07 80 c1 03 38 c1 0f 8c 67 ff ff ff 4c
> RSP: 0018:88808d96f6f0 EFLAGS: 00010246
> RAX: 0024 RBX:  RCX: d31929014df6e300
> RDX:  RSI: 8000 RDI: 
> RBP: 88808d96f728 R08: 816068e4 R09: fbfff11fbdfd
> R10: fbfff11fbdfd R11:  R12: 88809377b620
> R13: 8880982858c0 R14: 888095d39460 R15: 88821b70d150
>   kernfs_create_dir_ns+0x44/0x130 fs/kernfs/dir.c:1022
>   sysfs_create_dir_ns+0x161/0x310 fs/sysfs/dir.c:59
>   create_dir lib/kobject.c:89 [inline]
>   kobject_add_internal+0x459/0xd50 lib/kobject.c:255
>   kobject_add_varg lib/kobject.c:390 [inline]
>   kobject_add+0x138/0x200 lib/kobject.c:442
>   device_add+0x508/0x1570 drivers/base/core.c:2065
>   hci_register_dev+0x331/0x720 net/bluetooth/hci_core.c:3307
>   __vhci_create_device drivers/bluetooth/hci_vhci.c:124 [inline]
>   vhci_create_device+0x2f3/0x530 drivers/bluetooth/hci_vhci.c:148
>   vhci_get_user drivers/bluetooth/hci_vhci.c:204 [inline]
>   vhci_write+0x3ac/0x440 drivers/bluetooth/hci_vhci.c:284
>   call_write_iter include/linux/fs.h:1870 [inline]
>   new_sync_write fs/read_write.c:483 [inline]
>   __vfs_write+0x617/0x7d0 fs/read_write.c:496
>   vfs_write+0x275/0x590 fs/read_write.c:558
>   ksys_write+0x16b/0x2a0 fs/read_write.c:611
>   __do_sys_write fs/read_write.c:623 [inline]
>   __se_sys_write fs/read_write.c:620 [inline]
>   __x64_sys_write+0x7b/0x90 fs/read_write.c:620
>   do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:296
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x441d39
> Code: e8 4c e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48
> 89 f7  
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
> f0 ff  
> ff 0f 83 db 07 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ffd7e3026c8 EFLAGS: 0246 ORIG_RAX:
>

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-08-05 Thread Benjamin Herrenschmidt

On Tue, 2019-07-30 at 13:28 -0700, Benjamin Herrenschmidt wrote:
> > One problem is that we've an nvme parameter, io_queue_depth, that a user
> > could set to something less than 32, and then you won't be able to do
> > any IO. I'd recommend enforce the admin queue to QD1 for this device so
> > that you have more potential IO tags.
> 
> So I had a look and it's not that trivial. I would have to change
> a few things that use constants for the admin queue depth, such as
> the AEN tag etc...
> 
> For such a special case, I am tempted instead to do the much simpler:
> 
> if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) {
> if (dev->q_depth < (NVME_AQ_DEPTH + 2))
> dev->q_depth = NVME_AQ_DEPTH + 2;
> }
> 
> In nvme_pci_enable() next to the existing q_depth hackery for other
> controllers.
> 
> Thoughts ?

Ping ? I had another look today and I don't feel like mucking around
with all the AQ size logic, AEN magic tag etc... just for that sake of
that Apple gunk. I'm happy to have it give up IO tags, it doesn't seem
to make much of a difference in practice anyway.

But if you feel strongly about it, then I'll implement the "proper" way
sometimes this week, adding a way to shrink the AQ down to something
like 3 (one admin request, one async event (AEN), and the empty slot)
by making a bunch of the constants involved variables instead.

This leas to a question: Wouldn't be be nicer/cleaner to make AEN be
tag 0 of the AQ ? That way we just include it as reserved tag ? Not a
huge different from what we do now, just a thought.

Cheers,
Ben.

Re: [PATCH] drivers/macintosh/smu.c: Mark expected switch fall-through

2019-07-30 Thread Benjamin Herrenschmidt

On Tue, 2019-07-30 at 10:07 -0700, Kees Cook wrote:
> 
> > Why do we think it's an expected fall through? I can't really
> > convince
> > myself from the surrounding code that it's definitely intentional.
> 
> Yeah, good question. Just now when I went looking for who
> used SMU_I2C_TRANSFER_COMBINED, I found the only caller in
> arch/powerpc/platforms/powermac/low_i2c.c and it is clearly using a
> fall-through for building the command for "stdsub" and "combined",
> so I think that's justification enough:

Yes, sorry for the delay, the fall through is intentional.

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-30 Thread Benjamin Herrenschmidt

On Tue, 2019-07-30 at 09:30 -0600, Keith Busch wrote:
> On Fri, Jul 19, 2019 at 03:31:02PM +1000, Benjamin Herrenschmidt wrote:
> > From 8dcba2ef5b1466b023b88b4eca463b30de78d9eb Mon Sep 17 00:00:00 2001
> > From: Benjamin Herrenschmidt 
> > Date: Fri, 19 Jul 2019 15:03:06 +1000
> > Subject: 
> > 
> > Another issue with the Apple T2 based 2018 controllers seem to be
> > that they blow up (and shut the machine down) if there's a tag
> > collision between the IO queue and the Admin queue.
> > 
> > My suspicion is that they use our tags for their internal tracking
> > and don't mix them with the queue id. They also seem to not like
> > when tags go beyond the IO queue depth, ie 128 tags.
> > 
> > This adds a quirk that marks tags 0..31 of the IO queue reserved
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> 
> One problem is that we've an nvme parameter, io_queue_depth, that a user
> could set to something less than 32, and then you won't be able to do
> any IO. I'd recommend enforce the admin queue to QD1 for this device so
> that you have more potential IO tags.

So I had a look and it's not that trivial. I would have to change
a few things that use constants for the admin queue depth, such as
the AEN tag etc...

For such a special case, I am tempted instead to do the much simpler:

if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS) {
if (dev->q_depth < (NVME_AQ_DEPTH + 2))
dev->q_depth = NVME_AQ_DEPTH + 2;
}

In nvme_pci_enable() next to the existing q_depth hackery for other
controllers.

Thoughts ?

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-30 Thread Benjamin Herrenschmidt

On Tue, 2019-07-30 at 09:30 -0600, Keith Busch wrote:
> On Fri, Jul 19, 2019 at 03:31:02PM +1000, Benjamin Herrenschmidt wrote:
> > From 8dcba2ef5b1466b023b88b4eca463b30de78d9eb Mon Sep 17 00:00:00 2001
> > From: Benjamin Herrenschmidt 
> > Date: Fri, 19 Jul 2019 15:03:06 +1000
> > Subject: 
> > 
> > Another issue with the Apple T2 based 2018 controllers seem to be
> > that they blow up (and shut the machine down) if there's a tag
> > collision between the IO queue and the Admin queue.
> > 
> > My suspicion is that they use our tags for their internal tracking
> > and don't mix them with the queue id. They also seem to not like
> > when tags go beyond the IO queue depth, ie 128 tags.
> > 
> > This adds a quirk that marks tags 0..31 of the IO queue reserved
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> 
> One problem is that we've an nvme parameter, io_queue_depth, that a user
> could set to something less than 32, and then you won't be able to do
> any IO. I'd recommend enforce the admin queue to QD1 for this device so
> that you have more potential IO tags.

Makes sense, I don't think we care much about the number of admin tags 
on these devices.

I'm travelling, not sure I'll be able to respin & test before next
week.

Thanks.

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-22 Thread Benjamin Herrenschmidt

On Fri, 2019-07-19 at 23:51 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2019-07-19 at 14:28 +0200, Christoph Hellwig wrote:
> > Yikes, that things looks worse and worse.  I think at this point
> > we'll
> > have to defer the support to 5.4 unfortunately as it is getting more
> > and more involved..
> 
> Well, at least v3 of that patch, thanks to Damien's idea, isn't
> particularly invasive and I've hammered the SSD with it over night with
> a combination of IOs and smart commands, it's solid.
> 
> But if you prefer waiting for 5.4, no worries.

BTW. It's been solid for 3 days now, so I think that was the last of it
(famous last words...)

Cheers,
Ben.

Re: [PATCH v2 6/8] PCI: al: Add support for DW based driver type

2019-07-21 Thread Benjamin Herrenschmidt

On Sun, 2019-07-21 at 15:08 +, Chocron, Jonathan wrote:
> > Please sort the variables following the reverse tree order.
> > 
> 
> Done. 
> 
> I'd think that it would make sense to group variables which have a
> common characteristic (e.g. resources read from the DT), even if it
> mildly breaks the convention (as long as the general frame is longest
> to shortest). Does this sound ok?
> 
> BTW, I couldn't find any documentation regarding the reverse-tree
> convention, do you have a pointer to some?

It's a complete stupidity that people who have nothing better to
comment about keep throwing at you when you are trying to get stuff
working.

Yes, we should avoid ugly stuff such as

int rc;
struct something_very_long foo;

But this definitely should come second to saner ordering such as the
one you propose of grouping related things together. At least IMHO.

You'll notice that the kernel these days attract way more people
interested in commenting to death on various minor points of coding
style than actual worthwhile technical comments on the code.

Cheers,
Ben.

Re: [PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-19 Thread Benjamin Herrenschmidt

On Fri, 2019-07-19 at 14:28 +0200, Christoph Hellwig wrote:
> Yikes, that things looks worse and worse.  I think at this point
> we'll
> have to defer the support to 5.4 unfortunately as it is getting more
> and more involved..

Well, at least v3 of that patch, thanks to Damien's idea, isn't
particularly invasive and I've hammered the SSD with it over night with
a combination of IOs and smart commands, it's solid.

But if you prefer waiting for 5.4, no worries.

Cheers,
Ben.

[PATCH v3] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

>From 8dcba2ef5b1466b023b88b4eca463b30de78d9eb Mon Sep 17 00:00:00 2001
From: Benjamin Herrenschmidt 
Date: Fri, 19 Jul 2019 15:03:06 +1000
Subject: 

Another issue with the Apple T2 based 2018 controllers seem to be
that they blow up (and shut the machine down) if there's a tag
collision between the IO queue and the Admin queue.

My suspicion is that they use our tags for their internal tracking
and don't mix them with the queue id. They also seem to not like
when tags go beyond the IO queue depth, ie 128 tags.

This adds a quirk that marks tags 0..31 of the IO queue reserved

Signed-off-by: Benjamin Herrenschmidt 
---

Thanks Damien, reserved tags work and make this a lot simpler !

 drivers/nvme/host/nvme.h |  5 +
 drivers/nvme/host/pci.c  | 19 ++-
 2 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ced0e0a7e039..8732da6df555 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,11 @@ enum nvme_quirks {
 * Use non-standard 128 bytes SQEs.
 */
NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
+
+   /*
+* Prevent tag overlap between queues
+*/
+   NVME_QUIRK_SHARED_TAGS  = (1 << 12),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7088971d4c42..fc74395a028b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2106,6 +2106,14 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
unsigned long size;
 
nr_io_queues = max_io_queues();
+
+   /*
+* If tags are shared with admin queue (Apple bug), then
+* make sure we only use one IO queue.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+   nr_io_queues = 1;
+
result = nvme_set_queue_count(>ctrl, _io_queues);
if (result < 0)
return result;
@@ -2278,6 +2286,14 @@ static int nvme_dev_add(struct nvme_dev *dev)
dev->tagset.flags = BLK_MQ_F_SHOULD_MERGE;
dev->tagset.driver_data = dev;
 
+   /*
+* Some Apple controllers requires tags to be unique
+* across admin and IO queue, so reserve the first 32
+* tags of the IO queue.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+   dev->tagset.reserved_tags = NVME_AQ_DEPTH;
+
ret = blk_mq_alloc_tag_set(>tagset);
if (ret) {
dev_warn(dev->ctrl.device,
@@ -3057,7 +3073,8 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
.driver_data = NVME_QUIRK_SINGLE_VECTOR |
-   NVME_QUIRK_128_BYTES_SQES },
+   NVME_QUIRK_128_BYTES_SQES |
+   NVME_QUIRK_SHARED_TAGS },
{ 0, }
 };
 MODULE_DEVICE_TABLE(pci, nvme_id_table);

Re: [PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

On Fri, 2019-07-19 at 05:01 +, Damien Le Moal wrote:
> > I suppose that would work and be simpler. I honestly don't know much
> > about the block layer and tag allocation so I stayed away from it :-)
> > 
> > I'll dig, but a hint would be welcome :)
> 
> Uuuh.. Never played with the tag management code directly myself either. A 
> quick
> look seem to indicate that blk_mq_get/put_tag() is what you should be using. 
> But
> further looking, struct blk_mq_tags has the field nr_reserved_tags which is 
> used
> as an offset start point for searching free tags, which is exactly what you
> would need.

Yup. I was getting there, it's just that we use the tagset mess which I
had to untangle a bit first :-)

Cheers,
Ben.

Re: [PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

On Fri, 2019-07-19 at 14:49 +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2019-07-19 at 04:43 +, Damien Le Moal wrote:
> > On 2019/07/19 13:37, Benjamin Herrenschmidt wrote:
> > > Another issue with the Apple T2 based 2018 controllers seem to be
> > > that they blow up (and shut the machine down) if there's a tag
> > > collision between the IO queue and the Admin queue.
> > > 
> > > My suspicion is that they use our tags for their internal
> > > tracking
> > > and don't mix them with the queue id. They also seem to not like
> > > when tags go beyond the IO queue depth, ie 128 tags.
> > > 
> > > This adds a quirk that offsets all the tags in the IO queue by 32
> > > to avoid those collisions. It also limits the number of IO queues
> > > to 1 since the code wouldn't otherwise make sense (the device
> > > supports only one queue anyway but better safe than sorry) and
> > > reduces the size of the IO queue
> > 
> > What about keeping the IO queue QD at 128, but marking the first 32
> > tags as
> > "allocated" when the device is initialized ? This way, these tags
> > numbers are
> > never used for regular IOs and you can avoid the entire tag +/-
> > offset dance at
> > runtime. The admin queue uses tags 0-31 and the IO queue uses tags
> > 32-127. No ?
> 
> I suppose that would work and be simpler. I honestly don't know much
> about the block layer and tag allocation so I stayed away from it :-)
> 
> I'll dig, but a hint would be welcome :)

Hrm .. tagset->reserved_tags ?

> Cheers,
> Ben.
> 
> > > 
> > > Signed-off-by: Benjamin Herrenschmidt 
> > > ---
> > > 
> > > Note: One thing I noticed is how we have nvme_completion as
> > > volatile.
> > > 
> > > I don't think we really need that, it's forcing the compiler to
> > > constantly
> > > reload things which makes no sense once we have established that
> > > an
> > > entry is valid.
> > > 
> > > And since we have a data & control dependency from
> > > nvme_cqe_pending(),
> > > we know that reading the CQE is going to depend on it being
> > > valid. I
> > > don't really see what volatile is buying us here other than cargo
> > > culting.
> > > 
> > > Cheers,
> > > Ben.
> > > 
> > >  drivers/nvme/host/nvme.h |  5 
> > >  drivers/nvme/host/pci.c  | 52 +-
> > > --
> > >  2 files changed, 49 insertions(+), 8 deletions(-)
> > > 
> > > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > > index ced0e0a7e039..7c6de398de7d 100644
> > > --- a/drivers/nvme/host/nvme.h
> > > +++ b/drivers/nvme/host/nvme.h
> > > @@ -102,6 +102,11 @@ enum nvme_quirks {
> > >* Use non-standard 128 bytes SQEs.
> > >*/
> > >   NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
> > > +
> > > + /*
> > > +  * Prevent tag overlap between queues
> > > +  */
> > > + NVME_QUIRK_SHARED_TAGS  = (1 << 12),
> > >  };
> > >  
> > >  /*
> > > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > > index 7088971d4c42..c38e946ad8ca 100644
> > > --- a/drivers/nvme/host/pci.c
> > > +++ b/drivers/nvme/host/pci.c
> > > @@ -178,6 +178,7 @@ struct nvme_queue {
> > >   u16 cq_head;
> > >   u16 last_cq_head;
> > >   u16 qid;
> > > + u16 tag_offset;
> > >   u8 cq_phase;
> > >   u8 sqes;
> > >   unsigned long flags;
> > > @@ -490,6 +491,7 @@ static void nvme_submit_cmd(struct nvme_queue
> > > *nvmeq, struct nvme_command *cmd,
> > >   bool write_sq)
> > >  {
> > >   spin_lock(>sq_lock);
> > > + cmd->common.command_id += nvmeq->tag_offset;
> > >   memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> > >  cmd, sizeof(*cmd));
> > >   if (++nvmeq->sq_tail == nvmeq->q_depth)
> > > @@ -951,9 +953,10 @@ static inline void
> > > nvme_ring_cq_doorbell(struct nvme_queue *nvmeq)
> > >  static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16
> > > idx)
> > >  {
> > >   volatile struct nvme_completion *cqe = >cqes[idx];
> > > + u16 ctag = cqe->command_id - nvmeq->tag_offset;
> > >   struct request *req;
> > >  
> > > - if (unlikely(cqe

Re: [PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

On Fri, 2019-07-19 at 04:43 +, Damien Le Moal wrote:
> On 2019/07/19 13:37, Benjamin Herrenschmidt wrote:
> > Another issue with the Apple T2 based 2018 controllers seem to be
> > that they blow up (and shut the machine down) if there's a tag
> > collision between the IO queue and the Admin queue.
> > 
> > My suspicion is that they use our tags for their internal tracking
> > and don't mix them with the queue id. They also seem to not like
> > when tags go beyond the IO queue depth, ie 128 tags.
> > 
> > This adds a quirk that offsets all the tags in the IO queue by 32
> > to avoid those collisions. It also limits the number of IO queues
> > to 1 since the code wouldn't otherwise make sense (the device
> > supports only one queue anyway but better safe than sorry) and
> > reduces the size of the IO queue
> 
> What about keeping the IO queue QD at 128, but marking the first 32 tags as
> "allocated" when the device is initialized ? This way, these tags numbers are
> never used for regular IOs and you can avoid the entire tag +/- offset dance 
> at
> runtime. The admin queue uses tags 0-31 and the IO queue uses tags 32-127. No 
> ?

I suppose that would work and be simpler. I honestly don't know much
about the block layer and tag allocation so I stayed away from it :-)

I'll dig, but a hint would be welcome :)

Cheers,
Ben.

> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> > 
> > Note: One thing I noticed is how we have nvme_completion as volatile.
> > 
> > I don't think we really need that, it's forcing the compiler to constantly
> > reload things which makes no sense once we have established that an
> > entry is valid.
> > 
> > And since we have a data & control dependency from nvme_cqe_pending(),
> > we know that reading the CQE is going to depend on it being valid. I
> > don't really see what volatile is buying us here other than cargo culting.
> > 
> > Cheers,
> > Ben.
> > 
> >  drivers/nvme/host/nvme.h |  5 
> >  drivers/nvme/host/pci.c  | 52 +---
> >  2 files changed, 49 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> > index ced0e0a7e039..7c6de398de7d 100644
> > --- a/drivers/nvme/host/nvme.h
> > +++ b/drivers/nvme/host/nvme.h
> > @@ -102,6 +102,11 @@ enum nvme_quirks {
> >  * Use non-standard 128 bytes SQEs.
> >  */
> > NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
> > +
> > +   /*
> > +* Prevent tag overlap between queues
> > +*/
> > +   NVME_QUIRK_SHARED_TAGS  = (1 << 12),
> >  };
> >  
> >  /*
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 7088971d4c42..c38e946ad8ca 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -178,6 +178,7 @@ struct nvme_queue {
> > u16 cq_head;
> > u16 last_cq_head;
> > u16 qid;
> > +   u16 tag_offset;
> > u8 cq_phase;
> > u8 sqes;
> > unsigned long flags;
> > @@ -490,6 +491,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
> > struct nvme_command *cmd,
> > bool write_sq)
> >  {
> > spin_lock(>sq_lock);
> > +   cmd->common.command_id += nvmeq->tag_offset;
> > memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
> >cmd, sizeof(*cmd));
> > if (++nvmeq->sq_tail == nvmeq->q_depth)
> > @@ -951,9 +953,10 @@ static inline void nvme_ring_cq_doorbell(struct 
> > nvme_queue *nvmeq)
> >  static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx)
> >  {
> > volatile struct nvme_completion *cqe = >cqes[idx];
> > +   u16 ctag = cqe->command_id - nvmeq->tag_offset;
> > struct request *req;
> >  
> > -   if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
> > +   if (unlikely(ctag >= nvmeq->q_depth)) {
> > dev_warn(nvmeq->dev->ctrl.device,
> > "invalid id %d completed on queue %d\n",
> > cqe->command_id, le16_to_cpu(cqe->sq_id));
> > @@ -966,14 +969,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
> > *nvmeq, u16 idx)
> >  * aborts.  We don't even bother to allocate a struct request
> >  * for them but rather special case them here.
> >  */
> > -   if (unlikely(nvmeq->qid == 0 &&
> > -   cqe->command_id >= NVM

[PATCH v2] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

Another issue with the Apple T2 based 2018 controllers seem to be
that they blow up (and shut the machine down) if there's a tag
collision between the IO queue and the Admin queue.

My suspicion is that they use our tags for their internal tracking
and don't mix them with the queue id. They also seem to not like
when tags go beyond the IO queue depth, ie 128 tags.

This adds a quirk that offsets all the tags in the IO queue by 32
to avoid those collisions. It also limits the number of IO queues
to 1 since the code wouldn't otherwise make sense (the device
supports only one queue anyway but better safe than sorry) and
reduces the size of the IO queue

Signed-off-by: Benjamin Herrenschmidt 
---

Note: One thing I noticed is how we have nvme_completion as volatile.

I don't think we really need that, it's forcing the compiler to constantly
reload things which makes no sense once we have established that an
entry is valid.

And since we have a data & control dependency from nvme_cqe_pending(),
we know that reading the CQE is going to depend on it being valid. I
don't really see what volatile is buying us here other than cargo culting.

Cheers,
Ben.

 drivers/nvme/host/nvme.h |  5 
 drivers/nvme/host/pci.c  | 52 +---
 2 files changed, 49 insertions(+), 8 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ced0e0a7e039..7c6de398de7d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,11 @@ enum nvme_quirks {
 * Use non-standard 128 bytes SQEs.
 */
NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
+
+   /*
+* Prevent tag overlap between queues
+*/
+   NVME_QUIRK_SHARED_TAGS  = (1 << 12),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7088971d4c42..c38e946ad8ca 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -178,6 +178,7 @@ struct nvme_queue {
u16 cq_head;
u16 last_cq_head;
u16 qid;
+   u16 tag_offset;
u8 cq_phase;
u8 sqes;
unsigned long flags;
@@ -490,6 +491,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
struct nvme_command *cmd,
bool write_sq)
 {
spin_lock(>sq_lock);
+   cmd->common.command_id += nvmeq->tag_offset;
memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
   cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
@@ -951,9 +953,10 @@ static inline void nvme_ring_cq_doorbell(struct nvme_queue 
*nvmeq)
 static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx)
 {
volatile struct nvme_completion *cqe = >cqes[idx];
+   u16 ctag = cqe->command_id - nvmeq->tag_offset;
struct request *req;
 
-   if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
+   if (unlikely(ctag >= nvmeq->q_depth)) {
dev_warn(nvmeq->dev->ctrl.device,
"invalid id %d completed on queue %d\n",
cqe->command_id, le16_to_cpu(cqe->sq_id));
@@ -966,14 +969,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
*nvmeq, u16 idx)
 * aborts.  We don't even bother to allocate a struct request
 * for them but rather special case them here.
 */
-   if (unlikely(nvmeq->qid == 0 &&
-   cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
+   if (unlikely(nvmeq->qid == 0 && ctag >= NVME_AQ_BLK_MQ_DEPTH)) {
nvme_complete_async_event(>dev->ctrl,
cqe->status, >result);
return;
}
 
-   req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);
+   req = blk_mq_tag_to_rq(*nvmeq->tags, ctag);
trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
nvme_end_request(req, cqe->status, cqe->result);
 }
@@ -1004,7 +1006,10 @@ static inline int nvme_process_cq(struct nvme_queue 
*nvmeq, u16 *start,
 
*start = nvmeq->cq_head;
while (nvme_cqe_pending(nvmeq)) {
-   if (tag == -1U || nvmeq->cqes[nvmeq->cq_head].command_id == tag)
+   u16 ctag = nvmeq->cqes[nvmeq->cq_head].command_id;
+
+   ctag -= nvmeq->tag_offset;
+   if (tag == -1U || ctag == tag)
found++;
nvme_update_cq_head(nvmeq);
}
@@ -1487,6 +1492,10 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
nvmeq->qid = qid;
dev->ctrl.queue_count++;
 
+   if (qid && (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS))
+   nvmeq->tag_offset = NVME_AQ_DEPTH;
+   else
+   nvmeq->tag_offset = 0;
return 0;
 
  free_cqdma:
@@ -2106,6 +2115

Re: [PATCH] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

On Thu, 2019-07-18 at 17:11 +1000, Benjamin Herrenschmidt wrote:
> Another issue with the Apple T2 based 2018 controllers seem to be
> that they blow up (and shut the machine down) if there's a tag
> collision between the IO queue and the Admin queue.
> 
> This adds a quirk that offsets all the tags in the IO queue by 32
> to avoid those collisions. It also limits the number of IO queues
> to 1 since the code wouldn't otherwise make sense (the device
> supports only one queue anyway but better safe than sorry).
> 
> The bug is typically triggered by tag collisions between SMART
> commands from smartd and IO commands, often at boot time.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
> 
> Note: This is the smallest way I found of doing this that keeps
> the impact self contained to pci.c. Feel free to suggest
> alternatives.

Looks like it's not enough ... the bug is a lot harder to hit but I
still occasionally get a duplicate tag. I'm now wondering if it's
unhappy about having tags bigger than q_depth... I wouldn't be
surprised with anything here.

I'll try again with a reduce q_depth as well...

Ben.

>  drivers/nvme/host/nvme.h |  5 +
>  drivers/nvme/host/pci.c  | 26 --
>  2 files changed, 25 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
> index 564b967058f4..eeb99e485898 100644
> --- a/drivers/nvme/host/nvme.h
> +++ b/drivers/nvme/host/nvme.h
> @@ -102,6 +102,11 @@ enum nvme_quirks {
>* Use non-standard 128 bytes SQEs.
>*/
>   NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
> +
> + /*
> +  * Prevent tag overlap between queues
> +  */
> + NVME_QUIRK_SHARED_TAGS  = (1 << 12),
>  };
>  
>  /*
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index e399e59863c7..1055f19e57a4 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -194,6 +194,7 @@ struct nvme_queue {
>   u16 cq_head;
>   u16 last_cq_head;
>   u16 qid;
> + u16 tag_offset;
>   u8 cq_phase;
>   u8 sqes;
>   unsigned long flags;
> @@ -506,6 +507,7 @@ static void nvme_submit_cmd(struct nvme_queue
> *nvmeq, struct nvme_command *cmd,
>   bool write_sq)
>  {
>   spin_lock(>sq_lock);
> + cmd->common.command_id += nvmeq->tag_offset;
>   memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
>  cmd, sizeof(*cmd));
>   if (++nvmeq->sq_tail == nvmeq->q_depth)
> @@ -967,9 +969,10 @@ static inline void nvme_ring_cq_doorbell(struct
> nvme_queue *nvmeq)
>  static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16
> idx)
>  {
>   volatile struct nvme_completion *cqe = >cqes[idx];
> + u16 ctag = cqe->command_id - nvmeq->tag_offset;
>   struct request *req;
>  
> - if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
> + if (unlikely(ctag >= nvmeq->q_depth)) {
>   dev_warn(nvmeq->dev->ctrl.device,
>   "invalid id %d completed on queue %d\n",
>   cqe->command_id, le16_to_cpu(cqe->sq_id));
> @@ -982,14 +985,13 @@ static inline void nvme_handle_cqe(struct
> nvme_queue *nvmeq, u16 idx)
>* aborts.  We don't even bother to allocate a struct request
>* for them but rather special case them here.
>*/
> - if (unlikely(nvmeq->qid == 0 &&
> - cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
> + if (unlikely(nvmeq->qid == 0 && ctag >= NVME_AQ_BLK_MQ_DEPTH))
> {
>   nvme_complete_async_event(>dev->ctrl,
>   cqe->status, >result);
>   return;
>   }
>  
> - req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);
> + req = blk_mq_tag_to_rq(*nvmeq->tags, ctag);
>   trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
>   nvme_end_request(req, cqe->status, cqe->result);
>  }
> @@ -1020,7 +1022,10 @@ static inline int nvme_process_cq(struct
> nvme_queue *nvmeq, u16 *start,
>  
>   *start = nvmeq->cq_head;
>   while (nvme_cqe_pending(nvmeq)) {
> - if (tag == -1U || nvmeq->cqes[nvmeq-
> >cq_head].command_id == tag)
> + u16 ctag = nvmeq->cqes[nvmeq->cq_head].command_id;
> +
> + ctag -= nvmeq->tag_offset;
> + if (tag == -1U || ctag == tag)
>   found++;
>   nvme_update_cq_head(nvmeq);
>   }
> @@ -1499,6 +1504,10 @@ static int nvme_alloc_queue(struct

[PATCH] nvme-pci: Support shared tags across queues for Apple 2018 controllers

2019-07-18 Thread Benjamin Herrenschmidt

Another issue with the Apple T2 based 2018 controllers seem to be
that they blow up (and shut the machine down) if there's a tag
collision between the IO queue and the Admin queue.

This adds a quirk that offsets all the tags in the IO queue by 32
to avoid those collisions. It also limits the number of IO queues
to 1 since the code wouldn't otherwise make sense (the device
supports only one queue anyway but better safe than sorry).

The bug is typically triggered by tag collisions between SMART
commands from smartd and IO commands, often at boot time.

Signed-off-by: Benjamin Herrenschmidt 
---

Note: This is the smallest way I found of doing this that keeps
the impact self contained to pci.c. Feel free to suggest alternatives.

 drivers/nvme/host/nvme.h |  5 +
 drivers/nvme/host/pci.c  | 26 --
 2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 564b967058f4..eeb99e485898 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -102,6 +102,11 @@ enum nvme_quirks {
 * Use non-standard 128 bytes SQEs.
 */
NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
+
+   /*
+* Prevent tag overlap between queues
+*/
+   NVME_QUIRK_SHARED_TAGS  = (1 << 12),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e399e59863c7..1055f19e57a4 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -194,6 +194,7 @@ struct nvme_queue {
u16 cq_head;
u16 last_cq_head;
u16 qid;
+   u16 tag_offset;
u8 cq_phase;
u8 sqes;
unsigned long flags;
@@ -506,6 +507,7 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
struct nvme_command *cmd,
bool write_sq)
 {
spin_lock(>sq_lock);
+   cmd->common.command_id += nvmeq->tag_offset;
memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
   cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
@@ -967,9 +969,10 @@ static inline void nvme_ring_cq_doorbell(struct nvme_queue 
*nvmeq)
 static inline void nvme_handle_cqe(struct nvme_queue *nvmeq, u16 idx)
 {
volatile struct nvme_completion *cqe = >cqes[idx];
+   u16 ctag = cqe->command_id - nvmeq->tag_offset;
struct request *req;
 
-   if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
+   if (unlikely(ctag >= nvmeq->q_depth)) {
dev_warn(nvmeq->dev->ctrl.device,
"invalid id %d completed on queue %d\n",
cqe->command_id, le16_to_cpu(cqe->sq_id));
@@ -982,14 +985,13 @@ static inline void nvme_handle_cqe(struct nvme_queue 
*nvmeq, u16 idx)
 * aborts.  We don't even bother to allocate a struct request
 * for them but rather special case them here.
 */
-   if (unlikely(nvmeq->qid == 0 &&
-   cqe->command_id >= NVME_AQ_BLK_MQ_DEPTH)) {
+   if (unlikely(nvmeq->qid == 0 && ctag >= NVME_AQ_BLK_MQ_DEPTH)) {
nvme_complete_async_event(>dev->ctrl,
cqe->status, >result);
return;
}
 
-   req = blk_mq_tag_to_rq(*nvmeq->tags, cqe->command_id);
+   req = blk_mq_tag_to_rq(*nvmeq->tags, ctag);
trace_nvme_sq(req, cqe->sq_head, nvmeq->sq_tail);
nvme_end_request(req, cqe->status, cqe->result);
 }
@@ -1020,7 +1022,10 @@ static inline int nvme_process_cq(struct nvme_queue 
*nvmeq, u16 *start,
 
*start = nvmeq->cq_head;
while (nvme_cqe_pending(nvmeq)) {
-   if (tag == -1U || nvmeq->cqes[nvmeq->cq_head].command_id == tag)
+   u16 ctag = nvmeq->cqes[nvmeq->cq_head].command_id;
+
+   ctag -= nvmeq->tag_offset;
+   if (tag == -1U || ctag == tag)
found++;
nvme_update_cq_head(nvmeq);
}
@@ -1499,6 +1504,10 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
nvmeq->qid = qid;
dev->ctrl.queue_count++;
 
+   if (qid && (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS))
+   nvmeq->tag_offset = NVME_AQ_DEPTH;
+   else
+   nvmeq->tag_offset = 0;
return 0;
 
  free_cqdma:
@@ -2110,6 +2119,10 @@ static int nvme_setup_io_queues(struct nvme_dev *dev)
unsigned long size;
 
nr_io_queues = max_io_queues();
+
+   if (dev->ctrl.quirks & NVME_QUIRK_SHARED_TAGS)
+   nr_io_queues = 1;
+
result = nvme_set_queue_count(>ctrl, _io_queues);
if (result < 0)
return result;
@@ -2957,7 +2970,8 @@ static const struct pci_device_id nvme_id_table[] = {
{

Re: [PATCH v2 2/3] nvme-pci: Add support for variable IO SQ element size

2019-07-17 Thread Benjamin Herrenschmidt

On Wed, 2019-07-17 at 20:51 +0900, Minwoo Im wrote:
> - struct nvme_command *sq_cmds;
> > +   void *sq_cmds;
> 
> It would be great if it can remain the existing data type for the
> SQEs...  But I'm fine with this also.
> 
> It looks good to me.

I changed it on purpose so we aren't tempted to index the array, since
that's not always valid.

Cheers,
Ben.

Re: [PATCH v2 3/3] nvme-pci: Add support for Apple 2018+ models

2019-07-16 Thread Benjamin Herrenschmidt

On Wed, 2019-07-17 at 06:50 +0200, Christoph Hellwig wrote:
> > # Conflicts:
> > #   drivers/nvme/host/core.c
> 
> I thought you were going to fix this up :)

Haha yeah I was ...

> But I can do that and this version of the series looks fine to me.

Thanks !

Cheers,
Ben.

[PATCH v2 3/3] nvme-pci: Add support for Apple 2018+ models

2019-07-16 Thread Benjamin Herrenschmidt

Based on reverse engineering and original patch by

Paul Pawlowski 

This adds support for Apple weird implementation of NVME in their
2018 or later machines. It accounts for the twice-as-big SQ entries
for the IO queues, and the fact that only interrupt vector 0 appears
to function properly.

Signed-off-by: Benjamin Herrenschmidt 

# Conflicts:
#   drivers/nvme/host/core.c
---
 drivers/nvme/host/nvme.h | 10 ++
 drivers/nvme/host/pci.c  | 21 -
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 716a876119c8..ced0e0a7e039 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,16 @@ enum nvme_quirks {
 * Broken Write Zeroes.
 */
NVME_QUIRK_DISABLE_WRITE_ZEROES = (1 << 9),
+
+   /*
+* Use only one interrupt vector for all queues
+*/
+   NVME_QUIRK_SINGLE_VECTOR= (1 << 10),
+
+   /*
+* Use non-standard 128 bytes SQEs.
+*/
+   NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 1637677afb78..7088971d4c42 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2081,6 +2081,13 @@ static int nvme_setup_irqs(struct nvme_dev *dev, 
unsigned int nr_io_queues)
dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
dev->io_queues[HCTX_TYPE_READ] = 0;
 
+   /*
+* Some Apple controllers require all queues to use the
+* first vector.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR)
+   irq_queues = 1;
+
return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
  PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, );
 }
@@ -2322,7 +2329,16 @@ static int nvme_pci_enable(struct nvme_dev *dev)
io_queue_depth);
dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
dev->dbs = dev->bar + 4096;
-   dev->io_sqes = NVME_NVM_IOSQES;
+
+   /*
+* Some Apple controllers require a non-standard SQE size.
+* Interestingly they also seem to ignore the CC:IOSQES register
+* so we don't bother updating it here.
+*/
+   if (dev->ctrl.quirks & NVME_QUIRK_128_BYTES_SQES)
+   dev->io_sqes = 7;
+   else
+   dev->io_sqes = NVME_NVM_IOSQES;
 
/*
 * Temporary fix for the Apple controller found in the MacBook8,1 and
@@ -3039,6 +3055,9 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
+   { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
+   .driver_data = NVME_QUIRK_SINGLE_VECTOR |
+   NVME_QUIRK_128_BYTES_SQES },
{ 0, }
 };
 MODULE_DEVICE_TABLE(pci, nvme_id_table);
-- 
2.17.1

[PATCH v2 1/3] nvme-pci: Pass the queue to SQ_SIZE/CQ_SIZE macros

2019-07-16 Thread Benjamin Herrenschmidt

This will make it easier to handle variable queue entry sizes
later. No functional change.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Christoph Hellwig 
---
 drivers/nvme/host/pci.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index dd10cf78f2d3..8f006638452b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,8 +28,8 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
-#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
+#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
@@ -1344,16 +1344,16 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
-   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
return;
 
if (test_and_clear_bit(NVMEQ_SQ_CMB, >flags)) {
pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev),
-   nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+   nvmeq->sq_cmds, SQ_SIZE(nvmeq));
} else {
-   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
}
 }
@@ -1433,12 +1433,12 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 }
 
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-   int qid, int depth)
+   int qid)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq));
if (nvmeq->sq_cmds) {
nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
nvmeq->sq_cmds);
@@ -1447,11 +1447,11 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, 
struct nvme_queue *nvmeq,
return 0;
}
 
-   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(depth));
+   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(nvmeq));
}
}
 
-   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(nvmeq),
>sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
@@ -1465,12 +1465,13 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
-   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(depth),
+   nvmeq->q_depth = depth;
+   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq),
 >cq_dma_addr, GFP_KERNEL);
if (!nvmeq->cqes)
goto free_nvmeq;
 
-   if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth))
+   if (nvme_alloc_sq_cmds(dev, nvmeq, qid))
goto free_cqdma;
 
nvmeq->dev = dev;
@@ -1479,15 +1480,14 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   nvmeq->q_depth = depth;
nvmeq->qid = qid;
dev->ctrl.queue_count++;
 
return 0;
 
  free_cqdma:
-   dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes,
-   nvmeq->cq_dma_addr);
+   dma_free_coherent(dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes,
+ nvmeq->cq_dma_addr);
  free_nvmeq:
return -ENOMEM;
 }
@@ -1515,7 +1515,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 
qid)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq));
nvme_dbbuf_init(dev, nvmeq, qid);
dev->online_queues++;
wmb(); /* ensure the first interrupt sees the initialization */
-- 
2.17.1

[PATCH v2 2/3] nvme-pci: Add support for variable IO SQ element size

2019-07-16 Thread Benjamin Herrenschmidt

The size of a submission queue element should always be 6 (64 bytes)
by spec.

However some controllers such as Apple's are not properly implementing
the standard and require a different size.

This provides the ground work for the subsequent quirks for these
controllers.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/nvme/host/pci.c | 11 ---
 include/linux/nvme.h|  1 +
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8f006638452b..1637677afb78 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,7 +28,7 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define SQ_SIZE(q) ((q)->q_depth << (q)->sqes)
 #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
@@ -100,6 +100,7 @@ struct nvme_dev {
unsigned io_queues[HCTX_MAX_TYPES];
unsigned int num_vecs;
int q_depth;
+   int io_sqes;
u32 db_stride;
void __iomem *bar;
unsigned long bar_mapped_size;
@@ -162,7 +163,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl 
*ctrl)
 struct nvme_queue {
struct nvme_dev *dev;
spinlock_t sq_lock;
-   struct nvme_command *sq_cmds;
+   void *sq_cmds;
 /* only used for poll queues: */
spinlock_t cq_poll_lock cacheline_aligned_in_smp;
volatile struct nvme_completion *cqes;
@@ -178,6 +179,7 @@ struct nvme_queue {
u16 last_cq_head;
u16 qid;
u8 cq_phase;
+   u8 sqes;
unsigned long flags;
 #define NVMEQ_ENABLED  0
 #define NVMEQ_SQ_CMB   1
@@ -488,7 +490,8 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
struct nvme_command *cmd,
bool write_sq)
 {
spin_lock(>sq_lock);
-   memcpy(>sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
+   memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+  cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
nvme_write_sq_db(nvmeq, write_sq);
@@ -1465,6 +1468,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
+   nvmeq->sqes = qid ? dev->io_sqes : NVME_NVM_ADMSQES;
nvmeq->q_depth = depth;
nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq),
 >cq_dma_addr, GFP_KERNEL);
@@ -2318,6 +2322,7 @@ static int nvme_pci_enable(struct nvme_dev *dev)
io_queue_depth);
dev->db_stride = 1 << NVME_CAP_STRIDE(dev->ctrl.cap);
dev->dbs = dev->bar + 4096;
+   dev->io_sqes = NVME_NVM_IOSQES;
 
/*
 * Temporary fix for the Apple controller found in the MacBook8,1 and
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 01aa6a6c241d..d5a4bc21f36b 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -140,6 +140,7 @@ enum {
  * Submission and Completion Queue Entry Sizes for the NVM command set.
  * (In bytes and specified as a power of two (2^n)).
  */
+#define NVME_NVM_ADMSQES   6
 #define NVME_NVM_IOSQES6
 #define NVME_NVM_IOCQES4
 
-- 
2.17.1

Re: [PATCH 2/3] nvme: Retrieve the required IO queue entry size from the controller

2019-07-16 Thread Benjamin Herrenschmidt

On Tue, 2019-07-16 at 14:05 +0200, Christoph Hellwig wrote:
> On Tue, Jul 16, 2019 at 08:58:28PM +1000, Benjamin Herrenschmidt wrote:
> > The main risk is if existing controllers return crap in SQES and we try
> > to then use that crap. The rest should essentially be NOPs.
> > 
> > Maybe I should add some kind of printk to warn in case we use/detect a
> > non-standard size. That would help diagnosing issues.
> 
> Given that the spec currently requires bits 0 to 3 of SQES to be 6
> we might as well not check SQES and just hardcode it to 6 or 7 depending
> on the quirk.  That actually was my initial idea, I just suggested using
> the SQES naming and indexing.

If we're going to do that, then I can move it back to pci.c and leave
core.c alone then I suppose. Up to you. I'm just doing that for fun, no
beef in that game :-) let me know how you want it.

Cheers,
Ben.

Re: [PATCH 2/3] nvme: Retrieve the required IO queue entry size from the controller

2019-07-16 Thread Benjamin Herrenschmidt

On Tue, 2019-07-16 at 11:33 +0200, Christoph Hellwig wrote:
> > >So back to the version
> > > you circulated to me in private mail that just sets q->sqes and has a
> > > comment that this is magic for The Apple controller.  If/when we get
> > > standardized large SQE support we'll need to discover that earlier or
> > > do a disable/enable dance.  Sorry for misleading you down this road and
> > > creating the extra work.  
> > 
> > I think it's still ok, let me know...
> 
> Ok, let's go with this series then unless the other maintainers have
> objections.
> 
> I'm still not sure if we want to queue this up for 5.3 (new hardware
> enablement) or wait a bit, though.

The main risk is if existing controllers return crap in SQES and we try
to then use that crap. The rest should essentially be NOPs.

Maybe I should add some kind of printk to warn in case we use/detect a
non-standard size. That would help diagnosing issues.

Cheers,
Ben.

Re: [PATCH 3/3] nvme: Add support for Apple 2018+ models

2019-07-16 Thread Benjamin Herrenschmidt

On Tue, 2019-07-16 at 08:06 +0200, Christoph Hellwig wrote:
> 
> >  /*
> > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> > index 54b35ea4af88..ab2358137419 100644
> > --- a/drivers/nvme/host/pci.c
> > +++ b/drivers/nvme/host/pci.c
> > @@ -2080,6 +2080,9 @@ static int nvme_setup_irqs(struct nvme_dev *dev, 
> > unsigned int nr_io_queues)
> > dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
> > dev->io_queues[HCTX_TYPE_READ] = 0;
> >  
> > +   if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR)
> > +   irq_queues = 1;
> > +
> > return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
> >   PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, );
> 
> Callin pci_alloc_irq_vectors_affinity in this case seems a bit
> pointless, but if this works for you I'd rather keep it as-is for now
> if this works for you.

It seems to work and it's simpler that way. The original patch was
grabbing all the interrupts then hacking the queues to all use vector 0
(well there's only one IO queue). The above is a bit cleaner imho.

Cheers,
Ben.

Re: [PATCH 2/3] nvme: Retrieve the required IO queue entry size from the controller

2019-07-16 Thread Benjamin Herrenschmidt

On Tue, 2019-07-16 at 08:04 +0200, Christoph Hellwig wrote:
> > 
> > +   /*
> > +* If our IO queue size isn't the default, update the setting
> > +* in CC:IOSQES.
> > +*/
> > +   if (ctrl->iosqes != NVME_NVM_IOSQES) {
> > +   ctrl->ctrl_config &= ~(0xfu << NVME_CC_IOSQES_SHIFT);
> > +   ctrl->ctrl_config |= ctrl->iosqes << 
> > NVME_CC_IOSQES_SHIFT;
> > +   ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC,
> > +ctrl->ctrl_config);
> > +   if (ret) {
> > +   dev_err(ctrl->device,
> > +   "error updating CC register\n");
> > +   goto out_free;
> > +   }
> > +   }
> 
> Actually, this doesn't work on a "real" nvme controller, to change CC
> values the controller needs to be disabled.

Not really. The specs says that MPS, AMD and CSS need to be set before
enabling, but IOCQES and IOSQES can be modified later as long as there
is no IO queue created yet.

This is necessary otherwise there's a chicken and egg problem. You need
the admin queue to do the controller id in order to get the sizes and
for that you need the controller to be enabled.

Note: This is not a huge issue anyway since I only update the register
if the required size isn't 6 which is probably never going to be the
case on non-Apple HW.

>   So back to the version
> you circulated to me in private mail that just sets q->sqes and has a
> comment that this is magic for The Apple controller.  If/when we get
> standardized large SQE support we'll need to discover that earlier or
> do a disable/enable dance.  Sorry for misleading you down this road and
> creating the extra work.  

I think it's still ok, let me know...

Ben.

Re: [PATCH 1/3] nvme: Pass the queue to SQ_SIZE/CQ_SIZE macros

2019-07-15 Thread Benjamin Herrenschmidt

On Tue, 2019-07-16 at 10:46 +1000, Benjamin Herrenschmidt wrote:
> # Conflicts:
> #   drivers/nvme/host/pci.c
> ---

Oops :-) You can strip that or should I resend ? I'll wait for
comments/reviews regardless.

Cheers,
Ben.

[PATCH 3/3] nvme: Add support for Apple 2018+ models

2019-07-15 Thread Benjamin Herrenschmidt

Based on reverse engineering and original patch by

Paul Pawlowski 

This adds support for Apple weird implementation of NVME in their
2018 or later machines. It accounts for the twice-as-big SQ entries
for the IO queues, and the fact that only interrupt vector 0 appears
to function properly.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/nvme/host/core.c |  5 -
 drivers/nvme/host/nvme.h | 10 ++
 drivers/nvme/host/pci.c  |  6 ++
 3 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 716ebe87a2b8..480ea24d8cf4 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2701,7 +2701,10 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->hmmaxd = le16_to_cpu(id->hmmaxd);
 
/* Grab required IO queue size */
-   ctrl->iosqes = id->sqes & 0xf;
+   if (ctrl->quirks & NVME_QUIRK_128_BYTES_SQES)
+   ctrl->iosqes = 7;
+   else
+   ctrl->iosqes = id->sqes & 0xf;
if (ctrl->iosqes < NVME_NVM_IOSQES) {
dev_err(ctrl->device,
"unsupported required IO queue size %d\n", 
ctrl->iosqes);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 34ef35fcd8a5..b2a78d08b984 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,16 @@ enum nvme_quirks {
 * Broken Write Zeroes.
 */
NVME_QUIRK_DISABLE_WRITE_ZEROES = (1 << 9),
+
+   /*
+* Use only one interrupt vector for all queues
+*/
+   NVME_QUIRK_SINGLE_VECTOR= (1 << 10),
+
+   /*
+* Use non-standard 128 bytes SQEs.
+*/
+   NVME_QUIRK_128_BYTES_SQES   = (1 << 11),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 54b35ea4af88..ab2358137419 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2080,6 +2080,9 @@ static int nvme_setup_irqs(struct nvme_dev *dev, unsigned 
int nr_io_queues)
dev->io_queues[HCTX_TYPE_DEFAULT] = 1;
dev->io_queues[HCTX_TYPE_READ] = 0;
 
+   if (dev->ctrl.quirks & NVME_QUIRK_SINGLE_VECTOR)
+   irq_queues = 1;
+
return pci_alloc_irq_vectors_affinity(pdev, 1, irq_queues,
  PCI_IRQ_ALL_TYPES | PCI_IRQ_AFFINITY, );
 }
@@ -3037,6 +3040,9 @@ static const struct pci_device_id nvme_id_table[] = {
{ PCI_DEVICE_CLASS(PCI_CLASS_STORAGE_EXPRESS, 0xff) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2001) },
{ PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2003) },
+   { PCI_DEVICE(PCI_VENDOR_ID_APPLE, 0x2005),
+   .driver_data = NVME_QUIRK_SINGLE_VECTOR |
+   NVME_QUIRK_128_BYTES_SQES },
{ 0, }
 };
 MODULE_DEVICE_TABLE(pci, nvme_id_table);
-- 
2.17.1

[PATCH 1/3] nvme: Pass the queue to SQ_SIZE/CQ_SIZE macros

2019-07-15 Thread Benjamin Herrenschmidt

This will make it easier to handle variable queue entry sizes
later. No functional change.

Signed-off-by: Benjamin Herrenschmidt 

# Conflicts:
#   drivers/nvme/host/pci.c
---
 drivers/nvme/host/pci.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index dd10cf78f2d3..8f006638452b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,8 +28,8 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
-#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
+#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
@@ -1344,16 +1344,16 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
-   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
return;
 
if (test_and_clear_bit(NVMEQ_SQ_CMB, >flags)) {
pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev),
-   nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+   nvmeq->sq_cmds, SQ_SIZE(nvmeq));
} else {
-   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
}
 }
@@ -1433,12 +1433,12 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 }
 
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-   int qid, int depth)
+   int qid)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq));
if (nvmeq->sq_cmds) {
nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
nvmeq->sq_cmds);
@@ -1447,11 +1447,11 @@ static int nvme_alloc_sq_cmds(struct nvme_dev *dev, 
struct nvme_queue *nvmeq,
return 0;
}
 
-   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(depth));
+   pci_free_p2pmem(pdev, nvmeq->sq_cmds, SQ_SIZE(nvmeq));
}
}
 
-   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(depth),
+   nvmeq->sq_cmds = dma_alloc_coherent(dev->dev, SQ_SIZE(nvmeq),
>sq_dma_addr, GFP_KERNEL);
if (!nvmeq->sq_cmds)
return -ENOMEM;
@@ -1465,12 +1465,13 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
-   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(depth),
+   nvmeq->q_depth = depth;
+   nvmeq->cqes = dma_alloc_coherent(dev->dev, CQ_SIZE(nvmeq),
 >cq_dma_addr, GFP_KERNEL);
if (!nvmeq->cqes)
goto free_nvmeq;
 
-   if (nvme_alloc_sq_cmds(dev, nvmeq, qid, depth))
+   if (nvme_alloc_sq_cmds(dev, nvmeq, qid))
goto free_cqdma;
 
nvmeq->dev = dev;
@@ -1479,15 +1480,14 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   nvmeq->q_depth = depth;
nvmeq->qid = qid;
dev->ctrl.queue_count++;
 
return 0;
 
  free_cqdma:
-   dma_free_coherent(dev->dev, CQ_SIZE(depth), (void *)nvmeq->cqes,
-   nvmeq->cq_dma_addr);
+   dma_free_coherent(dev->dev, CQ_SIZE(nvmeq), (void *)nvmeq->cqes,
+ nvmeq->cq_dma_addr);
  free_nvmeq:
return -ENOMEM;
 }
@@ -1515,7 +1515,7 @@ static void nvme_init_queue(struct nvme_queue *nvmeq, u16 
qid)
nvmeq->cq_head = 0;
nvmeq->cq_phase = 1;
nvmeq->q_db = >dbs[qid * 2 * dev->db_stride];
-   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq->q_depth));
+   memset((void *)nvmeq->cqes, 0, CQ_SIZE(nvmeq));
nvme_dbbuf_init(dev, nvmeq, qid);
dev->online_queues++;
wmb(); /* ensure the first interrupt sees the initialization */
-- 
2.17.1

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5350 matches

Mail list logo