[GIT PULL] FSI fixes for 4.20

2018-11-25 Thread Benjamin Herrenschmidt
Hi Greg !

Here are two very minor fixes for FSI. One from Arnd is a Kconfig fixup
and has been rusting away in my tree for a while (I had forgotten about
it). The other one just removes a duplicate #include, courtesy of
Brajeswar Ghosh.

The following changes since commit 2e6e902d185027f8e3cb8b7305238f7e35d6a436:

  Linux 4.20-rc4 (2018-11-25 14:19:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-11-26

for you to fetch changes up to d20810530b7109a95abef5130e6dcec09c5180d7:

  fsi: fsi-scom.c: Remove duplicate header (2018-11-26 10:13:04 +1100)


Arnd Bergmann (1):
  fsi: master-ast-cf: select GENERIC_ALLOCATOR

Brajeswar Ghosh (1):
  fsi: fsi-scom.c: Remove duplicate header

 drivers/fsi/Kconfig| 1 +
 drivers/fsi/fsi-scom.c | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)



Re: [PATCH] drivers/fsi/fsi-scom.c: Remove duplicate header

2018-11-25 Thread Benjamin Herrenschmidt
On Sat, 2018-11-24 at 13:51 +0530, Brajeswar Ghosh wrote:
> On Fri, Nov 16, 2018 at 4:17 PM Brajeswar Ghosh
>  wrote:
> > Remove linux/cdev.h which is included more than once
> > 
> > Signed-off-by: Brajeswar Ghosh 
> 
> Any comment on this patch?

Ah sorry, I missed it. Yeah it's fine. I will send it to Greg.

> > ---
> >  drivers/fsi/fsi-scom.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/fsi/fsi-scom.c b/drivers/fsi/fsi-scom.c
> > index df94021dd9d1..81dc01ac2351 100644
> > --- a/drivers/fsi/fsi-scom.c
> > +++ b/drivers/fsi/fsi-scom.c
> > @@ -20,7 +20,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> > 
> >  #include 
> > --
> > 2.17.1
> > 



Re: [[PATCH] 8/9] DMA-UART-Driver-for-AST2500

2018-10-18 Thread Benjamin Herrenschmidt
On Thu, 2018-10-18 at 15:25 +0530, Vinod wrote:
> 
> > It's not a dmaengine driver. It's a serial UART driver that happens to
> > use a dedicated DMA engine.
> 
> Then I see no reason for it to use dmaengine APIs. The framework allows
> people to share a controller for many clients, but if you have dedicated
> one then you may use it directly

Well... the engine is shared by a few UARTs, they have dedicated rings
but there's a common set of regs for interrupt handling etc.

That said, I still think it could be contained within a UART driver,
there's little benefit in adding the framework overhead, esp since
these are really weak cores, any overhead will be felt.

Ben.

> > It's unclear whether it should be split into two drivers, or just have
> > the serial driver directly use the dma engine since that engine is
> > dedicated in HW to only work on those UARTs and nothing else...
> > 
> > Cheers,
> > Ben.
> > 
> > 
> > > While doing resubmission please take some time to understand subsystem
> > > tags to use. (hint git log  will tell you)
> > > 
> > > Also series has [[PATCH] 8/9] whereas it should be [PATCH 8/9] please
> > > let git generate that for you (hint git format-patch start..end does a
> > > good job)
> > > 
> > > > @@ -0,0 +1,1594 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + *   drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > > > + *1. 2018/07/01 Shivah Shankar created
> > > > + *2. 2018/08/25 sudheer.veliseti modified
> > > 
> > > we dont use this log in kernel. I do not see s-o-b by Shivah, that
> > > should be added. I think he should be author and you need to list
> > > changes you did..
> > > 
> 
> 



Re: [[PATCH] 8/9] DMA-UART-Driver-for-AST2500

2018-10-17 Thread Benjamin Herrenschmidt
On Wed, 2018-10-17 at 11:35 +0530, Vinod wrote:
> On 17-10-18, 09:41, sudheer.v wrote:
> 
> Please add the change log describing the driver and its features
> 
> > Signed-off-by: sudheer.v 
> 
> 
> > ---
> >  drivers/tty/serial/8250/8250_aspeed_uart_dma.c | 1594 
> > 
> >  1 file changed, 1594 insertions(+)
> >  create mode 100644 drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > 
> > diff --git a/drivers/tty/serial/8250/8250_aspeed_uart_dma.c 
> > b/drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > new file mode 100644
> > index 000..e1019a8
> > --- /dev/null
> > +++ b/drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> 
> why is this in serial. It is dmaengine driver so belongs to drivers/dma/
> like other controllers. Please move it out and resubmit.
 
It's not a dmaengine driver. It's a serial UART driver that happens to
use a dedicated DMA engine.

It's unclear whether it should be split into two drivers, or just have
the serial driver directly use the dma engine since that engine is
dedicated in HW to only work on those UARTs and nothing else...

Cheers,
Ben.


> While doing resubmission please take some time to understand subsystem
> tags to use. (hint git log  will tell you)
> 
> Also series has [[PATCH] 8/9] whereas it should be [PATCH 8/9] please
> let git generate that for you (hint git format-patch start..end does a
> good job)
> 
> > @@ -0,0 +1,1594 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + *   drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > + *1. 2018/07/01 Shivah Shankar created
> > + *2. 2018/08/25 sudheer.veliseti modified
> 
> we dont use this log in kernel. I do not see s-o-b by Shivah, that
> should be added. I think he should be author and you need to list
> changes you did..
> 



Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Benjamin Herrenschmidt
On Thu, 2018-09-27 at 15:49 +0200, Christoph Hellwig wrote:
> On Thu, Sep 27, 2018 at 11:45:15AM +1000, Benjamin Herrenschmidt wrote:
> > I'm not sure this is entirely right.
> > 
> > Let's say the mask is 30 bits. You will return GFP_DMA32, which will
> > fail if you allocate something above 1G (which is legit for
> > ZONE_DMA32).
> 
> And then we will try GFP_DMA further down in the function:
> 
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>   dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;
>   }
> 
> This is and old optimization from x86, because chances are high that
> GFP_DMA32 will give you suitable memory for the infamous 31-bit
> dma mask devices (at least at boot time) and thus we don't have
> to deplete the tiny ZONE_DMA pool.

I see, it's rather confusing :-) Wouldn't it be better to check against
top of 32-bit memory instead here too ?


Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
>  wrote:
> > 
> > The comment actually does talk about it, although the comment also
> > claims that the cs read would use load_unaligned_zeropad(), which it
> > no longer does (now it only does the read_word_at_a_time).
> 
> IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
> dentry_cmp() interfaces") for why the zeropad went away for the cs
> access (but the comment wasn't updated).
> 
> And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
> dentry_string_cmp()") did the "let's make KASAN happy thing.
> 
> And yes, the word-at-a-time code actually matters a lot for certain
> loads. The "copy-and-hash" thing for path components ends up being
> pretty critical in all the pathname handling.

Yup, makes sense.

Thanks !

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).

Ah, my bad reading, I was looking at read_word_at_a_time() instead of
load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
stuff, I assume this is safe ?

> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.




Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> [ Adding a few new people the the cc.
> 
>   The issue is the worry about software-speculative accesses (ie
> things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> speculation now) accessing past RAM into possibly contiguous IO ]
> 
> On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
>  wrote:
> > 
> > If you have a machine with RAM that touches IO, you need to disable
> > the last page, exactly the same way we disable and marked reserved the
> > first page at zero.

So I missed the departure of that train ... stupid question, with
CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
what prevents it from crossing into a non-mapped page (not even IO) and
causing an oops ? Looking at a random user in fs/dcache.c its not a
uaccess-style read with recovery Or am I missing somethign obvious
here ?

IE, should we "reserve" the last page of any memory region (maybe mark
it read-only) to avoid this along with avoiding leakage into IO space ?

> > I thought we already did that.
> 
> We don't seem to do that.
> 
> And it's not just the last page, it's _any_ last page in a region that
> bumps up to IO. That's actually much more common in the low 4G area on
> PC's, I suspect, although the reserved BIOS ranges always tend to be
> there.

What makes IO more "wrong" than oopsing due to the page not being
mapped ?

> I suspect it should be trivial to do - maybe in
> e820__memblock_setup()? That's where we already trim partial pages
> etc.
>
> In fact, I think this might be done as an extension of commit
> 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved"), except making sure that non-RAM regions mark one
> page _previous_ as reserved too.
> 
> I assume memory hotplug might have the same issue, and checking
> whether ARM64 and powerpc perhaps might have already done something
> like this (or might need to add it).
> 
> We discussed long ago the case of user space mapping IO in user space,
> and decided we didn't care. But the kernel should probably explicitly
> make sure we don't either, even if I can't recall having ever seen a
> machine that actually maps IO contiguously to RAM. The layout always
> tends to end up having holes anyway.

Can't we put the safety in generic memblock ? IE, don't hand out an
allocation that contain the last page of a "block" and handle that last
page in the memblock->buddy transition rather than in arch specific
code ?

Cheers,
Ben.




Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
>  wrote:
> > It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> Just to clarify .. that's true of the dcache stuff.
> 
> The strscpy case actually explicitly limits things to page boundaries
> and falls back to the byte-by-byte case after that.

Ah ok, that makes sense.

Still, I can potentially see an issue with DEBUG_PAGEALLOC

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> > [ Adding a few new people the the cc.
> > 
> >   The issue is the worry about software-speculative accesses (ie
> > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> > speculation now) accessing past RAM into possibly contiguous IO ]
> > 
> > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
> >  wrote:
> > > 
> > > If you have a machine with RAM that touches IO, you need to disable
> > > the last page, exactly the same way we disable and marked reserved the
> > > first page at zero.
> 
> So I missed the departure of that train ... stupid question, with
> CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
> what prevents it from crossing into a non-mapped page (not even IO) and
> causing an oops ? Looking at a random user in fs/dcache.c its not a
> uaccess-style read with recovery Or am I missing somethign obvious
> here ?

Also, if we cross page boundaries with those guys then we have a bigger
problem no ? we could fall off a vmalloc page into the nether or into
an ioremap mapping no ?

Cheers,
Ben. 



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and
fault that way too ?

Cheers,
Ben.



Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt
On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Ok, it might be worth adding a DEBUG_VM based (or similar) warning in
case somebody ever thinks of passing a vmalloc pointer to it...

As for falling out of the end of memory, yes it could be a real problem
though I don't see why IO is any different than just hitting a non-
mapped area in that regard. So we should probably keep an unused
(readonly if possible) zero page at the end.
 
Cheers,
Ben.



Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-30 Thread Benjamin Herrenschmidt
On Fri, 2018-08-31 at 14:35 +1000, Benjamin Herrenschmidt wrote:
> 
> > If I force output with "-f", the resulting file has no occurrences 
> > of "phandle".
> 
> Are you booting with BootX or Open Firmware ?

Assuming you are using BootX (or miBoot), can you try this patch ?

--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -37,6 +37,7 @@ static unsigned long __initdata bootx_dt_strend;
 static unsigned long __initdata bootx_node_chosen;
 static boot_infos_t * __initdata bootx_info;
 static char __initdata bootx_disp_path[256];
+static int __initdata bootx_phandle;
 
 /* Is boot-info compatible ? */
 #define BOOT_INFO_IS_COMPATIBLE(bi) \
@@ -258,6 +259,8 @@ static void __init bootx_scan_dt_build_strings(unsigned 
long base,
namep = pp->name ? (char *)(base + pp->name) : NULL;
if (namep == NULL || strcmp(namep, "name") == 0)
goto next;
+   if (!strcmp(namep, "phandle") || !strcmp(namep, 
"linux,phandle"))
+   bootx_phandle = -1;
/* get/create string entry */
soff = bootx_dt_find_string(namep);
if (soff == 0)
@@ -310,6 +313,7 @@ static void __init bootx_scan_dt_build_struct(unsigned long 
base,
*mem_end = _ALIGN_UP((unsigned long)lp + 1, 4);
 
/* get and store all properties */
+   has_phandle = false;
while (*ppp) {
struct bootx_dt_prop *pp =
(struct bootx_dt_prop *)(base + *ppp);
@@ -330,6 +334,12 @@ static void __init bootx_scan_dt_build_struct(unsigned 
long base,
ppp = >next;
}
 
+   /* add a phandle */
+   if (bootx_phandle > 0) {
+   bootx_dt_add_prop("phandle", _phandle, 4, mem_end);
+   bootx_phandle++;
+   }
+
if (node == bootx_node_chosen) {
bootx_add_chosen_props(base, mem_end);
if (bootx_info->dispDeviceRegEntryOffset == 0)
@@ -385,6 +395,8 @@ static unsigned long __init bootx_flatten_dt(unsigned long 
start)
bootx_dt_add_string("linux,bootx-height", _end);
bootx_dt_add_string("linux,bootx-linebytes", _end);
bootx_dt_add_string("linux,bootx-addr", _end);
+   if (bootx_phandle > 0)
+   bootx_dt_add_string("phandle", _end);
/* Wrap up strings */
hdr->off_dt_strings = bootx_dt_strbase - mem_start;
hdr->dt_strings_size = bootx_dt_strend - bootx_dt_strbase;
@@ -482,6 +494,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
bootx_dt_strbase = bootx_dt_strend = 0;
bootx_node_chosen = 0;
bootx_disp_path[0] = 0;
+   bootx_phandle = 1;
 
if (!BOOT_INFO_IS_V2_COMPATIBLE(bi))
bi->logicalDisplayBase = bi->dispDeviceBase;



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 19:02 +1000, Nicholas Piggin wrote:
> > More tlbies ? With the cost of the broadasts on the fabric ? I don't
> > think so.. or I'm not understanding your point...
> 
> More tlbies are no good, but there will be some places where it works
> out much better (and fewer tlbies). Worst possible case for current code
> is a big unmap with lots of scattered page sizes. We _should_ get that
> with just a single PID flush at the end, but what we will get today is
> a bunch of PID and VA flushes.
> 
> I don't propose doing that though, I'd rather be explicit about
> tracking start and end range of each page size. Still not "optimal"
> but neither is existing single range for sparse mappings... anyway it
> will need to be profiled, but my point is we don't really fit exactly
> what x86/arm want.

If we have an arch specific part, we could just remember up to N
"large" pages there without actually flushing, and if that overflows,
upgrade to a full flush.

Cheers,
Ben.



Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-29 Thread Benjamin Herrenschmidt
On Wed, 2018-08-29 at 13:34 +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> Cc-ing Benjamin on this.
> 
> On (08/29/18 03:23), Dmitry Safonov wrote:
> > BUG: unable to handle kernel paging request at 2260
> > IP: [..] n_tty_receive_buf_common+0x5f/0x86d
> > Workqueue: events_unbound flush_to_ldisc
> > Call Trace:
> >  [..] n_tty_receive_buf2
> >  [..] tty_ldisc_receive_buf
> >  [..] flush_to_ldisc
> >  [..] process_one_work
> >  [..] worker_thread
> >  [..] kthread
> >  [..] ret_from_fork
> 
> Seems that you are not the first one to hit this NULL deref.
> 
> > I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
> > writing, which will protect any reader against line discipline changes.
> 
> Per https://lore.kernel.org/patchwork/patch/777220/
> 
> : Note that we noticed one path that called reinit without the ldisc lock
> : held for writing, we added that, but it didn't fix the problem.
> 
> And I guess that Ben meant the same reinit path which you patched:

This is too old for me to remember buit yes, there definitely was a bug
there...

> > @@ -1267,15 +1267,20 @@ static int tty_reopen(struct tty_struct *tty)
> > if (test_bit(TTY_EXCLUSIVE, >flags) && !capable(CAP_SYS_ADMIN))
> > return -EBUSY;
> >  
> > -   tty->count++;
> > +   retval = tty_ldisc_lock(tty, 5 * HZ);
> > +   if (retval)
> > +   return retval;
> >  
> > +   tty->count++;
> > if (tty->ldisc)
> > -   return 0;
> > +   goto out_unlock;
> >  
> > retval = tty_ldisc_reinit(tty, tty->termios.c_line);
> > if (retval)
> > tty->count--;
> >  
> > +out_unlock:
> > +   tty_ldisc_unlock(tty);
> > return retval;
> >  }
> 
>   -ss



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Benjamin Herrenschmidt
On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > still correct for huge pages, inefficient maybe, but so is flushing
> > every other page because 'sparse' transparant-huge-pages.
> 
> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

More tlbies ? With the cost of the broadasts on the fabric ? I don't
think so.. or I'm not understanding your point...

Sadly our architecture requires a precise match between the page size
specified in the tlbie instruction and the entry in the TLB or it won't
be flushed.

Ben.



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 22:11 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 9:54 PM Benjamin Herrenschmidt  
> wrote:
> > 
> > 
> > So we do need a different flush instruction for the page tables vs. the
> > normal TLB pages.
> 
> Right. ARM wants it too. x86 is odd in that a regular "invlpg" already
> invalidates all the internal tlb cache nodes.
> 
> So the "new world order" is exactly that patch that PeterZ sent you, that 
> adds a
> 
> +   unsigned intfreed_tables : 1;
> 

 .../...

> So instead, when you get to the actual "tlb_flush(tlb)", you do
> exactly that - flush the tlb. And the mmu_gather structure shows you
> how much you need to flush. If you see that "freed_tables" is set,
> then you know that you need to also do the special instruction to
> flush the inner level caches. The range continues to show the page
> range.

Yup. That looks like a generic version of the "need_flush_all" flag we
have, which is fine by us.

Just don't blame powerpc for all the historical crap :-)

Cheers,
Ben.

> 
>Linus



Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-22 Thread Benjamin Herrenschmidt
On Wed, 2018-08-22 at 20:59 -0700, Linus Torvalds wrote:
> On Wed, Aug 22, 2018 at 8:45 PM Nicholas Piggin  wrote:
> > 
> > powerpc/radix has no such issue, it already does this tracking.
> 
> Yeah, I now realize that this was why you wanted to add that hacky
> thing to the generic code, so that you can add the tlb_flush_pgtable()
> call.
> 
> I thought it was because powerpc had some special flush instruction
> for it, and the regular tlb flush didn't do it. But no. It was because
> the regular code had lost the tlb flush _entirely_, because powerpc
> didn't want it.

Heh :-) Well, back on hash we didn't (we do now with Radix) but I
wouldn't blame us for the generic code being broken ... the RCU table
freeing was in arch/powerpc at the time :-) I don't think it was us
making it generic :)

> > We were discussing this a couple of months ago, I wasn't aware of ARM's
> > issue but I suggested x86 could go the same way as powerpc.
> 
> The problem is that x86 _used_ to do this all correctly long long ago.
> 
> And then we switched over to the "generic" table flushing (which
> harkens back to the powerpc code).

Yes, we wrote it the RCU stuff to solve the races with SW walking,
which is completely orthogonal with HW walking & TLB content. We didn't
do the move to generic code though ;-)

> Which actually turned out to be not generic at all, and did not flush
> the internal pages like x86 used to (back when x86 just used
> tlb_remove_page for everything).

Well, having RCU do the flushing is rather generic, it makes sense
whenever there's somebody doing a SW walk *and* you don't have IPIs to
synchronize your flushes (ie, anybody with HW TLB invalidation
broadcast basically, so ARM and us).

> So as a result, x86 had unintentionally lost the TLB flush we used to
> have, because tlb_remove_table() had lost the tlb flushing because of
> a powerpc quirk.

This is a somewhat odd way of putting the "blame" :-) But yeah ok...

> You then added it back as a hacky per-architecture hook (apparently
> having realized that you never did it at all), which didn't fix the
> unintentional lack of flushing on x86.
> 
> So now we're going to do it right.  No more "oh, powerpc didn't need
> to flush because the hash tables weren't in the tlb at all" thing in
> the generic code that then others need to work around.

So we do need a different flush instruction for the page tables vs. the
normal TLB pages. 

Cheers,
Ben.




Re: [PATCH v2] i2c: Remove '-Wno-deprecated-declarations' compiler warning

2018-08-20 Thread Benjamin Herrenschmidt
On Mon, 2018-08-20 at 10:06 +0200, Wolfram Sang wrote:
> 
> > Some people still use those old machines, I'll give a try at fixing the
> > driver.
> 
> I did this here a while ago [1] ("macintosh: therm_windtunnel: drop
> using attach adapter")...
> 
> > Paul: Do you still have one of these machines ?
> 
> ... but this was the main problem. I couldn't find anyone to test this.
> Even asked on PPC lists of various distributions.

Yeah this one is annoying because it's only used by one specific
machine model..

> Note that we did apply the same approach of patch [1] already in commit
> ac397c80de89 ("ALSA: ppc: keywest: drop using attach adapter"). This has
> been tested on HW. I also have a hack to make this driver probe on one
> of my ARM boards. Of course, no real HW present there, but shows that
> the new probing works at least there.
> 
> This is very interesting timing, too. Because I wanted to use the
> tentative version jump to 5.0 to finally get rid of this deprecated
> feature independently of the discussion here. Which would have meant
> resending the patch and arguing we have enough indications that the
> patch [1] is sane. And then I could finally remove this callback.
> 
> Thanks,
> 
>Wolfram
> 
> [1] http://patchwork.ozlabs.org/patch/461759/

Ok, worst case we can remove the driver and re-instate it back later if
somebody complains, but I'll try to chase a machine first.

Cheers,
Ben.




Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-20 Thread Benjamin Herrenschmidt
On Mon, 2018-08-20 at 09:17 +0200, Lukas Wunner wrote:
> On Mon, Aug 20, 2018 at 12:10:59PM +1000, Benjamin Herrenschmidt wrote:
> > I chose to create a new mutex which we should be able to address other
> > similar races if we find them. The other solutions that I dismissed
> > were:
> > 
> >  - Using the device_lock. A explained previously, this is tricky, I
> > prefer not using this for anything other than locking against
> > concurrent add/remove. The main issue is that drivers will be sometimes
> > called in context where that's already held, so we can't take it inside
> > pci_enable_device() and I'd rather not add new constraints such as
> > "pci_enable_device() must be only called from probe() unless you also
> > take the device lock". It would be tricky to audit everybody...
> > 
> >  - Using a global mutex. We could move the bridge lock from AER to core
> > code for example, and use that. But it doesn't buy us much, and
> > slightly redecuces parallelism. It also makes it a little bit more
> > messy to walk up the bridge chain, we'd have to do a
> > pci_enable_device_unlocked or something, messy.
> 
> +1 from my side for adding a struct mutex to struct pci_dev to protect
> state changes.

Ok thanks. This is what my patch proposes. We can use it later to
protect more things if we wish to do so.

> The device_lock() primarily protects binding / unbinding of the device
> and pci_dev state may have to be changed while binding / unbinding.

Yup, precisely.

> A global lock invites deadlocks if multiple devices are added / removed
> concurrently where one is a parent of the other.  (Think hot-removal of
> multiple devices on a Thunderbolt daisy-chain.)

Yes.

> As said I'd also welcome folding PCI_DEV_DISCONNECTED into enum
> pci_channel_state, either as an additional state or by using
> pci_channel_io_perm_failure.

Ok. I have that in my tentative series but I think for robustness, I
should make the error_state field atomically updated in order to ensure
that no transition out of "disconnected" can happen while racing with
concurrent error_state updates at interrupt time (at least with EEH, it
can be updated from any read{b,w,l,q}).

I'll do a bit more work on the patches this week as time permits and
send a non-RFC series.

Cheers,
Ben. 

> Thanks,
> 
> Lukas



Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-20 Thread Benjamin Herrenschmidt
On Mon, 2018-08-20 at 11:55 +0530, Hari Vyas wrote:
> On Mon, Aug 20, 2018 at 7:40 AM, Benjamin Herrenschmidt
>  wrote:
> > On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
> > > On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
> > > > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> > > > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt 
> > > > > wrote:
> > > > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> > > > > 
> > > > > Just to be clear, if I understand correctly, this is a pure revert of
> > > > > 44bda4b7d26e and as such it reintroduces the problem solved by that
> > > > > commit.
> > > > > 
> > > > > If your solution turns out to be better, that's great, but it would be
> > > > > nice to avoid the bisection hole of reintroducing the problem, then
> > > > > fixing it again later.
> > > > 
> > > > There is no way to do that other than merging the revert and the fix
> > > > into one. That said, the race is sufficiently hard to hit that I
> > > > wouldn't worry too much about it.
> > > 
> > > OK, then at least mention that in the changelog.
> > 
> > Sure will do. This is just RFC at this stage :-)
> > 
> > As for the race with enable, what's your take on my approach ? The
> > basic premise is that we need some kind of mutex to make the updates to
> > enable_cnt and the actual config space changes atomic. While at it we
> > can fold pci_set_master vs. is_busmaster as well as those are racy too.
> > 
> > I chose to create a new mutex which we should be able to address other
> > similar races if we find them. The other solutions that I dismissed
> > were:
> > 
> >  - Using the device_lock. A explained previously, this is tricky, I
> > prefer not using this for anything other than locking against
> > concurrent add/remove. The main issue is that drivers will be sometimes
> > called in context where that's already held, so we can't take it inside
> > pci_enable_device() and I'd rather not add new constraints such as
> > "pci_enable_device() must be only called from probe() unless you also
> > take the device lock". It would be tricky to audit everybody...
> > 
> >  - Using a global mutex. We could move the bridge lock from AER to core
> > code for example, and use that. But it doesn't buy us much, and
> > slightly redecuces parallelism. It also makes it a little bit more
> > messy to walk up the bridge chain, we'd have to do a
> > pci_enable_device_unlocked or something, messy.
> > 
> > So are you ok with the approach ? Do you prefer one of the above
> > regardless ? Something else ?
> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> Some concern was raised about race situation so just to be more clear
> about race condition.

This is not what the above discussion is about.

The race with is is_added is due to the fact that the bit is set too
later as discussed previously and in my changelog.

The race I'm talking about here is the race related to multiple
subtrees trying to simultaneously enable a parent bridge.

> Situation is described in Bug 200283 - PCI: Data corruption happening
> due to a race condition.
> Issue was discovered by our broadcom QA team.
> Initially commit was to use one separate lock only for avoiding race
> condition but after review, approach was slightly changed to move
> is_added  to pci_dev private flags as it was completely
> internal/private variable of pci core driver.
> Powerpc legacy arch code was using is_added flag directly which looks
> bit strange so  ../../ type of inclusion of pci.h was required. I know
> it looks ugly but it is being used in some legacy code still.
> Anyway, as stated earlier too, problem should be just solved in better way.

The is_added race can be fixed with a 3 lines patch moving is_added up
to before device_attach() I believe. I yet have to find a scenario
where doing so would break something but it's possible that I missed a
case.

At this stage, I'm more intested however in us agreeing how to fix the
other race, the one with enabling. As I wrote above, I proposed an
approach based on a mutex in pci_dev, and this is what I would like
discussed.

This race is currently causing our large systems to randomly error out
at boot time when probing some PCIe devices. I'm putting a band-aid in
the powerpc code for now to pre-enable bridges at boot, but I'd rather
have the race fixed in the generic code.

Ben.




Re: [PATCH v2] i2c: Remove '-Wno-deprecated-declarations' compiler warning

2018-08-19 Thread Benjamin Herrenschmidt
On Sun, 2018-08-19 at 09:46 -0700, Linus Torvalds wrote:
> On Sun, Aug 19, 2018 at 6:51 AM Sedat Dilek  wrote:
> > 
> > This can be dropped with commit 771c035372a036f83353eef46dbb829780330234
> > ("deprecate the '__deprecated' attribute warnings entirely and for good")
> > now in upstream.
> 
> Could we please just remove the __deprecated use in i2c entirely?
> 
> As fat as I can tell, it's used for one thing, which is the
> "attach_adapter" function pointer member in 'struct i2c_driver'.
> 
> And I think there is _one_ remaining driver using it, namely the
> powermac "therm_windtunnel" driver.
> 
> Could somebody who knows the i2c probing just a bit convert that *one*
> remaining driver, then remove the "attach_adapter" entirely, and also
> remove this thing?
> 
> Alternatively, perhaps the driver should be removed? The last actual
> powermac change to that driver seems to be from August 2009.
> Everything since has been just cleanup unrelated to that driver code
> itself (ie spelling fixes, automated coccinelle scripts etc).
> 
> Added a few people either because they're still officially maintainers
> of that file, or because they touched the code ten years ago.

Some people still use those old machines, I'll give a try at fixing the
driver.

Paul: Do you still have one of these machines ?

Cheers
Ben.



Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-19 Thread Benjamin Herrenschmidt
On Sat, 2018-08-18 at 21:24 -0500, Bjorn Helgaas wrote:
> On Sat, Aug 18, 2018 at 01:24:51PM +1000, Benjamin Herrenschmidt wrote:
> > On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> > > On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > > > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> > > 
> > > Just to be clear, if I understand correctly, this is a pure revert of
> > > 44bda4b7d26e and as such it reintroduces the problem solved by that
> > > commit.
> > > 
> > > If your solution turns out to be better, that's great, but it would be
> > > nice to avoid the bisection hole of reintroducing the problem, then
> > > fixing it again later.
> > 
> > There is no way to do that other than merging the revert and the fix
> > into one. That said, the race is sufficiently hard to hit that I
> > wouldn't worry too much about it.
> 
> OK, then at least mention that in the changelog.

Sure will do. This is just RFC at this stage :-)

As for the race with enable, what's your take on my approach ? The
basic premise is that we need some kind of mutex to make the updates to
enable_cnt and the actual config space changes atomic. While at it we
can fold pci_set_master vs. is_busmaster as well as those are racy too.

I chose to create a new mutex which we should be able to address other
similar races if we find them. The other solutions that I dismissed
were:

 - Using the device_lock. A explained previously, this is tricky, I
prefer not using this for anything other than locking against
concurrent add/remove. The main issue is that drivers will be sometimes
called in context where that's already held, so we can't take it inside
pci_enable_device() and I'd rather not add new constraints such as
"pci_enable_device() must be only called from probe() unless you also
take the device lock". It would be tricky to audit everybody...

 - Using a global mutex. We could move the bridge lock from AER to core
code for example, and use that. But it doesn't buy us much, and
slightly redecuces parallelism. It also makes it a little bit more
messy to walk up the bridge chain, we'd have to do a
pci_enable_device_unlocked or something, messy.

So are you ok with the approach ? Do you prefer one of the above
regardless ? Something else ?

Cheers,
Ben.




Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 20:15 +0200, Lukas Wunner wrote:
> 
> The two steps (enumeration and driver attachment) are protected by
> pci_lock_rescan_remove().  Initially, when a pci_dev is newly allocated
> with kzalloc(), is_added is 0.  The transition from 0 to 1 happens while
> under pci_lock_rescan_remove().  When that lock is released, is_added is
> always 1, barring a device_attach() error, in which case it will remain
> at 0.
> 
> AFAICS, there is no second mutex needed to ensure synchronization of
> is_added, the existing mutex should be sufficient and the only problem
> are RMW races when accessing adjacent flags. Those were correctly addressed
> by Hari's patch.  Benjamin seems to be alleging that concurrency issues
> exist beyond the RMW races, I fail to see them, sorry.

Im saying that fixing those issues using atomic bitops is a generally
unsafe practice even if it happens to work in a specific case.

In this one, I argue that the root problem was simply that is_added was
set in the wrong place. The "fix" from Hari touches 9 files, adds
horrible relative includes to an architecture and generally bloats
things while a single 3 liner would have been sufficient.

The current use of is_added is asymetric (it's cleared during
device_attach but set during detach) which is bogus and the entire race
stems from the fact that it is set after device_attach returns.

So setting it early is, imho, the right fix, is much simpler, and fixes
the odd imbalance we have to begin with.

Ben.




Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 11:25 -0500, Bjorn Helgaas wrote:
> On Fri, Aug 17, 2018 at 02:48:58PM +1000, Benjamin Herrenschmidt wrote:
> > This re-fixes the bug reported by Hari Vyas 
> > after my revert of his commit but in a much simpler way.
> > 
> > The main issues is that is_added was being set after the driver
> > got bound and started, and thus setting it could race with other
> > changes to struct pci_dev.
> 
> The "bind driver, then set dev->added = 1" order seems to have been
> there since the beginning of dev->is_added:
> 
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=8a1bc9013a03
> 
> This patch seems reasonable, but I'm a little dubious about the
> existence of "is_added" in the first place.  As far as I can tell, the
> only other buses with something similar are the MEN Chameleon bus and
> the Intel Management Engine Interface.
> 
> The PCI uses of "is_added" don't seem *that* critical or unique to
> PCI, so I'm not 100% convinced we need it at all.  But I haven't
> looked into it enough to be able to propose an alternative.

This is a whole different conversation you are taking us into :-)

is_added is currently needed for a number of reasons, mostly relating
to partial hotplug, and historically comes from the fact that we
separated the PCI probing & tree construction from the registration
with the device-model. This of course comes from the fact that the
device model didn't actually exist yet when the PCI code was
created :-)

So let's keep things separate shall we ? I'd rather fix this correctly
now, and get rid of that pesky atomic priv_flags which I think is just
going to be a long term add to the mess rather than an improvement, and
separately we can discuss whether is_added is something that can go
away, but I suspect this will come in the form of either a deeper
rework of how we do PCI probing, or simply finding a struct device/kobj
field we can use as a hint that we've added the device already for
hotplug.

> > This fixes it by setting the flag first, which also has the
> > advantage of matching the fact that we are clearing it *after*
> > unbinding in the remove path, thus the flag is now symtetric
> > and always set while the driver code is running.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  drivers/pci/bus.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> > index 35b7fc87eac5..48ae63673aa8 100644
> > --- a/drivers/pci/bus.c
> > +++ b/drivers/pci/bus.c
> > @@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
> > pci_proc_attach_device(dev);
> > pci_bridge_d3_update(dev);
> >  
> > +   dev->is_added = 1;
> > dev->match_driver = true;
> > retval = device_attach(>dev);
> > if (retval < 0 && retval != -EPROBE_DEFER) {
> > +   dev->is_added = 0;
> > pci_warn(dev, "device attach failed (%d)\n", retval);
> > pci_proc_detach_device(dev);
> > pci_remove_sysfs_dev_files(dev);
> > return;
> > }
> > -
> > -   dev->is_added = 1;
> >  }
> >  EXPORT_SYMBOL_GPL(pci_bus_add_device);
> >  
> > -- 
> > 2.17.1
> > 



Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 10:44 -0500, Bjorn Helgaas wrote:
> On Fri, Aug 17, 2018 at 02:48:57PM +1000, Benjamin Herrenschmidt wrote:
> > This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> 
> Just to be clear, if I understand correctly, this is a pure revert of
> 44bda4b7d26e and as such it reintroduces the problem solved by that
> commit.
> 
> If your solution turns out to be better, that's great, but it would be
> nice to avoid the bisection hole of reintroducing the problem, then
> fixing it again later.

There is no way to do that other than merging the revert and the fix
into one. That said, the race is sufficiently hard to hit that I
wouldn't worry too much about it.

> > The new pci state mutex provides a simpler way of addressing
> > this race and avoids the relative includes added to the powerpc
> > code.



Re: [PATCH] platform/x86: acer-wmi: Silence "unsupported" message a bit

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 14:27 +0300, Andy Shevchenko wrote:
> On Fri, Aug 17, 2018 at 2:22 PM, Benjamin Herrenschmidt
>  wrote:
> > On Fri, 2018-08-17 at 14:09 +0300, Andy Shevchenko wrote:
> > > On Fri, Aug 17, 2018 at 2:07 PM, Andy Shevchenko
> > >  wrote:
> > > > On Thu, Aug 16, 2018 at 2:27 AM, Benjamin Herrenschmidt
> > > >  wrote:
> > > > > This driver prints that "Unsupported machine..." message on every 
> > > > > boot on
> > > > > ThinkPad X1 Carbon laptops (and I assume a number of other systems), 
> > > > > which
> > > > > causes graphical boots to "glitch" a bit and is rather annoying ...
> > > > > 
> > > > > Make it a pr_debug instead.
> > > > 
> > > > Thanks for the patch. Unfortunately it doesn't apply.
> > > > Please, check your mail clients and tools for formatting patch (it
> > > > produced at the end spaces instead of tabs)
> > > 
> > > No need to resend _this_ time. I have pushed to my review and testing 
> > > queue.
> > 
> > Thanks, odd. Not sure what's going on with the latest evolution, it's
> > the second time it's done this. I'll try to investigate. Small patch
> > like that I tend to just copy/paste in an email in "preformat" style
> > ;-)
> 
> Evolution is totally broken for years already (yes, I'm still using it
> due to Outlook Web Services support, but it's utter PITA).
> I could just easily come up with couple of dozens of bugs that are
> present there for more than 3 years.

I had reasonably good luck getting bugs fixed by reporting them via
Fedora actually. But yeah, something with the composer tends to break
regularly.

Cheers,
Ben.

> > 
> > Cheers,
> > Ben.
> > > 
> > > > > 
> > > > > Signed-off-by: Benjamin Herrenschmidt 
> > > > > ---
> > > > > 
> > > > > diff --git a/drivers/platform/x86/acer-wmi.c 
> > > > > b/drivers/platform/x86/acer-wmi.c
> > > > > index 8952173dd380..7f7192e8fdab 100644
> > > > > --- a/drivers/platform/x86/acer-wmi.c
> > > > > +++ b/drivers/platform/x86/acer-wmi.c
> > > > > @@ -2216,7 +2216,7 @@ static int __init acer_wmi_init(void)
> > > > > if (wmi_has_guid(AMW0_GUID1) &&
> > > > > !dmi_check_system(amw0_whitelist) &&
> > > > > quirks == _unknown) {
> > > > > -   pr_err("Unsupported machine has AMW0_GUID1, unable to 
> > > > > load\n");
> > > > > +   pr_debug("Unsupported machine has AMW0_GUID1, unable 
> > > > > to load\n");
> > > > > return -ENODEV;
> > > > > }
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > --
> > > > With Best Regards,
> > > > Andy Shevchenko
> > > 
> > > 
> > > 
> 
> 
> 



Re: [PATCH] platform/x86: acer-wmi: Silence "unsupported" message a bit

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 14:09 +0300, Andy Shevchenko wrote:
> On Fri, Aug 17, 2018 at 2:07 PM, Andy Shevchenko
>  wrote:
> > On Thu, Aug 16, 2018 at 2:27 AM, Benjamin Herrenschmidt
> >  wrote:
> > > This driver prints that "Unsupported machine..." message on every boot on
> > > ThinkPad X1 Carbon laptops (and I assume a number of other systems), which
> > > causes graphical boots to "glitch" a bit and is rather annoying ...
> > > 
> > > Make it a pr_debug instead.
> > 
> > Thanks for the patch. Unfortunately it doesn't apply.
> > Please, check your mail clients and tools for formatting patch (it
> > produced at the end spaces instead of tabs)
> 
> No need to resend _this_ time. I have pushed to my review and testing queue.

Thanks, odd. Not sure what's going on with the latest evolution, it's
the second time it's done this. I'll try to investigate. Small patch
like that I tend to just copy/paste in an email in "preformat" style
;-)

Cheers,
Ben.
> 
> > > 
> > > Signed-off-by: Benjamin Herrenschmidt 
> > > ---
> > > 
> > > diff --git a/drivers/platform/x86/acer-wmi.c 
> > > b/drivers/platform/x86/acer-wmi.c
> > > index 8952173dd380..7f7192e8fdab 100644
> > > --- a/drivers/platform/x86/acer-wmi.c
> > > +++ b/drivers/platform/x86/acer-wmi.c
> > > @@ -2216,7 +2216,7 @@ static int __init acer_wmi_init(void)
> > > if (wmi_has_guid(AMW0_GUID1) &&
> > > !dmi_check_system(amw0_whitelist) &&
> > > quirks == _unknown) {
> > > -   pr_err("Unsupported machine has AMW0_GUID1, unable to 
> > > load\n");
> > > +   pr_debug("Unsupported machine has AMW0_GUID1, unable to 
> > > load\n");
> > > return -ENODEV;
> > > }
> > > 
> > > 
> > 
> > 
> > 
> > --
> > With Best Regards,
> > Andy Shevchenko
> 
> 
> 



Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 15:40 +0530, Hari Vyas wrote:
> 
> > So I'm rather dubious of adding a whole new layer of "modify" callbacks
> > to config space accessors for that, especially since they won't do any
> > good on architectures with lockless config accesses such as ... x86
> > 
> 
> modify() is optional.  host/controller layers may override it.
> It was just a proposal to avoid race issues which happens in SMP environment 
> and
> solves other issues. I agree, nothing great anyhow can be done in
> lockless config

Right so let's not go down the path of creating low level accessors
providing a semantic that cannot actually be provided by some
architectures :-)

> > > 2) SW level(internal data structure):
> > > About is_added,is_busmaster: These all are bit fields so infact I too
> > > suggested to remove those bit fields and
> > > make separate variables in pci_dev structure.
> > > This will avoid all data-overwritten,concurrency issues but ofcourse
> > > at the level of space cost.
> > 
> > It's unnecessary to do blanket changes without first understanding the
> > details of what's going on. A lot of these things are never touched
> > outside of the overall single threaded environment of
> > discovery/assignment or under driver control, in which case it's
> > expectd that the driver doesn't call them in racy ways
> > 
> > That said, I'm happy to move some of them under my proposed
> > dev_state_lock.
> > 
> > For is_added specifically, the field is simply set at the wrong time as
> > you can see in my previous patch.
> > 
> Issue needs to be addressed and that is our goal.

Yes.

> Some times simple mistakes need lot of debugging which happened in my case and
> my suggestion is to just avoid. SMP related issues are popping up now
> so we just need to be careful.

Oh I agree, and I took me a while to re-debug independently some of the
issues on my side too :-) I should be clearer that I very much do
appreciate your debugging work and finding those problems !

I might disagree on the solution but your effort is very valuable and
I'm sure we can converge on a solution that works for everybody.

 .../...

> > Can you remind us in this thread which specific cases of RMW races of
> > config space you were trying to address ?
> > 
> 
> Same pci bridge master, memory bit setting concern only (which my
> colleague Srinath figured out after lot of effort some time back) where only 
> one bit in
> PCI_COMMAND  was getting set.
> (Bug 200793 - PCI: read-write config operation doesn't look like SMP safe)
> My approach is to handle with modify operations at lower level so bits
> are not over-written or lost.

Right so those are the same old races with pci_set_master() on the
bridge upward chain ?

If yes then this should be addressed by my patches but there are still
some debate about whether to add that new mutex or try to use an
existing one, so let's see what Bjorn has to say.

I think the set-master and enable/disable issues are intimately related
and should be solved together by some higher level locking.

This is a typical case where the "atomic" enable_cnt provided people
with a false sense of safety while the code in practice was still
completely racy.

> As stated earlier, issue should be just resolved in better way. No
> issue in going with majority

Hehe, ok. I think the is_added fix is simply to move it up since it
matches the rest of the code in that area, but let's see what Bjorn has
to say in that regard.

As for the enable/disable/set_master races, Ive wrote my arguments in
favor of a dedicated lock, let's see what others have to respond.

Cheers,
Ben.

> Regards,
> hari
> > Cheers,
> > Ben.
> > 
> > 



Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 14:30 +0530, Hari Vyas wrote:
> So many mail threads for common issues going so just trying to
> summarize concern from my side.
> 1) HW level
> PCI Config data overwritten (ex: PCI_COMMAND bits etc) was happening
> at lower layer so from my
> perspective, it is best to handle concern at lower level with locking
> mechanism and that is what
> I proposed in my upcoming patch. Without that, I guess, we may end up
> in issues later with some weird scenario
> which may not be known today  and we will again be fine tuning stuff
> here and there.

Maybe... at this point I'm not trying to address that specific issue.
That being said, the PCI_COMMAND register should be under control of
the driver at runtime and thus shouldn't be exposed to races like that
except in the usual case of races in configuring bridges. Those races
definitely need to be handled at the higher level.

So I'm rather dubious of adding a whole new layer of "modify" callbacks
to config space accessors for that, especially since they won't do any
good on architectures with lockless config accesses such as ... x86 

> 2) SW level(internal data structure):
> About is_added,is_busmaster: These all are bit fields so infact I too
> suggested to remove those bit fields and
> make separate variables in pci_dev structure. 
> This will avoid all data-overwritten,concurrency issues but ofcourse
> at the level of space cost.

It's unnecessary to do blanket changes without first understanding the
details of what's going on. A lot of these things are never touched
outside of the overall single threaded environment of
discovery/assignment or under driver control, in which case it's
expectd that the driver doesn't call them in racy ways

That said, I'm happy to move some of them under my proposed
dev_state_lock.

For is_added specifically, the field is simply set at the wrong time as
you can see in my previous patch.

> Other possibility will be either to use atomic

Atomic is almost always wrong. It doesn't synchronize anything other
than the field, so either it's a big waste of it gives a false sense of
security for something that's almost always still racy.

I'd rather keep the indication that a bunch of those fields *are*
unprotected and rely on the outer layers being single threaded.

>  or locking mechanism
> for these. My point here is also same, better
> avoid fine tuning later stage.
> Moving is_added up and down doesn't look like good as we are just
> shifting issue up and down.

No, you are wrong. I also originally covered is_added with my new mutex
but in hindsight it's the wrong thing to do, and went back to the
correct fix at this point which is precisely to move it up.

is_added is essentially owned by the part of the PCI probe code that
runs single threaded before the device gets exposed to the device
model.

Pretty much all of the code, including all the corresponding fields
(struct resources etc...) in pci_dev are unprotected and rely on being
accessed in a single threaded manner. is_added is no exception.

It was simply a mistake to set it after device_attach().

Thus moving it back up fixes it *correctly* by making it follow the
same rules as all the related fields.

That said, if we want to start adding protection to all those other
fields, then this is a different matter alltogether. But at this point,
this is the best fix for the problem at hand.

> Please check and decide accordingly. I will hold my to-be-submitted
> modify() patch about handling hw level
> over-written case with locking around read-write operation.

Can you remind us in this thread which specific cases of RMW races of
config space you were trying to address ?

Cheers,
Ben.




Re: [RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

2018-08-17 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 10:09 +0200, Marta Rybczynska wrote:
> 
> - On 17 Aug, 2018, at 06:49, Benjamin Herrenschmidt 
> b...@kernel.crashing.org wrote:
> 
> > This protects enable/disable operations using the state mutex to
> > avoid races with, for example, concurrent enables on a bridge.
> > 
> > The bus hierarchy is walked first before taking the lock to
> > avoid lock nesting (though it would be ok if we ensured that
> > we always nest them bottom-up, it is better to just avoid the
> > issue alltogether, especially as we might find cases where
> > we want to take it top-down later).
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> 
> 
> > 
> > static void pci_enable_bridge(struct pci_dev *dev)
> > {
> > struct pci_dev *bridge;
> > -   int retval;
> > +   int retval, enabled;
> > 
> > bridge = pci_upstream_bridge(dev);
> > if (bridge)
> > pci_enable_bridge(bridge);
> > 
> > -   if (pci_is_enabled(dev)) {
> > -   if (!dev->is_busmaster)
> > -   pci_set_master(dev);
> > +   /* Already enabled ? */
> > +   pci_dev_state_lock(dev);
> > +   enabled = pci_is_enabled(dev);
> > +   if (enabled && !dev->is_busmaster)
> > +   pci_set_master(dev);
> > +   pci_dev_state_unlock(dev);
> > +   if (enabled)
> > return;
> > -   }
> > 
> 
> This looks complicated too me especially with the double locking. What do you
> think about a function doing that check that used an unlocked version of
> pcie_set_master?
> 
> Like:
> 
> dev_state_lock(dev);
> enabled = pci_is_enabled(dev);
> if (enabled &&  !dev->is_busmaster)
> pci_set_master_unlocked();
> pci_dev_state_unlock(dev);
> 
> BTW If I remember correctly the code today can set bus mastering multiple
> times without checking if already done.

I don't mind but I tend to dislike all those _unlocked() suffixes, I
suppose my generation is typing adverse enough that we call these
__something instead :)

As for setting multiple times, yes pci_set_master() doesn't check but
look at the "-" hunks of my patch, I'm not changing the existing test
here. Not that it matters much, it's an optimization.

In fact my original version just completely removed the whole lot
and just unconditionally did pci_enable_device() + pci_set_master(),
just ignoring the existing state :-)

But I decided to keep the patch functionally equivalent so I added it
back.

Cheers,
Ben.




[RFC PATCH 4/6] pci: Add a mutex to pci_dev to protect device state

2018-08-16 Thread Benjamin Herrenschmidt
This adds a pci_dev mutex which will be used to fix a number
of races related to the various state bits maintained in
that structure.

We call it "state_lock" with similarly named accessors to avoid
confusion with the pci_dev_lock() which uses the device_lock().

This is a low level mutex meant to protect the mapping between
the state fields and the hardware state, for example enabling
disabling, setting/clearing bus master etc...

These operations can happen while the device_lock() is already
held (but don't have to) so a separate mutex is preferable.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/pci/probe.c |  1 +
 include/linux/pci.h | 17 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 440445ac7dfa..3ce287ab6150 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2155,6 +2155,7 @@ struct pci_dev *pci_alloc_dev(struct pci_bus *bus)
INIT_LIST_HEAD(>bus_list);
dev->dev.type = _dev_type;
dev->bus = pci_bus_get(bus);
+   mutex_init(>state_lock);
 
return dev;
 }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f58bda204f09..0d4fc22df190 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -443,8 +444,24 @@ struct pci_dev {
phys_addr_t rom;/* Physical address if not from BAR */
size_t  romlen; /* Length if not from BAR */
char*driver_override; /* Driver name to force a match */
+
+   unsigned long   priv_flags; /* Private flags for the PCI driver */
+
+   struct mutexstate_lock; /* Protect local state bits */
+
+   /* --- Fields below this line are protected by the state_lock mutex */
 };
 
+static inline void pci_dev_state_lock(struct pci_dev *dev)
+{
+   mutex_lock(>state_lock);
+}
+
+static inline void pci_dev_state_unlock(struct pci_dev *dev)
+{
+   mutex_unlock(>state_lock);
+}
+
 static inline struct pci_dev *pci_physfn(struct pci_dev *dev)
 {
 #ifdef CONFIG_PCI_IOV
-- 
2.17.1



[RFC PATCH 6/6] pci: Protect is_busmaster using the state lock

2018-08-16 Thread Benjamin Herrenschmidt
This wraps pci_set_master() and pci_clear_master() with the pci_dev
state lock. The clearing of is_busmaster in pci_disable_device()
is already covered.

This also adds a comment explaining why is_busmaster must not be
checked in pci_set_master() due to how the power management code
uses it.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/pci/pci.c   | 9 +
 include/linux/pci.h | 2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 68152de2b5a0..13d988d5b2a3 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4052,11 +4052,18 @@ void __weak pcibios_set_master(struct pci_dev *dev)
  *
  * Enables bus-mastering on the device and calls pcibios_set_master()
  * to do the needed arch specific settings.
+ *
+ * Note: This must not check dev->is_busmaster because the power management
+ *   code will call this in order to restore the config space to the
+ *   state of is_busmaster, thus is_busmaster might be set but the
+ *   config space bit cleared.
  */
 void pci_set_master(struct pci_dev *dev)
 {
+   pci_dev_state_lock(dev);
__pci_set_master(dev, true);
pcibios_set_master(dev);
+   pci_dev_state_unlock(dev);
 }
 EXPORT_SYMBOL(pci_set_master);
 
@@ -4066,7 +4073,9 @@ EXPORT_SYMBOL(pci_set_master);
  */
 void pci_clear_master(struct pci_dev *dev)
 {
+   pci_dev_state_lock(dev);
__pci_set_master(dev, false);
+   pci_dev_state_unlock(dev);
 }
 EXPORT_SYMBOL(pci_clear_master);
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0d4fc22df190..a5bac5b21454 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -375,7 +375,6 @@ struct pci_dev {
unsigned intmultifunction:1;/* Multi-function device */
 
unsigned intis_added:1;
-   unsigned intis_busmaster:1; /* Is busmaster */
unsigned intno_msi:1;   /* May not use MSI */
unsigned intno_64bit_msi:1; /* May only use 32-bit MSIs */
unsigned intblock_cfg_access:1; /* Config space access blocked 
*/
@@ -450,6 +449,7 @@ struct pci_dev {
struct mutexstate_lock; /* Protect local state bits */
 
/* --- Fields below this line are protected by the state_lock mutex */
+   unsigned intis_busmaster:1; /* Is busmaster */
 };
 
 static inline void pci_dev_state_lock(struct pci_dev *dev)
-- 
2.17.1



[RFC PATCH 3/6] pci: Remove priv_flags and use dev->error_state for "disconnected" status

2018-08-16 Thread Benjamin Herrenschmidt
This already represents whether a device is accessible or not,
creating a new flag isn't particularly helpful.

dev->error_state being an int, assigning it doesn't require
an atomic operation per-se. The existing atomic bitop only
protects the field, not anything else anyway.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/pci/access.c | 12 ++--
 drivers/pci/msi.c|  4 ++--
 drivers/pci/pci.c|  2 +-
 drivers/pci/pci.h| 11 ++-
 include/linux/pci.h  |  2 --
 5 files changed, 11 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index a3ad2fe185b9..ad97a3544636 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -535,7 +535,7 @@ EXPORT_SYMBOL(pcie_capability_clear_and_set_dword);
 
 int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -545,7 +545,7 @@ EXPORT_SYMBOL(pci_read_config_byte);
 
 int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -556,7 +556,7 @@ EXPORT_SYMBOL(pci_read_config_word);
 int pci_read_config_dword(const struct pci_dev *dev, int where,
u32 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -566,7 +566,7 @@ EXPORT_SYMBOL(pci_read_config_dword);
 
 int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_byte(dev->bus, dev->devfn, where, val);
 }
@@ -574,7 +574,7 @@ EXPORT_SYMBOL(pci_write_config_byte);
 
 int pci_write_config_word(const struct pci_dev *dev, int where, u16 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);
 }
@@ -583,7 +583,7 @@ EXPORT_SYMBOL(pci_write_config_word);
 int pci_write_config_dword(const struct pci_dev *dev, int where,
 u32 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..2f872db40042 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -298,7 +298,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct 
msi_msg *msg)
 {
struct pci_dev *dev = msi_desc_to_pci_dev(entry);
 
-   if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
+   if (dev->current_state != PCI_D0 || pci_channel_offline(dev)) {
/* Don't touch the hardware now */
} else if (entry->msi_attrib.is_msix) {
void __iomem *base = pci_msix_desc_addr(entry);
@@ -975,7 +975,7 @@ static void pci_msix_shutdown(struct pci_dev *dev)
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;
 
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
dev->msix_enabled = 0;
return;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..62591c153999 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5713,7 +5713,7 @@ bool pci_device_is_present(struct pci_dev *pdev)
 {
u32 v;
 
-   if (pci_dev_is_disconnected(pdev))
+   if (pci_channel_offline(pdev))
return false;
return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, , 0);
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 473aa10a5dbf..f7704333e6a1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -293,18 +293,11 @@ struct pci_sriov {
booldrivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
-   set_bit(PCI_DEV_DISCONNECTED, >priv_flags);
-   return 0;
-}
+   dev->error_state = pci_channel_io_perm_failure;
 
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
-   return test_bit(PCI_DEV_DISCONNECTED, >priv_flags);
+   return 0;
 }
 
 #ifdef CONFIG_PCIEAER
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9799109c5e25..f58bda204f09 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -4

[RFC PATCH v2 3/6] pci: Remove priv_flags and use dev->error_state for "disconnected" status

2018-08-16 Thread Benjamin Herrenschmidt
This already represents whether a device is accessible or not,
creating a new flag isn't particularly helpful.

dev->error_state being an int, assigning it doesn't require
an atomic operation per-se. The existing atomic bitop only
protects the field, not anything else anyway.

Signed-off-by: Benjamin Herrenschmidt 
---

v2. Initialize the io channel state before we try to read the header

 drivers/pci/access.c | 12 ++--
 drivers/pci/msi.c|  4 ++--
 drivers/pci/pci.c|  2 +-
 drivers/pci/pci.h| 11 ++-
 drivers/pci/probe.c  |  4 +++-
 include/linux/pci.h  |  2 --
 6 files changed, 14 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index a3ad2fe185b9..ad97a3544636 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -535,7 +535,7 @@ EXPORT_SYMBOL(pcie_capability_clear_and_set_dword);
 
 int pci_read_config_byte(const struct pci_dev *dev, int where, u8 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -545,7 +545,7 @@ EXPORT_SYMBOL(pci_read_config_byte);
 
 int pci_read_config_word(const struct pci_dev *dev, int where, u16 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -556,7 +556,7 @@ EXPORT_SYMBOL(pci_read_config_word);
 int pci_read_config_dword(const struct pci_dev *dev, int where,
u32 *val)
 {
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
*val = ~0;
return PCIBIOS_DEVICE_NOT_FOUND;
}
@@ -566,7 +566,7 @@ EXPORT_SYMBOL(pci_read_config_dword);
 
 int pci_write_config_byte(const struct pci_dev *dev, int where, u8 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_byte(dev->bus, dev->devfn, where, val);
 }
@@ -574,7 +574,7 @@ EXPORT_SYMBOL(pci_write_config_byte);
 
 int pci_write_config_word(const struct pci_dev *dev, int where, u16 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_word(dev->bus, dev->devfn, where, val);
 }
@@ -583,7 +583,7 @@ EXPORT_SYMBOL(pci_write_config_word);
 int pci_write_config_dword(const struct pci_dev *dev, int where,
 u32 val)
 {
-   if (pci_dev_is_disconnected(dev))
+   if (pci_channel_offline(dev))
return PCIBIOS_DEVICE_NOT_FOUND;
return pci_bus_write_config_dword(dev->bus, dev->devfn, where, val);
 }
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index f2ef896464b3..2f872db40042 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -298,7 +298,7 @@ void __pci_write_msi_msg(struct msi_desc *entry, struct 
msi_msg *msg)
 {
struct pci_dev *dev = msi_desc_to_pci_dev(entry);
 
-   if (dev->current_state != PCI_D0 || pci_dev_is_disconnected(dev)) {
+   if (dev->current_state != PCI_D0 || pci_channel_offline(dev)) {
/* Don't touch the hardware now */
} else if (entry->msi_attrib.is_msix) {
void __iomem *base = pci_msix_desc_addr(entry);
@@ -975,7 +975,7 @@ static void pci_msix_shutdown(struct pci_dev *dev)
if (!pci_msi_enable || !dev || !dev->msix_enabled)
return;
 
-   if (pci_dev_is_disconnected(dev)) {
+   if (pci_channel_offline(dev)) {
dev->msix_enabled = 0;
return;
}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 29ff9619b5fa..62591c153999 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5713,7 +5713,7 @@ bool pci_device_is_present(struct pci_dev *pdev)
 {
u32 v;
 
-   if (pci_dev_is_disconnected(pdev))
+   if (pci_channel_offline(pdev))
return false;
return pci_bus_read_dev_vendor_id(pdev->bus, pdev->devfn, , 0);
 }
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 473aa10a5dbf..f7704333e6a1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -293,18 +293,11 @@ struct pci_sriov {
booldrivers_autoprobe; /* Auto probing of VFs by driver */
 };
 
-/* pci_dev priv_flags */
-#define PCI_DEV_DISCONNECTED 0
-
 static inline int pci_dev_set_disconnected(struct pci_dev *dev, void *unused)
 {
-   set_bit(PCI_DEV_DISCONNECTED, >priv_flags);
-   return 0;
-}
+   dev->error_state = pci_channel_io_perm_failure;
 
-static inline bool pci_dev_is_disconnected(const struct pci_dev *dev)
-{
-   return test_bit(PCI_DEV_DISCONNECTED, >priv_flags);
+   return 0;
 }
 
 #ifdef CONFIG_PCIEAER
diff --git a/drivers/pci/probe.c b/drivers/

[RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

2018-08-16 Thread Benjamin Herrenschmidt
This re-fixes the bug reported by Hari Vyas 
after my revert of his commit but in a much simpler way.

The main issues is that is_added was being set after the driver
got bound and started, and thus setting it could race with other
changes to struct pci_dev.

This fixes it by setting the flag first, which also has the
advantage of matching the fact that we are clearing it *after*
unbinding in the remove path, thus the flag is now symtetric
and always set while the driver code is running.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/pci/bus.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 35b7fc87eac5..48ae63673aa8 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);
 
+   dev->is_added = 1;
dev->match_driver = true;
retval = device_attach(>dev);
if (retval < 0 && retval != -EPROBE_DEFER) {
+   dev->is_added = 0;
pci_warn(dev, "device attach failed (%d)\n", retval);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
return;
}
-
-   dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);
 
-- 
2.17.1



[RFC PATCH 5/6] pci: Protect the enable/disable state of pci_dev using the state mutex

2018-08-16 Thread Benjamin Herrenschmidt
This protects enable/disable operations using the state mutex to
avoid races with, for example, concurrent enables on a bridge.

The bus hierarchy is walked first before taking the lock to
avoid lock nesting (though it would be ok if we ensured that
we always nest them bottom-up, it is better to just avoid the
issue alltogether, especially as we might find cases where
we want to take it top-down later).

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/pci/pci.c | 51 ---
 1 file changed, 35 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 62591c153999..68152de2b5a0 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1540,26 +1540,33 @@ static int do_pci_enable_device(struct pci_dev *dev, 
int bars)
  */
 int pci_reenable_device(struct pci_dev *dev)
 {
+   int rc = 0;
+
+   pci_dev_state_lock(dev);
if (pci_is_enabled(dev))
-   return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
-   return 0;
+   rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+   pci_dev_state_unlock(dev);
+   return rc;
 }
 EXPORT_SYMBOL(pci_reenable_device);
 
 static void pci_enable_bridge(struct pci_dev *dev)
 {
struct pci_dev *bridge;
-   int retval;
+   int retval, enabled;
 
bridge = pci_upstream_bridge(dev);
if (bridge)
pci_enable_bridge(bridge);
 
-   if (pci_is_enabled(dev)) {
-   if (!dev->is_busmaster)
-   pci_set_master(dev);
+   /* Already enabled ? */
+   pci_dev_state_lock(dev);
+   enabled = pci_is_enabled(dev);
+   if (enabled && !dev->is_busmaster)
+   pci_set_master(dev);
+   pci_dev_state_unlock(dev);
+   if (enabled)
return;
-   }
 
retval = pci_enable_device(dev);
if (retval)
@@ -1571,9 +1578,16 @@ static void pci_enable_bridge(struct pci_dev *dev)
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
struct pci_dev *bridge;
-   int err;
+   int err = 0;
int i, bars = 0;
 
+   /* Handle upstream bridges first to avoid locking issues */
+   bridge = pci_upstream_bridge(dev);
+   if (bridge)
+   pci_enable_bridge(bridge);
+
+   pci_dev_state_lock(dev);
+
/*
 * Power state could be unknown at this point, either due to a fresh
 * boot or a device removal call.  So get the current power state
@@ -1586,12 +1600,9 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}
 
+   /* Already enabled ? */
if (atomic_inc_return(>enable_cnt) > 1)
-   return 0;   /* already enabled */
-
-   bridge = pci_upstream_bridge(dev);
-   if (bridge)
-   pci_enable_bridge(bridge);
+   goto bail;
 
/* only skip sriov related */
for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1604,6 +1615,8 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
err = do_pci_enable_device(dev, bars);
if (err < 0)
atomic_dec(>enable_cnt);
+ bail:
+   pci_dev_state_unlock(dev);
return err;
 }
 
@@ -1820,12 +1833,16 @@ static void do_pci_disable_device(struct pci_dev *dev)
  * @dev: PCI device to disable
  *
  * NOTE: This function is a backend of PCI power management routines and is
- * not supposed to be called drivers.
+ * not supposed to be called drivers. It will keep enable_cnt and is_busmaster
+ * unmodified so that the resume code knows how to restore the corresponding
+ * command register bits.
  */
 void pci_disable_enabled_device(struct pci_dev *dev)
 {
+   pci_dev_state_lock(dev);
if (pci_is_enabled(dev))
do_pci_disable_device(dev);
+   pci_dev_state_unlock(dev);
 }
 
 /**
@@ -1849,12 +1866,14 @@ void pci_disable_device(struct pci_dev *dev)
dev_WARN_ONCE(>dev, atomic_read(>enable_cnt) <= 0,
  "disabling already-disabled device");
 
+   pci_dev_state_lock(dev);
if (atomic_dec_return(>enable_cnt) != 0)
-   return;
+   goto bail;
 
do_pci_disable_device(dev);
-
dev->is_busmaster = 0;
+ bail:
+   pci_dev_state_unlock(dev);
 }
 EXPORT_SYMBOL(pci_disable_device);
 
-- 
2.17.1



Re: [RFC PATCH 0/6] pci: Rework is_added race fix and address bridge enable races

2018-08-16 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> The second part aims at fixing the enable/disable/set_master races,
> and does so by providing a framework for future device state locking
> issues.
> 
> It introduces a pci_dev->state_mutex which is used at a lower level
> than the device_lock (the device lock isn't suitable, as explained
> in the cset comments) and uses it to protect enablement and set_master.

As discussed in the series, I'm not using the device_lock because I
don't like it creeping out of drivers/base too much unless you
explicitly try to lock against concurrent add/remove.

That being said, if we decided we prefer using it to solve the
enable/disable race, then we have to be careful of a few things:

 - Driver callbacks hold it, so we can't take it from within
pci_enable_device(), pci_set_master() etc... themselves. We'll have to
assume the caller has it

 - The above means auditing callers of these various APIs that might be
calling them from outside of a driver callback and add the necessary
lock

 - We need to take great care about the possibility of the parent
device(s) lock being held. It can happen for USB for example. Now we
don't have USB->PCI adapters that I know of that uses the PCI stack in
Linux but generally assuming your parent lock isn't held is risky. So I
would be a bit weary of walking up the bridge chain and taking it
unconditionally.

Alternatively, we could hijack an existing global lock, for example
mvoe the bridge_mutex outside of ACPI and use it for the upwards bridge
walk, and ignore the other possible races with enable/disable.

Cheers,
Ben.





Re: [RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

2018-08-16 Thread Benjamin Herrenschmidt
s/device_add/device_attach .. ugh.

On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> This re-fixes the bug reported by Hari Vyas 
> after my revert of his commit but in a much simpler way.
> 
> The main issues is that is_added was being set after the driver
> got bound and started, and thus setting it could race with other
> changes to struct pci_dev.
> 
> This fixes it by setting the flag first, which also has the
> advantage of matching the fact that we are clearing it *after*
> unbinding in the remove path, thus the flag is now symtetric
> and always set while the driver code is running.
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  drivers/pci/bus.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 35b7fc87eac5..48ae63673aa8 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -321,16 +321,16 @@ void pci_bus_add_device(struct pci_dev *dev)
>   pci_proc_attach_device(dev);
>   pci_bridge_d3_update(dev);
>  
> + dev->is_added = 1;
>   dev->match_driver = true;
>   retval = device_attach(>dev);
>   if (retval < 0 && retval != -EPROBE_DEFER) {
> + dev->is_added = 0;
>   pci_warn(dev, "device attach failed (%d)\n", retval);
>   pci_proc_detach_device(dev);
>   pci_remove_sysfs_dev_files(dev);
>   return;
>   }
> -
> - dev->is_added = 1;
>  }
>  EXPORT_SYMBOL_GPL(pci_bus_add_device);
>  



Re: [RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-16 Thread Benjamin Herrenschmidt
On Fri, 2018-08-17 at 14:48 +1000, Benjamin Herrenschmidt wrote:
> This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.
> 
> The new pci state mutex provides a simpler way of addressing
> this race and avoids the relative includes added to the powerpc
> code.

Ignore the cset comment, my "fix" no longer relies on a state mutex,
I'll re-post with a better comment after discussions.

The actual fix is in patch 2:

[RFC PATCH 2/6] pci: Set pci_dev->is_added before calling device_add

(and yes that was supposed be device_attach ... ugh, not enough
caffeine today).

> Signed-off-by: Benjamin Herrenschmidt 
> ---
>  arch/powerpc/kernel/pci-common.c  |  4 +---
>  arch/powerpc/platforms/powernv/pci-ioda.c |  3 +--
>  arch/powerpc/platforms/pseries/setup.c|  3 +--
>  drivers/pci/bus.c |  6 +++---
>  drivers/pci/hotplug/acpiphp_glue.c|  2 +-
>  drivers/pci/pci.h | 11 ---
>  drivers/pci/probe.c   |  4 ++--
>  drivers/pci/remove.c  |  5 ++---
>  include/linux/pci.h   |  1 +
>  9 files changed, 12 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index 471aac313b89..fe9733aa 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -42,8 +42,6 @@
>  #include 
>  #include 
>  
> -#include "../../../drivers/pci/pci.h"
> -
>  /* hose_spinlock protects accesses to the the phb_bitmap. */
>  static DEFINE_SPINLOCK(hose_spinlock);
>  LIST_HEAD(hose_list);
> @@ -1016,7 +1014,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
>   /* Cardbus can call us to add new devices to a bus, so ignore
>* those who are already fully discovered
>*/
> - if (pci_dev_is_added(dev))
> + if (dev->is_added)
>   continue;
>  
>   pcibios_setup_device(dev);
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 70b2e1e0f23c..5bd0eb6681bc 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -46,7 +46,6 @@
>  
>  #include "powernv.h"
>  #include "pci.h"
> -#include "../../../../drivers/pci/pci.h"
>  
>  #define PNV_IODA1_M64_NUM16  /* Number of M64 BARs   */
>  #define PNV_IODA1_M64_SEGS   8   /* Segments per M64 BAR */
> @@ -3139,7 +3138,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
> pci_dev *pdev)
>   struct pci_dn *pdn;
>   int mul, total_vfs;
>  
> - if (!pdev->is_physfn || pci_dev_is_added(pdev))
> + if (!pdev->is_physfn || pdev->is_added)
>   return;
>  
>   pdn = pci_get_pdn(pdev);
> diff --git a/arch/powerpc/platforms/pseries/setup.c 
> b/arch/powerpc/platforms/pseries/setup.c
> index 8a4868a3964b..139f0af6c3d9 100644
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -71,7 +71,6 @@
>  #include 
>  
>  #include "pseries.h"
> -#include "../../../../drivers/pci/pci.h"
>  
>  int CMO_PrPSP = -1;
>  int CMO_SecPSP = -1;
> @@ -665,7 +664,7 @@ static void pseries_pci_fixup_iov_resources(struct 
> pci_dev *pdev)
>   const int *indexes;
>   struct device_node *dn = pci_device_to_OF_node(pdev);
>  
> - if (!pdev->is_physfn || pci_dev_is_added(pdev))
> + if (!pdev->is_physfn || pdev->is_added)
>   return;
>   /*Firmware must support open sriov otherwise dont configure*/
>   indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 5cb40b2518f9..35b7fc87eac5 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
>   return;
>   }
>  
> - pci_dev_assign_added(dev, true);
> + dev->is_added = 1;
>  }
>  EXPORT_SYMBOL_GPL(pci_bus_add_device);
>  
> @@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
>  
>   list_for_each_entry(dev, >devices, bus_list) {
>   /* Skip already-added devices */
> - if (pci_dev_is_added(dev))
> + if (dev->is_added)
>   continue;
>   pci_bus_add_device(dev);
>   }
>  
>   list_for_each_entry(dev, >devices, bus_list) {
>   /* Skip if device attach failed */
> - i

[RFC PATCH 0/6] pci: Rework is_added race fix and address bridge enable races

2018-08-16 Thread Benjamin Herrenschmidt
This is really two series but since they conflict a bit separately
here they are in one:

First we undo the mess of those atomic priv_flags. The atomicity
doesn't provide any security since there's no locking against
the other state pertaining to those flags, it only protects the
flags themselves.

The is_added mess is fixed much more simply by moving the assignment
of the flag to before we start the driver. This is in line with the
rest of the PCI code: until bound to the device model, we are essentially
assuming a single threaded environment. is_added is a flag that is
logically owned by that part of the PCI code, and thus should be set
and cleared within that "safer" environment.

This removes the horrid relative includes in the powerpc code as well.

The second part aims at fixing the enable/disable/set_master races,
and does so by providing a framework for future device state locking
issues.

It introduces a pci_dev->state_mutex which is used at a lower level
than the device_lock (the device lock isn't suitable, as explained
in the cset comments) and uses it to protect enablement and set_master.




[RFC PATCH 1/6] Revert "PCI: Fix is_added/is_busmaster race condition"

2018-08-16 Thread Benjamin Herrenschmidt
This reverts commit 44bda4b7d26e9fffed6d7152d98a2e9edaeb2a76.

The new pci state mutex provides a simpler way of addressing
this race and avoids the relative includes added to the powerpc
code.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/pci-common.c  |  4 +---
 arch/powerpc/platforms/powernv/pci-ioda.c |  3 +--
 arch/powerpc/platforms/pseries/setup.c|  3 +--
 drivers/pci/bus.c |  6 +++---
 drivers/pci/hotplug/acpiphp_glue.c|  2 +-
 drivers/pci/pci.h | 11 ---
 drivers/pci/probe.c   |  4 ++--
 drivers/pci/remove.c  |  5 ++---
 include/linux/pci.h   |  1 +
 9 files changed, 12 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 471aac313b89..fe9733aa 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -42,8 +42,6 @@
 #include 
 #include 
 
-#include "../../../drivers/pci/pci.h"
-
 /* hose_spinlock protects accesses to the the phb_bitmap. */
 static DEFINE_SPINLOCK(hose_spinlock);
 LIST_HEAD(hose_list);
@@ -1016,7 +1014,7 @@ void pcibios_setup_bus_devices(struct pci_bus *bus)
/* Cardbus can call us to add new devices to a bus, so ignore
 * those who are already fully discovered
 */
-   if (pci_dev_is_added(dev))
+   if (dev->is_added)
continue;
 
pcibios_setup_device(dev);
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 70b2e1e0f23c..5bd0eb6681bc 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -46,7 +46,6 @@
 
 #include "powernv.h"
 #include "pci.h"
-#include "../../../../drivers/pci/pci.h"
 
 #define PNV_IODA1_M64_NUM  16  /* Number of M64 BARs   */
 #define PNV_IODA1_M64_SEGS 8   /* Segments per M64 BAR */
@@ -3139,7 +3138,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
struct pci_dn *pdn;
int mul, total_vfs;
 
-   if (!pdev->is_physfn || pci_dev_is_added(pdev))
+   if (!pdev->is_physfn || pdev->is_added)
return;
 
pdn = pci_get_pdn(pdev);
diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 8a4868a3964b..139f0af6c3d9 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -71,7 +71,6 @@
 #include 
 
 #include "pseries.h"
-#include "../../../../drivers/pci/pci.h"
 
 int CMO_PrPSP = -1;
 int CMO_SecPSP = -1;
@@ -665,7 +664,7 @@ static void pseries_pci_fixup_iov_resources(struct pci_dev 
*pdev)
const int *indexes;
struct device_node *dn = pci_device_to_OF_node(pdev);
 
-   if (!pdev->is_physfn || pci_dev_is_added(pdev))
+   if (!pdev->is_physfn || pdev->is_added)
return;
/*Firmware must support open sriov otherwise dont configure*/
indexes = of_get_property(dn, "ibm,open-sriov-vf-bar-info", NULL);
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 5cb40b2518f9..35b7fc87eac5 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -330,7 +330,7 @@ void pci_bus_add_device(struct pci_dev *dev)
return;
}
 
-   pci_dev_assign_added(dev, true);
+   dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);
 
@@ -347,14 +347,14 @@ void pci_bus_add_devices(const struct pci_bus *bus)
 
list_for_each_entry(dev, >devices, bus_list) {
/* Skip already-added devices */
-   if (pci_dev_is_added(dev))
+   if (dev->is_added)
continue;
pci_bus_add_device(dev);
}
 
list_for_each_entry(dev, >devices, bus_list) {
/* Skip if device attach failed */
-   if (!pci_dev_is_added(dev))
+   if (!dev->is_added)
continue;
child = dev->subordinate;
if (child)
diff --git a/drivers/pci/hotplug/acpiphp_glue.c 
b/drivers/pci/hotplug/acpiphp_glue.c
index ef0b1b6ba86f..3a17b290df5d 100644
--- a/drivers/pci/hotplug/acpiphp_glue.c
+++ b/drivers/pci/hotplug/acpiphp_glue.c
@@ -509,7 +509,7 @@ static void enable_slot(struct acpiphp_slot *slot)
 
list_for_each_entry(dev, >devices, bus_list) {
/* Assume that newly added devices are powered on already. */
-   if (!pci_dev_is_added(dev))
+   if (!dev->is_added)
dev->current_state = PCI_D0;
}
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 6e0d1528d471..473aa10a5dbf 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -295,7 +295,6 @@ struct

Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-16 Thread Benjamin Herrenschmidt
On Thu, 2018-08-16 at 22:07 -0500, Bjorn Helgaas wrote:
> [+cc Srinath, Guenter, Jens, Lukas, Konstantin, Marta, Pierre-Yves]
> 
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> > So far it does...
> > 
> > The issue (as discussed in the Re: PCIe enable device races thread) is
> > that pci_enable_device and related functions along with pci_set_master
> > and pci_enable_bridge are fundamentally racy.
> > 
> > There is no lockign protecting the state of the device in pci_dev and
> > if multiple devices under the same bridge try to enable it simultaneously
> > one some of them will potentially start accessing it before it has actually
> > been enabled.
> > 
> > Now there are a LOT more fields in pci_dev that aren't covered by any
> > form of locking.
> 
> Most of the PCI core relies on the assumption that only a single
> thread touches a device at a time.  This is generally true of the core
> during enumeration because enumeration is single-threaded.  It's
> generally true in normal driver operation because the core shouldn't
> touch a device after a driver claims it.

Mostly :-) There are a few exceptions though.

> But there are several exceptions, and I think we need to understand
> those scenarios before applying locks willy-nilly.

We need to stop creating ad-hoc locks. We have a good handle already on
the main enable/disable and bus master scenario, and the race with
is_added.

Ignore the patch itself, it has at least 2 bugs with PM, I'll send a
series improving things a bit later.

> One big exception is that enabling device A may also touch an upstream
> bridge B.  That causes things like your issue and Srinath's issue
> where drivers simultaneously enabling two devices below the same
> bridge corrupt the bridge's state [1].  Marta reported essentially the
> same issue [2].
> 
> Hari's issue [3] seems related to a race between a driver work queue
> and the core enumerating the device.  I should have pushed harder to
> understand this; I feel like we papered over the immediate problem
> without clearing up the larger issue of why the core enumeration path
> (pci_bus_add_device()) is running at the same time as a driver.

It's not. What is happening is that is_added is set by
pci_bus_add_device() after it has bound the driver. An easy fix would
have been to move it up instead:

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 737d1c52f002..ff4d536d43fc 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -323,16 +323,16 @@ void pci_bus_add_device(struct pci_dev *dev)
pci_proc_attach_device(dev);
pci_bridge_d3_update(dev);
 
+   dev->is_added = 1;
dev->match_driver = true;
retval = device_attach(>dev);
if (retval < 0 && retval != -EPROBE_DEFER) {
+   dev->is_added = 0;
pci_warn(dev, "device attach failed (%d)\n", retval);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
return;
}
-
-   dev->is_added = 1;
 }
 EXPORT_SYMBOL_GPL(pci_bus_add_device);

(Untested).

Note: another advantage of the above is that the current code has an
odd asymetry: is_added is currently set after we attach but also
cleared after we detatch.

If we want to keep the flag being set after attaching, then we do
indeed need to protect it against concurrent access to other fields.

The easiest way to do that would have been to remove the :1 as this:

-   unsigned intis_added:1;
+   unsigned intis_added;

Again, none of these approach involves the invasive patch your merged
which uses that atomic operation which provides the false sense of
security that your are somewhat "protected" while in fact you only
protect the field itself, but provide no protection about overall
concurrency of the callers which might clash in different ways.

Finally, we could also move is_added under the protection of the new
mutex I propose adding, but that would really only work as long as
we move all the :1 fields protected by that mutex together inside
the struct pci_dev structure as to avoid collisions with other fields
being modified.

All of the above are preferable to what you merged.

> DPC/AER error handling adds more cases where the core potentially
> accesses devices asynchronously to the driver.
>
> User-mode sysfs controls like ASPM are also asynchronous to drivers.
> 
> Even setpci is a potential issue, though I don't know how far we want
> to go to protect it

Re: [RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
On Wed, 2018-08-15 at 15:40 -0700, Guenter Roeck wrote:
> On Thu, Aug 16, 2018 at 07:50:13AM +1000, Benjamin Herrenschmidt wrote:
> > (Resent with lkml on copy)
> > 
> > [Note: This isn't meant to be merged, it need splitting at the very
> > least, see below]
> > 
> > This is something I cooked up quickly today to test if that would fix
> > my problems with large number of switch and NVME devices on POWER.
> > 
> 
> Is that a problem that can be reproduced with a qemu setup ?

With difficulty... mt-tcg might help, but you need a rather large
systems to reproduce it.

My repro-case is a 2 socket POWER9 system (about 40 cores off the top
of my mind, so 160 threads) with 72 NVME devices underneath a tree of
switches (I don't have the system at hand today to check how many).

It's possible to observe it I suppose on a smaller system (in theory a
single bridge with 2 devices is enough) but in practice the timing is
extremely hard to hit.

You need a combination of:

  - The bridges come up disabled (which is the case when Linux does the
resource assignment, such as on POWER but not on x86 unless it's
hotplug)

  - The nvme devices try to enable them simultaneously

Also the resulting error is a UR, I don't know how well qemu models
that.

On the above system, I get usually *one* device failing due to the race
out of 72, and not on every boot.

However, the bug is known (see Bjorn's reply to the other thread) "Re:
PCIe enable device races (Was: [PATCH v3] PCI: Data corruption
happening due to race condition)" on linux-pci, so I'm not the only one
with a repro-case around.

Cheers,
Ben.




[PATCH] platform/x86: acer-wmi: Silence "unsupported" message a bit

2018-08-15 Thread Benjamin Herrenschmidt
This driver prints that "Unsupported machine..." message on every boot on
ThinkPad X1 Carbon laptops (and I assume a number of other systems), which
causes graphical boots to "glitch" a bit and is rather annoying ...

Make it a pr_debug instead.

Signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/drivers/platform/x86/acer-wmi.c b/drivers/platform/x86/acer-wmi.c
index 8952173dd380..7f7192e8fdab 100644
--- a/drivers/platform/x86/acer-wmi.c
+++ b/drivers/platform/x86/acer-wmi.c
@@ -2216,7 +2216,7 @@ static int __init acer_wmi_init(void)
if (wmi_has_guid(AMW0_GUID1) &&
!dmi_check_system(amw0_whitelist) &&
quirks == _unknown) {
-   pr_err("Unsupported machine has AMW0_GUID1, unable to load\n");
+   pr_debug("Unsupported machine has AMW0_GUID1, unable to 
load\n");
return -ENODEV;
}
  



[RFC PATCH] pci: Proof of concept at fixing pci_enable_device/bridge races

2018-08-15 Thread Benjamin Herrenschmidt
(Resent with lkml on copy)

[Note: This isn't meant to be merged, it need splitting at the very
least, see below]

This is something I cooked up quickly today to test if that would fix
my problems with large number of switch and NVME devices on POWER.

So far it does...

The issue (as discussed in the Re: PCIe enable device races thread) is
that pci_enable_device and related functions along with pci_set_master
and pci_enable_bridge are fundamentally racy.

There is no lockign protecting the state of the device in pci_dev and
if multiple devices under the same bridge try to enable it simultaneously
one some of them will potentially start accessing it before it has actually
been enabled.

Now there are a LOT more fields in pci_dev that aren't covered by any
form of locking.

Additionally, some other parts such as ASPM or SR-IOV seem to be trying to do
their own ad-hoc locking, but will manipulate bit fields in pci_dev that
might be sharing words with other unrelated bit fields racily.

So I think we need to introduce a pci_dev mutex aimed at synchronizing
the main enable/master state of a given device, and possibly also the
bazillion of state bits held inside pci_dev.

I suggest a progressive approach:

 1- First we add the mutex, and put a red line comment "/* - */" in
struct pci_dev, at the bottom

 2- We move the basic enable/disable/set_master stuff under that lock and
move the corresponding fields in pci_dev below the line comment.

 3- Progressively, as we untangle them and verify their locking, we move
individual fields below the line so that we have a good visibility of
what has been addressed/audited already and what not

Note: I would very much like to keep most of the probing/resource allocation
single threaded at least within a given domain, I know we have some attempts
at locking in the hotplug code for that, I'm not ready to tackle that or
change it at this stage.

We might be able to also addrss the is_added issues (Hari Vyas stuff) using
the same mutex, I haven't looked into the details.

Not-signed-off-by: Benjamin Herrenschmidt 
---

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 316496e99da9..9c4895c40f1d 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1348,9 +1348,13 @@ static int do_pci_enable_device(struct pci_dev *dev, int 
bars)
  */
 int pci_reenable_device(struct pci_dev *dev)
 {
+   int rc = 0;
+
+   mutex_lock(>lock);
if (pci_is_enabled(dev))
-   return do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
-   return 0;
+   rc = do_pci_enable_device(dev, (1 << PCI_NUM_RESOURCES) - 1);
+   mutex_unlock(>lock);
+   return rc;
 }
 EXPORT_SYMBOL(pci_reenable_device);
 
@@ -1363,12 +1367,6 @@ static void pci_enable_bridge(struct pci_dev *dev)
if (bridge)
pci_enable_bridge(bridge);
 
-   if (pci_is_enabled(dev)) {
-   if (!dev->is_busmaster)
-   pci_set_master(dev);
-   return;
-   }
-
retval = pci_enable_device(dev);
if (retval)
pci_err(dev, "Error enabling bridge (%d), continuing\n",
@@ -1379,9 +1377,16 @@ static void pci_enable_bridge(struct pci_dev *dev)
 static int pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
 {
struct pci_dev *bridge;
-   int err;
+   int err = 0;
int i, bars = 0;
 
+   /* Handle upstream bridges first to avoid locking issues */
+   bridge = pci_upstream_bridge(dev);
+   if (bridge)
+   pci_enable_bridge(bridge);
+
+   mutex_lock(>lock);
+
/*
 * Power state could be unknown at this point, either due to a fresh
 * boot or a device removal call.  So get the current power state
@@ -1394,12 +1399,9 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
dev->current_state = (pmcsr & PCI_PM_CTRL_STATE_MASK);
}
 
+   /* Already enabled ? */
if (atomic_inc_return(>enable_cnt) > 1)
-   return 0;   /* already enabled */
-
-   bridge = pci_upstream_bridge(dev);
-   if (bridge)
-   pci_enable_bridge(bridge);
+   goto bail;
 
/* only skip sriov related */
for (i = 0; i <= PCI_ROM_RESOURCE; i++)
@@ -1412,6 +1414,8 @@ static int pci_enable_device_flags(struct pci_dev *dev, 
unsigned long flags)
err = do_pci_enable_device(dev, bars);
if (err < 0)
atomic_dec(>enable_cnt);
+ bail:
+   mutex_unlock(>lock);
return err;
 }
 
@@ -1618,6 +1622,7 @@ static void do_pci_disable_device(struct pci_dev *dev)
if (pci_command & PCI_COMMAND_MASTER) {
pci_command &= ~PCI_COMMAND_MASTER;
pci_write_config_word(dev, PCI_COMMAND, pci_command);
+   dev->is_busmaster = 0;
}
 
pcibi

Re: [PATCH] fsi: ast: select GENERIC_ALLOCATOR

2018-08-13 Thread Benjamin Herrenschmidt
On Tue, 2018-08-14 at 00:37 +0200, Arnd Bergmann wrote:
> In randconfig builds without CONFIG_GENERIC_ALLOCATOR, this driver
> fails to link:
> 
> ERROR: "gen_pool_alloc_algo" [drivers/fsi/fsi-master-ast-cf.ko] undefined!
> ERROR: "gen_pool_fixed_alloc" [drivers/fsi/fsi-master-ast-cf.ko] undefined!
> ERROR: "of_gen_pool_get" [drivers/fsi/fsi-master-ast-cf.ko] undefined!
> ERROR: "gen_pool_free" [drivers/fsi/fsi-master-ast-cf.ko] undefined!
> 
> Select the dependency as all other users do.
> 
> Fixes: 6a794a27daca ("fsi: master-ast-cf: Add new FSI master using Aspeed 
> ColdFire")
> Signed-off-by: Arnd Bergmann 
> ---

Thanks, I'll pick it up and send to Greg with a slightly different
subject

Cheers,
Ben.

>  drivers/fsi/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> index af3a20dd5aa4..99c99a5d57fe 100644
> --- a/drivers/fsi/Kconfig
> +++ b/drivers/fsi/Kconfig
> @@ -46,6 +46,7 @@ config FSI_MASTER_AST_CF
>   tristate "FSI master based on Aspeed ColdFire coprocessor"
>   depends on GPIOLIB
>   depends on GPIO_ASPEED
> + select GENERIC_ALLOCATOR
>   ---help---
>   This option enables a FSI master using the AST2400 and AST2500 GPIO
>   lines driven by the internal ColdFire coprocessor. This requires



Re: [RFC 0/4] Virtio uses DMA API for all devices

2018-08-08 Thread Benjamin Herrenschmidt
On Wed, 2018-08-08 at 05:30 -0700, Christoph Hellwig wrote:
> On Wed, Aug 08, 2018 at 08:07:49PM +1000, Benjamin Herrenschmidt wrote:
> > Qemu virtio bypasses that iommu when the VIRTIO_F_IOMMU_PLATFORM flag
> > is not set (default) but there's nothing in the device-tree to tell the
> > guest about this since it's a violation of our pseries architecture, so
> > we just rely on Linux virtio "knowing" that it happens. It's a bit
> > yucky but that's now history...
> 
> That is ugly as hell, but it is how virtio works everywhere, so nothing
> special so far.

Yup.

> > Essentially pseries "architecturally" does not have the concept of not
> > having an iommu in the way and qemu violates that architecture today.
> > 
> > (Remember it comes from pHyp, our priorietary HV, which we are somewhat
> > mimmicing here).
> 
> It shouldnt be too hard to have a dt property that communicates this,
> should it?

We could invent something I suppose. The additional problem then (yeah
I know ... what a mess) is that qemu doesn't create the DT for PCI
devices, the firmware (SLOF) inside the guest does using normal PCI
probing.

That said, that FW could know about all the virtio vendor/device IDs,
check the VIRTIO_F_IOMMU_PLATFORM and set that property accordingly...
messy but doable. It's not a bus property (see my other reply below as
this could complicate things with your bus mask).

But we are drifting from the problem at hand :-) You propose we do set
VIRTIO_F_IOMMU_PLATFORM so we aren't in the above case, and the bypass
stuff works, so no need to touch it.

See my recap at the end of the email to make sure I understand fully
what you suggest.

> > So if we always set VIRTIO_F_IOMMU_PLATFORM, it *will* force all virtio
> > through that iommu and performance will suffer (esp vhost I suspect),
> > especially since adding/removing translations in the iommu is a
> > hypercall.

> Well, we'd nee to make sure that for this particular bus we skip the
> actualy iommu.

It's not a bus property. Qemu will happily mix up everything on the
same bus, that includes emulated devices that go through the emulated
iommu, real VFIO devices that go through an actual HW iommu and virtio
that bypasses everything.

This makes things tricky in general (not just in my powerpc secure VM
case) since, at least on powerpc but I suppose elsewhere too, iommu
related properties tend to be per "bus" while here, qemu will mix and
match.

But again, I think we are drifting away from the topic, see below

> > > It would not be the same effect.  The problem with that is that you must
> > > now assumes that your qemu knows that for example you might be passing
> > > a dma offset if the bus otherwise requires it. 
> > 
> > I would assume that arch_virtio_wants_dma_ops() only returns true when
> > no such offsets are involved, at least in our case that would be what
> > happens.
> 
> That would work, but we're really piling hacĸs ontop of hacks here.

Sort-of :-) At least none of what we are discussing now involves
touching the dma_ops themselves so we are not in the way of your big
cleanup operation here. But yeah, let's continue discussing your other
solution below.

> > >  Or in other words:
> > > you potentially break the contract between qemu and the guest of always
> > > passing down physical addresses.  If we explicitly change that contract
> > > through using a flag that says you pass bus address everything is fine.
> > 
> > For us a "bus address" is behind the iommu so that's what
> > VIRTIO_F_IOMMU_PLATFORM does already. We don't have the concept of a
> > bus address that is different. I suppose it's an ARMism to have DMA
> > offsets that are separate from iommus ? 
> 
> No, a lot of platforms support a bus address that has an offset from
> the physical address. including a lot of power platforms:

Ok, just talking past each other :-) For all the powerpc ones, these
*do* go through the iommu, which is what I meant. It's just a window of
the iommu that provides some kind of direct mapping of memory.

For pseries, there is no such thing however. What we do to avoid
constant map/unmap of iommu PTEs in pseries guests is that we use
hypercalls to create a 64-bit window and populate all its PTEs with an
identity mapping. But that's not as efficient as a real bypass.

There are good historical reasons for that, since pseries is a guest
platform, its memory is never really where the guest thinks it is, so
you always need an iommu to remap. Even for virtual devices, since for
most of them, in the "IBM" pHyp model, the "peer" is actually another
partition, so the virtual iommu handles translating accross the two
partitions.

Same goes with cell in HW, no rea

[GIT PULL] FSI fixes for 4.19

2018-08-07 Thread Benjamin Herrenschmidt
Hi Greg !

This adds two FSI fixes to what you already have in your tree

  - Fix a NULL dereference in the scom driver

  - Fix a command buffer size issue in the sbefifo driver that
breaks some operations with POWER system debugger (cronus)

Cheers,
Ben.

The following changes since tag fsi-updates-2018-07-27:

  fsi: Prevent multiple concurrent rescans (2018-07-27 09:58:12 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-08-08

for you to fetch changes up to 15e2a7218c2788d79c566d17cb9428c221e7:

  fsi: sbefifo: Bump max command length (2018-08-08 15:44:47 +1000)


Benjamin Herrenschmidt (2):
  fsi: scom: Fix NULL dereference
  fsi: sbefifo: Bump max command length

 drivers/fsi/fsi-sbefifo.c | 39 ---
 drivers/fsi/fsi-scom.c|  1 +
 2 files changed, 33 insertions(+), 7 deletions(-)



[GIT PULL] FSI updates round 4 (last) for 4.19

2018-07-26 Thread Benjamin Herrenschmidt
Hi Greg !

This adds a few fixes for things reported since the last merge,
and the latch batch of changes pending for FSI for 4.19.

That batch is a rather mechanical conversion of the misc devices
into proper char devices.

The misc devices were ill suited, the minor space for them is
limited and we can have a lot of chips in a system creating FSI
devices.

This also allows us to better control (and fix) object lifetime
getting rid of the bad devm_kzalloc() of the structures containing
the devices etc...

Finally, we add a chardev to the core FSI that provides raw CFAM
access to FSI slaves as a replacement for the current "raw" binary
sysfs file which will be ultimately deprecated and removed.

Thanks !
Ben.

The following changes since commit 0a213777d1dd879092225a7aa847b6e9b3a1c267:

  fsi: Add support for device-tree provided chip IDs (2018-07-23 16:27:32 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-27

for you to fetch changes up to 9840fcd8cc43bfba486a53b4461044f1a1189cdc:

  fsi: Prevent multiple concurrent rescans (2018-07-27 09:58:12 +1000)

----
Benjamin Herrenschmidt (8):
  fsi: master-ast-cf: Fix build warnings on 64-bit platforms
  fsi: master-ast-cf: Mask unused bits in RTAG/RCRC
  fsi: master-ast-cf: Rename dump_trace() to avoid name collision
  fsi: Add new central chardev support
  fsi: sbefifo: Convert to use the new chardev
  fsi: scom: Convert to use the new chardev
  fsi: Add cfam char devices
  fsi: Prevent multiple concurrent rescans

Gustavo A. R. Silva (1):
  fsi: master-ast-cf: Fix memory leak

 drivers/fsi/Kconfig |  15 +++
 drivers/fsi/fsi-core.c  | 375 
+++--
 drivers/fsi/fsi-master-ast-cf.c |  24 +++--
 drivers/fsi/fsi-master.h|   2 +
 drivers/fsi/fsi-sbefifo.c   |  84 ++-
 drivers/fsi/fsi-scom.c  | 130 ++-
 include/linux/fsi.h |  12 ++-
 7 files changed, 496 insertions(+), 146 deletions(-)



Re: [PATCH] fsi: master-ast-cf: Fix memory leak

2018-07-25 Thread Benjamin Herrenschmidt
On Wed, 2018-07-25 at 08:38 -0500, Gustavo A. R. Silva wrote:
> In case memory resources for *fw* were allocated, release them
> before return.
> 
> Addresses-Coverity-ID: 1472044 ("Resource leak")
> Fixes: 6a794a27daca ("fsi: master-ast-cf: Add new FSI master using Aspeed 
> ColdFire")
> Signed-off-by: Gustavo A. R. Silva 

Thanks Gustavo !

I'll apply before I send my next pull request to Greg !

Cheers,
Ben.

> ---
>  drivers/fsi/fsi-master-ast-cf.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/fsi/fsi-master-ast-cf.c b/drivers/fsi/fsi-master-ast-cf.c
> index c4a3557..e39a4630 100644
> --- a/drivers/fsi/fsi-master-ast-cf.c
> +++ b/drivers/fsi/fsi-master-ast-cf.c
> @@ -861,7 +861,8 @@ static int load_copro_firmware(struct fsi_master_acf 
> *master)
>   if (sig != wanted_sig) {
>   dev_err(master->dev, "Failed to locate image sig %04x in FW 
> blob\n",
>   wanted_sig);
> - return -ENODEV;
> + rc = -ENODEV;
> + goto release_fw;
>   }
>   if (size > master->cf_mem_size) {
>   dev_err(master->dev, "FW size (%zd) bigger than memory reserve 
> (%zd)\n",
> @@ -870,8 +871,9 @@ static int load_copro_firmware(struct fsi_master_acf 
> *master)
>   } else {
>   memcpy_toio(master->cf_mem, data, size);
>   }
> - release_firmware(fw);
>  
> +release_fw:
> + release_firmware(fw);
>   return rc;
>  }
>  



Re: [PATCH] fsi: master-ast-cf: use correct format string for size_t

2018-07-24 Thread Benjamin Herrenschmidt
On Tue, 2018-07-24 at 11:34 +0200, Arnd Bergmann wrote:
> On 32-bit targets, size_t is often 'unsigned int', so printing it as %lu
> causes a warning:
> 
> drivers/fsi/fsi-master-ast-cf.c: In function 'fsi_master_acf_read':
> drivers/fsi/fsi-master-ast-cf.c:609:23: error: format '%u' expects argument 
> of type 'unsigned int', but argument 6 has type 'size_t' {aka 'long unsigned 
> int'} [-Werror=format=]
>   dev_dbg(master->dev, "read id %d addr %x size %ud\n", id, addr, size);
> drivers/fsi/fsi-master-ast-cf.c: In function 'fsi_master_acf_write':
> drivers/fsi/fsi-master-ast-cf.c:634:23: error: format '%u' expects argument 
> of type 'unsigned int', but argument 6 has type 'size_t' {aka 'long unsigned 
> int'} [-Werror=format=]
> 
> The correct format string is %zu.

Thanks Arnd, but I've already pushed such a fix in the fsi tree today
:)

Cheers,
Ben.

> Fixes: 6a794a27daca ("fsi: master-ast-cf: Add new FSI master using Aspeed 
> ColdFire")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/fsi/fsi-master-ast-cf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/fsi/fsi-master-ast-cf.c b/drivers/fsi/fsi-master-ast-cf.c
> index 57afaae0b691..d1bbdf8b8343 100644
> --- a/drivers/fsi/fsi-master-ast-cf.c
> +++ b/drivers/fsi/fsi-master-ast-cf.c
> @@ -606,7 +606,7 @@ static int fsi_master_acf_read(struct fsi_master 
> *_master, int link,
>   return -ENODEV;
>  
>   mutex_lock(>lock);
> - dev_dbg(master->dev, "read id %d addr %x size %ud\n", id, addr, size);
> + dev_dbg(master->dev, "read id %d addr %x size %zu\n", id, addr, size);
>   build_ar_command(master, , id, addr, size, NULL);
>   rc = fsi_master_acf_xfer(master, id, , size, val);
>   last_address_update(master, id, rc == 0, addr);
> @@ -631,7 +631,7 @@ static int fsi_master_acf_write(struct fsi_master 
> *_master, int link,
>  
>   mutex_lock(>lock);
>   build_ar_command(master, , id, addr, size, val);
> - dev_dbg(master->dev, "write id %d addr %x size %ud raw_data: %08x\n",
> + dev_dbg(master->dev, "write id %d addr %x size %zu raw_data: %08x\n",
>   id, addr, size, *(uint32_t *)val);
>   rc = fsi_master_acf_xfer(master, id, , 0, NULL);
>   last_address_update(master, id, rc == 0, addr);



Re: linux-next: build warning after merge of the fsi tree

2018-07-24 Thread Benjamin Herrenschmidt
On Tue, 2018-07-24 at 16:03 +1000, Stephen Rothwell wrote:
> Hi Benjamin,
> 
> After merging the fsi tree, today's linux-next build (x86_64 allmodconfig)
> produced this warning:

Ah thanks. It's a pair of missing "z" qualifiers in those dev_dbg.

This normally won't show as this driver can only be enabled on 32-bit
ARM, but COMPILE_TEST will allow you to build it on x86_64 thus
triggering the warning.

I've pushed a fix in the fsi tree.

Greg, do you want a new signed tag with the fix or that can wait
for my next pull request ? I have a last bunch of chardev conversions
and object lifetime fixes being reviewed that I was thinking of sending
later this week.

Cheers,
Ben.

> 
> In file included from include/linux/printk.h:336:0,
>  from include/linux/kernel.h:14,
>  from include/linux/delay.h:22,
>  from drivers/fsi/fsi-master-ast-cf.c:8:
> drivers/fsi/fsi-master-ast-cf.c: In function 'fsi_master_acf_read':
> drivers/fsi/fsi-master-ast-cf.c:609:23: warning: format '%u' expects argument 
> of type 'unsigned int', but argument 6 has type 'size_t {aka long unsigned 
> int}' [-Wformat=]
>   dev_dbg(master->dev, "read id %d addr %x size %ud\n", id, addr, size);
>^
> include/linux/dynamic_debug.h:135:39: note: in definition of macro 
> 'dynamic_dev_dbg'
>__dynamic_dev_dbg(, dev, fmt, \
>^~~
> include/linux/device.h:1428:23: note: in expansion of macro 'dev_fmt'
>   dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
>^~~
> drivers/fsi/fsi-master-ast-cf.c:609:2: note: in expansion of macro 'dev_dbg'
>   dev_dbg(master->dev, "read id %d addr %x size %ud\n", id, addr, size);
>   ^~~
> drivers/fsi/fsi-master-ast-cf.c: In function 'fsi_master_acf_write':
> drivers/fsi/fsi-master-ast-cf.c:634:23: warning: format '%u' expects argument 
> of type 'unsigned int', but argument 6 has type 'size_t {aka long unsigned 
> int}' [-Wformat=]
>   dev_dbg(master->dev, "write id %d addr %x size %ud raw_data: %08x\n",
>^
> include/linux/dynamic_debug.h:135:39: note: in definition of macro 
> 'dynamic_dev_dbg'
>__dynamic_dev_dbg(, dev, fmt, \
>^~~
> include/linux/device.h:1428:23: note: in expansion of macro 'dev_fmt'
>   dynamic_dev_dbg(dev, dev_fmt(fmt), ##__VA_ARGS__)
>^~~
> drivers/fsi/fsi-master-ast-cf.c:634:2: note: in expansion of macro 'dev_dbg'
>   dev_dbg(master->dev, "write id %d addr %x size %ud raw_data: %08x\n",
>   ^~~
> 
> Introduced by commit
> 
>   6a794a27daca ("fsi: master-ast-cf: Add new FSI master using Aspeed 
> ColdFire")
> 



[PATCH 2/5] fsi: sbefifo: Convert to use the new chardev

2018-07-23 Thread Benjamin Herrenschmidt
This converts FSI sbefifo to use the new fsi-core controlled
chardev allocator and use a real cdev instead of a miscdev.

One side effect is to fix the object lifetime by removing
the use of devm_kzalloc() for something that contains kobjects,
and using proper reference counting.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/fsi-sbefifo.c | 84 +--
 1 file changed, 55 insertions(+), 29 deletions(-)

diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
index 33a5d9a43a07..8a185f53c9d8 100644
--- a/drivers/fsi/fsi-sbefifo.c
+++ b/drivers/fsi/fsi-sbefifo.c
@@ -17,9 +17,8 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
+#include 
 #include 
 #include 
 #include 
@@ -118,11 +117,11 @@ struct sbefifo {
uint32_tmagic;
 #define SBEFIFO_MAGIC  0x53424546 /* "SBEF" */
struct fsi_device   *fsi_dev;
-   struct miscdevice   mdev;
+   struct device   dev;
+   struct cdev cdev;
struct mutexlock;
-   charname[32];
-   int idx;
boolbroken;
+   booldead;
boolasync_ffdc;
 };
 
@@ -133,9 +132,9 @@ struct sbefifo_user {
size_t  pending_len;
 };
 
-static DEFINE_IDA(sbefifo_ida);
 static DEFINE_MUTEX(sbefifo_ffdc_mutex);
 
+
 static void __sbefifo_dump_ffdc(struct device *dev, const __be32 *ffdc,
size_t ffdc_sz, bool internal)
 {
@@ -667,6 +666,9 @@ static int __sbefifo_submit(struct sbefifo *sbefifo,
struct device *dev = >fsi_dev->dev;
int rc;
 
+   if (sbefifo->dead)
+   return -ENODEV;
+
if (cmd_len < 2 || be32_to_cpu(command[0]) != cmd_len) {
dev_vdbg(dev, "Invalid command len %zd (header: %d)\n",
 cmd_len, be32_to_cpu(command[0]));
@@ -751,8 +753,7 @@ EXPORT_SYMBOL_GPL(sbefifo_submit);
  */
 static int sbefifo_user_open(struct inode *inode, struct file *file)
 {
-   struct sbefifo *sbefifo = container_of(file->private_data,
-  struct sbefifo, mdev);
+   struct sbefifo *sbefifo = container_of(inode->i_cdev, struct sbefifo, 
cdev);
struct sbefifo_user *user;
 
user = kzalloc(sizeof(struct sbefifo_user), GFP_KERNEL);
@@ -889,6 +890,14 @@ static const struct file_operations sbefifo_fops = {
.release= sbefifo_user_release,
 };
 
+static void sbefifo_free(struct device *dev)
+{
+   struct sbefifo *sbefifo = container_of(dev, struct sbefifo, dev);
+
+   put_device(>fsi_dev->dev);
+   kfree(sbefifo);
+}
+
 /*
  * Probe/remove
  */
@@ -900,15 +909,23 @@ static int sbefifo_probe(struct device *dev)
struct device_node *np;
struct platform_device *child;
char child_name[32];
-   int rc, child_idx = 0;
+   int rc, didx, child_idx = 0;
 
dev_dbg(dev, "Found sbefifo device\n");
 
-   sbefifo = devm_kzalloc(dev, sizeof(*sbefifo), GFP_KERNEL);
+   sbefifo = kzalloc(sizeof(*sbefifo), GFP_KERNEL);
if (!sbefifo)
return -ENOMEM;
+
+   /* Grab a reference to the device (parent of our cdev), we'll drop it 
later */
+   if (!get_device(dev)) {
+   return -ENODEV;
+   kfree(sbefifo);
+   }
+
sbefifo->magic = SBEFIFO_MAGIC;
sbefifo->fsi_dev = fsi_dev;
+   dev_set_drvdata(dev, sbefifo);
mutex_init(>lock);
 
/*
@@ -919,28 +936,30 @@ static int sbefifo_probe(struct device *dev)
if (rc && rc != -ESHUTDOWN)
dev_err(dev, "Initial HW cleanup failed, will retry later\n");
 
-   sbefifo->idx = ida_simple_get(_ida, 1, INT_MAX, GFP_KERNEL);
-   snprintf(sbefifo->name, sizeof(sbefifo->name), "sbefifo%d",
-sbefifo->idx);
+   /* Create chardev for userspace access */
+   sbefifo->dev.type = _cdev_type;
+   sbefifo->dev.parent = dev;
+   sbefifo->dev.release = sbefifo_free;
+   device_initialize(>dev);
 
-   dev_set_drvdata(dev, sbefifo);
+   /* Allocate a minor in the FSI space */
+   rc = fsi_get_new_minor(fsi_dev, fsi_dev_sbefifo, >dev.devt, 
);
+   if (rc)
+   goto err;
 
-   /* Create misc chardev for userspace access */
-   sbefifo->mdev.minor = MISC_DYNAMIC_MINOR;
-   sbefifo->mdev.fops = _fops;
-   sbefifo->mdev.name = sbefifo->name;
-   sbefifo->mdev.parent = dev;
-   rc = misc_register(>mdev);
+   dev_set_name(>dev, "sbefifo%d", didx);
+   cdev_init(>cdev, _fops);
+   rc = cdev_device_add(>cdev, >dev);
if (rc) {
-   dev_err(dev, "Failed to register miscdev

[PATCH 5/5] fsi: Prevent multiple concurrent rescans

2018-07-23 Thread Benjamin Herrenschmidt
The bus scanning process isn't terribly good at parallel attempts
at rescanning the same bus. Let's have a per-master mutex protecting
the scanning process.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/fsi-core.c   | 16 ++--
 drivers/fsi/fsi-master.h |  2 ++
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index dea5bd48acc5..2c31563fdcae 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -1203,8 +1203,14 @@ static void fsi_master_unscan(struct fsi_master *master)
 
 int fsi_master_rescan(struct fsi_master *master)
 {
+   int rc;
+
+   mutex_lock(>scan_lock);
fsi_master_unscan(master);
-   return fsi_master_scan(master);
+   rc = fsi_master_scan(master);
+   mutex_unlock(>scan_lock);
+
+   return rc;
 }
 EXPORT_SYMBOL_GPL(fsi_master_rescan);
 
@@ -1240,6 +1246,7 @@ int fsi_master_register(struct fsi_master *master)
int rc;
struct device_node *np;
 
+   mutex_init(>scan_lock);
master->idx = ida_simple_get(_ida, 0, INT_MAX, GFP_KERNEL);
dev_set_name(>dev, "fsi%d", master->idx);
 
@@ -1264,8 +1271,11 @@ int fsi_master_register(struct fsi_master *master)
}
 
np = dev_of_node(>dev);
-   if (!of_property_read_bool(np, "no-scan-on-init"))
+   if (!of_property_read_bool(np, "no-scan-on-init")) {
+   mutex_lock(>scan_lock);
fsi_master_scan(master);
+   mutex_unlock(>scan_lock);
+   }
 
return 0;
 }
@@ -1278,7 +1288,9 @@ void fsi_master_unregister(struct fsi_master *master)
master->idx = -1;
}
 
+   mutex_lock(>scan_lock);
fsi_master_unscan(master);
+   mutex_unlock(>scan_lock);
device_unregister(>dev);
 }
 EXPORT_SYMBOL_GPL(fsi_master_unregister);
diff --git a/drivers/fsi/fsi-master.h b/drivers/fsi/fsi-master.h
index f653f75da7be..040a7d4cf717 100644
--- a/drivers/fsi/fsi-master.h
+++ b/drivers/fsi/fsi-master.h
@@ -18,6 +18,7 @@
 #define DRIVERS_FSI_MASTER_H
 
 #include 
+#include 
 
 /* Various protocol delays */
 #defineFSI_ECHO_DELAY_CLOCKS   16  /* Number clocks for echo delay 
*/
@@ -59,6 +60,7 @@ struct fsi_master {
int idx;
int n_links;
int flags;
+   struct mutexscan_lock;
int (*read)(struct fsi_master *, int link, uint8_t id,
uint32_t addr, void *val, size_t size);
int (*write)(struct fsi_master *, int link, uint8_t id,
-- 
2.17.1



[PATCH 4/5] fsi: Add cfam char devices

2018-07-23 Thread Benjamin Herrenschmidt
This aims to deprecate the "raw" sysfs file used for directly
accessing the CFAM and instead use a char device like the
other sub drivers.

Since it reworks the slave creation code and adds a cfam device
type, we also use the opportunity to convert the attributes
to attribute groups and add a couple more.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/fsi-core.c | 264 +
 1 file changed, 213 insertions(+), 51 deletions(-)

diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index faa1760a5a40..dea5bd48acc5 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -11,6 +11,11 @@
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
+ *
+ * TODO:
+ *  - Rework topology
+ *  - s/chip_id/chip_loc
+ *  - s/cfam/chip (cfam_id -> chip_id etc...)
  */
 
 #include 
@@ -21,6 +26,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 #include "fsi-master.h"
 
@@ -78,8 +86,11 @@ static DEFINE_IDA(master_ida);
 struct fsi_slave {
struct device   dev;
struct fsi_master   *master;
-   int id;
-   int link;
+   struct cdev cdev;
+   int cdev_idx;
+   int id; /* FSI address */
+   int link;   /* FSI link# */
+   u32 cfam_id;
int chip_id;
uint32_tsize;   /* size of slave address space */
u8  t_send_delay;
@@ -607,29 +618,6 @@ static const struct bin_attribute fsi_slave_raw_attr = {
.write = fsi_slave_sysfs_raw_write,
 };
 
-static ssize_t fsi_slave_sysfs_term_write(struct file *file,
-   struct kobject *kobj, struct bin_attribute *attr,
-   char *buf, loff_t off, size_t count)
-{
-   struct fsi_slave *slave = to_fsi_slave(kobj_to_dev(kobj));
-   struct fsi_master *master = slave->master;
-
-   if (!master->term)
-   return -ENODEV;
-
-   master->term(master, slave->link, slave->id);
-   return count;
-}
-
-static const struct bin_attribute fsi_slave_term_attr = {
-   .attr = {
-   .name = "term",
-   .mode = 0200,
-   },
-   .size = 0,
-   .write = fsi_slave_sysfs_term_write,
-};
-
 static void fsi_slave_release(struct device *dev)
 {
struct fsi_slave *slave = to_fsi_slave(dev);
@@ -682,6 +670,127 @@ static struct device_node *fsi_slave_find_of_node(struct 
fsi_master *master,
return NULL;
 }
 
+static ssize_t cfam_read(struct file *filep, char __user *buf, size_t count,
+loff_t *offset)
+{
+   struct fsi_slave *slave = filep->private_data;
+   size_t total_len, read_len;
+   loff_t off = *offset;
+   ssize_t rc;
+
+   if (off < 0)
+   return -EINVAL;
+
+   if (off > 0x || count > 0x || off + count > 0x)
+   return -EINVAL;
+
+   for (total_len = 0; total_len < count; total_len += read_len) {
+   __be32 data;
+
+   read_len = min_t(size_t, count, 4);
+   read_len -= off & 0x3;
+
+   rc = fsi_slave_read(slave, off, , read_len);
+   if (rc)
+   goto fail;
+   rc = copy_to_user(buf + total_len, , read_len);
+   if (rc) {
+   rc = -EFAULT;
+   goto fail;
+   }
+   off += read_len;
+   }
+   rc = count;
+ fail:
+   *offset = off;
+   return count;
+}
+
+static ssize_t cfam_write(struct file *filep, const char __user *buf,
+ size_t count, loff_t *offset)
+{
+   struct fsi_slave *slave = filep->private_data;
+   size_t total_len, write_len;
+   loff_t off = *offset;
+   ssize_t rc;
+
+
+   if (off < 0)
+   return -EINVAL;
+
+   if (off > 0x || count > 0x || off + count > 0x)
+   return -EINVAL;
+
+   for (total_len = 0; total_len < count; total_len += write_len) {
+   __be32 data;
+
+   write_len = min_t(size_t, count, 4);
+   write_len -= off & 0x3;
+
+   rc = copy_from_user(, buf + total_len, write_len);
+   if (rc) {
+   rc = -EFAULT;
+   goto fail;
+   }
+   rc = fsi_slave_write(slave, off, , write_len);
+   if (rc)
+   goto fail;
+   off += write_len;
+   }
+   rc = count;
+ fail:
+   *offset = off;
+   return count;
+}
+
+static loff_t cfam_llseek(struct file *file, loff_t o

[PATCH 1/5] fsi: Add new central chardev support

2018-07-23 Thread Benjamin Herrenschmidt
The various FSI devices (sbefifo, occ, scom, more to come)
currently use misc devices.

This is problematic as the minor device space for misc is
limited and there can be a lot of them. Also it limits our
ability to move them to a dedicated /dev/fsi directory or
to be smart about device naming and numbering.

It also means we have IDAs on every single of these drivers

This creates a common fsi "device_type" for the optional
/dev/fsi grouping and a dev_t allocator for all FSI devices.

"Legacy" devices get to use a backward compatible numbering
scheme (as long as chip id <16 and there's only one copy
of a given unit type per chip).

A single major number and a single IDA are shared for all
FSI devices.

This doesn't convert the FSI device drivers to use the new
scheme yet, they will be converted individually.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/Kconfig| 15 +++
 drivers/fsi/fsi-core.c | 95 +-
 include/linux/fsi.h| 12 +-
 3 files changed, 119 insertions(+), 3 deletions(-)

diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
index 8d82b1e60514..af3a20dd5aa4 100644
--- a/drivers/fsi/Kconfig
+++ b/drivers/fsi/Kconfig
@@ -12,6 +12,21 @@ menuconfig FSI
 
 if FSI
 
+config FSI_NEW_DEV_NODE
+   bool "Create '/dev/fsi' directory for char devices"
+   default n
+   ---help---
+   This option causes char devices created for FSI devices to be
+   located under a common /dev/fsi/ directory. Set to N unless your
+   userspace has been updated to handle the new location.
+
+   Additionally, it also causes the char device names to be offset
+   by one so that chip 0 will have /dev/scom1 and chip1 /dev/scom2
+   to match old userspace expectations.
+
+   New userspace will use udev rules to generate predictable access
+   symlinks in /dev/fsi/by-path when this option is enabled.
+
 config FSI_MASTER_GPIO
tristate "GPIO-based FSI master"
depends on GPIOLIB
diff --git a/drivers/fsi/fsi-core.c b/drivers/fsi/fsi-core.c
index eab6c5c4990e..faa1760a5a40 100644
--- a/drivers/fsi/fsi-core.c
+++ b/drivers/fsi/fsi-core.c
@@ -92,6 +92,13 @@ struct fsi_slave {
 static const int slave_retries = 2;
 static int discard_errors;
 
+static dev_t fsi_base_dev;
+static DEFINE_IDA(fsi_minor_ida);
+#define FSI_CHAR_MAX_DEVICES   0x1000
+
+/* Legacy /dev numbering: 4 devices per chip, 16 chips */
+#define FSI_CHAR_LEGACY_TOP64
+
 static int fsi_master_read(struct fsi_master *master, int link,
uint8_t slave_id, uint32_t addr, void *val, size_t size);
 static int fsi_master_write(struct fsi_master *master, int link,
@@ -627,6 +634,7 @@ static void fsi_slave_release(struct device *dev)
 {
struct fsi_slave *slave = to_fsi_slave(dev);
 
+   fsi_free_minor(slave->dev.devt);
of_node_put(dev->of_node);
kfree(slave);
 }
@@ -729,6 +737,75 @@ static ssize_t chip_id_show(struct device *dev,
 
 static DEVICE_ATTR_RO(chip_id);
 
+static char *fsi_cdev_devnode(struct device *dev, umode_t *mode,
+ kuid_t *uid, kgid_t *gid)
+{
+#ifdef CONFIG_FSI_NEW_DEV_NODE
+   return kasprintf(GFP_KERNEL, "fsi/%s", dev_name(dev));
+#else
+   return kasprintf(GFP_KERNEL, "%s", dev_name(dev));
+#endif
+}
+
+const struct device_type fsi_cdev_type = {
+   .name = "fsi-cdev",
+   .devnode = fsi_cdev_devnode,
+};
+EXPORT_SYMBOL_GPL(fsi_cdev_type);
+
+/* Backward compatible /dev/ numbering in "old style" mode */
+static int fsi_adjust_index(int index)
+{
+#ifdef CONFIG_FSI_NEW_DEV_NODE
+   return index;
+#else
+   return index + 1;
+#endif
+}
+
+static int __fsi_get_new_minor(struct fsi_slave *slave, enum fsi_dev_type type,
+  dev_t *out_dev, int *out_index)
+{
+   int cid = slave->chip_id;
+   int id;
+
+   /* Check if we qualify for legacy numbering */
+   if (cid >= 0 && cid < 16 && type < 4) {
+   /* Try reserving the legacy number */
+   id = (cid << 4) | type;
+   id = ida_simple_get(_minor_ida, id, id + 1, GFP_KERNEL);
+   if (id >= 0) {
+   *out_index = fsi_adjust_index(cid);
+   *out_dev = fsi_base_dev + id;
+   return 0;
+   }
+   /* Other failure */
+   if (id != -ENOSPC)
+   return id;
+   /* Fallback to non-legacy allocation */
+   }
+   id = ida_simple_get(_minor_ida, FSI_CHAR_LEGACY_TOP,
+   FSI_CHAR_MAX_DEVICES, GFP_KERNEL);
+   if (id < 0)
+   return id;
+   *out_index = fsi_adjust_index(id);
+   *out_dev = fsi_base_dev + id;
+   return 0;
+}
+
+int fsi_get_new_minor(struct fsi_device *fdev, enum fsi_dev_type type,

[PATCH 3/5] fsi: scom: Convert to use the new chardev

2018-07-23 Thread Benjamin Herrenschmidt
This converts FSI scom to use the new fsi-core controlled
chardev allocator and use a real cdev instead of a miscdev.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/fsi-scom.c | 130 +
 1 file changed, 80 insertions(+), 50 deletions(-)

diff --git a/drivers/fsi/fsi-scom.c b/drivers/fsi/fsi-scom.c
index 39c74351f1bf..0f303a700f69 100644
--- a/drivers/fsi/fsi-scom.c
+++ b/drivers/fsi/fsi-scom.c
@@ -20,9 +20,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
-#include 
 
 #include 
 
@@ -77,18 +76,12 @@
 struct scom_device {
struct list_head link;
struct fsi_device *fsi_dev;
-   struct miscdevice mdev;
+   struct device dev;
+   struct cdev cdev;
struct mutex lock;
-   charname[32];
-   int idx;
+   bool dead;
 };
 
-#define to_scom_dev(x) container_of((x), struct scom_device, mdev)
-
-static struct list_head scom_devices;
-
-static DEFINE_IDA(scom_ida);
-
 static int __put_scom(struct scom_device *scom_dev, uint64_t value,
  uint32_t addr, uint32_t *status)
 {
@@ -374,9 +367,7 @@ static int get_scom(struct scom_device *scom, uint64_t 
*value,
 static ssize_t scom_read(struct file *filep, char __user *buf, size_t len,
 loff_t *offset)
 {
-   struct miscdevice *mdev =
-   (struct miscdevice *)filep->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = filep->private_data;
struct device *dev = >fsi_dev->dev;
uint64_t val;
int rc;
@@ -385,7 +376,10 @@ static ssize_t scom_read(struct file *filep, char __user 
*buf, size_t len,
return -EINVAL;
 
mutex_lock(>lock);
-   rc = get_scom(scom, , *offset);
+   if (scom->dead)
+   rc = -ENODEV;
+   else
+   rc = get_scom(scom, , *offset);
mutex_unlock(>lock);
if (rc) {
dev_dbg(dev, "get_scom fail:%d\n", rc);
@@ -403,8 +397,7 @@ static ssize_t scom_write(struct file *filep, const char 
__user *buf,
  size_t len, loff_t *offset)
 {
int rc;
-   struct miscdevice *mdev = filep->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = filep->private_data;
struct device *dev = >fsi_dev->dev;
uint64_t val;
 
@@ -418,7 +411,10 @@ static ssize_t scom_write(struct file *filep, const char 
__user *buf,
}
 
mutex_lock(>lock);
-   rc = put_scom(scom, val, *offset);
+   if (scom->dead)
+   rc = -ENODEV;
+   else
+   rc = put_scom(scom, val, *offset);
mutex_unlock(>lock);
if (rc) {
dev_dbg(dev, "put_scom failed with:%d\n", rc);
@@ -532,12 +528,15 @@ static int scom_check(struct scom_device *scom, void 
__user *argp)
 
 static long scom_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-   struct miscdevice *mdev = file->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = file->private_data;
void __user *argp = (void __user *)arg;
int rc = -ENOTTY;
 
mutex_lock(>lock);
+   if (scom->dead) {
+   mutex_unlock(>lock);
+   return -ENODEV;
+   }
switch(cmd) {
case FSI_SCOM_CHECK:
rc = scom_check(scom, argp);
@@ -556,48 +555,88 @@ static long scom_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
return rc;
 }
 
+static int scom_open(struct inode *inode, struct file *file)
+{
+   struct scom_device *scom = container_of(inode->i_cdev, struct 
scom_device, cdev);
+
+   file->private_data = scom;
+
+   return 0;
+}
+
 static const struct file_operations scom_fops = {
.owner  = THIS_MODULE,
+   .open   = scom_open,
.llseek = scom_llseek,
.read   = scom_read,
.write  = scom_write,
.unlocked_ioctl = scom_ioctl,
 };
 
+static void scom_free(struct device *dev)
+{
+   struct scom_device *scom = container_of(dev, struct scom_device, dev);
+
+   put_device(>fsi_dev->dev);
+   kfree(scom);
+}
+
 static int scom_probe(struct device *dev)
 {
struct fsi_device *fsi_dev = to_fsi_dev(dev);
struct scom_device *scom;
+   int rc, didx;
 
-   scom = devm_kzalloc(dev, sizeof(*scom), GFP_KERNEL);
+   scom = kzalloc(sizeof(*scom), GFP_KERNEL);
if (!scom)
return -ENOMEM;
-
+   dev_set_drvdata(dev, scom);
mutex_init(>lock);
-   scom->idx = ida_simple_get(_ida, 1, INT_MAX, GFP_KERNEL);
-   snprintf(scom->name, sizeof(scom->name), "scom%d", scom->idx);
-   scom->fsi_dev = fsi_dev;
-   scom-

[PATCH 0/5] fsi: Convert misc devs to proper chardevs and more

2018-07-23 Thread Benjamin Herrenschmidt
This converts the various FSI devices from misc dev to chardev,
as there can potentially be too much of them for misc devs limited
minors, and because there are some lifetime issues with the current
support.

This provide a common infrastructure to allocate an FSI major and
distribute minors in a way that keeps it compatible with existing
userspace. A new representation grouping FSI devices under a
/dev/fsi directory is optinally provided, which will work in
conjunction with new udev scripts aimed at providing fixed ID
based symlinks.

A side effect of those conversions is to fix some object lifetime
issues caused by a mixup between devm_kzalloc and proper object
lifetime.

This series also adds a /dev{/fsi}/cfamN chardev for raw CFAM
access that will superseed the existing "raw" sysfs file.

Finally there's also a locking fix to avoid horrible mixups if
the "rescan" file is poked while a rescan is already in progress.




[GIT PULL] FSI updates round 3 for 4.19

2018-07-23 Thread Benjamin Herrenschmidt
Hi Greg !

This adds support for offloading the FSI low level bitbanging to the
ColdFire coprocessor of the Aspeed SoCs. All the pre-requisites have
already been merged, this is the final piece in the puzzle.

This branch also pull gpio/ib-aspeed which is a topic branch already
in gpio/for-next (and thus in next) whic contains pre-requisites.

Finally, there's also a bug fix to the sbefifo driver for some
inconsistent use of a mutex in the error handling code.

Note: The oddball "origin" commit of that pull request comes from
the fact that Linus Walleij gpio/ib-aspeed tree is ahead of mine,
so pulling his branch pulled a bunch of already upstream other things
in. As such, i had to base this pull request off a local merge. This
shouldn't affect you at all, and allows the diffstat below to be
correct.

The bindings updates have been acked by Rob and the corresponding
device-tree updates will be merged by Joel via the ARM SoC tree.

Thanks !
Ben.

The following changes since commit d5e748ff2b996d83489ac76c072e8b99f9ecef13:

  Merge remote-tracking branch 'gpio/ib-aspeed' into upstream-ready (2018-07-23 
15:21:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-24

for you to fetch changes up to 0a213777d1dd879092225a7aa847b6e9b3a1c267:

  fsi: Add support for device-tree provided chip IDs (2018-07-23 16:27:32 +1000)

----
Benjamin Herrenschmidt (6):
  devres: Add devm_of_iomap()
  dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"
  fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire
  fsi: sbefifo: Fix inconsistent use of ffdc mutex
  dt-bindings: fsi: Add optional chip-id to CFAMs
  fsi: Add support for device-tree provided chip IDs

 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt |   36 +++
 Documentation/devicetree/bindings/fsi/fsi.txt   |5 +
 drivers/fsi/Kconfig |9 ++
 drivers/fsi/Makefile|1 +
 drivers/fsi/cf-fsi-fw.h |  157 

 drivers/fsi/fsi-core.c  |   24 +
 drivers/fsi/fsi-master-ast-cf.c | 1438 
+
 drivers/fsi/fsi-sbefifo.c   |   13 ++-
 include/linux/device.h  |4 +
 include/trace/events/fsi_master_ast_cf.h|  150 
++
 lib/devres.c|   36 +++
 11 files changed, 1869 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h



Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-22 Thread Benjamin Herrenschmidt
On Sat, 2018-07-21 at 09:53 +0200, Greg Kroah-Hartman wrote:
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> To follow up on this.  I've applied the 2/2 patch for this series, so
> this 1/2 "should" not be needed.  Ben, if you still see this trigger
> with that, I guess I can take this, but it still feels wrong to me :)

I've tested my repro-case with only patch 2 and it doesn't trigger
anymore, as expected.

That said, you and Linus might want to sync your clocks on how sysfs
"should" work :-)

Cheers,
Ben.




Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-19 Thread Benjamin Herrenschmidt
On Fri, 2018-07-20 at 09:37 +0930, Andrew Jeffery wrote:
> > 
> > Andrew, can you start with a list that shows what you expect us to need
> > on our systems ?
> > 
> 
> Okay, our Witherspoon and Romulus platforms containing the ASPEED AST2500 
> currently need the following tuneables exposed:
> 
> > From the SCU:
> 
> - Debug UART enable
> - VGA DAC mux
> - VGA scratch registers 0-7
> - LPC SuperIO decode enable
> - VGA MMIO decode enable
> 
> > From the LPC controller:
> 
> - iLPC2AHB enable
> - SuperIO scratch registers 0x20-0x2f
> 
> (The LPC controller is just as much of a collection of random bits as the SCU)
> 
> Lastly, our Palmetto platform uses an AST2400 which has fewer features 
> compared to the AST2500. Its tuneable list is the same as the above with the 
> exception of "Debug UART enable".
> 
> Tuneables that we may need to expose in the future include:
> 
> > From the SCU:
> 
> - PCI VID/DID for the BMC PCIe device
> - VGA device enable (may need to be disabled if the platform contains a 
> discrete graphics processor)

Additionally there's a bunch of resigters controlling the mapping of
various MMIO regions of the BMC PCIe device to portions of the BMC
address space. I'm not sure what's the best way to handle that.

This specific set might require a dedicated device as a subnode of
the SCU in the DT that contains all the mappings as properties... 

That or we consider them static enough and just whack it in u-boot.

> > From the LPC controller:
> 
> - UART mux
> 
> Alexander, Eugene, can you chime in with your platforms' needs?
> 
> Cheers,
> 
> Andrew


Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt
On Thu, 2018-07-19 at 11:58 +0930, Andrew Jeffery wrote:
> > > I agree
> > > that not using /dev/mem is a good thing, but there are several ways
> > > you could do that independent of any DT binding.
> > 
> > Such as ? The only other option is to have one or more kernel drivers
> > that have built-in the precise and complete list of those tunables that
> > need to be exposed.
> > 
> > It's not impossible, but I worry that it's not going to scale terribly
> > well, and that vendors will constantly "fork" that driver to add
> > different things to that list.
> 
> Picking on that last point, "different things" doesn't necessarily
> mean "different fields in the SoC" (though it may). We could just
> need to use different initial values for the fields already
> described. 

I don't worry about initial values too much. Those could be specified
by userspace. It would be trivial to have something akin to
/etc/sysctl.conf that contains the initial values and a script blasting
them early. In fact I prefer this being in userspace to be frank. It
could be in the initramfs if we want it done early enough, maybe with a
usermode helper.

Unless we have some demonstrable reasons why some of these must be set
before the various kernel drivers initialize.

> So taking that into account, the field descriptions could vary wildly
> from platform to platform - where "platform" here is the combination
> of the BMC, its host system, and the host system's required
> configuration - not just referring to the BMC in isolation.  That in
> turn may cause a fork of the driver (changes that are incompatible
> with other platforms), or not scale terribly well as Ben suggests.

I really only worry about the actual set of registers/fields. I think
folding in initial values goes a bit too far.

> The initial value concept can help reduce the impact on userspace as
> userpace may no-longer need to care about it, so I think it's a
> desirable property.

I don't worry too much about userspace containing system specific bits
and pieces such as a config file with the appropriate initial values
for the platform. Userspace will have to contain platform specific
stuff regardless (if anything, stuff like thermal control is
intrinsicly different from one platform to the next).

>  With respect to devicetree, the field definitions will stay in the
> SoC dtsi, but the initial values would be described on a per-platform 
> basis in the dts.

If the fields are in the SoC dtsi, then Rob has a reasonable argument
that the list of fields is rather fixed for a given SoC and thus could
be hard wired in the driver 

I think at this stage, we need to create more clarity. In order to do
that, I think we need to come up with a list, as exhaustive as we can,
of what are the fields/register we need for our platforms, and find any
reason why userspace wouldn't be a good enough location to stick the
initial values.

Then we need Dell and Yadro (and maybe others) to help with their
variant of that list for the same SoCs to begin with.

With that, we'll get an idea of whether we think we can get a
reasonable stable long term list that's appropriate for most vendors.

If that's the case, then my objections to have it in the kernel instead
of the DT fade away a bit.

If one or two of those fields absolutely must have their defaults
early, then we can consider a specific set of one or two properties in
the SoC node for that SoC family that provides those specific defaults.

But unless we have that list, I think we'll be throwing too many
hypotheticals around to convince Rob and others.

Andrew, can you start with a list that shows what you expect us to need
on our systems ?

Cheers,
Ben



Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt
On Wed, 2018-07-18 at 13:50 -0600, Rob Herring wrote:
> 
> > So Rob, I think that's precisely where the disconnect is.
> > 
> > I think we all (well hopefully) agree that those few tunables don't fit
> > in any existing subystem and aren't likely to ever do (famous last
> > words...).
> > 
> > Where we disagree is we want to make this parametrized via the DT, and
> > you want us to hard wire the list in some kind of SoC driver for a
> > given SoC family/version.
> > 
> > The reason I think hard wiring the list in the driver is not a great
> > solution is that that list in itself is prone to variations, possibly
> > fairly often, between boards, vendors, versions of boards, etc...
> 
> Can we separate the list of what's enabled from the details of the
> registers?  Even if we put all this into DT, we may still want to have
> some separation for dts file structure. Some portion of this has to be
> SoC specific because you are simply exposing SoC registers.

Not sure I understand what you mean by "what's enabled".

The goal here is to expose register "fields" (nor raw values, some
registers need locking for RMW etc... the kernel would handle that via
syscon locking I expect), so basically named "tunables" to userspace.

Userspace is the one that knows the values for a given system.

> > We can't know for sure every SoC tunable (out of the gazillions in
> > those chips) are going to be needed for a given system. We know which
> > ones we do use for ours, and that's a couple of handfuls, but it could
> > be that Dell need a slightly different set, and so might Yadro, or so
> > might our next board revision for that matter.
> 
> That's very hand wavy. Do you have some concrete examples (i.e. dts
> files) showing the differences.

We could list what we have on the pile for some of our IBM systems
today, but we would need Dell and Yadro to chime in with what they
need.

I still, from experience with that stuff and gut feeling, am pretty
sure it's going to be a moving list, and updating the kernel driver
constantly isn't going to fly.

> One problem I'm having with this is you are still going to need per
> board specifics in userspace.

Yes. Userspace is ultimately the one that knows what needs to be done
on a given machine.

>  You may be moving some of details out,
> but moving to DT is not going to completely eliminate that. 

We aren't trying to either. We are trying to make sure we don't need to
change the *kernel* all the time, in part bcs we are pushing hard for
OpenBMC vendors to use upstream with, if possible, no vendor changes.

So userspace has to know the board specific tunables anyways. Today in
many systems, it does that by whacking /dev/mem. I guess you can
understand why that is bad :-)

> I agree
> that not using /dev/mem is a good thing, but there are several ways
> you could do that independent of any DT binding.

Such as ? The only other option is to have one or more kernel drivers
that have built-in the precise and complete list of those tunables that
need to be exposed.

It's not impossible, but I worry that it's not going to scale terribly
well, and that vendors will constantly "fork" that driver to add
different things to that list.

I might be wrong here, so I'd like for Eugene (Dell) and Alexander
(Yadro) to chime in, but experience with BMCs has shown that we
regularily , as we add a feature or rewrite something, need to find
another new magic SoC tunable the HW manufacturer hid somewhere that
will make our stuff work.

> > Now, updating the device-tree in the board flash with whatever vendor
> > specific information is needed is a LOT easier than getting the kernel
> > driver constantly updated. The device-tree after all is there to
> > reflect among other things system specific ways in which the SoC is
> > wired and configured. This is rather close...
> 
> Sadly, updating my kernel is easier than updating my PC firmware
> (though packaged firmware on my current laptop changes that). I can
> assure you that ARM boards are generally much worse in that regard.
> BTW, you may want to pay attention to EBBR[1][2]. Not sure how much
> you care for BMCs, but there may be some interest.

You are conflating your host kernel and your BMC here. The BMC kernel
is part of the "firmware", as is its DTS and the BMC userspace.

(Again this isn't the host DTS, this is the BMC DTS).

They get all updated together. My point isn't about the ease or
difficulty for a *user* to udpate their BMC, in that case the solutions
above are equivalent.

The point is from a system vendor perspective. A system vendor using
OpenBMC will *customize* their BMC build in various ways. Typically
they *will* have a custom DT since this is what represents their
specific system and they will have some specific userspace bits.

However, we are trying very hard for them *not* to fork the kernel, and
if possible move OpenBMC towards a fully upstream kernel (still working
on getting all the SoC drivers cleaned up and pushed up but 

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-16 Thread Benjamin Herrenschmidt
On Mon, 2018-07-16 at 07:55 -0600, Rob Herring wrote:
> If that data is one set per SoC, then i'm not that concerned having
> platform-specific data in the driver. That doesn't mean the driver is
> not "generic". It's still not clear to me in this thread, how much of
> this is board specific, but given that you've placed all the data in
> an SoC dtsi file it seems to be all per SoC.

So Rob, I think that's precisely where the disconnect is.

I think we all (well hopefully) agree that those few tunables don't fit
in any existing subystem and aren't likely to ever do (famous last
words...).

Where we disagree is we want to make this parametrized via the DT, and
you want us to hard wire the list in some kind of SoC driver for a
given SoC family/version.

The reason I think hard wiring the list in the driver is not a great
solution is that that list in itself is prone to variations, possibly
fairly often, between boards, vendors, versions of boards, etc...

We can't know for sure every SoC tunable (out of the gazillions in
those chips) are going to be needed for a given system. We know which
ones we do use for ours, and that's a couple of handfuls, but it could
be that Dell need a slightly different set, and so might Yadro, or so
might our next board revision for that matter.

Now, updating the device-tree in the board flash with whatever vendor
specific information is needed is a LOT easier than getting the kernel
driver constantly updated. The device-tree after all is there to
reflect among other things system specific ways in which the SoC is
wired and configured. This is rather close...

Cheers,
Ben.



Re: [PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-16 Thread Benjamin Herrenschmidt
On Mon, 2018-07-16 at 09:33 -0600, Rob Herring wrote:
> On Thu, Jul 12, 2018 at 01:48:43PM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems based
> > on the Aspeed chips. It has most of the same properties,
> > plus some more needed to operate the coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..431bf8a423ce
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > +   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
> > +   or
> > +   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - aspeed,sram = ;: Reference to the SRAM node.
> > + - aspeed,cvic = ;: Reference to the CVIC node.
> > +
> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
> > +
> > +   clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +
> > +   memory-region = <_memory>;
> > +   sram = <>;
> > +   cvic = <>;
> 
> Need to update the example. With that

Ah right, thanks. The next spin will have that fixed and will go into
the fsi tree.

> Reviewed-by: Rob Herring 


Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-12 Thread Benjamin Herrenschmidt
On Thu, 2018-07-12 at 09:11 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 6:54 PM Andrew Jeffery  wrote:
> > 
> > Hi Rob,
> > 
> > Thanks for the response.
> > 
> > On Thu, 12 Jul 2018, at 05:34, Rob Herring wrote:
> > > On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > > > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > > > provide remote management of (primarily) server platforms. BMCs are
> > > > often tightly coupled to the platform in terms of behaviour and provide
> > > > many hardware features integral to booting and running the host system.
> > > > 
> > > > Some of these hardware features are simple, for example scratch
> > > > registers provided by the BMC that are exposed to both the host and the
> > > > BMC. In other cases there's a single bit switch to enable or disable
> > > > some of the provided functionality.
> > > > 
> > > > The documentation defines bindings for fields in registers that do not
> > > > integrate well into other driver models yet must be described to allow
> > > > the BMC kernel to assume control of these features.
> > > 
> > > So we'll get a new binding when that happens? That will break
> > > compatibility.
> > 
> > Can you please expand on this? I'm not following.
> 
> If we have a subsystem in the future, then there would likely be an
> associated binding which would be different. So if you update the DT,
> then old kernels won't work with it.

What kind of "subsystem" ? There is almost no way there could be one
for that sort of BMC tunables. We've look at several BMC chips out
there and requirements from several vendors, BIOS and system
manufacturers and it's all over the place.

> > I feel like this is an argument of tradition. Maybe people have
> > been dissuaded from doing so when they don't have a reasonable use-
> > case? I'm not saying that what I'm proposing is unquestionably
> > reasonable, but I don't want to dismiss it out of hand.
> 
> One of experience. The one that stands out is clock bindings.
> Initially we were doing a node per clock modelling which could end up
> being 100s of nodes and is difficult to get right (with DT being an
> ABI).
> 
> It comes up with system controller type blocks too that just have a
> bunch of random registers. Those change in every SoC and not in any
> controlled or ordered way that would make describing the individual
> sub-functions in DT worthwhile.

So what's the alternative ? Because without something like what we
propose, what's going to happen is /dev/mem ... that's what people do
today.

> > > A node per register bit doesn't scale.
> > 
> > It isn't meant to scale in terms of a single system. Using it
> > extensively is very likely wrong. Separately, register-bit-led does
> > pretty much the same thing. Doesn't the scale argument apply there?
> > Who is to stop me from attaching an insane number of LEDs to a
> > system?
> 
> Review.
> 
> If you look, register-bit-led is rarely used outside of some ARM, Ltd.
> boards. It's simply quite rare to have MMIO register bits that have a
> fixed function of LED control.

Well, same here, we hope to review what goes upstream to make it
reasonable. Otherwise it doens't matter. If a random vendor, let's say
IBM, chose to chip a system where they put an insane amount of cruft in
there, it will only affect those systems's BMC and the userspace stack
on it.

Thankfully that stack is OpenBMC and IBM is aiming at having their
device-tree's upstream, thus reviewed, thus it won't happen.

*Anything* can be abused. The point here is that we have a number,
thankfully rather small, maybe a dozen or two, of tunables that are
quite specific to a combination (system vendor, bmc vendor, system
model) which control a few HW features that essentially do *NOT* fit in
a subsystem.

For everything that does, we have created proper drivers (and are doing
more).


> > Obviously if there are lots of systems using it sparingly and
> > legitimately then maybe there's a scale issue, but isn't that just
> > a reality of different hardware designs? Whoever is implementing
> > support for the system is going to have to describe the hardware
> > one way or another.
> > 
> > > 
> > > Maybe this should be modelled using GPIO binding? There's a line there
> > > too as whether the signals are "general purpose" or not.
> > 
> > I don't think so, mainly because some of the things it is intended to be 
> > used for are not GPIOs. For instance, take the DAC mux I've described in 
> > the patch. It doesn't directly influence anything external to the SoC (i.e. 
> > it's certainly not a traditional GPIO in any sense). However, it does 
> > *indirectly* influence the SoC's behaviour by muxing the DAC internally 
> > between:
> > 
> > 0. VGA device exposed on the host PCIe bus
> > 1. The "Graphics CRT" controller
> > 2. VGA port A
> > 3. VGA port B
> 
> And this mux control is fixed in the SoC design?

This specific family of SoC (Aspeed) support those 4 

Re: [PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-12 Thread Benjamin Herrenschmidt
On Thu, 2018-07-12 at 17:53 +1000, Joel Stanley wrote:
> On 12 July 2018 at 13:48, Benjamin Herrenschmidt
>  wrote:
> > The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
> > is currently unused on OpenPower systems.
> > 
> > This adds an alternative to the fsi-master-gpio driver that
> > uses that coprocessor instead of bit banging from the ARM
> > core itself. The end result is about 4 times faster.
> > 
> > The firmware for the coprocessor and its source code can be
> > found at https://github.com/ozbenh/cf-fsi and is system specific.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  drivers/fsi/Kconfig  |9 +
> >  drivers/fsi/Makefile |1 +
> >  drivers/fsi/cf-fsi-fw.h  |  157 +++
> >  drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
> >  include/trace/events/fsi_master_ast_cf.h |  150 +++
> >  5 files changed, 1755 insertions(+)
> >  create mode 100644 drivers/fsi/cf-fsi-fw.h
> >  create mode 100644 drivers/fsi/fsi-master-ast-cf.c
> >  create mode 100644 include/trace/events/fsi_master_ast_cf.h
> > 
> > diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> > index 322cec393cf2..e0220d1e1357 100644
> > --- a/drivers/fsi/Kconfig
> > +++ b/drivers/fsi/Kconfig
> > @@ -27,6 +27,15 @@ config FSI_MASTER_HUB
> > allow chaining of FSI links to an arbitrary depth.  This allows for
> > a high target device fanout.
> > 
> > +config FSI_MASTER_AST_CF
> > +   tristate "FSI master based on Aspeed ColdFire coprocessor"
> > +   depends on GPIOLIB
> > +   depends on GPIO_ASPEED
> > +   ---help---
> > +   This option enables a FSI master using the AST2400 and AST2500 GPIO
> > +   lines driven by the internal ColdFire coprocessor. This requires
> > +   the corresponding machine specific ColdFire firmware to be 
> > available.
> 
> The "machine specific" part isn't true anymore, is it?

Right, I'll fixup the changelog before pushing if that's the last spin.

> I gave this a spin on a palmetto and it appeared to work fine for me.
> 
> Tested-by: Joel Stanley 

Thanks !

Cheers,
Ben
.


[PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-11 Thread Benjamin Herrenschmidt
This isn't per-se a real device, it's a pseudo-device that
represents the use of the Aspeed built-in ColdFire to
implement the FSI protocol by bitbanging the GPIOs instead
of doing it from the ARM core.

Thus it's a drop-in replacement for the existing
fsi-master-gpio pseudo-device for use on systems based
on the Aspeed chips. It has most of the same properties,
plus some more needed to operate the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt

diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
new file mode 100644
index ..431bf8a423ce
--- /dev/null
+++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
@@ -0,0 +1,36 @@
+Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
+
+
+Required properties:
+ - compatible =
+   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
+   or
+   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
+
+ - clock-gpios = ;: GPIO for FSI clock
+ - data-gpios = ; : GPIO for FSI data signal
+ - enable-gpios = ;   : GPIO for enable signal
+ - trans-gpios = ;: GPIO for voltage translator enable
+ - mux-gpios = ;  : GPIO for pin multiplexing with other
+  functions (eg, external FSI masters)
+ - memory-region = ;  : Reference to the reserved memory for
+  the ColdFire. Must be 2M aligned on
+ AST2400 and 1M aligned on AST2500
+ - aspeed,sram = ;: Reference to the SRAM node.
+ - aspeed,cvic = ;: Reference to the CVIC node.
+
+Examples:
+
+fsi-master {
+compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
+
+   clock-gpios = < 0>;
+data-gpios = < 1>;
+enable-gpios = < 2>;
+trans-gpios = < 3>;
+mux-gpios = < 4>;
+
+   memory-region = <_memory>;
+   sram = <>;
+   cvic = <>;
+}
-- 
2.17.1



[PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-11 Thread Benjamin Herrenschmidt
The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
is currently unused on OpenPower systems.

This adds an alternative to the fsi-master-gpio driver that
uses that coprocessor instead of bit banging from the ARM
core itself. The end result is about 4 times faster.

The firmware for the coprocessor and its source code can be
found at https://github.com/ozbenh/cf-fsi and is system specific.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/Kconfig  |9 +
 drivers/fsi/Makefile |1 +
 drivers/fsi/cf-fsi-fw.h  |  157 +++
 drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
 include/trace/events/fsi_master_ast_cf.h |  150 +++
 5 files changed, 1755 insertions(+)
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h

diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
index 322cec393cf2..e0220d1e1357 100644
--- a/drivers/fsi/Kconfig
+++ b/drivers/fsi/Kconfig
@@ -27,6 +27,15 @@ config FSI_MASTER_HUB
allow chaining of FSI links to an arbitrary depth.  This allows for
a high target device fanout.
 
+config FSI_MASTER_AST_CF
+   tristate "FSI master based on Aspeed ColdFire coprocessor"
+   depends on GPIOLIB
+   depends on GPIO_ASPEED
+   ---help---
+   This option enables a FSI master using the AST2400 and AST2500 GPIO
+   lines driven by the internal ColdFire coprocessor. This requires
+   the corresponding machine specific ColdFire firmware to be available.
+
 config FSI_SCOM
tristate "SCOM FSI client device driver"
---help---
diff --git a/drivers/fsi/Makefile b/drivers/fsi/Makefile
index 75fdc6d8cfc4..62687ec86d2e 100644
--- a/drivers/fsi/Makefile
+++ b/drivers/fsi/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_FSI) += fsi-core.o
 obj-$(CONFIG_FSI_MASTER_HUB) += fsi-master-hub.o
 obj-$(CONFIG_FSI_MASTER_GPIO) += fsi-master-gpio.o
+obj-$(CONFIG_FSI_MASTER_AST_CF) += fsi-master-ast-cf.o
 obj-$(CONFIG_FSI_SCOM) += fsi-scom.o
 obj-$(CONFIG_FSI_SBEFIFO) += fsi-sbefifo.o
 obj-$(CONFIG_FSI_OCC) += fsi-occ.o
diff --git a/drivers/fsi/cf-fsi-fw.h b/drivers/fsi/cf-fsi-fw.h
new file mode 100644
index ..712df0461911
--- /dev/null
+++ b/drivers/fsi/cf-fsi-fw.h
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0+
+#ifndef __CF_FSI_FW_H
+#define __CF_FSI_FW_H
+
+/*
+ * uCode file layout
+ *
+ * ...03ff : m68k exception vectors
+ * 0400...04ff : Header info & boot config block
+ * 0500... : Code & stack
+ */
+
+/*
+ * Header info & boot config area
+ *
+ * The Header info is built into the ucode and provide version and
+ * platform information.
+ *
+ * the Boot config needs to be adjusted by the ARM prior to starting
+ * the ucode if the Command/Status area isn't at 0x32 in CF space
+ * (ie. beginning of SRAM).
+ */
+
+#define HDR_OFFSET 0x400
+
+/* Info: Signature & version */
+#define HDR_SYS_SIG0x00/* 2 bytes system signature */
+#define  SYS_SIG_SHARED0x5348
+#define  SYS_SIG_SPLIT 0x5350
+#define HDR_FW_VERS0x02/* 2 bytes Major.Minor */
+#define HDR_API_VERS   0x04/* 2 bytes Major.Minor */
+#define  API_VERSION_MAJ   2   /* Current version */
+#define  API_VERSION_MIN   1
+#define HDR_FW_OPTIONS 0x08/* 4 bytes option flags */
+#define  FW_OPTION_TRACE_EN0x0001  /* FW tracing enabled */
+#define FW_OPTION_CONT_CLOCK   0x0002  /* Continuous clocking 
supported */
+#define HDR_FW_SIZE0x10/* 4 bytes size for combo image */
+
+/* Boot Config: Address of Command/Status area */
+#define HDR_CMD_STAT_AREA  0x80/* 4 bytes CF address */
+#define HDR_FW_CONTROL 0x84/* 4 bytes control flags */
+#define FW_CONTROL_CONT_CLOCK  0x0002  /* Continuous clocking 
enabled */
+#define FW_CONTROL_DUMMY_RD0x0004  /* Extra dummy read 
(AST2400) */
+#define FW_CONTROL_USE_STOP0x0008  /* Use STOP 
instructions */
+#define HDR_CLOCK_GPIO_VADDR   0x90/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_DADDR   0x92/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_VADDR0x94/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_DADDR0x96/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_VADDR   0x98/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_DADDR   0x9a/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_BIT 0x9c/* 1 byte bit number */
+#define HDR_DATA_GPIO_BIT  0x9d/* 1 byte bit number */
+#define HDR_TRANS_GPIO_BIT 0x9e/* 1 byte bit number */
+
+/*
+ *  Command/Status area layout: Main part
+ */
+
+/* Command/Status register:
+ *
+ * +---+
+ * | STAT | RLEN | CLEN | C

[PATCH v3 2/5] devres: Add devm_of_iomap()

2018-07-11 Thread Benjamin Herrenschmidt
There are still quite a few cases where a device might want
to get to a different node of the device-tree, obtain the
resources and map them.

We have of_iomap() and of_io_request_and_map() but they both
have shortcomings, such as not returning the size of the
resource found (which can be useful) and not being "managed".

This adds a devm_of_iomap() that provides all of these and
should probably replace uses of the above in most drivers.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Linus Walleij 
Reviewed-by: Joel Stanley 
---
 include/linux/device.h |  4 
 lib/devres.c   | 36 
 2 files changed, 40 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 477956990f5e..96249d790374 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -688,6 +688,10 @@ extern void devm_free_pages(struct device *dev, unsigned 
long addr);
 
 void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res);
 
+void __iomem *devm_of_iomap(struct device *dev,
+   struct device_node *node, int index,
+   resource_size_t *size);
+
 /* allows to add/remove a custom action to devres stack */
 int devm_add_action(struct device *dev, void (*action)(void *), void *data);
 void devm_remove_action(struct device *dev, void (*action)(void *), void 
*data);
diff --git a/lib/devres.c b/lib/devres.c
index 5bec1120b392..faccf1a037d0 100644
--- a/lib/devres.c
+++ b/lib/devres.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum devm_ioremap_type {
DEVM_IOREMAP = 0,
@@ -162,6 +163,41 @@ void __iomem *devm_ioremap_resource(struct device *dev, 
struct resource *res)
 }
 EXPORT_SYMBOL(devm_ioremap_resource);
 
+/*
+ * devm_of_iomap - Requests a resource and maps the memory mapped IO
+ *for a given device_node managed by a given device
+ *
+ * Checks that a resource is a valid memory region, requests the memory
+ * region and ioremaps it. All operations are managed and will be undone
+ * on driver detach of the device.
+ *
+ * This is to be used when a device requests/maps resources described
+ * by other device tree nodes (children or otherwise).
+ *
+ * @dev:   The device "managing" the resource
+ * @node:   The device-tree node where the resource resides
+ * @index: index of the MMIO range in the "reg" property
+ * @size:  Returns the size of the resource (pass NULL if not needed)
+ * Returns a pointer to the requested and mapped memory or an ERR_PTR() encoded
+ * error code on failure. Usage example:
+ *
+ * base = devm_of_iomap(>dev, node, 0, NULL);
+ * if (IS_ERR(base))
+ * return PTR_ERR(base);
+ */
+void __iomem *devm_of_iomap(struct device *dev, struct device_node *node, int 
index,
+   resource_size_t *size)
+{
+   struct resource res;
+
+   if (of_address_to_resource(node, index, ))
+   return IOMEM_ERR_PTR(-EINVAL);
+   if (size)
+   *size = resource_size();
+   return devm_ioremap_resource(dev, );
+}
+EXPORT_SYMBOL(devm_of_iomap);
+
 #ifdef CONFIG_HAS_IOPORT_MAP
 /*
  * Generic iomap devres
-- 
2.17.1



[PATCH v3 4/5] arm: dts: OpenPower Romulus system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt
Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
index 347938673c83..4f4e42e47e3f 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
@@ -37,6 +37,11 @@
compatible = "shared-dma-pool";
reusable;
};
+
+   coldfire_memory: codefire_memory@9ef0 {
+   reg = <0x9ef0 0x0010>;
+   no-map;
+   };
};
 
leds {
@@ -56,10 +61,13 @@
};
 
fsi: gpio-fsi {
-   compatible = "fsi-master-gpio", "fsi-master";
+   compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
#address-cells = <2>;
#size-cells = <0>;
-   no-gpio-delays;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
 
clock-gpios = < ASPEED_GPIO(AA, 0) GPIO_ACTIVE_HIGH>;
data-gpios = < ASPEED_GPIO(AA, 2) GPIO_ACTIVE_HIGH>;
-- 
2.17.1



[PATCH v3 5/5] arm: dts: OpenPower Palmetto system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt
This switches away from userspace bitbanging to kernel FSI
using the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts | 28 ++-
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
index e6095f51ecf5..feec377c04d4 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
@@ -31,6 +31,11 @@
no-map;
reg = <0x9800 0x0100>; /* 16MB */
};
+
+   coldfire_memory: codefire_memory@5ee0 {
+   reg = <0x5ee0 0x0020>;
+   no-map;
+   };
};
 
leds {
@@ -49,6 +54,22 @@
};
};
 
+   fsi: gpio-fsi {
+   compatible = "aspeed,ast2400-cf-fsi-master", "fsi-master";
+   #address-cells = <2>;
+   #size-cells = <0>;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
+
+   clock-gpios = < ASPEED_GPIO(A, 4) GPIO_ACTIVE_HIGH>;
+   data-gpios = < ASPEED_GPIO(A, 5) GPIO_ACTIVE_HIGH>;
+   mux-gpios = < ASPEED_GPIO(A, 6) GPIO_ACTIVE_HIGH>;
+   enable-gpios = < ASPEED_GPIO(D, 0) GPIO_ACTIVE_HIGH>;
+   trans-gpios = < ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH>;
+   };
+
gpio-keys {
compatible = "gpio-keys";
 
@@ -323,13 +344,6 @@
line-name = "SYS_PWROK_BMC";
};
 
-   pin_gpio_h6 {
-   gpio-hog;
-   gpios = ;
-   output-high;
-   line-name = "SCM1_FSI0_DATA_EN";
-   };
-
pin_gpio_h7 {
gpio-hog;
gpios = ;
-- 
2.17.1



[PATCH 0/5] fsi: Coldfire coprocessor offload

2018-07-11 Thread Benjamin Herrenschmidt
This series implements support for offloading the FSI protocol bitbanging
to the ColdFire secondary core of the Aspeed SoCs. The result increases
FSI performance by a factor of 4, and on systems that don't support async
FSI clock, provide much more regular and continuous clocking which helps
reliability.

This series is much smaller than the previous submissions as I already
merged all the "preparatory" work into the FSI tree. This is now strictly
the coldfire support and is now only waiting for ack of the DT bindings
(and whatever other review comments might come my way) before I put it
into the FSI tree and sends it to Greg.

There are two dependencies to be able to build/use this. The above
prep work:

  https://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git/log/

And the Aspeed GPIO driver changes for handling with GPIO lines ownership and
handshaking. The patches have been merged into the GPIO tree and a dedicated
immutable topic branch and can be found here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git/log/?h=ib-aspeed

Finally, the driver needs a machine specific firmware file. The firwmare
is open source and available at:

https://github.com/ozbenh/cf-fsi

I will submit it to linux-firmware if there's enough popular demand ;-)

v2. Fix misuse of devm_kzalloc for objects containing a struct device
in fsi-master-gpio and fsi-master-ast-cf (similar fixes for the
various FSI drivers will come later a part of rework to move away
from misc devs).

v3. Sparse fix and updated DT bindings as per Rob's comments.





[GIT PULL] FSI updates round 2 for 4.19

2018-07-11 Thread Benjamin Herrenschmidt
Hi Greg !

This is the second round of updates of the FSI stack.

These comprise of:

 - Some build fixes detected with COMPILE_TEST when
used on "other" archs

 - Sparse warning fixes

 - Some object lifetime fixes

 - A new feature to control some of the protocol delays

 - Overheaul of the "SCOM" driver used to access the POWER
processor internal SCOM bus.

 - Preparatory work for the new ColdFire coprocessor based FSI master
work which, along some other newer work is still going through the last
few rounds of polish and DT bindings reviews.

The following changes since commit 9f4a8a2d7f9d71093f41c4bb0ef8707e8145bad3:

  fsi/sbefifo: Add driver for the SBE FIFO (2018-06-12 14:05:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-12

for you to fetch changes up to fea9cf321c916e9372874e6f2af1bf0b5beb89fb:

  fsi: Move various master definitions to a common header (2018-07-12 12:06:02 
+1000)

----
Benjamin Herrenschmidt (17):
  fsi: sbefifo: Remove unneeded semicolon
  fsi: scom: Add mutex around FSI2PIB accesses
  fsi: scom: Whitespace fixes
  fsi: scom: Fixup endian annotations
  fsi: scom: Add register definitions
  fsi: scom: Major overhaul
  fsi: sbefifo: Fix checker warning about late NULL check
  fsi: Move code around to avoid forward declaration
  fsi: Add mechanism to set the tSendDelay and tEchoDelay values
  fsi: master-gpio: Rename and adjust send delay
  fsi: master-gpio: Add support for link_config
  fsi: master-gpio: Add more tracepoints
  fsi: master-gpio: Remove unused definitions
  fsi: master-gpio: Remove "GPIO" prefix on some definitions
  fsi: Don't use device_unregister() in fsi_master_register()
  fsi: master-gpio: Add missing release function
  fsi: Move various master definitions to a common header

Eddie James (1):
  fsi: sbefifo: Add missing mutex_unlock

Guenter Roeck (1):
  fsi/sbefifo: Add dependency on OF_ADDRESS

Joel Stanley (3):
  fsi: sbefifo: Fix sparse warnings
  fsi: master-hub: Fix sparse warnings
  fsi: core: Fix sparse warnings

 drivers/fsi/Kconfig|   1 +
 drivers/fsi/fsi-core.c | 215 +++--
 drivers/fsi/fsi-master-gpio.c  | 170 ---
 drivers/fsi/fsi-master-hub.c   |   5 +-
 drivers/fsi/fsi-master.h   |  35 +
 drivers/fsi/fsi-sbefifo.c  |  15 +-
 drivers/fsi/fsi-scom.c | 433 
+++---
 include/trace/events/fsi_master_gpio.h |  59 
 include/uapi/linux/fsi.h   |  58 
 9 files changed, 813 insertions(+), 178 deletions(-)
 create mode 100644 include/uapi/linux/fsi.h


Re: [PATCH 06/14] fsi: master-gpio: Add more tracepoints

2018-07-11 Thread Benjamin Herrenschmidt
On Thu, 2018-06-28 at 13:41 +0930, Joel Stanley wrote:
> On 27 June 2018 at 08:55, Benjamin Herrenschmidt
>  wrote:
> > This adds a few more tracepoints that have proven useful when
> > debugging issues with the FSI bus.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> 
> Reviewed-by: Joel Stanley 
> 
> > ---
> >  drivers/fsi/fsi-master-gpio.c  | 16 ---
> >  include/trace/events/fsi_master_gpio.h | 59 ++
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/fsi/fsi-master-gpio.c b/drivers/fsi/fsi-master-gpio.c
> > index 48e0e65b2982..a00a85aa6d56 100644
> > --- a/drivers/fsi/fsi-master-gpio.c
> > +++ b/drivers/fsi/fsi-master-gpio.c
> > @@ -130,10 +130,17 @@ static void set_sda_output(struct fsi_master_gpio 
> > *master, int value)
> > 
> >  static void clock_zeros(struct fsi_master_gpio *master, int count)
> >  {
> > +   trace_fsi_master_gpio_clock_zeros(master, count);
> > set_sda_output(master, 1);
> > clock_toggle(master, count);
> >  }
> > 
> > +static void echo_delay(struct fsi_master_gpio *master)
> > +{
> > +   clock_zeros(master, master->t_echo_delay);
> > +}
> 
> This doesn't look like it belongs in this patch.
> 
> You've just moved it up. Not worth a re-spin.

I've done more, I made it use clock_zeros() instead of open coding it
in order to share the tracepoint.
> 
> > +
> > +
> >  static void serial_in(struct fsi_master_gpio *master, struct fsi_gpio_msg 
> > *msg,
> > uint8_t num_bits)
> >  {
> > @@ -279,16 +286,19 @@ static void build_ar_command(struct fsi_master_gpio 
> > *master,
> > addr_bits = 2;
> > opcode_bits = 2;
> > opcode = FSI_GPIO_CMD_SAME_AR;
> > +   trace_fsi_master_gpio_cmd_same_addr(master);
> > 
> > } else if (check_relative_address(master, id, addr, _addr)) {
> > /* 8 bits plus sign */
> > addr_bits = 9;
> > addr = rel_addr;
> > opcode = FSI_GPIO_CMD_REL_AR;
> > +   trace_fsi_master_gpio_cmd_rel_addr(master, rel_addr);
> > 
> > } else {
> > addr_bits = 21;
> > opcode = FSI_GPIO_CMD_ABS_AR;
> > +   trace_fsi_master_gpio_cmd_abs_addr(master, addr);
> > }
> > 
> > /*
> > @@ -337,12 +347,6 @@ static void build_epoll_command(struct fsi_gpio_msg 
> > *cmd, uint8_t slave_id)
> > msg_push_crc(cmd);
> >  }
> > 
> > -static void echo_delay(struct fsi_master_gpio *master)
> > -{
> > -   set_sda_output(master, 1);
> > -   clock_toggle(master, master->t_echo_delay);
> > -}
> > -
> >  static void build_term_command(struct fsi_gpio_msg *cmd, uint8_t slave_id)
> >  {
> > cmd->bits = 0;


Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-11 Thread Benjamin Herrenschmidt
On Wed, 2018-07-11 at 14:04 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > provide remote management of (primarily) server platforms. BMCs are
> > often tightly coupled to the platform in terms of behaviour and provide
> > many hardware features integral to booting and running the host system.
> > 
> > Some of these hardware features are simple, for example scratch
> > registers provided by the BMC that are exposed to both the host and the
> > BMC. In other cases there's a single bit switch to enable or disable
> > some of the provided functionality.
> > 
> > The documentation defines bindings for fields in registers that do not
> > integrate well into other driver models yet must be described to allow
> > the BMC kernel to assume control of these features.
> 
> So we'll get a new binding when that happens? That will break 
> compatibility.

What do you mean ? The binding provides a way to describe arbitrary
register fields and expose them to userspace. I'm not sure what you
mean by backward compatibility.

There is simply no way these things can be "abstracted" via some kind
of nice layered Linux subsystem of some sort. It's just a whole bunch
of knobs that control various things integral to the operation of the
host system. Andrew can provide a more precise list if you need to.

> 
> > 
> > Signed-off-by: Andrew Jeffery 
> > ---
> > 
> > Since RFC v1:
> > 
> > * Add a commit message
> > * Minor changes to documented labels
> > 
> >  .../bindings/misc/bmc-misc-ctrl.txt   | 252 ++
> >  MAINTAINERS   |   6 +
> >  2 files changed, 258 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt 
> > b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > new file mode 100644
> > index ..2c869fcc7ef2
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > @@ -0,0 +1,252 @@
> > +BMC Miscellaneous Control Interfaces
> > +
> > +
> > +Baseboard Management Controllers (BMCs) often have an array of hardware
> > +features that need to be described but are awkward to sensibly expose.
> > +
> > +This bindings document provides a generic mechanism for describing such
> > +features, covering read-only (RO), read-modify-write (RMW) and
> > +write-1-set/write-1-clear (W1SC) semantics.
> 
> If we wanted a generic mechanism for single register bits/fields in DT, 
> we'd have one already. A node per register bit doesn't scale.

Register *field*. I think you are looking at this from the wrong angle.
This isn't about exposing 236821643 SoC registers in an embedded
product. This is about exposing a dozen or so tunables that don't
control the SoC from the perspective of the OS running on it, but
controls how various features of the SoC are exposed to the *host*
system. Examples could be which of the SoC internal PCIe devices
exposed to the host is enabled (the SoC can appear as a GPU, a BMC misc
device or both or neither, it can have a DMA controller or not,
etc...). Other examples are scratch registers that need to be set to
system specific values by userspace, which the BIOS of the host will
read to determine some configuration information. That sort of thing.

There isn't that many, scalability isn't a problem. Both the list of
registers of interest and what needs to go in them is very much
system/vendor specific. This is a way for the kernel to act as a
"conduit" that doesn't need to know the specifics of a given
system/vendor/BIOS combination.

Your response smells like yet another case of trying to apply one of
those general "blanket" rules to something where it's completely
unapplicable.

> Maybe this should be modelled using GPIO binding? There's a line there 
> too as whether the signals are "general purpose" or not.

GPIOs are binary values right ? These are arbitrary fields. We want
arbitrary fields. This is really needed Rob, otherwise we'll NEVER have
BMC support upstream. 

The other option will be a dozen random ad-hoc char-devs that would
have to be updated every time a new bit needs to be added or changed.

Ben.


Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt
On Tue, 2018-07-10 at 16:55 -0700, Linus Torvalds wrote:
> On Tue, Jul 10, 2018 at 4:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > > I like that fix, which should make this patch obsolete, right?
> > 
> > Yes, for that specific issue, but Linus seemed to think patch 1 was the
> > "right thing to do" regardless...
> 
> I would definitely prefer either a kobject_get_unless_zero() or a
> warning if it is ever zero.
> 
> The fact that right now it silently can do known bad things, and then
> causes odd corruption _later_, is not good.

Maybe we should make the existing warning in refcount unconditional ? 

This is whe warning that allowed me to pull that string with the
gluedirs and fix what ended up very weird crashes caused by the
resulting use-after-free. So it's definitely valuable.

As for whether we should generalize kobject_get_unless_zero() vs avoid
using it alltogether, this is a debate for you to have with Greg ;-)

He seems to think nothing should ever try to get a zeroed object (which
I tend to agree with, it's close to my opinion that visibility and
lifetime should be disconnected).

That being said, there are existing constructs such as the "late
removal from sysfs from kobject_release" that mean that zero-reference
objects *can* still be visible, via either sysfs or ksets, as far as I
can tell.

So it's a bit of a mess... but if we chose to go Greg's way we should
probably put a WARN'ing in kobject_release() for late-removal from
sysfs since this exposes 0-ref objects to outside visibility.

Cheers,
Ben.



Re: [PATCH v11 5/8] i2c: fsi: Add transfer implementation

2018-07-10 Thread Benjamin Herrenschmidt
On Tue, 2018-07-10 at 12:52 -0500, Eddie James wrote:
> 
> On 07/09/2018 05:41 PM, Wolfram Sang wrote:
> > > + cmd |= FIELD_PREP(I2C_CMD_ADDR, msg->addr >> 1);
> > 
> > I just noticed this and wonder: Don't you need the LSB of the address?
> > It is not the RW flag, this is encoded in msg->flags.
> 
> So, the hardware interprets the LSB as the RW flag. It wouldn't be 
> possible to have a device addressed with the LSB set on this I2C master.

What do you mean ? That doesn't sound right... 
> 
> > 
> > Also, no seperate handling for 10 bit addresses? Technically, 7-bit 0x50
> > is different on the wire from 10-bit 0x050. This is minor, though. There
> > are no 10-bit devices out there. Still, did you test 10-bit support?
> 
> Indeed, real 10-bit addresses require some additional manipulation of 
> this I2C master in order to work. We don't support it right now.
> 
> Thanks,
> Eddie
> 
> > 
> > Rest looks good.
> > 


Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt
On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jul 10, 2018 at 09:44:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> > > No, kobject_get() should never happen on a 0 refcount object.  That
> > > being said, the code does allow it, so if things are messed up, it will
> > > happen.  I think that change happened when the switch to refcount_t
> > > occured, before then we would WARN_ON() if that ever happened.  I should
> > > go fix that up, and restore that old behavior, so that syzbot starts
> > > complaining loudly when stuff like that hits.
> > > 
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> I like that fix, which should make this patch obsolete, right?

Yes, for that specific issue, but Linus seemed to think patch 1 was the
"right thing to do" regardless...

I suggest you read the backlog of thread if you are interested in the
ins and outs of his position, we had a rather extensive discussion on
this stuff.

Cheers,
Ben.


Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-10 Thread Benjamin Herrenschmidt
On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> 
> > +/**
> > + * kobject_has_children - Returns whether a kobject has children.
> > + * @kobj: the object to test
> > + *
> > + * This will return whether a kobject has other kobjects as children.
> > + *
> > + * It does NOT account for the presence of attribute files, only sub
> > + * directories. It also assumes there is no concurrent addition or
> > + * removal of such children, and thus relies on external locking.
> > + */
> > +static inline bool kobject_has_children(struct kobject *kobj)
> > +{
> > +   WARN_ON_ONCE(kref_read(>kref) == 0);
> 
> Why warn on?  Who is going to hit this and how are you going to fix up
> the syzbot reports?  :)

Well, that's it, the hope is nobody ever hits it ... but if one does it
would be useful to get a backtrace to figure it out. You can shoot the
reports my way I suppose :-)

> Anyway, this looks good, I can just take this and not the 1/2 patch now,
> right?  I really didn't like that patch.

Yes, it will fix the practical problem. As for patch 1, it's rather
funny, you and Linus seem to have a completely opposite idea of how
this stuff should work :-)

Cheers,
Ben.

> thanks,
> 
> greg k-h


Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt
On Mon, 2018-07-09 at 17:33 -0700, Linus Torvalds wrote:
> On Mon, Jul 9, 2018 at 5:29 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> > 
> > This directory is never "explicitely" removed when empty however,
> 
> explicitly
> 
> Is the mis-spelling why you had the quotes? I do find that spelling in
> the kernel, but not in drivers/base/.

No idea :-) Just my poor english, I'm not sure why I put quotes, I
think I meant *explictly* as enphasis, not sure.

> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> 
> Ack. This looks good to me.
> 
> I didn't see your 1/2 - you should probably re-send that one too so
> that Greg doesn't have to fish for it. But I'll Ack that one too in
> this same email regardless.

OK, I didn't re-send it. Greg just nak'ed it though :-)

Cheers,
Ben.


Re: [PATCH] gpio: aspeed: fix compile testing warning

2018-07-09 Thread Benjamin Herrenschmidt
On Mon, 2018-07-09 at 16:56 +0200, Arnd Bergmann wrote:
> Gcc cannot always see that BUG_ON(1) is guaranteed to not
> return, so we get a warning message in some configurations:
> 
> drivers/gpio/gpio-aspeed.c: In function 'bank_reg':
> drivers/gpio/gpio-aspeed.c:244:1: error: control reaches end of non-void 
> function [-Werror=return-type]
> 
> Using a plain BUG() is easier here and avoids the problem.
> 
> Fixes: 44ddf559d579 ("gpio: aspeed: Rework register type accessors")
> Signed-off-by: Arnd Bergmann 

Acked-by: Benjamin Herrenschmidt 

Linus, can you plonk that on top of the patches in that topic branch
you created ?

> ---
>  drivers/gpio/gpio-aspeed.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpio/gpio-aspeed.c b/drivers/gpio/gpio-aspeed.c
> index 1e00f4045f9d..2342e154029b 100644
> --- a/drivers/gpio/gpio-aspeed.c
> +++ b/drivers/gpio/gpio-aspeed.c
> @@ -240,7 +240,7 @@ static inline void __iomem *bank_reg(struct aspeed_gpio 
> *gpio,
>   case reg_cmdsrc1:
>   return gpio->base + bank->cmdsrc_regs + GPIO_CMDSRC_1;
>   }
> - BUG_ON(1);
> + BUG();
>  }
>  
>  #define GPIO_BANK(x) ((x) >> 5)




[PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt
For devices with a class, we create a "glue" directory between
the parent device and the new device with the class name.

This directory is never "explicitely" removed when empty however,
this is left to the implicit sysfs removal done by kobject_release()
when the object loses its last reference via kobject_put().

This is problematic because as long as it's not been removed from
sysfs, it is still present in the class kset and in sysfs directory
structure.

The presence in the class kset exposes a use after free bug fixed
by the previous patch, but the presence in sysfs means that until
the kobject is released, which can take a while (especially with
kobject debugging), any attempt at re-creating such as binding a
new device for that class/parent pair, will result in a sysfs
duplicate file name error.

This fixes it by instead doing an explicit kobject_del() when
the glue dir is empty, by keeping track of the number of
child devices of the gluedir.

This is made easy by the fact that all glue dir operations are
done with a global mutex, and there's already a function
(cleanup_glue_dir) called in all the right places taking that
mutex that can be enhanced for this. It appears that this was
in fact the intent of the function, but the implementation was
wrong.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/base/core.c |  2 ++
 include/linux/kobject.h | 17 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index e9eff2099896..93c0f8d1a447 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1572,6 +1572,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..270b40515e79 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,23 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ *
+ * This will return whether a kobject has other kobjects as children.
+ *
+ * It does NOT account for the presence of attribute files, only sub
+ * directories. It also assumes there is no concurrent addition or
+ * removal of such children, and thus relies on external locking.
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kobj->sd->dir.subdirs;
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;



Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt
On Sat, 2018-07-07 at 18:51 +0200, Greg Kroah-Hartman wrote:
> 
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index b610816eb887..e9eff2099896 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -1517,11 +1517,13 @@ static struct kobject *get_device_parent(struct 
> > device *dev,
> >  
> > /* find our class-directory at the parent and reference it */
> > spin_lock(>class->p->glue_dirs.list_lock);
> > -   list_for_each_entry(k, >class->p->glue_dirs.list, entry)
> > +   list_for_each_entry(k, >class->p->glue_dirs.list, entry) {
> > if (k->parent == parent_kobj) {
> > -   kobj = kobject_get(k);
> > -   break;
> > +   kobj = kobject_get_unless_zero(k);
> > +   if (kobj)
> > +   break;
> 
> A parent directory _should_ not ever be able to be removed before the
> object being removed was, as we should have had a reference to it,
> right?  So I don't see how this can get hit "in real life".
> 
> Yes, enabling kobject debugging does keep objects around for a long time
> in order to try to help figure out where people are messing up their
> usage of them.  What subsystem is doing this in a way that causes
> problems here?  Shouldn't we fix that up instead?

The broken subsystem is the driver core itself :-) See the descriptions
here and in patch 2/2.

Note: This is a more generic problem with ksets vs relying on the magic
sysfs cleanup happening in kobject_release().

Any kobject that is a member of a kset and doesn't get explicitely
removed from sysfs with kobject_del() prior to dropping the last
reference with kobject_put() (and thus relies instead on the automatic
cleanup done by kobject_release()) will be exposed to the race:

The last kobject_put() will drop the refcount to 0 while the object is
still in the kset. Only some amount of time later (which can be very
short or very long if you enable kobject debugging), will
kobject_release() take it out of sysfs and out of the kset.

Thus the object will be visible, with a 0 refcount, to anything that
"walks" the kset during that period.

This is exactly what happens with the gluedirs in the device core, but
it could happen elsewhere for all I know.

Cheers,
Ben.


Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt
On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> No, kobject_get() should never happen on a 0 refcount object.  That
> being said, the code does allow it, so if things are messed up, it will
> happen.  I think that change happened when the switch to refcount_t
> occured, before then we would WARN_ON() if that ever happened.  I should
> go fix that up, and restore that old behavior, so that syzbot starts
> complaining loudly when stuff like that hits.
> 
> So I hate using kobject_get_unless_zero(), and resisted ever adding it
> to the tree as it shows a bad locking/tree situation as you point out
> here.  But for some reason, the block developers seemed to insist they
> needed it, and so it is in the tree for them.  I don't want it to spread
> if at all possible, which makes me want to reject this patch as this
> should be "a case that can never be hit".

Except it can in that situation... at least unless you get my patch 2/2
(or the newer one I'm about to send that avoids adding a child counter
and uses the one in kernfs instead).

Ben.



Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-06 Thread Benjamin Herrenschmidt
On Thu, 2018-07-05 at 10:08 -0600, Rob Herring wrote:
> 
> > It's not really a SOC block from a vendor, it's a pseudo-device in a
> > way. The current one that doesn't use the coldfire offload is just
> > compatible "fsi-master-gpio".
> > 
> > I can add a vendor but what should it be ? aspeed because it runs on
> > the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
> > protocol ?
> 
> I would say aspeed as it is tied to their chip.
> 
> > 
> > - doesn't make sense here though.
> 
> But you do already have  in the compatible, but in a slightly
> different form and position. And "cf" is the block.
>
> So I'd propose: aspeed,ast2500-cf-fsi-master

Ok, I'll do that.

> > 
> > > > +
> > > > + - clock-gpios = ;: GPIO for FSI clock
> > > > + - data-gpios = ; : GPIO for FSI data signal
> > > > + - enable-gpios = ;   : GPIO for enable signal
> > > > + - trans-gpios = ;: GPIO for voltage 
> > > > translator enable
> > > > + - mux-gpios = ;  : GPIO for pin multiplexing with 
> > > > other
> > > 
> > > So the gpio info is pased to the CF? Otherwise, what's the point of
> > > having these in DT?
> > 
> > In the original version you are looking at, they are not passed to the
> > CF per-se but the driver does use aspeed GPIO specific APIs to
> > configure them to be owned by the CF, so we need the references.
> 
> Okay.
> 
> > However, I've just reworked the ucode with a few tricks to avoid losing
> > singificant performance, so that we can indeed pass them to the CF,
> > thus avoiding the need for a per-system image, so the above are here to
> > stay.
> > 
> > > > +  functions (eg, external FSI 
> > > > masters)
> > > > + - memory-region = ;  : Reference to the reserved 
> > > > memory for
> > > > +  the ColdFire. Must be 2M 
> > > > aligned on
> > > > + AST2400 and 1M aligned on AST2500
> > > > + - sram = ;   : Reference to the SRAM 
> > > > node.
> > > > + - cvic = ;   : Reference to the CVIC 
> > > > node.
> > > 
> > > Vendor prefixes.
> > 
> > On what ? Why would an "sram" pointer have a vendor prefix ? Or a
> > memory region pointer ?
> 
> memory-region is a standard property. sram and cvic are not, so should
> have vendor prefixes. However, perhaps we should add a common "sram"
> property to sram/sram.txt.

Hrm... originally vendor prefix on properties were for things that
didn't have a binding afaik. IE a way for an f-code driver to stash
things in the DT that were vendor specific and retrieve them from the
OS driver for example.

Here with well defined bindings it's rather bloaty don't you think ? I
don't strongly object to doing it, it's just a bit ... odd.

Cheers,
Ben.



Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 16:30 -0600, Rob Herring wrote:
> On Wed, Jun 27, 2018 at 09:26:02AM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems for which
> > a corresponding firmware file exists. It has most of the
> > same properties, plus some more needed to operate the
> > coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..50913ae685cc
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > + "fsi-master-ast-2400-cf" for an AST2400 based system
> > +   or
> > + "fsi-master-ast-2500-cf" for an AST2500 based system
> 
> ,-

It's not really a SOC block from a vendor, it's a pseudo-device in a
way. The current one that doesn't use the coldfire offload is just
compatible "fsi-master-gpio".

I can add a vendor but what should it be ? aspeed because it runs on
the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
protocol ?

- doesn't make sense here though.

> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> 
> So the gpio info is pased to the CF? Otherwise, what's the point of 
> having these in DT?

In the original version you are looking at, they are not passed to the
CF per-se but the driver does use aspeed GPIO specific APIs to
configure them to be owned by the CF, so we need the references.

However, I've just reworked the ucode with a few tricks to avoid losing
singificant performance, so that we can indeed pass them to the CF,
thus avoiding the need for a per-system image, so the above are here to
stay.

> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - sram = ;   : Reference to the SRAM node.
> > + - cvic = ;   : Reference to the CVIC node.
> 
> Vendor prefixes.

On what ? Why would an "sram" pointer have a vendor prefix ? Or a
memory region pointer ?

> > + - fw-name = ; : The name used to construct 
> > the firmware
> > + file name (cf-fsi-.bin)
> 
> firmware-name is used in some other places (and is the full filename).

It's gone in the latest version as there's a single FW file now for all
systems.
> 
> > + - fw-platform-sig = ;  : A signature value that must 
> > match the one
> > + contained in the firmware 
> > image. Known
> > + values are listed in the firmware 
> > interface
> > + file cf-fsi-fw.h
> 
> Vendor prefix unless you think this could be common.

It's going away.

> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "fsi-master-gpio", "fsi-master";
> 
> Forget to update the example?

Indeed :)

> > +clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +}
> > -- 
> > 2.17.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe devicetree" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fsi: sbefifo: Add missing mutex_unlock

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 16:21 -0500, Eddie James wrote:
> There was no unlock of the FFDC mutex.
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Signed-off-by: Eddie James 

Thanks. 

Cheers,
Ben.

> ---
>  drivers/fsi/fsi-sbefifo.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
> index 4f076fd..a34ff99 100644
> --- a/drivers/fsi/fsi-sbefifo.c
> +++ b/drivers/fsi/fsi-sbefifo.c
> @@ -194,6 +194,7 @@ static void sbefifo_dump_ffdc(struct device *dev, const 
> __be32 *ffdc,
>   }
>   dev_warn(dev, 
> "+---+\n");
>   }
> + mutex_unlock(_ffdc_mutex);
>  }
>  
>  int sbefifo_parse_status(struct device *dev, u16 cmd, __be32 *response,


Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 08:46 -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> > +bool kernfs_has_children(struct kernfs_node *kn)
> > +{
> > +   bool has_children = false;
> > +   struct kernfs_node *pos;
> > +
> > +   /* Lockless shortcut */
> > +   if (RB_EMPTY_NODE(>rb))
> > +   return false;
> 
> Hmm... shouldn't it be testing !rb_first(kn->dir.children)?  The above
> would test whether @kn itself is unlinked.

Ah possibly, the rb tree usage got me confused a few times in there.

> > +
> > +   /* Now check for active children */
> > +   mutex_lock(_mutex);
> > +   pos = NULL;
> > +   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> > +   if (kernfs_active(pos)) {
> > +   has_children = true;
> > +   break;
> > +   }
> > +   }
> > +   mutex_unlock(_mutex);
> > +
> > +   return has_children;
> 
> The active ref is there to synchronize removal against in-flight reads
> so that kernfs_remove() can wait for them to drain.  On return from
> kernfs_remove(), the node is guaranteed to be off the rbtree and the
> above test isn't necessary.  !rb_first() test should be enough
> (provided that there's external synchronization, of course).

Ok. Yes there's external synchronization. So a simple !rb_first test
doesn't need the mutex I suppose ?

Cheers,
Ben.

> 
> Thanks.
> 


Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 16:31 +0200, Greg KH wrote:
> On Wed, Jul 04, 2018 at 12:16:49AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> > > On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > > > Signed-off-by: Andrew Jeffery 
> > > > ---
> > > 
> > > I can't take patches without any changelog text at all :(
> > 
> > Greg (and replying to your other comments as well)...
> > 
> > This is an RFC series, it's not meant for you to take at this point,
> > it's about discussing the overall approach to exposing BMC random
> > "tunables" as explained in patch 0 of the series.
> > 
> > Yes the individual patches aren't yet at the level of polish for a
> > formal submission, we (naively ?) thought that's what the whole RFC tag
> > is about :-)
> 
> Oh come on, putting a basic "here is what this patch does" comment
> should be part of every patch, otherwise what is there to comment on if
> we don't know what is going on in the patch itself?

Well, it adds documentation :-) You can just read the patch which is
... the documentation :)
> 
> Anyway, I provided a bunch of feedback to the "real" patch in this
> series...

Yes, you did that's fine. Thanks.

Cheers,
Ben.



Re: [PATCH -next] fsi/sbefifo: Add dependency on OF_ADDRESS

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 06:43 -0700, Guenter Roeck wrote:
> The driver calls of_platform_device_create() which is only available
> if OF_ADDRESS is enabled. When building sparc64 images, this results
> in
> 
> ERROR: "of_platform_device_create" [drivers/fsi/fsi-sbefifo.ko] undefined!
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Guenter Roeck 

Thanks Guenter ! I'll include this into the fsi tree.

(I was wondering what that kbuild report on sparc64 was about... I must
have misread the code, I didn't realize of_platform_device_create
wasn't under CONFIG_OF alone).

Cheers,
Ben.

> ---
>  drivers/fsi/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> index 24f84a96b8b9..9c08f467a7bb 100644
> --- a/drivers/fsi/Kconfig
> +++ b/drivers/fsi/Kconfig
> @@ -34,6 +34,7 @@ config FSI_SCOM
>  
>  config FSI_SBEFIFO
>   tristate "SBEFIFO FSI client device driver"
> + depends on OF_ADDRESS
>   ---help---
>   This option enables an FSI based SBEFIFO device driver. The SBEFIFO is
>   a pipe-like FSI device for communicating with the self boot engine


Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > Signed-off-by: Andrew Jeffery 
> > ---
> 
> I can't take patches without any changelog text at all :(

Greg (and replying to your other comments as well)...

This is an RFC series, it's not meant for you to take at this point,
it's about discussing the overall approach to exposing BMC random
"tunables" as explained in patch 0 of the series.

Yes the individual patches aren't yet at the level of polish for a
formal submission, we (naively ?) thought that's what the whole RFC tag
is about :-)

Cheers,
Ben.



Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Tue, 2018-07-03 at 12:39 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> > On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
> >  wrote:
> > > 
> > > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > > this might work, but probably doesn't". Maybe it gives you an idea,
> > > although that idea might be "Linus has finally lost it".
> > 
> > Even if it were to work, it should probably just be done inside kernfs
> > and exposed as some kind of "kernfs_remove_if_empty()" function.
> 
> That would be harder, we'd have to expose it via a
> kobject_del_if_empty() then.
> 
> Since we control completely when things are added/removed from the
> gluedir and have a big fat mutex for it, I don't think locking is much
> of an issue. At least in that specific case.

Ok, kernfs was a tad tougher to crack than I hoped but here's a
prototype lightly tested.

Tejun, Greg, does it look reasonable to you ?

Cheers,
Ben.

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..9166f68276c6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1570,6 +1570,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..e4bec09bc602 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1251,6 +1251,29 @@ void kernfs_activate(struct kernfs_node *kn)
mutex_unlock(_mutex);
 }
 
+bool kernfs_has_children(struct kernfs_node *kn)
+{
+   bool has_children = false;
+   struct kernfs_node *pos;
+
+   /* Lockless shortcut */
+   if (RB_EMPTY_NODE(>rb))
+   return false;
+
+   /* Now check for active children */
+   mutex_lock(_mutex);
+   pos = NULL;
+   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
+   if (kernfs_active(pos)) {
+   has_children = true;
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return has_children;
+}
+
 static void __kernfs_remove(struct kernfs_node *kn)
 {
struct kernfs_node *pos;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..776350ac1cbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -300,6 +300,12 @@ static inline bool kernfs_ns_enabled(struct kernfs_node 
*kn)
return kn->flags & KERNFS_NS;
 }
 
+/**
+ * kernfs_has_children - Returns whether a kernfs node has children.
+ * @kn: the node to test
+ */
+bool kernfs_has_children(struct kernfs_node *kn);
+
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
 int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
  char *buf, size_t buflen);
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..1e3294ab95f0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,17 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kernfs_has_children(kobj->sd);
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;



Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
>  wrote:
> > 
> > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > this might work, but probably doesn't". Maybe it gives you an idea,
> > although that idea might be "Linus has finally lost it".
> 
> Even if it were to work, it should probably just be done inside kernfs
> and exposed as some kind of "kernfs_remove_if_empty()" function.

That would be harder, we'd have to expose it via a
kobject_del_if_empty() then.

Since we control completely when things are added/removed from the
gluedir and have a big fat mutex for it, I don't think locking is much
of an issue. At least in that specific case.

> We happen to know that we hold the lock that creates all the entries
> in the glue directory (and we don't allow users to move or create
> stuff, afaik, even if we alloc chmod etc), so there should be no
> races, but a generic kernfs helper function would probably have to get
> proper locks and check it the right way.

Yes that or document very clearly under what circumstances that
function can be used.

> Linus


Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 19:15 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 5:57 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > I'll have a look see if I can find a way to check that the sysfs dir is
> > empty without adding that childcount, that would at least alleviate
> > some of your objection, and would probably make the patch
> > smaller/simpler.
> 
> That would definitely make me happier.
> 
> Right now we already remove the actual device node sysfs associations
> synchronously with "kobject_del()" (even if it still then stays around
> as a kobject), and that does the remove for the object itself - just
> not the glue directory.
> 
> If we then just did a "if glue dir is empty, delete the glue dir too"
> in cleanup_glue_dir(), at least ther patch would be prettier.
> 
> I *think* that should be easy. Maybe. Something almost, but not quite,
> entirely unlike the below UNTESTED hack?

Yes, something like that. I have a bunch of other things I need to
attend to today, so I might not come up with a patch before some time
tomorrow but I'll have a look.
 
> It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> this might work, but probably doesn't". Maybe it gives you an idea,
> although that idea might be "Linus has finally lost it".
> 
> I suspect it's broken because kernfs has this notion of inactive
> nodes, so just testing for the rbtree being empty is probably wrong. I
> think maybe it needs to test for there not being any active nodes, but
> then you'd need the kernfs_mutex etc, so that would make things much
> worse.

There's also a "subdirs" in kernfs_node which we could use. That's a
bit ugly because it only account directories but we happen to know that
the gluedirs only contain directories, so it would work fine in
practice for that case.

> I really haven't touched the kernfs code much, thus the "this may be
> complete and utter garbage" warning.
> 
> Adding Tejun to the participants, he should know. Or Greg, who has
> been silent, presumably because he's still getting over his crazy
> travels.

Cheers,
Ben.

> Linus
> 
> ---
> 
> --- a/drivers/base/core.c
>+++ b/drivers/base/core.c
>@@ -1585,6 +1585,12 @@ static inline struct kobject
> *get_glue_dir(struct device *dev)
> return dev->kobj.parent;
> }
> 
>+static inline bool has_children(struct kobject *dir)
>+{
>+struct kernfs_node *kn = dir->sd;
>+return kn && kn->dir.children.rb_node;
>+}
>+
> /*
>  * make sure cleaning up dir as the last step, we need to make
>  * sure .release handler of kobject is run with holding the
>@@ -1597,6 +1603,10 @@ static void cleanup_glue_dir(struct device
> *dev, struct kobject *glue_dir)
> return;
> 
> mutex_lock(_mutex);
>-kobject_put(glue_dir);
>+if (has_children(glue_dir))
>+kobject_put(glue_dir);
>+else
>+kobject_del(glue_dir);
> mutex_unlock(_mutex);
> }


Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 12:24 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 3:23 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > Let me try one last ditch attempt to convince you using maybe a
> > different perspective: this is how sysfs is intended to work and how
> > the device model already does everywhere else except the gluedirs.
> 
> So don't get me wrong. I don't think that patch is _wrong_. It may
> well be the best thing we can do now.
> 
> I just think some of the arguments for the patch are bogus.
> 
> I still think that the auto-cleanup is actually a good thing in
> general. Not because it simplifies things (which it can do), but
> because it fundamentally *allows* you to use less locking.
> 
> And that ends up being my real reason to not like your patch all that
> much. It depends on
> 
>  (a) an extra reference count

Right, that is not that great, and it might be possible to avoid it if
we have a way to check that the sysfs dir is empty. Maybe kobj->sd-
>subdirs ? This will find subdirs, not files, but for gluedirs that's
sufficient. I can play with that today see if it works.

BTW. Talking of extra reference counts, we already have two everywhere
afaik... kernfs has its own that's separate from the kobject one. I got
lost in that maze a couple of times already.

>  (b) on fairly heavy-handed locking (ie the whole "lock on release
> too, not just on allocation").

This is already there, I'm not adding it ;-) It's just ineffective
without my patch due to the possibility of the "auto-cleanup" to leak
outside the lock.

> and I think both of those are non-optimal.
> 
> So:
> 
> > The actual lifetime of the struct device is handled *separately* by
> > device_get/put which are just wrappers on kobject_get/put.
> 
> I agree that that is the end result of your patch (and perhaps the
> buggy intent of the original code).

This is the existing device_add/del without my patch. I was describing
how the core device stuff works for the normal device directories
today. My patch makes the gludirs do the same thing basically, which I
think was the intend of the code already but was broken.

> I just don't necessarily agree it's a great thing in general. It
> basically uses sysfs visibility as an argument for why lifetimes
> should differ from the refcounted lifetimes.

Yes, it does that so that we can remove and re-add without name
clashes.

> And that may be a practical argument, but I don't think it's a very
> good argument in general. I think it would arguably be much better to
> be less serialized, and depend on refcounting more, and make less of a
> deal about the sysfs visibility.
> 
> For example, if we *really* add back the exact same device immediately
> after removing it, and the device was still in use somehow (ie the
> refcount never went down to zero), maybe we really should not have
> reused the same name? Or if we do re-use the same name, maybe we
> should have re-used the device node itself?

Well, maybe but that's what we've been doing forever, and I think
drivers deal with it already. It applies to devices and to chardevs for
example.

> See what I'm saying? I understand where your patch is coming from, but
> I am suspicious of the fact that it seems to want to re-use a name
> (but not the object) that is by definition still in use.

Yes, I see what you mean.

I don't really have the same qualms as you do, it's how we've done
things forever (it's just that the gluedir code was broken), and it
generally works well.

It's not different in principle to what we do with files. You can
unlink an open file. It's still "in use" but the directory entry is
gone and one can create a new one.

It a slightly more roundabout way, we do that with mm_structs, the last
process can be gone but we keep them around due to being referenced (or
lazily attached to a kthread). Here we have 2 counters too.

> Maybe that's the right thing in this case. Considering that we have a
> real oops and a real problem, and I don't have an alternative patch, I
> guess the "patches talk, bullshit walks" rule applies.

Haha, as you prefer. Patch 1 fixes the oops, patch 2 fixes the name
collision, so in theory you could get away with just patch 1. But then
device would start failing to be re-added or re-bound.

(In my case it was a sub-device created by the driver of a parent
device, so it happens on bind/unbind of that parent driver).

I'll have a look see if I can find a way to check that the sysfs dir is
empty without adding that childcount, that would at least alleviate
some of your objection, and would probably make the patch
smaller/simpler.

Cheers,
Ben.



Re: DMA mappings and crossing boundaries

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

 .../...

Thanks Robin, I was starting to depair anybody would reply ;-)

> > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
> > HOWTO.txt) as always allocating to the next power-of-2 order, so we
> > should never have the problem unless we allocate a single chunk larger
> > than the IOMMU page size.
> 
> (and even then it's not *that* much of a problem, since it comes down to 
> just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
> that new chunk)

Yes, this case is not my biggest worry.

> > For dma_map_sg() however, if a request that has a single "entry"
> > spawning such a boundary, we need to ensure that the result mapping is
> > 2 contiguous "large" iommu pages as well.
> > 
> > However, that doesn't fit well with us re-using existing mappings since
> > they may already exist and either not be contiguous, or partially exist
> > with no free hole around them.
> > 
> > Now, we *could* possibly construe a way to solve this by detecting this
> > case and just allocating another "pair" (or set if we cross even more
> > pages) of IOMMU pages elsewhere, thus partially breaking our re-use
> > scheme.
> > 
> > But while doable, this introduce some serious complexity in the
> > implementation, which I would very much like to avoid.
> > 
> > So I was wondering if you guys thought that was ever likely to happen ?
> > Do you see reasonable cases where dma_map_sg() would be called with a
> > list in which a single entry crosses a 256M or 1G boundary ?
> 
> For streaming mappings of buffers cobbled together out of any old CPU 
> pages (e.g. user memory), you may well happen to get two 
> physically-adjacent pages falling either side of an IOMMU boundary, 
> which comprise all or part of a single request - note that whilst it's 
> probably less likely than the scatterlist case, this could technically 
> happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

> Conceptually it looks pretty easy to extend the allocation constraints 
> to cope with that - even the pathological worst case would have an 
> absolute upper bound of 3 IOMMU entries for any one physical region - 
> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
> DMA addresses having only 4 1GB slots to play with, I can't really see a 
> way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

> Maybe the best compromise would be some sort of hybrid scheme which 
> makes sure that one of the IOMMU entries always covers the SWIOTLB 
> buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

Ben.

> Robin.



Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Mon, 2018-07-02 at 09:36 +1000, Benjamin Herrenschmidt wrote:
> > No. See above. The reason I think your patch 2/2 is wrong is that is
> > actually *breaks* the above model, exactly because of that thing that
> > you hatre.
> > 
> > The explicit removal is actively wrong for the "I want to reuse the
> > object" model, exactly because it happens before the refcount has gone
> > to zero.
> 
> That's indeed where we disagree.

Sooo...

I had some quality time with my good friend Oban 14 last night,
thinking through this and during the resulting hangover today. I've
tried very hard to "get" your perspective, but after another dive
through the code, I'm afraid I still think patch 2/2 is the right thing
to do.

(Let's ignore patch 1/1 for now and assume we get that part right)

Let me try one last ditch attempt to convince you using maybe a
different perspective: this is how sysfs is intended to work and how
the device model already does everywhere else except the gluedirs.

Sooo... let's start pulling the string bit by bit...

I think you agree since (I'm not going to try to quote you here) you
did mention adding/removing need to generally be protected by higher
level locks in whatever subsystem owns the kobjects. I don't think
there's an argument here.

This is hinted by some comments (among others) in sysfs/dir.c about the
kobject owner being responsible to ensure that removal doesn't race
with any operation, which is hard to do when such removal doesn't
happen under subsystem internal locks.

Now let's look at how things are handled for a normal struct device
kobject.

The device core provides device_add and device_del functions. Those are
completely orthogonal to the object lifetime controlled by
device_initialize (creation) and device_put. And for good reasons (yes,
there are helpers, device_register/unregister that combine them, but
those are just that... helpers).

Why ? For the exact same reason mentioned above: Along with all the
driver binding/unbinding business etc... those functions ensure that
the kobject of the device is added/removed from sysfs under all the
necessary locks in the device model, so that they cannot race with each
other.

This guarantees among others that it is possible to do a new device_add
for the same device immediately after device_del, without clashing with
a duplicate sysfs name, *even* if something still holds a reference to
the kobject for any reason (and there are plenty of these, see the cdev
example below).

The actual lifetime of the struct device is handled *separately* by
device_get/put which are just wrappers on kobject_get/put.

This is the same with a cdev. cdev_add and cdev_del or their friends
cdev_device_add/del do the same thing. The chardev case is interesting
specifically because it is much more common for these to persist
(lifetime) beyond cdev_del. You just need userspace to have an open
file descriptor to the chardev. Drivers have to deal with it of course,
they need to handle things like read/write/ioctl being called on
already open files after remove/del/unbind, and hopefully do so.

*But* it's also possible to create a new identical cdev with the same
sysfs name immediately after cdev_device_del even if the old one still
has open instanced precisely because we separate the sysfs presence,
handled through the device model internal locking, from the object
lifetime.

Now, going back to our gluedirs, in essence those are just more objects
owned by the device core. More specifically, they need to be created
synchronously with the first child device of a class added under a
given parent, and removed synchronously with the last child device of
that class under that same parent.

Delaying that removal from sysfs simply means that we now are unable to
re-add the same device after removing it for some arbitrary amount of
time, because something might still hold a kobject reference to the
gluedir for any reason, it doesn't matter.

It also means that the deletion from sysfs is no longer synchronized
with addition since it no longer happens from within the context of
decice_del which is the "owner" of the glue dir, under whatever locks
it uses to ensure there is no races between multiple devices addition
removal etc...

In fact, if you look at the code and the way the gdp_mutex is used, I
think this is *exactly* the intention of the code. That the
kobject_put() done inside device_del() is the last one, done under that
mutex, thus removing the gluedir synchronously and we have no race.

However, that code just fails to take into account the delayed removal
by kobject debugging and the possibility of another reference existing.

Now it's possible there there can be no other references of a gluedir
today, to be frank I haven't audited the entire code base to figure out
if anything can possibly hold one that isn't a child of that gluedir,
so if indeed there can't be any and the only r

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt
On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > I suspect you didn't read it my entire argument or I wasn't clear
> > enough :-) This is actually the crux of the problem:
> > 
> > Yes the object continues to exist. However, the *last* kobject_put to
> > it will no longer be done under whatever higher level locking the
> > subsystem provides (whatever prevents for example concurrent add and
> > removes).
> 
> Well, yes and no.
> 
> Why "no"?
> 
> The last dropping is actually not necessarily that interesting.
> Especially with the sysfs interface, we basically know that you can
> look up the object using RCU (since that's what the filesystem lookup
> does), and that basically means that the refcount is always the final
> serialization mechanism.There is nothing else that can possibly lock
> it.
> 
> So this is where we disagree:
> 
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> 
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> 
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> 
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant.  Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.
> 
> So what happens then?
> 
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> 
> Why?
> 
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> 
>  - get subsystem lock
>  - look up object (using the "unless_zero()" model)
>  - if you got the object, re-use it, and you're done: drop lock and return
>  - otherwise, you know that nobody else can get it either
>  - create new object and instantiate it
>  - drop lock
> 
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).
> 
> See? The lack of locking at drop time didn't matter.  The refcount
> itself serialized things.
> 
> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> 
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.
> 
> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> 
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.
> 
> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
> 
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that

  1   2   3   4   5   6   7   8   9   10   >