from:"Benjamin Herrenschmidt"

[PATCH 3/5] fsi: scom: Convert to use the new chardev

2018-07-23 Thread Benjamin Herrenschmidt

This converts FSI scom to use the new fsi-core controlled
chardev allocator and use a real cdev instead of a miscdev.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/fsi-scom.c | 130 +
 1 file changed, 80 insertions(+), 50 deletions(-)

diff --git a/drivers/fsi/fsi-scom.c b/drivers/fsi/fsi-scom.c
index 39c74351f1bf..0f303a700f69 100644
--- a/drivers/fsi/fsi-scom.c
+++ b/drivers/fsi/fsi-scom.c
@@ -20,9 +20,8 @@
 #include 
 #include 
 #include 
-#include 
+#include 
 #include 
-#include 
 
 #include 
 
@@ -77,18 +76,12 @@
 struct scom_device {
struct list_head link;
struct fsi_device *fsi_dev;
-   struct miscdevice mdev;
+   struct device dev;
+   struct cdev cdev;
struct mutex lock;
-   charname[32];
-   int idx;
+   bool dead;
 };
 
-#define to_scom_dev(x) container_of((x), struct scom_device, mdev)
-
-static struct list_head scom_devices;
-
-static DEFINE_IDA(scom_ida);
-
 static int __put_scom(struct scom_device *scom_dev, uint64_t value,
  uint32_t addr, uint32_t *status)
 {
@@ -374,9 +367,7 @@ static int get_scom(struct scom_device *scom, uint64_t 
*value,
 static ssize_t scom_read(struct file *filep, char __user *buf, size_t len,
 loff_t *offset)
 {
-   struct miscdevice *mdev =
-   (struct miscdevice *)filep->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = filep->private_data;
struct device *dev = >fsi_dev->dev;
uint64_t val;
int rc;
@@ -385,7 +376,10 @@ static ssize_t scom_read(struct file *filep, char __user 
*buf, size_t len,
return -EINVAL;
 
mutex_lock(>lock);
-   rc = get_scom(scom, , *offset);
+   if (scom->dead)
+   rc = -ENODEV;
+   else
+   rc = get_scom(scom, , *offset);
mutex_unlock(>lock);
if (rc) {
dev_dbg(dev, "get_scom fail:%d\n", rc);
@@ -403,8 +397,7 @@ static ssize_t scom_write(struct file *filep, const char 
__user *buf,
  size_t len, loff_t *offset)
 {
int rc;
-   struct miscdevice *mdev = filep->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = filep->private_data;
struct device *dev = >fsi_dev->dev;
uint64_t val;
 
@@ -418,7 +411,10 @@ static ssize_t scom_write(struct file *filep, const char 
__user *buf,
}
 
mutex_lock(>lock);
-   rc = put_scom(scom, val, *offset);
+   if (scom->dead)
+   rc = -ENODEV;
+   else
+   rc = put_scom(scom, val, *offset);
mutex_unlock(>lock);
if (rc) {
dev_dbg(dev, "put_scom failed with:%d\n", rc);
@@ -532,12 +528,15 @@ static int scom_check(struct scom_device *scom, void 
__user *argp)
 
 static long scom_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
-   struct miscdevice *mdev = file->private_data;
-   struct scom_device *scom = to_scom_dev(mdev);
+   struct scom_device *scom = file->private_data;
void __user *argp = (void __user *)arg;
int rc = -ENOTTY;
 
mutex_lock(>lock);
+   if (scom->dead) {
+   mutex_unlock(>lock);
+   return -ENODEV;
+   }
switch(cmd) {
case FSI_SCOM_CHECK:
rc = scom_check(scom, argp);
@@ -556,48 +555,88 @@ static long scom_ioctl(struct file *file, unsigned int 
cmd, unsigned long arg)
return rc;
 }
 
+static int scom_open(struct inode *inode, struct file *file)
+{
+   struct scom_device *scom = container_of(inode->i_cdev, struct 
scom_device, cdev);
+
+   file->private_data = scom;
+
+   return 0;
+}
+
 static const struct file_operations scom_fops = {
.owner  = THIS_MODULE,
+   .open   = scom_open,
.llseek = scom_llseek,
.read   = scom_read,
.write  = scom_write,
.unlocked_ioctl = scom_ioctl,
 };
 
+static void scom_free(struct device *dev)
+{
+   struct scom_device *scom = container_of(dev, struct scom_device, dev);
+
+   put_device(>fsi_dev->dev);
+   kfree(scom);
+}
+
 static int scom_probe(struct device *dev)
 {
struct fsi_device *fsi_dev = to_fsi_dev(dev);
struct scom_device *scom;
+   int rc, didx;
 
-   scom = devm_kzalloc(dev, sizeof(*scom), GFP_KERNEL);
+   scom = kzalloc(sizeof(*scom), GFP_KERNEL);
if (!scom)
return -ENOMEM;
-
+   dev_set_drvdata(dev, scom);
mutex_init(>lock);
-   scom->idx = ida_simple_get(_ida, 1, INT_MAX, GFP_KERNEL);
-   snprintf(scom->name, sizeof(scom->name), "scom%d", scom->idx);
-   scom->fsi_dev = fsi_dev;
-   scom-

[PATCH 0/5] fsi: Convert misc devs to proper chardevs and more

2018-07-23 Thread Benjamin Herrenschmidt

This converts the various FSI devices from misc dev to chardev,
as there can potentially be too much of them for misc devs limited
minors, and because there are some lifetime issues with the current
support.

This provide a common infrastructure to allocate an FSI major and
distribute minors in a way that keeps it compatible with existing
userspace. A new representation grouping FSI devices under a
/dev/fsi directory is optinally provided, which will work in
conjunction with new udev scripts aimed at providing fixed ID
based symlinks.

A side effect of those conversions is to fix some object lifetime
issues caused by a mixup between devm_kzalloc and proper object
lifetime.

This series also adds a /dev{/fsi}/cfamN chardev for raw CFAM
access that will superseed the existing "raw" sysfs file.

Finally there's also a locking fix to avoid horrible mixups if
the "rescan" file is poked while a rescan is already in progress.

[GIT PULL] FSI updates round 3 for 4.19

2018-07-23 Thread Benjamin Herrenschmidt

Hi Greg !

This adds support for offloading the FSI low level bitbanging to the
ColdFire coprocessor of the Aspeed SoCs. All the pre-requisites have
already been merged, this is the final piece in the puzzle.

This branch also pull gpio/ib-aspeed which is a topic branch already
in gpio/for-next (and thus in next) whic contains pre-requisites.

Finally, there's also a bug fix to the sbefifo driver for some
inconsistent use of a mutex in the error handling code.

Note: The oddball "origin" commit of that pull request comes from
the fact that Linus Walleij gpio/ib-aspeed tree is ahead of mine,
so pulling his branch pulled a bunch of already upstream other things
in. As such, i had to base this pull request off a local merge. This
shouldn't affect you at all, and allows the diffstat below to be
correct.

The bindings updates have been acked by Rob and the corresponding
device-tree updates will be merged by Joel via the ARM SoC tree.

Thanks !
Ben.

The following changes since commit d5e748ff2b996d83489ac76c072e8b99f9ecef13:

  Merge remote-tracking branch 'gpio/ib-aspeed' into upstream-ready (2018-07-23 
15:21:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-24

for you to fetch changes up to 0a213777d1dd879092225a7aa847b6e9b3a1c267:

  fsi: Add support for device-tree provided chip IDs (2018-07-23 16:27:32 +1000)

----
Benjamin Herrenschmidt (6):
  devres: Add devm_of_iomap()
  dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"
  fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire
  fsi: sbefifo: Fix inconsistent use of ffdc mutex
  dt-bindings: fsi: Add optional chip-id to CFAMs
  fsi: Add support for device-tree provided chip IDs

 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt |   36 +++
 Documentation/devicetree/bindings/fsi/fsi.txt   |5 +
 drivers/fsi/Kconfig |9 ++
 drivers/fsi/Makefile|1 +
 drivers/fsi/cf-fsi-fw.h |  157 

 drivers/fsi/fsi-core.c  |   24 +
 drivers/fsi/fsi-master-ast-cf.c | 1438 
+
 drivers/fsi/fsi-sbefifo.c   |   13 ++-
 include/linux/device.h  |4 +
 include/trace/events/fsi_master_ast_cf.h|  150 
++
 lib/devres.c|   36 +++
 11 files changed, 1869 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h

[GIT PULL] FSI updates round 3 for 4.19

2018-07-23 Thread Benjamin Herrenschmidt

Hi Greg !

This adds support for offloading the FSI low level bitbanging to the
ColdFire coprocessor of the Aspeed SoCs. All the pre-requisites have
already been merged, this is the final piece in the puzzle.

This branch also pull gpio/ib-aspeed which is a topic branch already
in gpio/for-next (and thus in next) whic contains pre-requisites.

Finally, there's also a bug fix to the sbefifo driver for some
inconsistent use of a mutex in the error handling code.

Note: The oddball "origin" commit of that pull request comes from
the fact that Linus Walleij gpio/ib-aspeed tree is ahead of mine,
so pulling his branch pulled a bunch of already upstream other things
in. As such, i had to base this pull request off a local merge. This
shouldn't affect you at all, and allows the diffstat below to be
correct.

The bindings updates have been acked by Rob and the corresponding
device-tree updates will be merged by Joel via the ARM SoC tree.

Thanks !
Ben.

The following changes since commit d5e748ff2b996d83489ac76c072e8b99f9ecef13:

  Merge remote-tracking branch 'gpio/ib-aspeed' into upstream-ready (2018-07-23 
15:21:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-24

for you to fetch changes up to 0a213777d1dd879092225a7aa847b6e9b3a1c267:

  fsi: Add support for device-tree provided chip IDs (2018-07-23 16:27:32 +1000)

----
Benjamin Herrenschmidt (6):
  devres: Add devm_of_iomap()
  dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"
  fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire
  fsi: sbefifo: Fix inconsistent use of ffdc mutex
  dt-bindings: fsi: Add optional chip-id to CFAMs
  fsi: Add support for device-tree provided chip IDs

 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt |   36 +++
 Documentation/devicetree/bindings/fsi/fsi.txt   |5 +
 drivers/fsi/Kconfig |9 ++
 drivers/fsi/Makefile|1 +
 drivers/fsi/cf-fsi-fw.h |  157 

 drivers/fsi/fsi-core.c  |   24 +
 drivers/fsi/fsi-master-ast-cf.c | 1438 
+
 drivers/fsi/fsi-sbefifo.c   |   13 ++-
 include/linux/device.h  |4 +
 include/trace/events/fsi_master_ast_cf.h|  150 
++
 lib/devres.c|   36 +++
 11 files changed, 1869 insertions(+), 4 deletions(-)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-22 Thread Benjamin Herrenschmidt

On Sat, 2018-07-21 at 09:53 +0200, Greg Kroah-Hartman wrote:
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> To follow up on this.  I've applied the 2/2 patch for this series, so
> this 1/2 "should" not be needed.  Ben, if you still see this trigger
> with that, I guess I can take this, but it still feels wrong to me :)

I've tested my repro-case with only patch 2 and it doesn't trigger
anymore, as expected.

That said, you and Linus might want to sync your clocks on how sysfs
"should" work :-)

Cheers,
Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-22 Thread Benjamin Herrenschmidt

On Sat, 2018-07-21 at 09:53 +0200, Greg Kroah-Hartman wrote:
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> To follow up on this.  I've applied the 2/2 patch for this series, so
> this 1/2 "should" not be needed.  Ben, if you still see this trigger
> with that, I guess I can take this, but it still feels wrong to me :)

I've tested my repro-case with only patch 2 and it doesn't trigger
anymore, as expected.

That said, you and Linus might want to sync your clocks on how sysfs
"should" work :-)

Cheers,
Ben.

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-19 Thread Benjamin Herrenschmidt

On Fri, 2018-07-20 at 09:37 +0930, Andrew Jeffery wrote:
> > 
> > Andrew, can you start with a list that shows what you expect us to need
> > on our systems ?
> > 
> 
> Okay, our Witherspoon and Romulus platforms containing the ASPEED AST2500 
> currently need the following tuneables exposed:
> 
> > From the SCU:
> 
> - Debug UART enable
> - VGA DAC mux
> - VGA scratch registers 0-7
> - LPC SuperIO decode enable
> - VGA MMIO decode enable
> 
> > From the LPC controller:
> 
> - iLPC2AHB enable
> - SuperIO scratch registers 0x20-0x2f
> 
> (The LPC controller is just as much of a collection of random bits as the SCU)
> 
> Lastly, our Palmetto platform uses an AST2400 which has fewer features 
> compared to the AST2500. Its tuneable list is the same as the above with the 
> exception of "Debug UART enable".
> 
> Tuneables that we may need to expose in the future include:
> 
> > From the SCU:
> 
> - PCI VID/DID for the BMC PCIe device
> - VGA device enable (may need to be disabled if the platform contains a 
> discrete graphics processor)

Additionally there's a bunch of resigters controlling the mapping of
various MMIO regions of the BMC PCIe device to portions of the BMC
address space. I'm not sure what's the best way to handle that.

This specific set might require a dedicated device as a subnode of
the SCU in the DT that contains all the mappings as properties... 

That or we consider them static enough and just whack it in u-boot.

> > From the LPC controller:
> 
> - UART mux
> 
> Alexander, Eugene, can you chime in with your platforms' needs?
> 
> Cheers,
> 
> Andrew

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-19 Thread Benjamin Herrenschmidt

On Fri, 2018-07-20 at 09:37 +0930, Andrew Jeffery wrote:
> > 
> > Andrew, can you start with a list that shows what you expect us to need
> > on our systems ?
> > 
> 
> Okay, our Witherspoon and Romulus platforms containing the ASPEED AST2500 
> currently need the following tuneables exposed:
> 
> > From the SCU:
> 
> - Debug UART enable
> - VGA DAC mux
> - VGA scratch registers 0-7
> - LPC SuperIO decode enable
> - VGA MMIO decode enable
> 
> > From the LPC controller:
> 
> - iLPC2AHB enable
> - SuperIO scratch registers 0x20-0x2f
> 
> (The LPC controller is just as much of a collection of random bits as the SCU)
> 
> Lastly, our Palmetto platform uses an AST2400 which has fewer features 
> compared to the AST2500. Its tuneable list is the same as the above with the 
> exception of "Debug UART enable".
> 
> Tuneables that we may need to expose in the future include:
> 
> > From the SCU:
> 
> - PCI VID/DID for the BMC PCIe device
> - VGA device enable (may need to be disabled if the platform contains a 
> discrete graphics processor)

Additionally there's a bunch of resigters controlling the mapping of
various MMIO regions of the BMC PCIe device to portions of the BMC
address space. I'm not sure what's the best way to handle that.

This specific set might require a dedicated device as a subnode of
the SCU in the DT that contains all the mappings as properties... 

That or we consider them static enough and just whack it in u-boot.

> > From the LPC controller:
> 
> - UART mux
> 
> Alexander, Eugene, can you chime in with your platforms' needs?
> 
> Cheers,
> 
> Andrew

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt

On Thu, 2018-07-19 at 11:58 +0930, Andrew Jeffery wrote:
> > > I agree
> > > that not using /dev/mem is a good thing, but there are several ways
> > > you could do that independent of any DT binding.
> > 
> > Such as ? The only other option is to have one or more kernel drivers
> > that have built-in the precise and complete list of those tunables that
> > need to be exposed.
> > 
> > It's not impossible, but I worry that it's not going to scale terribly
> > well, and that vendors will constantly "fork" that driver to add
> > different things to that list.
> 
> Picking on that last point, "different things" doesn't necessarily
> mean "different fields in the SoC" (though it may). We could just
> need to use different initial values for the fields already
> described. 

I don't worry about initial values too much. Those could be specified
by userspace. It would be trivial to have something akin to
/etc/sysctl.conf that contains the initial values and a script blasting
them early. In fact I prefer this being in userspace to be frank. It
could be in the initramfs if we want it done early enough, maybe with a
usermode helper.

Unless we have some demonstrable reasons why some of these must be set
before the various kernel drivers initialize.

> So taking that into account, the field descriptions could vary wildly
> from platform to platform - where "platform" here is the combination
> of the BMC, its host system, and the host system's required
> configuration - not just referring to the BMC in isolation.  That in
> turn may cause a fork of the driver (changes that are incompatible
> with other platforms), or not scale terribly well as Ben suggests.

I really only worry about the actual set of registers/fields. I think
folding in initial values goes a bit too far.

> The initial value concept can help reduce the impact on userspace as
> userpace may no-longer need to care about it, so I think it's a
> desirable property.

I don't worry too much about userspace containing system specific bits
and pieces such as a config file with the appropriate initial values
for the platform. Userspace will have to contain platform specific
stuff regardless (if anything, stuff like thermal control is
intrinsicly different from one platform to the next).

>  With respect to devicetree, the field definitions will stay in the
> SoC dtsi, but the initial values would be described on a per-platform 
> basis in the dts.

If the fields are in the SoC dtsi, then Rob has a reasonable argument
that the list of fields is rather fixed for a given SoC and thus could
be hard wired in the driver 

I think at this stage, we need to create more clarity. In order to do
that, I think we need to come up with a list, as exhaustive as we can,
of what are the fields/register we need for our platforms, and find any
reason why userspace wouldn't be a good enough location to stick the
initial values.

Then we need Dell and Yadro (and maybe others) to help with their
variant of that list for the same SoCs to begin with.

With that, we'll get an idea of whether we think we can get a
reasonable stable long term list that's appropriate for most vendors.

If that's the case, then my objections to have it in the kernel instead
of the DT fade away a bit.

If one or two of those fields absolutely must have their defaults
early, then we can consider a specific set of one or two properties in
the SoC node for that SoC family that provides those specific defaults.

But unless we have that list, I think we'll be throwing too many
hypotheticals around to convince Rob and others.

Andrew, can you start with a list that shows what you expect us to need
on our systems ?

Cheers,
Ben

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt

On Thu, 2018-07-19 at 11:58 +0930, Andrew Jeffery wrote:
> > > I agree
> > > that not using /dev/mem is a good thing, but there are several ways
> > > you could do that independent of any DT binding.
> > 
> > Such as ? The only other option is to have one or more kernel drivers
> > that have built-in the precise and complete list of those tunables that
> > need to be exposed.
> > 
> > It's not impossible, but I worry that it's not going to scale terribly
> > well, and that vendors will constantly "fork" that driver to add
> > different things to that list.
> 
> Picking on that last point, "different things" doesn't necessarily
> mean "different fields in the SoC" (though it may). We could just
> need to use different initial values for the fields already
> described. 

I don't worry about initial values too much. Those could be specified
by userspace. It would be trivial to have something akin to
/etc/sysctl.conf that contains the initial values and a script blasting
them early. In fact I prefer this being in userspace to be frank. It
could be in the initramfs if we want it done early enough, maybe with a
usermode helper.

Unless we have some demonstrable reasons why some of these must be set
before the various kernel drivers initialize.

> So taking that into account, the field descriptions could vary wildly
> from platform to platform - where "platform" here is the combination
> of the BMC, its host system, and the host system's required
> configuration - not just referring to the BMC in isolation.  That in
> turn may cause a fork of the driver (changes that are incompatible
> with other platforms), or not scale terribly well as Ben suggests.

I really only worry about the actual set of registers/fields. I think
folding in initial values goes a bit too far.

> The initial value concept can help reduce the impact on userspace as
> userpace may no-longer need to care about it, so I think it's a
> desirable property.

I don't worry too much about userspace containing system specific bits
and pieces such as a config file with the appropriate initial values
for the platform. Userspace will have to contain platform specific
stuff regardless (if anything, stuff like thermal control is
intrinsicly different from one platform to the next).

>  With respect to devicetree, the field definitions will stay in the
> SoC dtsi, but the initial values would be described on a per-platform 
> basis in the dts.

If the fields are in the SoC dtsi, then Rob has a reasonable argument
that the list of fields is rather fixed for a given SoC and thus could
be hard wired in the driver 

I think at this stage, we need to create more clarity. In order to do
that, I think we need to come up with a list, as exhaustive as we can,
of what are the fields/register we need for our platforms, and find any
reason why userspace wouldn't be a good enough location to stick the
initial values.

Then we need Dell and Yadro (and maybe others) to help with their
variant of that list for the same SoCs to begin with.

With that, we'll get an idea of whether we think we can get a
reasonable stable long term list that's appropriate for most vendors.

If that's the case, then my objections to have it in the kernel instead
of the DT fade away a bit.

If one or two of those fields absolutely must have their defaults
early, then we can consider a specific set of one or two properties in
the SoC node for that SoC family that provides those specific defaults.

But unless we have that list, I think we'll be throwing too many
hypotheticals around to convince Rob and others.

Andrew, can you start with a list that shows what you expect us to need
on our systems ?

Cheers,
Ben

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt

On Wed, 2018-07-18 at 13:50 -0600, Rob Herring wrote:
> 
> > So Rob, I think that's precisely where the disconnect is.
> > 
> > I think we all (well hopefully) agree that those few tunables don't fit
> > in any existing subystem and aren't likely to ever do (famous last
> > words...).
> > 
> > Where we disagree is we want to make this parametrized via the DT, and
> > you want us to hard wire the list in some kind of SoC driver for a
> > given SoC family/version.
> > 
> > The reason I think hard wiring the list in the driver is not a great
> > solution is that that list in itself is prone to variations, possibly
> > fairly often, between boards, vendors, versions of boards, etc...
> 
> Can we separate the list of what's enabled from the details of the
> registers?  Even if we put all this into DT, we may still want to have
> some separation for dts file structure. Some portion of this has to be
> SoC specific because you are simply exposing SoC registers.

Not sure I understand what you mean by "what's enabled".

The goal here is to expose register "fields" (nor raw values, some
registers need locking for RMW etc... the kernel would handle that via
syscon locking I expect), so basically named "tunables" to userspace.

Userspace is the one that knows the values for a given system.

> > We can't know for sure every SoC tunable (out of the gazillions in
> > those chips) are going to be needed for a given system. We know which
> > ones we do use for ours, and that's a couple of handfuls, but it could
> > be that Dell need a slightly different set, and so might Yadro, or so
> > might our next board revision for that matter.
> 
> That's very hand wavy. Do you have some concrete examples (i.e. dts
> files) showing the differences.

We could list what we have on the pile for some of our IBM systems
today, but we would need Dell and Yadro to chime in with what they
need.

I still, from experience with that stuff and gut feeling, am pretty
sure it's going to be a moving list, and updating the kernel driver
constantly isn't going to fly.

> One problem I'm having with this is you are still going to need per
> board specifics in userspace.

Yes. Userspace is ultimately the one that knows what needs to be done
on a given machine.

>  You may be moving some of details out,
> but moving to DT is not going to completely eliminate that. 

We aren't trying to either. We are trying to make sure we don't need to
change the *kernel* all the time, in part bcs we are pushing hard for
OpenBMC vendors to use upstream with, if possible, no vendor changes.

So userspace has to know the board specific tunables anyways. Today in
many systems, it does that by whacking /dev/mem. I guess you can
understand why that is bad :-)

> I agree
> that not using /dev/mem is a good thing, but there are several ways
> you could do that independent of any DT binding.

Such as ? The only other option is to have one or more kernel drivers
that have built-in the precise and complete list of those tunables that
need to be exposed.

It's not impossible, but I worry that it's not going to scale terribly
well, and that vendors will constantly "fork" that driver to add
different things to that list.

I might be wrong here, so I'd like for Eugene (Dell) and Alexander
(Yadro) to chime in, but experience with BMCs has shown that we
regularily , as we add a feature or rewrite something, need to find
another new magic SoC tunable the HW manufacturer hid somewhere that
will make our stuff work.

> > Now, updating the device-tree in the board flash with whatever vendor
> > specific information is needed is a LOT easier than getting the kernel
> > driver constantly updated. The device-tree after all is there to
> > reflect among other things system specific ways in which the SoC is
> > wired and configured. This is rather close...
> 
> Sadly, updating my kernel is easier than updating my PC firmware
> (though packaged firmware on my current laptop changes that). I can
> assure you that ARM boards are generally much worse in that regard.
> BTW, you may want to pay attention to EBBR[1][2]. Not sure how much
> you care for BMCs, but there may be some interest.

You are conflating your host kernel and your BMC here. The BMC kernel
is part of the "firmware", as is its DTS and the BMC userspace.

(Again this isn't the host DTS, this is the BMC DTS).

They get all updated together. My point isn't about the ease or
difficulty for a *user* to udpate their BMC, in that case the solutions
above are equivalent.

The point is from a system vendor perspective. A system vendor using
OpenBMC will *customize* their BMC build in various ways. Typically
they *will* have a custom DT since this is what represents their
specific system and they will have some specific userspace bits.

However, we are trying very hard for them *not* to fork the kernel, and
if possible move OpenBMC towards a fully upstream kernel (still working
on getting all the SoC drivers cleaned up and pushed up but

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-18 Thread Benjamin Herrenschmidt

On Wed, 2018-07-18 at 13:50 -0600, Rob Herring wrote:
> 
> > So Rob, I think that's precisely where the disconnect is.
> > 
> > I think we all (well hopefully) agree that those few tunables don't fit
> > in any existing subystem and aren't likely to ever do (famous last
> > words...).
> > 
> > Where we disagree is we want to make this parametrized via the DT, and
> > you want us to hard wire the list in some kind of SoC driver for a
> > given SoC family/version.
> > 
> > The reason I think hard wiring the list in the driver is not a great
> > solution is that that list in itself is prone to variations, possibly
> > fairly often, between boards, vendors, versions of boards, etc...
> 
> Can we separate the list of what's enabled from the details of the
> registers?  Even if we put all this into DT, we may still want to have
> some separation for dts file structure. Some portion of this has to be
> SoC specific because you are simply exposing SoC registers.

Not sure I understand what you mean by "what's enabled".

The goal here is to expose register "fields" (nor raw values, some
registers need locking for RMW etc... the kernel would handle that via
syscon locking I expect), so basically named "tunables" to userspace.

Userspace is the one that knows the values for a given system.

> > We can't know for sure every SoC tunable (out of the gazillions in
> > those chips) are going to be needed for a given system. We know which
> > ones we do use for ours, and that's a couple of handfuls, but it could
> > be that Dell need a slightly different set, and so might Yadro, or so
> > might our next board revision for that matter.
> 
> That's very hand wavy. Do you have some concrete examples (i.e. dts
> files) showing the differences.

We could list what we have on the pile for some of our IBM systems
today, but we would need Dell and Yadro to chime in with what they
need.

I still, from experience with that stuff and gut feeling, am pretty
sure it's going to be a moving list, and updating the kernel driver
constantly isn't going to fly.

> One problem I'm having with this is you are still going to need per
> board specifics in userspace.

Yes. Userspace is ultimately the one that knows what needs to be done
on a given machine.

>  You may be moving some of details out,
> but moving to DT is not going to completely eliminate that. 

We aren't trying to either. We are trying to make sure we don't need to
change the *kernel* all the time, in part bcs we are pushing hard for
OpenBMC vendors to use upstream with, if possible, no vendor changes.

So userspace has to know the board specific tunables anyways. Today in
many systems, it does that by whacking /dev/mem. I guess you can
understand why that is bad :-)

> I agree
> that not using /dev/mem is a good thing, but there are several ways
> you could do that independent of any DT binding.

Such as ? The only other option is to have one or more kernel drivers
that have built-in the precise and complete list of those tunables that
need to be exposed.

It's not impossible, but I worry that it's not going to scale terribly
well, and that vendors will constantly "fork" that driver to add
different things to that list.

I might be wrong here, so I'd like for Eugene (Dell) and Alexander
(Yadro) to chime in, but experience with BMCs has shown that we
regularily , as we add a feature or rewrite something, need to find
another new magic SoC tunable the HW manufacturer hid somewhere that
will make our stuff work.

> > Now, updating the device-tree in the board flash with whatever vendor
> > specific information is needed is a LOT easier than getting the kernel
> > driver constantly updated. The device-tree after all is there to
> > reflect among other things system specific ways in which the SoC is
> > wired and configured. This is rather close...
> 
> Sadly, updating my kernel is easier than updating my PC firmware
> (though packaged firmware on my current laptop changes that). I can
> assure you that ARM boards are generally much worse in that regard.
> BTW, you may want to pay attention to EBBR[1][2]. Not sure how much
> you care for BMCs, but there may be some interest.

You are conflating your host kernel and your BMC here. The BMC kernel
is part of the "firmware", as is its DTS and the BMC userspace.

(Again this isn't the host DTS, this is the BMC DTS).

They get all updated together. My point isn't about the ease or
difficulty for a *user* to udpate their BMC, in that case the solutions
above are equivalent.

The point is from a system vendor perspective. A system vendor using
OpenBMC will *customize* their BMC build in various ways. Typically
they *will* have a custom DT since this is what represents their
specific system and they will have some specific userspace bits.

However, we are trying very hard for them *not* to fork the kernel, and
if possible move OpenBMC towards a fully upstream kernel (still working
on getting all the SoC drivers cleaned up and pushed up but

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-16 Thread Benjamin Herrenschmidt

On Mon, 2018-07-16 at 07:55 -0600, Rob Herring wrote:
> If that data is one set per SoC, then i'm not that concerned having
> platform-specific data in the driver. That doesn't mean the driver is
> not "generic". It's still not clear to me in this thread, how much of
> this is board specific, but given that you've placed all the data in
> an SoC dtsi file it seems to be all per SoC.

So Rob, I think that's precisely where the disconnect is.

I think we all (well hopefully) agree that those few tunables don't fit
in any existing subystem and aren't likely to ever do (famous last
words...).

Where we disagree is we want to make this parametrized via the DT, and
you want us to hard wire the list in some kind of SoC driver for a
given SoC family/version.

The reason I think hard wiring the list in the driver is not a great
solution is that that list in itself is prone to variations, possibly
fairly often, between boards, vendors, versions of boards, etc...

We can't know for sure every SoC tunable (out of the gazillions in
those chips) are going to be needed for a given system. We know which
ones we do use for ours, and that's a couple of handfuls, but it could
be that Dell need a slightly different set, and so might Yadro, or so
might our next board revision for that matter.

Now, updating the device-tree in the board flash with whatever vendor
specific information is needed is a LOT easier than getting the kernel
driver constantly updated. The device-tree after all is there to
reflect among other things system specific ways in which the SoC is
wired and configured. This is rather close...

Cheers,
Ben.

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-16 Thread Benjamin Herrenschmidt

On Mon, 2018-07-16 at 07:55 -0600, Rob Herring wrote:
> If that data is one set per SoC, then i'm not that concerned having
> platform-specific data in the driver. That doesn't mean the driver is
> not "generic". It's still not clear to me in this thread, how much of
> this is board specific, but given that you've placed all the data in
> an SoC dtsi file it seems to be all per SoC.

So Rob, I think that's precisely where the disconnect is.

I think we all (well hopefully) agree that those few tunables don't fit
in any existing subystem and aren't likely to ever do (famous last
words...).

Where we disagree is we want to make this parametrized via the DT, and
you want us to hard wire the list in some kind of SoC driver for a
given SoC family/version.

The reason I think hard wiring the list in the driver is not a great
solution is that that list in itself is prone to variations, possibly
fairly often, between boards, vendors, versions of boards, etc...

We can't know for sure every SoC tunable (out of the gazillions in
those chips) are going to be needed for a given system. We know which
ones we do use for ours, and that's a couple of handfuls, but it could
be that Dell need a slightly different set, and so might Yadro, or so
might our next board revision for that matter.

Now, updating the device-tree in the board flash with whatever vendor
specific information is needed is a LOT easier than getting the kernel
driver constantly updated. The device-tree after all is there to
reflect among other things system specific ways in which the SoC is
wired and configured. This is rather close...

Cheers,
Ben.

Re: [PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-16 Thread Benjamin Herrenschmidt

On Mon, 2018-07-16 at 09:33 -0600, Rob Herring wrote:
> On Thu, Jul 12, 2018 at 01:48:43PM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems based
> > on the Aspeed chips. It has most of the same properties,
> > plus some more needed to operate the coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..431bf8a423ce
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > +   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
> > +   or
> > +   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - aspeed,sram = ;: Reference to the SRAM node.
> > + - aspeed,cvic = ;: Reference to the CVIC node.
> > +
> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
> > +
> > +   clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +
> > +   memory-region = <_memory>;
> > +   sram = <>;
> > +   cvic = <>;
> 
> Need to update the example. With that

Ah right, thanks. The next spin will have that fixed and will go into
the fsi tree.

> Reviewed-by: Rob Herring

Re: [PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-16 Thread Benjamin Herrenschmidt

On Mon, 2018-07-16 at 09:33 -0600, Rob Herring wrote:
> On Thu, Jul 12, 2018 at 01:48:43PM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems based
> > on the Aspeed chips. It has most of the same properties,
> > plus some more needed to operate the coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..431bf8a423ce
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > +   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
> > +   or
> > +   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - aspeed,sram = ;: Reference to the SRAM node.
> > + - aspeed,cvic = ;: Reference to the CVIC node.
> > +
> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
> > +
> > +   clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +
> > +   memory-region = <_memory>;
> > +   sram = <>;
> > +   cvic = <>;
> 
> Need to update the example. With that

Ah right, thanks. The next spin will have that fixed and will go into
the fsi tree.

> Reviewed-by: Rob Herring

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-12 Thread Benjamin Herrenschmidt

On Thu, 2018-07-12 at 09:11 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 6:54 PM Andrew Jeffery  wrote:
> > 
> > Hi Rob,
> > 
> > Thanks for the response.
> > 
> > On Thu, 12 Jul 2018, at 05:34, Rob Herring wrote:
> > > On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > > > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > > > provide remote management of (primarily) server platforms. BMCs are
> > > > often tightly coupled to the platform in terms of behaviour and provide
> > > > many hardware features integral to booting and running the host system.
> > > > 
> > > > Some of these hardware features are simple, for example scratch
> > > > registers provided by the BMC that are exposed to both the host and the
> > > > BMC. In other cases there's a single bit switch to enable or disable
> > > > some of the provided functionality.
> > > > 
> > > > The documentation defines bindings for fields in registers that do not
> > > > integrate well into other driver models yet must be described to allow
> > > > the BMC kernel to assume control of these features.
> > > 
> > > So we'll get a new binding when that happens? That will break
> > > compatibility.
> > 
> > Can you please expand on this? I'm not following.
> 
> If we have a subsystem in the future, then there would likely be an
> associated binding which would be different. So if you update the DT,
> then old kernels won't work with it.

What kind of "subsystem" ? There is almost no way there could be one
for that sort of BMC tunables. We've look at several BMC chips out
there and requirements from several vendors, BIOS and system
manufacturers and it's all over the place.

> > I feel like this is an argument of tradition. Maybe people have
> > been dissuaded from doing so when they don't have a reasonable use-
> > case? I'm not saying that what I'm proposing is unquestionably
> > reasonable, but I don't want to dismiss it out of hand.
> 
> One of experience. The one that stands out is clock bindings.
> Initially we were doing a node per clock modelling which could end up
> being 100s of nodes and is difficult to get right (with DT being an
> ABI).
> 
> It comes up with system controller type blocks too that just have a
> bunch of random registers. Those change in every SoC and not in any
> controlled or ordered way that would make describing the individual
> sub-functions in DT worthwhile.

So what's the alternative ? Because without something like what we
propose, what's going to happen is /dev/mem ... that's what people do
today.

> > > A node per register bit doesn't scale.
> > 
> > It isn't meant to scale in terms of a single system. Using it
> > extensively is very likely wrong. Separately, register-bit-led does
> > pretty much the same thing. Doesn't the scale argument apply there?
> > Who is to stop me from attaching an insane number of LEDs to a
> > system?
> 
> Review.
> 
> If you look, register-bit-led is rarely used outside of some ARM, Ltd.
> boards. It's simply quite rare to have MMIO register bits that have a
> fixed function of LED control.

Well, same here, we hope to review what goes upstream to make it
reasonable. Otherwise it doens't matter. If a random vendor, let's say
IBM, chose to chip a system where they put an insane amount of cruft in
there, it will only affect those systems's BMC and the userspace stack
on it.

Thankfully that stack is OpenBMC and IBM is aiming at having their
device-tree's upstream, thus reviewed, thus it won't happen.

*Anything* can be abused. The point here is that we have a number,
thankfully rather small, maybe a dozen or two, of tunables that are
quite specific to a combination (system vendor, bmc vendor, system
model) which control a few HW features that essentially do *NOT* fit in
a subsystem.

For everything that does, we have created proper drivers (and are doing
more).


> > Obviously if there are lots of systems using it sparingly and
> > legitimately then maybe there's a scale issue, but isn't that just
> > a reality of different hardware designs? Whoever is implementing
> > support for the system is going to have to describe the hardware
> > one way or another.
> > 
> > > 
> > > Maybe this should be modelled using GPIO binding? There's a line there
> > > too as whether the signals are "general purpose" or not.
> > 
> > I don't think so, mainly because some of the things it is intended to be 
> > used for are not GPIOs. For instance, take the DAC mux I've described in 
> > the patch. It doesn't directly influence anything external to the SoC (i.e. 
> > it's certainly not a traditional GPIO in any sense). However, it does 
> > *indirectly* influence the SoC's behaviour by muxing the DAC internally 
> > between:
> > 
> > 0. VGA device exposed on the host PCIe bus
> > 1. The "Graphics CRT" controller
> > 2. VGA port A
> > 3. VGA port B
> 
> And this mux control is fixed in the SoC design?

This specific family of SoC (Aspeed) support those 4

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-12 Thread Benjamin Herrenschmidt

On Thu, 2018-07-12 at 09:11 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 6:54 PM Andrew Jeffery  wrote:
> > 
> > Hi Rob,
> > 
> > Thanks for the response.
> > 
> > On Thu, 12 Jul 2018, at 05:34, Rob Herring wrote:
> > > On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > > > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > > > provide remote management of (primarily) server platforms. BMCs are
> > > > often tightly coupled to the platform in terms of behaviour and provide
> > > > many hardware features integral to booting and running the host system.
> > > > 
> > > > Some of these hardware features are simple, for example scratch
> > > > registers provided by the BMC that are exposed to both the host and the
> > > > BMC. In other cases there's a single bit switch to enable or disable
> > > > some of the provided functionality.
> > > > 
> > > > The documentation defines bindings for fields in registers that do not
> > > > integrate well into other driver models yet must be described to allow
> > > > the BMC kernel to assume control of these features.
> > > 
> > > So we'll get a new binding when that happens? That will break
> > > compatibility.
> > 
> > Can you please expand on this? I'm not following.
> 
> If we have a subsystem in the future, then there would likely be an
> associated binding which would be different. So if you update the DT,
> then old kernels won't work with it.

What kind of "subsystem" ? There is almost no way there could be one
for that sort of BMC tunables. We've look at several BMC chips out
there and requirements from several vendors, BIOS and system
manufacturers and it's all over the place.

> > I feel like this is an argument of tradition. Maybe people have
> > been dissuaded from doing so when they don't have a reasonable use-
> > case? I'm not saying that what I'm proposing is unquestionably
> > reasonable, but I don't want to dismiss it out of hand.
> 
> One of experience. The one that stands out is clock bindings.
> Initially we were doing a node per clock modelling which could end up
> being 100s of nodes and is difficult to get right (with DT being an
> ABI).
> 
> It comes up with system controller type blocks too that just have a
> bunch of random registers. Those change in every SoC and not in any
> controlled or ordered way that would make describing the individual
> sub-functions in DT worthwhile.

So what's the alternative ? Because without something like what we
propose, what's going to happen is /dev/mem ... that's what people do
today.

> > > A node per register bit doesn't scale.
> > 
> > It isn't meant to scale in terms of a single system. Using it
> > extensively is very likely wrong. Separately, register-bit-led does
> > pretty much the same thing. Doesn't the scale argument apply there?
> > Who is to stop me from attaching an insane number of LEDs to a
> > system?
> 
> Review.
> 
> If you look, register-bit-led is rarely used outside of some ARM, Ltd.
> boards. It's simply quite rare to have MMIO register bits that have a
> fixed function of LED control.

Well, same here, we hope to review what goes upstream to make it
reasonable. Otherwise it doens't matter. If a random vendor, let's say
IBM, chose to chip a system where they put an insane amount of cruft in
there, it will only affect those systems's BMC and the userspace stack
on it.

Thankfully that stack is OpenBMC and IBM is aiming at having their
device-tree's upstream, thus reviewed, thus it won't happen.

*Anything* can be abused. The point here is that we have a number,
thankfully rather small, maybe a dozen or two, of tunables that are
quite specific to a combination (system vendor, bmc vendor, system
model) which control a few HW features that essentially do *NOT* fit in
a subsystem.

For everything that does, we have created proper drivers (and are doing
more).


> > Obviously if there are lots of systems using it sparingly and
> > legitimately then maybe there's a scale issue, but isn't that just
> > a reality of different hardware designs? Whoever is implementing
> > support for the system is going to have to describe the hardware
> > one way or another.
> > 
> > > 
> > > Maybe this should be modelled using GPIO binding? There's a line there
> > > too as whether the signals are "general purpose" or not.
> > 
> > I don't think so, mainly because some of the things it is intended to be 
> > used for are not GPIOs. For instance, take the DAC mux I've described in 
> > the patch. It doesn't directly influence anything external to the SoC (i.e. 
> > it's certainly not a traditional GPIO in any sense). However, it does 
> > *indirectly* influence the SoC's behaviour by muxing the DAC internally 
> > between:
> > 
> > 0. VGA device exposed on the host PCIe bus
> > 1. The "Graphics CRT" controller
> > 2. VGA port A
> > 3. VGA port B
> 
> And this mux control is fixed in the SoC design?

This specific family of SoC (Aspeed) support those 4

Re: [PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-12 Thread Benjamin Herrenschmidt

On Thu, 2018-07-12 at 17:53 +1000, Joel Stanley wrote:
> On 12 July 2018 at 13:48, Benjamin Herrenschmidt
>  wrote:
> > The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
> > is currently unused on OpenPower systems.
> > 
> > This adds an alternative to the fsi-master-gpio driver that
> > uses that coprocessor instead of bit banging from the ARM
> > core itself. The end result is about 4 times faster.
> > 
> > The firmware for the coprocessor and its source code can be
> > found at https://github.com/ozbenh/cf-fsi and is system specific.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  drivers/fsi/Kconfig  |9 +
> >  drivers/fsi/Makefile |1 +
> >  drivers/fsi/cf-fsi-fw.h  |  157 +++
> >  drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
> >  include/trace/events/fsi_master_ast_cf.h |  150 +++
> >  5 files changed, 1755 insertions(+)
> >  create mode 100644 drivers/fsi/cf-fsi-fw.h
> >  create mode 100644 drivers/fsi/fsi-master-ast-cf.c
> >  create mode 100644 include/trace/events/fsi_master_ast_cf.h
> > 
> > diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> > index 322cec393cf2..e0220d1e1357 100644
> > --- a/drivers/fsi/Kconfig
> > +++ b/drivers/fsi/Kconfig
> > @@ -27,6 +27,15 @@ config FSI_MASTER_HUB
> > allow chaining of FSI links to an arbitrary depth.  This allows for
> > a high target device fanout.
> > 
> > +config FSI_MASTER_AST_CF
> > +   tristate "FSI master based on Aspeed ColdFire coprocessor"
> > +   depends on GPIOLIB
> > +   depends on GPIO_ASPEED
> > +   ---help---
> > +   This option enables a FSI master using the AST2400 and AST2500 GPIO
> > +   lines driven by the internal ColdFire coprocessor. This requires
> > +   the corresponding machine specific ColdFire firmware to be 
> > available.
> 
> The "machine specific" part isn't true anymore, is it?

Right, I'll fixup the changelog before pushing if that's the last spin.

> I gave this a spin on a palmetto and it appeared to work fine for me.
> 
> Tested-by: Joel Stanley 

Thanks !

Cheers,
Ben
.

Re: [PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-12 Thread Benjamin Herrenschmidt

On Thu, 2018-07-12 at 17:53 +1000, Joel Stanley wrote:
> On 12 July 2018 at 13:48, Benjamin Herrenschmidt
>  wrote:
> > The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
> > is currently unused on OpenPower systems.
> > 
> > This adds an alternative to the fsi-master-gpio driver that
> > uses that coprocessor instead of bit banging from the ARM
> > core itself. The end result is about 4 times faster.
> > 
> > The firmware for the coprocessor and its source code can be
> > found at https://github.com/ozbenh/cf-fsi and is system specific.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  drivers/fsi/Kconfig  |9 +
> >  drivers/fsi/Makefile |1 +
> >  drivers/fsi/cf-fsi-fw.h  |  157 +++
> >  drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
> >  include/trace/events/fsi_master_ast_cf.h |  150 +++
> >  5 files changed, 1755 insertions(+)
> >  create mode 100644 drivers/fsi/cf-fsi-fw.h
> >  create mode 100644 drivers/fsi/fsi-master-ast-cf.c
> >  create mode 100644 include/trace/events/fsi_master_ast_cf.h
> > 
> > diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> > index 322cec393cf2..e0220d1e1357 100644
> > --- a/drivers/fsi/Kconfig
> > +++ b/drivers/fsi/Kconfig
> > @@ -27,6 +27,15 @@ config FSI_MASTER_HUB
> > allow chaining of FSI links to an arbitrary depth.  This allows for
> > a high target device fanout.
> > 
> > +config FSI_MASTER_AST_CF
> > +   tristate "FSI master based on Aspeed ColdFire coprocessor"
> > +   depends on GPIOLIB
> > +   depends on GPIO_ASPEED
> > +   ---help---
> > +   This option enables a FSI master using the AST2400 and AST2500 GPIO
> > +   lines driven by the internal ColdFire coprocessor. This requires
> > +   the corresponding machine specific ColdFire firmware to be 
> > available.
> 
> The "machine specific" part isn't true anymore, is it?

Right, I'll fixup the changelog before pushing if that's the last spin.

> I gave this a spin on a palmetto and it appeared to work fine for me.
> 
> Tested-by: Joel Stanley 

Thanks !

Cheers,
Ben
.

[PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-11 Thread Benjamin Herrenschmidt

This isn't per-se a real device, it's a pseudo-device that
represents the use of the Aspeed built-in ColdFire to
implement the FSI protocol by bitbanging the GPIOs instead
of doing it from the ARM core.

Thus it's a drop-in replacement for the existing
fsi-master-gpio pseudo-device for use on systems based
on the Aspeed chips. It has most of the same properties,
plus some more needed to operate the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt

diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
new file mode 100644
index ..431bf8a423ce
--- /dev/null
+++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
@@ -0,0 +1,36 @@
+Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
+
+
+Required properties:
+ - compatible =
+   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
+   or
+   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
+
+ - clock-gpios = ;: GPIO for FSI clock
+ - data-gpios = ; : GPIO for FSI data signal
+ - enable-gpios = ;   : GPIO for enable signal
+ - trans-gpios = ;: GPIO for voltage translator enable
+ - mux-gpios = ;  : GPIO for pin multiplexing with other
+  functions (eg, external FSI masters)
+ - memory-region = ;  : Reference to the reserved memory for
+  the ColdFire. Must be 2M aligned on
+ AST2400 and 1M aligned on AST2500
+ - aspeed,sram = ;: Reference to the SRAM node.
+ - aspeed,cvic = ;: Reference to the CVIC node.
+
+Examples:
+
+fsi-master {
+compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
+
+   clock-gpios = < 0>;
+data-gpios = < 1>;
+enable-gpios = < 2>;
+trans-gpios = < 3>;
+mux-gpios = < 4>;
+
+   memory-region = <_memory>;
+   sram = <>;
+   cvic = <>;
+}
-- 
2.17.1

[PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-11 Thread Benjamin Herrenschmidt

The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
is currently unused on OpenPower systems.

This adds an alternative to the fsi-master-gpio driver that
uses that coprocessor instead of bit banging from the ARM
core itself. The end result is about 4 times faster.

The firmware for the coprocessor and its source code can be
found at https://github.com/ozbenh/cf-fsi and is system specific.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/Kconfig  |9 +
 drivers/fsi/Makefile |1 +
 drivers/fsi/cf-fsi-fw.h  |  157 +++
 drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
 include/trace/events/fsi_master_ast_cf.h |  150 +++
 5 files changed, 1755 insertions(+)
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h

diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
index 322cec393cf2..e0220d1e1357 100644
--- a/drivers/fsi/Kconfig
+++ b/drivers/fsi/Kconfig
@@ -27,6 +27,15 @@ config FSI_MASTER_HUB
allow chaining of FSI links to an arbitrary depth.  This allows for
a high target device fanout.
 
+config FSI_MASTER_AST_CF
+   tristate "FSI master based on Aspeed ColdFire coprocessor"
+   depends on GPIOLIB
+   depends on GPIO_ASPEED
+   ---help---
+   This option enables a FSI master using the AST2400 and AST2500 GPIO
+   lines driven by the internal ColdFire coprocessor. This requires
+   the corresponding machine specific ColdFire firmware to be available.
+
 config FSI_SCOM
tristate "SCOM FSI client device driver"
---help---
diff --git a/drivers/fsi/Makefile b/drivers/fsi/Makefile
index 75fdc6d8cfc4..62687ec86d2e 100644
--- a/drivers/fsi/Makefile
+++ b/drivers/fsi/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_FSI) += fsi-core.o
 obj-$(CONFIG_FSI_MASTER_HUB) += fsi-master-hub.o
 obj-$(CONFIG_FSI_MASTER_GPIO) += fsi-master-gpio.o
+obj-$(CONFIG_FSI_MASTER_AST_CF) += fsi-master-ast-cf.o
 obj-$(CONFIG_FSI_SCOM) += fsi-scom.o
 obj-$(CONFIG_FSI_SBEFIFO) += fsi-sbefifo.o
 obj-$(CONFIG_FSI_OCC) += fsi-occ.o
diff --git a/drivers/fsi/cf-fsi-fw.h b/drivers/fsi/cf-fsi-fw.h
new file mode 100644
index ..712df0461911
--- /dev/null
+++ b/drivers/fsi/cf-fsi-fw.h
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0+
+#ifndef __CF_FSI_FW_H
+#define __CF_FSI_FW_H
+
+/*
+ * uCode file layout
+ *
+ * ...03ff : m68k exception vectors
+ * 0400...04ff : Header info & boot config block
+ * 0500... : Code & stack
+ */
+
+/*
+ * Header info & boot config area
+ *
+ * The Header info is built into the ucode and provide version and
+ * platform information.
+ *
+ * the Boot config needs to be adjusted by the ARM prior to starting
+ * the ucode if the Command/Status area isn't at 0x32 in CF space
+ * (ie. beginning of SRAM).
+ */
+
+#define HDR_OFFSET 0x400
+
+/* Info: Signature & version */
+#define HDR_SYS_SIG0x00/* 2 bytes system signature */
+#define  SYS_SIG_SHARED0x5348
+#define  SYS_SIG_SPLIT 0x5350
+#define HDR_FW_VERS0x02/* 2 bytes Major.Minor */
+#define HDR_API_VERS   0x04/* 2 bytes Major.Minor */
+#define  API_VERSION_MAJ   2   /* Current version */
+#define  API_VERSION_MIN   1
+#define HDR_FW_OPTIONS 0x08/* 4 bytes option flags */
+#define  FW_OPTION_TRACE_EN0x0001  /* FW tracing enabled */
+#define FW_OPTION_CONT_CLOCK   0x0002  /* Continuous clocking 
supported */
+#define HDR_FW_SIZE0x10/* 4 bytes size for combo image */
+
+/* Boot Config: Address of Command/Status area */
+#define HDR_CMD_STAT_AREA  0x80/* 4 bytes CF address */
+#define HDR_FW_CONTROL 0x84/* 4 bytes control flags */
+#define FW_CONTROL_CONT_CLOCK  0x0002  /* Continuous clocking 
enabled */
+#define FW_CONTROL_DUMMY_RD0x0004  /* Extra dummy read 
(AST2400) */
+#define FW_CONTROL_USE_STOP0x0008  /* Use STOP 
instructions */
+#define HDR_CLOCK_GPIO_VADDR   0x90/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_DADDR   0x92/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_VADDR0x94/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_DADDR0x96/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_VADDR   0x98/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_DADDR   0x9a/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_BIT 0x9c/* 1 byte bit number */
+#define HDR_DATA_GPIO_BIT  0x9d/* 1 byte bit number */
+#define HDR_TRANS_GPIO_BIT 0x9e/* 1 byte bit number */
+
+/*
+ *  Command/Status area layout: Main part
+ */
+
+/* Command/Status register:
+ *
+ * +---+
+ * | STAT | RLEN | CLEN | C

[PATCH v3 1/5] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-11 Thread Benjamin Herrenschmidt

This isn't per-se a real device, it's a pseudo-device that
represents the use of the Aspeed built-in ColdFire to
implement the FSI protocol by bitbanging the GPIOs instead
of doing it from the ARM core.

Thus it's a drop-in replacement for the existing
fsi-master-gpio pseudo-device for use on systems based
on the Aspeed chips. It has most of the same properties,
plus some more needed to operate the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
 1 file changed, 36 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt

diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
new file mode 100644
index ..431bf8a423ce
--- /dev/null
+++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
@@ -0,0 +1,36 @@
+Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
+
+
+Required properties:
+ - compatible =
+   "aspeed,ast2400-cf-fsi-master" for an AST2400 based system
+   or
+   "aspeed,ast2500-cf-fsi-master" for an AST2500 based system
+
+ - clock-gpios = ;: GPIO for FSI clock
+ - data-gpios = ; : GPIO for FSI data signal
+ - enable-gpios = ;   : GPIO for enable signal
+ - trans-gpios = ;: GPIO for voltage translator enable
+ - mux-gpios = ;  : GPIO for pin multiplexing with other
+  functions (eg, external FSI masters)
+ - memory-region = ;  : Reference to the reserved memory for
+  the ColdFire. Must be 2M aligned on
+ AST2400 and 1M aligned on AST2500
+ - aspeed,sram = ;: Reference to the SRAM node.
+ - aspeed,cvic = ;: Reference to the CVIC node.
+
+Examples:
+
+fsi-master {
+compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
+
+   clock-gpios = < 0>;
+data-gpios = < 1>;
+enable-gpios = < 2>;
+trans-gpios = < 3>;
+mux-gpios = < 4>;
+
+   memory-region = <_memory>;
+   sram = <>;
+   cvic = <>;
+}
-- 
2.17.1

[PATCH v3 3/5] fsi: master-ast-cf: Add new FSI master using Aspeed ColdFire

2018-07-11 Thread Benjamin Herrenschmidt

The Aspeed AST2x00 can contain a ColdFire v1 coprocessor which
is currently unused on OpenPower systems.

This adds an alternative to the fsi-master-gpio driver that
uses that coprocessor instead of bit banging from the ARM
core itself. The end result is about 4 times faster.

The firmware for the coprocessor and its source code can be
found at https://github.com/ozbenh/cf-fsi and is system specific.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/fsi/Kconfig  |9 +
 drivers/fsi/Makefile |1 +
 drivers/fsi/cf-fsi-fw.h  |  157 +++
 drivers/fsi/fsi-master-ast-cf.c  | 1438 ++
 include/trace/events/fsi_master_ast_cf.h |  150 +++
 5 files changed, 1755 insertions(+)
 create mode 100644 drivers/fsi/cf-fsi-fw.h
 create mode 100644 drivers/fsi/fsi-master-ast-cf.c
 create mode 100644 include/trace/events/fsi_master_ast_cf.h

diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
index 322cec393cf2..e0220d1e1357 100644
--- a/drivers/fsi/Kconfig
+++ b/drivers/fsi/Kconfig
@@ -27,6 +27,15 @@ config FSI_MASTER_HUB
allow chaining of FSI links to an arbitrary depth.  This allows for
a high target device fanout.
 
+config FSI_MASTER_AST_CF
+   tristate "FSI master based on Aspeed ColdFire coprocessor"
+   depends on GPIOLIB
+   depends on GPIO_ASPEED
+   ---help---
+   This option enables a FSI master using the AST2400 and AST2500 GPIO
+   lines driven by the internal ColdFire coprocessor. This requires
+   the corresponding machine specific ColdFire firmware to be available.
+
 config FSI_SCOM
tristate "SCOM FSI client device driver"
---help---
diff --git a/drivers/fsi/Makefile b/drivers/fsi/Makefile
index 75fdc6d8cfc4..62687ec86d2e 100644
--- a/drivers/fsi/Makefile
+++ b/drivers/fsi/Makefile
@@ -2,6 +2,7 @@
 obj-$(CONFIG_FSI) += fsi-core.o
 obj-$(CONFIG_FSI_MASTER_HUB) += fsi-master-hub.o
 obj-$(CONFIG_FSI_MASTER_GPIO) += fsi-master-gpio.o
+obj-$(CONFIG_FSI_MASTER_AST_CF) += fsi-master-ast-cf.o
 obj-$(CONFIG_FSI_SCOM) += fsi-scom.o
 obj-$(CONFIG_FSI_SBEFIFO) += fsi-sbefifo.o
 obj-$(CONFIG_FSI_OCC) += fsi-occ.o
diff --git a/drivers/fsi/cf-fsi-fw.h b/drivers/fsi/cf-fsi-fw.h
new file mode 100644
index ..712df0461911
--- /dev/null
+++ b/drivers/fsi/cf-fsi-fw.h
@@ -0,0 +1,157 @@
+// SPDX-License-Identifier: GPL-2.0+
+#ifndef __CF_FSI_FW_H
+#define __CF_FSI_FW_H
+
+/*
+ * uCode file layout
+ *
+ * ...03ff : m68k exception vectors
+ * 0400...04ff : Header info & boot config block
+ * 0500... : Code & stack
+ */
+
+/*
+ * Header info & boot config area
+ *
+ * The Header info is built into the ucode and provide version and
+ * platform information.
+ *
+ * the Boot config needs to be adjusted by the ARM prior to starting
+ * the ucode if the Command/Status area isn't at 0x32 in CF space
+ * (ie. beginning of SRAM).
+ */
+
+#define HDR_OFFSET 0x400
+
+/* Info: Signature & version */
+#define HDR_SYS_SIG0x00/* 2 bytes system signature */
+#define  SYS_SIG_SHARED0x5348
+#define  SYS_SIG_SPLIT 0x5350
+#define HDR_FW_VERS0x02/* 2 bytes Major.Minor */
+#define HDR_API_VERS   0x04/* 2 bytes Major.Minor */
+#define  API_VERSION_MAJ   2   /* Current version */
+#define  API_VERSION_MIN   1
+#define HDR_FW_OPTIONS 0x08/* 4 bytes option flags */
+#define  FW_OPTION_TRACE_EN0x0001  /* FW tracing enabled */
+#define FW_OPTION_CONT_CLOCK   0x0002  /* Continuous clocking 
supported */
+#define HDR_FW_SIZE0x10/* 4 bytes size for combo image */
+
+/* Boot Config: Address of Command/Status area */
+#define HDR_CMD_STAT_AREA  0x80/* 4 bytes CF address */
+#define HDR_FW_CONTROL 0x84/* 4 bytes control flags */
+#define FW_CONTROL_CONT_CLOCK  0x0002  /* Continuous clocking 
enabled */
+#define FW_CONTROL_DUMMY_RD0x0004  /* Extra dummy read 
(AST2400) */
+#define FW_CONTROL_USE_STOP0x0008  /* Use STOP 
instructions */
+#define HDR_CLOCK_GPIO_VADDR   0x90/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_DADDR   0x92/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_VADDR0x94/* 2 bytes offset from GPIO base */
+#define HDR_DATA_GPIO_DADDR0x96/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_VADDR   0x98/* 2 bytes offset from GPIO base */
+#define HDR_TRANS_GPIO_DADDR   0x9a/* 2 bytes offset from GPIO base */
+#define HDR_CLOCK_GPIO_BIT 0x9c/* 1 byte bit number */
+#define HDR_DATA_GPIO_BIT  0x9d/* 1 byte bit number */
+#define HDR_TRANS_GPIO_BIT 0x9e/* 1 byte bit number */
+
+/*
+ *  Command/Status area layout: Main part
+ */
+
+/* Command/Status register:
+ *
+ * +---+
+ * | STAT | RLEN | CLEN | C

[PATCH v3 2/5] devres: Add devm_of_iomap()

2018-07-11 Thread Benjamin Herrenschmidt

There are still quite a few cases where a device might want
to get to a different node of the device-tree, obtain the
resources and map them.

We have of_iomap() and of_io_request_and_map() but they both
have shortcomings, such as not returning the size of the
resource found (which can be useful) and not being "managed".

This adds a devm_of_iomap() that provides all of these and
should probably replace uses of the above in most drivers.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Linus Walleij 
Reviewed-by: Joel Stanley 
---
 include/linux/device.h |  4 
 lib/devres.c   | 36 
 2 files changed, 40 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 477956990f5e..96249d790374 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -688,6 +688,10 @@ extern void devm_free_pages(struct device *dev, unsigned 
long addr);
 
 void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res);
 
+void __iomem *devm_of_iomap(struct device *dev,
+   struct device_node *node, int index,
+   resource_size_t *size);
+
 /* allows to add/remove a custom action to devres stack */
 int devm_add_action(struct device *dev, void (*action)(void *), void *data);
 void devm_remove_action(struct device *dev, void (*action)(void *), void 
*data);
diff --git a/lib/devres.c b/lib/devres.c
index 5bec1120b392..faccf1a037d0 100644
--- a/lib/devres.c
+++ b/lib/devres.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum devm_ioremap_type {
DEVM_IOREMAP = 0,
@@ -162,6 +163,41 @@ void __iomem *devm_ioremap_resource(struct device *dev, 
struct resource *res)
 }
 EXPORT_SYMBOL(devm_ioremap_resource);
 
+/*
+ * devm_of_iomap - Requests a resource and maps the memory mapped IO
+ *for a given device_node managed by a given device
+ *
+ * Checks that a resource is a valid memory region, requests the memory
+ * region and ioremaps it. All operations are managed and will be undone
+ * on driver detach of the device.
+ *
+ * This is to be used when a device requests/maps resources described
+ * by other device tree nodes (children or otherwise).
+ *
+ * @dev:   The device "managing" the resource
+ * @node:   The device-tree node where the resource resides
+ * @index: index of the MMIO range in the "reg" property
+ * @size:  Returns the size of the resource (pass NULL if not needed)
+ * Returns a pointer to the requested and mapped memory or an ERR_PTR() encoded
+ * error code on failure. Usage example:
+ *
+ * base = devm_of_iomap(>dev, node, 0, NULL);
+ * if (IS_ERR(base))
+ * return PTR_ERR(base);
+ */
+void __iomem *devm_of_iomap(struct device *dev, struct device_node *node, int 
index,
+   resource_size_t *size)
+{
+   struct resource res;
+
+   if (of_address_to_resource(node, index, ))
+   return IOMEM_ERR_PTR(-EINVAL);
+   if (size)
+   *size = resource_size();
+   return devm_ioremap_resource(dev, );
+}
+EXPORT_SYMBOL(devm_of_iomap);
+
 #ifdef CONFIG_HAS_IOPORT_MAP
 /*
  * Generic iomap devres
-- 
2.17.1

[PATCH v3 4/5] arm: dts: OpenPower Romulus system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
index 347938673c83..4f4e42e47e3f 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
@@ -37,6 +37,11 @@
compatible = "shared-dma-pool";
reusable;
};
+
+   coldfire_memory: codefire_memory@9ef0 {
+   reg = <0x9ef0 0x0010>;
+   no-map;
+   };
};
 
leds {
@@ -56,10 +61,13 @@
};
 
fsi: gpio-fsi {
-   compatible = "fsi-master-gpio", "fsi-master";
+   compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
#address-cells = <2>;
#size-cells = <0>;
-   no-gpio-delays;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
 
clock-gpios = < ASPEED_GPIO(AA, 0) GPIO_ACTIVE_HIGH>;
data-gpios = < ASPEED_GPIO(AA, 2) GPIO_ACTIVE_HIGH>;
-- 
2.17.1

[PATCH v3 4/5] arm: dts: OpenPower Romulus system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
index 347938673c83..4f4e42e47e3f 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-romulus.dts
@@ -37,6 +37,11 @@
compatible = "shared-dma-pool";
reusable;
};
+
+   coldfire_memory: codefire_memory@9ef0 {
+   reg = <0x9ef0 0x0010>;
+   no-map;
+   };
};
 
leds {
@@ -56,10 +61,13 @@
};
 
fsi: gpio-fsi {
-   compatible = "fsi-master-gpio", "fsi-master";
+   compatible = "aspeed,ast2500-cf-fsi-master", "fsi-master";
#address-cells = <2>;
#size-cells = <0>;
-   no-gpio-delays;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
 
clock-gpios = < ASPEED_GPIO(AA, 0) GPIO_ACTIVE_HIGH>;
data-gpios = < ASPEED_GPIO(AA, 2) GPIO_ACTIVE_HIGH>;
-- 
2.17.1

[PATCH v3 2/5] devres: Add devm_of_iomap()

2018-07-11 Thread Benjamin Herrenschmidt

There are still quite a few cases where a device might want
to get to a different node of the device-tree, obtain the
resources and map them.

We have of_iomap() and of_io_request_and_map() but they both
have shortcomings, such as not returning the size of the
resource found (which can be useful) and not being "managed".

This adds a devm_of_iomap() that provides all of these and
should probably replace uses of the above in most drivers.

Signed-off-by: Benjamin Herrenschmidt 
Reviewed-by: Linus Walleij 
Reviewed-by: Joel Stanley 
---
 include/linux/device.h |  4 
 lib/devres.c   | 36 
 2 files changed, 40 insertions(+)

diff --git a/include/linux/device.h b/include/linux/device.h
index 477956990f5e..96249d790374 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -688,6 +688,10 @@ extern void devm_free_pages(struct device *dev, unsigned 
long addr);
 
 void __iomem *devm_ioremap_resource(struct device *dev, struct resource *res);
 
+void __iomem *devm_of_iomap(struct device *dev,
+   struct device_node *node, int index,
+   resource_size_t *size);
+
 /* allows to add/remove a custom action to devres stack */
 int devm_add_action(struct device *dev, void (*action)(void *), void *data);
 void devm_remove_action(struct device *dev, void (*action)(void *), void 
*data);
diff --git a/lib/devres.c b/lib/devres.c
index 5bec1120b392..faccf1a037d0 100644
--- a/lib/devres.c
+++ b/lib/devres.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 enum devm_ioremap_type {
DEVM_IOREMAP = 0,
@@ -162,6 +163,41 @@ void __iomem *devm_ioremap_resource(struct device *dev, 
struct resource *res)
 }
 EXPORT_SYMBOL(devm_ioremap_resource);
 
+/*
+ * devm_of_iomap - Requests a resource and maps the memory mapped IO
+ *for a given device_node managed by a given device
+ *
+ * Checks that a resource is a valid memory region, requests the memory
+ * region and ioremaps it. All operations are managed and will be undone
+ * on driver detach of the device.
+ *
+ * This is to be used when a device requests/maps resources described
+ * by other device tree nodes (children or otherwise).
+ *
+ * @dev:   The device "managing" the resource
+ * @node:   The device-tree node where the resource resides
+ * @index: index of the MMIO range in the "reg" property
+ * @size:  Returns the size of the resource (pass NULL if not needed)
+ * Returns a pointer to the requested and mapped memory or an ERR_PTR() encoded
+ * error code on failure. Usage example:
+ *
+ * base = devm_of_iomap(>dev, node, 0, NULL);
+ * if (IS_ERR(base))
+ * return PTR_ERR(base);
+ */
+void __iomem *devm_of_iomap(struct device *dev, struct device_node *node, int 
index,
+   resource_size_t *size)
+{
+   struct resource res;
+
+   if (of_address_to_resource(node, index, ))
+   return IOMEM_ERR_PTR(-EINVAL);
+   if (size)
+   *size = resource_size();
+   return devm_ioremap_resource(dev, );
+}
+EXPORT_SYMBOL(devm_of_iomap);
+
 #ifdef CONFIG_HAS_IOPORT_MAP
 /*
  * Generic iomap devres
-- 
2.17.1

[PATCH v3 5/5] arm: dts: OpenPower Palmetto system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt

This switches away from userspace bitbanging to kernel FSI
using the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts | 28 ++-
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
index e6095f51ecf5..feec377c04d4 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
@@ -31,6 +31,11 @@
no-map;
reg = <0x9800 0x0100>; /* 16MB */
};
+
+   coldfire_memory: codefire_memory@5ee0 {
+   reg = <0x5ee0 0x0020>;
+   no-map;
+   };
};
 
leds {
@@ -49,6 +54,22 @@
};
};
 
+   fsi: gpio-fsi {
+   compatible = "aspeed,ast2400-cf-fsi-master", "fsi-master";
+   #address-cells = <2>;
+   #size-cells = <0>;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
+
+   clock-gpios = < ASPEED_GPIO(A, 4) GPIO_ACTIVE_HIGH>;
+   data-gpios = < ASPEED_GPIO(A, 5) GPIO_ACTIVE_HIGH>;
+   mux-gpios = < ASPEED_GPIO(A, 6) GPIO_ACTIVE_HIGH>;
+   enable-gpios = < ASPEED_GPIO(D, 0) GPIO_ACTIVE_HIGH>;
+   trans-gpios = < ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH>;
+   };
+
gpio-keys {
compatible = "gpio-keys";
 
@@ -323,13 +344,6 @@
line-name = "SYS_PWROK_BMC";
};
 
-   pin_gpio_h6 {
-   gpio-hog;
-   gpios = ;
-   output-high;
-   line-name = "SCM1_FSI0_DATA_EN";
-   };
-
pin_gpio_h7 {
gpio-hog;
gpios = ;
-- 
2.17.1

[PATCH v3 5/5] arm: dts: OpenPower Palmetto system can use coprocessor for FSI

2018-07-11 Thread Benjamin Herrenschmidt

This switches away from userspace bitbanging to kernel FSI
using the coprocessor.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts | 28 ++-
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts 
b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
index e6095f51ecf5..feec377c04d4 100644
--- a/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
+++ b/arch/arm/boot/dts/aspeed-bmc-opp-palmetto.dts
@@ -31,6 +31,11 @@
no-map;
reg = <0x9800 0x0100>; /* 16MB */
};
+
+   coldfire_memory: codefire_memory@5ee0 {
+   reg = <0x5ee0 0x0020>;
+   no-map;
+   };
};
 
leds {
@@ -49,6 +54,22 @@
};
};
 
+   fsi: gpio-fsi {
+   compatible = "aspeed,ast2400-cf-fsi-master", "fsi-master";
+   #address-cells = <2>;
+   #size-cells = <0>;
+
+   memory-region = <_memory>;
+   aspeed,sram = <>;
+   aspeed,cvic = <>;
+
+   clock-gpios = < ASPEED_GPIO(A, 4) GPIO_ACTIVE_HIGH>;
+   data-gpios = < ASPEED_GPIO(A, 5) GPIO_ACTIVE_HIGH>;
+   mux-gpios = < ASPEED_GPIO(A, 6) GPIO_ACTIVE_HIGH>;
+   enable-gpios = < ASPEED_GPIO(D, 0) GPIO_ACTIVE_HIGH>;
+   trans-gpios = < ASPEED_GPIO(H, 6) GPIO_ACTIVE_HIGH>;
+   };
+
gpio-keys {
compatible = "gpio-keys";
 
@@ -323,13 +344,6 @@
line-name = "SYS_PWROK_BMC";
};
 
-   pin_gpio_h6 {
-   gpio-hog;
-   gpios = ;
-   output-high;
-   line-name = "SCM1_FSI0_DATA_EN";
-   };
-
pin_gpio_h7 {
gpio-hog;
gpios = ;
-- 
2.17.1

[PATCH 0/5] fsi: Coldfire coprocessor offload

2018-07-11 Thread Benjamin Herrenschmidt

This series implements support for offloading the FSI protocol bitbanging
to the ColdFire secondary core of the Aspeed SoCs. The result increases
FSI performance by a factor of 4, and on systems that don't support async
FSI clock, provide much more regular and continuous clocking which helps
reliability.

This series is much smaller than the previous submissions as I already
merged all the "preparatory" work into the FSI tree. This is now strictly
the coldfire support and is now only waiting for ack of the DT bindings
(and whatever other review comments might come my way) before I put it
into the FSI tree and sends it to Greg.

There are two dependencies to be able to build/use this. The above
prep work:

  https://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git/log/

And the Aspeed GPIO driver changes for handling with GPIO lines ownership and
handshaking. The patches have been merged into the GPIO tree and a dedicated
immutable topic branch and can be found here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git/log/?h=ib-aspeed

Finally, the driver needs a machine specific firmware file. The firwmare
is open source and available at:

https://github.com/ozbenh/cf-fsi

I will submit it to linux-firmware if there's enough popular demand ;-)

v2. Fix misuse of devm_kzalloc for objects containing a struct device
in fsi-master-gpio and fsi-master-ast-cf (similar fixes for the
various FSI drivers will come later a part of rework to move away
from misc devs).

v3. Sparse fix and updated DT bindings as per Rob's comments.

[PATCH 0/5] fsi: Coldfire coprocessor offload

2018-07-11 Thread Benjamin Herrenschmidt

This series implements support for offloading the FSI protocol bitbanging
to the ColdFire secondary core of the Aspeed SoCs. The result increases
FSI performance by a factor of 4, and on systems that don't support async
FSI clock, provide much more regular and continuous clocking which helps
reliability.

This series is much smaller than the previous submissions as I already
merged all the "preparatory" work into the FSI tree. This is now strictly
the coldfire support and is now only waiting for ack of the DT bindings
(and whatever other review comments might come my way) before I put it
into the FSI tree and sends it to Greg.

There are two dependencies to be able to build/use this. The above
prep work:

  https://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git/log/

And the Aspeed GPIO driver changes for handling with GPIO lines ownership and
handshaking. The patches have been merged into the GPIO tree and a dedicated
immutable topic branch and can be found here:

  
https://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio.git/log/?h=ib-aspeed

Finally, the driver needs a machine specific firmware file. The firwmare
is open source and available at:

https://github.com/ozbenh/cf-fsi

I will submit it to linux-firmware if there's enough popular demand ;-)

v2. Fix misuse of devm_kzalloc for objects containing a struct device
in fsi-master-gpio and fsi-master-ast-cf (similar fixes for the
various FSI drivers will come later a part of rework to move away
from misc devs).

v3. Sparse fix and updated DT bindings as per Rob's comments.

[GIT PULL] FSI updates round 2 for 4.19

2018-07-11 Thread Benjamin Herrenschmidt

Hi Greg !

This is the second round of updates of the FSI stack.

These comprise of:

 - Some build fixes detected with COMPILE_TEST when
used on "other" archs

 - Sparse warning fixes

 - Some object lifetime fixes

 - A new feature to control some of the protocol delays

 - Overheaul of the "SCOM" driver used to access the POWER
processor internal SCOM bus.

 - Preparatory work for the new ColdFire coprocessor based FSI master
work which, along some other newer work is still going through the last
few rounds of polish and DT bindings reviews.

The following changes since commit 9f4a8a2d7f9d71093f41c4bb0ef8707e8145bad3:

  fsi/sbefifo: Add driver for the SBE FIFO (2018-06-12 14:05:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-12

for you to fetch changes up to fea9cf321c916e9372874e6f2af1bf0b5beb89fb:

  fsi: Move various master definitions to a common header (2018-07-12 12:06:02 
+1000)

----
Benjamin Herrenschmidt (17):
  fsi: sbefifo: Remove unneeded semicolon
  fsi: scom: Add mutex around FSI2PIB accesses
  fsi: scom: Whitespace fixes
  fsi: scom: Fixup endian annotations
  fsi: scom: Add register definitions
  fsi: scom: Major overhaul
  fsi: sbefifo: Fix checker warning about late NULL check
  fsi: Move code around to avoid forward declaration
  fsi: Add mechanism to set the tSendDelay and tEchoDelay values
  fsi: master-gpio: Rename and adjust send delay
  fsi: master-gpio: Add support for link_config
  fsi: master-gpio: Add more tracepoints
  fsi: master-gpio: Remove unused definitions
  fsi: master-gpio: Remove "GPIO" prefix on some definitions
  fsi: Don't use device_unregister() in fsi_master_register()
  fsi: master-gpio: Add missing release function
  fsi: Move various master definitions to a common header

Eddie James (1):
  fsi: sbefifo: Add missing mutex_unlock

Guenter Roeck (1):
  fsi/sbefifo: Add dependency on OF_ADDRESS

Joel Stanley (3):
  fsi: sbefifo: Fix sparse warnings
  fsi: master-hub: Fix sparse warnings
  fsi: core: Fix sparse warnings

 drivers/fsi/Kconfig|   1 +
 drivers/fsi/fsi-core.c | 215 +++--
 drivers/fsi/fsi-master-gpio.c  | 170 ---
 drivers/fsi/fsi-master-hub.c   |   5 +-
 drivers/fsi/fsi-master.h   |  35 +
 drivers/fsi/fsi-sbefifo.c  |  15 +-
 drivers/fsi/fsi-scom.c | 433 
+++---
 include/trace/events/fsi_master_gpio.h |  59 
 include/uapi/linux/fsi.h   |  58 
 9 files changed, 813 insertions(+), 178 deletions(-)
 create mode 100644 include/uapi/linux/fsi.h

[GIT PULL] FSI updates round 2 for 4.19

2018-07-11 Thread Benjamin Herrenschmidt

Hi Greg !

This is the second round of updates of the FSI stack.

These comprise of:

 - Some build fixes detected with COMPILE_TEST when
used on "other" archs

 - Sparse warning fixes

 - Some object lifetime fixes

 - A new feature to control some of the protocol delays

 - Overheaul of the "SCOM" driver used to access the POWER
processor internal SCOM bus.

 - Preparatory work for the new ColdFire coprocessor based FSI master
work which, along some other newer work is still going through the last
few rounds of polish and DT bindings reviews.

The following changes since commit 9f4a8a2d7f9d71093f41c4bb0ef8707e8145bad3:

  fsi/sbefifo: Add driver for the SBE FIFO (2018-06-12 14:05:39 +1000)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-07-12

for you to fetch changes up to fea9cf321c916e9372874e6f2af1bf0b5beb89fb:

  fsi: Move various master definitions to a common header (2018-07-12 12:06:02 
+1000)

----
Benjamin Herrenschmidt (17):
  fsi: sbefifo: Remove unneeded semicolon
  fsi: scom: Add mutex around FSI2PIB accesses
  fsi: scom: Whitespace fixes
  fsi: scom: Fixup endian annotations
  fsi: scom: Add register definitions
  fsi: scom: Major overhaul
  fsi: sbefifo: Fix checker warning about late NULL check
  fsi: Move code around to avoid forward declaration
  fsi: Add mechanism to set the tSendDelay and tEchoDelay values
  fsi: master-gpio: Rename and adjust send delay
  fsi: master-gpio: Add support for link_config
  fsi: master-gpio: Add more tracepoints
  fsi: master-gpio: Remove unused definitions
  fsi: master-gpio: Remove "GPIO" prefix on some definitions
  fsi: Don't use device_unregister() in fsi_master_register()
  fsi: master-gpio: Add missing release function
  fsi: Move various master definitions to a common header

Eddie James (1):
  fsi: sbefifo: Add missing mutex_unlock

Guenter Roeck (1):
  fsi/sbefifo: Add dependency on OF_ADDRESS

Joel Stanley (3):
  fsi: sbefifo: Fix sparse warnings
  fsi: master-hub: Fix sparse warnings
  fsi: core: Fix sparse warnings

 drivers/fsi/Kconfig|   1 +
 drivers/fsi/fsi-core.c | 215 +++--
 drivers/fsi/fsi-master-gpio.c  | 170 ---
 drivers/fsi/fsi-master-hub.c   |   5 +-
 drivers/fsi/fsi-master.h   |  35 +
 drivers/fsi/fsi-sbefifo.c  |  15 +-
 drivers/fsi/fsi-scom.c | 433 
+++---
 include/trace/events/fsi_master_gpio.h |  59 
 include/uapi/linux/fsi.h   |  58 
 9 files changed, 813 insertions(+), 178 deletions(-)
 create mode 100644 include/uapi/linux/fsi.h

Re: [PATCH 06/14] fsi: master-gpio: Add more tracepoints

2018-07-11 Thread Benjamin Herrenschmidt

On Thu, 2018-06-28 at 13:41 +0930, Joel Stanley wrote:
> On 27 June 2018 at 08:55, Benjamin Herrenschmidt
>  wrote:
> > This adds a few more tracepoints that have proven useful when
> > debugging issues with the FSI bus.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> 
> Reviewed-by: Joel Stanley 
> 
> > ---
> >  drivers/fsi/fsi-master-gpio.c  | 16 ---
> >  include/trace/events/fsi_master_gpio.h | 59 ++
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/fsi/fsi-master-gpio.c b/drivers/fsi/fsi-master-gpio.c
> > index 48e0e65b2982..a00a85aa6d56 100644
> > --- a/drivers/fsi/fsi-master-gpio.c
> > +++ b/drivers/fsi/fsi-master-gpio.c
> > @@ -130,10 +130,17 @@ static void set_sda_output(struct fsi_master_gpio 
> > *master, int value)
> > 
> >  static void clock_zeros(struct fsi_master_gpio *master, int count)
> >  {
> > +   trace_fsi_master_gpio_clock_zeros(master, count);
> > set_sda_output(master, 1);
> > clock_toggle(master, count);
> >  }
> > 
> > +static void echo_delay(struct fsi_master_gpio *master)
> > +{
> > +   clock_zeros(master, master->t_echo_delay);
> > +}
> 
> This doesn't look like it belongs in this patch.
> 
> You've just moved it up. Not worth a re-spin.

I've done more, I made it use clock_zeros() instead of open coding it
in order to share the tracepoint.
> 
> > +
> > +
> >  static void serial_in(struct fsi_master_gpio *master, struct fsi_gpio_msg 
> > *msg,
> > uint8_t num_bits)
> >  {
> > @@ -279,16 +286,19 @@ static void build_ar_command(struct fsi_master_gpio 
> > *master,
> > addr_bits = 2;
> > opcode_bits = 2;
> > opcode = FSI_GPIO_CMD_SAME_AR;
> > +   trace_fsi_master_gpio_cmd_same_addr(master);
> > 
> > } else if (check_relative_address(master, id, addr, _addr)) {
> > /* 8 bits plus sign */
> > addr_bits = 9;
> > addr = rel_addr;
> > opcode = FSI_GPIO_CMD_REL_AR;
> > +   trace_fsi_master_gpio_cmd_rel_addr(master, rel_addr);
> > 
> > } else {
> > addr_bits = 21;
> > opcode = FSI_GPIO_CMD_ABS_AR;
> > +   trace_fsi_master_gpio_cmd_abs_addr(master, addr);
> > }
> > 
> > /*
> > @@ -337,12 +347,6 @@ static void build_epoll_command(struct fsi_gpio_msg 
> > *cmd, uint8_t slave_id)
> > msg_push_crc(cmd);
> >  }
> > 
> > -static void echo_delay(struct fsi_master_gpio *master)
> > -{
> > -   set_sda_output(master, 1);
> > -   clock_toggle(master, master->t_echo_delay);
> > -}
> > -
> >  static void build_term_command(struct fsi_gpio_msg *cmd, uint8_t slave_id)
> >  {
> > cmd->bits = 0;

Re: [PATCH 06/14] fsi: master-gpio: Add more tracepoints

2018-07-11 Thread Benjamin Herrenschmidt

On Thu, 2018-06-28 at 13:41 +0930, Joel Stanley wrote:
> On 27 June 2018 at 08:55, Benjamin Herrenschmidt
>  wrote:
> > This adds a few more tracepoints that have proven useful when
> > debugging issues with the FSI bus.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> 
> Reviewed-by: Joel Stanley 
> 
> > ---
> >  drivers/fsi/fsi-master-gpio.c  | 16 ---
> >  include/trace/events/fsi_master_gpio.h | 59 ++
> >  2 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/fsi/fsi-master-gpio.c b/drivers/fsi/fsi-master-gpio.c
> > index 48e0e65b2982..a00a85aa6d56 100644
> > --- a/drivers/fsi/fsi-master-gpio.c
> > +++ b/drivers/fsi/fsi-master-gpio.c
> > @@ -130,10 +130,17 @@ static void set_sda_output(struct fsi_master_gpio 
> > *master, int value)
> > 
> >  static void clock_zeros(struct fsi_master_gpio *master, int count)
> >  {
> > +   trace_fsi_master_gpio_clock_zeros(master, count);
> > set_sda_output(master, 1);
> > clock_toggle(master, count);
> >  }
> > 
> > +static void echo_delay(struct fsi_master_gpio *master)
> > +{
> > +   clock_zeros(master, master->t_echo_delay);
> > +}
> 
> This doesn't look like it belongs in this patch.
> 
> You've just moved it up. Not worth a re-spin.

I've done more, I made it use clock_zeros() instead of open coding it
in order to share the tracepoint.
> 
> > +
> > +
> >  static void serial_in(struct fsi_master_gpio *master, struct fsi_gpio_msg 
> > *msg,
> > uint8_t num_bits)
> >  {
> > @@ -279,16 +286,19 @@ static void build_ar_command(struct fsi_master_gpio 
> > *master,
> > addr_bits = 2;
> > opcode_bits = 2;
> > opcode = FSI_GPIO_CMD_SAME_AR;
> > +   trace_fsi_master_gpio_cmd_same_addr(master);
> > 
> > } else if (check_relative_address(master, id, addr, _addr)) {
> > /* 8 bits plus sign */
> > addr_bits = 9;
> > addr = rel_addr;
> > opcode = FSI_GPIO_CMD_REL_AR;
> > +   trace_fsi_master_gpio_cmd_rel_addr(master, rel_addr);
> > 
> > } else {
> > addr_bits = 21;
> > opcode = FSI_GPIO_CMD_ABS_AR;
> > +   trace_fsi_master_gpio_cmd_abs_addr(master, addr);
> > }
> > 
> > /*
> > @@ -337,12 +347,6 @@ static void build_epoll_command(struct fsi_gpio_msg 
> > *cmd, uint8_t slave_id)
> > msg_push_crc(cmd);
> >  }
> > 
> > -static void echo_delay(struct fsi_master_gpio *master)
> > -{
> > -   set_sda_output(master, 1);
> > -   clock_toggle(master, master->t_echo_delay);
> > -}
> > -
> >  static void build_term_command(struct fsi_gpio_msg *cmd, uint8_t slave_id)
> >  {
> > cmd->bits = 0;

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-11 Thread Benjamin Herrenschmidt

On Wed, 2018-07-11 at 14:04 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > provide remote management of (primarily) server platforms. BMCs are
> > often tightly coupled to the platform in terms of behaviour and provide
> > many hardware features integral to booting and running the host system.
> > 
> > Some of these hardware features are simple, for example scratch
> > registers provided by the BMC that are exposed to both the host and the
> > BMC. In other cases there's a single bit switch to enable or disable
> > some of the provided functionality.
> > 
> > The documentation defines bindings for fields in registers that do not
> > integrate well into other driver models yet must be described to allow
> > the BMC kernel to assume control of these features.
> 
> So we'll get a new binding when that happens? That will break 
> compatibility.

What do you mean ? The binding provides a way to describe arbitrary
register fields and expose them to userspace. I'm not sure what you
mean by backward compatibility.

There is simply no way these things can be "abstracted" via some kind
of nice layered Linux subsystem of some sort. It's just a whole bunch
of knobs that control various things integral to the operation of the
host system. Andrew can provide a more precise list if you need to.

> 
> > 
> > Signed-off-by: Andrew Jeffery 
> > ---
> > 
> > Since RFC v1:
> > 
> > * Add a commit message
> > * Minor changes to documented labels
> > 
> >  .../bindings/misc/bmc-misc-ctrl.txt   | 252 ++
> >  MAINTAINERS   |   6 +
> >  2 files changed, 258 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt 
> > b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > new file mode 100644
> > index ..2c869fcc7ef2
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > @@ -0,0 +1,252 @@
> > +BMC Miscellaneous Control Interfaces
> > +
> > +
> > +Baseboard Management Controllers (BMCs) often have an array of hardware
> > +features that need to be described but are awkward to sensibly expose.
> > +
> > +This bindings document provides a generic mechanism for describing such
> > +features, covering read-only (RO), read-modify-write (RMW) and
> > +write-1-set/write-1-clear (W1SC) semantics.
> 
> If we wanted a generic mechanism for single register bits/fields in DT, 
> we'd have one already. A node per register bit doesn't scale.

Register *field*. I think you are looking at this from the wrong angle.
This isn't about exposing 236821643 SoC registers in an embedded
product. This is about exposing a dozen or so tunables that don't
control the SoC from the perspective of the OS running on it, but
controls how various features of the SoC are exposed to the *host*
system. Examples could be which of the SoC internal PCIe devices
exposed to the host is enabled (the SoC can appear as a GPU, a BMC misc
device or both or neither, it can have a DMA controller or not,
etc...). Other examples are scratch registers that need to be set to
system specific values by userspace, which the BIOS of the host will
read to determine some configuration information. That sort of thing.

There isn't that many, scalability isn't a problem. Both the list of
registers of interest and what needs to go in them is very much
system/vendor specific. This is a way for the kernel to act as a
"conduit" that doesn't need to know the specifics of a given
system/vendor/BIOS combination.

Your response smells like yet another case of trying to apply one of
those general "blanket" rules to something where it's completely
unapplicable.

> Maybe this should be modelled using GPIO binding? There's a line there 
> too as whether the signals are "general purpose" or not.

GPIOs are binary values right ? These are arbitrary fields. We want
arbitrary fields. This is really needed Rob, otherwise we'll NEVER have
BMC support upstream. 

The other option will be a dozen random ad-hoc char-devs that would
have to be updated every time a new bit needs to be added or changed.

Ben.

Re: [RFC PATCH v2 1/4] dt-bindings: misc: Add bindings for misc. BMC control fields

2018-07-11 Thread Benjamin Herrenschmidt

On Wed, 2018-07-11 at 14:04 -0600, Rob Herring wrote:
> On Wed, Jul 11, 2018 at 03:01:19PM +0930, Andrew Jeffery wrote:
> > Baseboard Management Controllers (BMCs) are embedded SoCs that exist to
> > provide remote management of (primarily) server platforms. BMCs are
> > often tightly coupled to the platform in terms of behaviour and provide
> > many hardware features integral to booting and running the host system.
> > 
> > Some of these hardware features are simple, for example scratch
> > registers provided by the BMC that are exposed to both the host and the
> > BMC. In other cases there's a single bit switch to enable or disable
> > some of the provided functionality.
> > 
> > The documentation defines bindings for fields in registers that do not
> > integrate well into other driver models yet must be described to allow
> > the BMC kernel to assume control of these features.
> 
> So we'll get a new binding when that happens? That will break 
> compatibility.

What do you mean ? The binding provides a way to describe arbitrary
register fields and expose them to userspace. I'm not sure what you
mean by backward compatibility.

There is simply no way these things can be "abstracted" via some kind
of nice layered Linux subsystem of some sort. It's just a whole bunch
of knobs that control various things integral to the operation of the
host system. Andrew can provide a more precise list if you need to.

> 
> > 
> > Signed-off-by: Andrew Jeffery 
> > ---
> > 
> > Since RFC v1:
> > 
> > * Add a commit message
> > * Minor changes to documented labels
> > 
> >  .../bindings/misc/bmc-misc-ctrl.txt   | 252 ++
> >  MAINTAINERS   |   6 +
> >  2 files changed, 258 insertions(+)
> >  create mode 100644 Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt 
> > b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > new file mode 100644
> > index ..2c869fcc7ef2
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/misc/bmc-misc-ctrl.txt
> > @@ -0,0 +1,252 @@
> > +BMC Miscellaneous Control Interfaces
> > +
> > +
> > +Baseboard Management Controllers (BMCs) often have an array of hardware
> > +features that need to be described but are awkward to sensibly expose.
> > +
> > +This bindings document provides a generic mechanism for describing such
> > +features, covering read-only (RO), read-modify-write (RMW) and
> > +write-1-set/write-1-clear (W1SC) semantics.
> 
> If we wanted a generic mechanism for single register bits/fields in DT, 
> we'd have one already. A node per register bit doesn't scale.

Register *field*. I think you are looking at this from the wrong angle.
This isn't about exposing 236821643 SoC registers in an embedded
product. This is about exposing a dozen or so tunables that don't
control the SoC from the perspective of the OS running on it, but
controls how various features of the SoC are exposed to the *host*
system. Examples could be which of the SoC internal PCIe devices
exposed to the host is enabled (the SoC can appear as a GPU, a BMC misc
device or both or neither, it can have a DMA controller or not,
etc...). Other examples are scratch registers that need to be set to
system specific values by userspace, which the BIOS of the host will
read to determine some configuration information. That sort of thing.

There isn't that many, scalability isn't a problem. Both the list of
registers of interest and what needs to go in them is very much
system/vendor specific. This is a way for the kernel to act as a
"conduit" that doesn't need to know the specifics of a given
system/vendor/BIOS combination.

Your response smells like yet another case of trying to apply one of
those general "blanket" rules to something where it's completely
unapplicable.

> Maybe this should be modelled using GPIO binding? There's a line there 
> too as whether the signals are "general purpose" or not.

GPIOs are binary values right ? These are arbitrary fields. We want
arbitrary fields. This is really needed Rob, otherwise we'll NEVER have
BMC support upstream. 

The other option will be a dozen random ad-hoc char-devs that would
have to be updated every time a new bit needs to be added or changed.

Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 -0700, Linus Torvalds wrote:
> On Tue, Jul 10, 2018 at 4:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > > I like that fix, which should make this patch obsolete, right?
> > 
> > Yes, for that specific issue, but Linus seemed to think patch 1 was the
> > "right thing to do" regardless...
> 
> I would definitely prefer either a kobject_get_unless_zero() or a
> warning if it is ever zero.
> 
> The fact that right now it silently can do known bad things, and then
> causes odd corruption _later_, is not good.

Maybe we should make the existing warning in refcount unconditional ? 

This is whe warning that allowed me to pull that string with the
gluedirs and fix what ended up very weird crashes caused by the
resulting use-after-free. So it's definitely valuable.

As for whether we should generalize kobject_get_unless_zero() vs avoid
using it alltogether, this is a debate for you to have with Greg ;-)

He seems to think nothing should ever try to get a zeroed object (which
I tend to agree with, it's close to my opinion that visibility and
lifetime should be disconnected).

That being said, there are existing constructs such as the "late
removal from sysfs from kobject_release" that mean that zero-reference
objects *can* still be visible, via either sysfs or ksets, as far as I
can tell.

So it's a bit of a mess... but if we chose to go Greg's way we should
probably put a WARN'ing in kobject_release() for late-removal from
sysfs since this exposes 0-ref objects to outside visibility.

Cheers,
Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 -0700, Linus Torvalds wrote:
> On Tue, Jul 10, 2018 at 4:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > > I like that fix, which should make this patch obsolete, right?
> > 
> > Yes, for that specific issue, but Linus seemed to think patch 1 was the
> > "right thing to do" regardless...
> 
> I would definitely prefer either a kobject_get_unless_zero() or a
> warning if it is ever zero.
> 
> The fact that right now it silently can do known bad things, and then
> causes odd corruption _later_, is not good.

Maybe we should make the existing warning in refcount unconditional ? 

This is whe warning that allowed me to pull that string with the
gluedirs and fix what ended up very weird crashes caused by the
resulting use-after-free. So it's definitely valuable.

As for whether we should generalize kobject_get_unless_zero() vs avoid
using it alltogether, this is a debate for you to have with Greg ;-)

He seems to think nothing should ever try to get a zeroed object (which
I tend to agree with, it's close to my opinion that visibility and
lifetime should be disconnected).

That being said, there are existing constructs such as the "late
removal from sysfs from kobject_release" that mean that zero-reference
objects *can* still be visible, via either sysfs or ksets, as far as I
can tell.

So it's a bit of a mess... but if we chose to go Greg's way we should
probably put a WARN'ing in kobject_release() for late-removal from
sysfs since this exposes 0-ref objects to outside visibility.

Cheers,
Ben.

Re: [PATCH v11 5/8] i2c: fsi: Add transfer implementation

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 12:52 -0500, Eddie James wrote:
> 
> On 07/09/2018 05:41 PM, Wolfram Sang wrote:
> > > + cmd |= FIELD_PREP(I2C_CMD_ADDR, msg->addr >> 1);
> > 
> > I just noticed this and wonder: Don't you need the LSB of the address?
> > It is not the RW flag, this is encoded in msg->flags.
> 
> So, the hardware interprets the LSB as the RW flag. It wouldn't be 
> possible to have a device addressed with the LSB set on this I2C master.

What do you mean ? That doesn't sound right... 
> 
> > 
> > Also, no seperate handling for 10 bit addresses? Technically, 7-bit 0x50
> > is different on the wire from 10-bit 0x050. This is minor, though. There
> > are no 10-bit devices out there. Still, did you test 10-bit support?
> 
> Indeed, real 10-bit addresses require some additional manipulation of 
> this I2C master in order to work. We don't support it right now.
> 
> Thanks,
> Eddie
> 
> > 
> > Rest looks good.
> >

Re: [PATCH v11 5/8] i2c: fsi: Add transfer implementation

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 12:52 -0500, Eddie James wrote:
> 
> On 07/09/2018 05:41 PM, Wolfram Sang wrote:
> > > + cmd |= FIELD_PREP(I2C_CMD_ADDR, msg->addr >> 1);
> > 
> > I just noticed this and wonder: Don't you need the LSB of the address?
> > It is not the RW flag, this is encoded in msg->flags.
> 
> So, the hardware interprets the LSB as the RW flag. It wouldn't be 
> possible to have a device addressed with the LSB set on this I2C master.

What do you mean ? That doesn't sound right... 
> 
> > 
> > Also, no seperate handling for 10 bit addresses? Technically, 7-bit 0x50
> > is different on the wire from 10-bit 0x050. This is minor, though. There
> > are no 10-bit devices out there. Still, did you test 10-bit support?
> 
> Indeed, real 10-bit addresses require some additional manipulation of 
> this I2C master in order to work. We don't support it right now.
> 
> Thanks,
> Eddie
> 
> > 
> > Rest looks good.
> >

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jul 10, 2018 at 09:44:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> > > No, kobject_get() should never happen on a 0 refcount object.  That
> > > being said, the code does allow it, so if things are messed up, it will
> > > happen.  I think that change happened when the switch to refcount_t
> > > occured, before then we would WARN_ON() if that ever happened.  I should
> > > go fix that up, and restore that old behavior, so that syzbot starts
> > > complaining loudly when stuff like that hits.
> > > 
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> I like that fix, which should make this patch obsolete, right?

Yes, for that specific issue, but Linus seemed to think patch 1 was the
"right thing to do" regardless...

I suggest you read the backlog of thread if you are interested in the
ins and outs of his position, we had a rather extensive discussion on
this stuff.

Cheers,
Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> On Tue, Jul 10, 2018 at 09:44:33AM +1000, Benjamin Herrenschmidt wrote:
> > On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> > > No, kobject_get() should never happen on a 0 refcount object.  That
> > > being said, the code does allow it, so if things are messed up, it will
> > > happen.  I think that change happened when the switch to refcount_t
> > > occured, before then we would WARN_ON() if that ever happened.  I should
> > > go fix that up, and restore that old behavior, so that syzbot starts
> > > complaining loudly when stuff like that hits.
> > > 
> > > So I hate using kobject_get_unless_zero(), and resisted ever adding it
> > > to the tree as it shows a bad locking/tree situation as you point out
> > > here.  But for some reason, the block developers seemed to insist they
> > > needed it, and so it is in the tree for them.  I don't want it to spread
> > > if at all possible, which makes me want to reject this patch as this
> > > should be "a case that can never be hit".
> > 
> > Except it can in that situation... at least unless you get my patch 2/2
> > (or the newer one I'm about to send that avoids adding a child counter
> > and uses the one in kernfs instead).
> 
> I like that fix, which should make this patch obsolete, right?

Yes, for that specific issue, but Linus seemed to think patch 1 was the
"right thing to do" regardless...

I suggest you read the backlog of thread if you are interested in the
ins and outs of his position, we had a rather extensive discussion on
this stuff.

Cheers,
Ben.

Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> 
> > +/**
> > + * kobject_has_children - Returns whether a kobject has children.
> > + * @kobj: the object to test
> > + *
> > + * This will return whether a kobject has other kobjects as children.
> > + *
> > + * It does NOT account for the presence of attribute files, only sub
> > + * directories. It also assumes there is no concurrent addition or
> > + * removal of such children, and thus relies on external locking.
> > + */
> > +static inline bool kobject_has_children(struct kobject *kobj)
> > +{
> > +   WARN_ON_ONCE(kref_read(>kref) == 0);
> 
> Why warn on?  Who is going to hit this and how are you going to fix up
> the syzbot reports?  :)

Well, that's it, the hope is nobody ever hits it ... but if one does it
would be useful to get a backtrace to figure it out. You can shoot the
reports my way I suppose :-)

> Anyway, this looks good, I can just take this and not the 1/2 patch now,
> right?  I really didn't like that patch.

Yes, it will fix the practical problem. As for patch 1, it's rather
funny, you and Linus seem to have a completely opposite idea of how
this stuff should work :-)

Cheers,
Ben.

> thanks,
> 
> greg k-h

Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-10 Thread Benjamin Herrenschmidt

On Tue, 2018-07-10 at 16:55 +0200, Greg Kroah-Hartman wrote:
> 
> > +/**
> > + * kobject_has_children - Returns whether a kobject has children.
> > + * @kobj: the object to test
> > + *
> > + * This will return whether a kobject has other kobjects as children.
> > + *
> > + * It does NOT account for the presence of attribute files, only sub
> > + * directories. It also assumes there is no concurrent addition or
> > + * removal of such children, and thus relies on external locking.
> > + */
> > +static inline bool kobject_has_children(struct kobject *kobj)
> > +{
> > +   WARN_ON_ONCE(kref_read(>kref) == 0);
> 
> Why warn on?  Who is going to hit this and how are you going to fix up
> the syzbot reports?  :)

Well, that's it, the hope is nobody ever hits it ... but if one does it
would be useful to get a backtrace to figure it out. You can shoot the
reports my way I suppose :-)

> Anyway, this looks good, I can just take this and not the 1/2 patch now,
> right?  I really didn't like that patch.

Yes, it will fix the practical problem. As for patch 1, it's rather
funny, you and Linus seem to have a completely opposite idea of how
this stuff should work :-)

Cheers,
Ben.

> thanks,
> 
> greg k-h

Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt

On Mon, 2018-07-09 at 17:33 -0700, Linus Torvalds wrote:
> On Mon, Jul 9, 2018 at 5:29 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> > 
> > This directory is never "explicitely" removed when empty however,
> 
> explicitly
> 
> Is the mis-spelling why you had the quotes? I do find that spelling in
> the kernel, but not in drivers/base/.

No idea :-) Just my poor english, I'm not sure why I put quotes, I
think I meant *explictly* as enphasis, not sure.

> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> 
> Ack. This looks good to me.
> 
> I didn't see your 1/2 - you should probably re-send that one too so
> that Greg doesn't have to fish for it. But I'll Ack that one too in
> this same email regardless.

OK, I didn't re-send it. Greg just nak'ed it though :-)

Cheers,
Ben.

Re: [PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt

On Mon, 2018-07-09 at 17:33 -0700, Linus Torvalds wrote:
> On Mon, Jul 9, 2018 at 5:29 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> > 
> > This directory is never "explicitely" removed when empty however,
> 
> explicitly
> 
> Is the mis-spelling why you had the quotes? I do find that spelling in
> the kernel, but not in drivers/base/.

No idea :-) Just my poor english, I'm not sure why I put quotes, I
think I meant *explictly* as enphasis, not sure.

> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> 
> Ack. This looks good to me.
> 
> I didn't see your 1/2 - you should probably re-send that one too so
> that Greg doesn't have to fish for it. But I'll Ack that one too in
> this same email regardless.

OK, I didn't re-send it. Greg just nak'ed it though :-)

Cheers,
Ben.

Re: [PATCH] gpio: aspeed: fix compile testing warning

2018-07-09 Thread Benjamin Herrenschmidt

On Mon, 2018-07-09 at 16:56 +0200, Arnd Bergmann wrote:
> Gcc cannot always see that BUG_ON(1) is guaranteed to not
> return, so we get a warning message in some configurations:
> 
> drivers/gpio/gpio-aspeed.c: In function 'bank_reg':
> drivers/gpio/gpio-aspeed.c:244:1: error: control reaches end of non-void 
> function [-Werror=return-type]
> 
> Using a plain BUG() is easier here and avoids the problem.
> 
> Fixes: 44ddf559d579 ("gpio: aspeed: Rework register type accessors")
> Signed-off-by: Arnd Bergmann 

Acked-by: Benjamin Herrenschmidt 

Linus, can you plonk that on top of the patches in that topic branch
you created ?

> ---
>  drivers/gpio/gpio-aspeed.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpio/gpio-aspeed.c b/drivers/gpio/gpio-aspeed.c
> index 1e00f4045f9d..2342e154029b 100644
> --- a/drivers/gpio/gpio-aspeed.c
> +++ b/drivers/gpio/gpio-aspeed.c
> @@ -240,7 +240,7 @@ static inline void __iomem *bank_reg(struct aspeed_gpio 
> *gpio,
>   case reg_cmdsrc1:
>   return gpio->base + bank->cmdsrc_regs + GPIO_CMDSRC_1;
>   }
> - BUG_ON(1);
> + BUG();
>  }
>  
>  #define GPIO_BANK(x) ((x) >> 5)

Re: [PATCH] gpio: aspeed: fix compile testing warning

2018-07-09 Thread Benjamin Herrenschmidt

On Mon, 2018-07-09 at 16:56 +0200, Arnd Bergmann wrote:
> Gcc cannot always see that BUG_ON(1) is guaranteed to not
> return, so we get a warning message in some configurations:
> 
> drivers/gpio/gpio-aspeed.c: In function 'bank_reg':
> drivers/gpio/gpio-aspeed.c:244:1: error: control reaches end of non-void 
> function [-Werror=return-type]
> 
> Using a plain BUG() is easier here and avoids the problem.
> 
> Fixes: 44ddf559d579 ("gpio: aspeed: Rework register type accessors")
> Signed-off-by: Arnd Bergmann 

Acked-by: Benjamin Herrenschmidt 

Linus, can you plonk that on top of the patches in that topic branch
you created ?

> ---
>  drivers/gpio/gpio-aspeed.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpio/gpio-aspeed.c b/drivers/gpio/gpio-aspeed.c
> index 1e00f4045f9d..2342e154029b 100644
> --- a/drivers/gpio/gpio-aspeed.c
> +++ b/drivers/gpio/gpio-aspeed.c
> @@ -240,7 +240,7 @@ static inline void __iomem *bank_reg(struct aspeed_gpio 
> *gpio,
>   case reg_cmdsrc1:
>   return gpio->base + bank->cmdsrc_regs + GPIO_CMDSRC_1;
>   }
> - BUG_ON(1);
> + BUG();
>  }
>  
>  #define GPIO_BANK(x) ((x) >> 5)

[PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt

For devices with a class, we create a "glue" directory between
the parent device and the new device with the class name.

This directory is never "explicitely" removed when empty however,
this is left to the implicit sysfs removal done by kobject_release()
when the object loses its last reference via kobject_put().

This is problematic because as long as it's not been removed from
sysfs, it is still present in the class kset and in sysfs directory
structure.

The presence in the class kset exposes a use after free bug fixed
by the previous patch, but the presence in sysfs means that until
the kobject is released, which can take a while (especially with
kobject debugging), any attempt at re-creating such as binding a
new device for that class/parent pair, will result in a sysfs
duplicate file name error.

This fixes it by instead doing an explicit kobject_del() when
the glue dir is empty, by keeping track of the number of
child devices of the gluedir.

This is made easy by the fact that all glue dir operations are
done with a global mutex, and there's already a function
(cleanup_glue_dir) called in all the right places taking that
mutex that can be enhanced for this. It appears that this was
in fact the intent of the function, but the implementation was
wrong.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/base/core.c |  2 ++
 include/linux/kobject.h | 17 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index e9eff2099896..93c0f8d1a447 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1572,6 +1572,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..270b40515e79 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,23 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ *
+ * This will return whether a kobject has other kobjects as children.
+ *
+ * It does NOT account for the presence of attribute files, only sub
+ * directories. It also assumes there is no concurrent addition or
+ * removal of such children, and thus relies on external locking.
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kobj->sd->dir.subdirs;
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;

[PATCH v2 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-09 Thread Benjamin Herrenschmidt

For devices with a class, we create a "glue" directory between
the parent device and the new device with the class name.

This directory is never "explicitely" removed when empty however,
this is left to the implicit sysfs removal done by kobject_release()
when the object loses its last reference via kobject_put().

This is problematic because as long as it's not been removed from
sysfs, it is still present in the class kset and in sysfs directory
structure.

The presence in the class kset exposes a use after free bug fixed
by the previous patch, but the presence in sysfs means that until
the kobject is released, which can take a while (especially with
kobject debugging), any attempt at re-creating such as binding a
new device for that class/parent pair, will result in a sysfs
duplicate file name error.

This fixes it by instead doing an explicit kobject_del() when
the glue dir is empty, by keeping track of the number of
child devices of the gluedir.

This is made easy by the fact that all glue dir operations are
done with a global mutex, and there's already a function
(cleanup_glue_dir) called in all the right places taking that
mutex that can be enhanced for this. It appears that this was
in fact the intent of the function, but the implementation was
wrong.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/base/core.c |  2 ++
 include/linux/kobject.h | 17 +
 2 files changed, 19 insertions(+)

diff --git a/drivers/base/core.c b/drivers/base/core.c
index e9eff2099896..93c0f8d1a447 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1572,6 +1572,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..270b40515e79 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,23 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ *
+ * This will return whether a kobject has other kobjects as children.
+ *
+ * It does NOT account for the presence of attribute files, only sub
+ * directories. It also assumes there is no concurrent addition or
+ * removal of such children, and thus relies on external locking.
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kobj->sd->dir.subdirs;
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt

On Sat, 2018-07-07 at 18:51 +0200, Greg Kroah-Hartman wrote:
> 
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index b610816eb887..e9eff2099896 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -1517,11 +1517,13 @@ static struct kobject *get_device_parent(struct 
> > device *dev,
> >  
> > /* find our class-directory at the parent and reference it */
> > spin_lock(>class->p->glue_dirs.list_lock);
> > -   list_for_each_entry(k, >class->p->glue_dirs.list, entry)
> > +   list_for_each_entry(k, >class->p->glue_dirs.list, entry) {
> > if (k->parent == parent_kobj) {
> > -   kobj = kobject_get(k);
> > -   break;
> > +   kobj = kobject_get_unless_zero(k);
> > +   if (kobj)
> > +   break;
> 
> A parent directory _should_ not ever be able to be removed before the
> object being removed was, as we should have had a reference to it,
> right?  So I don't see how this can get hit "in real life".
> 
> Yes, enabling kobject debugging does keep objects around for a long time
> in order to try to help figure out where people are messing up their
> usage of them.  What subsystem is doing this in a way that causes
> problems here?  Shouldn't we fix that up instead?

The broken subsystem is the driver core itself :-) See the descriptions
here and in patch 2/2.

Note: This is a more generic problem with ksets vs relying on the magic
sysfs cleanup happening in kobject_release().

Any kobject that is a member of a kset and doesn't get explicitely
removed from sysfs with kobject_del() prior to dropping the last
reference with kobject_put() (and thus relies instead on the automatic
cleanup done by kobject_release()) will be exposed to the race:

The last kobject_put() will drop the refcount to 0 while the object is
still in the kset. Only some amount of time later (which can be very
short or very long if you enable kobject debugging), will
kobject_release() take it out of sysfs and out of the kset.

Thus the object will be visible, with a 0 refcount, to anything that
"walks" the kset during that period.

This is exactly what happens with the gluedirs in the device core, but
it could happen elsewhere for all I know.

Cheers,
Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt

On Sat, 2018-07-07 at 18:51 +0200, Greg Kroah-Hartman wrote:
> 
> > diff --git a/drivers/base/core.c b/drivers/base/core.c
> > index b610816eb887..e9eff2099896 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -1517,11 +1517,13 @@ static struct kobject *get_device_parent(struct 
> > device *dev,
> >  
> > /* find our class-directory at the parent and reference it */
> > spin_lock(>class->p->glue_dirs.list_lock);
> > -   list_for_each_entry(k, >class->p->glue_dirs.list, entry)
> > +   list_for_each_entry(k, >class->p->glue_dirs.list, entry) {
> > if (k->parent == parent_kobj) {
> > -   kobj = kobject_get(k);
> > -   break;
> > +   kobj = kobject_get_unless_zero(k);
> > +   if (kobj)
> > +   break;
> 
> A parent directory _should_ not ever be able to be removed before the
> object being removed was, as we should have had a reference to it,
> right?  So I don't see how this can get hit "in real life".
> 
> Yes, enabling kobject debugging does keep objects around for a long time
> in order to try to help figure out where people are messing up their
> usage of them.  What subsystem is doing this in a way that causes
> problems here?  Shouldn't we fix that up instead?

The broken subsystem is the driver core itself :-) See the descriptions
here and in patch 2/2.

Note: This is a more generic problem with ksets vs relying on the magic
sysfs cleanup happening in kobject_release().

Any kobject that is a member of a kset and doesn't get explicitely
removed from sysfs with kobject_del() prior to dropping the last
reference with kobject_put() (and thus relies instead on the automatic
cleanup done by kobject_release()) will be exposed to the race:

The last kobject_put() will drop the refcount to 0 while the object is
still in the kset. Only some amount of time later (which can be very
short or very long if you enable kobject debugging), will
kobject_release() take it out of sysfs and out of the kset.

Thus the object will be visible, with a 0 refcount, to anything that
"walks" the kset during that period.

This is exactly what happens with the gluedirs in the device core, but
it could happen elsewhere for all I know.

Cheers,
Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt

On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> No, kobject_get() should never happen on a 0 refcount object.  That
> being said, the code does allow it, so if things are messed up, it will
> happen.  I think that change happened when the switch to refcount_t
> occured, before then we would WARN_ON() if that ever happened.  I should
> go fix that up, and restore that old behavior, so that syzbot starts
> complaining loudly when stuff like that hits.
> 
> So I hate using kobject_get_unless_zero(), and resisted ever adding it
> to the tree as it shows a bad locking/tree situation as you point out
> here.  But for some reason, the block developers seemed to insist they
> needed it, and so it is in the tree for them.  I don't want it to spread
> if at all possible, which makes me want to reject this patch as this
> should be "a case that can never be hit".

Except it can in that situation... at least unless you get my patch 2/2
(or the newer one I'm about to send that avoids adding a child counter
and uses the one in kernfs instead).

Ben.

Re: [PATCH 1/2] drivers: core: Don't try to use a dead glue_dir

2018-07-09 Thread Benjamin Herrenschmidt

On Sat, 2018-07-07 at 18:48 +0200, Greg Kroah-Hartman wrote:
> No, kobject_get() should never happen on a 0 refcount object.  That
> being said, the code does allow it, so if things are messed up, it will
> happen.  I think that change happened when the switch to refcount_t
> occured, before then we would WARN_ON() if that ever happened.  I should
> go fix that up, and restore that old behavior, so that syzbot starts
> complaining loudly when stuff like that hits.
> 
> So I hate using kobject_get_unless_zero(), and resisted ever adding it
> to the tree as it shows a bad locking/tree situation as you point out
> here.  But for some reason, the block developers seemed to insist they
> needed it, and so it is in the tree for them.  I don't want it to spread
> if at all possible, which makes me want to reject this patch as this
> should be "a case that can never be hit".

Except it can in that situation... at least unless you get my patch 2/2
(or the newer one I'm about to send that avoids adding a child counter
and uses the one in kernfs instead).

Ben.

Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-06 Thread Benjamin Herrenschmidt

On Thu, 2018-07-05 at 10:08 -0600, Rob Herring wrote:
> 
> > It's not really a SOC block from a vendor, it's a pseudo-device in a
> > way. The current one that doesn't use the coldfire offload is just
> > compatible "fsi-master-gpio".
> > 
> > I can add a vendor but what should it be ? aspeed because it runs on
> > the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
> > protocol ?
> 
> I would say aspeed as it is tied to their chip.
> 
> > 
> > - doesn't make sense here though.
> 
> But you do already have  in the compatible, but in a slightly
> different form and position. And "cf" is the block.
>
> So I'd propose: aspeed,ast2500-cf-fsi-master

Ok, I'll do that.

> > 
> > > > +
> > > > + - clock-gpios = ;: GPIO for FSI clock
> > > > + - data-gpios = ; : GPIO for FSI data signal
> > > > + - enable-gpios = ;   : GPIO for enable signal
> > > > + - trans-gpios = ;: GPIO for voltage 
> > > > translator enable
> > > > + - mux-gpios = ;  : GPIO for pin multiplexing with 
> > > > other
> > > 
> > > So the gpio info is pased to the CF? Otherwise, what's the point of
> > > having these in DT?
> > 
> > In the original version you are looking at, they are not passed to the
> > CF per-se but the driver does use aspeed GPIO specific APIs to
> > configure them to be owned by the CF, so we need the references.
> 
> Okay.
> 
> > However, I've just reworked the ucode with a few tricks to avoid losing
> > singificant performance, so that we can indeed pass them to the CF,
> > thus avoiding the need for a per-system image, so the above are here to
> > stay.
> > 
> > > > +  functions (eg, external FSI 
> > > > masters)
> > > > + - memory-region = ;  : Reference to the reserved 
> > > > memory for
> > > > +  the ColdFire. Must be 2M 
> > > > aligned on
> > > > + AST2400 and 1M aligned on AST2500
> > > > + - sram = ;   : Reference to the SRAM 
> > > > node.
> > > > + - cvic = ;   : Reference to the CVIC 
> > > > node.
> > > 
> > > Vendor prefixes.
> > 
> > On what ? Why would an "sram" pointer have a vendor prefix ? Or a
> > memory region pointer ?
> 
> memory-region is a standard property. sram and cvic are not, so should
> have vendor prefixes. However, perhaps we should add a common "sram"
> property to sram/sram.txt.

Hrm... originally vendor prefix on properties were for things that
didn't have a binding afaik. IE a way for an f-code driver to stash
things in the DT that were vendor specific and retrieve them from the
OS driver for example.

Here with well defined bindings it's rather bloaty don't you think ? I
don't strongly object to doing it, it's just a bit ... odd.

Cheers,
Ben.

Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-06 Thread Benjamin Herrenschmidt

On Thu, 2018-07-05 at 10:08 -0600, Rob Herring wrote:
> 
> > It's not really a SOC block from a vendor, it's a pseudo-device in a
> > way. The current one that doesn't use the coldfire offload is just
> > compatible "fsi-master-gpio".
> > 
> > I can add a vendor but what should it be ? aspeed because it runs on
> > the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
> > protocol ?
> 
> I would say aspeed as it is tied to their chip.
> 
> > 
> > - doesn't make sense here though.
> 
> But you do already have  in the compatible, but in a slightly
> different form and position. And "cf" is the block.
>
> So I'd propose: aspeed,ast2500-cf-fsi-master

Ok, I'll do that.

> > 
> > > > +
> > > > + - clock-gpios = ;: GPIO for FSI clock
> > > > + - data-gpios = ; : GPIO for FSI data signal
> > > > + - enable-gpios = ;   : GPIO for enable signal
> > > > + - trans-gpios = ;: GPIO for voltage 
> > > > translator enable
> > > > + - mux-gpios = ;  : GPIO for pin multiplexing with 
> > > > other
> > > 
> > > So the gpio info is pased to the CF? Otherwise, what's the point of
> > > having these in DT?
> > 
> > In the original version you are looking at, they are not passed to the
> > CF per-se but the driver does use aspeed GPIO specific APIs to
> > configure them to be owned by the CF, so we need the references.
> 
> Okay.
> 
> > However, I've just reworked the ucode with a few tricks to avoid losing
> > singificant performance, so that we can indeed pass them to the CF,
> > thus avoiding the need for a per-system image, so the above are here to
> > stay.
> > 
> > > > +  functions (eg, external FSI 
> > > > masters)
> > > > + - memory-region = ;  : Reference to the reserved 
> > > > memory for
> > > > +  the ColdFire. Must be 2M 
> > > > aligned on
> > > > + AST2400 and 1M aligned on AST2500
> > > > + - sram = ;   : Reference to the SRAM 
> > > > node.
> > > > + - cvic = ;   : Reference to the CVIC 
> > > > node.
> > > 
> > > Vendor prefixes.
> > 
> > On what ? Why would an "sram" pointer have a vendor prefix ? Or a
> > memory region pointer ?
> 
> memory-region is a standard property. sram and cvic are not, so should
> have vendor prefixes. However, perhaps we should add a common "sram"
> property to sram/sram.txt.

Hrm... originally vendor prefix on properties were for things that
didn't have a binding afaik. IE a way for an f-code driver to stash
things in the DT that were vendor specific and retrieve them from the
OS driver for example.

Here with well defined bindings it's rather bloaty don't you think ? I
don't strongly object to doing it, it's just a bit ... odd.

Cheers,
Ben.

Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:30 -0600, Rob Herring wrote:
> On Wed, Jun 27, 2018 at 09:26:02AM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems for which
> > a corresponding firmware file exists. It has most of the
> > same properties, plus some more needed to operate the
> > coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..50913ae685cc
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > + "fsi-master-ast-2400-cf" for an AST2400 based system
> > +   or
> > + "fsi-master-ast-2500-cf" for an AST2500 based system
> 
> ,-

It's not really a SOC block from a vendor, it's a pseudo-device in a
way. The current one that doesn't use the coldfire offload is just
compatible "fsi-master-gpio".

I can add a vendor but what should it be ? aspeed because it runs on
the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
protocol ?

- doesn't make sense here though.

> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> 
> So the gpio info is pased to the CF? Otherwise, what's the point of 
> having these in DT?

In the original version you are looking at, they are not passed to the
CF per-se but the driver does use aspeed GPIO specific APIs to
configure them to be owned by the CF, so we need the references.

However, I've just reworked the ucode with a few tricks to avoid losing
singificant performance, so that we can indeed pass them to the CF,
thus avoiding the need for a per-system image, so the above are here to
stay.

> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - sram = ;   : Reference to the SRAM node.
> > + - cvic = ;   : Reference to the CVIC node.
> 
> Vendor prefixes.

On what ? Why would an "sram" pointer have a vendor prefix ? Or a
memory region pointer ?

> > + - fw-name = ; : The name used to construct 
> > the firmware
> > + file name (cf-fsi-.bin)
> 
> firmware-name is used in some other places (and is the full filename).

It's gone in the latest version as there's a single FW file now for all
systems.
> 
> > + - fw-platform-sig = ;  : A signature value that must 
> > match the one
> > + contained in the firmware 
> > image. Known
> > + values are listed in the firmware 
> > interface
> > + file cf-fsi-fw.h
> 
> Vendor prefix unless you think this could be common.

It's going away.

> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "fsi-master-gpio", "fsi-master";
> 
> Forget to update the example?

Indeed :)

> > +clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +}
> > -- 
> > 2.17.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe devicetree" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 11/14] dt-bindings: fsi: Document binding for the fsi-master-ast-cf "device"

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:30 -0600, Rob Herring wrote:
> On Wed, Jun 27, 2018 at 09:26:02AM +1000, Benjamin Herrenschmidt wrote:
> > This isn't per-se a real device, it's a pseudo-device that
> > represents the use of the Aspeed built-in ColdFire to
> > implement the FSI protocol by bitbanging the GPIOs instead
> > of doing it from the ARM core.
> > 
> > Thus it's a drop-in replacement for the existing
> > fsi-master-gpio pseudo-device for use on systems for which
> > a corresponding firmware file exists. It has most of the
> > same properties, plus some more needed to operate the
> > coprocessor.
> > 
> > Signed-off-by: Benjamin Herrenschmidt 
> > ---
> >  .../bindings/fsi/fsi-master-ast-cf.txt| 36 +++
> >  1 file changed, 36 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > 
> > diff --git a/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt 
> > b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > new file mode 100644
> > index ..50913ae685cc
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/fsi/fsi-master-ast-cf.txt
> > @@ -0,0 +1,36 @@
> > +Device-tree bindings for ColdFire offloaded gpio-based FSI master driver
> > +
> > +
> > +Required properties:
> > + - compatible =
> > + "fsi-master-ast-2400-cf" for an AST2400 based system
> > +   or
> > + "fsi-master-ast-2500-cf" for an AST2500 based system
> 
> ,-

It's not really a SOC block from a vendor, it's a pseudo-device in a
way. The current one that doesn't use the coldfire offload is just
compatible "fsi-master-gpio".

I can add a vendor but what should it be ? aspeed because it runs on
the aspeed SoCs only ? ibm because we wrote it and FSI is an IBM
protocol ?

- doesn't make sense here though.

> > +
> > + - clock-gpios = ;: GPIO for FSI clock
> > + - data-gpios = ; : GPIO for FSI data signal
> > + - enable-gpios = ;   : GPIO for enable signal
> > + - trans-gpios = ;: GPIO for voltage translator 
> > enable
> > + - mux-gpios = ;  : GPIO for pin multiplexing with other
> 
> So the gpio info is pased to the CF? Otherwise, what's the point of 
> having these in DT?

In the original version you are looking at, they are not passed to the
CF per-se but the driver does use aspeed GPIO specific APIs to
configure them to be owned by the CF, so we need the references.

However, I've just reworked the ucode with a few tricks to avoid losing
singificant performance, so that we can indeed pass them to the CF,
thus avoiding the need for a per-system image, so the above are here to
stay.

> > +  functions (eg, external FSI 
> > masters)
> > + - memory-region = ;  : Reference to the reserved 
> > memory for
> > +  the ColdFire. Must be 2M aligned 
> > on
> > + AST2400 and 1M aligned on AST2500
> > + - sram = ;   : Reference to the SRAM node.
> > + - cvic = ;   : Reference to the CVIC node.
> 
> Vendor prefixes.

On what ? Why would an "sram" pointer have a vendor prefix ? Or a
memory region pointer ?

> > + - fw-name = ; : The name used to construct 
> > the firmware
> > + file name (cf-fsi-.bin)
> 
> firmware-name is used in some other places (and is the full filename).

It's gone in the latest version as there's a single FW file now for all
systems.
> 
> > + - fw-platform-sig = ;  : A signature value that must 
> > match the one
> > + contained in the firmware 
> > image. Known
> > + values are listed in the firmware 
> > interface
> > + file cf-fsi-fw.h
> 
> Vendor prefix unless you think this could be common.

It's going away.

> > +Examples:
> > +
> > +fsi-master {
> > +compatible = "fsi-master-gpio", "fsi-master";
> 
> Forget to update the example?

Indeed :)

> > +clock-gpios = < 0>;
> > +data-gpios = < 1>;
> > +enable-gpios = < 2>;
> > +trans-gpios = < 3>;
> > +mux-gpios = < 4>;
> > +}
> > -- 
> > 2.17.1
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe devicetree" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fsi: sbefifo: Add missing mutex_unlock

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:21 -0500, Eddie James wrote:
> There was no unlock of the FFDC mutex.
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Signed-off-by: Eddie James 

Thanks. 

Cheers,
Ben.

> ---
>  drivers/fsi/fsi-sbefifo.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
> index 4f076fd..a34ff99 100644
> --- a/drivers/fsi/fsi-sbefifo.c
> +++ b/drivers/fsi/fsi-sbefifo.c
> @@ -194,6 +194,7 @@ static void sbefifo_dump_ffdc(struct device *dev, const 
> __be32 *ffdc,
>   }
>   dev_warn(dev, 
> "+---+\n");
>   }
> + mutex_unlock(_ffdc_mutex);
>  }
>  
>  int sbefifo_parse_status(struct device *dev, u16 cmd, __be32 *response,

Re: [PATCH] fsi: sbefifo: Add missing mutex_unlock

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:21 -0500, Eddie James wrote:
> There was no unlock of the FFDC mutex.
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Signed-off-by: Eddie James 

Thanks. 

Cheers,
Ben.

> ---
>  drivers/fsi/fsi-sbefifo.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
> index 4f076fd..a34ff99 100644
> --- a/drivers/fsi/fsi-sbefifo.c
> +++ b/drivers/fsi/fsi-sbefifo.c
> @@ -194,6 +194,7 @@ static void sbefifo_dump_ffdc(struct device *dev, const 
> __be32 *ffdc,
>   }
>   dev_warn(dev, 
> "+---+\n");
>   }
> + mutex_unlock(_ffdc_mutex);
>  }
>  
>  int sbefifo_parse_status(struct device *dev, u16 cmd, __be32 *response,

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 08:46 -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> > +bool kernfs_has_children(struct kernfs_node *kn)
> > +{
> > +   bool has_children = false;
> > +   struct kernfs_node *pos;
> > +
> > +   /* Lockless shortcut */
> > +   if (RB_EMPTY_NODE(>rb))
> > +   return false;
> 
> Hmm... shouldn't it be testing !rb_first(kn->dir.children)?  The above
> would test whether @kn itself is unlinked.

Ah possibly, the rb tree usage got me confused a few times in there.

> > +
> > +   /* Now check for active children */
> > +   mutex_lock(_mutex);
> > +   pos = NULL;
> > +   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> > +   if (kernfs_active(pos)) {
> > +   has_children = true;
> > +   break;
> > +   }
> > +   }
> > +   mutex_unlock(_mutex);
> > +
> > +   return has_children;
> 
> The active ref is there to synchronize removal against in-flight reads
> so that kernfs_remove() can wait for them to drain.  On return from
> kernfs_remove(), the node is guaranteed to be off the rbtree and the
> above test isn't necessary.  !rb_first() test should be enough
> (provided that there's external synchronization, of course).

Ok. Yes there's external synchronization. So a simple !rb_first test
doesn't need the mutex I suppose ?

Cheers,
Ben.

> 
> Thanks.
>

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 08:46 -0700, Tejun Heo wrote:
> Hello,
> 
> On Tue, Jul 03, 2018 at 03:22:47PM +1000, Benjamin Herrenschmidt wrote:
> > +bool kernfs_has_children(struct kernfs_node *kn)
> > +{
> > +   bool has_children = false;
> > +   struct kernfs_node *pos;
> > +
> > +   /* Lockless shortcut */
> > +   if (RB_EMPTY_NODE(>rb))
> > +   return false;
> 
> Hmm... shouldn't it be testing !rb_first(kn->dir.children)?  The above
> would test whether @kn itself is unlinked.

Ah possibly, the rb tree usage got me confused a few times in there.

> > +
> > +   /* Now check for active children */
> > +   mutex_lock(_mutex);
> > +   pos = NULL;
> > +   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
> > +   if (kernfs_active(pos)) {
> > +   has_children = true;
> > +   break;
> > +   }
> > +   }
> > +   mutex_unlock(_mutex);
> > +
> > +   return has_children;
> 
> The active ref is there to synchronize removal against in-flight reads
> so that kernfs_remove() can wait for them to drain.  On return from
> kernfs_remove(), the node is guaranteed to be off the rbtree and the
> above test isn't necessary.  !rb_first() test should be enough
> (provided that there's external synchronization, of course).

Ok. Yes there's external synchronization. So a simple !rb_first test
doesn't need the mutex I suppose ?

Cheers,
Ben.

> 
> Thanks.
>

Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:31 +0200, Greg KH wrote:
> On Wed, Jul 04, 2018 at 12:16:49AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> > > On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > > > Signed-off-by: Andrew Jeffery 
> > > > ---
> > > 
> > > I can't take patches without any changelog text at all :(
> > 
> > Greg (and replying to your other comments as well)...
> > 
> > This is an RFC series, it's not meant for you to take at this point,
> > it's about discussing the overall approach to exposing BMC random
> > "tunables" as explained in patch 0 of the series.
> > 
> > Yes the individual patches aren't yet at the level of polish for a
> > formal submission, we (naively ?) thought that's what the whole RFC tag
> > is about :-)
> 
> Oh come on, putting a basic "here is what this patch does" comment
> should be part of every patch, otherwise what is there to comment on if
> we don't know what is going on in the patch itself?

Well, it adds documentation :-) You can just read the patch which is
... the documentation :)
> 
> Anyway, I provided a bunch of feedback to the "real" patch in this
> series...

Yes, you did that's fine. Thanks.

Cheers,
Ben.

Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 16:31 +0200, Greg KH wrote:
> On Wed, Jul 04, 2018 at 12:16:49AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> > > On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > > > Signed-off-by: Andrew Jeffery 
> > > > ---
> > > 
> > > I can't take patches without any changelog text at all :(
> > 
> > Greg (and replying to your other comments as well)...
> > 
> > This is an RFC series, it's not meant for you to take at this point,
> > it's about discussing the overall approach to exposing BMC random
> > "tunables" as explained in patch 0 of the series.
> > 
> > Yes the individual patches aren't yet at the level of polish for a
> > formal submission, we (naively ?) thought that's what the whole RFC tag
> > is about :-)
> 
> Oh come on, putting a basic "here is what this patch does" comment
> should be part of every patch, otherwise what is there to comment on if
> we don't know what is going on in the patch itself?

Well, it adds documentation :-) You can just read the patch which is
... the documentation :)
> 
> Anyway, I provided a bunch of feedback to the "real" patch in this
> series...

Yes, you did that's fine. Thanks.

Cheers,
Ben.

Re: [PATCH -next] fsi/sbefifo: Add dependency on OF_ADDRESS

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 06:43 -0700, Guenter Roeck wrote:
> The driver calls of_platform_device_create() which is only available
> if OF_ADDRESS is enabled. When building sparc64 images, this results
> in
> 
> ERROR: "of_platform_device_create" [drivers/fsi/fsi-sbefifo.ko] undefined!
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Guenter Roeck 

Thanks Guenter ! I'll include this into the fsi tree.

(I was wondering what that kbuild report on sparc64 was about... I must
have misread the code, I didn't realize of_platform_device_create
wasn't under CONFIG_OF alone).

Cheers,
Ben.

> ---
>  drivers/fsi/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> index 24f84a96b8b9..9c08f467a7bb 100644
> --- a/drivers/fsi/Kconfig
> +++ b/drivers/fsi/Kconfig
> @@ -34,6 +34,7 @@ config FSI_SCOM
>  
>  config FSI_SBEFIFO
>   tristate "SBEFIFO FSI client device driver"
> + depends on OF_ADDRESS
>   ---help---
>   This option enables an FSI based SBEFIFO device driver. The SBEFIFO is
>   a pipe-like FSI device for communicating with the self boot engine

Re: [PATCH -next] fsi/sbefifo: Add dependency on OF_ADDRESS

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 06:43 -0700, Guenter Roeck wrote:
> The driver calls of_platform_device_create() which is only available
> if OF_ADDRESS is enabled. When building sparc64 images, this results
> in
> 
> ERROR: "of_platform_device_create" [drivers/fsi/fsi-sbefifo.ko] undefined!
> 
> Fixes: 9f4a8a2d7f9d ("fsi/sbefifo: Add driver for the SBE FIFO")
> Cc: Benjamin Herrenschmidt 
> Signed-off-by: Guenter Roeck 

Thanks Guenter ! I'll include this into the fsi tree.

(I was wondering what that kbuild report on sparc64 was about... I must
have misread the code, I didn't realize of_platform_device_create
wasn't under CONFIG_OF alone).

Cheers,
Ben.

> ---
>  drivers/fsi/Kconfig | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/fsi/Kconfig b/drivers/fsi/Kconfig
> index 24f84a96b8b9..9c08f467a7bb 100644
> --- a/drivers/fsi/Kconfig
> +++ b/drivers/fsi/Kconfig
> @@ -34,6 +34,7 @@ config FSI_SCOM
>  
>  config FSI_SBEFIFO
>   tristate "SBEFIFO FSI client device driver"
> + depends on OF_ADDRESS
>   ---help---
>   This option enables an FSI based SBEFIFO device driver. The SBEFIFO is
>   a pipe-like FSI device for communicating with the self boot engine

Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > Signed-off-by: Andrew Jeffery 
> > ---
> 
> I can't take patches without any changelog text at all :(

Greg (and replying to your other comments as well)...

This is an RFC series, it's not meant for you to take at this point,
it's about discussing the overall approach to exposing BMC random
"tunables" as explained in patch 0 of the series.

Yes the individual patches aren't yet at the level of polish for a
formal submission, we (naively ?) thought that's what the whole RFC tag
is about :-)

Cheers,
Ben.

Re: [RFC PATCH 1/4] dts: misc: Add bindings documentation for bmc-misc-ctrl

2018-07-03 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 09:50 +0200, Greg KH wrote:
> On Tue, Jul 03, 2018 at 05:04:10PM +1000, Andrew Jeffery wrote:
> > Signed-off-by: Andrew Jeffery 
> > ---
> 
> I can't take patches without any changelog text at all :(

Greg (and replying to your other comments as well)...

This is an RFC series, it's not meant for you to take at this point,
it's about discussing the overall approach to exposing BMC random
"tunables" as explained in patch 0 of the series.

Yes the individual patches aren't yet at the level of polish for a
formal submission, we (naively ?) thought that's what the whole RFC tag
is about :-)

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 12:39 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> > On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
> >  wrote:
> > > 
> > > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > > this might work, but probably doesn't". Maybe it gives you an idea,
> > > although that idea might be "Linus has finally lost it".
> > 
> > Even if it were to work, it should probably just be done inside kernfs
> > and exposed as some kind of "kernfs_remove_if_empty()" function.
> 
> That would be harder, we'd have to expose it via a
> kobject_del_if_empty() then.
> 
> Since we control completely when things are added/removed from the
> gluedir and have a big fat mutex for it, I don't think locking is much
> of an issue. At least in that specific case.

Ok, kernfs was a tad tougher to crack than I hoped but here's a
prototype lightly tested.

Tejun, Greg, does it look reasonable to you ?

Cheers,
Ben.

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..9166f68276c6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1570,6 +1570,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..e4bec09bc602 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1251,6 +1251,29 @@ void kernfs_activate(struct kernfs_node *kn)
mutex_unlock(_mutex);
 }
 
+bool kernfs_has_children(struct kernfs_node *kn)
+{
+   bool has_children = false;
+   struct kernfs_node *pos;
+
+   /* Lockless shortcut */
+   if (RB_EMPTY_NODE(>rb))
+   return false;
+
+   /* Now check for active children */
+   mutex_lock(_mutex);
+   pos = NULL;
+   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
+   if (kernfs_active(pos)) {
+   has_children = true;
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return has_children;
+}
+
 static void __kernfs_remove(struct kernfs_node *kn)
 {
struct kernfs_node *pos;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..776350ac1cbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -300,6 +300,12 @@ static inline bool kernfs_ns_enabled(struct kernfs_node 
*kn)
return kn->flags & KERNFS_NS;
 }
 
+/**
+ * kernfs_has_children - Returns whether a kernfs node has children.
+ * @kn: the node to test
+ */
+bool kernfs_has_children(struct kernfs_node *kn);
+
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
 int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
  char *buf, size_t buflen);
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..1e3294ab95f0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,17 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kernfs_has_children(kobj->sd);
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Tue, 2018-07-03 at 12:39 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> > On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
> >  wrote:
> > > 
> > > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > > this might work, but probably doesn't". Maybe it gives you an idea,
> > > although that idea might be "Linus has finally lost it".
> > 
> > Even if it were to work, it should probably just be done inside kernfs
> > and exposed as some kind of "kernfs_remove_if_empty()" function.
> 
> That would be harder, we'd have to expose it via a
> kobject_del_if_empty() then.
> 
> Since we control completely when things are added/removed from the
> gluedir and have a big fat mutex for it, I don't think locking is much
> of an issue. At least in that specific case.

Ok, kernfs was a tad tougher to crack than I hoped but here's a
prototype lightly tested.

Tejun, Greg, does it look reasonable to you ?

Cheers,
Ben.

diff --git a/drivers/base/core.c b/drivers/base/core.c
index b610816eb887..9166f68276c6 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -1570,6 +1570,8 @@ static void cleanup_glue_dir(struct device *dev, struct 
kobject *glue_dir)
return;
 
mutex_lock(_mutex);
+   if (!kobject_has_children(glue_dir))
+   kobject_del(glue_dir);
kobject_put(glue_dir);
mutex_unlock(_mutex);
 }
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 89d1dc19340b..e4bec09bc602 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1251,6 +1251,29 @@ void kernfs_activate(struct kernfs_node *kn)
mutex_unlock(_mutex);
 }
 
+bool kernfs_has_children(struct kernfs_node *kn)
+{
+   bool has_children = false;
+   struct kernfs_node *pos;
+
+   /* Lockless shortcut */
+   if (RB_EMPTY_NODE(>rb))
+   return false;
+
+   /* Now check for active children */
+   mutex_lock(_mutex);
+   pos = NULL;
+   while ((pos = kernfs_next_descendant_post(pos, kn)) && pos != kn) {
+   if (kernfs_active(pos)) {
+   has_children = true;
+   break;
+   }
+   }
+   mutex_unlock(_mutex);
+
+   return has_children;
+}
+
 static void __kernfs_remove(struct kernfs_node *kn)
 {
struct kernfs_node *pos;
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index ab25c8b6d9e3..776350ac1cbc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -300,6 +300,12 @@ static inline bool kernfs_ns_enabled(struct kernfs_node 
*kn)
return kn->flags & KERNFS_NS;
 }
 
+/**
+ * kernfs_has_children - Returns whether a kernfs node has children.
+ * @kn: the node to test
+ */
+bool kernfs_has_children(struct kernfs_node *kn);
+
 int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen);
 int kernfs_path_from_node(struct kernfs_node *root_kn, struct kernfs_node *kn,
  char *buf, size_t buflen);
diff --git a/include/linux/kobject.h b/include/linux/kobject.h
index 7f6f93c3df9c..1e3294ab95f0 100644
--- a/include/linux/kobject.h
+++ b/include/linux/kobject.h
@@ -116,6 +116,17 @@ extern void kobject_put(struct kobject *kobj);
 extern const void *kobject_namespace(struct kobject *kobj);
 extern char *kobject_get_path(struct kobject *kobj, gfp_t flag);
 
+/**
+ * kobject_has_children - Returns whether a kobject has children.
+ * @kobj: the object to test
+ */
+static inline bool kobject_has_children(struct kobject *kobj)
+{
+   WARN_ON_ONCE(kref_read(>kref) == 0);
+
+   return kobj->sd && kernfs_has_children(kobj->sd);
+}
+
 struct kobj_type {
void (*release)(struct kobject *kobj);
const struct sysfs_ops *sysfs_ops;

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
>  wrote:
> > 
> > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > this might work, but probably doesn't". Maybe it gives you an idea,
> > although that idea might be "Linus has finally lost it".
> 
> Even if it were to work, it should probably just be done inside kernfs
> and exposed as some kind of "kernfs_remove_if_empty()" function.

That would be harder, we'd have to expose it via a
kobject_del_if_empty() then.

Since we control completely when things are added/removed from the
gluedir and have a big fat mutex for it, I don't think locking is much
of an issue. At least in that specific case.

> We happen to know that we hold the lock that creates all the entries
> in the glue directory (and we don't allow users to move or create
> stuff, afaik, even if we alloc chmod etc), so there should be no
> races, but a generic kernfs helper function would probably have to get
> proper locks and check it the right way.

Yes that or document very clearly under what circumstances that
function can be used.

> Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 19:26 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 7:15 PM Linus Torvalds
>  wrote:
> > 
> > It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> > USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> > this might work, but probably doesn't". Maybe it gives you an idea,
> > although that idea might be "Linus has finally lost it".
> 
> Even if it were to work, it should probably just be done inside kernfs
> and exposed as some kind of "kernfs_remove_if_empty()" function.

That would be harder, we'd have to expose it via a
kobject_del_if_empty() then.

Since we control completely when things are added/removed from the
gluedir and have a big fat mutex for it, I don't think locking is much
of an issue. At least in that specific case.

> We happen to know that we hold the lock that creates all the entries
> in the glue directory (and we don't allow users to move or create
> stuff, afaik, even if we alloc chmod etc), so there should be no
> races, but a generic kernfs helper function would probably have to get
> proper locks and check it the right way.

Yes that or document very clearly under what circumstances that
function can be used.

> Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 19:15 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 5:57 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > I'll have a look see if I can find a way to check that the sysfs dir is
> > empty without adding that childcount, that would at least alleviate
> > some of your objection, and would probably make the patch
> > smaller/simpler.
> 
> That would definitely make me happier.
> 
> Right now we already remove the actual device node sysfs associations
> synchronously with "kobject_del()" (even if it still then stays around
> as a kobject), and that does the remove for the object itself - just
> not the glue directory.
> 
> If we then just did a "if glue dir is empty, delete the glue dir too"
> in cleanup_glue_dir(), at least ther patch would be prettier.
> 
> I *think* that should be easy. Maybe. Something almost, but not quite,
> entirely unlike the below UNTESTED hack?

Yes, something like that. I have a bunch of other things I need to
attend to today, so I might not come up with a patch before some time
tomorrow but I'll have a look.
 
> It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> this might work, but probably doesn't". Maybe it gives you an idea,
> although that idea might be "Linus has finally lost it".
> 
> I suspect it's broken because kernfs has this notion of inactive
> nodes, so just testing for the rbtree being empty is probably wrong. I
> think maybe it needs to test for there not being any active nodes, but
> then you'd need the kernfs_mutex etc, so that would make things much
> worse.

There's also a "subdirs" in kernfs_node which we could use. That's a
bit ugly because it only account directories but we happen to know that
the gluedirs only contain directories, so it would work fine in
practice for that case.

> I really haven't touched the kernfs code much, thus the "this may be
> complete and utter garbage" warning.
> 
> Adding Tejun to the participants, he should know. Or Greg, who has
> been silent, presumably because he's still getting over his crazy
> travels.

Cheers,
Ben.

> Linus
> 
> ---
> 
> --- a/drivers/base/core.c
>+++ b/drivers/base/core.c
>@@ -1585,6 +1585,12 @@ static inline struct kobject
> *get_glue_dir(struct device *dev)
> return dev->kobj.parent;
> }
> 
>+static inline bool has_children(struct kobject *dir)
>+{
>+struct kernfs_node *kn = dir->sd;
>+return kn && kn->dir.children.rb_node;
>+}
>+
> /*
>  * make sure cleaning up dir as the last step, we need to make
>  * sure .release handler of kobject is run with holding the
>@@ -1597,6 +1603,10 @@ static void cleanup_glue_dir(struct device
> *dev, struct kobject *glue_dir)
> return;
> 
> mutex_lock(_mutex);
>-kobject_put(glue_dir);
>+if (has_children(glue_dir))
>+kobject_put(glue_dir);
>+else
>+kobject_del(glue_dir);
> mutex_unlock(_mutex);
> }

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 19:15 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 5:57 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > I'll have a look see if I can find a way to check that the sysfs dir is
> > empty without adding that childcount, that would at least alleviate
> > some of your objection, and would probably make the patch
> > smaller/simpler.
> 
> That would definitely make me happier.
> 
> Right now we already remove the actual device node sysfs associations
> synchronously with "kobject_del()" (even if it still then stays around
> as a kobject), and that does the remove for the object itself - just
> not the glue directory.
> 
> If we then just did a "if glue dir is empty, delete the glue dir too"
> in cleanup_glue_dir(), at least ther patch would be prettier.
> 
> I *think* that should be easy. Maybe. Something almost, but not quite,
> entirely unlike the below UNTESTED hack?

Yes, something like that. I have a bunch of other things I need to
attend to today, so I might not come up with a patch before some time
tomorrow but I'll have a look.
 
> It's whitespace-damaged on purpose. It's probably broken shit. DO NOT
> USE UNDER ANY CIRCUMSTANCES.  Think of it more as a "something like
> this might work, but probably doesn't". Maybe it gives you an idea,
> although that idea might be "Linus has finally lost it".
> 
> I suspect it's broken because kernfs has this notion of inactive
> nodes, so just testing for the rbtree being empty is probably wrong. I
> think maybe it needs to test for there not being any active nodes, but
> then you'd need the kernfs_mutex etc, so that would make things much
> worse.

There's also a "subdirs" in kernfs_node which we could use. That's a
bit ugly because it only account directories but we happen to know that
the gluedirs only contain directories, so it would work fine in
practice for that case.

> I really haven't touched the kernfs code much, thus the "this may be
> complete and utter garbage" warning.
> 
> Adding Tejun to the participants, he should know. Or Greg, who has
> been silent, presumably because he's still getting over his crazy
> travels.

Cheers,
Ben.

> Linus
> 
> ---
> 
> --- a/drivers/base/core.c
>+++ b/drivers/base/core.c
>@@ -1585,6 +1585,12 @@ static inline struct kobject
> *get_glue_dir(struct device *dev)
> return dev->kobj.parent;
> }
> 
>+static inline bool has_children(struct kobject *dir)
>+{
>+struct kernfs_node *kn = dir->sd;
>+return kn && kn->dir.children.rb_node;
>+}
>+
> /*
>  * make sure cleaning up dir as the last step, we need to make
>  * sure .release handler of kobject is run with holding the
>@@ -1597,6 +1603,10 @@ static void cleanup_glue_dir(struct device
> *dev, struct kobject *glue_dir)
> return;
> 
> mutex_lock(_mutex);
>-kobject_put(glue_dir);
>+if (has_children(glue_dir))
>+kobject_put(glue_dir);
>+else
>+kobject_del(glue_dir);
> mutex_unlock(_mutex);
> }

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 12:24 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 3:23 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > Let me try one last ditch attempt to convince you using maybe a
> > different perspective: this is how sysfs is intended to work and how
> > the device model already does everywhere else except the gluedirs.
> 
> So don't get me wrong. I don't think that patch is _wrong_. It may
> well be the best thing we can do now.
> 
> I just think some of the arguments for the patch are bogus.
> 
> I still think that the auto-cleanup is actually a good thing in
> general. Not because it simplifies things (which it can do), but
> because it fundamentally *allows* you to use less locking.
> 
> And that ends up being my real reason to not like your patch all that
> much. It depends on
> 
>  (a) an extra reference count

Right, that is not that great, and it might be possible to avoid it if
we have a way to check that the sysfs dir is empty. Maybe kobj->sd-
>subdirs ? This will find subdirs, not files, but for gluedirs that's
sufficient. I can play with that today see if it works.

BTW. Talking of extra reference counts, we already have two everywhere
afaik... kernfs has its own that's separate from the kobject one. I got
lost in that maze a couple of times already.

>  (b) on fairly heavy-handed locking (ie the whole "lock on release
> too, not just on allocation").

This is already there, I'm not adding it ;-) It's just ineffective
without my patch due to the possibility of the "auto-cleanup" to leak
outside the lock.

> and I think both of those are non-optimal.
> 
> So:
> 
> > The actual lifetime of the struct device is handled *separately* by
> > device_get/put which are just wrappers on kobject_get/put.
> 
> I agree that that is the end result of your patch (and perhaps the
> buggy intent of the original code).

This is the existing device_add/del without my patch. I was describing
how the core device stuff works for the normal device directories
today. My patch makes the gludirs do the same thing basically, which I
think was the intend of the code already but was broken.

> I just don't necessarily agree it's a great thing in general. It
> basically uses sysfs visibility as an argument for why lifetimes
> should differ from the refcounted lifetimes.

Yes, it does that so that we can remove and re-add without name
clashes.

> And that may be a practical argument, but I don't think it's a very
> good argument in general. I think it would arguably be much better to
> be less serialized, and depend on refcounting more, and make less of a
> deal about the sysfs visibility.
> 
> For example, if we *really* add back the exact same device immediately
> after removing it, and the device was still in use somehow (ie the
> refcount never went down to zero), maybe we really should not have
> reused the same name? Or if we do re-use the same name, maybe we
> should have re-used the device node itself?

Well, maybe but that's what we've been doing forever, and I think
drivers deal with it already. It applies to devices and to chardevs for
example.

> See what I'm saying? I understand where your patch is coming from, but
> I am suspicious of the fact that it seems to want to re-use a name
> (but not the object) that is by definition still in use.

Yes, I see what you mean.

I don't really have the same qualms as you do, it's how we've done
things forever (it's just that the gluedir code was broken), and it
generally works well.

It's not different in principle to what we do with files. You can
unlink an open file. It's still "in use" but the directory entry is
gone and one can create a new one.

It a slightly more roundabout way, we do that with mm_structs, the last
process can be gone but we keep them around due to being referenced (or
lazily attached to a kthread). Here we have 2 counters too.

> Maybe that's the right thing in this case. Considering that we have a
> real oops and a real problem, and I don't have an alternative patch, I
> guess the "patches talk, bullshit walks" rule applies.

Haha, as you prefer. Patch 1 fixes the oops, patch 2 fixes the name
collision, so in theory you could get away with just patch 1. But then
device would start failing to be re-added or re-bound.

(In my case it was a sub-device created by the driver of a parent
device, so it happens on bind/unbind of that parent driver).

I'll have a look see if I can find a way to check that the sysfs dir is
empty without adding that childcount, that would at least alleviate
some of your objection, and would probably make the patch
smaller/simpler.

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 12:24 -0700, Linus Torvalds wrote:
> On Mon, Jul 2, 2018 at 3:23 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > Let me try one last ditch attempt to convince you using maybe a
> > different perspective: this is how sysfs is intended to work and how
> > the device model already does everywhere else except the gluedirs.
> 
> So don't get me wrong. I don't think that patch is _wrong_. It may
> well be the best thing we can do now.
> 
> I just think some of the arguments for the patch are bogus.
> 
> I still think that the auto-cleanup is actually a good thing in
> general. Not because it simplifies things (which it can do), but
> because it fundamentally *allows* you to use less locking.
> 
> And that ends up being my real reason to not like your patch all that
> much. It depends on
> 
>  (a) an extra reference count

Right, that is not that great, and it might be possible to avoid it if
we have a way to check that the sysfs dir is empty. Maybe kobj->sd-
>subdirs ? This will find subdirs, not files, but for gluedirs that's
sufficient. I can play with that today see if it works.

BTW. Talking of extra reference counts, we already have two everywhere
afaik... kernfs has its own that's separate from the kobject one. I got
lost in that maze a couple of times already.

>  (b) on fairly heavy-handed locking (ie the whole "lock on release
> too, not just on allocation").

This is already there, I'm not adding it ;-) It's just ineffective
without my patch due to the possibility of the "auto-cleanup" to leak
outside the lock.

> and I think both of those are non-optimal.
> 
> So:
> 
> > The actual lifetime of the struct device is handled *separately* by
> > device_get/put which are just wrappers on kobject_get/put.
> 
> I agree that that is the end result of your patch (and perhaps the
> buggy intent of the original code).

This is the existing device_add/del without my patch. I was describing
how the core device stuff works for the normal device directories
today. My patch makes the gludirs do the same thing basically, which I
think was the intend of the code already but was broken.

> I just don't necessarily agree it's a great thing in general. It
> basically uses sysfs visibility as an argument for why lifetimes
> should differ from the refcounted lifetimes.

Yes, it does that so that we can remove and re-add without name
clashes.

> And that may be a practical argument, but I don't think it's a very
> good argument in general. I think it would arguably be much better to
> be less serialized, and depend on refcounting more, and make less of a
> deal about the sysfs visibility.
> 
> For example, if we *really* add back the exact same device immediately
> after removing it, and the device was still in use somehow (ie the
> refcount never went down to zero), maybe we really should not have
> reused the same name? Or if we do re-use the same name, maybe we
> should have re-used the device node itself?

Well, maybe but that's what we've been doing forever, and I think
drivers deal with it already. It applies to devices and to chardevs for
example.

> See what I'm saying? I understand where your patch is coming from, but
> I am suspicious of the fact that it seems to want to re-use a name
> (but not the object) that is by definition still in use.

Yes, I see what you mean.

I don't really have the same qualms as you do, it's how we've done
things forever (it's just that the gluedir code was broken), and it
generally works well.

It's not different in principle to what we do with files. You can
unlink an open file. It's still "in use" but the directory entry is
gone and one can create a new one.

It a slightly more roundabout way, we do that with mm_structs, the last
process can be gone but we keep them around due to being referenced (or
lazily attached to a kthread). Here we have 2 counters too.

> Maybe that's the right thing in this case. Considering that we have a
> real oops and a real problem, and I don't have an alternative patch, I
> guess the "patches talk, bullshit walks" rule applies.

Haha, as you prefer. Patch 1 fixes the oops, patch 2 fixes the name
collision, so in theory you could get away with just patch 1. But then
device would start failing to be re-added or re-bound.

(In my case it was a sub-device created by the driver of a parent
device, so it happens on bind/unbind of that parent driver).

I'll have a look see if I can find a way to check that the sysfs dir is
empty without adding that childcount, that would at least alleviate
some of your objection, and would probably make the patch
smaller/simpler.

Cheers,
Ben.

Re: DMA mappings and crossing boundaries

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

 .../...

Thanks Robin, I was starting to depair anybody would reply ;-)

> > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
> > HOWTO.txt) as always allocating to the next power-of-2 order, so we
> > should never have the problem unless we allocate a single chunk larger
> > than the IOMMU page size.
> 
> (and even then it's not *that* much of a problem, since it comes down to 
> just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
> that new chunk)

Yes, this case is not my biggest worry.

> > For dma_map_sg() however, if a request that has a single "entry"
> > spawning such a boundary, we need to ensure that the result mapping is
> > 2 contiguous "large" iommu pages as well.
> > 
> > However, that doesn't fit well with us re-using existing mappings since
> > they may already exist and either not be contiguous, or partially exist
> > with no free hole around them.
> > 
> > Now, we *could* possibly construe a way to solve this by detecting this
> > case and just allocating another "pair" (or set if we cross even more
> > pages) of IOMMU pages elsewhere, thus partially breaking our re-use
> > scheme.
> > 
> > But while doable, this introduce some serious complexity in the
> > implementation, which I would very much like to avoid.
> > 
> > So I was wondering if you guys thought that was ever likely to happen ?
> > Do you see reasonable cases where dma_map_sg() would be called with a
> > list in which a single entry crosses a 256M or 1G boundary ?
> 
> For streaming mappings of buffers cobbled together out of any old CPU 
> pages (e.g. user memory), you may well happen to get two 
> physically-adjacent pages falling either side of an IOMMU boundary, 
> which comprise all or part of a single request - note that whilst it's 
> probably less likely than the scatterlist case, this could technically 
> happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

> Conceptually it looks pretty easy to extend the allocation constraints 
> to cope with that - even the pathological worst case would have an 
> absolute upper bound of 3 IOMMU entries for any one physical region - 
> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
> DMA addresses having only 4 1GB slots to play with, I can't really see a 
> way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

> Maybe the best compromise would be some sort of hybrid scheme which 
> makes sure that one of the IOMMU entries always covers the SWIOTLB 
> buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

Ben.

> Robin.

Re: DMA mappings and crossing boundaries

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 14:06 +0100, Robin Murphy wrote:

 .../...

Thanks Robin, I was starting to depair anybody would reply ;-)

> > AFAIK, dma_alloc_coherent() is defined (Documentation/DMA-API-
> > HOWTO.txt) as always allocating to the next power-of-2 order, so we
> > should never have the problem unless we allocate a single chunk larger
> > than the IOMMU page size.
> 
> (and even then it's not *that* much of a problem, since it comes down to 
> just finding n > 1 consecutive unused IOMMU entries for exclusive use by 
> that new chunk)

Yes, this case is not my biggest worry.

> > For dma_map_sg() however, if a request that has a single "entry"
> > spawning such a boundary, we need to ensure that the result mapping is
> > 2 contiguous "large" iommu pages as well.
> > 
> > However, that doesn't fit well with us re-using existing mappings since
> > they may already exist and either not be contiguous, or partially exist
> > with no free hole around them.
> > 
> > Now, we *could* possibly construe a way to solve this by detecting this
> > case and just allocating another "pair" (or set if we cross even more
> > pages) of IOMMU pages elsewhere, thus partially breaking our re-use
> > scheme.
> > 
> > But while doable, this introduce some serious complexity in the
> > implementation, which I would very much like to avoid.
> > 
> > So I was wondering if you guys thought that was ever likely to happen ?
> > Do you see reasonable cases where dma_map_sg() would be called with a
> > list in which a single entry crosses a 256M or 1G boundary ?
> 
> For streaming mappings of buffers cobbled together out of any old CPU 
> pages (e.g. user memory), you may well happen to get two 
> physically-adjacent pages falling either side of an IOMMU boundary, 
> which comprise all or part of a single request - note that whilst it's 
> probably less likely than the scatterlist case, this could technically 
> happen for dma_map_{page, single}() calls too.

Could it ? I wouldn't think dma_map_page is allows to cross page
boundaries ... what about single() ? The main worry is people using
these things on kmalloc'ed memory

> Conceptually it looks pretty easy to extend the allocation constraints 
> to cope with that - even the pathological worst case would have an 
> absolute upper bound of 3 IOMMU entries for any one physical region - 
> but if in practice it's a case of mapping arbitrary CPU pages to 32-bit 
> DMA addresses having only 4 1GB slots to play with, I can't really see a 
> way to make that practical :(

No we are talking about 40-ish-bits of address space, so there's a bit
of leeway. Of course no scheme will work if the user app tries to map
more than the GPU can possibly access.

But with newer AMD adding a few more bits and nVidia being at 47-bits,
I think we have some margin, it's just that they can't reach our
discontiguous memory with a normal 'bypass' mapping and I'd rather not
teach Linux about every single way our HW can scatter memory accross
nodes, so an "on demand" mechanism is by far the most flexible way to
deal with all configurations.

> Maybe the best compromise would be some sort of hybrid scheme which 
> makes sure that one of the IOMMU entries always covers the SWIOTLB 
> buffer, and invokes software bouncing for the awkward cases.

Hrm... not too sure about that. I'm happy to limit that scheme to well
known GPU vendor/device IDs, and SW bouncing is pointless in these
cases. It would be nice if we could have some kind of guarantee that a
single mapping or sglist entry never crossed a specific boundary
though... We more/less have that for 4G already (well, we are supposed
to at least). Who are the main potential problematic subsystems here ?
I'm thinking network skb allocation pools ... and page cache if it
tries to coalesce entries before issuing the map request, does it ?

Ben.

> Robin.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 09:36 +1000, Benjamin Herrenschmidt wrote:
> > No. See above. The reason I think your patch 2/2 is wrong is that is
> > actually *breaks* the above model, exactly because of that thing that
> > you hatre.
> > 
> > The explicit removal is actively wrong for the "I want to reuse the
> > object" model, exactly because it happens before the refcount has gone
> > to zero.
> 
> That's indeed where we disagree.

Sooo...

I had some quality time with my good friend Oban 14 last night,
thinking through this and during the resulting hangover today. I've
tried very hard to "get" your perspective, but after another dive
through the code, I'm afraid I still think patch 2/2 is the right thing
to do.

(Let's ignore patch 1/1 for now and assume we get that part right)

Let me try one last ditch attempt to convince you using maybe a
different perspective: this is how sysfs is intended to work and how
the device model already does everywhere else except the gluedirs.

Sooo... let's start pulling the string bit by bit...

I think you agree since (I'm not going to try to quote you here) you
did mention adding/removing need to generally be protected by higher
level locks in whatever subsystem owns the kobjects. I don't think
there's an argument here.

This is hinted by some comments (among others) in sysfs/dir.c about the
kobject owner being responsible to ensure that removal doesn't race
with any operation, which is hard to do when such removal doesn't
happen under subsystem internal locks.

Now let's look at how things are handled for a normal struct device
kobject.

The device core provides device_add and device_del functions. Those are
completely orthogonal to the object lifetime controlled by
device_initialize (creation) and device_put. And for good reasons (yes,
there are helpers, device_register/unregister that combine them, but
those are just that... helpers).

Why ? For the exact same reason mentioned above: Along with all the
driver binding/unbinding business etc... those functions ensure that
the kobject of the device is added/removed from sysfs under all the
necessary locks in the device model, so that they cannot race with each
other.

This guarantees among others that it is possible to do a new device_add
for the same device immediately after device_del, without clashing with
a duplicate sysfs name, *even* if something still holds a reference to
the kobject for any reason (and there are plenty of these, see the cdev
example below).

The actual lifetime of the struct device is handled *separately* by
device_get/put which are just wrappers on kobject_get/put.

This is the same with a cdev. cdev_add and cdev_del or their friends
cdev_device_add/del do the same thing. The chardev case is interesting
specifically because it is much more common for these to persist
(lifetime) beyond cdev_del. You just need userspace to have an open
file descriptor to the chardev. Drivers have to deal with it of course,
they need to handle things like read/write/ioctl being called on
already open files after remove/del/unbind, and hopefully do so.

*But* it's also possible to create a new identical cdev with the same
sysfs name immediately after cdev_device_del even if the old one still
has open instanced precisely because we separate the sysfs presence,
handled through the device model internal locking, from the object
lifetime.

Now, going back to our gluedirs, in essence those are just more objects
owned by the device core. More specifically, they need to be created
synchronously with the first child device of a class added under a
given parent, and removed synchronously with the last child device of
that class under that same parent.

Delaying that removal from sysfs simply means that we now are unable to
re-add the same device after removing it for some arbitrary amount of
time, because something might still hold a kobject reference to the
gluedir for any reason, it doesn't matter.

It also means that the deletion from sysfs is no longer synchronized
with addition since it no longer happens from within the context of
decice_del which is the "owner" of the glue dir, under whatever locks
it uses to ensure there is no races between multiple devices addition
removal etc...

In fact, if you look at the code and the way the gdp_mutex is used, I
think this is *exactly* the intention of the code. That the
kobject_put() done inside device_del() is the last one, done under that
mutex, thus removing the gluedir synchronously and we have no race.

However, that code just fails to take into account the delayed removal
by kobject debugging and the possibility of another reference existing.

Now it's possible there there can be no other references of a gluedir
today, to be frank I haven't audited the entire code base to figure out
if anything can possibly hold one that isn't a child of that gluedir,
so if indeed there can't be any and the only r

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Mon, 2018-07-02 at 09:36 +1000, Benjamin Herrenschmidt wrote:
> > No. See above. The reason I think your patch 2/2 is wrong is that is
> > actually *breaks* the above model, exactly because of that thing that
> > you hatre.
> > 
> > The explicit removal is actively wrong for the "I want to reuse the
> > object" model, exactly because it happens before the refcount has gone
> > to zero.
> 
> That's indeed where we disagree.

Sooo...

I had some quality time with my good friend Oban 14 last night,
thinking through this and during the resulting hangover today. I've
tried very hard to "get" your perspective, but after another dive
through the code, I'm afraid I still think patch 2/2 is the right thing
to do.

(Let's ignore patch 1/1 for now and assume we get that part right)

Let me try one last ditch attempt to convince you using maybe a
different perspective: this is how sysfs is intended to work and how
the device model already does everywhere else except the gluedirs.

Sooo... let's start pulling the string bit by bit...

I think you agree since (I'm not going to try to quote you here) you
did mention adding/removing need to generally be protected by higher
level locks in whatever subsystem owns the kobjects. I don't think
there's an argument here.

This is hinted by some comments (among others) in sysfs/dir.c about the
kobject owner being responsible to ensure that removal doesn't race
with any operation, which is hard to do when such removal doesn't
happen under subsystem internal locks.

Now let's look at how things are handled for a normal struct device
kobject.

The device core provides device_add and device_del functions. Those are
completely orthogonal to the object lifetime controlled by
device_initialize (creation) and device_put. And for good reasons (yes,
there are helpers, device_register/unregister that combine them, but
those are just that... helpers).

Why ? For the exact same reason mentioned above: Along with all the
driver binding/unbinding business etc... those functions ensure that
the kobject of the device is added/removed from sysfs under all the
necessary locks in the device model, so that they cannot race with each
other.

This guarantees among others that it is possible to do a new device_add
for the same device immediately after device_del, without clashing with
a duplicate sysfs name, *even* if something still holds a reference to
the kobject for any reason (and there are plenty of these, see the cdev
example below).

The actual lifetime of the struct device is handled *separately* by
device_get/put which are just wrappers on kobject_get/put.

This is the same with a cdev. cdev_add and cdev_del or their friends
cdev_device_add/del do the same thing. The chardev case is interesting
specifically because it is much more common for these to persist
(lifetime) beyond cdev_del. You just need userspace to have an open
file descriptor to the chardev. Drivers have to deal with it of course,
they need to handle things like read/write/ioctl being called on
already open files after remove/del/unbind, and hopefully do so.

*But* it's also possible to create a new identical cdev with the same
sysfs name immediately after cdev_device_del even if the old one still
has open instanced precisely because we separate the sysfs presence,
handled through the device model internal locking, from the object
lifetime.

Now, going back to our gluedirs, in essence those are just more objects
owned by the device core. More specifically, they need to be created
synchronously with the first child device of a class added under a
given parent, and removed synchronously with the last child device of
that class under that same parent.

Delaying that removal from sysfs simply means that we now are unable to
re-add the same device after removing it for some arbitrary amount of
time, because something might still hold a kobject reference to the
gluedir for any reason, it doesn't matter.

It also means that the deletion from sysfs is no longer synchronized
with addition since it no longer happens from within the context of
decice_del which is the "owner" of the glue dir, under whatever locks
it uses to ensure there is no races between multiple devices addition
removal etc...

In fact, if you look at the code and the way the gdp_mutex is used, I
think this is *exactly* the intention of the code. That the
kobject_put() done inside device_del() is the last one, done under that
mutex, thus removing the gluedir synchronously and we have no race.

However, that code just fails to take into account the delayed removal
by kobject debugging and the possibility of another reference existing.

Now it's possible there there can be no other references of a gluedir
today, to be frank I haven't audited the entire code base to figure out
if anything can possibly hold one that isn't a child of that gluedir,
so if indeed there can't be any and the only r

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > I suspect you didn't read it my entire argument or I wasn't clear
> > enough :-) This is actually the crux of the problem:
> > 
> > Yes the object continues to exist. However, the *last* kobject_put to
> > it will no longer be done under whatever higher level locking the
> > subsystem provides (whatever prevents for example concurrent add and
> > removes).
> 
> Well, yes and no.
> 
> Why "no"?
> 
> The last dropping is actually not necessarily that interesting.
> Especially with the sysfs interface, we basically know that you can
> look up the object using RCU (since that's what the filesystem lookup
> does), and that basically means that the refcount is always the final
> serialization mechanism.There is nothing else that can possibly lock
> it.
> 
> So this is where we disagree:
> 
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> 
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> 
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> 
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant.  Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.
> 
> So what happens then?
> 
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> 
> Why?
> 
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> 
>  - get subsystem lock
>  - look up object (using the "unless_zero()" model)
>  - if you got the object, re-use it, and you're done: drop lock and return
>  - otherwise, you know that nobody else can get it either
>  - create new object and instantiate it
>  - drop lock
> 
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).
> 
> See? The lack of locking at drop time didn't matter.  The refcount
> itself serialized things.
> 
> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> 
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.
> 
> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> 
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.
> 
> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
> 
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-02 Thread Benjamin Herrenschmidt

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> On Sun, Jul 1, 2018 at 12:16 AM Benjamin Herrenschmidt
>  wrote:
> > 
> > I suspect you didn't read it my entire argument or I wasn't clear
> > enough :-) This is actually the crux of the problem:
> > 
> > Yes the object continues to exist. However, the *last* kobject_put to
> > it will no longer be done under whatever higher level locking the
> > subsystem provides (whatever prevents for example concurrent add and
> > removes).
> 
> Well, yes and no.
> 
> Why "no"?
> 
> The last dropping is actually not necessarily that interesting.
> Especially with the sysfs interface, we basically know that you can
> look up the object using RCU (since that's what the filesystem lookup
> does), and that basically means that the refcount is always the final
> serialization mechanism.There is nothing else that can possibly lock
> it.
> 
> So this is where we disagree:
> 
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> 
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> 
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> 
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant.  Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.
> 
> So what happens then?
> 
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> 
> Why?
> 
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> 
>  - get subsystem lock
>  - look up object (using the "unless_zero()" model)
>  - if you got the object, re-use it, and you're done: drop lock and return
>  - otherwise, you know that nobody else can get it either
>  - create new object and instantiate it
>  - drop lock
> 
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).
> 
> See? The lack of locking at drop time didn't matter.  The refcount
> itself serialized things.
> 
> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> 
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.
> 
> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> 
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.
> 
> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when it no longer
> > contains any children, explicitely rather than implicitely by the
> > virtue of doing the "last" kobject_put.
> 
> No. See above. The reason I think your patch 2/2 is wrong is that is
> actually *breaks* the above model, exactly because of that thing that

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-01 Thread Benjamin Herrenschmidt

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> 
> So this is where we disagree:
> 
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> 
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> 
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> 
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant.  Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.

This is almost true... the issue there is that gap of time between the
refcount going to 0 and kobject_release() actually doing the
kobject_del. Thus the object will be present in the kset for example,
and thus discoverable, while its refcount is 0.

Yes, you are absolutely right, anything that can find such a dead
object must be using kobject_get_unless_zero(). This is what patch 1/2
does and it fixes the basic crashing race.

But we are then exposed to the duplicate name issue on fast bind/unbind
or add/remove.

> So what happens then?
> 
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> 
> Why?
> 
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> 
>  - get subsystem lock
>  - look up object (using the "unless_zero()" model)
>  - if you got the object, re-use it, and you're done: drop lock and return
>  - otherwise, you know that nobody else can get it either
>  - create new object and instantiate it
>  - drop lock
> 
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).

Except that create will occasionally fail due to duplicate names. So we
need to fix that one way or another (see below)

> See? The lack of locking at drop time didn't matter.  The refcount
> itself serialized things.

Oh I absolutely agree with all that when it comes to object lifetime, I
don't completely agree when it comes to visibility to the outside
world, see below.

> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> 
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.

In the case of gluedir that is perfectly fine though. It's a bit like
having an open unlinked file. You want to "unlink" the gluedir
synchronously with the last of its children going away, in order to be
able to create a new one as soon as a new children comes in. However
you don't want to "free" the underlying kobject until after all users
have dropped their reference.

What I don't like is that idea of conflating discoverability and
lifetime. We may just have to agree to disagree on that one, but unless
we can make kernfs prevent "finding" of a 0-ref object and thus
allow creating of a duplicate in that case (which isn't that simple a
problem to solve), we still have a problem.  

> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> 
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.

I agree with that, absolutely. kobject_get() should probably just be an
unless_zero variant always rather than a special case.

> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-01 Thread Benjamin Herrenschmidt

On Sun, 2018-07-01 at 10:04 -0700, Linus Torvalds wrote:
> 
> So this is where we disagree:
> 
> > Thus in that scenario the "last minute" kobject_release() done by the
> > last kobject_put() will be effectively unprotected from for example the
> > gdp_mutex (in the case of the gluedirs) or whatever other locking the
> > subsystem owning the kobject is using to avoid making that "refount 0"
> > object "discoverable".
> 
> No. *Fundamentally*, there is only one thing that protects that
> object: the refcount.
> 
> And my argument is that anything that has this model (which means
> anything that has any sysfs linkage, which pretty much means any
> kobject) absolutely *must* use "kobject_get_unless_zero()" unless it
> had an existing stable pointer and just wants to increase the refcount
> (ie "already got a reference throuigh one of my data structures that
> use the lock")
> 
> But if you have that model, that means that the "last drop" is
> actually almost totally irrelevant.  Because it doesn't matter if it's
> done inside the lock or not - you know that once it has been done,
> that object is entirely gone. It's not discoverable any more -
> regardless of locking. The discoverability is basically controlled
> entirely by the refcount.

This is almost true... the issue there is that gap of time between the
refcount going to 0 and kobject_release() actually doing the
kobject_del. Thus the object will be present in the kset for example,
and thus discoverable, while its refcount is 0.

Yes, you are absolutely right, anything that can find such a dead
object must be using kobject_get_unless_zero(). This is what patch 1/2
does and it fixes the basic crashing race.

But we are then exposed to the duplicate name issue on fast bind/unbind
or add/remove.

> So what happens then?
> 
> The locking isn't important for the last release, but it *is*
> important for new object *creation*.
> 
> Why?
> 
> The refcount means that once an object is gone, it's gone for
> *everyone*. It's a one-way thing, and it's thread-safe. So the code
> that does *creation* can do this:
> 
>  - get subsystem lock
>  - look up object (using the "unless_zero()" model)
>  - if you got the object, re-use it, and you're done: drop lock and return
>  - otherwise, you know that nobody else can get it either
>  - create new object and instantiate it
>  - drop lock
> 
> and this means that you create a new object IFF the old object had its
> refcount drop to zero. So you always have exactly one copy (or no copy
> at all, in between the last drop and the creation of the new one).

Except that create will occasionally fail due to duplicate names. So we
need to fix that one way or another (see below)

> See? The lack of locking at drop time didn't matter.  The refcount
> itself serialized things.

Oh I absolutely agree with all that when it comes to object lifetime, I
don't completely agree when it comes to visibility to the outside
world, see below.

> So the above is what I *think* the "glue_dir" logic should be. No need
> for any other count. Just re-use the old glue dir if you can find it,
> and create a new one if you can't.
> 
> The one important thing is that the object lookup above needs to use
> the lookup needs to find the object all the way until the refcount has
> become zero. And that actually means that the object MUST NOT be
> removed from the object lists until *after* the refcount has been
> decremented to zero. Which is actually why that "automatic cleanup"
> that you hate is actually an integral and important part of the
> process: removing the object *before* the refcount went to zero is
> broken, because that means that the "look up object" phase can now
> miss an object that still has a non-zero refcount.

In the case of gluedir that is perfectly fine though. It's a bit like
having an open unlinked file. You want to "unlink" the gluedir
synchronously with the last of its children going away, in order to be
able to create a new one as soon as a new children comes in. However
you don't want to "free" the underlying kobject until after all users
have dropped their reference.

What I don't like is that idea of conflating discoverability and
lifetime. We may just have to agree to disagree on that one, but unless
we can make kernfs prevent "finding" of a 0-ref object and thus
allow creating of a duplicate in that case (which isn't that simple a
problem to solve), we still have a problem.  

> > So my patch 1/2 prevents us from finding the old dying object (and thus
> > from crashing) but replaces this with the duplicate name problem.
> 
> So I absolutely agree with your patch 1/2. My argument against it is
> actually that I think the "unless_zero" thing needs to be more
> universal.

I agree with that, absolutely. kobject_get() should probably just be an
unless_zero variant always rather than a special case.

> > My patch 2/2 removes that second problem by ensuring we remove the
> > object from sysfs synchronously in device_del when

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-01 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 20:57 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 8:42 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > But what if something *else* still holds a reference to the kobject ?
> > It could be anything really... t
> 
> But that's fine. Then the object will continue to exist, and the sysfs
> file will continue to exist, and you won't get a new glue directory.

I suspect you didn't read it my entire argument or I wasn't clear
enough :-) This is actually the crux of the problem:

Yes the object continues to exist. However, the *last* kobject_put to
it will no longer be done under whatever higher level locking the
subsystem provides (whatever prevents for example concurrent add and
removes).

Thus in that scenario the "last minute" kobject_release() done by the
last kobject_put() will be effectively unprotected from for example the
gdp_mutex (in the case of the gluedirs) or whatever other locking the
subsystem owning the kobject is using to avoid making that "refount 0"
object "discoverable".

So not only the conitnued existence of the object will potentially
trigger the duplicate sysfs name problem, but it *will* also re-open
the window for the object to be temporarily be visible with a 0
refcount.

The kobject debugging doesn't create a new race here. It just
significantly increase the size of an existing race window.

You argued yourself that the reason that window is a non-issue without
kobject debugging is that the expectation is that the release happens
synchronously with the last kobject_put done by device_del, which has
appropriate higher-level locking.

My point is that this specific assumption doesn't hold in the case
where another reference exists. In that case the object will be freed
with the race open for others to observe it while its refcount is 0.

> In fact, what you describe is a problem with *your* patch, exactly
> because you introduce another counter that is *not* the reference
> count, and now you have the problem with "old directory kobject is
> still live, but I removed the sysfs part, and now I'm creating a new
> object with the same name".
> 
> Hmm?

So my patch 1/2 prevents us from finding the old dying object (and thus
from crashing) but replaces this with the duplicate name problem.

My patch 2/2 removes that second problem by ensuring we remove the
object from sysfs synchronously in device_del when it no longer
contains any children, explicitely rather than implicitely by the
virtue of doing the "last" kobject_put.

You'll notice the existing code wraps that kobject_put() with the
gdp_mutex, specifically I suppose to synronize with the creation path
of the gluedir, ie with the expectation that this kobject_put() will
synchronously remove the object from the kset.

I argue this assumption is flawed, that another reference could delay
the removal from the kset to a point where the mutex isn't held, and
thus the race re-opens.

My patch doesn't implement a separate "refcount" per-se, it implements
a "childcount". This is akin to mm_users vs mm_count.

We want the glue dir to be removed when it has no more children, we
currently "conflate" this with the kobject having no reference. This is
what I argue is incorrect. The kobject can have "unrelated" references,
that define the lifetime of the object in memory, but it's presence in
sysfs is dictated by the number of children it contains.

> > It is and there's a WARN_ON about it inside kobject_get(). I don't
> > think anybody argues against that, you are absolutely right.
> 
> No.  That the zero kobject_get() will not result in a warning. It just
> does a kref_get(), no warnings anywhere.

It's there but it's in refcount:

void refcount_inc(refcount_t *r)
{
WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; 
use-after-free.\n");
}
EXPORT_SYMBOL(refcount_inc);

In fact that's how I started digging into that problem in the first place :-)

> Yes, there is a kobject_get() warning, but that's about an
> _uninitialized_ kobject, not a already released one! You will get no
> warning that I can see from the "oh, you just got a stale one".

Cheers,
Ben.

>Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-07-01 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 20:57 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 8:42 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > But what if something *else* still holds a reference to the kobject ?
> > It could be anything really... t
> 
> But that's fine. Then the object will continue to exist, and the sysfs
> file will continue to exist, and you won't get a new glue directory.

I suspect you didn't read it my entire argument or I wasn't clear
enough :-) This is actually the crux of the problem:

Yes the object continues to exist. However, the *last* kobject_put to
it will no longer be done under whatever higher level locking the
subsystem provides (whatever prevents for example concurrent add and
removes).

Thus in that scenario the "last minute" kobject_release() done by the
last kobject_put() will be effectively unprotected from for example the
gdp_mutex (in the case of the gluedirs) or whatever other locking the
subsystem owning the kobject is using to avoid making that "refount 0"
object "discoverable".

So not only the conitnued existence of the object will potentially
trigger the duplicate sysfs name problem, but it *will* also re-open
the window for the object to be temporarily be visible with a 0
refcount.

The kobject debugging doesn't create a new race here. It just
significantly increase the size of an existing race window.

You argued yourself that the reason that window is a non-issue without
kobject debugging is that the expectation is that the release happens
synchronously with the last kobject_put done by device_del, which has
appropriate higher-level locking.

My point is that this specific assumption doesn't hold in the case
where another reference exists. In that case the object will be freed
with the race open for others to observe it while its refcount is 0.

> In fact, what you describe is a problem with *your* patch, exactly
> because you introduce another counter that is *not* the reference
> count, and now you have the problem with "old directory kobject is
> still live, but I removed the sysfs part, and now I'm creating a new
> object with the same name".
> 
> Hmm?

So my patch 1/2 prevents us from finding the old dying object (and thus
from crashing) but replaces this with the duplicate name problem.

My patch 2/2 removes that second problem by ensuring we remove the
object from sysfs synchronously in device_del when it no longer
contains any children, explicitely rather than implicitely by the
virtue of doing the "last" kobject_put.

You'll notice the existing code wraps that kobject_put() with the
gdp_mutex, specifically I suppose to synronize with the creation path
of the gluedir, ie with the expectation that this kobject_put() will
synchronously remove the object from the kset.

I argue this assumption is flawed, that another reference could delay
the removal from the kset to a point where the mutex isn't held, and
thus the race re-opens.

My patch doesn't implement a separate "refcount" per-se, it implements
a "childcount". This is akin to mm_users vs mm_count.

We want the glue dir to be removed when it has no more children, we
currently "conflate" this with the kobject having no reference. This is
what I argue is incorrect. The kobject can have "unrelated" references,
that define the lifetime of the object in memory, but it's presence in
sysfs is dictated by the number of children it contains.

> > It is and there's a WARN_ON about it inside kobject_get(). I don't
> > think anybody argues against that, you are absolutely right.
> 
> No.  That the zero kobject_get() will not result in a warning. It just
> does a kref_get(), no warnings anywhere.

It's there but it's in refcount:

void refcount_inc(refcount_t *r)
{
WARN_ONCE(!refcount_inc_not_zero(r), "refcount_t: increment on 0; 
use-after-free.\n");
}
EXPORT_SYMBOL(refcount_inc);

In fact that's how I started digging into that problem in the first place :-)

> Yes, there is a kobject_get() warning, but that's about an
> _uninitialized_ kobject, not a already released one! You will get no
> warning that I can see from the "oh, you just got a stale one".

Cheers,
Ben.

>Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

(This is a resend with lkml added for background & archival purposes)

On Sat, 2018-06-30 at 12:29 -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:19 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> > 
> > This directory is never "explicitely" removed when empty however,
> > this is left to the implicit sysfs removal done by kobjects when
> > they are released on the last kobject_put().
> > 
> > This is problematic because as long as it's not been removed from
> > sysfs, it is still present in the class kset and in sysfs directory
> > structure.
> 
> Ok, so I don't hate the patch per se, but looking around, I think this
> is still wrong in the end.
> 
> Why?
> 
> Because normally, the last kobject_put() really does do a synchronous
> kobject_del(). So normally, this is all completely pointless, and the
> normal kobject lifetime rules should be sufficient.

Not entirely sadly

There's a small race between the kref going down to 0 and the last
kobject_put(). Something might still "catch" the 0-reference object
during that window. 

In this specific case our bacon is mostly saved by the gdp_mutex which
is taken by cleanup_glue_dir() around what *should* be the lats
kobject_put but what if it isn't ?

If anything else happens to hold a reference to the directory object
(open files in sysfs maybe ?), then the last put will come from
elsewhere and will happen without that mutex being held, thus re-
opening the tiny race.

Is this possible ?

> So I *think* your problem happens because you have
> CONFIG_DEBUG_KOBJECT_RELEASE enabled, and that intentionally delays
> the cleanup.

I think it just opens an existing race more widely. The race always
exist becaues another CPU can observe the object between the reference
going to 0 and the last kobject_del done by kobject_release. That's one
main reason why I dislike this "auto-clean" mechanism.

One other way to solve it, which I just thought about, could be to,
inside kobject_put() itself, check that the reference is *1* and do
kobject_del() before the last kref_put. That does mean that somebody
can snatch it in that window after it's been removed from sysfs though,
is that ok ? It won't crash I suppose...

> This is actually not really what DEBUG_KOBJECT_RELEASE is really
> documented to do.  It is documented as a "let's debug problems where
> drivers think deletion is immediate", but the sysfs interaction with
> the same-name issue really smells different.
> 
> So what the patch does is basically to just fight
> DEBUG_KOBJECT_RELEASE delaying rules, and that kind of stinks.

Not entirely, it fight an existing race that DEBUG_KOBJECT_RELEASE just
opens more widely.

> To me, it really feels like either we should see the
> DEBUG_KOBJECT_RELEASE rules are "real" (in which case fighting them is
> wrong), or we should admit that DEBUG_KOBJECT_RELEASE causes problems
> (in which case we should probably try to fix the debug aid).
> 
> Ben, can you confirm that your problem just goes away if you don't
> select DEBUG_KOBJECT_RELEASE?

The easily reproducable crash goes away, because the device I've been
observing these is a tiny slow single core ARM embedded thing. But I'm
not sure the therical race is solved.

> Greg - comments? The pattern of "remove last device, add a new device
> of the same class" really seems to be a valid pattern, and
> CONFIG_DEBUG_KOBJECT_RELEASE seems to actively break it.
> 
> Could we perhaps add a synthetic test for exactly this pattern (add a
> silly device with a bogus class, remove it, and add another device
> with the same class)?
> 
>  Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

(This is a resend with lkml added for background & archival purposes)

On Sat, 2018-06-30 at 12:29 -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:19 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > For devices with a class, we create a "glue" directory between
> > the parent device and the new device with the class name.
> > 
> > This directory is never "explicitely" removed when empty however,
> > this is left to the implicit sysfs removal done by kobjects when
> > they are released on the last kobject_put().
> > 
> > This is problematic because as long as it's not been removed from
> > sysfs, it is still present in the class kset and in sysfs directory
> > structure.
> 
> Ok, so I don't hate the patch per se, but looking around, I think this
> is still wrong in the end.
> 
> Why?
> 
> Because normally, the last kobject_put() really does do a synchronous
> kobject_del(). So normally, this is all completely pointless, and the
> normal kobject lifetime rules should be sufficient.

Not entirely sadly

There's a small race between the kref going down to 0 and the last
kobject_put(). Something might still "catch" the 0-reference object
during that window. 

In this specific case our bacon is mostly saved by the gdp_mutex which
is taken by cleanup_glue_dir() around what *should* be the lats
kobject_put but what if it isn't ?

If anything else happens to hold a reference to the directory object
(open files in sysfs maybe ?), then the last put will come from
elsewhere and will happen without that mutex being held, thus re-
opening the tiny race.

Is this possible ?

> So I *think* your problem happens because you have
> CONFIG_DEBUG_KOBJECT_RELEASE enabled, and that intentionally delays
> the cleanup.

I think it just opens an existing race more widely. The race always
exist becaues another CPU can observe the object between the reference
going to 0 and the last kobject_del done by kobject_release. That's one
main reason why I dislike this "auto-clean" mechanism.

One other way to solve it, which I just thought about, could be to,
inside kobject_put() itself, check that the reference is *1* and do
kobject_del() before the last kref_put. That does mean that somebody
can snatch it in that window after it's been removed from sysfs though,
is that ok ? It won't crash I suppose...

> This is actually not really what DEBUG_KOBJECT_RELEASE is really
> documented to do.  It is documented as a "let's debug problems where
> drivers think deletion is immediate", but the sysfs interaction with
> the same-name issue really smells different.
> 
> So what the patch does is basically to just fight
> DEBUG_KOBJECT_RELEASE delaying rules, and that kind of stinks.

Not entirely, it fight an existing race that DEBUG_KOBJECT_RELEASE just
opens more widely.

> To me, it really feels like either we should see the
> DEBUG_KOBJECT_RELEASE rules are "real" (in which case fighting them is
> wrong), or we should admit that DEBUG_KOBJECT_RELEASE causes problems
> (in which case we should probably try to fix the debug aid).
> 
> Ben, can you confirm that your problem just goes away if you don't
> select DEBUG_KOBJECT_RELEASE?

The easily reproducable crash goes away, because the device I've been
observing these is a tiny slow single core ARM embedded thing. But I'm
not sure the therical race is solved.

> Greg - comments? The pattern of "remove last device, add a new device
> of the same class" really seems to be a valid pattern, and
> CONFIG_DEBUG_KOBJECT_RELEASE seems to actively break it.
> 
> Could we perhaps add a synthetic test for exactly this pattern (add a
> silly device with a bogus class, remove it, and add another device
> with the same class)?
> 
>  Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 19:18 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 7:07 PM Linus Torvalds
>  wrote:
> > 
> > Those locks won't protect kobj races in _general_ (ie there is no
> > locking between two totally unrelated buses), but they *should*
> > serialize the case of a device being added within one class. No?
> 
> Side note: there had *better* be some locking whenever there is a way
> to find an object, because otherwise you have a fundamental lifetime
> problem: one thread finding the object at the same time another thread
> frees it for the last time. Even the "unless_zero()" won't fix it,
> because the final free will release the underlying object itself, so
> the "zero" state is ephemeral.

Right. We agree. The problem here relates to that built-in cleanup in
kobject_release() which violates that rule: it removes the object from
sysfs (and from any kset) after the reference has been dropped.

Users such as the gluedir code try to workaround this by using outer
mutexes or locks, but it's fundamentally racy if there's any chance
that another 3rd party holds the last reference, since in that case,
the cleanup happens outside of any added locking.

That's why I tend to think that anything that relies on that "late
removal from sysfs" is broken by design :) Anything that uses kobject
should ensure they are taken out of sysfs and their respective ksets
before the last reference drops.

Now it's probably all fine for most cases, I assume sysfs itself checks
the reference to be non-0 in lookup path (I haven't checked), at least
to prevent use-after-free. This is still open to the duplicate name
issue. The bigger issue of use-after-free only happens in the rare
cases where there's a separate list being maintained and cleaned up
late.

The gluedir is such an example because it "manually" thrawls the kset
list.

So you are right, kobject_get should probably always be _unless_zero,
but my point is we even with that, we probably still need to ensure the
gluedir is taken out of the kset before we drop the last reference, so
this can be done with the appropriate upper layer subsystem locking,
and thus without racing with a new device being bound.

> That locking might be just RCU during lookup, and rcu-delaying the
> release, of course.  I think that's all the sysfs code needs, for
> example, since that's what lookup uses.
> 
> And for any other embedded kobj cases, where you can reach the object
> using some random subsystem pointers, there had better be other
> locking in place for that pointer lookup vs the last removal.
> 
> kobject itself doesn't provide that locking, it only provides the
> reference counting. But that's partly why it really has to disallow
> any kobject_get() of a zero object, because it means that the
> tear-down has been started, but the tear-down itself may not have had
> time to get the lock yet (ie kobject_release() may be just about to
> call the t->release() function).
> 
> But maybe I'm missing something subtle.
> 
>Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 19:18 -0700, Linus Torvalds wrote:
> On Sat, Jun 30, 2018 at 7:07 PM Linus Torvalds
>  wrote:
> > 
> > Those locks won't protect kobj races in _general_ (ie there is no
> > locking between two totally unrelated buses), but they *should*
> > serialize the case of a device being added within one class. No?
> 
> Side note: there had *better* be some locking whenever there is a way
> to find an object, because otherwise you have a fundamental lifetime
> problem: one thread finding the object at the same time another thread
> frees it for the last time. Even the "unless_zero()" won't fix it,
> because the final free will release the underlying object itself, so
> the "zero" state is ephemeral.

Right. We agree. The problem here relates to that built-in cleanup in
kobject_release() which violates that rule: it removes the object from
sysfs (and from any kset) after the reference has been dropped.

Users such as the gluedir code try to workaround this by using outer
mutexes or locks, but it's fundamentally racy if there's any chance
that another 3rd party holds the last reference, since in that case,
the cleanup happens outside of any added locking.

That's why I tend to think that anything that relies on that "late
removal from sysfs" is broken by design :) Anything that uses kobject
should ensure they are taken out of sysfs and their respective ksets
before the last reference drops.

Now it's probably all fine for most cases, I assume sysfs itself checks
the reference to be non-0 in lookup path (I haven't checked), at least
to prevent use-after-free. This is still open to the duplicate name
issue. The bigger issue of use-after-free only happens in the rare
cases where there's a separate list being maintained and cleaned up
late.

The gluedir is such an example because it "manually" thrawls the kset
list.

So you are right, kobject_get should probably always be _unless_zero,
but my point is we even with that, we probably still need to ensure the
gluedir is taken out of the kset before we drop the last reference, so
this can be done with the appropriate upper layer subsystem locking,
and thus without racing with a new device being bound.

> That locking might be just RCU during lookup, and rcu-delaying the
> release, of course.  I think that's all the sysfs code needs, for
> example, since that's what lookup uses.
> 
> And for any other embedded kobj cases, where you can reach the object
> using some random subsystem pointers, there had better be other
> locking in place for that pointer lookup vs the last removal.
> 
> kobject itself doesn't provide that locking, it only provides the
> reference counting. But that's partly why it really has to disallow
> any kobject_get() of a zero object, because it means that the
> tear-down has been started, but the tear-down itself may not have had
> time to get the lock yet (ie kobject_release() may be just about to
> call the t->release() function).
> 
> But maybe I'm missing something subtle.
> 
>Linus

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 19:07 -0700, Linus Torvalds wrote:
> [ Ben - feel free to post the missing emails to lkml too ]
> 
> On Sat, Jun 30, 2018 at 6:56 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > On Sat, 2018-06-30 at 12:29 -0700, Linus Torvalds wrote:
> > > 
> > > Because normally, the last kobject_put() really does do a synchronous
> > > kobject_del(). So normally, this is all completely pointless, and the
> > > normal kobject lifetime rules should be sufficient.
> > 
> > Not entirely sadly
> > 
> > There's a small race between the kref going down to 0 and the last
> > kobject_put(). Something might still "catch" the 0-reference object
> > during that window.
> 
> That's only true if the underlying subsystem doesn't have any
> serialization itself.
> 
> Which I don't think is normal.
> 
> IOW, if you have (say) a PCI hotplug model, the PCI layer will have
> the pci_hp_mutex, for example, which protects the PCI hotplug slot
> lists and the kobj things.
> 
> Those locks won't protect kobj races in _general_ (ie there is no
> locking between two totally unrelated buses), but they *should*
> serialize the case of a device being added within one class. No?
> 
> If not, maybe we need to add some class-based serialization.

Well, yes and no ... yes I agree there should be upper level
serialization such that you can't hit the "add" path until the "del"
path has completed.

And this will work as long as the kobject_put done at the end of "del"
is indeed the very last one.

But what if something *else* still holds a reference to the kobject ?
It could be anything really... the driver itself, that is now unbound,
might have a client that's holding a stale reference because some
chardev is still open, for example. I don't know if sysfs can also hold
one because a users still has an open directory there, but in the
general case I don't thing one can assume that the kobject_put() done
by the end of the "owner" subsystem delete path (in our example
device_del) is the very last kobject_put.

And since some other random thing might be doing that last put, it may
also do it without any of the upper level locking/serialization.

Which is the reason I so dislike the whole "cleanup sysfs in the last
kobject_put" we do in kobject_release(). This is also why we clearly
differenciate in the device model device_del from the final device_put
etc...

We generally separeate "visibility to the outside world" from the
actual object lifetime, the latter being what the kref controls.

 .. except when we dont, such as this gluedir case. And I think that's
wrong in principle.

This is the reason why I prefer the approach of explicitely doing the
kobject_del() instead of kobject_put() in the tail of device_del (what
this 2/2 patch does), because it restores that separation between the
object being "active/visible" or not from the actual lifetime of the
structure.

> > I think it just opens an existing race more widely. The race always
> > exist becaues another CPU can observe the object between the reference
> > going to 0 and the last kobject_del done by kobject_release.
> 
> No it really damn well can't, exactly because we should not allow it -
> and allowing it would be fundamentally racy.
> 
> We should never allow a kobject_get() with a zero reference count.
> It's always a *fundamental* bug - see my other email on your 1/2
> patch.

It is and there's a WARN_ON about it inside kobject_get(). I don't
think anybody argues against that, you are absolutely right.

That said, the fact that it happens is I think the result of a deeper
problem of conflating the visibility of the object and its refcount,
and the fact that we do the sysfs "cleanup" (visibility) after we've
dropped the last reference.

Yes, the race is generally not going to happen if the put() done by the
owning subystem is indeed the last one, because as you said, the
subsystem should have higher level locking to precent racing between
removal and addition.

However, it *can* happen because by definition kobjects can have
references held elsewhere for any reason, and those can be dropped
outside of any subsystem locking.

> So there is no race, because the reference count itself protects the
> object. Either another CPU gets it in time (before it went down to
> zero), or it went down to zero and it is dead (because at that point,
> deletion will have been scheduled.
> 
> Note that this is entirely independent of the auto-clean actions,
> since even with absolutely zero cleanup, you still have the
> fundamental issue of "reference count goes to zero causes the object
> to be free'd".

Yes.

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-30 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 19:07 -0700, Linus Torvalds wrote:
> [ Ben - feel free to post the missing emails to lkml too ]
> 
> On Sat, Jun 30, 2018 at 6:56 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > On Sat, 2018-06-30 at 12:29 -0700, Linus Torvalds wrote:
> > > 
> > > Because normally, the last kobject_put() really does do a synchronous
> > > kobject_del(). So normally, this is all completely pointless, and the
> > > normal kobject lifetime rules should be sufficient.
> > 
> > Not entirely sadly
> > 
> > There's a small race between the kref going down to 0 and the last
> > kobject_put(). Something might still "catch" the 0-reference object
> > during that window.
> 
> That's only true if the underlying subsystem doesn't have any
> serialization itself.
> 
> Which I don't think is normal.
> 
> IOW, if you have (say) a PCI hotplug model, the PCI layer will have
> the pci_hp_mutex, for example, which protects the PCI hotplug slot
> lists and the kobj things.
> 
> Those locks won't protect kobj races in _general_ (ie there is no
> locking between two totally unrelated buses), but they *should*
> serialize the case of a device being added within one class. No?
> 
> If not, maybe we need to add some class-based serialization.

Well, yes and no ... yes I agree there should be upper level
serialization such that you can't hit the "add" path until the "del"
path has completed.

And this will work as long as the kobject_put done at the end of "del"
is indeed the very last one.

But what if something *else* still holds a reference to the kobject ?
It could be anything really... the driver itself, that is now unbound,
might have a client that's holding a stale reference because some
chardev is still open, for example. I don't know if sysfs can also hold
one because a users still has an open directory there, but in the
general case I don't thing one can assume that the kobject_put() done
by the end of the "owner" subsystem delete path (in our example
device_del) is the very last kobject_put.

And since some other random thing might be doing that last put, it may
also do it without any of the upper level locking/serialization.

Which is the reason I so dislike the whole "cleanup sysfs in the last
kobject_put" we do in kobject_release(). This is also why we clearly
differenciate in the device model device_del from the final device_put
etc...

We generally separeate "visibility to the outside world" from the
actual object lifetime, the latter being what the kref controls.

 .. except when we dont, such as this gluedir case. And I think that's
wrong in principle.

This is the reason why I prefer the approach of explicitely doing the
kobject_del() instead of kobject_put() in the tail of device_del (what
this 2/2 patch does), because it restores that separation between the
object being "active/visible" or not from the actual lifetime of the
structure.

> > I think it just opens an existing race more widely. The race always
> > exist becaues another CPU can observe the object between the reference
> > going to 0 and the last kobject_del done by kobject_release.
> 
> No it really damn well can't, exactly because we should not allow it -
> and allowing it would be fundamentally racy.
> 
> We should never allow a kobject_get() with a zero reference count.
> It's always a *fundamental* bug - see my other email on your 1/2
> patch.

It is and there's a WARN_ON about it inside kobject_get(). I don't
think anybody argues against that, you are absolutely right.

That said, the fact that it happens is I think the result of a deeper
problem of conflating the visibility of the object and its refcount,
and the fact that we do the sysfs "cleanup" (visibility) after we've
dropped the last reference.

Yes, the race is generally not going to happen if the put() done by the
owning subystem is indeed the last one, because as you said, the
subsystem should have higher level locking to precent racing between
removal and addition.

However, it *can* happen because by definition kobjects can have
references held elsewhere for any reason, and those can be dropped
outside of any subsystem locking.

> So there is no race, because the reference count itself protects the
> object. Either another CPU gets it in time (before it went down to
> zero), or it went down to zero and it is dead (because at that point,
> deletion will have been scheduled.
> 
> Note that this is entirely independent of the auto-clean actions,
> since even with absolutely zero cleanup, you still have the
> fundamental issue of "reference count goes to zero causes the object
> to be free'd".

Yes.

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-29 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 11:04 +1000, Benjamin Herrenschmidt wrote:
> I had a look (see my other email). It's non-trivial. We can still look
> into it, but from what I gathered of how sysfs works, it's based on
> kernfs which doesn't have the kobjects nor access to the reference
> count, and is the holder of the names rbtree.
> 
> So we'd need to find a way to convey that "in-use" information down to
> kernfs (I thought maybe adding an optional pointer to a kref ? but
> then, it's still somewhat racy ...)
> 
> Additionally, we would need a few more pairs of eyes to make sure that
> sticking duplicates in that rbtree isn't going to break some corner
> case in there.

Just to clarify, I will look into it, but it will take more time so I'm
keen on having the low hanging fixes in now.

Also in general, I dislike the idea of leaving things in sysfs with a
refcount of 0. That "late" cleanup in kobject_release() looks to me
more like a band-aid than a feature.

I'd be almost tempted to stick a WARN_ON in there and see who gets hit
but I worry it's going send me down one of those rabbit holes and my
time is limited these days ;-)

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-29 Thread Benjamin Herrenschmidt

On Sat, 2018-06-30 at 11:04 +1000, Benjamin Herrenschmidt wrote:
> I had a look (see my other email). It's non-trivial. We can still look
> into it, but from what I gathered of how sysfs works, it's based on
> kernfs which doesn't have the kobjects nor access to the reference
> count, and is the holder of the names rbtree.
> 
> So we'd need to find a way to convey that "in-use" information down to
> kernfs (I thought maybe adding an optional pointer to a kref ? but
> then, it's still somewhat racy ...)
> 
> Additionally, we would need a few more pairs of eyes to make sure that
> sticking duplicates in that rbtree isn't going to break some corner
> case in there.

Just to clarify, I will look into it, but it will take more time so I'm
keen on having the low hanging fixes in now.

Also in general, I dislike the idea of leaving things in sysfs with a
refcount of 0. That "late" cleanup in kobject_release() looks to me
more like a band-aid than a feature.

I'd be almost tempted to stick a WARN_ON in there and see who gets hit
but I worry it's going send me down one of those rabbit holes and my
time is limited these days ;-)

Cheers,
Ben.

Re: [RFC PATCH 01/14] devres: Add devm_of_iomap()

2018-06-29 Thread Benjamin Herrenschmidt

On Fri, 2018-06-29 at 23:27 +0300, Andy Shevchenko wrote:
> On Fri, Jun 29, 2018 at 12:14 PM, Linus Walleij
>  wrote:
> 
> > I wonder if it is easy to find these cases and replace them with
> > this neat function...
> 
> Would be reasonable easy by using coccinelle.

For the obvious ones yes. A lot of the existing users of of_iomap
however don't do the request_region, and while they probably should and
should use the new accessor, this can't be done blindly without
testing, because there are many old things around that have broken
memory region tracking and that will fail..

I plan to do a sweep through some of my old powermac/powerpc stuff one
of these days and do some conversions.

Cheers,
Ben.

Re: [RFC PATCH 01/14] devres: Add devm_of_iomap()

2018-06-29 Thread Benjamin Herrenschmidt

On Fri, 2018-06-29 at 23:27 +0300, Andy Shevchenko wrote:
> On Fri, Jun 29, 2018 at 12:14 PM, Linus Walleij
>  wrote:
> 
> > I wonder if it is easy to find these cases and replace them with
> > this neat function...
> 
> Would be reasonable easy by using coccinelle.

For the obvious ones yes. A lot of the existing users of of_iomap
however don't do the request_region, and while they probably should and
should use the new accessor, this can't be done blindly without
testing, because there are many old things around that have broken
memory region tracking and that will fail..

I plan to do a sweep through some of my old powermac/powerpc stuff one
of these days and do some conversions.

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-29 Thread Benjamin Herrenschmidt

On Fri, 2018-06-29 at 06:56 -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:22 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> 
> Ugh. I was hoping that you'd just do the "only check duplicate names
> if actually in use" instead.

I had a look (see my other email). It's non-trivial. We can still look
into it, but from what I gathered of how sysfs works, it's based on
kernfs which doesn't have the kobjects nor access to the reference
count, and is the holder of the names rbtree.

So we'd need to find a way to convey that "in-use" information down to
kernfs (I thought maybe adding an optional pointer to a kref ? but
then, it's still somewhat racy ...)

Additionally, we would need a few more pairs of eyes to make sure that
sticking duplicates in that rbtree isn't going to break some corner
case in there.

It's a code base I'm not at all familiar with. So while I agree with
the overall idea that a kobject whose reference has gone down to 0
should probably be ignored by sysfs, actually doing it doesn't seem
simple.

> This patch seems to make patch 1/2 pointless. No?

Somewhat. It's stil the "right thing" to do in case a hole shows up
elsewhere, and it's an easier stable backport. But yes. I wrote it
before I figured out what to do with 2/2 (I was still trying to do what
you wanted and just skip ref=0 in sysfs name lookups).

But yes, with 2/2, there shouldn't be a kobject with a 0 ref in the
list anymore... hopefully.

I would very much appreciate the opinion of someone more familiar with
sysfs and the device core on all this though.

Cheers,
Ben.

Re: [PATCH 2/2] drivers: core: Remove glue dirs from sysfs earlier

2018-06-29 Thread Benjamin Herrenschmidt

On Fri, 2018-06-29 at 06:56 -0700, Linus Torvalds wrote:
> On Thu, Jun 28, 2018 at 7:22 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > This fixes it by instead doing an explicit kobject_del() when
> > the glue dir is empty, by keeping track of the number of
> > child devices of the gluedir.
> 
> Ugh. I was hoping that you'd just do the "only check duplicate names
> if actually in use" instead.

I had a look (see my other email). It's non-trivial. We can still look
into it, but from what I gathered of how sysfs works, it's based on
kernfs which doesn't have the kobjects nor access to the reference
count, and is the holder of the names rbtree.

So we'd need to find a way to convey that "in-use" information down to
kernfs (I thought maybe adding an optional pointer to a kref ? but
then, it's still somewhat racy ...)

Additionally, we would need a few more pairs of eyes to make sure that
sticking duplicates in that rbtree isn't going to break some corner
case in there.

It's a code base I'm not at all familiar with. So while I agree with
the overall idea that a kobject whose reference has gone down to 0
should probably be ignored by sysfs, actually doing it doesn't seem
simple.

> This patch seems to make patch 1/2 pointless. No?

Somewhat. It's stil the "right thing" to do in case a hole shows up
elsewhere, and it's an easier stable backport. But yes. I wrote it
before I figured out what to do with 2/2 (I was still trying to do what
you wanted and just skip ref=0 in sysfs name lookups).

But yes, with 2/2, there shouldn't be a kobject with a 0 ref in the
list anymore... hopefully.

I would very much appreciate the opinion of someone more familiar with
sysfs and the device core on all this though.

Cheers,
Ben.

< 1 2 3 4 5 6 7 8 9 10 >

301 - 400 of 5350 matches

Mail list logo