Re: [PATCH 03/17] base: fix order of OF initialization

2017-06-07 Thread Benjamin Herrenschmidt
On Wed, 2017-06-07 at 11:39 -0700, Wesley Terpstra wrote:
> It was a while ago that I debugged this. I already reported this bug
> to Benjamin Herrenschmidt (now in CC), and I believe he has a patch of
> his own to fix the same issue.
> 
> As I understand it, of_core_init sets up the OF entries in
> /sys/firmware/devicetree. During platform bringup, when the system
> describes the cpu + cache hierarchy, it also makes an of_node symlink
> into that directory. However, if it doesn't exist yet, you get the
> warning.

Ugh... yes I did a patch for that and I think it fell through the
cracks, I can't even find it anymore...

The patch quoted here is fine I think. Everything in the device model
can potentially use OF bits these days, it makes sense to have them
initialized earlier.

Cheers,
Ben.

> # ls -l /sys/devices/system/cpu/cpu3/of_node
> lrwxrwxrwx1 root root 0 Jan  1 00:00
> /sys/devices/system/cpu/cpu3/of_node ->
> ../../../../firmware/devicetree/base/cpus/cpu@3
> 
> On Wed, Jun 7, 2017 at 2:35 AM, Mark Rutland <mark.rutl...@arm.com> wrote:
> > On Wed, Jun 07, 2017 at 09:07:20AM +0200, Geert Uytterhoeven wrote:
> > > CC devicetree folks
> > > 
> > > On Wed, Jun 7, 2017 at 12:59 AM, Palmer Dabbelt <pal...@dabbelt.com> 
> > > wrote:
> > > > From: "Wesley W. Terpstra" <wes...@sifive.com>
> > > > 
> > > > This fixes: [0.01] cpu cpu0: Error -2 creating of_node link
> > > > ... which you get for every CPU on all architectures with a OF cpu/ 
> > > > node.
> > 
> > I take it this means a /cpus node? Or the /cpus/cpu@* nodes?
> > 
> > I'm not seeing this on arm64 when booting v4.12-rc4 with DT, so clearly
> > this doesn't affect all such architectures.
> > 
> > What path are these errors happening in?
> > 
> > Thanks,
> > Mark.
> > 
> > > > 
> > > > This affects riscv, nios, etc.
> > > > 
> > > > Signed-off-by: Palmer Dabbelt <pal...@dabbelt.com>
> > > > ---
> > > >  drivers/base/init.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/base/init.c b/drivers/base/init.c
> > > > index 48c0e220acc0..0dcd17e561d0 100644
> > > > --- a/drivers/base/init.c
> > > > +++ b/drivers/base/init.c
> > > > @@ -31,9 +31,9 @@ void __init driver_init(void)
> > > > /* These are also core pieces, but must come after the
> > > >  * core core pieces.
> > > >  */
> > > > +   of_core_init();
> > > > platform_bus_init();
> > > > cpu_dev_init();
> > > > memory_dev_init();
> > > > container_dev_init();
> > > > -   of_core_init();
> > > >  }
> > > > --
> > > > 2.13.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe devicetree" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/17] base: fix order of OF initialization

2017-06-07 Thread Benjamin Herrenschmidt
On Wed, 2017-06-07 at 11:39 -0700, Wesley Terpstra wrote:
> It was a while ago that I debugged this. I already reported this bug
> to Benjamin Herrenschmidt (now in CC), and I believe he has a patch of
> his own to fix the same issue.
> 
> As I understand it, of_core_init sets up the OF entries in
> /sys/firmware/devicetree. During platform bringup, when the system
> describes the cpu + cache hierarchy, it also makes an of_node symlink
> into that directory. However, if it doesn't exist yet, you get the
> warning.

Ugh... yes I did a patch for that and I think it fell through the
cracks, I can't even find it anymore...

The patch quoted here is fine I think. Everything in the device model
can potentially use OF bits these days, it makes sense to have them
initialized earlier.

Cheers,
Ben.

> # ls -l /sys/devices/system/cpu/cpu3/of_node
> lrwxrwxrwx1 root root 0 Jan  1 00:00
> /sys/devices/system/cpu/cpu3/of_node ->
> ../../../../firmware/devicetree/base/cpus/cpu@3
> 
> On Wed, Jun 7, 2017 at 2:35 AM, Mark Rutland  wrote:
> > On Wed, Jun 07, 2017 at 09:07:20AM +0200, Geert Uytterhoeven wrote:
> > > CC devicetree folks
> > > 
> > > On Wed, Jun 7, 2017 at 12:59 AM, Palmer Dabbelt  
> > > wrote:
> > > > From: "Wesley W. Terpstra" 
> > > > 
> > > > This fixes: [0.01] cpu cpu0: Error -2 creating of_node link
> > > > ... which you get for every CPU on all architectures with a OF cpu/ 
> > > > node.
> > 
> > I take it this means a /cpus node? Or the /cpus/cpu@* nodes?
> > 
> > I'm not seeing this on arm64 when booting v4.12-rc4 with DT, so clearly
> > this doesn't affect all such architectures.
> > 
> > What path are these errors happening in?
> > 
> > Thanks,
> > Mark.
> > 
> > > > 
> > > > This affects riscv, nios, etc.
> > > > 
> > > > Signed-off-by: Palmer Dabbelt 
> > > > ---
> > > >  drivers/base/init.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/drivers/base/init.c b/drivers/base/init.c
> > > > index 48c0e220acc0..0dcd17e561d0 100644
> > > > --- a/drivers/base/init.c
> > > > +++ b/drivers/base/init.c
> > > > @@ -31,9 +31,9 @@ void __init driver_init(void)
> > > > /* These are also core pieces, but must come after the
> > > >  * core core pieces.
> > > >  */
> > > > +   of_core_init();
> > > > platform_bus_init();
> > > > cpu_dev_init();
> > > > memory_dev_init();
> > > > container_dev_init();
> > > > -   of_core_init();
> > > >  }
> > > > --
> > > > 2.13.0
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe devicetree" in
> > > the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Wed, 2017-06-07 at 07:45 +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 07, 2017 at 09:04:41AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> > > >   
> > > >   static struct device_attribute vio_dev_attrs[] = {
> > > >    __ATTR_RO(name),
> > > > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = 
> > > > {
> > > >    __ATTR_RO(modalias),
> > > >    __ATTR_NULL
> > > >   };
> > > > +static struct attribute *vio_dev_attrs[] = {
> > > 
> > > Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> > > the above vio_dev_attrs field as well.  Is powerpc really a dead
> > > platform?  :)
> > 
> > Haha, not yet no, and the above is actually still quite actively
> > in use as it's part of our hypervisor virtual IO infrastructure.
> 
> Ok, let me fix this, right after I emailed 0-day sent me the build error :)

Thanks !

Cheers,
Ben.



Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Wed, 2017-06-07 at 07:45 +0200, Greg Kroah-Hartman wrote:
> On Wed, Jun 07, 2017 at 09:04:41AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> > > >   
> > > >   static struct device_attribute vio_dev_attrs[] = {
> > > >    __ATTR_RO(name),
> > > > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = 
> > > > {
> > > >    __ATTR_RO(modalias),
> > > >    __ATTR_NULL
> > > >   };
> > > > +static struct attribute *vio_dev_attrs[] = {
> > > 
> > > Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> > > the above vio_dev_attrs field as well.  Is powerpc really a dead
> > > platform?  :)
> > 
> > Haha, not yet no, and the above is actually still quite actively
> > in use as it's part of our hypervisor virtual IO infrastructure.
> 
> Ok, let me fix this, right after I emailed 0-day sent me the build error :)

Thanks !

Cheers,
Ben.



Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> >   
> >   static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(name),
> > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(modalias),
> >    __ATTR_NULL
> >   };
> > +static struct attribute *vio_dev_attrs[] = {
> 
> Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> the above vio_dev_attrs field as well.  Is powerpc really a dead
> platform?  :)

Haha, not yet no, and the above is actually still quite actively
in use as it's part of our hypervisor virtual IO infrastructure.

Cheers,
Ben.



Re: [PATCH 10/16] powerpc: vio: use dev_groups and not dev_attrs for bus_type

2017-06-06 Thread Benjamin Herrenschmidt
On Tue, 2017-06-06 at 21:30 +0200, Greg Kroah-Hartman wrote:
> >   
> >   static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(name),
> > @@ -1573,6 +1576,13 @@ static struct device_attribute vio_dev_attrs[] = {
> >    __ATTR_RO(modalias),
> >    __ATTR_NULL
> >   };
> > +static struct attribute *vio_dev_attrs[] = {
> 
> Hm, this feels wrong, odd that 0-day passed it.  I should be deleting
> the above vio_dev_attrs field as well.  Is powerpc really a dead
> platform?  :)

Haha, not yet no, and the above is actually still quite actively
in use as it's part of our hypervisor virtual IO infrastructure.

Cheers,
Ben.



Re: [v2] powerpc: handle simultaneous interrupts at once

2017-06-05 Thread Benjamin Herrenschmidt
On Mon, 2017-06-05 at 20:21 +1000, Michael Ellerman wrote:
> On Thu, 2017-03-16 at 08:55:45 UTC, Christophe Leroy wrote:
> > It often happens to have simultaneous interrupts, for instance
> > when having double Ethernet attachment. With the current
> > implementation, we suffer the cost of kernel entry/exit for each
> > interrupt.
> > 
> > This patch introduces a loop in __do_irq() to handle all interrupts
> > at once before returning.
> > 
> > Signed-off-by: Christophe Leroy 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/45cb08f4791ce6a15c54598b4cb73d

Hrm, I hadn't noticed that patch...

We used to do that and then removed the code for it. There's a cost,
sometimes noticeable, to an extra call to ppc_md.get_irq.

Why not have your get_irq (or eoi) implementation set a per-cpu
requesting a new spin of the loop ?

We could move the xive force replay stuff to use the same thing.

Ben.



Re: [v2] powerpc: handle simultaneous interrupts at once

2017-06-05 Thread Benjamin Herrenschmidt
On Mon, 2017-06-05 at 20:21 +1000, Michael Ellerman wrote:
> On Thu, 2017-03-16 at 08:55:45 UTC, Christophe Leroy wrote:
> > It often happens to have simultaneous interrupts, for instance
> > when having double Ethernet attachment. With the current
> > implementation, we suffer the cost of kernel entry/exit for each
> > interrupt.
> > 
> > This patch introduces a loop in __do_irq() to handle all interrupts
> > at once before returning.
> > 
> > Signed-off-by: Christophe Leroy 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/45cb08f4791ce6a15c54598b4cb73d

Hrm, I hadn't noticed that patch...

We used to do that and then removed the code for it. There's a cost,
sometimes noticeable, to an extra call to ppc_md.get_irq.

Why not have your get_irq (or eoi) implementation set a per-cpu
requesting a new spin of the loop ?

We could move the xive force replay stuff to use the same thing.

Ben.



Re: [PATCH 2/5] powerpc/mm: split store_updates_sp() in two parts in do_page_fault()

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 11:39 +0200, Christophe LEROY wrote:
> The difference between get_user() and __get_user() is that get_user() 
> performs an access_ok() in addition.
> 
> Doesn't access_ok() only checks whether addr is below TASK_SIZE to 
> ensure it is a valid user address ?

Do you have a measurable improvement by skipping that check ? I agree
with your reasoning but I'm also paranoid and so I wouldn't change it
unless it's really worth it.

Cheers,
Ben.



Re: [PATCH 2/5] powerpc/mm: split store_updates_sp() in two parts in do_page_fault()

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 11:39 +0200, Christophe LEROY wrote:
> The difference between get_user() and __get_user() is that get_user() 
> performs an access_ok() in addition.
> 
> Doesn't access_ok() only checks whether addr is below TASK_SIZE to 
> ensure it is a valid user address ?

Do you have a measurable improvement by skipping that check ? I agree
with your reasoning but I'm also paranoid and so I wouldn't change it
unless it's really worth it.

Cheers,
Ben.



Re: [PATCH v9 5/5] i2c: aspeed: added slave support for Aspeed I2C driver

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 01:46 -0700, Brendan Higgins wrote:
> Added slave support for Aspeed I2C controller. Supports fourteen busses
> present in AST24XX and AST25XX BMC SoCs by Aspeed.

(Not an issue for merging)

Have you looked at a "mode" by which you implement just enough of the
slave support to decode smbus notifications ?

This looks like it could be useful when using pmbus devices, as some of
them use smbus notifications fairly often.

Today only one driver calls i2c_handle_smbus_host_notify() (the Intel
one of course).

Cheers,
Ben.



Re: [PATCH v9 5/5] i2c: aspeed: added slave support for Aspeed I2C driver

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 01:46 -0700, Brendan Higgins wrote:
> Added slave support for Aspeed I2C controller. Supports fourteen busses
> present in AST24XX and AST25XX BMC SoCs by Aspeed.

(Not an issue for merging)

Have you looked at a "mode" by which you implement just enough of the
slave support to decode smbus notifications ?

This looks like it could be useful when using pmbus devices, as some of
them use smbus notifications fairly often.

Today only one driver calls i2c_handle_smbus_host_notify() (the Intel
one of course).

Cheers,
Ben.



Re: [PATCH v8 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 02:29 -0700, Brendan Higgins wrote:
> > That way you avoid the above lock which is racy, slave activity could
> > get in there and trigger an error interrupt clobbering cmd_err for
> > example no ? Or am I missing something...
> 
> The slave handler does not touch these fields, so that should be fine.
> The only way I can think
> that we could get in this state is if we had some sort of error
> interrupt fire after we handled a
> STOP or previous error; this should not happen, but I think it is
> better to be safe than sorry.
> Nevertheless, I do not think it is necessary to do more than what I
> have already done because
> it would mean the bus is in a pretty bad state anyway. Maybe I should
> just drop the locks here.
> 
> If you disagree, could you elaborate on what you meant by putting
> cmd_err in a separate field?
> Did you just mean having one for xfer and one for everything else (like 
> resets)?

None of that is a big deal and definitely not a blocker for merging,
you can always "improve" things with a latter patch.

I was thinking that the act of "completing" a request could cleaner if
entirely done from the interrupt code, thus clearing bus->msgs and
setting a "final" status code to be returned to the caller.

Now that could be "cmd_err", it's just that today, that is written to
under various circumstances that may or may not related to a command
being in progress and this I worry that with the lock dropping, we
might "lose" the value that actually pertain to the command itself (and
possibly caused it to fail).

This is all quite academic however, as you said, if that happens we are
already probably in a pretty bad shape.

I suspect the only errors that would happen in "normal" circumstances
would be loss of arbitration and nacks, which are I think, already
properly handled, at least from my reading of the code.

Cheers,
Ben.

> > > + /* If nothing went wrong, return number of messages transferred. */
> > > + if (ret >= 0)
> > > + return bus->msgs_index + 1;
> > > + else
> > > + return ret;
> > > +}
> > > +
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v8 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 02:29 -0700, Brendan Higgins wrote:
> > That way you avoid the above lock which is racy, slave activity could
> > get in there and trigger an error interrupt clobbering cmd_err for
> > example no ? Or am I missing something...
> 
> The slave handler does not touch these fields, so that should be fine.
> The only way I can think
> that we could get in this state is if we had some sort of error
> interrupt fire after we handled a
> STOP or previous error; this should not happen, but I think it is
> better to be safe than sorry.
> Nevertheless, I do not think it is necessary to do more than what I
> have already done because
> it would mean the bus is in a pretty bad state anyway. Maybe I should
> just drop the locks here.
> 
> If you disagree, could you elaborate on what you meant by putting
> cmd_err in a separate field?
> Did you just mean having one for xfer and one for everything else (like 
> resets)?

None of that is a big deal and definitely not a blocker for merging,
you can always "improve" things with a latter patch.

I was thinking that the act of "completing" a request could cleaner if
entirely done from the interrupt code, thus clearing bus->msgs and
setting a "final" status code to be returned to the caller.

Now that could be "cmd_err", it's just that today, that is written to
under various circumstances that may or may not related to a command
being in progress and this I worry that with the lock dropping, we
might "lose" the value that actually pertain to the command itself (and
possibly caused it to fail).

This is all quite academic however, as you said, if that happens we are
already probably in a pretty bad shape.

I suspect the only errors that would happen in "normal" circumstances
would be loss of arbitration and nacks, which are I think, already
properly handled, at least from my reading of the code.

Cheers,
Ben.

> > > + /* If nothing went wrong, return number of messages transferred. */
> > > + if (ret >= 0)
> > > + return bus->msgs_index + 1;
> > > + else
> > > + return ret;
> > > +}
> > > +
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v9 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 01:46 -0700, Brendan Higgins wrote:
> Added initial master support for Aspeed I2C controller. Supports
> fourteen busses present in AST24XX and AST25XX BMC SoCs by Aspeed.
> 
> Signed-off-by: Brendan Higgins <brendanhigg...@google.com>

Reviewed-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
---

Just spotted a completely minor/tivial nit:

> +static int aspeed_i2c_master_xfer(struct i2c_adapter *adap,
> +   struct i2c_msg *msgs, int num)
> +{
> + struct aspeed_i2c_bus *bus = adap->algo_data;
> + unsigned long time_left, flags;
> + int ret = 0;
> +
> + spin_lock_irqsave(>lock, flags);
> + bus->cmd_err = 0;
^

This is probably unnecessary now since it's initialized further down.

> + /* If bus is busy, attempt recovery. We assume a single master
> +  * environment.
> +  */
> + if (readl(bus->base + ASPEED_I2C_CMD_REG) & ASPEED_I2CD_BUS_BUSY_STS) {
> + spin_unlock_irqrestore(>lock, flags);
> + ret = aspeed_i2c_recover_bus(bus);
> + if (ret)
> + return ret;
> + spin_lock_irqsave(>lock, flags);
> + }
> +
> + bus->cmd_err = 0;
> + bus->msgs = msgs;
> + bus->msgs_index = 0;
> + bus->msgs_count = num;

Now I'd like Andrew to give it a spin as he's currently
testing/debugging some i2c related stuff to be 100% certain it works
fine on our systems.

Thanks !

Cheers,
Ben.



Re: [PATCH v9 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-06-02 Thread Benjamin Herrenschmidt
On Fri, 2017-06-02 at 01:46 -0700, Brendan Higgins wrote:
> Added initial master support for Aspeed I2C controller. Supports
> fourteen busses present in AST24XX and AST25XX BMC SoCs by Aspeed.
> 
> Signed-off-by: Brendan Higgins 

Reviewed-by: Benjamin Herrenschmidt 
---

Just spotted a completely minor/tivial nit:

> +static int aspeed_i2c_master_xfer(struct i2c_adapter *adap,
> +   struct i2c_msg *msgs, int num)
> +{
> + struct aspeed_i2c_bus *bus = adap->algo_data;
> + unsigned long time_left, flags;
> + int ret = 0;
> +
> + spin_lock_irqsave(>lock, flags);
> + bus->cmd_err = 0;
^

This is probably unnecessary now since it's initialized further down.

> + /* If bus is busy, attempt recovery. We assume a single master
> +  * environment.
> +  */
> + if (readl(bus->base + ASPEED_I2C_CMD_REG) & ASPEED_I2CD_BUS_BUSY_STS) {
> + spin_unlock_irqrestore(>lock, flags);
> + ret = aspeed_i2c_recover_bus(bus);
> + if (ret)
> + return ret;
> + spin_lock_irqsave(>lock, flags);
> + }
> +
> + bus->cmd_err = 0;
> + bus->msgs = msgs;
> + bus->msgs_index = 0;
> + bus->msgs_count = num;

Now I'd like Andrew to give it a spin as he's currently
testing/debugging some i2c related stuff to be 100% certain it works
fine on our systems.

Thanks !

Cheers,
Ben.



Re: [PATCH v8 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-05-30 Thread Benjamin Herrenschmidt
On Wed, 2017-05-24 at 23:49 -0700, Brendan Higgins wrote:
> Added initial master support for Aspeed I2C controller. Supports
> fourteen busses present in AST24XX and AST25XX BMC SoCs by Aspeed.

Hi Brendan !

It's getting there :-) I have a few comments mostly about some corner
cases below.

Cheers,
Ben.

> Signed-off-by: Brendan Higgins 
> ---
> Changes for v2:
>   - Added single module_init (multiple was breaking some builds).
> Changes for v3:
>   - Removed "bus" device tree param; now extracted from bus address offset
> Changes for v4:
>   - I2C adapter number is now generated dynamically unless specified in alias.
> Changes for v5:
>   - Removed irq_chip used to multiplex IRQ and replaced it with dummy_irq_chip
> along with some other IRQ cleanup.
>   - Addressed comments from Cedric, and Vladimir, mostly stylistic things and
> using devm managed resources.
>   - Increased max clock frequency before the bus is put in HighSpeed mode, as
> per Kachalov's comment.
> Changes for v6:
>   - No longer arbitrarily restrict bus to be slave xor master.
>   - Pulled out "struct aspeed_i2c_controller" as a interrupt controller.
>   - Pulled out slave support into its own commit.
>   - Rewrote code that sets clock divider register because the original version
> set it incorrectly.
>   - Rewrote the aspeed_i2c_master_irq handler because the old method of
> completing a completion in between restarts was too slow causing devices 
> to
> misbehave.
>   - Added support for I2C_M_RECV_LEN which I had incorrectly said was 
> supported
> before.
>   - Addressed other comments from Vladimir.
> Changes for v7:
>   - Changed clock-frequency to bus-frequency
>   - Made some fixes to clock divider code
>   - Added hardware reset function
>   - Marked functions that need to be called with the lock held as "unlocked"
>   - Did a bunch of clean up
> Changes for v8:
>   - ACK IRQ status bits before doing anything else
>   - Added multi-master device tree property
>   - Do not send STOP commands after interrupt errors
>   - Fix SMBUS_QUICK emulation handling
>   - Removed highspeed clock code (I will do it in a later patch set)
>   - Use the platform_device name for the adapter name
>   - Reset for all failed recoveries
>   - Removed the "__" prefix from all of the non-thread safe functions
> ---
>  drivers/i2c/busses/Kconfig  |  10 +
>  drivers/i2c/busses/Makefile |   1 +
>  drivers/i2c/busses/i2c-aspeed.c | 695 
> 
>  3 files changed, 706 insertions(+)
>  create mode 100644 drivers/i2c/busses/i2c-aspeed.c
> 
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 144cbadc7c72..280f84a0d7d1 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -326,6 +326,16 @@ config I2C_POWERMAC
>  
>  comment "I2C system bus drivers (mostly embedded / system-on-chip)"
>  
> +config I2C_ASPEED
> + tristate "Aspeed I2C Controller"
> + depends on ARCH_ASPEED
> + help
> +   If you say yes to this option, support will be included for the
> +   Aspeed I2C controller.
> +
> +   This driver can also be built as a module.  If so, the module
> +   will be called i2c-aspeed.
> +
>  config I2C_AT91
>   tristate "Atmel AT91 I2C Two-Wire interface (TWI)"
>   depends on ARCH_AT91
> diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile
> index 30b60855fbcd..e84604b9bf3b 100644
> --- a/drivers/i2c/busses/Makefile
> +++ b/drivers/i2c/busses/Makefile
> @@ -29,6 +29,7 @@ obj-$(CONFIG_I2C_HYDRA) += i2c-hydra.o
>  obj-$(CONFIG_I2C_POWERMAC)   += i2c-powermac.o
>  
>  # Embedded system I2C/SMBus host controller drivers
> +obj-$(CONFIG_I2C_ASPEED) += i2c-aspeed.o
>  obj-$(CONFIG_I2C_AT91)   += i2c-at91.o
>  obj-$(CONFIG_I2C_AU1550) += i2c-au1550.o
>  obj-$(CONFIG_I2C_AXXIA)  += i2c-axxia.o
> diff --git a/drivers/i2c/busses/i2c-aspeed.c b/drivers/i2c/busses/i2c-aspeed.c
> new file mode 100644
> index ..3eb1252442d7
> --- /dev/null
> +++ b/drivers/i2c/busses/i2c-aspeed.c
> @@ -0,0 +1,695 @@
> +/*
> + *  Aspeed 24XX/25XX I2C Controller.
> + *
> + *  Copyright (C) 2012-2017 ASPEED Technology Inc.
> + *  Copyright 2017 IBM Corporation
> + *  Copyright 2017 Google, Inc.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License version 2 as
> + *  published by the Free Software Foundation.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* I2C Register */
> +#define ASPEED_I2C_FUN_CTRL_REG  0x00
> +#define ASPEED_I2C_AC_TIMING_REG10x04
> +#define ASPEED_I2C_AC_TIMING_REG2   

Re: [PATCH v8 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-05-30 Thread Benjamin Herrenschmidt
On Wed, 2017-05-24 at 23:49 -0700, Brendan Higgins wrote:
> Added initial master support for Aspeed I2C controller. Supports
> fourteen busses present in AST24XX and AST25XX BMC SoCs by Aspeed.

Hi Brendan !

It's getting there :-) I have a few comments mostly about some corner
cases below.

Cheers,
Ben.

> Signed-off-by: Brendan Higgins 
> ---
> Changes for v2:
>   - Added single module_init (multiple was breaking some builds).
> Changes for v3:
>   - Removed "bus" device tree param; now extracted from bus address offset
> Changes for v4:
>   - I2C adapter number is now generated dynamically unless specified in alias.
> Changes for v5:
>   - Removed irq_chip used to multiplex IRQ and replaced it with dummy_irq_chip
> along with some other IRQ cleanup.
>   - Addressed comments from Cedric, and Vladimir, mostly stylistic things and
> using devm managed resources.
>   - Increased max clock frequency before the bus is put in HighSpeed mode, as
> per Kachalov's comment.
> Changes for v6:
>   - No longer arbitrarily restrict bus to be slave xor master.
>   - Pulled out "struct aspeed_i2c_controller" as a interrupt controller.
>   - Pulled out slave support into its own commit.
>   - Rewrote code that sets clock divider register because the original version
> set it incorrectly.
>   - Rewrote the aspeed_i2c_master_irq handler because the old method of
> completing a completion in between restarts was too slow causing devices 
> to
> misbehave.
>   - Added support for I2C_M_RECV_LEN which I had incorrectly said was 
> supported
> before.
>   - Addressed other comments from Vladimir.
> Changes for v7:
>   - Changed clock-frequency to bus-frequency
>   - Made some fixes to clock divider code
>   - Added hardware reset function
>   - Marked functions that need to be called with the lock held as "unlocked"
>   - Did a bunch of clean up
> Changes for v8:
>   - ACK IRQ status bits before doing anything else
>   - Added multi-master device tree property
>   - Do not send STOP commands after interrupt errors
>   - Fix SMBUS_QUICK emulation handling
>   - Removed highspeed clock code (I will do it in a later patch set)
>   - Use the platform_device name for the adapter name
>   - Reset for all failed recoveries
>   - Removed the "__" prefix from all of the non-thread safe functions
> ---
>  drivers/i2c/busses/Kconfig  |  10 +
>  drivers/i2c/busses/Makefile |   1 +
>  drivers/i2c/busses/i2c-aspeed.c | 695 
> 
>  3 files changed, 706 insertions(+)
>  create mode 100644 drivers/i2c/busses/i2c-aspeed.c
> 
> diff --git a/drivers/i2c/busses/Kconfig b/drivers/i2c/busses/Kconfig
> index 144cbadc7c72..280f84a0d7d1 100644
> --- a/drivers/i2c/busses/Kconfig
> +++ b/drivers/i2c/busses/Kconfig
> @@ -326,6 +326,16 @@ config I2C_POWERMAC
>  
>  comment "I2C system bus drivers (mostly embedded / system-on-chip)"
>  
> +config I2C_ASPEED
> + tristate "Aspeed I2C Controller"
> + depends on ARCH_ASPEED
> + help
> +   If you say yes to this option, support will be included for the
> +   Aspeed I2C controller.
> +
> +   This driver can also be built as a module.  If so, the module
> +   will be called i2c-aspeed.
> +
>  config I2C_AT91
>   tristate "Atmel AT91 I2C Two-Wire interface (TWI)"
>   depends on ARCH_AT91
> diff --git a/drivers/i2c/busses/Makefile b/drivers/i2c/busses/Makefile
> index 30b60855fbcd..e84604b9bf3b 100644
> --- a/drivers/i2c/busses/Makefile
> +++ b/drivers/i2c/busses/Makefile
> @@ -29,6 +29,7 @@ obj-$(CONFIG_I2C_HYDRA) += i2c-hydra.o
>  obj-$(CONFIG_I2C_POWERMAC)   += i2c-powermac.o
>  
>  # Embedded system I2C/SMBus host controller drivers
> +obj-$(CONFIG_I2C_ASPEED) += i2c-aspeed.o
>  obj-$(CONFIG_I2C_AT91)   += i2c-at91.o
>  obj-$(CONFIG_I2C_AU1550) += i2c-au1550.o
>  obj-$(CONFIG_I2C_AXXIA)  += i2c-axxia.o
> diff --git a/drivers/i2c/busses/i2c-aspeed.c b/drivers/i2c/busses/i2c-aspeed.c
> new file mode 100644
> index ..3eb1252442d7
> --- /dev/null
> +++ b/drivers/i2c/busses/i2c-aspeed.c
> @@ -0,0 +1,695 @@
> +/*
> + *  Aspeed 24XX/25XX I2C Controller.
> + *
> + *  Copyright (C) 2012-2017 ASPEED Technology Inc.
> + *  Copyright 2017 IBM Corporation
> + *  Copyright 2017 Google, Inc.
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License version 2 as
> + *  published by the Free Software Foundation.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/* I2C Register */
> +#define ASPEED_I2C_FUN_CTRL_REG  0x00
> +#define ASPEED_I2C_AC_TIMING_REG10x04
> +#define ASPEED_I2C_AC_TIMING_REG20x08
> +#define 

Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Benjamin Herrenschmidt
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
> > +
> > +#include 
> 
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
> 
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Right, I was about to comment on that one.

The question Palmer is about the ordering semantics of non-cached
storage.

What kind of ordering is provided architecturally ? Especially
between cachable and non-cachable loads and stores ?

Also, you have PCIe right ? What is the behaviour of MSIs ?

Does your HW provide a guarantee that in the case of a series of DMA
writes to memory by a device followed by an MSI, the CPU getting the
MSI will only get it after all the previous DMA writes have reached
coherency ? (Unlike LSIs where the driver is required to do an MMIO
read from the device, MSIs are expected to be ordered with data).

Another things with the read*() accessors. It's not uncommon for
a driver to do:

writel(1, reset_reg);
readl(reset_reg); /* flush posted writes */
udelay(10);
writel(0, reset_reg);

Now, in the above case, what can typically happen if you aren't careful
is that the readl which is intended to "push" the previous writel, will
not actually do its job because the return value hasn't been "consumed"
by the processor. Thus, the CPU will stick that on some kind of load
queue and won't actually wait for the return value before hitting the
delay loop.

Thus you might end up in a situation where the writel of 1 to the
device is itself reaching the device way after you started the delay
loop, and thus end up violating the delay requirement of the HW.

On powerpc we solve that by using a special instruction construct
inside the read* accessors that prevents the CPU from executing
subsequent instructions until the read value has been returned.

You may want to consider something similar.

Cheers,
Ben.



Re: [PATCH 4/7] RISC-V: arch/riscv/include

2017-05-23 Thread Benjamin Herrenschmidt
On Tue, 2017-05-23 at 14:55 +0200, Arnd Bergmann wrote:
> > +
> > +#include 
> 
> I would recommend providing your own {read,write}{b,w,l,q}{,_relaxed}
> helpers using inline assembly, to prevent the compiler for breaking
> up accesses into byte accesses.
> 
> Also, most architectures require to some synchronization after a
> non-relaxed readl() to prevent prefetching of DMA buffers, and
> before a writel() to flush write buffers when a DMA gets triggered.

Right, I was about to comment on that one.

The question Palmer is about the ordering semantics of non-cached
storage.

What kind of ordering is provided architecturally ? Especially
between cachable and non-cachable loads and stores ?

Also, you have PCIe right ? What is the behaviour of MSIs ?

Does your HW provide a guarantee that in the case of a series of DMA
writes to memory by a device followed by an MSI, the CPU getting the
MSI will only get it after all the previous DMA writes have reached
coherency ? (Unlike LSIs where the driver is required to do an MMIO
read from the device, MSIs are expected to be ordered with data).

Another things with the read*() accessors. It's not uncommon for
a driver to do:

writel(1, reset_reg);
readl(reset_reg); /* flush posted writes */
udelay(10);
writel(0, reset_reg);

Now, in the above case, what can typically happen if you aren't careful
is that the readl which is intended to "push" the previous writel, will
not actually do its job because the return value hasn't been "consumed"
by the processor. Thus, the CPU will stick that on some kind of load
queue and won't actually wait for the return value before hitting the
delay loop.

Thus you might end up in a situation where the writel of 1 to the
device is itself reaching the device way after you started the delay
loop, and thus end up violating the delay requirement of the HW.

On powerpc we solve that by using a special instruction construct
inside the read* accessors that prevents the CPU from executing
subsequent instructions until the read value has been returned.

You may want to consider something similar.

Cheers,
Ben.



Re: [patches] Re: [PATCH 2/7] RISC-V: arch/riscv Makefile and Kconfigs

2017-05-23 Thread Benjamin Herrenschmidt
On Mon, 2017-05-22 at 22:16 -0700, Olof Johansson wrote:
> The same is true for some other drivers. Actually, I wonder if it
> might be just as easy to implement a sbi backend for hvc -- see
> hvc_udbg.c for an example where, on power, you have a simple get/put
> char hypervisor call in a very similar manner.

Rather look at hvc_opal. This is the console driver we use on native
POWER servers with the OPAL firmware, and it calls into a firmware
in a very similar way.

The driver lives in drivers/tty/hvc. It does call some "helpers" in
the arch code that wrap the actual FW calls.

> Either way (keeping discrete sbi driver or implementing hvc backend),
> moving to drivers/tty is the right thing here -- we've worked hard on
> ARM to get rid of random drivers under arch/ and it'd be nice to not
> see new ones intoduced here.

Yup. The reason mostly is that if the tty maintainer needs to do a
subsystem-wide change, he can address all drivers in drivers/tty and
doesn't have to look for others elsewhere in the tree. This is the same
for all subsystems.

Cheers,
Ben.



Re: [patches] Re: [PATCH 2/7] RISC-V: arch/riscv Makefile and Kconfigs

2017-05-23 Thread Benjamin Herrenschmidt
On Mon, 2017-05-22 at 22:16 -0700, Olof Johansson wrote:
> The same is true for some other drivers. Actually, I wonder if it
> might be just as easy to implement a sbi backend for hvc -- see
> hvc_udbg.c for an example where, on power, you have a simple get/put
> char hypervisor call in a very similar manner.

Rather look at hvc_opal. This is the console driver we use on native
POWER servers with the OPAL firmware, and it calls into a firmware
in a very similar way.

The driver lives in drivers/tty/hvc. It does call some "helpers" in
the arch code that wrap the actual FW calls.

> Either way (keeping discrete sbi driver or implementing hvc backend),
> moving to drivers/tty is the right thing here -- we've worked hard on
> ARM to get rid of random drivers under arch/ and it'd be nice to not
> see new ones intoduced here.

Yup. The reason mostly is that if the tty maintainer needs to do a
subsystem-wide change, he can address all drivers in drivers/tty and
doesn't have to look for others elsewhere in the tree. This is the same
for all subsystems.

Cheers,
Ben.



Re: [RFC][PATCH] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD

2017-05-22 Thread Benjamin Herrenschmidt
On Mon, 2017-05-22 at 12:06 -0700, John Stultz wrote:
> So, long ago I talked w/ Paul Mackerras about the ppc vdso code, as
> ppc has some other legacy "userspace time" code that has to be
> maintained as well (I believe there's not code page, just data page
> that userspace pulls directly from).

Hrm... the ppc VDSO has a code page, userspace just calls
__kernel_clock_*

I don't know if anything still uses the direct mapped values, we should
enquire internally.

Cheers,
Ben.



Re: [RFC][PATCH] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD

2017-05-22 Thread Benjamin Herrenschmidt
On Mon, 2017-05-22 at 12:06 -0700, John Stultz wrote:
> So, long ago I talked w/ Paul Mackerras about the ppc vdso code, as
> ppc has some other legacy "userspace time" code that has to be
> maintained as well (I believe there's not code page, just data page
> that userspace pulls directly from).

Hrm... the ppc VDSO has a code page, userspace just calls
__kernel_clock_*

I don't know if anything still uses the direct mapped values, we should
enquire internally.

Cheers,
Ben.



Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

2017-05-19 Thread Benjamin Herrenschmidt
On Mon, 2017-05-15 at 16:07 +0300, Ivan Mikhaylov wrote:
> +#ifdef CONFIG_FSP2
> +   /*
> +    * Prevent a kernel panic caused by unintentionally clearing TCR
> +    * watchdog bits.  At this point in the kernel boot, the watchdog has
> +    * already been enabled by u-boot.  The original code's attempt to
> +    * write to the TCR register results in an inadvertent clearing of the
> +    * watchdog configuration bits, causing the 440 to reset.
> +    */
> +   tcr = mfspr(SPRN_TCR);
> +   tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
> +   tcr |= TCR_DIE; /* enable decrementer */
> +   mtspr(SPRN_TCR, tcr);
> +#else

This should be a runtime test, not a compile time option.

Cheers,
Ben.



Re: [PATCH 4/4] arch/powerpc/44x/fsp2: wdt tcr update instead of whole rewrite

2017-05-19 Thread Benjamin Herrenschmidt
On Mon, 2017-05-15 at 16:07 +0300, Ivan Mikhaylov wrote:
> +#ifdef CONFIG_FSP2
> +   /*
> +    * Prevent a kernel panic caused by unintentionally clearing TCR
> +    * watchdog bits.  At this point in the kernel boot, the watchdog has
> +    * already been enabled by u-boot.  The original code's attempt to
> +    * write to the TCR register results in an inadvertent clearing of the
> +    * watchdog configuration bits, causing the 440 to reset.
> +    */
> +   tcr = mfspr(SPRN_TCR);
> +   tcr &= TCR_WP_MASK; /* clear all bits except for TCR[WP] */
> +   tcr |= TCR_DIE; /* enable decrementer */
> +   mtspr(SPRN_TCR, tcr);
> +#else

This should be a runtime test, not a compile time option.

Cheers,
Ben.



Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-16 Thread Benjamin Herrenschmidt
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote:
> > Right now it is larger, but what I suggested is to add a new optimized
> > routine just for this case, which would do STBI for 64-bytes but
> > without membar (do membar at the end of memmap_init_zone() and
> > deferred_init_memmap()
> > 
> > #define struct_page_clear(page) \
> >  __asm__ __volatile__(   \
> >  "stxa   %%g0, [%0]%2\n" \
> >  "stxa   %%xg0, [%0 + %1]%2\n"   \
> >  : /* No output */   \
> >  : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> > 
> > And insert it into __init_single_page() instead of memset()
> > 
> > The final result is 4.01s/T which is even faster compared to current
> > 4.97s/T
> 
> Ok, indeed, that would work.

On ppc64, that might not. We have a dcbz instruction that clears an
entire cache line at once. That's what we use for memset's and page
clearing. However, 64 bytes is half a cache line on modern processors
so we can't use it with that semantic and would have to fallback to the
slower stores.

Cheers,
Ben.



Re: [v3 0/9] parallelized "struct page" zeroing

2017-05-16 Thread Benjamin Herrenschmidt
On Fri, 2017-05-12 at 13:37 -0400, David Miller wrote:
> > Right now it is larger, but what I suggested is to add a new optimized
> > routine just for this case, which would do STBI for 64-bytes but
> > without membar (do membar at the end of memmap_init_zone() and
> > deferred_init_memmap()
> > 
> > #define struct_page_clear(page) \
> >  __asm__ __volatile__(   \
> >  "stxa   %%g0, [%0]%2\n" \
> >  "stxa   %%xg0, [%0 + %1]%2\n"   \
> >  : /* No output */   \
> >  : "r" (page), "r" (0x20), "i"(ASI_BLK_INIT_QUAD_LDD_P))
> > 
> > And insert it into __init_single_page() instead of memset()
> > 
> > The final result is 4.01s/T which is even faster compared to current
> > 4.97s/T
> 
> Ok, indeed, that would work.

On ppc64, that might not. We have a dcbz instruction that clears an
entire cache line at once. That's what we use for memset's and page
clearing. However, 64 bytes is half a cache line on modern processors
so we can't use it with that semantic and would have to fallback to the
slower stores.

Cheers,
Ben.



Re: [PATCH v7 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-05-09 Thread Benjamin Herrenschmidt
On Mon, 2017-05-08 at 16:28 -0700, Brendan Higgins wrote:
> 
> I am curious what everyone thinks about this. It seemed that, earlier
> on, people did not like me disabling multi-master mode, but I think
> that it would make bus recovery not work as well. Given that, I think
> it makes the most sense to provide a device tree option either to
> enable multi-master support or disable it. Thoughts?

I think we probably want to keep it enabled yes. Another reason is that
SMBus notifications are in effect a type of multi-master (well slave
really), and we should really add support for them even when the slave
is not enabled.

We are hitting some PMbus devices that apparently can't be prevented
from sending those and that's causing issues.

Cheers,
Ben.



Re: [PATCH v7 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-05-09 Thread Benjamin Herrenschmidt
On Mon, 2017-05-08 at 16:28 -0700, Brendan Higgins wrote:
> 
> I am curious what everyone thinks about this. It seemed that, earlier
> on, people did not like me disabling multi-master mode, but I think
> that it would make bus recovery not work as well. Given that, I think
> it makes the most sense to provide a device tree option either to
> enable multi-master support or disable it. Thoughts?

I think we probably want to keep it enabled yes. Another reason is that
SMBus notifications are in effect a type of multi-master (well slave
really), and we should really add support for them even when the slave
is not enabled.

We are hitting some PMbus devices that apparently can't be prevented
from sending those and that's causing issues.

Cheers,
Ben.



Re: Is iounmap(NULL) safe or not?

2017-05-06 Thread Benjamin Herrenschmidt
On Sat, 2017-05-06 at 01:50 +0300, Alexey Khoroshilov wrote:
> Could you please clarify if iounmap(NULL) safe or not.
> I guess it would be less errorprone if the answer is architecture independent.

I think it's supposed to be and we should fix ppc.

Cheers,
Ben.



Re: Is iounmap(NULL) safe or not?

2017-05-06 Thread Benjamin Herrenschmidt
On Sat, 2017-05-06 at 01:50 +0300, Alexey Khoroshilov wrote:
> Could you please clarify if iounmap(NULL) safe or not.
> I guess it would be less errorprone if the answer is architecture independent.

I think it's supposed to be and we should fix ppc.

Cheers,
Ben.



Re: [PATCH] crypto: vmx: Remove dubiously licensed crypto code

2017-05-05 Thread Benjamin Herrenschmidt
On Fri, 2017-05-05 at 15:52 +0200, Michal Suchánek wrote:
> 
> So you have an e-mail message from one of the authors of the code.
> Andy Polyakov wrote most of the code but there are probably other
> contributors who never gave explicit consent for using their code
> outside of OpenSSL.

The only contributions to that file in OpenSSL that isn't from Andy
are a whitespace fix and the addition of the licence :-)

If that makes you feel better we could undo the whitespace fix :-)

I've checked the other PowerPC related files in there and the situation
is identical (aes-ppc.pl does have also a spelling fix in comments).

At this point, it's pretty clear that this code is authored solely by
Andy who gave permission to use it.

Cheers,
Ben.



Re: [PATCH] crypto: vmx: Remove dubiously licensed crypto code

2017-05-05 Thread Benjamin Herrenschmidt
On Fri, 2017-05-05 at 15:52 +0200, Michal Suchánek wrote:
> 
> So you have an e-mail message from one of the authors of the code.
> Andy Polyakov wrote most of the code but there are probably other
> contributors who never gave explicit consent for using their code
> outside of OpenSSL.

The only contributions to that file in OpenSSL that isn't from Andy
are a whitespace fix and the addition of the licence :-)

If that makes you feel better we could undo the whitespace fix :-)

I've checked the other PowerPC related files in there and the situation
is identical (aes-ppc.pl does have also a spelling fix in comments).

At this point, it's pretty clear that this code is authored solely by
Andy who gave permission to use it.

Cheers,
Ben.



Re: [PATCH v4 2/2] drivers/serial: Add driver for Aspeed virtual UART

2017-05-02 Thread Benjamin Herrenschmidt
On Tue, 2017-05-02 at 17:15 +0930, Joel Stanley wrote:
> From: Jeremy Kerr <j...@ozlabs.org>
> 
> This change adds a driver for the 16550-based Aspeed virtual UART
> device. We use a similar process to the of_serial driver for device
> probe, but expose some VUART-specific functions through sysfs too.
> 
> The VUART is two UART 'front ends' connected by their FIFO (no actual
> serial line in between). One is on the BMC side (management controller)
> and one is on the host CPU side.
> 
> This driver is for the BMC side. The sysfs files allow the BMC
> userspace, which owns the system configuration policy, to specify at
> what IO port and interrupt number the host side will appear to the host
> on the Host <-> BMC LPC bus. It could be different on a different system
> (though most of them use 3f8/4).
> 
> OpenPOWER host firmware doesn't like it when the host-side of the
> VUART's FIFO is not drained. This driver only disables host TX discard
> mode when the port is in use. We set the VUART enabled bit when we bind
> to the device, and clear it on unbind.
> 
> We don't want to do this on open/release, as the host may be using this
> bit to configure serial output modes, which is independent of whether
> the devices has been opened by BMC userspace.
> 
> Signed-off-by: Jeremy Kerr <j...@ozlabs.org>
> Signed-off-by: Joel Stanley <j...@jms.id.au>
> Acked-by: Rob Herring <r...@kernel.org>

Reviewed-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>

> ---
> v4:
>  - Reorder if statements
>  - Remove uncessary comment
> 
> v3:
>  - remove whitespace in header
>  - reformat comment
>  - don't check for reg-io-width property; we don't need any special
>accessors for reading/writing bytes
>  - move file to 8250_aspeed_vuart.c
> 
> v2:
>  - Use attribute groups and DEVICE_ATTR_RW
>  - Use platform_get_resource/devm_ioremap_resource
>  - of_find_property -> of_property_read_bool
>  - Drop unncessary 0xff mask
>  - Fix comment style
>  - Use BIT and GENMASK where pssible
>  - Move to 8250 directory
>  - Rename ast -> aspeed to match other Aspeed drivers
>  - Add documentation of sysfs file
>  - Add detail to the commit message
> 
>  Documentation/ABI/stable/sysfs-driver-aspeed-vuart |  15 +
>  Documentation/devicetree/bindings/serial/8250.txt  |   2 +
>  drivers/tty/serial/8250/8250_aspeed_vuart.c| 323 
> +
>  drivers/tty/serial/8250/Kconfig|  10 +
>  drivers/tty/serial/8250/Makefile   |   1 +
>  5 files changed, 351 insertions(+)
>  create mode 100644 Documentation/ABI/stable/sysfs-driver-aspeed-vuart
>  create mode 100644 drivers/tty/serial/8250/8250_aspeed_vuart.c
> 
> diff --git a/Documentation/ABI/stable/sysfs-driver-aspeed-vuart 
> b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart
> new file mode 100644
> index ..8062953ce77b
> --- /dev/null
> +++ b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart
> @@ -0,0 +1,15 @@
> +What:/sys/bus/platform/drivers/aspeed-vuart/*/lpc_address
> +Date:April 2017
> +Contact: Jeremy Kerr <j...@ozlabs.org>
> +Description: Configures which IO port the host side of the UART
> + will appear on the host <-> BMC LPC bus.
> +Users:   OpenBMC.  Proposed changes should be mailed to
> + open...@lists.ozlabs.org
> +
> +What:/sys/bus/platform/drivers/aspeed-vuart*/sirq
> +Date:April 2017
> +Contact: Jeremy Kerr <j...@ozlabs.org>
> +Description: Configures which interrupt number the host side of
> + the UART will appear on the host <-> BMC LPC bus.
> +Users:   OpenBMC.  Proposed changes should be mailed to
> + open...@lists.ozlabs.org
> diff --git a/Documentation/devicetree/bindings/serial/8250.txt 
> b/Documentation/devicetree/bindings/serial/8250.txt
> index 10276a46ecef..656733949309 100644
> --- a/Documentation/devicetree/bindings/serial/8250.txt
> +++ b/Documentation/devicetree/bindings/serial/8250.txt
> @@ -20,6 +20,8 @@ Required properties:
>   - "fsl,16550-FIFO64"
>   - "fsl,ns16550"
>   - "ti,da830-uart"
> + - "aspeed,ast2400-vuart"
> + - "aspeed,ast2500-vuart"
>   - "serial" if the port type is unknown.
>  - reg : offset and length of the register set for the device.
>  - interrupts : should contain uart interrupt.
> diff --git a/drivers/tty/serial/8250/8250_aspeed_vuart.c 
> b/drivers/tty/serial/8250/8250_aspeed_vuart.c
> new file mode 100644
> index ..822be4906763
> --- /dev/null
> +++ b

Re: [PATCH v4 2/2] drivers/serial: Add driver for Aspeed virtual UART

2017-05-02 Thread Benjamin Herrenschmidt
On Tue, 2017-05-02 at 17:15 +0930, Joel Stanley wrote:
> From: Jeremy Kerr 
> 
> This change adds a driver for the 16550-based Aspeed virtual UART
> device. We use a similar process to the of_serial driver for device
> probe, but expose some VUART-specific functions through sysfs too.
> 
> The VUART is two UART 'front ends' connected by their FIFO (no actual
> serial line in between). One is on the BMC side (management controller)
> and one is on the host CPU side.
> 
> This driver is for the BMC side. The sysfs files allow the BMC
> userspace, which owns the system configuration policy, to specify at
> what IO port and interrupt number the host side will appear to the host
> on the Host <-> BMC LPC bus. It could be different on a different system
> (though most of them use 3f8/4).
> 
> OpenPOWER host firmware doesn't like it when the host-side of the
> VUART's FIFO is not drained. This driver only disables host TX discard
> mode when the port is in use. We set the VUART enabled bit when we bind
> to the device, and clear it on unbind.
> 
> We don't want to do this on open/release, as the host may be using this
> bit to configure serial output modes, which is independent of whether
> the devices has been opened by BMC userspace.
> 
> Signed-off-by: Jeremy Kerr 
> Signed-off-by: Joel Stanley 
> Acked-by: Rob Herring 

Reviewed-by: Benjamin Herrenschmidt 

> ---
> v4:
>  - Reorder if statements
>  - Remove uncessary comment
> 
> v3:
>  - remove whitespace in header
>  - reformat comment
>  - don't check for reg-io-width property; we don't need any special
>accessors for reading/writing bytes
>  - move file to 8250_aspeed_vuart.c
> 
> v2:
>  - Use attribute groups and DEVICE_ATTR_RW
>  - Use platform_get_resource/devm_ioremap_resource
>  - of_find_property -> of_property_read_bool
>  - Drop unncessary 0xff mask
>  - Fix comment style
>  - Use BIT and GENMASK where pssible
>  - Move to 8250 directory
>  - Rename ast -> aspeed to match other Aspeed drivers
>  - Add documentation of sysfs file
>  - Add detail to the commit message
> 
>  Documentation/ABI/stable/sysfs-driver-aspeed-vuart |  15 +
>  Documentation/devicetree/bindings/serial/8250.txt  |   2 +
>  drivers/tty/serial/8250/8250_aspeed_vuart.c| 323 
> +
>  drivers/tty/serial/8250/Kconfig|  10 +
>  drivers/tty/serial/8250/Makefile   |   1 +
>  5 files changed, 351 insertions(+)
>  create mode 100644 Documentation/ABI/stable/sysfs-driver-aspeed-vuart
>  create mode 100644 drivers/tty/serial/8250/8250_aspeed_vuart.c
> 
> diff --git a/Documentation/ABI/stable/sysfs-driver-aspeed-vuart 
> b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart
> new file mode 100644
> index ..8062953ce77b
> --- /dev/null
> +++ b/Documentation/ABI/stable/sysfs-driver-aspeed-vuart
> @@ -0,0 +1,15 @@
> +What:/sys/bus/platform/drivers/aspeed-vuart/*/lpc_address
> +Date:April 2017
> +Contact: Jeremy Kerr 
> +Description: Configures which IO port the host side of the UART
> + will appear on the host <-> BMC LPC bus.
> +Users:   OpenBMC.  Proposed changes should be mailed to
> + open...@lists.ozlabs.org
> +
> +What:/sys/bus/platform/drivers/aspeed-vuart*/sirq
> +Date:April 2017
> +Contact: Jeremy Kerr 
> +Description: Configures which interrupt number the host side of
> + the UART will appear on the host <-> BMC LPC bus.
> +Users:   OpenBMC.  Proposed changes should be mailed to
> + open...@lists.ozlabs.org
> diff --git a/Documentation/devicetree/bindings/serial/8250.txt 
> b/Documentation/devicetree/bindings/serial/8250.txt
> index 10276a46ecef..656733949309 100644
> --- a/Documentation/devicetree/bindings/serial/8250.txt
> +++ b/Documentation/devicetree/bindings/serial/8250.txt
> @@ -20,6 +20,8 @@ Required properties:
>   - "fsl,16550-FIFO64"
>   - "fsl,ns16550"
>   - "ti,da830-uart"
> + - "aspeed,ast2400-vuart"
> + - "aspeed,ast2500-vuart"
>   - "serial" if the port type is unknown.
>  - reg : offset and length of the register set for the device.
>  - interrupts : should contain uart interrupt.
> diff --git a/drivers/tty/serial/8250/8250_aspeed_vuart.c 
> b/drivers/tty/serial/8250/8250_aspeed_vuart.c
> new file mode 100644
> index ..822be4906763
> --- /dev/null
> +++ b/drivers/tty/serial/8250/8250_aspeed_vuart.c
> @@ -0,0 +1,323 @@
> +/*
> + *  Serial Port driver for Aspeed VUART device
> + *
> + *Copyright (C) 201

Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-25 Thread Benjamin Herrenschmidt
On Tue, 2017-04-25 at 08:50 +, Ryan Chen wrote:
> Hello All,
>   ASPEED_I2CD_M_SDA_DRIVE_1T_EN,
> ASPEED_I2CD_SDA_DRIVE_1T_EN is specific for some case usage. 
>   For example, if i2c bus is use on "high speed" and
> "single slave and master" and i2c bus is too long. It need drive SDA
> or SCL less lunacy. It would enable it. 
>   Otherwise, don’t enable it. especially in multi-master. 
> It can’t be enable. 

That smells like a specific enough use case that we should probably
cover with a device-tree property, something like an empty
"sda-extra-drive" property (empty properties are typically used
for booleans, their presence means "true").

Thanks Ryan. Can you shed some light on the meaning of the high-speed
bit as well please ? Does it force to a specific speed (ignoring the
divisor) or we can still play with the clock high/low counts ?

Cheers,
Ben.

>     
>   
> 
> Best Regards,
> Ryan
> 
> 信驊科技股份有限公司
> ASPEED Technology Inc.
> 2F,No.15,Industry East Road 4.,Hsinchu Science Park, Hsinchu City
> 30077, Taiwan
> Tel: 886-3-578-9568  #857
> Fax: 886-3-578-9586
> * Email Confidentiality Notice 
> DISCLAIMER:
> This message (and any attachments) may contain legally privileged
> and/or other confidential information. If you have received it in
> error, please notify the sender by reply e-mail and immediately
> delete the e-mail and any attachments without copying or disclosing
> the contents. Thank you.
> 
> 
> -----Original Message-
> From: Brendan Higgins [mailto:brendanhigg...@google.com] 
> Sent: Tuesday, April 25, 2017 4:32 PM
> To: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Cc: Wolfram Sang <w...@the-dreams.de>; Rob Herring <robh...@kernel.org
> >; Mark Rutland <mark.rutl...@arm.com>; Thomas Gleixner <tglx@linutro
> nix.de>; Jason Cooper <ja...@lakedaemon.net>; Marc Zyngier  i...@arm.com>; Joel Stanley <j...@jms.id.au>; Vladimir Zapolskiy <vz@m
> leia.com>; Kachalov Anton <mo...@mayc.ru>; Cédric Le Goater <clg@kaod
> .org>; linux-...@vger.kernel.org; devicet...@vger.kernel.org; Linux
> Kernel Mailing List <linux-kernel@vger.kernel.org>; OpenBMC Maillist
> <open...@lists.ozlabs.org>; Ryan Chen <ryan_c...@aspeedtech.com>
> Subject: Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C
> 
> Adding Ryan.
> 
> On Mon, Apr 24, 2017 at 7:19 PM, Benjamin Herrenschmidt <benh@kernel.
> crashing.org> wrote:
> > On Mon, 2017-04-24 at 11:56 -0700, Brendan Higgins wrote:
> > > > > +struct aspeed_i2c_bus {
> > > > > + struct i2c_adapter  adap;
> > > > > + struct device   *dev;
> > > > > + void __iomem*base;
> > > > > + /* Synchronizes I/O mem access to base. */
> > > > > + spinlock_t  lock;
> > > > 
> > > > I am not entirely convinced we need that lock. The i2c core
> > > > will 
> > > > take a mutex protecting all operations on the bus. So we only
> > > > need 
> > > > to synchronize between our "xfer" code and our interrupt
> > > > handler.
> > > 
> > > You are right if both having slave and master active at the same
> > > time 
> > > was not possible; however, it is.
> > 
> > Right, I somewhat forgot about the slave case.
> > 
> >   ...
> > 
> > > > Some of those error states probably also warrant a reset of
> > > > the 
> > > > controller, I think aspeed does that in the SDK.
> > > 
> > > For timeout and cmd_err, I do not see any argument against it;
> > > it 
> > > sounds like we are in a very messed up, very unknown state, so
> > > full 
> > > reset is probably the best last resort.
> > 
> > Yup.
> > 
> > > For SDA staying pulled down, I
> > > think we can say with reasonable confidence that some device on
> > > our 
> > > bus is behaving very badly and I am not convinced that resetting
> > > the 
> > > controller is likely to do anything to help;
> > 
> > Right. Hammering with STOPs and pray ...
> 
> I think sending recovery mode sends stops as a part of the recovery
> algorithm it executes.
> 
> > 
> > >  that being said, I really
> > > do not have any good ideas to address that. So maybe praying and 
> > > resetting the controller is *the most reasonable thing to do.* I 
> > &

Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-25 Thread Benjamin Herrenschmidt
On Tue, 2017-04-25 at 08:50 +, Ryan Chen wrote:
> Hello All,
>   ASPEED_I2CD_M_SDA_DRIVE_1T_EN,
> ASPEED_I2CD_SDA_DRIVE_1T_EN is specific for some case usage. 
>   For example, if i2c bus is use on "high speed" and
> "single slave and master" and i2c bus is too long. It need drive SDA
> or SCL less lunacy. It would enable it. 
>   Otherwise, don’t enable it. especially in multi-master. 
> It can’t be enable. 

That smells like a specific enough use case that we should probably
cover with a device-tree property, something like an empty
"sda-extra-drive" property (empty properties are typically used
for booleans, their presence means "true").

Thanks Ryan. Can you shed some light on the meaning of the high-speed
bit as well please ? Does it force to a specific speed (ignoring the
divisor) or we can still play with the clock high/low counts ?

Cheers,
Ben.

>     
>   
> 
> Best Regards,
> Ryan
> 
> 信驊科技股份有限公司
> ASPEED Technology Inc.
> 2F,No.15,Industry East Road 4.,Hsinchu Science Park, Hsinchu City
> 30077, Taiwan
> Tel: 886-3-578-9568  #857
> Fax: 886-3-578-9586
> * Email Confidentiality Notice 
> DISCLAIMER:
> This message (and any attachments) may contain legally privileged
> and/or other confidential information. If you have received it in
> error, please notify the sender by reply e-mail and immediately
> delete the e-mail and any attachments without copying or disclosing
> the contents. Thank you.
> 
> 
> -----Original Message-
> From: Brendan Higgins [mailto:brendanhigg...@google.com] 
> Sent: Tuesday, April 25, 2017 4:32 PM
> To: Benjamin Herrenschmidt 
> Cc: Wolfram Sang ; Rob Herring  >; Mark Rutland ; Thomas Gleixner  nix.de>; Jason Cooper ; Marc Zyngier  i...@arm.com>; Joel Stanley ; Vladimir Zapolskiy  leia.com>; Kachalov Anton ; Cédric Le Goater  .org>; linux-...@vger.kernel.org; devicet...@vger.kernel.org; Linux
> Kernel Mailing List ; OpenBMC Maillist
> ; Ryan Chen 
> Subject: Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C
> 
> Adding Ryan.
> 
> On Mon, Apr 24, 2017 at 7:19 PM, Benjamin Herrenschmidt  crashing.org> wrote:
> > On Mon, 2017-04-24 at 11:56 -0700, Brendan Higgins wrote:
> > > > > +struct aspeed_i2c_bus {
> > > > > + struct i2c_adapter  adap;
> > > > > + struct device   *dev;
> > > > > + void __iomem*base;
> > > > > + /* Synchronizes I/O mem access to base. */
> > > > > + spinlock_t  lock;
> > > > 
> > > > I am not entirely convinced we need that lock. The i2c core
> > > > will 
> > > > take a mutex protecting all operations on the bus. So we only
> > > > need 
> > > > to synchronize between our "xfer" code and our interrupt
> > > > handler.
> > > 
> > > You are right if both having slave and master active at the same
> > > time 
> > > was not possible; however, it is.
> > 
> > Right, I somewhat forgot about the slave case.
> > 
> >   ...
> > 
> > > > Some of those error states probably also warrant a reset of
> > > > the 
> > > > controller, I think aspeed does that in the SDK.
> > > 
> > > For timeout and cmd_err, I do not see any argument against it;
> > > it 
> > > sounds like we are in a very messed up, very unknown state, so
> > > full 
> > > reset is probably the best last resort.
> > 
> > Yup.
> > 
> > > For SDA staying pulled down, I
> > > think we can say with reasonable confidence that some device on
> > > our 
> > > bus is behaving very badly and I am not convinced that resetting
> > > the 
> > > controller is likely to do anything to help;
> > 
> > Right. Hammering with STOPs and pray ...
> 
> I think sending recovery mode sends stops as a part of the recovery
> algorithm it executes.
> 
> > 
> > >  that being said, I really
> > > do not have any good ideas to address that. So maybe praying and 
> > > resetting the controller is *the most reasonable thing to do.* I 
> > > would like to know what you think we should do in that case.
> > 
> > Well, there's a (small ?) chance that it's a controller bug
> > asserting 
> > the line so ... but there's little we can do if not.
> 
> True.
> 
> > 
> > > While I was thinking about this I also realized that the SDA
> > > lin

Re: [PATCH v7 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 11:18 -0700, Brendan Higgins wrote:
> +static int __aspeed_i2c_init_clk(struct aspeed_i2c_bus *bus,
> +    struct platform_device *pdev)
> +{

Minor nit ... I'm really not fan of those underscores.

We use __ functions in some cases in the kernel for low level
helpers, usually when it's a low level variant of an existing
function or an "unlocked" variant, but I don't think generalizing
it to pretty much everything in the driver is worthwhile here.

If you want to be explicit about locking, I would suggest you
use a comment in front of the function explaining if it
expects to be called with the lock held.

We tend to only do that when *both* functions exist and one is
implemented in term of the other.

Cheers,
Ben.



Re: [PATCH v7 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 11:18 -0700, Brendan Higgins wrote:
> +static int __aspeed_i2c_init_clk(struct aspeed_i2c_bus *bus,
> +    struct platform_device *pdev)
> +{

Minor nit ... I'm really not fan of those underscores.

We use __ functions in some cases in the kernel for low level
helpers, usually when it's a low level variant of an existing
function or an "unlocked" variant, but I don't think generalizing
it to pretty much everything in the driver is worthwhile here.

If you want to be explicit about locking, I would suggest you
use a comment in front of the function explaining if it
expects to be called with the lock held.

We tend to only do that when *both* functions exist and one is
implemented in term of the other.

Cheers,
Ben.



Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 11:56 -0700, Brendan Higgins wrote:
> > > +struct aspeed_i2c_bus {
> > > + struct i2c_adapter  adap;
> > > + struct device   *dev;
> > > + void __iomem*base;
> > > + /* Synchronizes I/O mem access to base. */
> > > + spinlock_t  lock;
> > 
> > I am not entirely convinced we need that lock. The i2c core will
> > take a mutex protecting all operations on the bus. So we only need
> > to synchronize between our "xfer" code and our interrupt handler.
> 
> You are right if both having slave and master active at the same time
> was not possible; however, it is.

Right, I somewhat forgot about the slave case.

  ...

> > Some of those error states probably also warrant a reset of the
> > controller,
> > I think aspeed does that in the SDK.
> 
> For timeout and cmd_err, I do not see any argument against it; it
> sounds like we are in a very messed up, very unknown state, so full
> reset is probably the best last resort.

Yup.

> For SDA staying pulled down, I
> think we can say with reasonable confidence that some device on our
> bus is behaving very badly and I am not convinced that resetting the
> controller is likely to do anything to help;

Right. Hammering with STOPs and pray ...

>  that being said, I really
> do not have any good ideas to address that. So maybe praying and
> resetting the controller is *the most reasonable thing to do.* I
> would like to know what you think we should do in that case.

Well, there's a (small ?) chance that it's a controller bug asserting
the line so ... but there's little we can do if not.

> While I was thinking about this I also realized that the SDA line
> check after recovery happens in the else branch, but SCL line check
> does not happen after we attempt to STOP if SCL is hung. If we decide
> to make special note SDA being hung by a device that won't let go, we
> might want to make a special note that SCL is hung by a device that
> won't let go. Just a thought.

Maybe. Or just "unrecoverable error"... hopefully these don't happen
too often ... We had cases of a TPM misbehaving like that.

> > > +out:
> 
> ...
> > What about I2C_M_NOSTART ?
> > 
> > Not that I've ever seen it used... ;-)
> 
> Right now I am not doing any of the protocol mangling options, but I
> can add them in if you think it is important for initial support.

No, not important, we can add that later if it ever becomes useful.

 ...

> > In general, you always ACK all interrupts first. Then you handle
> > the bits you have harvested.
> > 
> 
> The documentation says to ACK the interrupt after handling in the RX
> case:
> 
> <<<
> S/W needs to clear this status bit to allow next data receiving.
> > > > 
> 
> I will double check with Ryan to make sure TX works the same way.
> 
> > > + if (irq_status & ASPEED_I2CD_INTR_ERROR ||
> > > + (!bus->msgs && bus->master_state !=
> > > ASPEED_I2C_MASTER_STOP)) {
> 
> ...
> > 
> > I would set master_state to "RECOVERY" (new state ?) and ensure
> > those things are caught if they happen outside of a recovery.

I replied privately ... as long as we ack before we start a new command
we should be ok but we shouldn't ack after.

Your latest patch still does that. It will do things like start a STOP
command *then* ack the status bits. I'm pretty sure that's bogus.

That way it's a lot simpler to simply move the

writel(irq_status, bus->base + ASPEED_I2C_INTR_STS_REG);

To either right after the readl of the status reg at the beginning of
aspeed_i2c_master_irq().

I would be very surprised if that didn't work properly and wasn't much
safer than what you are currently doing. 

> Let me know if you still think we need a "RECOVERY" state.

The way you just switch to stop state and store the error for later
should work I think.

> > 
> > > + if (bus->master_state == ASPEED_I2C_MASTER_START) {
> 
> ...
> > 
> > > + dev_dbg(bus->dev,
> > > + "no slave present at %02x", msg-
> > > >addr);
> > > + status_ack |= ASPEED_I2CD_INTR_TX_NAK;
> > > + bus->cmd_err = -EIO;
> > > + do_stop(bus);
> > > + goto out_no_complete;
> > > + } else {
> > > + status_ack |= ASPEED_I2CD_INTR_TX_ACK;
> > > + if (msg->flags & I2C_M_RD)
> > > + bus->master_state =
> > > ASPEED_I2C_MASTER_RX;
> > > + else
> > > + bus->master_state =
> > > ASPEED_I2C_MASTER_TX_FIRST;
> > 
> > What about the SMBUS_QUICK case ? (0-len transfer). Do we need
> > to handle this here ? A quick look at the TX_FIRST case makes
> > me think we are ok there but I'm not sure about the RX case.
> 
> I did not think that there is an SMBUS_QUICK RX. Could you point me
> to an example?

Not so much an RX, it's more like you are sending a 1-bit data in
the 

Re: [PATCH v6 4/5] i2c: aspeed: added driver for Aspeed I2C

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 11:56 -0700, Brendan Higgins wrote:
> > > +struct aspeed_i2c_bus {
> > > + struct i2c_adapter  adap;
> > > + struct device   *dev;
> > > + void __iomem*base;
> > > + /* Synchronizes I/O mem access to base. */
> > > + spinlock_t  lock;
> > 
> > I am not entirely convinced we need that lock. The i2c core will
> > take a mutex protecting all operations on the bus. So we only need
> > to synchronize between our "xfer" code and our interrupt handler.
> 
> You are right if both having slave and master active at the same time
> was not possible; however, it is.

Right, I somewhat forgot about the slave case.

  ...

> > Some of those error states probably also warrant a reset of the
> > controller,
> > I think aspeed does that in the SDK.
> 
> For timeout and cmd_err, I do not see any argument against it; it
> sounds like we are in a very messed up, very unknown state, so full
> reset is probably the best last resort.

Yup.

> For SDA staying pulled down, I
> think we can say with reasonable confidence that some device on our
> bus is behaving very badly and I am not convinced that resetting the
> controller is likely to do anything to help;

Right. Hammering with STOPs and pray ...

>  that being said, I really
> do not have any good ideas to address that. So maybe praying and
> resetting the controller is *the most reasonable thing to do.* I
> would like to know what you think we should do in that case.

Well, there's a (small ?) chance that it's a controller bug asserting
the line so ... but there's little we can do if not.

> While I was thinking about this I also realized that the SDA line
> check after recovery happens in the else branch, but SCL line check
> does not happen after we attempt to STOP if SCL is hung. If we decide
> to make special note SDA being hung by a device that won't let go, we
> might want to make a special note that SCL is hung by a device that
> won't let go. Just a thought.

Maybe. Or just "unrecoverable error"... hopefully these don't happen
too often ... We had cases of a TPM misbehaving like that.

> > > +out:
> 
> ...
> > What about I2C_M_NOSTART ?
> > 
> > Not that I've ever seen it used... ;-)
> 
> Right now I am not doing any of the protocol mangling options, but I
> can add them in if you think it is important for initial support.

No, not important, we can add that later if it ever becomes useful.

 ...

> > In general, you always ACK all interrupts first. Then you handle
> > the bits you have harvested.
> > 
> 
> The documentation says to ACK the interrupt after handling in the RX
> case:
> 
> <<<
> S/W needs to clear this status bit to allow next data receiving.
> > > > 
> 
> I will double check with Ryan to make sure TX works the same way.
> 
> > > + if (irq_status & ASPEED_I2CD_INTR_ERROR ||
> > > + (!bus->msgs && bus->master_state !=
> > > ASPEED_I2C_MASTER_STOP)) {
> 
> ...
> > 
> > I would set master_state to "RECOVERY" (new state ?) and ensure
> > those things are caught if they happen outside of a recovery.

I replied privately ... as long as we ack before we start a new command
we should be ok but we shouldn't ack after.

Your latest patch still does that. It will do things like start a STOP
command *then* ack the status bits. I'm pretty sure that's bogus.

That way it's a lot simpler to simply move the

writel(irq_status, bus->base + ASPEED_I2C_INTR_STS_REG);

To either right after the readl of the status reg at the beginning of
aspeed_i2c_master_irq().

I would be very surprised if that didn't work properly and wasn't much
safer than what you are currently doing. 

> Let me know if you still think we need a "RECOVERY" state.

The way you just switch to stop state and store the error for later
should work I think.

> > 
> > > + if (bus->master_state == ASPEED_I2C_MASTER_START) {
> 
> ...
> > 
> > > + dev_dbg(bus->dev,
> > > + "no slave present at %02x", msg-
> > > >addr);
> > > + status_ack |= ASPEED_I2CD_INTR_TX_NAK;
> > > + bus->cmd_err = -EIO;
> > > + do_stop(bus);
> > > + goto out_no_complete;
> > > + } else {
> > > + status_ack |= ASPEED_I2CD_INTR_TX_ACK;
> > > + if (msg->flags & I2C_M_RD)
> > > + bus->master_state =
> > > ASPEED_I2C_MASTER_RX;
> > > + else
> > > + bus->master_state =
> > > ASPEED_I2C_MASTER_TX_FIRST;
> > 
> > What about the SMBUS_QUICK case ? (0-len transfer). Do we need
> > to handle this here ? A quick look at the TX_FIRST case makes
> > me think we are ok there but I'm not sure about the RX case.
> 
> I did not think that there is an SMBUS_QUICK RX. Could you point me
> to an example?

Not so much an RX, it's more like you are sending a 1-bit data in
the 

Re: [PATCH v4 05/11] VAS: Define helpers for access MMIO regions

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 10:25 -0700, Sukadev Bhattiprolu wrote:
> which maybe due to this :-) Should I change to pgprot_noncached() for
> the MMIO writes?

Just use normal ioremap().

> > requires being mapped cachable. Ask Aneesh for a cleaner way of
> > doing it too while at it.


Re: [PATCH v4 05/11] VAS: Define helpers for access MMIO regions

2017-04-24 Thread Benjamin Herrenschmidt
On Mon, 2017-04-24 at 10:25 -0700, Sukadev Bhattiprolu wrote:
> which maybe due to this :-) Should I change to pgprot_noncached() for
> the MMIO writes?

Just use normal ioremap().

> > requires being mapped cachable. Ask Aneesh for a cleaner way of
> > doing it too while at it.


Re: [PATCH v4 05/11] VAS: Define helpers for access MMIO regions

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> +static void *map_mmio_region(char *name, uint64_t start, int len)
> +{
> +   void *map;
> +
> +   if (!request_mem_region(start, len, name)) {
> +   pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
> +   __func__, start, len);
> +   return NULL;
> +   }
> +
> +   map = __ioremap(start, len, pgprot_val(pgprot_cached(__pgprot(0;
> +   if (!map) {
> +   pr_devel("%s(): ioremap(0x%llx, %d) failed\n", __func__, 
> start,
> +   len);
> +   return NULL;
> +   }
> +
> +   return map;
> +}

That's very wrong. I assume this never worked right ?

MMIO regions must be mapped non-cachable. Only the paste region
requires being mapped cachable. Ask Aneesh for a cleaner way of
doing it too while at it.

> +/*
> + * Unmap the MMIO regions for a window.
> + */
> +static void unmap_wc_paste_kaddr(struct vas_window *window)
> +{
> +   int len;

Don't use "wc"... that usually means "write combine".

Cheers,
Ben.



Re: [PATCH v4 05/11] VAS: Define helpers for access MMIO regions

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> +static void *map_mmio_region(char *name, uint64_t start, int len)
> +{
> +   void *map;
> +
> +   if (!request_mem_region(start, len, name)) {
> +   pr_devel("%s(): request_mem_region(0x%llx, %d) failed\n",
> +   __func__, start, len);
> +   return NULL;
> +   }
> +
> +   map = __ioremap(start, len, pgprot_val(pgprot_cached(__pgprot(0;
> +   if (!map) {
> +   pr_devel("%s(): ioremap(0x%llx, %d) failed\n", __func__, 
> start,
> +   len);
> +   return NULL;
> +   }
> +
> +   return map;
> +}

That's very wrong. I assume this never worked right ?

MMIO regions must be mapped non-cachable. Only the paste region
requires being mapped cachable. Ask Aneesh for a cleaner way of
doing it too while at it.

> +/*
> + * Unmap the MMIO regions for a window.
> + */
> +static void unmap_wc_paste_kaddr(struct vas_window *window)
> +{
> +   int len;

Don't use "wc"... that usually means "write combine".

Cheers,
Ben.



Re: [PATCH v4 04/11] VAS: Define vas_init() and vas_exit()

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> 
> + p = of_get_property(dn, "vas-id", NULL);
> + if (!p) {
> + pr_err("VAS: NULL vas-id? %p\n", p);
> + return -ENODEV;
> + }
> +
> + vinst->vas_id = of_read_number(p, 1);

of_property_read_u32()

> + /*
> +  * Hardcoded 6 is tied to corresponding code in
> +  *  skiboot.git/core/vas.c
> +  */
> + rc = of_property_read_variable_u64_array(dn, "reg", values,
> 6, 6);
> + if (rc != 6) {
> + pr_err("VAS %d: Unable to read reg properties, rc
> %d\n",
> + vinst->vas_id, rc);
> + return rc;
> + }
> +
> + vinst->hvwc_bar_start = values[0];
> + vinst->hvwc_bar_len = values[1];
> + vinst->uwc_bar_start = values[2];
> + vinst->uwc_bar_len = values[3];
> + vinst->win_base_addr = values[4];
> + vinst->win_id_shift = values[5];

If you make the VAS a proper MMIO device on the powerbus (as explained
in my comment to your OPAL patch), and you create this as a platform
device based on the DT, the core will parse the reg property for you
and will populate the device struct resource array for you.

> + return 0;
> +}
> +
> +/*
> + * Although this is read/used multiple times, it is written to only
> + * during initialization.
> + */
> +struct vas_instance *find_vas_instance(int vasid)
> +{
> + int i;
> + struct vas_instance *vinst;
> +
> + for (i = 0; i < vas_num_instances; i++) {
> + vinst = _instances[i];
> + if (vinst->vas_id == vasid)
> + return vinst;
> + }
> + pr_err("VAS instance for vas-id %d not found\n", vasid);
> + WARN_ON_ONCE(1);
> + return NULL;
> +}

This is gross. What is it needed for  ?

> +
> +bool vas_initialized(void)
> +{
> + return init_done;
> +}
> +
> +int vas_init(void)
> +{
> + int rc;
> + struct device_node *dn;
> + struct vas_instance *vinst;
> +
> + if (!pvr_version_is(PVR_POWER9))
> + return -ENODEV;

No. Create a platform device that matches the corresponding device-tree
node. One per instance.

The resulting VAS "driver" thus becomes a loadable module which you can
put elsewhere in the tree, matching on the DT node "compatible"
property.

You can then have the copros be chilren of it. The VAS driver can then
create additional platform devices for its children, who can have their
own module (which *can* depend on symbols exportd by the VAS one)
etc...

> + vas_num_instances = 0;
> + for_each_node_by_name(dn, "vas")
> + vas_num_instances++;
> +
> + if (!vas_num_instances)
> + return -ENODEV;
> +
> + vas_instances = kcalloc(vas_num_instances, sizeof(*vinst),
> GFP_KERNEL);
> + if (!vas_instances)
> + return -ENOMEM;
> +
> + vinst = _instances[0];
> + for_each_node_by_name(dn, "vas") {
> + rc = init_vas_instance(dn, vinst);
> + if (rc) {
> + pr_err("Error %d initializing VAS instance
> %ld\n", rc,
> + (vinst-_instances[0]));
> + goto cleanup;
> + }
> + vinst++;
> + }
> +
> + rc = -ENODEV;
> + if (vinst == _instances[0]) {
> + /* Should not happen as we saw some above. */
> + pr_err("VAS: Did not find any VAS DT nodes now!\n");
> + goto cleanup;
> + }
> +
> + pr_devel("VAS: Initialized %d instances\n",
> vas_num_instances);
> + init_done = true;
> +
> + return 0;
> +
> +cleanup:
> + kfree(vas_instances);
> + return rc;
> +}
> +
> +void vas_exit(void)
> +{
> + init_done = false;
> + kfree(vas_instances);
> +}
> +
> +module_init(vas_init);
> +module_exit(vas_exit);
> +MODULE_DESCRIPTION("IBM Virtual Accelerator Switchboard");
> +MODULE_AUTHOR("Sukadev Bhattiprolu ");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/vas.h
> b/arch/powerpc/platforms/powernv/vas.h
> index c63395d..46aaa17 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -384,4 +384,7 @@ struct vas_winctx {
>   enum vas_notify_after_count notify_after_count;
>  };
>  
> +extern bool vas_initialized(void);
> +extern struct vas_instance *find_vas_instance(int vasid);
> +
>  #endif /* _VAS_H */


Re: [PATCH v4 04/11] VAS: Define vas_init() and vas_exit()

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> 
> + p = of_get_property(dn, "vas-id", NULL);
> + if (!p) {
> + pr_err("VAS: NULL vas-id? %p\n", p);
> + return -ENODEV;
> + }
> +
> + vinst->vas_id = of_read_number(p, 1);

of_property_read_u32()

> + /*
> +  * Hardcoded 6 is tied to corresponding code in
> +  *  skiboot.git/core/vas.c
> +  */
> + rc = of_property_read_variable_u64_array(dn, "reg", values,
> 6, 6);
> + if (rc != 6) {
> + pr_err("VAS %d: Unable to read reg properties, rc
> %d\n",
> + vinst->vas_id, rc);
> + return rc;
> + }
> +
> + vinst->hvwc_bar_start = values[0];
> + vinst->hvwc_bar_len = values[1];
> + vinst->uwc_bar_start = values[2];
> + vinst->uwc_bar_len = values[3];
> + vinst->win_base_addr = values[4];
> + vinst->win_id_shift = values[5];

If you make the VAS a proper MMIO device on the powerbus (as explained
in my comment to your OPAL patch), and you create this as a platform
device based on the DT, the core will parse the reg property for you
and will populate the device struct resource array for you.

> + return 0;
> +}
> +
> +/*
> + * Although this is read/used multiple times, it is written to only
> + * during initialization.
> + */
> +struct vas_instance *find_vas_instance(int vasid)
> +{
> + int i;
> + struct vas_instance *vinst;
> +
> + for (i = 0; i < vas_num_instances; i++) {
> + vinst = _instances[i];
> + if (vinst->vas_id == vasid)
> + return vinst;
> + }
> + pr_err("VAS instance for vas-id %d not found\n", vasid);
> + WARN_ON_ONCE(1);
> + return NULL;
> +}

This is gross. What is it needed for  ?

> +
> +bool vas_initialized(void)
> +{
> + return init_done;
> +}
> +
> +int vas_init(void)
> +{
> + int rc;
> + struct device_node *dn;
> + struct vas_instance *vinst;
> +
> + if (!pvr_version_is(PVR_POWER9))
> + return -ENODEV;

No. Create a platform device that matches the corresponding device-tree
node. One per instance.

The resulting VAS "driver" thus becomes a loadable module which you can
put elsewhere in the tree, matching on the DT node "compatible"
property.

You can then have the copros be chilren of it. The VAS driver can then
create additional platform devices for its children, who can have their
own module (which *can* depend on symbols exportd by the VAS one)
etc...

> + vas_num_instances = 0;
> + for_each_node_by_name(dn, "vas")
> + vas_num_instances++;
> +
> + if (!vas_num_instances)
> + return -ENODEV;
> +
> + vas_instances = kcalloc(vas_num_instances, sizeof(*vinst),
> GFP_KERNEL);
> + if (!vas_instances)
> + return -ENOMEM;
> +
> + vinst = _instances[0];
> + for_each_node_by_name(dn, "vas") {
> + rc = init_vas_instance(dn, vinst);
> + if (rc) {
> + pr_err("Error %d initializing VAS instance
> %ld\n", rc,
> + (vinst-_instances[0]));
> + goto cleanup;
> + }
> + vinst++;
> + }
> +
> + rc = -ENODEV;
> + if (vinst == _instances[0]) {
> + /* Should not happen as we saw some above. */
> + pr_err("VAS: Did not find any VAS DT nodes now!\n");
> + goto cleanup;
> + }
> +
> + pr_devel("VAS: Initialized %d instances\n",
> vas_num_instances);
> + init_done = true;
> +
> + return 0;
> +
> +cleanup:
> + kfree(vas_instances);
> + return rc;
> +}
> +
> +void vas_exit(void)
> +{
> + init_done = false;
> + kfree(vas_instances);
> +}
> +
> +module_init(vas_init);
> +module_exit(vas_exit);
> +MODULE_DESCRIPTION("IBM Virtual Accelerator Switchboard");
> +MODULE_AUTHOR("Sukadev Bhattiprolu ");
> +MODULE_LICENSE("GPL");
> diff --git a/arch/powerpc/platforms/powernv/vas.h
> b/arch/powerpc/platforms/powernv/vas.h
> index c63395d..46aaa17 100644
> --- a/arch/powerpc/platforms/powernv/vas.h
> +++ b/arch/powerpc/platforms/powernv/vas.h
> @@ -384,4 +384,7 @@ struct vas_winctx {
>   enum vas_notify_after_count notify_after_count;
>  };
>  
> +extern bool vas_initialized(void);
> +extern struct vas_instance *find_vas_instance(int vasid);
> +
>  #endif /* _VAS_H */


Re: [PATCH v4 04/11] VAS: Define vas_init() and vas_exit()

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> +config VAS
> +   tristate "IBM Virtual Accelerator Switchboard (VAS)"

CONFIG_IBM_VAS or PPC_VAS ... too high risk of collision otherwise

Ben.



Re: [PATCH v4 04/11] VAS: Define vas_init() and vas_exit()

2017-04-24 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 22:13 -0700, Sukadev Bhattiprolu wrote:
> +config VAS
> +   tristate "IBM Virtual Accelerator Switchboard (VAS)"

CONFIG_IBM_VAS or PPC_VAS ... too high risk of collision otherwise

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 16:24 -0600, Jason Gunthorpe wrote:
> Basically, all this list processing is a huge overhead compared to
> just putting a helper call in the existing sg iteration loop of the
> actual op.  Particularly if the actual op is a no-op like no-mmu x86
> would use.

Yes, I'm leaning toward that approach too.

The helper itself could hang off the devmap though.

> Since dma mapping is a performance path we must be careful not to
> create intrinsic inefficiencies with otherwise nice layering :)
> 
> Jason


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 17:21 -0600, Jason Gunthorpe wrote:
> Splitting the sgl is different from iommu batching.
> 
> As an example, an O_DIRECT write of 1 MB with a single 4K P2P page in
> the middle.
> 
> The optimum behavior is to allocate a 1MB-4K iommu range and fill it
> with the CPU memory. Then return a SGL with three entires, two
> pointing into the range and one to the p2p.
> 
> It is creating each range which tends to be expensive, so creating
> two
> ranges (or worse, if every SGL created a range it would be 255) is
> very undesired.

I think it's easier to get us started to just use a helper and
stick it in the existing sglist processing loop of the architecture.

As we noticed, stacking dma_ops is actually non-trivial and opens quite
the can of worms.

As Jerome mentioned, you can end up with IOs ops containing an sglist
that is a collection of memory and GPU pages for example.

Cheers,
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:22 -0600, Jason Gunthorpe wrote:
> On Tue, Apr 18, 2017 at 02:11:33PM -0700, Dan Williams wrote:
> > > I think this opens an even bigger can of worms..
> > 
> > No, I don't think it does. You'd only shim when the target page is
> > backed by a device, not host memory, and you can figure this out by
> > a
> > is_zone_device_page()-style lookup.
> 
> The bigger can of worms is how do you meaningfully stack dma_ops.
> 
> What does the p2p provider do when it detects a p2p page?

Yeah I think we don't really want to stack dma_ops... thinking more
about it.

As I just wrote, it looks like we might need a more specialised hook
in the devmap to be used by the main dma_op, on a per-page basis.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 15:03 -0600, Jason Gunthorpe wrote:
> I don't follow, when does get_dma_ops() return a p2p aware provider?
> It has no way to know if the DMA is going to involve p2p, get_dma_ops
> is called with the device initiating the DMA.
> 
> So you'd always return the P2P shim on a system that has registered
> P2P memory?
> 
> Even so, how does this shim work? dma_ops are not really intended to
> be stacked. How would we make unmap work, for instance? What happens
> when the underlying iommu dma ops actually natively understands p2p
> and doesn't want the shim?

Good point. We only know on a per-page basis ... ugh.

So we really need to change the arch main dma_ops. I'm not opposed to
that. What we then need to do is have that main arch dma_map_sg,
when it encounters a "device" page, call into a helper attached to
the devmap to handle *that page*, providing sufficient context.

That helper wouldn't perform the actual iommu mapping. It would simply
return something along the lines of:

 - "use that alternate bus address and don't map in the iommu"
 - "use that alternate bus address and do map in the iommu"
 - "proceed as normal"
 - "fail"

What do you think ?

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 14:48 -0600, Logan Gunthorpe wrote:
> > ...and that dma_map goes through get_dma_ops(), so I don't see the conflict?
> 
> The main conflict is in dma_map_sg which only does get_dma_ops once but
> the sg may contain memory of different types.

We can handle that in our "overriden" dma ops.

It's a bit tricky but it *could* break it down into segments and
forward portions back to the original dma ops.

Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 12:00 -0600, Jason Gunthorpe wrote:
> - All platforms can succeed if the PCI devices are under the same
>   'segment', but where segments begin is somewhat platform specific
>   knowledge. (this is 'same switch' idea Logan has talked about)

We also need to be careful whether P2P is enabled in the switch
or not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Tue, 2017-04-18 at 10:27 -0700, Dan Williams wrote:
> > FWIW, RDMA probably wouldn't want to use a p2mem device either, we
> > already have APIs that map BAR memory to user space, and would like to
> > keep using them. A 'enable P2P for bar' helper function sounds better
> > to me.
> 
> ...and I think it's not a helper function as much as asking the bus
> provider "can these two device dma to each other". The "helper" is the
> dma api redirecting through a software-iommu that handles bus address
> translation differently than it would handle host memory dma mapping.

Do we even need tat function ? The dma_ops have a dma_supported()
call...

If we have those override ops built into the "dma_target" object,
then these things can make that decision knowing both the source
and target device.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> > Is it ? Again, you create a "concept" the user may have no idea about,
> > "p2pmem memory". So now any kind of memory buffer on a device can could
> > be use for p2p but also potentially a bunch of other things becomes
> > special and called "p2pmem" ...
> 
> The user is going to have to have an idea about it if they are designing
> systems to make use of it. I've said it before many times: this is an
> optimization with significant trade-offs so the user does have to make
> decisions regarding when to enable it.

Not necessarily. There are many cases where the "end user" won't have
any idea. In any case, I think we bring the story down to those two
points of conflating the allocator with the peer to peer DMA, and the
lack of generality in the approach to solve the peer to peer DMA
problem.

> > But what do you have in p2pmem that somebody benefits from. Again I
> > don't understand what that "p2pmem" device buys you in term of
> > functionality vs. having the device just instanciate the pages.
> 
> Well thanks for just taking a big shit on all of our work without even
> reading the patches. Bravo.

Now now now  calm down. We are being civil here. I'm not shitting
on anything, I'm asking what seems to be a reasonable question in term
of benefits of the approach you have chosen. Using that sort of
language will not get you anywhere. 

> > Now having some kind of way to override the dma_ops, yes I do get that,
> > and it could be that this "p2pmem" is typically the way to do it, but
> > at the moment you don't even have that. So I'm a bit at a loss here.
> 
> Yes, we've already said many times that this is something we will need
> to add.
> 
> > But it doesn't *have* to be. Again, take my GPU example. The fact that
> > a NIC might be able to DMA into it doesn't make it specifically "p2p
> > memory".
> 
> Just because you use it for other things doesn't mean it can't also
> provide the service of a "p2pmem" device.

But there is no such thing as a "p2pmem" device.. that's what I'm
trying to tell you...

As both Jerome and I tried to explain, there are many reason why one
may want to do peer DMA into some device memory, that doesn't make that
memory some kind of "p2pmem". It's trying to stick a generic label onto
something that isn't.

That's why I'm suggesting we disconnect the two aspects. On one hand
the problem of handling p2p DMA, whether the target is some memory,
some MMIO registers, etc...

On the other hand, some generic "utility" that can optionally be used
by drivers to manage a pool of DMA memory in the device, essentially a
simple allocator.

The two things are completely orthogonal.

> > So now your "p2pmem" device needs to also be laid out on top of those
> > MMIO registers ? It's becoming weird.
> 
> Yes, Max Gurtovoy has also expressed an interest in expanding this work
> to cover things other than memory. He's suggested simply calling it a
> p2p device, but until we figure out what exactly that all means we can't
> really finalize a name.

Possibly. In any case, I think it should be separate from the
allocation.

> > See, basically, doing peer 2 peer between devices has 3 main challenges
> > today: The DMA API needing struct pages, the MMIO translation issues
> > and the IOMMU translation issues.
> > 
> > You seem to create that added device as some kind of "owner" for the
> > struct pages, solving #1, but leave #2 and #3 alone.
> 
> Well there are other challenges too. Like figuring out when it's
> appropriate to use, tying together the device that provides the memory
> with the driver tring to use it in DMA transactions, etc, etc. Our patch
> set tackles these latter issues.

But it tries to conflate the allocation, which is basically the fact
that this is some kind of "memory pool" with the problem of doing peer
DMA.

I'm advocating for separating the concepts.

> > If we go down that path, though, rather than calling it p2pmem I would
> > call it something like dma_target which I find much clearer especially
> > since it doesn't have to be just memory.
> 
> I'm not set on the name. My arguments have been specifically for the
> existence of an independent struct device. But I'm not really interested
> in getting into bike shedding arguments over what to call it at this
> time when we don't even really know what it's going to end up doing in
> the end.

It's not bike shedding. It's about taking out the allocator part and
making it clear that this isn't something to lay out on top of a p

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-18 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 23:43 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 03:11 PM, Benjamin Herrenschmidt wrote:
> > Is it ? Again, you create a "concept" the user may have no idea about,
> > "p2pmem memory". So now any kind of memory buffer on a device can could
> > be use for p2p but also potentially a bunch of other things becomes
> > special and called "p2pmem" ...
> 
> The user is going to have to have an idea about it if they are designing
> systems to make use of it. I've said it before many times: this is an
> optimization with significant trade-offs so the user does have to make
> decisions regarding when to enable it.

Not necessarily. There are many cases where the "end user" won't have
any idea. In any case, I think we bring the story down to those two
points of conflating the allocator with the peer to peer DMA, and the
lack of generality in the approach to solve the peer to peer DMA
problem.

> > But what do you have in p2pmem that somebody benefits from. Again I
> > don't understand what that "p2pmem" device buys you in term of
> > functionality vs. having the device just instanciate the pages.
> 
> Well thanks for just taking a big shit on all of our work without even
> reading the patches. Bravo.

Now now now  calm down. We are being civil here. I'm not shitting
on anything, I'm asking what seems to be a reasonable question in term
of benefits of the approach you have chosen. Using that sort of
language will not get you anywhere. 

> > Now having some kind of way to override the dma_ops, yes I do get that,
> > and it could be that this "p2pmem" is typically the way to do it, but
> > at the moment you don't even have that. So I'm a bit at a loss here.
> 
> Yes, we've already said many times that this is something we will need
> to add.
> 
> > But it doesn't *have* to be. Again, take my GPU example. The fact that
> > a NIC might be able to DMA into it doesn't make it specifically "p2p
> > memory".
> 
> Just because you use it for other things doesn't mean it can't also
> provide the service of a "p2pmem" device.

But there is no such thing as a "p2pmem" device.. that's what I'm
trying to tell you...

As both Jerome and I tried to explain, there are many reason why one
may want to do peer DMA into some device memory, that doesn't make that
memory some kind of "p2pmem". It's trying to stick a generic label onto
something that isn't.

That's why I'm suggesting we disconnect the two aspects. On one hand
the problem of handling p2p DMA, whether the target is some memory,
some MMIO registers, etc...

On the other hand, some generic "utility" that can optionally be used
by drivers to manage a pool of DMA memory in the device, essentially a
simple allocator.

The two things are completely orthogonal.

> > So now your "p2pmem" device needs to also be laid out on top of those
> > MMIO registers ? It's becoming weird.
> 
> Yes, Max Gurtovoy has also expressed an interest in expanding this work
> to cover things other than memory. He's suggested simply calling it a
> p2p device, but until we figure out what exactly that all means we can't
> really finalize a name.

Possibly. In any case, I think it should be separate from the
allocation.

> > See, basically, doing peer 2 peer between devices has 3 main challenges
> > today: The DMA API needing struct pages, the MMIO translation issues
> > and the IOMMU translation issues.
> > 
> > You seem to create that added device as some kind of "owner" for the
> > struct pages, solving #1, but leave #2 and #3 alone.
> 
> Well there are other challenges too. Like figuring out when it's
> appropriate to use, tying together the device that provides the memory
> with the driver tring to use it in DMA transactions, etc, etc. Our patch
> set tackles these latter issues.

But it tries to conflate the allocation, which is basically the fact
that this is some kind of "memory pool" with the problem of doing peer
DMA.

I'm advocating for separating the concepts.

> > If we go down that path, though, rather than calling it p2pmem I would
> > call it something like dma_target which I find much clearer especially
> > since it doesn't have to be just memory.
> 
> I'm not set on the name. My arguments have been specifically for the
> existence of an independent struct device. But I'm not really interested
> in getting into bike shedding arguments over what to call it at this
> time when we don't even really know what it's going to end up doing in
> the end.

It's not bike shedding. It's about taking out the allocator part and
making it clear that this isn't something to lay out on top of a p

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

Is it ? Again, you create a "concept" the user may have no idea about,
"p2pmem memory". So now any kind of memory buffer on a device can could
be use for p2p but also potentially a bunch of other things becomes
special and called "p2pmem" ...

> > > 2) In order to create the struct pages we use the ZONE_DEVICE
> > > infrastructure which requires a struct device. (See
> > > devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device.

But what do you have in p2pmem that somebody benefits from. Again I
don't understand what that "p2pmem" device buys you in term of
functionality vs. having the device just instanciate the pages.

Now having some kind of way to override the dma_ops, yes I do get that,
and it could be that this "p2pmem" is typically the way to do it, but
at the moment you don't even have that. So I'm a bit at a loss here.
 
>  Having a specific class for this makes it very
> clear how this memory would be handled.

But it doesn't *have* to be. Again, take my GPU example. The fact that
a NIC might be able to DMA into it doesn't make it specifically "p2p
memory".

Essentially you are saying that any device that happens to have a piece
of mappable "memory" (or something that behaves like it) and can be
DMA'ed into should now have that "p2pmem" thing attached to it.

Now take an example where that becomes really awkward (it's also a real
example of something people want to do). I have a NIC and a GPU, the
NIC DMA's data to/from the GPU, but they also want to poke at each
other doorbell, the GPU to kick the NIC into action when data is ready
to send, the NIC to poke the GPU when data has been received.

Those doorbells are MMIO registers.

So now your "p2pmem" device needs to also be laid out on top of those
MMIO registers ? It's becoming weird.

See, basically, doing peer 2 peer between devices has 3 main challenges
today: The DMA API needing struct pages, the MMIO translation issues
and the IOMMU translation issues.

You seem to create that added device as some kind of "owner" for the
struct pages, solving #1, but leave #2 and #3 alone.

Now, as I said, it could very well be that having the devmap pointer
point to some specific device-type with a well known structure to
provide solutions for #2 and #3 such as dma_ops overrides, is indeed
the right way to solve these problems.

If we go down that path, though, rather than calling it p2pmem I would
call it something like dma_target which I find much clearer especially
since it doesn't have to be just memory.

For the sole case of creating struct page's however, I fail to see the
point.

>  For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

If we are going to create some sort of struct dma_target, HMM could
potentially just look for the parent if it needs the PCI device.

> > >  This amazingly gets us the get_dev_pagemap
> > > architecture which also uses a struct device. So by using a p2pmem
> > > device we can go from struct page to struct device to p2pmem device
> > > quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?

But why not directly use that other bus' device in that case ?

> > > 3) You wouldn't want to use the pci's struct device because it doesn't
> > > really describe what's going on. For example, there may be multiple
> > > devices on the pci devi

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Mon, 2017-04-17 at 10:52 -0600, Logan Gunthorpe wrote:
> 
> On 17/04/17 01:20 AM, Benjamin Herrenschmidt wrote:
> > But is it ? For example take a GPU, does it, in your scheme, need an
> > additional "p2pmem" child ? Why can't the GPU driver just use some
> > helper to instantiate the necessary struct pages ? What does having an
> > actual "struct device" child buys you ?
> 
> Yes, in this scheme, it needs an additional p2pmem child. Why is that an
> issue? It certainly makes it a lot easier for the user to understand the
> p2pmem memory in the system (through the sysfs tree) and reason about
> the topology and when to use it. This is important.

Is it ? Again, you create a "concept" the user may have no idea about,
"p2pmem memory". So now any kind of memory buffer on a device can could
be use for p2p but also potentially a bunch of other things becomes
special and called "p2pmem" ...

> > > 2) In order to create the struct pages we use the ZONE_DEVICE
> > > infrastructure which requires a struct device. (See
> > > devm_memremap_pages.)
> > 
> > Yup, but you already have one in the actual pci_dev ... What is the
> > benefit of adding a second one ?
> 
> But that would tie all of this very tightly to be pci only and may get
> hard to differentiate if more users of ZONE_DEVICE crop up who happen to
> be using a pci device.

But what do you have in p2pmem that somebody benefits from. Again I
don't understand what that "p2pmem" device buys you in term of
functionality vs. having the device just instanciate the pages.

Now having some kind of way to override the dma_ops, yes I do get that,
and it could be that this "p2pmem" is typically the way to do it, but
at the moment you don't even have that. So I'm a bit at a loss here.
 
>  Having a specific class for this makes it very
> clear how this memory would be handled.

But it doesn't *have* to be. Again, take my GPU example. The fact that
a NIC might be able to DMA into it doesn't make it specifically "p2p
memory".

Essentially you are saying that any device that happens to have a piece
of mappable "memory" (or something that behaves like it) and can be
DMA'ed into should now have that "p2pmem" thing attached to it.

Now take an example where that becomes really awkward (it's also a real
example of something people want to do). I have a NIC and a GPU, the
NIC DMA's data to/from the GPU, but they also want to poke at each
other doorbell, the GPU to kick the NIC into action when data is ready
to send, the NIC to poke the GPU when data has been received.

Those doorbells are MMIO registers.

So now your "p2pmem" device needs to also be laid out on top of those
MMIO registers ? It's becoming weird.

See, basically, doing peer 2 peer between devices has 3 main challenges
today: The DMA API needing struct pages, the MMIO translation issues
and the IOMMU translation issues.

You seem to create that added device as some kind of "owner" for the
struct pages, solving #1, but leave #2 and #3 alone.

Now, as I said, it could very well be that having the devmap pointer
point to some specific device-type with a well known structure to
provide solutions for #2 and #3 such as dma_ops overrides, is indeed
the right way to solve these problems.

If we go down that path, though, rather than calling it p2pmem I would
call it something like dma_target which I find much clearer especially
since it doesn't have to be just memory.

For the sole case of creating struct page's however, I fail to see the
point.

>  For example, although I haven't
> looked into it, this could very well be a point of conflict with HMM. If
> they were to use the pci device to populate the dev_pagemap then we
> couldn't also use the pci device. I feel it's much better for users of
> dev_pagemap to have their struct devices they own to avoid such conflicts.

If we are going to create some sort of struct dma_target, HMM could
potentially just look for the parent if it needs the PCI device.

> > >  This amazingly gets us the get_dev_pagemap
> > > architecture which also uses a struct device. So by using a p2pmem
> > > device we can go from struct page to struct device to p2pmem device
> > > quickly and effortlessly.
> > 
> > Which isn't terribly useful in itself right ? What you care about is
> > the "enclosing" pci_dev no ? Or am I missing something ?
> 
> Sure it is. What if we want to someday support p2pmem that's on another bus?

But why not directly use that other bus' device in that case ?

> > > 3) You wouldn't want to use the pci's struct device because it doesn't
> > > really describe what's going on. For example, there may be multiple
> > > devices on the pci devi

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote:
> 
> > 
> > I'm still not 100% why do you need a "p2mem device" mind you ...
> 
> Well, you don't "need" it but it is a design choice that I think makes a
> lot of sense for the following reasons:
> 
> 1) p2pmem is in fact a device on the pci bus. A pci driver will need to
> set it up and create the device and thus it will have a natural parent
> pci device. Instantiating a struct device for it means it will appear in
> the device hierarchy and one can use that to reason about its position
> in the topology.

But is it ? For example take a GPU, does it, in your scheme, need an
additional "p2pmem" child ? Why can't the GPU driver just use some
helper to instantiate the necessary struct pages ? What does having an
actual "struct device" child buys you ?

> 2) In order to create the struct pages we use the ZONE_DEVICE
> infrastructure which requires a struct device. (See
> devm_memremap_pages.)

Yup, but you already have one in the actual pci_dev ... What is the
benefit of adding a second one ?

>  This amazingly gets us the get_dev_pagemap
> architecture which also uses a struct device. So by using a p2pmem
> device we can go from struct page to struct device to p2pmem device
> quickly and effortlessly.

Which isn't terribly useful in itself right ? What you care about is
the "enclosing" pci_dev no ? Or am I missing something ?

> 3) You wouldn't want to use the pci's struct device because it doesn't
> really describe what's going on. For example, there may be multiple
> devices on the pci device in question: eg. an NVME card and some p2pmem.

What is "some p2pmem" ?

> Or it could be a NIC with some p2pmem.

Again what is "some p2pmem" ?

That a device might have some memory-like buffer space is all well and
good but does it need to be specifically distinguished at the device
level ? It could be inherent to what the device is... for example again
take the GPU example, why would you call the FB memory "p2pmem" ? 

>  Or it could just be p2pmem by itself. And the logic to figure out what
>  memory is available and where
> the address is will be non-standard so it's really straightforward to
> have any pci driver just instantiate a p2pmem device.

Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
standard way to lookup the various offsets etc... I mentioned earlier,
then yes, it would make sense to have it as a staging point. As-is, I
don't know. 

> It is probably worth you reading the RFC patches at this point to get a
> better feel for this.

Yup, I'll have another look a bit more in depth.

Cheers,
Ben.


> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-17 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 23:13 -0600, Logan Gunthorpe wrote:
> 
> > 
> > I'm still not 100% why do you need a "p2mem device" mind you ...
> 
> Well, you don't "need" it but it is a design choice that I think makes a
> lot of sense for the following reasons:
> 
> 1) p2pmem is in fact a device on the pci bus. A pci driver will need to
> set it up and create the device and thus it will have a natural parent
> pci device. Instantiating a struct device for it means it will appear in
> the device hierarchy and one can use that to reason about its position
> in the topology.

But is it ? For example take a GPU, does it, in your scheme, need an
additional "p2pmem" child ? Why can't the GPU driver just use some
helper to instantiate the necessary struct pages ? What does having an
actual "struct device" child buys you ?

> 2) In order to create the struct pages we use the ZONE_DEVICE
> infrastructure which requires a struct device. (See
> devm_memremap_pages.)

Yup, but you already have one in the actual pci_dev ... What is the
benefit of adding a second one ?

>  This amazingly gets us the get_dev_pagemap
> architecture which also uses a struct device. So by using a p2pmem
> device we can go from struct page to struct device to p2pmem device
> quickly and effortlessly.

Which isn't terribly useful in itself right ? What you care about is
the "enclosing" pci_dev no ? Or am I missing something ?

> 3) You wouldn't want to use the pci's struct device because it doesn't
> really describe what's going on. For example, there may be multiple
> devices on the pci device in question: eg. an NVME card and some p2pmem.

What is "some p2pmem" ?

> Or it could be a NIC with some p2pmem.

Again what is "some p2pmem" ?

That a device might have some memory-like buffer space is all well and
good but does it need to be specifically distinguished at the device
level ? It could be inherent to what the device is... for example again
take the GPU example, why would you call the FB memory "p2pmem" ? 

>  Or it could just be p2pmem by itself. And the logic to figure out what
>  memory is available and where
> the address is will be non-standard so it's really straightforward to
> have any pci driver just instantiate a p2pmem device.

Again I'm not sure why it needs to "instanciate a p2pmem" device. Maybe
it's the term "p2pmem" that offputs me. If p2pmem allowed to have a
standard way to lookup the various offsets etc... I mentioned earlier,
then yes, it would make sense to have it as a staging point. As-is, I
don't know. 

> It is probably worth you reading the RFC patches at this point to get a
> better feel for this.

Yup, I'll have another look a bit more in depth.

Cheers,
Ben.


> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> 
> On 16/04/17 09:53 AM, Dan Williams wrote:
> > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > context about the physical address in question. I'm thinking you can
> > hang bus address translation data off of that structure. This seems
> > vaguely similar to what HMM is doing.
> 
> Thanks! I didn't realize you had the infrastructure to look up a device
> from a pfn/page. That would really come in handy for us.

It does indeed. I won't be able to play with that much for a few weeks
(see my other email) so if you're going to tackle this while I'm away,
can you work with Jerome to make sure you don't conflict with HMM ?

I really want a way for HMM to be able to layout struct pages over the
GPU BARs rather than in "allocated free space" for the case where the
BAR is big enough to cover all of the GPU memory.

In general, I'd like a simple & generic way for any driver to ask the
core to layout DMA'ble struct pages over BAR space. I an not convinced
this requires a "p2mem device" to be created on top of this though but
that's a different discussion.

Of course the actual ability to perform the DMA mapping will be subject
to various restrictions that will have to be implemented in the actual
"dma_ops override" backend. We can have generic code to handle the case
where devices reside on the same domain, which can deal with switch
configuration etc... we will need to have iommu specific code to handle
the case going through the fabric. 

Virtualization is a separate can of worms due to how qemu completely
fakes the MMIO space, we can look into that later.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:34 -0600, Logan Gunthorpe wrote:
> 
> On 16/04/17 09:53 AM, Dan Williams wrote:
> > ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> > context about the physical address in question. I'm thinking you can
> > hang bus address translation data off of that structure. This seems
> > vaguely similar to what HMM is doing.
> 
> Thanks! I didn't realize you had the infrastructure to look up a device
> from a pfn/page. That would really come in handy for us.

It does indeed. I won't be able to play with that much for a few weeks
(see my other email) so if you're going to tackle this while I'm away,
can you work with Jerome to make sure you don't conflict with HMM ?

I really want a way for HMM to be able to layout struct pages over the
GPU BARs rather than in "allocated free space" for the case where the
BAR is big enough to cover all of the GPU memory.

In general, I'd like a simple & generic way for any driver to ask the
core to layout DMA'ble struct pages over BAR space. I an not convinced
this requires a "p2mem device" to be created on top of this though but
that's a different discussion.

Of course the actual ability to perform the DMA mapping will be subject
to various restrictions that will have to be implemented in the actual
"dma_ops override" backend. We can have generic code to handle the case
where devices reside on the same domain, which can deal with switch
configuration etc... we will need to have iommu specific code to handle
the case going through the fabric. 

Virtualization is a separate can of worms due to how qemu completely
fakes the MMIO space, we can look into that later.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote:
> > I think you need to give other archs a chance to support this with a
> > design that considers the offset case as a first class citizen rather
> > than an afterthought.
> 
> I'll consider this. Given the fact I can use your existing
> get_dev_pagemap infrastructure to look up the p2pmem device this
> probably isn't as hard as I thought it would be anyway (we probably
> don't even need a page flag). We'd just have lookup the dev_pagemap,
> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
> which could apply the offset or do any other arch specific logic (if
> necessary).

I'm still not 100% why do you need a "p2mem device" mind you ...

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 10:47 -0600, Logan Gunthorpe wrote:
> > I think you need to give other archs a chance to support this with a
> > design that considers the offset case as a first class citizen rather
> > than an afterthought.
> 
> I'll consider this. Given the fact I can use your existing
> get_dev_pagemap infrastructure to look up the p2pmem device this
> probably isn't as hard as I thought it would be anyway (we probably
> don't even need a page flag). We'd just have lookup the dev_pagemap,
> test if it's a p2pmem device, and if so, call a p2pmem_dma_map function
> which could apply the offset or do any other arch specific logic (if
> necessary).

I'm still not 100% why do you need a "p2mem device" mind you ...

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote:
> > Just thinking out loud ... I don't have a firm idea or a design. But
> > peer to peer is definitely a problem we need to tackle generically, the
> > demand for it keeps coming up.
> 
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Ok, that's interesting. That would be a way to handle the "lookup" I
was mentioning in the email I sent a few minutes ago.

We would probably need to put some "structure" to that context.

I'm very short on time to look into the details of this for at least
a month (I'm taking about 3 weeks off for personal reasons next week),
but I'm happy to dive more into this when I'm back and sort out with
Jerome how to make it all co-habitate nicely with HMM.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:53 -0700, Dan Williams wrote:
> > Just thinking out loud ... I don't have a firm idea or a design. But
> > peer to peer is definitely a problem we need to tackle generically, the
> > demand for it keeps coming up.
> 
> ZONE_DEVICE allows you to redirect via get_dev_pagemap() to retrieve
> context about the physical address in question. I'm thinking you can
> hang bus address translation data off of that structure. This seems
> vaguely similar to what HMM is doing.

Ok, that's interesting. That would be a way to handle the "lookup" I
was mentioning in the email I sent a few minutes ago.

We would probably need to put some "structure" to that context.

I'm very short on time to look into the details of this for at least
a month (I'm taking about 3 weeks off for personal reasons next week),
but I'm happy to dive more into this when I'm back and sort out with
Jerome how to make it all co-habitate nicely with HMM.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote:
> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support.

Indeed. In fact we have work in progress support for pmem on power
using experimental HW.

> THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.
> 
> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

The more I think about it, the more I tend toward something along the
lines of having the arch DMA ops being able to quickly differentiate
between "normal" memory (which includes non-PCI pmem in some cases,
it's an architecture choice I suppose) and "special device" (page flag
? pfn bit ? ... there are options).

>From there, we keep our existing fast path for the normal case.

For the special case, we need to provide a fast lookup mechanism
(assuming we can't stash enough stuff in struct page or the pfn)
to get back to a struct of some sort that provides the necessary
information to resolve the translation.

This *could* be something like a struct p2mem device that carries
a special set of DMA ops, though we probably shouldn't make the generic
structure PCI specific.

This is a slightly slower path, but that "stub" structure allows the
special DMA ops to provide the necessary bus-specific knowledge, which
for PCI for example, can check whether the devices are on the same
segment, whether the switches are configured to allow p2p, etc...

What form should that fast lookup take ? It's not completely clear to
me at that point. We could start with a simple linear lookup I suppose
and improve in a second stage.

Of course this pipes into the old discussion about disconnecting
the DMA ops from struct page. If we keep struct page, any device that
wants to be a potential DMA target will need to do something "special"
to create those struct pages etc.. though we could make that a simple
pci helper that pops the necessary bits and pieces for a given BAR &
range.

If we don't need struct page, then it might be possible to hide it all
in the PCI infrastructure.

> > Virtualization specifically would be a _lot_ more difficult than simply
> > supporting offsets. The actual topology of the bus will probably be lost
> > on the guest OS and it would therefor have a difficult time figuring out
> > when it's acceptable to use p2pmem. I also have a difficult time seeing
> > a use case for it and thus I have a hard time with the argument that we
> > can't support use cases that do want it because use cases that don't
> > want it (perhaps yet) won't work.
> > 
> > > This is an interesting experiement to look at I suppose, but if you
> > > ever want this upstream I would like at least for you to develop a
> > > strategy to support the wider case, if not an actual implementation.
> > 
> > I think there are plenty of avenues forward to support offsets, etc.
> > It's just work. Nothing we'd be proposing would be incompatible with it.
> > We just don't want to have to do it all upfront especially when no one
> > really knows how well various architecture's hardware supports this or
> > if anyone even wants to run it on systems such as those. (Keep in mind
> > this is a pretty specific optimization that mostly helps systems
> > designed in specific ways -- not a general "everybody gets faster" type
> > situation.) Get the cases working we know will work, can easily support
> > and people actually want.  Then expand it to support others as people
> > come around with hardware to test and use cases for it.
> 
> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

Thanks :-) There's a reason why I'm insisting on this. We have constant
requests for this today. We have hacks in the GPU drivers to do it for
GPUs behind a switch, but those are just that, ad-hoc hacks in the
drivers. We have similar grossness around the corner with some CAPI
NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
to whack nVME devices.

I'm very interested in a more generic solution to deal with the problem
of P2P between devices. I'm happy to contribute with code to handle the
powerpc bits but we need to agree on the design first :)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-16 Thread Benjamin Herrenschmidt
On Sun, 2017-04-16 at 08:44 -0700, Dan Williams wrote:
> The difference is that there was nothing fundamental in the core
> design of pmem + DAX that prevented other archs from growing pmem
> support.

Indeed. In fact we have work in progress support for pmem on power
using experimental HW.

> THP and memory hotplug existed on other architectures and
> they just need to plug in their arch-specific enabling. p2p support
> needs the same starting point of something more than one architecture
> can plug into, and handling the bus address offset case needs to be
> incorporated into the design.
> 
> pmem + dax did not change the meaning of what a dma_addr_t is, p2p does.

The more I think about it, the more I tend toward something along the
lines of having the arch DMA ops being able to quickly differentiate
between "normal" memory (which includes non-PCI pmem in some cases,
it's an architecture choice I suppose) and "special device" (page flag
? pfn bit ? ... there are options).

>From there, we keep our existing fast path for the normal case.

For the special case, we need to provide a fast lookup mechanism
(assuming we can't stash enough stuff in struct page or the pfn)
to get back to a struct of some sort that provides the necessary
information to resolve the translation.

This *could* be something like a struct p2mem device that carries
a special set of DMA ops, though we probably shouldn't make the generic
structure PCI specific.

This is a slightly slower path, but that "stub" structure allows the
special DMA ops to provide the necessary bus-specific knowledge, which
for PCI for example, can check whether the devices are on the same
segment, whether the switches are configured to allow p2p, etc...

What form should that fast lookup take ? It's not completely clear to
me at that point. We could start with a simple linear lookup I suppose
and improve in a second stage.

Of course this pipes into the old discussion about disconnecting
the DMA ops from struct page. If we keep struct page, any device that
wants to be a potential DMA target will need to do something "special"
to create those struct pages etc.. though we could make that a simple
pci helper that pops the necessary bits and pieces for a given BAR &
range.

If we don't need struct page, then it might be possible to hide it all
in the PCI infrastructure.

> > Virtualization specifically would be a _lot_ more difficult than simply
> > supporting offsets. The actual topology of the bus will probably be lost
> > on the guest OS and it would therefor have a difficult time figuring out
> > when it's acceptable to use p2pmem. I also have a difficult time seeing
> > a use case for it and thus I have a hard time with the argument that we
> > can't support use cases that do want it because use cases that don't
> > want it (perhaps yet) won't work.
> > 
> > > This is an interesting experiement to look at I suppose, but if you
> > > ever want this upstream I would like at least for you to develop a
> > > strategy to support the wider case, if not an actual implementation.
> > 
> > I think there are plenty of avenues forward to support offsets, etc.
> > It's just work. Nothing we'd be proposing would be incompatible with it.
> > We just don't want to have to do it all upfront especially when no one
> > really knows how well various architecture's hardware supports this or
> > if anyone even wants to run it on systems such as those. (Keep in mind
> > this is a pretty specific optimization that mostly helps systems
> > designed in specific ways -- not a general "everybody gets faster" type
> > situation.) Get the cases working we know will work, can easily support
> > and people actually want.  Then expand it to support others as people
> > come around with hardware to test and use cases for it.
> 
> I think you need to give other archs a chance to support this with a
> design that considers the offset case as a first class citizen rather
> than an afterthought.

Thanks :-) There's a reason why I'm insisting on this. We have constant
requests for this today. We have hacks in the GPU drivers to do it for
GPUs behind a switch, but those are just that, ad-hoc hacks in the
drivers. We have similar grossness around the corner with some CAPI
NICs trying to DMA to GPUs. I have people trying to use PLX DMA engines
to whack nVME devices.

I'm very interested in a more generic solution to deal with the problem
of P2P between devices. I'm happy to contribute with code to handle the
powerpc bits but we need to agree on the design first :)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
> I'm wondering, since this is limited to support behind a single
> switch, if you could have a software-iommu hanging off that switch
> device object that knows how to catch and translate the non-zero
> offset bus address case. We have something like this with VMD driver,
> and I toyed with a soft pci bridge when trying to support AHCI+NVME
> bar remapping. When the dma api looks up the iommu for its device it
> hits this soft-iommu and that driver checks if the page is host memory
> or device memory to do the dma translation. You wouldn't need a bit in
> struct page, just a lookup to the hosting struct dev_pagemap in the
> is_zone_device_page() case and that can point you to p2p details.

I was thinking about a hook in the arch DMA ops but that kind of
wrapper might work instead indeed. However I'm not sure what's the best
way to "instantiate" it.

The main issue is that the DMA ops are a function of the initiator,
not the target (since the target is supposed to be memory) so things
are a bit awkward.

One (user ?) would have to know that a given device "intends" to DMA
directly to another device.

This is awkward because in the ideal scenario, this isn't something the
device knows. For example, one could want to have an existing NIC DMA
directly to/from NVME pages or GPU pages.

The NIC itself doesn't know the characteristic of these pages, but
*something* needs to insert itself in the DMA ops of that bridge to
make it possible.

That's why I wonder if it's the struct page of the target that should
be "marked" in such a way that the arch dma'ops can immediately catch
that they belong to a device and might require "wrapped" operations.

Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
flag ?)

That would allow us to keep a fast path for normal memory targets, but
also have some kind of way to handle the special cases of such peer 2
peer (or also handle other type of peer to peer that don't necessarily
involve PCI address wrangling but could require additional iommu bits).

Just thinking out loud ... I don't have a firm idea or a design. But
peer to peer is definitely a problem we need to tackle generically, the
demand for it keeps coming up.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 15:09 -0700, Dan Williams wrote:
> I'm wondering, since this is limited to support behind a single
> switch, if you could have a software-iommu hanging off that switch
> device object that knows how to catch and translate the non-zero
> offset bus address case. We have something like this with VMD driver,
> and I toyed with a soft pci bridge when trying to support AHCI+NVME
> bar remapping. When the dma api looks up the iommu for its device it
> hits this soft-iommu and that driver checks if the page is host memory
> or device memory to do the dma translation. You wouldn't need a bit in
> struct page, just a lookup to the hosting struct dev_pagemap in the
> is_zone_device_page() case and that can point you to p2p details.

I was thinking about a hook in the arch DMA ops but that kind of
wrapper might work instead indeed. However I'm not sure what's the best
way to "instantiate" it.

The main issue is that the DMA ops are a function of the initiator,
not the target (since the target is supposed to be memory) so things
are a bit awkward.

One (user ?) would have to know that a given device "intends" to DMA
directly to another device.

This is awkward because in the ideal scenario, this isn't something the
device knows. For example, one could want to have an existing NIC DMA
directly to/from NVME pages or GPU pages.

The NIC itself doesn't know the characteristic of these pages, but
*something* needs to insert itself in the DMA ops of that bridge to
make it possible.

That's why I wonder if it's the struct page of the target that should
be "marked" in such a way that the arch dma'ops can immediately catch
that they belong to a device and might require "wrapped" operations.

Are ZONE_DEVICE pages identifiable based on the struct page alone ? (a
flag ?)

That would allow us to keep a fast path for normal memory targets, but
also have some kind of way to handle the special cases of such peer 2
peer (or also handle other type of peer to peer that don't necessarily
involve PCI address wrangling but could require additional iommu bits).

Just thinking out loud ... I don't have a firm idea or a design. But
peer to peer is definitely a problem we need to tackle generically, the
demand for it keeps coming up.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote:
> Thanks, Benjamin, for the summary of some of the issues.
> 
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> > So I assume the p2p code provides a way to address that too via special
> > dma_ops ? Or wrappers ?
> 
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses.

You can't. If the iommu is on, everything is remapped. Or do you mean
to have dma_map_* not do a remapping ?

That's the problem again, same as before, for that to work, the
dma_map_* ops would have to do something special that depends on *both*
the source and target device.

The current DMA infrastructure doesn't have anything like that. It's a
rather fundamental issue to your design that you need to address.

The dma_ops today are architecture specific and have no way to
differenciate between normal and those special P2P DMA pages.

> Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.

So first, no, it's more than "you wouldn't get the performance". On
some systems it may also just not work. Also what do you mean by "the
SW iommu doesn't have this problem" ? It catches the fact that
addresses don't point to RAM and maps differently ?

> > The problem is that the latter while seemingly easier, is also slower
> > and not supported by all platforms and architectures (for example,
> > POWER currently won't allow it, or rather only allows a store-only
> > subset of it under special circumstances).
> 
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.

And the situation where you don't cross bridges is the one where you
need to also take into account the offsets.

*both* cases mean that you need somewhat to intervene at the dma_ops
level to handle this. Which means having a way to identify your special
struct pages or PFNs to allow the arch to add a special case to the
dma_ops.

> > I don't fully understand how p2pmem "solves" that by creating struct
> > pages. The offset problem is one issue. But there's the iommu issue as
> > well, the driver cannot just use the normal dma_map ops.
> 
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

So you are designing something that is built from scratch to only work
on a specific limited category of systems and is also incompatible with
virtualization.

This is an interesting experiement to look at I suppose, but if you
ever want this upstream I would like at least for you to develop a
strategy to support the wider case, if not an actual implementation.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-15 Thread Benjamin Herrenschmidt
On Sat, 2017-04-15 at 11:41 -0600, Logan Gunthorpe wrote:
> Thanks, Benjamin, for the summary of some of the issues.
> 
> On 14/04/17 04:07 PM, Benjamin Herrenschmidt wrote
> > So I assume the p2p code provides a way to address that too via special
> > dma_ops ? Or wrappers ?
> 
> Not at this time. We will probably need a way to ensure the iommus do
> not attempt to remap these addresses.

You can't. If the iommu is on, everything is remapped. Or do you mean
to have dma_map_* not do a remapping ?

That's the problem again, same as before, for that to work, the
dma_map_* ops would have to do something special that depends on *both*
the source and target device.

The current DMA infrastructure doesn't have anything like that. It's a
rather fundamental issue to your design that you need to address.

The dma_ops today are architecture specific and have no way to
differenciate between normal and those special P2P DMA pages.

> Though if it does, I'd expect
> everything would still work you just wouldn't get the performance or
> traffic flow you are looking for. We've been testing with the software
> iommu which doesn't have this problem.

So first, no, it's more than "you wouldn't get the performance". On
some systems it may also just not work. Also what do you mean by "the
SW iommu doesn't have this problem" ? It catches the fact that
addresses don't point to RAM and maps differently ?

> > The problem is that the latter while seemingly easier, is also slower
> > and not supported by all platforms and architectures (for example,
> > POWER currently won't allow it, or rather only allows a store-only
> > subset of it under special circumstances).
> 
> Yes, I think situations where we have to cross host bridges will remain
> unsupported by this work for a long time. There are two many cases where
> it just doesn't work or it performs too poorly to be useful.

And the situation where you don't cross bridges is the one where you
need to also take into account the offsets.

*both* cases mean that you need somewhat to intervene at the dma_ops
level to handle this. Which means having a way to identify your special
struct pages or PFNs to allow the arch to add a special case to the
dma_ops.

> > I don't fully understand how p2pmem "solves" that by creating struct
> > pages. The offset problem is one issue. But there's the iommu issue as
> > well, the driver cannot just use the normal dma_map ops.
> 
> We are not using a proper iommu and we are dealing with systems that
> have zero offset. This case is also easily supported. I expect fixing
> the iommus to not map these addresses would also be reasonably achievable.

So you are designing something that is built from scratch to only work
on a specific limited category of systems and is also incompatible with
virtualization.

This is an interesting experiement to look at I suppose, but if you
ever want this upstream I would like at least for you to develop a
strategy to support the wider case, if not an actual implementation.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote:
> I'm a little hesitant about excluding offset support, so I'd like to
> hear more about this.
> 
> Is the issue related to PCI BARs that are not completely addressable
> by the CPU?  If so, that sounds like a first-class issue that should
> be resolved up front because I don't think the PCI core in general
> would deal well with that.
> 
> If all the PCI memory of interest is in fact addressable by the CPU,
> I would think it would be pretty straightforward to support offsets
> --
> everywhere you currently use a PCI bus address, you just use the
> corresponding CPU physical address instead.

It's not *that* easy sadly. The reason is that normal dma map APIs
assume the "target" of the DMAs are system memory, there is no way to
pass it "another device", in a way that allows it to understand the
offsets if needed.

That said, dma_map will already be problematic when doing p2p behind
the same bridge due to the fact that the iommu is not in the way so you
can't use the arch standard ops there.

So I assume the p2p code provides a way to address that too via special
dma_ops ? Or wrappers ?

Basically there are two very different ways you can do p2p, either
behind the same host bridge or accross two host bridges:

 - Behind the same host bridge, you don't go through the iommu, which
means that even if your target looks like a struct page, you can't just
use dma_map_* on it because what you'll get back from that is an iommu
token, not a sibling BAR offset. Additionally, if you use the target
struct resource address, you will need to offset it to get back to the
actual BAR value on the PCIe bus.

 - Behind different host bridges, then you go through the iommu and the
host remapping. IE. that's actually the easy case. You can probably
just use the normal iommu path and normal dma mapping ops, you just
need to have your struct page representing the target device BAR *in
CPU space* this time. So no offsetting required either.

The problem is that the latter while seemingly easier, is also slower
and not supported by all platforms and architectures (for example,
POWER currently won't allow it, or rather only allows a store-only
subset of it under special circumstances).

So what people practically want to do is to have 2 devices behind a
switch DMA'ing to/from each other.

But that brings the 2 problems above.

I don't fully understand how p2pmem "solves" that by creating struct
pages. The offset problem is one issue. But there's the iommu issue as
well, the driver cannot just use the normal dma_map ops.

I haven't had a chance to look at the details of the patches but it's
not clear from the description in patch 0 how that is solved.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 14:04 -0500, Bjorn Helgaas wrote:
> I'm a little hesitant about excluding offset support, so I'd like to
> hear more about this.
> 
> Is the issue related to PCI BARs that are not completely addressable
> by the CPU?  If so, that sounds like a first-class issue that should
> be resolved up front because I don't think the PCI core in general
> would deal well with that.
> 
> If all the PCI memory of interest is in fact addressable by the CPU,
> I would think it would be pretty straightforward to support offsets
> --
> everywhere you currently use a PCI bus address, you just use the
> corresponding CPU physical address instead.

It's not *that* easy sadly. The reason is that normal dma map APIs
assume the "target" of the DMAs are system memory, there is no way to
pass it "another device", in a way that allows it to understand the
offsets if needed.

That said, dma_map will already be problematic when doing p2p behind
the same bridge due to the fact that the iommu is not in the way so you
can't use the arch standard ops there.

So I assume the p2p code provides a way to address that too via special
dma_ops ? Or wrappers ?

Basically there are two very different ways you can do p2p, either
behind the same host bridge or accross two host bridges:

 - Behind the same host bridge, you don't go through the iommu, which
means that even if your target looks like a struct page, you can't just
use dma_map_* on it because what you'll get back from that is an iommu
token, not a sibling BAR offset. Additionally, if you use the target
struct resource address, you will need to offset it to get back to the
actual BAR value on the PCIe bus.

 - Behind different host bridges, then you go through the iommu and the
host remapping. IE. that's actually the easy case. You can probably
just use the normal iommu path and normal dma mapping ops, you just
need to have your struct page representing the target device BAR *in
CPU space* this time. So no offsetting required either.

The problem is that the latter while seemingly easier, is also slower
and not supported by all platforms and architectures (for example,
POWER currently won't allow it, or rather only allows a store-only
subset of it under special circumstances).

So what people practically want to do is to have 2 devices behind a
switch DMA'ing to/from each other.

But that brings the 2 problems above.

I don't fully understand how p2pmem "solves" that by creating struct
pages. The offset problem is one issue. But there's the iommu issue as
well, the driver cannot just use the normal dma_map ops.

I haven't had a chance to look at the details of the patches but it's
not clear from the description in patch 0 how that is solved.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> 
> On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > I'd suggest just detecting if there is any translation in bus
> > addresses anywhere and just hard disabling P2P on such systems.
> 
> That's a fantastic suggestion. It simplifies things significantly.
> Unless there are any significant objections I think I will plan on
> doing
> that.

I object.

> > On modern hardware with 64 bit BARs there is very little reason to
> > have translation, so I think this is a legacy feature.
> 
> Yes, p2pmem users are likely to be designing systems around it (ie
> JBOFs) and not trying to shoehorn it onto legacy architectures.
> 
> At the very least, it makes sense to leave it out and if someone
> comes
> along who cares they can put in the effort to support the address
> translation.
> 
> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> 
> On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > I'd suggest just detecting if there is any translation in bus
> > addresses anywhere and just hard disabling P2P on such systems.
> 
> That's a fantastic suggestion. It simplifies things significantly.
> Unless there are any significant objections I think I will plan on
> doing
> that.

I object.

> > On modern hardware with 64 bit BARs there is very little reason to
> > have translation, so I think this is a legacy feature.
> 
> Yes, p2pmem users are likely to be designing systems around it (ie
> JBOFs) and not trying to shoehorn it onto legacy architectures.
> 
> At the very least, it makes sense to leave it out and if someone
> comes
> along who cares they can put in the effort to support the address
> translation.
> 
> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> > 
> > On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > > I'd suggest just detecting if there is any translation in bus
> > > addresses anywhere and just hard disabling P2P on such systems.
> > 
> > That's a fantastic suggestion. It simplifies things significantly.
> > Unless there are any significant objections I think I will plan on
> > doing
> > that.
> 
> I object.

Note: It would also make your stuff fundamentally incompatible with
KVM guest pass-through since KVM plays with remapping BARs all over
the place.

Ben.

> > > On modern hardware with 64 bit BARs there is very little reason
> > > to
> > > have translation, so I think this is a legacy feature.
> > 
> > Yes, p2pmem users are likely to be designing systems around it (ie
> > JBOFs) and not trying to shoehorn it onto legacy architectures.
> > 
> > At the very least, it makes sense to leave it out and if someone
> > comes
> > along who cares they can put in the effort to support the address
> > translation.
> > 
> > Thanks,
> > 
> > Logan



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Fri, 2017-04-14 at 21:37 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2017-04-13 at 22:40 -0600, Logan Gunthorpe wrote:
> > 
> > On 13/04/17 10:16 PM, Jason Gunthorpe wrote:
> > > I'd suggest just detecting if there is any translation in bus
> > > addresses anywhere and just hard disabling P2P on such systems.
> > 
> > That's a fantastic suggestion. It simplifies things significantly.
> > Unless there are any significant objections I think I will plan on
> > doing
> > that.
> 
> I object.

Note: It would also make your stuff fundamentally incompatible with
KVM guest pass-through since KVM plays with remapping BARs all over
the place.

Ben.

> > > On modern hardware with 64 bit BARs there is very little reason
> > > to
> > > have translation, so I think this is a legacy feature.
> > 
> > Yes, p2pmem users are likely to be designing systems around it (ie
> > JBOFs) and not trying to shoehorn it onto legacy architectures.
> > 
> > At the very least, it makes sense to leave it out and if someone
> > comes
> > along who cares they can put in the effort to support the address
> > translation.
> > 
> > Thanks,
> > 
> > Logan



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote:
> > Any caller of pci_add_resource_offset() uses CPU addresses different from
> > the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> > platforms also support this translation (see "translation_offset"), though
> > in most x86 systems the offset is zero.  I'm aware of one x86 system that
> > was tested with a non-zero offset but I don't think it was shipped that
> > way.
> 
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

I object to designing a subsystem that by design cannot work on whole
categories of architectures out there.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

No it's not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-14 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 22:16 -0600, Jason Gunthorpe wrote:
> > Any caller of pci_add_resource_offset() uses CPU addresses different from
> > the PCI bus addresses (unless the offset is zero, of course).  All ACPI
> > platforms also support this translation (see "translation_offset"), though
> > in most x86 systems the offset is zero.  I'm aware of one x86 system that
> > was tested with a non-zero offset but I don't think it was shipped that
> > way.
> 
> I'd suggest just detecting if there is any translation in bus
> addresses anywhere and just hard disabling P2P on such systems.

I object to designing a subsystem that by design cannot work on whole
categories of architectures out there.

> On modern hardware with 64 bit BARs there is very little reason to
> have translation, so I think this is a legacy feature.

No it's not.

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-13 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote:
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

I think a bunch of non-x86 architectures but I don't know which ones
outside of powerpc.

> > When behind the same switch you need to use PCI addresses. If one tries
> > later to do P2P between host bridges (via the CPU fabric) things get
> > more complex and one will have to use either CPU addresses or something
> > else alltogether (probably would have to teach the arch DMA mapping
> > routines to work with those struct pages you create and return the
> > right thing).
> 
> Probably for starters we'd want to explicitly deny cases between host
> bridges and add that later if someone wants to do the testing.

Cheers,
Ben.

> Thanks,
> 
> Logan


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-13 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 15:22 -0600, Logan Gunthorpe wrote:
> 
> On 12/04/17 03:55 PM, Benjamin Herrenschmidt wrote:
> > Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
> > will perform the conversion between the struct resource content (CPU
> > physical address) and the actual PCI bus side address.
> 
> Ah, thanks for the tip! On my system, this translation returns the same
> address so it was not necessary. And, yes, that means this would have to
> find its way into the dma mapping routine somehow. This means we'll
> eventually need a way to look-up the p2pmem device from the struct page.
> Which means we will likely need a new flag bit in the struct page or
> something. The big difficulty I see is testing. Do you know what
> architectures or in what circumstances are these translations used?

I think a bunch of non-x86 architectures but I don't know which ones
outside of powerpc.

> > When behind the same switch you need to use PCI addresses. If one tries
> > later to do P2P between host bridges (via the CPU fabric) things get
> > more complex and one will have to use either CPU addresses or something
> > else alltogether (probably would have to teach the arch DMA mapping
> > routines to work with those struct pages you create and return the
> > right thing).
> 
> Probably for starters we'd want to explicitly deny cases between host
> bridges and add that later if someone wants to do the testing.

Cheers,
Ben.

> Thanks,
> 
> Logan


Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-12 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 09:28 +0530, Aneesh Kumar K.V wrote:
> >   #endif
> >    mtctr   r12
> >    bctrl
> > +/*
> > + * cur_cpu_spec->cpu_restore would restore LPCR to a
> > + * sane value that is set at early boot time,
> > + * thereby clearing LPCR_UPRT.
> > + * LPCR_UPRT is required if we are running in Radix mode.
> > + * Set it here if that be the case.
> > + */
> > +BEGIN_MMU_FTR_SECTION
> > + mfspr   r3, SPRN_LPCR
> > + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> > + or  r3, r3, r4
> > + mtspr   SPRN_LPCR, r3
> > +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)

We are probably better off saving the value somewhere during boot
and just "blasting" it whole back.

Cheers
Ben.


Re: [PATCH 3/3] powernv:idle: Set LPCR_UPRT on wakeup from deep-stop

2017-04-12 Thread Benjamin Herrenschmidt
On Thu, 2017-04-13 at 09:28 +0530, Aneesh Kumar K.V wrote:
> >   #endif
> >    mtctr   r12
> >    bctrl
> > +/*
> > + * cur_cpu_spec->cpu_restore would restore LPCR to a
> > + * sane value that is set at early boot time,
> > + * thereby clearing LPCR_UPRT.
> > + * LPCR_UPRT is required if we are running in Radix mode.
> > + * Set it here if that be the case.
> > + */
> > +BEGIN_MMU_FTR_SECTION
> > + mfspr   r3, SPRN_LPCR
> > + LOAD_REG_IMMEDIATE(r4, LPCR_UPRT)
> > + or  r3, r3, r4
> > + mtspr   SPRN_LPCR, r3
> > +END_MMU_FTR_SECTION_IFSET(MMU_FTR_TYPE_RADIX)

We are probably better off saving the value somewhere during boot
and just "blasting" it whole back.

Cheers
Ben.


Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Wed, 2017-04-12 at 11:09 -0600, Logan Gunthorpe wrote:
> 
> > Do you handle funky address translation too ? IE. the fact that the PCI
> > addresses aren't the same as the CPU physical addresses for a BAR ?
> 
> No, we use the CPU physical address of the BAR. If it's not mapped that
> way we can't use it.

Ok, you need to fix that or a bunch of architectures won't work. 

Look at pcibios_resource_to_bus() and pcibios_bus_to_resource(). They
will perform the conversion between the struct resource content (CPU
physical address) and the actual PCI bus side address.

When behind the same switch you need to use PCI addresses. If one tries
later to do P2P between host bridges (via the CPU fabric) things get
more complex and one will have to use either CPU addresses or something
else alltogether (probably would have to teach the arch DMA mapping
routines to work with those struct pages you create and return the
right thing).

> > > This will mean many setups that could likely
> > > work well will not be supported so that we can be more confident it
> > > will work and not place any responsibility on the user to understand
> > > their topology. (We've chosen to go this route based on feedback we
> > > received at LSF).
> > > 
> > > In order to enable this functionality we introduce a new p2pmem device
> > > which can be instantiated by PCI drivers. The device will register some
> > > PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> > > users of these devices to get buffers.
> > 
> > I don't completely understand this. This is actual memory on the PCI
> > bus ? Where does it come from ? Or are you just trying to create struct
> > pages that cover your PCIe DMA target ?
> 
> Yes, the memory is on the PCI bus in a BAR. For now we have a special
> PCI card for this, but in the future it would likely be the CMB in an
> NVMe card. These patches create struct pages to map these BAR addresses
> using ZONE_DEVICE.

Ok.

So ideally we'd want things like dma_map_* to be able to be fed those
struct pages and do the right thing which is ... tricky, especially
with the address translation I mentioned since the address will be
different whether the initiator is on the same host bridge as the
target or not.

> > So correct me if I'm wrong, you are trying to create struct page's that
> > map a PCIe BAR right ? I'm trying to understand how that interacts with
> > what Jerome is doing for HMM.
> 
> Yes, well we are using ZONE_DEVICE in the exact same way as the dax code
> is. These patches use the existing API with no modifications. As I
> understand it, HMM was using ZONE_DEVICE in a way that was quite
> different to how it was originally designed.

Sort-of. I don't see why there would be a conflict with the struct
pages use though. Jerome can you chime in ? Jerome: It looks like they
are just laying out struct page over a BAR which is the same thing I
think you should do when the BAR is "large enough" for the GPU memory.

The case where HMM uses "holes" in the address space for its struct
page is somewhat orthogonal but I also see no conflict here.

> > The reason is that the HMM currently creates the struct pages with
> > "fake" PFNs pointing to a hole in the address space rather than
> > covering the actual PCIe memory of the GPU. He does that to deal with
> > the fact that some GPUs have a smaller aperture on PCIe than their
> > total memory.
> 
> I'm aware of what HMM is trying to do and although I'm not familiar with
> the intimate details, I saw it as fairly orthogonal to what we are
> attempting to do.

Right.

> > However, I have asked him to only apply that policy if the aperture is
> > indeed smaller, and if not, create struct pages that directly cover the
> > PCIe BAR of the GPU instead, which will work better on systems or
> > architecture that don't have a "pinhole" window limitation.
> > However he was under the impression that this was going to collide with
> > what you guys are doing, so I'm trying to understand how. 
> 
> I'm not sure I understand how either. However, I suspect if you collide
> with these patches then you'd also be breaking dax too.

Possibly but as I said, I don't see why so I'll let Jerome chime in
since he was under the impression that there was a conflict here :-)

Cheers,
Ben.



Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
> 
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
> 
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?
 
> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how. 

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
> 
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
> 
> Thanks,
> 
> Logan
> 
> 
> Logan Gunthorpe (6):
>   Introduce Peer-to-Peer memory (p2pmem) device
>   nvmet: Use p2pmem in nvme target
>   scatterlist: Modify SG copy functions to support io memory.
>   nvmet: Be careful about using iomem accesses when dealing with p2pmem
>   p2pmem: Support device removal
>   p2pmem: Added char device user interface
> 
> Steve Wise (2):
>   cxgb4: setup pcie memory window 4 and create p2pmem region
>   p2pmem: Add debugfs "stats" file
> 
>  drivers/memory/Kconfig  |   5 +
>  drivers/memory/Makefile |   2 +
>  drivers/memory/p2pmem.c | 697 
> 
>  

Re: [RFC 0/8] Copy Offload with Peer-to-Peer PCI Memory

2017-04-12 Thread Benjamin Herrenschmidt
On Thu, 2017-03-30 at 16:12 -0600, Logan Gunthorpe wrote:
> Hello,
> 
> As discussed at LSF/MM we'd like to present our work to enable
> copy offload support in NVMe fabrics RDMA targets. We'd appreciate
> some review and feedback from the community on our direction.
> This series is not intended to go upstream at this point.
> 
> The concept here is to use memory that's exposed on a PCI BAR as
> data buffers in the NVME target code such that data can be transferred
> from an RDMA NIC to the special memory and then directly to an NVMe
> device avoiding system memory entirely. The upside of this is better
> QoS for applications running on the CPU utilizing memory and lower
> PCI bandwidth required to the CPU (such that systems could be designed
> with fewer lanes connected to the CPU). However, presently, the trade-off
> is currently a reduction in overall throughput. (Largely due to hardware
> issues that would certainly improve in the future).

Another issue of course is that not all systems support P2P
between host bridges :-) (Though almost all switches can enable it).

> Due to these trade-offs we've designed the system to only enable using
> the PCI memory in cases where the NIC, NVMe devices and memory are all
> behind the same PCI switch.

Ok. I suppose that's a reasonable starting point. Do I haven't looked
at the patches in detail yet but it would be nice if that policy was in
a well isolated component so it can potentially be affected by
arch/platform code.

Do you handle funky address translation too ? IE. the fact that the PCI
addresses aren't the same as the CPU physical addresses for a BAR ?

> This will mean many setups that could likely
> work well will not be supported so that we can be more confident it
> will work and not place any responsibility on the user to understand
> their topology. (We've chosen to go this route based on feedback we
> received at LSF).
> 
> In order to enable this functionality we introduce a new p2pmem device
> which can be instantiated by PCI drivers. The device will register some
> PCI memory as ZONE_DEVICE and provide an genalloc based allocator for
> users of these devices to get buffers.

I don't completely understand this. This is actual memory on the PCI
bus ? Where does it come from ? Or are you just trying to create struct
pages that cover your PCIe DMA target ?
 
> We give an example of enabling
> p2p memory with the cxgb4 driver, however currently these devices have
> some hardware issues that prevent their use so we will likely be
> dropping this patch in the future. Ideally, we'd want to enable this
> functionality with NVME CMB buffers, however we don't have any hardware
> with this feature at this time.

So correct me if I'm wrong, you are trying to create struct page's that
map a PCIe BAR right ? I'm trying to understand how that interacts with
what Jerome is doing for HMM.

The reason is that the HMM currently creates the struct pages with
"fake" PFNs pointing to a hole in the address space rather than
covering the actual PCIe memory of the GPU. He does that to deal with
the fact that some GPUs have a smaller aperture on PCIe than their
total memory.

However, I have asked him to only apply that policy if the aperture is
indeed smaller, and if not, create struct pages that directly cover the
PCIe BAR of the GPU instead, which will work better on systems or
architecture that don't have a "pinhole" window limitation.

However he was under the impression that this was going to collide with
what you guys are doing, so I'm trying to understand how. 

> In nvmet-rdma, we attempt to get an appropriate p2pmem device at
> queue creation time and if a suitable one is found we will use it for
> all the (non-inlined) memory in the queue. An 'allow_p2pmem' configfs
> attribute is also created which is required to be set before any p2pmem
> is attempted.
> 
> This patchset also includes a more controversial patch which provides an
> interface for userspace to obtain p2pmem buffers through an mmap call on
> a cdev. This enables userspace to fairly easily use p2pmem with RDMA and
> O_DIRECT interfaces. However, the user would be entirely responsible for
> knowing what their doing and inspecting sysfs to understand the pci
> topology and only using it in sane situations.
> 
> Thanks,
> 
> Logan
> 
> 
> Logan Gunthorpe (6):
>   Introduce Peer-to-Peer memory (p2pmem) device
>   nvmet: Use p2pmem in nvme target
>   scatterlist: Modify SG copy functions to support io memory.
>   nvmet: Be careful about using iomem accesses when dealing with p2pmem
>   p2pmem: Support device removal
>   p2pmem: Added char device user interface
> 
> Steve Wise (2):
>   cxgb4: setup pcie memory window 4 and create p2pmem region
>   p2pmem: Add debugfs "stats" file
> 
>  drivers/memory/Kconfig  |   5 +
>  drivers/memory/Makefile |   2 +
>  drivers/memory/p2pmem.c | 697 
> 
>  

Re: [PATCH v3 21/32] powerpc: include default ioremap_nopost() implementation

2017-04-11 Thread Benjamin Herrenschmidt
On Tue, 2017-04-11 at 15:24 +0100, Lorenzo Pieralisi wrote:
> Ok, point taken. BTW, may I ask you guys to have a look into this
> please ?
> 
> https://lkml.org/lkml/2017/4/6/743
> 
> It is a side effect of this thread (v2), not sure why 
> on powerpc has to include .

Not sure how we ended up with that... it's odd indeed.

Michael ? Any reason we can't just remove it ?

Cheers,
Ben.



Re: [PATCH v3 21/32] powerpc: include default ioremap_nopost() implementation

2017-04-11 Thread Benjamin Herrenschmidt
On Tue, 2017-04-11 at 15:24 +0100, Lorenzo Pieralisi wrote:
> Ok, point taken. BTW, may I ask you guys to have a look into this
> please ?
> 
> https://lkml.org/lkml/2017/4/6/743
> 
> It is a side effect of this thread (v2), not sure why 
> on powerpc has to include .

Not sure how we ended up with that... it's odd indeed.

Michael ? Any reason we can't just remove it ?

Cheers,
Ben.



<    4   5   6   7   8   9   10   11   12   13   >