from:"Benjamin Herrenschmidt"

[PATCH 2/3] nvme: Retrieve the required IO queue entry size from the controller

2019-07-15 Thread Benjamin Herrenschmidt

On PCIe based NVME devices, this will  retrieve the IO queue entry
size from the controller and use the "required" setting.

It should always be 6 (64 bytes) by spec. However some controllers
such as Apple's are not properly implementing the spec and require
the size to be 7 (128 bytes).

This provides the ground work for the subsequent quirks for these
controllers.

Signed-off-by: Benjamin Herrenschmidt 
---
 drivers/nvme/host/core.c | 25 +
 drivers/nvme/host/nvme.h |  1 +
 drivers/nvme/host/pci.c  |  9 ++---
 include/linux/nvme.h |  1 +
 4 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cc09b81fc7f4..716ebe87a2b8 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1986,6 +1986,7 @@ int nvme_enable_ctrl(struct nvme_ctrl *ctrl, u64 cap)
ctrl->ctrl_config = NVME_CC_CSS_NVM;
ctrl->ctrl_config |= (page_shift - 12) << NVME_CC_MPS_SHIFT;
ctrl->ctrl_config |= NVME_CC_AMS_RR | NVME_CC_SHN_NONE;
+   /* Use default IOSQES. We'll update it later if needed */
ctrl->ctrl_config |= NVME_CC_IOSQES | NVME_CC_IOCQES;
ctrl->ctrl_config |= NVME_CC_ENABLE;
 
@@ -2698,6 +2699,30 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
ctrl->hmmin = le32_to_cpu(id->hmmin);
ctrl->hmminds = le32_to_cpu(id->hmminds);
ctrl->hmmaxd = le16_to_cpu(id->hmmaxd);
+
+   /* Grab required IO queue size */
+   ctrl->iosqes = id->sqes & 0xf;
+   if (ctrl->iosqes < NVME_NVM_IOSQES) {
+   dev_err(ctrl->device,
+   "unsupported required IO queue size %d\n", 
ctrl->iosqes);
+   ret = -EINVAL;
+   goto out_free;
+   }
+   /*
+* If our IO queue size isn't the default, update the setting
+* in CC:IOSQES.
+*/
+   if (ctrl->iosqes != NVME_NVM_IOSQES) {
+   ctrl->ctrl_config &= ~(0xfu << NVME_CC_IOSQES_SHIFT);
+   ctrl->ctrl_config |= ctrl->iosqes << 
NVME_CC_IOSQES_SHIFT;
+   ret = ctrl->ops->reg_write32(ctrl, NVME_REG_CC,
+ctrl->ctrl_config);
+   if (ret) {
+   dev_err(ctrl->device,
+   "error updating CC register\n");
+   goto out_free;
+   }
+   }
}
 
ret = nvme_mpath_init(ctrl, id);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 716a876119c8..34ef35fcd8a5 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -244,6 +244,7 @@ struct nvme_ctrl {
u32 hmmin;
u32 hmminds;
u16 hmmaxd;
+   u8 iosqes;
 
/* Fabrics only */
u16 sqsize;
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 8f006638452b..54b35ea4af88 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -28,7 +28,7 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_command))
+#define SQ_SIZE(q) ((q)->q_depth << (q)->sqes)
 #define CQ_SIZE(q) ((q)->q_depth * sizeof(struct nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
@@ -162,7 +162,7 @@ static inline struct nvme_dev *to_nvme_dev(struct nvme_ctrl 
*ctrl)
 struct nvme_queue {
struct nvme_dev *dev;
spinlock_t sq_lock;
-   struct nvme_command *sq_cmds;
+   void *sq_cmds;
 /* only used for poll queues: */
spinlock_t cq_poll_lock cacheline_aligned_in_smp;
volatile struct nvme_completion *cqes;
@@ -178,6 +178,7 @@ struct nvme_queue {
u16 last_cq_head;
u16 qid;
u8 cq_phase;
+   u8 sqes;
unsigned long flags;
 #define NVMEQ_ENABLED  0
 #define NVMEQ_SQ_CMB   1
@@ -488,7 +489,8 @@ static void nvme_submit_cmd(struct nvme_queue *nvmeq, 
struct nvme_command *cmd,
bool write_sq)
 {
spin_lock(&nvmeq->sq_lock);
-   memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
+   memcpy(nvmeq->sq_cmds + (nvmeq->sq_tail << nvmeq->sqes),
+  cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
nvme_write_sq_db(nvmeq, write_sq);
@@ -1465,6 +1467,7 @@ static int nvme_alloc_queue(struct nvme_dev *dev, int 
qid, int depth)
if (dev->ctrl.queue_count > qid)
return 0;
 
+   nvmeq->sqes = qid ? dev->ctrl

Re: [PATCH] nvme: Add support for Apple 2018+ models

2019-07-15 Thread Benjamin Herrenschmidt

On Mon, 2019-07-15 at 12:28 +0300, Maxim Levitsky wrote:
> 
> To be honest, the spec explicitly states that minimum submission queue entry 
> size is 64 
> and minimum completion entry size should be is 16 bytes for NVM command set:
> 
> "Bits 3:0 define the required (i.e., minimum) Submission Queue Entry size when
> using the NVM Command Set. This is the minimum entry size that may be used.
> The value is in bytes and is reported as a power of two (2^n). The required 
> value
> shall be 6, corresponding to 64."

Yes, I saw that :-) Apple seems to ignore this and CC:IOSQES and
effectively hard wire a size of 7 (128 bytes) for the IO queue.

> "Bits 3:0 define the required (i.e., minimum) Completion Queue entry size 
> when using
> the NVM Command Set. This is the minimum entry size that may be used. The 
> value
> is in bytes and is reported as a power of two (2^n). The required value shall 
> be 4,
> corresponding to 16."
> 
> Pages 136/137, NVME 1.3d.
> 
> In theory the spec allows for non NVM IO command set, and for which the sq/cq 
> entry sizes can be of any size,
> as indicated in SQES/CQES and set in CC.IOCQES/CC.IOSQES, but than most of 
> the spec won't apply to it.
> 
> 
> Also FYI, values in CC (IOCQES/IOSQES) are for I/O queues, which kind of 
> implies that admin queue,
> should always use the 64/16 bytes entries, although I haven't found any 
> explicit mention of that.

Right, and it does on the Apple HW as well.

Cheers,
Ben.

Re: [PATCH] nvme: Add support for Apple 2018+ models

2019-07-15 Thread Benjamin Herrenschmidt

On Mon, 2019-07-15 at 18:43 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2019-07-15 at 10:10 +0200, Christoph Hellwig wrote:
> > > + /*
> > > +  * Apple 2018 and latter variant has a few issues
> > > +  */
> > > + NVME_QUIRK_APPLE_2018   = (1 << 10),
> > 
> > We try to have quirks for the actual issue, so this should be one quirk
> > for the irq vectors issues, and another for the sq entry size.  Note that
> > NVMe actually has the concept of an I/O queue entry size (IOSQES in the
> > Cc register based on values reported in the SQES field in Identify
> > Controller.  Do these controllers report anything interesting there?
> 
> Ah good to know, I'll dig.

Interesting... so SQES is 0x76, indicating that it supports the larger
entry size but not that it mandates it.

However, we configure CC:IOSQES with 6 and the HW fails unless we have
the 128 bytes entry size.

So the HW is bogus, but we can probably sort that by doing a better job
at fixing up SQES in the identify on the Apple HW, and then actually
using it for the SQ.

I checked and CC is 0x00460001 so it takes our write of "6" fine. I
think they just ignore the value.

How do you want to proceed here ? Should I go all the way at attempting
to honor sqes "mandatory" size field (and quirk *that*) or just I go
the simpler way and stick to shift 6 unless Apple ?

If I go the complicated path, should I do the same with cq size
(knowing that no known HW has a non-4 mandatory size there and we don't
know of a HW bug... yet).

Cheers,
Ben.

Re: [PATCH] nvme: Add support for Apple 2018+ models

2019-07-15 Thread Benjamin Herrenschmidt

On Mon, 2019-07-15 at 10:10 +0200, Christoph Hellwig wrote:
> > +   /*
> > +* Apple 2018 and latter variant has a few issues
> > +*/
> > +   NVME_QUIRK_APPLE_2018   = (1 << 10),
> 
> We try to have quirks for the actual issue, so this should be one quirk
> for the irq vectors issues, and another for the sq entry size.  Note that
> NVMe actually has the concept of an I/O queue entry size (IOSQES in the
> Cc register based on values reported in the SQES field in Identify
> Controller.  Do these controllers report anything interesting there?

Ah good to know, I'll dig.

> At the very least I'd make all the terminology based on that and then
> just treat the Apple controllers as a buggy implementation of that model.

Yup, sounds good. I'll poke around tomorrow.

> Btw, are there open source darwin NVMe driver that could explain this
> mess a little better?

You wish...

> > @@ -504,8 +505,11 @@ static inline void nvme_write_sq_db(struct nvme_queue 
> > *nvmeq, bool write_sq)
> >  static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command 
> > *cmd,
> > bool write_sq)
> >  {
> > +   u16 sq_actual_pos;
> > +
> > spin_lock(&nvmeq->sq_lock);
> > -   memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
> > +   sq_actual_pos = nvmeq->sq_tail << nvmeq->sq_extra_shift;
> > +   memcpy(&nvmeq->sq_cmds[sq_actual_pos], cmd, sizeof(*cmd));
> 
> This is a little too magic.  I think we'd better off making sq_cmds
> a void array and use manual indexing, at least that makes it very
> obvious what is going on.

Ok. That's plan B as I described in the message. There's an advantage
to do that, it merges the indexing shift and the quirk shift into one.

I'll look into it & respin

> > -   nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
> > +   nvmeq->sq_cmds, SQ_SIZE(nvmeq));
> 
> Btw, chaning SQ_SIZE to take the queue seems like something that should
> be split into a prep patch, making the main change a lot smaller.

Sure. Will do.

> > -   if (!polled)
> > +   if (!polled) {
> > +
> > +   /*
> > +* On Apple 2018 or later implementations, only vector 0 is 
> > accepted
> 
> Please fix the > 80 char line.

Ok.

Thanks for the review.

Cheers,
Ben.

[PATCH] nvme: Add support for Apple 2018+ models

2019-07-14 Thread Benjamin Herrenschmidt

Based on reverse engineering and original patch by

Paul Pawlowski 

This adds support for Apple weird implementation of NVME in their
2018 or later machines. It accounts for the twice-as-big SQ entries
for the IO queues, and the fact that only interrupt vector 0 appears
to function properly.

Signed-off-by: Benjamin Herrenschmidt 
---

I reworked Paul's patch to be less invasive in nvme_submit_cmd()
hot path, effectively only adding a shift with a value hopefully
coming from the same cache line as existing stuff.

It could probably be made even less by making sq_extra_shift be
instead "sq_shift" and contain the complete shift between entries,
ie, 6 or 7, and then replacing the memcpy to

&nvmeq->sq_cmds[nvmeq->sq_tail]

With something like

((char *)nvmeq->sq_cmds) + ((size_t)nvmeq->sq_tail) << nvmeq->sq_shift

But I doubt the difference will be measurable anywhere and it makes
the code grosser imho.

Note: I'm not subscribed to linux-nvme, please CC me on replies.

 drivers/nvme/host/nvme.h |  5 
 drivers/nvme/host/pci.c  | 59 
 2 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 3d293a98..9ae53cbfb320 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -92,6 +92,11 @@ enum nvme_quirks {
 * Broken Write Zeroes.
 */
NVME_QUIRK_DISABLE_WRITE_ZEROES = (1 << 9),
+
+   /*
+* Apple 2018 and latter variant has a few issues
+*/
+   NVME_QUIRK_APPLE_2018   = (1 << 10),
 };
 
 /*
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 524d6bd6d095..1a41412fc48b 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -27,8 +27,8 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define SQ_SIZE(depth) (depth * sizeof(struct nvme_command))
-#define CQ_SIZE(depth) (depth * sizeof(struct nvme_completion))
+#define SQ_SIZE(q) size_t)(q)->q_depth) << (q)->sq_extra_shift) * 
sizeof(struct nvme_command))
+#define CQ_SIZE(q) (((size_t)(q)->q_depth) * sizeof(struct 
nvme_completion))
 
 #define SGES_PER_PAGE  (PAGE_SIZE / sizeof(struct nvme_sgl_desc))
 
@@ -195,6 +195,7 @@ struct nvme_queue {
u16 last_cq_head;
u16 qid;
u8 cq_phase;
+   u8 sq_extra_shift;
unsigned long flags;
 #define NVMEQ_ENABLED  0
 #define NVMEQ_SQ_CMB   1
@@ -504,8 +505,11 @@ static inline void nvme_write_sq_db(struct nvme_queue 
*nvmeq, bool write_sq)
 static void nvme_submit_cmd(struct nvme_queue *nvmeq, struct nvme_command *cmd,
bool write_sq)
 {
+   u16 sq_actual_pos;
+
spin_lock(&nvmeq->sq_lock);
-   memcpy(&nvmeq->sq_cmds[nvmeq->sq_tail], cmd, sizeof(*cmd));
+   sq_actual_pos = nvmeq->sq_tail << nvmeq->sq_extra_shift;
+   memcpy(&nvmeq->sq_cmds[sq_actual_pos], cmd, sizeof(*cmd));
if (++nvmeq->sq_tail == nvmeq->q_depth)
nvmeq->sq_tail = 0;
nvme_write_sq_db(nvmeq, write_sq);
@@ -1361,16 +1365,16 @@ static enum blk_eh_timer_return nvme_timeout(struct 
request *req, bool reserved)
 
 static void nvme_free_queue(struct nvme_queue *nvmeq)
 {
-   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, CQ_SIZE(nvmeq),
(void *)nvmeq->cqes, nvmeq->cq_dma_addr);
if (!nvmeq->sq_cmds)
return;
 
if (test_and_clear_bit(NVMEQ_SQ_CMB, &nvmeq->flags)) {
pci_free_p2pmem(to_pci_dev(nvmeq->dev->dev),
-   nvmeq->sq_cmds, SQ_SIZE(nvmeq->q_depth));
+   nvmeq->sq_cmds, SQ_SIZE(nvmeq));
} else {
-   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq->q_depth),
+   dma_free_coherent(nvmeq->dev->dev, SQ_SIZE(nvmeq),
nvmeq->sq_cmds, nvmeq->sq_dma_addr);
}
 }
@@ -1450,12 +1454,12 @@ static int nvme_cmb_qdepth(struct nvme_dev *dev, int 
nr_io_queues,
 }
 
 static int nvme_alloc_sq_cmds(struct nvme_dev *dev, struct nvme_queue *nvmeq,
-   int qid, int depth)
+   int qid)
 {
struct pci_dev *pdev = to_pci_dev(dev->dev);
 
if (qid && dev->cmb_use_sqes && (dev->cmbsz & NVME_CMBSZ_SQS)) {
-   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(depth));
+   nvmeq->sq_cmds = pci_alloc_p2pmem(pdev, SQ_SIZE(nvmeq));
nvmeq->sq_dma_addr = pci_p2pmem_virt_to_bus(pdev,
nvmeq->sq_cmds);
if (nvmeq->sq_dma_addr) {
@@ -1

Re: [PATCH 4/8] PCI: Add quirk to disable MSI support for Amazon's Annapurna Labs host bridge

2019-07-14 Thread Benjamin Herrenschmidt

On Sun, 2019-07-14 at 15:09 +, Chocron, Jonathan wrote:
> > s/host bridge/Root Port/, if I understand correctly.
> > 
> 
> Ack.
> 
> BTW, what is the main difference between the 2 terms, since they seem
> to be (mistakenly?) used interchangeably?

The host bridge is the parent of the root port. You can have several
root ports under a host bridge in fact. They tend to be part of the
same silicon and somewhat intimately linked but they are distinct
logical entities. The root port appears as a PCIe p2p bridge sitting on
the top level bus provided by the host bridge. The Host Bridge doesn't
have to have a representation in config space (it sometimes does
historically, but as a sibling of the devices on that top level bus. In
PCIe land, these are chipset built-in devices).

Ben.

Re: [PATCH v1 OPT1] driver core: Fix use-after-free and double free on glue directory

2019-07-03 Thread Benjamin Herrenschmidt

On Thu, 2019-07-04 at 07:41 +0200, Greg KH wrote:
> On Thu, Jul 04, 2019 at 08:57:53AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Wed, 2019-07-03 at 21:37 +0200, Greg KH wrote:
> > > Ok, I guess I have to take this patch, as the other one is so bad
> > > :)
> > > 
> > > But, I need a very large comment here saying why we are poking
> > > around in
> > > a kref and why we need to do this, at the expense of anything
> > > else.
> > > 
> > > So can you respin this patch with a comment here to help explain
> > > it so
> > > we have a chance to understand it when we run across this line in
> > > 10
> > > years?
> > 
> > Also are we confident that an open dir on the glue dir from
> > userspace
> > won't keep the kref up ?
> 
> How do you "open" a directory which raises the kref?

Hrm.. kernfs foo .. not sure. You probably can't :-) I don't know.

Cheers,
Ben.

Re: [PATCH v4 OPT2] driver core: Fix use-after-free and double free on glue directory

2019-07-03 Thread Benjamin Herrenschmidt

On Thu, 2019-07-04 at 07:40 +0200, Greg KH wrote:
> On Thu, Jul 04, 2019 at 08:57:13AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Wed, 2019-07-03 at 21:36 +0200, Greg KH wrote:
> > > 
> > > > -static struct kobject *get_device_parent(struct device *dev,
> > > > -struct device *parent)
> > > > +/**
> > > > + * __get_device_parent() - Get the parent device kobject.
> > > > + * @dev: Pointer to the device structure.
> > > > + * @parent: Pointer to the parent device structure.
> > > > + * @lock: When we live in a glue directory, should we hold the
> > > > + *gdp_mutex lock when this function returns? If @lock
> > > > + *is true, this function returns with the gdp_mutex
> > > > + *holed. Otherwise it will not.
> > > 
> > > Ugh, if you are trying to get me to hate one version of these
> > > patches,
> > > this is how you do it :)
> > > 
> > > A function should not "sometimes takes a lock, sometimes does
> > > not,
> > > depending on a parameter passed into it"  That way lies
> > > madness...
> > 
> > Yes, I prefer this approach to the fix but I dont like the patch
> > either
> > for the same reason...
> > 
> >  ...
> > 
> > > Anyway, this is a mess.
> > > 
> > > Ugh I hate glue dirs...
> > 
> > Amen...
> 
> Well, can we just remove them?  Who relies on them anymore?

Isn't it an ABI ? I'm sure there are going to be userspace things that
break if we do...

Cheers,
Ben.

Re: [patch v3 1/5] AST2500 DMA UART driver

2019-07-03 Thread Benjamin Herrenschmidt

On Wed, 2019-07-03 at 19:49 +0200, Greg KH wrote:
> > +
> > + if (tx_sts & UART_SDMA0_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA0_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[0]));
> > + } else if (tx_sts & UART_SDMA1_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA1_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[1]));
> > + } else if (tx_sts & UART_SDMA2_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA2_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[2]));
> > + } else if (tx_sts & UART_SDMA3_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA3_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[3]));
> > + } else if (tx_sts & UART_SDMA4_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA4_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[4]));
> > + } else if (tx_sts & UART_SDMA5_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA5_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[5]));
> > + } else if (tx_sts & UART_SDMA6_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA6_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[6]));
> > + } else if (tx_sts & UART_SDMA7_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA7_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[7]));
> > + } else if (tx_sts & UART_SDMA8_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA8_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[8]));
> > + } else if (tx_sts & UART_SDMA9_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA9_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[9]));
> > + } else if (tx_sts & UART_SDMA10_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA10_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[10]));
> > + } else if (tx_sts & UART_SDMA11_INT) {
> > + ast_uart_sdma_write(sdma, UART_SDMA11_INT, UART_TX_SDMA_ISR);
> > + ast_sdma_bufffdone(&(sdma->dma_ch->tx_dma_info[11]));
> > + } else {
> > + }

Also this should be a for () loop...

Cheers,
Ben.

Re: [PATCH v1 OPT1] driver core: Fix use-after-free and double free on glue directory

2019-07-03 Thread Benjamin Herrenschmidt

On Wed, 2019-07-03 at 21:37 +0200, Greg KH wrote:
> Ok, I guess I have to take this patch, as the other one is so bad :)
> 
> But, I need a very large comment here saying why we are poking around in
> a kref and why we need to do this, at the expense of anything else.
> 
> So can you respin this patch with a comment here to help explain it so
> we have a chance to understand it when we run across this line in 10
> years?

Also are we confident that an open dir on the glue dir from userspace
won't keep the kref up ?

Cheers,
Ben.

Re: [PATCH v4 OPT2] driver core: Fix use-after-free and double free on glue directory

2019-07-03 Thread Benjamin Herrenschmidt

On Wed, 2019-07-03 at 21:36 +0200, Greg KH wrote:
> 
> > -static struct kobject *get_device_parent(struct device *dev,
> > -struct device *parent)
> > +/**
> > + * __get_device_parent() - Get the parent device kobject.
> > + * @dev: Pointer to the device structure.
> > + * @parent: Pointer to the parent device structure.
> > + * @lock: When we live in a glue directory, should we hold the
> > + *gdp_mutex lock when this function returns? If @lock
> > + *is true, this function returns with the gdp_mutex
> > + *holed. Otherwise it will not.
> 
> Ugh, if you are trying to get me to hate one version of these patches,
> this is how you do it :)
> 
> A function should not "sometimes takes a lock, sometimes does not,
> depending on a parameter passed into it"  That way lies madness...

Yes, I prefer this approach to the fix but I dont like the patch either
for the same reason...

 ...

> Anyway, this is a mess.
> 
> Ugh I hate glue dirs...

Amen...

Ben.

Re: [GIT PULL] FSI changes for 5.3

2019-07-02 Thread Benjamin Herrenschmidt

On Wed, 2019-07-03 at 03:39 +, Joel Stanley wrote:
> Hello Greg,
> 
> We've not had a MAINAINERS entry for drivers/fsi, so this fixes that. It names
> Jeremy and I as maintainers, so if it works for you we will send pull requests
> to you each cycle.

Ack. I no longer work for IBM and thus cannot handle that subsystem
anymore.

> I realise this one is a bit late, but please consider including so we have a
> clear path for future submissions from 5.3 on.
> 
> This pull request contains two code changes. One touches hwmon and has an ack
> from Guenter as the hwmon maintainer.
> 
> The following changes since commit a188339ca5a396acc588e5851ed7e19f66b0ebd9:
> 
>   Linux 5.2-rc1 (2019-05-19 15:47:09 -0700)
> 
> are available in the Git repository at:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/joel/fsi.git tags/fsi-for-5.3
> 
> for you to fetch changes up to 371975b0b07520c85098652d561639837a60a905:
> 
>   fsi/core: Fix error paths on CFAM init (2019-07-03 10:42:53 +0930)
> 
> 
> FSI changes for 5.3
> 
>  - Add MAINTAINERS entry. There is now a git tree and a mailing
>  list/patchwork for collecting FSI patches
> 
>  - Bug fix for error driver registration error paths
> 
>  - Correction for the OCC hwmon driver to meet the spec
> 
> 
> Eddie James (1):
>   OCC: FSI and hwmon: Add sequence numbering
> 
> Jeremy Kerr (1):
>   fsi/core: Fix error paths on CFAM init
> 
> Joel Stanley (1):
>   MAINTAINERS: Add FSI subsystem
> 
>  MAINTAINERS| 13 +
>  drivers/fsi/fsi-core.c | 32 
>  drivers/fsi/fsi-occ.c  | 15 ---
>  drivers/hwmon/occ/common.c |  4 ++--
>  drivers/hwmon/occ/common.h |  1 +
>  5 files changed, 48 insertions(+), 17 deletions(-)

Re: WARNING: refcount bug in kobject_add_internal

2019-07-01 Thread Benjamin Herrenschmidt

Munchun, is this what your patch fixes ?


On Mon, 2019-07-01 at 16:27 -0700, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 726e41097920a73e4c7c33385dcc0debb1281e18
> Author: Benjamin Herrenschmidt 
> Date:   Tue Jul 10 00:29:10 2018 +
> 
>  drivers: core: Remove glue dirs from sysfs earlier
> 
> bisection log:  
> https://syzkaller.appspot.com/x/bisect.txt?x=140d6739a0
> start commit:   6fbc7275 Linux 5.2-rc7
> git tree:   upstream
> final crash:
> https://syzkaller.appspot.com/x/report.txt?x=160d6739a0
> console output: 
> https://syzkaller.appspot.com/x/log.txt?x=120d6739a0
> kernel config:  
> https://syzkaller.appspot.com/x/.config?x=bff6583efcfaed3f
> dashboard link: 
> https://syzkaller.appspot.com/bug?extid=32259bb9bc1a487ad206
> syz repro:  
> https://syzkaller.appspot.com/x/repro.syz?x=115bad39a0
> C reproducer:   
> https://syzkaller.appspot.com/x/repro.c?x=1241bdd5a0
> 
> Reported-by: syzbot+32259bb9bc1a487ad...@syzkaller.appspotmail.com
> Fixes: 726e41097920 ("drivers: core: Remove glue dirs from sysfs
> earlier")
> 
> For information about bisection process see: 
> https://goo.gl/tpsmEJ#bisection

Re: [PATCH v4] driver core: Fix use-after-free and double free on glue directory

2019-06-25 Thread Benjamin Herrenschmidt

On Tue, 2019-06-25 at 23:06 +0800, Muchun Song wrote:
> Benjamin Herrenschmidt  于2019年6月19日周三
> 上午5:51写道：
> > 
> > On Tue, 2019-06-18 at 18:13 +0200, Greg KH wrote:
> > > 
> > > Again, I am totally confused and do not see a patch in an email
> > > that
> > > I
> > > can apply...
> > > 
> > > Someone needs to get people to agree here...
> > 
> > I think he was hoping you would chose which solution you prefered
> > here
> 
> Yeah, right, I am hoping you would chose which solution you prefered
> here.
> Thanks.
> 
> > :-) His original or the one I suggested instead. I don't think
> > there's
> > anybody else with understanding of sysfs guts around to form an
> > opinion.
> > 

Muchun, I don't think Greg still has the previous emails. He deals with
too much to keep track of old stuff.

Can you send both patches tagged as [OPT1] and [OPT2] along with a
comment in one go so Greg can see both and decide ?

I think looking at the refcount is fragile, I might be wrong, but I
think it mostly paper over the root of the problem which is the fact
that the lock isn't taken accross both operations, thus exposing the
race. But I'm happy if Greg prefers your approach as long as it's
fixed.

Cheers,
Ben.

Re: [PATCH 1/2] i2c: aspeed: allow to customize base clock divisor

2019-06-19 Thread Benjamin Herrenschmidt

On Wed, 2019-06-19 at 22:32 +, Tao Ren wrote:
> Thank you for the quick response, Brendan.
> 
> Aspeed I2C bus frequency is defined by 3 parameters
> (base_clk_divisor, clk_high_width, clk_low_width), and I choose
> base_clk_divisor because it controls all the Aspeed I2C timings (such
> as setup time and hold time). Once base_clk_divisor is decided
> (either by the current logic in i2c-aspeed driver or manually set in
> device tree), clk_high_width and clk_low_width will be calculated by
> i2c-aspeed driver to meet the specified I2C bus speed.
> 
> For example, by setting I2C bus frequency to 100KHz on AST2500
> platform, (base_clock_divisor, clk_high_width, clk_low_width) is set
> to (3, 15, 14) by our driver. But some slave devices (on CMM i2c-8
> and Minipack i2c-0) NACK byte transactions with the default timing
> setting: the issue can be resolved by setting base_clk_divisor to 4,
> and (clk_high_width, clk_low_width) will be set to (7, 7) by our i2c-
> aspeed driver to achieve similar I2C bus speed.
> 
> Not sure if my answer helps to address your concerns, but kindly let
> me know if you have further questions/suggestions.

Did you look at the resulting output on a scope ? I'm curious what
might be wrong 

CCing Ryan from Aspeed, he might have some idea.

Could it be that with some specific dividers you have more jitter ?
Still, i2c devices tend to be rather robust vs crappy clocks unless you
are massively out of bounds, which makes me wonder whether something
else might be wrong in your setup.

Cheers,
Ben.

Re: [PATCH] powerpc: enable a 30-bit ZONE_DMA for 32-bit pmac

2019-06-19 Thread Benjamin Herrenschmidt

On Wed, 2019-06-19 at 22:32 +1000, Michael Ellerman wrote:
> Christoph Hellwig  writes:
> > Any chance this could get picked up to fix the regression?
> 
> Was hoping Ben would Ack it. He's still powermac maintainer :)
> 
> I guess he OK'ed it in the other thread, will add it to my queue.

Yeah ack. If I had written it myself, I would have made the DMA bits a
variable and only set it down to 30 if I see that device in the DT
early on, but I can't be bothered now, if it works, ship it :-)

Note: The patch affects all ppc32, though I don't think it will cause
any significant issue on those who don't need it.

Cheers,
Ben.

> cheers
> 
> > On Thu, Jun 13, 2019 at 10:24:46AM +0200, Christoph Hellwig wrote:
> > > With the strict dma mask checking introduced with the switch to
> > > the generic DMA direct code common wifi chips on 32-bit
> > > powerbooks
> > > stopped working.  Add a 30-bit ZONE_DMA to the 32-bit pmac builds
> > > to allow them to reliably allocate dma coherent memory.
> > > 
> > > Fixes: 65a21b71f948 ("powerpc/dma: remove
> > > dma_nommu_dma_supported")
> > > Reported-by: Aaro Koskinen 
> > > Signed-off-by: Christoph Hellwig 
> > > ---
> > >  arch/powerpc/include/asm/page.h | 7 +++
> > >  arch/powerpc/mm/mem.c   | 3 ++-
> > >  arch/powerpc/platforms/powermac/Kconfig | 1 +
> > >  3 files changed, 10 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/powerpc/include/asm/page.h
> > > b/arch/powerpc/include/asm/page.h
> > > index b8286a2013b4..0d52f57fca04 100644
> > > --- a/arch/powerpc/include/asm/page.h
> > > +++ b/arch/powerpc/include/asm/page.h
> > > @@ -319,6 +319,13 @@ struct vm_area_struct;
> > >  #endif /* __ASSEMBLY__ */
> > >  #include 
> > >  
> > > +/*
> > > + * Allow 30-bit DMA for very limited Broadcom wifi chips on many
> > > powerbooks.
> > > + */
> > > +#ifdef CONFIG_PPC32
> > > +#define ARCH_ZONE_DMA_BITS 30
> > > +#else
> > >  #define ARCH_ZONE_DMA_BITS 31
> > > +#endif
> > >  
> > >  #endif /* _ASM_POWERPC_PAGE_H */
> > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
> > > index cba29131bccc..2540d3b2588c 100644
> > > --- a/arch/powerpc/mm/mem.c
> > > +++ b/arch/powerpc/mm/mem.c
> > > @@ -248,7 +248,8 @@ void __init paging_init(void)
> > >  (long int)((top_of_ram - total_ram) >> 20));
> > >  
> > >  #ifdef CONFIG_ZONE_DMA
> > > - max_zone_pfns[ZONE_DMA] = min(max_low_pfn, 0x7fffUL
> > > >> PAGE_SHIFT);
> > > + max_zone_pfns[ZONE_DMA] = min(max_low_pfn,
> > > + ((1UL << ARCH_ZONE_DMA_BITS) - 1) >>
> > > PAGE_SHIFT);
> > >  #endif
> > >   max_zone_pfns[ZONE_NORMAL] = max_low_pfn;
> > >  #ifdef CONFIG_HIGHMEM
> > > diff --git a/arch/powerpc/platforms/powermac/Kconfig
> > > b/arch/powerpc/platforms/powermac/Kconfig
> > > index f834a19ed772..c02d8c503b29 100644
> > > --- a/arch/powerpc/platforms/powermac/Kconfig
> > > +++ b/arch/powerpc/platforms/powermac/Kconfig
> > > @@ -7,6 +7,7 @@ config PPC_PMAC
> > >   select PPC_INDIRECT_PCI if PPC32
> > >   select PPC_MPC106 if PPC32
> > >   select PPC_NATIVE
> > > + select ZONE_DMA if PPC32
> > >   default y
> > >  
> > >  config PPC_PMAC64
> > > -- 
> > > 2.20.1
> > 
> > ---end quoted text---

Re: [PATCH v4] driver core: Fix use-after-free and double free on glue directory

2019-06-18 Thread Benjamin Herrenschmidt

On Tue, 2019-06-18 at 18:13 +0200, Greg KH wrote:
> 
> Again, I am totally confused and do not see a patch in an email that
> I
> can apply...
> 
> Someone needs to get people to agree here...

I think he was hoping you would chose which solution you prefered here
:-) His original or the one I suggested instead. I don't think there's
anybody else with understanding of sysfs guts around to form an
opinion.

Cheers,
Ben.

Re: [PATCH v4] driver core: Fix use-after-free and double free on glue directory

2019-06-18 Thread Benjamin Herrenschmidt

On Tue, 2019-06-18 at 21:40 +0800, Muchun Song wrote:
> Ping guys ? I think this is worth fixing.

I agree :-)

My opinion hasn't changed though, the right fix isn't making guesses
based on the refcount but solve the actual race which is the mutex
being dropped between looking for the object existence and deciding to
create it :-)

Cheers,
Ben.

> Muchun Song  于2019年5月25日周六 下午8:15写道：
> 
> > 
> > Hi greg k-h,
> > 
> > Greg KH  于2019年5月25日周六 上午3:04写道：
> > > 
> > > On Thu, May 16, 2019 at 10:23:42PM +0800, Muchun Song wrote:
> > > > There is a race condition between removing glue directory and
> > > > adding a new
> > > > device under the glue directory. It can be reproduced in
> > > > following test:
> > > 
> > > 
> > > 
> > > Is this related to:
> > > Subject: [PATCH v3] drivers: core: Remove glue dirs early
> > > only when refcount is 1
> > > 
> > > ?
> > > 
> > > If so, why is the solution so different?
> > 
> > In the v1 patch, the solution is that remove glue dirs early only
> > when
> > refcount is 1. So
> > the v1 patch like below:
> > 
> > @@ -1825,7 +1825,7 @@ static void cleanup_glue_dir(struct device
> > *dev,
> > struct kobject *glue_dir)
> > return;
> > 
> > mutex_lock(&gdp_mutex);
> > -   if (!kobject_has_children(glue_dir))
> > +   if (!kobject_has_children(glue_dir) && kref_read(&glue_dir-
> > >kref) == 1)
> > kobject_del(glue_dir);
> > kobject_put(glue_dir);
> > mutex_unlock(&gdp_mutex);
> > -
> > --
> > 
> > But from Ben's suggestion as below:
> > 
> > I find relying on the object count for such decisions rather
> > fragile as
> > it could be taken temporarily for other reasons, couldn't it ? In
> > which
> > case we would just fail...
> > 
> > Ideally, the looking up of the glue dir and creation of its child
> > should be protected by the same lock instance (the gdp_mutex in
> > that
> > case).
> > -
> > --
> > 
> > So another solution is used from Ben's suggestion in the v2 patch.
> > But
> > I forgot to update the commit message until the v4 patch. Thanks.
> > 
> > Yours,
> > Muchun

Re: [PATCH v3 2/2] PCI: Fix disabling of bridge BARs when assigning bus resources

2019-06-18 Thread Benjamin Herrenschmidt

On Mon, 2019-06-17 at 08:53 -0500, Bjorn Helgaas wrote:
> On Fri, May 31, 2019 at 11:12:16AM -0600, Logan Gunthorpe wrote:
> > One odd quirk of PLX switches is that their upstream bridge port has
> > 256K of space allocated behind its BAR0 (most other bridge
> > implementations do not report any BAR space).
> 
> Somewhat unusual, but completely legal, of course.

Ah yes, I've seen these. They have an MMIO path to their internal
registers in addition to cfg. Can be annoying.

> If a bridge has memory BARs, AFAIK it is impossible to enable a memory
> window without also enabling the BARs, so if we want to use the bridge
> at all, we *must* allocate space for its BARs, just like for any other
> device.

Right.

 .../... (agreeing violently)

> In my ideal world we wouldn't zap the flags of any resource.  I think
> we should derive the flags from the device's config space *once*
> during enumeration and remember them for the life of the device.

Amen brother. It will take a little while to get there. One thing we
should do is have a clearer way to mark a resource that failed to
assign/allocate (though technically parent=NULL is it really, as long
as all archs these days claim properly, it used not to be the case).

We do wipe *bridge* windows (nor BARs) all over the place, that is less
of an issue I suppose though I would be more comfortable if we also
wrote to the bridge to close those windows as we do so...

The problem of course is how much old weird quirky will break due to
subtle assumptions as we "fix" these things :-)

> This patch preserves res->flags for bridge BARs just like for any
> other device, so I think this is definitely a step in the right
> direction.
> 
> I'm not sure the "dev->subordinate" test is really correct, though.

Right, shouldn't it be pci_is_bridge() ?

> I think the original intent of this code was to clear res->flags for
> bridge windows under the assumptions that (a) we can identify bridges
> by "dev->subordinate" being non-zero, and (b) bridges only have
> windows and didn't have BARs.
> 
> This patch fixes assumption (b), but I think (a) is false, and we
> should fix it as well.  One can imagine a bridge device without a
> subordinate bus (maybe we ran out of bus numbers), so I don't think we
> should test dev->subordinate.

Yup.

> We could test something like pci_is_bridge(), although testing for idx
> being in the PCI_BRIDGE_RESOURCES range should be sufficient because I
> don't think we use those resource for anything other than windows.

Yeah quite possibly.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-13 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 14:41 -0500, Larry Finger wrote:
> On 6/12/19 1:55 AM, Christoph Hellwig wrote:
> > 
> > Ooops, yes.  But I think we could just enable ZONE_DMA on 32-bit
> > powerpc.  Crude enablement hack below:
> > 
> > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> > index 8c1c636308c8..1dd71a98b70c 100644
> > --- a/arch/powerpc/Kconfig
> > +++ b/arch/powerpc/Kconfig
> > @@ -372,7 +372,7 @@ config PPC_ADV_DEBUG_DAC_RANGE
> >
> >config ZONE_DMA
> >bool
> > - default y if PPC_BOOK3E_64
> > + default y
> >
> >config PGTABLE_LEVELS
> >int
> > 
> 
> With the patch for Kconfig above, and the original patch setting 
> ARCH_ZONE_DMA_BITS to 30, everything works.
> 
> Do you have any ideas on what should trigger the change in ARCH_ZONE_BITS? 
> Should it be CONFIG_PPC32 defined, or perhaps CONFIG_G4_CPU defined?

I think CONFIG_PPC32 is fine

Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 13:00 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 07:42:42AM -0300, Mauro Carvalho Chehab wrote:
> > That's said, from the admin PoV, it makes sense to have a single
> > daemon that collect errors from all error sources and take the
> > needed actions.
> 
> Doing recovery actions in userspace is too flaky. Daemon can get killed
> at any point in time

So what ? If root kills your RAS daemon, then so be it. That has never
been a problem on POWER8/POWER9 server platforms and those have some of
the nastiest RAS in town.

You can kill PID 1 too you know ... 

>  and there are error types where you want to do recovery *before* you return 
> to userspace.

Very few (precise examples please) and I yet have to see why those need
some kind of magic coordinator.

> Yes, we do have different error reporting facilities but I still think
> that concentrating all the error information needed in order to do
> proper recovery action is the better approach here. And make that part
> of the kernel so that it is robust. Userspace can still configure it and
> so on.

Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 12:42 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 06:29:26PM +1000, Benjamin Herrenschmidt wrote:
> > I tend to disagree here. We've been down that rabbit hole in the past
> > and we (Linux in general) are trying to move away from that sort of
> > "platform" overarching driver as much as possible.
> 
> Why is a "platform" driver like that so bad?

It tends to be a slippery slope. Also in the ARM world, most SoC tend
to re-use IP blocks, so you get a lot of code duplication, bug fixed in
one and not the other etc...

I don't necessarily mind having a "platform" component that handles
policies in case where userspace is really not an option, but it
shouldn't be doing it by containing the actual drivers for the
individual IP block error collection. It could however "use" them via
in-kernel APIs.

> > This is a policy. It should either belong to userspace,
> 
> For some errors you can't do userspace as it is too late for it - you
> wanna address that before you return to it.

Those are rare. At the end of the day, if you have a UE on memory, it's
a matter of luck. It could have hit your kernel as well. You get lucky
it only hit userspace but you can't make a general statement you "can't
trust userspace".

Cache errors tend to be the kind that tend to have to be addressed
immediately, but even then, that's often local to some architecture
machine check handling, not even in EDAC.

Do you have a concrete example of a type of error that

 - Must be addressed in the kernel

 - Relies on coordinating drivers for more than one IP block

?

Even then though, my argument would be that the right way to do that,
assuming that's even platform specific, would be to have then the
"platform RAS driver" just layout on top of the individual EDAC drivers
and consume their output. Not contain the drivers themselves.

> > or be in some generic RAS code in the kernel, there's no reason why
> > these can't be abstracted.
> 
> Yes, we have this drivers/ras/cec.c thing which collects correctable
> DRAM errors on x86. :-)

Using machine checks, not EDAC. It's completely orghogonal at this
point at least.

That said, it would make sense to have an EDAC API to match that
address back into a DIMM location and give user an informational
message about failures happening on that DIMM. But that could be done
via core EDAC MC APIs.

Here too, no need for having an over-arching platform driver.

> > Also in your specific example, it could be entirely local to the MC
> > EDAC / DRAM controller path, we could have a generic way for EDAC to
> > advertise that a given memory channel is giving lots of errors and
> > have memory controller drivers listen to it but usually the EDAC MC
> > driver *is* the only thing that looks like a MC driver to begin with,
> > 
> > so again, pretty much no overlap with L1/L2 caches RAS or PCIe RAS
> > etc...
> > 
> > Unless I'm mistaken, that amd64 EDAC is just an MC one... but I only
> > had a cursory glance at the code.
> 
> EDAC has historically been concentrating on DRAM errors as that is
> what people have been paying attention to. But it isn't limited to
> DRAM errors - there is some basic PCI errors functionality behind
> edac_pci_create_generic_ctl() which polls for PCI parity errors.

Right, somebody whacked the PCI stuff in the same driver. So what ?
There's no coordination here not particular reason it has to be so.
Those PCI bits could have moved to a sepatate driver easily. Maybe they
didn't bcs they didn't have a good way to probe the two separately via
ACPI ? I don't know. But it doesn't matter nor does it affect the
situation with ARM.

That said, x86 platforms tend to be less diverse in their selection of
IP blocks, and tend to have more integrated chipsets where the
distinction between the memory controller and PCI may be a bit less
obvious. This isn't the case on ARM.

I still think that doesn't prove or disprove anything.

> So it still makes sense to me to have a single driver which takes care
> of all things RAS for a platform. You just load one driver and it does
> it all, including recovery actions.

Why ? Because one or two historical drivers mix MC and PCI then "it
makes sense" to do that for everybody ?

And then you have 20 platforms and 20 drivers, with 50% or more code
duplication, bugs fixed in one and not the other, gratuituous behaviour
differences to confuse users etc... No. that doesn't make sense.

> > Maybe because what you are promoting might not be the right path
> > here... seriously, there's a reason why all vendors want to go down
> > that path and in this case I don't think they are w

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-13 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 14:25 +0200, Borislav Petkov wrote:
> > But for the main case that really needs to be in the kernel, which is
> > DRAM, the recovery can usually be contained to the MC driver anyway.
> 
> Right, if that is enough to handle the error properly.
> 
> The memory-failure.c example I gave before is the error reporting
> mechanism (x86 MCA) calling into the mm subsystem to poison and isolate
> page frames which are known to contain errors. So you have two things
> talking to each other.

And none of them is an EDAC driver...

I mean yes, the network drivers talk to the network stack, or even the
memory allocator :-)

I still don't see how that requires a big platform coordinator...

Ben.

Re: [PATCH+DISCUSSION] irqchip: armada-370-xp: Remove redundant ops assignment

2019-06-13 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 15:16 +1000, Benjamin Herrenschmidt wrote:
> pci_msi_create_irq_domain -> pci_msi_domain_update_chip_ops will
> set those two already since the driver sets MSI_FLAG_USE_DEF_CHIP_OPS
> 
> Signed-off-by: Benjamin Herrenschmidt 
> ---
> 
> [UNTESTED]
> 
> Just something I noticed while browsing through those drivers in
> search of ways to factor some of the code.
> 
> That leads to a question here:
> 
> Some MSI drivers such as this one (or any using the defaults
> mask/unmask
> provided by drivers/pci/msi.c) only call the PCI MSI mask/unmask
> functions.
> 
> Some other drivers call those PCI function but *also* call the parent
> mask/unmask (giv-v2m for example) which generally is the inner domain
> which just itself forwards to its own parent.

  .../...

So I looked at x86 and it also only uses pci_msi_unmask_irq, it doesn't
mask at the parent level. And it also specifies those explicitly which
isn't necessary so the same trivial cleanup patch could be done (happy
to do it unless I missed something here).

Question: If that's indeed the rule we want to establish, should we
consider making all MSI controllers just use the PCI masking and remove
the forwarding to the parent ?

The ones that do the parent, at least in drivers/irqchip/* and
drivers/pci/controller/* (ther are more in arch code) are all the GIC
ones (v2m, v3-its, v3-mbi), alpine which was copied on GIC I think,
tango and dwc.

The other approach would be to make the generic ops setup by
pci_msi_domain_update_chip_ops call the parent as well .. if there is
one and it has corresponding mask/unmask callbacks. That means things
like armada_370 would be unaffected since their "middle" irqdomain chip
doesn't have them, at least until somebody decides that masking at the
parent level as well is a good thing. I *think* it would also work for
x86 since the parent in that case is x86_vector_domain which also
doesn't have mask and unmask callbacks, so it would be a nop change.

Let me know what you think.

Cheers,
Ben.

Re: [PATCH+DISCUSSION] irqchip: armada-370-xp: Remove redundant ops assignment

2019-06-13 Thread Benjamin Herrenschmidt

On Thu, 2019-06-13 at 10:22 +0100, Marc Zyngier wrote:
> 
> It looks to me that masking at the PCI level is rather superfluous as
> long as the MSI controller HW has the capability to mask the interrupt
> on a per MSI basis. After all, most non MSI-X endpoint lack support
> for masking of individual vectors, so I think that we should just mask
> things at the irqchip level. This is also consistent with what you'd
> have to do for non-PCI MSI, where nothing standardises the MSI
> masking.
> 
> I think this is in effect a split in responsibilities:
> 
> - the end-point driver should (directly or indirectly) control the
>   interrupt generation at the end-point level,
> 
> - the MSI controller driver should control the signalling of the MSI
>   to the CPU.
> 
> The only case where we should rely on masking interrupts at the
> end-point level is when the MSI controller doesn't provide a method to
> do so (hopefully a rare exception).

While I would tend to agree, I'm also wary of standardizing on
something which isn't what x86 does today :-)

You know what happens when we break them... interestingly enough they
(like quite a few other drivers) don't even bother trying to mask at
the APIC level unless I misread the code. That means that for endpoints
that don't support masking, they just get those MSIs and
"ignore" them...

But I'll look into it, see what the patch looks like.

I've also looked at trying to make the "inner domain" more generic but
that's looking a tad trickier... not giving up yet though :-)

Cheers,
Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 08:42 -0300, Mauro Carvalho Chehab wrote:
> > Yes, we do have different error reporting facilities but I still
> > think
> > that concentrating all the error information needed in order to do
> > proper recovery action is the better approach here. And make that
> > part
> > of the kernel so that it is robust. Userspace can still configure
> > it and
> > so on.
> 
> If the error reporting facilities are for the same hardware "group"
> (like the machine's memory controllers), I agree with you: it makes
> sense to have a single driver. 
> 
> If they are for completely independent hardware then implementing
> as separate drivers would work equally well, with the advantage of
> making easier to maintain and make it generic enough to support
> different vendors using the same IP block.

Right. And if you really want a platform orchestrator for recovery in
the kenrel, it should be a separate one, that consumes data from the
individual IP block drivers that report the raw errors anyway.

But for the main case that really needs to be in the kernel, which is
DRAM, the recovery can usually be contained to the MC driver anyway.

Cheers,
Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 05:48 +0200, Borislav Petkov wrote:
> On Wed, Jun 12, 2019 at 08:25:52AM +1000, Benjamin Herrenschmidt wrote:
> > Yes, we would be in a world of pain already if tracepoints couldn't
> > handle concurrency :-)
> 
> Right, lockless buffer and the whole shebang :)

Yup.

> > Sort-of... I still don't see a race in what we propose but I might be
> > missing something subtle. We are talking about two drivers for two
> > different IP blocks updating different counters etc...
> 
> If you do only *that* you should be fine. That should technically be ok.

Yes, that' the point.

> I still think, though, that the sensible thing to do is have one
> platform driver which concentrates all RAS functionality. 

I tend to disagree here. We've been down that rabbit hole in the past
and we (Linux in general) are trying to move away from that sort of
"platform" overarching driver as much as possible.

> It is the
> more sensible design and takes care of potential EDAC shortcomings and
> the need to communicate between the different logging functionality,
> as in, for example, "I had so many errors, lemme go and increase DRAM
> scrubber frequency." For example. And all the other advantages of having
> everything in a single driver.

This is a policy. It should either belong to userspace, or be in some
generic RAS code in the kernel, there's no reason why these can't be
abstracted. Also in your specific example, it could be entirely local
to the MC EDAC / DRAM controller path, we could have a generic way for
EDAC to advertise that a given memory channel is giving lots of errors
and have memory controller drivers listen to it but usually the EDAC MC
driver *is* the only thing that looks like a MC driver to begin with,
so again, pretty much no overlap with L1/L2 caches RAS or PCIe RAS
etc...

> And x86 already does that - we even have a single driver for all AMD
> platforms - amd64_edac. Intel has a couple but there's still a lot of
> sharing.

Unless I'm mistaken, that amd64 EDAC is just an MC one... but I only
had a cursory glance at the code.

> But apparently ARM folks want to have one driver per IP block. And we
> have this discussion each time a new vendor decides to upstream its
> driver. And there's no shortage of vendors in ARM-land trying to do
> that.

For good reasons :-)

> James and I have tried to come up with a nice scheme to make that work
> on ARM and he has an example prototype here:
> 
> http://www.linux-arm.org/git?p=linux-jm.git;a=shortlog;h=refs/heads/edac_dummy/v1
>
> to show how it could look like.
> 
> But I'm slowly growing a serious aversion against having this very same
> discussion each time an ARM vendor sends a driver. And that happens
> pretty often nowadays.

Maybe because what you are promoting might not be the right path
here... seriously, there's a reason why all vendors want to go down
that path and in this case I don't think they are wrong.

This isn't about just another ARM vendor, in fact I'm rather new to the
whole ARM thing, I used to maintain arch/powerpc :-) The point is what
you are trying to push for goes against everything we've been trying to
do in Linux when it comes to splitting drivers to individual IP blocks.

Yes, in *some* cases coordination will be needed in which case there
are ways to do that that don't necessarily involve matching a driver to
the root of the DT, and a pseudo-device is in fact a very reasonable
way to do it, it was a common practice in IEEE1275 before I invented
the FDT, and we do that for a number of other things already.

Cheers,
Ben.

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 09:25 +0300, Oded Gabbay wrote:
> 
> > You can't. Your device is broken. Devices that don't support DMAing to
> > the full 64-bit deserve to be added to the trash pile.
> > 
> 
> Hmm... right know they are added to customers data-centers but what do I know 
> ;)

Well, some customers don't know they are being sold a lemon :)

> > As a result, getting it to work will require hacks. Some GPUs have
> > similar issues and require similar hacks, it's unfortunate.
> > 
> > Added a couple of guys on CC who might be able to help get those hacks
> > right.
> 
> Thanks :)
> > 
> > It's still very fishy .. the idea is to detect the case where setting a
> > 64-bit mask will give your system memory mapped at a fixed high address
> > (1 << 59 in our case) and program that in your chip in the "Fixed high
> > bits" register that you seem to have (also make sure it doesn't affect
> > MSIs or it will break them).
> 
> MSI-X are working. The set of bit 59 doesn't apply to MSI-X
> transactions (AFAICS from the PCIe controller spec we have).

Ok.

> > This will only work as long as all of the system memory can be
> > addressed at an offset from that fixed address that itself fits your
> > device addressing capabilities (50 bits in this case). It may or may
> > not be the case but there's no way to check since the DMA mask logic
> > won't really apply.
> 
> Understood. In the specific system we are integrated to, that is the
> case - we have less then 48 bits. But, as you pointed out, it is not a
> generic solution but with my H/W I can't give a generic fit-all
> solution for POWER9. I'll settle for the best that I can do.
> 
> > 
> > You might want to consider fixing your HW in the next iteration... This
> > is going to bite you when x86 increases the max physical memory for
> > example, or on other architectures.
> 
> Understood and taken care of.

Cheers,
Ben.

> > 
> > Cheers,
> > Ben.
> > 
> > 
> > 
> >

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-12 Thread Benjamin Herrenschmidt

On Wed, 2019-06-12 at 15:45 +1000, Oliver O'Halloran wrote:
> 
> Also, are you sure about the MSI thing? The IODA3 spec says the only
> important bits for a 64bit MSI are bits 61:60 (to hit the window) and
> the lower bits that determine what IVE to use. Everything in between
> is ignored so ORing in bit 59 shouldn't break anything.

On IODA3... could be different on another system. My point is you can't
just have a fixed setting for all top bits for DMA & MSIs.

> > This will only work as long as all of the system memory can be
> > addressed at an offset from that fixed address that itself fits your
> > device addressing capabilities (50 bits in this case). It may or may
> > not be the case but there's no way to check since the DMA mask logic
> > won't really apply.
> > 
> > You might want to consider fixing your HW in the next iteration... This
> > is going to bite you when x86 increases the max physical memory for
> > example, or on other architectures.
> 
> Yes, do this. The easiest way to avoid this sort of wierd hack is to
> just design the PCIe interface to the spec in the first place.

Ben.

[PATCH+DISCUSSION] irqchip: armada-370-xp: Remove redundant ops assignment

2019-06-11 Thread Benjamin Herrenschmidt

pci_msi_create_irq_domain -> pci_msi_domain_update_chip_ops will
set those two already since the driver sets MSI_FLAG_USE_DEF_CHIP_OPS

Signed-off-by: Benjamin Herrenschmidt 
---

[UNTESTED]

Just something I noticed while browsing through those drivers in
search of ways to factor some of the code.

That leads to a question here:

Some MSI drivers such as this one (or any using the defaults mask/unmask
provided by drivers/pci/msi.c) only call the PCI MSI mask/unmask functions.

Some other drivers call those PCI function but *also* call the parent
mask/unmask (giv-v2m for example) which generally is the inner domain
which just itself forwards to its own parent.

Is there any preference for doing it one way or the other ? I can see
that in cases where the device doesn't support MSI masking, calling the
parent could be useful but we don't know that at the moment in the
corresponding code.

It feels like something we should consolidate (and remove code from
drivers). For example, the defaults in drivers/pci/msi.c could always
call the parent if it exists and has a mask/unmask callback.

Opinions ? I'm happy to produce patches once we agree...

diff --git a/drivers/irqchip/irq-armada-370-xp.c 
b/drivers/irqchip/irq-armada-370-xp.c
index c9bdc5221b82..911230f28e2d 100644
--- a/drivers/irqchip/irq-armada-370-xp.c
+++ b/drivers/irqchip/irq-armada-370-xp.c
@@ -197,8 +197,6 @@ static void armada_370_xp_irq_unmask(struct irq_data *d)
 
 static struct irq_chip armada_370_xp_msi_irq_chip = {
.name = "MPIC MSI",
-   .irq_mask = pci_msi_mask_irq,
-   .irq_unmask = pci_msi_unmask_irq,
 };
 
 static struct msi_domain_info armada_370_xp_msi_domain_info = {

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 20:52 -0500, Larry Finger wrote:
> On 6/11/19 5:46 PM, Benjamin Herrenschmidt wrote:
> > On Tue, 2019-06-11 at 17:20 -0500, Larry Finger wrote:
> > > b43-pci-bridge 0001:11:00.0: dma_direct_supported: failed (mask =
> > > 0x3fff,
> > > min_mask = 0x5000/0x5000, dma bits = 0x1f
> > 
> > Ugh ? A mask with holes in it ? That's very wrong... That min_mask is
> > bogus.
> 
> I agree, but that is not likely serious as most systems will have enough 
> memory 
> that the max_pfn term will be much larger than the initial min_mask, and 
> min_mask will be unchanged by the min function. 

Well no... it's too much memory that is the problem. If min_mask is
bogus though it will cause problem later too, so one should look into
it.

> In addition, min_mask is not 
> used beyond this routine, and then only to decide if direct dma is supported. 
> The following patch generates masks with no holes, but I cannot see that it 
> is 
> needed.

The right fix is to round up max_pfn to a power of 2, something like

min_mask = min_t(u64, min_mask, (roundup_pow_of_two(max_pfn - 1)) <<
PAGE_SHIFT) 

> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index 2c2772e9702a..e3edd4f29e80 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -384,7 +384,8 @@ int dma_direct_supported(struct device *dev, u64 mask)
>  else
>  min_mask = DMA_BIT_MASK(32);
> 
> -   min_mask = min_t(u64, min_mask, (max_pfn - 1) << PAGE_SHIFT);
> +   min_mask = min_t(u64, min_mask, ((max_pfn - 1) << PAGE_SHIFT) |
> +DMA_BIT_MASK(PAGE_SHIFT));
> 
>  /*
>   * This check needs to be against the actual bit mask value, so
> 
> 
> Larry

Re: [PATCH v2 8/8] habanalabs: enable 64-bit DMA mask in POWER9

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 20:22 +0300, Oded Gabbay wrote:
> 
> > So, to summarize:
> > If I call pci_set_dma_mask with 48, then it fails on POWER9. However,
> > in runtime, I don't know if its POWER9 or not, so upon failure I will
> > call it again with 32, which makes our device pretty much unusable.
> > If I call pci_set_dma_mask with 64, and do the dedicated configuration
> > in Goya's PCIe controller, then it won't work on x86-64, because bit
> > 59 will be set and the host won't like it (I checked it). In addition,
> > I might get addresses above 50 bits, which my device can't generate.
> > 
> > I hope this makes things more clear. Now, please explain to me how I
> > can call pci_set_dma_mask without any regard to whether I run on
> > x86-64 or POWER9, considering what I wrote above ?
> > 
> > Thanks,
> > Oded
> 
> Adding ppc mailing list.

You can't. Your device is broken. Devices that don't support DMAing to
the full 64-bit deserve to be added to the trash pile.

As a result, getting it to work will require hacks. Some GPUs have
similar issues and require similar hacks, it's unfortunate.

Added a couple of guys on CC who might be able to help get those hacks
right.

It's still very fishy .. the idea is to detect the case where setting a
64-bit mask will give your system memory mapped at a fixed high address
(1 << 59 in our case) and program that in your chip in the "Fixed high
bits" register that you seem to have (also make sure it doesn't affect
MSIs or it will break them).

This will only work as long as all of the system memory can be
addressed at an offset from that fixed address that itself fits your
device addressing capabilities (50 bits in this case). It may or may
not be the case but there's no way to check since the DMA mask logic
won't really apply.

You might want to consider fixing your HW in the next iteration... This
is going to bite you when x86 increases the max physical memory for
example, or on other architectures.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 17:20 -0500, Larry Finger wrote:
> b43-pci-bridge 0001:11:00.0: dma_direct_supported: failed (mask =
> 0x3fff, 
> min_mask = 0x5000/0x5000, dma bits = 0x1f

Ugh ? A mask with holes in it ? That's very wrong... That min_mask is
bogus.

Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 13:56 +0200, Borislav Petkov wrote:
> On Tue, Jun 11, 2019 at 05:21:39PM +1000, Benjamin Herrenschmidt wrote:
> > So looking again ... all the registration/removal of edac devices seem
> > to already be protected by mutexes, so that's not a problem.
> > 
> > Tell me more about what specific races you think we might have here,
> > I'm not sure I follow...
> 
> Well, as I said "it might work or it might set your cat on fire." For
> example, one of the error logging paths is edac_mc_handle_error() and
> that thing mostly operates using the *mci pointer which should be ok
> but then it calls the "trace_mc_event" tracepoint and I'd suppose that
> tracepoints can do lockless but I'm not sure.

Yes, we would be in a world of pain already if tracepoints couldn't
handle concurrency :-)

> So what needs to happen is for paths which weren't called by multiple
> EDAC agents in parallel but need to get called in parallel now due to
> ARM drivers wanting to do that, to get audited that they're safe.

That's the thing, I don't think we have such path. We are talking about
having separate L1/L2 vs. MC drivers, they don't overlap.

> Situation is easy if you have one platform driver where you can
> synchronize things in the driver but since you guys need to do separate
> drivers for whatever reason, then that would need to be done prior.
> 
> Makes more sense?

Sort-of... I still don't see a race in what we propose but I might be
missing something subtle. We are talking about two drivers for two
different IP blocks updating different counters etc...

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 09:54 +0200, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 04:59:54PM +1000, Benjamin Herrenschmidt
> wrote:
> > Ah stupid me ... it's dma_set_mask that failed, since it has no
> > idea
> > that the calling driver is limited to lowmem.
> > 
> > That's also why the "wrong" patch worked.
> > 
> > So yes, a ZONE_DMA at 30-bits will work, though it's somewhat
> > overkill.
> 
> Well, according to Larry it doesn't actually work, which is odd.

Oh I assume that's just a glitch in the patch :-)

Cheers,
Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 15:50 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote:
> > On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote:
> > > Those IP blocks don't need any SW coordination at runtime. The drivers
> > > don't share data nor communicate with each other. There is absolultely
> > > no reason to go down that path.
> > 
> > Let me set one thing straight: the EDAC "subsystem" if you will - or
> > that pile of code which does error counting and reporting - has its
> > limitations in supporting one EDAC driver per platform. And whenever we
> > have two drivers loadable on a platform, we have to do dirty hacks like
> > 
> >   301375e76432 ("EDAC: Add owner check to the x86 platform drivers")
> > 
> > What that means is, that if you need to call EDAC logging routines or
> > whatnot from two different drivers, there's no locking, no nothing. So
> > it might work or it might set your cat on fire.
> 
> Should we fix that then instead ? What are the big issues with adding
> some basic locking ? being called from NMIs ?
> 
> If the separate drivers operate on distinct counters I don't see a big
> problem there.

So looking again ... all the registration/removal of edac devices seem
to already be protected by mutexes, so that's not a problem.

Tell me more about what specific races you think we might have here,
I'm not sure I follow...

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-11 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 16:58 +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2019-06-11 at 08:08 +0200, Christoph Hellwig wrote:
> > On Tue, Jun 11, 2019 at 03:56:33PM +1000, Benjamin Herrenschmidt
> > wrote:
> > > The reason I think it sort-of-mostly-worked is that to get more
> > > than
> > > 1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
> > > buffers aren't allocated in Highmem so you got lucky.
> > > 
> > > That said, there is such as thing as no-copy send on network, so I
> > > wouldn't be surprised if some things would still have failed, just
> > > not
> > > frequent enough for you to notice.
> > 
> > Unless NETIF_F_HIGHDMA is set on a netdev, the core networkign code
> > will bounce buffer highmem pages for the driver under all
> > circumstances.
> 
>  ... which b43legacy doesn't set to the best of my knowledge ...
> 
> Which makes me wonder how come it didn't work even with your patches ?
> AFAIK, we have less than 1GB of lowmem unless the config has been
> tweaked

Ah stupid me ... it's dma_set_mask that failed, since it has no idea
that the calling driver is limited to lowmem.

That's also why the "wrong" patch worked.

So yes, a ZONE_DMA at 30-bits will work, though it's somewhat overkill.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-10 Thread Benjamin Herrenschmidt

On Tue, 2019-06-11 at 08:08 +0200, Christoph Hellwig wrote:
> On Tue, Jun 11, 2019 at 03:56:33PM +1000, Benjamin Herrenschmidt
> wrote:
> > The reason I think it sort-of-mostly-worked is that to get more
> > than
> > 1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
> > buffers aren't allocated in Highmem so you got lucky.
> > 
> > That said, there is such as thing as no-copy send on network, so I
> > wouldn't be surprised if some things would still have failed, just
> > not
> > frequent enough for you to notice.
> 
> Unless NETIF_F_HIGHDMA is set on a netdev, the core networkign code
> will bounce buffer highmem pages for the driver under all
> circumstances.

 ... which b43legacy doesn't set to the best of my knowledge ...

Which makes me wonder how come it didn't work even with your patches ?
AFAIK, we have less than 1GB of lowmem unless the config has been
tweaked

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-10 Thread Benjamin Herrenschmidt

On Mon, 2019-06-10 at 13:44 -0500, Larry Finger wrote:
> On 6/7/19 11:21 PM, Benjamin Herrenschmidt wrote:
> > 
> > > Please try the attached patch. I'm not really pleased with it and I will
> > > continue to determine why the fallback to a 30-bit mask fails, but at 
> > > least this
> > > one works for me.
> > 
> > Your patch only makes sense if the device is indeed capable of
> > addressing 31-bits.
> > 
> > So either the driver is buggy and asks for a too small mask in which
> > case your patch is ok, or it's not and you're just going to cause all
> > sort of interesting random problems including possible memory
> > corruption.
> 
> Of course the driver may be buggy, but it asks for the correct mask.
> 
> This particular device is not capable of handling 32-bit DMA. The driver 
> detects 
> the 32-bit failure and falls back to 30 bits. It works on x86, and did on 
> PPC32 
> until 5.1. As Christoph said, it should always be possible to use fewer bits 
> than the maximum.

No, I don't think it *worked* on ppc32 before Christoph patch. I think
it "mostly sort-of worked" :-)

The reason I'm saying that is if your system has more than 1GB of RAM,
then you'll have chunks of memory that the device simply cannot
address.

Before Christoph patches, we had no ZONE_DMA or ZONE_DMA32 covering the
30-bit limited space, so any memory allocation could in theory land
above 30-bits, causing all sort of horrible things to happen with that
driver.

The reason I think it sort-of-mostly-worked is that to get more than
1GB of RAM, those machines use CONFIG_HIGHMEM. And *most* network
buffers aren't allocated in Highmem so you got lucky.

That said, there is such as thing as no-copy send on network, so I
wouldn't be surprised if some things would still have failed, just not
frequent enough for you to notice.

> Similar devices that are new enough to use b43 rather than b43legacy work 
> with 
> new kernels; however, they have and use 32-bit DMA.

Cheres,
Ben.

Re: [PATCH 1/1] irqchip/gic: Add support for Amazon Graviton variant of GICv3+GICv2m

2019-06-10 Thread Benjamin Herrenschmidt

On Mon, 2019-06-10 at 09:20 +0100, Marc Zyngier wrote:
> Hi Zeev,
> 
> On 07/06/2019 00:17, Zeev Zilberman wrote:
> > The patch adds support for Amazon Graviton custom variant of GICv2m, where
> > hw irq is encoded using the MSI message address, as opposed to standard
> > GICv2m, where hw irq is encoded in the MSI message data.
> > In addition, the Graviton flavor of GICv2m is used along GICv3 (and not
> > GICv2).
> > 
> > Signed-off-by: Zeev Zilberman 
> > Signed-off-by: Benjamin Herrenschmidt 
> 
> There seem to be some confusion about who is the author of this patch.
> As you're the one posting the patch, your SoB tag should be the last
> one. And assuming the patch has been developed together with Ben, it
> should read:
> 
> Co-developed-by: Benjamin Herrenschmidt 
> Signed-off-by: Benjamin Herrenschmidt 
> Signed-off-by: Zeev Zilberman 

It was his patch originally. I shuffled a few things around to make it
less intrusive, then Zeev picked it back up and addresses your previous
comments. I'm happy for him to take full ownership.

> > ---
> > diff --git a/drivers/irqchip/irq-gic-v2m.c b/drivers/irqchip/irq-gic-v2m.c
> > index 3c77ab6..eeed19f 100644
> > --- a/drivers/irqchip/irq-gic-v2m.c
> > +++ b/drivers/irqchip/irq-gic-v2m.c
> > @@ -56,6 +56,7 @@
> >  
> >  /* List of flags for specific v2m implementation */
> >  #define GICV2M_NEEDS_SPI_OFFSET0x0001
> > +#define GICV2M_GRAVITON_ADDRESS_ONLY   0x0002
> >  
> >  static LIST_HEAD(v2m_nodes);
> >  static DEFINE_SPINLOCK(v2m_lock);
> > @@ -98,15 +99,26 @@ static struct msi_domain_info gicv2m_msi_domain_info = {
> > .chip   = &gicv2m_msi_irq_chip,
> >  };
> >  
> > +static phys_addr_t gicv2m_get_msi_addr(struct v2m_data *v2m, int hwirq)
> > +{
> > +   if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY)
> > +   return v2m->res.start | ((hwirq - 32) << 3);
> > +   else
> > +   return v2m->res.start + V2M_MSI_SETSPI_NS;
> > +}
> > +
> >  static void gicv2m_compose_msi_msg(struct irq_data *data, struct msi_msg 
> > *msg)
> >  {
> > struct v2m_data *v2m = irq_data_get_irq_chip_data(data);
> > -   phys_addr_t addr = v2m->res.start + V2M_MSI_SETSPI_NS;
> > +   phys_addr_t addr = gicv2m_get_msi_addr(v2m, data->hwirq);
> >  
> > msg->address_hi = upper_32_bits(addr);
> > msg->address_lo = lower_32_bits(addr);
> > -   msg->data = data->hwirq;
> >  
> > +   if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY)
> > +   msg->data = 0;
> > +   else
> > +   msg->data = data->hwirq;
> > if (v2m->flags & GICV2M_NEEDS_SPI_OFFSET)
> > msg->data -= v2m->spi_offset;
> >  
> > @@ -188,7 +200,7 @@ static int gicv2m_irq_domain_alloc(struct irq_domain 
> > *domain, unsigned int virq,
> > hwirq = v2m->spi_start + offset;
> >  
> > err = iommu_dma_prepare_msi(info->desc,
> > -   v2m->res.start + V2M_MSI_SETSPI_NS);
> > +   gicv2m_get_msi_addr(v2m, hwirq));
> > if (err)
> > return err;
> >  
> > @@ -307,7 +319,7 @@ static int gicv2m_allocate_domains(struct irq_domain 
> > *parent)
> >  
> >  static int __init gicv2m_init_one(struct fwnode_handle *fwnode,
> >   u32 spi_start, u32 nr_spis,
> > - struct resource *res)
> > + struct resource *res, u32 flags)
> >  {
> > int ret;
> > struct v2m_data *v2m;
> > @@ -320,6 +332,7 @@ static int __init gicv2m_init_one(struct fwnode_handle 
> > *fwnode,
> >  
> > INIT_LIST_HEAD(&v2m->entry);
> > v2m->fwnode = fwnode;
> > +   v2m->flags = flags;
> >  
> > memcpy(&v2m->res, res, sizeof(struct resource));
> >  
> > @@ -334,7 +347,14 @@ static int __init gicv2m_init_one(struct fwnode_handle 
> > *fwnode,
> > v2m->spi_start = spi_start;
> > v2m->nr_spis = nr_spis;
> > } else {
> > -   u32 typer = readl_relaxed(v2m->base + V2M_MSI_TYPER);
> > +   u32 typer;
> > +
> > +   /* Graviton should always have explicit spi_start/nr_spis */
> > +   if (v2m->flags & GICV2M_GRAVITON_ADDRESS_ONLY) {
> > +   ret = -EINVAL;
> > +   goto err_iounmap;
> > +   }
> > +   typer = readl_relaxed(v2m

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-10 Thread Benjamin Herrenschmidt

On Sat, 2019-06-08 at 11:05 +0200, Borislav Petkov wrote:
> On Sat, Jun 08, 2019 at 10:16:11AM +1000, Benjamin Herrenschmidt wrote:
> > Those IP blocks don't need any SW coordination at runtime. The drivers
> > don't share data nor communicate with each other. There is absolultely
> > no reason to go down that path.
> 
> Let me set one thing straight: the EDAC "subsystem" if you will - or
> that pile of code which does error counting and reporting - has its
> limitations in supporting one EDAC driver per platform. And whenever we
> have two drivers loadable on a platform, we have to do dirty hacks like
> 
>   301375e76432 ("EDAC: Add owner check to the x86 platform drivers")
> 
> What that means is, that if you need to call EDAC logging routines or
> whatnot from two different drivers, there's no locking, no nothing. So
> it might work or it might set your cat on fire.

Should we fix that then instead ? What are the big issues with adding
some basic locking ? being called from NMIs ?

If the separate drivers operate on distinct counters I don't see a big
problem there.

> IOW, having multiple separate "drivers" or representations of RAS
> functionality using EDAC facilities is something that hasn't been
> done. Well, almost. highbank_mc_edac.c and highbank_l2_edac.c is one
> example but they make sure they don't step on each other's toes by using
> different EDAC pieces - a device vs a memory controller abstraction.

That sounds like a reasonable requirement.

> And now the moment all of a sudden you decide you want for those
> separate "drivers" to synchronize on something, you need to do something
> hacky like the amd_register_ecc_decoder() thing, for example, because we
> need to call into the EDAC memory controller driver to decode a DRAM ECC
> error properly, while the rest of the error types get decoded somewhere
> else...
> 
> Then there comes the issue with code reuse - wouldn't it be great if a
> memory controller driver can be shared between platform drivers instead of
> copying it in both?
> 
> We already do that - see fsl_ddr_edac.c which gets shared between PPC
> *and* ARM. drivers/edac/skx_common.c is another example for Intel chips.
> 
> Now, if you have a platform with 10 IP blocks which each have RAS
> functionality, are you saying you'll do 10 different pieces called
> 
> __edac.c
> 
> ?
> 
> And if  has an old IP block with the old RAS
> functionality, you load __edac.c on the new
> platform too?

I'n not sure why  ...

Anyway, let's get back to the specific case of our Amazon platform here
since it's a concrete example.

Hanna, can you give us a reasonably exhaustive list of how many such
"drivers" we'll want in the EDAC subsystem and whether you envision any
coordination requirement between them or not ?

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-07 Thread Benjamin Herrenschmidt



> Please try the attached patch. I'm not really pleased with it and I will 
> continue to determine why the fallback to a 30-bit mask fails, but at least 
> this 
> one works for me.

Your patch only makes sense if the device is indeed capable of
addressing 31-bits.

So either the driver is buggy and asks for a too small mask in which
case your patch is ok, or it's not and you're just going to cause all
sort of interesting random problems including possible memory
corruption.

Cheers,
Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread Benjamin Herrenschmidt

On Fri, 2019-06-07 at 16:11 +0100, James Morse wrote:
> I'm coming at this from somewhere else. This stuff has to be considered all 
> the way
> through the system. Just because each component supports error detection, 
> doesn't mean you
> aren't going to get silent corruption. Likewise if another platform picks up 
> two piecemeal
> edac drivers for hardware it happens to have in common with yours, it doesn't 
> mean we're
> counting all the errors. This stuff has to be viewed for the whole platform.

Sure but you don't solve that problem by having a magic myplatform.c
overseer. And even if you do, it can perfectly access the individual IP
block drivers, finding them via phandles in the DT for example etc...
without having to make those individual drivers dependent on some over
arching machine wide probing mechanism.

> But this doesn't give you a device you can bind a driver to, to kick this 
> stuff off.
> This (I assume) is why you added a dummy 'edac_l1_l2' node, that just probes 
> the driver.
> The hardware is to do with the CPU and caches, 'edac_l1'_l2' doesn't 
> correspond to any
> distinct part of the soc.
> 
> The request is to use the machine compatible, not a dummy node. This wraps up 
> the firmware
> properties too, and any other platform property we don't know about today.
> 
> Once you have this, you don't really need the cpu/cache integration 
> annotations, and your
> future memory-controller support can be picked up as part of the platform 
> driver.
> If you have otherwise identical platforms with different memory controllers, 
> OF gives you
> the API to match the node in the DT.

Dummy nodes are pefectly fine, and has been from the early days of Open
Firmware. That said, these aren't so much dummy as a way to expose the
control path to the caches. The DT isn't perfect in its structure and
the way caches and CPUs are represented makes it difficult to represent
arbitrary control path to them without extra nodes, which is thus what
people do.

Cheers,
Ben.

Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

2019-06-07 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 11:33 +0100, James Morse wrote:

> > Disagree. The various drivers don't depend on each other.
> > I think we should keep the drivers separated as they are distinct and 
> > independent IP blocks.
> 
> But they don't exist in isolation, they both depend on the 
> integration-choices/firmware
> that makes up your platform.

What do you mean ? They are exposing counters from independent IP
blocks. They are independent drivers. You argument could be use to
claim the entire SoC depends on "integration choices / firmware" ... I
don't get it.

> Other platforms may have exactly the same IP blocks, configured differently, 
> or with
> different features enabled in firmware.

Sure, like every other IP block on the planet. That has never been a
good reason to bring back the ugly spectre of myboard.c file...

>  This means we can't just probe the driver based on
> the presence of the IP block, we need to know the integration choices and 
> firmware
> settings match what the driver requires.

Such as ? I mean none of that differs between these EDAC drivers and
any other IP block, and we still probe them individually.

> (Case in point, that A57 ECC support is optional, another A57 may not have it)

So what ? That belongs in the DT.

> Descriptions of what firmware did don't really belong in the DT. Its not a 
> hardware property.

Since when ? I'm tired of people coming up over and over about that
complete fallacy that the DT should for some obscure religious reason
be strictly limited to "HW properties". ACPI isn't. The old Open
Firmware which I used as a basis for creating the FDT wasn't.

It is perfectly legitimate for the DT to contain configuration
information and firmware choices.

What's not OK is to stick there things that are essentially specific to
the Linux driver implementation but that isn't what we are talking
about here.

> This is why its better to probe this stuff based on the 
> machine-compatible/platform-name,
> not the presence of the IP block in the DT.

No. No no no no. This is bringing back the days of having board files
etc... this is wrong.

Those IP blocks don't need any SW coordination at runtime. The drivers
don't share data nor communicate with each other. There is absolultely
no reason to go down that path.

> Will either of your separate drivers ever run alone? If they're probed from 
> the same
> machine-compatible this won't happen.

They should be probed independently from independent DT nodes, what's
the problem you are trying to fix here ?

> How does your memory controller report errors? Does it send back some data 
> with an invalid
> checksum, or a specific poison/invalid flag? Will the cache report this as a 
> cache error
> too, if its an extra signal, does the cache know what it is?

That's ok, you get the error from both sides, power has done it that
way for ever. It's not always possible to correlate anyways and it's
certainly not the job of the EDAC drivers to try.

> All these are integration choices between the two IP blocks, done as separate 
> drivers we
> don't have anywhere to store that information.

We do, it's called the DT.

>  Even if you don't care about this, making
> them separate drivers should only be done to make them usable on other 
> platforms, where
> these choices may have been different.

That wouldn't make the drivers unusable on other platforms at all.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 20:56 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> > Hi,
> > 
> > On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> > wrote:
> > > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > > Hi,
> > > > 
> > > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > > does
> > > > not work anymore:
> > > > 
> > > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > > patch level 14 (2005-04-18 02:36:27)
> > > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > > support the required 30-bit DMA mask
> > > > 
> > > > The same happens with the current mainline.
> > > 
> > > How much RAM do you have ?
> > 
> > The system has 1129 MB RAM. Booting with mem=1G makes it work.
> 
> Wow... that's an odd amount. One thing we could possibly do is add code
> to limit the amount of RAM when we detect that device

Sent too quickly... I mean that *or* force swiotlb at 30-bits on those systems 
based
on detecting the presence of that device in the device-tree.

Cheers,
Ben.

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 12:31 +0300, Aaro Koskinen wrote:
> Hi,
> 
> On Thu, Jun 06, 2019 at 10:54:51AM +1000, Benjamin Herrenschmidt
> wrote:
> > On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> > > Hi,
> > > 
> > > When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN
> > > does
> > > not work anymore:
> > > 
> > > [   42.004303] b43legacy-phy0: Loading firmware version 0x127,
> > > patch level 14 (2005-04-18 02:36:27)
> > > [   42.184837] b43legacy-phy0 debug: Chip initialized
> > > [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not
> > > support the required 30-bit DMA mask
> > > 
> > > The same happens with the current mainline.
> > 
> > How much RAM do you have ?
> 
> The system has 1129 MB RAM. Booting with mem=1G makes it work.

Wow... that's an odd amount. One thing we could possibly do is add code
to limit the amount of RAM when we detect that device

Cheers,
Ben.

Re: [PATCH v2 2/2] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 08:05 +0100, Marc Zyngier wrote:
> 
> > I disagree Marc. This is a rather bad error which indicates that the
> > device-tree is probably incorrect (or the HW was wired in a way that
> > cannot work).
> 
> But surely that's something you'll spot pretty quickly.

Not really. A level/edge mismatch isn't something you can spot that
quickly, but will cause lost interrupts on load. Since the kernel can
spot the error pretty much right away, I think that could even be a
pr_err :)

> Also, you get
> a splat from the irq subsystem already, telling you that things went
> wrong (see __irq_set_trigger). At that stage, you can enable debugging
> and figure it out.

Ah returning an error will cause such splat indeed.

> What I'm trying to avoid is the kernel becoming a (pretty bad)
> validation tool for DTS files.

Haha, yeah, I don't like it going out of its way to validate them but
that sort of very obvious sanity checking makes sense.

> > Basically a given FIC can either be entirely level sensitive or
> > entirely edge sensitive. This catches cases where the DT has routed
> > a mixed of both to the same FIC. Definitely worth barfing loudly
> > about rather than trying to understand subtle odd misbehaviours of
> > the device in the field.
> 
> Then, in the interest of not producing incorrect DTs, could the
> edge/level property be encoded in the FIC description itself, rather
> than in the interrupt specifiers of the individual devices? It would
> sidestep the problem altogether. You can still put the wrong one in
> the FIC node, but it then becomes even more obvious what is going
> on...

This was Talel original approach internally in fact. I told him to put
it in the specifier instead :-) The advantage in doing it that way is
that you get the right flags in the descriptor by default iirc, so the
right value in /proc/interrupts etc... And it will continue working if
a future FIC loses that limitation.

That said, if you feel strongly about it, we can revert to putting a
global property in the FIC node itself. Let us know what you want.

Cheers,
Ben.

Re: [PATCH 2/3] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-06 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 16:47 +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2019-06-06 at 08:37 +0200, Greg KH wrote:
> > On Thu, Jun 06, 2019 at 07:55:43AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2019-06-05 at 09:59 +0200, Greg KH wrote:
> > > > 
> > > > > +struct irq_domain *al_fic_wire_get_domain(struct al_fic *fic);
> > > > > +
> > > > > +struct al_fic *al_fic_wire_init(struct device_node *node,
> > > > > + void __iomem *base,
> > > > > + const char *name,
> > > > > + unsigned int parent_irq);
> > > > > +int al_fic_cleanup(struct al_fic *fic);
> > > > 
> > > > Who is using these new functions?  We don't add new apis that no one
> > > > uses :(
> > > 
> > > They will be used by subsequent driver submissions but those aren't
> > > quite ready yet, so we can hold onto patch 3 for now until they are.
> > 
> > Patch 2 also should have these removed :)
> 
> That's a mistake, that export should have been in patch3. Talel, pls
> fix that in your next spin.

Actually that's already fixed in v2 of the series. The API have been
removed for now.

Cheers,
Ben.

Re: [PATCH 2/3] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-05 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 08:37 +0200, Greg KH wrote:
> On Thu, Jun 06, 2019 at 07:55:43AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2019-06-05 at 09:59 +0200, Greg KH wrote:
> > > 
> > > > +struct irq_domain *al_fic_wire_get_domain(struct al_fic *fic);
> > > > +
> > > > +struct al_fic *al_fic_wire_init(struct device_node *node,
> > > > +   void __iomem *base,
> > > > +   const char *name,
> > > > +   unsigned int parent_irq);
> > > > +int al_fic_cleanup(struct al_fic *fic);
> > > 
> > > Who is using these new functions?  We don't add new apis that no one
> > > uses :(
> > 
> > They will be used by subsequent driver submissions but those aren't
> > quite ready yet, so we can hold onto patch 3 for now until they are.
> 
> Patch 2 also should have these removed :)

That's a mistake, that export should have been in patch3. Talel, pls
fix that in your next spin.

> You know we don't add new apis until we have a real, in-kernel user for
> them...

Yup, the user are going to be drivers in other subsystems, so
coordination is a bit tricky, which is I think why Talel wanted to
submit that now, but the patches for those other drivers aren't quite
ready yet so we can hold onto that one for the time being.

It's nothing nefarious :-) Just coordination issues.

Cheers,
Ben.

> thanks,
> 
> greg k-h

Re: [BISECTED REGRESSION] b43legacy broken on G4 PowerBook

2019-06-05 Thread Benjamin Herrenschmidt

On Thu, 2019-06-06 at 01:50 +0300, Aaro Koskinen wrote:
> Hi,
> 
> When upgrading from v5.0 -> v5.1 on G4 PowerBook, I noticed WLAN does
> not work anymore:
> 
> [   42.004303] b43legacy-phy0: Loading firmware version 0x127, patch level 14 
> (2005-04-18 02:36:27)
> [   42.184837] b43legacy-phy0 debug: Chip initialized
> [   42.184873] b43legacy-phy0 ERROR: The machine/kernel does not support the 
> required 30-bit DMA mask
> 
> The same happens with the current mainline.

How much RAM do you have ?

Ben.

> 
> Bisected to:
> 
>   commit 65a21b71f948406201e4f62e41f06513350ca390
>   Author: Christoph Hellwig 
>   Date:   Wed Feb 13 08:01:26 2019 +0100
> 
>   powerpc/dma: remove dma_nommu_dma_supported
> 
>   This function is largely identical to the generic version used
>   everywhere else.  Replace it with the generic version.
> 
>   Signed-off-by: Christoph Hellwig 
>   Tested-by: Christian Zigotzky 
>   Signed-off-by: Michael Ellerman 
> 
> A.

Re: [PATCH v2 2/2] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-05 Thread Benjamin Herrenschmidt

On Wed, 2019-06-05 at 16:12 +0100, Marc Zyngier wrote:
> > Those error messages are control path messages. if we return the same 
> > error value from here and from the previous error, how can we 
> > differentiate between the two error cases by looking at the log?
> > 
> > Having informative printouts seems like a good idea for bad 
> > configuration cases as such, wouldn't you agree?
> 
> I completely disagree. The kernel log isn't a dumping ground for this
> kind of pretty useless information. Furthermore, the irq subsystem will
> also shout at you when it gets an error, so no need to add insult to injury.
> 
> If you really want to keep them around, turn them into pr_debug.

I disagree Marc. This is a rather bad error which indicates that the device-tree
is probably incorrect (or the HW was wired in a way that cannot work).

Basically a given FIC can either be entirely level sensitive or entirely edge
sensitive. This catches cases where the DT has routed a mixed of both to the
same FIC. Definitely worth barfing loudly about rather than trying to understand
subtle odd misbehaviours of the device in the field.

Cheers,
Ben.

Re: [PATCH v3 1/2] dt-bindings: interrupt-controller: Amazon's Annapurna Labs FIC

2019-06-05 Thread Benjamin Herrenschmidt

On Wed, 2019-06-05 at 17:59 +0300, Talel Shenhar wrote:
> 

 ../..

> +- compatible: should be "amazon,al-fic"
> +- reg: physical base address and size of the registers
> +- interrupt-controller: identifies the node as an interrupt controller
> +- #interrupt-cells: must be 2.
> +- interrupt-parent: specifies the parent interrupt controller.
> +- interrupts: describes which input line in the interrupt parent, this
> +  fic's output is connected to.
> +
> +Example:
> +
> +amazon_fic: interrupt-controller@0xfd8a8500 {
> + compatible = "amazon,al-fic";
> + interrupt-controller;
> + #interrupt-cells = <1>;
^ should be 2

> + reg = <0x0 0xfd8a8500 0x0 0x1000>;
> + interrupt-parent = <&gic>;
> + interrupts = ;
> +};

Re: [PATCH v2 2/2] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-05 Thread Benjamin Herrenschmidt

On Wed, 2019-06-05 at 13:22 +0100, Marc Zyngier wrote:
> 
> > +* This is generally fixed depending on what pieces of HW it's wired up
> > +* to.
> > +*
> > +* We configure it based on the sensitivity of the first source
> > +* being setup, and reject any subsequent attempt at configuring it in a
> > +* different way.
> 
> Is that a reliable guess? It also strikes me that the DT binding doesn't
> allow for the trigger type to be passed, meaning the individual drivers
> have to request the trigger as part of their request_irq() call. I'd
> rather you have a complete interrupt specifier in DT, and document the
> various limitations of the HW.

Actually the DT does, but Talel forgot to update the "example" part of
the binding patch. The description does say 2 cells.

This is the best approach imho (translation: I asked Talel to do it
this way :-) The other option which I don't like is to stick to
#interrupt-cells = 1, and have a separate property in the interrupt
controller node to indicate whether it needs to be configured as level
or edge.

These FICs are used for what is generally fixed wires inside the SoC,
so it doesn't matter much either way, but I prefer having it self
configured based on source just in case a future implementation doesn't
have the limitation of all inputs having the same trigger type.

Cheers,
Ben.

Re: [PATCH 2/3] irqchip: al-fic: Introduce Amazon's Annapurna Labs Fabric Interrupt Controller Driver

2019-06-05 Thread Benjamin Herrenschmidt

On Wed, 2019-06-05 at 09:59 +0200, Greg KH wrote:
> 
> > +struct irq_domain *al_fic_wire_get_domain(struct al_fic *fic);
> > +
> > +struct al_fic *al_fic_wire_init(struct device_node *node,
> > +   void __iomem *base,
> > +   const char *name,
> > +   unsigned int parent_irq);
> > +int al_fic_cleanup(struct al_fic *fic);
> 
> Who is using these new functions?  We don't add new apis that no one
> uses :(

They will be used by subsequent driver submissions but those aren't
quite ready yet, so we can hold onto patch 3 for now until they are.

Cheers,
Ben.

Re: [PATCH] driver core: Fix use-after-free and double free on glue directory

2019-05-09 Thread Benjamin Herrenschmidt

On Thu, 2019-05-09 at 20:08 +0530, Gaurav Kohli wrote:
> Hi ,
> 
> Last patch will serialize the addition of child to parent directory, 
> won't it affect performance.

I doubt this is a significant issue, and there's already a global lock
taken once or twice in that path, the fix is purely to make sure that
the some locked section is used both for the lookup and the addition as
the bug comes from the window in between those two operations allowing
the object to be removed after it was "found".

Cheers,
Ben.
 
> 
> Regards
> Gaurav
> 
> On 5/4/2019 9:04 PM, Greg KH wrote:
> > On Sat, May 04, 2019 at 10:47:07PM +0800, Muchun Song wrote:
> > > Benjamin Herrenschmidt  于2019年5月2日周四
> > > 下午2:25写道：
> > > 
> > > > > > The basic idea yes, the whole bool *locked is horrid
> > > > > > though.
> > > > > > Wouldn't it
> > > > > > work to have a get_device_parent_locked that always returns
> > > > > > with
> > > > > > the mutex held,
> > > > > > or just move the mutex to the caller or something simpler
> > > > > > like this
> > > > > > ?
> > > > > > 
> > > > > 
> > > > > Greg and Rafael, do you have any suggestions for this? Or you
> > > > > also
> > > > > agree with Ben?
> > > > 
> > > > Ping guys ? This is worth fixing...
> > > 
> > > I also agree with you. But Greg and Rafael seem to be high
> > > latency right now.
> > 
> > It's in my list of patches to get to, sorry, hopefully will dig out
> > of
> > that next week with the buffer that the merge window provides me.
> > 
> > thanks,
> > 
> > greg k-h
> > 
> 
>

Re: [PATCH] driver core: Fix use-after-free and double free on glue directory

2019-05-01 Thread Benjamin Herrenschmidt

On Sun, 2019-04-28 at 22:49 +0800, Muchun Song wrote:
> Hi Greg and Rafael:
> 
> 
> Benjamin Herrenschmidt  于2019年4月28日周日
> 下午6:10写道：
> > 
> > The basic idea yes, the whole bool *locked is horrid though.
> > Wouldn't it
> > work to have a get_device_parent_locked that always returns with
> > the mutex held,
> > or just move the mutex to the caller or something simpler like this
> > ?
> > 
> 
> Greg and Rafael, do you have any suggestions for this? Or you also
> agree with Ben?

Ping guys ? This is worth fixing... 

Ben.

Re: [PATCH] driver core: Fix use-after-free and double free on glue directory

2019-04-28 Thread Benjamin Herrenschmidt

On Thu, 2019-04-25 at 23:44 +0800, Muchun Song wrote:
> I agree with you that the looking up of the glue dir and creation of its child
> should be protected by the same lock of gdp_mutex. So, do you agree with
> the fix of the following code snippet?

The basic idea yes, the whole bool *locked is horrid though. Wouldn't it
work to have a get_device_parent_locked that always returns with the mutex held,
or just move the mutex to the caller or something simpler like this ?

Ben.

Re: [PATCH] driver core: Fix use-after-free and double free on glue directory

2019-04-25 Thread Benjamin Herrenschmidt

On Tue, 2019-04-23 at 22:32 +0800, Muchun Song wrote:
> There is a race condition between removing glue directory and adding a new
> device under the glue directory. It can be reproduced in following test:
> 

 .../...

> In order to avoid this happening, we we should not call kobject_del() on
> path2 when the reference count of glue_dir is greater than 1. So we add a
> conditional statement to fix it.

Good catch ! However I'm not completely happy about the fix you
propose.

I find relying on the object count for such decisions rather fragile as
it could be taken temporarily for other reasons, couldn't it ? In which
case we would just fail...

Ideally, the looking up of the glue dir and creation of its child
should be protected by the same lock instance (the gdp_mutex in that
case).

That might require a bit of shuffling around though.

Greg, thoughts ? This whole gluedir business is annoyingly racy still.

My gut feeling is that the "right fix" is to ensure the lookup of the
glue dir and creation of the child object(s) are done under a single
instance of gdp_mutex so we never see a stale "empty" but still
poentially used glue dir around.

This should also be true when creating such gluedir in the first place
in fact, though that race is a lot harder to hit.

Cheers,
Ben.

Re: [PATCH v2 01/21] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section

2019-04-14 Thread Benjamin Herrenschmidt

On Fri, 2019-04-12 at 14:17 +0100, Will Deacon wrote:
> 
> +the same CPU thread to a particular device will arrive in program
> +order.
> +
> + 2. A writeX() by a CPU thread to the peripheral will first wait for the
> +completion of all prior writes to memory either issued by the thread
> +or issued while holding a spinlock that was subsequently taken by the
> +thread. This ensures that writes by the CPU to an outbound DMA
> +buffer allocated by dma_alloc_coherent() will be visible to a DMA
> +engine when the CPU writes to its MMIO control register to trigger
> +the transfer.

Not particularily trying to be annoying here but I find the above
rather hard to parse :) I know what you're getting at but I'm not sure
somebody who doesn't will understand.

One way would be to instead prefix the whole thing with a blurb along
the lines of:

readX() and writeX() provide some ordering guarantees versus
each other and other memory accesses that are described below. 
Those guarantees apply to accesses performed either by the same
logical thread of execution, or by different threads but while 
holding the same lock (spinlock or mutex).

Then have as simpler description of each case. No ?

> + 3. A readX() by a CPU thread from the peripheral will complete before
> +any subsequent reads from memory by the same thread can begin. This
> +ensures that reads by the CPU from an incoming DMA buffer allocated
> +by dma_alloc_coherent() will not see stale data after reading from
> +the DMA engine's MMIO status register to establish that the DMA
> +transfer has completed.
> +
> + 4. A readX() by a CPU thread from the peripheral will complete before
> +any subsequent delay() loop can begin execution on the same thread.
> +This ensures that two MMIO register writes by the CPU to a peripheral
> +will arrive at least 1us apart if the first write is immediately read
> +back with readX() and udelay(1) is called prior to the second
> +writeX():
>  
>   writel(42, DEVICE_REGISTER_0); // Arrives at the device...
>   readl(DEVICE_REGISTER_0);
> @@ -2600,8 +2604,10 @@ guarantees:
>   These will perform appropriately for the type of access they're actually
>   doing, be it inX()/outX() or readX()/writeX().
>  
> -All of these accessors assume that the underlying peripheral is 
> little-endian,
> -and will therefore perform byte-swapping operations on big-endian 
> architectures.
> +With the exception of the string accessors (insX(), outsX(), readsX() and
> +writesX()), all of the above assume that the underlying peripheral is
> +little-endian and will therefore perform byte-swapping operations on 
> big-endian
> +architectures.
>  
>  
>

Re: [PATCH v2 01/21] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section

2019-04-11 Thread Benjamin Herrenschmidt

On Thu, 2019-04-11 at 15:34 -0700, Linus Torvalds wrote:
> On Thu, Apr 11, 2019 at 3:13 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Minor nit... I would have said "All readX() and writeX() accesses
> > _from
> > the same CPU_ to the same peripheral... and then s/the CPU/this
> > CPU.
> 
> Maybe talk about "same thread" rather than "same cpu", with the
> understanding that scheduling/preemption has to include the
> appropriate cross-CPU IO barrier?

Works for me, but why not spell all this out in the document ? We know,
but others might not.

Cheers,
Ben.

Re: [PATCH v2 01/21] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section

2019-04-11 Thread Benjamin Herrenschmidt

On Fri, 2019-04-05 at 14:59 +0100, Will Deacon wrote:
> + 1. All readX() and writeX() accesses to the same peripheral are ordered
> +with respect to each other. For example, this ensures that MMIO 
> register
> +   writes by the CPU to a particular device will arrive in program order.

Minor nit... I would have said "All readX() and writeX() accesses _from
the same CPU_ to the same peripheral... and then s/the CPU/this CPU.

> - Accesses to this space may be fully synchronous (as on i386), but
> - intermediary bridges (such as the PCI host bridge) may not fully honour
> - that.
> + 2. A writeX() by the CPU to the peripheral will first wait for the
> +completion of all prior CPU writes to memory. For example, this 
> ensures
> +that writes by the CPU to an outbound DMA buffer allocated by
> +dma_alloc_coherent() will be visible to a DMA engine when the CPU 
> writes
> +to its MMIO control register to trigger the transfer.

Similarily "the CPU" -> "a CPU"
>  
> - They are guaranteed to be fully ordered with respect to each other.
> + 3. A readX() by the CPU from the peripheral will complete before any
> +   subsequent CPU reads from memory can begin. For example, this ensures
> +   that reads by the CPU from an incoming DMA buffer allocated by
> +   dma_alloc_coherent() will not see stale data after reading from the 
> DMA
> +   engine's MMIO status register to establish that the DMA transfer has
> +   completed.
>  
> - They are not guaranteed to be fully ordered with respect to other types 
> of
> - memory and I/O operation.
> + 4. A readX() by the CPU from the peripheral will complete before any
> +   subsequent delay() loop can begin execution. For example, this ensures
> +   that two MMIO register writes by the CPU to a peripheral will arrive 
> at
> +   least 1us apart if the first write is immediately read back with 
> readX()
> +   and udelay(1) is called prior to the second writeX().
>  
> - (*) readX(), writeX():
> + __iomem pointers obtained with non-default attributes (e.g. those 
> returned
> + by ioremap_wc()) are unlikely to provide many of these guarantees.

So we give up on defining _wc semantics ? :-) Fair enough, it's a
mess...

 .../...

> +All of these accessors assume that the underlying peripheral is 
> little-endian,
> +and will therefore perform byte-swapping operations on big-endian 
> architectures.

This is not true of readsX/writesX, those will perform native accesses and are
intrinsically endian neutral.

> +Composing I/O ordering barriers with SMP ordering barriers and LOCK/UNLOCK
> +operations is a dangerous sport which may require the use of mmiowb(). See 
> the
> +subsection "Acquires vs I/O accesses" for more information.

Cheers,
Ben.

Re: [PATCH v3] PCI: al: Add Amazon Annapurna Labs PCIe host controller driver

2019-04-08 Thread Benjamin Herrenschmidt

On Thu, 2019-03-28 at 13:57 +0200, Jonathan Chocron wrote:
> Add support for Amazon's Annapurna Labs PCIe driver. The HW
> controller
> is based on DesignWare's IP.
> 
> The HW doesn't support accessing the Root Port's config space via
> ECAM,
> so we obtain its base address via an AMZN0001 device.
> 
> Furthermore, the DesignWare PCIe controller doesn't filter out config
> transactions sent to devices 1 and up on its bus, so they are
> filtered
> by the driver.
> 
> All subordinate buses do support ECAM access.
> 
> Implementing specific PCI config access functions involves:
>  - Adding an init function to obtain the Root Port's base address
> from
>an AMZN0001 device.
>  - Adding a new entry in the MCFG quirk array
> 
> Co-developed-by: Vladimir Aerov 
> Signed-off-by: Jonathan Chocron 
> Signed-off-by: Vladimir Aerov 
> Reviewed-by: David Woodhouse 

Late to the party, sorry :-) That kernel.crashing.org email is on its
last legs...

Reviewed-by: Benjamin Herrenschmidt 

> ---
> 
> --v2:
>   - Fix commit message comments (incl. using AMZN0001 instead of
> PNP0C02)
>   - Use the usual multi-line comment style
> 
> --v3:
>   - Fix additional commit message comments
> 
>  MAINTAINERS  |  6 +++
>  drivers/acpi/pci_mcfg.c  | 12 +
>  drivers/pci/controller/dwc/Makefile  |  1 +
>  drivers/pci/controller/dwc/pcie-al.c | 93
> 
>  include/linux/pci-ecam.h |  1 +
>  5 files changed, 113 insertions(+)
>  create mode 100644 drivers/pci/controller/dwc/pcie-al.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 32d76a90..7a17017f9f82 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11769,6 +11769,12 @@ T:   git
> git://git.kernel.org/pub/scm/linux/kernel/git/lpieralisi/pci.git/
>  S:   Supported
>  F:   drivers/pci/controller/
>  
> +PCIE DRIVER FOR ANNAPURNA LABS
> +M:   Jonathan Chocron 
> +L:   linux-...@vger.kernel.org
> +S:   Maintained
> +F:   drivers/pci/controller/dwc/pcie-al.c
> +
>  PCIE DRIVER FOR AMLOGIC MESON
>  M:   Yue Wang 
>  L:   linux-...@vger.kernel.org
> diff --git a/drivers/acpi/pci_mcfg.c b/drivers/acpi/pci_mcfg.c
> index a4e8432fc2fb..b42be067fb83 100644
> --- a/drivers/acpi/pci_mcfg.c
> +++ b/drivers/acpi/pci_mcfg.c
> @@ -52,6 +52,18 @@ struct mcfg_fixup {
>  static struct mcfg_fixup mcfg_quirks[] = {
>  /*   { OEM_ID, OEM_TABLE_ID, REV, SEGMENT, BUS_RANGE, ops, cfgres },
> */
>  
> +#define AL_ECAM(table_id, rev, seg, ops) \
> + { "AMAZON", table_id, rev, seg, MCFG_BUS_ANY, ops }
> +
> + AL_ECAM("GRAVITON", 0, 0, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 1, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 2, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 3, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 4, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 5, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 6, &al_pcie_ops),
> + AL_ECAM("GRAVITON", 0, 7, &al_pcie_ops),
> +
>  #define QCOM_ECAM32(seg) \
>   { "QCOM  ", "QDF2432 ", 1, seg, MCFG_BUS_ANY, &pci_32b_ops }
>  
> diff --git a/drivers/pci/controller/dwc/Makefile
> b/drivers/pci/controller/dwc/Makefile
> index 7bcdcdf5024e..1ea773c0070d 100644
> --- a/drivers/pci/controller/dwc/Makefile
> +++ b/drivers/pci/controller/dwc/Makefile
> @@ -28,5 +28,6 @@ obj-$(CONFIG_PCIE_UNIPHIER) += pcie-uniphier.o
>  # depending on whether ACPI, the DT driver, or both are enabled.
>  
>  ifdef CONFIG_PCI
> +obj-$(CONFIG_ARM64) += pcie-al.o
>  obj-$(CONFIG_ARM64) += pcie-hisi.o
>  endif
> diff --git a/drivers/pci/controller/dwc/pcie-al.c
> b/drivers/pci/controller/dwc/pcie-al.c
> new file mode 100644
> index ..65a9776c12be
> --- /dev/null
> +++ b/drivers/pci/controller/dwc/pcie-al.c
> @@ -0,0 +1,93 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * PCIe host controller driver for Amazon's Annapurna Labs IP (used
> in chips
> + * such as Graviton and Alpine)
> + *
> + * Copyright 2019 Amazon.com, Inc. or its affiliates. All Rights
> Reserved.
> + *
> + * Author: Jonathan Chocron 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include "../../pci.h"
> +
> +#if defined(CONFIG_ACPI) && defined(CONFIG_PCI_QUIRKS)
> +
> +struct al_pcie_acpi  {
> + void __iomem *dbi_base;
> +};
> +
> +static void __iomem *al_pcie_map_bus(struct pci_bus *bus, unsigned
> int devfn,
> +  int where)
> +{
> + struct

Re: [RFC PATCH] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section

2019-02-11 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 14:34 -0800, Linus Torvalds wrote:
> On Mon, Feb 11, 2019 at 9:30 AM Will Deacon  wrote:
> > +
> > + 1. All readX() and writeX() accesses to the same peripheral are 
> > ordered
> > +with respect to each other. For example, this ensures that MMIO 
> > register
> > +   writes by the CPU to a particular device will arrive in program 
> > order.
> 
> Hmm. I'd like more people look at strengthening this one wrt across
> CPUs and locking.
> 
> Right now we document mmiowb(), but that "documentation" is really
> just a fairy tale. Very *very* few drivers actually do mmiowb() on
> their own.
> 
> IOW, we should seriously just consider making the rule be that locking
> will order mmio too. Because that's practically the rule anyway.
> 
> Powerpc already does it. IO within a locked region will serialize with the 
> lock.

Yup. It's a bit ugly but I felt back then that getting drivers to use
mmiowb() properly was going to be a losing battle.

Cheers,
Ben.

Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it

2019-02-11 Thread Benjamin Herrenschmidt

On Mon, 2019-02-11 at 07:26 +0100, Christophe Leroy wrote:
> 
> Le 11/02/2019 à 01:21, Benjamin Herrenschmidt a écrit :
> > On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote:
> > >   /*
> > > + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
> > > + */
> > > +.macro __LOAD_MSR_KERNEL r, x
> > > +.if \x >= 0x8000
> > > +   lis \r, (\x)@h
> > > +   ori \r, \r, (\x)@l
> > > +.else
> > > +   li \r, (\x)
> > > +.endif
> > > +.endm
> > > +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
> > > +
> > 
> > You changed the limit from >= 0x1 to >= 0x8000 without a
> > corresponding explanation as to why...
> 
> Yes, the existing LOAD_MSR_KERNEL() was buggy because 'li' takes a 
> signed u16, ie between -0x8000 and 0x7999.

Ah yes, I was only looking at the "large" case which is fine...

> By chance it was working because until now nobody was trying to set 
> MSR_KERNEL | MSR_EE.
> 
> Christophe

Re: [PATCH] powerpc: fix 32-bit KVM-PR lockup and panic with MacOS guest

2019-02-10 Thread Benjamin Herrenschmidt

On Fri, 2019-02-08 at 14:51 +, Mark Cave-Ayland wrote:
> 
> Indeed, but there are still some questions to be asked here:
> 
> 1) Why were these bits removed from the original bitmask in the first place 
> without
> it being documented in the commit message?
> 
> 2) Is this the right fix? I'm told that MacOS guests already run without this 
> patch
> on a G5 under 64-bit KVM-PR which may suggest that this is a workaround for 
> another
> bug elsewhere in the 32-bit powerpc code.
> 
> 
> If you think that these points don't matter, then I'm happy to resubmit the 
> patch
> as-is based upon your comments above.

We should write a test case to verify that FE0/FE1 are properly
preserved/context-switched etc... I bet if we accidentally wiped them,
we wouldn't notice 99.9% of the time.

Cheers,
Ben.

Re: [PATCH v1 03/16] powerpc/32: move LOAD_MSR_KERNEL() into head_32.h and use it

2019-02-10 Thread Benjamin Herrenschmidt

On Fri, 2019-02-08 at 12:52 +, Christophe Leroy wrote:
>  /*
> + * MSR_KERNEL is > 0x8000 on 4xx/Book-E since it include MSR_CE.
> + */
> +.macro __LOAD_MSR_KERNEL r, x
> +.if \x >= 0x8000
> +   lis \r, (\x)@h
> +   ori \r, \r, (\x)@l
> +.else
> +   li \r, (\x)
> +.endif
> +.endm
> +#define LOAD_MSR_KERNEL(r, x) __LOAD_MSR_KERNEL r, x
> +

You changed the limit from >= 0x1 to >= 0x8000 without a
corresponding explanation as to why...

Ben.

Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-21 Thread Benjamin Herrenschmidt

On Wed, 2019-01-16 at 07:35 +, Koenig, Christian wrote:
> No, but you answer the wrong question.
> 
> See we don't want to have different mappings of cached and non-cached on 
> the CPU, but rather want to know if a snooped DMA from the PCIe counts 
> as cached access as well.
> 
> As far as I know on x86 it doesn't, so when you have an un-cached page 
> you can still access it with a snooping DMA read/write operation and 
> don't cause trouble.

Hrm... well, if you map it uncached on the CPU on powerpc, a snoop DMA
will work fine too, it won't hit any cache. The only problem I'm aware
of is a core (or CAPI device) emiting non-cached load/stores colliding
with a cache snooper.

> > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > others is just that ... an ugly and horrible hacks that should have
> > never eventuated, when the search for performance pushes HW people into
> > utter insanity :)
> 
> Well I agree that un-cached system memory makes things much more 
> complicated for a questionable gain.
> 
> But fact is we now have to deal with the mess, so no point in 
> complaining about it to much :)

I wish we could just sent the HW designers home and tell them we won't
support that crap... oh well.

Ben.

> Cheers,
> Christian.
> 
> > Cheers,
> > Ben.
> > 
> >

Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-16 Thread Benjamin Herrenschmidt

On Wed, 2019-01-16 at 08:47 +0100, Ard Biesheuvel wrote:
> > As far as I know on x86 it doesn't, so when you have an un-cached page
> > you can still access it with a snooping DMA read/write operation and
> > don't cause trouble.
> > 
> 
> I think it is the other way around. The question is, on an otherwise
> cache coherent device, whether the NoSnoop attribute set by the GPU
> propagates all the way to the bus so that it bypasses the caches.

On powerpc it's ignored, all DMA accesses will be snooped. But that's
fine regardless of whether the memory was mapped cachable or not, the
snooper will simply not find anything if not. I *think* we only do
cache inject if the line already exists in one of the caches.

> On x86, we can tolerate if this is not the case, since uncached memory
> accesses by the CPU snoop the caches as well.
> 
> On other architectures, uncached accesses go straight to main memory,
> so if the device wrote anything to the caches we won't see it.

Well, on all powerpc implementations that I am aware of at least (dunno
about ARM), they do, but we don't have a problem because I don't think
the devices can/will write to the caches directly unless a
corresponding line already exists (but I might be wrong, we need to
double check all implementations which is tricky).

I am not aware of any powerpc chip implementing NoSnoop.

> So to use this optimization, you have to either be 100% sure that
> NoSnoop is implemented correctly, or have a x86 CPU.
> 
> > > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > > others is just that ... an ugly and horrible hacks that should have
> > > never eventuated, when the search for performance pushes HW people into
> > > utter insanity :)
> > 
> > Well I agree that un-cached system memory makes things much more
> > complicated for a questionable gain.
> > 
> > But fact is we now have to deal with the mess, so no point in
> > complaining about it to much :)
> > 
> 
> Indeed. I wonder if we should just disable it altogether unless CONFIG_X86=y

The question is whether DMA from a device can instanciate cache lines
in your system. This a system specific rather than architecture
specific question I suspect...

Cheers,
Ben.

Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-16 Thread Benjamin Herrenschmidt

On Wed, 2019-01-16 at 07:35 +, Koenig, Christian wrote:
> No, but you answer the wrong question.
> 
> See we don't want to have different mappings of cached and non-cached on 
> the CPU, but rather want to know if a snooped DMA from the PCIe counts 
> as cached access as well.
> 
> As far as I know on x86 it doesn't, so when you have an un-cached page 
> you can still access it with a snooping DMA read/write operation and 
> don't cause trouble.

Hrm... well, if you map it uncached on the CPU on powerpc, a snoop DMA
will work fine too, it won't hit any cache. The only problem I'm aware
of is a core (or CAPI device) emiting non-cached load/stores colliding
with a cache snooper.

> > The old hack of using non-cached mapping to avoid snoop cost in AGP and
> > others is just that ... an ugly and horrible hacks that should have
> > never eventuated, when the search for performance pushes HW people into
> > utter insanity :)
> 
> Well I agree that un-cached system memory makes things much more 
> complicated for a questionable gain.
> 
> But fact is we now have to deal with the mess, so no point in 
> complaining about it to much :)

I wish we could just sent the HW designers home and tell them we won't
support that crap... oh well.

Ben.

> Cheers,
> Christian.
> 
> > Cheers,
> > Ben.
> > 
> >

Re: [RFC PATCH] drm/ttm: force cached mappings for system RAM on ARM

2019-01-15 Thread Benjamin Herrenschmidt

On Tue, 2019-01-15 at 22:31 +1100, Michael Ellerman wrote:
> > > As far as I know Power doesn't really supports un-cached memory at all, 
> > > except for a very very old and odd configuration with AGP.
> > 
> > Hopefully Michael/Ben can elaborate here, but I was under the (possibly
> > mistaken) impression that mismatched attributes could cause a machine-check
> > on Power.
> 
> That's what I've always been told, but I can't actually find where it's
> documented, I'll keep searching.
> 
> But you're right that mixing cached / uncached is not really supported,
> and probably results in a machine check or worse.

 .. or worse :) It could checkstop.

It's also my understanding that on ARM v7 and above, it's technically
forbidden to map the same physical page with both cached and non-cached 
mappings, since the cached one could prefetch (or speculatively load),
thus creating collisions and inconsistencies. Am I wrong here ?

The old hack of using non-cached mapping to avoid snoop cost in AGP and
others is just that ... an ugly and horrible hacks that should have
never eventuated, when the search for performance pushes HW people into
utter insanity :)

Cheers,
Ben.

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt

On Wed, 2019-01-09 at 15:53 +1100, Alexey Kardashevskiy wrote:
> "A PCI completion timeout occurred for an outstanding PCI-E transaction"
> it is.
> 
> This is how I bind the device to vfio:
> 
> echo vfio-pci > '/sys/bus/pci/devices/:01:00.0/driver_override'
> echo vfio-pci > '/sys/bus/pci/devices/:01:00.1/driver_override'
> echo ':01:00.0' > '/sys/bus/pci/devices/:01:00.0/driver/unbind'
> echo ':01:00.1' > '/sys/bus/pci/devices/:01:00.1/driver/unbind'
> echo ':01:00.0' > /sys/bus/pci/drivers/vfio-pci/bind
> echo ':01:00.1' > /sys/bus/pci/drivers/vfio-pci/bind
> 
> 
> and I noticed that EEH only happens with the last command. The order
> (.0,.1  or .1,.0) does not matter, it seems that putting one function to
> D3 is fine but putting another one when the first one is already in D3 -
> produces EEH. And I do not recall ever seeing this on the firestone
> machine. Weird.

Putting all functions into D3 is what allows the device to actually go
into D3.

Does it work with other devices ? We do have that bug on early P9
revisions where the attempt of bringing the link to L1 as part of the
D3 process fails in horrible ways, I thought P8 would be ok but maybe
not ...

Otherwise, it might be that our timeouts are too low (you may want to
talk to our PCIe guys internally)

Cheers,
Ben.

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt

On Wed, 2019-01-09 at 17:32 +1100, Alexey Kardashevskiy wrote:
> I have just moved the "Mellanox Technologies MT27700 Family
> [ConnectX-4]" from garrison to firestone machine and there it does not
> produce an EEH, with the same kernel and skiboot (both upstream + my
> debug). Hm. I cannot really blame the card but I cannot see what could
> cause the difference in skiboot either. I even tried disabling NPU so
> garrison would look like firestone, still EEH'ing.

The systems have a different chip though, firestone is P8 and garrison
is P8', which a slightly different PHB revision. Worth checking if we
have anything significantly different in our inits and poke at the HW
guys.

BTW. Are the cards behind a switch in either case ?

Cheers,
Ben.

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-08 Thread Benjamin Herrenschmidt

On Mon, 2019-01-07 at 21:01 -0700, Jason Gunthorpe wrote:
> 
> > In a very cryptic way that requires manual parsing using non-public
> > docs sadly but yes. From the look of it, it's a completion timeout.
> > 
> > Looks to me like we don't get a response to a config space access
> > during the change of D state. I don't know if it's the write of the D3
> > state itself or the read back though (it's probably detected on the
> > read back or a subsequent read, but that doesn't tell me which specific
> > one failed).
> 
> If it is just one card doing it (again, check you have latest
> firmware) I wonder if it is a sketchy PCI-E electrical link that is
> causing a long re-training cycle? Can you tell if the PCI-E link is
> permanently gone or does it eventually return?

No, it's 100% reproducable on systems with that specific card model,
not card instance, and maybe different systems/cards as well, I'll let
David & Alexey comment further on that.

> Does the card work in Gen 3 when it starts? Is there any indication of
> PCI-E link errors?

Nope.

> Everytime or sometimes?
> 
> POWER 8 firmware is good? If the link does eventually come back, is
> the POWER8's D3 resumption timeout long enough?
> 
> If this doesn't lead to an obvious conclusion you'll probably need to
> connect to IBM's Mellanox support team to get more information from
> the card side.

We are IBM :-) So far, it seems to be that the card is doing something
not quite right, but we don't know what. We might need to engage
Mellanox themselves.

Cheers,
Ben.

Re: [PATCH] PCI: Add no-D3 quirk for Mellanox ConnectX-[45]

2019-01-05 Thread Benjamin Herrenschmidt

On Sat, 2019-01-05 at 10:51 -0700, Jason Gunthorpe wrote:
> 
> > Interesting.  I've investigated this further, though I don't have as
> > many new clues as I'd like.  The problem occurs reliably, at least on
> > one particular type of machine (a POWER8 "Garrison" with ConnectX-4).
> > I don't yet know if it occurs with other machines, I'm having trouble
> > getting access to other machines with a suitable card.  I didn't
> > manage to reproduce it on a different POWER8 machine with a
> > ConnectX-5, but I don't know if it's the difference in machine or
> > difference in card revision that's important.
> 
> Make sure the card has the latest firmware is always good advice..
> 
> > So possibilities that occur to me:
> >   * It's something specific about how the vfio-pci driver uses D3
> > state - have you tried rebinding your device to vfio-pci?
> >   * It's something specific about POWER, either the kernel or the PCI
> > bridge hardware
> >   * It's something specific about this particular type of machine
> 
> Does the EEH indicate what happend to actually trigger it?

In a very cryptic way that requires manual parsing using non-public
docs sadly but yes. From the look of it, it's a completion timeout.

Looks to me like we don't get a response to a config space access
during the change of D state. I don't know if it's the write of the D3
state itself or the read back though (it's probably detected on the
read back or a subsequent read, but that doesn't tell me which specific
one failed).

Some extra logging in OPAL might help pin that down by checking the InA
error state in the config accessor after the config write (and polling
on it for a while as from a CPU perspective I don't knw if the write is
synchronous, probably not).

Cheers,
Ben.

Re: [PATCH] fsi:fsi-sbefifo: Fix possible concurrency use-after-free bugs in sbefifo_user_release

2019-01-03 Thread Benjamin Herrenschmidt

On Fri, 2019-01-04 at 10:26 +0800, Jia-Ju Bai wrote:
> 
> On 2019/1/4 8:47, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-12-26 at 21:56 +0800, Jia-Ju Bai wrote:
> > > In drivers/fsi/fsi-sbefifo.c, the functions sbefifo_user_release(),
> > > sbefifo_user_read() and sbefifo_user_write() may be concurrently executed.
> > So after refreshing my mind, looking at the code and talking with Al, I
> > really dont' see what race you are trying to fix here.
> > 
> > read/write should never be concurrent with release for a given file and
> > the stuff we are protecting here is local to the file instance.
> > 
> > Do you have an actual problem you observed ?
> > 
> 
> Thanks for the reply.
> 
> In fact, this report is found by a static tool written by myself, 
> instead of real execution.
> My tool found that in some drivers, for the structure "struct 
> file_operations", the code in intetrfaces ".read" , "write" and 
> ".release" are protected by the same lock.
> The functions kcs_bmc_read(), kcs_bmc_write() and kcs_bmc_release() are 
> examples.
> Thus, my tool inferred that the intetrfaces ".read" , "write" and 
> ".release" of "struct file_operations" can be concurrently executed, and 
> generated this report.
> I manually checked this report, but I was not very sure of it, so I 
> marked it as a "possible bug" and reported it.

So what happens is that they cannot be executed concurrently for a
given struct file. But they can be for separate files.

In the fsi-sbefifo case, all of the data and the lock are part of a
private structure which is allocated in open() and thus is per-file
instance, so there should be no race.

In the example you gave, kcs_bmc.c, the data and lock are part of a
per-device (struct kcs_bmc) and thus shared by all file instances. So
in that case, the race does exist.

>  From your message, now I know my report is false, and ".read" , "write" 
> cannot be concurrently executed with ".release" for a given file.
> Sorry for my false report, and thanks for your message.

Right, your tool is valuable as pre-screening but you need in addition
to check (probably manually) whether the data accessed (and lock) are
shared by multiple open file instances or are entirely local to a given
file instance.

Cheers,
Ben.

Re: [PATCH] fsi:fsi-sbefifo: Fix possible concurrency use-after-free bugs in sbefifo_user_release

2019-01-03 Thread Benjamin Herrenschmidt

On Wed, 2018-12-26 at 21:56 +0800, Jia-Ju Bai wrote:
> In drivers/fsi/fsi-sbefifo.c, the functions sbefifo_user_release(), 
> sbefifo_user_read() and sbefifo_user_write() may be concurrently executed.

So after refreshing my mind, looking at the code and talking with Al, I
really dont' see what race you are trying to fix here.

read/write should never be concurrent with release for a given file and
the stuff we are protecting here is local to the file instance.

Do you have an actual problem you observed ?

Cheers,
Ben.

> sbefifo_user_release()
>   sbefifo_release_command()
> vfree(user->pending_cmd);
> 
> sbefifo_user_read()
>   mutex_lock();
>   rc = __sbefifo_submit(sbefifo, user->pending_cmd, ...);
> 
> sbefifo_user_write()
>   mutex_lock();
>   user->pending_cmd = user->cmd_page;
>   user->pending_cmd = vmalloc(len);
> 
> Thus, possible concurrency use-after-free bugs may occur in
> sbefifo_user_release().
> 
> To fix these bugs, the calls to mutex_lock() and mutex_unlock() are
> added in sbefifo_user_release().
> 
> 
> Signed-off-by: Jia-Ju Bai 
> ---
>  drivers/fsi/fsi-sbefifo.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/fsi/fsi-sbefifo.c b/drivers/fsi/fsi-sbefifo.c
> index d92f5b87c251..e278a9014b8f 100644
> --- a/drivers/fsi/fsi-sbefifo.c
> +++ b/drivers/fsi/fsi-sbefifo.c
> @@ -900,8 +900,10 @@ static int sbefifo_user_release(struct inode *inode, 
> struct file *file)
>   if (!user)
>   return -EINVAL;
>  
> + mutex_lock(&user->file_lock);
>   sbefifo_release_command(user);
>   free_page((unsigned long)user->cmd_page);
> + mutex_unlock(&user->file_lock);
>   kfree(user);
>  
>   return 0;

Re: [PATCH] fsi:fsi-sbefifo: Fix possible concurrency use-after-free bugs in sbefifo_user_release

2019-01-02 Thread Benjamin Herrenschmidt

On Thu, 2019-01-03 at 14:27 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-02 at 09:34 +, David Howells wrote:
> > Jia-Ju Bai  wrote:
> > 
> > > + mutex_lock(&user->file_lock);
> > >   sbefifo_release_command(user);
> > >   free_page((unsigned long)user->cmd_page);
> > > + mutex_unlock(&user->file_lock);
> > 
> > It shouldn't be necessary to do the free_page() call inside the locked
> > section.
> 
> True. However, I didn't realize read/write could be concurrent with
> release so we have another problem.
> 
> I assume when release is called, no new read/write can be issued, I am
> correct ? So all we have to protect against is a read/write that has
> started prior to release being called, right ?

Hrm... looking briefly at the vfs, read/write are wrapped in
fdget/fdput, so release shouldn't happen concurrently or am I missing
something here ?

Cheers,
Ben.

Re: [PATCH] fsi:fsi-sbefifo: Fix possible concurrency use-after-free bugs in sbefifo_user_release

2019-01-02 Thread Benjamin Herrenschmidt

On Wed, 2019-01-02 at 09:34 +, David Howells wrote:
> Jia-Ju Bai  wrote:
> 
> > +   mutex_lock(&user->file_lock);
> > sbefifo_release_command(user);
> > free_page((unsigned long)user->cmd_page);
> > +   mutex_unlock(&user->file_lock);
> 
> It shouldn't be necessary to do the free_page() call inside the locked
> section.

True. However, I didn't realize read/write could be concurrent with
release so we have another problem.

I assume when release is called, no new read/write can be issued, I am
correct ? So all we have to protect against is a read/write that has
started prior to release being called, right ?

In that case, what can happen is that release() wins the race on the
mutex, frees everything, then read or write starts using feed stuff.

This is nasty to fix because the mutex is in the user structure,
so even looking at the mutex is racy if release is called.

The right fix, would be, I think, for "user" (pointed to by file-
>private_data) to be protected by a kref. That doesn't close it
completely as the free in release() can still lead to the structure
becoming re-used before read/write tries to get the kref but after it
has NULL checked the private data.

So to make that solid, I would also RCU-defer the actual freeing and
use RCU around dereferencing file->private_data

Now, I yet have to see other chardevs do any of the above, do that mean
they are all hopelessly racy ?

Cheers,
Ben.

Re: [PATCH 16/33] powerpc/powernv: remove dead npu-dma code

2018-12-22 Thread Benjamin Herrenschmidt

On Mon, 2018-10-15 at 12:34 +1100, Alexey Kardashevskiy wrote:
> On 10/10/2018 00:24, Christoph Hellwig wrote:
> > This code has been unused since it was merged and is in the way of
> > cleaning up the DMA code, thus remove it.
> > 
> > This effectively reverts commit 5d2aa710 ("powerpc/powernv: Add support
> > for Nvlink NPUs").
> 
> 
> This code is heavily used by the NVIDIA GPU driver.

Some of it is, yes. And while I don't want to be involved in the
discussion about that specific can of worms, there is code in this file
related to the custom "always error" DMA ops that I suppose we could
remove, which is what is getting in the way of Christoph cleanups. It's
just meant as a debug stuff to catch incorrect attempts at doing the
dma mappings on the wrong "side" of the GPU.

Cheers,
Ben.

Re: [PATCH 2/5] dma-direct: add an explicit dma_direct_get_required_mask

2018-12-15 Thread Benjamin Herrenschmidt

On Thu, 2018-09-20 at 20:52 +0200, Christoph Hellwig wrote:
> This is somewhat modelled after the powerpc version, and differs from
> the legacy fallback in use fls64 instead of pointlessly splitting up the
> address into low and high dwords and in that it takes (__)phys_to_dma
> into account.

This looks like it will be usable if/when we switch powerpc to
dma/direct.c

Acked-by: Benjamin Herrenschmidt 
---
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/dma-direct.h |  1 +
>  kernel/dma/direct.c| 21 ++---
>  2 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/dma-direct.h b/include/linux/dma-direct.h
> index 86a59ba5a7f3..b79496d8c75b 100644
> --- a/include/linux/dma-direct.h
> +++ b/include/linux/dma-direct.h
> @@ -55,6 +55,7 @@ static inline void dma_mark_clean(void *addr, size_t size)
>  }
>  #endif /* CONFIG_ARCH_HAS_DMA_MARK_CLEAN */
>  
> +u64 dma_direct_get_required_mask(struct device *dev);
>  void *dma_direct_alloc(struct device *dev, size_t size, dma_addr_t 
> *dma_handle,
>   gfp_t gfp, unsigned long attrs);
>  void dma_direct_free(struct device *dev, size_t size, void *cpu_addr,
> diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
> index c954f0a6dc62..81b73a5bba54 100644
> --- a/kernel/dma/direct.c
> +++ b/kernel/dma/direct.c
> @@ -53,11 +53,25 @@ check_addr(struct device *dev, dma_addr_t dma_addr, 
> size_t size,
>   return true;
>  }
>  
> +static inline dma_addr_t phys_to_dma_direct(struct device *dev,
> + phys_addr_t phys)
> +{
> + if (force_dma_unencrypted())
> + return __phys_to_dma(dev, phys);
> + return phys_to_dma(dev, phys);
> +}
> +
> +u64 dma_direct_get_required_mask(struct device *dev)
> +{
> + u64 max_dma = phys_to_dma_direct(dev, (max_pfn - 1) << PAGE_SHIFT);
> +
> + return (1ULL << (fls64(max_dma) - 1)) * 2 - 1;
> +}
> +
>  static bool dma_coherent_ok(struct device *dev, phys_addr_t phys, size_t 
> size)
>  {
> - dma_addr_t addr = force_dma_unencrypted() ?
> - __phys_to_dma(dev, phys) : phys_to_dma(dev, phys);
> - return addr + size - 1 <= dev->coherent_dma_mask;
> + return phys_to_dma_direct(dev, phys) + size - 1 <=
> + dev->coherent_dma_mask;
>  }
>  
>  void *dma_direct_alloc_pages(struct device *dev, size_t size,
> @@ -296,6 +310,7 @@ const struct dma_map_ops dma_direct_ops = {
>   .unmap_page = dma_direct_unmap_page,
>   .unmap_sg   = dma_direct_unmap_sg,
>  #endif
> + .get_required_mask  = dma_direct_get_required_mask,
>   .dma_supported  = dma_direct_supported,
>   .mapping_error  = dma_direct_mapping_error,
>   .cache_sync = arch_dma_cache_sync,

Re: use generic DMA mapping code in powerpc V4

2018-12-11 Thread Benjamin Herrenschmidt

On Tue, 2018-12-11 at 19:17 +0100, Christian Zigotzky wrote:
> X5000 (P5020 board): U-Boot loads the kernel and the dtb file. Then the 
> kernel starts but it doesn't find any hard disks (partitions). That 
> means this is also the bad commit for the P5020 board.

What are the disks hanging off ? A PCIe device of some sort ?

Can you send good & bad dmesg logs ?

Ben.

Re: use generic DMA mapping code in powerpc V4

2018-12-10 Thread Benjamin Herrenschmidt

On Mon, 2018-12-10 at 20:33 +0100, Christoph Hellwig wrote:
> On Mon, Dec 10, 2018 at 05:04:46PM +, Rui Salvaterra wrote:
> > Hi, Christoph and Ben,
> > 
> > It just came to my mind (and this is most likely a stupid question,
> > but still)… Is there any possibility of these changes having an
> > (positive) effect on the long-standing problem of Power Mac machines
> > with AGP graphics cards (which have to be limited to PCI transfers,
> > otherwise they'll hang, due to coherence issues)? If so, I have a G4
> > machine where I'd gladly test them.
> 
> These patches themselves are not going to affect that directly.
> But IFF the problem really is that the AGP needs to be treated as not
> cache coherent (I have no idea if that is true) the generic direct
> mapping code has full support for a per-device coherent flag, so
> support for a non-coherent AGP slot could be implemented relatively
> simply.

AGP is a gigantic nightmare :-) It's not just cache coherency issues
(some implementations are coherent, some aren't, Apple's is ... weird).

Apple has all sort of bugs, and Darwin source code only sheds light on
some of them. Some implementation can only read, not write I think, for
example. There are issues with transfers crossing some boundaries I
beleive, but it's all unclear.

Apple makes this work with a combination of hacks in the AGP "driver"
and the closed source GPU driver, which we don't see.

I have given up trying to make that stuff work reliably a decade ago :)

Cheers,
Ben.

Re: use generic DMA mapping code in powerpc V4

2018-12-08 Thread Benjamin Herrenschmidt

On Tue, 2018-11-27 at 08:42 +0100, Christoph Hellwig wrote:
> Any comments?  I'd like to at least get the ball moving on the easy
> bits.

So I had to cleanup some dust but it works on G5 with and without iommu
and 32-bit powermacs at least.

We're doing more tests, hopefully mpe can dig out some PASemi and
NXP/FSL HW as well. I'll try to review & ack the patches over the next
few days too.

Cheers,
Ben.

> On Wed, Nov 14, 2018 at 09:22:40AM +0100, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series switches the powerpc port to use the generic swiotlb and
> > noncoherent dma ops, and to use more generic code for the coherent
> > direct mapping, as well as removing a lot of dead code.
> > 
> > As this series is very large and depends on the dma-mapping tree I've
> > also published a git tree:
> > 
> > git://git.infradead.org/users/hch/misc.git powerpc-dma.4
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.4
> > 
> > Changes since v3:
> >  - rebase on the powerpc fixes tree
> >  - add a new patch to actually make the baseline amigaone config
> >configure without warnings
> >  - only use ZONE_DMA for 64-bit embedded CPUs, on pseries an IOMMU is
> >always present
> >  - fix compile in mem.c for one configuration
> >  - drop the full npu removal for now, will be resent separately
> >  - a few git bisection fixes
> > 
> > The changes since v1 are to big to list and v2 was not posted in public.
> > 
> > ___
> > iommu mailing list
> > io...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---

Re: use generic DMA mapping code in powerpc V4

2018-12-08 Thread Benjamin Herrenschmidt

On Tue, 2018-11-27 at 08:42 +0100, Christoph Hellwig wrote:
> Any comments?  I'd like to at least get the ball moving on the easy
> bits.

I completely missed your posting of V4 ! I was wondering what was
taking you so long :)

I'll give it a spin & send acks over the next 2 or 3 days.

Cheers,
Ben.

> On Wed, Nov 14, 2018 at 09:22:40AM +0100, Christoph Hellwig wrote:
> > Hi all,
> > 
> > this series switches the powerpc port to use the generic swiotlb and
> > noncoherent dma ops, and to use more generic code for the coherent
> > direct mapping, as well as removing a lot of dead code.
> > 
> > As this series is very large and depends on the dma-mapping tree I've
> > also published a git tree:
> > 
> > git://git.infradead.org/users/hch/misc.git powerpc-dma.4
> > 
> > Gitweb:
> > 
> > 
> > http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/powerpc-dma.4
> > 
> > Changes since v3:
> >  - rebase on the powerpc fixes tree
> >  - add a new patch to actually make the baseline amigaone config
> >configure without warnings
> >  - only use ZONE_DMA for 64-bit embedded CPUs, on pseries an IOMMU is
> >always present
> >  - fix compile in mem.c for one configuration
> >  - drop the full npu removal for now, will be resent separately
> >  - a few git bisection fixes
> > 
> > The changes since v1 are to big to list and v2 was not posted in public.
> > 
> > ___
> > iommu mailing list
> > io...@lists.linux-foundation.org
> > https://lists.linuxfoundation.org/mailman/listinfo/iommu
> ---end quoted text---

[GIT PULL] FSI fixes for 4.20

2018-11-25 Thread Benjamin Herrenschmidt

Hi Greg !

Here are two very minor fixes for FSI. One from Arnd is a Kconfig fixup
and has been rusting away in my tree for a while (I had forgotten about
it). The other one just removes a duplicate #include, courtesy of
Brajeswar Ghosh.

The following changes since commit 2e6e902d185027f8e3cb8b7305238f7e35d6a436:

  Linux 4.20-rc4 (2018-11-25 14:19:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/linux-fsi.git 
tags/fsi-updates-2018-11-26

for you to fetch changes up to d20810530b7109a95abef5130e6dcec09c5180d7:

  fsi: fsi-scom.c: Remove duplicate header (2018-11-26 10:13:04 +1100)


Arnd Bergmann (1):
  fsi: master-ast-cf: select GENERIC_ALLOCATOR

Brajeswar Ghosh (1):
  fsi: fsi-scom.c: Remove duplicate header

 drivers/fsi/Kconfig| 1 +
 drivers/fsi/fsi-scom.c | 1 -
 2 files changed, 1 insertion(+), 1 deletion(-)

Re: [PATCH] drivers/fsi/fsi-scom.c: Remove duplicate header

2018-11-25 Thread Benjamin Herrenschmidt

On Sat, 2018-11-24 at 13:51 +0530, Brajeswar Ghosh wrote:
> On Fri, Nov 16, 2018 at 4:17 PM Brajeswar Ghosh
>  wrote:
> > Remove linux/cdev.h which is included more than once
> > 
> > Signed-off-by: Brajeswar Ghosh 
> 
> Any comment on this patch?

Ah sorry, I missed it. Yeah it's fine. I will send it to Greg.

> > ---
> >  drivers/fsi/fsi-scom.c | 1 -
> >  1 file changed, 1 deletion(-)
> > 
> > diff --git a/drivers/fsi/fsi-scom.c b/drivers/fsi/fsi-scom.c
> > index df94021dd9d1..81dc01ac2351 100644
> > --- a/drivers/fsi/fsi-scom.c
> > +++ b/drivers/fsi/fsi-scom.c
> > @@ -20,7 +20,6 @@
> >  #include 
> >  #include 
> >  #include 
> > -#include 
> >  #include 
> > 
> >  #include 
> > --
> > 2.17.1
> >

Re: [[PATCH] 8/9] DMA-UART-Driver-for-AST2500

2018-10-18 Thread Benjamin Herrenschmidt

On Thu, 2018-10-18 at 15:25 +0530, Vinod wrote:
> 
> > It's not a dmaengine driver. It's a serial UART driver that happens to
> > use a dedicated DMA engine.
> 
> Then I see no reason for it to use dmaengine APIs. The framework allows
> people to share a controller for many clients, but if you have dedicated
> one then you may use it directly

Well... the engine is shared by a few UARTs, they have dedicated rings
but there's a common set of regs for interrupt handling etc.

That said, I still think it could be contained within a UART driver,
there's little benefit in adding the framework overhead, esp since
these are really weak cores, any overhead will be felt.

Ben.

> > It's unclear whether it should be split into two drivers, or just have
> > the serial driver directly use the dma engine since that engine is
> > dedicated in HW to only work on those UARTs and nothing else...
> > 
> > Cheers,
> > Ben.
> > 
> > 
> > > While doing resubmission please take some time to understand subsystem
> > > tags to use. (hint git log  will tell you)
> > > 
> > > Also series has [[PATCH] 8/9] whereas it should be [PATCH 8/9] please
> > > let git generate that for you (hint git format-patch start..end does a
> > > good job)
> > > 
> > > > @@ -0,0 +1,1594 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +/*
> > > > + *   drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > > > + *1. 2018/07/01 Shivah Shankar created
> > > > + *2. 2018/08/25 sudheer.veliseti modified
> > > 
> > > we dont use this log in kernel. I do not see s-o-b by Shivah, that
> > > should be added. I think he should be author and you need to list
> > > changes you did..
> > > 
> 
>

Re: [[PATCH] 8/9] DMA-UART-Driver-for-AST2500

2018-10-17 Thread Benjamin Herrenschmidt

On Wed, 2018-10-17 at 11:35 +0530, Vinod wrote:
> On 17-10-18, 09:41, sudheer.v wrote:
> 
> Please add the change log describing the driver and its features
> 
> > Signed-off-by: sudheer.v 
> 
> 
> > ---
> >  drivers/tty/serial/8250/8250_aspeed_uart_dma.c | 1594 
> > 
> >  1 file changed, 1594 insertions(+)
> >  create mode 100644 drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > 
> > diff --git a/drivers/tty/serial/8250/8250_aspeed_uart_dma.c 
> > b/drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > new file mode 100644
> > index 000..e1019a8
> > --- /dev/null
> > +++ b/drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> 
> why is this in serial. It is dmaengine driver so belongs to drivers/dma/
> like other controllers. Please move it out and resubmit.
 
It's not a dmaengine driver. It's a serial UART driver that happens to
use a dedicated DMA engine.

It's unclear whether it should be split into two drivers, or just have
the serial driver directly use the dma engine since that engine is
dedicated in HW to only work on those UARTs and nothing else...

Cheers,
Ben.


> While doing resubmission please take some time to understand subsystem
> tags to use. (hint git log  will tell you)
> 
> Also series has [[PATCH] 8/9] whereas it should be [PATCH 8/9] please
> let git generate that for you (hint git format-patch start..end does a
> good job)
> 
> > @@ -0,0 +1,1594 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + *   drivers/tty/serial/8250/8250_aspeed_uart_dma.c
> > + *1. 2018/07/01 Shivah Shankar created
> > + *2. 2018/08/25 sudheer.veliseti modified
> 
> we dont use this log in kernel. I do not see s-o-b by Shivah, that
> should be added. I think he should be author and you need to list
> changes you did..
>

Re: [PATCH 3/5] dma-direct: refine dma_direct_alloc zone selection

2018-09-27 Thread Benjamin Herrenschmidt

On Thu, 2018-09-27 at 15:49 +0200, Christoph Hellwig wrote:
> On Thu, Sep 27, 2018 at 11:45:15AM +1000, Benjamin Herrenschmidt wrote:
> > I'm not sure this is entirely right.
> > 
> > Let's say the mask is 30 bits. You will return GFP_DMA32, which will
> > fail if you allocate something above 1G (which is legit for
> > ZONE_DMA32).
> 
> And then we will try GFP_DMA further down in the function:
> 
>   if (IS_ENABLED(CONFIG_ZONE_DMA) &&
>   dev->coherent_dma_mask < DMA_BIT_MASK(32) &&
>   !(gfp & GFP_DMA)) {
>   gfp = (gfp & ~GFP_DMA32) | GFP_DMA;
>   goto again;
>   }
> 
> This is and old optimization from x86, because chances are high that
> GFP_DMA32 will give you suitable memory for the infamous 31-bit
> dma mask devices (at least at boot time) and thus we don't have
> to deplete the tiny ZONE_DMA pool.

I see, it's rather confusing :-) Wouldn't it be better to check against
top of 32-bit memory instead here too ?


Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sun, 2018-09-02 at 19:52 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:47 PM Linus Torvalds
>  wrote:
> > 
> > The comment actually does talk about it, although the comment also
> > claims that the cs read would use load_unaligned_zeropad(), which it
> > no longer does (now it only does the read_word_at_a_time).
> 
> IOW, look at commit 12f8ad4b0533 ("vfs: clean up __d_lookup_rcu() and
> dentry_cmp() interfaces") for why the zeropad went away for the cs
> access (but the comment wasn't updated).
> 
> And then bfe7aa6c39b1 ("fs/dcache: Use read_word_at_a_time() in
> dentry_string_cmp()") did the "let's make KASAN happy thing.
> 
> And yes, the word-at-a-time code actually matters a lot for certain
> loads. The "copy-and-hash" thing for path components ends up being
> pretty critical in all the pathname handling.

Yup, makes sense.

Thanks !

Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sun, 2018-09-02 at 19:10 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 7:01 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Still, I can potentially see an issue with DEBUG_PAGEALLOC
> 
> An unmapped page isn't a problem. That's what the whole
> load_unaligned_zeropad() is about: it's ok to take a fault on the part
> that crosses a page, and we'll just fill the value with zeroes (that's
> the "zeropad" part).

Ah, my bad reading, I was looking at read_word_at_a_time() instead of
load_unaligned_zeropad(). I'm not familiar enough with the dentry qstr
stuff, I assume this is safe ?

> So as long as it's rare (and it is), it's all fine.
> 
> That said, I think we turn off for DEBUG_PAGEALLOC simply because it's
> not rare _enough_.
> 
> And vmalloc() should actually be safe too, simply because I think we
> strive for a guard page between vmalloc areas.
> 
> So only a *mapped* page after the page that matters, and only if it's
> something you can't read without side effects.
> 
> Which basically doesn't happen on x86 in reality. BIOSes just don't
> put MMIO right after the last page of RAM. I think this is why it only
> triggered on Xen, due to some crazy "Xen reacts badly" case where we
> do the speculation into a balloon address.
> 
> So _practically_ this is just a Xen bug, nothing more.
> 
> But since in _theory_ you could have MMIO abut regular RAM directly,
> it's worth maybe making sure it's purely theory.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> [ Adding a few new people the the cc.
> 
>   The issue is the worry about software-speculative accesses (ie
> things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> speculation now) accessing past RAM into possibly contiguous IO ]
> 
> On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
>  wrote:
> > 
> > If you have a machine with RAM that touches IO, you need to disable
> > the last page, exactly the same way we disable and marked reserved the
> > first page at zero.

So I missed the departure of that train ... stupid question, with
CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
what prevents it from crossing into a non-mapped page (not even IO) and
causing an oops ? Looking at a random user in fs/dcache.c its not a
uaccess-style read with recovery Or am I missing somethign obvious
here ?

IE, should we "reserve" the last page of any memory region (maybe mark
it read-only) to avoid this along with avoiding leakage into IO space ?

> > I thought we already did that.
> 
> We don't seem to do that.
> 
> And it's not just the last page, it's _any_ last page in a region that
> bumps up to IO. That's actually much more common in the low 4G area on
> PC's, I suspect, although the reserved BIOS ranges always tend to be
> there.

What makes IO more "wrong" than oopsing due to the page not being
mapped ?

> I suspect it should be trivial to do - maybe in
> e820__memblock_setup()? That's where we already trim partial pages
> etc.
>
> In fact, I think this might be done as an extension of commit
> 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
> memblock.reserved"), except making sure that non-RAM regions mark one
> page _previous_ as reserved too.
> 
> I assume memory hotplug might have the same issue, and checking
> whether ARM64 and powerpc perhaps might have already done something
> like this (or might need to add it).
> 
> We discussed long ago the case of user space mapping IO in user space,
> and decided we didn't care. But the kernel should probably explicitly
> make sure we don't either, even if I can't recall having ever seen a
> machine that actually maps IO contiguously to RAM. The layout always
> tends to end up having holes anyway.

Can't we put the safety in generic memblock ? IE, don't hand out an
allocation that contain the last page of a "block" and handle that last
page in the memblock->buddy transition rather than in arch specific
code ?

Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sun, 2018-09-02 at 18:42 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:38 PM Linus Torvalds
>  wrote:
> > It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> Just to clarify .. that's true of the dcache stuff.
> 
> The strscpy case actually explicitly limits things to page boundaries
> and falls back to the byte-by-byte case after that.

Ah ok, that makes sense.

Still, I can potentially see an issue with DEBUG_PAGEALLOC

Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Mon, 2018-09-03 at 10:48 +1000, Benjamin Herrenschmidt wrote:
> On Sat, 2018-09-01 at 11:06 -0700, Linus Torvalds wrote:
> > [ Adding a few new people the the cc.
> > 
> >   The issue is the worry about software-speculative accesses (ie
> > things like CONFIG_DCACHE_WORD_ACCESS - not talking about the hw
> > speculation now) accessing past RAM into possibly contiguous IO ]
> > 
> > On Sat, Sep 1, 2018 at 10:27 AM Linus Torvalds
> >  wrote:
> > > 
> > > If you have a machine with RAM that touches IO, you need to disable
> > > the last page, exactly the same way we disable and marked reserved the
> > > first page at zero.
> 
> So I missed the departure of that train ... stupid question, with
> CONFIG_DCACHE_WORD_ACCESS, if that can be unaligned (I assume it can),
> what prevents it from crossing into a non-mapped page (not even IO) and
> causing an oops ? Looking at a random user in fs/dcache.c its not a
> uaccess-style read with recovery Or am I missing somethign obvious
> here ?

Also, if we cross page boundaries with those guys then we have a bigger
problem no ? we could fall off a vmalloc page into the nether or into
an ioremap mapping no ?

Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Actually what about DEBUG_PAGEALLOC ? Can't we fall off a page and
fault that way too ?

Cheers,
Ben.

Re: Access to non-RAM pages

2018-09-02 Thread Benjamin Herrenschmidt

On Sun, 2018-09-02 at 18:38 -0700, Linus Torvalds wrote:
> On Sun, Sep 2, 2018 at 6:32 PM Benjamin Herrenschmidt
>  wrote:
> > 
> > Also, if we cross page boundaries with those guys then we have a bigger
> > problem no ? we could fall off a vmalloc page into the nether or into
> > an ioremap mapping no ?
> 
> It's not used for vmalloc stuff. It's just regular kmalloc().
> 
> So it can cross pages, and it can fall off the end of memory, but it
> can't do random stuff.

Ok, it might be worth adding a DEBUG_VM based (or similar) warning in
case somebody ever thinks of passing a vmalloc pointer to it...

As for falling out of the end of memory, yes it could be a real problem
though I don't see why IO is any different than just hitting a non-
mapped area in that regard. So we should probably keep an unused
(readonly if possible) zero page at the end.

Cheers,
Ben.

Re: v4.17 regression: PowerMac G3 won't boot, was Re: [PATCH v5 1/3] of: cache phandle nodes to reduce cost of of_find_node_by_phandle()

2018-08-30 Thread Benjamin Herrenschmidt

On Fri, 2018-08-31 at 14:35 +1000, Benjamin Herrenschmidt wrote:
> 
> > If I force output with "-f", the resulting file has no occurrences 
> > of "phandle".
> 
> Are you booting with BootX or Open Firmware ?

Assuming you are using BootX (or miBoot), can you try this patch ?

--- a/arch/powerpc/platforms/powermac/bootx_init.c
+++ b/arch/powerpc/platforms/powermac/bootx_init.c
@@ -37,6 +37,7 @@ static unsigned long __initdata bootx_dt_strend;
 static unsigned long __initdata bootx_node_chosen;
 static boot_infos_t * __initdata bootx_info;
 static char __initdata bootx_disp_path[256];
+static int __initdata bootx_phandle;
 
 /* Is boot-info compatible ? */
 #define BOOT_INFO_IS_COMPATIBLE(bi) \
@@ -258,6 +259,8 @@ static void __init bootx_scan_dt_build_strings(unsigned 
long base,
namep = pp->name ? (char *)(base + pp->name) : NULL;
if (namep == NULL || strcmp(namep, "name") == 0)
goto next;
+   if (!strcmp(namep, "phandle") || !strcmp(namep, 
"linux,phandle"))
+   bootx_phandle = -1;
/* get/create string entry */
soff = bootx_dt_find_string(namep);
if (soff == 0)
@@ -310,6 +313,7 @@ static void __init bootx_scan_dt_build_struct(unsigned long 
base,
*mem_end = _ALIGN_UP((unsigned long)lp + 1, 4);
 
/* get and store all properties */
+   has_phandle = false;
while (*ppp) {
struct bootx_dt_prop *pp =
(struct bootx_dt_prop *)(base + *ppp);
@@ -330,6 +334,12 @@ static void __init bootx_scan_dt_build_struct(unsigned 
long base,
ppp = &pp->next;
}
 
+   /* add a phandle */
+   if (bootx_phandle > 0) {
+   bootx_dt_add_prop("phandle", &bootx_phandle, 4, mem_end);
+   bootx_phandle++;
+   }
+
if (node == bootx_node_chosen) {
bootx_add_chosen_props(base, mem_end);
if (bootx_info->dispDeviceRegEntryOffset == 0)
@@ -385,6 +395,8 @@ static unsigned long __init bootx_flatten_dt(unsigned long 
start)
bootx_dt_add_string("linux,bootx-height", &mem_end);
bootx_dt_add_string("linux,bootx-linebytes", &mem_end);
bootx_dt_add_string("linux,bootx-addr", &mem_end);
+   if (bootx_phandle > 0)
+   bootx_dt_add_string("phandle", &mem_end);
/* Wrap up strings */
hdr->off_dt_strings = bootx_dt_strbase - mem_start;
hdr->dt_strings_size = bootx_dt_strend - bootx_dt_strbase;
@@ -482,6 +494,7 @@ void __init bootx_init(unsigned long r3, unsigned long r4)
bootx_dt_strbase = bootx_dt_strend = 0;
bootx_node_chosen = 0;
bootx_disp_path[0] = 0;
+   bootx_phandle = 1;
 
if (!BOOT_INFO_IS_V2_COMPATIBLE(bi))
bi->logicalDisplayBase = bi->dispDeviceBase;

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-30 Thread Benjamin Herrenschmidt

On Mon, 2018-08-27 at 19:02 +1000, Nicholas Piggin wrote:
> > More tlbies ? With the cost of the broadasts on the fabric ? I don't
> > think so.. or I'm not understanding your point...
> 
> More tlbies are no good, but there will be some places where it works
> out much better (and fewer tlbies). Worst possible case for current code
> is a big unmap with lots of scattered page sizes. We _should_ get that
> with just a single PID flush at the end, but what we will get today is
> a bunch of PID and VA flushes.
> 
> I don't propose doing that though, I'd rather be explicit about
> tracking start and end range of each page size. Still not "optimal"
> but neither is existing single range for sparse mappings... anyway it
> will need to be profiled, but my point is we don't really fit exactly
> what x86/arm want.

If we have an arch specific part, we could just remember up to N
"large" pages there without actually flushing, and if that overflows,
upgrade to a full flush.

Cheers,
Ben.

Re: [PATCH 2/4] tty: Hold tty_ldisc_lock() during tty_reopen()

2018-08-29 Thread Benjamin Herrenschmidt

On Wed, 2018-08-29 at 13:34 +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> Cc-ing Benjamin on this.
> 
> On (08/29/18 03:23), Dmitry Safonov wrote:
> > BUG: unable to handle kernel paging request at 2260
> > IP: [..] n_tty_receive_buf_common+0x5f/0x86d
> > Workqueue: events_unbound flush_to_ldisc
> > Call Trace:
> >  [..] n_tty_receive_buf2
> >  [..] tty_ldisc_receive_buf
> >  [..] flush_to_ldisc
> >  [..] process_one_work
> >  [..] worker_thread
> >  [..] kthread
> >  [..] ret_from_fork
> 
> Seems that you are not the first one to hit this NULL deref.
> 
> > I think, tty_ldisc_reinit() should be called with ldisc_sem hold for
> > writing, which will protect any reader against line discipline changes.
> 
> Per https://lore.kernel.org/patchwork/patch/777220/
> 
> : Note that we noticed one path that called reinit without the ldisc lock
> : held for writing, we added that, but it didn't fix the problem.
> 
> And I guess that Ben meant the same reinit path which you patched:

This is too old for me to remember buit yes, there definitely was a bug
there...

> > @@ -1267,15 +1267,20 @@ static int tty_reopen(struct tty_struct *tty)
> > if (test_bit(TTY_EXCLUSIVE, &tty->flags) && !capable(CAP_SYS_ADMIN))
> > return -EBUSY;
> >  
> > -   tty->count++;
> > +   retval = tty_ldisc_lock(tty, 5 * HZ);
> > +   if (retval)
> > +   return retval;
> >  
> > +   tty->count++;
> > if (tty->ldisc)
> > -   return 0;
> > +   goto out_unlock;
> >  
> > retval = tty_ldisc_reinit(tty, tty->termios.c_line);
> > if (retval)
> > tty->count--;
> >  
> > +out_unlock:
> > +   tty_ldisc_unlock(tty);
> > return retval;
> >  }
> 
>   -ss

Re: [PATCH 3/4] mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE

2018-08-27 Thread Benjamin Herrenschmidt

On Mon, 2018-08-27 at 18:04 +1000, Nicholas Piggin wrote:
> > Yes.. I see that. tlb_remove_check_page_size_change() really is a rather
> > ugly thing, it can cause loads of TLB flushes. Do you really _have_ to
> > do that? The way ARM and x86 work is that using INVLPG in a 4K stride is
> > still correct for huge pages, inefficient maybe, but so is flushing
> > every other page because 'sparse' transparant-huge-pages.
> 
> It could do that. It requires a tlbie that matches the page size,
> so it means 3 sizes. I think possibly even that would be better
> than current code, but we could do better if we had a few specific
> fields in there.

More tlbies ? With the cost of the broadasts on the fabric ? I don't
think so.. or I'm not understanding your point...

Sadly our architecture requires a precise match between the page size
specified in the tlbie instruction and the entry in the TLB or it won't
be flushed.

Ben.

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2408 matches

Mail list logo