Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode

2020-08-30 Thread Christoph Hellwig
On Sun, Aug 30, 2020 at 11:04:21AM +0200, Cédric Le Goater wrote:
> Hello,
> 
> On 7/8/20 5:24 PM, Christoph Hellwig wrote:
> > Use the DMA API bypass mechanism for direct window mappings.  This uses
> > common code and speed up the direct mapping case by avoiding indirect
> > calls just when not using dma ops at all.  It also fixes a problem where
> > the sync_* methods were using the bypass check for DMA allocations, but
> > those are part of the streaming ops.
> > 
> > Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
> > has never been well defined, as is only used by a few drivers, which
> > IIRC never showed up in the typical Cell blade setups that are affected
> > by the ordering workaround.
> > 
> > Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
> > Signed-off-by: Christoph Hellwig 
> > ---
> >  arch/powerpc/Kconfig  |  1 +
> >  arch/powerpc/include/asm/device.h |  5 --
> >  arch/powerpc/kernel/dma-iommu.c   | 90 ---
> >  3 files changed, 10 insertions(+), 86 deletions(-)
> 
> I am seeing corruptions on a couple of POWER9 systems (boston) when
> stressed with IO. stress-ng gives some results but I have first seen
> it when compiling the kernel in a guest and this is still the best way
> to raise the issue.
> 
> These systems have of a SAS Adaptec controller :
> 
>   0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 
> 3 (rev 01)
> 
> When the failure occurs, the POWERPC EEH interrupt fires and dumps
> lowlevel PHB4 registers among which :
> 
>   [ 2179.251069490,3] PHB#0003[0:3]:   phbErrorStatus = 
> 0280
>   [ 2179.251117476,3] PHB#0003[0:3]:  phbFirstErrorStatus = 
> 0200
> 
> The bits raised identify a PPC 'TCE' error, which means it is related
> to DMAs. See below for more details.
> 
> 
> Reverting this patch "fixes" the issue but it is probably else where,
> in some other layers or in the aacraid driver. How should I proceed 
> to get more information ?

The aacraid DMA masks look like a mess.  Can you try the hack
below and see it it helps?

diff --git a/drivers/scsi/aacraid/aachba.c b/drivers/scsi/aacraid/aachba.c
index 769af4ca9ca97e..79c6b744dbb66c 100644
--- a/drivers/scsi/aacraid/aachba.c
+++ b/drivers/scsi/aacraid/aachba.c
@@ -2228,18 +2228,6 @@ int aac_get_adapter_info(struct aac_dev* dev)
expose_physicals = 0;
}
 
-   if (dev->dac_support) {
-   if (!pci_set_dma_mask(dev->pdev, DMA_BIT_MASK(64))) {
-   if (!dev->in_reset)
-   dev_info(&dev->pdev->dev, "64 Bit DAC 
enabled\n");
-   } else if (!pci_set_dma_mask(dev->pdev, DMA_BIT_MASK(32))) {
-   dev_info(&dev->pdev->dev, "DMA mask set failed, 64 Bit 
DAC disabled\n");
-   dev->dac_support = 0;
-   } else {
-   dev_info(&dev->pdev->dev, "No suitable DMA 
available\n");
-   rcode = -ENOMEM;
-   }
-   }
/*
 * Deal with configuring for the individualized limits of each packet
 * interface.
diff --git a/drivers/scsi/aacraid/commsup.c b/drivers/scsi/aacraid/commsup.c
index adbdc3b7c7a706..dbb23b351a4e7d 100644
--- a/drivers/scsi/aacraid/commsup.c
+++ b/drivers/scsi/aacraid/commsup.c
@@ -1479,7 +1479,6 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
forced, u8 reset_type)
struct Scsi_Host *host = aac->scsi_host_ptr;
int jafo = 0;
int bled;
-   u64 dmamask;
int num_of_fibs = 0;
 
/*
@@ -1558,22 +1557,7 @@ static int _aac_reset_adapter(struct aac_dev *aac, int 
forced, u8 reset_type)
kfree(aac->fsa_dev);
aac->fsa_dev = NULL;
 
-   dmamask = DMA_BIT_MASK(32);
quirks = aac_get_driver_ident(index)->quirks;
-   if (quirks & AAC_QUIRK_31BIT)
-   retval = pci_set_dma_mask(aac->pdev, dmamask);
-   else if (!(quirks & AAC_QUIRK_SRC))
-   retval = pci_set_dma_mask(aac->pdev, dmamask);
-   else
-   retval = pci_set_consistent_dma_mask(aac->pdev, dmamask);
-
-   if (quirks & AAC_QUIRK_31BIT && !retval) {
-   dmamask = DMA_BIT_MASK(31);
-   retval = pci_set_consistent_dma_mask(aac->pdev, dmamask);
-   }
-
-   if (retval)
-   goto out;
 
if ((retval = (*(aac_get_driver_ident(index)->init))(aac)))
goto out;
diff --git a/drivers/scsi/aacraid/linit.c b/drivers/scsi/aacraid/linit.c
index 8588da0a065551..d897a9d59e24a1 100644
--- a/drivers/scsi/aacraid/linit.c
+++ b/drivers/scsi/aacraid/linit.c
@@ -1634,8 +1634,6 @@ static int aac_probe_one(struct pci_dev *pdev, const 
struct pci_device_id *id)
struct list_head *insert = &aac_devices;
int error;
int unique_id = 0;
-   u64 dmamask;
-   int mask_bits = 0;
extern int aac_sync_

Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()

2020-08-30 Thread Vinod Koul
Hi Linus,

On 29-08-20, 14:20, Linus Torvalds wrote:
> On Sat, Aug 29, 2020 at 1:40 PM Guenter Roeck  wrote:
> >
> > Except for
> >
> > CHECK: spaces preferred around that '+' (ctx:VxV)
> > #29: FILE: drivers/dma/fsldma.h:223:
> > +   u32 val_lo = in_be32((u32 __iomem *)addr+1);
> 
> Added spaces.
> 
> > I don't see anything wrong with it either, so
> >
> > Reviewed-by: Guenter Roeck 
> >
> > Since I didn't see the real problem with the original code,
> > I'd take that with a grain of salt, though.
> 
> Well, honestly, the old code was so confused that just making it build
> is clearly already an improvement even if everything else were to be
> wrong.
> 
> So I committed my "fix". If it turns out there's more wrong in there
> and somebody tests it, we can fix it again. But now it hopefully
> compiles, at least.
> 
> My bet is that if that driver ever worked on ppc32, it will continue
> to work whatever we do to that function.
> 
> I _think_ the old code happened to - completely by mistake - get the
> value right for the case of "little endian access, with dma_addr_t
> being 32-bit". Because then it would still read the upper bits wrong,
> but the cast to dma_addr_t would then throw those bits away. And the
> lower bits would be right.
> 
> But for big-endian accesses or for ARCH_DMA_ADDR_T_64BIT it really
> looks like it always returned a completely incorrect value.
> 
> And again - the driver may have worked even with that completely
> incorrect value, since the use of it seems to be very incidental.

Thank you for the fix.

Acked-By: Vinod Koul 

> 
> In either case ("it didn't work before" or "it worked because the
> value doesn't really matter"), I don't think I could possibly have
> made things worse.
> 
> Famous last words.

I guess no one tested this on 32bits seems to have caused this.

-- 
~Vinod


[PATCH] powerpc/powernv/pci: Drop pnv_phb->initialized

2020-08-30 Thread Oliver O'Halloran
The pnv_phb->initialized flag is an odd beast. It was added back in 2012 in
commit db1266c85261 ("powerpc/powernv: Skip check on PE if necessary") to
allow devices to be enabled even if their PE assignments hadn't been
completed yet. I can't think of any situation where we would (or should)
have PCI devices being enabled before their PEs are assigned, so I can only
assume it was a workaround for a bug or some other undesirable behaviour
from the PCI core.

Since commit dc3d8f85bb57 ("powerpc/powernv/pci: Re-work bus PE
configuration") the PE setup occurs before the PCI core allows driver to
attach to the device so the problem should no longer exist. Even it does
allowing the device to be enabled before we have assigned the device to a
PE is almost certainly broken and will cause spurious EEH events so we
should probably just remove it.

It's also worth pointing out that ->initialized flag is set in
pnv_pci_ioda_create_dbgfs() which has the entire function body wrapped in
flag.  That has the fun side effect of bypassing any other checks in
pnv_pci_enable_device_hook() which is probably not what anybody wants.

Signed-off-by: Oliver O'Halloran 
---
 arch/powerpc/platforms/powernv/pci-ioda.c | 17 -
 arch/powerpc/platforms/powernv/pci.h  |  1 -
 2 files changed, 18 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index 023a4f987bb2..6ac3c637b313 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2410,9 +2410,6 @@ static void pnv_pci_ioda_create_dbgfs(void)
list_for_each_entry_safe(hose, tmp, &hose_list, list_node) {
phb = hose->private_data;
 
-   /* Notify initialization of PHB done */
-   phb->initialized = 1;
-
sprintf(name, "PCI%04x", hose->global_number);
phb->dbgfs = debugfs_create_dir(name, powerpc_debugfs_root);
 
@@ -2609,17 +2606,8 @@ static resource_size_t pnv_pci_default_alignment(void)
  */
 static bool pnv_pci_enable_device_hook(struct pci_dev *dev)
 {
-   struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
struct pci_dn *pdn;
 
-   /* The function is probably called while the PEs have
-* not be created yet. For example, resource reassignment
-* during PCI probe period. We just skip the check if
-* PEs isn't ready.
-*/
-   if (!phb->initialized)
-   return true;
-
pdn = pci_get_pdn(dev);
if (!pdn || pdn->pe_number == IODA_INVALID_PE)
return false;
@@ -2629,14 +2617,9 @@ static bool pnv_pci_enable_device_hook(struct pci_dev 
*dev)
 
 static bool pnv_ocapi_enable_device_hook(struct pci_dev *dev)
 {
-   struct pci_controller *hose = pci_bus_to_host(dev->bus);
-   struct pnv_phb *phb = hose->private_data;
struct pci_dn *pdn;
struct pnv_ioda_pe *pe;
 
-   if (!phb->initialized)
-   return true;
-
pdn = pci_get_pdn(dev);
if (!pdn)
return false;
diff --git a/arch/powerpc/platforms/powernv/pci.h 
b/arch/powerpc/platforms/powernv/pci.h
index 739a0b3b72e1..36d22920f5a3 100644
--- a/arch/powerpc/platforms/powernv/pci.h
+++ b/arch/powerpc/platforms/powernv/pci.h
@@ -119,7 +119,6 @@ struct pnv_phb {
int flags;
void __iomem*regs;
u64 regs_phys;
-   int initialized;
spinlock_t  lock;
 
 #ifdef CONFIG_DEBUG_FS
-- 
2.26.2



Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 04:36, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> As of today, if the biggest DDW that can be created can't map the whole
>>> partition, it's creation is skipped and the default DMA window
>>> "ibm,dma-window" is used instead.
>>>
>>> DDW is 16x bigger than the default DMA window,
>>
>> 16x only under very specific circumstances which are
>> 1. phyp
>> 2. sriov
>> 3. device class in hmc (or what that priority number is in the lpar config).
> 
> Yeah, missing details.
> 
>>> having the same amount of
>>> pages, but increasing the page size to 64k.
>>> Besides larger DMA window,
>>
>> "Besides being larger"?
> 
> You are right there.
> 
>>
>>> it performs better for allocations over 4k,
>>
>> Better how?
> 
> I was thinking for allocations larger than (512 * 4k), since >2
> hypercalls are needed here, and for 64k pages would still be just 1
> hypercall up to (512 * 64k). 
> But yeah, not the usual case anyway.

Yup.


> 
>>
>>> so it would be nice to use it instead.
>>
>> I'd rather say something like:
>> ===
>> So far we assumed we can map the guest RAM 1:1 to the bus which worked
>> with a small number of devices. SRIOV changes it as the user can
>> configure hundreds VFs and since phyp preallocates TCEs and does not
>> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
>> per a PE to limit waste of physical pages.
>> ===
> 
> I mixed this in my commit message, it looks like this:
> 
> ===
> powerpc/pseries/iommu: Make use of DDW for indirect mapping
> 
> So far it's assumed possible to map the guest RAM 1:1 to the bus, which
> works with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> 
> As of today, if the assumed direct mapping is not possible, DDW
> creation is skipped and the default DMA window "ibm,dma-window" is used
> instead.
> 
> The default DMA window uses 4k pages instead of 64k pages, and since
> the amount of pages is the same,


Is the amount really the same? I thought you can prioritize some VFs
over others (== allocate different number of TCEs). Does it really
matter if it is the same?


> making use of DDW instead of the
> default DMA window for indirect mapping will expand in 16x the amount
> of memory that can be mapped on DMA.

Stop saying "16x", it is not guaranteed by anything :)


> 
> The DDW created will be used for direct mapping by default. [...]
> ===
> 
> What do you think?
> 
>>> The DDW created will be used for direct mapping by default.
>>> If it's not available, indirect mapping will be used instead.
>>>
>>> For indirect mapping, it's necessary to update the iommu_table so
>>> iommu_alloc() can use the DDW created. For this,
>>> iommu_table_update_window() is called when everything else succeeds
>>> at enable_ddw().
>>>
>>> Removing the default DMA window for using DDW with indirect mapping
>>> is only allowed if there is no current IOMMU memory allocated in
>>> the iommu_table. enable_ddw() is aborted otherwise.
>>>
>>> As there will never have both direct and indirect mappings at the same
>>> time, the same property name can be used for the created DDW.
>>>
>>> So renaming
>>> define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
>>> to
>>> define DMA64_PROPNAME "linux,dma64-ddr-window-info"
>>> looks the right thing to do.
>>
>> I know I suggested this but this does not look so good anymore as I
>> suspect it breaks kexec (from older kernel to this one) so you either
>> need to check for both DT names or just keep the old one. Changing the
>> macro name is fine.
>>
> 
> Yeah, having 'direct' in the name don't really makes sense if it's used
> for indirect mapping. I will just add this new define instead of
> replacing the old one, and check for both. 
> Is that ok?


No, having two of these does not seem right or useful. It is pseries
which does not use petitboot (relies on grub instead so until the target
kernel is started, there will be no ddw) so realistically we need this
property for kexec/kdump which uses the same kernel but different
initramdisk so for that purpose we need the same property name.

But I can see myself annoyed when I try petitboot in the hacked pseries
qemu and things may crash :) On this basis I'd suggest keeping the name
and adding a comment next to it that it is not always "direct" anymore.


> 
>>
>>> To make sure the property differentiates both cases, a new u32 for flags
>>> was added at the end of the property, where BIT(0) set means direct
>>> mapping.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 108 +++--
>>>  1 file changed, 84 insertions(+), 24 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/plat

Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 01:25, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
>>
>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> Code used to create a ddw property that was previously scattered in
>>> enable_ddw() is now gathered in ddw_property_create(), which deals with
>>> allocation and filling the property, letting it ready for
>>> of_property_add(), which now occurs in sequence.
>>>
>>> This created an opportunity to reorganize the second part of enable_ddw():
>>>
>>> Without this patch enable_ddw() does, in order:
>>> kzalloc() property & members, create_ddw(), fill ddwprop inside property,
>>> ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
>>> of_add_property().
>>>
>>> With this patch enable_ddw() does, in order:
>>> create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
>>> do tce_setrange_multi_pSeriesLP_walk in all memory.
>>>
>>> This change requires of_remove_property() in case anything fails after
>>> of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
>>> in all memory, which looks the most expensive operation, only if
>>> everything else succeeds.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 97 +++---
>>>  1 file changed, 57 insertions(+), 40 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index 4031127c9537..3a1ef02ad9d5 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, 
>>> struct device_node *par_dn)
>>>  ret);
>>>  }
>>>  
>>> +static int ddw_property_create(struct property **ddw_win, const char 
>>> *propname,
>>
>> @propname is always the same, do you really want to pass it every time?
> 
> I think it reads better, like "create a ddw property with this name".

This reads as "there are at least two ddw properties".

> Also, it makes possible to create ddw properties with other names, in
> case we decide to create properties with different names depending on
> the window created.

It is one window at any given moment, why call it different names... I
get the part that it is not always "direct" anymore but still...


> Also, it's probably optimized / inlined at this point.
> Is it ok doing it like this?
> 
>>
>>> +  u32 liobn, u64 dma_addr, u32 page_shift, u32 
>>> window_shift)
>>> +{
>>> +   struct dynamic_dma_window_prop *ddwprop;
>>> +   struct property *win64;
>>> +
>>> +   *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
>>> +   if (!win64)
>>> +   return -ENOMEM;
>>> +
>>> +   win64->name = kstrdup(propname, GFP_KERNEL);
>>
>> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
>> generic OF code does not try kfree() it but it is probably out of scope
>> here.
> 
> Yeah, I had that question too. 
> Previous code was like that, and I as trying not to mess too much on
> how it's done.
> 
>>> +   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
>>> +   win64->value = ddwprop;
>>> +   win64->length = sizeof(*ddwprop);
>>> +   if (!win64->name || !win64->value)
>>> +   return -ENOMEM;
>>
>> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
>> still looks fragile. Instead you could simply return win64 as the only
>> error possible here is -ENOMEM and returning NULL is equally good.
> 
> I agree. It's better if this function have it's own cleaning routine.
> It will be fixed for next version.
> 
>>
>>
>>> +
>>> +   ddwprop->liobn = cpu_to_be32(liobn);
>>> +   ddwprop->dma_base = cpu_to_be64(dma_addr);
>>> +   ddwprop->tce_shift = cpu_to_be32(page_shift);
>>> +   ddwprop->window_shift = cpu_to_be32(window_shift);
>>> +
>>> +   return 0;
>>> +}
>>> +
>>>  /*
>>>   * If the PE supports dynamic dma windows, and there is space for a table
>>>   * that can map all pages in a linear offset, then setup such a table,
>>> @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> struct ddw_query_response query;
>>> struct ddw_create_response create;
>>> int page_shift;
>>> -   u64 max_addr;
>>> +   u64 max_addr, win_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> struct direct_window *window;
>>> -   struct property *win64;
>>> -   struct dynamic_dma_window_prop *ddwprop;
>>> +   struct property *win64 = NULL;
>>> struct failed_ddw_pdn *fpdn;
>>> bool default_win_removed = false;
>>>  
>>> @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> goto out_failed;
>>> }
>>> len = order_base_2(max_addr);
>>> -   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>>> -   if (!win64) {
>>> -   dev_info(&dev->dev,
>>> -   "couldn't allocate proper

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Alexey Kardashevskiy



On 31/08/2020 11:41, Oliver O'Halloran wrote:
> On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy  wrote:
>>
>> On 29/08/2020 05:55, Leonardo Bras wrote:
>>> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:

 On 28/08/2020 01:32, Leonardo Bras wrote:
> Hello Alexey, thank you for this feedback!
>
> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
>>> +#define TCE_RPN_BITS 52  /* Bits 0-51 
>>> represent RPN on TCE */
>>
>> Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>> is the actual limit.
>
> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
> memory addressable in the machine. IIUC, it means we can access physical 
> address up to (1ul << MAX_PHYSMEM_BITS).
>
> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> 0-51 as the RPN. By looking at code, I understand that it means we may 
> input any address < (1ul << 52) to TCE.
>
> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
> suppose we can't ever pass a physical page address over
> (1ul << 51), and TCE accepts up to (1ul << 52).
> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means 
> that TCE_RPN_BITS will also be increased, so I think they are independent 
> values.
>
> Does it make sense? Please let me know if I am missing something.

 The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
 6Apr2012.pdf spec says:

 "The number of most significant RPN bits implemented in the TCE is
 dependent on the max size of System Memory to be supported by the 
 platform".

 IODA3 is the same on this matter.

 This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
 on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
 where TCE_RPN_BITS comes from exactly - I have no idea.
>>>
>>> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
>>> hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
>>> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
>>> described as RPN, as described before.
>>>
>>> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
>>> shows system memory mapping into a TCE, and the TCE also has bits 0-51
>>> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
 In fact, by the looks of those figures, the RPN_MASK should always be a
>>> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>>
>> I suspect the mask is there in the first place for extra protection
>> against too big addresses going to the TCE table (or/and for virtial vs
>> physical addresses). Using 52bit mask makes no sense for anything, you
>> could just drop the mask and let c compiler deal with 64bit "uint" as it
>> is basically a 4K page address anywhere in the 64bit space. Thanks,
> 
> Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
> physical address space. The IODA3 spec does explicitly say the upper
> bits are optional and the implementation only needs to support enough
> to cover up to the physical address limit, which is 56bits of P9 /
> PHB4. If you want to validate that the address will fit inside of
> MAX_PHYSMEM_BITS then fine, but I think that should be done as a
> WARN_ON or similar rather than just silently masking off the bits.

We can do this and probably should anyway but I am also pretty sure we
can just ditch the mask and have the hypervisor return an error which
will show up in dmesg.


-- 
Alexey


Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()

2020-08-30 Thread Michael Ellerman
Linus Torvalds  writes:
> On Sat, Aug 29, 2020 at 1:40 PM Guenter Roeck  wrote:
>>
>> Except for
>>
>> CHECK: spaces preferred around that '+' (ctx:VxV)
>> #29: FILE: drivers/dma/fsldma.h:223:
>> +   u32 val_lo = in_be32((u32 __iomem *)addr+1);
>
> Added spaces.
>
>> I don't see anything wrong with it either, so
>>
>> Reviewed-by: Guenter Roeck 
>>
>> Since I didn't see the real problem with the original code,
>> I'd take that with a grain of salt, though.
>
> Well, honestly, the old code was so confused that just making it build
> is clearly already an improvement even if everything else were to be
> wrong.

The old code is not that old, only ~18 months:

a1ff82a9c165 ("dmaengine: fsldma: Adding macro FSL_DMA_IN/OUT implement for ARM 
platform") (Jan 2019)

So I think it's possible it's never been tested on 32-bit ppc at all.

I did have a 32-bit FSL machine but it lost its network card in a power
outage and now it won't boot (and I can't get to it physically).

> So I committed my "fix". If it turns out there's more wrong in there
> and somebody tests it, we can fix it again. But now it hopefully
> compiles, at least.
>
> My bet is that if that driver ever worked on ppc32, it will continue
> to work whatever we do to that function.
>
> I _think_ the old code happened to - completely by mistake - get the
> value right for the case of "little endian access, with dma_addr_t
> being 32-bit". Because then it would still read the upper bits wrong,
> but the cast to dma_addr_t would then throw those bits away. And the
> lower bits would be right.
>
> But for big-endian accesses or for ARCH_DMA_ADDR_T_64BIT it really
> looks like it always returned a completely incorrect value.
>
> And again - the driver may have worked even with that completely
> incorrect value, since the use of it seems to be very incidental.
>
> In either case ("it didn't work before" or "it worked because the
> value doesn't really matter"), I don't think I could possibly have
> made things worse.

Agreed.

Hopefully someone from NXP can test it.

cheers


Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Oliver O'Halloran
On Mon, Aug 31, 2020 at 10:08 AM Alexey Kardashevskiy  wrote:
>
> On 29/08/2020 05:55, Leonardo Bras wrote:
> > On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
> >>
> >> On 28/08/2020 01:32, Leonardo Bras wrote:
> >>> Hello Alexey, thank you for this feedback!
> >>>
> >>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > +#define TCE_RPN_BITS 52  /* Bits 0-51 
> > represent RPN on TCE */
> 
>  Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
>  is the actual limit.
> >>>
> >>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
> >>> memory addressable in the machine. IIUC, it means we can access physical 
> >>> address up to (1ul << MAX_PHYSMEM_BITS).
> >>>
> >>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> >>> 0-51 as the RPN. By looking at code, I understand that it means we may 
> >>> input any address < (1ul << 52) to TCE.
> >>>
> >>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
> >>> suppose we can't ever pass a physical page address over
> >>> (1ul << 51), and TCE accepts up to (1ul << 52).
> >>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means 
> >>> that TCE_RPN_BITS will also be increased, so I think they are independent 
> >>> values.
> >>>
> >>> Does it make sense? Please let me know if I am missing something.
> >>
> >> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> >> 6Apr2012.pdf spec says:
> >>
> >> "The number of most significant RPN bits implemented in the TCE is
> >> dependent on the max size of System Memory to be supported by the 
> >> platform".
> >>
> >> IODA3 is the same on this matter.
> >>
> >> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> >> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> >> where TCE_RPN_BITS comes from exactly - I have no idea.
> >
> > Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> > hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
> > pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> > described as RPN, as described before.
> >
> > IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> > shows system memory mapping into a TCE, and the TCE also has bits 0-51
> > for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
> >> In fact, by the looks of those figures, the RPN_MASK should always be a
> > 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.
>
> I suspect the mask is there in the first place for extra protection
> against too big addresses going to the TCE table (or/and for virtial vs
> physical addresses). Using 52bit mask makes no sense for anything, you
> could just drop the mask and let c compiler deal with 64bit "uint" as it
> is basically a 4K page address anywhere in the 64bit space. Thanks,

Assuming 4K pages you need 52 RPN bits to cover the whole 64bit
physical address space. The IODA3 spec does explicitly say the upper
bits are optional and the implementation only needs to support enough
to cover up to the physical address limit, which is 56bits of P9 /
PHB4. If you want to validate that the address will fit inside of
MAX_PHYSMEM_BITS then fine, but I think that should be done as a
WARN_ON or similar rather than just silently masking off the bits.


Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-30 Thread Nicholas Piggin
Excerpts from pet...@infradead.org's message of August 28, 2020 9:15 pm:
> On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:
> 
>> Closing this race only requires interrupts to be disabled while ->mm
>> and ->active_mm are being switched, but the TLB problem requires also
>> holding interrupts off over activate_mm. Unfortunately not all archs
>> can do that yet, e.g., arm defers the switch if irqs are disabled and
>> expects finish_arch_post_lock_switch() to be called to complete the
>> flush; um takes a blocking lock in activate_mm().
> 
> ARM at least has activate_mm() := switch_mm(), so it could be made to
> work.
>

Yeah, so long as that post_lock_switch switch did the right thing with
respect to its TLB flushing. It should do because arm doesn't seem to
check ->mm or ->active_mm (and if it was broken, the scheduler context
switch would be suspect too). I don't think the fix would be hard, just
that I don't have a good way to test it and qemu isn't great for testing
this kind of thing.

um too I think could probably defer that lock until after interrupts are
enabled again. I might throw a bunch of arch conversion patches over the
wall if this gets merged and try to move things along.

Thanks,
Nick


Re: KVM on POWER8 host lock up since 10d91611f426 ("powerpc/64s: Reimplement book3s idle code in C")

2020-08-30 Thread Nicholas Piggin
Excerpts from Michal Suchánek's message of August 31, 2020 6:11 am:
> Hello,
> 
> on POWER8 KVM hosts lock up since commit 10d91611f426 ("powerpc/64s:
> Reimplement book3s idle code in C").
> 
> The symptom is host locking up completely after some hours of KVM
> workload with messages like
> 
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 71
> 2020-08-30T10:51:31+00:00 obs-power8-01 kernel: KVM: couldn't grab cpu 47
> 
> printed before the host locks up.
> 
> The machines run sandboxed builds which is a mixed workload resulting in
> IO/single core/mutiple core load over time and there are periods of no
> activity and no VMS runnig as well. The VMs are shortlived so VM
> setup/terdown is somewhat excercised as well.
> 
> POWER9 with the new guest entry fast path does not seem to be affected.
> 
> Reverted the patch and the followup idle fixes on top of 5.2.14 and
> re-applied commit a3f3072db6ca ("powerpc/powernv/idle: Restore IAMR
> after idle") which gives same idle code as 5.1.16 and the kernel seems
> stable.
> 
> Config is attached.
> 
> I cannot easily revert this commit, especially if I want to use the same
> kernel on POWER8 and POWER9 - many of the POWER9 fixes are applicable
> only to the new idle code.
> 
> Any idea what can be the problem?

So hwthread_state is never getting back to to HWTHREAD_IN_IDLE on
those threads. I wonder what they are doing. POWER8 doesn't have a good
NMI IPI and I don't know if it supports pdbg dumping registers from the
BMC unfortunately. Do the messages always come in pairs of CPUs?

I'm not sure where to start with reproducing, I'll have to try. How many
vCPUs in the guests? Do you have several guests running at once?

Thanks,
Nick



Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 00:04, Leonardo Bras wrote:
> On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
>>
>>> On 18/08/2020 09:40, Leonardo Bras wrote:
>>> enable_ddw() currently returns the address of the DMA window, which is
>>> considered invalid if has the value 0x00.
>>>
>>> Also, it only considers valid an address returned from find_existing_ddw
>>> if it's not 0x00.
>>>
>>> Changing this behavior makes sense, given the users of enable_ddw() only
>>> need to know if direct mapping is possible. It can also allow a DMA window
>>> starting at 0x00 to be used.
>>>
>>> This will be helpful for using a DDW with indirect mapping, as the window
>>> address will be different than 0x00, but it will not map the whole
>>> partition.
>>>
>>> Signed-off-by: Leonardo Bras 
>>> ---
>>>  arch/powerpc/platforms/pseries/iommu.c | 30 --
>>>  1 file changed, 14 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
>>> b/arch/powerpc/platforms/pseries/iommu.c
>>> index fcdefcc0f365..4031127c9537 100644
>>> --- a/arch/powerpc/platforms/pseries/iommu.c
>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
>>> @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool 
>>> remove_prop)
>>> np, ret);
>>>  }
  
>>> -static u64 find_existing_ddw(struct device_node *pdn)
>>> +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
>>>  {
>>> struct direct_window *window;
>>> const struct dynamic_dma_window_prop *direct64;
>>> -   u64 dma_addr = 0;
>>> +   bool found = false;
>>>  
>>> spin_lock(&direct_window_list_lock);
>>> /* check if we already created a window and dupe that config if so */
>>> list_for_each_entry(window, &direct_window_list, list) {
>>> if (window->device == pdn) {
>>> direct64 = window->prop;
>>> -   dma_addr = be64_to_cpu(direct64->dma_base);
>>> +   *dma_addr = be64_to_cpu(direct64->dma_base);
>>> +   found = true;
>>> break;
>>> }
>>> }
>>> spin_unlock(&direct_window_list_lock);
>>>  
>>> -   return dma_addr;
>>> +   return found;
>>>  }
>>>  
>>>  static struct direct_window *ddw_list_add(struct device_node *pdn,
>>> @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, 
>>> struct device_node *par_dn)
>>>   * pdn: the parent pe node with the ibm,dma_window property
>>>   * Future: also check if we can remap the base window for our base page 
>>> size
>>>   *
>>> - * returns the dma offset for use by the direct mapped DMA code.
>>> + * returns true if can map all pages (direct mapping), false otherwise..
>>>   */
>>> -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>> +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>>>  {
>>> int len, ret;
>>> struct ddw_query_response query;
>>> struct ddw_create_response create;
>>> int page_shift;
>>> -   u64 dma_addr, max_addr;
>>> +   u64 max_addr;
>>> struct device_node *dn;
>>> u32 ddw_avail[DDW_APPLICABLE_SIZE];
>>> struct direct_window *window;
>>> @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>>  
>>> mutex_lock(&direct_window_init_mutex);
>>>  
>>> -   dma_addr = find_existing_ddw(pdn);
>>> -   if (dma_addr != 0)
>>> +   if (find_existing_ddw(pdn, &dev->dev.archdata.dma_offset))
>>> goto out_unlock;
>>>  
>>> /*
>>> @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
>>> device_node *pdn)
>>> goto out_free_window;
>>> }
>>>  
>>> -   dma_addr = be64_to_cpu(ddwprop->dma_base);
>>> +   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
>>
>> Do not you need the same chunk in the find_existing_ddw() case above as
>> well? Thanks,
> 
> The new signature of find_existing_ddw() is 
> static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> 
> And on enable_ddw(), we call 
> find_existing_ddw(pdn, &dev->dev.archdata.dma_offset)
> 
> And inside the function we do:
> *dma_addr = be64_to_cpu(direct64->dma_base);
> 
> I think it's the same as the chunk before.
> Am I missing something?

ah no, sorry, you are not missing anything.


Reviewed-by: Alexey Kardashevskiy 




-- 
Alexey


Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 06:41, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
>>> I think it would be better to keep the code as much generic as possible
>>> regarding page sizes. 
>>
>> Then you need to test it. Does 4K guest even boot (it should but I would
>> not bet much on it)?
> 
> Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
> should be enough, is there any chance to get indirect mapping in qemu
> like this? (DDW but with smaller DMA window available) 


You will have to hack the guest kernel to always do indirect mapping or
hack QEMU's rtas_ibm_query_pe_dma_window() to return a small number of
available TCEs. But you will be testing QEMU/KVM which behave quite
differently to pHyp in this particular case.



 Because if we want the former (==support), then we'll have to align the
 size up to the bigger page size when allocating/zeroing system pages,
 etc. 
>>>
>>> This part I don't understand. Why do we need to align everything to the
>>> bigger pagesize? 
>>>
>>> I mean, is not that enough that the range [ret, ret + size[ is both
>>> allocated by mm and mapped on a iommu range?
>>>
>>> Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
>>> IOMMU_PAGE_SIZE() == 64k.
>>> Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough? 
>>> All the space the user asked for is allocated and mapped for DMA.
>>
>> The user asked to map 16K, the rest - 48K - is used for something else
>> (may be even mapped to another device) but you are making all 64K
>> accessible by the device which only should be able to access 16K.
>>
>> In practice, if this happens, H_PUT_TCE will simply fail.
> 
> I have noticed mlx5 driver getting a few bytes in a buffer, and using
> iommu_map_page(). It does map a whole page for as few bytes as the user


Whole 4K system page or whole 64K iommu page?

> wants mapped, and the other bytes get used for something else, or just
> mapped on another DMA page.
> It seems to work fine.  



With 4K system page and 64K IOMMU page? In practice it would take an
effort or/and bad luck to see it crashing. Thanks,



> 
>>
>>
 Bigger pages are not the case here as I understand it.
>>>
>>> I did not get this part, what do you mean?
>>
>> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
>> supported set of sizes is different for P8/P9 and type of IO (PHB,
>> NVLink/CAPI).
>>
>>
> Update those functions to guarantee alignment with requested size
> using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
>
> Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
>
> Signed-off-by: Leonardo Bras 
> ---
>  arch/powerpc/kernel/iommu.c | 17 +
>  1 file changed, 9 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 9704f3f76e63..d7086087830f 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct device 
> *dev,
>   }
>  
>   if (dev)
> - boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> -   1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, 
> tbl);

 Run checkpatch.pl, should complain about a long line.
>>>
>>> It's 86 columns long, which is less than the new limit of 100 columns
>>> Linus announced a few weeks ago. checkpatch.pl was updated too:
>>> https://www.phoronix.com/scan.php?page=news_item&px=Linux-Kernel-Deprecates-80-Col
>>
>> Yay finally :) Thanks,
> 
> :)
> 
>>
>>

>   else
> - boundary_size = ALIGN(1UL << 32, 1 << tbl->it_page_shift);
> + boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
>   /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
>  
>   n = iommu_area_alloc(tbl->it_map, limit, start, npages, tbl->it_offset,
> @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   unsigned int order;
>   unsigned int nio_pages, io_order;
>   struct page *page;
> + size_t size_io = size;
>  
>   size = PAGE_ALIGN(size);
>   order = get_order(size);
> @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, struct 
> iommu_table *tbl,
>   memset(ret, 0, size);
>  
>   /* Set up tces to cover the allocated range */
> - nio_pages = size >> tbl->it_page_shift;
> - io_order = get_iommu_order(size, tbl);
> + size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> + nio_pages = size_io >> tbl->it_page_shift;
> + io_order = get_iommu_order(size_io, tbl);
>   mapping = iommu_alloc(dev, tbl, ret, nio_pages, DMA_BIDIRECTIONAL,
> mask >> tbl->it_page_shift, io_order, 

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-30 Thread Alexey Kardashevskiy



On 29/08/2020 05:55, Leonardo Bras wrote:
> On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
>>
>> On 28/08/2020 01:32, Leonardo Bras wrote:
>>> Hello Alexey, thank you for this feedback!
>>>
>>> On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> +#define TCE_RPN_BITS 52  /* Bits 0-51 represent 
> RPN on TCE */

 Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
 is the actual limit.
>>>
>>> I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
>>> memory addressable in the machine. IIUC, it means we can access physical 
>>> address up to (1ul << MAX_PHYSMEM_BITS). 
>>>
>>> This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
>>> 0-51 as the RPN. By looking at code, I understand that it means we may 
>>> input any address < (1ul << 52) to TCE.
>>>
>>> In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
>>> suppose we can't ever pass a physical page address over 
>>> (1ul << 51), and TCE accepts up to (1ul << 52).
>>> But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that 
>>> TCE_RPN_BITS will also be increased, so I think they are independent 
>>> values. 
>>>
>>> Does it make sense? Please let me know if I am missing something.
>>
>> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
>> 6Apr2012.pdf spec says:
>>
>> "The number of most significant RPN bits implemented in the TCE is
>> dependent on the max size of System Memory to be supported by the platform".
>>
>> IODA3 is the same on this matter.
>>
>> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
>> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
>> where TCE_RPN_BITS comes from exactly - I have no idea.
> 
> Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
> hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
> pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
> described as RPN, as described before.
> 
> IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
> shows system memory mapping into a TCE, and the TCE also has bits 0-51
> for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.
>> In fact, by the looks of those figures, the RPN_MASK should always be a
> 52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.


I suspect the mask is there in the first place for extra protection
against too big addresses going to the TCE table (or/and for virtial vs
physical addresses). Using 52bit mask makes no sense for anything, you
could just drop the mask and let c compiler deal with 64bit "uint" as it
is basically a 4K page address anywhere in the 64bit space. Thanks,


> Maybe that's it?




> 
>>
>>

> +#define TCE_RPN_MASK(ps) ((1ul << (TCE_RPN_BITS - (ps))) - 1)
>  #define TCE_VALID0x800   /* TCE valid */
>  #define TCE_ALLIO0x400   /* TCE valid for all 
> lpars */
>  #define TCE_PCI_WRITE0x2 /* write from PCI 
> allowed */
> diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> b/arch/powerpc/platforms/pseries/iommu.c
> index e4198700ed1a..8fe23b7dff3a 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table *tbl, 
> long index,
>   u64 proto_tce;
>   __be64 *tcep;
>   u64 rpn;
> + const unsigned long tceshift = tbl->it_page_shift;
> + const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> + const u64 rpn_mask = TCE_RPN_MASK(tceshift);

 Using IOMMU_PAGE_SIZE macro for the page size and not using
 IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
 explode :) I understand the history but man... Oh well, ok.

>>>
>>> Yeah, it feels kind of weird after two IOMMU related consts. :)
>>> But sure IOMMU_PAGE_MASK() would not be useful here :)
>>>
>>> And this kind of let me thinking:
> + rpn = __pa(uaddr) >> tceshift;
> + *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << tceshift);
>>> Why not:
>>> rpn_mask =  TCE_RPN_MASK(tceshift) << tceshift;
>>
>> A mask for a page number (but not the address!) hurts my brain, masks
>> are good against addresses but numbers should already have all bits
>> adjusted imho, may be it is just me :-/
>>
>>
>>> 
>>> rpn = __pa(uaddr) & rpn_mask;
>>> *tcep = cpu_to_be64(proto_tce | rpn)
>>>
>>> I am usually afraid of changing stuff like this, but I think it's safe.
>>>
 Good, otherwise. Thanks,
>>>
>>> Thank you for reviewing!
>>>  
>>>
>>>
> 

-- 
Alexey


Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Chris Packham

On 31/08/20 9:41 am, Heiner Kallweit wrote:
> On 30.08.2020 23:00, Chris Packham wrote:
>> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> 
>>
 I've also now seen the RX FIFO not empty error on the T2080RDB

 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32

 With my current workaround of emptying the RX FIFO. It seems
 survivable. Interestingly it only ever seems to be 1 extra byte in the
 RX FIFO and it seems to be after either a READ_SR or a READ_FSR.

 fsl_espi ffe11.spi: tx 70
 fsl_espi ffe11.spi: rx 03
 fsl_espi ffe11.spi: Extra RX 00
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
 fsl_espi ffe11.spi: tx 05
 fsl_espi ffe11.spi: rx 00
 fsl_espi ffe11.spi: Extra RX 03
 fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
 fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
 fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
 fsl_espi ffe11.spi: tx 05
 fsl_espi ffe11.spi: rx 00
 fsl_espi ffe11.spi: Extra RX 03

From all the Micron SPI-NOR datasheets I've got access to it is
 possible to continually read the SR/FSR. But I've no idea why it
 happens some times and not others.
>>> So I think I've got a reproduction and I think I've bisected the problem
>>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>>> C"). My day is just finishing now so I haven't applied too much scrutiny
>>> to this result. Given the various rabbit holes I've been down on this
>>> issue already I'd take this information with a good degree of 
>>> skepticism.
>>>
>> OK, so an easy test should be to re-test with a 5.4 kernel.
>> It doesn't have yet the change you're referring to, and the fsl-espi 
>> driver
>> is basically the same as in 5.7 (just two small changes in 5.7).
> There's 6cc0c16d82f88 and maybe also other interrupt related patches
> around this time that could affect book E, so it's good if that exact
> patch is confirmed.
 My confirmation is basically that I can induce the issue in a 5.4 kernel
 by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
 5.9-rc2 by reverting that one commit.

 I both cases it's not exactly a clean cherry-pick/revert so I also
 confirmed the bisection result by building at 3282a3da25bd (which sees
 the issue) and the commit just before (which does not).
>>> Thanks for testing, that confirms it well.
>>>
>>> [snip patch]
>>>
 I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
 didn't report anything (either with or without the change above).
>>> Okay, it was a bit of a shot in the dark. I still can't see what
>>> else has changed.
>>>
>>> What would cause this, a lost interrupt? A spurious interrupt? Or
>>> higher interrupt latency?
>>>
>>> I don't think the patch should cause significantly worse latency,
>>> (it's supposed to be a bit better if anything because it doesn't set
>>> up the full interrupt frame). But it's possible.
>> My working theory is that the SPI_DON indication is all about the TX
>> direction an now that the interrupts are faster we're hitting an error
>> because there is still RX activity going on. Heiner disagrees with my
>> interpretation of the SPI_DON indication and the fact that it doesn't
>> happen every time does throw doubt on it.
>>
> It's right that the eSPI spec can be interpreted that SPI_DON refers to
> TX only. However this wouldn't really make sense, because also for RX
> we program the frame length, and therefore want to be notified once the
> full frame was received. Also practical experience shows that SPI_DON
> is set also after RX-only transfers.
> Typical SPI NOR use case is that you write read command + start address,
> followed by a longer read. If the TX-only interpretation would be right,
> we'd always end up with SPI_DON not being set.
>
>> I can't really explain the extra RX byte in the fifo. We know how many
>> bytes to expect and we pull that many from the fifo so it's not as if
>> we're missing an interrupt causing us to skip the last byte. I've been
>> looking for some kind of off-by-one calculation but again if it were
>> something like tha

Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Heiner Kallweit
On 30.08.2020 23:00, Chris Packham wrote:
> 
> On 31/08/20 12:30 am, Nicholas Piggin wrote:
>> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> 
> 
> 
>>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>>
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>>
>>> With my current workaround of emptying the RX FIFO. It seems
>>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>>
>>> fsl_espi ffe11.spi: tx 70
>>> fsl_espi ffe11.spi: rx 03
>>> fsl_espi ffe11.spi: Extra RX 00
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe11.spi: tx 05
>>> fsl_espi ffe11.spi: rx 00
>>> fsl_espi ffe11.spi: Extra RX 03
>>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>> fsl_espi ffe11.spi: tx 05
>>> fsl_espi ffe11.spi: rx 00
>>> fsl_espi ffe11.spi: Extra RX 03
>>>
>>>   From all the Micron SPI-NOR datasheets I've got access to it is
>>> possible to continually read the SR/FSR. But I've no idea why it
>>> happens some times and not others.
>> So I think I've got a reproduction and I think I've bisected the problem
>> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
>> C"). My day is just finishing now so I haven't applied too much scrutiny
>> to this result. Given the various rabbit holes I've been down on this
>> issue already I'd take this information with a good degree of skepticism.
>>
> OK, so an easy test should be to re-test with a 5.4 kernel.
> It doesn't have yet the change you're referring to, and the fsl-espi 
> driver
> is basically the same as in 5.7 (just two small changes in 5.7).
 There's 6cc0c16d82f88 and maybe also other interrupt related patches
 around this time that could affect book E, so it's good if that exact
 patch is confirmed.
>>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>>> 5.9-rc2 by reverting that one commit.
>>>
>>> I both cases it's not exactly a clean cherry-pick/revert so I also
>>> confirmed the bisection result by building at 3282a3da25bd (which sees
>>> the issue) and the commit just before (which does not).
>> Thanks for testing, that confirms it well.
>>
>> [snip patch]
>>
>>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>>> didn't report anything (either with or without the change above).
>> Okay, it was a bit of a shot in the dark. I still can't see what
>> else has changed.
>>
>> What would cause this, a lost interrupt? A spurious interrupt? Or
>> higher interrupt latency?
>>
>> I don't think the patch should cause significantly worse latency,
>> (it's supposed to be a bit better if anything because it doesn't set
>> up the full interrupt frame). But it's possible.
> 
> My working theory is that the SPI_DON indication is all about the TX 
> direction an now that the interrupts are faster we're hitting an error 
> because there is still RX activity going on. Heiner disagrees with my 
> interpretation of the SPI_DON indication and the fact that it doesn't 
> happen every time does throw doubt on it.
> 
It's right that the eSPI spec can be interpreted that SPI_DON refers to
TX only. However this wouldn't really make sense, because also for RX
we program the frame length, and therefore want to be notified once the
full frame was received. Also practical experience shows that SPI_DON
is set also after RX-only transfers.
Typical SPI NOR use case is that you write read command + start address,
followed by a longer read. If the TX-only interpretation would be right,
we'd always end up with SPI_DON not being set.

> I can't really explain the extra RX byte in the fifo. We know how many 
> bytes to expect and we pull that many from the fifo so it's not as if 
> we're missing an interrupt causing us to skip the last byte. I've been 
> looking for some kind of off-by-one calculation but again if it were 
> something like that it'd happen all the time.
> 
Maybe it helps to know what value this extra byte in the FIFO has. Is it:
- a duplicate of the last read b

Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Chris Packham

On 31/08/20 12:30 am, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:



>> I've also now seen the RX FIFO not empty error on the T2080RDB
>>
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>>
>> With my current workaround of emptying the RX FIFO. It seems
>> survivable. Interestingly it only ever seems to be 1 extra byte in the
>> RX FIFO and it seems to be after either a READ_SR or a READ_FSR.
>>
>> fsl_espi ffe11.spi: tx 70
>> fsl_espi ffe11.spi: rx 03
>> fsl_espi ffe11.spi: Extra RX 00
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>> fsl_espi ffe11.spi: Transfer done but SPIE_DON isn't set!
>> fsl_espi ffe11.spi: Transfer done but rx/tx fifo's aren't empty!
>> fsl_espi ffe11.spi: SPIE_RXCNT = 1, SPIE_TXCNT = 32
>> fsl_espi ffe11.spi: tx 05
>> fsl_espi ffe11.spi: rx 00
>> fsl_espi ffe11.spi: Extra RX 03
>>
>>   From all the Micron SPI-NOR datasheets I've got access to it is
>> possible to continually read the SR/FSR. But I've no idea why it
>> happens some times and not others.
> So I think I've got a reproduction and I think I've bisected the problem
> to commit 3282a3da25bd ("powerpc/64: Implement soft interrupt replay in
> C"). My day is just finishing now so I haven't applied too much scrutiny
> to this result. Given the various rabbit holes I've been down on this
> issue already I'd take this information with a good degree of skepticism.
>
 OK, so an easy test should be to re-test with a 5.4 kernel.
 It doesn't have yet the change you're referring to, and the fsl-espi driver
 is basically the same as in 5.7 (just two small changes in 5.7).
>>> There's 6cc0c16d82f88 and maybe also other interrupt related patches
>>> around this time that could affect book E, so it's good if that exact
>>> patch is confirmed.
>> My confirmation is basically that I can induce the issue in a 5.4 kernel
>> by cherry-picking 3282a3da25bd. I'm also able to "fix" the issue in
>> 5.9-rc2 by reverting that one commit.
>>
>> I both cases it's not exactly a clean cherry-pick/revert so I also
>> confirmed the bisection result by building at 3282a3da25bd (which sees
>> the issue) and the commit just before (which does not).
> Thanks for testing, that confirms it well.
>
> [snip patch]
>
>> I still saw the issue with this change applied. PPC_IRQ_SOFT_MASK_DEBUG
>> didn't report anything (either with or without the change above).
> Okay, it was a bit of a shot in the dark. I still can't see what
> else has changed.
>
> What would cause this, a lost interrupt? A spurious interrupt? Or
> higher interrupt latency?
>
> I don't think the patch should cause significantly worse latency,
> (it's supposed to be a bit better if anything because it doesn't set
> up the full interrupt frame). But it's possible.

My working theory is that the SPI_DON indication is all about the TX 
direction an now that the interrupts are faster we're hitting an error 
because there is still RX activity going on. Heiner disagrees with my 
interpretation of the SPI_DON indication and the fact that it doesn't 
happen every time does throw doubt on it.

I can't really explain the extra RX byte in the fifo. We know how many 
bytes to expect and we pull that many from the fifo so it's not as if 
we're missing an interrupt causing us to skip the last byte. I've been 
looking for some kind of off-by-one calculation but again if it were 
something like that it'd happen all the time.



Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.9-4 tag

2020-08-30 Thread pr-tracker-bot
The pull request you sent on Sun, 30 Aug 2020 22:27:25 +1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
> tags/powerpc-5.9-4

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/8bb5021cc2ee5d5dd129a9f2f5ad2bb76eea297d

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


Re: ptrace_syscall_32 is failing

2020-08-30 Thread Andy Lutomirski
On Sat, Aug 29, 2020 at 9:40 PM Brian Gerst  wrote:
>
> On Sat, Aug 29, 2020 at 12:52 PM Andy Lutomirski  wrote:
> >
> > Seems to be a recent regression, maybe related to entry/exit work changes.
> >
> > # ./tools/testing/selftests/x86/ptrace_syscall_32
> > [RUN]Check int80 return regs
> > [OK]getpid() preserves regs
> > [OK]kill(getpid(), SIGUSR1) preserves regs
> > [RUN]Check AT_SYSINFO return regs
> > [OK]getpid() preserves regs
> > [OK]kill(getpid(), SIGUSR1) preserves regs
> > [RUN]ptrace-induced syscall restart
> > Child will make one syscall
> > [RUN]SYSEMU
> > [FAIL]Initial args are wrong (nr=224, args=10 11 12 13 14 4289172732)
> > [RUN]Restart the syscall (ip = 0xf7f3b549)
> > [OK]Restarted nr and args are correct
> > [RUN]Change nr and args and restart the syscall (ip = 0xf7f3b549)
> > [OK]Replacement nr and args are correct
> > [OK]Child exited cleanly
> > [RUN]kernel syscall restart under ptrace
> > Child will take a nap until signaled
> > [RUN]SYSCALL
> > [FAIL]Initial args are wrong (nr=29, args=0 0 0 0 0 4289172732)
> > [RUN]SYSCALL
> > [OK]Args after SIGUSR1 are correct (ax = -514)
> > [OK]Child got SIGUSR1
> > [RUN]Step again
> > [OK]pause(2) restarted correctly
>
> Bisected to commit 0b085e68f407 ("x86/entry: Consolidate 32/64 bit
> syscall entry").
> It looks like it is because syscall_enter_from_user_mode() is called
> before reading the 6th argument from the user stack.

Ugh.  I caught, in review, a potential related issue with exit (not a
problem in current kernels), but I missed the entry version.

Thomas, can we revert the syscall_enter() and syscall_exit() part of
the series?  I think that they almost work for x86, but not quite as
indicated by this bug.  Even if we imagine we can somehow hack around
this bug, I imagine we're going to find other problems with this
model, e.g. the potential upcoming exit problem I noted in my review.

I really think the model should be:

void do_syscall_whatever(...)
{
  irqentry_enter(...);
  instrumentation_begin();

  /* Do whatever arch ABI oddities are needed on entry. */

  Then either:
  syscall_begin(arch, nr, regs);
  dispatch the syscall;
  syscall_end(arch, nr, regs);

  Or just:
  generic_do_syscall(arch, nr, regs);

  /* Do whatever arch ABI oddities are needed on exit from the syscall. */

  instrumentation_end();
  irqentry_exit(...);
}

x86 has an ABI oddity needed on entry: this fast syscall argument
fixup.  We also might end up with ABI oddities on exit if we ever try
to make single-stepping of syscalls work fully correctly.  x86 sort of
gets away without specifying arch because the arch helpers that get
called for audit, etc can deduce the arch, but this is kind of gross.
I suppose we could omit arch as an explicit parameter.

Or I suppose we could try to rejigger the API in time for 5.9.
Fortunately only x86 uses the new APIs so far.  I cc'd a bunch of
other arch maintainers to see if other architectures fit well in the
new syscall_enter() model, but I feel like the fact that x86 is
already broken indicates that we messed it up a bit.

--Andy


Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Heiner Kallweit
On 30.08.2020 14:30, Nicholas Piggin wrote:
> Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
>> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
 On 26.08.2020 08:07, Chris Packham wrote:
> On 26/08/20 1:48 pm, Chris Packham wrote:
>> On 26/08/20 10:22 am, Chris Packham wrote:
>>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>>
>>> 
 I've been staring at spi-fsl-espi.c for while now and I think I've
> identified a couple of deficiencies that may or may not be related
> to my
> issue.
>
> First I think the 'Transfer done but SPIE_DON isn't set' message
> can be
> generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
> register.
> We also write back to it to clear the current events. We re-read it in
> fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
> naturally end up in that situation if we're doing a large read.
> Consider
> the messages for reading a block of data from a spi-nor chip
>
>     tx = READ_OP + ADDR
>     rx = data
>
> We setup the transfer and pump out the tx_buf. The first interrupt
> goes
> off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
> clear ESPI_SPIE and wait for the next interrupt. The next interrupt
> fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
> continues until we've received all the data and we finish with
> ESPI_SPIE
> having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
> isn't set.
>
> The other deficiency is that we only get an interrupt when the
> amount of
> data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
> FSL_ESPI_RXTHR left to be received we will never pull them out of
> the fifo.
>
 SPIM_DON will trigger an interrupt once the last characters have been
 transferred, and read the remaining characters from the FIFO.
>>> The T2080RM that I have says the following about the DON bit
>>>
>>> "Last character was transmitted. The last character was transmitted
>>> and a new command can be written for the next frame."
>>>
>>> That does at least seem to fit with my assertion that it's all about
>>> the TX direction. But the fact that it doesn't happen all the time
>>> throws some doubt on it.
>>>
 I think the reason I'm seeing some variability is because of how fast
> (or slow) the interrupts get processed and how fast the spi-nor
> chip can
> fill the CPUs rx fifo.
>
 To rule out timing issues at high bus frequencies I initially asked
 for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
 or even less, then timing shouldn't be an issue.
>>> Yes I've currently got spi-max-frequency = <100>; in my dts. I
>>> would also expect a slower frequency would fit my "DON is for TX"
>>> narrative.
 Last relevant functional changes have been done almost 4 years ago.
 And yours is the first such report I see. So question is what could
 be so
 special with your setup that it seems you're the only one being
 affected.
 The scenarios you describe are standard, therefore much more people
 should be affected in case of a driver bug.
>>> Agreed. But even on my hardware (which may have a latent issue
>>> despite being in the field for going on 5 years) the issue only
>>> triggers under some fairly specific circumstances.
 You said that kernel config impacts how frequently the issue happens.
 Therefore question is what's the diff in kernel config, and how could
 the differences be related to SPI.
>>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>>> impact but every time I found something that seemed to be having an
>>> impact I've been able to disprove it. I actually think its about how
>>> busy the system is which may or may not affect when we get round to
>>> processing the interrupts.
>>>
>>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>>> occur on the T2080RDB.
>>>
>>> I've had to add the following to expose the environment as a mtd
>>> partition
>>>
>>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> index ff87e67c70da..fbf95fc1fd68 100644
>>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>>> @@ -116,6 +116,15 @@ flash@0 {
>>>      compatible = "micron,n25q512ax3",
>>> "jedec,spi-nor";
>>>   

Re: fsl_espi errors on v5.7.15

2020-08-30 Thread Nicholas Piggin
Excerpts from Chris Packham's message of August 28, 2020 8:07 am:
> On 27/08/20 7:12 pm, Nicholas Piggin wrote:
>> Excerpts from Heiner Kallweit's message of August 26, 2020 4:38 pm:
>>> On 26.08.2020 08:07, Chris Packham wrote:
 On 26/08/20 1:48 pm, Chris Packham wrote:
> On 26/08/20 10:22 am, Chris Packham wrote:
>> On 25/08/20 7:22 pm, Heiner Kallweit wrote:
>>
>> 
>>> I've been staring at spi-fsl-espi.c for while now and I think I've
 identified a couple of deficiencies that may or may not be related
 to my
 issue.

 First I think the 'Transfer done but SPIE_DON isn't set' message
 can be
 generated spuriously. In fsl_espi_irq() we read the ESPI_SPIE
 register.
 We also write back to it to clear the current events. We re-read it in
 fsl_espi_cpu_irq() and complain when SPIE_DON is not set. But we can
 naturally end up in that situation if we're doing a large read.
 Consider
 the messages for reading a block of data from a spi-nor chip

     tx = READ_OP + ADDR
     rx = data

 We setup the transfer and pump out the tx_buf. The first interrupt
 goes
 off and ESPI_SPIE has SPIM_DON and SPIM_RXT set. We empty the rx fifo,
 clear ESPI_SPIE and wait for the next interrupt. The next interrupt
 fires and this time we have ESPI_SPIE with just SPIM_RXT set. This
 continues until we've received all the data and we finish with
 ESPI_SPIE
 having only SPIM_RXT set. When we re-read it we complain that SPIE_DON
 isn't set.

 The other deficiency is that we only get an interrupt when the
 amount of
 data in the rx fifo is above FSL_ESPI_RXTHR. If there are fewer than
 FSL_ESPI_RXTHR left to be received we will never pull them out of
 the fifo.

>>> SPIM_DON will trigger an interrupt once the last characters have been
>>> transferred, and read the remaining characters from the FIFO.
>> The T2080RM that I have says the following about the DON bit
>>
>> "Last character was transmitted. The last character was transmitted
>> and a new command can be written for the next frame."
>>
>> That does at least seem to fit with my assertion that it's all about
>> the TX direction. But the fact that it doesn't happen all the time
>> throws some doubt on it.
>>
>>> I think the reason I'm seeing some variability is because of how fast
 (or slow) the interrupts get processed and how fast the spi-nor
 chip can
 fill the CPUs rx fifo.

>>> To rule out timing issues at high bus frequencies I initially asked
>>> for re-testing at lower frequencies. If you e.g. limit the bus to 1 MHz
>>> or even less, then timing shouldn't be an issue.
>> Yes I've currently got spi-max-frequency = <100>; in my dts. I
>> would also expect a slower frequency would fit my "DON is for TX"
>> narrative.
>>> Last relevant functional changes have been done almost 4 years ago.
>>> And yours is the first such report I see. So question is what could
>>> be so
>>> special with your setup that it seems you're the only one being
>>> affected.
>>> The scenarios you describe are standard, therefore much more people
>>> should be affected in case of a driver bug.
>> Agreed. But even on my hardware (which may have a latent issue
>> despite being in the field for going on 5 years) the issue only
>> triggers under some fairly specific circumstances.
>>> You said that kernel config impacts how frequently the issue happens.
>>> Therefore question is what's the diff in kernel config, and how could
>>> the differences be related to SPI.
>> It did seem to be somewhat random. Things like CONFIG_PREEMPT have an
>> impact but every time I found something that seemed to be having an
>> impact I've been able to disprove it. I actually think its about how
>> busy the system is which may or may not affect when we get round to
>> processing the interrupts.
>>
>> I have managed to get the 'Transfer done but SPIE_DON isn't set!' to
>> occur on the T2080RDB.
>>
>> I've had to add the following to expose the environment as a mtd
>> partition
>>
>> diff --git a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> index ff87e67c70da..fbf95fc1fd68 100644
>> --- a/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> +++ b/arch/powerpc/boot/dts/fsl/t208xrdb.dtsi
>> @@ -116,6 +116,15 @@ flash@0 {
>>      compatible = "micron,n25q512ax3",
>> "jedec,spi-nor";
>>      reg = <0>;
>>      spi-max-frequency = <1000>; /*
>> input clock */
>> +
>> +

[GIT PULL] Please pull powerpc/linux.git powerpc-5.9-4 tag

2020-08-30 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull some more powerpc fixes for 5.9:

The following changes since commit 64ef8f2c4791940d7f3945507b6a45c20d959260:

  powerpc/perf/hv-24x7: Move cpumask file to top folder of hv-24x7 driver 
(2020-08-21 23:35:27 +1000)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.9-4

for you to fetch changes up to 4a133eb351ccc275683ad49305d0b04dde903733:

  powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU (2020-08-28 12:03:18 
+1000)

- --
powerpc fixes for 5.9 #4

Revert our removal of PROT_SAO, at least one user expressed an interest in using
it on Power9. Instead don't allow it to be used in guests unless enabled
explicitly at compile time.

A fix for a crash introduced by a recent change to FP handling.

Revert a change to our idle code that left Power10 with no idle support.

One minor fix for the new scv system call path to set PPR.

Fix a crash in our "generic" PMU if branch stack events were enabled.

A fix for the IMC PMU, to correctly identify host kernel samples.

The ADB_PMU powermac code was found to be incompatible with VMAP_STACK, so make
them incompatible in Kconfig until the code can be fixed.

A build fix in drivers/video/fbdev/controlfb.c, and a documentation fix.

Thanks to:
  Alexey Kardashevskiy, Athira Rajeev, Christophe Leroy, Giuseppe Sacco,
  Madhavan Srinivasan, Milton Miller, Nicholas Piggin, Pratik Rajesh Sampat,
  Randy Dunlap, Shawn Anastasio, Vaidyanathan Srinivasan.

- --
Alexey Kardashevskiy (1):
  powerpc/perf: Fix crashes with generic_compat_pmu & BHRB

Athira Rajeev (1):
  powerpc/perf: Fix reading of MSR[HV/PR] bits in trace-imc

Christophe Leroy (1):
  powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU

Michael Ellerman (2):
  video: fbdev: controlfb: Fix build for COMPILE_TEST=y && PPC_PMAC=n
  powerpc/64s: Fix crash in load_fp_state() due to fpexc_mode

Nicholas Piggin (1):
  powerpc/64s: scv entry should set PPR

Pratik Rajesh Sampat (1):
  Revert "powerpc/powernv/idle: Replace CPU feature check with PVR check"

Randy Dunlap (1):
  Documentation/powerpc: fix malformed table in syscall64-abi

Shawn Anastasio (3):
  Revert "powerpc/64s: Remove PROT_SAO support"
  powerpc/64s: Disallow PROT_SAO in LPARs by default
  selftests/powerpc: Update PROT_SAO test to skip ISA 3.1


 Documentation/powerpc/syscall64-abi.rst   |  4 +-
 arch/powerpc/Kconfig  | 12 ++
 arch/powerpc/include/asm/book3s/64/pgtable.h  |  8 ++--
 arch/powerpc/include/asm/cputable.h   | 10 ++---
 arch/powerpc/include/asm/mman.h   | 31 --
 arch/powerpc/include/asm/nohash/64/pgtable.h  |  2 +
 arch/powerpc/include/uapi/asm/mman.h  |  2 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c |  2 +-
 arch/powerpc/kernel/entry_64.S|  4 ++
 arch/powerpc/kernel/process.c | 12 --
 arch/powerpc/mm/book3s64/hash_utils.c |  2 +
 arch/powerpc/perf/core-book3s.c   | 19 ++---
 arch/powerpc/perf/imc-pmu.c   |  4 +-
 arch/powerpc/platforms/Kconfig.cputype|  2 +-
 arch/powerpc/platforms/powernv/idle.c |  2 +-
 drivers/video/fbdev/controlfb.c   |  2 +
 include/linux/mm.h|  2 +
 include/trace/events/mmflags.h|  2 +
 mm/ksm.c  |  4 ++
 tools/testing/selftests/powerpc/mm/.gitignore |  1 +
 tools/testing/selftests/powerpc/mm/Makefile   |  4 +-
 tools/testing/selftests/powerpc/mm/prot_sao.c | 43 
 22 files changed, 144 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/powerpc/mm/prot_sao.c
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl9LmmkACgkQUevqPMjh
pYBw8w//RQpss9mLAe8NTVBF9bV6sMRWaTugjaRwQIIzbhu21eygaIcW9d4XFtXz
xVB83HHRoD4ND+DH/7rOq6ZCIzusbfeqhm0uRBzkrlwVZuo4Y/ZuCvW+0Bo06qRp
nkayAtpoI1wGoeWQcHUpHunZgwQTbWVNAo1yRo4cm8ux5wP88E1iEiZdNXcP/IPr
V3p0BGYpCrwuNUKE54N3JPOHRP1UILBff1agjfhfctTTVY+tUB5katgMxYh8Euv+
HVTm7U5QHnvkqhLSxdP76UP6R1DCN+E8GruXbDpR+ofJ5PLd1TgY5w3CLNjE88Pn
fnM7GigG6xB3x/DunbVOD3RRGKKg6FFJIRvJ6YrSEWkf84IUKxsQ6Y0Noeb2bLs9
04C5hN0d7GBi7JSjW0nZZvB3jZT0ptiAl3BggEhOshfaqyloogOHEk4pxyXG2/ja
fkTFDdhEgNBO/iAjGCsXaUmaSa1OimpENKNZtosPL6dYbG/FFQ2UgKz+lUR7jsD8
5uH8H1gKH1565JmRfckcplX2hkwPteVDQ2HzSQAD3KyjIMmvDPLCAynTlvxxxn/V
wPeoXpeD1DDZA7RSiV+jaVtjK6rNcjbAUUOhlngigSiXCjBvfsA4lDhovQzivsOF
E1TRnWSmCrTlV21+rtSZjbEdg/WLiBIHVu0DLGjcSw/XeaVa9g8=
=3NM9
-END PGP SIGNATURE-


Re: [PATCH] fsldma: fsl_ioread64*() do not need lower_32_bits()

2020-08-30 Thread Luc Van Oostenryck
On Sat, Aug 29, 2020 at 10:29:55AM -0700, Linus Torvalds wrote:
> On Sat, Aug 29, 2020 at 5:46 AM Luc Van Oostenryck
>  wrote:
> >
> > But the pointer is already 32-bit, so simply cast the pointer to u32.
> 
> Yeah, that code was completely pointless. If the pointer had actually
> been 64-bit, the old code would have warned too.
> 
> The odd thing is that the fsl_iowrite64() functions make sense. It's
> only the fsl_ioread64() functions that seem to be written by somebody
> who is really confused.
> 
> That said, this patch only humors the confusion. The cast to 'u32' is
> completely pointless. In fact, it seems to be actively wrong, because
> it means that the later "fsl_addr + 1" is done entirely incorrectly -
> it now literally adds "1" to an integer value, while the iowrite()
> functions will add one to a "u32 __iomem *" pointer (so will do
> pointer arithmetic, and add 4).
> 
My bad. I had noticed the '+ 1' and so automatically assumed
'OK, pointer arithmetic now' without noticing that the cast was
done only after the addition. Grrr.

FWIW, the version you committed looks much better to me.

-- Luc


Re: [PATCH v2] powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU

2020-08-30 Thread Michael Ellerman
On Thu, 27 Aug 2020 18:30:27 + (UTC), Christophe Leroy wrote:
> low_sleep_handler() can't restore the context from virtual
> stack because the stack can hardly be accessed with MMU OFF.
> 
> For now, disable VMAP stack when CONFIG_ADB_PMU is selected.

Applied to powerpc/fixes.

[1/1] powerpc/32s: Disable VMAP stack which CONFIG_ADB_PMU
  https://git.kernel.org/powerpc/c/4a133eb351ccc275683ad49305d0b04dde903733

cheers