Re: [PATCH v1 06/10] powerpc/pseries/iommu: Add ddw_list_add() helper

2020-08-28 Thread Leonardo Bras
On Fri, 2020-08-28 at 11:58 +1000, Alexey Kardashevskiy wrote:
> 
> On 28/08/2020 08:11, Leonardo Bras wrote:
> > On Mon, 2020-08-24 at 13:46 +1000, Alexey Kardashevskiy wrote:
> > > >  static int find_existing_ddw_windows(void)
> > > >  {
> > > > int len;
> > > > @@ -887,18 +905,11 @@ static int find_existing_ddw_windows(void)
> > > > if (!direct64)
> > > > continue;
> > > >  
> > > > -   window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > -   if (!window || len < sizeof(struct 
> > > > dynamic_dma_window_prop)) {
> > > > +   window = ddw_list_add(pdn, direct64);
> > > > +   if (!window || len < sizeof(*direct64)) {
> > > 
> > > Since you are touching this code, it looks like the "len <
> > > sizeof(*direct64)" part should go above to "if (!direct64)".
> > 
> > Sure, makes sense.
> > It will be fixed for v2.
> > 
> > > 
> > > 
> > > > kfree(window);
> > > > remove_ddw(pdn, true);
> > > > -   continue;
> > > > }
> > > > -
> > > > -   window->device = pdn;
> > > > -   window->prop = direct64;
> > > > -   spin_lock(_window_list_lock);
> > > > -   list_add(>list, _window_list);
> > > > -   spin_unlock(_window_list_lock);
> > > > }
> > > >  
> > > > return 0;
> > > > @@ -1261,7 +1272,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > > > device_node *pdn)
> > > > dev_dbg(>dev, "created tce table LIOBN 0x%x for %pOF\n",
> > > >   create.liobn, dn);
> > > >  
> > > > -   window = kzalloc(sizeof(*window), GFP_KERNEL);
> > > > +   /* Add new window to existing DDW list */
> > > 
> > > The comment seems to duplicate what the ddw_list_add name already 
> > > suggests.
> > 
> > Ok, I will remove it then.
> > 
> > > > +   window = ddw_list_add(pdn, ddwprop);
> > > > if (!window)
> > > > goto out_clear_window;
> > > >  
> > > > @@ -1280,16 +1292,14 @@ static u64 enable_ddw(struct pci_dev *dev, 
> > > > struct device_node *pdn)
> > > > goto out_free_window;
> > > > }
> > > >  
> > > > -   window->device = pdn;
> > > > -   window->prop = ddwprop;
> > > > -   spin_lock(_window_list_lock);
> > > > -   list_add(>list, _window_list);
> > > > -   spin_unlock(_window_list_lock);
> > > 
> > > I'd leave these 3 lines here and in find_existing_ddw_windows() (which
> > > would make  ddw_list_add -> ddw_prop_alloc). In general you want to have
> > > less stuff to do on the failure path. kmalloc may fail and needs kfree
> > > but you can safely delay list_add (which cannot fail) and avoid having
> > > the lock help twice in the same function (one of them is hidden inside
> > > ddw_list_add).
> > > Not sure if this change is really needed after all. Thanks,
> > 
> > I understand this leads to better performance in case anything fails.
> > Also, I think list_add happening in the end is less error-prone (in
> > case the list is checked between list_add and a fail).
> 
> Performance was not in my mind at all.
> 
> I noticed you remove from a list with a lock help and it was not there
> before and there is a bunch on labels on the exit path and started
> looking for list_add() and if you do not double remove from the list.
> 
> 
> > But what if we put it at the end?
> > What is the chance of a kzalloc of 4 pointers (struct direct_window)
> > failing after walk_system_ram_range?
> 
> This is not about chances really, it is about readability. If let's say
> kmalloc failed, you just to the error exit label and simply call kfree()
> on that pointer, kfree will do nothing if it is NULL already, simple.
> list_del() does not have this simplicity.
> 
> 
> > Is it not worthy doing that for making enable_ddw() easier to
> > understand?
> 
> This is my goal here :)

Ok, it makes sense to me now. 
I tried creating list_add() to keep everything related to list-adding
into a single place, instead of splitting it around the other stuff,
but now I understand that the code may look more complex than it was
before, because of the failing path increasing in size.

For me it was strange creating a list entry end not list_add()ing it
right away, but maybe it's something worth to get used to, as it may
increase the failing path simplicity, since list_add() don't fail.

I will try to see if the ddw_list_add() routine would become a useful
ddw_list_entry(), but if not, I will remove this patch.

Alexey, Thank you for reviewing this series!
Best regards,

Leonardo



Re: [PATCH v1 02/10] powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE on iommu_*_coherent()

2020-08-28 Thread Leonardo Bras
On Fri, 2020-08-28 at 11:40 +1000, Alexey Kardashevskiy wrote:
> > I think it would be better to keep the code as much generic as possible
> > regarding page sizes. 
> 
> Then you need to test it. Does 4K guest even boot (it should but I would
> not bet much on it)?

Maybe testing with host 64k pagesize and IOMMU 16MB pagesize in qemu
should be enough, is there any chance to get indirect mapping in qemu
like this? (DDW but with smaller DMA window available) 

> > > Because if we want the former (==support), then we'll have to align the
> > > size up to the bigger page size when allocating/zeroing system pages,
> > > etc. 
> > 
> > This part I don't understand. Why do we need to align everything to the
> > bigger pagesize? 
> > 
> > I mean, is not that enough that the range [ret, ret + size[ is both
> > allocated by mm and mapped on a iommu range?
> > 
> > Suppose a iommu_alloc_coherent() of 16kB on PAGESIZE = 4k and
> > IOMMU_PAGE_SIZE() == 64k.
> > Why 4 * cpu_pages mapped by a 64k IOMMU page is not enough? 
> > All the space the user asked for is allocated and mapped for DMA.
> 
> The user asked to map 16K, the rest - 48K - is used for something else
> (may be even mapped to another device) but you are making all 64K
> accessible by the device which only should be able to access 16K.
> 
> In practice, if this happens, H_PUT_TCE will simply fail.

I have noticed mlx5 driver getting a few bytes in a buffer, and using
iommu_map_page(). It does map a whole page for as few bytes as the user
wants mapped, and the other bytes get used for something else, or just
mapped on another DMA page.
It seems to work fine.  

> 
> 
> > > Bigger pages are not the case here as I understand it.
> > 
> > I did not get this part, what do you mean?
> 
> Possible IOMMU page sizes are 4K, 64K, 2M, 16M, 256M, 1GB, and the
> supported set of sizes is different for P8/P9 and type of IO (PHB,
> NVLink/CAPI).
> 
> 
> > > > Update those functions to guarantee alignment with requested size
> > > > using IOMMU_PAGE_ALIGN() before doing iommu_alloc() / iommu_free().
> > > > 
> > > > Also, on iommu_range_alloc(), replace ALIGN(n, 1 << tbl->it_page_shift)
> > > > with IOMMU_PAGE_ALIGN(n, tbl), which seems easier to read.
> > > > 
> > > > Signed-off-by: Leonardo Bras 
> > > > ---
> > > >  arch/powerpc/kernel/iommu.c | 17 +
> > > >  1 file changed, 9 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > > > index 9704f3f76e63..d7086087830f 100644
> > > > --- a/arch/powerpc/kernel/iommu.c
> > > > +++ b/arch/powerpc/kernel/iommu.c
> > > > @@ -237,10 +237,9 @@ static unsigned long iommu_range_alloc(struct 
> > > > device *dev,
> > > > }
> > > >  
> > > > if (dev)
> > > > -   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
> > > > - 1 << tbl->it_page_shift);
> > > > +   boundary_size = 
> > > > IOMMU_PAGE_ALIGN(dma_get_seg_boundary(dev) + 1, tbl);
> > > 
> > > Run checkpatch.pl, should complain about a long line.
> > 
> > It's 86 columns long, which is less than the new limit of 100 columns
> > Linus announced a few weeks ago. checkpatch.pl was updated too:
> > https://www.phoronix.com/scan.php?page=news_item=Linux-Kernel-Deprecates-80-Col
> 
> Yay finally :) Thanks,

:)

> 
> 
> > > 
> > > > else
> > > > -   boundary_size = ALIGN(1UL << 32, 1 << 
> > > > tbl->it_page_shift);
> > > > +   boundary_size = IOMMU_PAGE_ALIGN(1UL << 32, tbl);
> > > > /* 4GB boundary for iseries_hv_alloc and iseries_hv_map */
> > > >  
> > > > n = iommu_area_alloc(tbl->it_map, limit, start, npages, 
> > > > tbl->it_offset,
> > > > @@ -858,6 +857,7 @@ void *iommu_alloc_coherent(struct device *dev, 
> > > > struct iommu_table *tbl,
> > > > unsigned int order;
> > > > unsigned int nio_pages, io_order;
> > > > struct page *page;
> > > > +   size_t size_io = size;
> > > >  
> > > > size = PAGE_ALIGN(size);
> > > > order = get_order(size);
> > > > @@ -884,8 +884,9 @@ void *iommu_alloc_coherent(struct device *dev, 
> > > > struct iommu_table *tbl,
> > > > memset(ret, 0, size);
> > > >  
> > > > /* Set up tces to cover the allocated range */
> > > > -   nio_pages = size >> tbl->it_page_shift;
> > > > -   io_order = get_iommu_order(size, tbl);
> > > > +   size_io = IOMMU_PAGE_ALIGN(size_io, tbl);
> > > > +   nio_pages = size_io >> tbl->it_page_shift;
> > > > +   io_order = get_iommu_order(size_io, tbl);
> > > > mapping = iommu_alloc(dev, tbl, ret, nio_pages, 
> > > > DMA_BIDIRECTIONAL,
> > > >   mask >> tbl->it_page_shift, io_order, 0);
> > > > if (mapping == DMA_MAPPING_ERROR) {
> > > > @@ -900,11 +901,11 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> > > > size_t size,
> > > >  void *vaddr, dma_addr_t dma_handle)

Re: [PATCH v1 01/10] powerpc/pseries/iommu: Replace hard-coded page shift

2020-08-28 Thread Leonardo Bras
On Fri, 2020-08-28 at 12:27 +1000, Alexey Kardashevskiy wrote:
> 
> On 28/08/2020 01:32, Leonardo Bras wrote:
> > Hello Alexey, thank you for this feedback!
> > 
> > On Sat, 2020-08-22 at 19:33 +1000, Alexey Kardashevskiy wrote:
> > > > +#define TCE_RPN_BITS   52  /* Bits 0-51 represent 
> > > > RPN on TCE */
> > > 
> > > Ditch this one and use MAX_PHYSMEM_BITS instead? I am pretty sure this
> > > is the actual limit.
> > 
> > I understand this MAX_PHYSMEM_BITS(51) comes from the maximum physical 
> > memory addressable in the machine. IIUC, it means we can access physical 
> > address up to (1ul << MAX_PHYSMEM_BITS). 
> > 
> > This 52 comes from PAPR "Table 9. TCE Definition" which defines bits
> > 0-51 as the RPN. By looking at code, I understand that it means we may 
> > input any address < (1ul << 52) to TCE.
> > 
> > In practice, MAX_PHYSMEM_BITS should be enough as of today, because I 
> > suppose we can't ever pass a physical page address over 
> > (1ul << 51), and TCE accepts up to (1ul << 52).
> > But if we ever increase MAX_PHYSMEM_BITS, it doesn't necessarily means that 
> > TCE_RPN_BITS will also be increased, so I think they are independent 
> > values. 
> > 
> > Does it make sense? Please let me know if I am missing something.
> 
> The underlying hardware is PHB3/4 about which the IODA2 Version 2.4
> 6Apr2012.pdf spec says:
> 
> "The number of most significant RPN bits implemented in the TCE is
> dependent on the max size of System Memory to be supported by the platform".
> 
> IODA3 is the same on this matter.
> 
> This is MAX_PHYSMEM_BITS and PHB itself does not have any other limits
> on top of that. So the only real limit comes from MAX_PHYSMEM_BITS and
> where TCE_RPN_BITS comes from exactly - I have no idea.

Well, I created this TCE_RPN_BITS = 52 because the previous mask was a
hardcoded 40-bit mask (0xfful), for hard-coded 12-bit (4k)
pagesize, and on PAPR+/LoPAR also defines TCE as having bits 0-51
described as RPN, as described before.

IODA3 Revision 3.0_prd1 (OpenPowerFoundation), Figure 3.4 and 3.5.
shows system memory mapping into a TCE, and the TCE also has bits 0-51
for the RPN (52 bits). "Table 3.6. TCE Definition" also shows it.

In fact, by the looks of those figures, the RPN_MASK should always be a
52-bit mask, and RPN = (page >> tceshift) & RPN_MASK.

Maybe that's it?

> 
> 
> > > 
> > > > +#define TCE_RPN_MASK(ps)   ((1ul << (TCE_RPN_BITS - (ps))) - 1)
> > > >  #define TCE_VALID  0x800   /* TCE valid */
> > > >  #define TCE_ALLIO  0x400   /* TCE valid for all 
> > > > lpars */
> > > >  #define TCE_PCI_WRITE  0x2 /* write from PCI 
> > > > allowed */
> > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > > > b/arch/powerpc/platforms/pseries/iommu.c
> > > > index e4198700ed1a..8fe23b7dff3a 100644
> > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > @@ -107,6 +107,9 @@ static int tce_build_pSeries(struct iommu_table 
> > > > *tbl, long index,
> > > > u64 proto_tce;
> > > > __be64 *tcep;
> > > > u64 rpn;
> > > > +   const unsigned long tceshift = tbl->it_page_shift;
> > > > +   const unsigned long pagesize = IOMMU_PAGE_SIZE(tbl);
> > > > +   const u64 rpn_mask = TCE_RPN_MASK(tceshift);
> > > 
> > > Using IOMMU_PAGE_SIZE macro for the page size and not using
> > > IOMMU_PAGE_MASK for the mask - this incosistency makes my small brain
> > > explode :) I understand the history but man... Oh well, ok.
> > > 
> > 
> > Yeah, it feels kind of weird after two IOMMU related consts. :)
> > But sure IOMMU_PAGE_MASK() would not be useful here :)
> > 
> > And this kind of let me thinking:
> > > > +   rpn = __pa(uaddr) >> tceshift;
> > > > +   *tcep = cpu_to_be64(proto_tce | (rpn & rpn_mask) << 
> > > > tceshift);
> > Why not:
> > rpn_mask =  TCE_RPN_MASK(tceshift) << tceshift;
> 
> A mask for a page number (but not the address!) hurts my brain, masks
> are good against addresses but numbers should already have all bits
> adjusted imho, may be it is just me :-/
> 
> 
> > 
> > rpn = __pa(uaddr) & rpn_mask;
> > *tcep = cpu_to_be64(proto_tce | rpn)
> > 
> > I am usually afraid of changing stuff like this, but I think it's safe.
> > 
> > > Good, otherwise. Thanks,
> > 
> > Thank you for reviewing!
> >  
> > 
> > 



Re: [PATCH v1 09/10] powerpc/pseries/iommu: Make use of DDW even if it does not map the partition

2020-08-28 Thread Leonardo Bras
On Mon, 2020-08-24 at 15:17 +1000, Alexey Kardashevskiy wrote:
> 
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > As of today, if the biggest DDW that can be created can't map the whole
> > partition, it's creation is skipped and the default DMA window
> > "ibm,dma-window" is used instead.
> > 
> > DDW is 16x bigger than the default DMA window,
> 
> 16x only under very specific circumstances which are
> 1. phyp
> 2. sriov
> 3. device class in hmc (or what that priority number is in the lpar config).

Yeah, missing details.

> > having the same amount of
> > pages, but increasing the page size to 64k.
> > Besides larger DMA window,
> 
> "Besides being larger"?

You are right there.

> 
> > it performs better for allocations over 4k,
> 
> Better how?

I was thinking for allocations larger than (512 * 4k), since >2
hypercalls are needed here, and for 64k pages would still be just 1
hypercall up to (512 * 64k). 
But yeah, not the usual case anyway.

> 
> > so it would be nice to use it instead.
> 
> I'd rather say something like:
> ===
> So far we assumed we can map the guest RAM 1:1 to the bus which worked
> with a small number of devices. SRIOV changes it as the user can
> configure hundreds VFs and since phyp preallocates TCEs and does not
> allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
> per a PE to limit waste of physical pages.
> ===

I mixed this in my commit message, it looks like this:

===
powerpc/pseries/iommu: Make use of DDW for indirect mapping

So far it's assumed possible to map the guest RAM 1:1 to the bus, which
works with a small number of devices. SRIOV changes it as the user can
configure hundreds VFs and since phyp preallocates TCEs and does not
allow IOMMU pages bigger than 64K, it has to limit the number of TCEs
per a PE to limit waste of physical pages.

As of today, if the assumed direct mapping is not possible, DDW
creation is skipped and the default DMA window "ibm,dma-window" is used
instead.

The default DMA window uses 4k pages instead of 64k pages, and since
the amount of pages is the same, making use of DDW instead of the
default DMA window for indirect mapping will expand in 16x the amount
of memory that can be mapped on DMA.

The DDW created will be used for direct mapping by default. [...]
===

What do you think?

> > The DDW created will be used for direct mapping by default.
> > If it's not available, indirect mapping will be used instead.
> > 
> > For indirect mapping, it's necessary to update the iommu_table so
> > iommu_alloc() can use the DDW created. For this,
> > iommu_table_update_window() is called when everything else succeeds
> > at enable_ddw().
> > 
> > Removing the default DMA window for using DDW with indirect mapping
> > is only allowed if there is no current IOMMU memory allocated in
> > the iommu_table. enable_ddw() is aborted otherwise.
> > 
> > As there will never have both direct and indirect mappings at the same
> > time, the same property name can be used for the created DDW.
> > 
> > So renaming
> > define DIRECT64_PROPNAME "linux,direct64-ddr-window-info"
> > to
> > define DMA64_PROPNAME "linux,dma64-ddr-window-info"
> > looks the right thing to do.
> 
> I know I suggested this but this does not look so good anymore as I
> suspect it breaks kexec (from older kernel to this one) so you either
> need to check for both DT names or just keep the old one. Changing the
> macro name is fine.
> 

Yeah, having 'direct' in the name don't really makes sense if it's used
for indirect mapping. I will just add this new define instead of
replacing the old one, and check for both. 
Is that ok?

> 
> > To make sure the property differentiates both cases, a new u32 for flags
> > was added at the end of the property, where BIT(0) set means direct
> > mapping.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 108 +++--
> >  1 file changed, 84 insertions(+), 24 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 3a1ef02ad9d5..9544e3c91ced 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -350,8 +350,11 @@ struct dynamic_dma_window_prop {
> > __be64  dma_base;   /* address hi,lo */
> > __be32  tce_shift;  /* ilog2(tce_page_size) */
> > __be32  window_shift;   /* ilog2(tce_window_size) */
> > +   __be32  flags;  /* DDW properties, see bellow */
> >  };
> >  
> > +#define DDW_FLAGS_DIRECT   0x01
> 
> This is set if ((1<= ddw_memory_hotplug_max()), you
> could simply check window_shift and drop the flags.
> 

Yeah, it's better this way, I will revert this.

> 
> > +
> >  struct direct_window {
> > struct device_node *device;
> > const struct dynamic_dma_window_prop *prop;
> > @@ -377,7 +380,7 @@ static LIST_HEAD(direct_window_list);
> >  static DEFINE_SPINLOCK(direct_window_list_lock);
> >  /* protects initializing 

Re: kernel since 5.6 do not boot anymore on Apple PowerBook

2020-08-28 Thread Gabriel Paubert
On Thu, Aug 27, 2020 at 12:39:06PM +0200, Christophe Leroy wrote:
> Hi,
> 
> Le 27/08/2020 à 10:28, Giuseppe Sacco a écrit :
> > Il giorno gio, 27/08/2020 alle 09.46 +0200, Giuseppe Sacco ha scritto:
> > > Il giorno gio, 27/08/2020 alle 00.28 +0200, Giuseppe Sacco ha scritto:
> > > > Hello Christophe,
> > > > 
> > > > Il giorno mer, 26/08/2020 alle 15.53 +0200, Christophe Leroy ha
> > > > scritto:
> > > > [...]
> > > > > If there is no warning, then the issue is something else, bad luck.
> > > > > 
> > > > > Could you increase the loglevel and try again both with and without
> > > > > VMAP_STACK ? Maybe we'll get more information on where it stops.
> > > > 
> > > > The problem is related to the CPU frequency changes. This is where the
> > > > system stop: cpufreq get the CPU frequency limits and then start the
> > > > default governor (performance) and then calls function
> > > > cpufreq_gov_performance_limits() that never returns.
> > > > 
> > > > Rebuilding after enabling pr_debug for cpufreq.c, I've got some more
> > > > lines of output:
> > > > 
> > > > cpufreq: setting new policy for CPU 0: 667000 - 867000 kHz
> > > > cpufreq: new min and max freqs are 667000 - 867000 kHz
> > > > cpufreq: governor switch
> > > > cpufreq: cpufreq_init_governor: for CPU 0
> > > > cpufreq: cpufreq_start_governor: for CPU 0
> > > > cpufreq: target for CPU 0: 867000 kHz, relation 1, requested 867000 kHz
> > > > cpufreq: __target_index: cpu: 0, oldfreq: 667000, new freq: 867000
> > > > cpufreq: notification 0 of frequency transition to 867000 kHz
> > > > cpufreq: saving 133328 as reference value for loops_per_jiffy; freq is 
> > > > 667000 kHz
> > > > 
> > > > no more lines are printed. I think this output only refers to the
> > > > notification sent prior to the frequency change.
> > > > 
> > > > I was thinking that selecting a governor that would run at 667mHz would
> > > > probably skip the problem. I added cpufreq.default_governor=powersave
> > > > to the command line parameters but it did not work: the selected
> > > > governor was still performance and the system crashed.
> > > 
> > > I kept following the thread and found that CPU frequency is changed in
> > > function pmu_set_cpu_speed() in file drivers/cpufreq/pmac32-cpufreq.c.
> > > As first thing, the function calls the macro preempt_disable() and this
> > > is where it stops.
> > 
> > Sorry, I made a mistake. The real problem is down, on the same
> > function, when it calls low_sleep_handler(). This is where the problem
> > probably is.
> > 
> 
> 
> Great, you spotted the problem.
> 
> I see what it is, it is in low_sleep_handler() in
> arch/powerpc/platforms/powermac/sleep.S
> 
> All critical registers are saved on the stack. At restore, they are restore
> BEFORE re-enabling MMU (because they are needed for that). But when we have
> VMAP_STACK, the stack can hardly be accessed without the MMU enabled.
> tophys() doesn't work for virtual stack addresses.
> 
> Therefore, the low_sleep_handler() has to be reworked for using an area in
> the linear mem instead of the stack.

Actually there is already one, at the end of sleep.S, there is a 4 byte
area called sleep_storage. It should be enough to move there the CPU state
which has to be restored before the MMU is enabled:
- SDR1
- BATs
- SPRG (they might stay on the stack by reordering the code unless I
  miss something)

Note that save_cpu_setup/restore_cpu_setup also use another static
storage area to save other SPRs.

Using static area would not work on SMP, but all PPC based Apple laptops
are single core and single thread.

Gabriel




Re: Flushing transparent hugepages

2020-08-28 Thread Catalin Marinas
On Tue, Aug 18, 2020 at 05:08:16PM +0100, Will Deacon wrote:
> On Tue, Aug 18, 2020 at 04:07:36PM +0100, Matthew Wilcox wrote:
> > For example, arm64 seems confused in this scenario:
> > 
> > void flush_dcache_page(struct page *page)
> > {
> > if (test_bit(PG_dcache_clean, >flags))
> > clear_bit(PG_dcache_clean, >flags);
> > }
> > 
> > ...
> > 
> > void __sync_icache_dcache(pte_t pte)
> > {
> > struct page *page = pte_page(pte);
> > 
> > if (!test_and_set_bit(PG_dcache_clean, >flags))
> > sync_icache_aliases(page_address(page), page_size(page));
> > }
> > 
> > So arm64 keeps track on a per-page basis which ones have been flushed.
> > page_size() will return PAGE_SIZE if called on a tail page or regular
> > page, but will return PAGE_SIZE << compound_order if called on a head
> > page.  So this will either over-flush, or it's missing the opportunity
> > to clear the bits on all the subpages which have now been flushed.
> 
> Hmm, that seems to go all the way back to 2014 as the result of a bug fix
> in 923b8f5044da ("arm64: mm: Make icache synchronisation logic huge page
> aware") which has a Reported-by Mark and a CC stable, suggesting something
> _was_ going wrong at the time :/ Was there a point where the tail pages
> could end up with PG_arch_1 uncleared on allocation?

In my experience, it's the other way around: you can end up with
PG_arch_1 cleared in a tail page when the head one was set (splitting
THP).

> > What would you _like_ to see?  Would you rather flush_dcache_page()
> > were called once for each subpage, or would you rather maintain
> > the page-needs-flushing state once per compound page?  We could also
> > introduce flush_dcache_thp() if some architectures would prefer it one
> > way and one the other, although that brings into question what to do
> > for hugetlbfs pages.
> 
> For arm64, we'd like to see PG_arch_1 preserved during huge page splitting
> [1], but there was a worry that it might break x86 and s390. It's also not
> clear to me that we can change __sync_icache_dcache() as it's called when
> we're installing the entry in the page-table, so why would it be called
> again for the tail pages?

Indeed, __sync_icache_dcache() is called from set_pte_at() on the head
page, though it could always iterate and flush the tail pages
individually (I think we could have done this in commit 923b8f5044da).
Currently I suspect it does some over-flushing if you use THP on
executable pages (it's a no-op on non-exec pages).

With MTE (arm64 memory tagging) I'm introducing a PG_arch_2 flag and
losing this is more problematic as it can lead to clearing valid tags.
In the subsequent patch [2], mte_sync_tags() (also called from
set_pte_at()) checks the PG_arch_2 in each page of a compound one.

My preference would be to treat both PG_arch_1 and _2 similarly.

> [1] 
> https://lore.kernel.org/linux-arch/20200703153718.16973-8-catalin.mari...@arm.com/

[2] 
https://lore.kernel.org/linux-arch/20200703153718.16973-9-catalin.mari...@arm.com/

-- 
Catalin


[powerpc:fixes-test] BUILD SUCCESS 4a133eb351ccc275683ad49305d0b04dde903733

2020-08-28 Thread kernel test robot
defconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
parisc   allyesconfig
s390defconfig
i386 allyesconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allmodconfig
powerpc defconfig
powerpc  allyesconfig
powerpc  allmodconfig
x86_64   randconfig-a003-20200827
x86_64   randconfig-a002-20200827
x86_64   randconfig-a001-20200827
x86_64   randconfig-a005-20200827
x86_64   randconfig-a006-20200827
x86_64   randconfig-a004-20200827
i386 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64   randconfig-a015-20200828
x86_64   randconfig-a012-20200828
x86_64   randconfig-a016-20200828
x86_64   randconfig-a014-20200828
x86_64   randconfig-a011-20200828
x86_64   randconfig-a013-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:merge] BUILD SUCCESS f5d3660ab83be7c3dc2c8a5d662bbac7bc1b092f

2020-08-28 Thread kernel test robot
 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64   randconfig-a015-20200828
x86_64   randconfig-a012-20200828
x86_64   randconfig-a016-20200828
x86_64   randconfig-a014-20200828
x86_64   randconfig-a011-20200828
x86_64   randconfig-a013-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
x86_64   randconfig-a003-20200827
x86_64   randconfig-a002-20200827
x86_64   randconfig-a001-20200827
x86_64   randconfig-a005-20200827
x86_64   randconfig-a006-20200827
x86_64   randconfig-a004-20200827
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


[powerpc:next-test] BUILD SUCCESS WITH WARNING dd419a93bd9954cfdd333f8387a9a0d22218720d

2020-08-28 Thread kernel test robot
   randconfig-a002-20200827
x86_64   randconfig-a001-20200827
x86_64   randconfig-a005-20200827
x86_64   randconfig-a006-20200827
x86_64   randconfig-a004-20200827
i386 randconfig-a002-20200828
i386 randconfig-a005-20200828
i386 randconfig-a003-20200828
i386 randconfig-a004-20200828
i386 randconfig-a001-20200828
i386 randconfig-a006-20200828
x86_64   randconfig-a015-20200828
x86_64   randconfig-a012-20200828
x86_64   randconfig-a014-20200828
x86_64   randconfig-a011-20200828
x86_64   randconfig-a013-20200828
x86_64   randconfig-a016-20200828
i386 randconfig-a013-20200828
i386 randconfig-a012-20200828
i386 randconfig-a011-20200828
i386 randconfig-a016-20200828
i386 randconfig-a014-20200828
i386 randconfig-a015-20200828
riscvallyesconfig
riscv allnoconfig
riscv   defconfig
riscvallmodconfig
x86_64   rhel
x86_64   allyesconfig
x86_64rhel-7.6-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH v1 08/10] powerpc/pseries/iommu: Add ddw_property_create() and refactor enable_ddw()

2020-08-28 Thread Leonardo Bras
On Mon, 2020-08-24 at 15:07 +1000, Alexey Kardashevskiy wrote:
> 
> On 18/08/2020 09:40, Leonardo Bras wrote:
> > Code used to create a ddw property that was previously scattered in
> > enable_ddw() is now gathered in ddw_property_create(), which deals with
> > allocation and filling the property, letting it ready for
> > of_property_add(), which now occurs in sequence.
> > 
> > This created an opportunity to reorganize the second part of enable_ddw():
> > 
> > Without this patch enable_ddw() does, in order:
> > kzalloc() property & members, create_ddw(), fill ddwprop inside property,
> > ddw_list_add(), do tce_setrange_multi_pSeriesLP_walk in all memory,
> > of_add_property().
> > 
> > With this patch enable_ddw() does, in order:
> > create_ddw(), ddw_property_create(), of_add_property(), ddw_list_add(),
> > do tce_setrange_multi_pSeriesLP_walk in all memory.
> > 
> > This change requires of_remove_property() in case anything fails after
> > of_add_property(), but we get to do tce_setrange_multi_pSeriesLP_walk
> > in all memory, which looks the most expensive operation, only if
> > everything else succeeds.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 97 +++---
> >  1 file changed, 57 insertions(+), 40 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index 4031127c9537..3a1ef02ad9d5 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -1123,6 +1123,31 @@ static void reset_dma_window(struct pci_dev *dev, 
> > struct device_node *par_dn)
> >  ret);
> >  }
> >  
> > +static int ddw_property_create(struct property **ddw_win, const char 
> > *propname,
> 
> @propname is always the same, do you really want to pass it every time?

I think it reads better, like "create a ddw property with this name".
Also, it makes possible to create ddw properties with other names, in
case we decide to create properties with different names depending on
the window created.

Also, it's probably optimized / inlined at this point.
Is it ok doing it like this?

> 
> > +  u32 liobn, u64 dma_addr, u32 page_shift, u32 
> > window_shift)
> > +{
> > +   struct dynamic_dma_window_prop *ddwprop;
> > +   struct property *win64;
> > +
> > +   *ddw_win = win64 = kzalloc(sizeof(*win64), GFP_KERNEL);
> > +   if (!win64)
> > +   return -ENOMEM;
> > +
> > +   win64->name = kstrdup(propname, GFP_KERNEL);
> 
> Not clear why "win64->name = DIRECT64_PROPNAME" would not work here, the
> generic OF code does not try kfree() it but it is probably out of scope
> here.

Yeah, I had that question too. 
Previous code was like that, and I as trying not to mess too much on
how it's done.

> > +   ddwprop = kzalloc(sizeof(*ddwprop), GFP_KERNEL);
> > +   win64->value = ddwprop;
> > +   win64->length = sizeof(*ddwprop);
> > +   if (!win64->name || !win64->value)
> > +   return -ENOMEM;
> 
> Up to 2 memory leaks here. I see the cleanup at "out_free_prop:" but
> still looks fragile. Instead you could simply return win64 as the only
> error possible here is -ENOMEM and returning NULL is equally good.

I agree. It's better if this function have it's own cleaning routine.
It will be fixed for next version.

> 
> 
> > +
> > +   ddwprop->liobn = cpu_to_be32(liobn);
> > +   ddwprop->dma_base = cpu_to_be64(dma_addr);
> > +   ddwprop->tce_shift = cpu_to_be32(page_shift);
> > +   ddwprop->window_shift = cpu_to_be32(window_shift);
> > +
> > +   return 0;
> > +}
> > +
> >  /*
> >   * If the PE supports dynamic dma windows, and there is space for a table
> >   * that can map all pages in a linear offset, then setup such a table,
> > @@ -1140,12 +1165,11 @@ static bool enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > -   u64 max_addr;
> > +   u64 max_addr, win_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > -   struct property *win64;
> > -   struct dynamic_dma_window_prop *ddwprop;
> > +   struct property *win64 = NULL;
> > struct failed_ddw_pdn *fpdn;
> > bool default_win_removed = false;
> >  
> > @@ -1244,38 +1268,34 @@ static bool enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > goto out_failed;
> > }
> > len = order_base_2(max_addr);
> > -   win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
> > -   if (!win64) {
> > -   dev_info(>dev,
> > -   "couldn't allocate property for 64bit dma window\n");
> > +
> > +   ret = create_ddw(dev, ddw_avail, , page_shift, len);
> > +   if (ret != 0)
> 
> It is usually just "if (ret)"

It was previously like that, and all query_ddw() checks return value
this way. Should I update them all or just this one?


Re: [PATCH v1 07/10] powerpc/pseries/iommu: Allow DDW windows starting at 0x00

2020-08-28 Thread Leonardo Bras
On Mon, 2020-08-24 at 13:44 +1000, Alexey Kardashevskiy wrote:
> 
> > On 18/08/2020 09:40, Leonardo Bras wrote:
> > enable_ddw() currently returns the address of the DMA window, which is
> > considered invalid if has the value 0x00.
> > 
> > Also, it only considers valid an address returned from find_existing_ddw
> > if it's not 0x00.
> > 
> > Changing this behavior makes sense, given the users of enable_ddw() only
> > need to know if direct mapping is possible. It can also allow a DMA window
> > starting at 0x00 to be used.
> > 
> > This will be helpful for using a DDW with indirect mapping, as the window
> > address will be different than 0x00, but it will not map the whole
> > partition.
> > 
> > Signed-off-by: Leonardo Bras 
> > ---
> >  arch/powerpc/platforms/pseries/iommu.c | 30 --
> >  1 file changed, 14 insertions(+), 16 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> > b/arch/powerpc/platforms/pseries/iommu.c
> > index fcdefcc0f365..4031127c9537 100644
> > --- a/arch/powerpc/platforms/pseries/iommu.c
> > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > @@ -852,24 +852,25 @@ static void remove_ddw(struct device_node *np, bool 
> > remove_prop)
> > np, ret);
> >  }
> > >  
> > -static u64 find_existing_ddw(struct device_node *pdn)
> > +static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)
> >  {
> > struct direct_window *window;
> > const struct dynamic_dma_window_prop *direct64;
> > -   u64 dma_addr = 0;
> > +   bool found = false;
> >  
> > spin_lock(_window_list_lock);
> > /* check if we already created a window and dupe that config if so */
> > list_for_each_entry(window, _window_list, list) {
> > if (window->device == pdn) {
> > direct64 = window->prop;
> > -   dma_addr = be64_to_cpu(direct64->dma_base);
> > +   *dma_addr = be64_to_cpu(direct64->dma_base);
> > +   found = true;
> > break;
> > }
> > }
> > spin_unlock(_window_list_lock);
> >  
> > -   return dma_addr;
> > +   return found;
> >  }
> >  
> >  static struct direct_window *ddw_list_add(struct device_node *pdn,
> > @@ -1131,15 +1132,15 @@ static void reset_dma_window(struct pci_dev *dev, 
> > struct device_node *par_dn)
> >   * pdn: the parent pe node with the ibm,dma_window property
> >   * Future: also check if we can remap the base window for our base page 
> > size
> >   *
> > - * returns the dma offset for use by the direct mapped DMA code.
> > + * returns true if can map all pages (direct mapping), false otherwise..
> >   */
> > -static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> > +static bool enable_ddw(struct pci_dev *dev, struct device_node *pdn)
> >  {
> > int len, ret;
> > struct ddw_query_response query;
> > struct ddw_create_response create;
> > int page_shift;
> > -   u64 dma_addr, max_addr;
> > +   u64 max_addr;
> > struct device_node *dn;
> > u32 ddw_avail[DDW_APPLICABLE_SIZE];
> > struct direct_window *window;
> > @@ -1150,8 +1151,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> >  
> > mutex_lock(_window_init_mutex);
> >  
> > -   dma_addr = find_existing_ddw(pdn);
> > -   if (dma_addr != 0)
> > +   if (find_existing_ddw(pdn, >dev.archdata.dma_offset))
> > goto out_unlock;
> >  
> > /*
> > @@ -1292,7 +1292,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > goto out_free_window;
> > }
> >  
> > -   dma_addr = be64_to_cpu(ddwprop->dma_base);
> > +   dev->dev.archdata.dma_offset = be64_to_cpu(ddwprop->dma_base);
> 
> Do not you need the same chunk in the find_existing_ddw() case above as
> well? Thanks,

The new signature of find_existing_ddw() is 
static bool find_existing_ddw(struct device_node *pdn, u64 *dma_addr)

And on enable_ddw(), we call 
find_existing_ddw(pdn, >dev.archdata.dma_offset)

And inside the function we do:
*dma_addr = be64_to_cpu(direct64->dma_base);

I think it's the same as the chunk before.
Am I missing something?

> 
> 
> > goto out_unlock;
> >  
> >  out_free_window:
> > @@ -1309,6 +1309,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> > kfree(win64->name);
> > kfree(win64->value);
> > kfree(win64);
> > +   win64 = NULL;
> >  
> >  out_failed:
> > if (default_win_removed)
> > @@ -1322,7 +1323,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct 
> > device_node *pdn)
> >  
> >  out_unlock:
> > mutex_unlock(_window_init_mutex);
> > -   return dma_addr;
> > +   return win64;
> >  }
> >  
> >  static void pci_dma_dev_setup_pSeriesLP(struct pci_dev *dev)
> > @@ -1401,11 +1402,8 @@ static bool iommu_bypass_supported_pSeriesLP(struct 
> > pci_dev *pdev, u64 dma_mask)
> > break;
> > }
> >  
> > -   if (pdn && PCI_DN(pdn)) {
> > -   

Re: [PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-28 Thread peterz
On Fri, Aug 28, 2020 at 08:00:19PM +1000, Nicholas Piggin wrote:

> Closing this race only requires interrupts to be disabled while ->mm
> and ->active_mm are being switched, but the TLB problem requires also
> holding interrupts off over activate_mm. Unfortunately not all archs
> can do that yet, e.g., arm defers the switch if irqs are disabled and
> expects finish_arch_post_lock_switch() to be called to complete the
> flush; um takes a blocking lock in activate_mm().

ARM at least has activate_mm() := switch_mm(), so it could be made to
work.


[PATCH v2] powerpc/book3s64/radix: Fix boot failure with large amount of guest memory

2020-08-28 Thread Aneesh Kumar K.V
If the hypervisor doesn't support hugepages, the kernel ends up allocating a 
large
number of page table pages. The early page table allocation was wrongly
setting the max memblock limit to ppc64_rma_size with radix translation
which resulted in boot failure as shown below.

Kernel panic - not syncing:
early_alloc_pgtable: Failed to allocate 16777216 bytes align=0x100 nid=-1 
from=0x max_addr=0x
 CPU: 0 PID: 0 Comm: swapper Not tainted 5.8.0-24.9-default+ #2
 Call Trace:
 [c16f3d00] [c07c6470] dump_stack+0xc4/0x114 (unreliable)
 [c16f3d40] [c014c78c] panic+0x164/0x418
 [c16f3dd0] [c0098890] early_alloc_pgtable+0xe0/0xec
 [c16f3e60] [c10a5440] radix__early_init_mmu+0x360/0x4b4
 [c16f3ef0] [c1099bac] early_init_mmu+0x1c/0x3c
 [c16f3f10] [c109a320] early_setup+0x134/0x170

This was because the kernel was checking for the radix feature before we enable 
the
feature via mmu_features. This resulted in the kernel using hash restrictions on
radix.

Rework the early init code such that the kernel boot with memblock restrictions
as imposed by hash. At that point, the kernel still hasn't finalized the
translation the kernel will end up using.

We have three different ways of detecting radix.

1. dt_cpu_ftrs_scan -> used only in case of PowerNV
2. ibm,pa-features -> Used when we don't use cpu_dt_ftr_scan
3. CAS -> Where we negotiate with hypervisor about the supported translation.

We look at 1 or 2 early in the boot and after that, we look at the CAS vector to
finalize the translation the kernel will use. We also support a kernel command
line option (disable_radix) to switch to hash.

Update the memblock limit after mmu_early_init_devtree() if the kernel is going
to use radix translation. This forces some of the memblock allocations we do 
before
mmu_early_init_devtree() to be within the RMA limit.

Fixes: 2bfd65e45e87 ("powerpc/mm/radix: Add radix callbacks for early init 
routines")
Reviewed-by: Hari Bathini 
Reported-by: Shirisha Ganta 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/mmu.h | 10 +-
 arch/powerpc/mm/book3s64/radix_pgtable.c | 15 ---
 arch/powerpc/mm/init_64.c| 11 +--
 3 files changed, 14 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index 55442d45c597..b392384a3b15 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -239,14 +239,14 @@ static inline void early_init_mmu_secondary(void)
 
 extern void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
 phys_addr_t first_memblock_size);
-extern void radix__setup_initial_memory_limit(phys_addr_t first_memblock_base,
-phys_addr_t first_memblock_size);
 static inline void setup_initial_memory_limit(phys_addr_t first_memblock_base,
  phys_addr_t first_memblock_size)
 {
-   if (early_radix_enabled())
-   return radix__setup_initial_memory_limit(first_memblock_base,
-  first_memblock_size);
+   /*
+* Hash has more strict restrictions. At this point we don't
+* know which translations we will pick. Hence go with hash
+* restrictions.
+*/
return hash__setup_initial_memory_limit(first_memblock_base,
   first_memblock_size);
 }
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 28c784976bed..d5f0c10d752a 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -734,21 +734,6 @@ void radix__mmu_cleanup_all(void)
}
 }
 
-void radix__setup_initial_memory_limit(phys_addr_t first_memblock_base,
-   phys_addr_t first_memblock_size)
-{
-   /*
-* We don't currently support the first MEMBLOCK not mapping 0
-* physical on those processors
-*/
-   BUG_ON(first_memblock_base != 0);
-
-   /*
-* Radix mode is not limited by RMA / VRMA addressing.
-*/
-   ppc64_rma_size = ULONG_MAX;
-}
-
 #ifdef CONFIG_MEMORY_HOTPLUG
 static void free_pte_table(pte_t *pte_start, pmd_t *pmd)
 {
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 02e127fa5777..8459056cce67 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -433,9 +433,16 @@ void __init mmu_early_init_devtree(void)
if (!(mfmsr() & MSR_HV))
early_check_vec5();
 
-   if (early_radix_enabled())
+   if (early_radix_enabled()) {
radix__early_init_devtree();
-   else
+   /*
+* We have finalized the translation we 

[PATCH 4/4] powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

2020-08-28 Thread Nicholas Piggin
Commit 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of
single-threaded mm_cpumask") added a mechanism to trim the mm_cpumask of
a process under certain conditions. One of the assumptions is that
mm_users would not be incremented via a reference outside the process
context with mmget_not_zero() then go on to kthread_use_mm() via that
reference.

That invariant was broken by io_uring code (see previous sparc64 fix),
but I'll point Fixes: to the original powerpc commit because we are
changing that assumption going forward, so this will make backports
match up.

Fix this by no longer relying on that assumption, but by having each CPU
check the mm is not being used, and clearing their own bit from the mask
if it's okay. This fix relies on commit 38cf307c1f20 ("mm: fix
kthread_use_mm() vs TLB invalidate") to disable irqs over the mm switch,
and ARCH_WANT_IRQS_OFF_ACTIVATE_MM to be enabled.

Fixes: 0cef77c7798a7 ("powerpc/64s/radix: flush remote CPUs out of 
single-threaded mm_cpumask")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/tlb.h   | 13 -
 arch/powerpc/mm/book3s64/radix_tlb.c | 23 ---
 2 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index fbc6f3002f23..d97f061fecac 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -66,19 +66,6 @@ static inline int mm_is_thread_local(struct mm_struct *mm)
return false;
return cpumask_test_cpu(smp_processor_id(), mm_cpumask(mm));
 }
-static inline void mm_reset_thread_local(struct mm_struct *mm)
-{
-   WARN_ON(atomic_read(>context.copros) > 0);
-   /*
-* It's possible for mm_access to take a reference on mm_users to
-* access the remote mm from another thread, but it's not allowed
-* to set mm_cpumask, so mm_users may be > 1 here.
-*/
-   WARN_ON(current->mm != mm);
-   atomic_set(>context.active_cpus, 1);
-   cpumask_clear(mm_cpumask(mm));
-   cpumask_set_cpu(smp_processor_id(), mm_cpumask(mm));
-}
 #else /* CONFIG_PPC_BOOK3S_64 */
 static inline int mm_is_thread_local(struct mm_struct *mm)
 {
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0d233763441f..a421a0e3f930 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -645,19 +645,29 @@ static void do_exit_flush_lazy_tlb(void *arg)
struct mm_struct *mm = arg;
unsigned long pid = mm->context.id;
 
+   /*
+* A kthread could have done a mmget_not_zero() after the flushing CPU
+* checked mm_users == 1, and be in the process of kthread_use_mm when
+* interrupted here. In that case, current->mm will be set to mm,
+* because kthread_use_mm() setting ->mm and switching to the mm is
+* done with interrupts off.
+*/
if (current->mm == mm)
-   return; /* Local CPU */
+   goto out_flush;
 
if (current->active_mm == mm) {
-   /*
-* Must be a kernel thread because sender is single-threaded.
-*/
-   BUG_ON(current->mm);
+   WARN_ON_ONCE(current->mm != NULL);
+   /* Is a kernel thread and is using mm as the lazy tlb */
mmgrab(_mm);
-   switch_mm(mm, _mm, current);
current->active_mm = _mm;
+   switch_mm_irqs_off(mm, _mm, current);
mmdrop(mm);
}
+
+   atomic_dec(>context.active_cpus);
+   cpumask_clear_cpu(smp_processor_id(), mm_cpumask(mm));
+
+out_flush:
_tlbiel_pid(pid, RIC_FLUSH_ALL);
 }
 
@@ -672,7 +682,6 @@ static void exit_flush_lazy_tlbs(struct mm_struct *mm)
 */
smp_call_function_many(mm_cpumask(mm), do_exit_flush_lazy_tlb,
(void *)mm, 1);
-   mm_reset_thread_local(mm);
 }
 
 void radix__flush_tlb_mm(struct mm_struct *mm)
-- 
2.23.0



[PATCH 3/4] sparc64: remove mm_cpumask clearing to fix kthread_use_mm race

2020-08-28 Thread Nicholas Piggin
The de facto (and apparently uncommented) standard for using an mm had,
thanks to this code in sparc if nothing else, been that you must have a
reference on mm_users *and that reference must have been obtained with
mmget()*, i.e., from a thread with a reference to mm_users that had used
the mm.

The introduction of mmget_not_zero() in commit d2005e3f41d4
("userfaultfd: don't pin the user memory in userfaultfd_file_create()")
allowed mm_count holders to aoperate on user mappings asynchronously
from the actual threads using the mm, but they were not to load those
mappings into their TLB (i.e., walking vmas and page tables is okay,
kthread_use_mm() is not).

io_uring 2b188cc1bb857 ("Add io_uring IO interface") added code which
does a kthread_use_mm() from a mmget_not_zero() refcount.

The problem with this is code which previously assumed mm == current->mm
and mm->mm_users == 1 implies the mm will remain single-threaded at
least until this thread creates another mm_users reference, has now
broken.

arch/sparc/kernel/smp_64.c:

if (atomic_read(>mm_users) == 1) {
cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
goto local_flush_and_out;
}

vs fs/io_uring.c

if (unlikely(!(ctx->flags & IORING_SETUP_SQPOLL) ||
 !mmget_not_zero(ctx->sqo_mm)))
return -EFAULT;
kthread_use_mm(ctx->sqo_mm);

mmget_not_zero() could come in right after the mm_users == 1 test, then
kthread_use_mm() which sets its CPU in the mm_cpumask. That update could
be lost if cpumask_copy() occurs afterward.

I propose we fix this by allowing mmget_not_zero() to be a first-class
reference, and not have this obscure undocumented and unchecked
restriction.

The basic fix for sparc64 is to remove its mm_cpumask clearing code. The
optimisation could be effectively restored by sending IPIs to mm_cpumask
members and having them remove themselves from mm_cpumask. This is more
tricky so I leave it as an exercise for someone with a sparc64 SMP.
powerpc has a (currently similarly broken) example.

Cc: sparcli...@vger.kernel.org
Signed-off-by: Nicholas Piggin 
---
 arch/sparc/kernel/smp_64.c | 65 --
 1 file changed, 14 insertions(+), 51 deletions(-)

diff --git a/arch/sparc/kernel/smp_64.c b/arch/sparc/kernel/smp_64.c
index e286e2badc8a..e38d8bf454e8 100644
--- a/arch/sparc/kernel/smp_64.c
+++ b/arch/sparc/kernel/smp_64.c
@@ -1039,38 +1039,9 @@ void smp_fetch_global_pmu(void)
  * are flush_tlb_*() routines, and these run after flush_cache_*()
  * which performs the flushw.
  *
- * The SMP TLB coherency scheme we use works as follows:
- *
- * 1) mm->cpu_vm_mask is a bit mask of which cpus an address
- *space has (potentially) executed on, this is the heuristic
- *we use to avoid doing cross calls.
- *
- *Also, for flushing from kswapd and also for clones, we
- *use cpu_vm_mask as the list of cpus to make run the TLB.
- *
- * 2) TLB context numbers are shared globally across all processors
- *in the system, this allows us to play several games to avoid
- *cross calls.
- *
- *One invariant is that when a cpu switches to a process, and
- *that processes tsk->active_mm->cpu_vm_mask does not have the
- *current cpu's bit set, that tlb context is flushed locally.
- *
- *If the address space is non-shared (ie. mm->count == 1) we avoid
- *cross calls when we want to flush the currently running process's
- *tlb state.  This is done by clearing all cpu bits except the current
- *processor's in current->mm->cpu_vm_mask and performing the
- *flush locally only.  This will force any subsequent cpus which run
- *this task to flush the context from the local tlb if the process
- *migrates to another cpu (again).
- *
- * 3) For shared address spaces (threads) and swapping we bite the
- *bullet for most cases and perform the cross call (but only to
- *the cpus listed in cpu_vm_mask).
- *
- *The performance gain from "optimizing" away the cross call for threads is
- *questionable (in theory the big win for threads is the massive sharing of
- *address space state across processors).
+ * mm->cpu_vm_mask is a bit mask of which cpus an address
+ * space has (potentially) executed on, this is the heuristic
+ * we use to limit cross calls.
  */
 
 /* This currently is only used by the hugetlb arch pre-fault
@@ -1080,18 +1051,13 @@ void smp_fetch_global_pmu(void)
 void smp_flush_tlb_mm(struct mm_struct *mm)
 {
u32 ctx = CTX_HWBITS(mm->context);
-   int cpu = get_cpu();
 
-   if (atomic_read(>mm_users) == 1) {
-   cpumask_copy(mm_cpumask(mm), cpumask_of(cpu));
-   goto local_flush_and_out;
-   }
+   get_cpu();
 
smp_cross_call_masked(_flush_tlb_mm,
  ctx, 0, 0,
  mm_cpumask(mm));
 
-local_flush_and_out:
__flush_tlb_mm(ctx, SECONDARY_CONTEXT);
 
put_cpu();
@@ -1114,17 +1080,15 @@ void 

[PATCH 2/4] powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM

2020-08-28 Thread Nicholas Piggin
powerpc uses IPIs in some situations to switch a kernel thread away
from a lazy tlb mm, which is subject to the TLB flushing race
described in the changelog introducing ARCH_WANT_IRQS_OFF_ACTIVATE_MM.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   | 1 +
 arch/powerpc/include/asm/mmu_context.h | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1f48bbfb3ce9..65cb32211574 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -149,6 +149,7 @@ config PPC
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
select ARCH_WANT_IPC_PARSE_VERSION
+   select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WEAK_RELEASE_ACQUIRE
select BINFMT_ELF
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 7f3658a97384..e02aa793420b 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -244,7 +244,7 @@ static inline void switch_mm(struct mm_struct *prev, struct 
mm_struct *next,
  */
 static inline void activate_mm(struct mm_struct *prev, struct mm_struct *next)
 {
-   switch_mm(prev, next, current);
+   switch_mm_irqs_off(prev, next, current);
 }
 
 /* We don't currently use enter_lazy_tlb() for anything */
-- 
2.23.0



[PATCH 1/4] mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race

2020-08-28 Thread Nicholas Piggin
Reading and modifying current->mm and current->active_mm and switching
mm should be done with irqs off, to prevent races seeing an intermediate
state.

This is similar to commit 38cf307c1f20 ("mm: fix kthread_use_mm() vs TLB
invalidate"). At exec-time when the new mm is activated, the old one
should usually be single-threaded and no longer used, unless something
else is holding an mm_users reference (which may be possible).

Absent other mm_users, there is also a race with preemption and lazy tlb
switching. Consider the kernel_execve case where the current thread is
using a lazy tlb active mm:

  call_usermodehelper()
kernel_execve()
  old_mm = current->mm;
  active_mm = current->active_mm;
  *** preempt *** >  schedule()
   prev->active_mm = NULL;
   mmdrop(prev active_mm);
 ...
  <  schedule()
  current->mm = mm;
  current->active_mm = mm;
  if (!old_mm)
  mmdrop(active_mm);

If we switch back to the kernel thread from a different mm, there is a
double free of the old active_mm, and a missing free of the new one.

Closing this race only requires interrupts to be disabled while ->mm
and ->active_mm are being switched, but the TLB problem requires also
holding interrupts off over activate_mm. Unfortunately not all archs
can do that yet, e.g., arm defers the switch if irqs are disabled and
expects finish_arch_post_lock_switch() to be called to complete the
flush; um takes a blocking lock in activate_mm().

So as a first step, disable interrupts across the mm/active_mm updates
to close the lazy tlb preempt race, and provide an arch option to
extend that to activate_mm which allows architectures doing IPI based
TLB shootdowns to close the second race.

This is a bit ugly, but in the interest of fixing the bug and backporting
before all architectures are converted this is a compromise.

Signed-off-by: Nicholas Piggin 
---
 arch/Kconfig |  7 +++
 fs/exec.c| 17 +++--
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index af14a567b493..94821e3f94d1 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -414,6 +414,13 @@ config MMU_GATHER_NO_GATHER
bool
depends on MMU_GATHER_TABLE_FREE
 
+config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
+   bool
+   help
+ Temporary select until all architectures can be converted to have
+ irqs disabled over activate_mm. Architectures that do IPI based TLB
+ shootdowns should enable this.
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
bool
 
diff --git a/fs/exec.c b/fs/exec.c
index a91003e28eaa..d4fb18baf1fb 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1130,11 +1130,24 @@ static int exec_mmap(struct mm_struct *mm)
}
 
task_lock(tsk);
-   active_mm = tsk->active_mm;
membarrier_exec_mmap(mm);
-   tsk->mm = mm;
+
+   local_irq_disable();
+   active_mm = tsk->active_mm;
tsk->active_mm = mm;
+   tsk->mm = mm;
+   /*
+* This prevents preemption while active_mm is being loaded and
+* it and mm are being updated, which could cause problems for
+* lazy tlb mm refcounting when these are updated by context
+* switches. Not all architectures can handle irqs off over
+* activate_mm yet.
+*/
+   if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
activate_mm(active_mm, mm);
+   if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
+   local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
vmacache_flush(tsk);
task_unlock(tsk);
-- 
2.23.0



[PATCH 0/4] more mm switching vs TLB shootdown and lazy tlb

2020-08-28 Thread Nicholas Piggin
This is an attempt to fix a few different related issues around
switching mm, TLB flushing, and lazy tlb mm handling.

This will require all architectures to eventually move to disabling
irqs over activate_mm, but it's possible we could add another arch
call after irqs are re-enabled for those few which can't do their
entire activation with irqs disabled.

I'd like some feedback on the sparc/powerpc vs kthread_use_mm
problem too.

Thanks,
Nick

Nicholas Piggin (4):
  mm: fix exec activate_mm vs TLB shootdown and lazy tlb switching race
  powerpc: select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
  sparc64: remove mm_cpumask clearing to fix kthread_use_mm race
  powerpc/64s/radix: Fix mm_cpumask trimming race vs kthread_use_mm

 arch/Kconfig   |  7 +++
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/include/asm/mmu_context.h |  2 +-
 arch/powerpc/include/asm/tlb.h | 13 --
 arch/powerpc/mm/book3s64/radix_tlb.c   | 23 ++---
 arch/sparc/kernel/smp_64.c | 65 ++
 fs/exec.c  | 17 ++-
 7 files changed, 54 insertions(+), 74 deletions(-)

-- 
2.23.0



Re: [PATCHv5 2/2] powerpc/pseries: update device tree before ejecting hotplug uevents

2020-08-28 Thread Pingfan Liu
On Thu, Aug 27, 2020 at 3:53 PM Laurent Dufour  wrote:
>
> Le 10/08/2020 à 10:52, Pingfan Liu a écrit :
> > A bug is observed on pseries by taking the following steps on rhel:
> > -1. drmgr -c mem -r -q 5
> > -2. echo c > /proc/sysrq-trigger
> >
> > And then, the failure looks like:
> > kdump: saving to /sysroot//var/crash/127.0.0.1-2020-01-16-02:06:14/
> > kdump: saving vmcore-dmesg.txt
> > kdump: saving vmcore-dmesg.txt complete
> > kdump: saving vmcore
> >   Checking for memory holes : [  0.0 %] /   
> > Checking for memory holes : [100.0 %] | 
> >   Excluding unnecessary pages   : [100.0 %] 
> > \   Copying data  : [  
> > 0.3 %] -  eta: 38s[   44.337636] hash-mmu: mm: Hashing failure ! 
> > EA=0x7fffba40 access=0x8004 current=makedumpfile
> > [   44.337663] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337677] hash-mmu: mm: Hashing failure ! EA=0x7fffba40 
> > access=0x8004 current=makedumpfile
> > [   44.337692] hash-mmu: trap=0x300 vsid=0x13a109c ssize=1 base psize=2 
> > psize 2 pte=0xc0005504
> > [   44.337708] makedumpfile[469]: unhandled signal 7 at 7fffba40 
> > nip 7fffbbc4d7fc lr 00011356ca3c code 2
> > [   44.338548] Core dump to |/bin/false pipe failed
> > /lib/kdump-lib-initramfs.sh: line 98:   469 Bus error   
> > $CORE_COLLECTOR /proc/vmcore 
> > $_mp/$KDUMP_PATH/$HOST_IP-$DATEDIR/vmcore-incomplete
> > kdump: saving vmcore failed
> >
> > * Root cause *
> >After analyzing, it turns out that in the current implementation,
> > when hot-removing lmb, the KOBJ_REMOVE event ejects before the dt updating 
> > as
> > the code __remove_memory() comes before drmem_update_dt().
> > So in kdump kernel, when read_from_oldmem() resorts to
> > pSeries_lpar_hpte_insert() to install hpte, but fails with -2 due to
> > non-exist pfn. And finally, low_hash_fault() raise SIGBUS to process, as it
> > can be observed "Bus error"
> >
> >  From a viewpoint of listener and publisher, the publisher notifies the
> > listener before data is ready.  This introduces a problem where udev
> > launches kexec-tools (due to KOBJ_REMOVE) and loads a stale dt before
> > updating. And in capture kernel, makedumpfile will access the memory based
> > on the stale dt info, and hit a SIGBUS error due to an un-existed lmb.
> >
> > * Fix *
> > This bug is introduced by commit 063b8b1251fd
> > ("powerpc/pseries/memory-hotplug: Only update DT once per memory DLPAR
> > request"), which tried to combine all the dt updating into one.
> >
> > To fix this issue, meanwhile not to introduce a quadratic runtime
> > complexity by the model:
> >dlpar_memory_add_by_count
> >  for_each_drmem_lmb <--
> >dlpar_add_lmb
> >  drmem_update_dt(_v1|_v2)
> >for_each_drmem_lmb   <--
> > The dt should still be only updated once, and just before the last memory
> > online/offline event is ejected to user space. Achieve this by tracing the
> > num of lmb added or removed.
> >
> > Signed-off-by: Pingfan Liu 
> > Cc: Michael Ellerman 
> > Cc: Hari Bathini 
> > Cc: Nathan Lynch 
> > Cc: Nathan Fontenot 
> > Cc: Laurent Dufour 
> > To: linuxppc-dev@lists.ozlabs.org
> > Cc: ke...@lists.infradead.org
> > ---
> > v4 -> v5: change dlpar_add_lmb()/dlpar_remove_lmb() prototype to report
> >whether dt is updated successfully.
> >Fix a condition boundary check bug
> > v3 -> v4: resolve a quadratic runtime complexity issue.
> >This series is applied on next-test branch
> >   arch/powerpc/platforms/pseries/hotplug-memory.c | 102 
> > +++-
> >   1 file changed, 80 insertions(+), 22 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> > b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > index 46cbcd1..1567d9f 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> > @@ -350,13 +350,22 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
> >   return true;
> >   }
> >
> > -static int dlpar_add_lmb(struct drmem_lmb *);
> > +enum dt_update_status {
> > + DT_NOUPDATE,
> > + DT_TOUPDATE,
> > + DT_UPDATED,
> > +};
> > +
> > +/* "*dt_update" returns DT_UPDATED if updated */
> > +static int dlpar_add_lmb(struct drmem_lmb *lmb,
> > + enum dt_update_status *dt_update);
> >
> > -static int dlpar_remove_lmb(struct drmem_lmb *lmb)
> > +static int dlpar_remove_lmb(struct drmem_lmb *lmb,
> > + enum dt_update_status *dt_update)
> >   {
> >   unsigned long block_sz;
> >   phys_addr_t base_addr;
> > - int rc, nid;
> > + int rc, ret, nid;
> >
> >   if (!lmb_is_removable(lmb))
> >   return -EINVAL;
> > @@ -372,6 +381,13 

[PATCH v2 5/5] powerpc/vdso: Declare vdso_patches[] as __initdata

2020-08-28 Thread Christophe Leroy
vdso_patches[] table is used only at init time.

Mark it __initdata.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vdso.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 4ad042995ccc..dfa08a7b4e7c 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -76,7 +76,7 @@ struct vdso_patch_def
  * Currently, we only change sync_dicache to do nothing on processors
  * with a coherent icache
  */
-static struct vdso_patch_def vdso_patches[] = {
+static struct vdso_patch_def vdso_patches[] __initdata = {
{
CPU_FTR_COHERENT_ICACHE, CPU_FTR_COHERENT_ICACHE,
"__kernel_sync_dicache", "__kernel_sync_dicache_p5"
-- 
2.25.0



[PATCH v2 4/5] powerpc/vdso: Declare constant vars as __ro_after_init

2020-08-28 Thread Christophe Leroy
To avoid any risk of modification of vital VDSO variables,
declare them __ro_after_init.

vdso32_kbase and vdso64_kbase could be made 'const', but it would
have high impact on all functions using them as the compiler doesn't
expect const property to be discarded.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vdso.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index fb393266b9cb..4ad042995ccc 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -38,19 +38,19 @@
 #define VDSO_ALIGNMENT (1 << 16)
 
 extern char vdso32_start, vdso32_end;
-static unsigned int vdso32_pages;
-static void *vdso32_kbase = _start;
-unsigned long vdso32_sigtramp;
-unsigned long vdso32_rt_sigtramp;
+static unsigned int vdso32_pages __ro_after_init;
+static void *vdso32_kbase __ro_after_init = _start;
+unsigned long vdso32_sigtramp __ro_after_init;
+unsigned long vdso32_rt_sigtramp __ro_after_init;
 
 extern char vdso64_start, vdso64_end;
-static void *vdso64_kbase = _start;
-static unsigned int vdso64_pages;
+static void *vdso64_kbase __ro_after_init = _start;
+static unsigned int vdso64_pages __ro_after_init;
 #ifdef CONFIG_PPC64
-unsigned long vdso64_rt_sigtramp;
+unsigned long vdso64_rt_sigtramp __ro_after_init;
 #endif /* CONFIG_PPC64 */
 
-static int vdso_ready;
+static int vdso_ready __ro_after_init;
 
 /*
  * The vdso data page (aka. systemcfg for old ppc64 fans) is here.
-- 
2.25.0



[PATCH v2 3/5] powerpc/vdso: Initialise vdso32_kbase at compile time

2020-08-28 Thread Christophe Leroy
Initialise vdso32_kbase at compile time like vdso64_kbase.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vdso.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 8f245e988a8a..fb393266b9cb 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -37,13 +37,12 @@
 /* The alignment of the vDSO */
 #define VDSO_ALIGNMENT (1 << 16)
 
+extern char vdso32_start, vdso32_end;
 static unsigned int vdso32_pages;
-static void *vdso32_kbase;
+static void *vdso32_kbase = _start;
 unsigned long vdso32_sigtramp;
 unsigned long vdso32_rt_sigtramp;
 
-extern char vdso32_start, vdso32_end;
-
 extern char vdso64_start, vdso64_end;
 static void *vdso64_kbase = _start;
 static unsigned int vdso64_pages;
@@ -689,8 +688,6 @@ static int __init vdso_init(void)
 */
vdso64_pages = (_end - _start) >> PAGE_SHIFT;
 
-   vdso32_kbase = _start;
-
/*
 * Calculate the size of the 32 bits vDSO
 */
-- 
2.25.0



[PATCH v2 1/5] powerpc/vdso: Remove DBG()

2020-08-28 Thread Christophe Leroy
DBG() is defined as void when DEBUG is not defined,
and DEBUG is explicitly undefined.

It means there is no other way than modifying source code
to get the messages printed.

It was most likely useful in the first days of VDSO, but
today the only 3 DBG() calls don't deserve a special
handling.

Just remove them. If one day someone need such messages back,
use a standard pr_debug() or equivalent.

Signed-off-by: Christophe Leroy 
---
This is a follow up series, applying on top of the series that
switches powerpc VDSO to _install_special_mapping(),
rebased on today's powerpc/next-test (dd419a93bd99)

v2 removes the modification to arch_setup_additional_pages() to
consider when is_32bit_task() returning true when CONFIG_VDSO32
not set, as this should never happen.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vdso.c | 13 -
 1 file changed, 13 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index e2568d9ecdff..3ef3fc546ac8 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -31,14 +31,6 @@
 #include 
 #include 
 
-#undef DEBUG
-
-#ifdef DEBUG
-#define DBG(fmt...) printk(fmt)
-#else
-#define DBG(fmt...)
-#endif
-
 /* Max supported size for symbol names */
 #define MAX_SYMNAME64
 
@@ -567,9 +559,6 @@ static __init int vdso_fixup_alt_funcs(struct lib32_elfinfo 
*v32,
if (!match)
continue;
 
-   DBG("replacing %s with %s...\n", patch->gen_name,
-   patch->fix_name ? "NONE" : patch->fix_name);
-
/*
 * Patch the 32 bits and 64 bits symbols. Note that we do not
 * patch the "." symbol on 64 bits.
@@ -704,7 +693,6 @@ static int __init vdso_init(void)
 * Calculate the size of the 64 bits vDSO
 */
vdso64_pages = (_end - _start) >> PAGE_SHIFT;
-   DBG("vdso64_kbase: %p, 0x%x pages\n", vdso64_kbase, vdso64_pages);
 
vdso32_kbase = _start;
 
@@ -712,7 +700,6 @@ static int __init vdso_init(void)
 * Calculate the size of the 32 bits vDSO
 */
vdso32_pages = (_end - _start) >> PAGE_SHIFT;
-   DBG("vdso32_kbase: %p, 0x%x pages\n", vdso32_kbase, vdso32_pages);
 
/*
 * Setup the syscall map in the vDOS
-- 
2.25.0



[PATCH v2 2/5] powerpc/vdso: Don't rely on vdso_pages being 0 for failure

2020-08-28 Thread Christophe Leroy
If vdso initialisation failed, vdso_ready is not set.
Otherwise, vdso_pages is only 0 when it is a 32 bits task
and CONFIG_VDSO32 is not selected.

As arch_setup_additional_pages() now bails out directly in
that case, we don't need to set vdso_pages to 0.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/vdso.c | 23 ++-
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index 3ef3fc546ac8..8f245e988a8a 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -176,11 +176,6 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
 
current->mm->context.vdso_base = 0;
 
-   /* vDSO has a problem and was disabled, just don't "enable" it for the
-* process
-*/
-   if (vdso_pages == 0)
-   return 0;
/* Add a page to the vdso size for the data page */
vdso_pages ++;
 
@@ -710,14 +705,16 @@ static int __init vdso_init(void)
 * Initialize the vDSO images in memory, that is do necessary
 * fixups of vDSO symbols, locate trampolines, etc...
 */
-   if (vdso_setup())
-   goto setup_failed;
+   if (vdso_setup()) {
+   pr_err("vDSO setup failure, not enabled !\n");
+   return 0;
+   }
 
if (IS_ENABLED(CONFIG_VDSO32)) {
/* Make sure pages are in the correct state */
pagelist = kcalloc(vdso32_pages + 1, sizeof(struct page *), 
GFP_KERNEL);
if (!pagelist)
-   goto alloc_failed;
+   return 0;
 
pagelist[0] = virt_to_page(vdso_data);
 
@@ -730,7 +727,7 @@ static int __init vdso_init(void)
if (IS_ENABLED(CONFIG_PPC64)) {
pagelist = kcalloc(vdso64_pages + 1, sizeof(struct page *), 
GFP_KERNEL);
if (!pagelist)
-   goto alloc_failed;
+   return 0;
 
pagelist[0] = virt_to_page(vdso_data);
 
@@ -743,14 +740,6 @@ static int __init vdso_init(void)
smp_wmb();
vdso_ready = 1;
 
-   return 0;
-
-setup_failed:
-   pr_err("vDSO setup failure, not enabled !\n");
-alloc_failed:
-   vdso32_pages = 0;
-   vdso64_pages = 0;
-
return 0;
 }
 arch_initcall(vdso_init);
-- 
2.25.0