Re: [PATCH v2 3/3] powerpc/powernv: Always stop secondaries before reboot/shutdown

2018-04-02 Thread Vasant Hegde

On 04/01/2018 04:06 PM, Nicholas Piggin wrote:

Currently powernv reboot and shutdown requests just leave secondaries
to do their own things. This is undesirable because they can trigger
any number of watchdogs while waiting for reboot, but also we don't
know what else they might be doing -- they might be causing trouble,
trampling memory, etc.

The opal scheduled flash update code already ran into watchdog problems
due to flashing taking a long time, and it was fixed with 2196c6f1ed
("powerpc/powernv: Return secondary CPUs to firmware before FW update"),
which returns secondaries to opal. It's been found that regular reboots
can take over 10 seconds, which can result in the hard lockup watchdog
firing,

   reboot: Restarting system
   [  360.038896709,5] OPAL: Reboot request...
   Watchdog CPU:0 Hard LOCKUP
   Watchdog CPU:44 detected Hard LOCKUP other CPUS:16
   Watchdog CPU:16 Hard LOCKUP
   watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0]

This patch removes the special case for flash update, and calls
smp_send_stop in all cases before calling reboot/shutdown.

smp_send_stop could return CPUs to OPAL, the main reason not to is
that the request could come from a NMI that interrupts OPAL code,
so re-entry to OPAL can cause a number of problems. Putting
secondaries into simple spin loops improves the chances of a
successful reboot.

Cc: Vasant Hegde 


Nick,

Patch looks good to me. We have tested this patchset on FSP based system and 
everything looks fine.


Reviewed-by:  Vasant Hegde 

-Vasant


Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/opal.h |  2 +-
  arch/powerpc/platforms/powernv/opal-flash.c | 28 +---
  arch/powerpc/platforms/powernv/setup.c  | 15 +--
  3 files changed, 7 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index dde60089d0d4..7159e1a6a61a 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -325,7 +325,7 @@ struct rtc_time;
  extern unsigned long opal_get_boot_time(void);
  extern void opal_nvram_init(void);
  extern void opal_flash_update_init(void);
-extern void opal_flash_term_callback(void);
+extern void opal_flash_update_print_message(void);
  extern int opal_elog_init(void);
  extern void opal_platform_dump_init(void);
  extern void opal_sys_param_init(void);
diff --git a/arch/powerpc/platforms/powernv/opal-flash.c 
b/arch/powerpc/platforms/powernv/opal-flash.c
index 1cb0b895a236..b37015101bf6 100644
--- a/arch/powerpc/platforms/powernv/opal-flash.c
+++ b/arch/powerpc/platforms/powernv/opal-flash.c
@@ -303,26 +303,9 @@ static int opal_flash_update(int op)
return rc;
  }

-/* Return CPUs to OPAL before starting FW update */
-static void flash_return_cpu(void *info)
-{
-   int cpu = smp_processor_id();
-
-   if (!cpu_online(cpu))
-   return;
-
-   /* Disable IRQ */
-   hard_irq_disable();
-
-   /* Return the CPU to OPAL */
-   opal_return_cpu();
-}
-
  /* This gets called just before system reboots */
-void opal_flash_term_callback(void)
+void opal_flash_update_print_message(void)
  {
-   struct cpumask mask;
-
if (update_flash_data.status != FLASH_IMG_READY)
return;

@@ -333,15 +316,6 @@ void opal_flash_term_callback(void)

/* Small delay to help getting the above message out */
msleep(500);
-
-   /* Return secondary CPUs to firmware */
-   cpumask_copy(, cpu_online_mask);
-   cpumask_clear_cpu(smp_processor_id(), );
-   if (!cpumask_empty())
-   smp_call_function_many(,
-  flash_return_cpu, NULL, false);
-   /* Hard disable interrupts */
-   hard_irq_disable();
  }

  /*
diff --git a/arch/powerpc/platforms/powernv/setup.c 
b/arch/powerpc/platforms/powernv/setup.c
index 5f963286232f..ef8c9ce53a61 100644
--- a/arch/powerpc/platforms/powernv/setup.c
+++ b/arch/powerpc/platforms/powernv/setup.c
@@ -201,17 +201,12 @@ static void pnv_prepare_going_down(void)
 */
opal_event_shutdown();

-   /* Soft disable interrupts */
-   local_irq_disable();
+   /* Print flash update message if one is scheduled. */
+   opal_flash_update_print_message();

-   /*
-* Return secondary CPUs to firwmare if a flash update
-* is pending otherwise we will get all sort of error
-* messages about CPU being stuck etc.. This will also
-* have the side effect of hard disabling interrupts so
-* past this point, the kernel is effectively dead.
-*/
-   opal_flash_term_callback();
+   smp_send_stop();
+
+   hard_irq_disable();
  }

  static void  __noreturn pnv_restart(char *cmd)





Re: [PATCH 4/5] powerpc/pm: add sleep and deep sleep on QorIQ SoCs

2018-04-02 Thread kbuild test robot
Hi Ran,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on powerpc/next]
[also build test WARNING on v4.16 next-20180329]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Ran-Wang/powerpc-pm-Fix-suspend-n-in-menuconfig-for-e500mc-platforms/20180330-072848
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-ppc64e_defconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All warnings (new ones prefixed by >>):

   arch/powerpc/platforms/85xx/sleep.S: Assembler messages:
>> arch/powerpc/platforms/85xx/sleep.S:1174: Warning: invalid register 
>> expression

vim +1174 arch/powerpc/platforms/85xx/sleep.S

  1137  
  1138  _GLOBAL(fsl_booke_deep_sleep_resume)
  1139  /* disable interrupts */
  1140  FSL_DIS_ALL_IRQ
  1141  
  1142  /* switch to 64-bit mode */
  1143  bl  .enable_64b_mode
  1144  
  1145  /* set TOC pointer */
  1146  bl  .relative_toc
  1147  
  1148  /* setup initial TLBs, switch to kernel space ... */
  1149  bl  .start_initialization_book3e
  1150  
  1151  /* address space changed, set TOC pointer again */
  1152  bl  .relative_toc
  1153  
  1154  /* call a cpu state restore handler */
  1155  LOAD_REG_ADDR(r23, cur_cpu_spec)
  1156  ld  r23,0(r23)
  1157  ld  r23,CPU_SPEC_RESTORE(r23)
  1158  cmpdi   0,r23,0
  1159  beq 1f
  1160  ld  r23,0(r23)
  1161  mtctr   r23
  1162  bctrl
  1163  1:
  1164  LOAD_REG_ADDR(r3, regs_buffer)
  1165  bl  e5500_cpu_state_restore
  1166  
  1167  /* Load each CAM entry */
  1168  LOAD_REG_ADDR(r3, tlbcam_index)
  1169  lwz r3, 0(r3)
  1170  mtctr   r3
  1171  li  r0, 0
  1172  3:  mr  r3, r0
  1173  bl  loadcam_entry
> 1174  addir0, r0, 1
  1175  bdnz3b
  1176  
  1177  /* restore return address */
  1178  LOAD_REG_ADDR(r3, buf_tmp)
  1179  ld  r4, 16(r3)
  1180  mtspr   SPRN_TCR, r4
  1181  ld  r4, 0(r3)
  1182  mtlrr4
  1183  ld  r4, 8(r3)
  1184  mtmsr   r4
  1185  ld  r4, 24(r3)
  1186  mtcrr4
  1187  
  1188  blr
  1189  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

2018-04-02 Thread Pingfan Liu
I think CMA has protected us from hot-remove, so this patch is not necessary.

Regards,
Pingfan

On Mon, Apr 2, 2018 at 2:30 PM, Mahesh J Salgaonkar
 wrote:
> From: Mahesh Salgaonkar 
>
> For fadump to work successfully there should not be any holes in reserved
> memory ranges where kernel has asked firmware to move the content of old
> kernel memory in event of crash. But this memory area is currently not
> protected from hot-remove operations. Hence, fadump service can fail to
> re-register after the hot-remove operation, if hot-removed memory belongs
> to fadump reserved region. To avoid this make sure that memory from fadump
> reserved area is not hot-removable if fadump is registered.
>
> However, if user still wants to remove that memory, he can do so by
> manually stopping fadump service before hot-remove operation.
>
> Signed-off-by: Mahesh Salgaonkar 
> ---
>  arch/powerpc/include/asm/fadump.h   |2 +-
>  arch/powerpc/kernel/fadump.c|   10 --
>  arch/powerpc/platforms/pseries/hotplug-memory.c |7 +--
>  3 files changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/fadump.h 
> b/arch/powerpc/include/asm/fadump.h
> index 44b6ebfe9be6..d16dc77107a8 100644
> --- a/arch/powerpc/include/asm/fadump.h
> +++ b/arch/powerpc/include/asm/fadump.h
> @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges {
> unsigned long long  size;
>  };
>
> -extern int is_fadump_boot_memory_area(u64 addr, ulong size);
> +extern int is_fadump_memory_area(u64 addr, ulong size);
>  extern int early_init_dt_scan_fw_dump(unsigned long node,
> const char *uname, int depth, void *data);
>  extern int fadump_reserve_mem(void);
> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> index 59aaf0357a52..2c3c7e655eec 100644
> --- a/arch/powerpc/kernel/fadump.c
> +++ b/arch/powerpc/kernel/fadump.c
> @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long 
> node,
>
>  /*
>   * If fadump is registered, check if the memory provided
> - * falls within boot memory area.
> + * falls within boot memory area and reserved memory area.
>   */
> -int is_fadump_boot_memory_area(u64 addr, ulong size)
> +int is_fadump_memory_area(u64 addr, ulong size)
>  {
> +   u64 d_start = fw_dump.reserve_dump_area_start;
> +   u64 d_end = d_start + fw_dump.reserve_dump_area_size;
> +
> if (!fw_dump.dump_registered)
> return 0;
>
> +   if (((addr + size) > d_start) && (addr <= d_end))
> +   return 1;
> +
> return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
>  }
>
> diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
> b/arch/powerpc/platforms/pseries/hotplug-memory.c
> index c1578f54c626..e4c658cda3a7 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-memory.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
> @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
> phys_addr = lmb->base_addr;
>
>  #ifdef CONFIG_FA_DUMP
> -   /* Don't hot-remove memory that falls in fadump boot memory area */
> -   if (is_fadump_boot_memory_area(phys_addr, block_sz))
> +   /*
> +* Don't hot-remove memory that falls in fadump boot memory area
> +* and memory that is reserved for capturing old kernel memory.
> +*/
> +   if (is_fadump_memory_area(phys_addr, block_sz))
> return false;
>  #endif
>
>


Re: [PATCH v2 1/3] powerpc: use NMI IPI for smp_send_stop

2018-04-02 Thread Nicholas Piggin
On Tue, 3 Apr 2018 10:13:26 +1000
Balbir Singh  wrote:

> On Sun,  1 Apr 2018 20:36:13 +1000
> Nicholas Piggin  wrote:
> 
> > Use the NMI IPI rather than smp_call_function for smp_send_stop.
> > Have stopped CPUs hard disable interrupts rather than just soft
> > disable.
> > 
> > This function is used in crash/panic/shutdown paths to bring other
> > CPUs down as quickly and reliably as possible, and minimizing their
> > potential to cause trouble.
> > 
> > Avoiding the Linux smp_call_function infrastructure and (if supported)
> > using true NMI IPIs makes this more robust.
> > 
> > Signed-off-by: Nicholas Piggin 
> > ---
> >  arch/powerpc/kernel/smp.c | 8 
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index cfc08b099c49..db88660bf6bd 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -565,7 +565,11 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct 
> > pt_regs *))
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_NMI_IPI
> > +static void stop_this_cpu(struct pt_regs *regs)
> > +#else
> >  static void stop_this_cpu(void *dummy)
> > +#endif
> >  {
> > /* Remove this CPU */
> > set_cpu_online(smp_processor_id(), false);
> > @@ -577,7 +581,11 @@ static void stop_this_cpu(void *dummy)
> >  
> >  void smp_send_stop(void)
> >  {
> > +#ifdef CONFIG_NMI_IPI
> > +   smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100);  
> 
> I wonder if the delay_us should be a function of number of cpus and the
> callee should figure this out on its own? May be not in this series, but
> in the longer run.

It possibly should. The big per-cpu/core serialized delay is in firmware
(worst case we have to do a special wakeup on each core and bring it out
of deep stop), and that gets done before the delay starts. So we count
delay from after all threads are woken and given their sreset command, so
this is probably okay even for very big systems.

But it should at least be a #define constant if not something more
sophisticated like you suggest. We have a few delays like that including
in xmon and other crash paths that could use some more thought.

Thanks,
Nick


> 
> 
> Balbir Singh.



Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.

2018-04-02 Thread Ilya Smith
> On 29 Mar 2018, at 00:07, Luck, Tony  wrote:
> 
>> The default limit of only 65536 VMAs will also quickly come into play
>> if consecutive anon mmaps don't get merged. Of course this can be
>> raised, but it has significant resource and performance (fork) costs.
> 
> Could the random mmap address chooser look for how many existing
> VMAs have space before/after and the right attributes to merge with the
> new one you want to create? If this is above some threshold (100?) then
> pick one of them randomly and allocate the new address so that it will
> merge from below/above with an existing one.
> 
> That should still give you a very high degree of randomness, but prevent
> out of control numbers of VMAs from being created.

I think this wouldn’t work. For example these 100 allocation may happened on 
process initialization. But when attacker come to the server all his 
allocations would be made on the predictable offsets from each other. So in 
result we did nothing just decrease performance of first 100 allocations. I 
think I can make ioctl to turn off this randomization per process and it could 
be used if needed. For example if application going to allocate big chunk or 
make big memory pressure, etc.

Best regards,
Ilya


Re: [PATCH v2 1/3] powerpc: use NMI IPI for smp_send_stop

2018-04-02 Thread Balbir Singh
On Sun,  1 Apr 2018 20:36:13 +1000
Nicholas Piggin  wrote:

> Use the NMI IPI rather than smp_call_function for smp_send_stop.
> Have stopped CPUs hard disable interrupts rather than just soft
> disable.
> 
> This function is used in crash/panic/shutdown paths to bring other
> CPUs down as quickly and reliably as possible, and minimizing their
> potential to cause trouble.
> 
> Avoiding the Linux smp_call_function infrastructure and (if supported)
> using true NMI IPIs makes this more robust.
> 
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/kernel/smp.c | 8 
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index cfc08b099c49..db88660bf6bd 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -565,7 +565,11 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct 
> pt_regs *))
>  }
>  #endif
>  
> +#ifdef CONFIG_NMI_IPI
> +static void stop_this_cpu(struct pt_regs *regs)
> +#else
>  static void stop_this_cpu(void *dummy)
> +#endif
>  {
>   /* Remove this CPU */
>   set_cpu_online(smp_processor_id(), false);
> @@ -577,7 +581,11 @@ static void stop_this_cpu(void *dummy)
>  
>  void smp_send_stop(void)
>  {
> +#ifdef CONFIG_NMI_IPI
> + smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100);

I wonder if the delay_us should be a function of number of cpus and the
callee should figure this out on its own? May be not in this series, but
in the longer run.


Balbir Singh.


Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> This change is inspired by the Peter's proposal patch [1] which was
> protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in
> that particular case, and it is introducing major performance degradation
> due to excessive scheduling operations.
> 
> To allow access to the mm_rb tree without grabbing the mmap_sem, this patch
> is protecting it access using a rwlock.  As the mm_rb tree is a O(log n)
> search it is safe to protect it using such a lock.  The VMA cache is not
> protected by the new rwlock and it should not be used without holding the
> mmap_sem.
> 
> To allow the picked VMA structure to be used once the rwlock is released, a
> use count is added to the VMA structure. When the VMA is allocated it is
> set to 1.  Each time the VMA is picked with the rwlock held its use count
> is incremented. Each time the VMA is released it is decremented. When the
> use count hits zero, this means that the VMA is no more used and should be
> freed.
> 
> This patch is preparing for 2 kind of VMA access :
>  - as usual, under the control of the mmap_sem,
>  - without holding the mmap_sem for the speculative page fault handler.
> 
> Access done under the control the mmap_sem doesn't require to grab the
> rwlock to protect read access to the mm_rb tree, but access in write must
> be done under the protection of the rwlock too. This affects inserting and
> removing of elements in the RB tree.
> 
> The patch is introducing 2 new functions:
>  - vma_get() to find a VMA based on an address by holding the new rwlock.
>  - vma_put() to release the VMA when its no more used.
> These services are designed to be used when access are made to the RB tree
> without holding the mmap_sem.
> 
> When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and
> we rely on the WMB done when releasing the rwlock to serialize the write
> with the RMB done in a later patch to check for the VMA's validity.
> 
> When free_vma is called, the file associated with the VMA is closed
> immediately, but the policy and the file structure remained in used until
> the VMA's use count reach 0, which may happens later when exiting an
> in progress speculative page fault.
> 
> [1] https://patchwork.kernel.org/patch/5108281/
> 
> Cc: Peter Zijlstra (Intel) 
> Cc: Matthew Wilcox 
> Signed-off-by: Laurent Dufour 

Can __free_vma() be generalized for mm/nommu.c's delete_vma() and 
do_mmap()?


Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> When dealing with speculative page fault handler, we may race with VMA
> being split or merged. In this case the vma->vm_start and vm->vm_end
> fields may not match the address the page fault is occurring.
> 
> This can only happens when the VMA is split but in that case, the
> anon_vma pointer of the new VMA will be the same as the original one,
> because in __split_vma the new->anon_vma is set to src->anon_vma when
> *new = *vma.
> 
> So even if the VMA boundaries are not correct, the anon_vma pointer is
> still valid.
> 
> If the VMA has been merged, then the VMA in which it has been merged
> must have the same anon_vma pointer otherwise the merge can't be done.
> 
> So in all the case we know that the anon_vma is valid, since we have
> checked before starting the speculative page fault that the anon_vma
> pointer is valid for this VMA and since there is an anon_vma this
> means that at one time a page has been backed and that before the VMA
> is cleaned, the page table lock would have to be grab to clean the
> PTE, and the anon_vma field is checked once the PTE is locked.
> 
> This patch introduce a new __page_add_new_anon_rmap() service which
> doesn't check for the VMA boundaries, and create a new inline one
> which do the check.
> 
> When called from a page fault handler, if this is not a speculative one,
> there is a guarantee that vm_start and vm_end match the faulting address,
> so this check is useless. In the context of the speculative page fault
> handler, this check may be wrong but anon_vma is still valid as explained
> above.
> 
> Signed-off-by: Laurent Dufour 

I'm indifferent on this: it could be argued both sides that the new 
function and its variant for a simple VM_BUG_ON() isn't worth it and it 
would should rather be done in the callers of page_add_new_anon_rmap().  
It feels like it would be better left to the caller and add a comment to 
page_add_anon_rmap() itself in mm/rmap.c.


Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a84ddc218bbd..73b8b99f482b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1263,8 +1263,11 @@ struct zap_details {
>   pgoff_t last_index; /* Highest page->index to unmap 
> */
>  };
>  
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device);
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags);
> +#define _vm_normal_page(vma, addr, pte, with_public_device) \
> + __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags)
>  #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false)
>  
>  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long 
> addr,

If _vm_normal_page() is a static inline function does it break somehow?  
It's nice to avoid the #define's.

> diff --git a/mm/memory.c b/mm/memory.c
> index af0338fbc34d..184a0d663a76 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, 
> unsigned long addr,
>  #else
>  # define HAVE_PTE_SPECIAL 0
>  #endif
> -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> -  pte_t pte, bool with_public_device)
> +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> +   pte_t pte, bool with_public_device,
> +   unsigned long vma_flags)
>  {
>   unsigned long pfn = pte_pfn(pte);
>  

Would it be possible to update the comment since the function itself is no 
longer named vm_normal_page?


Re: [PATCH v9 14/24] mm: Introduce __maybe_mkwrite()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index dfa81a638b7c..a84ddc218bbd 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -684,13 +684,18 @@ void free_compound_page(struct page *page);
>   * pte_mkwrite.  But get_user_pages can cause write faults for mappings
>   * that do not have writing enabled, when used by access_process_vm.
>   */
> -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags)
>  {
> - if (likely(vma->vm_flags & VM_WRITE))
> + if (likely(vma_flags & VM_WRITE))
>   pte = pte_mkwrite(pte);
>   return pte;
>  }
>  
> +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
> +{
> + return __maybe_mkwrite(pte, vma->vm_flags);
> +}
> +
>  int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
>   struct page *page);
>  int finish_fault(struct vm_fault *vmf);
> diff --git a/mm/memory.c b/mm/memory.c
> index 0a0a483d9a65..af0338fbc34d 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf)
>  
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>   entry = pte_mkyoung(vmf->orig_pte);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
>   update_mmu_cache(vma, vmf->address, vmf->pte);
>   pte_unmap_unlock(vmf->pte, vmf->ptl);
> @@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf)
>   inc_mm_counter_fast(mm, MM_ANONPAGES);
>   }
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
> - entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + entry = mk_pte(new_page, vmf->vma_page_prot);
> + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags);
>   /*
>* Clear the pte entry and flush it first, before updating the
>* pte with the new entry. This will avoid a race condition

Don't you also need to do this in do_swap_page()?

diff --git a/mm/memory.c b/mm/memory.c
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3067,9 +3067,9 @@ int do_swap_page(struct vm_fault *vmf)
 
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
-   pte = mk_pte(page, vma->vm_page_prot);
+   pte = mk_pte(page, vmf->vma_page_prot);
if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
-   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+   pte = __maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags);
vmf->flags &= ~FAULT_FLAG_WRITE;
ret |= VM_FAULT_WRITE;
exclusive = RMAP_EXCLUSIVE;


Re: [PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> The speculative page fault handler which is run without holding the
> mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags
> is not guaranteed to remain constant.
> Introducing __lru_cache_add_active_or_unevictable() which has the vma flags
> value parameter instead of the vma pointer.
> 
> Signed-off-by: Laurent Dufour 

Acked-by: David Rientjes 


Re: [PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> migrate_misplaced_page() is only called during the page fault handling so
> it's better to pass the pointer to the struct vm_fault instead of the vma.
> 
> This way during the speculative page fault path the saved vma->vm_flags
> could be used.
> 
> Signed-off-by: Laurent Dufour 

Acked-by: David Rientjes 


Re: [PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure

2018-04-02 Thread David Rientjes
On Tue, 13 Mar 2018, Laurent Dufour wrote:

> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ef6ef0627090..dfa81a638b7c 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -359,6 +359,12 @@ struct vm_fault {
>* page table to avoid allocation from
>* atomic context.
>*/
> + /*
> +  * These entries are required when handling speculative page fault.
> +  * This way the page handling is done using consistent field values.
> +  */
> + unsigned long vma_flags;
> + pgprot_t vma_page_prot;
>  };
>  
>  /* page entry size for vm->huge_fault() */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 446427cafa19..f71db2b42b30 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct 
> vm_area_struct *vma,
>   .vma = vma,
>   .address = address,
>   .flags = flags,
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   /*
>* Hard to debug if it ends up being
>* used by a callee that assumes
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 32314e9e48dd..a946d5306160 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct 
> *mm,
>   .flags = FAULT_FLAG_ALLOW_RETRY,
>   .pmd = pmd,
>   .pgoff = linear_page_index(vma, address),
> + .vma_flags = vma->vm_flags,
> + .vma_page_prot = vma->vm_page_prot,
>   };
>  
>   /* we only decide to swapin, if there is enough young ptes */
> diff --git a/mm/memory.c b/mm/memory.c
> index 0200340ef089..46fe92b93682 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>* Don't let another task, with possibly unlocked vma,
>* keep the mlocked page.
>*/
> - if (page_copied && (vma->vm_flags & VM_LOCKED)) {
> + if (page_copied && (vmf->vma_flags & VM_LOCKED)) {
>   lock_page(old_page);/* LRU manipulation */
>   if (PageMlocked(old_page))
>   munlock_vma_page(old_page);

Doesn't wp_page_copy() also need to pass this to anon_vma_prepare() so 
that find_mergeable_anon_vma() works correctly?

> @@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf)
>   */
>  int finish_mkwrite_fault(struct vm_fault *vmf)
>  {
> - WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED));
> + WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED));
>   if (!pte_map_lock(vmf))
>   return VM_FAULT_RETRY;
>   /*
> @@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf)
>* We should not cow pages in a shared writeable mapping.
>* Just mark the pages writable and/or call ops->pfn_mkwrite.
>*/
> - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> + if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
>(VM_WRITE|VM_SHARED))
>   return wp_pfn_shared(vmf);
>  
> @@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf)
>   return VM_FAULT_WRITE;
>   }
>   unlock_page(vmf->page);
> - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
> + } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) ==
>   (VM_WRITE|VM_SHARED))) {
>   return wp_page_shared(vmf);
>   }
> @@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf)
>  
>   inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>   dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
> - pte = mk_pte(page, vma->vm_page_prot);
> + pte = mk_pte(page, vmf->vma_page_prot);
>   if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
>   pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>   vmf->flags &= ~FAULT_FLAG_WRITE;
> @@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_fault *vmf)
>  
>   swap_free(entry);
>   if (mem_cgroup_swap_full(page) ||
> - (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
> + (vmf->vma_flags & VM_LOCKED) || PageMlocked(page))
>   try_to_free_swap(page);
>   unlock_page(page);
>   if (page != swapcache && swapcache) {
> @@ -3150,7 +3150,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>   pte_t entry;
>  
>   /* File mapping without ->vm_ops ? */
> - if (vma->vm_flags & VM_SHARED)
> + if (vmf->vma_flags & VM_SHARED)
>  

[PATCH] tty:hvcs: Move Assignment for hvcsd->port.count

2018-04-02 Thread Bryant G. Ly
Move the assignment out of hvcsd->port.count from within the
if () block so this patch fixes it. It is bad practice to have
assignments within an if () block.

Signed-off-by: Bryant G. Ly 
---
 drivers/tty/hvc/hvcs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c
index 1db1d97..333dc76 100644
--- a/drivers/tty/hvc/hvcs.c
+++ b/drivers/tty/hvc/hvcs.c
@@ -1202,7 +1202,8 @@ static void hvcs_close(struct tty_struct *tty, struct 
file *filp)
hvcsd = tty->driver_data;
 
spin_lock_irqsave(>lock, flags);
-   if (--hvcsd->port.count == 0) {
+   --hvcsd->port.count;
+   if (hvcsd->port.count == 0) {
 
vio_disable_interrupts(hvcsd->vdev);
 
-- 
2.7.2



Re: [PATCH 0/3] stop secondaries for reboot/shutdown

2018-04-02 Thread ppaidipe

On 2018-04-01 16:06, Nicholas Piggin wrote:

I'm rebasing and resending this series because the hard
lockup messages are being seen in the field.

Since last time, one of the patches was merged, and patches
were split and reordered a bit. No major changes.

This still hasn't been tested with the FSP firmware update.



I have tested this series on a FSP platform with firmware update
and multiple reboot/shutdown tests. Not seen any issues.

Tested-by: Pridhiviraj Paidipeddi 

Thanks
Pridhiviraj


Thanks,
Nick

Nicholas Piggin (3):
  powerpc: use NMI IPI for smp_send_stop
  powerpc: hard disable irqs in smp_send_stop loop
  powerpc/powernv: Always stop secondaries before reboot/shutdown

 arch/powerpc/include/asm/opal.h |  2 +-
 arch/powerpc/kernel/smp.c   | 13 +++--
 arch/powerpc/platforms/powernv/opal-flash.c | 28 
+---

 arch/powerpc/platforms/powernv/setup.c  | 15 +--
 4 files changed, 18 insertions(+), 40 deletions(-)




Re: RFC on writel and writel_relaxed

2018-04-02 Thread Sinan Kaya
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt 
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
 Let's fix all archs, it's way easier than fixing all drivers. Half of
 the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() 
implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() 
implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow 
up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm 
Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH V2] powerpc: Fix oops due to wrong access of lppaca on non virtualized platform

2018-04-02 Thread Nicholas Piggin
On Mon,  2 Apr 2018 13:03:37 +0530
"Aneesh Kumar K.V"  wrote:

> Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not
> virtualized") removed allocation of lppaca on non virtualized platform. But 
> with
> CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in 
> some
> code paths.
> 
> Fix this but adding runtime check for virtualized platform.
> 
> Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not 
> virtualized")
> 
> Signed-off-by: Aneesh Kumar K.V 

Thanks, this looks good to me...

> ---
> Changes from V1:
> * update commit summary
> 
>  arch/powerpc/include/asm/lppaca.h   | 3 +++
>  arch/powerpc/include/asm/spinlock.h | 2 ++
>  arch/powerpc/kernel/time.c  | 3 +++
>  3 files changed, 8 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/lppaca.h 
> b/arch/powerpc/include/asm/lppaca.h
> index 65d589689f01..7c23ce8a5a4c 100644
> --- a/arch/powerpc/include/asm/lppaca.h
> +++ b/arch/powerpc/include/asm/lppaca.h
> @@ -34,6 +34,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /*
>   * The lppaca is the "virtual processor area" registered with the hypervisor,
> @@ -114,6 +115,8 @@ struct lppaca {
>  
>  static inline bool lppaca_shared_proc(struct lppaca *l)
>  {
> + if (!firmware_has_feature(FW_FEATURE_SPLPAR))
> + return false;
>   return !!(l->__old_status & LPPACA_OLD_SHARED_PROC);
>  }
>  

We should eventually clean this up. There's no need to pass in an lppaca
here AFAIKS. Is it even per-CPU, does it change dynamically? We could set
it up in a FW_FEATURE parameter at boot, which would could make some of
these tests cheaper.

> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index b9ebc3085fb7..72dc4ddc2972 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -56,6 +56,8 @@
>  #define vcpu_is_preempted vcpu_is_preempted
>  static inline bool vcpu_is_preempted(int cpu)
>  {
> + if (!firmware_has_feature(FW_FEATURE_SPLPAR))
> + return false;
>   return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
>  }
>  #endif
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index f7d96a68ecaa..360e71d455cc 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -266,6 +266,9 @@ void accumulate_stolen_time(void)
>  
>  static inline u64 calculate_stolen_time(u64 stop_tb)
>  {
> + if (!firmware_has_feature(FW_FEATURE_SPLPAR))
> + return 0;
> +
>   if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx))
>   return scan_dispatch_log(stop_tb);
>  



[PATCH V2] powerpc: Fix oops due to wrong access of lppaca on non virtualized platform

2018-04-02 Thread Aneesh Kumar K.V
Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not
virtualized") removed allocation of lppaca on non virtualized platform. But with
CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in some
code paths.

Fix this but adding runtime check for virtualized platform.

Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not 
virtualized")

Signed-off-by: Aneesh Kumar K.V 
---
Changes from V1:
* update commit summary

 arch/powerpc/include/asm/lppaca.h   | 3 +++
 arch/powerpc/include/asm/spinlock.h | 2 ++
 arch/powerpc/kernel/time.c  | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/lppaca.h 
b/arch/powerpc/include/asm/lppaca.h
index 65d589689f01..7c23ce8a5a4c 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The lppaca is the "virtual processor area" registered with the hypervisor,
@@ -114,6 +115,8 @@ struct lppaca {
 
 static inline bool lppaca_shared_proc(struct lppaca *l)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return false;
return !!(l->__old_status & LPPACA_OLD_SHARED_PROC);
 }
 
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index b9ebc3085fb7..72dc4ddc2972 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -56,6 +56,8 @@
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return false;
return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
 #endif
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f7d96a68ecaa..360e71d455cc 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -266,6 +266,9 @@ void accumulate_stolen_time(void)
 
 static inline u64 calculate_stolen_time(u64 stop_tb)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx))
return scan_dispatch_log(stop_tb);
 
-- 
2.14.3



[PATCH] powerpc: Fix oops due to deferral of lpacca on non virtualized platform

2018-04-02 Thread Aneesh Kumar K.V
Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not
virtualized") removed allocation of lppaca on non virtualized platform. But with
CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in some
code paths.

Fix this but adding runtime check for virtualized platform.

Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not 
virtualized")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/lppaca.h   | 3 +++
 arch/powerpc/include/asm/spinlock.h | 2 ++
 arch/powerpc/kernel/time.c  | 3 +++
 3 files changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/lppaca.h 
b/arch/powerpc/include/asm/lppaca.h
index 65d589689f01..7c23ce8a5a4c 100644
--- a/arch/powerpc/include/asm/lppaca.h
+++ b/arch/powerpc/include/asm/lppaca.h
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
  * The lppaca is the "virtual processor area" registered with the hypervisor,
@@ -114,6 +115,8 @@ struct lppaca {
 
 static inline bool lppaca_shared_proc(struct lppaca *l)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return false;
return !!(l->__old_status & LPPACA_OLD_SHARED_PROC);
 }
 
diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index b9ebc3085fb7..72dc4ddc2972 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -56,6 +56,8 @@
 #define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(int cpu)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return false;
return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
 }
 #endif
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index f7d96a68ecaa..360e71d455cc 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -266,6 +266,9 @@ void accumulate_stolen_time(void)
 
 static inline u64 calculate_stolen_time(u64 stop_tb)
 {
+   if (!firmware_has_feature(FW_FEATURE_SPLPAR))
+   return 0;
+
if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx))
return scan_dispatch_log(stop_tb);
 
-- 
2.14.3



Re: [PATCH] Extract initrd free logic from arch-specific code.

2018-04-02 Thread Ingo Molnar

* Shea Levy  wrote:

> > Please also put it into Documentation/features/.
> 
> I switched this patch series (the latest revision v6 was just posted) to
> using weak symbols instead of Kconfig. Does it still warrant documentation?

Probably not.

Thanks,

Ingo


[PATCH v3 7/7] powerpc/fadump: un-register fadump on kexec path.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

otherwise the fadump registration in new kexec-ed kernel complains that
fadump is already registered. This makes new kernel to continue using
fadump registered by previous kernel which may lead to invalid vmcore
generation. Hence this patch fixes this issue by un-registering fadump
in fadump_cleanup() which is called during kexec path so that new kernel
can register fadump with new valid values.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 2c3c7e655eec..cb496c8f40f3 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -1345,6 +1345,9 @@ void fadump_cleanup(void)
init_fadump_mem_struct(,
get_fadump_metadata_base(fdm_active));
fadump_invalidate_dump();
+   } else if (fw_dump.dump_registered) {
+   /* Un-register Firmware-assisted dump if it was registered. */
+   fadump_unregister_dump();
}
 }
 



[PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

For fadump to work successfully there should not be any holes in reserved
memory ranges where kernel has asked firmware to move the content of old
kernel memory in event of crash. But this memory area is currently not
protected from hot-remove operations. Hence, fadump service can fail to
re-register after the hot-remove operation, if hot-removed memory belongs
to fadump reserved region. To avoid this make sure that memory from fadump
reserved area is not hot-removable if fadump is registered.

However, if user still wants to remove that memory, he can do so by
manually stopping fadump service before hot-remove operation.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/fadump.h   |2 +-
 arch/powerpc/kernel/fadump.c|   10 --
 arch/powerpc/platforms/pseries/hotplug-memory.c |7 +--
 3 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 44b6ebfe9be6..d16dc77107a8 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -207,7 +207,7 @@ struct fad_crash_memory_ranges {
unsigned long long  size;
 };
 
-extern int is_fadump_boot_memory_area(u64 addr, ulong size);
+extern int is_fadump_memory_area(u64 addr, ulong size);
 extern int early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data);
 extern int fadump_reserve_mem(void);
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 59aaf0357a52..2c3c7e655eec 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node,
 
 /*
  * If fadump is registered, check if the memory provided
- * falls within boot memory area.
+ * falls within boot memory area and reserved memory area.
  */
-int is_fadump_boot_memory_area(u64 addr, ulong size)
+int is_fadump_memory_area(u64 addr, ulong size)
 {
+   u64 d_start = fw_dump.reserve_dump_area_start;
+   u64 d_end = d_start + fw_dump.reserve_dump_area_size;
+
if (!fw_dump.dump_registered)
return 0;
 
+   if (((addr + size) > d_start) && (addr <= d_end))
+   return 1;
+
return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size;
 }
 
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index c1578f54c626..e4c658cda3a7 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb)
phys_addr = lmb->base_addr;
 
 #ifdef CONFIG_FA_DUMP
-   /* Don't hot-remove memory that falls in fadump boot memory area */
-   if (is_fadump_boot_memory_area(phys_addr, block_sz))
+   /*
+* Don't hot-remove memory that falls in fadump boot memory area
+* and memory that is reserved for capturing old kernel memory.
+*/
+   if (is_fadump_memory_area(phys_addr, block_sz))
return false;
 #endif
 



[PATCH v3 5/7] powerpc/fadump: throw proper error message on fadump registration failure.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

fadump fails to register when there are holes in reserved memory area.
This can happen if user has hot-removed a memory that falls in the fadump
reserved memory area. Throw a meaningful error message to the user in
such case.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |   33 +
 1 file changed, 33 insertions(+)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index a497e9fb93fe..59aaf0357a52 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -216,6 +216,36 @@ static int is_boot_memory_area_contiguous(void)
return ret;
 }
 
+/*
+ * Returns 1, if there are no holes in reserved memory area,
+ * 0 otherwise.
+ */
+static int is_reserved_memory_area_contiguous(void)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+   unsigned long d_start = fw_dump.reserve_dump_area_start;
+   unsigned long d_end = d_start + fw_dump.reserve_dump_area_size;
+   int ret = 0;
+
+   for_each_memblock(memory, reg) {
+   start = max(d_start, (unsigned long)reg->base);
+   end = min(d_end, (unsigned long)(reg->base + reg->size));
+   if (d_start < end) {
+   /* Memory hole from d_start to start */
+   if (start > d_start)
+   break;
+
+   if (end == d_end) {
+   ret = 1;
+   break;
+   }
+   d_start = end + 1;
+   }
+   }
+   return ret;
+}
+
 /* Print firmware assisted dump configurations for debugging purpose. */
 static void fadump_show_config(void)
 {
@@ -602,6 +632,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm)
if (!is_boot_memory_area_contiguous())
pr_err("Can't have holes in boot memory area while "
   "registering fadump\n");
+   else if (!is_reserved_memory_area_contiguous())
+   pr_err("Can't have holes in reserved memory area while"
+  " registering fadump\n");
 
printk(KERN_ERR "Failed to register firmware-assisted kernel"
" dump. Parameter Error(%d).\n", rc);



[PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

The second kernel, during early boot after the crash, reserves rest of
the memory above boot memory size to make sure it does not touch any of the
dump memory area. It uses memblock_reserve() that reserves the specified
memory region irrespective of memory holes present within that region.
There are chances where previous kernel would have hot removed some of
its memory leaving memory holes behind. In such cases fadump kernel reports
incorrect number of reserved pages through arch_reserved_kernel_pages()
hook causing kernel to hang or panic.

Fix this by excluding memory holes while reserving rest of the memory
above boot memory size during second kernel boot after crash.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |   17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 011f1aa7abab..a497e9fb93fe 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -433,6 +433,21 @@ static inline unsigned long get_fadump_metadata_base(
return be64_to_cpu(fdm->metadata_region.source_address);
 }
 
+static void fadump_memblock_reserve(unsigned long base, unsigned long size)
+{
+   struct memblock_region *reg;
+   unsigned long start, end;
+
+   for_each_memblock(memory, reg) {
+   start = max(base, (unsigned long)reg->base);
+   end = reg->base + reg->size;
+   end = min(base + size, end);
+
+   if (start < end)
+   memblock_reserve(start, end - start);
+   }
+}
+
 int __init fadump_reserve_mem(void)
 {
unsigned long base, size, memory_boundary;
@@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void)
 */
base = fw_dump.boot_memory_size;
size = memory_boundary - base;
-   memblock_reserve(base, size);
+   fadump_memblock_reserve(base, size);
printk(KERN_INFO "Reserved %ldMB of memory at %ldMB "
"for saving crash dump\n",
(unsigned long)(size >> 20),



[PATCH v3 3/7] powerpc/fadump: Reservationless firmware assisted dump

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. On large
systems with TeraBytes of memory, this reservation can be quite
significant.

In some cases, fadump fails if the memory reserved is insufficient, or
if the reserved memory was DLPAR hot-removed.

In the normal case, post reboot, the preserved memory is filtered to
extract only relevant areas of interest using the makedumpfile tool.
While the tool provides flexibility to determine what needs to be part
of the dump and what memory to filter out, all supported distributions
default this to "Capture only kernel data and nothing else".

We take advantage of this default and the Linux kernel's Contiguous
Memory Allocator (CMA) to fundamentally change the memory reservation
model for fadump. Fadump can now only capture kernel memory.

Instead of setting aside a significant chunk of memory nobody can use,
this patch uses CMA instead, to reserve a significant chunk of memory
that the kernel is prevented from using (due to MIGRATE_CMA), but
applications are free to use it.

Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream:
[root@zzxx-yy10 ~]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   7557 1936822  12 5416725
Swap:  4095   04095

With this patch:
[root@zzxx-yy10 ~]# free -m
  totalusedfree  shared  buff/cache   available
Mem:   8133 1947464  12 4757338
Swap:  4095   04095

Changes made here are completely transparent to how fadump has
traditionally worked.

Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand
CMA and its usage.

TODO:
- Handle case where CMA reservation spans nodes.

Signed-off-by: Ananth N Mavinakayanahalli 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/kernel/fadump.c |  122 --
 1 file changed, 104 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 80552fd7d5f8..011f1aa7abab 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -45,11 +46,57 @@
 static struct fw_dump fw_dump;
 static struct fadump_mem_struct fdm;
 static const struct fadump_mem_struct *fdm_active;
+static struct cma *fadump_cma;
 
 static DEFINE_MUTEX(fadump_mutex);
 struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES];
 int crash_mem_ranges;
 
+/*
+ * fadump_cma_reserve() - reserve area for fadump memory reservation
+ *
+ * This function reserves memory from early allocator. It should be
+ * called by arch specific code once the memblock allocator
+ * has been activated.
+ */
+int __init fadump_cma_reserve(void)
+{
+   unsigned long long base, size;
+   int rc;
+
+   if (!fw_dump.fadump_enabled)
+   return 0;
+
+   base = fw_dump.reserve_dump_area_start;
+   size = fw_dump.reserve_dump_area_size;
+   pr_debug("Original reserve area base %ld, size %ld\n",
+   (unsigned long)base >> 20,
+   (unsigned long)size >> 20);
+   if (!size)
+   return 0;
+
+   rc = cma_declare_contiguous(base, size, 0, 0, 0, false,
+   "fadump_cma", _cma);
+   if (rc) {
+   printk(KERN_ERR "fadump: Failed to reserve cma area for "
+   "firmware-assisted dump, %d\n", rc);
+   fw_dump.reserve_dump_area_size = 0;
+   return 0;
+   }
+   /*
+* So we now have cma area reserved for fadump. base may be different
+* from what we requested.
+*/
+   fw_dump.reserve_dump_area_start = cma_get_base(fadump_cma);
+   fw_dump.reserve_dump_area_size = cma_get_size(fadump_cma);
+   printk("Reserved %ldMB cma area at %ldMB for firmware-assisted dump "
+   "(System RAM: %ldMB)\n",
+   cma_get_size(fadump_cma) >> 20,
+   (unsigned long)cma_get_base(fadump_cma) >> 20,
+   (unsigned long)(memblock_phys_mem_size() >> 20));
+   return 1;
+}
+
 /* Scan the Firmware Assisted dump configuration details. */
 int __init early_init_dt_scan_fw_dump(unsigned long node,
const char *uname, int depth, void *data)
@@ -381,7 +428,7 @@ static unsigned long get_fadump_area_size(void)
 }
 
 static inline unsigned long get_fadump_metadata_base(
-   const struct fadump_mem_struct *fdm)
+   const struct fadump_mem_struct *fdm)
 {
return 

[PATCH v3 2/7] powerpc/fadump: Update documentation to reflect the metadata region movement.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

Metadata region that holds fadump header and ELF header is now placed at
the start of reserved memory area. Update the documentation accordingly

Signed-off-by: Mahesh Salgaonkar 
---
 Documentation/powerpc/firmware-assisted-dump.txt |   31 +-
 1 file changed, 19 insertions(+), 12 deletions(-)

diff --git a/Documentation/powerpc/firmware-assisted-dump.txt 
b/Documentation/powerpc/firmware-assisted-dump.txt
index bdd344aa18d9..c77f8dc9231f 100644
--- a/Documentation/powerpc/firmware-assisted-dump.txt
+++ b/Documentation/powerpc/firmware-assisted-dump.txt
@@ -115,17 +115,24 @@ be kept permanently reserved, so that it can act as a 
receptacle
 for a copy of the boot memory content in addition to CPU state
 and HPTE region, in the case a crash does occur.
 
+The first kernel, during fadump registration, prepares ELF core header
+that contains necessary information for the coredump of the 1st kernel
+which includes its physical memory layout. This ELF header and some more
+additional data is stored in the area called metadata region at the start
+of the reserved memory area.
+
   o Memory Reservation during first kernel
 
   Low memory Top of memory
   0  boot memory size   |
   |   ||<--Reserved dump area -->|  |
   V   V|   Permanent Reservation |  V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
-|   ^
-|   |
+  +---+--/ /---++---++---+--+
+  |   ||ELF |CPU|HPTE|  DUMP |  |
+  +---+-/ /++---++---+--+
+|^  ^
+||  |
+|metadata region|
 \   /
  ---
   Boot memory content gets transferred to
@@ -137,14 +144,14 @@ and HPTE region, in the case a crash does occur.
 
   Low memoryTop of memory
   0  boot memory size   |
-  |   |<- Reserved dump area --- -->|
+  |   |<- Reserved dump area -->|
   V   V V
-  +---+--/ /---+---++---++--+
-  |   ||CPU|HPTE|  DUMP |ELF |  |
-  +---+--/ /---+---++---++--+
-|  |
-V  V
-   Used by second/proc/vmcore
+  +---+--/ /---+++---+--+
+  |   ||ELF |CPU|HPTE|  DUMP |  |
+  +---+--/ /---+++---+--+
+| |
+V V
+   Used by second /proc/vmcore
kernel to boot
Fig. 2
 



[PATCH v3 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.

2018-04-02 Thread Mahesh J Salgaonkar
From: Mahesh Salgaonkar 

Currently the metadata region that holds crash info structure and ELF core
header is placed towards the end of reserved memory area. This patch places
it at the beginning of the reserved memory area.

Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/include/asm/fadump.h |4 ++
 arch/powerpc/kernel/fadump.c  |   92 -
 2 files changed, 83 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/fadump.h 
b/arch/powerpc/include/asm/fadump.h
index 5a23010af600..44b6ebfe9be6 100644
--- a/arch/powerpc/include/asm/fadump.h
+++ b/arch/powerpc/include/asm/fadump.h
@@ -61,6 +61,9 @@
 #define FADUMP_UNREGISTER  2
 #define FADUMP_INVALIDATE  3
 
+/* Number of dump sections requested by kernel */
+#define FADUMP_NUM_SECTIONS4
+
 /* Dump status flag */
 #define FADUMP_ERROR_FLAG  0x2000
 
@@ -119,6 +122,7 @@ struct fadump_mem_struct {
struct fadump_section   cpu_state_data;
struct fadump_section   hpte_region;
struct fadump_section   rmr_region;
+   struct fadump_section   metadata_region;
 };
 
 /* Firmware-assisted dump configuration details. */
diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3c2c2688918f..80552fd7d5f8 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -188,17 +188,48 @@ static void fadump_show_config(void)
pr_debug("Boot memory size  : %lx\n", fw_dump.boot_memory_size);
 }
 
+static unsigned long get_fadump_metadata_size(void)
+{
+   unsigned long size = 0;
+
+   /*
+* If fadump is active then look into fdm_active to get metadata
+* size set by previous kernel.
+*/
+   if (fw_dump.dump_active) {
+   size = fdm_active->cpu_state_data.destination_address -
+   fdm_active->metadata_region.source_address;
+   goto out;
+   }
+   size += sizeof(struct fadump_crash_info_header);
+   size += sizeof(struct elfhdr); /* ELF core header.*/
+   size += sizeof(struct elf_phdr); /* place holder for cpu notes */
+   /* Program headers for crash memory regions. */
+   size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2);
+
+   size = PAGE_ALIGN(size);
+out:
+   pr_debug("fadump Metadata size is %ld\n", size);
+   return size;
+}
+
 static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm,
unsigned long addr)
 {
+   uint16_t num_sections = 0;
+   unsigned long metadata_base = 0;
+
if (!fdm)
return 0;
 
+   /* Skip the fadump metadata area. */
+   metadata_base = addr;
+   addr += get_fadump_metadata_size();
+
memset(fdm, 0, sizeof(struct fadump_mem_struct));
addr = addr & PAGE_MASK;
 
fdm->header.dump_format_version = cpu_to_be32(0x0001);
-   fdm->header.dump_num_sections = cpu_to_be16(3);
fdm->header.dump_status_flag = 0;
fdm->header.offset_first_dump_section =
cpu_to_be32((u32)offsetof(struct fadump_mem_struct, 
cpu_state_data));
@@ -222,6 +253,7 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->cpu_state_data.source_address = 0;
fdm->cpu_state_data.source_len = 
cpu_to_be64(fw_dump.cpu_state_data_size);
fdm->cpu_state_data.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.cpu_state_data_size;
 
/* hpte region section */
@@ -230,6 +262,7 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->hpte_region.source_address = 0;
fdm->hpte_region.source_len = cpu_to_be64(fw_dump.hpte_region_size);
fdm->hpte_region.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.hpte_region_size;
 
/* RMA region section */
@@ -238,8 +271,25 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
fdm->rmr_region.source_address = cpu_to_be64(RMA_START);
fdm->rmr_region.source_len = cpu_to_be64(fw_dump.boot_memory_size);
fdm->rmr_region.destination_address = cpu_to_be64(addr);
+   num_sections++;
addr += fw_dump.boot_memory_size;
 
+   /*
+* fadump metadata section.
+* Add an entry with source len a zero. We are only intereseted in
+* source address which will help us to detect the location of
+* metadata area where faump header and elf core header is placed.
+*/
+   fdm->metadata_region.request_flag = cpu_to_be32(FADUMP_REQUEST_FLAG);
+   fdm->metadata_region.source_data_type =
+   cpu_to_be16(FADUMP_REAL_MODE_REGION);
+   fdm->metadata_region.source_address = cpu_to_be64(metadata_base);
+   

[PATCH v3 0/7] powerpc/fadump: Improvements and fixes for firmware-assisted dump.

2018-04-02 Thread Mahesh J Salgaonkar
One of the primary issues with Firmware Assisted Dump (fadump) on Power
is that it needs a large amount of memory to be reserved. This reserved
memory is used for saving the contents of old crashed kernel's memory before
fadump capture kernel uses old kernel's memory area to boot. However, This
reserved memory area stays unused until system crash and isn't available
for production kernel to use.

Instead of setting aside a significant chunk of memory that nobody can use,
take advantage Linux kernel's Contiguous Memory Allocator (CMA) feature,
to reserve a significant chunk of memory that the kernel is prevented from
using , but applications are free to use it.

Patch 1 moves the metadata region to the start of the reserved area for easy
handling/detection of metadata region in second kernel.
Patch 3 implements the usage of CMA region to allow production kernel to
use that memory for applications usage, making fadump reservationless.
Patch 4-7 fixes various fadump issues and bugs.

Changes in V3:
- patch 1 & 2: move metadata region and documentation update.
- patch 7: Un-register the faudmp on kexec path


---

Mahesh Salgaonkar (7):
  powerpc/fadump: Move the metadata region to start of the reserved area.
  powerpc/fadump: Update documentation to reflect the metadata region 
movement.
  powerpc/fadump: Reservationless firmware assisted dump
  powerpc/fadump: exclude memory holes while reserving memory in second 
kernel.
  powerpc/fadump: throw proper error message on fadump registration failure.
  powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
  powerpc/fadump: un-register fadump on kexec path.


 Documentation/powerpc/firmware-assisted-dump.txt |   31 ++
 arch/powerpc/include/asm/fadump.h|6 
 arch/powerpc/kernel/fadump.c |  275 +++---
 arch/powerpc/platforms/pseries/hotplug-memory.c  |7 -
 4 files changed, 271 insertions(+), 48 deletions(-)

--
Signature