Re: [PATCH v2 3/3] powerpc/powernv: Always stop secondaries before reboot/shutdown
On 04/01/2018 04:06 PM, Nicholas Piggin wrote: Currently powernv reboot and shutdown requests just leave secondaries to do their own things. This is undesirable because they can trigger any number of watchdogs while waiting for reboot, but also we don't know what else they might be doing -- they might be causing trouble, trampling memory, etc. The opal scheduled flash update code already ran into watchdog problems due to flashing taking a long time, and it was fixed with 2196c6f1ed ("powerpc/powernv: Return secondary CPUs to firmware before FW update"), which returns secondaries to opal. It's been found that regular reboots can take over 10 seconds, which can result in the hard lockup watchdog firing, reboot: Restarting system [ 360.038896709,5] OPAL: Reboot request... Watchdog CPU:0 Hard LOCKUP Watchdog CPU:44 detected Hard LOCKUP other CPUS:16 Watchdog CPU:16 Hard LOCKUP watchdog: BUG: soft lockup - CPU#16 stuck for 3s! [swapper/16:0] This patch removes the special case for flash update, and calls smp_send_stop in all cases before calling reboot/shutdown. smp_send_stop could return CPUs to OPAL, the main reason not to is that the request could come from a NMI that interrupts OPAL code, so re-entry to OPAL can cause a number of problems. Putting secondaries into simple spin loops improves the chances of a successful reboot. Cc: Vasant HegdeNick, Patch looks good to me. We have tested this patchset on FSP based system and everything looks fine. Reviewed-by: Vasant Hegde -Vasant Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 3 files changed, 7 insertions(+), 38 deletions(-) diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h index dde60089d0d4..7159e1a6a61a 100644 --- a/arch/powerpc/include/asm/opal.h +++ b/arch/powerpc/include/asm/opal.h @@ -325,7 +325,7 @@ struct rtc_time; extern unsigned long opal_get_boot_time(void); extern void opal_nvram_init(void); extern void opal_flash_update_init(void); -extern void opal_flash_term_callback(void); +extern void opal_flash_update_print_message(void); extern int opal_elog_init(void); extern void opal_platform_dump_init(void); extern void opal_sys_param_init(void); diff --git a/arch/powerpc/platforms/powernv/opal-flash.c b/arch/powerpc/platforms/powernv/opal-flash.c index 1cb0b895a236..b37015101bf6 100644 --- a/arch/powerpc/platforms/powernv/opal-flash.c +++ b/arch/powerpc/platforms/powernv/opal-flash.c @@ -303,26 +303,9 @@ static int opal_flash_update(int op) return rc; } -/* Return CPUs to OPAL before starting FW update */ -static void flash_return_cpu(void *info) -{ - int cpu = smp_processor_id(); - - if (!cpu_online(cpu)) - return; - - /* Disable IRQ */ - hard_irq_disable(); - - /* Return the CPU to OPAL */ - opal_return_cpu(); -} - /* This gets called just before system reboots */ -void opal_flash_term_callback(void) +void opal_flash_update_print_message(void) { - struct cpumask mask; - if (update_flash_data.status != FLASH_IMG_READY) return; @@ -333,15 +316,6 @@ void opal_flash_term_callback(void) /* Small delay to help getting the above message out */ msleep(500); - - /* Return secondary CPUs to firmware */ - cpumask_copy(, cpu_online_mask); - cpumask_clear_cpu(smp_processor_id(), ); - if (!cpumask_empty()) - smp_call_function_many(, - flash_return_cpu, NULL, false); - /* Hard disable interrupts */ - hard_irq_disable(); } /* diff --git a/arch/powerpc/platforms/powernv/setup.c b/arch/powerpc/platforms/powernv/setup.c index 5f963286232f..ef8c9ce53a61 100644 --- a/arch/powerpc/platforms/powernv/setup.c +++ b/arch/powerpc/platforms/powernv/setup.c @@ -201,17 +201,12 @@ static void pnv_prepare_going_down(void) */ opal_event_shutdown(); - /* Soft disable interrupts */ - local_irq_disable(); + /* Print flash update message if one is scheduled. */ + opal_flash_update_print_message(); - /* -* Return secondary CPUs to firwmare if a flash update -* is pending otherwise we will get all sort of error -* messages about CPU being stuck etc.. This will also -* have the side effect of hard disabling interrupts so -* past this point, the kernel is effectively dead. -*/ - opal_flash_term_callback(); + smp_send_stop(); + + hard_irq_disable(); } static void __noreturn pnv_restart(char *cmd)
Re: [PATCH 4/5] powerpc/pm: add sleep and deep sleep on QorIQ SoCs
Hi Ran, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on powerpc/next] [also build test WARNING on v4.16 next-20180329] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Ran-Wang/powerpc-pm-Fix-suspend-n-in-menuconfig-for-e500mc-platforms/20180330-072848 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-ppc64e_defconfig (attached as .config) compiler: powerpc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0 reproduce: wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=powerpc All warnings (new ones prefixed by >>): arch/powerpc/platforms/85xx/sleep.S: Assembler messages: >> arch/powerpc/platforms/85xx/sleep.S:1174: Warning: invalid register >> expression vim +1174 arch/powerpc/platforms/85xx/sleep.S 1137 1138 _GLOBAL(fsl_booke_deep_sleep_resume) 1139 /* disable interrupts */ 1140 FSL_DIS_ALL_IRQ 1141 1142 /* switch to 64-bit mode */ 1143 bl .enable_64b_mode 1144 1145 /* set TOC pointer */ 1146 bl .relative_toc 1147 1148 /* setup initial TLBs, switch to kernel space ... */ 1149 bl .start_initialization_book3e 1150 1151 /* address space changed, set TOC pointer again */ 1152 bl .relative_toc 1153 1154 /* call a cpu state restore handler */ 1155 LOAD_REG_ADDR(r23, cur_cpu_spec) 1156 ld r23,0(r23) 1157 ld r23,CPU_SPEC_RESTORE(r23) 1158 cmpdi 0,r23,0 1159 beq 1f 1160 ld r23,0(r23) 1161 mtctr r23 1162 bctrl 1163 1: 1164 LOAD_REG_ADDR(r3, regs_buffer) 1165 bl e5500_cpu_state_restore 1166 1167 /* Load each CAM entry */ 1168 LOAD_REG_ADDR(r3, tlbcam_index) 1169 lwz r3, 0(r3) 1170 mtctr r3 1171 li r0, 0 1172 3: mr r3, r0 1173 bl loadcam_entry > 1174 addir0, r0, 1 1175 bdnz3b 1176 1177 /* restore return address */ 1178 LOAD_REG_ADDR(r3, buf_tmp) 1179 ld r4, 16(r3) 1180 mtspr SPRN_TCR, r4 1181 ld r4, 0(r3) 1182 mtlrr4 1183 ld r4, 8(r3) 1184 mtmsr r4 1185 ld r4, 24(r3) 1186 mtcrr4 1187 1188 blr 1189 --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
I think CMA has protected us from hot-remove, so this patch is not necessary. Regards, Pingfan On Mon, Apr 2, 2018 at 2:30 PM, Mahesh J Salgaonkarwrote: > From: Mahesh Salgaonkar > > For fadump to work successfully there should not be any holes in reserved > memory ranges where kernel has asked firmware to move the content of old > kernel memory in event of crash. But this memory area is currently not > protected from hot-remove operations. Hence, fadump service can fail to > re-register after the hot-remove operation, if hot-removed memory belongs > to fadump reserved region. To avoid this make sure that memory from fadump > reserved area is not hot-removable if fadump is registered. > > However, if user still wants to remove that memory, he can do so by > manually stopping fadump service before hot-remove operation. > > Signed-off-by: Mahesh Salgaonkar > --- > arch/powerpc/include/asm/fadump.h |2 +- > arch/powerpc/kernel/fadump.c| 10 -- > arch/powerpc/platforms/pseries/hotplug-memory.c |7 +-- > 3 files changed, 14 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/include/asm/fadump.h > b/arch/powerpc/include/asm/fadump.h > index 44b6ebfe9be6..d16dc77107a8 100644 > --- a/arch/powerpc/include/asm/fadump.h > +++ b/arch/powerpc/include/asm/fadump.h > @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges { > unsigned long long size; > }; > > -extern int is_fadump_boot_memory_area(u64 addr, ulong size); > +extern int is_fadump_memory_area(u64 addr, ulong size); > extern int early_init_dt_scan_fw_dump(unsigned long node, > const char *uname, int depth, void *data); > extern int fadump_reserve_mem(void); > diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c > index 59aaf0357a52..2c3c7e655eec 100644 > --- a/arch/powerpc/kernel/fadump.c > +++ b/arch/powerpc/kernel/fadump.c > @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long > node, > > /* > * If fadump is registered, check if the memory provided > - * falls within boot memory area. > + * falls within boot memory area and reserved memory area. > */ > -int is_fadump_boot_memory_area(u64 addr, ulong size) > +int is_fadump_memory_area(u64 addr, ulong size) > { > + u64 d_start = fw_dump.reserve_dump_area_start; > + u64 d_end = d_start + fw_dump.reserve_dump_area_size; > + > if (!fw_dump.dump_registered) > return 0; > > + if (((addr + size) > d_start) && (addr <= d_end)) > + return 1; > + > return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size; > } > > diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c > b/arch/powerpc/platforms/pseries/hotplug-memory.c > index c1578f54c626..e4c658cda3a7 100644 > --- a/arch/powerpc/platforms/pseries/hotplug-memory.c > +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c > @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) > phys_addr = lmb->base_addr; > > #ifdef CONFIG_FA_DUMP > - /* Don't hot-remove memory that falls in fadump boot memory area */ > - if (is_fadump_boot_memory_area(phys_addr, block_sz)) > + /* > +* Don't hot-remove memory that falls in fadump boot memory area > +* and memory that is reserved for capturing old kernel memory. > +*/ > + if (is_fadump_memory_area(phys_addr, block_sz)) > return false; > #endif > >
Re: [PATCH v2 1/3] powerpc: use NMI IPI for smp_send_stop
On Tue, 3 Apr 2018 10:13:26 +1000 Balbir Singhwrote: > On Sun, 1 Apr 2018 20:36:13 +1000 > Nicholas Piggin wrote: > > > Use the NMI IPI rather than smp_call_function for smp_send_stop. > > Have stopped CPUs hard disable interrupts rather than just soft > > disable. > > > > This function is used in crash/panic/shutdown paths to bring other > > CPUs down as quickly and reliably as possible, and minimizing their > > potential to cause trouble. > > > > Avoiding the Linux smp_call_function infrastructure and (if supported) > > using true NMI IPIs makes this more robust. > > > > Signed-off-by: Nicholas Piggin > > --- > > arch/powerpc/kernel/smp.c | 8 > > 1 file changed, 8 insertions(+) > > > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > > index cfc08b099c49..db88660bf6bd 100644 > > --- a/arch/powerpc/kernel/smp.c > > +++ b/arch/powerpc/kernel/smp.c > > @@ -565,7 +565,11 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct > > pt_regs *)) > > } > > #endif > > > > +#ifdef CONFIG_NMI_IPI > > +static void stop_this_cpu(struct pt_regs *regs) > > +#else > > static void stop_this_cpu(void *dummy) > > +#endif > > { > > /* Remove this CPU */ > > set_cpu_online(smp_processor_id(), false); > > @@ -577,7 +581,11 @@ static void stop_this_cpu(void *dummy) > > > > void smp_send_stop(void) > > { > > +#ifdef CONFIG_NMI_IPI > > + smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100); > > I wonder if the delay_us should be a function of number of cpus and the > callee should figure this out on its own? May be not in this series, but > in the longer run. It possibly should. The big per-cpu/core serialized delay is in firmware (worst case we have to do a special wakeup on each core and bring it out of deep stop), and that gets done before the delay starts. So we count delay from after all threads are woken and given their sreset command, so this is probably okay even for very big systems. But it should at least be a #define constant if not something more sophisticated like you suggest. We have a few delays like that including in xmon and other crash paths that could use some more thought. Thanks, Nick > > > Balbir Singh.
Re: [RFC PATCH v2 0/2] Randomization of address chosen by mmap.
> On 29 Mar 2018, at 00:07, Luck, Tonywrote: > >> The default limit of only 65536 VMAs will also quickly come into play >> if consecutive anon mmaps don't get merged. Of course this can be >> raised, but it has significant resource and performance (fork) costs. > > Could the random mmap address chooser look for how many existing > VMAs have space before/after and the right attributes to merge with the > new one you want to create? If this is above some threshold (100?) then > pick one of them randomly and allocate the new address so that it will > merge from below/above with an existing one. > > That should still give you a very high degree of randomness, but prevent > out of control numbers of VMAs from being created. I think this wouldn’t work. For example these 100 allocation may happened on process initialization. But when attacker come to the server all his allocations would be made on the predictable offsets from each other. So in result we did nothing just decrease performance of first 100 allocations. I think I can make ioctl to turn off this randomization per process and it could be used if needed. For example if application going to allocate big chunk or make big memory pressure, etc. Best regards, Ilya
Re: [PATCH v2 1/3] powerpc: use NMI IPI for smp_send_stop
On Sun, 1 Apr 2018 20:36:13 +1000 Nicholas Pigginwrote: > Use the NMI IPI rather than smp_call_function for smp_send_stop. > Have stopped CPUs hard disable interrupts rather than just soft > disable. > > This function is used in crash/panic/shutdown paths to bring other > CPUs down as quickly and reliably as possible, and minimizing their > potential to cause trouble. > > Avoiding the Linux smp_call_function infrastructure and (if supported) > using true NMI IPIs makes this more robust. > > Signed-off-by: Nicholas Piggin > --- > arch/powerpc/kernel/smp.c | 8 > 1 file changed, 8 insertions(+) > > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c > index cfc08b099c49..db88660bf6bd 100644 > --- a/arch/powerpc/kernel/smp.c > +++ b/arch/powerpc/kernel/smp.c > @@ -565,7 +565,11 @@ void crash_send_ipi(void (*crash_ipi_callback)(struct > pt_regs *)) > } > #endif > > +#ifdef CONFIG_NMI_IPI > +static void stop_this_cpu(struct pt_regs *regs) > +#else > static void stop_this_cpu(void *dummy) > +#endif > { > /* Remove this CPU */ > set_cpu_online(smp_processor_id(), false); > @@ -577,7 +581,11 @@ static void stop_this_cpu(void *dummy) > > void smp_send_stop(void) > { > +#ifdef CONFIG_NMI_IPI > + smp_send_nmi_ipi(NMI_IPI_ALL_OTHERS, stop_this_cpu, 100); I wonder if the delay_us should be a function of number of cpus and the callee should figure this out on its own? May be not in this series, but in the longer run. Balbir Singh.
Re: [PATCH v9 17/24] mm: Protect mm_rb tree with a rwlock
On Tue, 13 Mar 2018, Laurent Dufour wrote: > This change is inspired by the Peter's proposal patch [1] which was > protecting the VMA using SRCU. Unfortunately, SRCU is not scaling well in > that particular case, and it is introducing major performance degradation > due to excessive scheduling operations. > > To allow access to the mm_rb tree without grabbing the mmap_sem, this patch > is protecting it access using a rwlock. As the mm_rb tree is a O(log n) > search it is safe to protect it using such a lock. The VMA cache is not > protected by the new rwlock and it should not be used without holding the > mmap_sem. > > To allow the picked VMA structure to be used once the rwlock is released, a > use count is added to the VMA structure. When the VMA is allocated it is > set to 1. Each time the VMA is picked with the rwlock held its use count > is incremented. Each time the VMA is released it is decremented. When the > use count hits zero, this means that the VMA is no more used and should be > freed. > > This patch is preparing for 2 kind of VMA access : > - as usual, under the control of the mmap_sem, > - without holding the mmap_sem for the speculative page fault handler. > > Access done under the control the mmap_sem doesn't require to grab the > rwlock to protect read access to the mm_rb tree, but access in write must > be done under the protection of the rwlock too. This affects inserting and > removing of elements in the RB tree. > > The patch is introducing 2 new functions: > - vma_get() to find a VMA based on an address by holding the new rwlock. > - vma_put() to release the VMA when its no more used. > These services are designed to be used when access are made to the RB tree > without holding the mmap_sem. > > When a VMA is removed from the RB tree, its vma->vm_rb field is cleared and > we rely on the WMB done when releasing the rwlock to serialize the write > with the RMB done in a later patch to check for the VMA's validity. > > When free_vma is called, the file associated with the VMA is closed > immediately, but the policy and the file structure remained in used until > the VMA's use count reach 0, which may happens later when exiting an > in progress speculative page fault. > > [1] https://patchwork.kernel.org/patch/5108281/ > > Cc: Peter Zijlstra (Intel)> Cc: Matthew Wilcox > Signed-off-by: Laurent Dufour Can __free_vma() be generalized for mm/nommu.c's delete_vma() and do_mmap()?
Re: [PATCH v9 16/24] mm: Introduce __page_add_new_anon_rmap()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > When dealing with speculative page fault handler, we may race with VMA > being split or merged. In this case the vma->vm_start and vm->vm_end > fields may not match the address the page fault is occurring. > > This can only happens when the VMA is split but in that case, the > anon_vma pointer of the new VMA will be the same as the original one, > because in __split_vma the new->anon_vma is set to src->anon_vma when > *new = *vma. > > So even if the VMA boundaries are not correct, the anon_vma pointer is > still valid. > > If the VMA has been merged, then the VMA in which it has been merged > must have the same anon_vma pointer otherwise the merge can't be done. > > So in all the case we know that the anon_vma is valid, since we have > checked before starting the speculative page fault that the anon_vma > pointer is valid for this VMA and since there is an anon_vma this > means that at one time a page has been backed and that before the VMA > is cleaned, the page table lock would have to be grab to clean the > PTE, and the anon_vma field is checked once the PTE is locked. > > This patch introduce a new __page_add_new_anon_rmap() service which > doesn't check for the VMA boundaries, and create a new inline one > which do the check. > > When called from a page fault handler, if this is not a speculative one, > there is a guarantee that vm_start and vm_end match the faulting address, > so this check is useless. In the context of the speculative page fault > handler, this check may be wrong but anon_vma is still valid as explained > above. > > Signed-off-by: Laurent DufourI'm indifferent on this: it could be argued both sides that the new function and its variant for a simple VM_BUG_ON() isn't worth it and it would should rather be done in the callers of page_add_new_anon_rmap(). It feels like it would be better left to the caller and add a comment to page_add_anon_rmap() itself in mm/rmap.c.
Re: [PATCH v9 15/24] mm: Introduce __vm_normal_page()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index a84ddc218bbd..73b8b99f482b 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1263,8 +1263,11 @@ struct zap_details { > pgoff_t last_index; /* Highest page->index to unmap > */ > }; > > -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > - pte_t pte, bool with_public_device); > +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > + pte_t pte, bool with_public_device, > + unsigned long vma_flags); > +#define _vm_normal_page(vma, addr, pte, with_public_device) \ > + __vm_normal_page(vma, addr, pte, with_public_device, (vma)->vm_flags) > #define vm_normal_page(vma, addr, pte) _vm_normal_page(vma, addr, pte, false) > > struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long > addr, If _vm_normal_page() is a static inline function does it break somehow? It's nice to avoid the #define's. > diff --git a/mm/memory.c b/mm/memory.c > index af0338fbc34d..184a0d663a76 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -826,8 +826,9 @@ static void print_bad_pte(struct vm_area_struct *vma, > unsigned long addr, > #else > # define HAVE_PTE_SPECIAL 0 > #endif > -struct page *_vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > - pte_t pte, bool with_public_device) > +struct page *__vm_normal_page(struct vm_area_struct *vma, unsigned long addr, > + pte_t pte, bool with_public_device, > + unsigned long vma_flags) > { > unsigned long pfn = pte_pfn(pte); > Would it be possible to update the comment since the function itself is no longer named vm_normal_page?
Re: [PATCH v9 14/24] mm: Introduce __maybe_mkwrite()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index dfa81a638b7c..a84ddc218bbd 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -684,13 +684,18 @@ void free_compound_page(struct page *page); > * pte_mkwrite. But get_user_pages can cause write faults for mappings > * that do not have writing enabled, when used by access_process_vm. > */ > -static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > +static inline pte_t __maybe_mkwrite(pte_t pte, unsigned long vma_flags) > { > - if (likely(vma->vm_flags & VM_WRITE)) > + if (likely(vma_flags & VM_WRITE)) > pte = pte_mkwrite(pte); > return pte; > } > > +static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma) > +{ > + return __maybe_mkwrite(pte, vma->vm_flags); > +} > + > int alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg, > struct page *page); > int finish_fault(struct vm_fault *vmf); > diff --git a/mm/memory.c b/mm/memory.c > index 0a0a483d9a65..af0338fbc34d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2472,7 +2472,7 @@ static inline void wp_page_reuse(struct vm_fault *vmf) > > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); > entry = pte_mkyoung(vmf->orig_pte); > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags); > if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1)) > update_mmu_cache(vma, vmf->address, vmf->pte); > pte_unmap_unlock(vmf->pte, vmf->ptl); > @@ -2549,8 +2549,8 @@ static int wp_page_copy(struct vm_fault *vmf) > inc_mm_counter_fast(mm, MM_ANONPAGES); > } > flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); > - entry = mk_pte(new_page, vma->vm_page_prot); > - entry = maybe_mkwrite(pte_mkdirty(entry), vma); > + entry = mk_pte(new_page, vmf->vma_page_prot); > + entry = __maybe_mkwrite(pte_mkdirty(entry), vmf->vma_flags); > /* >* Clear the pte entry and flush it first, before updating the >* pte with the new entry. This will avoid a race condition Don't you also need to do this in do_swap_page()? diff --git a/mm/memory.c b/mm/memory.c --- a/mm/memory.c +++ b/mm/memory.c @@ -3067,9 +3067,9 @@ int do_swap_page(struct vm_fault *vmf) inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); - pte = mk_pte(page, vma->vm_page_prot); + pte = mk_pte(page, vmf->vma_page_prot); if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { - pte = maybe_mkwrite(pte_mkdirty(pte), vma); + pte = __maybe_mkwrite(pte_mkdirty(pte), vmf->vma_flags); vmf->flags &= ~FAULT_FLAG_WRITE; ret |= VM_FAULT_WRITE; exclusive = RMAP_EXCLUSIVE;
Re: [PATCH v9 13/24] mm: Introduce __lru_cache_add_active_or_unevictable
On Tue, 13 Mar 2018, Laurent Dufour wrote: > The speculative page fault handler which is run without holding the > mmap_sem is calling lru_cache_add_active_or_unevictable() but the vm_flags > is not guaranteed to remain constant. > Introducing __lru_cache_add_active_or_unevictable() which has the vma flags > value parameter instead of the vma pointer. > > Signed-off-by: Laurent DufourAcked-by: David Rientjes
Re: [PATCH v9 12/24] mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
On Tue, 13 Mar 2018, Laurent Dufour wrote: > migrate_misplaced_page() is only called during the page fault handling so > it's better to pass the pointer to the struct vm_fault instead of the vma. > > This way during the speculative page fault path the saved vma->vm_flags > could be used. > > Signed-off-by: Laurent DufourAcked-by: David Rientjes
Re: [PATCH v9 11/24] mm: Cache some VMA fields in the vm_fault structure
On Tue, 13 Mar 2018, Laurent Dufour wrote: > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ef6ef0627090..dfa81a638b7c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -359,6 +359,12 @@ struct vm_fault { >* page table to avoid allocation from >* atomic context. >*/ > + /* > + * These entries are required when handling speculative page fault. > + * This way the page handling is done using consistent field values. > + */ > + unsigned long vma_flags; > + pgprot_t vma_page_prot; > }; > > /* page entry size for vm->huge_fault() */ > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 446427cafa19..f71db2b42b30 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3717,6 +3717,8 @@ static int hugetlb_no_page(struct mm_struct *mm, struct > vm_area_struct *vma, > .vma = vma, > .address = address, > .flags = flags, > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, > /* >* Hard to debug if it ends up being >* used by a callee that assumes > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 32314e9e48dd..a946d5306160 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -882,6 +882,8 @@ static bool __collapse_huge_page_swapin(struct mm_struct > *mm, > .flags = FAULT_FLAG_ALLOW_RETRY, > .pmd = pmd, > .pgoff = linear_page_index(vma, address), > + .vma_flags = vma->vm_flags, > + .vma_page_prot = vma->vm_page_prot, > }; > > /* we only decide to swapin, if there is enough young ptes */ > diff --git a/mm/memory.c b/mm/memory.c > index 0200340ef089..46fe92b93682 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -2615,7 +2615,7 @@ static int wp_page_copy(struct vm_fault *vmf) >* Don't let another task, with possibly unlocked vma, >* keep the mlocked page. >*/ > - if (page_copied && (vma->vm_flags & VM_LOCKED)) { > + if (page_copied && (vmf->vma_flags & VM_LOCKED)) { > lock_page(old_page);/* LRU manipulation */ > if (PageMlocked(old_page)) > munlock_vma_page(old_page); Doesn't wp_page_copy() also need to pass this to anon_vma_prepare() so that find_mergeable_anon_vma() works correctly? > @@ -2649,7 +2649,7 @@ static int wp_page_copy(struct vm_fault *vmf) > */ > int finish_mkwrite_fault(struct vm_fault *vmf) > { > - WARN_ON_ONCE(!(vmf->vma->vm_flags & VM_SHARED)); > + WARN_ON_ONCE(!(vmf->vma_flags & VM_SHARED)); > if (!pte_map_lock(vmf)) > return VM_FAULT_RETRY; > /* > @@ -2751,7 +2751,7 @@ static int do_wp_page(struct vm_fault *vmf) >* We should not cow pages in a shared writeable mapping. >* Just mark the pages writable and/or call ops->pfn_mkwrite. >*/ > - if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > + if ((vmf->vma_flags & (VM_WRITE|VM_SHARED)) == >(VM_WRITE|VM_SHARED)) > return wp_pfn_shared(vmf); > > @@ -2798,7 +2798,7 @@ static int do_wp_page(struct vm_fault *vmf) > return VM_FAULT_WRITE; > } > unlock_page(vmf->page); > - } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == > + } else if (unlikely((vmf->vma_flags & (VM_WRITE|VM_SHARED)) == > (VM_WRITE|VM_SHARED))) { > return wp_page_shared(vmf); > } > @@ -3067,7 +3067,7 @@ int do_swap_page(struct vm_fault *vmf) > > inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); > dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS); > - pte = mk_pte(page, vma->vm_page_prot); > + pte = mk_pte(page, vmf->vma_page_prot); > if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) { > pte = maybe_mkwrite(pte_mkdirty(pte), vma); > vmf->flags &= ~FAULT_FLAG_WRITE; > @@ -3093,7 +3093,7 @@ int do_swap_page(struct vm_fault *vmf) > > swap_free(entry); > if (mem_cgroup_swap_full(page) || > - (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) > + (vmf->vma_flags & VM_LOCKED) || PageMlocked(page)) > try_to_free_swap(page); > unlock_page(page); > if (page != swapcache && swapcache) { > @@ -3150,7 +3150,7 @@ static int do_anonymous_page(struct vm_fault *vmf) > pte_t entry; > > /* File mapping without ->vm_ops ? */ > - if (vma->vm_flags & VM_SHARED) > + if (vmf->vma_flags & VM_SHARED) >
[PATCH] tty:hvcs: Move Assignment for hvcsd->port.count
Move the assignment out of hvcsd->port.count from within the if () block so this patch fixes it. It is bad practice to have assignments within an if () block. Signed-off-by: Bryant G. Ly--- drivers/tty/hvc/hvcs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/tty/hvc/hvcs.c b/drivers/tty/hvc/hvcs.c index 1db1d97..333dc76 100644 --- a/drivers/tty/hvc/hvcs.c +++ b/drivers/tty/hvc/hvcs.c @@ -1202,7 +1202,8 @@ static void hvcs_close(struct tty_struct *tty, struct file *filp) hvcsd = tty->driver_data; spin_lock_irqsave(>lock, flags); - if (--hvcsd->port.count == 0) { + --hvcsd->port.count; + if (hvcsd->port.count == 0) { vio_disable_interrupts(hvcsd->vdev); -- 2.7.2
Re: [PATCH 0/3] stop secondaries for reboot/shutdown
On 2018-04-01 16:06, Nicholas Piggin wrote: I'm rebasing and resending this series because the hard lockup messages are being seen in the field. Since last time, one of the patches was merged, and patches were split and reordered a bit. No major changes. This still hasn't been tested with the FSP firmware update. I have tested this series on a FSP platform with firmware update and multiple reboot/shutdown tests. Not seen any issues. Tested-by: Pridhiviraj PaidipeddiThanks Pridhiviraj Thanks, Nick Nicholas Piggin (3): powerpc: use NMI IPI for smp_send_stop powerpc: hard disable irqs in smp_send_stop loop powerpc/powernv: Always stop secondaries before reboot/shutdown arch/powerpc/include/asm/opal.h | 2 +- arch/powerpc/kernel/smp.c | 13 +++-- arch/powerpc/platforms/powernv/opal-flash.c | 28 +--- arch/powerpc/platforms/powernv/setup.c | 15 +-- 4 files changed, 18 insertions(+), 40 deletions(-)
Re: RFC on writel and writel_relaxed
On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote: >> On 3/28/2018 11:55 AM, David Miller wrote: >>> From: Benjamin Herrenschmidt>>> Date: Thu, 29 Mar 2018 02:13:16 +1100 >>> Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. >>> >>> Agreed. >>> >> >> I pinged most of the maintainers yesterday. >> Which arches do we care about these days? >> I have not been paying attention any other architecture besides arm64. > > Thanks for going through that exercise ! > > Once sparc, s390, microblaze and mips reply, I think we'll have a good > coverage, maybe riscv is to put in that lot too. I posted the following two patches for supporting microblaze and unicore32. [PATCH v2 1/2] io: prevent compiler reordering on the default writeX() implementation [PATCH v2 2/2] io: prevent compiler reordering on the default readX() implementation The rest of the arches except mips and alpha seem OK. I sent a question email on Friday to mips and alpha mailing lists. I'll follow up with an actual patch today. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.
Re: [PATCH V2] powerpc: Fix oops due to wrong access of lppaca on non virtualized platform
On Mon, 2 Apr 2018 13:03:37 +0530 "Aneesh Kumar K.V"wrote: > Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not > virtualized") removed allocation of lppaca on non virtualized platform. But > with > CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in > some > code paths. > > Fix this but adding runtime check for virtualized platform. > > Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not > virtualized") > > Signed-off-by: Aneesh Kumar K.V Thanks, this looks good to me... > --- > Changes from V1: > * update commit summary > > arch/powerpc/include/asm/lppaca.h | 3 +++ > arch/powerpc/include/asm/spinlock.h | 2 ++ > arch/powerpc/kernel/time.c | 3 +++ > 3 files changed, 8 insertions(+) > > diff --git a/arch/powerpc/include/asm/lppaca.h > b/arch/powerpc/include/asm/lppaca.h > index 65d589689f01..7c23ce8a5a4c 100644 > --- a/arch/powerpc/include/asm/lppaca.h > +++ b/arch/powerpc/include/asm/lppaca.h > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > > /* > * The lppaca is the "virtual processor area" registered with the hypervisor, > @@ -114,6 +115,8 @@ struct lppaca { > > static inline bool lppaca_shared_proc(struct lppaca *l) > { > + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) > + return false; > return !!(l->__old_status & LPPACA_OLD_SHARED_PROC); > } > We should eventually clean this up. There's no need to pass in an lppaca here AFAIKS. Is it even per-CPU, does it change dynamically? We could set it up in a FW_FEATURE parameter at boot, which would could make some of these tests cheaper. > diff --git a/arch/powerpc/include/asm/spinlock.h > b/arch/powerpc/include/asm/spinlock.h > index b9ebc3085fb7..72dc4ddc2972 100644 > --- a/arch/powerpc/include/asm/spinlock.h > +++ b/arch/powerpc/include/asm/spinlock.h > @@ -56,6 +56,8 @@ > #define vcpu_is_preempted vcpu_is_preempted > static inline bool vcpu_is_preempted(int cpu) > { > + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) > + return false; > return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); > } > #endif > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c > index f7d96a68ecaa..360e71d455cc 100644 > --- a/arch/powerpc/kernel/time.c > +++ b/arch/powerpc/kernel/time.c > @@ -266,6 +266,9 @@ void accumulate_stolen_time(void) > > static inline u64 calculate_stolen_time(u64 stop_tb) > { > + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) > + return 0; > + > if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx)) > return scan_dispatch_log(stop_tb); >
[PATCH V2] powerpc: Fix oops due to wrong access of lppaca on non virtualized platform
Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not virtualized") removed allocation of lppaca on non virtualized platform. But with CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in some code paths. Fix this but adding runtime check for virtualized platform. Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not virtualized") Signed-off-by: Aneesh Kumar K.V--- Changes from V1: * update commit summary arch/powerpc/include/asm/lppaca.h | 3 +++ arch/powerpc/include/asm/spinlock.h | 2 ++ arch/powerpc/kernel/time.c | 3 +++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h index 65d589689f01..7c23ce8a5a4c 100644 --- a/arch/powerpc/include/asm/lppaca.h +++ b/arch/powerpc/include/asm/lppaca.h @@ -34,6 +34,7 @@ #include #include #include +#include /* * The lppaca is the "virtual processor area" registered with the hypervisor, @@ -114,6 +115,8 @@ struct lppaca { static inline bool lppaca_shared_proc(struct lppaca *l) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return false; return !!(l->__old_status & LPPACA_OLD_SHARED_PROC); } diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index b9ebc3085fb7..72dc4ddc2972 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -56,6 +56,8 @@ #define vcpu_is_preempted vcpu_is_preempted static inline bool vcpu_is_preempted(int cpu) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return false; return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); } #endif diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index f7d96a68ecaa..360e71d455cc 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -266,6 +266,9 @@ void accumulate_stolen_time(void) static inline u64 calculate_stolen_time(u64 stop_tb) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return 0; + if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx)) return scan_dispatch_log(stop_tb); -- 2.14.3
[PATCH] powerpc: Fix oops due to deferral of lpacca on non virtualized platform
Commit 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not virtualized") removed allocation of lppaca on non virtualized platform. But with CONFIG_PPC_SPLPAR enabled, we do access lppaca non-virtualized platform in some code paths. Fix this but adding runtime check for virtualized platform. Fixes: 8e0b634b1327 ("powerpc/64s: Do not allocate lppaca if we are not virtualized") Signed-off-by: Aneesh Kumar K.V--- arch/powerpc/include/asm/lppaca.h | 3 +++ arch/powerpc/include/asm/spinlock.h | 2 ++ arch/powerpc/kernel/time.c | 3 +++ 3 files changed, 8 insertions(+) diff --git a/arch/powerpc/include/asm/lppaca.h b/arch/powerpc/include/asm/lppaca.h index 65d589689f01..7c23ce8a5a4c 100644 --- a/arch/powerpc/include/asm/lppaca.h +++ b/arch/powerpc/include/asm/lppaca.h @@ -34,6 +34,7 @@ #include #include #include +#include /* * The lppaca is the "virtual processor area" registered with the hypervisor, @@ -114,6 +115,8 @@ struct lppaca { static inline bool lppaca_shared_proc(struct lppaca *l) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return false; return !!(l->__old_status & LPPACA_OLD_SHARED_PROC); } diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index b9ebc3085fb7..72dc4ddc2972 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -56,6 +56,8 @@ #define vcpu_is_preempted vcpu_is_preempted static inline bool vcpu_is_preempted(int cpu) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return false; return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1); } #endif diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c index f7d96a68ecaa..360e71d455cc 100644 --- a/arch/powerpc/kernel/time.c +++ b/arch/powerpc/kernel/time.c @@ -266,6 +266,9 @@ void accumulate_stolen_time(void) static inline u64 calculate_stolen_time(u64 stop_tb) { + if (!firmware_has_feature(FW_FEATURE_SPLPAR)) + return 0; + if (get_paca()->dtl_ridx != be64_to_cpu(get_lppaca()->dtl_idx)) return scan_dispatch_log(stop_tb); -- 2.14.3
Re: [PATCH] Extract initrd free logic from arch-specific code.
* Shea Levywrote: > > Please also put it into Documentation/features/. > > I switched this patch series (the latest revision v6 was just posted) to > using weak symbols instead of Kconfig. Does it still warrant documentation? Probably not. Thanks, Ingo
[PATCH v3 7/7] powerpc/fadump: un-register fadump on kexec path.
From: Mahesh Salgaonkarotherwise the fadump registration in new kexec-ed kernel complains that fadump is already registered. This makes new kernel to continue using fadump registered by previous kernel which may lead to invalid vmcore generation. Hence this patch fixes this issue by un-registering fadump in fadump_cleanup() which is called during kexec path so that new kernel can register fadump with new valid values. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c |3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 2c3c7e655eec..cb496c8f40f3 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -1345,6 +1345,9 @@ void fadump_cleanup(void) init_fadump_mem_struct(, get_fadump_metadata_base(fdm_active)); fadump_invalidate_dump(); + } else if (fw_dump.dump_registered) { + /* Un-register Firmware-assisted dump if it was registered. */ + fadump_unregister_dump(); } }
[PATCH v3 6/7] powerpc/fadump: Do not allow hot-remove memory from fadump reserved area.
From: Mahesh SalgaonkarFor fadump to work successfully there should not be any holes in reserved memory ranges where kernel has asked firmware to move the content of old kernel memory in event of crash. But this memory area is currently not protected from hot-remove operations. Hence, fadump service can fail to re-register after the hot-remove operation, if hot-removed memory belongs to fadump reserved region. To avoid this make sure that memory from fadump reserved area is not hot-removable if fadump is registered. However, if user still wants to remove that memory, he can do so by manually stopping fadump service before hot-remove operation. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |2 +- arch/powerpc/kernel/fadump.c| 10 -- arch/powerpc/platforms/pseries/hotplug-memory.c |7 +-- 3 files changed, 14 insertions(+), 5 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 44b6ebfe9be6..d16dc77107a8 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -207,7 +207,7 @@ struct fad_crash_memory_ranges { unsigned long long size; }; -extern int is_fadump_boot_memory_area(u64 addr, ulong size); +extern int is_fadump_memory_area(u64 addr, ulong size); extern int early_init_dt_scan_fw_dump(unsigned long node, const char *uname, int depth, void *data); extern int fadump_reserve_mem(void); diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 59aaf0357a52..2c3c7e655eec 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -162,13 +162,19 @@ int __init early_init_dt_scan_fw_dump(unsigned long node, /* * If fadump is registered, check if the memory provided - * falls within boot memory area. + * falls within boot memory area and reserved memory area. */ -int is_fadump_boot_memory_area(u64 addr, ulong size) +int is_fadump_memory_area(u64 addr, ulong size) { + u64 d_start = fw_dump.reserve_dump_area_start; + u64 d_end = d_start + fw_dump.reserve_dump_area_size; + if (!fw_dump.dump_registered) return 0; + if (((addr + size) > d_start) && (addr <= d_end)) + return 1; + return (addr + size) > RMA_START && addr <= fw_dump.boot_memory_size; } diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c b/arch/powerpc/platforms/pseries/hotplug-memory.c index c1578f54c626..e4c658cda3a7 100644 --- a/arch/powerpc/platforms/pseries/hotplug-memory.c +++ b/arch/powerpc/platforms/pseries/hotplug-memory.c @@ -389,8 +389,11 @@ static bool lmb_is_removable(struct drmem_lmb *lmb) phys_addr = lmb->base_addr; #ifdef CONFIG_FA_DUMP - /* Don't hot-remove memory that falls in fadump boot memory area */ - if (is_fadump_boot_memory_area(phys_addr, block_sz)) + /* +* Don't hot-remove memory that falls in fadump boot memory area +* and memory that is reserved for capturing old kernel memory. +*/ + if (is_fadump_memory_area(phys_addr, block_sz)) return false; #endif
[PATCH v3 5/7] powerpc/fadump: throw proper error message on fadump registration failure.
From: Mahesh Salgaonkarfadump fails to register when there are holes in reserved memory area. This can happen if user has hot-removed a memory that falls in the fadump reserved memory area. Throw a meaningful error message to the user in such case. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 33 + 1 file changed, 33 insertions(+) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index a497e9fb93fe..59aaf0357a52 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -216,6 +216,36 @@ static int is_boot_memory_area_contiguous(void) return ret; } +/* + * Returns 1, if there are no holes in reserved memory area, + * 0 otherwise. + */ +static int is_reserved_memory_area_contiguous(void) +{ + struct memblock_region *reg; + unsigned long start, end; + unsigned long d_start = fw_dump.reserve_dump_area_start; + unsigned long d_end = d_start + fw_dump.reserve_dump_area_size; + int ret = 0; + + for_each_memblock(memory, reg) { + start = max(d_start, (unsigned long)reg->base); + end = min(d_end, (unsigned long)(reg->base + reg->size)); + if (d_start < end) { + /* Memory hole from d_start to start */ + if (start > d_start) + break; + + if (end == d_end) { + ret = 1; + break; + } + d_start = end + 1; + } + } + return ret; +} + /* Print firmware assisted dump configurations for debugging purpose. */ static void fadump_show_config(void) { @@ -602,6 +632,9 @@ static int register_fw_dump(struct fadump_mem_struct *fdm) if (!is_boot_memory_area_contiguous()) pr_err("Can't have holes in boot memory area while " "registering fadump\n"); + else if (!is_reserved_memory_area_contiguous()) + pr_err("Can't have holes in reserved memory area while" + " registering fadump\n"); printk(KERN_ERR "Failed to register firmware-assisted kernel" " dump. Parameter Error(%d).\n", rc);
[PATCH v3 4/7] powerpc/fadump: exclude memory holes while reserving memory in second kernel.
From: Mahesh SalgaonkarThe second kernel, during early boot after the crash, reserves rest of the memory above boot memory size to make sure it does not touch any of the dump memory area. It uses memblock_reserve() that reserves the specified memory region irrespective of memory holes present within that region. There are chances where previous kernel would have hot removed some of its memory leaving memory holes behind. In such cases fadump kernel reports incorrect number of reserved pages through arch_reserved_kernel_pages() hook causing kernel to hang or panic. Fix this by excluding memory holes while reserving rest of the memory above boot memory size during second kernel boot after crash. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 011f1aa7abab..a497e9fb93fe 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -433,6 +433,21 @@ static inline unsigned long get_fadump_metadata_base( return be64_to_cpu(fdm->metadata_region.source_address); } +static void fadump_memblock_reserve(unsigned long base, unsigned long size) +{ + struct memblock_region *reg; + unsigned long start, end; + + for_each_memblock(memory, reg) { + start = max(base, (unsigned long)reg->base); + end = reg->base + reg->size; + end = min(base + size, end); + + if (start < end) + memblock_reserve(start, end - start); + } +} + int __init fadump_reserve_mem(void) { unsigned long base, size, memory_boundary; @@ -487,7 +502,7 @@ int __init fadump_reserve_mem(void) */ base = fw_dump.boot_memory_size; size = memory_boundary - base; - memblock_reserve(base, size); + fadump_memblock_reserve(base, size); printk(KERN_INFO "Reserved %ldMB of memory at %ldMB " "for saving crash dump\n", (unsigned long)(size >> 20),
[PATCH v3 3/7] powerpc/fadump: Reservationless firmware assisted dump
From: Mahesh SalgaonkarOne of the primary issues with Firmware Assisted Dump (fadump) on Power is that it needs a large amount of memory to be reserved. On large systems with TeraBytes of memory, this reservation can be quite significant. In some cases, fadump fails if the memory reserved is insufficient, or if the reserved memory was DLPAR hot-removed. In the normal case, post reboot, the preserved memory is filtered to extract only relevant areas of interest using the makedumpfile tool. While the tool provides flexibility to determine what needs to be part of the dump and what memory to filter out, all supported distributions default this to "Capture only kernel data and nothing else". We take advantage of this default and the Linux kernel's Contiguous Memory Allocator (CMA) to fundamentally change the memory reservation model for fadump. Fadump can now only capture kernel memory. Instead of setting aside a significant chunk of memory nobody can use, this patch uses CMA instead, to reserve a significant chunk of memory that the kernel is prevented from using (due to MIGRATE_CMA), but applications are free to use it. Essentially, on a P9 LPAR with 2 cores, 8GB RAM and current upstream: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 7557 1936822 12 5416725 Swap: 4095 04095 With this patch: [root@zzxx-yy10 ~]# free -m totalusedfree shared buff/cache available Mem: 8133 1947464 12 4757338 Swap: 4095 04095 Changes made here are completely transparent to how fadump has traditionally worked. Thanks to Aneesh Kumar and Anshuman Khandual for helping us understand CMA and its usage. TODO: - Handle case where CMA reservation spans nodes. Signed-off-by: Ananth N Mavinakayanahalli Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/kernel/fadump.c | 122 -- 1 file changed, 104 insertions(+), 18 deletions(-) diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 80552fd7d5f8..011f1aa7abab 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -34,6 +34,7 @@ #include #include #include +#include #include #include @@ -45,11 +46,57 @@ static struct fw_dump fw_dump; static struct fadump_mem_struct fdm; static const struct fadump_mem_struct *fdm_active; +static struct cma *fadump_cma; static DEFINE_MUTEX(fadump_mutex); struct fad_crash_memory_ranges crash_memory_ranges[INIT_CRASHMEM_RANGES]; int crash_mem_ranges; +/* + * fadump_cma_reserve() - reserve area for fadump memory reservation + * + * This function reserves memory from early allocator. It should be + * called by arch specific code once the memblock allocator + * has been activated. + */ +int __init fadump_cma_reserve(void) +{ + unsigned long long base, size; + int rc; + + if (!fw_dump.fadump_enabled) + return 0; + + base = fw_dump.reserve_dump_area_start; + size = fw_dump.reserve_dump_area_size; + pr_debug("Original reserve area base %ld, size %ld\n", + (unsigned long)base >> 20, + (unsigned long)size >> 20); + if (!size) + return 0; + + rc = cma_declare_contiguous(base, size, 0, 0, 0, false, + "fadump_cma", _cma); + if (rc) { + printk(KERN_ERR "fadump: Failed to reserve cma area for " + "firmware-assisted dump, %d\n", rc); + fw_dump.reserve_dump_area_size = 0; + return 0; + } + /* +* So we now have cma area reserved for fadump. base may be different +* from what we requested. +*/ + fw_dump.reserve_dump_area_start = cma_get_base(fadump_cma); + fw_dump.reserve_dump_area_size = cma_get_size(fadump_cma); + printk("Reserved %ldMB cma area at %ldMB for firmware-assisted dump " + "(System RAM: %ldMB)\n", + cma_get_size(fadump_cma) >> 20, + (unsigned long)cma_get_base(fadump_cma) >> 20, + (unsigned long)(memblock_phys_mem_size() >> 20)); + return 1; +} + /* Scan the Firmware Assisted dump configuration details. */ int __init early_init_dt_scan_fw_dump(unsigned long node, const char *uname, int depth, void *data) @@ -381,7 +428,7 @@ static unsigned long get_fadump_area_size(void) } static inline unsigned long get_fadump_metadata_base( - const struct fadump_mem_struct *fdm) + const struct fadump_mem_struct *fdm) { return
[PATCH v3 2/7] powerpc/fadump: Update documentation to reflect the metadata region movement.
From: Mahesh SalgaonkarMetadata region that holds fadump header and ELF header is now placed at the start of reserved memory area. Update the documentation accordingly Signed-off-by: Mahesh Salgaonkar --- Documentation/powerpc/firmware-assisted-dump.txt | 31 +- 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/Documentation/powerpc/firmware-assisted-dump.txt b/Documentation/powerpc/firmware-assisted-dump.txt index bdd344aa18d9..c77f8dc9231f 100644 --- a/Documentation/powerpc/firmware-assisted-dump.txt +++ b/Documentation/powerpc/firmware-assisted-dump.txt @@ -115,17 +115,24 @@ be kept permanently reserved, so that it can act as a receptacle for a copy of the boot memory content in addition to CPU state and HPTE region, in the case a crash does occur. +The first kernel, during fadump registration, prepares ELF core header +that contains necessary information for the coredump of the 1st kernel +which includes its physical memory layout. This ELF header and some more +additional data is stored in the area called metadata region at the start +of the reserved memory area. + o Memory Reservation during first kernel Low memory Top of memory 0 boot memory size | | ||<--Reserved dump area -->| | V V| Permanent Reservation | V - +---+--/ /---+---++---++--+ - | ||CPU|HPTE| DUMP |ELF | | - +---+--/ /---+---++---++--+ -| ^ -| | + +---+--/ /---++---++---+--+ + | ||ELF |CPU|HPTE| DUMP | | + +---+-/ /++---++---+--+ +|^ ^ +|| | +|metadata region| \ / --- Boot memory content gets transferred to @@ -137,14 +144,14 @@ and HPTE region, in the case a crash does occur. Low memoryTop of memory 0 boot memory size | - | |<- Reserved dump area --- -->| + | |<- Reserved dump area -->| V V V - +---+--/ /---+---++---++--+ - | ||CPU|HPTE| DUMP |ELF | | - +---+--/ /---+---++---++--+ -| | -V V - Used by second/proc/vmcore + +---+--/ /---+++---+--+ + | ||ELF |CPU|HPTE| DUMP | | + +---+--/ /---+++---+--+ +| | +V V + Used by second /proc/vmcore kernel to boot Fig. 2
[PATCH v3 1/7] powerpc/fadump: Move the metadata region to start of the reserved area.
From: Mahesh SalgaonkarCurrently the metadata region that holds crash info structure and ELF core header is placed towards the end of reserved memory area. This patch places it at the beginning of the reserved memory area. Signed-off-by: Mahesh Salgaonkar --- arch/powerpc/include/asm/fadump.h |4 ++ arch/powerpc/kernel/fadump.c | 92 - 2 files changed, 83 insertions(+), 13 deletions(-) diff --git a/arch/powerpc/include/asm/fadump.h b/arch/powerpc/include/asm/fadump.h index 5a23010af600..44b6ebfe9be6 100644 --- a/arch/powerpc/include/asm/fadump.h +++ b/arch/powerpc/include/asm/fadump.h @@ -61,6 +61,9 @@ #define FADUMP_UNREGISTER 2 #define FADUMP_INVALIDATE 3 +/* Number of dump sections requested by kernel */ +#define FADUMP_NUM_SECTIONS4 + /* Dump status flag */ #define FADUMP_ERROR_FLAG 0x2000 @@ -119,6 +122,7 @@ struct fadump_mem_struct { struct fadump_section cpu_state_data; struct fadump_section hpte_region; struct fadump_section rmr_region; + struct fadump_section metadata_region; }; /* Firmware-assisted dump configuration details. */ diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c index 3c2c2688918f..80552fd7d5f8 100644 --- a/arch/powerpc/kernel/fadump.c +++ b/arch/powerpc/kernel/fadump.c @@ -188,17 +188,48 @@ static void fadump_show_config(void) pr_debug("Boot memory size : %lx\n", fw_dump.boot_memory_size); } +static unsigned long get_fadump_metadata_size(void) +{ + unsigned long size = 0; + + /* +* If fadump is active then look into fdm_active to get metadata +* size set by previous kernel. +*/ + if (fw_dump.dump_active) { + size = fdm_active->cpu_state_data.destination_address - + fdm_active->metadata_region.source_address; + goto out; + } + size += sizeof(struct fadump_crash_info_header); + size += sizeof(struct elfhdr); /* ELF core header.*/ + size += sizeof(struct elf_phdr); /* place holder for cpu notes */ + /* Program headers for crash memory regions. */ + size += sizeof(struct elf_phdr) * (memblock_num_regions(memory) + 2); + + size = PAGE_ALIGN(size); +out: + pr_debug("fadump Metadata size is %ld\n", size); + return size; +} + static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, unsigned long addr) { + uint16_t num_sections = 0; + unsigned long metadata_base = 0; + if (!fdm) return 0; + /* Skip the fadump metadata area. */ + metadata_base = addr; + addr += get_fadump_metadata_size(); + memset(fdm, 0, sizeof(struct fadump_mem_struct)); addr = addr & PAGE_MASK; fdm->header.dump_format_version = cpu_to_be32(0x0001); - fdm->header.dump_num_sections = cpu_to_be16(3); fdm->header.dump_status_flag = 0; fdm->header.offset_first_dump_section = cpu_to_be32((u32)offsetof(struct fadump_mem_struct, cpu_state_data)); @@ -222,6 +253,7 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->cpu_state_data.source_address = 0; fdm->cpu_state_data.source_len = cpu_to_be64(fw_dump.cpu_state_data_size); fdm->cpu_state_data.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.cpu_state_data_size; /* hpte region section */ @@ -230,6 +262,7 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->hpte_region.source_address = 0; fdm->hpte_region.source_len = cpu_to_be64(fw_dump.hpte_region_size); fdm->hpte_region.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.hpte_region_size; /* RMA region section */ @@ -238,8 +271,25 @@ static unsigned long init_fadump_mem_struct(struct fadump_mem_struct *fdm, fdm->rmr_region.source_address = cpu_to_be64(RMA_START); fdm->rmr_region.source_len = cpu_to_be64(fw_dump.boot_memory_size); fdm->rmr_region.destination_address = cpu_to_be64(addr); + num_sections++; addr += fw_dump.boot_memory_size; + /* +* fadump metadata section. +* Add an entry with source len a zero. We are only intereseted in +* source address which will help us to detect the location of +* metadata area where faump header and elf core header is placed. +*/ + fdm->metadata_region.request_flag = cpu_to_be32(FADUMP_REQUEST_FLAG); + fdm->metadata_region.source_data_type = + cpu_to_be16(FADUMP_REAL_MODE_REGION); + fdm->metadata_region.source_address = cpu_to_be64(metadata_base); +
[PATCH v3 0/7] powerpc/fadump: Improvements and fixes for firmware-assisted dump.
One of the primary issues with Firmware Assisted Dump (fadump) on Power is that it needs a large amount of memory to be reserved. This reserved memory is used for saving the contents of old crashed kernel's memory before fadump capture kernel uses old kernel's memory area to boot. However, This reserved memory area stays unused until system crash and isn't available for production kernel to use. Instead of setting aside a significant chunk of memory that nobody can use, take advantage Linux kernel's Contiguous Memory Allocator (CMA) feature, to reserve a significant chunk of memory that the kernel is prevented from using , but applications are free to use it. Patch 1 moves the metadata region to the start of the reserved area for easy handling/detection of metadata region in second kernel. Patch 3 implements the usage of CMA region to allow production kernel to use that memory for applications usage, making fadump reservationless. Patch 4-7 fixes various fadump issues and bugs. Changes in V3: - patch 1 & 2: move metadata region and documentation update. - patch 7: Un-register the faudmp on kexec path --- Mahesh Salgaonkar (7): powerpc/fadump: Move the metadata region to start of the reserved area. powerpc/fadump: Update documentation to reflect the metadata region movement. powerpc/fadump: Reservationless firmware assisted dump powerpc/fadump: exclude memory holes while reserving memory in second kernel. powerpc/fadump: throw proper error message on fadump registration failure. powerpc/fadump: Do not allow hot-remove memory from fadump reserved area. powerpc/fadump: un-register fadump on kexec path. Documentation/powerpc/firmware-assisted-dump.txt | 31 ++ arch/powerpc/include/asm/fadump.h|6 arch/powerpc/kernel/fadump.c | 275 +++--- arch/powerpc/platforms/pseries/hotplug-memory.c |7 - 4 files changed, 271 insertions(+), 48 deletions(-) -- Signature