Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Gabriel Paubert
On Thu, Jan 24, 2019 at 04:58:41PM +0100, Christophe Leroy wrote:
> 
> 
> Le 24/01/2019 à 16:01, Christophe Leroy a écrit :
> > 
> > 
> > Le 24/01/2019 à 10:43, Christophe Leroy a écrit :
> > > 
> > > 
> > > On 01/24/2019 01:06 AM, Michael Ellerman wrote:
> > > > Christophe Leroy  writes:
> > > > > Le 12/01/2019 à 10:55, Christophe Leroy a écrit :
> > > > > > The purpose of this serie is to activate
> > > > > > CONFIG_THREAD_INFO_IN_TASK which
> > > > > > moves the thread_info into task_struct.
> > > > > > 
> > > > > > Moving thread_info into task_struct has the following advantages:
> > > > > > - It protects thread_info from corruption in the case of stack
> > > > > > overflows.
> > > > > > - Its address is harder to determine if stack addresses are
> > > > > > leaked, making a number of attacks more difficult.
> > > > > 
> > > > > I ran null_syscall and context_switch benchmark selftests
> > > > > and the result
> > > > > is surprising. There is slight degradation in context_switch and a
> > > > > significant one on null_syscall:
> > > > > 
> > > > > Without the serie:
> > > > > 
> > > > > ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> > > > > 55542
> > > > > 55562
> > > > > 55564
> > > > > 55562
> > > > > 55568
> > > > > ...
> > > > > 
> > > > > ~# ./null_syscall
> > > > >  2546.71 ns 336.17 cycles
> > > > > 
> > > > > 
> > > > > With the serie:
> > > > > 
> > > > > ~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
> > > > > 55138
> > > > > 55142
> > > > > 55152
> > > > > 55144
> > > > > 55142
> > > > > 
> > > > > ~# ./null_syscall
> > > > >  3479.54 ns 459.30 cycles
> > > > > 
> > > > > So 0,8% less context switches per second and 37% more time
> > > > > for one syscall ?
> > > > > 
> > > > > Any idea ?
> > > > 
> > > > What platform is that on?
> > > 
> > > It is on the 8xx
> 
> On the 83xx, I have a slight improvment:
> 
> Without the serie:
> 
> root@vgoippro:~# ./null_syscall
> 921.44 ns 307.15 cycles
> 
> With the serie:
> 
> root@vgoippro:~# ./null_syscall
> 918.78 ns 306.26 cycles
> 

The 8xx has very low cache associativity, something like 2, right?

In this case it may be access patterns which change the number of
cache line transfers when you move things around. 

Try to move things around in main(), for example allocate a variable of
~1kB on the stack in the function that performs the null_syscalls (use 
the variable before and after the loop, to avoid clever compiler
optimizations).

Gabriel


> Christophe
> 
> > > 
> > > > 
> > > > On 64-bit we have to turn one mtmsrd into two and that's obviously a
> > > > slow down. But I don't see that you've done anything similar in 32-bit
> > > > code.
> > > > 
> > > > I assume it's patch 8 that causes the slow down?
> > > 
> > > I have not digged into it yet, but why patch 8 ?
> > > 
> > 
> > The increase of null_syscall duration happens with patch 5 when we
> > activate CONFIG_THREAD_INFO_IN_TASK.
> > 


Re: BUG: memcmp(): Accessing invalid memory location

2019-01-24 Thread Christophe Leroy




Le 25/01/2019 à 01:55, Benjamin Herrenschmidt a écrit :

On Thu, 2019-01-24 at 19:48 +0530, Chandan Rajendra wrote:

- Here we execute "LD rB,0,r4". In the case of this bug, r4 has an unaligned
   value and hence ends up accessing the "next" double word. The "next" double
   word happens to occur after the last page mapped into the kernel's address
   space and hence this leads to the previously listed oops.
   


This is interesting ... should we mark the last page of any piece of
mapped linear mapping as reserved to avoid that sort of issue ?


Or revert to a normal comparison once remaining length is < 8 and r4 in 
unaligned ?


Christophe



Nick ? Aneesh ?

Cheers,
Ben.



RE: [PATCH] dmaengine: fsldma: Add 64-bit I/O accessors for powerpc64

2019-01-24 Thread Peng Ma
Hi Vinod,

Sorry to replay late.
1:This patch has already send to the patchwork.
Please see the patch link: https://patchwork.kernel.org/patch/10741521/
2:I have already compile the fsl patches on arm and powerpc after patched 
https://patchwork.kernel.org/patch/10741521/
The compile will successful, please let me know the reported regression 
results, thanks very much.

Best Regards,
Peng

>-Original Message-
>From: Vinod Koul 
>Sent: 2019年1月19日 20:59
>To: Peng Ma 
>Cc: Scott Wood ; Leo Li ; Zhang Wei
>; linuxppc-dev@lists.ozlabs.org;
>dmaeng...@vger.kernel.org; Wen He 
>Subject: Re: [PATCH] dmaengine: fsldma: Add 64-bit I/O accessors for
>powerpc64
>
>On 24-12-18, 05:29, Peng Ma wrote:
>> Hi Scott,
>>
>> Oh, I did not see the in_XX64/out_XX64 supported only __powerpc64__ just
>now.
>> Thanks for your reminder.
>
>Can you send the formal patch for this...
>
>FWIW, fsl patches were not merged last cycle because of reported regression...
>
>>
>> #ifdef __powerpc64__
>>
>> #ifdef __BIG_ENDIAN__
>> DEF_MMIO_OUT_D(out_be64, 64, std);
>> DEF_MMIO_IN_D(in_be64, 64, ld);
>>
>> /* There is no asm instructions for 64 bits reverse loads and stores
>> */ static inline u64 in_le64(const volatile u64 __iomem *addr) {
>> return swab64(in_be64(addr));
>> }
>>
>> static inline void out_le64(volatile u64 __iomem *addr, u64 val) {
>> out_be64(addr, swab64(val));
>> }
>> #else
>> DEF_MMIO_OUT_D(out_le64, 64, std);
>> DEF_MMIO_IN_D(in_le64, 64, ld);
>>
>> /* There is no asm instructions for 64 bits reverse loads and stores
>> */ static inline u64 in_be64(const volatile u64 __iomem *addr) {
>> return swab64(in_le64(addr));
>> }
>>
>> static inline void out_be64(volatile u64 __iomem *addr, u64 val) {
>> out_le64(addr, swab64(val));
>> }
>>
>> #endif
>> #endif /* __powerpc64__ */
>>
>> Best Regards,
>> Peng
>> >-Original Message-
>> >From: Scott Wood 
>> >Sent: 2018年12月24日 12:46
>> >To: Peng Ma ; Leo Li ; Zhang
>Wei
>> >
>> >Cc: linuxppc-dev@lists.ozlabs.org; dmaeng...@vger.kernel.org; Wen He
>> >
>> >Subject: Re: [PATCH] dmaengine: fsldma: Add 64-bit I/O accessors for
>> >powerpc64
>> >
>> >On Mon, 2018-12-24 at 03:42 +, Peng Ma wrote:
>> >> Hi Scott,
>> >>
>> >> You are right, we should support powerpc64, so could I changed it
>> >> as
>> >> fallows:
>> >>
>> >> diff --git a/drivers/dma/fsldma.h b/drivers/dma/fsldma.h index
>> >> 88db939..057babf 100644
>> >> --- a/drivers/dma/fsldma.h
>> >> +++ b/drivers/dma/fsldma.h
>> >> @@ -202,35 +202,10 @@ struct fsldma_chan {
>> >>  #define fsl_iowrite32(v, p)out_le32(p, v)
>> >>  #define fsl_iowrite32be(v, p)  out_be32(p, v)
>> >>
>> >> -#ifndef __powerpc64__
>> >> -static u64 fsl_ioread64(const u64 __iomem *addr) -{
>> >> -   u32 fsl_addr = lower_32_bits(addr);
>> >> -   u64 fsl_addr_hi = (u64)in_le32((u32 *)(fsl_addr + 1)) << 32;
>> >> -
>> >> -   return fsl_addr_hi | in_le32((u32 *)fsl_addr);
>> >> -}
>> >> -
>> >> -static void fsl_iowrite64(u64 val, u64 __iomem *addr) -{
>> >> -   out_le32((u32 __iomem *)addr + 1, val >> 32);
>> >> -   out_le32((u32 __iomem *)addr, (u32)val);
>> >> -}
>> >> -
>> >> -static u64 fsl_ioread64be(const u64 __iomem *addr) -{
>> >> -   u32 fsl_addr = lower_32_bits(addr);
>> >> -   u64 fsl_addr_hi = (u64)in_be32((u32 *)fsl_addr) << 32;
>> >> -
>> >> -   return fsl_addr_hi | in_be32((u32 *)(fsl_addr + 1));
>> >> -}
>> >> -
>> >> -static void fsl_iowrite64be(u64 val, u64 __iomem *addr) -{
>> >> -   out_be32((u32 __iomem *)addr, val >> 32);
>> >> -   out_be32((u32 __iomem *)addr + 1, (u32)val);
>> >> -}
>> >> -#endif
>> >> +#define fsl_ioread64(p)in_le64(p)
>> >> +#define fsl_ioread64be(p)  in_be64(p)
>> >> +#define fsl_iowrite64(v, p)out_le64(p, v)
>> >> +#define fsl_iowrite64be(v, p)  out_be64(p, v)
>> >>  #endif
>> >
>> >Then you'll break 32-bit, assuming those
>> >fake-it-with-two-32-bit-accesses were actually needed.
>> >
>> >-Scott
>> >
>>
>
>--
>~Vinod


[RFC PATCH 2/2] cxl: Force a CAPP reset when unloading CXL module

2019-01-24 Thread Vaibhav Jain
This patch forces shutdown of CAPP when CXL module is unloaded. This
is accomplished via a call to pnv_phb_to_cxl_mode() with mode ==
OPAL_PHB_CAPI_MODE_PCIE.

Signed-off-by: Vaibhav Jain 
---
 drivers/misc/cxl/cxl.h  |  1 +
 drivers/misc/cxl/main.c |  3 +++
 drivers/misc/cxl/pci.c  | 25 -
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index d1d927ccb589..e545c2b81faf 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -1136,4 +1136,5 @@ void cxl_context_mm_count_get(struct cxl_context *ctx);
 /* Decrements the reference count to "struct mm_struct" */
 void cxl_context_mm_count_put(struct cxl_context *ctx);
 
+void cxl_pci_shutdown_capp(void);
 #endif
diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c
index f35406be465a..f14ff0dcf231 100644
--- a/drivers/misc/cxl/main.c
+++ b/drivers/misc/cxl/main.c
@@ -372,6 +372,9 @@ static void exit_cxl(void)
if (cxl_is_power8())
unregister_cxl_calls(&cxl_calls);
idr_destroy(&cxl_adapter_idr);
+
+   if (cpu_has_feature(CPU_FTR_HVMODE))
+   cxl_pci_shutdown_capp();
 }
 
 module_init(init_cxl);
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index c79ba1c699ad..01be2e2d1069 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -25,7 +25,7 @@
 
 #include "cxl.h"
 #include 
-
+#include 
 
 #define CXL_PCI_VSEC_ID0x1280
 #define CXL_VSEC_MIN_SIZE 0x80
@@ -2065,6 +2065,29 @@ static void cxl_pci_resume(struct pci_dev *pdev)
}
 }
 
+void cxl_pci_shutdown_capp(void)
+{
+   struct pci_dev *pdev;
+   struct pci_bus *root_bus;
+   int rc;
+
+   /* Iterate over all CAPP supported PHB's and force them to PCI mode */
+   list_for_each_entry(root_bus, &pci_root_buses, node) {
+   for_each_pci_bridge(pdev, root_bus) {
+
+   if (!cxllib_slot_is_supported(pdev, 0))
+   continue;
+
+   rc = pnv_phb_to_cxl_mode(pdev,
+OPAL_PHB_CAPI_MODE_PCIE);
+   if (rc)
+   dev_err(&pdev->dev,
+   "cxl: Error resetting CAPP. Err=%d\n",
+   rc);
+   }
+   }
+}
+
 static const struct pci_error_handlers cxl_err_handler = {
.error_detected = cxl_pci_error_detected,
.slot_reset = cxl_pci_slot_reset,
-- 
2.20.1



[RFC PATCH 1/2] powerpc/powernv: Add support for CXL mode switch that need PHB reset

2019-01-24 Thread Vaibhav Jain
Recent updates to OPAL [1] have provided support for new CXL modes on
PHB that need to force a cold reset on the bridge (CRESET). However
PHB CRESET is a multi step process and cannot be completed
synchronously as expected by current kernel implementation that issues
opal call opal_pci_set_phb_cxl_mode().

Hence this patch updates pnv_phb_to_cxl_mode() to implement a polling
loop that handles specific error codes (OPAL_BUSY) returned from
opal_pci_set_phb_cxl_mode() and drive the OPAL pci-state machine, if the
requested CXL mode needs a CRESET.

The patch also updates pnv_phb_to_cxl_mode() to convert and return
OPAL error codes into kernel error codes. This removes a previous
issue where callers to this function would have to include
'opal-api.h' to check for specific OPAL error codes.

References:
[1]: https://lists.ozlabs.org/pipermail/skiboot/2019-January/013063.html

Signed-off-by: Vaibhav Jain 
---
 arch/powerpc/platforms/powernv/pci-cxl.c | 71 +---
 1 file changed, 63 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-cxl.c 
b/arch/powerpc/platforms/powernv/pci-cxl.c
index 1b18111453d7..d33d662c6212 100644
--- a/arch/powerpc/platforms/powernv/pci-cxl.c
+++ b/arch/powerpc/platforms/powernv/pci-cxl.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "pci.h"
 
@@ -18,21 +19,75 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t mode)
struct pci_controller *hose = pci_bus_to_host(dev->bus);
struct pnv_phb *phb = hose->private_data;
struct pnv_ioda_pe *pe;
+   unsigned long starttime, endtime;
int rc;
 
pe = pnv_ioda_get_pe(dev);
if (!pe)
-   return -ENODEV;
+   return -ENOENT;
+
+   pe_info(pe, "Switching PHB to CXL mode=%d\n", mode);
+
+   /*
+* Use a 15 second timeout for mode switch. Value arrived after
+* limited testing and may need more tweaking.
+*/
+   starttime = jiffies;
+   endtime = starttime + HZ * 15;
+
+   do {
+   rc = opal_pci_set_phb_cxl_mode(phb->opal_id, mode,
+  pe->pe_number);
+
+   /* Wait until mode transistion done */
+   if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
+   break;
+
+   /* Check if we timedout */
+   if (time_after(jiffies, endtime)) {
+   rc = OPAL_TIMEOUT;
+   break;
+   }
 
-   pe_info(pe, "Switching PHB to CXL\n");
+   /* Opal Busy with mode switch. Run pci state-machine */
+   rc = opal_pci_poll(phb->opal_id);
+   if (rc >= 0) {
+   /* wait for some time */
+   if (rc > 0)
+   msleep(rc);
+   opal_poll_events(NULL);
+   rc = OPAL_BUSY;
+   /* Continue with the mode switch */
+   }
+   } while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT);
+
+   pe_level_printk(pe, KERN_DEBUG, "CXL mode switch finished in %u-msecs.",
+   jiffies_to_msecs(jiffies - starttime));
 
-   rc = opal_pci_set_phb_cxl_mode(phb->opal_id, mode, pe->pe_number);
-   if (rc == OPAL_UNSUPPORTED)
-   dev_err(&dev->dev, "Required cxl mode not supported by firmware 
- update skiboot\n");
-   else if (rc)
-   dev_err(&dev->dev, "opal_pci_set_phb_cxl_mode failed: %i\n", 
rc);
+   /* Check OPAL errors and convert them to kernel error codes */
+   switch (rc) {
+   case OPAL_SUCCESS:
+   return 0;
 
-   return rc;
+   case OPAL_PARAMETER:
+   dev_err(&dev->dev, "CXL not supported on this PHB\n");
+   return -ENOENT;
+
+   case OPAL_UNSUPPORTED:
+   dev_err(&dev->dev,
+   "Required cxl mode not supported by firmware"
+   " - update skiboot\n");
+   return -ENODEV;
+
+   case OPAL_TIMEOUT:
+   dev_err(&dev->dev, "opal_pci_set_phb_cxl_mode Timedout\n");
+   return -ETIME;
+
+   default:
+   dev_err(&dev->dev,
+   "opal_pci_set_phb_cxl_mode failed: %i\n", rc);
+   return -EIO;
+   };
 }
 EXPORT_SYMBOL(pnv_phb_to_cxl_mode);
 
-- 
2.20.1



[RFC PATCH 0/2] cxl: Add support for disabling CAPP when unloading CXL

2019-01-24 Thread Vaibhav Jain
Recent updates to OPAL have implemented necessary infrastructure [1] to disable
CAPP and switch PHB back to PCIE mode during fast reset. This small patch-set
uses the same OPAL infrastructure to force disable of CAPP when CXL module is
unloaded via rmmod.

References:
[1]: https://lists.ozlabs.org/pipermail/skiboot/2019-January/013063.html

Vaibhav Jain (2):
  powerpc/powernv: Add support for CXL mode switch that need PHB reset
  cxl: Force a CAPP reset when unloading CXL module

 arch/powerpc/platforms/powernv/pci-cxl.c | 71 +---
 drivers/misc/cxl/cxl.h   |  1 +
 drivers/misc/cxl/main.c  |  3 +
 drivers/misc/cxl/pci.c   | 25 -
 4 files changed, 91 insertions(+), 9 deletions(-)

-- 
2.20.1



[PATCH] cxl: Wrap iterations over afu slices inside 'afu_list_lock'

2019-01-24 Thread Vaibhav Jain
Within cxl module, iteration over array 'adapter->slices' may be racy
at few points as it might be simultaneously read during an EEH and its
contents being set to NULL while driver is being unloaded or unbound
from the adapter. This might result in a NULL pointer to 'struct afu'
being de-referenced during an EEH thereby causing a kernel oops.

This patch fixes this by making sure that all access to the array
'adapter->slices' is wrapped within the context of spin-lock
'adapter->afu_list_lock'.

Signed-off-by: Vaibhav Jain 
---
 drivers/misc/cxl/guest.c |  2 ++
 drivers/misc/cxl/pci.c   | 38 --
 2 files changed, 30 insertions(+), 10 deletions(-)

diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c
index 5d28d9e454f5..08f4a512afad 100644
--- a/drivers/misc/cxl/guest.c
+++ b/drivers/misc/cxl/guest.c
@@ -267,6 +267,7 @@ static int guest_reset(struct cxl *adapter)
int i, rc;
 
pr_devel("Adapter reset request\n");
+   spin_lock(&adapter->afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
if ((afu = adapter->afu[i])) {
pci_error_handlers(afu, CXL_ERROR_DETECTED_EVENT,
@@ -283,6 +284,7 @@ static int guest_reset(struct cxl *adapter)
pci_error_handlers(afu, CXL_RESUME_EVENT, 0);
}
}
+   spin_unlock(&adapter->afu_list_lock);
return rc;
 }
 
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index c79ba1c699ad..28c28bceb063 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1805,7 +1805,7 @@ static pci_ers_result_t cxl_vphb_error_detected(struct 
cxl_afu *afu,
/* There should only be one entry, but go through the list
 * anyway
 */
-   if (afu->phb == NULL)
+   if (afu == NULL || afu->phb == NULL)
return result;
 
list_for_each_entry(afu_dev, &afu->phb->bus->devices, bus_list) {
@@ -1843,6 +1843,8 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 
/* If we're permanently dead, give up. */
if (state == pci_channel_io_perm_failure) {
+   /* Stop the slice traces */
+   spin_lock(&adapter->afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
/*
@@ -1851,6 +1853,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 */
cxl_vphb_error_detected(afu, state);
}
+   spin_unlock(&adapter->afu_list_lock);
return PCI_ERS_RESULT_DISCONNECT;
}
 
@@ -1932,14 +1935,20 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 * * In slot_reset, free the old resources and allocate new ones.
 * * In resume, clear the flag to allow things to start.
 */
+
+   /* Make sure no one else changes the afu list */
+   spin_lock(&adapter->afu_list_lock);
+
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
 
afu_result = cxl_vphb_error_detected(afu, state);
 
-   cxl_context_detach_all(afu);
-   cxl_ops->afu_deactivate_mode(afu, afu->current_mode);
-   pci_deconfigure_afu(afu);
+   if (afu != NULL) {
+   cxl_context_detach_all(afu);
+   cxl_ops->afu_deactivate_mode(afu, afu->current_mode);
+   pci_deconfigure_afu(afu);
+   }
 
/* Disconnect trumps all, NONE trumps NEED_RESET */
if (afu_result == PCI_ERS_RESULT_DISCONNECT)
@@ -1948,6 +1957,7 @@ static pci_ers_result_t cxl_pci_error_detected(struct 
pci_dev *pdev,
 (result == PCI_ERS_RESULT_NEED_RESET))
result = PCI_ERS_RESULT_NONE;
}
+   spin_unlock(&adapter->afu_list_lock);
 
/* should take the context lock here */
if (cxl_adapter_context_lock(adapter) != 0)
@@ -1980,14 +1990,15 @@ static pci_ers_result_t cxl_pci_slot_reset(struct 
pci_dev *pdev)
 */
cxl_adapter_context_unlock(adapter);
 
+   spin_lock(&adapter->afu_list_lock);
for (i = 0; i < adapter->slices; i++) {
afu = adapter->afu[i];
 
if (pci_configure_afu(afu, adapter, pdev))
-   goto err;
+   goto err_unlock;
 
if (cxl_afu_select_best_mode(afu))
-   goto err;
+   goto err_unlock;
 
if (afu->phb == NULL)
continue;
@@ -1999,16 +2010,16 @@ static pci_ers_result_t cxl_pci_slot_reset(struct 
pci_dev *pdev)
ctx = cxl_get_context(afu_dev);
 
if (ctx && cxl_release_context(ctx))
-   goto err;
+ 

Re: BUG: memcmp(): Accessing invalid memory location

2019-01-24 Thread Benjamin Herrenschmidt
On Thu, 2019-01-24 at 19:48 +0530, Chandan Rajendra wrote:
> - Here we execute "LD rB,0,r4". In the case of this bug, r4 has an unaligned
>   value and hence ends up accessing the "next" double word. The "next" double
>   word happens to occur after the last page mapped into the kernel's address
>   space and hence this leads to the previously listed oops.
>   

This is interesting ... should we mark the last page of any piece of
mapped linear mapping as reserved to avoid that sort of issue ?

Nick ? Aneesh ?

Cheers,
Ben.




Re: [RFC 5/6] powerpc/pci/hotplug: Use common drcinfo parsing

2019-01-24 Thread Tyrel Datwyler
On 01/14/2019 04:28 PM, Bjorn Helgaas wrote:
> On Fri, Dec 14, 2018 at 02:51:31PM -0600, Michael Bringmann wrote:
>> The implementation of the pseries-specific drc info properties
>> is currently implemented in pseries-specific and non-pseries-specific
>> files.  This patch set uses a new implementation of the device-tree
>> parsing code for the properties.
>>
>> This patch refactors parsing of the pseries-specific drc-info properties
>> out of rpaphp_core.c to use the common parser.  In the case where an
>> architecture does not use these properties, an __weak copy of the
>> function is provided with dummy return values.  Changes include creating
>> appropriate callback functions and passing callback-specific data
>> blocks into arch_find_drc_match.  All functions that were used just
>> to support the previous parsing implementation have been moved.
>>
>> Signed-off-by: Michael Bringmann 
> 
> This is fine with me.  Any rpaphp_core.c maintainers want to comment?
> Tyrel?

It greatly simplifies the code in rpaphp_core.c, and as far as I can tell the
refactoring maintains the existing functionality.

Acked-by: Tyrel Datwyler 

> 
> $ ./scripts/get_maintainer.pl -f drivers/pci/hotplug/rpaphp_core.c
> Tyrel Datwyler  (supporter:IBM Power PCI Hotplug 
> Driver for RPA-compliant...)
> Benjamin Herrenschmidt  (supporter:LINUX FOR 
> POWERPC (32-BIT AND 64-BIT))
> Paul Mackerras  (supporter:LINUX FOR POWERPC (32-BIT AND 
> 64-BIT))
> Michael Ellerman  (supporter:LINUX FOR POWERPC (32-BIT 
> AND 64-BIT))
> 
>> ---
>>  drivers/pci/hotplug/rpaphp_core.c |  232 
>> -
>>  1 file changed, 28 insertions(+), 204 deletions(-)
>>
>> diff --git a/drivers/pci/hotplug/rpaphp_core.c 
>> b/drivers/pci/hotplug/rpaphp_core.c
>> index bcd5d35..9ad7384 100644
>> --- a/drivers/pci/hotplug/rpaphp_core.c
>> +++ b/drivers/pci/hotplug/rpaphp_core.c
>> @@ -154,182 +154,18 @@ static enum pci_bus_speed get_max_bus_speed(struct 
>> slot *slot)
>>  return speed;
>>  }
>>  
>> -static int get_children_props(struct device_node *dn, const int 
>> **drc_indexes,
>> -const int **drc_names, const int **drc_types,
>> -const int **drc_power_domains)
>> -{
>> -const int *indexes, *names, *types, *domains;
>> -
>> -indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
>> -names = of_get_property(dn, "ibm,drc-names", NULL);
>> -types = of_get_property(dn, "ibm,drc-types", NULL);
>> -domains = of_get_property(dn, "ibm,drc-power-domains", NULL);
>> -
>> -if (!indexes || !names || !types || !domains) {
>> -/* Slot does not have dynamically-removable children */
>> -return -EINVAL;
>> -}
>> -if (drc_indexes)
>> -*drc_indexes = indexes;
>> -if (drc_names)
>> -/* &drc_names[1] contains NULL terminated slot names */
>> -*drc_names = names;
>> -if (drc_types)
>> -/* &drc_types[1] contains NULL terminated slot types */
>> -*drc_types = types;
>> -if (drc_power_domains)
>> -*drc_power_domains = domains;
>> -
>> -return 0;
>> -}
>> -
>> -
>>  /* Verify the existence of 'drc_name' and/or 'drc_type' within the
>> - * current node.  First obtain it's my-drc-index property.  Next,
>> - * obtain the DRC info from it's parent.  Use the my-drc-index for
>> - * correlation, and obtain/validate the requested properties.
>> + * current node.
>>   */
>>  
>> -static int rpaphp_check_drc_props_v1(struct device_node *dn, char *drc_name,
>> -char *drc_type, unsigned int my_index)
>> -{
>> -char *name_tmp, *type_tmp;
>> -const int *indexes, *names;
>> -const int *types, *domains;
>> -int i, rc;
>> -
>> -rc = get_children_props(dn->parent, &indexes, &names, &types, &domains);
>> -if (rc < 0) {
>> -return -EINVAL;
>> -}
>> -
>> -name_tmp = (char *) &names[1];
>> -type_tmp = (char *) &types[1];
>> -
>> -/* Iterate through parent properties, looking for my-drc-index */
>> -for (i = 0; i < be32_to_cpu(indexes[0]); i++) {
>> -if ((unsigned int) indexes[i + 1] == my_index)
>> -break;
>> -
>> -name_tmp += (strlen(name_tmp) + 1);
>> -type_tmp += (strlen(type_tmp) + 1);
>> -}
>> -
>> -if (((drc_name == NULL) || (drc_name && !strcmp(drc_name, name_tmp))) &&
>> -((drc_type == NULL) || (drc_type && !strcmp(drc_type, type_tmp
>> -return 0;
>> -
>> -return -EINVAL;
>> -}
>> -
>> -static int rpaphp_check_drc_props_v2(struct device_node *dn, char *drc_name,
>> -char *drc_type, unsigned int my_index)
>> -{
>> -struct property *info;
>> -unsigned int entries;
>> -struct of_drc_info drc;
>> -const __be32 *value;
>> -char cell_drc_name[MAX_DRC_NAME_LEN];
>> -int j, fndit;
>> -
>> -info = of_find_property(dn->parent, "ibm,drc-info", NULL);
>> -if (info == NULL)

Re: [RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2019-01-24 Thread Tyrel Datwyler
On 12/14/2018 12:50 PM, Michael Bringmann wrote:
> Define interface to acquire arch-specific drc info to match against
> hotpluggable devices.  The current implementation exposes several
> pseries-specific dynamic memory properties in generic kernel code.
> This patch set provides an interface to pull that code out of the
> generic kernel.
> 
> Signed-off-by: Michael Bringmann 
> ---
>  include/linux/topology.h |9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index cb0775e..df97f5f 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -44,6 +44,15 @@
>  
>  int arch_update_cpu_topology(void);

On another note a kern doc comment for this function would also be nice.

-Tyrel

>  
> +int arch_find_drc_match(struct device_node *dn,
> + bool (*usercb)(struct device_node *dn,
> + u32 drc_index, char *drc_name,
> + char *drc_type, u32 drc_power_domain,
> + void *data),
> + char *opt_drc_type, char *opt_drc_name,
> + bool match_drc_index, bool ck_php_type,
> + void *data);
> +
>  /* Conform to ACPI 2.0 SLIT distance definitions */
>  #define LOCAL_DISTANCE   10
>  #define REMOTE_DISTANCE  20
> 



Re: [RFC 3/6] pseries/drcinfo: Pseries impl of arch_find_drc_info

2019-01-24 Thread Tyrel Datwyler
On 12/14/2018 12:51 PM, Michael Bringmann wrote:
> This patch provides a common interface to parse ibm,drc-indexes,
> ibm,drc-names, ibm,drc-types, ibm,drc-power-domains, or ibm,drc-info.
> The generic interface arch_find_drc_match is provided which accepts
> callback functions that may be applied to examine the data for each
> entry.
> 

The title of your patch is "pseries/drcinfo: Pseries impl of arch_find_drc_info"
but the name of the function you are ultimately implementing is
arch_find_drc_match if I'm not mistaken.

> Signed-off-by: Michael Bringmann 
> ---
>  arch/powerpc/include/asm/prom.h |3 
>  arch/powerpc/platforms/pseries/of_helpers.c |  299 
> +++
>  include/linux/topology.h|2 
>  3 files changed, 298 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
> index b04c5ce..910d1dc 100644
> --- a/arch/powerpc/include/asm/prom.h
> +++ b/arch/powerpc/include/asm/prom.h
> @@ -91,9 +91,6 @@ struct of_drc_info {
>   u32 last_drc_index;
>  };
>  
> -extern int of_read_drc_info_cell(struct property **prop,
> - const __be32 **curval, struct of_drc_info *data);
> -
>  
>  /*
>   * There are two methods for telling firmware what our capabilities are.
> diff --git a/arch/powerpc/platforms/pseries/of_helpers.c 
> b/arch/powerpc/platforms/pseries/of_helpers.c
> index 0185e50..11c90cd 100644
> --- a/arch/powerpc/platforms/pseries/of_helpers.c
> +++ b/arch/powerpc/platforms/pseries/of_helpers.c
> @@ -1,5 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
> +#define pr_fmt(fmt) "drc: " fmt
> +
>  #include 
>  #include 
>  #include 
> @@ -11,6 +13,12 @@
>  
>  #define  MAX_DRC_NAME_LEN 64
>  
> +static int drc_debug;
> +#define dbg(args...) if (drc_debug) { printk(KERN_DEBUG args); }
> +#define err(arg...) printk(KERN_ERR args);
> +#define info(arg...) printk(KERN_INFO args);
> +#define warn(arg...) printk(KERN_WARNING args);

Its pretty standard these days to use the pr_debug, pr_err, pr_info, pr_warn
variations over printk(LEVEL args).

> +
>  /**
>   * pseries_of_derive_parent - basically like dirname(1)
>   * @path:  the full_name of a node to be added to the tree
> @@ -46,7 +54,8 @@ struct device_node *pseries_of_derive_parent(const char 
> *path)
>  
>  /* Helper Routines to convert between drc_index to cpu numbers */
>  
> -int of_read_drc_info_cell(struct property **prop, const __be32 **curval,
> +static int of_read_drc_info_cell(struct property **prop,
> + const __be32 **curval,
>   struct of_drc_info *data)
>  {
>   const char *p;
> @@ -90,4 +99,290 @@ int of_read_drc_info_cell(struct property **prop, const 
> __be32 **curval,
>  
>   return 0;
>  }
> -EXPORT_SYMBOL(of_read_drc_info_cell);
> +
> +static int walk_drc_info(struct device_node *dn,
> + bool (*usercb)(struct of_drc_info *drc,
> + void *data,
> + int *ret_code),
> + char *opt_drc_type,
> + void *data)
> +{
> + struct property *info;
> + unsigned int entries;
> + struct of_drc_info drc;
> + const __be32 *value;
> + int j, ret_code = -EINVAL;
> + bool done = false;
> +
> + info = of_find_property(dn, "ibm,drc-info", NULL);
> + if (info == NULL)
> + return -EINVAL;
> +
> + value = info->value;
> + entries = of_read_number(value++, 1);
> +
> + for (j = 0, done = 0; (j < entries) && (!done); j++) {
> + of_read_drc_info_cell(&info, &value, &drc);
> +
> + if (opt_drc_type && strcmp(opt_drc_type, drc.drc_type))
> + continue;
> +
> + done = usercb(&drc, data, &ret_code);
> + }
> +
> + return ret_code;
> +}
> +
> +static int get_children_props(struct device_node *dn, const int 
> **drc_indexes,
> + const int **drc_names, const int **drc_types,
> + const int **drc_power_domains)
> +{
> + const int *indexes, *names, *types, *domains;
> +
> + indexes = of_get_property(dn, "ibm,drc-indexes", NULL);
> + names = of_get_property(dn, "ibm,drc-names", NULL);
> + types = of_get_property(dn, "ibm,drc-types", NULL);
> + domains = of_get_property(dn, "ibm,drc-power-domains", NULL);
> +
> + if (!indexes || !names || !types || !domains) {
> + /* Slot does not have dynamically-removable children */
> + return -EINVAL;
> + }
> + if (drc_indexes)
> + *drc_indexes = indexes;
> + if (drc_names)
> + /* &drc_names[1] contains NULL terminated slot names */
> + *drc_names = names;
> + if (drc_types)
> + /* &drc_types[1] contains NULL terminated slot types */
> + *drc_types = types;
> + if (drc_power_domains)
> + *drc_power_domains = domains;
> +
> + return 0;
> +}
> +
> +static int is_php_type(char *drc_ty

Re: [RFC 1/6] powerpc:/drc Define interface to acquire arch-specific drc info

2019-01-24 Thread Tyrel Datwyler
On 12/14/2018 12:50 PM, Michael Bringmann wrote:
> Define interface to acquire arch-specific drc info to match against
> hotpluggable devices.  The current implementation exposes several
> pseries-specific dynamic memory properties in generic kernel code.
> This patch set provides an interface to pull that code out of the
> generic kernel.
> 
> Signed-off-by: Michael Bringmann 
> ---
>  include/linux/topology.h |9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/topology.h b/include/linux/topology.h
> index cb0775e..df97f5f 100644
> --- a/include/linux/topology.h
> +++ b/include/linux/topology.h
> @@ -44,6 +44,15 @@

As far as I know pseries is the only platform that uses DR connectors, and I
highly doubt that any other powerpc platform or arch ever will. So, I'm not sure
that this is really generic enough to belong in topology.h. If anything I would
suggest putting this in an include in arch/powerpc/include/ named something like
drcinfo.h or pseries-drc.h. That will make it visible to modules like rpaphp
that want/need to use this functionality.

-Tyrel

>  
>  int arch_update_cpu_topology(void);
>  
> +int arch_find_drc_match(struct device_node *dn,
> + bool (*usercb)(struct device_node *dn,
> + u32 drc_index, char *drc_name,
> + char *drc_type, u32 drc_power_domain,
> + void *data),
> + char *opt_drc_type, char *opt_drc_name,
> + bool match_drc_index, bool ck_php_type,
> + void *data);
> +
>  /* Conform to ACPI 2.0 SLIT distance definitions */
>  #define LOCAL_DISTANCE   10
>  #define REMOTE_DISTANCE  20
> 



BUG: memcmp(): Accessing invalid memory location

2019-01-24 Thread Chandan Rajendra
When executing fstests' generic/026 test, I hit the following call trace,

[  417.061038] BUG: Unable to handle kernel data access at 0xc0062ac4
[  417.062172] Faulting instruction address: 0xc0092240
[  417.062242] Oops: Kernel access of bad area, sig: 11 [#1]
[  417.062299] LE SMP NR_CPUS=2048 DEBUG_PAGEALLOC NUMA pSeries
[  417.062366] Modules linked in:
[  417.062401] CPU: 0 PID: 27828 Comm: chacl Not tainted 
5.0.0-rc2-next-20190115-1-g6de6dba64dda #1
[  417.062495] NIP:  c0092240 LR: c066a55c CTR: 
[  417.062567] REGS: c0062c0c3430 TRAP: 0300   Not tainted  
(5.0.0-rc2-next-20190115-1-g6de6dba64dda)
[  417.062660] MSR:  82009033   CR: 44000842  
XER: 2000
[  417.062750] CFAR: 7fff7f3108ac DAR: c0062ac4 DSISR: 4000 
IRQMASK: 0
   GPR00:  c0062c0c36c0 c17f4c00 
c121a660
   GPR04: c0062ac3fff9 0004 0020 
275b19c4
   GPR08: 000c 46494c45 5347495f41434c5f 
c26073a0
   GPR12:  c27a  

   GPR16:    

   GPR20: c0062ea70020 c0062c0c38d0 0002 
0002
   GPR24: c0062ac3ffe8 275b19c4 0001 
c0062ac3
   GPR28: c0062c0c38d0 c0062ac30050 c0062ac30058 

[  417.063563] NIP [c0092240] memcmp+0x120/0x690
[  417.063635] LR [c066a55c] xfs_attr3_leaf_lookup_int+0x53c/0x5b0
[  417.063709] Call Trace:
[  417.063744] [c0062c0c36c0] [c066a098] 
xfs_attr3_leaf_lookup_int+0x78/0x5b0 (unreliable)
[  417.063851] [c0062c0c3760] [c0693f8c] 
xfs_da3_node_lookup_int+0x32c/0x5a0
[  417.063944] [c0062c0c3820] [c06634a0] 
xfs_attr_node_addname+0x170/0x6b0
[  417.064034] [c0062c0c38b0] [c0664ffc] xfs_attr_set+0x2ac/0x340
[  417.064118] [c0062c0c39a0] [c0758d40] __xfs_set_acl+0xf0/0x230
[  417.064190] [c0062c0c3a00] [c0758f50] xfs_set_acl+0xd0/0x160
[  417.064268] [c0062c0c3aa0] [c04b69b0] set_posix_acl+0xc0/0x130
[  417.064339] [c0062c0c3ae0] [c04b6a88] 
posix_acl_xattr_set+0x68/0x110
[  417.064412] [c0062c0c3b20] [c04532d4] __vfs_setxattr+0xa4/0x110
[  417.064485] [c0062c0c3b80] [c0454c2c] 
__vfs_setxattr_noperm+0xac/0x240
[  417.064566] [c0062c0c3bd0] [c0454ee8] vfs_setxattr+0x128/0x130
[  417.064638] [c0062c0c3c30] [c0455138] setxattr+0x248/0x600
[  417.064710] [c0062c0c3d90] [c0455738] path_setxattr+0x108/0x120
[  417.064785] [c0062c0c3e00] [c0455778] sys_setxattr+0x28/0x40
[  417.064858] [c0062c0c3e20] [c000bae4] system_call+0x5c/0x70
[  417.064930] Instruction dump:
[  417.064964] 7d201c28 7d402428 7c295040 38630008 38840008 408201f0 4200ffe8 
2c05
[  417.065051] 4182ff6c 20c50008 54c61838 7d201c28 <7d402428> 7d293436 7d4a3436 
7c295040
[  417.065150] ---[ end trace 0d060411b5e3741b ]---


Both the memory locations passed to memcmp() had "SGI_ACL_FILE" and len
argument of memcmp() was set to 12. s1 argument of memcmp() had the value
0xf4af0485, while s2 argument had the value 0xce9e316f.

The following is the code path within memcmp() that gets executed for the
above mentioned values,

- Since len (i.e. 12) is greater than 7, we branch to .Lno_short.
- We then prefetch the contents of r3 & r4 and branch to
  .Ldiffoffset_8bytes_make_align_start.
- Under .Ldiffoffset_novmx_cmp, Since r3 is unaligned we end up comparing
  "SGI" part of the string. r3's value is then aligned. r4's value is
  incremented by 3. For comparing the remaining 9 bytes, we jump to
  .Lcmp_lt32bytes.
- Here, 8 bytes of the remaining 9 bytes are compared and execution moves to
  .Lcmp_rest_lt8bytes.
- Here we execute "LD rB,0,r4". In the case of this bug, r4 has an unaligned
  value and hence ends up accessing the "next" double word. The "next" double
  word happens to occur after the last page mapped into the kernel's address
  space and hence this leads to the previously listed oops.
  
-- 
chandan





Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Dave Hansen
On 1/24/19 6:17 AM, Michal Hocko wrote:
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.

This looks OK to me.

Could we add a few DEBUG_VM checks that *look* for these invalid
zonelists?  Or, would our existing list debugging have caught this?

Basically, is this bug also a sign that we need better debugging around
this?


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Mike Rapoport
On Thu, Jan 24, 2019 at 03:17:27PM +0100, Michal Hocko wrote:
> a friendly ping for this. Does anybody see any problem with this
> approach?

FWIW, it looks fine to me.

It'd just be nice to have a few more words in the changelog about *how* the
x86 init was reworked ;-)
 
> On Mon 14-01-19 09:24:16, Michal Hocko wrote:
> > From: Michal Hocko 
> > 
> > Pingfan Liu has reported the following splat
> > [5.772742] BUG: unable to handle kernel paging request at 
> > 2088
> > [5.773618] PGD 0 P4D 0
> > [5.773618] Oops:  [#1] SMP NOPTI
> > [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> > [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> > 06/29/2018
> > [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> > [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da 
> > c1 ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 
> > <3b> 77 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> > e1 44 89 e6 89
> > [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> > [5.773618] RAX:  RBX: 006012c0 RCX: 
> > 
> > [5.773618] RDX:  RSI: 0002 RDI: 
> > 2080
> > [5.773618] RBP: 006012c0 R08:  R09: 
> > 0002
> > [5.773618] R10: 006080c0 R11: 0002 R12: 
> > 
> > [5.773618] R13: 0001 R14:  R15: 
> > 0002
> > [5.773618] FS:  () GS:8c69afe0() 
> > knlGS:
> > [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> > [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> > 003406e0
> > [5.773618] Call Trace:
> > [5.773618]  new_slab+0xa9/0x570
> > [5.773618]  ___slab_alloc+0x375/0x540
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  __slab_alloc+0x1c/0x38
> > [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> > [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  devm_kmalloc+0x28/0x60
> > [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> > [5.773618]  really_probe+0x73/0x420
> > [5.773618]  driver_probe_device+0x115/0x130
> > [5.773618]  __driver_attach+0x103/0x110
> > [5.773618]  ? driver_probe_device+0x130/0x130
> > [5.773618]  bus_for_each_dev+0x67/0xc0
> > [5.773618]  ? klist_add_tail+0x3b/0x70
> > [5.773618]  bus_add_driver+0x41/0x260
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  driver_register+0x5b/0xe0
> > [5.773618]  ? pcie_port_setup+0x4d/0x4d
> > [5.773618]  do_one_initcall+0x4e/0x1d4
> > [5.773618]  ? init_setup+0x25/0x28
> > [5.773618]  kernel_init_freeable+0x1c1/0x26e
> > [5.773618]  ? loglevel+0x5b/0x5b
> > [5.773618]  ? rest_init+0xb0/0xb0
> > [5.773618]  kernel_init+0xa/0x110
> > [5.773618]  ret_from_fork+0x22/0x40
> > [5.773618] Modules linked in:
> > [5.773618] CR2: 2088
> > [5.773618] ---[ end trace 1030c9120a03d081 ]---
> > 
> > with his AMD machine with the following topology
> >   NUMA node0 CPU(s): 0,8,16,24
> >   NUMA node1 CPU(s): 2,10,18,26
> >   NUMA node2 CPU(s): 4,12,20,28
> >   NUMA node3 CPU(s): 6,14,22,30
> >   NUMA node4 CPU(s): 1,9,17,25
> >   NUMA node5 CPU(s): 3,11,19,27
> >   NUMA node6 CPU(s): 5,13,21,29
> >   NUMA node7 CPU(s): 7,15,23,31
> > 
> > [0.007418] Early memory node ranges
> > [0.007419]   node   1: [mem 0x1000-0x0008efff]
> > [0.007420]   node   1: [mem 0x0009-0x0009]
> > [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> > [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> > [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> > [0.007424]   node   1: [mem 0x0001-0x00047fff]
> > [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> > 
> > and nr_cpus set to 4. The underlying reason is tha the device is bound
> > to node 2 which doesn't have any memory and init_cpu_to_node only
> > initializes memory-less nodes for possible cpus which nr_cpus restrics.
> > This in turn means that proper zonelists are not allocated and the page
> > allocator blows up.
> > 
> > Fix the issue by reworking how x86 initializes the memory less nodes.
> > The current implementation is hacked into the workflow and it doesn't
> > allow any flexibility. There is init_memory_less_node called for each
> > offline node that has a CPU as already mentioned above. This will make
> > sure that we will have a new online node without any memory. Much later
> > on we build a zone list for this node and things seem to work, except
> > they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> > make much sense to consider an empty node as o

Re: powerpc/ps3: Use struct_size() in kzalloc()

2019-01-24 Thread Gustavo A. R. Silva



On 1/23/19 9:40 PM, Michael Ellerman wrote:
> On Tue, 2019-01-08 at 21:00:10 UTC, "Gustavo A. R. Silva" wrote:
>> One of the more common cases of allocation size calculations is finding the
>> size of a structure that has a zero-sized array at the end, along with memory
>> for some number of elements for that array. For example:
>>
>> struct foo {
>> int stuff;
>> void *entry[];
>> };
>>
>> instance = kzalloc(sizeof(struct foo) + sizeof(void *) * count, GFP_KERNEL);
>>
>> Instead of leaving these open-coded and prone to type mistakes, we can now
>> use the new struct_size() helper:
>>
>> instance = kzalloc(struct_size(instance, entry, count), GFP_KERNEL);
>>
>> This code was detected with the help of Coccinelle.
>>
>> Signed-off-by: Gustavo A. R. Silva 
> 
> Applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/31367b9a01d6a3f4f77694bd44f547d6
> 

Thanks, Michael.

--
Gustavo


Re: [PATCH] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-24 Thread Gustavo A. R. Silva



On 1/24/19 8:01 AM, Breno Leitao wrote:
> 'regno' is directly controlled by user space, hence leading to a potential
> exploitation of the Spectre variant 1 vulnerability.
> 
> On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
> register number that would be read or written. This register number is
> called 'regno' which is part of the 'addr' syscall parameter.
> 
> This 'regno' value is checked against the maximum pt_regs structure size,
> and then used to dereference it, which matches the initial part of a
> Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
> is returned to userspace in the GETREGS case.
> 

Was this reported by any tool?

If so, it might be worth mentioning it.

> This patch sanitizes 'regno' before using it to dereference pt_reg.
> 
> Notice that given that speculation windows are large, the policy is
> to kill the speculation on the first load and not worry if it can be
> completed with a dependent load/store [1].
> 
> [1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2
> 
> Signed-off-by: Breno Leitao 
> ---
>  arch/powerpc/kernel/ptrace.c | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index cdd5d1d3ae41..3eac38a29863 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -33,6 +33,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -298,6 +299,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
> unsigned long *data)
>  #endif
>  
>   if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {

I would use a variable to store sizeof(struct user_pt_regs) / sizeof(unsigned 
long).

> + regno = array_index_nospec(regno,
> + (sizeof(struct user_pt_regs) /
> +  sizeof(unsigned long)));

See the rest of my comments below.

>   *data = ((unsigned long *)task->thread.regs)[regno];
>   return 0;
>   }
> @@ -321,6 +325,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
> unsigned long data)
>   return set_user_dscr(task, data);
>  
>   if (regno <= PT_MAX_PUT_REG) {
> + regno = array_index_nospec(regno, PT_MAX_PUT_REG);

This is wrong.  array_index_nospec() will return PT_MAX_PUT_REG - 1 in case 
regno is equal to
PT_MAX_PUT_REG, and this is not what you want.

Similar reasoning applies to the case above.

>   ((unsigned long *)task->thread.regs)[regno] = data;
>   return 0;
>   }
> 

Thanks
--
Gustavo


Re: [PATCH v14 02/12] powerpc/irq: use memblock functions returning virtual address

2019-01-24 Thread Mark Rutland
On Thu, Jan 24, 2019 at 07:25:53PM +0200, Mike Rapoport wrote:
> On Thu, Jan 24, 2019 at 04:51:53PM +, Mark Rutland wrote:
> > On Thu, Jan 24, 2019 at 04:19:33PM +, Christophe Leroy wrote:
> > > Since only the virtual address of allocated blocks is used,
> > > lets use functions returning directly virtual address.
> > > 
> > > Those functions have the advantage of also zeroing the block.
> > > 
> > > Suggested-by: Mike Rapoport 
> > > Acked-by: Mike Rapoport 
> > > Signed-off-by: Christophe Leroy 
> > 
> > [...]
> > 
> > > +static void *__init alloc_stack(void)
> > > +{
> > > + void *ptr = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
> > > +
> > > + if (!ptr)
> > > + panic("cannot allocate stacks");
> > > +
> > > + return ptr;
> > > +}
> > 
> > I believe memblock_alloc() will panic() if it cannot allocate memory,
> > since that goes:
> > 
> >  memblock_alloc()
> >  -> memblock_alloc_try_nid()
> > -> panic()
> > 
> > So you can get rid of the panic() here, or if you want a custom panic
> > message, you can use memblock_alloc_nopanic().
> 
> As we've already discussed it in [1], I'm working on removing the
> _nopanic() versions and dropping the panic() calls from memblock_alloc()
> and friends.
> 
> I've posted v2 of the patches earlier this week [2].
>  
> > [...]
> 
> [1] https://lore.kernel.org/lkml/20190108143428.GB14063@rapoport-lnx/ 
> [2] 
> https://lore.kernel.org/lkml/1548057848-15136-1-git-send-email-r...@linux.ibm.com/

Fair enough.

Feel free to take the ack regardless, then!

Thanks,
Mark.


Re: [PATCH v14 02/12] powerpc/irq: use memblock functions returning virtual address

2019-01-24 Thread Mike Rapoport
On Thu, Jan 24, 2019 at 04:51:53PM +, Mark Rutland wrote:
> On Thu, Jan 24, 2019 at 04:19:33PM +, Christophe Leroy wrote:
> > Since only the virtual address of allocated blocks is used,
> > lets use functions returning directly virtual address.
> > 
> > Those functions have the advantage of also zeroing the block.
> > 
> > Suggested-by: Mike Rapoport 
> > Acked-by: Mike Rapoport 
> > Signed-off-by: Christophe Leroy 
> 
> [...]
> 
> > +static void *__init alloc_stack(void)
> > +{
> > +   void *ptr = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
> > +
> > +   if (!ptr)
> > +   panic("cannot allocate stacks");
> > +
> > +   return ptr;
> > +}
> 
> I believe memblock_alloc() will panic() if it cannot allocate memory,
> since that goes:
> 
>  memblock_alloc()
>  -> memblock_alloc_try_nid()
> -> panic()
> 
> So you can get rid of the panic() here, or if you want a custom panic
> message, you can use memblock_alloc_nopanic().

As we've already discussed it in [1], I'm working on removing the
_nopanic() versions and dropping the panic() calls from memblock_alloc()
and friends.

I've posted v2 of the patches earlier this week [2].
 
> [...]

[1] https://lore.kernel.org/lkml/20190108143428.GB14063@rapoport-lnx/ 
[2] 
https://lore.kernel.org/lkml/1548057848-15136-1-git-send-email-r...@linux.ibm.com/

> >  static void *__init alloc_stack(unsigned long limit, int cpu)
> >  {
> > -   unsigned long pa;
> > +   void *ptr;
> >  
> > BUILD_BUG_ON(STACK_INT_FRAME_SIZE % 16);
> >  
> > -   pa = memblock_alloc_base_nid(THREAD_SIZE, THREAD_SIZE, limit,
> > -   early_cpu_to_node(cpu), MEMBLOCK_NONE);
> > -   if (!pa) {
> > -   pa = memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit);
> > -   if (!pa)
> > -   panic("cannot allocate stacks");
> > -   }
> > +   ptr = memblock_alloc_try_nid(THREAD_SIZE, THREAD_SIZE,
> > +MEMBLOCK_LOW_LIMIT, limit,
> > +early_cpu_to_node(cpu));
> > +   if (!ptr)
> > +   panic("cannot allocate stacks");
> 
> The same applies here -- memblock_alloc_try_nid() will panic itself
> rather than returning NULL.
> 
> Otherwise, this looks like a nice cleanup. With the panics removed (or
> using the _nopanic() allocators), feel free to add:
> 
> Acked-by: Mark Rutland 
> 
> Thanks,
> Mark.
> 

-- 
Sincerely yours,
Mike.



Re: [PATCH v14 07/12] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Mark Rutland
On Thu, Jan 24, 2019 at 04:19:43PM +, Christophe Leroy wrote:
> This patch activates CONFIG_THREAD_INFO_IN_TASK which
> moves the thread_info into task_struct.
> 
> Moving thread_info into task_struct has the following advantages:
> - It protects thread_info from corruption in the case of stack
> overflows.
> - Its address is harder to determine if stack addresses are
> leaked, making a number of attacks more difficult.
> 
> This has the following consequences:
> - thread_info is now located at the beginning of task_struct.
> - The 'cpu' field is now in task_struct, and only exists when
> CONFIG_SMP is active.
> - thread_info doesn't have anymore the 'task' field.
> 
> This patch:
> - Removes all recopy of thread_info struct when the stack changes.
> - Changes the CURRENT_THREAD_INFO() macro to point to current.
> - Selects CONFIG_THREAD_INFO_IN_TASK.
> - Modifies raw_smp_processor_id() to get ->cpu from current without
> including linux/sched.h to avoid circular inclusion and without
> including asm/asm-offsets.h to avoid symbol names duplication
> between ASM constants and C constants.
> - Modifies klp_init_thread_info() to take a task_struct pointer
> argument.
> 
> Signed-off-by: Christophe Leroy 
> Reviewed-by: Nicholas Piggin 

[...]

> +ifdef CONFIG_SMP
> +prepare: task_cpu_prepare
> +
> +task_cpu_prepare: prepare0
> + $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
> print $$3;}' include/generated/asm-offsets.h))
> +endif

[...]

> -#define raw_smp_processor_id()   (current_thread_info()->cpu)
> +/*
> + * This is particularly ugly: it appears we can't actually get the definition
> + * of task_struct here, but we need access to the CPU this task is running 
> on.
> + * Instead of using task_struct we're using _TASK_CPU which is extracted from
> + * asm-offsets.h by kbuild to get the current processor ID.
> + *
> + * This also needs to be safeguarded when building asm-offsets.s because at
> + * that time _TASK_CPU is not defined yet. It could have been guarded by
> + * _TASK_CPU itself, but we want the build to fail if _TASK_CPU is missing
> + * when building something else than asm-offsets.s
> + */
> +#ifdef GENERATING_ASM_OFFSETS
> +#define raw_smp_processor_id()   (0)
> +#else
> +#define raw_smp_processor_id()   (*(unsigned int *)((void 
> *)current + _TASK_CPU))
> +#endif
>  #define hard_smp_processor_id()  (smp_hw_index[smp_processor_id()])

On arm64 we have the per-cpu offset in a CPU register (TPIDR_EL1), so we
can do:

DEFINE_PER_CPU_READ_MOSTLY(int, cpu_number);

#define raw_smp_processor_id() (*raw_cpu_ptr(&cpu_number))

... but I guess that's not possible on PPC for some reason?

I think I asked that before, but I couldn't find the thread.

Otherwise, this all looks sound to me, but I don't know much about PPC.

Thanks,
Mark.


Re: [PATCH v14 05/12] powerpc: prep stack walkers for THREAD_INFO_IN_TASK

2019-01-24 Thread Mark Rutland
On Thu, Jan 24, 2019 at 04:19:39PM +, Christophe Leroy wrote:
> [text copied from commit 9bbd4c56b0b6
> ("arm64: prep stack walkers for THREAD_INFO_IN_TASK")]
> 
> When CONFIG_THREAD_INFO_IN_TASK is selected, task stacks may be freed
> before a task is destroyed. To account for this, the stacks are
> refcounted, and when manipulating the stack of another task, it is
> necessary to get/put the stack to ensure it isn't freed and/or re-used
> while we do so.
> 
> This patch reworks the powerpc stack walking code to account for this.
> When CONFIG_THREAD_INFO_IN_TASK is not selected these perform no
> refcounting, and this should only be a structural change that does not
> affect behaviour.
> 
> Signed-off-by: Christophe Leroy 

I'm not familiar with the powerpc code, but AFAICT this is analagous to
the arm64 code, and I'm not aware of any special caveats. FWIW:

Acked-by: Mark Rutland 

Thanks,
Mark.

> ---
>  arch/powerpc/kernel/process.c| 23 +--
>  arch/powerpc/kernel/stacktrace.c | 29 ++---
>  2 files changed, 47 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> index ce393df243aa..4ffbb677c9f5 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -2027,7 +2027,7 @@ int validate_sp(unsigned long sp, struct task_struct *p,
>  
>  EXPORT_SYMBOL(validate_sp);
>  
> -unsigned long get_wchan(struct task_struct *p)
> +static unsigned long __get_wchan(struct task_struct *p)
>  {
>   unsigned long ip, sp;
>   int count = 0;
> @@ -2053,6 +2053,20 @@ unsigned long get_wchan(struct task_struct *p)
>   return 0;
>  }
>  
> +unsigned long get_wchan(struct task_struct *p)
> +{
> + unsigned long ret;
> +
> + if (!try_get_task_stack(p))
> + return 0;
> +
> + ret = __get_wchan(p);
> +
> + put_task_stack(p);
> +
> + return ret;
> +}
> +
>  static int kstack_depth_to_print = CONFIG_PRINT_STACK_DEPTH;
>  
>  void show_stack(struct task_struct *tsk, unsigned long *stack)
> @@ -2067,6 +2081,9 @@ void show_stack(struct task_struct *tsk, unsigned long 
> *stack)
>   int curr_frame = 0;
>  #endif
>  
> + if (!try_get_task_stack(tsk))
> + return;
> +
>   sp = (unsigned long) stack;
>   if (tsk == NULL)
>   tsk = current;
> @@ -2081,7 +2098,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
> *stack)
>   printk("Call Trace:\n");
>   do {
>   if (!validate_sp(sp, tsk, STACK_FRAME_OVERHEAD))
> - return;
> + break;
>  
>   stack = (unsigned long *) sp;
>   newsp = stack[0];
> @@ -2121,6 +2138,8 @@ void show_stack(struct task_struct *tsk, unsigned long 
> *stack)
>  
>   sp = newsp;
>   } while (count++ < kstack_depth_to_print);
> +
> + put_task_stack(tsk);
>  }
>  
>  #ifdef CONFIG_PPC64
> diff --git a/arch/powerpc/kernel/stacktrace.c 
> b/arch/powerpc/kernel/stacktrace.c
> index e2c50b55138f..a5745571e06e 100644
> --- a/arch/powerpc/kernel/stacktrace.c
> +++ b/arch/powerpc/kernel/stacktrace.c
> @@ -67,12 +67,17 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct 
> stack_trace *trace)
>  {
>   unsigned long sp;
>  
> + if (!try_get_task_stack(tsk))
> + return;
> +
>   if (tsk == current)
>   sp = current_stack_pointer();
>   else
>   sp = tsk->thread.ksp;
>  
>   save_context_stack(trace, sp, tsk, 0);
> +
> + put_task_stack(tsk);
>  }
>  EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
>  
> @@ -84,9 +89,8 @@ save_stack_trace_regs(struct pt_regs *regs, struct 
> stack_trace *trace)
>  EXPORT_SYMBOL_GPL(save_stack_trace_regs);
>  
>  #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
> -int
> -save_stack_trace_tsk_reliable(struct task_struct *tsk,
> - struct stack_trace *trace)
> +static int __save_stack_trace_tsk_reliable(struct task_struct *tsk,
> +struct stack_trace *trace)
>  {
>   unsigned long sp;
>   unsigned long stack_page = (unsigned long)task_stack_page(tsk);
> @@ -193,6 +197,25 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
>   }
>   return 0;
>  }
> +
> +int save_stack_trace_tsk_reliable(struct task_struct *tsk,
> +   struct stack_trace *trace)
> +{
> + int ret;
> +
> + /*
> +  * If the task doesn't have a stack (e.g., a zombie), the stack is
> +  * "reliably" empty.
> +  */
> + if (!try_get_task_stack(tsk))
> + return 0;
> +
> + ret = __save_stack_trace_tsk_reliable(trace, tsk);
> +
> + put_task_stack(tsk);
> +
> + return ret;
> +}
>  EXPORT_SYMBOL_GPL(save_stack_trace_tsk_reliable);
>  #endif /* CONFIG_HAVE_RELIABLE_STACKTRACE */
>  
> -- 
> 2.13.3
> 


Re: [PATCH v14 02/12] powerpc/irq: use memblock functions returning virtual address

2019-01-24 Thread Mark Rutland
On Thu, Jan 24, 2019 at 04:19:33PM +, Christophe Leroy wrote:
> Since only the virtual address of allocated blocks is used,
> lets use functions returning directly virtual address.
> 
> Those functions have the advantage of also zeroing the block.
> 
> Suggested-by: Mike Rapoport 
> Acked-by: Mike Rapoport 
> Signed-off-by: Christophe Leroy 

[...]

> +static void *__init alloc_stack(void)
> +{
> + void *ptr = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
> +
> + if (!ptr)
> + panic("cannot allocate stacks");
> +
> + return ptr;
> +}

I believe memblock_alloc() will panic() if it cannot allocate memory,
since that goes:

 memblock_alloc()
 -> memblock_alloc_try_nid()
-> panic()

So you can get rid of the panic() here, or if you want a custom panic
message, you can use memblock_alloc_nopanic().

[...]

>  static void *__init alloc_stack(unsigned long limit, int cpu)
>  {
> - unsigned long pa;
> + void *ptr;
>  
>   BUILD_BUG_ON(STACK_INT_FRAME_SIZE % 16);
>  
> - pa = memblock_alloc_base_nid(THREAD_SIZE, THREAD_SIZE, limit,
> - early_cpu_to_node(cpu), MEMBLOCK_NONE);
> - if (!pa) {
> - pa = memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit);
> - if (!pa)
> - panic("cannot allocate stacks");
> - }
> + ptr = memblock_alloc_try_nid(THREAD_SIZE, THREAD_SIZE,
> +  MEMBLOCK_LOW_LIMIT, limit,
> +  early_cpu_to_node(cpu));
> + if (!ptr)
> + panic("cannot allocate stacks");

The same applies here -- memblock_alloc_try_nid() will panic itself
rather than returning NULL.

Otherwise, this looks like a nice cleanup. With the panics removed (or
using the _nopanic() allocators), feel free to add:

Acked-by: Mark Rutland 

Thanks,
Mark.


[PATCH v14 02/12] powerpc/irq: use memblock functions returning virtual address

2019-01-24 Thread Christophe Leroy
Since only the virtual address of allocated blocks is used,
lets use functions returning directly virtual address.

Those functions have the advantage of also zeroing the block.

Suggested-by: Mike Rapoport 
Acked-by: Mike Rapoport 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c  |  5 -
 arch/powerpc/kernel/setup_32.c | 25 +++--
 arch/powerpc/kernel/setup_64.c | 19 +++
 3 files changed, 22 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index bb299613a462..4a5dd8800946 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -725,18 +725,15 @@ void exc_lvl_ctx_init(void)
 #endif
 #endif
 
-   memset((void *)critirq_ctx[cpu_nr], 0, THREAD_SIZE);
tp = critirq_ctx[cpu_nr];
tp->cpu = cpu_nr;
tp->preempt_count = 0;
 
 #ifdef CONFIG_BOOKE
-   memset((void *)dbgirq_ctx[cpu_nr], 0, THREAD_SIZE);
tp = dbgirq_ctx[cpu_nr];
tp->cpu = cpu_nr;
tp->preempt_count = 0;
 
-   memset((void *)mcheckirq_ctx[cpu_nr], 0, THREAD_SIZE);
tp = mcheckirq_ctx[cpu_nr];
tp->cpu = cpu_nr;
tp->preempt_count = HARDIRQ_OFFSET;
@@ -754,12 +751,10 @@ void irq_ctx_init(void)
int i;
 
for_each_possible_cpu(i) {
-   memset((void *)softirq_ctx[i], 0, THREAD_SIZE);
tp = softirq_ctx[i];
tp->cpu = i;
klp_init_thread_info(tp);
 
-   memset((void *)hardirq_ctx[i], 0, THREAD_SIZE);
tp = hardirq_ctx[i];
tp->cpu = i;
klp_init_thread_info(tp);
diff --git a/arch/powerpc/kernel/setup_32.c b/arch/powerpc/kernel/setup_32.c
index 947f904688b0..f0e25d845f8c 100644
--- a/arch/powerpc/kernel/setup_32.c
+++ b/arch/powerpc/kernel/setup_32.c
@@ -196,6 +196,16 @@ static int __init ppc_init(void)
 }
 arch_initcall(ppc_init);
 
+static void *__init alloc_stack(void)
+{
+   void *ptr = memblock_alloc(THREAD_SIZE, THREAD_SIZE);
+
+   if (!ptr)
+   panic("cannot allocate stacks");
+
+   return ptr;
+}
+
 void __init irqstack_early_init(void)
 {
unsigned int i;
@@ -203,10 +213,8 @@ void __init irqstack_early_init(void)
/* interrupt stacks must be in lowmem, we get that for free on ppc32
 * as the memblock is limited to lowmem by default */
for_each_possible_cpu(i) {
-   softirq_ctx[i] = (struct thread_info *)
-   __va(memblock_phys_alloc(THREAD_SIZE, THREAD_SIZE));
-   hardirq_ctx[i] = (struct thread_info *)
-   __va(memblock_phys_alloc(THREAD_SIZE, THREAD_SIZE));
+   softirq_ctx[i] = alloc_stack();
+   hardirq_ctx[i] = alloc_stack();
}
 }
 
@@ -224,13 +232,10 @@ void __init exc_lvl_early_init(void)
hw_cpu = 0;
 #endif
 
-   critirq_ctx[hw_cpu] = (struct thread_info *)
-   __va(memblock_phys_alloc(THREAD_SIZE, THREAD_SIZE));
+   critirq_ctx[hw_cpu] = alloc_stack();
 #ifdef CONFIG_BOOKE
-   dbgirq_ctx[hw_cpu] = (struct thread_info *)
-   __va(memblock_phys_alloc(THREAD_SIZE, THREAD_SIZE));
-   mcheckirq_ctx[hw_cpu] = (struct thread_info *)
-   __va(memblock_phys_alloc(THREAD_SIZE, THREAD_SIZE));
+   dbgirq_ctx[hw_cpu] = alloc_stack();
+   mcheckirq_ctx[hw_cpu] = alloc_stack();
 #endif
}
 }
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 236c1151a3a7..080dd515d587 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -634,19 +634,17 @@ __init u64 ppc64_bolted_size(void)
 
 static void *__init alloc_stack(unsigned long limit, int cpu)
 {
-   unsigned long pa;
+   void *ptr;
 
BUILD_BUG_ON(STACK_INT_FRAME_SIZE % 16);
 
-   pa = memblock_alloc_base_nid(THREAD_SIZE, THREAD_SIZE, limit,
-   early_cpu_to_node(cpu), MEMBLOCK_NONE);
-   if (!pa) {
-   pa = memblock_alloc_base(THREAD_SIZE, THREAD_SIZE, limit);
-   if (!pa)
-   panic("cannot allocate stacks");
-   }
+   ptr = memblock_alloc_try_nid(THREAD_SIZE, THREAD_SIZE,
+MEMBLOCK_LOW_LIMIT, limit,
+early_cpu_to_node(cpu));
+   if (!ptr)
+   panic("cannot allocate stacks");
 
-   return __va(pa);
+   return ptr;
 }
 
 void __init irqstack_early_init(void)
@@ -739,20 +737,17 @@ void __init emergency_stack_init(void)
struct thread_info *ti;
 
ti = alloc_stack(limit, i);
-   memset(ti, 0, THREAD_SIZE);
emerg_stack_init_thread_info(ti, i);
paca_ptrs[i]->e

[PATCH v14 11/12] powerpc/64: Remove CURRENT_THREAD_INFO

2019-01-24 Thread Christophe Leroy
Now that current_thread_info is located at the beginning of 'current'
task struct, CURRENT_THREAD_INFO macro is not really needed any more.

This patch replaces it by loads of the value at PACACURRENT(r13).

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/exception-64s.h   |  4 ++--
 arch/powerpc/include/asm/thread_info.h |  4 
 arch/powerpc/kernel/entry_64.S | 10 +-
 arch/powerpc/kernel/exceptions-64e.S   |  2 +-
 arch/powerpc/kernel/exceptions-64s.S   |  2 +-
 arch/powerpc/kernel/idle_book3e.S  |  2 +-
 arch/powerpc/kernel/idle_power4.S  |  2 +-
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |  6 +++---
 8 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h 
b/arch/powerpc/include/asm/exception-64s.h
index 3b4767ed3ec5..dd6a5ae7a769 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -671,7 +671,7 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 
 #define RUNLATCH_ON\
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r3, r1);\
+   ld  r3, PACACURRENT(r13);   \
ld  r4,TI_LOCAL_FLAGS(r3);  \
andi.   r0,r4,_TLF_RUNLATCH;\
beqlppc64_runlatch_on_trampoline;   \
@@ -721,7 +721,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_CTRL)
 #ifdef CONFIG_PPC_970_NAP
 #define FINISH_NAP \
 BEGIN_FTR_SECTION  \
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r9,TI_LOCAL_FLAGS(r11); \
andi.   r10,r9,_TLF_NAPPING;\
bnelpower4_fixup_nap;   \
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index c959b8d66cac..8e1d0195ac36 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -17,10 +17,6 @@
 
 #define THREAD_SIZE(1 << THREAD_SHIFT)
 
-#ifdef CONFIG_PPC64
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#endif
-
 #ifndef __ASSEMBLY__
 #include 
 #include 
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 01d0706d873f..83bddacd7a17 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -166,7 +166,7 @@ system_call:/* label this so stack 
traces look sane */
li  r10,IRQS_ENABLED
std r10,SOFTE(r1)
 
-   CURRENT_THREAD_INFO(r11, r1)
+   ld  r11, PACACURRENT(r13)
ld  r10,TI_FLAGS(r11)
andi.   r11,r10,_TIF_SYSCALL_DOTRACE
bne .Lsyscall_dotrace   /* does not return */
@@ -213,7 +213,7 @@ system_call:/* label this so stack 
traces look sane */
ld  r3,RESULT(r1)
 #endif
 
-   CURRENT_THREAD_INFO(r12, r1)
+   ld  r12, PACACURRENT(r13)
 
ld  r8,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3S
@@ -348,7 +348,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_HAS_PPR)
 
/* Repopulate r9 and r10 for the syscall path */
addir9,r1,STACK_FRAME_OVERHEAD
-   CURRENT_THREAD_INFO(r10, r1)
+   ld  r10, PACACURRENT(r13)
ld  r10,TI_FLAGS(r10)
 
cmpldi  r0,NR_syscalls
@@ -746,7 +746,7 @@ _GLOBAL(ret_from_except_lite)
mtmsrd  r10,1 /* Update machine state */
 #endif /* CONFIG_PPC_BOOK3E */
 
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r3,_MSR(r1)
 #ifdef CONFIG_PPC_BOOK3E
ld  r10,PACACURRENT(r13)
@@ -860,7 +860,7 @@ resume_kernel:
 1: bl  preempt_schedule_irq
 
/* Re-test flags and eventually loop */
-   CURRENT_THREAD_INFO(r9, r1)
+   ld  r9, PACACURRENT(r13)
ld  r4,TI_FLAGS(r9)
andi.   r0,r4,_TIF_NEED_RESCHED
bne 1b
diff --git a/arch/powerpc/kernel/exceptions-64e.S 
b/arch/powerpc/kernel/exceptions-64e.S
index 20f14996281d..04ee24789f80 100644
--- a/arch/powerpc/kernel/exceptions-64e.S
+++ b/arch/powerpc/kernel/exceptions-64e.S
@@ -493,7 +493,7 @@ exc_##n##_bad_stack:
\
  * interrupts happen before the wait instruction.
  */
 #define CHECK_NAPPING()
\
-   CURRENT_THREAD_INFO(r11, r1);   \
+   ld  r11, PACACURRENT(r13);  \
ld  r10,TI_LOCAL_FLAGS(r11);\
andi.   r9,r10,_TLF_NAPPING;\
beq+1f; \
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 9e253ce27e08..c7c4e2d6f98f 100644

[PATCH v14 07/12] powerpc: Activate CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy
This patch activates CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

This has the following consequences:
- thread_info is now located at the beginning of task_struct.
- The 'cpu' field is now in task_struct, and only exists when
CONFIG_SMP is active.
- thread_info doesn't have anymore the 'task' field.

This patch:
- Removes all recopy of thread_info struct when the stack changes.
- Changes the CURRENT_THREAD_INFO() macro to point to current.
- Selects CONFIG_THREAD_INFO_IN_TASK.
- Modifies raw_smp_processor_id() to get ->cpu from current without
including linux/sched.h to avoid circular inclusion and without
including asm/asm-offsets.h to avoid symbol names duplication
between ASM constants and C constants.
- Modifies klp_init_thread_info() to take a task_struct pointer
argument.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/Kconfig   |  1 +
 arch/powerpc/Makefile  |  7 +++
 arch/powerpc/include/asm/irq.h |  4 --
 arch/powerpc/include/asm/livepatch.h   |  6 +--
 arch/powerpc/include/asm/ptrace.h  |  2 +-
 arch/powerpc/include/asm/smp.h | 17 +++-
 arch/powerpc/include/asm/thread_info.h | 17 +---
 arch/powerpc/kernel/asm-offsets.c  |  7 ++-
 arch/powerpc/kernel/entry_32.S |  9 ++--
 arch/powerpc/kernel/exceptions-64e.S   | 11 -
 arch/powerpc/kernel/head_32.S  |  6 +--
 arch/powerpc/kernel/head_44x.S |  4 +-
 arch/powerpc/kernel/head_64.S  |  1 +
 arch/powerpc/kernel/head_booke.h   |  8 +---
 arch/powerpc/kernel/head_fsl_booke.S   |  7 ++-
 arch/powerpc/kernel/irq.c  | 79 +-
 arch/powerpc/kernel/kgdb.c | 28 
 arch/powerpc/kernel/machine_kexec_64.c |  6 +--
 arch/powerpc/kernel/process.c  |  2 +-
 arch/powerpc/kernel/setup-common.c |  2 +-
 arch/powerpc/kernel/setup_64.c | 21 -
 arch/powerpc/kernel/smp.c  |  2 +-
 arch/powerpc/net/bpf_jit32.h   |  5 +--
 23 files changed, 57 insertions(+), 195 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 2890d36eb531..0a26e0075ce5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -241,6 +241,7 @@ config PPC
select RTC_LIB
select SPARSE_IRQ
select SYSCTL_EXCEPTION_TRACE
+   select THREAD_INFO_IN_TASK
select VIRT_TO_BUS  if !PPC64
#
# Please keep this list sorted alphabetically.
diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index ac033341ed55..53ffe935f3b0 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -427,6 +427,13 @@ else
 endif
 endif
 
+ifdef CONFIG_SMP
+prepare: task_cpu_prepare
+
+task_cpu_prepare: prepare0
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+endif
+
 # Check toolchain versions:
 # - gcc-4.6 is the minimum kernel-wide version so nothing required.
 checkbin:
diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 2efbae8d93be..28a7ace0a1b9 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -51,9 +51,6 @@ struct pt_regs;
 extern struct thread_info *critirq_ctx[NR_CPUS];
 extern struct thread_info *dbgirq_ctx[NR_CPUS];
 extern struct thread_info *mcheckirq_ctx[NR_CPUS];
-extern void exc_lvl_ctx_init(void);
-#else
-#define exc_lvl_ctx_init()
 #endif
 
 /*
@@ -62,7 +59,6 @@ extern void exc_lvl_ctx_init(void);
 extern struct thread_info *hardirq_ctx[NR_CPUS];
 extern struct thread_info *softirq_ctx[NR_CPUS];
 
-extern void irq_ctx_init(void);
 void call_do_softirq(void *sp);
 void call_do_irq(struct pt_regs *regs, void *sp);
 extern void do_IRQ(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/livepatch.h 
b/arch/powerpc/include/asm/livepatch.h
index 47a03b9b528b..7cb514865a28 100644
--- a/arch/powerpc/include/asm/livepatch.h
+++ b/arch/powerpc/include/asm/livepatch.h
@@ -43,13 +43,13 @@ static inline unsigned long 
klp_get_ftrace_location(unsigned long faddr)
return ftrace_location_range(faddr, faddr + 16);
 }
 
-static inline void klp_init_thread_info(struct thread_info *ti)
+static inline void klp_init_thread_info(struct task_struct *p)
 {
/* + 1 to account for STACK_END_MAGIC */
-   ti->livepatch_sp = (unsigned long *)(ti + 1) + 1;
+   task_thread_info(p)->livepatch_sp = end_of_stack(p) + 1;
 }
 #else
-static void klp_init_thread_info(struct thread_info *ti) { }
+static inline void klp_init_thread_info(struct task_struct *p) { }
 #endif /* CONFIG_LIVEPATCH */
 
 #endif /* _ASM_POWERPC_LIVEPATCH_H */
diff --git a/arch/

[PATCH v14 03/12] book3s/64: avoid circular header inclusion in mmu-hash.h

2019-01-24 Thread Christophe Leroy
When activating CONFIG_THREAD_INFO_IN_TASK, linux/sched.h
includes asm/current.h. This generates a circular dependency.
To avoid that, asm/processor.h shall not be included in mmu-hash.h

In order to do that, this patch moves into a new header called
asm/task_size_user64.h the information from asm/processor.h required
by mmu-hash.h

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h |  2 +-
 arch/powerpc/include/asm/processor.h  | 34 +-
 arch/powerpc/include/asm/task_size_user64.h   | 42 +++
 arch/powerpc/kvm/book3s_hv_hmi.c  |  1 +
 4 files changed, 45 insertions(+), 34 deletions(-)
 create mode 100644 arch/powerpc/include/asm/task_size_user64.h

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 12e522807f9f..b2aba048301e 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -23,7 +23,7 @@
  */
 #include 
 #include 
-#include 
+#include 
 #include 
 
 /*
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index ee58526cb6c2..692f7383d461 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -95,40 +95,8 @@ void release_thread(struct task_struct *);
 #endif
 
 #ifdef CONFIG_PPC64
-/*
- * 64-bit user address space can have multiple limits
- * For now supported values are:
- */
-#define TASK_SIZE_64TB  (0x4000UL)
-#define TASK_SIZE_128TB (0x8000UL)
-#define TASK_SIZE_512TB (0x0002UL)
-#define TASK_SIZE_1PB   (0x0004UL)
-#define TASK_SIZE_2PB   (0x0008UL)
-/*
- * With 52 bits in the address we can support
- * upto 4PB of range.
- */
-#define TASK_SIZE_4PB   (0x0010UL)
 
-/*
- * For now 512TB is only supported with book3s and 64K linux page size.
- */
-#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
-/*
- * Max value currently used:
- */
-#define TASK_SIZE_USER64   TASK_SIZE_4PB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
-#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
-#else
-#define TASK_SIZE_USER64   TASK_SIZE_64TB
-#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
-/*
- * We don't need to allocate extended context ids for 4K page size, because
- * we limit the max effective address on this config to 64TB.
- */
-#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
-#endif
+#include 
 
 /*
  * 32-bit user address space is 4GB - 1 page
diff --git a/arch/powerpc/include/asm/task_size_user64.h 
b/arch/powerpc/include/asm/task_size_user64.h
new file mode 100644
index ..a4043075864b
--- /dev/null
+++ b/arch/powerpc/include/asm/task_size_user64.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_TASK_SIZE_USER64_H
+#define _ASM_POWERPC_TASK_SIZE_USER64_H
+
+#ifdef CONFIG_PPC64
+/*
+ * 64-bit user address space can have multiple limits
+ * For now supported values are:
+ */
+#define TASK_SIZE_64TB  (0x4000UL)
+#define TASK_SIZE_128TB (0x8000UL)
+#define TASK_SIZE_512TB (0x0002UL)
+#define TASK_SIZE_1PB   (0x0004UL)
+#define TASK_SIZE_2PB   (0x0008UL)
+/*
+ * With 52 bits in the address we can support
+ * upto 4PB of range.
+ */
+#define TASK_SIZE_4PB   (0x0010UL)
+
+/*
+ * For now 512TB is only supported with book3s and 64K linux page size.
+ */
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_64K_PAGES)
+/*
+ * Max value currently used:
+ */
+#define TASK_SIZE_USER64   TASK_SIZE_4PB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_128TB
+#define TASK_CONTEXT_SIZE  TASK_SIZE_512TB
+#else
+#define TASK_SIZE_USER64   TASK_SIZE_64TB
+#define DEFAULT_MAP_WINDOW_USER64  TASK_SIZE_64TB
+/*
+ * We don't need to allocate extended context ids for 4K page size, because
+ * we limit the max effective address on this config to 64TB.
+ */
+#define TASK_CONTEXT_SIZE  TASK_SIZE_64TB
+#endif
+
+#endif /* CONFIG_PPC64 */
+#endif /* _ASM_POWERPC_TASK_SIZE_USER64_H */
diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
index e3f738eb1cac..64b5011475c7 100644
--- a/arch/powerpc/kvm/book3s_hv_hmi.c
+++ b/arch/powerpc/kvm/book3s_hv_hmi.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 
 void wait_for_subcore_guest_exit(void)
 {
-- 
2.13.3



[PATCH v14 12/12] powerpc: clean stack pointers naming

2019-01-24 Thread Christophe Leroy
Some stack pointers used to also be thread_info pointers
and were called tp. Now that they are only stack pointers,
rename them sp.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/irq.c  | 17 +++--
 arch/powerpc/kernel/setup_64.c | 11 +++
 2 files changed, 10 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 938944c6e2ee..8a936723c791 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -659,21 +659,21 @@ void __do_irq(struct pt_regs *regs)
 void do_IRQ(struct pt_regs *regs)
 {
struct pt_regs *old_regs = set_irq_regs(regs);
-   void *curtp, *irqtp, *sirqtp;
+   void *cursp, *irqsp, *sirqsp;
 
/* Switch to the irq stack to handle this */
-   curtp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
-   irqtp = hardirq_ctx[raw_smp_processor_id()];
-   sirqtp = softirq_ctx[raw_smp_processor_id()];
+   cursp = (void *)(current_stack_pointer() & ~(THREAD_SIZE - 1));
+   irqsp = hardirq_ctx[raw_smp_processor_id()];
+   sirqsp = softirq_ctx[raw_smp_processor_id()];
 
/* Already there ? */
-   if (unlikely(curtp == irqtp || curtp == sirqtp)) {
+   if (unlikely(cursp == irqsp || cursp == sirqsp)) {
__do_irq(regs);
set_irq_regs(old_regs);
return;
}
/* Switch stack and call */
-   call_do_irq(regs, irqtp);
+   call_do_irq(regs, irqsp);
 
set_irq_regs(old_regs);
 }
@@ -695,10 +695,7 @@ void *hardirq_ctx[NR_CPUS] __read_mostly;
 
 void do_softirq_own_stack(void)
 {
-   void *irqtp;
-
-   irqtp = softirq_ctx[smp_processor_id()];
-   call_do_softirq(irqtp);
+   call_do_softirq(softirq_ctx[smp_processor_id()]);
 }
 
 irq_hw_number_t virq_to_hw(unsigned int virq)
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index 2db1c5f7d141..daa361fc6a24 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -716,19 +716,14 @@ void __init emergency_stack_init(void)
limit = min(ppc64_bolted_size(), ppc64_rma_size);
 
for_each_possible_cpu(i) {
-   void *ti;
-
-   ti = alloc_stack(limit, i);
-   paca_ptrs[i]->emergency_sp = ti + THREAD_SIZE;
+   paca_ptrs[i]->emergency_sp = alloc_stack(limit, i) + 
THREAD_SIZE;
 
 #ifdef CONFIG_PPC_BOOK3S_64
/* emergency stack for NMI exception handling. */
-   ti = alloc_stack(limit, i);
-   paca_ptrs[i]->nmi_emergency_sp = ti + THREAD_SIZE;
+   paca_ptrs[i]->nmi_emergency_sp = alloc_stack(limit, i) + 
THREAD_SIZE;
 
/* emergency stack for machine check exception handling. */
-   ti = alloc_stack(limit, i);
-   paca_ptrs[i]->mc_emergency_sp = ti + THREAD_SIZE;
+   paca_ptrs[i]->mc_emergency_sp = alloc_stack(limit, i) + 
THREAD_SIZE;
 #endif
}
 }
-- 
2.13.3



[PATCH v14 10/12] powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU

2019-01-24 Thread Christophe Leroy
Now that thread_info is similar to task_struct, its address is in r2
so CURRENT_THREAD_INFO() macro is useless. This patch removes it.

At the same time, as the 'cpu' field is not anymore in thread_info,
this patch renames it to TASK_CPU.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/Makefile  |  2 +-
 arch/powerpc/include/asm/thread_info.h |  2 --
 arch/powerpc/kernel/asm-offsets.c  |  2 +-
 arch/powerpc/kernel/entry_32.S | 43 --
 arch/powerpc/kernel/epapr_hcalls.S |  5 ++--
 arch/powerpc/kernel/head_fsl_booke.S   |  5 ++--
 arch/powerpc/kernel/idle_6xx.S |  8 +++
 arch/powerpc/kernel/idle_e500.S|  8 +++
 arch/powerpc/kernel/misc_32.S  |  3 +--
 arch/powerpc/mm/hash_low_32.S  | 14 ---
 arch/powerpc/sysdev/6xx-suspend.S  |  5 ++--
 11 files changed, 35 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 53ffe935f3b0..7de49889bd5d 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -431,7 +431,7 @@ ifdef CONFIG_SMP
 prepare: task_cpu_prepare
 
 task_cpu_prepare: prepare0
-   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == "TI_CPU") 
print $$3;}' include/generated/asm-offsets.h))
+   $(eval KBUILD_CFLAGS += -D_TASK_CPU=$(shell awk '{if ($$2 == 
"TASK_CPU") print $$3;}' include/generated/asm-offsets.h))
 endif
 
 # Check toolchain versions:
diff --git a/arch/powerpc/include/asm/thread_info.h 
b/arch/powerpc/include/asm/thread_info.h
index d91523c2c7d8..c959b8d66cac 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -19,8 +19,6 @@
 
 #ifdef CONFIG_PPC64
 #define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(ld dest, 
PACACURRENT(r13))
-#else
-#define CURRENT_THREAD_INFO(dest, sp)  stringify_in_c(mr dest, r2)
 #endif
 
 #ifndef __ASSEMBLY__
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 94ac190a0b16..03439785c2ea 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -96,7 +96,7 @@ int main(void)
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
 #ifdef CONFIG_SMP
-   OFFSET(TI_CPU, task_struct, cpu);
+   OFFSET(TASK_CPU, task_struct, cpu);
 #endif
 
 #ifdef CONFIG_LIVEPATCH
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 367dbf34c8a7..bc85d4320283 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -168,8 +168,7 @@ transfer_to_handler:
tophys(r11,r11)
addir11,r11,global_dbcr0@l
 #ifdef CONFIG_SMP
-   CURRENT_THREAD_INFO(r9, r1)
-   lwz r9,TI_CPU(r9)
+   lwz r9,TASK_CPU(r2)
slwir9,r9,3
add r11,r11,r9
 #endif
@@ -189,8 +188,7 @@ transfer_to_handler:
ble-stack_ovf   /* then the kernel stack overflowed */
 5:
 #if defined(CONFIG_PPC_BOOK3S_32) || defined(CONFIG_E500)
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9,r9)   /* check local flags */
+   tophys(r9,r2)   /* check local flags */
lwz r12,TI_LOCAL_FLAGS(r9)
mtcrf   0x01,r12
bt- 31-TLF_NAPPING,4f
@@ -198,8 +196,7 @@ transfer_to_handler:
 #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */
 3:
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9, r9)
+   tophys(r9, r2)
ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
 #endif
.globl transfer_to_handler_cont
@@ -344,8 +341,7 @@ _GLOBAL(DoSyscall)
mtmsr   r11
 1:
 #endif /* CONFIG_TRACE_IRQFLAGS */
-   CURRENT_THREAD_INFO(r10, r1)
-   lwz r11,TI_FLAGS(r10)
+   lwz r11,TI_FLAGS(r2)
andi.   r11,r11,_TIF_SYSCALL_DOTRACE
bne-syscall_dotrace
 syscall_dotrace_cont:
@@ -378,13 +374,12 @@ ret_from_syscall:
lwz r3,GPR3(r1)
 #endif
mr  r6,r3
-   CURRENT_THREAD_INFO(r12, r1)
/* disable interrupts so current_thread_info()->flags can't change */
LOAD_MSR_KERNEL(r10,MSR_KERNEL) /* doesn't include MSR_EE */
/* Note: We don't bother telling lockdep about it */
SYNC
MTMSRD(r10)
-   lwz r9,TI_FLAGS(r12)
+   lwz r9,TI_FLAGS(r2)
li  r8,-MAX_ERRNO
andi.   
r0,r9,(_TIF_SYSCALL_DOTRACE|_TIF_SINGLESTEP|_TIF_USER_WORK_MASK|_TIF_PERSYSCALL_MASK)
bne-syscall_exit_work
@@ -431,8 +426,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_NEED_PAIRED_STWCX)
 #ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
andi.   r4,r8,MSR_PR
beq 3f
-   CURRENT_THREAD_INFO(r4, r1)
-   ACCOUNT_CPU_USER_EXIT(r4, r5, r7)
+   ACCOUNT_CPU_USER_EXIT(r2, r5, r7)
 3:
 #endif
lwz r4,_LINK(r1)
@@ -525,7 +519,7 @@ syscall_exit_work:
/* Clear per-syscall TIF flags if any are set.  */
 
li  r11,_TIF_PERSYSCALL_MASK
-   addir12,r12,TI_FLAGS

[PATCH v14 09/12] powerpc: 'current_set' is now a table of task_struct pointers

2019-01-24 Thread Christophe Leroy
The table of pointers 'current_set' has been used for retrieving
the stack and current. They used to be thread_info pointers as
they were pointing to the stack and current was taken from the
'task' field of the thread_info.

Now, the pointers of 'current_set' table are now both pointers
to task_struct and pointers to thread_info.

As they are used to get current, and the stack pointer is
retrieved from current's stack field, this patch changes
their type to task_struct, and renames secondary_ti to
secondary_current.

Reviewed-by: Nicholas Piggin 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/asm-prototypes.h |  4 ++--
 arch/powerpc/kernel/head_32.S |  6 +++---
 arch/powerpc/kernel/head_44x.S|  4 ++--
 arch/powerpc/kernel/head_fsl_booke.S  |  4 ++--
 arch/powerpc/kernel/smp.c | 10 --
 5 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 1d911f68a23b..1484df6779ab 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -23,8 +23,8 @@
 #include 
 
 /* SMP */
-extern struct thread_info *current_set[NR_CPUS];
-extern struct thread_info *secondary_ti;
+extern struct task_struct *current_set[NR_CPUS];
+extern struct task_struct *secondary_current;
 void start_secondary(void *unused);
 
 /* kexec */
diff --git a/arch/powerpc/kernel/head_32.S b/arch/powerpc/kernel/head_32.S
index 309a45779ad5..146385b1c2da 100644
--- a/arch/powerpc/kernel/head_32.S
+++ b/arch/powerpc/kernel/head_32.S
@@ -846,9 +846,9 @@ __secondary_start:
 #endif /* CONFIG_PPC_BOOK3S_32 */
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   tophys(r1,r1)
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   tophys(r2,r2)
+   lwz r2,secondary_current@l(r2)
tophys(r1,r2)
lwz r1,TASK_STACK(r1)
 
diff --git a/arch/powerpc/kernel/head_44x.S b/arch/powerpc/kernel/head_44x.S
index f94a93b6c2f2..37117ab11584 100644
--- a/arch/powerpc/kernel/head_44x.S
+++ b/arch/powerpc/kernel/head_44x.S
@@ -1020,8 +1020,8 @@ _GLOBAL(start_secondary_47x)
/* Now we can get our task struct and real stack pointer */
 
/* Get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* Current stack pointer */
diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index 11f38adbe020..4ed2a7c8e89b 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -1091,8 +1091,8 @@ __secondary_start:
bl  call_setup_cpu
 
/* get current's stack and current */
-   lis r1,secondary_ti@ha
-   lwz r2,secondary_ti@l(r1)
+   lis r2,secondary_current@ha
+   lwz r2,secondary_current@l(r2)
lwz r1,TASK_STACK(r2)
 
/* stack */
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index aa4517686f90..a41fa8924004 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -76,7 +76,7 @@
 static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 #endif
 
-struct thread_info *secondary_ti;
+struct task_struct *secondary_current;
 bool has_big_cores;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
@@ -664,7 +664,7 @@ void smp_send_stop(void)
 }
 #endif /* CONFIG_NMI_IPI */
 
-struct thread_info *current_set[NR_CPUS];
+struct task_struct *current_set[NR_CPUS];
 
 static void smp_store_cpu_info(int id)
 {
@@ -929,7 +929,7 @@ void smp_prepare_boot_cpu(void)
paca_ptrs[boot_cpuid]->__current = current;
 #endif
set_numa_node(numa_cpu_lookup_table[boot_cpuid]);
-   current_set[boot_cpuid] = task_thread_info(current);
+   current_set[boot_cpuid] = current;
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
@@ -1014,15 +1014,13 @@ static bool secondaries_inhibited(void)
 
 static void cpu_idle_thread_init(unsigned int cpu, struct task_struct *idle)
 {
-   struct thread_info *ti = task_thread_info(idle);
-
 #ifdef CONFIG_PPC64
paca_ptrs[cpu]->__current = idle;
paca_ptrs[cpu]->kstack = (unsigned long)task_stack_page(idle) +
 THREAD_SIZE - STACK_FRAME_OVERHEAD;
 #endif
idle->cpu = cpu;
-   secondary_ti = current_set[cpu] = ti;
+   secondary_current = current_set[cpu] = idle;
 }
 
 int __cpu_up(unsigned int cpu, struct task_struct *tidle)
-- 
2.13.3



[PATCH v14 08/12] powerpc: regain entire stack space

2019-01-24 Thread Christophe Leroy
thread_info is not anymore in the stack, so the entire stack
can now be used.

There is also no risk anymore of corrupting task_cpu(p) with a
stack overflow so the patch removes the test.

When doing this, an explicit test for NULL stack pointer is
needed in validate_sp() as it is not anymore implicitely covered
by the sizeof(thread_info) gap.

In the meantime, with the previous patch all pointers to the stacks
are not anymore pointers to thread_info so this patch changes them
to void*

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/irq.h   | 10 +-
 arch/powerpc/include/asm/processor.h |  3 +--
 arch/powerpc/kernel/asm-offsets.c|  1 -
 arch/powerpc/kernel/entry_32.S   | 14 --
 arch/powerpc/kernel/irq.c| 19 +--
 arch/powerpc/kernel/misc_32.S|  6 ++
 arch/powerpc/kernel/process.c| 32 +---
 arch/powerpc/kernel/setup_64.c   |  8 
 8 files changed, 38 insertions(+), 55 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index 28a7ace0a1b9..c91a60cda4fa 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -48,16 +48,16 @@ struct pt_regs;
  * Per-cpu stacks for handling critical, debug and machine check
  * level interrupts.
  */
-extern struct thread_info *critirq_ctx[NR_CPUS];
-extern struct thread_info *dbgirq_ctx[NR_CPUS];
-extern struct thread_info *mcheckirq_ctx[NR_CPUS];
+extern void *critirq_ctx[NR_CPUS];
+extern void *dbgirq_ctx[NR_CPUS];
+extern void *mcheckirq_ctx[NR_CPUS];
 #endif
 
 /*
  * Per-cpu stacks for handling hard and soft interrupts.
  */
-extern struct thread_info *hardirq_ctx[NR_CPUS];
-extern struct thread_info *softirq_ctx[NR_CPUS];
+extern void *hardirq_ctx[NR_CPUS];
+extern void *softirq_ctx[NR_CPUS];
 
 void call_do_softirq(void *sp);
 void call_do_irq(struct pt_regs *regs, void *sp);
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 15acb282a876..8179b64871ed 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -325,8 +325,7 @@ struct thread_struct {
 #define ARCH_MIN_TASKALIGN 16
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
&init_stack)
-#define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)&init_stack)
+#define INIT_SP_LIMIT  ((unsigned long)&init_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 1fb52206c106..94ac190a0b16 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -92,7 +92,6 @@ int main(void)
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
 #else
-   DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #endif /* CONFIG_PPC64 */
OFFSET(TASK_STACK, task_struct, stack);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 29b9258dfff7..367dbf34c8a7 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -97,14 +97,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,_SRR1(r11)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,SAVED_KSP_LIMIT(r11)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
@@ -121,14 +118,11 @@ crit_transfer_to_handler:
mfspr   r0,SPRN_SRR1
stw r0,crit_srr1@l(0)
 
-   /* set the stack limit to the current stack
-* and set the limit to protect the thread_info
-* struct
-*/
+   /* set the stack limit to the current stack */
mfspr   r8,SPRN_SPRG_THREAD
lwz r0,KSP_LIMIT(r8)
stw r0,saved_ksp_limit@l(0)
-   rlwimi  r0,r1,0,0,(31-THREAD_SHIFT)
+   rlwinm  r0,r1,0,0,(31 - THREAD_SHIFT)
stw r0,KSP_LIMIT(r8)
/* fall through */
 #endif
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 85c48911938a..938944c6e2ee 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -618,9 +618,8 @@ static inline void check_stack_overflow(void)
sp = current_stack_pointer() & (THREAD_SIZE-1);
 
/* check for stack overflow: is there less than 2KB free? */
-   if (unlikely(sp < (sizeof(struct thread_info) + 2048))) {
-   pr_err("do_IRQ: stack overflow: %ld\n",
-   sp - sizeof(struct thread_info));
+   if (unlikely(sp < 2048)) {
+   pr_err("do_IRQ: stack overflow: %

[PATCH v14 06/12] powerpc: Prepare for moving thread_info into task_struct

2019-01-24 Thread Christophe Leroy
This patch cleans the powerpc kernel before activating
CONFIG_THREAD_INFO_IN_TASK:
- The purpose of the pointer given to call_do_softirq() and
call_do_irq() is to point the new stack ==> change it to void* and
rename it 'sp'
- Don't use CURRENT_THREAD_INFO() to locate the stack.
- Fix a few comments.
- Replace current_thread_info()->task by current
- Rename THREAD_INFO to TASK_STASK: as it is in fact the offset of the
pointer to the stack in task_struct, this pointer will not be impacted
by the move of THREAD_INFO.
- Makes TASK_STACK available to PPC64. PPC64 will need it to get the
stack pointer from current once the thread_info have been moved.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/irq.h   | 4 ++--
 arch/powerpc/include/asm/processor.h | 4 ++--
 arch/powerpc/include/asm/reg.h   | 2 +-
 arch/powerpc/kernel/asm-offsets.c| 2 +-
 arch/powerpc/kernel/entry_32.S   | 2 +-
 arch/powerpc/kernel/entry_64.S   | 2 +-
 arch/powerpc/kernel/head_32.S| 4 ++--
 arch/powerpc/kernel/head_40x.S   | 4 ++--
 arch/powerpc/kernel/head_44x.S   | 2 +-
 arch/powerpc/kernel/head_8xx.S   | 2 +-
 arch/powerpc/kernel/head_booke.h | 4 ++--
 arch/powerpc/kernel/head_fsl_booke.S | 4 ++--
 arch/powerpc/kernel/irq.c| 2 +-
 arch/powerpc/kernel/misc_32.S| 4 ++--
 arch/powerpc/kernel/process.c| 6 +++---
 arch/powerpc/kernel/smp.c| 4 +++-
 16 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/irq.h b/arch/powerpc/include/asm/irq.h
index ee39ce56b2a2..2efbae8d93be 100644
--- a/arch/powerpc/include/asm/irq.h
+++ b/arch/powerpc/include/asm/irq.h
@@ -63,8 +63,8 @@ extern struct thread_info *hardirq_ctx[NR_CPUS];
 extern struct thread_info *softirq_ctx[NR_CPUS];
 
 extern void irq_ctx_init(void);
-extern void call_do_softirq(struct thread_info *tp);
-extern void call_do_irq(struct pt_regs *regs, struct thread_info *tp);
+void call_do_softirq(void *sp);
+void call_do_irq(struct pt_regs *regs, void *sp);
 extern void do_IRQ(struct pt_regs *regs);
 extern void __init init_IRQ(void);
 extern void __do_irq(struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/processor.h 
b/arch/powerpc/include/asm/processor.h
index 692f7383d461..15acb282a876 100644
--- a/arch/powerpc/include/asm/processor.h
+++ b/arch/powerpc/include/asm/processor.h
@@ -40,7 +40,7 @@
 
 #ifndef __ASSEMBLY__
 #include 
-#include 
+#include 
 #include 
 #include 
 
@@ -326,7 +326,7 @@ struct thread_struct {
 
 #define INIT_SP(sizeof(init_stack) + (unsigned long) 
&init_stack)
 #define INIT_SP_LIMIT \
-   (_ALIGN_UP(sizeof(init_thread_info), 16) + (unsigned long) &init_stack)
+   (_ALIGN_UP(sizeof(struct thread_info), 16) + (unsigned long)&init_stack)
 
 #ifdef CONFIG_SPE
 #define SPEFSCR_INIT \
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 1c98ef1f2d5b..581e61db2dcf 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1062,7 +1062,7 @@
  * - SPRG9 debug exception scratch
  *
  * All 32-bit:
- * - SPRG3 current thread_info pointer
+ * - SPRG3 current thread_struct physical addr pointer
  *(virtual on BookE, physical on others)
  *
  * 32-bit classic:
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 9ffc72ded73a..b2b52e002a76 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -90,10 +90,10 @@ int main(void)
DEFINE(SIGSEGV, SIGSEGV);
DEFINE(NMI_MASK, NMI_MASK);
 #else
-   OFFSET(THREAD_INFO, task_struct, stack);
DEFINE(THREAD_INFO_GAP, _ALIGN_UP(sizeof(struct thread_info), 16));
OFFSET(KSP_LIMIT, thread_struct, ksp_limit);
 #endif /* CONFIG_PPC64 */
+   OFFSET(TASK_STACK, task_struct, stack);
 
 #ifdef CONFIG_LIVEPATCH
OFFSET(TI_livepatch_sp, thread_info, livepatch_sp);
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 0165edd03b38..8d4d8440681e 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -1165,7 +1165,7 @@ ret_from_debug_exc:
mfspr   r9,SPRN_SPRG_THREAD
lwz r10,SAVED_KSP_LIMIT(r1)
stw r10,KSP_LIMIT(r9)
-   lwz r9,THREAD_INFO-THREAD(r9)
+   lwz r9,TASK_STACK-THREAD(r9)
CURRENT_THREAD_INFO(r10, r1)
lwz r10,TI_PREEMPT(r10)
stw r10,TI_PREEMPT(r9)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 435927f549c4..01d0706d873f 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -695,7 +695,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
 2:
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
-   CURRENT_THREAD_INFO(r7, r8)  /* base of new stack */
+   clrrdi  r7, r8, THREAD_SHIFT/* base of new stack */
/* Note: this uses SWITCH_FRAME_SIZE rather than INT_FRAME_SIZE

[PATCH v14 05/12] powerpc: prep stack walkers for THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy
[text copied from commit 9bbd4c56b0b6
("arm64: prep stack walkers for THREAD_INFO_IN_TASK")]

When CONFIG_THREAD_INFO_IN_TASK is selected, task stacks may be freed
before a task is destroyed. To account for this, the stacks are
refcounted, and when manipulating the stack of another task, it is
necessary to get/put the stack to ensure it isn't freed and/or re-used
while we do so.

This patch reworks the powerpc stack walking code to account for this.
When CONFIG_THREAD_INFO_IN_TASK is not selected these perform no
refcounting, and this should only be a structural change that does not
affect behaviour.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/process.c| 23 +--
 arch/powerpc/kernel/stacktrace.c | 29 ++---
 2 files changed, 47 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index ce393df243aa..4ffbb677c9f5 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2027,7 +2027,7 @@ int validate_sp(unsigned long sp, struct task_struct *p,
 
 EXPORT_SYMBOL(validate_sp);
 
-unsigned long get_wchan(struct task_struct *p)
+static unsigned long __get_wchan(struct task_struct *p)
 {
unsigned long ip, sp;
int count = 0;
@@ -2053,6 +2053,20 @@ unsigned long get_wchan(struct task_struct *p)
return 0;
 }
 
+unsigned long get_wchan(struct task_struct *p)
+{
+   unsigned long ret;
+
+   if (!try_get_task_stack(p))
+   return 0;
+
+   ret = __get_wchan(p);
+
+   put_task_stack(p);
+
+   return ret;
+}
+
 static int kstack_depth_to_print = CONFIG_PRINT_STACK_DEPTH;
 
 void show_stack(struct task_struct *tsk, unsigned long *stack)
@@ -2067,6 +2081,9 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
int curr_frame = 0;
 #endif
 
+   if (!try_get_task_stack(tsk))
+   return;
+
sp = (unsigned long) stack;
if (tsk == NULL)
tsk = current;
@@ -2081,7 +2098,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
printk("Call Trace:\n");
do {
if (!validate_sp(sp, tsk, STACK_FRAME_OVERHEAD))
-   return;
+   break;
 
stack = (unsigned long *) sp;
newsp = stack[0];
@@ -2121,6 +2138,8 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
 
sp = newsp;
} while (count++ < kstack_depth_to_print);
+
+   put_task_stack(tsk);
 }
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index e2c50b55138f..a5745571e06e 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -67,12 +67,17 @@ void save_stack_trace_tsk(struct task_struct *tsk, struct 
stack_trace *trace)
 {
unsigned long sp;
 
+   if (!try_get_task_stack(tsk))
+   return;
+
if (tsk == current)
sp = current_stack_pointer();
else
sp = tsk->thread.ksp;
 
save_context_stack(trace, sp, tsk, 0);
+
+   put_task_stack(tsk);
 }
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk);
 
@@ -84,9 +89,8 @@ save_stack_trace_regs(struct pt_regs *regs, struct 
stack_trace *trace)
 EXPORT_SYMBOL_GPL(save_stack_trace_regs);
 
 #ifdef CONFIG_HAVE_RELIABLE_STACKTRACE
-int
-save_stack_trace_tsk_reliable(struct task_struct *tsk,
-   struct stack_trace *trace)
+static int __save_stack_trace_tsk_reliable(struct task_struct *tsk,
+  struct stack_trace *trace)
 {
unsigned long sp;
unsigned long stack_page = (unsigned long)task_stack_page(tsk);
@@ -193,6 +197,25 @@ save_stack_trace_tsk_reliable(struct task_struct *tsk,
}
return 0;
 }
+
+int save_stack_trace_tsk_reliable(struct task_struct *tsk,
+ struct stack_trace *trace)
+{
+   int ret;
+
+   /*
+* If the task doesn't have a stack (e.g., a zombie), the stack is
+* "reliably" empty.
+*/
+   if (!try_get_task_stack(tsk))
+   return 0;
+
+   ret = __save_stack_trace_tsk_reliable(trace, tsk);
+
+   put_task_stack(tsk);
+
+   return ret;
+}
 EXPORT_SYMBOL_GPL(save_stack_trace_tsk_reliable);
 #endif /* CONFIG_HAVE_RELIABLE_STACKTRACE */
 
-- 
2.13.3



[PATCH v14 04/12] powerpc: Only use task_struct 'cpu' field on SMP

2019-01-24 Thread Christophe Leroy
When moving to CONFIG_THREAD_INFO_IN_TASK, the thread_info 'cpu' field
gets moved into task_struct and only defined when CONFIG_SMP is set.

This patch ensures that TI_CPU is only used when CONFIG_SMP is set and
that task_struct 'cpu' field is not used directly out of SMP code.

Signed-off-by: Christophe Leroy 
Reviewed-by: Nicholas Piggin 
---
 arch/powerpc/kernel/head_fsl_booke.S | 2 ++
 arch/powerpc/kernel/misc_32.S| 4 
 arch/powerpc/xmon/xmon.c | 2 +-
 3 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S 
b/arch/powerpc/kernel/head_fsl_booke.S
index 2386ce2a9c6e..2c21e8642a00 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -243,8 +243,10 @@ set_ivor:
li  r0,0
stwur0,THREAD_SIZE-STACK_FRAME_OVERHEAD(r1)
 
+#ifdef CONFIG_SMP
CURRENT_THREAD_INFO(r22, r1)
stw r24, TI_CPU(r22)
+#endif
 
bl  early_init
 
diff --git a/arch/powerpc/kernel/misc_32.S b/arch/powerpc/kernel/misc_32.S
index 57d2ffb2d45c..02b8cdd73792 100644
--- a/arch/powerpc/kernel/misc_32.S
+++ b/arch/powerpc/kernel/misc_32.S
@@ -183,10 +183,14 @@ _GLOBAL(low_choose_750fx_pll)
or  r4,r4,r5
mtspr   SPRN_HID1,r4
 
+#ifdef CONFIG_SMP
/* Store new HID1 image */
CURRENT_THREAD_INFO(r6, r1)
lwz r6,TI_CPU(r6)
slwir6,r6,2
+#else
+   li  r6, 0
+#endif
addis   r6,r6,nap_save_hid1@ha
stw r4,nap_save_hid1@l(r6)
 
diff --git a/arch/powerpc/xmon/xmon.c b/arch/powerpc/xmon/xmon.c
index 757b8499aba2..a0f44f992360 100644
--- a/arch/powerpc/xmon/xmon.c
+++ b/arch/powerpc/xmon/xmon.c
@@ -2997,7 +2997,7 @@ static void show_task(struct task_struct *tsk)
printf("%px %016lx %6d %6d %c %2d %s\n", tsk,
tsk->thread.ksp,
tsk->pid, rcu_dereference(tsk->parent)->pid,
-   state, task_thread_info(tsk)->cpu,
+   state, task_cpu(tsk),
tsk->comm);
 }
 
-- 
2.13.3



[PATCH v14 01/12] powerpc/32: Fix CONFIG_VIRT_CPU_ACCOUNTING_NATIVE for 40x/booke

2019-01-24 Thread Christophe Leroy
40x/booke have another path to reach 3f from transfer_to_handler,
so ACCOUNT_CPU_USER_ENTRY() have to be moved there.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/entry_32.S | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 0768dfd8a64e..0165edd03b38 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -185,12 +185,6 @@ transfer_to_handler:
addir12,r12,-1
stw r12,4(r11)
 #endif
-#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
-   CURRENT_THREAD_INFO(r9, r1)
-   tophys(r9, r9)
-   ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
-#endif
-
b   3f
 
 2: /* if from kernel, check interrupted DOZE/NAP mode and
@@ -208,9 +202,14 @@ transfer_to_handler:
bt- 31-TLF_NAPPING,4f
bt- 31-TLF_SLEEPING,7f
 #endif /* CONFIG_PPC_BOOK3S_32 || CONFIG_E500 */
+3:
+#ifdef CONFIG_VIRT_CPU_ACCOUNTING_NATIVE
+   CURRENT_THREAD_INFO(r9, r1)
+   tophys(r9, r9)
+   ACCOUNT_CPU_USER_ENTRY(r9, r11, r12)
+#endif
.globl transfer_to_handler_cont
 transfer_to_handler_cont:
-3:
mflrr9
lwz r11,0(r9)   /* virtual address of handler */
lwz r9,4(r9)/* where to go when done */
-- 
2.13.3



[PATCH v14 00/12] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

Added Mark Rutland in copy as he did the work on arm64 and may have 
recommendations for us.

Changes in v14 (ie since v13):
 - Added in front a fixup patch which conflicts with this serie
 - Added a patch for using try_get_task_stack()/put_task_stack() in stack 
walkers.
 - Fixed compilation failure in the preparation patch (by moving the 
modification
 of klp_init_thread_info() to the following patch)

Changes since v12:
 - Patch 1: Taken comment from Mike (re-introduced the 'panic' in case memblock 
allocation fails in setup_64.c
 - Patch 1: Added alloc_stack() function in setup_32.c to also panic in case of 
allocation failure.

Changes since v11:
 - Rebased on 81775f5563fa ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
 - Added a first patch to change memblock allocs to functions returning virtual 
addrs. This removes
   the memset() which were the only remaining stuff in irq_ctx_init() and 
exc_lvl_ctx_init() at the end.
 - dropping irq_ctx_init() and exc_lvl_ctx_init() in patch 5 (powerpc: Activate 
CONFIG_THREAD_INFO_IN_TASK)
 - A few cosmetic changes in commit log and code.

Changes since v10:
 - Rebased on 21622a0d2023 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
  ==> Fixed conflict in setup_32.S

Changes since v9:
 - Rebased on 183cbf93be88 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
  ==> Fixed conflict on xmon

Changes since v8:
 - Rebased on e589b79e40d9 ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")
  ==> Main impact was conflicts due to commit 9a8dd708d547 ("memblock: rename 
memblock_alloc{_nid,_try_nid} to memblock_phys_alloc*")

Changes since v7:
 - Rebased on fb6c6ce7907d ("Automatic merge of branches 'master', 'next' and 
'fixes' into merge")

Changes since v6:
 - Fixed validate_sp() to exclude NULL sp in 'regain entire stack space' patch 
(early crash with CONFIG_KMEMLEAK)

Changes since v5:
 - Fixed livepatch_sp setup by using end_of_stack() instead of hardcoding
 - Fixed PPC_BPF_LOAD_CPU() macro

Changes since v4:
 - Fixed a build failure on 32bits SMP when include/generated/asm-offsets.h is 
not
 already existing, was due to spaces instead of a tab in the Makefile

Changes since RFC v3: (based on Nick's review)
 - Renamed task_size.h to task_size_user64.h to better relate to what it 
contains.
 - Handling of the isolation of thread_info cpu field inside CONFIG_SMP #ifdefs 
moved to a separate patch.
 - Removed CURRENT_THREAD_INFO macro completely.
 - Added a guard in asm/smp.h to avoid build failure before _TASK_CPU is 
defined.
 - Added a patch at the end to rename 'tp' pointers to 'sp' pointers
 - Renamed 'tp' into 'sp' pointers in preparation patch when relevant
 - Fixed a few commit logs
 - Fixed checkpatch report.

Changes since RFC v2:
 - Removed the modification of names in asm-offsets
 - Created a rule in arch/powerpc/Makefile to append the offset of current->cpu 
in CFLAGS
 - Modified asm/smp.h to use the offset set in CFLAGS
 - Squashed the renaming of THREAD_INFO to TASK_STACK in the preparation patch
 - Moved the modification of current_pt_regs in the patch activating 
CONFIG_THREAD_INFO_IN_TASK

Changes since RFC v1:
 - Removed the first patch which was modifying header inclusion order in timer
 - Modified some names in asm-offsets to avoid conflicts when including 
asm-offsets in C files
 - Modified asm/smp.h to avoid having to include linux/sched.h (using 
asm-offsets instead)
 - Moved some changes from the activation patch to the preparation patch.

Christophe Leroy (12):
  powerpc/32: Fix CONFIG_VIRT_CPU_ACCOUNTING_NATIVE for 40x/booke
  powerpc/irq: use memblock functions returning virtual address
  book3s/64: avoid circular header inclusion in mmu-hash.h
  powerpc: Only use task_struct 'cpu' field on SMP
  powerpc: prep stack walkers for THREAD_INFO_IN_TASK
  powerpc: Prepare for moving thread_info into task_struct
  powerpc: Activate CONFIG_THREAD_INFO_IN_TASK
  powerpc: regain entire stack space
  powerpc: 'current_set' is now a table of task_struct pointers
  powerpc/32: Remove CURRENT_THREAD_INFO and rename TI_CPU
  powerpc/64: Remove CURRENT_THREAD_INFO
  powerpc: clean stack pointers naming

 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/Makefile  |   7 ++
 arch/powerpc/include/asm/asm-prototypes.h  |   4 +-
 arch/powerpc/include/asm/book3s/64/mmu-hash.h  |   2 +-
 arch/powerpc/include/asm/exception-64s.h   |   4 +-
 arch/powerpc/include/asm/irq.h |  18 ++--
 arch/powerpc/include/asm/livepatch.h   |   6 +-
 arch

Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy




Le 24/01/2019 à 16:01, Christophe Leroy a écrit :



Le 24/01/2019 à 10:43, Christophe Leroy a écrit :



On 01/24/2019 01:06 AM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 12/01/2019 à 10:55, Christophe Leroy a écrit :
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK 
which

moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.


I ran null_syscall and context_switch benchmark selftests and the 
result

is surprising. There is slight degradation in context_switch and a
significant one on null_syscall:

Without the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55542
55562
55564
55562
55568
...

~# ./null_syscall
 2546.71 ns 336.17 cycles


With the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55138
55142
55152
55144
55142

~# ./null_syscall
 3479.54 ns 459.30 cycles

So 0,8% less context switches per second and 37% more time for one 
syscall ?


Any idea ?


What platform is that on?


It is on the 8xx


On the 83xx, I have a slight improvment:

Without the serie:

root@vgoippro:~# ./null_syscall
921.44 ns 307.15 cycles

With the serie:

root@vgoippro:~# ./null_syscall
918.78 ns 306.26 cycles

Christophe





On 64-bit we have to turn one mtmsrd into two and that's obviously a
slow down. But I don't see that you've done anything similar in 32-bit
code.

I assume it's patch 8 that causes the slow down?


I have not digged into it yet, but why patch 8 ?



The increase of null_syscall duration happens with patch 5 when we 
activate CONFIG_THREAD_INFO_IN_TASK.




Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy




Le 24/01/2019 à 01:59, Michael Ellerman a écrit :

Christophe Leroy  writes:

Le 19/01/2019 à 11:23, Michael Ellerman a écrit :

Christophe Leroy  writes:


The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.

Changes since v12:
   - Patch 1: Taken comment from Mike (re-introduced the 'panic' in case 
memblock allocation fails in setup_64.c
   - Patch 1: Added alloc_stack() function in setup_32.c to also panic in case 
of allocation failure.


Hi Christophe,

I can't get this series to boot on qemu mac99. I'm getting eg:

[0.981514] NFS: Registering the id_resolver key type
[0.981752] Key type id_resolver registered
[0.981868] Key type id_legacy registered
[0.995711] Unrecoverable exception 0 at 0 (msr=0)
[0.996091] Oops: Unrecoverable exception, sig: 6 [#1]
[0.996314] BE PAGE_SIZE=4K MMU=Hash PowerMac
[0.996617] Modules linked in:
[0.996869] CPU: 0 PID: 416 Comm: modprobe Not tainted 
5.0.0-rc2-gcc-7.3.0-00043-g53f2de798792 #342
[0.997138] NIP:   LR:  CTR: 
[0.997309] REGS: ef237f50 TRAP:    Not tainted  
(5.0.0-rc2-gcc-7.3.0-00043-g53f2de798792)
[0.997508] MSR:   <>  CR:   XER: 
[0.997712]
[0.997712] GPR00:  ef238000     
 
[0.997712] GPR08:       
c006477c ef13d8c0
[0.997712] GPR16:       
 
[0.997712] GPR24:       
 
[0.998671] NIP []   (null)
[0.998774] LR []   (null)
[0.998895] Call Trace:
[0.999030] Instruction dump:
[0.999320]        

[0.999546]     6000   

[1.23] ---[ end trace 925ea3419844fe68 ]---


No such issue on my side. Do you have a ramdisk with anythink special or
a special config ? I see your kernel is modprobing something, know what
it is ?


It's just a debian installer image, nothing special AFAIK.


Especially, what is the amount of memory in your config ? On my side
there is 128M:


I have 1G.

But today I can't reproduce the crash :/

So I guess it must have been something else in my config.


Or it could be because I didn't protect stack walks ? See

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9bbd4c56b0b642f04396da378296e68096d5afca

Anyway, I'll soon send out v14 including a patch for that.

Christophe


Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy




Le 24/01/2019 à 10:43, Christophe Leroy a écrit :



On 01/24/2019 01:06 AM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 12/01/2019 à 10:55, Christophe Leroy a écrit :
The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK 
which

moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.


I ran null_syscall and context_switch benchmark selftests and the result
is surprising. There is slight degradation in context_switch and a
significant one on null_syscall:

Without the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55542
55562
55564
55562
55568
...

~# ./null_syscall
 2546.71 ns 336.17 cycles


With the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55138
55142
55152
55144
55142

~# ./null_syscall
 3479.54 ns 459.30 cycles

So 0,8% less context switches per second and 37% more time for one 
syscall ?


Any idea ?


What platform is that on?


It is on the 8xx



On 64-bit we have to turn one mtmsrd into two and that's obviously a
slow down. But I don't see that you've done anything similar in 32-bit
code.

I assume it's patch 8 that causes the slow down?


I have not digged into it yet, but why patch 8 ?



The increase of null_syscall duration happens with patch 5 when we 
activate CONFIG_THREAD_INFO_IN_TASK.


Christophe


Re: [RFC PATCH] x86, numa: always initialize all possible nodes

2019-01-24 Thread Michal Hocko
a friendly ping for this. Does anybody see any problem with this
approach?

On Mon 14-01-19 09:24:16, Michal Hocko wrote:
> From: Michal Hocko 
> 
> Pingfan Liu has reported the following splat
> [5.772742] BUG: unable to handle kernel paging request at 2088
> [5.773618] PGD 0 P4D 0
> [5.773618] Oops:  [#1] SMP NOPTI
> [5.773618] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 4.20.0-rc1+ #3
> [5.773618] Hardware name: Dell Inc. PowerEdge R7425/02MJ3T, BIOS 1.4.3 
> 06/29/2018
> [5.773618] RIP: 0010:__alloc_pages_nodemask+0xe2/0x2a0
> [5.773618] Code: 00 00 44 89 ea 80 ca 80 41 83 f8 01 44 0f 44 ea 89 da c1 
> ea 08 83 e2 01 88 54 24 20 48 8b 54 24 08 48 85 d2 0f 85 46 01 00 00 <3b> 77 
> 08 0f 82 3d 01 00 00 48 89 f8 44 89 ea 48 89
> e1 44 89 e6 89
> [5.773618] RSP: 0018:aa65fb20 EFLAGS: 00010246
> [5.773618] RAX:  RBX: 006012c0 RCX: 
> 
> [5.773618] RDX:  RSI: 0002 RDI: 
> 2080
> [5.773618] RBP: 006012c0 R08:  R09: 
> 0002
> [5.773618] R10: 006080c0 R11: 0002 R12: 
> 
> [5.773618] R13: 0001 R14:  R15: 
> 0002
> [5.773618] FS:  () GS:8c69afe0() 
> knlGS:
> [5.773618] CS:  0010 DS:  ES:  CR0: 80050033
> [5.773618] CR2: 2088 CR3: 00087e00a000 CR4: 
> 003406e0
> [5.773618] Call Trace:
> [5.773618]  new_slab+0xa9/0x570
> [5.773618]  ___slab_alloc+0x375/0x540
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  __slab_alloc+0x1c/0x38
> [5.773618]  __kmalloc_node_track_caller+0xc8/0x270
> [5.773618]  ? pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  devm_kmalloc+0x28/0x60
> [5.773618]  pinctrl_bind_pins+0x2b/0x2a0
> [5.773618]  really_probe+0x73/0x420
> [5.773618]  driver_probe_device+0x115/0x130
> [5.773618]  __driver_attach+0x103/0x110
> [5.773618]  ? driver_probe_device+0x130/0x130
> [5.773618]  bus_for_each_dev+0x67/0xc0
> [5.773618]  ? klist_add_tail+0x3b/0x70
> [5.773618]  bus_add_driver+0x41/0x260
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  driver_register+0x5b/0xe0
> [5.773618]  ? pcie_port_setup+0x4d/0x4d
> [5.773618]  do_one_initcall+0x4e/0x1d4
> [5.773618]  ? init_setup+0x25/0x28
> [5.773618]  kernel_init_freeable+0x1c1/0x26e
> [5.773618]  ? loglevel+0x5b/0x5b
> [5.773618]  ? rest_init+0xb0/0xb0
> [5.773618]  kernel_init+0xa/0x110
> [5.773618]  ret_from_fork+0x22/0x40
> [5.773618] Modules linked in:
> [5.773618] CR2: 2088
> [5.773618] ---[ end trace 1030c9120a03d081 ]---
> 
> with his AMD machine with the following topology
>   NUMA node0 CPU(s): 0,8,16,24
>   NUMA node1 CPU(s): 2,10,18,26
>   NUMA node2 CPU(s): 4,12,20,28
>   NUMA node3 CPU(s): 6,14,22,30
>   NUMA node4 CPU(s): 1,9,17,25
>   NUMA node5 CPU(s): 3,11,19,27
>   NUMA node6 CPU(s): 5,13,21,29
>   NUMA node7 CPU(s): 7,15,23,31
> 
> [0.007418] Early memory node ranges
> [0.007419]   node   1: [mem 0x1000-0x0008efff]
> [0.007420]   node   1: [mem 0x0009-0x0009]
> [0.007422]   node   1: [mem 0x0010-0x5c3d6fff]
> [0.007422]   node   1: [mem 0x643df000-0x68ff7fff]
> [0.007423]   node   1: [mem 0x6c528000-0x6fff]
> [0.007424]   node   1: [mem 0x0001-0x00047fff]
> [0.007425]   node   5: [mem 0x00048000-0x00087eff]
> 
> and nr_cpus set to 4. The underlying reason is tha the device is bound
> to node 2 which doesn't have any memory and init_cpu_to_node only
> initializes memory-less nodes for possible cpus which nr_cpus restrics.
> This in turn means that proper zonelists are not allocated and the page
> allocator blows up.
> 
> Fix the issue by reworking how x86 initializes the memory less nodes.
> The current implementation is hacked into the workflow and it doesn't
> allow any flexibility. There is init_memory_less_node called for each
> offline node that has a CPU as already mentioned above. This will make
> sure that we will have a new online node without any memory. Much later
> on we build a zone list for this node and things seem to work, except
> they do not (e.g. due to nr_cpus). Not to mention that it doesn't really
> make much sense to consider an empty node as online because we just
> consider this node whenever we want to iterate nodes to use and empty
> node is obviously not the best candidate. This is all just too fragile.
> 
> Reported-by: Pingfan Liu 
> Tested-by: Pingfan Liu 
> Signed-off-by: Michal Hocko 
> ---
> 
> Hi,
> I am sending this as an RFC because I am not sure this is the proper way
> to go myself. I am especially not sure abou

[PATCH] powerpc/ptrace: Mitigate potential Spectre v1

2019-01-24 Thread Breno Leitao
'regno' is directly controlled by user space, hence leading to a potential
exploitation of the Spectre variant 1 vulnerability.

On PTRACE_SETREGS and PTRACE_GETREGS requests, user space passes the
register number that would be read or written. This register number is
called 'regno' which is part of the 'addr' syscall parameter.

This 'regno' value is checked against the maximum pt_regs structure size,
and then used to dereference it, which matches the initial part of a
Spectre v1 (and Spectre v1.1) attack. The dereferenced value, then,
is returned to userspace in the GETREGS case.

This patch sanitizes 'regno' before using it to dereference pt_reg.

Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].

[1] https://marc.info/?l=linux-kernel&m=152449131114778&w=2

Signed-off-by: Breno Leitao 
---
 arch/powerpc/kernel/ptrace.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
index cdd5d1d3ae41..3eac38a29863 100644
--- a/arch/powerpc/kernel/ptrace.c
+++ b/arch/powerpc/kernel/ptrace.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -298,6 +299,9 @@ int ptrace_get_reg(struct task_struct *task, int regno, 
unsigned long *data)
 #endif
 
if (regno < (sizeof(struct user_pt_regs) / sizeof(unsigned long))) {
+   regno = array_index_nospec(regno,
+   (sizeof(struct user_pt_regs) /
+sizeof(unsigned long)));
*data = ((unsigned long *)task->thread.regs)[regno];
return 0;
}
@@ -321,6 +325,7 @@ int ptrace_put_reg(struct task_struct *task, int regno, 
unsigned long data)
return set_user_dscr(task, data);
 
if (regno <= PT_MAX_PUT_REG) {
+   regno = array_index_nospec(regno, PT_MAX_PUT_REG);
((unsigned long *)task->thread.regs)[regno] = data;
return 0;
}
-- 
2.19.0



Re: [PATCH v13 00/10] powerpc: Switch to CONFIG_THREAD_INFO_IN_TASK

2019-01-24 Thread Christophe Leroy




On 01/24/2019 01:06 AM, Michael Ellerman wrote:

Christophe Leroy  writes:

Le 12/01/2019 à 10:55, Christophe Leroy a écrit :

The purpose of this serie is to activate CONFIG_THREAD_INFO_IN_TASK which
moves the thread_info into task_struct.

Moving thread_info into task_struct has the following advantages:
- It protects thread_info from corruption in the case of stack
overflows.
- Its address is harder to determine if stack addresses are
leaked, making a number of attacks more difficult.


I ran null_syscall and context_switch benchmark selftests and the result
is surprising. There is slight degradation in context_switch and a
significant one on null_syscall:

Without the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55542
55562
55564
55562
55568
...

~# ./null_syscall
 2546.71 ns 336.17 cycles


With the serie:

~# chrt -f 98 ./context_switch --no-altivec --no-vector --no-fp
55138
55142
55152
55144
55142

~# ./null_syscall
 3479.54 ns 459.30 cycles

So 0,8% less context switches per second and 37% more time for one syscall ?

Any idea ?


What platform is that on?


It is on the 8xx



On 64-bit we have to turn one mtmsrd into two and that's obviously a
slow down. But I don't see that you've done anything similar in 32-bit
code.

I assume it's patch 8 that causes the slow down?


I have not digged into it yet, but why patch 8 ?


I run null_syscall with perf, and I get the following. Can we conclude 
on something with that ?


Without the serie:

# Overhead   Samples  Command   Shared Object  Symbol 

#       . 


#
32.95% 46375  null_syscall  [kernel.kallsyms]  [k] DoSyscall
23.64% 33275  null_syscall  [kernel.kallsyms]  [k] 
__task_pid_nr_ns
15.47% 21778  null_syscall  libc-2.23.so   [.] 
__GI___getppid
 8.92% 12556  null_syscall  [kernel.kallsyms]  [k] 
__rcu_read_unlock

 5.69%  8014  null_syscall  [kernel.kallsyms]  [k] sys_getppid
 4.01%  5643  null_syscall  [kernel.kallsyms]  [k] 
__rcu_read_lock
 3.67%  5166  null_syscall  [kernel.kallsyms]  [k] 
syscall_dotrace_cont

 2.52%  3542  null_syscall  null_syscall   [.] main

With the serie:

30.04% 56337  null_syscall  [kernel.kallsyms]  [k] DoSyscall
13.89% 26060  null_syscall  [kernel.kallsyms]  [k] 
__rcu_read_unlock
13.36% 25062  null_syscall  libc-2.23.so   [.] 
__GI___getppid
12.73% 23872  null_syscall  [kernel.kallsyms]  [k] 
__task_pid_nr_ns

11.21% 21033  null_syscall  [kernel.kallsyms]  [k] sys_getppid
 8.24% 15457  null_syscall  [kernel.kallsyms]  [k] 
syscall_dotrace_cont
 4.38%  8217  null_syscall  [kernel.kallsyms]  [k] 
ret_from_syscall

 2.54%  4773  null_syscall  null_syscall   [.] main


Christophe


Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support

2019-01-24 Thread Cédric Le Goater
On 1/23/19 10:25 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>> instead of the "IPI" that was there at VM init time.
>>
>> So that makes it sound like there is a whole lot going on that hasn't
>> even been hinted at in the patch descriptions...  It sounds like we
>> need a good description of how all this works and fits together
>> somewhere under Documentation/.
>>
>> In any case we need much more informative patch descriptions.  I
>> realize that it's all currently in Cedric's head, but I bet that in
>> two or three years' time when we come to try to debug something, it
>> won't be in anyone's head...
> 
> The main problem is understanding XIVE itself. It's not realistic to
> ask Cedric to write a proper documentation for XIVE as part of the
> patch series, but sadly IBM doesn't have a good one to provide either.

QEMU has a preliminary introduction we could use :

https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a119a738f8e;hb=HEAD

With some extensions for sPAPR and KVM, the resulting file could 
be moved to the Linux documentation directory. This would be an
iterative process over time of course. 

Cheers,

C.