Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Nicholas Piggin
On Mon, 06 Nov 2017 11:48:06 +0530
"Aneesh Kumar K.V"  wrote:

> Nicholas Piggin  writes:
> 
> > On Fri, 3 Nov 2017 18:05:20 +0100
> > Florian Weimer  wrote:
> >  
> >> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> >> variant, but I have not seen it on our own builders) where running 
> >> localedef as part of the glibc build crashes with a segmentation fault.
> >> 
> >> Kernel version is 4.13.9 (Fedora 26 variant).
> >> 
> >> I have only seen this with an explicit loader invocation, like this:
> >> 
> >> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> >> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> >> charmaps/UTF-8 
> >> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> >> ; done
> >> 
> >> To be run in the localedata subdirectory of a glibc *source* tree, after 
> >> a build.  You may have to create the 
> >> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> >> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> >> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> >> or newly built binary.
> >> 
> >> I filed this as a glibc bug for tracking:
> >> 
> >>https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> >> 
> >> There's an strace log and a coredump from the crash.
> >> 
> >> I think the data shows that the address in question should be writable.
> >> 
> >> The crossed 0x8000 binary is very suggestive.  I think that 
> >> based on the operation of glibc's malloc, this write would be the first 
> >> time this happens during the lifetime of the process.
> >> 
> >> Does that ring any bells?  Is there anything I can do to provide more 
> >> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> >> any diagnostics tool which is provided by Fedora.  
> >
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.
> >
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.  
> 
> We should not return that address, unless we requested with a hint value
> of > 128TB. IIRC we discussed this early during the mmap interface
> change and said, we will return an address > 128T only if the hint
> address is above 128TB (not hint addr + length).

Yeah, I'm thinking we should change that. Make explicit addr + length
hint return > 128TB. Well, it already seems to for this case, which
is why powerpc breaks.

This restriction was added for reasonably well written apps that just
happened to assume they don't get > 128TB va returned by mmap. An app
which asked for addr < 128TB && addr + len > 128TB and were relying on
that to fail is very different. I don't think we should add a big
unintuitive wart to the interface for such an obscure and broken type
of app.

"You get < 128TB unless explicitly requested."

Simple, reasonable, obvious rule. Avoids breaking apps that store
some bits in the top of pointers (provided that memory allocator
userspace libraries also do the right thing).

Thanks,
Nick


Re: [PATCH] powerpc/opal: Fix EBUSY bug in acquiring tokens

2017-11-05 Thread William Kennington

> On Nov 4, 2017, at 2:14 AM, Michael Ellerman  > wrote:
> 
> "William A. Kennington III" > writes:
> 
>> The current code checks the completion map to look for the first token
>> that is complete. In some cases, a completion can come in but the token
>> can still be on lease to the caller processing the completion. If this
>> completed but unreleased token is the first token found in the bitmap by
>> another tasks trying to acquire a token, then the __test_and_set_bit
>> call will fail since the token will still be on lease. The acquisition
>> will then fail with an EBUSY.
>> 
>> This patch reorganizes the acquisition code to look at the
>> opal_async_token_map for an unleased token. If the token has no lease it
>> must have no outstanding completions so we should never see an EBUSY,
>> unless we have leased out too many tokens. Since
>> opal_async_get_token_inrerruptible is protected by a semaphore, we will
>> practically never see EBUSY anymore.
>> 
>> Signed-off-by: William A. Kennington III > >
>> ---
>> arch/powerpc/platforms/powernv/opal-async.c | 6 +++---
>> 1 file changed, 3 insertions(+), 3 deletions(-)
> 
> I think this is superseeded by Cyrils rework (which he's finally
> posted):
> 
>  http://patchwork.ozlabs.org/patch/833630/ 
> 
> 
> 
> If not please let us know.
> 
> cheers

Yeah, I think Cyril’s rework fixes this. I wasn’t sure how long it would take 
for master to receive his changes so I figured we could use something in the 
interim to fix the locking failures. If his changes will be mailed into the 
next merge window then we should have the issue fixed in master. I understand 
that rework probably won’t make it into stable kernels? If not then we should 
probably send this along to stable kernel maintainers.

- William

Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Aneesh Kumar K.V
Nicholas Piggin  writes:

> On Fri, 3 Nov 2017 18:05:20 +0100
> Florian Weimer  wrote:
>
>> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
>> variant, but I have not seen it on our own builders) where running 
>> localedef as part of the glibc build crashes with a segmentation fault.
>> 
>> Kernel version is 4.13.9 (Fedora 26 variant).
>> 
>> I have only seen this with an explicit loader invocation, like this:
>> 
>> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
>> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
>> charmaps/UTF-8 
>> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
>> ; done
>> 
>> To be run in the localedata subdirectory of a glibc *source* tree, after 
>> a build.  You may have to create the 
>> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
>> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
>> Fedora 26 host, but there it does not matter if you run the old (chroot) 
>> or newly built binary.
>> 
>> I filed this as a glibc bug for tracking:
>> 
>>https://sourceware.org/bugzilla/show_bug.cgi?id=22390
>> 
>> There's an strace log and a coredump from the crash.
>> 
>> I think the data shows that the address in question should be writable.
>> 
>> The crossed 0x8000 binary is very suggestive.  I think that 
>> based on the operation of glibc's malloc, this write would be the first 
>> time this happens during the lifetime of the process.
>> 
>> Does that ring any bells?  Is there anything I can do to provide more 
>> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
>> any diagnostics tool which is provided by Fedora.
>
> There was a recent change to move to 128TB address space by default,
> and option for 512TB addresses if explicitly requested.
>
> Your brk request asked for > 128TB which the kernel gave it, but the
> address limit in the paca that the SLB miss tests against was not
> updated to reflect the switch to 512TB address space.

We should not return that address, unless we requested with a hint value
of > 128TB. IIRC we discussed this early during the mmap interface
change and said, we will return an address > 128T only if the hint
address is above 128TB (not hint addr + length). I am not sure why
we are finding us returning and address > 128TB with paca limit set to
128TB?


>
> Why is your brk starting so high? Are you trying to test the > 128TB
> case, or maybe something is confused by the 64->128TB change? What's
> the strace look like if you run on a distro or <= 4.10 kernel?
>
> Something like the following patch may help if you could test.
>
> Thanks,
> Nick
>

-aneesh



Re: [RFC PATCH kernel] powerpc/pci: Get rid of unused @parent pointer in pci_controller

2017-11-05 Thread Alexey Kardashevskiy
On 06/11/17 14:33, Benjamin Herrenschmidt wrote:
> On Mon, 2017-11-06 at 14:24 +1100, Alexey Kardashevskiy wrote:
>> The @parent pointer is supposed to point to a device which represents
>> a PCI controller, however it is never set to anything and remains NULL;
>> it is also quite common to pass NULL to pci_create_root_bus().
>>
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
> 
> It's used for anything using of_pci_phb_probe() such as Cell.


ah, ok, thanks.

> 
> 
>>
>> This @parent is NULL for every PHB in garrison machines, for example.
>> Where would this pointer be really used?
>>
>> I wonder how does setting it make any difference in sysfs for CXL vphb?
>>
>> In general, I'd think that @parent makes sense and every PHB
>> has a parent device struct like any other device and the platform
>> code binds a correct device driver to a PHB and we can avoid
>> having PNV_PHB_IODA2 switches but for some reason PHBs are
>> hardcoded in the platform code.
>>
>> ---
>>  arch/powerpc/include/asm/pci-bridge.h | 1 -
>>  arch/powerpc/kernel/pci-common.c  | 2 +-
>>  drivers/misc/cxl/vphb.c   | 3 ---
>>  3 files changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/pci-bridge.h 
>> b/arch/powerpc/include/asm/pci-bridge.h
>> index 752718a2949d..ad88ddeaa6b7 100644
>> --- a/arch/powerpc/include/asm/pci-bridge.h
>> +++ b/arch/powerpc/include/asm/pci-bridge.h
>> @@ -60,7 +60,6 @@ struct pci_controller {
>>  #endif
>>  struct device_node *dn;
>>  struct list_head list_node;
>> -struct device *parent;
>>  
>>  int first_busno;
>>  int last_busno;
>> diff --git a/arch/powerpc/kernel/pci-common.c 
>> b/arch/powerpc/kernel/pci-common.c
>> index 02831a396419..597576777c34 100644
>> --- a/arch/powerpc/kernel/pci-common.c
>> +++ b/arch/powerpc/kernel/pci-common.c
>> @@ -1681,7 +1681,7 @@ void pcibios_scan_phb(struct pci_controller *hose)
>>  pci_add_resource(, >busn);
>>  
>>  /* Create an empty bus for the toplevel */
>> -bus = pci_create_root_bus(hose->parent, hose->first_busno,
>> +bus = pci_create_root_bus(NULL, hose->first_busno,
>>hose->ops, hose, );
>>  if (bus == NULL) {
>>  pr_err("Failed to create bus for PCI domain %04x\n",
>> diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
>> index 512a4897dbf6..ae3fe1563812 100644
>> --- a/drivers/misc/cxl/vphb.c
>> +++ b/drivers/misc/cxl/vphb.c
>> @@ -232,9 +232,6 @@ int cxl_pci_vphb_add(struct cxl_afu *afu)
>>  if (!phb)
>>  return -ENODEV;
>>  
>> -/* Setup parent in sysfs */
>> -phb->parent = parent;
>> -
>>  /* Setup the PHB using arch provided callback */
>>  phb->ops = _pcie_pci_ops;
>>  phb->cfg_addr = NULL;


-- 
Alexey


Re: [RFC PATCH kernel] powerpc/pci: Get rid of unused @parent pointer in pci_controller

2017-11-05 Thread Benjamin Herrenschmidt
On Mon, 2017-11-06 at 14:24 +1100, Alexey Kardashevskiy wrote:
> The @parent pointer is supposed to point to a device which represents
> a PCI controller, however it is never set to anything and remains NULL;
> it is also quite common to pass NULL to pci_create_root_bus().
> 
> Signed-off-by: Alexey Kardashevskiy 
> ---

It's used for anything using of_pci_phb_probe() such as Cell.


> 
> This @parent is NULL for every PHB in garrison machines, for example.
> Where would this pointer be really used?
> 
> I wonder how does setting it make any difference in sysfs for CXL vphb?
> 
> In general, I'd think that @parent makes sense and every PHB
> has a parent device struct like any other device and the platform
> code binds a correct device driver to a PHB and we can avoid
> having PNV_PHB_IODA2 switches but for some reason PHBs are
> hardcoded in the platform code.
> 
> ---
>  arch/powerpc/include/asm/pci-bridge.h | 1 -
>  arch/powerpc/kernel/pci-common.c  | 2 +-
>  drivers/misc/cxl/vphb.c   | 3 ---
>  3 files changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/pci-bridge.h 
> b/arch/powerpc/include/asm/pci-bridge.h
> index 752718a2949d..ad88ddeaa6b7 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -60,7 +60,6 @@ struct pci_controller {
>  #endif
>   struct device_node *dn;
>   struct list_head list_node;
> - struct device *parent;
>  
>   int first_busno;
>   int last_busno;
> diff --git a/arch/powerpc/kernel/pci-common.c 
> b/arch/powerpc/kernel/pci-common.c
> index 02831a396419..597576777c34 100644
> --- a/arch/powerpc/kernel/pci-common.c
> +++ b/arch/powerpc/kernel/pci-common.c
> @@ -1681,7 +1681,7 @@ void pcibios_scan_phb(struct pci_controller *hose)
>   pci_add_resource(, >busn);
>  
>   /* Create an empty bus for the toplevel */
> - bus = pci_create_root_bus(hose->parent, hose->first_busno,
> + bus = pci_create_root_bus(NULL, hose->first_busno,
> hose->ops, hose, );
>   if (bus == NULL) {
>   pr_err("Failed to create bus for PCI domain %04x\n",
> diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
> index 512a4897dbf6..ae3fe1563812 100644
> --- a/drivers/misc/cxl/vphb.c
> +++ b/drivers/misc/cxl/vphb.c
> @@ -232,9 +232,6 @@ int cxl_pci_vphb_add(struct cxl_afu *afu)
>   if (!phb)
>   return -ENODEV;
>  
> - /* Setup parent in sysfs */
> - phb->parent = parent;
> -
>   /* Setup the PHB using arch provided callback */
>   phb->ops = _pcie_pci_ops;
>   phb->cfg_addr = NULL;


Re: [PATCH] powerpc: eeh: stop using do_gettimeofday()

2017-11-05 Thread Russell Currey
On Sat, 2017-11-04 at 22:26 +0100, Arnd Bergmann wrote:
> This interface is inefficient and deprecated because of the y2038
> overflow.
> 
> ktime_get_seconds() is an appropriate replacement here, since it
> has sufficient granularity but is more efficient and uses monotonic
> time.
> 
> Signed-off-by: Arnd Bergmann 

Acked-by: Russell Currey 


Re: [PATCH] powerpc: eeh: stop using do_gettimeofday()

2017-11-05 Thread Andrew Donnellan

On 05/11/17 08:26, Arnd Bergmann wrote:

This interface is inefficient and deprecated because of the y2038
overflow.

ktime_get_seconds() is an appropriate replacement here, since it
has sufficient granularity but is more efficient and uses monotonic
time.

Signed-off-by: Arnd Bergmann 


Reviewed-by: Andrew Donnellan 


---
  arch/powerpc/include/asm/eeh.h   | 2 +-
  arch/powerpc/kernel/eeh_driver.c | 2 +-
  arch/powerpc/kernel/eeh_pe.c | 8 
  3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
index cd1df96335e5..5161c37dd039 100644
--- a/arch/powerpc/include/asm/eeh.h
+++ b/arch/powerpc/include/asm/eeh.h
@@ -93,7 +93,7 @@ struct eeh_pe {
struct pci_bus *bus;/* Top PCI bus for bus PE   */
int check_count;/* Times of ignored error   */
int freeze_count;   /* Times of froze up*/
-   struct timeval tstamp;  /* Time on first-time freeze*/
+   time64_t tstamp;/* Time on first-time freeze*/
int false_positives;/* Times of reported #ff's  */
atomic_t pass_dev_cnt;  /* Count of passed through devs */
struct eeh_pe *parent;  /* Parent PE*/
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 4e1b433f6cb5..4f71e4c9beb7 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -623,7 +623,7 @@ static int eeh_reset_device(struct eeh_pe *pe, struct 
pci_bus *bus,
struct eeh_rmv_data *rmv_data)
  {
struct pci_bus *frozen_bus = eeh_pe_bus_get(pe);
-   struct timeval tstamp;
+   time64_t tstamp;
int cnt, rc;
struct eeh_dev *edev;
  
diff --git a/arch/powerpc/kernel/eeh_pe.c b/arch/powerpc/kernel/eeh_pe.c

index 2e8d1b2b5af4..2d4956e97aa9 100644
--- a/arch/powerpc/kernel/eeh_pe.c
+++ b/arch/powerpc/kernel/eeh_pe.c
@@ -526,16 +526,16 @@ int eeh_rmv_from_parent_pe(struct eeh_dev *edev)
   */
  void eeh_pe_update_time_stamp(struct eeh_pe *pe)
  {
-   struct timeval tstamp;
+   time64_t tstamp;
  
  	if (!pe) return;
  
  	if (pe->freeze_count <= 0) {

pe->freeze_count = 0;
-   do_gettimeofday(>tstamp);
+   pe->tstamp = ktime_get_seconds();
} else {
-   do_gettimeofday();
-   if (tstamp.tv_sec - pe->tstamp.tv_sec > 3600) {
+   tstamp = ktime_get_seconds();
+   if (tstamp - pe->tstamp > 3600) {
pe->tstamp = tstamp;
pe->freeze_count = 0;
}



--
Andrew Donnellan  OzLabs, ADL Canberra
andrew.donnel...@au1.ibm.com  IBM Australia Limited



[RFC PATCH kernel] powerpc/pci: Get rid of unused @parent pointer in pci_controller

2017-11-05 Thread Alexey Kardashevskiy
The @parent pointer is supposed to point to a device which represents
a PCI controller, however it is never set to anything and remains NULL;
it is also quite common to pass NULL to pci_create_root_bus().

Signed-off-by: Alexey Kardashevskiy 
---

This @parent is NULL for every PHB in garrison machines, for example.
Where would this pointer be really used?

I wonder how does setting it make any difference in sysfs for CXL vphb?

In general, I'd think that @parent makes sense and every PHB
has a parent device struct like any other device and the platform
code binds a correct device driver to a PHB and we can avoid
having PNV_PHB_IODA2 switches but for some reason PHBs are
hardcoded in the platform code.

---
 arch/powerpc/include/asm/pci-bridge.h | 1 -
 arch/powerpc/kernel/pci-common.c  | 2 +-
 drivers/misc/cxl/vphb.c   | 3 ---
 3 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/pci-bridge.h 
b/arch/powerpc/include/asm/pci-bridge.h
index 752718a2949d..ad88ddeaa6b7 100644
--- a/arch/powerpc/include/asm/pci-bridge.h
+++ b/arch/powerpc/include/asm/pci-bridge.h
@@ -60,7 +60,6 @@ struct pci_controller {
 #endif
struct device_node *dn;
struct list_head list_node;
-   struct device *parent;
 
int first_busno;
int last_busno;
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 02831a396419..597576777c34 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1681,7 +1681,7 @@ void pcibios_scan_phb(struct pci_controller *hose)
pci_add_resource(, >busn);
 
/* Create an empty bus for the toplevel */
-   bus = pci_create_root_bus(hose->parent, hose->first_busno,
+   bus = pci_create_root_bus(NULL, hose->first_busno,
  hose->ops, hose, );
if (bus == NULL) {
pr_err("Failed to create bus for PCI domain %04x\n",
diff --git a/drivers/misc/cxl/vphb.c b/drivers/misc/cxl/vphb.c
index 512a4897dbf6..ae3fe1563812 100644
--- a/drivers/misc/cxl/vphb.c
+++ b/drivers/misc/cxl/vphb.c
@@ -232,9 +232,6 @@ int cxl_pci_vphb_add(struct cxl_afu *afu)
if (!phb)
return -ENODEV;
 
-   /* Setup parent in sysfs */
-   phb->parent = parent;
-
/* Setup the PHB using arch provided callback */
phb->ops = _pcie_pci_ops;
phb->cfg_addr = NULL;
-- 
2.11.0



Re: [PATCH] selftests/powerpc: Check FP/VEC on exception in TM

2017-11-05 Thread Cyril Bur
On Fri, 2017-11-03 at 10:28 -0200, Gustavo Romero wrote:
> Hi Cyril!
> 
> On 01-11-2017 20:10, Cyril Bur wrote:
> > Thanks Gustavo,
> > 
> > I do have one more thought on an improvement for this test which is
> > that:
> > +   /* Counter for busy wait *
> > +   uint64_t counter = 0x1ff00;
> > is a bit fragile, what we should do is have the test work out long it
> > should spin until it reliably gets a TM_CAUSE_FAC_UNAV failure and then
> > use that for these tests.
> > 
> > This will only become a problem if we were to change kernel heuristics
> > which is fine for now. I'll try to get that added soon but for now this
> > test has proven too useful to delay adding as is.
> 
> I see. Yup, 'counter' value was indeed determined experimentally under many
> different scenarios (VM and BM, different CPU loads, etc). At least if the
> heuristics changes hurting the test it will catch that pointing out that
> the expected failure did not happen, like:
> 
> Checking if FP/VEC registers are sane after a FP unavailable exception...
> If MSR.FP=0 MSR.VEC=0:
> Expecting the transaction to fail, but it didn't
> FP ok VEC ok
> ...
> 
> So it won't let the hurting change pass fine silently :-)
> 

Yeah, all for merging as is.

It would be nice so that when someone does  make a heuristic change
they don't also have to go fix tests - there is nothing more annoying
than a fragile test suite.

> 
> > > Signed-off-by: Gustavo Romero 
> > > Signed-off-by: Breno Leitao 
> > > Signed-off-by: Cyril Bur 
> 
> Thanks a lot for reviewing it.
> 
> Cheers,
> Gustavo
> 


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Florian Weimer

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:

Something like the following patch may help if you could test.


The patch appears to fix it:

# /lib64/ld64.so.1 ./a.out
initial brk value: 0x7fffe459
probing at 0x8001fffc

I used the follow simplified reproducer:

#include 
#include 
#include 
#include 
#include 

int
main (void)
{
  errno = 0;
  void *p = sbrk (0);
  if (errno != 0)
err (1, "sbrk (0)");
  printf ("initial brk value: %p\n", p);
  unsigned long long target = 0x8002ULL;
  if ((uintptr_t) p >= target)
errx (1, "initial brk value is already above target");
  unsigned long long increment = target - (uintptr_t) p;
  errno = 0;
  sbrk (increment);
  if (errno != 0)
err (1, "sbrk (0x%llx)", increment);
  volatile int *pi = (volatile int *) (target - 4);
  printf ("probing at %p\n", pi);
  *pi = 1;
}


It is still probabilistic because if the increment is too large, the 
second sbrk call will fail with an out of memory error (which is 
expected), so you'll have to run it a couple of times.


If the test fails, the write at the will segfault.

Thanks,
Florian


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Nicholas Piggin
On Sun, 5 Nov 2017 13:35:40 +0100
Florian Weimer  wrote:

> On 11/05/2017 01:18 PM, Nicholas Piggin wrote:
> 
> > There was a recent change to move to 128TB address space by default,
> > and option for 512TB addresses if explicitly requested.  
> 
> Do you have a commit hash for the introduction of 128TB by default?  Thanks.

I guess this one

f6eedbba7a26 ("powerpc/mm/hash: Increase VA range to 128TB")

> 
> > Your brk request asked for > 128TB which the kernel gave it, but the
> > address limit in the paca that the SLB miss tests against was not
> > updated to reflect the switch to 512TB address space.
> > 
> > Why is your brk starting so high? Are you trying to test the > 128TB
> > case, or maybe something is confused by the 64->128TB change? What's
> > the strace look like if you run on a distro or <= 4.10 kernel?  
> 
> I think it is a consequence of running with an explicit loader 
> invocation.  With that, the heap is placed above ld.so, which can be 
> quite high in the address space.
> 
> I'm attaching two runs of cat, one executing directly as /bin/cat, and 
> one with /lib64/ld64.so.1 /bin/cat.
> 
> Fortunately, this does *not* apply to PIE binaries (also attached). 
> However, explicit loader invocations are sometimes used in test suites 
> (not just for glibc), and these sporadic test failures are quite annoying.
> 
> Do you still need the strace log?  And if yes, of what exactly?

Thanks, that should be quite helpful. I'll spend a bit more time to
study it, I'll let you know if I need any other traces.

> 
> > Something like the following patch may help if you could test.  
> 
> Okay, this will take some time.

It's no rush, there will probably be a revision to come.

Thanks,
Nick


Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Florian Weimer

On 11/05/2017 01:18 PM, Nicholas Piggin wrote:


There was a recent change to move to 128TB address space by default,
and option for 512TB addresses if explicitly requested.


Do you have a commit hash for the introduction of 128TB by default?  Thanks.


Your brk request asked for > 128TB which the kernel gave it, but the
address limit in the paca that the SLB miss tests against was not
updated to reflect the switch to 512TB address space.

Why is your brk starting so high? Are you trying to test the > 128TB
case, or maybe something is confused by the 64->128TB change? What's
the strace look like if you run on a distro or <= 4.10 kernel?


I think it is a consequence of running with an explicit loader 
invocation.  With that, the heap is placed above ld.so, which can be 
quite high in the address space.


I'm attaching two runs of cat, one executing directly as /bin/cat, and 
one with /lib64/ld64.so.1 /bin/cat.


Fortunately, this does *not* apply to PIE binaries (also attached). 
However, explicit loader invocations are sometimes used in test suites 
(not just for glibc), and these sporadic test failures are quite annoying.


Do you still need the strace log?  And if yes, of what exactly?


Something like the following patch may help if you could test.


Okay, this will take some time.

Thanks,
Florian
1231d-1231e r-xp  fd:00 17852425 
/root/a.out
1231e-1231f r--p  fd:00 17852425 
/root/a.out
1231f-12320 rw-p 0001 fd:00 17852425 
/root/a.out
1000dbc-1000dbf rw-p  00:00 0[heap]
7fffa31d-7fffa340 r-xp  fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa340-7fffa341 ---p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa341-7fffa342 r--p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa342-7fffa343 rw-p 0024 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fffa344-7fffa346 r-xp  00:00 0  [vdso]
7fffa346-7fffa34a r-xp  fd:00 8390329
/usr/lib64/ld-2.25.so
7fffa34a-7fffa34b r--p 0003 fd:00 8390329
/usr/lib64/ld-2.25.so
7fffa34b-7fffa34c rw-p 0004 fd:00 8390329
/usr/lib64/ld-2.25.so
7fffe945-7fffe948 rw-p  00:00 0  [stack]
7fff7e79-7fff7e7d rw-p  00:00 0 
7fff7e7d-7fff7e83 r--p  fd:00 25167925   
/usr/lib/locale/en_US.utf8/LC_CTYPE
7fff7e83-7fff7e84 r--p  fd:00 25167928   
/usr/lib/locale/en_US.utf8/LC_NUMERIC
7fff7e84-7fff7e85 r--p  fd:00 16798929   
/usr/lib/locale/en_US.utf8/LC_TIME
7fff7e85-7fff7e98 r--p  fd:00 25167924   
/usr/lib/locale/en_US.utf8/LC_COLLATE
7fff7e98-7fff7e99 r--p  fd:00 16798927   
/usr/lib/locale/en_US.utf8/LC_MONETARY
7fff7e99-7fff7e9a r--p  fd:00 2511   
/usr/lib/locale/en_US.utf8/LC_MESSAGES/SYS_LC_MESSAGES
7fff7e9a-7fff7e9b r--p  fd:00 16798942   
/usr/lib/locale/en_US.utf8/LC_PAPER
7fff7e9b-7fff7e9c r--p  fd:00 25167927   
/usr/lib/locale/en_US.utf8/LC_NAME
7fff7e9c-7fff7e9d r--p  fd:00 16798924   
/usr/lib/locale/en_US.utf8/LC_ADDRESS
7fff7e9d-7fff7e9e r--p  fd:00 16798928   
/usr/lib/locale/en_US.utf8/LC_TELEPHONE
7fff7e9e-7fff7e9f r--p  fd:00 16798926   
/usr/lib/locale/en_US.utf8/LC_MEASUREMENT
7fff7e9f-7fff7ea0 r--s  fd:00 8390669
/usr/lib64/gconv/gconv-modules.cache
7fff7ea0-7fff7ec3 r-xp  fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec3-7fff7ec4 ---p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec4-7fff7ec5 r--p 0023 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec5-7fff7ec6 rw-p 0024 fd:00 25167936   
/usr/lib64/power8/libc-2.25.so
7fff7ec6-7fff7ec7 r--p  fd:00 16798925   
/usr/lib/locale/en_US.utf8/LC_IDENTIFICATION
7fff7ec7-7fff7ec8 r-xp  fd:00 202293 
/usr/bin/cat
7fff7ec8-7fff7ec9 r--p  fd:00 202293 
/usr/bin/cat
7fff7ec9-7fff7eca rw-p 0001 fd:00 202293 
/usr/bin/cat
7fff7eca-7fff7ecc r-xp  00:00 0  [vdso]
7fff7ecc-7fff7ed0 r-xp  fd:00 8390329
/usr/lib64/ld-2.25.so
7fff7ed0-7fff7ed1 r--p 0003 fd:00 8390329

[PATCH] KVM: PPC: Book3S HV: Handle host system reset in guest mode

2017-11-05 Thread Nicholas Piggin
If the host takes a system reset interrupt while a guest is running,
the CPU must exit the guest before processing the host exception
handler.

After this patch, taking a sysrq+x with a CPU running in a guest
gives a trace like this:

   cpu 0x27: Vector: 100 (System Reset) at [c00fdf5776f0]
   pc: c00810158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
   lr: c00810158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
   sp: c00fdf577850
  msr: 92803033
 current = 0xc00fdf4b1e00
 paca= 0xcfd4d680softe: 3irq_happened: 0x01
   pid   = 6608, comm = qemu-system-ppc
   Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP
   [c00fdf577a00] c00810159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv]
   [c00fdf577b30] c008100a537c kvmppc_vcpu_run+0x44/0x60 [kvm]
   [c00fdf577b60] c008100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm]
   [c00fdf577c00] c00810093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm]
   [c00fdf577d50] c0357bf8 do_vfs_ioctl+0xd8/0x8c0
   [c00fdf577df0] c0358448 SyS_ioctl+0x68/0x100
   [c00fdf577e30] c000b220 system_call+0x58/0x6c
   --- Exception: c01 (System Call) at 7fff76868df0
   SP (7fff7069baf0) is in userspace

Fixes: e36d0a2ed5 ("powerpc/powernv: Implement NMI IPI with 
OPAL_SIGNAL_SYSTEM_RESET")
Signed-off-by: Nicholas Piggin 
--

It has always been possible to sreset the host with direct scom
access, but the patch e36d0a2ed5 has significantly expanded this
functionality so in practice this is a required as a fix for it.

Since RFC:
- Removed the last hunk as sugggested by Paul.
- Re-tested.

Thanks,
Nick
---
 arch/powerpc/include/asm/hw_irq.h| 1 +
 arch/powerpc/kernel/exceptions-64s.S | 2 ++
 arch/powerpc/kernel/irq.c| 3 ++-
 arch/powerpc/kvm/book3s_hv.c | 7 ++-
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/hw_irq.h 
b/arch/powerpc/include/asm/hw_irq.h
index 92a3e9a79cb4..a8bbac425ae6 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -40,6 +40,7 @@
 
 #ifndef __ASSEMBLY__
 
+extern void replay_system_reset(void);
 extern void __replay_interrupt(unsigned int vector);
 
 extern void timer_interrupt(struct pt_regs *);
diff --git a/arch/powerpc/kernel/exceptions-64s.S 
b/arch/powerpc/kernel/exceptions-64s.S
index 651e1a0114ed..bff2ed6e3c3c 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -113,6 +113,7 @@ EXC_VIRT_NONE(0x4000, 0x100)
cmpwi   cr3,r10,2 ; \
BRANCH_TO_C000(r10, system_reset_idle_common) ; \
 1: \
+   KVMTEST_PR(n) ; \
END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)
 #else
 #define IDLETEST NOTEST
@@ -129,6 +130,7 @@ EXC_REAL_BEGIN(system_reset, 0x100, 0x100)
 
 EXC_REAL_END(system_reset, 0x100, 0x100)
 EXC_VIRT_NONE(0x4100, 0x100)
+TRAMP_KVM(PACA_EXNMI, 0x100)
 
 #ifdef CONFIG_PPC_P7_NAP
 EXC_COMMON_BEGIN(system_reset_idle_common)
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index efbadcbbf694..7e8259106944 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -437,7 +437,7 @@ static const u8 srr1_to_lazyirq[0x10] = {
PACA_IRQ_HMI,
0, 0, 0, 0, 0 };
 
-static noinline void replay_system_reset(void)
+void replay_system_reset(void)
 {
struct pt_regs regs;
 
@@ -447,6 +447,7 @@ static noinline void replay_system_reset(void)
system_reset_exception();
get_paca()->in_nmi = 0;
 }
+EXPORT_SYMBOL_GPL(replay_system_reset);
 
 void irq_set_pending_from_srr1(unsigned long srr1)
 {
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 8f34715cfbff..31a362669fea 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -47,6 +47,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1089,9 +1090,10 @@ static int kvmppc_handle_exit_hv(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
vcpu->stat.ext_intr_exits++;
r = RESUME_GUEST;
break;
-   /* HMI is hypervisor interrupt and host has handled it. Resume guest.*/
+   /* SR/HMI/PMI are HV interrupts that host has handled. Resume guest.*/
case BOOK3S_INTERRUPT_HMI:
case BOOK3S_INTERRUPT_PERFMON:
+   case BOOK3S_INTERRUPT_SYSTEM_RESET:
r = RESUME_GUEST;
break;
case BOOK3S_INTERRUPT_MACHINE_CHECK:
@@ -2604,6 +2606,9 @@ static void set_irq_happened(int trap)
case BOOK3S_INTERRUPT_HMI:
local_paca->irq_happened |= PACA_IRQ_HMI;
break;
+   case BOOK3S_INTERRUPT_SYSTEM_RESET:
+   replay_system_reset();
+   

Re: POWER: Unexpected fault when writing to brk-allocated memory

2017-11-05 Thread Nicholas Piggin
On Fri, 3 Nov 2017 18:05:20 +0100
Florian Weimer  wrote:

> We are seeing an issue on ppc64le and ppc64 (and perhaps on some arm 
> variant, but I have not seen it on our own builders) where running 
> localedef as part of the glibc build crashes with a segmentation fault.
> 
> Kernel version is 4.13.9 (Fedora 26 variant).
> 
> I have only seen this with an explicit loader invocation, like this:
> 
> while I18NPATH=. /lib64/ld64.so.1 /usr/bin/localedef 
> --alias-file=../intl/locale.alias --no-archive -i locales/nl_AW -c -f 
> charmaps/UTF-8 
> --prefix=/builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64 nl_AW ; do : 
> ; done
> 
> To be run in the localedata subdirectory of a glibc *source* tree, after 
> a build.  You may have to create the 
> /builddir/build/BUILDROOT/glibc-2.26-16.fc27.ppc64/usr/lib/locale 
> directory.  I have only reproduced this inside a Fedora 27 chroot on a 
> Fedora 26 host, but there it does not matter if you run the old (chroot) 
> or newly built binary.
> 
> I filed this as a glibc bug for tracking:
> 
>https://sourceware.org/bugzilla/show_bug.cgi?id=22390
> 
> There's an strace log and a coredump from the crash.
> 
> I think the data shows that the address in question should be writable.
> 
> The crossed 0x8000 binary is very suggestive.  I think that 
> based on the operation of glibc's malloc, this write would be the first 
> time this happens during the lifetime of the process.
> 
> Does that ring any bells?  Is there anything I can do to provide more 
> data?  The host is an LPAR with a stock Fedora 26 kernel, so I can use 
> any diagnostics tool which is provided by Fedora.

There was a recent change to move to 128TB address space by default,
and option for 512TB addresses if explicitly requested.

Your brk request asked for > 128TB which the kernel gave it, but the
address limit in the paca that the SLB miss tests against was not
updated to reflect the switch to 512TB address space.

Why is your brk starting so high? Are you trying to test the > 128TB
case, or maybe something is confused by the 64->128TB change? What's
the strace look like if you run on a distro or <= 4.10 kernel?

Something like the following patch may help if you could test.

Thanks,
Nick

---
 arch/powerpc/mm/hugetlbpage-radix.c| 18 ++
 arch/powerpc/mm/mmap.c | 34 +-
 arch/powerpc/mm/mmu_context_book3s64.c | 14 +++---
 arch/powerpc/mm/slice.c| 28 +++-
 4 files changed, 65 insertions(+), 29 deletions(-)

diff --git a/arch/powerpc/mm/hugetlbpage-radix.c 
b/arch/powerpc/mm/hugetlbpage-radix.c
index a12e86395025..44e1109765b5 100644
--- a/arch/powerpc/mm/hugetlbpage-radix.c
+++ b/arch/powerpc/mm/hugetlbpage-radix.c
@@ -50,8 +50,16 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
struct hstate *h = hstate_file(file);
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit && addr < TASK_SIZE))
-   mm->context.addr_limit = TASK_SIZE;
+   /*
+* If address is specified explicitly and crosses addr_limit, or if
+* address is unspecified but len is greater than addr_limit, then
+* expand out to TASK_SIZE.
+*/
+   if (unlikely(addr + len >= mm->context.addr_limit)) {
+   if ((!addr || addr + len > mm->context.addr_limit) &&
+   mm->context.addr_limit != TASK_SIZE)
+   mm->context.addr_limit = TASK_SIZE;
+   }
 
if (len & ~huge_page_mask(h))
return -EINVAL;
@@ -82,8 +90,10 @@ radix__hugetlb_get_unmapped_area(struct file *file, unsigned 
long addr,
info.align_mask = PAGE_MASK & ~huge_page_mask(h);
info.align_offset = 0;
 
-   if (addr > DEFAULT_MAP_WINDOW)
-   info.high_limit += mm->context.addr_limit - DEFAULT_MAP_WINDOW;
+   if (addr + len >= DEFAULT_MAP_WINDOW) {
+   if (!addr || addr + len > DEFAULT_MAP_WINDOW)
+   info.high_limit += mm->context.addr_limit - 
DEFAULT_MAP_WINDOW;
+   }
 
return vm_unmapped_area();
 }
diff --git a/arch/powerpc/mm/mmap.c b/arch/powerpc/mm/mmap.c
index 5d78b193fec4..a8fe1eaf1d96 100644
--- a/arch/powerpc/mm/mmap.c
+++ b/arch/powerpc/mm/mmap.c
@@ -108,9 +108,16 @@ radix__arch_get_unmapped_area(struct file *filp, unsigned 
long addr,
struct vm_area_struct *vma;
struct vm_unmapped_area_info info;
 
-   if (unlikely(addr > mm->context.addr_limit &&
-mm->context.addr_limit != TASK_SIZE))
-   mm->context.addr_limit = TASK_SIZE;
+   /*
+* If address is specified explicitly and crosses addr_limit, or if
+* address is unspecified but len is greater than addr_limit, then
+* expand out to TASK_SIZE.
+*/
+   if (unlikely(addr + len >= mm->context.addr_limit)) {
+