Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Andy Lutomirski
On Mon, Mar 13, 2017 at 7:07 PM, Till Smejkal
 wrote:
> On Mon, 13 Mar 2017, Andy Lutomirski wrote:
>> This sounds rather complicated.  Getting TLB flushing right seems
>> tricky.  Why not just map the same thing into multiple mms?
>
> This is exactly what happens at the end. The memory region that is described 
> by the
> VAS segment will be mapped in the ASes that use the segment.

So why is this kernel feature better than just doing MAP_SHARED
manually in userspace?


>> Ick.  Please don't do this.  Can we please keep an mm as just an mm
>> and not make it look magically different depending on which process
>> maps it?  If you need a trampoline (which you do, of course), just
>> write a trampoline in regular user code and map it manually.
>
> Did I understand you correctly that you are proposing that the switching 
> thread
> should make sure by itself that its code, stack, … memory regions are 
> properly setup
> in the new AS before/after switching into it? I think, this would make using 
> first
> class virtual address spaces much more difficult for user applications to the 
> extend
> that I am not even sure if they can be used at all. At the moment, switching 
> into a
> VAS is a very simple operation for an application because the kernel will 
> just simply
> do the right thing.

Yes.  I think that having the same mm_struct look different from
different tasks is problematic.  Getting it right in the arch code is
going to be nasty.  The heuristics of what to share are also tough --
why would text + data + stack or whatever you're doing be adequate?
What if you're in a thread?  What if two tasks have their stacks in
the same place?

I could imagine something like a sigaltstack() mode that lets you set
a signal up to also switch mm could be useful.


Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Benjamin Herrenschmidt
On Tue, 2017-03-14 at 14:35 +1000, Nicholas Piggin wrote:
> > We might need a sync still between clearing the byte and calling the
> > handler no ? Or at least a smp_wmb() to ensure that the clear is
> > visible before any action of the handler.
> 
> Yes I have exactly that (smp_wmb).
> 
> At first I checked and cleared each byte then did a single smp_wmb, but
> I changed my mind because most of the time the IPI will fire with only
> one message set, so it does not seem like it's worth the extra branches
> to avoid a lwsync in the rare case of 2 messages.

Will lwsync provide transitivity ?

What we care about is that if the handler does something that when observed
by another CPU causes that other CPU to send back an IPI, the write by that
other CPU comes after our clear. I'm not sure if lwsync is enough. We might
need Paul Mck for that :-)

Cheers,
Ben.



Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Nicholas Piggin
On Tue, 14 Mar 2017 14:57:20 +1100
Benjamin Herrenschmidt  wrote:

> On Tue, 2017-03-14 at 12:53 +1000, Nicholas Piggin wrote:
> > >   - Load all
> > >   - For each byte if set
> > >  - clear byte
> > >  - then call handler  
> > 
> > Yes. I think that will be okay because we shouldn't get any load-hit-
> > store
> > issues. I'll do some benchmarking anyway.  
> 
> We might need a sync still between clearing the byte and calling the
> handler no ? Or at least a smp_wmb() to ensure that the clear is
> visible before any action of the handler.

Yes I have exactly that (smp_wmb).

At first I checked and cleared each byte then did a single smp_wmb, but
I changed my mind because most of the time the IPI will fire with only
one message set, so it does not seem like it's worth the extra branches
to avoid a lwsync in the rare case of 2 messages.

Thanks,
Nick


[PATCH 3/3] cxl: Provide user-space access to afu descriptor on bare metal

2017-03-13 Thread Vaibhav Jain
This patch implements cxl backend to provide user-space access to
binary afu descriptor contents via sysfs. We add a new member to
struct cxl_afu_native named phy_desc that caches the physical base
address of afu descriptor, which is then used in implementation of new
native cxl backend ops namel:

* native_afu_desc_size()
* native_afu_desc_read()
* native_afu_desc_mmap()

The implementations of all these callbacks is mostly trivial except
native_afu_desc_mmap() which maps the PFNs pointing to afu descriptor
in i/o memory, to user-space vm_area_struct.

Signed-off-by: Vaibhav Jain 
---
 drivers/misc/cxl/cxl.h|  3 +++
 drivers/misc/cxl/native.c | 33 +
 drivers/misc/cxl/pci.c|  3 +++
 3 files changed, 39 insertions(+)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 1c43d06..c6db1fa 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -386,6 +386,9 @@ struct cxl_afu_native {
int spa_order;
int spa_max_procs;
u64 pp_offset;
+
+   /* Afu descriptor physical address */
+   u64 phy_desc;
 };
 
 struct cxl_afu_guest {
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 20d3df6..44e3e84 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -1330,6 +1330,36 @@ static ssize_t native_afu_read_err_buffer(struct cxl_afu 
*afu, char *buf,
return __aligned_memcpy(buf, ebuf, off, count, afu->eb_len);
 }
 
+static ssize_t native_afu_desc_size(struct cxl_afu *afu)
+{
+   return afu->adapter->native->afu_desc_size;
+}
+
+static ssize_t native_afu_desc_read(struct cxl_afu *afu, char *buf, loff_t off,
+size_t count)
+{
+   return __aligned_memcpy(buf, afu->native->afu_desc_mmio, off, count,
+   afu->adapter->native->afu_desc_size);
+}
+
+static int native_afu_desc_mmap(struct cxl_afu *afu, struct file *filp,
+struct vm_area_struct *vma)
+{
+   u64 len = vma->vm_end - vma->vm_start;
+
+   /* Check the size vma so that it doesn't go beyond afud size */
+   if (len > native_afu_desc_size(afu)) {
+   pr_err("Requested VMA too large. Requested=%lld, 
Available=%ld\n",
+  len, native_afu_desc_size(afu));
+   return -EINVAL;
+   }
+
+   vma->vm_flags |= VM_IO | VM_PFNMAP;
+   vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+   return vm_iomap_memory(vma, afu->native->phy_desc, len);
+}
+
 const struct cxl_backend_ops cxl_native_ops = {
.module = THIS_MODULE,
.adapter_reset = cxl_pci_reset,
@@ -1361,4 +1391,7 @@ const struct cxl_backend_ops cxl_native_ops = {
.afu_cr_write16 = native_afu_cr_write16,
.afu_cr_write32 = native_afu_cr_write32,
.read_adapter_vpd = cxl_pci_read_adapter_vpd,
+   .afu_desc_read = native_afu_desc_read,
+   .afu_desc_mmap = native_afu_desc_mmap,
+   .afu_desc_size =  native_afu_desc_size
 };
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 541dc9a..a6166e0 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -869,6 +869,9 @@ static int pci_map_slice_regs(struct cxl_afu *afu, struct 
cxl *adapter, struct p
if (afu_desc) {
if (!(afu->native->afu_desc_mmio = ioremap(afu_desc, 
adapter->native->afu_desc_size)))
goto err2;
+
+   /* Cache the afu descriptor physical address */
+   afu->native->phy_desc = afu_desc;
}
 
return 0;
-- 
2.9.3



[PATCH 2/3] cxl: Introduce afu_desc sysfs attribute

2017-03-13 Thread Vaibhav Jain
This patch introduces a new afu sysfs attribute named afu_desc. This
binary attribute provides access to raw contents of the afu descriptor
to user-space. Direct access to afu descriptor is useful for libcxl
that can use it to determine if the CXL card has been fenced or
provide application access to afu attributes beyond one defined in
CAIA.

We introduce three new backend-ops:

* afu_desc_size(): Return the size in bytes of the afu descriptor.

* afu_desc_read(): Copy into a provided buffer contents of afu
  descriptor starting at specific offset.

* afu_desc_mmap(): Memory map the afu descriptor to the given
  vm_area_struct.

If afu_desc_size() > 0 the afu_desc attribute gets created for the AFU.
The bin_attribute callbacks route the calls to corresponding cxl backend
implementation.

Signed-off-by: Vaibhav Jain 
---
 Documentation/ABI/testing/sysfs-class-cxl |  9 +++
 drivers/misc/cxl/cxl.h|  9 +++
 drivers/misc/cxl/sysfs.c  | 45 +++
 3 files changed, 63 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-class-cxl 
b/Documentation/ABI/testing/sysfs-class-cxl
index 640f65e..9ac84c4 100644
--- a/Documentation/ABI/testing/sysfs-class-cxl
+++ b/Documentation/ABI/testing/sysfs-class-cxl
@@ -6,6 +6,15 @@ Example: The real path of the attribute 
/sys/class/cxl/afu0.0s/irqs_max is
 
 Slave contexts (eg. /sys/class/cxl/afu0.0s):
 
+What:   /sys/class/cxl//afu_desc
+Date:   March 2016
+Contact:linuxppc-dev@lists.ozlabs.org
+Description:read only
+AFU Descriptor contents. The contents of this file are
+   binary contents of the AFU descriptor. LIBCXL library can
+   use this file to read afu descriptor and in some special cases
+   determine if the cxl card has been fenced.
+
 What:   /sys/class/cxl//afu_err_buf
 Date:   September 2014
 Contact:linuxppc-dev@lists.ozlabs.org
diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index ef683b7..1c43d06 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -426,6 +426,9 @@ struct cxl_afu {
u64 eb_len, eb_offset;
struct bin_attribute attr_eb;
 
+   /* Afu descriptor */
+   struct bin_attribute attr_afud;
+
/* pointer to the vphb */
struct pci_controller *phb;
 
@@ -995,6 +998,12 @@ struct cxl_backend_ops {
int (*afu_cr_write16)(struct cxl_afu *afu, int cr_idx, u64 offset, u16 
val);
int (*afu_cr_write32)(struct cxl_afu *afu, int cr_idx, u64 offset, u32 
val);
ssize_t (*read_adapter_vpd)(struct cxl *adapter, void *buf, size_t 
count);
+   /* Access to AFU descriptor */
+   ssize_t (*afu_desc_size)(struct cxl_afu *afu);
+   ssize_t (*afu_desc_read)(struct cxl_afu *afu, char *buf, loff_t off,
+size_t count);
+   int (*afu_desc_mmap)(struct cxl_afu *afu, struct file *filp,
+struct vm_area_struct *vma);
 };
 extern const struct cxl_backend_ops cxl_native_ops;
 extern const struct cxl_backend_ops cxl_guest_ops;
diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c
index a8b6d6a..fff3468 100644
--- a/drivers/misc/cxl/sysfs.c
+++ b/drivers/misc/cxl/sysfs.c
@@ -426,6 +426,26 @@ static ssize_t afu_eb_read(struct file *filp, struct 
kobject *kobj,
return cxl_ops->afu_read_err_buffer(afu, buf, off, count);
 }
 
+static ssize_t afu_desc_read(struct file *filp, struct kobject *kobj,
+struct bin_attribute *bin_attr, char *buf,
+loff_t off, size_t count)
+{
+   struct cxl_afu *afu = to_cxl_afu(kobj_to_dev(kobj));
+
+   return cxl_ops->afu_desc_read ?
+   cxl_ops->afu_desc_read(afu, buf, off, count) : -EIO;
+}
+
+static int afu_desc_mmap(struct file *filp, struct kobject *kobj,
+struct bin_attribute *attr, struct vm_area_struct *vma)
+{
+   struct cxl_afu *afu = to_cxl_afu(kobj_to_dev(kobj));
+
+   return cxl_ops->afu_desc_mmap ?
+   cxl_ops->afu_desc_mmap(afu, filp, vma) : -EINVAL;
+}
+
+
 static struct device_attribute afu_attrs[] = {
__ATTR_RO(mmio_size),
__ATTR_RO(irqs_min),
@@ -625,6 +645,9 @@ void cxl_sysfs_afu_remove(struct cxl_afu *afu)
struct afu_config_record *cr, *tmp;
int i;
 
+   if (afu->attr_afud.size > 0)
+   device_remove_bin_file(>dev, >attr_afud);
+
/* remove the err buffer bin attribute */
if (afu->eb_len)
device_remove_bin_file(>dev, >attr_eb);
@@ -686,6 +709,28 @@ int cxl_sysfs_afu_add(struct cxl_afu *afu)
list_add(>list, >crs);
}
 
+   /* Create the sysfs binattr for afu-descriptor */
+   afu->attr_afud.size = cxl_ops->afu_desc_size ?
+   cxl_ops->afu_desc_size(afu) : 0;
+
+   if (afu->attr_afud.size > 0) {
+   

[PATCH 1/3] cxl: Re-factor cxl_pci_afu_read_err_buffer()

2017-03-13 Thread Vaibhav Jain
This patch moves,renames and re-factors the function
afu_pci_afu_err_buffer(). The function is now moved to native.c from
pci.c and renamed as native_afu_read_err_buffer().

Also the ability of copying data from h/w enforcing 4/8 byte aligned
access is useful and better shared across other functions. So this
patch moves the core logic of existing cxl_pci_afu_read_err_buffer()
to a new function named __aligned_memcpy().The new implementation of
native_afu_read_err_buffer() is simply a call to __aligned_memcpy()
with appropriate actual parameters.

Signed-off-by: Vaibhav Jain 
---
 drivers/misc/cxl/cxl.h|  3 ---
 drivers/misc/cxl/native.c | 56 ++-
 drivers/misc/cxl/pci.c| 44 -
 3 files changed, 55 insertions(+), 48 deletions(-)

diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h
index 79e60ec..ef683b7 100644
--- a/drivers/misc/cxl/cxl.h
+++ b/drivers/misc/cxl/cxl.h
@@ -739,9 +739,6 @@ static inline u64 cxl_p2n_read(struct cxl_afu *afu, 
cxl_p2n_reg_t reg)
return ~0ULL;
 }
 
-ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
-   loff_t off, size_t count);
-
 /* Internal functions wrapped in cxl_base to allow PHB to call them */
 bool _cxl_pci_associate_default_context(struct pci_dev *dev, struct cxl_afu 
*afu);
 void _cxl_pci_disable_device(struct pci_dev *dev);
diff --git a/drivers/misc/cxl/native.c b/drivers/misc/cxl/native.c
index 7ae7105..20d3df6 100644
--- a/drivers/misc/cxl/native.c
+++ b/drivers/misc/cxl/native.c
@@ -1276,6 +1276,60 @@ static int native_afu_cr_write8(struct cxl_afu *afu, int 
cr, u64 off, u8 in)
return rc;
 }
 
+#define ERR_BUFF_MAX_COPY_SIZE PAGE_SIZE
+
+/*
+ * __aligned_memcpy:
+ * Copies count or max_read bytes (whichever is smaller) from src to dst buffer
+ * starting at offset off in src buffer. This specialized implementation of
+ * memcpy_fromio is needed as capi h/w only supports 4/8 bytes aligned access.
+ * So in case the requested offset/count arent 8 byte aligned the function uses
+ * a bounce buffer which can be max ERR_BUFF_MAX_COPY_SIZE == PAGE_SIZE
+ */
+static ssize_t __aligned_memcpy(void *dst, void __iomem *src, loff_t off,
+  size_t count, size_t max_read)
+{
+   loff_t aligned_start, aligned_end;
+   size_t aligned_length;
+   void *tbuf;
+
+   if (count == 0 || off < 0 || (size_t)off >= max_read)
+   return 0;
+
+   /* calculate aligned read window */
+   count = min((size_t)(max_read - off), count);
+   aligned_start = round_down(off, 8);
+   aligned_end = round_up(off + count, 8);
+   aligned_length = aligned_end - aligned_start;
+
+   /* max we can copy in one read is PAGE_SIZE */
+   if (aligned_length > ERR_BUFF_MAX_COPY_SIZE) {
+   aligned_length = ERR_BUFF_MAX_COPY_SIZE;
+   count = ERR_BUFF_MAX_COPY_SIZE - (off & 0x7);
+   }
+
+   /* use bounce buffer for copy */
+   tbuf = (void *)__get_free_page(GFP_TEMPORARY);
+   if (!tbuf)
+   return -ENOMEM;
+
+   /* perform aligned read from the mmio region */
+   memcpy_fromio(tbuf, src + aligned_start, aligned_length);
+   memcpy(dst, tbuf + (off & 0x7), count);
+
+   free_page((unsigned long)tbuf);
+
+   return count;
+}
+
+static ssize_t native_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
+   loff_t off, size_t count)
+{
+   void __iomem *ebuf = afu->native->afu_desc_mmio + afu->eb_offset;
+
+   return __aligned_memcpy(buf, ebuf, off, count, afu->eb_len);
+}
+
 const struct cxl_backend_ops cxl_native_ops = {
.module = THIS_MODULE,
.adapter_reset = cxl_pci_reset,
@@ -1294,7 +1348,7 @@ const struct cxl_backend_ops cxl_native_ops = {
.support_attributes = native_support_attributes,
.link_ok = cxl_adapter_link_ok,
.release_afu = cxl_pci_release_afu,
-   .afu_read_err_buffer = cxl_pci_afu_read_err_buffer,
+   .afu_read_err_buffer = native_afu_read_err_buffer,
.afu_check_and_enable = native_afu_check_and_enable,
.afu_activate_mode = native_afu_activate_mode,
.afu_deactivate_mode = native_afu_deactivate_mode,
diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c
index 91f6459..541dc9a 100644
--- a/drivers/misc/cxl/pci.c
+++ b/drivers/misc/cxl/pci.c
@@ -1051,50 +1051,6 @@ static int sanitise_afu_regs(struct cxl_afu *afu)
return 0;
 }
 
-#define ERR_BUFF_MAX_COPY_SIZE PAGE_SIZE
-/*
- * afu_eb_read:
- * Called from sysfs and reads the afu error info buffer. The h/w only supports
- * 4/8 bytes aligned access. So in case the requested offset/count arent 8 byte
- * aligned the function uses a bounce buffer which can be max PAGE_SIZE.
- */
-ssize_t cxl_pci_afu_read_err_buffer(struct cxl_afu *afu, char *buf,
-   loff_t off, 

[PATCH 0/3] cxl: Provide user-space r/o access to the AFU descriptor

2017-03-13 Thread Vaibhav Jain
Hi,

This patch-set provides a fix for libcxl github issue#20 
"libcxl doesn't handle a mmio read of 0xfff... (all 1's) correctly" at
https://github.com/ibm-capi/libcxl/issues/20.

The issue arises as libcxl uses mmio read values from problem-state-area(PSA) to
determine if the card had fenced or not. When it reads a value of 0xFFs (all 1)
it assumes that the card has fenced and returns an error for the mmio read.
Unfortunately 0xFFs can be a valid value in PSA of some some AFUs and in such
cases libcxl incorrectly assumes the card to be fenced and indicate and error to
the caller.

To fix this issue, the patch-set provides direct and read-only access of the afu
descriptor to libcxl via a binary sysfs attribute named 'afu_desc'. Libcxl can
mmap this attribute to process address space and if an mmio read returns 0xFFs
it tries reading contents of afu_desc at an offset whose contents != ~0x0ULL. If
the card is fenced read of the same offset will have contents == ~0x0ULL in
which case libcxl can indicate an error to the caller about the fenced card else
the read of 0xFFs is a valid read.


There are three patchset in this series:

- First patch refactors the function cxl_pci_afu_err_buffer() and moves its
  implementation to 'native.c'. It also moves the core logic of the function
  to a new function __aligned_memcpy().

- Second patch introduces a new sysfs attribute named afu_desc to be used by
  libcxl to read raw contents of afu descriptor.

- Third patch provides native implementations for new cxl backend-ops introduced
  in the previous patch. Most importantly it implements the afu_desc_mmap 
  backend-op thats mmaps the afu descriptor to a given vma.

Additional related changes apart from this patch-set:

- Corresponding libcxl changes are posted as a pull request#25 
  "libcxl: Check afu link when read from PSA mmio return all FFs" at
  https://github.com/ibm-capi/libcxl/pull/25

- Patch "kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()"
  which fixes a kernel oops occurring while removing a bin_attribute from sysfs
  at https://patchwork.ozlabs.org/patch/738536/.

Vaibhav Jain (3):
  cxl: Re-factor cxl_pci_afu_read_err_buffer()
  cxl: Introduce afu_desc sysfs attribute
  cxl: Provide user-space access to afu descriptor on bare metal

 Documentation/ABI/testing/sysfs-class-cxl |  9 
 drivers/misc/cxl/cxl.h| 15 --
 drivers/misc/cxl/native.c | 89 ++-
 drivers/misc/cxl/pci.c| 47 ++--
 drivers/misc/cxl/sysfs.c  | 45 
 5 files changed, 157 insertions(+), 48 deletions(-)

-- 
2.9.3



Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Benjamin Herrenschmidt
On Tue, 2017-03-14 at 12:53 +1000, Nicholas Piggin wrote:
> >   - Load all
> >   - For each byte if set
> >  - clear byte
> >  - then call handler
> 
> Yes. I think that will be okay because we shouldn't get any load-hit-
> store
> issues. I'll do some benchmarking anyway.

We might need a sync still between clearing the byte and calling the
handler no ? Or at least a smp_wmb() to ensure that the clear is
visible before any action of the handler.

Cheers,
Ben.



Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
On Mon, 13 Mar 2017, Andy Lutomirski wrote:
> On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal
>  wrote:
> > This patchset extends the kernel memory management subsystem with a new
> > type of address spaces (called VAS) which can be created and destroyed
> > independently of processes by a user in the system. During its lifetime
> > such a VAS can be attached to processes by the user which allows a process
> > to have multiple address spaces and thereby multiple, potentially
> > different, views on the system's main memory. During its execution the
> > threads belonging to the process are able to switch freely between the
> > different attached VAS and the process' original AS enabling them to
> > utilize the different available views on the memory.
> 
> Sounds like the old SKAS feature for UML.

I haven't heard of this feature before, but after shortly looking at the 
description
on the UML website it actually has some similarities with what I am proposing. 
But as
far as I can see this was not merged into the mainline kernel, was it? In 
addition, I
think that first class virtual address spaces goes even one step further by 
allowing
AS to live independently of processes.

> > In addition to the concept of first class virtual address spaces, this
> > patchset introduces yet another feature called VAS segments. VAS segments
> > are memory regions which have a fixed size and position in the virtual
> > address space and can be shared between multiple first class virtual
> > address spaces. Such shareable memory regions are especially useful for
> > in-memory pointer-based data structures or other pure in-memory data.
> 
> This sounds rather complicated.  Getting TLB flushing right seems
> tricky.  Why not just map the same thing into multiple mms?

This is exactly what happens at the end. The memory region that is described by 
the
VAS segment will be mapped in the ASes that use the segment.

> >
> > | VAS |  processes  |
> > -
> > switch  |   468ns |  1944ns |
> 
> The solution here is IMO to fix the scheduler.

IMHO it will be very difficult for the scheduler code to reach the same 
switching
time as the pure VAS switch because switching between VAS does not involve 
saving any
registers or FPU state and does not require selecting the next runnable task. 
VAS
switch is basically a system call that just changes the AS of the current thread
which makes it a very lightweight operation.

> Also, FWIW, I have patches (that need a little work) that will make
> switch_mm() wy faster on x86.

These patches will also improve the speed of the VAS switch operation. We are 
also
using the switch_mm function in the background to perform the actual hardware 
switch
between the two ASes. The main reason why the VAS switch is faster than the task
switch is that it just has to do fewer things.

> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> 
> Ick.  Please don't do this.  Can we please keep an mm as just an mm
> and not make it look magically different depending on which process
> maps it?  If you need a trampoline (which you do, of course), just
> write a trampoline in regular user code and map it manually.

Did I understand you correctly that you are proposing that the switching thread
should make sure by itself that its code, stack, … memory regions are properly 
setup
in the new AS before/after switching into it? I think, this would make using 
first
class virtual address spaces much more difficult for user applications to the 
extend
that I am not even sure if they can be used at all. At the moment, switching 
into a
VAS is a very simple operation for an application because the kernel will just 
simply
do the right thing.

Till


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 10:39 AM, Till Smejkal wrote:
> > > Is this an indication that full virtual address spaces are useless?  It
> > > would seem like if you only use virtual address segments then you avoid 
> > > all
> > > of the problems with executing code, active stacks, and brk.
> > 
> > What do you mean with *virtual address segments*? The nice part of first 
> > class
> > virtual address spaces is that one can share/reuse collections of address 
> > space
> > segments easily.
> 
> What do *I* mean?  You introduced the term, didn't you?
> Rereading your original I see you called them "VAS segments".

Oh, I am sorry. I thought that you were referring to some other feature that I 
don't
know.

> Anyway, whatever they are called, it would seem that these segments do not
> require any of the syncing mechanisms that are causing you problems.

Yes, VAS segments provide a possibility to share memory regions between multiple
address spaces without the need to synchronize heap, stack, etc. Unfortunately, 
the
VAS segment feature itself without the whole concept of first class virtual 
address
spaces is not as powerful. With some additional work it can probably be 
represented
with the existing shmem functionality.

The first class virtual address space feature on the other side provides a real
benefit for applications in our opinion namely that an application can switch 
between
different views on its memory which enables various interesting programming 
paradigms
as mentioned in the cover letter.

Till


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Richard Henderson

On 03/14/2017 10:39 AM, Till Smejkal wrote:

Is this an indication that full virtual address spaces are useless?  It
would seem like if you only use virtual address segments then you avoid all
of the problems with executing code, active stacks, and brk.


What do you mean with *virtual address segments*? The nice part of first class
virtual address spaces is that one can share/reuse collections of address space
segments easily.


What do *I* mean?  You introduced the term, didn't you?
Rereading your original I see you called them "VAS segments".

Anyway, whatever they are called, it would seem that these segments do not 
require any of the syncing mechanisms that are causing you problems.



r~


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Andy Lutomirski
On Mon, Mar 13, 2017 at 3:14 PM, Till Smejkal
 wrote:
> This patchset extends the kernel memory management subsystem with a new
> type of address spaces (called VAS) which can be created and destroyed
> independently of processes by a user in the system. During its lifetime
> such a VAS can be attached to processes by the user which allows a process
> to have multiple address spaces and thereby multiple, potentially
> different, views on the system's main memory. During its execution the
> threads belonging to the process are able to switch freely between the
> different attached VAS and the process' original AS enabling them to
> utilize the different available views on the memory.

Sounds like the old SKAS feature for UML.

> In addition to the concept of first class virtual address spaces, this
> patchset introduces yet another feature called VAS segments. VAS segments
> are memory regions which have a fixed size and position in the virtual
> address space and can be shared between multiple first class virtual
> address spaces. Such shareable memory regions are especially useful for
> in-memory pointer-based data structures or other pure in-memory data.

This sounds rather complicated.  Getting TLB flushing right seems
tricky.  Why not just map the same thing into multiple mms?

>
> | VAS |  processes  |
> -
> switch  |   468ns |  1944ns |

The solution here is IMO to fix the scheduler.

Also, FWIW, I have patches (that need a little work) that will make
switch_mm() wy faster on x86.

> At the current state of the development, first class virtual address spaces
> have one limitation, that we haven't been able to solve so far. The feature
> allows, that different threads of the same process can execute in different
> AS at the same time. This is possible, because the VAS-switch operation
> only changes the active mm_struct for the task_struct of the calling
> thread. However, when a thread switches into a first class virtual address
> space, some parts of its original AS are duplicated into the new one to
> allow the thread to continue its execution at its current state.

Ick.  Please don't do this.  Can we please keep an mm as just an mm
and not make it look magically different depending on which process
maps it?  If you need a trampoline (which you do, of course), just
write a trampoline in regular user code and map it manually.

--Andy


Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
On Tue, 14 Mar 2017, Richard Henderson wrote:
> On 03/14/2017 08:14 AM, Till Smejkal wrote:
> > At the current state of the development, first class virtual address spaces
> > have one limitation, that we haven't been able to solve so far. The feature
> > allows, that different threads of the same process can execute in different
> > AS at the same time. This is possible, because the VAS-switch operation
> > only changes the active mm_struct for the task_struct of the calling
> > thread. However, when a thread switches into a first class virtual address
> > space, some parts of its original AS are duplicated into the new one to
> > allow the thread to continue its execution at its current state.
> > Accordingly, parts of the processes AS (e.g. the code section, data
> > section, heap section and stack sections) exist in multiple AS if the
> > process has a VAS attached to it. Changes to these shared memory regions
> > are synchronized between the address spaces whenever a thread switches
> > between two of them. Unfortunately, in some scenarios the kernel is not
> > able to properly synchronize all these shared memory regions because of
> > conflicting changes. One such example happens if there are two threads, one
> > executing in an attached first class virtual address space, the other in
> > the tasks original address space. If both threads make changes to the heap
> > section that cause expansion of the underlying vm_area_struct, the kernel
> > cannot correctly synchronize these changes, because that would cause parts
> > of the virtual address space to be overwritten with unrelated data. In the
> > current implementation such conflicts are only detected but not resolved
> > and result in an error code being returned by the kernel during the VAS
> > switch operation. Unfortunately, that means for the particular thread that
> > tried to make the switch, that it cannot do this anymore in the future and
> > accordingly has to be killed.
> 
> This sounds like a fairly fundamental problem to me.

Yes I agree. This is a significant limitation of first class virtual address 
spaces.
However, conflict like this can be mitigated by being careful in the application
that uses multiple first class virtual address spaces. If all threads make sure 
that
they never resize shared memory regions when executing inside a VAS such 
conflicts do
not occur. Another possibility that I investigated but not yet finished is that 
such
resizes of shared memory regions have to be synchronized more frequently than 
just at
every switch between VASes. If one for example "forward" memory region resizes 
to all
AS that share this particular memory region during the resize operation, one can
completely eliminate this problem. Unfortunately, this introduces a significant 
cost
and introduces a difficult to handle race condition.

> Is this an indication that full virtual address spaces are useless?  It
> would seem like if you only use virtual address segments then you avoid all
> of the problems with executing code, active stacks, and brk.

What do you mean with *virtual address segments*? The nice part of first class
virtual address spaces is that one can share/reuse collections of address space
segments easily.

Till


Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Hi Greg,

First of all thanks for your reply.

On Tue, 14 Mar 2017, Greg Kroah-Hartman wrote:
> On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote:
> 
> There's no way with that many cc: lists and people that this is really
> making it through very many people's filters and actually on a mailing
> list.  Please trim them down.

I am sorry that the patch's cc-list is too big. This was the list of people 
that the
get_maintainers.pl script produced. I already recognized that it was a huge 
number of
people, but I didn't want to remove anyone from the list because I wasn't sure 
who
would be interested in this patch set. Do you have any suggestion who to remove 
from
the list? I don't want to annoy anyone with useless emails.

> Minor sysfs questions/issues:
> 
> > +struct vas {
> > +   struct kobject kobj;/* < the internal kobject that we use *
> > +*   for reference counting and sysfs *
> > +*   handling.*/
> > +
> > +   int id; /* < ID   */
> > +   char name[VAS_MAX_NAME_LENGTH]; /* < name */
> 
> The kobject has a name, why not use that?

The reason why I don't use the kobject's name is that I don't restrict the 
names that
are used for VAS/VAS segments. Accordingly, it would be allowed to use a name 
like
"foo/bar/xyz" as VAS name. However, I am not sure what would happen in the 
sysfs if I
would use such a name for the kobject. Especially, since one could think of 
another
VAS with the name "foo/bar" whose name would conflict with the first one 
although it
not necessarily has any connection with it.

> > +
> > +   struct mutex mtx;   /* < lock for parallel access.*/
> > +
> > +   struct mm_struct *mm;   /* < a partial memory map containing  *
> > +*   all mappings of this VAS.*/
> > +
> > +   struct list_head link;  /* < the link in the global VAS list. */
> > +   struct rcu_head rcu;/* < the RCU helper used for  *
> > +*   asynchronous VAS deletion.   */
> > +
> > +   u16 refcount;   /* < how often is the VAS attached.   */
> 
> The kobject has a refcount, use that?  Don't have 2 refcounts in the
> same structure, that way lies madness.  And bugs, lots of bugs...
> 
> And if this really is a refcount (hint, I don't think it is), you should
> use the refcount_t type.

I actually use both the internal kobject refcount to keep track of how often a
VAS/VAS segment is referenced and this 'refcount' variable to keep track how 
often
the VAS is actually attached to a task. They not necessarily must be related to 
each
other. I can rename this variable to attach_count. Or if preferred I can
alternatively only use the kobject reference counter and remove this variable
completely though I would loose information about how often the VAS is attached 
to a
task because the kobject reference counter is also used to keep track of other
variables referencing the VAS.

> > +/**
> > + * The sysfs structure we need to handle attributes of a VAS.
> > + **/
> > +struct vas_sysfs_attr {
> > +   struct attribute attr;
> > +   ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +   char *buf);
> > +   ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> > +const char *buf, size_t count);
> > +};
> > +
> > +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)
> > \
> > +static struct vas_sysfs_attr vas_sysfs_attr_##NAME =   
> > \
> > +   __ATTR(NAME, MODE, SHOW, STORE)
> 
> __ATTR_RO and __ATTR_RW should work better for you.  If you really need
> this.

Thank you. I will have a look at these functions.

> Oh, and where is the Documentation/ABI/ updates to try to describe the
> sysfs structure and files?  Did I miss that in the series?

Oh sorry, I forgot to add this file. I will add the ABI descriptions for future
submissions.

> > +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr 
> > *vsattr,
> > +  char *buf)
> > +{
> > +   return scnprintf(buf, PAGE_SIZE, "%s", vas->name);
> 
> It's a page size, just use sprintf() and be done with it.  No need to
> ever check, you "know" it will be correct.

OK. I was following the sysfs example in the documentation that used scnprintf, 
but
if sprintf is preferred, I can change this.

> Also, what about a trailing '\n' for these attributes?

I will change this.

> Oh wait, why have a name when the kobject name is already there in the
> directory itself?  Do you really need this?

See above.

> > +/**
> > + * The ktype data structure representing a VAS.
> > + **/
> > +static struct kobj_type vas_ktype = {
> > +   .sysfs_ops = _sysfs_ops,
> > +   .release = __vas_release,
> 
> Why the odd 

Re: [RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Richard Henderson

On 03/14/2017 08:14 AM, Till Smejkal wrote:

At the current state of the development, first class virtual address spaces
have one limitation, that we haven't been able to solve so far. The feature
allows, that different threads of the same process can execute in different
AS at the same time. This is possible, because the VAS-switch operation
only changes the active mm_struct for the task_struct of the calling
thread. However, when a thread switches into a first class virtual address
space, some parts of its original AS are duplicated into the new one to
allow the thread to continue its execution at its current state.
Accordingly, parts of the processes AS (e.g. the code section, data
section, heap section and stack sections) exist in multiple AS if the
process has a VAS attached to it. Changes to these shared memory regions
are synchronized between the address spaces whenever a thread switches
between two of them. Unfortunately, in some scenarios the kernel is not
able to properly synchronize all these shared memory regions because of
conflicting changes. One such example happens if there are two threads, one
executing in an attached first class virtual address space, the other in
the tasks original address space. If both threads make changes to the heap
section that cause expansion of the underlying vm_area_struct, the kernel
cannot correctly synchronize these changes, because that would cause parts
of the virtual address space to be overwritten with unrelated data. In the
current implementation such conflicts are only detected but not resolved
and result in an error code being returned by the kernel during the VAS
switch operation. Unfortunately, that means for the particular thread that
tried to make the switch, that it cannot do this anymore in the future and
accordingly has to be killed.


This sounds like a fairly fundamental problem to me.

Is this an indication that full virtual address spaces are useless?  It would 
seem like if you only use virtual address segments then you avoid all of the 
problems with executing code, active stacks, and brk.



r~


Re: [PATCH] kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

2017-03-13 Thread Greg Kroah-Hartman
On Tue, Mar 14, 2017 at 08:17:00AM +0530, Vaibhav Jain wrote:
> Recently started seeing a kernel oops when a module tries removing a
> memory mapped sysfs bin_attribute. On closer investigation the root
> cause seems to be kernfs_release_file() trying to call
> kernfs_op.release() callback that's NULL for such sysfs
> bin_attributes. The oops occurs when kernfs_release_file() is called from
> kernfs_drain_open_files() to cleanup any open handles with active
> memory mappings.
> 
> The patch fixes this by checking for flag KERNFS_HAS_RELEASE before
> calling kernfs_release_file() in function kernfs_drain_open_files().
> 
> On ppc64-le arch with cxl module the oops back-trace is of the
> form below:
> [  861.381126] Unable to handle kernel paging request for instruction fetch
> [  861.381360] Faulting instruction address: 0x
> [  861.381428] Oops: Kernel access of bad area, sig: 11 [#1]
> 
> [  861.382481] NIP:  LR: c0362c60 CTR:
> 
> 
> Call Trace:
> [c00f1680b750] [c0362c34] kernfs_drain_open_files+0x104/0x1d0 
> (unreliable)
> [c00f1680b790] [c035fa00] __kernfs_remove+0x260/0x2c0
> [c00f1680b820] [c0360da0] kernfs_remove_by_name_ns+0x60/0xe0
> [c00f1680b8b0] [c03638f4] sysfs_remove_bin_file+0x24/0x40
> [c00f1680b8d0] [c062a164] device_remove_bin_file+0x24/0x40
> [c00f1680b8f0] [d9b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl]
> [c00f1680b940] [d9b7c7e4] cxl_remove+0x6c/0x1a0 [cxl]
> [c00f1680b990] [c052f694] pci_device_remove+0x64/0x110
> [c00f1680b9d0] [c06321d4] 
> device_release_driver_internal+0x1f4/0x2b0
> [c00f1680ba20] [c0525cb0] pci_stop_bus_device+0xa0/0xd0
> [c00f1680ba60] [c0525e80] pci_stop_and_remove_bus_device+0x20/0x40
> [c00f1680ba90] [c004a6c4] pci_hp_remove_devices+0x84/0xc0
> [c00f1680bad0] [c004a688] pci_hp_remove_devices+0x48/0xc0
> [c00f1680bb10] [c09dfda4] eeh_reset_device+0xb0/0x290
> [c00f1680bbb0] [c0032b4c] eeh_handle_normal_event+0x47c/0x530
> [c00f1680bc60] [c0032e64] eeh_handle_event+0x174/0x350
> [c00f1680bd10] [c0033228] eeh_event_handler+0x1e8/0x1f0
> [c00f1680bdc0] [c00d384c] kthread+0x14c/0x190
> [c00f1680be30] [c000b5a0] ret_from_kernel_thread+0x5c/0xbc
> 
> Fixes: f83f3c515654("kernfs: fix locking around kernfs_ops->release()
> callback")
> Signed-off-by: Vaibhav Jain 
> ---
>  fs/kernfs/file.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 8e4dc7a..ac2dfe0 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -809,7 +809,8 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
>   if (kn->flags & KERNFS_HAS_MMAP)
>   unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>  
> - kernfs_release_file(kn, of);
> + if (kn->flags & KERNFS_HAS_RELEASE)
> + kernfs_release_file(kn, of);
>   }
>  
>   mutex_unlock(_open_file_mutex);
> -- 
> 2.9.3

Tejun, want to take this through your tree, or at the least, give me an
ack for this?

thanks,

greg k-h


Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Nicholas Piggin
On Tue, 14 Mar 2017 13:34:38 +1100
Benjamin Herrenschmidt  wrote:

> On Tue, 2017-03-14 at 11:49 +1000, Nicholas Piggin wrote:
> > On Tue, 14 Mar 2017 10:31:08 +1100  
> > > Benjamin Herrenschmidt  wrote:  
> >   
> > > On Mon, 2017-03-13 at 03:13 +1000, Nicholas Piggin wrote:  
> > > > Hi,
> > > > 
> > > > Just after the previous two fixes, I would like to propose changing
> > > > the way we do doorbell vs interrupt controller IPIs, and add support
> > > > for global doorbells supported by POWER9 in HV mode.
> > > > 
> > > > After this, the platform code knows about doorbells and interrupt
> > > > controller IPIs, rather than they know about each other.    
> > > 
> > > A few things come to mind:
> > > 
> > >  - We don't want to use doorbells under KVM. They are going to turn
> > > into traps and be emulated, slower than using H_IPI, at least on P9.
> > > Even for core only doorbells. I'm not sure how to convey that to the
> > > guest.  
> > 
> > msgsndp will be okay, won't it? Guest just chooses that based on
> > HVMODE (which pseries platform knows is core only).  
> 
> No. It will suck. Because KVM can run each guest thread on a different core,
> the HW won't work, so we have to disable it and trap the instructions & 
> emulate
> them. We really don't want P9 guests to use it under KVM (it's fine under 
> pHyp).

Ah, gotcha.

> > >  - On PP9 DD1 we need a CI load instead of msgsync (a DARN instruction
> > > would do too if it works)  
> > 
> > Yes, Paul pointed this out too. I'll add an alt patch for it. Apparently
> > also msgsync needs lwsync afterwards for DD2.  
> 
> Odd. Ok.
> 
> > >  - Can we get rid of the atomic ops for manipulating the IPI mux ? What
> > > about a cache line per message and just set/clear ? If we clear in the
> > > doorbell handler before we call the respective targets, we shouldn't
> > > "lose" messages no ? As long as the actual handlers "loop" as necessary
> > > of course.  
> > 
> > Yes I think that would work. Good idea. A single cacheline with messages
> > being independently stored bytes within it might work better, so the
> > receiver CPU does not have to go through and load multiple cachelines
> > to check for messages. It could load up to 8 message types with one load.  
> 
> Ok. But we need to make sure we use multiple stores to not lose messages.
> 
> Ie.
> 
>  - Load all
>  - For each byte if set
> - clear byte
> - then call handler

Yes. I think that will be okay because we shouldn't get any load-hit-store
issues. I'll do some benchmarking anyway.

Thanks,
Nick



[PATCH] kernfs: Check KERNFS_HAS_RELEASE before calling kernfs_release_file()

2017-03-13 Thread Vaibhav Jain
Recently started seeing a kernel oops when a module tries removing a
memory mapped sysfs bin_attribute. On closer investigation the root
cause seems to be kernfs_release_file() trying to call
kernfs_op.release() callback that's NULL for such sysfs
bin_attributes. The oops occurs when kernfs_release_file() is called from
kernfs_drain_open_files() to cleanup any open handles with active
memory mappings.

The patch fixes this by checking for flag KERNFS_HAS_RELEASE before
calling kernfs_release_file() in function kernfs_drain_open_files().

On ppc64-le arch with cxl module the oops back-trace is of the
form below:
[  861.381126] Unable to handle kernel paging request for instruction fetch
[  861.381360] Faulting instruction address: 0x
[  861.381428] Oops: Kernel access of bad area, sig: 11 [#1]

[  861.382481] NIP:  LR: c0362c60 CTR:


Call Trace:
[c00f1680b750] [c0362c34] kernfs_drain_open_files+0x104/0x1d0 
(unreliable)
[c00f1680b790] [c035fa00] __kernfs_remove+0x260/0x2c0
[c00f1680b820] [c0360da0] kernfs_remove_by_name_ns+0x60/0xe0
[c00f1680b8b0] [c03638f4] sysfs_remove_bin_file+0x24/0x40
[c00f1680b8d0] [c062a164] device_remove_bin_file+0x24/0x40
[c00f1680b8f0] [d9b7b22c] cxl_sysfs_afu_remove+0x144/0x170 [cxl]
[c00f1680b940] [d9b7c7e4] cxl_remove+0x6c/0x1a0 [cxl]
[c00f1680b990] [c052f694] pci_device_remove+0x64/0x110
[c00f1680b9d0] [c06321d4] device_release_driver_internal+0x1f4/0x2b0
[c00f1680ba20] [c0525cb0] pci_stop_bus_device+0xa0/0xd0
[c00f1680ba60] [c0525e80] pci_stop_and_remove_bus_device+0x20/0x40
[c00f1680ba90] [c004a6c4] pci_hp_remove_devices+0x84/0xc0
[c00f1680bad0] [c004a688] pci_hp_remove_devices+0x48/0xc0
[c00f1680bb10] [c09dfda4] eeh_reset_device+0xb0/0x290
[c00f1680bbb0] [c0032b4c] eeh_handle_normal_event+0x47c/0x530
[c00f1680bc60] [c0032e64] eeh_handle_event+0x174/0x350
[c00f1680bd10] [c0033228] eeh_event_handler+0x1e8/0x1f0
[c00f1680bdc0] [c00d384c] kthread+0x14c/0x190
[c00f1680be30] [c000b5a0] ret_from_kernel_thread+0x5c/0xbc

Fixes: f83f3c515654("kernfs: fix locking around kernfs_ops->release()
callback")
Signed-off-by: Vaibhav Jain 
---
 fs/kernfs/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8e4dc7a..ac2dfe0 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -809,7 +809,8 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
if (kn->flags & KERNFS_HAS_MMAP)
unmap_mapping_range(inode->i_mapping, 0, 0, 1);
 
-   kernfs_release_file(kn, of);
+   if (kn->flags & KERNFS_HAS_RELEASE)
+   kernfs_release_file(kn, of);
}
 
mutex_unlock(_open_file_mutex);
-- 
2.9.3



Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Benjamin Herrenschmidt
On Tue, 2017-03-14 at 11:49 +1000, Nicholas Piggin wrote:
> On Tue, 14 Mar 2017 10:31:08 +1100
> > Benjamin Herrenschmidt  wrote:
> 
> > On Mon, 2017-03-13 at 03:13 +1000, Nicholas Piggin wrote:
> > > Hi,
> > > 
> > > Just after the previous two fixes, I would like to propose changing
> > > the way we do doorbell vs interrupt controller IPIs, and add support
> > > for global doorbells supported by POWER9 in HV mode.
> > > 
> > > After this, the platform code knows about doorbells and interrupt
> > > controller IPIs, rather than they know about each other.  
> > 
> > A few things come to mind:
> > 
> >  - We don't want to use doorbells under KVM. They are going to turn
> > into traps and be emulated, slower than using H_IPI, at least on P9.
> > Even for core only doorbells. I'm not sure how to convey that to the
> > guest.
> 
> msgsndp will be okay, won't it? Guest just chooses that based on
> HVMODE (which pseries platform knows is core only).

No. It will suck. Because KVM can run each guest thread on a different core,
the HW won't work, so we have to disable it and trap the instructions & emulate
them. We really don't want P9 guests to use it under KVM (it's fine under pHyp).

> >  - On PP9 DD1 we need a CI load instead of msgsync (a DARN instruction
> > would do too if it works)
> 
> Yes, Paul pointed this out too. I'll add an alt patch for it. Apparently
> also msgsync needs lwsync afterwards for DD2.

Odd. Ok.

> >  - Can we get rid of the atomic ops for manipulating the IPI mux ? What
> > about a cache line per message and just set/clear ? If we clear in the
> > doorbell handler before we call the respective targets, we shouldn't
> > "lose" messages no ? As long as the actual handlers "loop" as necessary
> > of course.
> 
> Yes I think that would work. Good idea. A single cacheline with messages
> being independently stored bytes within it might work better, so the
> receiver CPU does not have to go through and load multiple cachelines
> to check for messages. It could load up to 8 message types with one load.

Ok. But we need to make sure we use multiple stores to not lose messages.

Ie.

 - Load all
 - For each byte if set
- clear byte
- then call handler

Cheers,
Ben.



Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Hi Vineet,

On Mon, 13 Mar 2017, Vineet Gupta wrote:
> I've not looked at the patches closely (or read the references paper fully 
> yet),
> but at first glance it seems on ARC architecture, we can can potentially
> use/leverage this mechanism to implement the shared TLB entries. Before anyone
> shouts these are not same as the IA64/x86 protection keys which allow TLB 
> entries
> with different protection bits across processes etc. These TLB entries are
> actually *shared* by processes.
> 
> Conceptually there's shared address spaces, independent of processes. e.g. 
> ldso
> code is shared address space #1, libc (code) #2  System can support a 
> limited
> number of shared addr spaces (say 64, enough for typical embedded sys).
> 
> While Normal TLB entries are tagged with ASID (Addr space ID) to keep them 
> unique
> across processes, Shared TLB entries are tagged with Shared address space ID.
> 
> A process MMU context consists of ASID (a single number) and a SASID bitmap 
> (to
> allow "subscription" to multiple Shared spaces. The subscriptions are set up 
> bu
> userspace ld.so which knows about the libs process wants to map.
> 
> The restriction ofcourse is that the spaces are mapped at *same* vaddr is all
> participating processes. I know this goes against whole security, address 
> space
> randomization - but it gives much better real time performance. Why does each
> process need to take a MMU exception for libc code...
> 
> So long story short - it seems there can be multiple uses of this 
> infrastructure !

During the development of this code, we also looked at shared TLB entries, but
the other way around. We wanted to use them to prevent flushing of TLB entries 
of
shared memory regions when switching between multiple ASes. Unfortunately, we 
never
finished this part of the code.

However, we also investigated into a different use-case for first class virtual
address spaces that is related to what you propose if I didn't understand 
something
wrong. The idea is to move shared libraries into their own first class virtual
address space and only load some small trampoline code in the application AS. 
This
trampoline code performs the VAS switch in the libraries AS and execute the 
requested
function there. If we combine this architecture with tagged TLB entries to 
prevent
TLB flushes during the switch operation, it can also reach an acceptable 
performance.
A side effect of moving the shared library into its own AS is that it can not 
be used
by ROP-attacks because it is not accessible in the application's AS.

Till


Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Nicholas Piggin
On Tue, 14 Mar 2017 10:31:08 +1100
Benjamin Herrenschmidt  wrote:

> On Mon, 2017-03-13 at 03:13 +1000, Nicholas Piggin wrote:
> > Hi,
> > 
> > Just after the previous two fixes, I would like to propose changing
> > the way we do doorbell vs interrupt controller IPIs, and add support
> > for global doorbells supported by POWER9 in HV mode.
> > 
> > After this, the platform code knows about doorbells and interrupt
> > controller IPIs, rather than they know about each other.  
> 
> A few things come to mind:
> 
>  - We don't want to use doorbells under KVM. They are going to turn
> into traps and be emulated, slower than using H_IPI, at least on P9.
> Even for core only doorbells. I'm not sure how to convey that to the
> guest.

msgsndp will be okay, won't it? Guest just chooses that based on
HVMODE (which pseries platform knows is core only).

>  - On PP9 DD1 we need a CI load instead of msgsync (a DARN instruction
> would do too if it works)

Yes, Paul pointed this out too. I'll add an alt patch for it. Apparently
also msgsync needs lwsync afterwards for DD2.

> 
>  - Can we get rid of the atomic ops for manipulating the IPI mux ? What
> about a cache line per message and just set/clear ? If we clear in the
> doorbell handler before we call the respective targets, we shouldn't
> "lose" messages no ? As long as the actual handlers "loop" as necessary
> of course.

Yes I think that would work. Good idea. A single cacheline with messages
being independently stored bytes within it might work better, so the
receiver CPU does not have to go through and load multiple cachelines
to check for messages. It could load up to 8 message types with one load.

Thanks,
Nick


Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Vineet Gupta
+CC Ingo, tglx

Hi Till,

On 03/13/2017 03:14 PM, Till Smejkal wrote:
> Introduce a different type of address spaces which are first class citizens
> in the OS. That means that the kernel now handles two types of AS, those
> which are closely coupled with a process and those which aren't. While the
> former ones are created and destroyed together with the process by the
> kernel and are the default type of AS in the Linux kernel, the latter ones
> have to be managed explicitly by the user and are the newly introduced
> type.
> 
> Accordingly, a first class AS (also called VAS == virtual address space)
> can exist in the OS independently from any process. A user has to
> explicitly create and destroy them in the system. Processes and VAS can be
> combined by attaching a previously created VAS to a process which basically
> adds an additional AS to the process that the process' threads are able to
> execute in. Hence, VAS allow a process to have different views onto the
> main memory of the system (its original AS and the attached VAS) between
> which its threads can switch arbitrarily during their lifetime.
> 
> The functionality made available through first class virtual address spaces
> can be used in various different ways. One possible way to utilize VAS is
> to compartmentalize a process for security reasons. Another possible usage
> is to improve the performance of data-centric applications by being able to
> manage different sets of data in memory without the need to map or unmap
> them.
> 
> Furthermore, first class virtual address spaces can be attached to
> different processes at the same time if the underlying memory is only
> readable. This mechanism allows sharing of whole address spaces between
> multiple processes that can both execute in them using the contained
> memory.

I've not looked at the patches closely (or read the references paper fully yet),
but at first glance it seems on ARC architecture, we can can potentially
use/leverage this mechanism to implement the shared TLB entries. Before anyone
shouts these are not same as the IA64/x86 protection keys which allow TLB 
entries
with different protection bits across processes etc. These TLB entries are
actually *shared* by processes.

Conceptually there's shared address spaces, independent of processes. e.g. ldso
code is shared address space #1, libc (code) #2  System can support a 
limited
number of shared addr spaces (say 64, enough for typical embedded sys).

While Normal TLB entries are tagged with ASID (Addr space ID) to keep them 
unique
across processes, Shared TLB entries are tagged with Shared address space ID.

A process MMU context consists of ASID (a single number) and a SASID bitmap (to
allow "subscription" to multiple Shared spaces. The subscriptions are set up bu
userspace ld.so which knows about the libs process wants to map.

The restriction ofcourse is that the spaces are mapped at *same* vaddr is all
participating processes. I know this goes against whole security, address space
randomization - but it gives much better real time performance. Why does each
process need to take a MMU exception for libc code...

So long story short - it seems there can be multiple uses of this 
infrastructure !

-Vineet


[GIT PULL] Please pull powerpc/linux.git powerpc-4.11-4 tag

2017-03-13 Thread Michael Ellerman
Hi Linus,

Please pull some more powerpc fixes for 4.11.

The bulk of the diffstat is the Power9 Machine Check handler. It's bigger than
I'd usually send after rc2, but apparently folks are hitting them in the field,
and it's from Nick.

This also includes the fix for macio devices that was sent directly to you by
Larry Finger.

cheers


The following changes since commit f7d6a7283aa1de430c6323a9714d1a787bc2d1ea:

  Merge tag 'powerpc-4.11-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux (2017-03-07 
10:46:10 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-4.11-4

for you to fetch changes up to 7b9f71f974a12740e79e918cfd58c2fce0b5b580:

  powerpc/64s: POWER9 machine check handler (2017-03-10 16:32:08 +1100)


powerpc fixes for 4.11 #4

The main item is the addition of the Power9 Machine Check handler. This was
delayed to make sure some details were correct, and is as minimal as possible.

The rest is small fixes, two for the Power9 PMU, two dealing with obscure
toolchain problems, two for the PowerNV IOMMU code (used by VFIO), and one to
fix a crash on 32-bit machines with macio devices due to missing dma_ops.

Thanks to:
  Alexey Kardashevskiy, Cyril Bur, Larry Finger, Madhavan Srinivasan, Nicholas
  Piggin.


Alexey Kardashevskiy (2):
  powerpc/powernv/ioda2: Gracefully fail if too many TCE levels requested
  powerpc/powernv/ioda2: Update iommu table base on ownership change

Cyril Bur (1):
  selftests/powerpc: Replace stxvx and lxvx with stxvd2x/lxvd2x

Larry Finger (1):
  powerpc/pmac: Fix crash in dma-mapping.h with NULL dma_ops

Madhavan Srinivasan (2):
  powerpc/perf: Fix perf_get_data_addr() for power9 DD1
  powerpc/perf: Handle sdar_mode for marked event in power9

Michael Ellerman (1):
  powerpc/boot: Fix zImage TOC alignment

Nicholas Piggin (3):
  powerpc/64s: fix handling of non-synchronous machine checks
  powerpc/64s: allow machine check handler to set severity and initiator
  powerpc/64s: POWER9 machine check handler

 arch/powerpc/boot/zImage.lds.S|   1 +
 arch/powerpc/include/asm/bitops.h |   4 +
 arch/powerpc/include/asm/mce.h| 108 +-
 arch/powerpc/kernel/cputable.c|   3 +
 arch/powerpc/kernel/mce.c |  88 +++-
 arch/powerpc/kernel/mce_power.c   | 237 ++
 arch/powerpc/perf/core-book3s.c   |   2 +
 arch/powerpc/perf/isa207-common.c |  43 +++-
 arch/powerpc/perf/isa207-common.h |   1 +
 arch/powerpc/platforms/powernv/opal.c |  21 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |  20 +-
 drivers/macintosh/macio_asic.c|   1 +
 tools/testing/selftests/powerpc/include/vsx_asm.h |  48 ++---
 13 files changed, 523 insertions(+), 54 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-13 Thread David Gibson
On Tue, Mar 14, 2017 at 11:54:03AM +1100, Alexey Kardashevskiy wrote:
> On 10/03/17 15:48, David Gibson wrote:
> > On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
> >> This is my current queue of patches to add acceleration of TCE
> >> updates in KVM.
> >>
> >> This is based on Linus'es tree sha1 c1aa905a304e.
> > 
> > I think we're finally there - I've now sent an R-b for all patches.
> 
> Thanks for the patience.
> 
> 
> I supposed in order to proceed now I need an ack from Alex, correct?

That, or simply for him to merge it.

> 
> 
> > 
> > 
> >>
> >> Please comment. Thanks.
> >>
> >> Changes:
> >> v8:
> >> * kept fixing oddities with error handling in 10/10
> >>
> >> v7:
> >> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
> >>
> >> v6:
> >> * reworked the last patch in terms of error handling and parameters 
> >> checking
> >>
> >> v5:
> >> * replaced "KVM: PPC: Separate TCE validation from update" with
> >> "KVM: PPC: iommu: Unify TCE checking"
> >> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup 
> >> iommu_table disposal"
> >> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
> >> * more details in individual commit logs
> >>
> >> v4:
> >> * addressed comments from v3
> >> * updated subject lines with correct component names
> >> * regrouped the patchset in order:
> >>- powerpc fixes;
> >>- vfio_spapr_tce driver fixes;
> >>- KVM/PPC fixes;
> >>- KVM+PPC+VFIO;
> >> * everything except last 2 patches have "Reviewed-By: David"
> >>
> >> v3:
> >> * there was no full repost, only last patch was posted
> >>
> >> v2:
> >> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
> >> a issue;
> >> * got 09/11, 10/11 to use notifiers in 11/11;
> >> * added rb: David to most of patches and added a comment in 05/11.
> >>
> >> Alexey Kardashevskiy (10):
> >>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
> >>   powerpc/powernv/iommu: Add real mode version of
> >> iommu_table_ops::exchange()
> >>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
> >>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
> >>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
> >>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
> >>   KVM: PPC: Pass kvm* to kvmppc_find_table()
> >>   KVM: PPC: Use preregistered memory API to access TCE list
> >>   KVM: PPC: iommu: Unify TCE checking
> >>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
> >>
> >>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
> >>  arch/powerpc/include/asm/iommu.h   |  32 ++-
> >>  arch/powerpc/include/asm/kvm_host.h|   8 +
> >>  arch/powerpc/include/asm/kvm_ppc.h |  12 +-
> >>  arch/powerpc/include/asm/mmu_context.h |   4 +
> >>  include/uapi/linux/kvm.h   |   9 +
> >>  arch/powerpc/kernel/iommu.c|  86 +---
> >>  arch/powerpc/kvm/book3s_64_vio.c   | 330 
> >> -
> >>  arch/powerpc/kvm/book3s_64_vio_hv.c| 303 
> >> ++
> >>  arch/powerpc/kvm/powerpc.c |   2 +
> >>  arch/powerpc/mm/mmu_context_iommu.c|  39 
> >>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
> >>  arch/powerpc/platforms/powernv/pci.c   |   1 +
> >>  arch/powerpc/platforms/pseries/iommu.c |   3 +-
> >>  arch/powerpc/platforms/pseries/vio.c   |   2 +-
> >>  drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
> >>  virt/kvm/vfio.c|  60 ++
> >>  arch/powerpc/kvm/Kconfig   |   1 +
> >>  18 files changed, 855 insertions(+), 107 deletions(-)
> >>
> > 
> 
> 




-- 
David Gibson| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au  | minimalist, thank you.  NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


signature.asc
Description: PGP signature


Re: [PATCH kernel v8 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration

2017-03-13 Thread Alexey Kardashevskiy
On 10/03/17 15:48, David Gibson wrote:
> On Fri, Mar 10, 2017 at 02:53:27PM +1100, Alexey Kardashevskiy wrote:
>> This is my current queue of patches to add acceleration of TCE
>> updates in KVM.
>>
>> This is based on Linus'es tree sha1 c1aa905a304e.
> 
> I think we're finally there - I've now sent an R-b for all patches.

Thanks for the patience.


I supposed in order to proceed now I need an ack from Alex, correct?


> 
> 
>>
>> Please comment. Thanks.
>>
>> Changes:
>> v8:
>> * kept fixing oddities with error handling in 10/10
>>
>> v7:
>> * added realmode's WARN_ON_ONCE_RM in arch/powerpc/kvm/book3s_64_vio_hv.c
>>
>> v6:
>> * reworked the last patch in terms of error handling and parameters checking
>>
>> v5:
>> * replaced "KVM: PPC: Separate TCE validation from update" with
>> "KVM: PPC: iommu: Unify TCE checking"
>> * changed already reviewed "powerpc/iommu/vfio_spapr_tce: Cleanup 
>> iommu_table disposal"
>> * reworked "KVM: PPC: VFIO: Add in-kernel acceleration for VFIO"
>> * more details in individual commit logs
>>
>> v4:
>> * addressed comments from v3
>> * updated subject lines with correct component names
>> * regrouped the patchset in order:
>>  - powerpc fixes;
>>  - vfio_spapr_tce driver fixes;
>>  - KVM/PPC fixes;
>>  - KVM+PPC+VFIO;
>> * everything except last 2 patches have "Reviewed-By: David"
>>
>> v3:
>> * there was no full repost, only last patch was posted
>>
>> v2:
>> * 11/11 reworked to use new notifiers, it is rather RFC as it still has
>> a issue;
>> * got 09/11, 10/11 to use notifiers in 11/11;
>> * added rb: David to most of patches and added a comment in 05/11.
>>
>> Alexey Kardashevskiy (10):
>>   powerpc/mmu: Add real mode support for IOMMU preregistered memory
>>   powerpc/powernv/iommu: Add real mode version of
>> iommu_table_ops::exchange()
>>   powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal
>>   powerpc/vfio_spapr_tce: Add reference counting to iommu_table
>>   KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number
>>   KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently
>>   KVM: PPC: Pass kvm* to kvmppc_find_table()
>>   KVM: PPC: Use preregistered memory API to access TCE list
>>   KVM: PPC: iommu: Unify TCE checking
>>   KVM: PPC: VFIO: Add in-kernel acceleration for VFIO
>>
>>  Documentation/virtual/kvm/devices/vfio.txt |  22 +-
>>  arch/powerpc/include/asm/iommu.h   |  32 ++-
>>  arch/powerpc/include/asm/kvm_host.h|   8 +
>>  arch/powerpc/include/asm/kvm_ppc.h |  12 +-
>>  arch/powerpc/include/asm/mmu_context.h |   4 +
>>  include/uapi/linux/kvm.h   |   9 +
>>  arch/powerpc/kernel/iommu.c|  86 +---
>>  arch/powerpc/kvm/book3s_64_vio.c   | 330 
>> -
>>  arch/powerpc/kvm/book3s_64_vio_hv.c| 303 ++
>>  arch/powerpc/kvm/powerpc.c |   2 +
>>  arch/powerpc/mm/mmu_context_iommu.c|  39 
>>  arch/powerpc/platforms/powernv/pci-ioda.c  |  46 ++--
>>  arch/powerpc/platforms/powernv/pci.c   |   1 +
>>  arch/powerpc/platforms/pseries/iommu.c |   3 +-
>>  arch/powerpc/platforms/pseries/vio.c   |   2 +-
>>  drivers/vfio/vfio_iommu_spapr_tce.c|   2 +-
>>  virt/kvm/vfio.c|  60 ++
>>  arch/powerpc/kvm/Kconfig   |   1 +
>>  18 files changed, 855 insertions(+), 107 deletions(-)
>>
> 


-- 
Alexey



signature.asc
Description: OpenPGP digital signature


Re: [RFC][PATCH 0/2] reworking cause_ipi and adding global doorbell support

2017-03-13 Thread Benjamin Herrenschmidt
On Mon, 2017-03-13 at 03:13 +1000, Nicholas Piggin wrote:
> Hi,
> 
> Just after the previous two fixes, I would like to propose changing
> the way we do doorbell vs interrupt controller IPIs, and add support
> for global doorbells supported by POWER9 in HV mode.
> 
> After this, the platform code knows about doorbells and interrupt
> controller IPIs, rather than they know about each other.

A few things come to mind:

 - We don't want to use doorbells under KVM. They are going to turn
into traps and be emulated, slower than using H_IPI, at least on P9.
Even for core only doorbells. I'm not sure how to convey that to the
guest.

 - On PP9 DD1 we need a CI load instead of msgsync (a DARN instruction
would do too if it works)

 - Can we get rid of the atomic ops for manipulating the IPI mux ? What
about a cache line per message and just set/clear ? If we clear in the
doorbell handler before we call the respective targets, we shouldn't
"lose" messages no ? As long as the actual handlers "loop" as necessary
of course.

Cheers,
Ben.

> Thanks,
> Nick
> 
> Nicholas Piggin (2):
>   powerpc/64s: change the doorbell IPI calling convention
>   powerpc/64s: use global doorbell on POWER9 in HV mode
> 
>  arch/powerpc/include/asm/dbell.h   | 38 ++
> -
>  arch/powerpc/include/asm/smp.h |  4 +--
>  arch/powerpc/include/asm/xics.h|  2 +-
>  arch/powerpc/kernel/dbell.c| 47 ++
> 
>  arch/powerpc/kernel/smp.c  | 27 ++-
>  arch/powerpc/platforms/85xx/smp.c  |  9 +--
>  arch/powerpc/platforms/powermac/smp.c  |  2 +-
>  arch/powerpc/platforms/powernv/smp.c   | 32 +--
>  arch/powerpc/platforms/pseries/smp.c   | 28 
>  arch/powerpc/sysdev/xics/icp-hv.c  |  2 +-
>  arch/powerpc/sysdev/xics/icp-native.c  | 12 +
>  arch/powerpc/sysdev/xics/icp-opal.c|  2 +-
>  arch/powerpc/sysdev/xics/xics-common.c |  3 ---
>  13 files changed, 118 insertions(+), 90 deletions(-)
> 


Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions

2017-03-13 Thread Till Smejkal
Hi Matthew,

On Mon, 13 Mar 2017, Matthew Wilcox wrote:
> On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote:
> > +/**
> > + * Create a new VAS segment.
> > + *
> > + * @param[in] name:The name of the new VAS segment.
> > + * @param[in] start:   The address where the VAS segment 
> > begins.
> > + * @param[in] end: The address where the VAS segment ends.
> > + * @param[in] mode:The access rights for the VAS segment.
> > + *
> > + * @returns:   The VAS segment ID on success, -ERRNO 
> > otherwise.
> > + **/
> 
> Please follow the kernel-doc conventions, as described in
> Documentation/doc-guide/kernel-doc.rst.  Also, function documentation
> goes with the implementation, not the declaration.

Thank you for this pointer. I wasn't aware of this convention. I will change the
patches accordingly.

> > +/**
> > + * Get ID of the VAS segment belonging to a given name.
> > + *
> > + * @param[in] name:The name of the VAS segment for which 
> > the ID
> > + * should be returned.
> > + *
> > + * @returns:   The VAS segment ID on success, -ERRNO
> > + * otherwise.
> > + **/
> > +extern int vas_seg_find(const char *name);
> 
> So ... segments have names, and IDs ... and access permissions ...
> Why isn't this a special purpose filesystem?

We also thought about this. However, we decided against implementing them as a
special purpose filesystem, mainly because we could not think of a good way to
represent a VAS/VAS segment in this file system (should they be represented 
rather as
file or directory) and we weren't sure what a hierarchy in the filesystem would 
mean
for the underlying address spaces. Hence we decided against it and rather used a
combination of IDR and sysfs. However, I don't have any strong feelings and 
would
also reimplement them as a special purpose filesystem if people rather like 
them to
be one.

Till


Re: [RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions

2017-03-13 Thread Matthew Wilcox
On Mon, Mar 13, 2017 at 03:14:13PM -0700, Till Smejkal wrote:
> +/**
> + * Create a new VAS segment.
> + *
> + * @param[in] name:  The name of the new VAS segment.
> + * @param[in] start: The address where the VAS segment begins.
> + * @param[in] end:   The address where the VAS segment ends.
> + * @param[in] mode:  The access rights for the VAS segment.
> + *
> + * @returns: The VAS segment ID on success, -ERRNO otherwise.
> + **/

Please follow the kernel-doc conventions, as described in
Documentation/doc-guide/kernel-doc.rst.  Also, function documentation
goes with the implementation, not the declaration.

> +/**
> + * Get ID of the VAS segment belonging to a given name.
> + *
> + * @param[in] name:  The name of the VAS segment for which the ID
> + *   should be returned.
> + *
> + * @returns: The VAS segment ID on success, -ERRNO
> + *   otherwise.
> + **/
> +extern int vas_seg_find(const char *name);

So ... segments have names, and IDs ... and access permissions ...
Why isn't this a special purpose filesystem?



[RFC PATCH 13/13] fs/proc: Add procfs support for first class virtual address spaces

2017-03-13 Thread Till Smejkal
Add new files and directories to the procfs file system that contain
various information about the first class virtual address spaces attach to
the processes in the system.

To the procfs directories of each process in the system (/proc/$PID) an
additional directory with the name 'vas' is added that contains information
about all the VAS that are attached to this process. In this directory one
can find for each attached VAS a special folder with a file with some
status information about the attached VAS, a file with the current memory
map of the attached VAS and a link to the sysfs folder of the underlying
VAS.

Signed-off-by: Till Smejkal 
---
 fs/proc/base.c | 528 +
 fs/proc/inode.c|   1 +
 fs/proc/internal.h |   1 +
 mm/Kconfig |   9 +
 4 files changed, 539 insertions(+)

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 87c9a9aacda3..e60c13dd087c 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -45,6 +45,9 @@
  *
  *  Paul Mundt :
  *  Overall revision about smaps.
+ *
+ *  Till Smejkal :
+ *  Add entries for first class virtual address spaces.
  */
 
 #include 
@@ -87,6 +90,7 @@
 #include 
 #include 
 #include 
+#include 
 #ifdef CONFIG_HARDWALL
 #include 
 #endif
@@ -2841,6 +2845,527 @@ static int proc_pid_personality(struct seq_file *m, 
struct pid_namespace *ns,
return err;
 }
 
+#ifdef CONFIG_VAS_PROCFS
+
+/**
+ * Get a string representation of the access type to a VAS.
+ **/
+#define vas_access_type_str(type) ((type) & MAY_WRITE ?
\
+  ((type) & MAY_READ ? "rw" : "wo") : "ro")
+
+static int att_vas_show_status(struct seq_file *sf, void *unused)
+{
+   struct inode *inode = sf->private;
+   struct proc_inode *pi = PROC_I(inode);
+   struct task_struct *tsk;
+   struct vas_context *vas_ctx;
+   struct att_vas *avas;
+   int vid = pi->vas_id;
+
+   tsk = get_proc_task(inode);
+   if (!tsk)
+   return -ENOENT;
+
+   vas_ctx = tsk->vas_ctx;
+
+   vas_context_lock(vas_ctx);
+
+   list_for_each_entry(avas, _ctx->vases, tsk_link) {
+   if (vid == avas->vas->id)
+   goto good_att_vas;
+   }
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return -ENOENT;
+
+good_att_vas:
+   seq_printf(sf,
+  "pid:  %d\n"
+  "vid:  %d\n"
+  "type: %s\n",
+  avas->tsk->pid, avas->vas->id,
+  vas_access_type_str(avas->type));
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return 0;
+}
+
+static int att_vas_show_status_open(struct inode *inode, struct file *file)
+{
+   return single_open(file, att_vas_show_status, inode);
+}
+
+static const struct file_operations att_vas_show_status_fops = {
+   .open   = att_vas_show_status_open,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= single_release,
+};
+
+static int att_vas_show_mappings(struct seq_file *sf, void *unused)
+{
+   struct inode *inode = sf->private;
+   struct proc_inode *pi = PROC_I(inode);
+   struct task_struct *tsk;
+   struct vas_context *vas_ctx;
+   struct att_vas *avas;
+   struct mm_struct *mm;
+   struct vm_area_struct *vma;
+   int vid = pi->vas_id;
+
+   tsk = get_proc_task(inode);
+   if (!tsk)
+   return -ENOENT;
+
+   vas_ctx = tsk->vas_ctx;
+
+   vas_context_lock(vas_ctx);
+
+   list_for_each_entry(avas, _ctx->vases, tsk_link) {
+   if (avas->vas->id == vid)
+   goto good_att_vas;
+   }
+
+   vas_context_unlock(vas_ctx);
+   put_task_struct(tsk);
+
+   return -ENOENT;
+
+good_att_vas:
+   mm = avas->mm;
+
+   down_read(>mmap_sem);
+
+   if (!mm->mmap) {
+   seq_puts(sf, "EMPTY\n");
+   goto out_unlock;
+   }
+
+   for (vma = mm->mmap; vma; vma = vma->vm_next) {
+   vm_flags_t flags = vma->vm_flags;
+   struct file *file = vma->vm_file;
+   unsigned long long pgoff = 0;
+
+   if (file)
+   pgoff = ((loff_t)vma->vm_pgoff) << PAGE_SHIFT;
+
+   seq_printf(sf, "%08lx-%08lx %c%c%c%c [%c:%c] %08llx",
+  vma->vm_start, vma->vm_end,
+  flags & VM_READ ? 'r' : '-',
+  flags & VM_WRITE ? 'w' : '-',
+  flags & VM_EXEC ? 'x' : '-',
+  flags & VM_MAYSHARE ? 's' : 'p',
+  vma->vas_reference ? 'v' : '-',
+  vma->vas_attached ? 'a' : '-',
+  pgoff);
+
+   seq_putc(sf, ' ');
+
+   if (file) {
+   

[RFC PATCH 12/13] mm/vas: Add lazy-attach support for first class virtual address spaces

2017-03-13 Thread Till Smejkal
Until now, whenever a task attaches a first class virtual address space,
all the memory regions currently present in the task are replicated into
the first class virtual address space so that the task can continue
executing as if nothing has changed. However, this technique causes the
attach and detach operations to be very costly, since the whole memory map
of the task has to be duplicated.

Lazy-attaching on the other side uses a similar technique as it is done to
copy page tables during fork. Instead of completely duplicating the memory
map of the task together with its page tables, only a skeleton memory map
is created and then later filled with content when a page fault is
triggered when the process actually accesses the memory regions. The big
advantage is, that unnecessary memory regions are not duplicated at all,
but just those that the process actually uses while executing inside the
first class virtual address space. The only memory region which is always
duplicated during the attach-operation is the code memory section, because
this memory region is always necessary for execution and saves us one page
fault later during the process execution.

Signed-off-by: Till Smejkal 
---
 include/linux/mm_types.h |   1 +
 include/linux/vas.h  |  26 
 mm/Kconfig   |  18 ++
 mm/memory.c  |   5 ++
 mm/vas.c | 164 ++-
 5 files changed, 197 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 82bf78ea83ee..65e04f14225d 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -362,6 +362,7 @@ struct vm_area_struct {
 #ifdef CONFIG_VAS
struct mm_struct *vas_reference;
ktime_t vas_last_update;
+   bool vas_attached;
 #endif
 };
 
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 376b9fa1ee27..8682bfc86568 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -2,6 +2,7 @@
 #define _LINUX_VAS_H
 
 
+#include 
 #include 
 #include 
 
@@ -293,4 +294,29 @@ static inline int vas_exit(struct task_struct *tsk) { 
return 0; }
 
 #endif /* CONFIG_VAS */
 
+
+/***
+ * Management of the VAS lazy attaching
+ ***/
+
+#ifdef CONFIG_VAS_LAZY_ATTACH
+
+/**
+ * Lazily update the page tables of a vm_area which was not completely setup
+ * during the VAS attaching.
+ *
+ * @param[in] vma: The vm_area for which the page tables should be
+ * setup before continuing the page fault handling.
+ *
+ * @returns:   0 of the lazy-attach was successful or not
+ * necessary, or 1 if something went wrong.
+ */
+extern int vas_lazy_attach_vma(struct vm_area_struct *vma);
+
+#else /* CONFIG_VAS_LAZY_ATTACH */
+
+static inline int vas_lazy_attach_vma(struct vm_area_struct *vma) { return 0; }
+
+#endif /* CONFIG_VAS_LAZY_ATTACH */
+
 #endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 9a80877f3536..934c56bcdbf4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -720,6 +720,24 @@ config VAS
 
  If not sure, then say N.
 
+config VAS_LAZY_ATTACH
+   bool "Use lazy-attach for First Class Virtual Address Spaces"
+   depends on VAS
+   default y
+   help
+ When this option is enabled, memory regions of First Class Virtual 
+ Address Spaces will be mapped in the task's address space lazily after
+ the switch happened. That means, the actual mapping will happen when a
+ page fault occurs for the particular memory region. While this
+ technique is less costly during the switching operation, it can become
+ very costly during the page fault handling.
+
+ Hence if the program uses a lot of different memory regions, this
+ lazy-attaching technique can be more costly than doing the mapping
+ eagerly during the switch.
+
+ If not sure, then say Y.
+
 config VAS_DEBUG
bool "Debugging output for First Class Virtual Address Spaces"
depends on VAS
diff --git a/mm/memory.c b/mm/memory.c
index e4747b3fd5b9..cdefc99a50ac 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -64,6 +64,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -4000,6 +4001,10 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned 
long address,
/* do counter updates before entering really critical section. */
check_sync_rss_stat(current);
 
+   /* Check if this VMA belongs to a VAS and needs to be lazy attached. */
+   if (unlikely(vas_lazy_attach_vma(vma)))
+   return VM_FAULT_SIGSEGV;
+
/*
 * Enable the memcg OOM handling for faults triggered in user
 * space.  Kernel faults are handled more gracefully.
diff --git a/mm/vas.c b/mm/vas.c
index 345b023c21aa..953ba8d6e603 100644
--- a/mm/vas.c
+++ b/mm/vas.c
@@ -138,12 +138,13 @@ static void __dump_memory_map(const char *title, struct 
mm_struct *mm)
 

[RFC PATCH 11/13] mm/vas: Introduce VAS segments - shareable address space regions

2017-03-13 Thread Till Smejkal
VAS segments are an extension to first class virtual address spaces that
can be used to share specific memory regions between multiple first class
virtual address spaces. VAS segments have a specific size and position in a
virtual address space and can thereby be used to share in-memory pointer
based data structures between multiple address spaces as well as other
in-memory data without the need to represent them in mmap-able files or
use shmem.

Similar to first class virtual address spaces, VAS segments must be created
and destroyed explicitly by a user. The system will never automatically
destroy or create a virtual segment. Via attaching a VAS segment to a first
class virtual address space, the memory that is contained in the VAS
segment can be accessed and changed.

Signed-off-by: Till Smejkal 
Signed-off-by: Marco Benatto 
---
 arch/x86/entry/syscalls/syscall_32.tbl |7 +
 arch/x86/entry/syscalls/syscall_64.tbl |7 +
 include/linux/syscalls.h   |   10 +
 include/linux/vas.h|  114 +++
 include/linux/vas_types.h  |   91 ++-
 include/uapi/asm-generic/unistd.h  |   16 +-
 include/uapi/linux/vas.h   |   12 +
 kernel/sys_ni.c|7 +
 mm/vas.c   | 1234 ++--
 9 files changed, 1451 insertions(+), 47 deletions(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 8c553eef8c44..a4f91d14a856 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -398,3 +398,10 @@
 389i386active_vas  sys_active_vas
 390i386vas_getattr sys_vas_getattr
 391i386vas_setattr sys_vas_setattr
+392i386vas_seg_create  sys_vas_seg_create
+393i386vas_seg_delete  sys_vas_seg_delete
+394i386vas_seg_findsys_vas_seg_find
+395i386vas_seg_attach  sys_vas_seg_attach
+396i386vas_seg_detach  sys_vas_seg_detach
+397i386vas_seg_getattr sys_vas_seg_getattr
+398i386vas_seg_setattr sys_vas_seg_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index 72f1f0495710..a0f9503c3d28 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -347,6 +347,13 @@
 338common  active_vas  sys_active_vas
 339common  vas_getattr sys_vas_getattr
 340common  vas_setattr sys_vas_setattr
+341common  vas_seg_create  sys_vas_seg_create
+342common  vas_seg_delete  sys_vas_seg_delete
+343common  vas_seg_findsys_vas_seg_find
+344common  vas_seg_attach  sys_vas_seg_attach
+345common  vas_seg_detach  sys_vas_seg_detach
+346common  vas_seg_getattr sys_vas_seg_getattr
+347common  vas_seg_setattr sys_vas_seg_setattr
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index fdea27d37c96..7380dcdc4bc1 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -66,6 +66,7 @@ struct perf_event_attr;
 struct file_handle;
 struct sigaltstack;
 struct vas_attr;
+struct vas_seg_attr;
 union bpf_attr;
 
 #include 
@@ -914,4 +915,13 @@ asmlinkage long sys_active_vas(void);
 asmlinkage long sys_vas_getattr(int vid, struct vas_attr __user *attr);
 asmlinkage long sys_vas_setattr(int vid, struct vas_attr __user *attr);
 
+asmlinkage long sys_vas_seg_create(const char __user *name, unsigned long 
start,
+  unsigned long end, umode_t mode);
+asmlinkage long sys_vas_seg_delete(int sid);
+asmlinkage long sys_vas_seg_find(const char __user *name);
+asmlinkage long sys_vas_seg_attach(int vid, int sid, int type);
+asmlinkage long sys_vas_seg_detach(int vid, int sid);
+asmlinkage long sys_vas_seg_getattr(int sid, struct vas_seg_attr __user *attr);
+asmlinkage long sys_vas_seg_setattr(int sid, struct vas_seg_attr __user *attr);
+
 #endif
diff --git a/include/linux/vas.h b/include/linux/vas.h
index 6a72e42f96d2..376b9fa1ee27 100644
--- a/include/linux/vas.h
+++ b/include/linux/vas.h
@@ -138,6 +138,120 @@ extern int vas_setattr(int vid, struct vas_attr *attr);
 
 
 /***
+ * Management of VAS segments
+ ***/
+
+/**
+ * Lock and unlock helper for VAS segments.
+ **/
+#define vas_seg_lock(seg) mutex_lock(&(seg)->mtx)
+#define vas_seg_unlock(seg) mutex_unlock(&(seg)->mtx)
+
+/**
+ * Create a new VAS segment.
+ *
+ * @param[in] name:The name of the new VAS segment.
+ * @param[in] start:   The address where the VAS segment begins.
+ * @param[in] end: The address where the VAS segment ends.
+ * @param[in] mode:The access rights for the VAS segment.
+ *

[RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
Introduce a different type of address spaces which are first class citizens
in the OS. That means that the kernel now handles two types of AS, those
which are closely coupled with a process and those which aren't. While the
former ones are created and destroyed together with the process by the
kernel and are the default type of AS in the Linux kernel, the latter ones
have to be managed explicitly by the user and are the newly introduced
type.

Accordingly, a first class AS (also called VAS == virtual address space)
can exist in the OS independently from any process. A user has to
explicitly create and destroy them in the system. Processes and VAS can be
combined by attaching a previously created VAS to a process which basically
adds an additional AS to the process that the process' threads are able to
execute in. Hence, VAS allow a process to have different views onto the
main memory of the system (its original AS and the attached VAS) between
which its threads can switch arbitrarily during their lifetime.

The functionality made available through first class virtual address spaces
can be used in various different ways. One possible way to utilize VAS is
to compartmentalize a process for security reasons. Another possible usage
is to improve the performance of data-centric applications by being able to
manage different sets of data in memory without the need to map or unmap
them.

Furthermore, first class virtual address spaces can be attached to
different processes at the same time if the underlying memory is only
readable. This mechanism allows sharing of whole address spaces between
multiple processes that can both execute in them using the contained
memory.

Signed-off-by: Till Smejkal 
Signed-off-by: Marco Benatto 
---
 MAINTAINERS|   10 +
 arch/x86/entry/syscalls/syscall_32.tbl |9 +
 arch/x86/entry/syscalls/syscall_64.tbl |9 +
 fs/exec.c  |3 +
 include/linux/mm_types.h   |8 +
 include/linux/sched.h  |   17 +
 include/linux/syscalls.h   |   11 +
 include/linux/vas.h|  182 +++
 include/linux/vas_types.h  |   88 ++
 include/uapi/asm-generic/unistd.h  |   20 +-
 include/uapi/linux/Kbuild  |1 +
 include/uapi/linux/vas.h   |   16 +
 init/main.c|2 +
 kernel/exit.c  |2 +
 kernel/fork.c  |   28 +-
 kernel/sys_ni.c|   11 +
 mm/Kconfig |   20 +
 mm/Makefile|1 +
 mm/internal.h  |8 +
 mm/memory.c|3 +
 mm/mmap.c  |   22 +
 mm/vas.c   | 2188 
 22 files changed, 2657 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/vas.h
 create mode 100644 include/linux/vas_types.h
 create mode 100644 include/uapi/linux/vas.h
 create mode 100644 mm/vas.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 527d13759ecc..060b1c64e67a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5040,6 +5040,16 @@ F:   Documentation/firmware_class/
 F: drivers/base/firmware*.c
 F: include/linux/firmware.h
 
+FIRST CLASS VIRTUAL ADDRESS SPACES
+M: Till Smejkal 
+L: linux-ker...@vger.kernel.org
+L: linux...@kvack.org
+S: Maintained
+F: include/linux/vas_types.h
+F: include/linux/vas.h
+F: include/uapi/linux/vas.h
+F: mm/vas.c
+
 FLASH ADAPTER DRIVER (IBM Flash Adapter 900GB Full Height PCI Flash Card)
 M: Joshua Morris 
 M: Philip Kelleher 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl 
b/arch/x86/entry/syscalls/syscall_32.tbl
index 2b3618542544..8c553eef8c44 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -389,3 +389,12 @@
 380i386pkey_mprotect   sys_pkey_mprotect
 381i386pkey_alloc  sys_pkey_alloc
 382i386pkey_free   sys_pkey_free
+383i386vas_create  sys_vas_create
+384i386vas_delete  sys_vas_delete
+385i386vas_findsys_vas_find
+386i386vas_attach  sys_vas_attach
+387i386vas_detach  sys_vas_detach
+388i386vas_switch  sys_vas_switch
+389i386active_vas  sys_active_vas
+390i386vas_getattr sys_vas_getattr
+391i386vas_setattr sys_vas_setattr
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
b/arch/x86/entry/syscalls/syscall_64.tbl
index e93ef0b38db8..72f1f0495710 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -338,6 +338,15 @@
 329

[RFC PATCH 09/13] mm/memory: Add function to one-to-one duplicate page ranges

2017-03-13 Thread Till Smejkal
Add new function to one-to-one duplicate a page table range of one memory
map to another memory map. The new function 'dup_page_range' copies the
page table entries for the specified region from the page table of the
source memory map to the page table of the destination memory map and
thereby allows actual sharing of the referenced memory pages instead of
relying on copy-on-write for anonymous memory pages or page faults for
read-only memory pages as it is done by the existing function
'copy_page_range'. Hence, 'dup_page_range' will produce shared pages
between two address spaces whereas 'copy_page_range' will result in copies
of pages if necessary.

Preexisting mappings in the page table of the destination memory map are
properly zapped by the 'dup_page_range' function if they differ from the
ones in the source memory map before they are replaced with the new ones.

Signed-off-by: Till Smejkal 
---
 include/linux/huge_mm.h |   6 +
 include/linux/hugetlb.h |   5 +
 include/linux/mm.h  |   2 +
 mm/huge_memory.c|  65 +++
 mm/hugetlb.c| 205 +++--
 mm/memory.c | 461 +---
 6 files changed, 620 insertions(+), 124 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 94a0e680b7d7..52c0498426ef 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -5,6 +5,12 @@ extern int do_huge_pmd_anonymous_page(struct vm_fault *vmf);
 extern int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
 struct vm_area_struct *vma);
+extern int dup_huge_pmd(struct mm_struct *dst_mm,
+   struct vm_area_struct *dst_vma,
+   struct mm_struct *src_mm,
+   struct vm_area_struct *src_vma,
+   struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+   unsigned long addr);
 extern void huge_pmd_set_accessed(struct vm_fault *vmf, pmd_t orig_pmd);
 extern int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd);
 extern struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 72260cc252f2..d8eb682e39a1 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -63,6 +63,10 @@ int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int,
 #endif
 
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
vm_area_struct *);
+int dup_hugetlb_page_range(struct mm_struct *dst_mm,
+  struct vm_area_struct *dst_vma,
+  struct mm_struct *src_mm,
+  struct vm_area_struct *src_vma);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 struct page **, struct vm_area_struct **,
 unsigned long *, unsigned long *, long, unsigned int);
@@ -134,6 +138,7 @@ static inline unsigned long hugetlb_total_pages(void)
 #define follow_hugetlb_page(m,v,p,vs,a,b,i,w)  ({ BUG(); 0; })
 #define follow_huge_addr(mm, addr, write)  ERR_PTR(-EINVAL)
 #define copy_hugetlb_page_range(src, dst, vma) ({ BUG(); 0; })
+#define dup_hugetlb_page_range(dst, dst_vma, src, src_vma) ({ BUG(); 0; })
 static inline void hugetlb_report_meminfo(struct seq_file *m)
 {
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 92925d97da20..b39ec795f64c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1208,6 +1208,8 @@ void free_pgd_range(struct mmu_gather *tlb, unsigned long 
addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
struct vm_area_struct *vma);
+int dup_page_range(struct mm_struct *dst, struct vm_area_struct *dst_vma,
+  struct mm_struct *src, struct vm_area_struct *src_vma);
 void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d5b2604867e5..1edf8c6d1814 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -887,6 +887,71 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
return ret;
 }
 
+int dup_huge_pmd(struct mm_struct *dst_mm, struct vm_area_struct *dst_vma,
+struct mm_struct *src_mm, struct vm_area_struct *src_vma,
+struct mmu_gather *tlb, pmd_t *dst_pmd, pmd_t *src_pmd,
+unsigned long addr)
+{
+   spinlock_t *dst_ptl, *src_ptl;
+   struct page *page;
+   pmd_t pmd;
+   pgtable_t pgtable;
+   int ret;
+
+   pgtable = pte_alloc_one(dst_mm, addr);
+   if (!pgtable)
+   return 

[RFC PATCH 08/13] kernel/fork: Define explicitly which mm_struct to duplicate during fork

2017-03-13 Thread Till Smejkal
The dup_mm-function used during 'do_fork' to duplicate the current task's
mm_struct for the newly forked task always implicitly uses current->mm for
this purpose. However, during copy_mm it was already decided which
mm_struct to copy/duplicate. So pass this mm_struct to dup_mm instead of
again deciding which mm_struct to use.

Signed-off-by: Till Smejkal 
---
 kernel/fork.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 9209f6d5d7c0..d3087d870855 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1158,9 +1158,10 @@ void mm_release(struct task_struct *tsk, struct 
mm_struct *mm)
  * Allocate a new mm structure and copy contents from the
  * mm structure of the passed in task structure.
  */
-static struct mm_struct *dup_mm(struct task_struct *tsk)
+static struct mm_struct *dup_mm(struct task_struct *tsk,
+   struct mm_struct *oldmm)
 {
-   struct mm_struct *mm, *oldmm = current->mm;
+   struct mm_struct *mm;
int err;
 
mm = allocate_mm();
@@ -1226,7 +1227,7 @@ static int copy_mm(unsigned long clone_flags, struct 
task_struct *tsk)
}
 
retval = -ENOMEM;
-   mm = dup_mm(tsk);
+   mm = dup_mm(tsk, oldmm);
if (!mm)
goto fail_nomem;
 
-- 
2.12.0



[RFC PATCH 07/13] kernel/fork: Split and export 'mm_alloc' and 'mm_init'

2017-03-13 Thread Till Smejkal
The only way until now to create a new memory map was via the exported
function 'mm_alloc'. Unfortunately, this function not only allocates a new
memory map, but also completely initializes it. However, with the
introduction of first class virtual address spaces, some initialization
steps done in 'mm_alloc' are not applicable to the memory maps needed for
this feature and hence would lead to errors in the kernel code.

Instead of introducing a new function that can allocate and initialize
memory maps for first class virtual address spaces and potentially
duplicate some code, I decided to split the mm_alloc function as well as
the 'mm_init' function that it uses.

Now there are four functions exported instead of only one. The new
'mm_alloc' function only allocates a new mm_struct and zeros it out. If one
want to have the old behavior of mm_alloc one can use the newly introduced
function 'mm_alloc_and_setup' which not only allocates a new mm_struct but
also fully initializes it.

The old 'mm_init' function which fully initialized a mm_struct was split up
into two separate functions. The first one - 'mm_setup' - does all the
initialization of the mm_struct that is not related to the mm_struct
belonging to a particular task. This part of the initialization is done in
the 'mm_set_task' function. This way it is possible to create memory maps
that don't have any task-specific information as needed by the first class
virtual address space feature. Both functions, 'mm_setup' and 'mm_set_task'
are also exported, so that they can be used in all files in the source
tree.

Signed-off-by: Till Smejkal 
---
 arch/arm/mach-rpc/ecard.c |  2 +-
 fs/exec.c |  2 +-
 include/linux/sched.h |  7 +-
 kernel/fork.c | 64 +--
 4 files changed, 59 insertions(+), 16 deletions(-)

diff --git a/arch/arm/mach-rpc/ecard.c b/arch/arm/mach-rpc/ecard.c
index dc67a7fb3831..15845e8abd7e 100644
--- a/arch/arm/mach-rpc/ecard.c
+++ b/arch/arm/mach-rpc/ecard.c
@@ -245,7 +245,7 @@ static void ecard_init_pgtables(struct mm_struct *mm)
 
 static int ecard_init_mm(void)
 {
-   struct mm_struct * mm = mm_alloc();
+   struct mm_struct *mm = mm_alloc_and_setup();
struct mm_struct *active_mm = current->active_mm;
 
if (!mm)
diff --git a/fs/exec.c b/fs/exec.c
index e57946610733..68d7908a1e5a 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -380,7 +380,7 @@ static int bprm_mm_init(struct linux_binprm *bprm)
int err;
struct mm_struct *mm = NULL;
 
-   bprm->mm = mm = mm_alloc();
+   bprm->mm = mm = mm_alloc_and_setup();
err = -ENOMEM;
if (!mm)
goto err;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42b9b93a50ac..7955adc00397 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2922,7 +2922,12 @@ static inline unsigned long sigsp(unsigned long sp, 
struct ksignal *ksig)
 /*
  * Routines for handling mm_structs
  */
-extern struct mm_struct * mm_alloc(void);
+extern struct mm_struct *mm_setup(struct mm_struct *mm);
+extern struct mm_struct *mm_set_task(struct mm_struct *mm,
+struct task_struct *p,
+struct user_namespace *user_ns);
+extern struct mm_struct *mm_alloc(void);
+extern struct mm_struct *mm_alloc_and_setup(void);
 
 /* mmdrop drops the mm and the page tables */
 extern void __mmdrop(struct mm_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 11c5c8ab827c..9209f6d5d7c0 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -747,8 +747,10 @@ static void mm_init_owner(struct mm_struct *mm, struct 
task_struct *p)
 #endif
 }
 
-static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
-   struct user_namespace *user_ns)
+/**
+ * Initialize all the task-unrelated fields of a mm_struct.
+ **/
+struct mm_struct *mm_setup(struct mm_struct *mm)
 {
mm->mmap = NULL;
mm->mm_rb = RB_ROOT;
@@ -767,24 +769,37 @@ static struct mm_struct *mm_init(struct mm_struct *mm, 
struct task_struct *p,
spin_lock_init(>page_table_lock);
mm_init_cpumask(mm);
mm_init_aio(mm);
-   mm_init_owner(mm, p);
mmu_notifier_mm_init(mm);
clear_tlb_flush_pending(mm);
 #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
mm->pmd_huge_pte = NULL;
 #endif
 
+   mm->flags = default_dump_filter;
+   mm->def_flags = 0;
+
+   if (mm_alloc_pgd(mm))
+   goto fail_nopgd;
+
+   return mm;
+
+fail_nopgd:
+   free_mm(mm);
+   return NULL;
+}
+
+/**
+ * Initialize all the task-related fields of a mm_struct.
+ **/
+struct mm_struct *mm_set_task(struct mm_struct *mm, struct task_struct *p,
+ struct user_namespace *user_ns)
+{
if (current->mm) {
mm->flags = current->mm->flags & MMF_INIT_MASK;
mm->def_flags = 

[RFC PATCH 06/13] mm/mmap: Export 'vma_link' and 'find_vma_links' to mm subsystem

2017-03-13 Thread Till Smejkal
Make the functions 'vma_link' and 'find_vma_links' accessible to other
source files in the mm/ source directory of the kernel so that other files
in that directory can also perform low level changes to mm_struct data
structures.

Signed-off-by: Till Smejkal 
---
 mm/internal.h | 11 +++
 mm/mmap.c | 12 ++--
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 7aa2ea0a8623..e22cb031b45b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -76,6 +76,17 @@ static inline void set_page_refcounted(struct page *page)
 extern unsigned long highest_memmap_pfn;
 
 /*
+ * in mm/mmap.c
+ */
+extern void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+struct vm_area_struct *prev, struct rb_node **rb_link,
+struct rb_node *rb_parent);
+extern int find_vma_links(struct mm_struct *mm, unsigned long addr,
+ unsigned long end, struct vm_area_struct **pprev,
+ struct rb_node ***rb_link,
+ struct rb_node **rb_parent);
+
+/*
  * in mm/vmscan.c:
  */
 extern int isolate_lru_page(struct page *page);
diff --git a/mm/mmap.c b/mm/mmap.c
index 3f60c8ebd6b6..d35c6b51cadf 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -466,9 +466,9 @@ anon_vma_interval_tree_post_update_vma(struct 
vm_area_struct *vma)
anon_vma_interval_tree_insert(avc, >anon_vma->rb_root);
 }
 
-static int find_vma_links(struct mm_struct *mm, unsigned long addr,
-   unsigned long end, struct vm_area_struct **pprev,
-   struct rb_node ***rb_link, struct rb_node **rb_parent)
+int find_vma_links(struct mm_struct *mm, unsigned long addr,
+  unsigned long end, struct vm_area_struct **pprev,
+  struct rb_node ***rb_link, struct rb_node **rb_parent)
 {
struct rb_node **__rb_link, *__rb_parent, *rb_prev;
 
@@ -580,9 +580,9 @@ __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
__vma_link_rb(mm, vma, rb_link, rb_parent);
 }
 
-static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
-   struct vm_area_struct *prev, struct rb_node **rb_link,
-   struct rb_node *rb_parent)
+void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, struct rb_node **rb_link,
+ struct rb_node *rb_parent)
 {
struct address_space *mapping = NULL;
 
-- 
2.12.0



[RFC PATCH 05/13] mm: Add mm_struct argument to 'mm_populate' and '__mm_populate'

2017-03-13 Thread Till Smejkal
Add to the 'mm_populate' and '__mm_populate' functions as additional
argument which mm_struct they should use during their execution. Before,
these functions simply used the memory map of the current task. However,
with the introduction of first class virtual address spaces, both
functions also need to be able to operate on other memory maps than just
the one of the current task. Accordingly, it is now possible to specify
explicitly which memory map these functions should use via an additional
argument.

Signed-off-by: Till Smejkal 
---
 arch/x86/mm/mpx.c  |  2 +-
 include/linux/mm.h | 13 -
 ipc/shm.c  |  9 +
 mm/gup.c   |  4 ++--
 mm/mlock.c | 21 +++--
 mm/mmap.c  |  6 +++---
 mm/mremap.c|  2 +-
 mm/util.c  |  2 +-
 8 files changed, 32 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 99c664a97c35..b46f7cdbdad8 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -54,7 +54,7 @@ static unsigned long mpx_mmap(unsigned long len)
   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, );
up_write(>mmap_sem);
if (populate)
-   mm_populate(addr, populate);
+   mm_populate(mm, addr, populate);
 
return addr;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1520da8f9c67..92925d97da20 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2040,15 +2040,18 @@ do_mmap_pgoff(struct mm_struct *mm, struct file *file, 
unsigned long addr,
 }
 
 #ifdef CONFIG_MMU
-extern int __mm_populate(unsigned long addr, unsigned long len,
-int ignore_errors);
-static inline void mm_populate(unsigned long addr, unsigned long len)
+extern int __mm_populate(struct mm_struct *mm, unsigned long addr,
+unsigned long len, int ignore_errors);
+static inline void mm_populate(struct mm_struct *mm, unsigned long addr,
+  unsigned long len)
 {
/* Ignore errors */
-   (void) __mm_populate(addr, len, 1);
+   (void) __mm_populate(mm, addr, len, 1);
 }
 #else
-static inline void mm_populate(unsigned long addr, unsigned long len) {}
+static inline void mm_populate(struct mm_struct *mm, unsigned long addr,
+  unsigned long len)
+{}
 #endif
 
 /* These take the mm semaphore themselves */
diff --git a/ipc/shm.c b/ipc/shm.c
index 2fd73cda4ec8..be692e0abe79 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1106,6 +1106,7 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
struct shm_file_data *sfd;
struct path path;
fmode_t f_mode;
+   struct mm_struct *mm = current->mm;
unsigned long populate = 0;
 
err = -EINVAL;
@@ -1208,7 +1209,7 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
if (err)
goto out_fput;
 
-   if (down_write_killable(>mm->mmap_sem)) {
+   if (down_write_killable(>mmap_sem)) {
err = -EINTR;
goto out_fput;
}
@@ -1218,7 +1219,7 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
if (addr + size < addr)
goto invalid;
 
-   if (find_vma_intersection(current->mm, addr, addr + size))
+   if (find_vma_intersection(mm, addr, addr + size))
goto invalid;
}
 
@@ -1229,9 +1230,9 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
if (IS_ERR_VALUE(addr))
err = (long)addr;
 invalid:
-   up_write(>mm->mmap_sem);
+   up_write(>mmap_sem);
if (populate)
-   mm_populate(addr, populate);
+   mm_populate(mm, addr, populate);
 
 out_fput:
fput(file);
diff --git a/mm/gup.c b/mm/gup.c
index 5531489d..ca5ba2703b40 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1053,9 +1053,9 @@ long populate_vma_page_range(struct vm_area_struct *vma,
  * flags. VMAs must be already marked with the desired vm_flags, and
  * mmap_sem must not be held.
  */
-int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
+int __mm_populate(struct mm_struct *mm, unsigned long start, unsigned long len,
+ int ignore_errors)
 {
-   struct mm_struct *mm = current->mm;
unsigned long end, nstart, nend;
struct vm_area_struct *vma = NULL;
int locked = 0;
diff --git a/mm/mlock.c b/mm/mlock.c
index cdbed8aaa426..9d74948c7b22 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -664,6 +664,7 @@ static int count_mm_mlocked_page_nr(struct mm_struct *mm,
 
 static __must_check int do_mlock(unsigned long start, size_t len, vm_flags_t 
flags)
 {
+   struct mm_struct *mm = current->mm;
unsigned long locked;
unsigned long lock_limit;
int error = -ENOMEM;
@@ -680,10 +681,10 @@ static __must_check int 

[RFC PATCH 04/13] mm: Add mm_struct argument to 'get_unmapped_area' and 'vm_unmapped_area'

2017-03-13 Thread Till Smejkal
Add the mm_struct that for which an unmapped area should be found as
explicit argument to the 'get_unmapped_area' function. Previously, the
function simply search for an unmapped area in the memory map of the
 current task. However, with the introduction of first class virtual
address spaces, it is necessary that get_unmapped_area also can look for
unmapped area in memory maps other than the one of the current task.

Changing the signature of the generic 'get_unmapped_area' function also
requires that all the 'arch_get_unmapped_area' functions as well as the
'vm_unmapped_area' function with its dependents have to take the memory
map that they should work on as additional argument. Simply using the one
of the current task, as these functions did before, is not correct anymore
and leads to incorrect results.

Signed-off-by: Till Smejkal 
---
 arch/alpha/kernel/osf_sys.c  | 19 ++--
 arch/arc/mm/mmap.c   |  8 ++---
 arch/arm/kernel/process.c|  2 +-
 arch/arm/mm/mmap.c   | 19 ++--
 arch/arm64/kernel/vdso.c |  2 +-
 arch/blackfin/include/asm/pgtable.h  |  3 +-
 arch/blackfin/kernel/sys_bfin.c  |  5 ++--
 arch/frv/mm/elf-fdpic.c  | 11 +++
 arch/hexagon/kernel/vdso.c   |  2 +-
 arch/ia64/kernel/perfmon.c   |  3 +-
 arch/ia64/kernel/sys_ia64.c  |  6 ++--
 arch/ia64/mm/hugetlbpage.c   |  7 +++--
 arch/metag/mm/hugetlbpage.c  | 11 +++
 arch/mips/kernel/vdso.c  |  2 +-
 arch/mips/mm/mmap.c  | 27 +
 arch/parisc/kernel/sys_parisc.c  | 19 ++--
 arch/parisc/mm/hugetlbpage.c |  7 +++--
 arch/powerpc/include/asm/book3s/64/hugetlb.h |  6 ++--
 arch/powerpc/include/asm/page_64.h   |  3 +-
 arch/powerpc/kernel/vdso.c   |  2 +-
 arch/powerpc/mm/hugetlbpage-radix.c  |  9 +++---
 arch/powerpc/mm/hugetlbpage.c|  9 +++---
 arch/powerpc/mm/mmap.c   | 17 +--
 arch/powerpc/mm/slice.c  | 25 
 arch/s390/kernel/vdso.c  |  3 +-
 arch/s390/mm/mmap.c  | 42 +-
 arch/sh/kernel/vsyscall/vsyscall.c   |  2 +-
 arch/sh/mm/mmap.c| 19 ++--
 arch/sparc/include/asm/pgtable_64.h  |  4 +--
 arch/sparc/kernel/sys_sparc_32.c |  6 ++--
 arch/sparc/kernel/sys_sparc_64.c | 31 +++-
 arch/sparc/mm/hugetlbpage.c  | 26 
 arch/tile/kernel/vdso.c  |  2 +-
 arch/tile/mm/hugetlbpage.c   | 26 
 arch/x86/entry/vdso/vma.c|  2 +-
 arch/x86/kernel/sys_x86_64.c | 19 ++--
 arch/x86/mm/hugetlbpage.c| 26 
 arch/xtensa/kernel/syscall.c |  7 +++--
 drivers/char/mem.c   | 15 ++
 drivers/dax/dax.c| 10 +++
 drivers/media/usb/uvc/uvc_v4l2.c |  6 ++--
 drivers/media/v4l2-core/v4l2-dev.c   |  8 ++---
 drivers/media/v4l2-core/videobuf2-v4l2.c |  5 ++--
 drivers/mtd/mtdchar.c|  3 +-
 drivers/usb/gadget/function/uvc_v4l2.c   |  3 +-
 fs/hugetlbfs/inode.c |  8 ++---
 fs/proc/inode.c  | 10 +++
 fs/ramfs/file-mmu.c  |  5 ++--
 fs/ramfs/file-nommu.c| 10 ---
 fs/romfs/mmap-nommu.c|  3 +-
 include/linux/fs.h   |  2 +-
 include/linux/huge_mm.h  |  6 ++--
 include/linux/hugetlb.h  |  5 ++--
 include/linux/mm.h   | 16 ++
 include/linux/mm_types.h |  7 +++--
 include/linux/sched.h| 10 +++
 include/linux/shmem_fs.h |  5 ++--
 include/media/v4l2-dev.h |  3 +-
 include/media/videobuf2-v4l2.h   |  5 ++--
 ipc/shm.c| 10 +++
 kernel/events/uprobes.c  |  2 +-
 mm/huge_memory.c | 18 +++-
 mm/mmap.c| 44 ++--
 mm/mremap.c  | 11 +++
 mm/nommu.c   | 10 ---
 mm/shmem.c   | 14 -
 sound/core/pcm_native.c  |  3 +-
 67 files changed, 370 insertions(+), 326 deletions(-)

diff --git a/arch/alpha/kernel/osf_sys.c b/arch/alpha/kernel/osf_sys.c
index 54d8616644e2..281109bcdc5d 100644
--- 

[RFC PATCH 03/13] mm: Rename 'unmap_region' and add mm_struct argument

2017-03-13 Thread Till Smejkal
Rename the 'unmap_region' function to 'munmap_region' so that it uses the
same naming pattern as the do_mmap <-> mmap_region couple. In addition
also make the new 'munmap_region' function publicly available to all other
kernel sources.

In addition, also add to the function the mm_struct it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, munmap_region need also be able to operate on other memory
maps than just the current task's one. Accordingly, add a new argument to
the function so that one can define explicitly which memory map should be
used.

Signed-off-by: Till Smejkal 
---
 include/linux/mm.h |  4 
 mm/mmap.c  | 14 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb11be77545f..71a90604d21f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2023,6 +2023,10 @@ extern unsigned long do_mmap(struct mm_struct *mm, 
struct file *file,
unsigned long addr, unsigned long len, unsigned long prot,
unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
unsigned long *populate);
+
+extern void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+ struct vm_area_struct *prev, unsigned long start,
+ unsigned long end);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
diff --git a/mm/mmap.c b/mm/mmap.c
index 70028bf7b58d..ea79bc4da5b7 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -70,10 +70,6 @@ int mmap_rnd_compat_bits __read_mostly = 
CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
 static bool ignore_rlimit_data;
 core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);
 
-static void unmap_region(struct mm_struct *mm,
-   struct vm_area_struct *vma, struct vm_area_struct *prev,
-   unsigned long start, unsigned long end);
-
 /* description of effects of mapping type and prot in current implementation.
  * this is due to the limited x86 page protection hardware.  The expected
  * behavior is in parens:
@@ -1731,7 +1727,7 @@ unsigned long mmap_region(struct mm_struct *mm, struct 
file *file,
fput(file);
 
/* Undo any partial mapping done by a device driver. */
-   unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
+   munmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
charged = 0;
if (vm_flags & VM_SHARED)
mapping_unmap_writable(file->f_mapping);
@@ -2447,9 +2443,9 @@ static void remove_vma_list(struct mm_struct *mm, struct 
vm_area_struct *vma)
  *
  * Called with the mm semaphore held.
  */
-static void unmap_region(struct mm_struct *mm,
-   struct vm_area_struct *vma, struct vm_area_struct *prev,
-   unsigned long start, unsigned long end)
+void munmap_region(struct mm_struct *mm, struct vm_area_struct *vma,
+   struct vm_area_struct *prev, unsigned long start,
+   unsigned long end)
 {
struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
struct mmu_gather tlb;
@@ -2654,7 +2650,7 @@ int do_munmap(struct mm_struct *mm, unsigned long start, 
size_t len)
 * Remove the vma's, and unmap the actual pages
 */
detach_vmas_to_be_unmapped(mm, vma, prev, end);
-   unmap_region(mm, vma, prev, start, end);
+   munmap_region(mm, vma, prev, start, end);
 
arch_unmap(mm, vma, start, end);
 
-- 
2.12.0



[RFC PATCH 02/13] mm: Add mm_struct argument to 'do_mmap' and 'do_mmap_pgoff'

2017-03-13 Thread Till Smejkal
Add to the 'do_mmap' and 'do_mmap_pgoff' functions the mm_struct they
should operate on as additional argument. Before, both functions simply
used the memory map of the current task. However, with the introduction of
first class virtual address spaces, these functions also need to be usable
for other memory maps than just the one of the current process. Hence,
explicitly define during the function call which memory map to use.

Signed-off-by: Till Smejkal 
---
 arch/x86/mm/mpx.c  |  4 ++--
 fs/aio.c   |  4 ++--
 include/linux/mm.h | 11 ++-
 ipc/shm.c  |  3 ++-
 mm/mmap.c  | 16 
 mm/nommu.c |  7 ---
 mm/util.c  |  2 +-
 7 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index af59f808742f..99c664a97c35 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -50,8 +50,8 @@ static unsigned long mpx_mmap(unsigned long len)
return -EINVAL;
 
down_write(>mmap_sem);
-   addr = do_mmap(NULL, 0, len, PROT_READ | PROT_WRITE,
-   MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, );
+   addr = do_mmap(mm, NULL, 0, len, PROT_READ | PROT_WRITE,
+  MAP_ANONYMOUS | MAP_PRIVATE, VM_MPX, 0, );
up_write(>mmap_sem);
if (populate)
mm_populate(addr, populate);
diff --git a/fs/aio.c b/fs/aio.c
index 873b4ca82ccb..df9bba5a2aff 100644
--- a/fs/aio.c
+++ b/fs/aio.c
@@ -510,8 +510,8 @@ static int aio_setup_ring(struct kioctx *ctx)
return -EINTR;
}
 
-   ctx->mmap_base = do_mmap_pgoff(ctx->aio_ring_file, 0, ctx->mmap_size,
-  PROT_READ | PROT_WRITE,
+   ctx->mmap_base = do_mmap_pgoff(current->mm, ctx->aio_ring_file, 0,
+  ctx->mmap_size, PROT_READ | PROT_WRITE,
   MAP_SHARED, 0, );
up_write(>mmap_sem);
if (IS_ERR((void *)ctx->mmap_base)) {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa483d2ff3eb..fb11be77545f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2019,17 +2019,18 @@ extern unsigned long get_unmapped_area(struct file *, 
unsigned long, unsigned lo
 extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
 unsigned long addr, unsigned long len,
 vm_flags_t vm_flags, unsigned long pgoff);
-extern unsigned long do_mmap(struct file *file, unsigned long addr,
-   unsigned long len, unsigned long prot, unsigned long flags,
-   vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
+extern unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+   unsigned long addr, unsigned long len, unsigned long prot,
+   unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
+   unsigned long *populate);
 extern int do_munmap(struct mm_struct *, unsigned long, size_t);
 
 static inline unsigned long
-do_mmap_pgoff(struct file *file, unsigned long addr,
+do_mmap_pgoff(struct mm_struct *mm, struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
unsigned long pgoff, unsigned long *populate)
 {
-   return do_mmap(file, addr, len, prot, flags, 0, pgoff, populate);
+   return do_mmap(mm, file, addr, len, prot, flags, 0, pgoff, populate);
 }
 
 #ifdef CONFIG_MMU
diff --git a/ipc/shm.c b/ipc/shm.c
index 81203e8ba013..64c21fb32ca9 100644
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -1222,7 +1222,8 @@ long do_shmat(int shmid, char __user *shmaddr, int 
shmflg, ulong *raddr,
goto invalid;
}
 
-   addr = do_mmap_pgoff(file, addr, size, prot, flags, 0, );
+   addr = do_mmap_pgoff(mm, file, addr, size, prot, flags, 0,
+);
*raddr = addr;
err = 0;
if (IS_ERR_VALUE(addr))
diff --git a/mm/mmap.c b/mm/mmap.c
index 5ac276ac9807..70028bf7b58d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1299,14 +1299,14 @@ static inline int mlock_future_check(struct mm_struct 
*mm,
 }
 
 /*
- * The caller must hold down_write(>mm->mmap_sem).
+ * The caller must hold down_write(>mmap_sem).
  */
-unsigned long do_mmap(struct file *file, unsigned long addr,
-   unsigned long len, unsigned long prot,
-   unsigned long flags, vm_flags_t vm_flags,
-   unsigned long pgoff, unsigned long *populate)
+unsigned long do_mmap(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate)
 {
-   struct mm_struct *mm = current->mm;
int pkey = 0;
 
*populate = 0;
@@ -2779,8 +2779,8 @@ SYSCALL_DEFINE5(remap_file_pages, 

[RFC PATCH 01/13] mm: Add mm_struct argument to 'mmap_region'

2017-03-13 Thread Till Smejkal
Add to the 'mmap_region' function the mm_struct that it should operate on
as additional argument. Before, the function simply used the memory map of
the current task. However, with the introduction of first class virtual
address spaces, mmap_region needs also be able to operate on other memory
maps than only the current task ones. By adding it as argument we can now
explicitly define which memory map to use.

Signed-off-by: Till Smejkal 
---
 arch/mips/kernel/vdso.c |  2 +-
 arch/tile/mm/elf.c  |  2 +-
 include/linux/mm.h  |  5 +++--
 mm/mmap.c   | 10 +-
 4 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index f9dbfb14af33..9631b42908f3 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -108,7 +108,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, 
int uses_interp)
return -EINTR;
 
/* Map delay slot emulation page */
-   base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
+   base = mmap_region(mm, NULL, STACK_TOP, PAGE_SIZE,
   VM_READ|VM_WRITE|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
   0);
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 6225cc998db1..a22768059b7a 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -141,7 +141,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
 */
if (!retval) {
unsigned long addr = MEM_USER_INTRPT;
-   addr = mmap_region(NULL, addr, INTRPT_SIZE,
+   addr = mmap_region(mm, NULL, addr, INTRPT_SIZE,
   VM_READ|VM_EXEC|
   VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0);
if (addr > (unsigned long) -PAGE_SIZE)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b84615b0f64c..fa483d2ff3eb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2016,8 +2016,9 @@ extern int install_special_mapping(struct mm_struct *mm,
 
 extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned 
long, unsigned long, unsigned long);
 
-extern unsigned long mmap_region(struct file *file, unsigned long addr,
-   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff);
+extern unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+unsigned long addr, unsigned long len,
+vm_flags_t vm_flags, unsigned long pgoff);
 extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate);
diff --git a/mm/mmap.c b/mm/mmap.c
index dc4291dcc99b..5ac276ac9807 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1447,7 +1447,7 @@ unsigned long do_mmap(struct file *file, unsigned long 
addr,
vm_flags |= VM_NORESERVE;
}
 
-   addr = mmap_region(file, addr, len, vm_flags, pgoff);
+   addr = mmap_region(mm, file, addr, len, vm_flags, pgoff);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
 (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1582,10 +1582,10 @@ static inline int accountable_mapping(struct file 
*file, vm_flags_t vm_flags)
return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
-unsigned long mmap_region(struct file *file, unsigned long addr,
-   unsigned long len, vm_flags_t vm_flags, unsigned long pgoff)
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+   unsigned long addr, unsigned long len, vm_flags_t vm_flags,
+   unsigned long pgoff)
 {
-   struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
int error;
struct rb_node **rb_link, *rb_parent;
@@ -1704,7 +1704,7 @@ unsigned long mmap_region(struct file *file, unsigned 
long addr,
vm_stat_account(mm, vm_flags, len >> PAGE_SHIFT);
if (vm_flags & VM_LOCKED) {
if (!((vm_flags & VM_SPECIAL) || is_vm_hugetlb_page(vma) ||
-   vma == get_gate_vma(current->mm)))
+   vma == get_gate_vma(mm)))
mm->locked_vm += (len >> PAGE_SHIFT);
else
vma->vm_flags &= VM_LOCKED_CLEAR_MASK;
-- 
2.12.0



[RFC PATCH 00/13] Introduce first class virtual address spaces

2017-03-13 Thread Till Smejkal
First class virtual address spaces (also called VAS) are a new functionality of
the Linux kernel allowing address spaces to exist independently of processes.
The general idea behind this feature is described in a paper at ASPLOS16 with
the title 'SpaceJMP: Programming with Multiple Virtual Address Spaces' [1].

This patchset extends the kernel memory management subsystem with a new
type of address spaces (called VAS) which can be created and destroyed
independently of processes by a user in the system. During its lifetime
such a VAS can be attached to processes by the user which allows a process
to have multiple address spaces and thereby multiple, potentially
different, views on the system's main memory. During its execution the
threads belonging to the process are able to switch freely between the
different attached VAS and the process' original AS enabling them to
utilize the different available views on the memory. These multiple virtual
address spaces per process and the possibility to switch between them
freely can be used in multiple interesting ways as also outlined in the
mentioned paper. Some of the many possible applications are for example to
compartmentalize a process for security reasons, to improve the performance
of data-centric applications and to introduce new application models [1].

In addition to the concept of first class virtual address spaces, this
patchset introduces yet another feature called VAS segments. VAS segments
are memory regions which have a fixed size and position in the virtual
address space and can be shared between multiple first class virtual
address spaces. Such shareable memory regions are especially useful for
in-memory pointer-based data structures or other pure in-memory data.

First class virtual address spaces have a significant advantage compared to
forking a process and using inter process communication mechanism, namely
that creating and switching between VAS is significant faster than creating
and switching between processes. As it can be seen in the following table,
measured on an Intel Xeon E5620 CPU with 2.40GHz, creating a VAS is about 7
times faster than forking and switching between VAS is up to 4 times faster
than switching between processes.

| VAS |  processes  |
-
switch  |   468ns |  1944ns |
create  | 20003ns |150491ns |

Hence, first class virtual address spaces provide a fast mechanism for
applications to utilize multiple virtual address spaces in parallel with a
higher performance than splitting up the application into multiple
independent processes.

Both VAS and VAS segments have another significant advantage when combined
with non-volatile memory. Because of their independent life cycle from
processes and other kernel data structures, they can be used to save
special memory regions or even whole AS into non-volatile memory making it
possible to reuse them across multiple system reboots.

At the current state of the development, first class virtual address spaces
have one limitation, that we haven't been able to solve so far. The feature
allows, that different threads of the same process can execute in different
AS at the same time. This is possible, because the VAS-switch operation
only changes the active mm_struct for the task_struct of the calling
thread. However, when a thread switches into a first class virtual address
space, some parts of its original AS are duplicated into the new one to
allow the thread to continue its execution at its current state.
Accordingly, parts of the processes AS (e.g. the code section, data
section, heap section and stack sections) exist in multiple AS if the
process has a VAS attached to it. Changes to these shared memory regions
are synchronized between the address spaces whenever a thread switches
between two of them. Unfortunately, in some scenarios the kernel is not
able to properly synchronize all these shared memory regions because of
conflicting changes. One such example happens if there are two threads, one
executing in an attached first class virtual address space, the other in
the tasks original address space. If both threads make changes to the heap
section that cause expansion of the underlying vm_area_struct, the kernel
cannot correctly synchronize these changes, because that would cause parts
of the virtual address space to be overwritten with unrelated data. In the
current implementation such conflicts are only detected but not resolved
and result in an error code being returned by the kernel during the VAS
switch operation. Unfortunately, that means for the particular thread that
tried to make the switch, that it cannot do this anymore in the future and
accordingly has to be killed.

This code was developed during an internship at Hewlett Packard Enterprise.

[1] http://impact.crhc.illinois.edu/shared/Papers/ASPLOS16-SpaceJMP.pdf

Till Smejkal (13):
  mm: Add mm_struct argument to 'mmap_region'
  mm: Add 

Re: [RFC PATCH 10/13] mm: Introduce first class virtual address spaces

2017-03-13 Thread Greg Kroah-Hartman
On Mon, Mar 13, 2017 at 03:14:12PM -0700, Till Smejkal wrote:

There's no way with that many cc: lists and people that this is really
making it through very many people's filters and actually on a mailing
list.  Please trim them down.

Minor sysfs questions/issues:

> +struct vas {
> + struct kobject kobj;/* < the internal kobject that we use *
> +  *   for reference counting and sysfs *
> +  *   handling.*/
> +
> + int id; /* < ID   */
> + char name[VAS_MAX_NAME_LENGTH]; /* < name */

The kobject has a name, why not use that?

> +
> + struct mutex mtx;   /* < lock for parallel access.*/
> +
> + struct mm_struct *mm;   /* < a partial memory map containing  *
> +  *   all mappings of this VAS.*/
> +
> + struct list_head link;  /* < the link in the global VAS list. */
> + struct rcu_head rcu;/* < the RCU helper used for  *
> +  *   asynchronous VAS deletion.   */
> +
> + u16 refcount;   /* < how often is the VAS attached.   */

The kobject has a refcount, use that?  Don't have 2 refcounts in the
same structure, that way lies madness.  And bugs, lots of bugs...

And if this really is a refcount (hint, I don't think it is), you should
use the refcount_t type.

> +/**
> + * The sysfs structure we need to handle attributes of a VAS.
> + **/
> +struct vas_sysfs_attr {
> + struct attribute attr;
> + ssize_t (*show)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> + char *buf);
> + ssize_t (*store)(struct vas *vas, struct vas_sysfs_attr *vsattr,
> +  const char *buf, size_t count);
> +};
> +
> +#define VAS_SYSFS_ATTR(NAME, MODE, SHOW, STORE)  
> \
> +static struct vas_sysfs_attr vas_sysfs_attr_##NAME = \
> + __ATTR(NAME, MODE, SHOW, STORE)

__ATTR_RO and __ATTR_RW should work better for you.  If you really need
this.

Oh, and where is the Documentation/ABI/ updates to try to describe the
sysfs structure and files?  Did I miss that in the series?

> +static ssize_t __show_vas_name(struct vas *vas, struct vas_sysfs_attr 
> *vsattr,
> +char *buf)
> +{
> + return scnprintf(buf, PAGE_SIZE, "%s", vas->name);

It's a page size, just use sprintf() and be done with it.  No need to
ever check, you "know" it will be correct.

Also, what about a trailing '\n' for these attributes?

Oh wait, why have a name when the kobject name is already there in the
directory itself?  Do you really need this?

> +/**
> + * The ktype data structure representing a VAS.
> + **/
> +static struct kobj_type vas_ktype = {
> + .sysfs_ops = _sysfs_ops,
> + .release = __vas_release,

Why the odd __vas* naming?  What's wrong with vas_release?


> + .default_attrs = vas_default_attr,
> +};
> +
> +
> +/***
> + * Internally visible functions
> + ***/
> +
> +/**
> + * Working with the global VAS list.
> + **/
> +static inline void vas_remove(struct vas *vas)



You have a ton of inline functions, for no good reason.  Make them all
"real" functions please.  Unless you can measure the size/speed
differences?  If so, please say so.


thanks,

greg k-h


Re: [RFC] powerpc: handle simultanneous interrupts at once

2017-03-13 Thread Benjamin Herrenschmidt
On Fri, 2017-03-10 at 12:11 +0100, Christophe Leroy wrote:
> It often happens to have simultanneous interrupts, for instance
> when having double Ethernet attachment. With the current
> implementation, we suffer the cost of kernel entry/exit for each
> interrupt.
> 
> This patch introduces a loop in __do_irq() to handle all interrupts
> at once before returning.
> 
> Signed-off-by: Christophe Leroy 
> ---
>  arch/powerpc/include/asm/hw_irq.h |  6 ++
>  arch/powerpc/kernel/irq.c | 22 +++---
>  2 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/hw_irq.h
> b/arch/powerpc/include/asm/hw_irq.h
> index eba60416536e..d69ae5846955 100644
> --- a/arch/powerpc/include/asm/hw_irq.h
> +++ b/arch/powerpc/include/asm/hw_irq.h
> @@ -123,6 +123,11 @@ static inline void may_hard_irq_enable(void)
>   __hard_irq_enable();
>  }
>  
> +static inline void may_hard_irq_disable(void)
> +{
> + __hard_irq_disable();
> +}
> +
>  static inline bool arch_irq_disabled_regs(struct pt_regs *regs)
>  {
>   return !regs->softe;
> @@ -204,6 +209,7 @@ static inline bool arch_irq_disabled_regs(struct
> pt_regs *regs)
>  }
>  
>  static inline void may_hard_irq_enable(void) { }
> +static inline void may_hard_irq_disable(void) { }
>  
>  #endif /* CONFIG_PPC64 */
>  
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index a018f5cae899..28aca510c166 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -515,14 +515,22 @@ void __do_irq(struct pt_regs *regs)
>    */
>   irq = ppc_md.get_irq();
>  
> - /* We can hard enable interrupts now to allow perf
> interrupts */
> - may_hard_irq_enable();
> + do {
> + /* We can hard enable interrupts now to allow perf
> interrupts */
> + may_hard_irq_enable();
> +
> + /* And finally process it */
> + if (unlikely(!irq))
> + __this_cpu_inc(irq_stat.spurious_irqs);
> + else
> + generic_handle_irq(irq);
> +
> + may_hard_irq_disable();

Not sure the above is that useful. If another interrupt is pending,
on ppc64 at least, you will have got it right at may_hard_irq_enable()
which would have caused a hard-disable for you anyway.

> - /* And finally process it */
> - if (unlikely(!irq))
> - __this_cpu_inc(irq_stat.spurious_irqs);
> - else
> - generic_handle_irq(irq);
> + irq = ppc_md.get_irq();
> + } while (irq);
> +
> + may_hard_irq_enable();
>  
>   trace_irq_exit(regs);
>  


Re: 5-level pagetable patches break ppc64le

2017-03-13 Thread Anton Blanchard
Hi Kirill,

> > My ppc64le boot tests stopped working as of commit c2febafc6773
> > ("mm: convert generic code to 5-level paging")
> > 
> > We hang part way during boot, just before bringing up the network. I
> > haven't had a chance to narrow it down yet.  
> 
> Please check if patch by this link helps:
> 
> http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com

It does fix the ppc64le boot hangs, thanks.

Tested-by: Anton Blanchard 

Anton


Re: syscall statx not implemented on powerpc

2017-03-13 Thread Chris Packham
On 13/03/17 21:52, Chandan Rajendra wrote:
> On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote:
>> Hi,
>>
>> I've just attempted to build a powerpc kernel from 4.11-rc2 using a
>> custom defconfig (available on request) and I'm hitting the following
>> error in the early stages of compilation.
>>
>> :1325:2: error: #warning syscall statx not implemented [-Werror=cpp]
>>
>> Same thing seems to happen with mpc85xx_basic_defconfig.
>>
>> I don't actually need this syscall so I'd be happy to turn something off
>> to get things building. I did a quick search and couldn't see anything
>> on linuxppc-dev but google keeps correcting "statx" to "stats" so I
>> could have missed it.
>>
>
> The upstream commit
> (https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f)
> that introduces the statx syscall provides a test program. I will wire-up the
> syscall on ppc64, run that test program and post the patch if the test program
> works well.
>

Thanks, I'd be happy to test a patch here.

In the meantime I worked around the build issue by adding __INGORE_statx 
to checksyscalls.sh.



Re: [PATCH v2 2/6] powerpc/perf: Export memory hierarchy info to user space

2017-03-13 Thread Sukadev Bhattiprolu
Madhavan Srinivasan [ma...@linux.vnet.ibm.com] wrote:
> The LDST field and DATA_SRC in SIER identifies the memory hierarchy level
> (eg: L1, L2 etc), from which a data-cache miss for a marked instruction
> was satisfied. Use the 'perf_mem_data_src' object to export this
> hierarchy level to user space.
> 


> diff --git a/arch/powerpc/include/asm/perf_event_server.h 
> b/arch/powerpc/include/asm/perf_event_server.h
>  int isa207_get_constraint(u64 event, unsigned long *maskp, unsigned long 
> *valp)
>  {
>   unsigned int unit, pmc, cache, ebb;
> diff --git a/arch/powerpc/perf/isa207-common.h 
> b/arch/powerpc/perf/isa207-common.h
> index cf9bd8990159..982542cce991 100644
> --- a/arch/powerpc/perf/isa207-common.h
> +++ b/arch/powerpc/perf/isa207-common.h
> @@ -259,6 +259,19 @@
>  #define MAX_ALT  2
>  #define MAX_PMU_COUNTERS 6
> 
> +#define ISA207_SIER_TYPE_SHIFT   15
> +#define ISA207_SIER_TYPE_MASK(0x7ull << 
> ISA207_SIER_TYPE_SHIFT)
> +
> +#define ISA207_SIER_LDST_SHIFT   1
> +#define ISA207_SIER_LDST_MASK(0x7ull << 
> ISA207_SIER_LDST_SHIFT)
> +
> +#define ISA207_SIER_DATA_SRC_SHIFT   53
> +#define ISA207_SIER_DATA_SRC_MASK(0x7ull << ISA207_SIER_DATA_SRC_SHIFT)
> +
> +#define P(a, b)  PERF_MEM_S(a, b)

Madhavan, Peter,

Can we see if we can get the kernel to set 'perf_mem_data_src.val' in
endian-nuetral format?

With something like  (untested) in include/uapi/linux/perf_event.h

#define PERF_MEM_OP_NBITS   PERF_MEM_LVL_SHIFT
#define PERF_MEM_LVL_NBITS  PERF_MEM_SNOOP_SHIFT
#define PERF_MEM_SNOOP_NBITSPERF_MEM_LOCK_SHIFT
#define PERF_MEM_TLB_NBITS  PERF_MEM_TLB_SHIFT

and here in arch/powerpc/perf/isa207-common.h

#define PERF_MEM_S_BE_SHIFT(a)  \
(63 - PERF_MEM_##a##_NBITS - PERF_MEM_##a##_SHIFT)

#define PERF_MEM_S_BE(a, s) \
(((__u64)PERF_MEM_##a##_##s) << PERF_MEM_S_BE_SHIFT(a))

#define P(a, b) PERF_MEM_S_BE(a, b)

Basically, have PERF_MEM_OP_NA be the right most bit and PERF_MEM_TLB_OS
the left most bit in perf_mem_data_src.val regardless of the endianness?

Sukadev



[PATCH] ASoC: fsl: constify snd_soc_ops structures

2017-03-13 Thread Bhumika Goyal
Declare snd_soc_ops structures as const as they are only stored
in the ops field of a snd_soc_dai_link structure. This field is
of type const, so snd_soc_ops structures having this property
can be made const too.

The following .o files did not compile:
sound/soc/fsl/{p1022_rdk.c/p1022_ds.c/mpc8610_hpcd.c}

Signed-off-by: Bhumika Goyal 
---
 sound/soc/fsl/eukrea-tlv320.c   | 2 +-
 sound/soc/fsl/imx-mc13783.c | 2 +-
 sound/soc/fsl/mpc8610_hpcd.c| 2 +-
 sound/soc/fsl/mx27vis-aic32x4.c | 2 +-
 sound/soc/fsl/p1022_ds.c| 2 +-
 sound/soc/fsl/p1022_rdk.c   | 2 +-
 sound/soc/fsl/phycore-ac97.c| 2 +-
 sound/soc/fsl/wm1133-ev1.c  | 2 +-
 8 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/sound/soc/fsl/eukrea-tlv320.c b/sound/soc/fsl/eukrea-tlv320.c
index 883087f..84ef638 100644
--- a/sound/soc/fsl/eukrea-tlv320.c
+++ b/sound/soc/fsl/eukrea-tlv320.c
@@ -64,7 +64,7 @@ static int eukrea_tlv320_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops eukrea_tlv320_snd_ops = {
+static const struct snd_soc_ops eukrea_tlv320_snd_ops = {
.hw_params  = eukrea_tlv320_hw_params,
 };
 
diff --git a/sound/soc/fsl/imx-mc13783.c b/sound/soc/fsl/imx-mc13783.c
index bb04590..9d19b80 100644
--- a/sound/soc/fsl/imx-mc13783.c
+++ b/sound/soc/fsl/imx-mc13783.c
@@ -48,7 +48,7 @@ static int imx_mc13783_hifi_hw_params(struct 
snd_pcm_substream *substream,
return snd_soc_dai_set_tdm_slot(cpu_dai, 0x3, 0x3, 2, 16);
 }
 
-static struct snd_soc_ops imx_mc13783_hifi_ops = {
+static const struct snd_soc_ops imx_mc13783_hifi_ops = {
.hw_params = imx_mc13783_hifi_hw_params,
 };
 
diff --git a/sound/soc/fsl/mpc8610_hpcd.c b/sound/soc/fsl/mpc8610_hpcd.c
index ddf49f3..a639b52 100644
--- a/sound/soc/fsl/mpc8610_hpcd.c
+++ b/sound/soc/fsl/mpc8610_hpcd.c
@@ -174,7 +174,7 @@ static int mpc8610_hpcd_machine_remove(struct snd_soc_card 
*card)
 /**
  * mpc8610_hpcd_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops mpc8610_hpcd_ops = {
+static const struct snd_soc_ops mpc8610_hpcd_ops = {
.startup = mpc8610_hpcd_startup,
 };
 
diff --git a/sound/soc/fsl/mx27vis-aic32x4.c b/sound/soc/fsl/mx27vis-aic32x4.c
index 198eeb3..d7ec3d2 100644
--- a/sound/soc/fsl/mx27vis-aic32x4.c
+++ b/sound/soc/fsl/mx27vis-aic32x4.c
@@ -73,7 +73,7 @@ static int mx27vis_aic32x4_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops mx27vis_aic32x4_snd_ops = {
+static const struct snd_soc_ops mx27vis_aic32x4_snd_ops = {
.hw_params  = mx27vis_aic32x4_hw_params,
 };
 
diff --git a/sound/soc/fsl/p1022_ds.c b/sound/soc/fsl/p1022_ds.c
index a1f780e..41c623c 100644
--- a/sound/soc/fsl/p1022_ds.c
+++ b/sound/soc/fsl/p1022_ds.c
@@ -184,7 +184,7 @@ static int p1022_ds_machine_remove(struct snd_soc_card 
*card)
 /**
  * p1022_ds_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops p1022_ds_ops = {
+static const struct snd_soc_ops p1022_ds_ops = {
.startup = p1022_ds_startup,
 };
 
diff --git a/sound/soc/fsl/p1022_rdk.c b/sound/soc/fsl/p1022_rdk.c
index d4d88a8..4afbdd6 100644
--- a/sound/soc/fsl/p1022_rdk.c
+++ b/sound/soc/fsl/p1022_rdk.c
@@ -188,7 +188,7 @@ static int p1022_rdk_machine_remove(struct snd_soc_card 
*card)
 /**
  * p1022_rdk_ops: ASoC machine driver operations
  */
-static struct snd_soc_ops p1022_rdk_ops = {
+static const struct snd_soc_ops p1022_rdk_ops = {
.startup = p1022_rdk_startup,
 };
 
diff --git a/sound/soc/fsl/phycore-ac97.c b/sound/soc/fsl/phycore-ac97.c
index ae403c2..66fb6c4 100644
--- a/sound/soc/fsl/phycore-ac97.c
+++ b/sound/soc/fsl/phycore-ac97.c
@@ -23,7 +23,7 @@
 
 static struct snd_soc_card imx_phycore;
 
-static struct snd_soc_ops imx_phycore_hifi_ops = {
+static const struct snd_soc_ops imx_phycore_hifi_ops = {
 };
 
 static struct snd_soc_dai_link imx_phycore_dai_ac97[] = {
diff --git a/sound/soc/fsl/wm1133-ev1.c b/sound/soc/fsl/wm1133-ev1.c
index b454972..cdaf163 100644
--- a/sound/soc/fsl/wm1133-ev1.c
+++ b/sound/soc/fsl/wm1133-ev1.c
@@ -139,7 +139,7 @@ static int wm1133_ev1_hw_params(struct snd_pcm_substream 
*substream,
return 0;
 }
 
-static struct snd_soc_ops wm1133_ev1_ops = {
+static const struct snd_soc_ops wm1133_ev1_ops = {
.hw_params = wm1133_ev1_hw_params,
 };
 
-- 
2.7.4



Re: 4.11.0-rc1 boot resulted in WARNING: CPU: 14 PID: 1722 at fs/sysfs/dir.c:31 .sysfs_warn_dup+0x78/0xb0

2017-03-13 Thread Abdul Haleem
On Sat, 2017-03-11 at 15:46 -0700, Jens Axboe wrote:
> On 03/09/2017 05:59 AM, Brian Foster wrote:
> > cc linux-block
> > 
> > On Thu, Mar 09, 2017 at 04:20:06PM +0530, Abdul Haleem wrote:
> >> On Wed, 2017-03-08 at 08:17 -0500, Brian Foster wrote:
> >>> On Tue, Mar 07, 2017 at 10:01:04PM +0530, Abdul Haleem wrote:
> 
>  Hi,
> 
>  Today's mainline (4.11.0-rc1) booted with warnings on Power7 LPAR.
> 
>  Issue is not reproducible all the time.
> 
> Is that still the case with -git as of yesterday? Check that you
> have this merge:
> 
> 34bbce9e344b47e8871273409632f525973afad4
> 
> in your tree.
> 

Thanks for pointing out, with the below merge commit warnings disappear.

commit 34bbce9e344b47e8871273409632f525973afad4
Merge: bb61ce5 672a2c8
Author: Linus Torvalds 
Date:   Thu Mar 9 15:53:25 2017 -0800

Merge branch 'for-linus' of git://git.kernel.dk/linux-block


Thanks for the fix !

Reported-by & Tested-by : Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [BUG] [PowerPC] Kernel Oops when booting Linux mainline

2017-03-13 Thread Abdul Haleem
On Mon, 2017-03-13 at 14:48 +0530, Abdul Haleem wrote:
> Hi,
> 
> Mainline boot is broken on PowerPC bare metal with below traces:
> Machine Type : Power 8 Bare metal
> 
> [  OK  ] Mounted Debug File System.
> [  OK  ] Started Nameserver information manager.
> [  OK  ] Started LVM2 metadata daemon.
> Unable to handle kernel paging request for data at address 0x30079
> Faulting instruction address: 0xc0811cac
> Oops: Kernel access of bad area, sig: 11 [#1]
> Unable to handle kernel paging request for instruction fetch
> Unable to handle kernel paging request for data at address 0xc07c77881e28
> Faulting instruction address: 0xc03c7acd7b80
> Faulting instruction address: 0xc029fd88
> Thread overran stack, or stack corrupted
> SMP NR_CPUS=2048 
> NUMA 
> PowerNV
> Modules linked in: ip_table�KK~<(E) x_tables�\8|<(E) autofs4(E) raid10 
> raid456 async_raid6_recov 
> async_memcpy async_pq async_xor async_tx xor raid6_pq multipath bnx2x(E) 
> mdio(E) libcrc32�^B�w<(E)
> CPU: 21 PID: 137 Comm: kworker/21:0 Not tainted 
> 4.11.0-rc1-00335-g56b24d1-dirty #1
> Workqueue: events dbs_work_handler
> task: c03c8e86b380 task.stack: c03c8eae
> NIP: c0811cac LR: c0813014 CTR: c0811c50
> REGS: c03c8eae3980 TRAP: 0300   Not tainted  
> (4.11.0-rc1-00335-g56b24d1-dirty)
> MSR: 90010280b033 
>   CR: 24002422  XER:   
> CFAR: c0008860 DAR: 00030079 DSISR: 4000 SOFTE: 1 
> GPR00: c0813014 c03c8eae3c00 c1019800 c03c780d0800 
> GPR04: 0540   c03c8e86b380 
> GPR08:  003c 003c 8397 
> GPR12: c0811c50 cfe05400 c00f2e08 c03ffb073d80 
> GPR16: c03ffc564328 c03ffc5640f8 c03ffc5640a0 0001 
> GPR20:   c0f50bc0 fef7 
> GPR24:  c03ffc564400  00030001 
> GPR28: c03c780d0b60 00030001 c03c780d0800 c03c780d0b60 
> NIP [c0811cac] od_dbs_update+0x5c/0x260
> LR [c0813014] dbs_work_handler+0x54/0xa0
> Call Trace:
> [c03c8eae3c00] [c03ffc564880] 0xc03ffc564880 (unreliable)
> [c03c8eae3c50] [c0813014] dbs_work_handler+0x54/0xa0
> [c03c8eae3c90] [c00ea600] process_one_work+0x2a0/0x590
> [c03c8eae3d20] [c00ea998] worker_thread+0xa8/0x660
> [c03c8eae3dc0] [c00f2f4c] kthread+0x14c/0x190
> [c03c8eae3e30] [c000b4e8] ret_from_kernel_thread+0x5c/0x74
> Instruction dump:
> f821ffb1 ebe30340 813f00ac 895f00ac ebbf0078 71280001 5549003c 993f00ac 
> 40820164 eb9e0340 7fc3f378 eb7c0078  48001539 6000 3920 
> ---[ end trace 587e7f5a13c0f2ad ]---
> 
> 
> Detailed logs and config is attached.
> 
> FYI, Good commit of last successful boot is : 
> 
> commit ea6200e84182989a3cce9687cf79a23ac44ec4db
> Merge: b4fb8f6 fc69910
> Author: Linus Torvalds 
> Date:   Wed Mar 8 14:45:31 2017 -0800
> 
> Merge branch 'core-urgent-for-linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
> 
> 
> 

With this patch, ppc boots fine. Thanks for the fix

http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com

Tested-by : Abdul Haleem 

-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre





Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-13 Thread Peter Zijlstra
On Mon, Mar 13, 2017 at 04:45:51PM +0530, Madhavan Srinivasan wrote:
> >  - should you not have fixed this in the tool only? This patch
> >effectively breaks ABI on big-endian architectures.
> 
> IIUC, we are the first BE user for this feature
> (Kindly correct me if I am wrong), so technically we
> are not breaking ABI here :) .  But let me also look  at
> the dynamic conversion part.

Huh? PPC hasn't yet implemented this? Then why are you fixing it?



Re: 5-level pagetable patches break ppc64le

2017-03-13 Thread Kirill A. Shutemov
On Mon, Mar 13, 2017 at 09:05:50PM +1100, Anton Blanchard wrote:
> Hi,
> 
> My ppc64le boot tests stopped working as of commit c2febafc6773 ("mm:
> convert generic code to 5-level paging")
> 
> We hang part way during boot, just before bringing up the network. I
> haven't had a chance to narrow it down yet.

Please check if patch by this link helps:

http://lkml.kernel.org/r/20170313052213.11411-1-kirill.shute...@linux.intel.com

-- 
 Kirill A. Shutemov


Re: [PATCH v2 1/6] powerpc/perf: Define big-endian version of perf_mem_data_src

2017-03-13 Thread Madhavan Srinivasan



On Tuesday 07 March 2017 03:53 PM, Peter Zijlstra wrote:

On Tue, Mar 07, 2017 at 03:28:17PM +0530, Madhavan Srinivasan wrote:


On Monday 06 March 2017 04:52 PM, Peter Zijlstra wrote:

On Mon, Mar 06, 2017 at 04:13:08PM +0530, Madhavan Srinivasan wrote:

From: Sukadev Bhattiprolu 

perf_mem_data_src is an union that is initialized via the ->val field
and accessed via the bitmap fields. For this to work on big endian
platforms, we also need a big-endian represenation of perf_mem_data_src.

Doesn't this break interpreting the data on a different endian machine?

IIUC, we will need this patch to not to break the interpreting data
on a different endian machine. Data collected from power8 LE/BE
guests with this patchset applied. Kindly correct me if I missed
your question here.

So your patch adds compile time bitfield differences. My worry was that
there was no dynamic conversion routine in the tools (it has for a lot
of other places).

This yields two questions:

  - are these two static layouts identical? (seeing that you illustrate
cross-endian things working this seems likely).

  - should you not have fixed this in the tool only? This patch
effectively breaks ABI on big-endian architectures.


IIUC, we are the first BE user for this feature
(Kindly correct me if I am wrong), so technically we
are not breaking ABI here :) .  But let me also look  at
the dynamic conversion part.

Maddy







Re: [FIX PATCH v1] powerpc/pseries: Fix reference count leak during CPU unplug

2017-03-13 Thread Bharata B Rao
On Thu, Mar 09, 2017 at 01:34:00PM -0800, Tyrel Datwyler wrote:
> On 03/08/2017 08:37 PM, Bharata B Rao wrote:
> > The following warning is seen when a CPU is hot unplugged on a PowerKVM
> > guest:
> 
> Is this the case with cpus present at boot? What about cpus hotplugged
> after boot?

I have observed this for CPUs that are hotplugged.

> 
> My suspicion is that the refcount was wrong to begin with. See my
> comments below. The use of the of_node_put() calls is correct as in each
> case we incremented the ref count earlier in the same function.
> 
> > 
> > refcount_t: underflow; use-after-free.
> > [ cut here ]
> > WARNING: CPU: 0 PID: 53 at lib/refcount.c:128 
> > refcount_sub_and_test+0xd8/0xf0
> > Modules linked in:
> > CPU: 0 PID: 53 Comm: kworker/u510:1 Not tainted 4.11.0-rc1 #3
> > Workqueue: pseries hotplug workque pseries_hp_work_fn
> > task: c000fb475000 task.stack: c000fb81c000
> > NIP: c06f0808 LR: c06f0804 CTR: c07b98c0
> > REGS: c000fb81f710 TRAP: 0700   Not tainted  (4.11.0-rc1)
> > MSR: 8282b033 
> >   CR: 4800  XER: 2000
> > CFAR: c0c438e0 SOFTE: 1
> > GPR00: c06f0804 c000fb81f990 c1573b00 0026
> > GPR04:  016c 667265652e0d0a73 652d61667465722d
> > GPR08: 0007 0007 0001 0006
> > GPR12: 2200 cff4 c010c578 c001f11b9f40
> > GPR16: c001fe0312a8 c001fe031078 c001fe031020 0001
> > GPR20:   c1454808 fef7
> > GPR24:  c001f1677648  
> > GPR28: 1008 c0e4d3d8  c001eaae07d8
> > NIP [c06f0808] refcount_sub_and_test+0xd8/0xf0
> > LR [c06f0804] refcount_sub_and_test+0xd4/0xf0
> > Call Trace:
> > [c000fb81f990] [c06f0804] refcount_sub_and_test+0xd4/0xf0 
> > (unreliable)
> > [c000fb81f9f0] [c06d04b4] kobject_put+0x44/0x2a0
> > [c000fb81fa70] [c09d5284] of_node_put+0x34/0x50
> > [c000fb81faa0] [c00aceb8] dlpar_cpu_remove_by_index+0x108/0x130
> > [c000fb81fb30] [c00ae128] dlpar_cpu+0x78/0x550
> > [c000fb81fbe0] [c00a7b40] handle_dlpar_errorlog+0xc0/0x160
> > [c000fb81fc50] [c00a7c74] pseries_hp_work_fn+0x94/0xa0
> > [c000fb81fc80] [c0102cec] process_one_work+0x23c/0x540
> > [c000fb81fd20] [c010309c] worker_thread+0xac/0x620
> > [c000fb81fdc0] [c010c6c4] kthread+0x154/0x1a0
> > [c000fb81fe30] [c000bbe0] ret_from_kernel_thread+0x5c/0x7c
> > 
> > Fix this by ensuring that of_node_put() is called only from the
> > error path in dlpar_cpu_remove_by_index(). In the normal path,
> > of_node_put() happens as part of dlpar_detach_node().
> > 
> > Signed-off-by: Bharata B Rao 
> > Cc: Nathan Fontenot 
> > ---
> > Changes in v1:
> > - Fixed the refcount problem in the userspace driven unplug path
> >   in addition to in-kernel unplug path. (Sachin Sant)
> > 
> > v0: https://patchwork.ozlabs.org/patch/736547/
> > 
> >  arch/powerpc/platforms/pseries/hotplug-cpu.c | 12 
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c 
> > b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > index 7bc0e91..c5ed510 100644
> > --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> > @@ -619,7 +619,8 @@ static int dlpar_cpu_remove_by_index(u32 drc_index)
> > }
> > 
> > rc = dlpar_cpu_remove(dn, drc_index);
> > -   of_node_put(dn);
> > +   if (rc)
> > +   of_node_put(dn);
> 
> I think there is another issue at play here because this is wrong.
> Regardless of whether the dlpar_cpu_remove() succeeds or fails we still
> need of_node_put() for both cases because we incremented the ref count
> earlier in this function with a call to cpu_drc_index_to_dn() call. That
> call doesn't, but shoul, document that it returns a device_node with
> incremented refcount.
> 
> > return rc;
> >  }
> > 
> > @@ -856,9 +857,12 @@ static ssize_t dlpar_cpu_release(const char *buf, 
> > size_t count)
> > }
> > 
> > rc = dlpar_cpu_remove(dn, drc_index);
> > -   of_node_put(dn);
> > -
> > -   return rc ? rc : count;
> > +   if (rc) {
> > +   of_node_put(dn);
> > +   return rc;
> > +   } else {
> > +   return count;
> > +   }
> 
> Same comment as above. The call earlier in the function to
> of_find_node_by_path() returned a device_node struct with its ref count
> incremented. So, regardless of whether dlpar_cpu_remove() succeeds or
> fails we need decrement the ref count with of_node_put().
> 
> Looking closer at the call paths for attach and detach one will notice

5-level pagetable patches break ppc64le

2017-03-13 Thread Anton Blanchard
Hi,

My ppc64le boot tests stopped working as of commit c2febafc6773 ("mm:
convert generic code to 5-level paging")

We hang part way during boot, just before bringing up the network. I
haven't had a chance to narrow it down yet.

Anton


[BUG] [PowerPC] Kernel Oops when booting Linux mainline

2017-03-13 Thread Abdul Haleem
Hi,

Mainline boot is broken on PowerPC bare metal with below traces:
Machine Type : Power 8 Bare metal

[  OK  ] Mounted Debug File System.
[  OK  ] Started Nameserver information manager.
[  OK  ] Started LVM2 metadata daemon.
Unable to handle kernel paging request for data at address 0x30079
Faulting instruction address: 0xc0811cac
Oops: Kernel access of bad area, sig: 11 [#1]
Unable to handle kernel paging request for instruction fetch
Unable to handle kernel paging request for data at address 0xc07c77881e28
Faulting instruction address: 0xc03c7acd7b80
Faulting instruction address: 0xc029fd88
Thread overran stack, or stack corrupted
SMP NR_CPUS=2048 
NUMA 
PowerNV
Modules linked in: ip_table�KK~<(E) x_tables�\8|<(E) autofs4(E) raid10 raid456 
async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq multipath bnx2x(E) 
mdio(E) libcrc32�^B�w<(E)
CPU: 21 PID: 137 Comm: kworker/21:0 Not tainted 4.11.0-rc1-00335-g56b24d1-dirty 
#1
Workqueue: events dbs_work_handler
task: c03c8e86b380 task.stack: c03c8eae
NIP: c0811cac LR: c0813014 CTR: c0811c50
REGS: c03c8eae3980 TRAP: 0300   Not tainted  
(4.11.0-rc1-00335-g56b24d1-dirty)
MSR: 90010280b033 
  CR: 24002422  XER:   
CFAR: c0008860 DAR: 00030079 DSISR: 4000 SOFTE: 1 
GPR00: c0813014 c03c8eae3c00 c1019800 c03c780d0800 
GPR04: 0540   c03c8e86b380 
GPR08:  003c 003c 8397 
GPR12: c0811c50 cfe05400 c00f2e08 c03ffb073d80 
GPR16: c03ffc564328 c03ffc5640f8 c03ffc5640a0 0001 
GPR20:   c0f50bc0 fef7 
GPR24:  c03ffc564400  00030001 
GPR28: c03c780d0b60 00030001 c03c780d0800 c03c780d0b60 
NIP [c0811cac] od_dbs_update+0x5c/0x260
LR [c0813014] dbs_work_handler+0x54/0xa0
Call Trace:
[c03c8eae3c00] [c03ffc564880] 0xc03ffc564880 (unreliable)
[c03c8eae3c50] [c0813014] dbs_work_handler+0x54/0xa0
[c03c8eae3c90] [c00ea600] process_one_work+0x2a0/0x590
[c03c8eae3d20] [c00ea998] worker_thread+0xa8/0x660
[c03c8eae3dc0] [c00f2f4c] kthread+0x14c/0x190
[c03c8eae3e30] [c000b4e8] ret_from_kernel_thread+0x5c/0x74
Instruction dump:
f821ffb1 ebe30340 813f00ac 895f00ac ebbf0078 71280001 5549003c 993f00ac 
40820164 eb9e0340 7fc3f378 eb7c0078  48001539 6000 3920 
---[ end trace 587e7f5a13c0f2ad ]---


Detailed logs and config is attached.

FYI, Good commit of last successful boot is : 

commit ea6200e84182989a3cce9687cf79a23ac44ec4db
Merge: b4fb8f6 fc69910
Author: Linus Torvalds 
Date:   Wed Mar 8 14:45:31 2017 -0800

Merge branch 'core-urgent-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip



-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


[pexpect]#kexec -e
[ 5899.394322] kexec_core: Starting new kernel
[ 5899.394322] kexec_core: Starting new kernel
[ 5989.055024957,5] OPAL: Switch to big-endian OS
[ 5996.176385535,5] OPAL: Switch to little-endian OS
[0.00] opal: OPAL detected !
[0.00] Page sizes from device-tree:
[0.00] base_shift=12: shift=12, sllp=0x, avpnm=0x, 
tlbiel=1, penc=0
[0.00] base_shift=12: shift=16, sllp=0x, avpnm=0x, 
tlbiel=1, penc=7
[0.00] base_shift=12: shift=24, sllp=0x, avpnm=0x, 
tlbiel=1, penc=56
[0.00] base_shift=16: shift=16, sllp=0x0110, avpnm=0x, 
tlbiel=1, penc=1
[0.00] base_shift=16: shift=24, sllp=0x0110, avpnm=0x, 
tlbiel=1, penc=8
[0.00] base_shift=20: shift=20, sllp=0x0130, avpnm=0x, 
tlbiel=0, penc=2
[0.00] base_shift=24: shift=24, sllp=0x0100, avpnm=0x0001, 
tlbiel=0, penc=0
[0.00] base_shift=34: shift=34, sllp=0x0120, avpnm=0x07ff, 
tlbiel=0, penc=3
[0.00] Using 1TB segments
[0.00] Initializing hash mmu with SLB
[0.00] Linux version 4.11.0-rc1-00335-g56b24d1-dirty (root@pkvmhab012) 
(gcc version 6.2.0 20161005 (Ubuntu 6.2.0-5ubuntu12) ) #1 SMP Mon Mar 13 
04:51:35 EDT 2017
[0.00] Found initrd at 0xc2a9:0xc397cecd
[0.00] OPAL: Found non-mapped LPC bus on chip 0
[0.00] Using PowerNV machine description
[0.00] bootconsole [udbg0] enabled
[0.00] CPU maps initialized for 8 threads per core
 -> smp_release_cpus()
spinning_secondaries = 79
 <- smp_release_cpus()
[0.00] -
[0.00] ppc64_pft_size= 0x0
[0.00] phys_mem_size = 0x40
[0.00] dcache_bsize  = 0x80
[0.00] icache_bsize  = 0x80
[0.00] cpu_features  = 

Re: syscall statx not implemented on powerpc

2017-03-13 Thread Chandan Rajendra
On Monday, March 13, 2017 03:33:07 AM Chris Packham wrote:
> Hi,
> 
> I've just attempted to build a powerpc kernel from 4.11-rc2 using a 
> custom defconfig (available on request) and I'm hitting the following 
> error in the early stages of compilation.
> 
> :1325:2: error: #warning syscall statx not implemented [-Werror=cpp]
> 
> Same thing seems to happen with mpc85xx_basic_defconfig.
> 
> I don't actually need this syscall so I'd be happy to turn something off 
> to get things building. I did a quick search and couldn't see anything 
> on linuxppc-dev but google keeps correcting "statx" to "stats" so I 
> could have missed it.
> 
> 

The upstream commit
(https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a528d35e8bfcc521d7cb70aaf03e1bd296c8493f)
that introduces the statx syscall provides a test program. I will wire-up the
syscall on ppc64, run that test program and post the patch if the test program
works well.

-- 
chandan



Today's mainline failed to boot on Power 6 with WARNING: CPU: 8 PID: 0 at kernel/sched/sched.h

2017-03-13 Thread Abdul Haleem
Hi,

4.11.0-rc1 failed to boot on Power 6 LPAR with the below warnings.

Machine Type : Power 6 LPAR
Kernel : 4.11.0-rc1
Config : p6-BE-config file attached

rq->clock_update_flags < RQCF_ACT_SKIP
[ cut here ]
WARNING: CPU: 8 PID: 0 at kernel/sched/sched.h:833 
.update_blocked_averages+0xe78/0xfb0
Modules linked in: sr_mod(E) sd_mod(E) cdrom(E) dm_mirror(E) dm_region_hash(E) 
dm_log(E) dm_mod(E)
CPU: 8 PID: 0 Comm: swapper/8 Tainted: GE   4.11.0-rc1-autotest #1
task: c001c2980080 task.stack: c001c299
NIP: c0108478 LR: c0108474 CTR: 
REGS: c001c2993180 TRAP: 0700   Tainted: GE
(4.11.0-rc1-autotest)
MSR: 80029032 
  CR: 24282022  XER: 2000 
CFAR: c08341c4 SOFTE: 0 
GPR00: c0108474 c001c2993400 c1025600 0026 
GPR04:  0006 c10c5600 c10c5600 
GPR08:  c0beebb0 d242 c08741c0 
GPR12: 24282024 cedc1800  0001 
GPR16: c000d3026ed0  ba7e ba7e 
GPR20: afb50401  c0c06600 c0c06600 
GPR24: afb504000afb5041 c000d3026600 0040  
GPR28:  c000fc85b200 c000fcb16d40 c000fcb16c00 
NIP [c0108478] .update_blocked_averages+0xe78/0xfb0
LR [c0108474] .update_blocked_averages+0xe74/0xfb0
Call Trace:
[c001c2993400] [c0108474] .update_blocked_averages+0xe74/0xfb0 
(unreliable)
[c001c2993520] [c0110c14] .rebalance_domains+0x74/0x330
[c001c2993610] [c00c8d00] .__do_softirq+0x180/0x3f0
[c001c2993720] [c00c9358] .irq_exit+0x108/0x120
[c001c2993790] [c001f17c] .timer_interrupt+0x9c/0xd0
[c001c2993810] [c0009308] decrementer_common+0x158/0x160
--- interrupt: 901 at plpar_hcall_norets_trace+0x58/0x8c
LR = plpar_hcall_norets_trace+0x34/0x8c
[c001c2993b70] [c06a5cd4] .check_and_cede_processor+0x24/0x40
[c001c2993be0] [c06a5f34] .dedicated_cede_loop+0x64/0x170
[c001c2993c70] [c06a3954] .cpuidle_enter_state+0xc4/0x3b0
[c001c2993d20] [c011b320] .call_cpuidle+0x40/0x70
[c001c2993d90] [c011b6e0] .do_idle+0x2a0/0x310
[c001c2993e60] [c011b97c] .cpu_startup_entry+0x2c/0x30
[c001c2993ee0] [c003e6a8] .start_secondary+0x358/0x3a0
[c001c2993f90] [c000b06c] start_secondary_prolog+0x10/0x14
Instruction dump:
6000 6000 3d42fff0 892addef 2f89 40fef2a0 3c62ffa4 3901 
3863de68 990addef 4872bd25 6000 <0fe0> 4bfff280 5540d97e 794a06e0 
---[ end trace f610c7e162cdd3b8 ]---
Unable to handle kernel paging request for data at address 0x2c2341820054
Faulting instruction address: 0xc01076e0
Oops: Kernel access of bad area, sig: 11 [#1]



-- 
Regard's

Abdul Haleem
IBM Linux Technology Centre


#
# Automatically generated file; DO NOT EDIT.
# Linux/powerpc 4.10.0-rc3 Kernel Configuration
#
CONFIG_PPC64=y

#
# Processor support
#
CONFIG_PPC_BOOK3S_64=y
# CONFIG_PPC_BOOK3E_64 is not set
CONFIG_GENERIC_CPU=y
# CONFIG_CELL_CPU is not set
# CONFIG_POWER4_CPU is not set
# CONFIG_POWER5_CPU is not set
# CONFIG_POWER6_CPU is not set
# CONFIG_POWER7_CPU is not set
# CONFIG_POWER8_CPU is not set
CONFIG_PPC_BOOK3S=y
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_VSX=y
# CONFIG_PPC_ICSWX is not set
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_64=y
CONFIG_PPC_RADIX_MMU=y
CONFIG_PPC_MM_SLICES=y
CONFIG_PPC_HAVE_PMU_SUPPORT=y
CONFIG_PPC_PERF_CTRS=y
CONFIG_SMP=y
CONFIG_NR_CPUS=1024
CONFIG_PPC_DOORBELL=y
CONFIG_VDSO32=y
CONFIG_CPU_BIG_ENDIAN=y
# CONFIG_CPU_LITTLE_ENDIAN is not set
CONFIG_64BIT=y
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_MMU=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NR_IRQS=512
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_ARCH_HAS_ILOG2_U32=y
CONFIG_ARCH_HAS_ILOG2_U64=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK=y
CONFIG_PPC=y
# CONFIG_GENERIC_CSUM is not set
CONFIG_EARLY_PRINTK=y
CONFIG_PANIC_TIMEOUT=180
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_PPC_UDBG_16550=y
# CONFIG_GENERIC_TBSYNC is not set
CONFIG_AUDIT_ARCH=y
CONFIG_GENERIC_BUG=y
CONFIG_EPAPR_BOOT=y
# CONFIG_DEFAULT_UIMAGE is not set
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
# CONFIG_PPC_DCR_NATIVE is not set
# CONFIG_PPC_DCR_MMIO is not set
# CONFIG_PPC_OF_PLATFORM_PCI is not set
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_PPC_EMULATE_SSTEP=y
CONFIG_ZONE_DMA32=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y

[PATCH 3/3] powernv:Recover correct PACA on wakeup from a stop on P9 DD1

2017-03-13 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

POWER9 platform can be configured to rebalance per-thread resources
within a core in order to improve SMT performance.  Certain STOP
states can be configure to relinquish resources include some
hypervisor SPRs in order to enable SMT thread folding.

Due to relinquishing of per-thread resources under certain platform
configuration, certain SPR context could be lost within a core that
needs to be recovered.  This state lose is due to reconfiguration of
SMT threads and not due to actual electrical power lose.

This patch implements a context recovery framework within threads of a
core, by provisioning space in paca_struct for saving every sibling
threads's paca pointers. Basically, we should be able to arrive at the
right paca pointer from any of the thread's existing paca pointer.

At bootup, during powernv idle-init, we save the paca address of every
CPU in each one its siblings paca_struct in the slot corresponding to
this CPU's index in the core.

On wakeup from a stop, the thread will determine its index in the core
from the lower 2 bits of the PIR register and recover its PACA pointer
by indexing into the correct slot in the provisioned space in the
current PACA.

[Changelog written with inputs from sva...@linux.vnet.ibm.com]

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/include/asm/paca.h   |  5 
 arch/powerpc/kernel/asm-offsets.c |  1 +
 arch/powerpc/kernel/idle_book3s.S | 43 ++-
 arch/powerpc/platforms/powernv/idle.c | 22 ++
 4 files changed, 70 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
index 708c3e5..4405630 100644
--- a/arch/powerpc/include/asm/paca.h
+++ b/arch/powerpc/include/asm/paca.h
@@ -172,6 +172,11 @@ struct paca_struct {
u8 thread_mask;
/* Mask to denote subcore sibling threads */
u8 subcore_sibling_mask;
+   /*
+* Pointer to an array which contains pointer
+* to the sibling threads' paca.
+*/
+   struct paca_struct *thread_sibling_pacas[8];
 #endif
 
 #ifdef CONFIG_PPC_BOOK3S_64
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index 4367e7d..6ec5016 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -727,6 +727,7 @@ int main(void)
OFFSET(PACA_THREAD_IDLE_STATE, paca_struct, thread_idle_state);
OFFSET(PACA_THREAD_MASK, paca_struct, thread_mask);
OFFSET(PACA_SUBCORE_SIBLING_MASK, paca_struct, subcore_sibling_mask);
+   OFFSET(PACA_SIBLING_PACA_PTRS, paca_struct, thread_sibling_pacas);
 #endif
 
DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 9957287..5a90f2c 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -385,7 +385,48 @@ _GLOBAL(power9_idle_stop)
  */
 _GLOBAL(pnv_restore_hyp_resource)
 BEGIN_FTR_SECTION
-   ld  r2,PACATOC(r13);
+BEGIN_FTR_SECTION_NESTED(70)
+/* Save our LR in r17 */
+   mflrr17
+/*
+ * On entering certain stop states, the thread might relinquish its
+ * per-thread resources due to some reconfiguration for improved SMT
+ * performance. This would result in certain SPR context such as
+ * HSPRG0 (which contains the paca pointer) to be lost within the core.
+ *
+ * Fortunately, the PIR is invariant to thread reconfiguration. Since
+ * this thread's paca pointer is recorded in all its sibling's
+ * paca, we can correctly recover this thread's paca pointer if we
+ * know the index of this thread in the core.
+ * This index can be obtained from the lower two bits of the PIR.
+ *
+ * i.e, thread's position in the core = PIR[62:63].
+ * If this value is i, then this thread's paca is
+ * paca->thread_sibling_pacas[i].
+ */
+   mfspr   r4, SPRN_PIR
+   andi.   r4, r4, 0x3
+/*
+ * Since each entry in thread_sibling_pacas is 8 bytes
+ * we need to left-shift by 3 bits. Thus r4 = i * 8
+ */
+   sldir4, r4, 3
+/* Get >thread_sibling_pacas[0] in r5 */
+   addir5, r13, PACA_SIBLING_PACA_PTRS
+/* Load paca->thread_sibling_pacas[i] into r3 */
+   ldx r3, r4, r5
+/* Move our paca pointer to r13 */
+   mr  r13, r3
+/* Correctly set up our PACA */
+   ld  r2, PACATOC(r13)
+   ld  r1, PACAR1(r13)
+   bl  setup_paca
+/* Now we are all set! Set our LR and TOC */
+   mtlrr17
+   ld  r2, PACATOC(r13)
+FTR_SECTION_ELSE_NESTED(70)
+   ld  r2, PACATOC(r13)
+ALT_FTR_SECTION_END_NESTED_IFSET(CPU_FTR_POWER9_DD1, 70)
/*
 * POWER ISA 3. Use PSSCR to determine if we
 * are waking up from deep idle state
diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 9fde6e4..87311c2 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ 

[PATCH 1/3] powernv:smp: Add busy-wait loop as fall back for CPU-Hotplug

2017-03-13 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently, the powernv cpu-offline function assumes that platform idle
states such as stop on POWER9, winkle/sleep/nap on POWER8 are always
available. On POWER8, it picks nap as the default state if other deep
idle states like sleep/winkle are not available and enabled in the
platform.

On POWER9, nap is not available and all idle states are managed by
STOP instruction.  The parameters to the idle state are passed through
processor stop status control register (PSSCR).  Hence as such
executing STOP would take parameters from current PSSCR. We do not
want to make any assumptions in kernel on what STOP states and PSSCR
features are configured by the platform.

Ideally platform will configure a good set of stop states that can be
used in the kernel.  We would like to start with a clean slate, if the
platform choose to not configure any state or there is an error in
platform firmware that lead to no stop states being configured or
allowed to be requested.

This patch adds a fallback method for CPU-Hotplug that is similar to
snooze loop at idle where the threads are left to spin at low priority
and hence reduce the cycles consumed.

This is a safe fallback mechanism in the case when no stop state would
be requested if the platform firmware did not configure them most
likely due to an error condition.

Requesting a stop state when the platform has not configured them or
enabled them would lead to further error conditions which could be
difficult to debug.

[Changelog written with inputs from sva...@linux.vnet.ibm.com]
Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/smp.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/powernv/smp.c 
b/arch/powerpc/platforms/powernv/smp.c
index e39e6c4..8d5b99e 100644
--- a/arch/powerpc/platforms/powernv/smp.c
+++ b/arch/powerpc/platforms/powernv/smp.c
@@ -192,8 +192,16 @@ static void pnv_smp_cpu_kill_self(void)
} else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
   (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
srr1 = power7_sleep();
-   } else {
+   } else if (idle_states & OPAL_PM_NAP_ENABLED) {
srr1 = power7_nap(1);
+   } else {
+   /* This is the fallback method. We emulate snooze */
+   while (!generic_check_cpu_restart(cpu)) {
+   HMT_low();
+   HMT_very_low();
+   }
+   srr1 = 0;
+   HMT_medium();
}
 
ppc64_runlatch_on();
-- 
1.9.4



[PATCH 2/3] powernv:idle: Don't override default/deepest directly in kernel

2017-03-13 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Currently during idle-init on power9, if we don't find suitable stop
states in the device tree that can be used as the
default_stop/deepest_stop, we set stop0 (ESL=1,EC=1) as the default
stop state psscr to be used by power9_idle and deepest stop state
which is used by CPU-Hotplug.

However, if the platform firmware has not configured or enabled a stop
state, the kernel should not make any assumptions and fallback to a
default choice.

If the kernel uses a stop state that is not configured by the platform
firmware, it may lead to further failures which should be avoided.

In this patch, we modify the init code to ensure that the kernel uses
only the stop states exposed by the firmware through the device
tree. When a suitable default stop state isn't found, we disable
ppc_md.power_save for power9. Similarly, when a suitable
deepest_stop_state is not found in the device tree exported by the
firmware, fall back to the default busy-wait loop in the CPU-Hotplug
code.

[Changelog written with inputs from sva...@linux.vnet.ibm.com]

Signed-off-by: Gautham R. Shenoy 
---
 arch/powerpc/platforms/powernv/idle.c| 27 ++-
 arch/powerpc/platforms/powernv/powernv.h |  1 +
 arch/powerpc/platforms/powernv/smp.c |  8 +++-
 3 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/idle.c 
b/arch/powerpc/platforms/powernv/idle.c
index 4ee837e..9fde6e4 100644
--- a/arch/powerpc/platforms/powernv/idle.c
+++ b/arch/powerpc/platforms/powernv/idle.c
@@ -147,7 +147,6 @@ u32 pnv_get_supported_cpuidle_states(void)
 }
 EXPORT_SYMBOL_GPL(pnv_get_supported_cpuidle_states);
 
-
 static void pnv_fastsleep_workaround_apply(void *info)
 
 {
@@ -243,6 +242,7 @@ static DEVICE_ATTR(fastsleep_workaround_applyonce, 0600,
  */
 u64 pnv_default_stop_val;
 u64 pnv_default_stop_mask;
+bool default_stop_found;
 
 /*
  * Used for ppc_md.power_save which needs a function with no parameters
@@ -264,6 +264,7 @@ static void power9_idle(void)
  */
 u64 pnv_deepest_stop_psscr_val;
 u64 pnv_deepest_stop_psscr_mask;
+bool deepest_stop_found;
 
 /*
  * Power ISA 3.0 idle initialization.
@@ -352,7 +353,6 @@ static int __init pnv_power9_idle_init(struct device_node 
*np, u32 *flags,
u32 *residency_ns = NULL;
u64 max_residency_ns = 0;
int rc = 0, i;
-   bool default_stop_found = false, deepest_stop_found = false;
 
psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
psscr_mask = kcalloc(dt_idle_states, sizeof(*psscr_mask), GFP_KERNEL);
@@ -433,20 +433,22 @@ static int __init pnv_power9_idle_init(struct device_node 
*np, u32 *flags,
}
 
if (!default_stop_found) {
-   pnv_default_stop_val = PSSCR_HV_DEFAULT_VAL;
-   pnv_default_stop_mask = PSSCR_HV_DEFAULT_MASK;
-   pr_warn("Setting default stop psscr 
val=0x%016llx,mask=0x%016llx\n",
+   pr_warn("powernv:idle: Default stop not found. Disabling 
ppcmd.powersave\n");
+   } else {
+   pr_info("powernv:idle: Default stop: psscr = 
0x%016llx,mask=0x%016llx\n",
pnv_default_stop_val, pnv_default_stop_mask);
}
 
if (!deepest_stop_found) {
-   pnv_deepest_stop_psscr_val = PSSCR_HV_DEFAULT_VAL;
-   pnv_deepest_stop_psscr_mask = PSSCR_HV_DEFAULT_MASK;
-   pr_warn("Setting default stop psscr 
val=0x%016llx,mask=0x%016llx\n",
+   pr_warn("powernv:idle: Deepest stop not found. CPU-Hotplug is 
affected\n");
+   } else {
+   pr_info("powernv:idle: Deepest stop: psscr = 
0x%016llx,mask=0x%016llx\n",
pnv_deepest_stop_psscr_val,
pnv_deepest_stop_psscr_mask);
}
 
+   pr_info("powernv:idle: RL value of first deep stop = 0x%llx\n",
+   pnv_first_deep_stop_state);
 out:
kfree(psscr_val);
kfree(psscr_mask);
@@ -454,6 +456,12 @@ static int __init pnv_power9_idle_init(struct device_node 
*np, u32 *flags,
return rc;
 }
 
+bool pnv_check_deepest_stop(void)
+{
+   return deepest_stop_found;
+}
+EXPORT_SYMBOL_GPL(pnv_check_deepest_stop);
+
 /*
  * Probe device tree for supported idle states
  */
@@ -526,7 +534,8 @@ static int __init pnv_init_idle_states(void)
 
if (supported_cpuidle_states & OPAL_PM_NAP_ENABLED)
ppc_md.power_save = power7_idle;
-   else if (supported_cpuidle_states & OPAL_PM_STOP_INST_FAST)
+   else if ((supported_cpuidle_states & OPAL_PM_STOP_INST_FAST) &&
+default_stop_found)
ppc_md.power_save = power9_idle;
 
 out:
diff --git a/arch/powerpc/platforms/powernv/powernv.h 
b/arch/powerpc/platforms/powernv/powernv.h
index 6130522..9acd5eb 100644
--- a/arch/powerpc/platforms/powernv/powernv.h
+++ b/arch/powerpc/platforms/powernv/powernv.h
@@ -18,6 +18,7 @@ static inline void 

[PATCH 0/3] powernv:idle: Fixes for CPU-Hotplug on POWER DD1.0

2017-03-13 Thread Gautham R. Shenoy
From: "Gautham R. Shenoy" 

Hi,

This patchset contains fixes to make CPU-Hotplug working on correctly
on POWER9 DD1 systems.

There are three patches in the series.

- The first patch adds a fallback mechanism for CPU-Hotplug when no
  platform idle state is available.

- The second patch ensures that the kernel doesn't use any stop state
  that is not exposed by the firmware.

- The third patch adds a recovery framework for correctly recovering
  paca pointer of the thread waking up from a stop.

These patches are based on powerpc/linux.git "fixes" with the top
commit a7d2475af7aed ("powerpc: Sort the selects under CONFIG_PPC").

The patches have been tested with stop1 (ESL=EC=1) as the
deepest-state entered into during CPU-Hotplug.

Gautham R. Shenoy (3):
  powernv:smp: Add busy-wait loop as fall back for CPU-Hotplug
  powernv:idle: Don't override default/deepest directly in kernel
  powernv:Recover correct PACA on wakeup from a stop on P9 DD1

 arch/powerpc/include/asm/paca.h  |  5 
 arch/powerpc/kernel/asm-offsets.c|  1 +
 arch/powerpc/kernel/idle_book3s.S| 43 +++-
 arch/powerpc/platforms/powernv/idle.c| 49 ++--
 arch/powerpc/platforms/powernv/powernv.h |  1 +
 arch/powerpc/platforms/powernv/smp.c | 18 ++--
 6 files changed, 105 insertions(+), 12 deletions(-)

-- 
1.9.4



Re: [net-next v2 00/10] QorIQ DPAA 1 updates

2017-03-13 Thread David Miller
From: Madalin Bucur 
Date: Thu, 9 Mar 2017 16:36:55 +0200

> This patch set introduces a series of fixes and features to the DPAA 1
> drivers. Besides activating hardware Rx checksum offloading, four traffic
> classes are added for Tx traffic prioritisation.
> 
> The changes are also available on the dpaa_eth-next branch in the git
> repository at:
> 
>   git://git.freescale.com/ppc/upstream/linux.git
> 
> changes from v1: added patch to enable context-A stashing

Pulled, thanks.