Re: [patch 14/30] drm/i915/pmu: Replace open coded kstat_irqs() copy

2020-12-11 Thread Tvrtko Ursulin



On 10/12/2020 19:25, Thomas Gleixner wrote:

Driver code has no business with the internals of the irq descriptor.

Aside of that the count is per interrupt line and therefore takes
interrupts from other devices into account which share the interrupt line
and are not handled by the graphics driver.

Replace it with a pmu private count which only counts interrupts which
originate from the graphics card.

To avoid atomics or heuristics of some sort make the counter field
'unsigned long'. That limits the count to 4e9 on 32bit which is a lot and
postprocessing can easily deal with the occasional wraparound.


After my failed hasty sketch from last night I had a different one which 
was kind of heuristics based (re-reading the upper dword and retrying if 
it changed on 32-bit). But you are right - it is okay to at least start 
like this today and if later there is a need we can either do that or 
deal with wrap at PMU read time.


So thanks for dealing with it, some small comments below but overall it 
is fine.



Signed-off-by: Thomas Gleixner 
Cc: Tvrtko Ursulin 
Cc: Jani Nikula 
Cc: Joonas Lahtinen 
Cc: Rodrigo Vivi 
Cc: David Airlie 
Cc: Daniel Vetter 
Cc: intel-...@lists.freedesktop.org
Cc: dri-de...@lists.freedesktop.org
---
  drivers/gpu/drm/i915/i915_irq.c |   34 ++
  drivers/gpu/drm/i915/i915_pmu.c |   18 +-
  drivers/gpu/drm/i915/i915_pmu.h |8 
  3 files changed, 43 insertions(+), 17 deletions(-)

--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -60,6 +60,24 @@
   * and related files, but that will be described in separate chapters.
   */
  
+/*

+ * Interrupt statistic for PMU. Increments the counter only if the
+ * interrupt originated from the the GPU so interrupts from a device which
+ * shares the interrupt line are not accounted.
+ */
+static inline void pmu_irq_stats(struct drm_i915_private *priv,


We never use priv as a local name, it should be either i915 or dev_priv.


+irqreturn_t res)
+{
+   if (unlikely(res != IRQ_HANDLED))
+   return;
+
+   /*
+* A clever compiler translates that into INC. A not so clever one
+* should at least prevent store tearing.
+*/
+   WRITE_ONCE(priv->pmu.irq_count, priv->pmu.irq_count + 1);


Curious, probably more educational for me - given x86_32 and x86_64, and 
the context of it getting called, what is the difference from just doing 
irq_count++?



+}
+
  typedef bool (*long_pulse_detect_func)(enum hpd_pin pin, u32 val);
  
  static const u32 hpd_ilk[HPD_NUM_PINS] = {

@@ -1599,6 +1617,8 @@ static irqreturn_t valleyview_irq_handle
valleyview_pipestat_irq_handler(dev_priv, pipe_stats);
} while (0);
  
+	pmu_irq_stats(dev_priv, ret);

+
enable_rpm_wakeref_asserts(_priv->runtime_pm);
  
  	return ret;

@@ -1676,6 +1696,8 @@ static irqreturn_t cherryview_irq_handle
valleyview_pipestat_irq_handler(dev_priv, pipe_stats);
} while (0);
  
+	pmu_irq_stats(dev_priv, ret);

+
enable_rpm_wakeref_asserts(_priv->runtime_pm);
  
  	return ret;

@@ -2103,6 +2125,8 @@ static irqreturn_t ilk_irq_handler(int i
if (sde_ier)
raw_reg_write(regs, SDEIER, sde_ier);
  
+	pmu_irq_stats(i915, ret);

+
/* IRQs are synced during runtime_suspend, we don't require a wakeref */
enable_rpm_wakeref_asserts(>runtime_pm);
  
@@ -2419,6 +2443,8 @@ static irqreturn_t gen8_irq_handler(int
  
  	gen8_master_intr_enable(regs);
  
+	pmu_irq_stats(dev_priv, IRQ_HANDLED);

+
return IRQ_HANDLED;
  }
  
@@ -2514,6 +2540,8 @@ static __always_inline irqreturn_t
  
  	gen11_gu_misc_irq_handler(gt, gu_misc_iir);
  
+	pmu_irq_stats(i915, IRQ_HANDLED);

+
return IRQ_HANDLED;
  }
  
@@ -3688,6 +3716,8 @@ static irqreturn_t i8xx_irq_handler(int

i8xx_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
  
+	pmu_irq_stats(dev_priv, ret);

+
enable_rpm_wakeref_asserts(_priv->runtime_pm);
  
  	return ret;

@@ -3796,6 +3826,8 @@ static irqreturn_t i915_irq_handler(int
i915_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
  
+	pmu_irq_stats(dev_priv, ret);

+
enable_rpm_wakeref_asserts(_priv->runtime_pm);
  
  	return ret;

@@ -3941,6 +3973,8 @@ static irqreturn_t i965_irq_handler(int
i965_pipestat_irq_handler(dev_priv, iir, pipe_stats);
} while (0);
  
+	pmu_irq_stats(dev_priv, IRQ_HANDLED);

+
enable_rpm_wakeref_asserts(_priv->runtime_pm);
  
  	return ret;

--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample(
return HRTIMER_RESTART;
  }


In this file you can also drop the #include  line.

  
-static u64 count_interrupts(struct drm_i915_private *i915)

-{
-   /* open-coded kstat_irqs() */

Re: [Intel-gfx] [PATCH v3 2/4] drm/i915/pmu: Use kstat_irqs to get interrupt count

2020-12-10 Thread Tvrtko Ursulin



On 10/12/2020 17:44, Thomas Gleixner wrote:

On Thu, Dec 10 2020 at 17:09, Tvrtko Ursulin wrote:

On 10/12/2020 16:35, Thomas Gleixner wrote:

I'll send out a series addressing irq_to_desc() (ab)use all over the
place shortly. i915 is in there...


Yep we don't need atomic, my bad. And we would care about the shared
interrupt line. And without atomic the extra accounting falls way below
noise.


You have to be careful though. If you make the accumulated counter 64
bit wide then you need to be careful vs. 32bit machines.


Yep, thanks, I am bad jumping from one thing to another. Forgot about 
the read side atomicity completely..



So in the light of it all, it sounds best I just quickly replace our
abuse with private counting and then you don't have to deal with it in
your series.


I mostly have it. Still chewing on the 32bit vs. 64bit thing. And
keeping it in my series allows me to remove the export of irq_to_desc()
at the end without waiting for your tree to be merged.

Give me a few.


Ok.

Regards,

Tvrtko


Re: [Intel-gfx] [PATCH v3 2/4] drm/i915/pmu: Use kstat_irqs to get interrupt count

2020-12-10 Thread Tvrtko Ursulin



On 10/12/2020 16:35, Thomas Gleixner wrote:

On Thu, Dec 10 2020 at 10:45, Tvrtko Ursulin wrote:

On 10/12/2020 07:53, Joonas Lahtinen wrote:

I think later in the thread there was a suggestion to replace this with
simple counter increment in IRQ handler.


It was indeed unsafe until recent b00bccb3f0bb ("drm/i915/pmu: Handle
PCI unbind") but now should be fine.

If kstat_irqs does not get exported it is easy enough for i915 to keep a
local counter. Reasoning was very infrequent per cpu summation is much
cheaper than very frequent atomic add. Up to thousands of interrupts per
second vs "once per second" PMU read kind of thing.


Why do you need a atomic_add? It's ONE interrupt which can only be
executed on ONE CPU at a time. Interrupt handlers are non-reentrant.

The core code function will just return an accumulated counter nowadays
which is only 32bit wide, which is what the interface provided forever.
That needs to be fixed first.

Aside of that the accounting is wrong when the interrupt line is shared
because the core accounts interrupt per line not per device sharing the
line. Don't know whether you care or not.

I'll send out a series addressing irq_to_desc() (ab)use all over the
place shortly. i915 is in there...


Yep we don't need atomic, my bad. And we would care about the shared 
interrupt line. And without atomic the extra accounting falls way below 
noise.


So in the light of it all, it sounds best I just quickly replace our 
abuse with private counting and then you don't have to deal with it in 
your series.


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH v3 2/4] drm/i915/pmu: Use kstat_irqs to get interrupt count

2020-12-10 Thread Tvrtko Ursulin



On 10/12/2020 07:53, Joonas Lahtinen wrote:

+ Tvrtko and Chris for comments

Code seems to be added in:

commit 0cd4684d6ea9a4ffec33fc19de4dd667bb90d0a5
Author: Tvrtko Ursulin 
Date:   Tue Nov 21 18:18:50 2017 +

 drm/i915/pmu: Add interrupt count metric

I think later in the thread there was a suggestion to replace this with
simple counter increment in IRQ handler.


It was indeed unsafe until recent b00bccb3f0bb ("drm/i915/pmu: Handle 
PCI unbind") but now should be fine.


If kstat_irqs does not get exported it is easy enough for i915 to keep a 
local counter. Reasoning was very infrequent per cpu summation is much 
cheaper than very frequent atomic add. Up to thousands of interrupts per 
second vs "once per second" PMU read kind of thing.


Regards,

Tvrtko


Quoting Thomas Gleixner (2020-12-06 18:38:44)

On Fri, Dec 04 2020 at 18:43, Jerry Snitselaar wrote:


Now that kstat_irqs is exported, get rid of count_interrupts in
i915_pmu.c
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -423,22 +423,6 @@ static enum hrtimer_restart i915_sample(struct hrtimer 
*hrtimer)
   return HRTIMER_RESTART;
  }
  
-static u64 count_interrupts(struct drm_i915_private *i915)

-{
- /* open-coded kstat_irqs() */
- struct irq_desc *desc = irq_to_desc(i915->drm.pdev->irq);
- u64 sum = 0;
- int cpu;
-
- if (!desc || !desc->kstat_irqs)
- return 0;
-
- for_each_possible_cpu(cpu)
- sum += *per_cpu_ptr(desc->kstat_irqs, cpu);
-
- return sum;
-}


May I ask why this has been merged in the first place?

Nothing in a driver has ever to fiddle with the internals of an irq
descriptor. We have functions for properly accessing them. Just because
C allows to fiddle with everything is not a justification. If the
required function is not exported then adding the export with a proper
explanation is not asked too much.

Also this lacks protection or at least a comment why this can be called
safely and is not subject to a concurrent removal of the irq descriptor.
The same problem exists when calling kstat_irqs(). It's even documented
at the top of the function.

Thanks,

 tglx


___
Intel-gfx mailing list
intel-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

___
Intel-gfx mailing list
intel-...@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx



Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-11-03 Thread Tvrtko Ursulin




On 03/11/2020 02:53, Lu Baolu wrote:

On 11/2/20 7:52 PM, Tvrtko Ursulin wrote:


On 02/11/2020 02:00, Lu Baolu wrote:

Hi Tvrtko,
On 10/12/20 4:44 PM, Tvrtko Ursulin wrote:


On 29/09/2020 01:11, Lu Baolu wrote:

Hi Tvrtko,

On 9/28/20 5:44 PM, Tvrtko Ursulin wrote:


On 27/09/2020 07:34, Lu Baolu wrote:

Hi,

The previous post of this series could be found here.

https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/ 



This version introduce a new patch [4/7] to fix an issue reported 
here.


https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/ 



There aren't any other changes.

Please help to test and review.

Best regards,
baolu

Lu Baolu (3):
   iommu: Add quirk for Intel graphic devices in map_sg


Since I do have patches to fix i915 to handle this, do we want to 
co-ordinate the two and avoid having to add this quirk and then 
later remove it? Or you want to go the staged approach?


I have no preference. It depends on which patch goes first. Let the
maintainers help here.


FYI we have merged the required i915 patches to out tree last week 
or so. I *think* this means they will go into 5.11. So the i915 
specific workaround patch will not be needed in Intel IOMMU.


Do you mind telling me what's the status of this fix patch? I tried this
series on v5.10-rc1 with the graphic quirk patch dropped. I am still
seeing dma faults from graphic device.


Hmm back then I thought i915 fixes for this would land in 5.11 so I 
will stick with that. :) (See my quoted text a paragraph above yours.)


What size are those fixes? I am considering pushing this series for
v5.11. Is it possible to get some acks for those patches and let them
go to Linus through iommu tree?


For 5.10 you mean? They feel a bit too large for comfort to go via a 
non-i915/drm tree. These are the two patches required:


https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=8a473dbadccfc6206150de3db3223c40785da348
https://cgit.freedesktop.org/drm-intel/commit/?h=drm-intel-gt-next=934941ed5a3070a7833c688c9b1d71484fc01a68

I'll copy Joonas as our maintainer - how does the idea of taking the 
above two patches through the iommu tree sound to you?


Regards,

Tvrtko


Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-11-02 Thread Tvrtko Ursulin



On 02/11/2020 02:00, Lu Baolu wrote:

Hi Tvrtko,
On 10/12/20 4:44 PM, Tvrtko Ursulin wrote:


On 29/09/2020 01:11, Lu Baolu wrote:

Hi Tvrtko,

On 9/28/20 5:44 PM, Tvrtko Ursulin wrote:


On 27/09/2020 07:34, Lu Baolu wrote:

Hi,

The previous post of this series could be found here.

https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/ 



This version introduce a new patch [4/7] to fix an issue reported 
here.


https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/ 



There aren't any other changes.

Please help to test and review.

Best regards,
baolu

Lu Baolu (3):
   iommu: Add quirk for Intel graphic devices in map_sg


Since I do have patches to fix i915 to handle this, do we want to 
co-ordinate the two and avoid having to add this quirk and then 
later remove it? Or you want to go the staged approach?


I have no preference. It depends on which patch goes first. Let the
maintainers help here.


FYI we have merged the required i915 patches to out tree last week or 
so. I *think* this means they will go into 5.11. So the i915 specific 
workaround patch will not be needed in Intel IOMMU.


Do you mind telling me what's the status of this fix patch? I tried this
series on v5.10-rc1 with the graphic quirk patch dropped. I am still
seeing dma faults from graphic device.


Hmm back then I thought i915 fixes for this would land in 5.11 so I will 
stick with that. :) (See my quoted text a paragraph above yours.)


Regards,

Tvrtko


Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-10-12 Thread Tvrtko Ursulin



On 29/09/2020 01:11, Lu Baolu wrote:

Hi Tvrtko,

On 9/28/20 5:44 PM, Tvrtko Ursulin wrote:


On 27/09/2020 07:34, Lu Baolu wrote:

Hi,

The previous post of this series could be found here.

https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/ 



This version introduce a new patch [4/7] to fix an issue reported here.

https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/ 



There aren't any other changes.

Please help to test and review.

Best regards,
baolu

Lu Baolu (3):
   iommu: Add quirk for Intel graphic devices in map_sg


Since I do have patches to fix i915 to handle this, do we want to 
co-ordinate the two and avoid having to add this quirk and then later 
remove it? Or you want to go the staged approach?


I have no preference. It depends on which patch goes first. Let the
maintainers help here.


FYI we have merged the required i915 patches to out tree last week or 
so. I *think* this means they will go into 5.11. So the i915 specific 
workaround patch will not be needed in Intel IOMMU.


Regards,

Tvrtko


Re: [PATCH v4 0/7] Convert the intel iommu driver to the dma-iommu api

2020-09-28 Thread Tvrtko Ursulin



On 27/09/2020 07:34, Lu Baolu wrote:

Hi,

The previous post of this series could be found here.

https://lore.kernel.org/linux-iommu/20200912032200.11489-1-baolu...@linux.intel.com/

This version introduce a new patch [4/7] to fix an issue reported here.

https://lore.kernel.org/linux-iommu/51a1baec-48d1-c0ac-181b-1fba92aa4...@linux.intel.com/

There aren't any other changes.

Please help to test and review.

Best regards,
baolu

Lu Baolu (3):
   iommu: Add quirk for Intel graphic devices in map_sg


Since I do have patches to fix i915 to handle this, do we want to 
co-ordinate the two and avoid having to add this quirk and then later 
remove it? Or you want to go the staged approach?


Regards,

Tvrtko


   iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev
   iommu/vt-d: Cleanup after converting to dma-iommu ops

Tom Murphy (4):
   iommu: Handle freelists when using deferred flushing in iommu drivers
   iommu: Add iommu_dma_free_cpu_cached_iovas()
   iommu: Allow the dma-iommu api to use bounce buffers
   iommu/vt-d: Convert intel iommu driver to the iommu ops

  .../admin-guide/kernel-parameters.txt |   5 -
  drivers/iommu/dma-iommu.c | 228 -
  drivers/iommu/intel/Kconfig   |   1 +
  drivers/iommu/intel/iommu.c   | 901 +++---
  include/linux/dma-iommu.h |   8 +
  include/linux/iommu.h |   1 +
  6 files changed, 336 insertions(+), 808 deletions(-)



Re: [PATCH 08/11] drm/i915: use vmap in i915_gem_object_map

2020-09-25 Thread Tvrtko Ursulin
return NULL;
+   }
  
-		for_each_sgt_page(page, iter, sgt)

-   **ptes++ = mk_pte(page, pgprot);
-   } else {
-   resource_size_t iomap;
-   struct sgt_iter iter;
-   pte_t **ptes = mem;
-   dma_addr_t addr;
+   i = 0;
+   for_each_sgt_page(page, iter, obj->mm.pages)
+   pages[i++] = page;
+   vaddr = vmap(pages, n_pages, 0, pgprot);
+   if (pages != stack)
+   kvfree(pages);
+   return vaddr;
+}
  
-		iomap = obj->mm.region->iomap.base;

-   iomap -= obj->mm.region->region.start;
+static void *i915_gem_object_map_pfn(struct drm_i915_gem_object *obj,
+   enum i915_map_type type)
+{
+   resource_size_t iomap = obj->mm.region->iomap.base -
+   obj->mm.region->region.start;
+   unsigned long n_pfn = obj->base.size >> PAGE_SHIFT;
+   unsigned long stack[32], *pfns = stack, i;
+   struct sgt_iter iter;
+   dma_addr_t addr;
+   void *vaddr;
+
+   if (type != I915_MAP_WC)
+   return NULL;
  
-		for_each_sgt_daddr(addr, iter, sgt)

-   **ptes++ = iomap_pte(iomap, addr, pgprot);
+   if (n_pfn > ARRAY_SIZE(stack)) {
+   /* Too big for stack -- allocate temporary array instead */
+   pfns = kvmalloc_array(n_pfn, sizeof(*pfns), GFP_KERNEL);
+   if (!pfns)
+   return NULL;
}
  
-	if (mem != stack)

-   kvfree(mem);
-
-   return area->addr;
+   for_each_sgt_daddr(addr, iter, obj->mm.pages)
+   pfns[i++] = (iomap + addr) >> PAGE_SHIFT;
+   vaddr = vmap_pfn(pfns, n_pfn, pgprot_writecombine(PAGE_KERNEL_IO));
+   if (pfns != stack)
+   kvfree(pfns);
+   return vaddr;
  }
  
  /* get, pin, and map the pages of the object into kernel space */

@@ -383,7 +367,13 @@ void *i915_gem_object_pin_map(struct drm_i915_gem_object 
*obj,
}
  
  	if (!ptr) {

-   ptr = i915_gem_object_map(obj, type);
+   if (GEM_WARN_ON(type == I915_MAP_WC &&
+   !static_cpu_has(X86_FEATURE_PAT)))
+   ptr = NULL;
+   else if (i915_gem_object_has_struct_page(obj))
+   ptr = i915_gem_object_map_page(obj, type);
+   else
+   ptr = i915_gem_object_map_pfn(obj, type);
if (!ptr) {
err = -ENOMEM;
goto err_unpin;



Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH 06/11] drm/i915: use vmap in shmem_pin_map

2020-09-25 Thread Tvrtko Ursulin



On 24/09/2020 14:58, Christoph Hellwig wrote:

shmem_pin_map somewhat awkwardly reimplements vmap using
alloc_vm_area and manual pte setup.  The only practical difference
is that alloc_vm_area prefeaults the vmalloc area PTEs, which doesn't
seem to be required here (and could be added to vmap using a flag if
actually required).  Switch to use vmap, and use vfree to free both the
vmalloc mapping and the page array, as well as dropping the references
to each page.

Signed-off-by: Christoph Hellwig 
---
  drivers/gpu/drm/i915/gt/shmem_utils.c | 76 +++
  1 file changed, 18 insertions(+), 58 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/shmem_utils.c 
b/drivers/gpu/drm/i915/gt/shmem_utils.c
index 43c7acbdc79dea..f011ea42487e11 100644
--- a/drivers/gpu/drm/i915/gt/shmem_utils.c
+++ b/drivers/gpu/drm/i915/gt/shmem_utils.c
@@ -49,80 +49,40 @@ struct file *shmem_create_from_object(struct 
drm_i915_gem_object *obj)
return file;
  }
  
-static size_t shmem_npte(struct file *file)

-{
-   return file->f_mapping->host->i_size >> PAGE_SHIFT;
-}
-
-static void __shmem_unpin_map(struct file *file, void *ptr, size_t n_pte)
-{
-   unsigned long pfn;
-
-   vunmap(ptr);
-
-   for (pfn = 0; pfn < n_pte; pfn++) {
-   struct page *page;
-
-   page = shmem_read_mapping_page_gfp(file->f_mapping, pfn,
-  GFP_KERNEL);
-   if (!WARN_ON(IS_ERR(page))) {
-   put_page(page);
-   put_page(page);
-   }
-   }
-}
-
  void *shmem_pin_map(struct file *file)
  {
-   const size_t n_pte = shmem_npte(file);
-   pte_t *stack[32], **ptes, **mem;
-   struct vm_struct *area;
-   unsigned long pfn;
-
-   mem = stack;
-   if (n_pte > ARRAY_SIZE(stack)) {
-   mem = kvmalloc_array(n_pte, sizeof(*mem), GFP_KERNEL);
-   if (!mem)
-   return NULL;
-   }
+   struct page **pages;
+   size_t n_pages, i;
+   void *vaddr;
  
-	area = alloc_vm_area(n_pte << PAGE_SHIFT, mem);

-   if (!area) {
-   if (mem != stack)
-   kvfree(mem);
+   n_pages = file->f_mapping->host->i_size >> PAGE_SHIFT;
+   pages = kvmalloc_array(n_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
return NULL;
-   }
  
-	ptes = mem;

-   for (pfn = 0; pfn < n_pte; pfn++) {
-   struct page *page;
-
-   page = shmem_read_mapping_page_gfp(file->f_mapping, pfn,
-  GFP_KERNEL);
-   if (IS_ERR(page))
+   for (i = 0; i < n_pages; i++) {
+   pages[i] = shmem_read_mapping_page_gfp(file->f_mapping, i,
+  GFP_KERNEL);
+   if (IS_ERR(pages[i]))
goto err_page;
-
-   **ptes++ = mk_pte(page,  PAGE_KERNEL);
}
  
-	if (mem != stack)

-   kvfree(mem);
-
+   vaddr = vmap(pages, n_pages, VM_MAP_PUT_PAGES, PAGE_KERNEL);
+   if (!vaddr)
+   goto err_page;
mapping_set_unevictable(file->f_mapping);
-   return area->addr;
-
+   return vaddr;
  err_page:
-   if (mem != stack)
-   kvfree(mem);
-
-   __shmem_unpin_map(file, area->addr, pfn);
+   while (--i >= 0)
+   put_page(pages[i]);
+   kvfree(pages);
return NULL;
  }
  
  void shmem_unpin_map(struct file *file, void *ptr)

  {
mapping_clear_unevictable(file->f_mapping);
-   __shmem_unpin_map(file, ptr, shmem_npte(file));
+   vfree(ptr);
  }
  
  static int __shmem_rw(struct file *file, loff_t off,




Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [PATCH 07/11] drm/i915: stop using kmap in i915_gem_object_map

2020-09-25 Thread Tvrtko Ursulin



On 24/09/2020 14:58, Christoph Hellwig wrote:

kmap for !PageHighmem is just a convoluted way to say page_address,
and kunmap is a no-op in that case.

Signed-off-by: Christoph Hellwig 
---
  drivers/gpu/drm/i915/gem/i915_gem_pages.c | 7 ++-
  1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c 
b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index d6eeefab3d018b..6550c0bc824ea2 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -162,8 +162,6 @@ static void unmap_object(struct drm_i915_gem_object *obj, 
void *ptr)
  {
if (is_vmalloc_addr(ptr))
vunmap(ptr);
-   else
-   kunmap(kmap_to_page(ptr));
  }
  
  struct sg_table *

@@ -277,11 +275,10 @@ static void *i915_gem_object_map(struct 
drm_i915_gem_object *obj,
 * forever.
 *
 * So if the page is beyond the 32b boundary, make an explicit
-* vmap. On 64b, this check will be optimised away as we can
-* directly kmap any page on the system.
+* vmap.
 */
if (!PageHighMem(page))
-   return kmap(page);
+   return page_address(page);
}
  
  	mem = stack;




Reviewed-by: Tvrtko Ursulin 

Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 4/6] drm/i915: use vmap in i915_gem_object_map

2020-09-24 Thread Tvrtko Ursulin



On 23/09/2020 15:44, Christoph Hellwig wrote:

On Wed, Sep 23, 2020 at 02:58:43PM +0100, Tvrtko Ursulin wrote:

Series did not get a CI run from our side because of a different base so I
don't know if you would like to have a run there? If so you would need to
rebase against git://anongit.freedesktop.org/drm-tip drm-tip and you could
even send a series to intel-gfx-try...@lists.freedesktop.org, suppressing
cc, to check it out without sending a copy to the real mailing list.


It doesn't seem like I can post to any freedesktop list, as I always
get rejection messages.  But I'll happily prepare a branch if one
of you an feed it into your CI.


That's fine, just ping me and I will forward it for testing, thanks!


 git://git.infradead.org/users/hch/misc.git i915-vmap-wip

Gitweb:

 
http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/i915-vmap-wip

note that this includes a new commit to clean up one of the recent
commits in the code.


CI says series looks good from the i915 perspective (*).

I don't know how will you handle it logistically, but when you have 
final version I am happy to re-read and r-b the i915 patches.



Regards,

Tvrtko

*)
https://patchwork.freedesktop.org/series/82051/


Re: [Intel-gfx] [PATCH 4/6] drm/i915: use vmap in i915_gem_object_map

2020-09-23 Thread Tvrtko Ursulin



On 23/09/2020 14:41, Christoph Hellwig wrote:

On Wed, Sep 23, 2020 at 10:52:33AM +0100, Tvrtko Ursulin wrote:


On 18/09/2020 17:37, Christoph Hellwig wrote:

i915_gem_object_map implements fairly low-level vmap functionality in
a driver.  Split it into two helpers, one for remapping kernel memory
which can use vmap, and one for I/O memory that uses vmap_pfn.

The only practical difference is that alloc_vm_area prefeaults the
vmalloc area PTEs, which doesn't seem to be required here for the
kernel memory case (and could be added to vmap using a flag if actually
required).


Patch looks good to me.

Series did not get a CI run from our side because of a different base so I
don't know if you would like to have a run there? If so you would need to
rebase against git://anongit.freedesktop.org/drm-tip drm-tip and you could
even send a series to intel-gfx-try...@lists.freedesktop.org, suppressing
cc, to check it out without sending a copy to the real mailing list.


It doesn't seem like I can post to any freedesktop list, as I always
get rejection messages.  But I'll happily prepare a branch if one
of you an feed it into your CI.


That's fine, just ping me and I will forward it for testing, thanks!

Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 4/6] drm/i915: use vmap in i915_gem_object_map

2020-09-23 Thread Tvrtko Ursulin



On 18/09/2020 17:37, Christoph Hellwig wrote:

i915_gem_object_map implements fairly low-level vmap functionality in
a driver.  Split it into two helpers, one for remapping kernel memory
which can use vmap, and one for I/O memory that uses vmap_pfn.

The only practical difference is that alloc_vm_area prefeaults the
vmalloc area PTEs, which doesn't seem to be required here for the
kernel memory case (and could be added to vmap using a flag if actually
required).


Patch looks good to me.

Series did not get a CI run from our side because of a different base so 
I don't know if you would like to have a run there? If so you would need 
to rebase against git://anongit.freedesktop.org/drm-tip drm-tip and you 
could even send a series to intel-gfx-try...@lists.freedesktop.org, 
suppressing cc, to check it out without sending a copy to the real 
mailing list.


Regards,

Tvrtko


Signed-off-by: Christoph Hellwig 
---
  drivers/gpu/drm/i915/Kconfig  |   1 +
  drivers/gpu/drm/i915/gem/i915_gem_pages.c | 101 ++
  2 files changed, 47 insertions(+), 55 deletions(-)

diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 9afa5c4a6bf006..1e1cb245fca778 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -25,6 +25,7 @@ config DRM_I915
select CRC32
select SND_HDA_I915 if SND_HDA_CORE
select CEC_CORE if CEC_NOTIFIER
+   select VMAP_PFN
help
  Choose this option if you have a system that has "Intel Graphics
  Media Accelerator" or "HD Graphics" integrated graphics,
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c 
b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
index e8a083743e0927..90029ea83aede9 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c
@@ -234,50 +234,24 @@ int __i915_gem_object_put_pages(struct 
drm_i915_gem_object *obj)
return err;
  }
  
-static inline pte_t iomap_pte(resource_size_t base,

- dma_addr_t offset,
- pgprot_t prot)
-{
-   return pte_mkspecial(pfn_pte((base + offset) >> PAGE_SHIFT, prot));
-}
-
  /* The 'mapping' part of i915_gem_object_pin_map() below */
-static void *i915_gem_object_map(struct drm_i915_gem_object *obj,
+static void *i915_gem_object_map_page(struct drm_i915_gem_object *obj,
 enum i915_map_type type)
  {
-   unsigned long n_pte = obj->base.size >> PAGE_SHIFT;
-   struct sg_table *sgt = obj->mm.pages;
-   pte_t *stack[32], **mem;
-   struct vm_struct *area;
+   unsigned long n_pages = obj->base.size >> PAGE_SHIFT, i;
+   struct page *stack[32], **pages = stack, *page;
+   struct sgt_iter iter;
pgprot_t pgprot;
-
-   if (!i915_gem_object_has_struct_page(obj) && type != I915_MAP_WC)
-   return NULL;
-
-   /* A single page can always be kmapped */
-   if (n_pte == 1 && type == I915_MAP_WB)
-   return kmap(sg_page(sgt->sgl));
-
-   mem = stack;
-   if (n_pte > ARRAY_SIZE(stack)) {
-   /* Too big for stack -- allocate temporary array instead */
-   mem = kvmalloc_array(n_pte, sizeof(*mem), GFP_KERNEL);
-   if (!mem)
-   return NULL;
-   }
-
-   area = alloc_vm_area(obj->base.size, mem);
-   if (!area) {
-   if (mem != stack)
-   kvfree(mem);
-   return NULL;
-   }
+   void *vaddr;
  
  	switch (type) {

default:
MISSING_CASE(type);
fallthrough;/* to use PAGE_KERNEL anyway */
case I915_MAP_WB:
+   /* A single page can always be kmapped */
+   if (n_pages == 1)
+   return kmap(sg_page(obj->mm.pages->sgl));
pgprot = PAGE_KERNEL;
break;
case I915_MAP_WC:
@@ -285,30 +259,44 @@ static void *i915_gem_object_map(struct 
drm_i915_gem_object *obj,
break;
}
  
-	if (i915_gem_object_has_struct_page(obj)) {

-   struct sgt_iter iter;
-   struct page *page;
-   pte_t **ptes = mem;
-
-   for_each_sgt_page(page, iter, sgt)
-   **ptes++ = mk_pte(page, pgprot);
-   } else {
-   resource_size_t iomap;
-   struct sgt_iter iter;
-   pte_t **ptes = mem;
-   dma_addr_t addr;
+   if (n_pages > ARRAY_SIZE(stack)) {
+   /* Too big for stack -- allocate temporary array instead */
+   pages = kvmalloc_array(n_pages, sizeof(*pages), GFP_KERNEL);
+   if (!pages)
+   return NULL;
+   }
  
-		iomap = obj->mm.region->iomap.base;

-   iomap -= obj->mm.region->region.start;
+   for_each_sgt_page(page, iter, obj->mm.pages)
+   pages[i++] = page;
+   vaddr = 

Re: [Intel-gfx] [PATCH 3/6] drm/i915: use vmap in shmem_pin_map

2020-09-22 Thread Tvrtko Ursulin



On 22/09/2020 07:22, Christoph Hellwig wrote:

On Mon, Sep 21, 2020 at 08:11:57PM +0100, Matthew Wilcox wrote:

This is awkward.  I'd like it if we had a vfree() variant which called
put_page() instead of __free_pages().  I'd like it even more if we
used release_pages() instead of our own loop that called put_page().


Note that we don't need a new vfree variant, we can do this manually if
we want, take a look at kernel/dma/remap.c.  But I thought this code
intentionally doesn't want to do that to avoid locking in the memory
for the pages array.  Maybe the i915 maintainers can clarify.


+ Chris & Matt who were involved with this part of i915.

If I understood this sub-thread correctly, iterating and freeing the 
pages via the vmapped ptes, so no need for a

shmem_read_mapping_page_gfp loop in shmem_unpin_map looks plausible to me.

I did not get the reference to kernel/dma/remap.c though, and also not 
sure how to do the error unwind path in shmem_pin_map at which point the 
allocated vm area hasn't been fully populated yet. Hand-roll the loop 
walking vm area struct in there?


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-09-09 Thread Tvrtko Ursulin



On 08/09/2020 23:43, Tom Murphy wrote:

On Tue, 8 Sep 2020 at 16:56, Tvrtko Ursulin
 wrote:



On 08/09/2020 16:44, Logan Gunthorpe wrote:

On 2020-09-08 9:28 a.m., Tvrtko Ursulin wrote:


diff --git a/drivers/gpu/drm/i915/i915_scatterlist.h
b/drivers/gpu/drm/i915/i915
index b7b59328cb76..9367ac801f0c 100644
--- a/drivers/gpu/drm/i915/i915_scatterlist.h
+++ b/drivers/gpu/drm/i915/i915_scatterlist.h
@@ -27,13 +27,19 @@ static __always_inline struct sgt_iter {
} __sgt_iter(struct scatterlist *sgl, bool dma) {
   struct sgt_iter s = { .sgp = sgl };

+   if (sgl && !sg_dma_len(s.sgp))


I'd extend the condition to be, just to be safe:
  if (dma && sgl && !sg_dma_len(s.sgp))



Right, good catch, that's definitely necessary.


+   s.sgp = NULL;
+
   if (s.sgp) {
   s.max = s.curr = s.sgp->offset;
-   s.max += s.sgp->length;
-   if (dma)
+
+   if (dma) {
+   s.max += sg_dma_len(s.sgp);
   s.dma = sg_dma_address(s.sgp);
-   else
+   } else {
+   s.max += s.sgp->length;
   s.pfn = page_to_pfn(sg_page(s.sgp));
+   }


Otherwise has this been tested or alternatively how to test it? (How to
repro the issue.)


It has not been tested. To test it, you need Tom's patch set without the
last "DO NOT MERGE" patch:

https://lkml.kernel.org/lkml/20200907070035.ga25...@infradead.org/T/


Tom, do you have a branch somewhere I could pull from? (Just being lazy
about downloading a bunch of messages from the archives.)


I don't unfortunately. I'm working locally with poor internet.



What GPU is in your Lenovo x1 carbon 5th generation and what
graphical/desktop setup I need to repro?



Is this enough info?:

$ lspci -vnn | grep VGA -A 12
00:02.0 VGA compatible controller [0300]: Intel Corporation HD
Graphics 620 [8086:5916] (rev 02) (prog-if 00 [VGA controller])
 Subsystem: Lenovo ThinkPad X1 Carbon 5th Gen [17aa:224f]
 Flags: bus master, fast devsel, latency 0, IRQ 148
 Memory at eb00 (64-bit, non-prefetchable) [size=16M]
 Memory at 6000 (64-bit, prefetchable) [size=256M]
 I/O ports at e000 [size=64]
 [virtual] Expansion ROM at 000c [disabled] [size=128K]
 Capabilities: [40] Vendor Specific Information: Len=0c 
 Capabilities: [70] Express Root Complex Integrated Endpoint, MSI 00
 Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
 Capabilities: [d0] Power Management version 2
 Capabilities: [100] Process Address Space ID (PASID)
 Capabilities: [200] Address Translation Service (ATS)


Works for a start. What about the steps to repro? Any desktop 
environment and it is just visual corruption, no hangs/stalls or such?


I've submitted a series consisting of what I understood are the patches 
needed to repro the issue to our automated CI here:


https://patchwork.freedesktop.org/series/81489/

So will see if it will catch something, or more targeted testing will be 
required. Hopefully it does trip over in which case I can add the patch 
suggested by Logan on top and see if that fixes it. Or I'll need to 
write a new test case.


If you could glance over my series to check I identified the patches 
correctly it would be appreciated.


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 0/8] Convert the intel iommu driver to the dma-iommu api

2020-09-08 Thread Tvrtko Ursulin



Hi,

On 27/08/2020 22:36, Logan Gunthorpe wrote:

On 2020-08-23 6:04 p.m., Tom Murphy wrote:

I have added a check for the sg_dma_len == 0 :
"""
  } __sgt_iter(struct scatterlist *sgl, bool dma) {
 struct sgt_iter s = { .sgp = sgl };

+   if (sgl && sg_dma_len(sgl) == 0)
+   s.sgp = NULL;

 if (s.sgp) {
 .
"""
at location [1].
but it doens't fix the problem.


Based on my read of the code, it looks like we also need to change usage
of sgl->length... Something like the rough patch below, maybe?


This thread was brought to my attention and I initially missed this 
reply. Essentially I came to the same conclusion about the need to use 
sg_dma_len. One small correction below:



Also, Tom, do you have an updated version of the patchset to convert the
Intel IOMMU to dma-iommu available? The last one I've found doesn't
apply cleanly (I'm assuming parts of it have been merged in slightly
modified forms).

Thanks,

Logan

--

diff --git a/drivers/gpu/drm/i915/i915_scatterlist.h
b/drivers/gpu/drm/i915/i915
index b7b59328cb76..9367ac801f0c 100644
--- a/drivers/gpu/drm/i915/i915_scatterlist.h
+++ b/drivers/gpu/drm/i915/i915_scatterlist.h
@@ -27,13 +27,19 @@ static __always_inline struct sgt_iter {
  } __sgt_iter(struct scatterlist *sgl, bool dma) {
 struct sgt_iter s = { .sgp = sgl };

+   if (sgl && !sg_dma_len(s.sgp))


I'd extend the condition to be, just to be safe:

if (dma && sgl && !sg_dma_len(s.sgp))


+   s.sgp = NULL;
+
 if (s.sgp) {
 s.max = s.curr = s.sgp->offset;
-   s.max += s.sgp->length;
-   if (dma)
+
+   if (dma) {
+   s.max += sg_dma_len(s.sgp);
 s.dma = sg_dma_address(s.sgp);
-   else
+   } else {
+   s.max += s.sgp->length;
 s.pfn = page_to_pfn(sg_page(s.sgp));
+   }


Otherwise has this been tested or alternatively how to test it? (How to 
repro the issue.)


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin


On 11/06/2020 12:29, Daniel Vetter wrote:
> On Thu, Jun 11, 2020 at 12:36 PM Tvrtko Ursulin
>  wrote:
>> On 10/06/2020 16:17, Daniel Vetter wrote:
>>> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>>>  wrote:
>>>>
>>>>
>>>> On 04/06/2020 09:12, Daniel Vetter wrote:
>>>>> Design is similar to the lockdep annotations for workers, but with
>>>>> some twists:
>>>>>
>>>>> - We use a read-lock for the execution/worker/completion side, so that
>>>>>  this explicit annotation can be more liberally sprinkled around.
>>>>>  With read locks lockdep isn't going to complain if the read-side
>>>>>  isn't nested the same way under all circumstances, so ABBA deadlocks
>>>>>  are ok. Which they are, since this is an annotation only.
>>>>>
>>>>> - We're using non-recursive lockdep read lock mode, since in recursive
>>>>>  read lock mode lockdep does not catch read side hazards. And we
>>>>>  _very_ much want read side hazards to be caught. For full details of
>>>>>  this limitation see
>>>>>
>>>>>  commit e91498589746065e3ae95d9a00b068e525eec34f
>>>>>  Author: Peter Zijlstra 
>>>>>  Date:   Wed Aug 23 13:13:11 2017 +0200
>>>>>
>>>>>  locking/lockdep/selftests: Add mixed read-write ABBA tests
>>>>>
>>>>> - To allow nesting of the read-side explicit annotations we explicitly
>>>>>  keep track of the nesting. lock_is_held() allows us to do that.
>>>>>
>>>>> - The wait-side annotation is a write lock, and entirely done within
>>>>>  dma_fence_wait() for everyone by default.
>>>>>
>>>>> - To be able to freely annotate helper functions I want to make it ok
>>>>>  to call dma_fence_begin/end_signalling from soft/hardirq context.
>>>>>  First attempt was using the hardirq locking context for the write
>>>>>  side in lockdep, but this forces all normal spinlocks nested within
>>>>>  dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>>>>>
>>>>>  The approach now is to simple check in_atomic(), and for these cases
>>>>>  entirely rely on the might_sleep() check in dma_fence_wait(). That
>>>>>  will catch any wrong nesting against spinlocks from soft/hardirq
>>>>>  contexts.
>>>>>
>>>>> The idea here is that every code path that's critical for eventually
>>>>> signalling a dma_fence should be annotated with
>>>>> dma_fence_begin/end_signalling. The annotation ideally starts right
>>>>> after a dma_fence is published (added to a dma_resv, exposed as a
>>>>> sync_file fd, attached to a drm_syncobj fd, or anything else that
>>>>> makes the dma_fence visible to other kernel threads), up to and
>>>>> including the dma_fence_wait(). Examples are irq handlers, the
>>>>> scheduler rt threads, the tail of execbuf (after the corresponding
>>>>> fences are visible), any workers that end up signalling dma_fences and
>>>>> really anything else. Not annotated should be code paths that only
>>>>> complete fences opportunistically as the gpu progresses, like e.g.
>>>>> shrinker/eviction code.
>>>>>
>>>>> The main class of deadlocks this is supposed to catch are:
>>>>>
>>>>> Thread A:
>>>>>
>>>>> mutex_lock(A);
>>>>> mutex_unlock(A);
>>>>>
>>>>> dma_fence_signal();
>>>>>
>>>>> Thread B:
>>>>>
>>>>> mutex_lock(A);
>>>>> dma_fence_wait();
>>>>> mutex_unlock(A);
>>>>>
>>>>> Thread B is blocked on A signalling the fence, but A never gets around
>>>>> to that because it cannot acquire the lock A.
>>>>>
>>>>> Note that dma_fence_wait() is allowed to be nested within
>>>>> dma_fence_begin/end_signalling sections. To allow this to happen the
>>>>> read lock needs to be upgraded to a write lock, which means that any
>>>>> other lock is acquired between the dma_fence_begin_signalling() call and
>>>>> the call to dma_fence_wait(), and still held, this will result i

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-11 Thread Tvrtko Ursulin


On 10/06/2020 16:17, Daniel Vetter wrote:
> On Wed, Jun 10, 2020 at 4:22 PM Tvrtko Ursulin
>  wrote:
>>
>>
>> On 04/06/2020 09:12, Daniel Vetter wrote:
>>> Design is similar to the lockdep annotations for workers, but with
>>> some twists:
>>>
>>> - We use a read-lock for the execution/worker/completion side, so that
>>> this explicit annotation can be more liberally sprinkled around.
>>> With read locks lockdep isn't going to complain if the read-side
>>> isn't nested the same way under all circumstances, so ABBA deadlocks
>>> are ok. Which they are, since this is an annotation only.
>>>
>>> - We're using non-recursive lockdep read lock mode, since in recursive
>>> read lock mode lockdep does not catch read side hazards. And we
>>> _very_ much want read side hazards to be caught. For full details of
>>> this limitation see
>>>
>>> commit e91498589746065e3ae95d9a00b068e525eec34f
>>> Author: Peter Zijlstra 
>>> Date:   Wed Aug 23 13:13:11 2017 +0200
>>>
>>> locking/lockdep/selftests: Add mixed read-write ABBA tests
>>>
>>> - To allow nesting of the read-side explicit annotations we explicitly
>>> keep track of the nesting. lock_is_held() allows us to do that.
>>>
>>> - The wait-side annotation is a write lock, and entirely done within
>>> dma_fence_wait() for everyone by default.
>>>
>>> - To be able to freely annotate helper functions I want to make it ok
>>> to call dma_fence_begin/end_signalling from soft/hardirq context.
>>> First attempt was using the hardirq locking context for the write
>>> side in lockdep, but this forces all normal spinlocks nested within
>>> dma_fence_begin/end_signalling to be spinlocks. That bollocks.
>>>
>>> The approach now is to simple check in_atomic(), and for these cases
>>> entirely rely on the might_sleep() check in dma_fence_wait(). That
>>> will catch any wrong nesting against spinlocks from soft/hardirq
>>> contexts.
>>>
>>> The idea here is that every code path that's critical for eventually
>>> signalling a dma_fence should be annotated with
>>> dma_fence_begin/end_signalling. The annotation ideally starts right
>>> after a dma_fence is published (added to a dma_resv, exposed as a
>>> sync_file fd, attached to a drm_syncobj fd, or anything else that
>>> makes the dma_fence visible to other kernel threads), up to and
>>> including the dma_fence_wait(). Examples are irq handlers, the
>>> scheduler rt threads, the tail of execbuf (after the corresponding
>>> fences are visible), any workers that end up signalling dma_fences and
>>> really anything else. Not annotated should be code paths that only
>>> complete fences opportunistically as the gpu progresses, like e.g.
>>> shrinker/eviction code.
>>>
>>> The main class of deadlocks this is supposed to catch are:
>>>
>>> Thread A:
>>>
>>>mutex_lock(A);
>>>mutex_unlock(A);
>>>
>>>dma_fence_signal();
>>>
>>> Thread B:
>>>
>>>mutex_lock(A);
>>>dma_fence_wait();
>>>mutex_unlock(A);
>>>
>>> Thread B is blocked on A signalling the fence, but A never gets around
>>> to that because it cannot acquire the lock A.
>>>
>>> Note that dma_fence_wait() is allowed to be nested within
>>> dma_fence_begin/end_signalling sections. To allow this to happen the
>>> read lock needs to be upgraded to a write lock, which means that any
>>> other lock is acquired between the dma_fence_begin_signalling() call and
>>> the call to dma_fence_wait(), and still held, this will result in an
>>> immediate lockdep complaint. The only other option would be to not
>>> annotate such calls, defeating the point. Therefore these annotations
>>> cannot be sprinkled over the code entirely mindless to avoid false
>>> positives.
>>>
>>> v2: handle soft/hardirq ctx better against write side and dont forget
>>> EXPORT_SYMBOL, drivers can't use this otherwise.
>>>
>>> v3: Kerneldoc.
>>>
>>> v4: Some spelling fixes from Mika
>>>
>>> Cc: Mika Kuoppala 
>>> Cc: Thomas Hellstrom 
>>> Cc: linux-me...@vger.kernel.org
>>> Cc: linaro-mm-...@lists.linaro.org
>>> Cc: linux-r...@vger.kernel.org
>

Re: [Intel-gfx] [PATCH 03/18] dma-fence: basic lockdep annotations

2020-06-10 Thread Tvrtko Ursulin



On 04/06/2020 09:12, Daniel Vetter wrote:

Design is similar to the lockdep annotations for workers, but with
some twists:

- We use a read-lock for the execution/worker/completion side, so that
   this explicit annotation can be more liberally sprinkled around.
   With read locks lockdep isn't going to complain if the read-side
   isn't nested the same way under all circumstances, so ABBA deadlocks
   are ok. Which they are, since this is an annotation only.

- We're using non-recursive lockdep read lock mode, since in recursive
   read lock mode lockdep does not catch read side hazards. And we
   _very_ much want read side hazards to be caught. For full details of
   this limitation see

   commit e91498589746065e3ae95d9a00b068e525eec34f
   Author: Peter Zijlstra 
   Date:   Wed Aug 23 13:13:11 2017 +0200

   locking/lockdep/selftests: Add mixed read-write ABBA tests

- To allow nesting of the read-side explicit annotations we explicitly
   keep track of the nesting. lock_is_held() allows us to do that.

- The wait-side annotation is a write lock, and entirely done within
   dma_fence_wait() for everyone by default.

- To be able to freely annotate helper functions I want to make it ok
   to call dma_fence_begin/end_signalling from soft/hardirq context.
   First attempt was using the hardirq locking context for the write
   side in lockdep, but this forces all normal spinlocks nested within
   dma_fence_begin/end_signalling to be spinlocks. That bollocks.

   The approach now is to simple check in_atomic(), and for these cases
   entirely rely on the might_sleep() check in dma_fence_wait(). That
   will catch any wrong nesting against spinlocks from soft/hardirq
   contexts.

The idea here is that every code path that's critical for eventually
signalling a dma_fence should be annotated with
dma_fence_begin/end_signalling. The annotation ideally starts right
after a dma_fence is published (added to a dma_resv, exposed as a
sync_file fd, attached to a drm_syncobj fd, or anything else that
makes the dma_fence visible to other kernel threads), up to and
including the dma_fence_wait(). Examples are irq handlers, the
scheduler rt threads, the tail of execbuf (after the corresponding
fences are visible), any workers that end up signalling dma_fences and
really anything else. Not annotated should be code paths that only
complete fences opportunistically as the gpu progresses, like e.g.
shrinker/eviction code.

The main class of deadlocks this is supposed to catch are:

Thread A:

mutex_lock(A);
mutex_unlock(A);

dma_fence_signal();

Thread B:

mutex_lock(A);
dma_fence_wait();
mutex_unlock(A);

Thread B is blocked on A signalling the fence, but A never gets around
to that because it cannot acquire the lock A.

Note that dma_fence_wait() is allowed to be nested within
dma_fence_begin/end_signalling sections. To allow this to happen the
read lock needs to be upgraded to a write lock, which means that any
other lock is acquired between the dma_fence_begin_signalling() call and
the call to dma_fence_wait(), and still held, this will result in an
immediate lockdep complaint. The only other option would be to not
annotate such calls, defeating the point. Therefore these annotations
cannot be sprinkled over the code entirely mindless to avoid false
positives.

v2: handle soft/hardirq ctx better against write side and dont forget
EXPORT_SYMBOL, drivers can't use this otherwise.

v3: Kerneldoc.

v4: Some spelling fixes from Mika

Cc: Mika Kuoppala 
Cc: Thomas Hellstrom 
Cc: linux-me...@vger.kernel.org
Cc: linaro-mm-...@lists.linaro.org
Cc: linux-r...@vger.kernel.org
Cc: amd-...@lists.freedesktop.org
Cc: intel-...@lists.freedesktop.org
Cc: Chris Wilson 
Cc: Maarten Lankhorst 
Cc: Christian König 
Signed-off-by: Daniel Vetter 
---
  Documentation/driver-api/dma-buf.rst |  12 +-
  drivers/dma-buf/dma-fence.c  | 161 +++
  include/linux/dma-fence.h|  12 ++
  3 files changed, 182 insertions(+), 3 deletions(-)

diff --git a/Documentation/driver-api/dma-buf.rst 
b/Documentation/driver-api/dma-buf.rst
index 63dec76d1d8d..05d856131140 100644
--- a/Documentation/driver-api/dma-buf.rst
+++ b/Documentation/driver-api/dma-buf.rst
@@ -100,11 +100,11 @@ CPU Access to DMA Buffer Objects
  .. kernel-doc:: drivers/dma-buf/dma-buf.c
 :doc: cpu access
  
-Fence Poll Support

-~~
+Implicit Fence Poll Support
+~~~
  
  .. kernel-doc:: drivers/dma-buf/dma-buf.c

-   :doc: fence polling
+   :doc: implicit fence polling
  
  Kernel Functions and Structures Reference

  ~
@@ -133,6 +133,12 @@ DMA Fences
  .. kernel-doc:: drivers/dma-buf/dma-fence.c
 :doc: DMA fences overview
  
+DMA Fence Signalling Annotations

+
+
+.. kernel-doc:: drivers/dma-buf/dma-fence.c
+   :doc: fence signalling annotation
+
  DMA Fences Functions Reference
 

Re: [Intel-gfx] [RFC 06/17] drm: i915: fix sg_table nents vs. orig_nents misuse

2020-04-28 Thread Tvrtko Ursulin



On 28/04/2020 14:19, Marek Szyprowski wrote:

The Documentation/DMA-API-HOWTO.txt states that dma_map_sg returns the
numer of the created entries in the DMA address space. However the
subsequent calls to dma_sync_sg_for_{device,cpu} and dma_unmap_sg must be
called with the original number of entries passed to dma_map_sg. The
sg_table->nents in turn holds the result of the dma_map_sg call as stated
in include/linux/scatterlist.h. Adapt the code to obey those rules.

Signed-off-by: Marek Szyprowski 
---
  drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c   | 13 +++--
  drivers/gpu/drm/i915/gem/i915_gem_internal.c |  4 ++--
  drivers/gpu/drm/i915/gem/i915_gem_region.c   |  4 ++--
  drivers/gpu/drm/i915/gem/i915_gem_shmem.c|  5 +++--
  drivers/gpu/drm/i915/gem/selftests/huge_pages.c  | 10 +-
  drivers/gpu/drm/i915/gem/selftests/mock_dmabuf.c |  5 +++--
  drivers/gpu/drm/i915/gt/intel_ggtt.c | 12 ++--
  drivers/gpu/drm/i915/i915_gem_gtt.c  | 12 +++-
  drivers/gpu/drm/i915/i915_scatterlist.c  |  4 ++--
  drivers/gpu/drm/i915/selftests/scatterlist.c |  8 
  10 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c 
b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
index 7db5a79..d829852 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c
@@ -36,21 +36,22 @@ static struct sg_table *i915_gem_map_dma_buf(struct 
dma_buf_attachment *attachme
goto err_unpin_pages;
}
  
-	ret = sg_alloc_table(st, obj->mm.pages->nents, GFP_KERNEL);

+   ret = sg_alloc_table(st, obj->mm.pages->orig_nents, GFP_KERNEL);
if (ret)
goto err_free;
  
  	src = obj->mm.pages->sgl;

dst = st->sgl;
-   for (i = 0; i < obj->mm.pages->nents; i++) {
+   for (i = 0; i < obj->mm.pages->orig_nents; i++) {
sg_set_page(dst, sg_page(src), src->length, 0);
dst = sg_next(dst);
src = sg_next(src);
}
  
-	if (!dma_map_sg_attrs(attachment->dev,

- st->sgl, st->nents, dir,
- DMA_ATTR_SKIP_CPU_SYNC)) {
+   st->nents = dma_map_sg_attrs(attachment->dev,
+st->sgl, st->orig_nents, dir,
+DMA_ATTR_SKIP_CPU_SYNC);
+   if (!st->nents) {
ret = -ENOMEM;
goto err_free_sg;
}
@@ -74,7 +75,7 @@ static void i915_gem_unmap_dma_buf(struct dma_buf_attachment 
*attachment,
struct drm_i915_gem_object *obj = dma_buf_to_obj(attachment->dmabuf);
  
  	dma_unmap_sg_attrs(attachment->dev,

-  sg->sgl, sg->nents, dir,
+  sg->sgl, sg->orig_nents, dir,
   DMA_ATTR_SKIP_CPU_SYNC);
sg_free_table(sg);
kfree(sg);
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c 
b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
index cbbff81..a8ebfdd 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c
@@ -73,7 +73,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
}
  
  	sg = st->sgl;

-   st->nents = 0;
+   st->nents = st->orig_nents = 0;
sg_page_sizes = 0;
  
  	do {

@@ -94,7 +94,7 @@ static int i915_gem_object_get_pages_internal(struct 
drm_i915_gem_object *obj)
  
  		sg_set_page(sg, page, PAGE_SIZE << order, 0);

sg_page_sizes |= PAGE_SIZE << order;
-   st->nents++;
+   st->nents = st->orig_nents = st->nents + 1;
  
  		npages -= 1 << order;

if (!npages) {
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_region.c 
b/drivers/gpu/drm/i915/gem/i915_gem_region.c
index 1515384..58ca560 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_region.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_region.c
@@ -53,7 +53,7 @@
GEM_BUG_ON(list_empty(blocks));
  
  	sg = st->sgl;

-   st->nents = 0;
+   st->nents = st->orig_nents = 0;
sg_page_sizes = 0;
prev_end = (resource_size_t)-1;
  
@@ -78,7 +78,7 @@
  
  			sg->length = block_size;
  
-			st->nents++;

+   st->nents = st->orig_nents = st->nents + 1;
} else {
sg->length += block_size;
sg_dma_len(sg) += block_size;
diff --git a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c 
b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
index 5d5d7ee..851a732 100644
--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
+++ b/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -80,7 +80,7 @@ static int shmem_get_pages(struct drm_i915_gem_object *obj)
noreclaim |= __GFP_NORETRY | __GFP_NOWARN;
  
  	sg = st->sgl;

-   st->nents = 0;
+   st->nents = st->orig_nents = 0;
sg_page_sizes = 0;
for (i = 0; i < page_count; i++) {
 

Re: [Intel-gfx] [PATCH v5 2/3] drm/i915: Move on the new pm runtime interface

2018-12-31 Thread Tvrtko Ursulin



On 21/12/2018 13:26, Vincent Guittot wrote:

On Fri, 21 Dec 2018 at 12:33, Tvrtko Ursulin
 wrote:



Hi,

On 21/12/2018 10:33, Vincent Guittot wrote:

Use the new pm runtime interface to get the accounted suspended time:
pm_runtime_suspended_time().
This new interface helps to simplify and cleanup the code that computes
__I915_SAMPLE_RC6_ESTIMATED and to remove direct access to internals of
PM runtime.

Reviewed-by: Ulf Hansson 
Signed-off-by: Vincent Guittot 
---
   drivers/gpu/drm/i915/i915_pmu.c | 16 ++--
   drivers/gpu/drm/i915/i915_pmu.h |  4 ++--
   2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index d6c8f8f..3f76f60 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -5,6 +5,7 @@
*/

   #include 
+#include 
   #include "i915_pmu.h"
   #include "intel_ringbuffer.h"
   #include "i915_drv.h"
@@ -478,7 +479,6 @@ static u64 get_rc6(struct drm_i915_private *i915)
* counter value.
*/
   spin_lock_irqsave(>pmu.lock, flags);
- spin_lock(>power.lock);

   /*
* After the above branch intel_runtime_pm_get_if_in_use failed
@@ -491,16 +491,13 @@ static u64 get_rc6(struct drm_i915_private *i915)
* suspended and if not we cannot do better than report the last
* known RC6 value.
*/
- if (kdev->power.runtime_status == RPM_SUSPENDED) {
- if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)
- i915->pmu.suspended_jiffies_last =
-   kdev->power.suspended_jiffies;
+ if (pm_runtime_status_suspended(kdev)) {
+ val = pm_runtime_suspended_time(kdev);


There is a race condition between the status check and timestamp access
which the existing code solves by holding the power.lock over it. But I
don't exactly remember how this issue was manifesting. Is
kdev->power.suspended_jiffies perhaps reset on exit from runtime
suspend, which was then underflowing the val, not sure.

Anyways, is the new way of doing this safe with regards to this race? In


AFAICT it is safe.
The current version does:
1-take lock,
2-test if dev is suspended
3-read some internals field to computed an up-to-date suspended time
4-update __I915_SAMPLE_RC6_ESTIMATED
5-release lock

The new version does:
1-test if dev is suspended
2-get an up-to-date  suspended time with pm_runtime_suspended_time.
This is atomic and monotonic
3-update __I915_SAMPLE_RC6_ESTIMATED

A change from suspended to another states that happens just before
step 1 is ok for both as we will run the else if
No change of the state can happen after step 1 in current code and the
estimated suspended time will be the time up to step2. In parallel,
Any state change will have to wait step5 to continue
If a change from suspended to another state happens after step 1 in
new code, the suspended time return by PM core will be the time up to
this change. So I would say you don't delay state transition and you
get a more accurate estimated suspended time (even if the difference
should be small).
If a change from suspended to another state happens after step 2 in
new code, the suspended time return by PM core will be the time up to
step 2 so there is no changes



other words is the value pm_runtime_suspended_time always monotonic,
even when not suspended? If not we have to handle the race somehow.


Yes pm_runtime_suspended_time is monotonic and stays unchanged when
not suspended



If it is always monotonic, then worst case we report one wrong sample,
which I guess is still not ideal since someone could be querying the PMU
with quite low frequency.

There are tests which probably can hit this, but to run them
automatically your patches would need to be rebased on drm-tip and maybe
sent to our trybot. I can do that after the holiday break if you are
okay with having the series waiting until then.


yes looks good to me


Looks good to me as well. And our CI agrees with it. So:

Reviewed-by: Tvrtko Ursulin 

I assume you will take the patch through some power related tree and we 
will eventually pull it back to drm-tip.


Regards,

Tvrtko


Re: [Intel-gfx] [PATCH v5 2/3] drm/i915: Move on the new pm runtime interface

2018-12-21 Thread Tvrtko Ursulin



Hi,

On 21/12/2018 10:33, Vincent Guittot wrote:

Use the new pm runtime interface to get the accounted suspended time:
pm_runtime_suspended_time().
This new interface helps to simplify and cleanup the code that computes
__I915_SAMPLE_RC6_ESTIMATED and to remove direct access to internals of
PM runtime.

Reviewed-by: Ulf Hansson 
Signed-off-by: Vincent Guittot 
---
  drivers/gpu/drm/i915/i915_pmu.c | 16 ++--
  drivers/gpu/drm/i915/i915_pmu.h |  4 ++--
  2 files changed, 8 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index d6c8f8f..3f76f60 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -5,6 +5,7 @@
   */
  
  #include 

+#include 
  #include "i915_pmu.h"
  #include "intel_ringbuffer.h"
  #include "i915_drv.h"
@@ -478,7 +479,6 @@ static u64 get_rc6(struct drm_i915_private *i915)
 * counter value.
 */
spin_lock_irqsave(>pmu.lock, flags);
-   spin_lock(>power.lock);
  
  		/*

 * After the above branch intel_runtime_pm_get_if_in_use failed
@@ -491,16 +491,13 @@ static u64 get_rc6(struct drm_i915_private *i915)
 * suspended and if not we cannot do better than report the last
 * known RC6 value.
 */
-   if (kdev->power.runtime_status == RPM_SUSPENDED) {
-   if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)
-   i915->pmu.suspended_jiffies_last =
- kdev->power.suspended_jiffies;
+   if (pm_runtime_status_suspended(kdev)) {
+   val = pm_runtime_suspended_time(kdev);


There is a race condition between the status check and timestamp access 
which the existing code solves by holding the power.lock over it. But I 
don't exactly remember how this issue was manifesting. Is 
kdev->power.suspended_jiffies perhaps reset on exit from runtime 
suspend, which was then underflowing the val, not sure.


Anyways, is the new way of doing this safe with regards to this race? In 
other words is the value pm_runtime_suspended_time always monotonic, 
even when not suspended? If not we have to handle the race somehow.


If it is always monotonic, then worst case we report one wrong sample, 
which I guess is still not ideal since someone could be querying the PMU 
with quite low frequency.


There are tests which probably can hit this, but to run them 
automatically your patches would need to be rebased on drm-tip and maybe 
sent to our trybot. I can do that after the holiday break if you are 
okay with having the series waiting until then.


Regards,

Tvrtko

  
-			val = kdev->power.suspended_jiffies -

- i915->pmu.suspended_jiffies_last;
-   val += jiffies - kdev->power.accounting_timestamp;
+   if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)
+   i915->pmu.suspended_time_last = val;
  
-			val = jiffies_to_nsecs(val);

+   val -= i915->pmu.suspended_time_last;
val += i915->pmu.sample[__I915_SAMPLE_RC6].cur;
  
  			i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = val;

@@ -510,7 +507,6 @@ static u64 get_rc6(struct drm_i915_private *i915)
val = i915->pmu.sample[__I915_SAMPLE_RC6].cur;
}
  
-		spin_unlock(>power.lock);

spin_unlock_irqrestore(>pmu.lock, flags);
}
  
diff --git a/drivers/gpu/drm/i915/i915_pmu.h b/drivers/gpu/drm/i915/i915_pmu.h

index 7f164ca..3dc2a30 100644
--- a/drivers/gpu/drm/i915/i915_pmu.h
+++ b/drivers/gpu/drm/i915/i915_pmu.h
@@ -95,9 +95,9 @@ struct i915_pmu {
 */
struct i915_pmu_sample sample[__I915_NUM_PMU_SAMPLERS];
/**
-* @suspended_jiffies_last: Cached suspend time from PM core.
+* @suspended_time_last: Cached suspend time from PM core.
 */
-   unsigned long suspended_jiffies_last;
+   u64 suspended_time_last;
/**
 * @i915_attr: Memory block holding device attributes.
 */



Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Tvrtko Ursulin



On 28/09/2018 15:02, Thomas Gleixner wrote:

Tvrtko,

On Fri, 28 Sep 2018, Tvrtko Ursulin wrote:

On 28/09/2018 11:26, Thomas Gleixner wrote:

On Wed, 19 Sep 2018, Tvrtko Ursulin wrote:

It would be very helpful if you cc all involved people on the cover letter
instead of just cc'ing your own pile of email addresses. CC'ed now.


I accept it was by bad to miss adding Cc's on the cover letter, but my own
email addresses hopefully should not bother you. It is simply a question of
what I have in .gitconfig vs what I forgot to do manually.


The keyword in the above sentence is 'just'. You can add as many of yours
as you want as long as everybody else is cc'ed.


Sure, but you also used the word "pile" and I would argue that made the 
rest of your sentence, after and including "instead", sound like it not 
only bothers you I forgot to Cc people on the cover letter, but it also 
bothers you I included a "pile" of my own addresses. If that wasn't your 
intention in the slightest then I apologise for misreading it.



I read through the previous thread and there was a clear request to involve
security people into this. Especially those who are deeply involved with
hardware side channels. I don't see anyone Cc'ed on the whole series.


Who would you recommend I add? Because I really don't know..


Sure, and because you don't know you didn't bother to ask around and
ignored the review request.


No, not because of that. You are assuming my actions and motivations and 
constructing a story.


"did not bother" = negative connotations
"ignored" = negative connotations

Note instead the time lapse between this and previous posting of the 
series, and if you want to assume something, assume things can get 
missed and forgotten without intent or malice.



I already added Kees and Jann. Please look for the SECCOMP folks in
MAINTAINERS.


Thanks!


For the record, I'm not buying the handwavy 'more noise' argument at
all. It wants a proper analysis and we need to come up with criteria which
PMUs can be exposed at all.

All of this want's a proper documentation clearly explaining the risks and
scope of these knobs per PMU. Just throwing magic knobs at sysadmins and
then saying 'its their problem to figure it out' is not acceptable.


Presumably you see adding fine grained control as diminishing the overall
security rather than raising it? Could you explain why? Because incompetent
sysadmin will turn it off for some PMU, while without having the fine-grained
control they wouldn't turn it off globally?


I did not say at all that this might be diminishing security. And the
argumentation with 'incompetent sysadmins' is just the wrong attitude.


Wrong attitude what? I was trying to guess your reasoning (cues in 
"presumably" and a lot of question marks) since it wasn't clear to me 
why is your position what it is.



What I was asking for is proper documentation and this proper documentation
is meant for _competent_ sysadmins.

That documentation has to clearly describe what kind of information is
accessible and what potential side effects security wise this might
have. You cannot expect that even competent sysadmins know offhand what
which PMU might expose. And telling them 'Use Google' is just not the right
thing to do.


I did not mention Google.


If you can't explain and document it, then providing the knob is just
fulfilling somebodys 'I want a pony' request.


Well it's not a pony, it is mechanism to avoid having to turn off all 
security. We can hopefully discuss it without ponies.



This feature was requested by the exact opposite concern, that in order to
access the i915 PMU, one has to compromise the security of the entire system
by allowing access to *all* PMU's.

Making this ability fine-grained sounds like a logical solution for solving
this weakening of security controls.


Sure, and this wants to be documented in the cover letter and the
changelogs.

But this does also require a proper analysis and documentation why it is
not a security risk to expose the i915 PMU or what potential security
issues this can create, so that the competent sysadmin can make a
judgement.

And the same is required for all other PMUs which can be enabled in the
same way for unprivileged access. And we might as well come to the
conclusion via analysis that for some PMUs unpriviledged access is just not
a good idea and exclude them. I surely know a few which qualify for
exclusion, so the right approach is to provide this knob only when the risk
is analyzed and documented and the PMU has been flagged as candidate for
unpriviledged exposure. I.e. opt in and not opt out.


I am happy to work on the mechanics of achieving this once the security 
guys and all PMU owners get involved. Even though I am not convinced the 
bar to allow fine-grained control should be evaluating all possible 
PMUs*, but if the security folks agree that is the case it is fine by me.


Regards,

Tv

Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Tvrtko Ursulin



On 28/09/2018 15:02, Thomas Gleixner wrote:

Tvrtko,

On Fri, 28 Sep 2018, Tvrtko Ursulin wrote:

On 28/09/2018 11:26, Thomas Gleixner wrote:

On Wed, 19 Sep 2018, Tvrtko Ursulin wrote:

It would be very helpful if you cc all involved people on the cover letter
instead of just cc'ing your own pile of email addresses. CC'ed now.


I accept it was by bad to miss adding Cc's on the cover letter, but my own
email addresses hopefully should not bother you. It is simply a question of
what I have in .gitconfig vs what I forgot to do manually.


The keyword in the above sentence is 'just'. You can add as many of yours
as you want as long as everybody else is cc'ed.


Sure, but you also used the word "pile" and I would argue that made the 
rest of your sentence, after and including "instead", sound like it not 
only bothers you I forgot to Cc people on the cover letter, but it also 
bothers you I included a "pile" of my own addresses. If that wasn't your 
intention in the slightest then I apologise for misreading it.



I read through the previous thread and there was a clear request to involve
security people into this. Especially those who are deeply involved with
hardware side channels. I don't see anyone Cc'ed on the whole series.


Who would you recommend I add? Because I really don't know..


Sure, and because you don't know you didn't bother to ask around and
ignored the review request.


No, not because of that. You are assuming my actions and motivations and 
constructing a story.


"did not bother" = negative connotations
"ignored" = negative connotations

Note instead the time lapse between this and previous posting of the 
series, and if you want to assume something, assume things can get 
missed and forgotten without intent or malice.



I already added Kees and Jann. Please look for the SECCOMP folks in
MAINTAINERS.


Thanks!


For the record, I'm not buying the handwavy 'more noise' argument at
all. It wants a proper analysis and we need to come up with criteria which
PMUs can be exposed at all.

All of this want's a proper documentation clearly explaining the risks and
scope of these knobs per PMU. Just throwing magic knobs at sysadmins and
then saying 'its their problem to figure it out' is not acceptable.


Presumably you see adding fine grained control as diminishing the overall
security rather than raising it? Could you explain why? Because incompetent
sysadmin will turn it off for some PMU, while without having the fine-grained
control they wouldn't turn it off globally?


I did not say at all that this might be diminishing security. And the
argumentation with 'incompetent sysadmins' is just the wrong attitude.


Wrong attitude what? I was trying to guess your reasoning (cues in 
"presumably" and a lot of question marks) since it wasn't clear to me 
why is your position what it is.



What I was asking for is proper documentation and this proper documentation
is meant for _competent_ sysadmins.

That documentation has to clearly describe what kind of information is
accessible and what potential side effects security wise this might
have. You cannot expect that even competent sysadmins know offhand what
which PMU might expose. And telling them 'Use Google' is just not the right
thing to do.


I did not mention Google.


If you can't explain and document it, then providing the knob is just
fulfilling somebodys 'I want a pony' request.


Well it's not a pony, it is mechanism to avoid having to turn off all 
security. We can hopefully discuss it without ponies.



This feature was requested by the exact opposite concern, that in order to
access the i915 PMU, one has to compromise the security of the entire system
by allowing access to *all* PMU's.

Making this ability fine-grained sounds like a logical solution for solving
this weakening of security controls.


Sure, and this wants to be documented in the cover letter and the
changelogs.

But this does also require a proper analysis and documentation why it is
not a security risk to expose the i915 PMU or what potential security
issues this can create, so that the competent sysadmin can make a
judgement.

And the same is required for all other PMUs which can be enabled in the
same way for unprivileged access. And we might as well come to the
conclusion via analysis that for some PMUs unpriviledged access is just not
a good idea and exclude them. I surely know a few which qualify for
exclusion, so the right approach is to provide this knob only when the risk
is analyzed and documented and the PMU has been flagged as candidate for
unpriviledged exposure. I.e. opt in and not opt out.


I am happy to work on the mechanics of achieving this once the security 
guys and all PMU owners get involved. Even though I am not convinced the 
bar to allow fine-grained control should be evaluating all possible 
PMUs*, but if the security folks agree that is the case it is fine by me.


Regards,

Tv

Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Tvrtko Ursulin



Hi,

On 28/09/2018 11:26, Thomas Gleixner wrote:

Tvrtko,

On Wed, 19 Sep 2018, Tvrtko Ursulin wrote:

It would be very helpful if you cc all involved people on the cover letter
instead of just cc'ing your own pile of email addresses. CC'ed now.


I accept it was by bad to miss adding Cc's on the cover letter, but my 
own email addresses hopefully should not bother you. It is simply a 
question of what I have in .gitconfig vs what I forgot to do manually.



For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156


This is really not helpful. The cover letter and the change logs should
contain a summary of that discussion and a proper justification of the
proposed change. Just saying 'sysadmins might want to allow' is not useful
at all, it's yet another 'I want a pony' thing.


Okay, for the next round I will expand the cover letter with at least 
one concrete example on how it is usable and summarize the discussion a bit.



I read through the previous thread and there was a clear request to involve
security people into this. Especially those who are deeply involved with
hardware side channels. I don't see anyone Cc'ed on the whole series.


Who would you recommend I add? Because I really don't know..


For the record, I'm not buying the handwavy 'more noise' argument at
all. It wants a proper analysis and we need to come up with criteria which
PMUs can be exposed at all.

All of this want's a proper documentation clearly explaining the risks and
scope of these knobs per PMU. Just throwing magic knobs at sysadmins and
then saying 'its their problem to figure it out' is not acceptable.


Presumably you see adding fine grained control as diminishing the 
overall security rather than raising it? Could you explain why? Because 
incompetent sysadmin will turn it off for some PMU, while without having 
the fine-grained control they wouldn't turn it off globally?


This feature was requested by the exact opposite concern, that in order 
to access the i915 PMU, one has to compromise the security of the entire 
system by allowing access to *all* PMU's.


Making this ability fine-grained sounds like a logical solution for 
solving this weakening of security controls.


Concrete example was that on video transcoding farms users want to 
monitor the utilization of GPU engines (like CPU cores) and they can do 
that via the i915 PMU. But for that to work today they have to dial down 
the global perf_event_paranoid setting. Obvious improvement was to allow 
them to only dial down the i915.perf_event_paranoid setting. As such, 
for this specific use case at least, the security is increased.


Regards,

Tvrtko


Re: [RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-28 Thread Tvrtko Ursulin



Hi,

On 28/09/2018 11:26, Thomas Gleixner wrote:

Tvrtko,

On Wed, 19 Sep 2018, Tvrtko Ursulin wrote:

It would be very helpful if you cc all involved people on the cover letter
instead of just cc'ing your own pile of email addresses. CC'ed now.


I accept it was by bad to miss adding Cc's on the cover letter, but my 
own email addresses hopefully should not bother you. It is simply a 
question of what I have in .gitconfig vs what I forgot to do manually.



For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156


This is really not helpful. The cover letter and the change logs should
contain a summary of that discussion and a proper justification of the
proposed change. Just saying 'sysadmins might want to allow' is not useful
at all, it's yet another 'I want a pony' thing.


Okay, for the next round I will expand the cover letter with at least 
one concrete example on how it is usable and summarize the discussion a bit.



I read through the previous thread and there was a clear request to involve
security people into this. Especially those who are deeply involved with
hardware side channels. I don't see anyone Cc'ed on the whole series.


Who would you recommend I add? Because I really don't know..


For the record, I'm not buying the handwavy 'more noise' argument at
all. It wants a proper analysis and we need to come up with criteria which
PMUs can be exposed at all.

All of this want's a proper documentation clearly explaining the risks and
scope of these knobs per PMU. Just throwing magic knobs at sysadmins and
then saying 'its their problem to figure it out' is not acceptable.


Presumably you see adding fine grained control as diminishing the 
overall security rather than raising it? Could you explain why? Because 
incompetent sysadmin will turn it off for some PMU, while without having 
the fine-grained control they wouldn't turn it off globally?


This feature was requested by the exact opposite concern, that in order 
to access the i915 PMU, one has to compromise the security of the entire 
system by allowing access to *all* PMU's.


Making this ability fine-grained sounds like a logical solution for 
solving this weakening of security controls.


Concrete example was that on video transcoding farms users want to 
monitor the utilization of GPU engines (like CPU cores) and they can do 
that via the i915 PMU. But for that to work today they have to dial down 
the global perf_event_paranoid setting. Obvious improvement was to allow 
them to only dial down the i915.perf_event_paranoid setting. As such, 
for this specific use case at least, the security is increased.


Regards,

Tvrtko


Re: [RFC 3/5] perf: Allow per PMU access control

2018-09-28 Thread Tvrtko Ursulin



On 27/09/2018 21:15, Andi Kleen wrote:

+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);


What happens to pmus that got added later?


There is a hunk a bit lower in the patch where in perf_pmu_register the 
initial setting is assigned from the global sysctl.



The rest looks good.

Can you post a non RFC version?


Sure!

Regards,

Tvrtko


Re: [RFC 3/5] perf: Allow per PMU access control

2018-09-28 Thread Tvrtko Ursulin



On 27/09/2018 21:15, Andi Kleen wrote:

+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);


What happens to pmus that got added later?


There is a hunk a bit lower in the patch where in perf_pmu_register the 
initial setting is assigned from the global sysctl.



The rest looks good.

Can you post a non RFC version?


Sure!

Regards,

Tvrtko


[PATCH 2/6] lib/scatterlist: Use consistent types in sgl API

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

There is an incosistency between data types used for order and number of
sg elements in the API.

Fix it so both are always unsigned int which, in the case of number of
elements, matches the underlying struct scatterlist.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 include/linux/scatterlist.h | 5 +++--
 lib/scatterlist.c   | 9 +
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index bdede25901b5..f6cf4d7c9a69 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -285,8 +285,9 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
unsigned int *nent_p);
 struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p);
-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
-void sgl_free_order(struct scatterlist *sgl, int order);
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+ unsigned int order);
+void sgl_free_order(struct scatterlist *sgl, unsigned int order);
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 014018f90e28..23e53dce897d 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -549,11 +549,12 @@ EXPORT_SYMBOL(sgl_alloc);
  * - All pages in a chained scatterlist can be freed at once by setting @nents
  *   to a high number.
  */
-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+ unsigned int order)
 {
struct scatterlist *sg;
struct page *page;
-   int i;
+   unsigned int i;
 
for_each_sg(sgl, sg, nents, i) {
if (!sg)
@@ -571,9 +572,9 @@ EXPORT_SYMBOL(sgl_free_n_order);
  * @sgl: Scatterlist with one or more elements
  * @order: Second argument for __free_pages()
  */
-void sgl_free_order(struct scatterlist *sgl, int order)
+void sgl_free_order(struct scatterlist *sgl, unsigned int order)
 {
-   sgl_free_n_order(sgl, INT_MAX, order);
+   sgl_free_n_order(sgl, UINT_MAX, order);
 }
 EXPORT_SYMBOL(sgl_free_order);
 
-- 
2.17.1



[PATCH 2/6] lib/scatterlist: Use consistent types in sgl API

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

There is an incosistency between data types used for order and number of
sg elements in the API.

Fix it so both are always unsigned int which, in the case of number of
elements, matches the underlying struct scatterlist.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 include/linux/scatterlist.h | 5 +++--
 lib/scatterlist.c   | 9 +
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index bdede25901b5..f6cf4d7c9a69 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -285,8 +285,9 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
unsigned int *nent_p);
 struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p);
-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
-void sgl_free_order(struct scatterlist *sgl, int order);
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+ unsigned int order);
+void sgl_free_order(struct scatterlist *sgl, unsigned int order);
 void sgl_free(struct scatterlist *sgl);
 #endif /* CONFIG_SGL_ALLOC */
 
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 014018f90e28..23e53dce897d 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -549,11 +549,12 @@ EXPORT_SYMBOL(sgl_alloc);
  * - All pages in a chained scatterlist can be freed at once by setting @nents
  *   to a high number.
  */
-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+ unsigned int order)
 {
struct scatterlist *sg;
struct page *page;
-   int i;
+   unsigned int i;
 
for_each_sg(sgl, sg, nents, i) {
if (!sg)
@@ -571,9 +572,9 @@ EXPORT_SYMBOL(sgl_free_n_order);
  * @sgl: Scatterlist with one or more elements
  * @order: Second argument for __free_pages()
  */
-void sgl_free_order(struct scatterlist *sgl, int order)
+void sgl_free_order(struct scatterlist *sgl, unsigned int order)
 {
-   sgl_free_n_order(sgl, INT_MAX, order);
+   sgl_free_n_order(sgl, UINT_MAX, order);
 }
 EXPORT_SYMBOL(sgl_free_order);
 
-- 
2.17.1



[PATCH 6/6] lib/scatterlist: Fix overflow check in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

It is necessary to ensure types on both sides of the comparison are of the
same width. Otherwise the check overflows sooner than expect due left hand
side being an unsigned long length, and the right hand side unsigned int
number of elements multiplied by element size.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 581a2e91e515..c87243d46f10 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -485,7 +485,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
/* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
+   if (length > ((unsigned long)nent << (PAGE_SHIFT + order)))
return NULL;
nalloc = nent;
if (chainable) {
-- 
2.17.1



[PATCH 4/6] lib/scatterlist: Do not leak pages when high-order allocation fails

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

If a higher-order allocation fails, the existing abort and cleanup path
would consider all segments allocated so far as 0-order page allocations
and would therefore leak memory.

Fix this by cleaning up using sgl_free_n_order which allows the correct
page order to be passed in.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 3cc01cd82242..0caed79d7291 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -481,7 +481,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 {
struct scatterlist *sgl, *sg;
struct page *page;
-   unsigned int nent, nalloc;
+   unsigned int nent, nalloc, i;
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
@@ -501,17 +501,19 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 
sg_init_table(sgl, nalloc);
sg = sgl;
+   i = 0;
while (length) {
elem_len = min_t(u64, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
-   sgl_free(sgl);
+   sgl_free_n_order(sgl, i, order);
return NULL;
}
 
sg_set_page(sg, page, elem_len, 0);
length -= elem_len;
sg = sg_next(sg);
+   i++;
}
WARN_ONCE(length, "length = %ld\n", length);
if (nent_p)
-- 
2.17.1



[PATCH 1/6] lib/scatterlist: Use natural long for sgl_alloc(_order) length parameter

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

None of the callers need unsinged long long (they either pass in an int,
u32, or size_t) so it is not required to burden the 32-bit builds with an
overspecified length parameter.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 include/linux/scatterlist.h |  8 
 lib/scatterlist.c   | 10 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 093aa57120b0..bdede25901b5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -280,10 +280,10 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, 
struct page **pages,
  unsigned long size, gfp_t gfp_mask);
 
 #ifdef CONFIG_SGL_ALLOC
-struct scatterlist *sgl_alloc_order(unsigned long long length,
-   unsigned int order, bool chainable,
-   gfp_t gfp, unsigned int *nent_p);
-struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+struct scatterlist *sgl_alloc_order(unsigned long length, unsigned int order,
+   bool chainable, gfp_t gfp,
+   unsigned int *nent_p);
+struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 7c6096a71704..014018f90e28 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -475,9 +475,9 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
-struct scatterlist *sgl_alloc_order(unsigned long long length,
-   unsigned int order, bool chainable,
-   gfp_t gfp, unsigned int *nent_p)
+struct scatterlist *sgl_alloc_order(unsigned long length, unsigned int order,
+   bool chainable, gfp_t gfp,
+   unsigned int *nent_p)
 {
struct scatterlist *sgl, *sg;
struct page *page;
@@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
length -= elem_len;
sg = sg_next(sg);
}
-   WARN_ONCE(length, "length = %lld\n", length);
+   WARN_ONCE(length, "length = %ld\n", length);
if (nent_p)
*nent_p = nent;
return sgl;
@@ -529,7 +529,7 @@ EXPORT_SYMBOL(sgl_alloc_order);
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
-struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p)
 {
return sgl_alloc_order(length, 0, false, gfp, nent_p);
-- 
2.17.1



[PATCH 3/6] lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

sg_init_table will clear the allocated block so requesting zeroed memory
from the allocator is redundant.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 23e53dce897d..3cc01cd82242 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -495,8 +495,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
return NULL;
nalloc++;
}
-   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
-   (gfp & ~GFP_DMA) | __GFP_ZERO);
+   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist), gfp & ~GFP_DMA);
if (!sgl)
return NULL;
 
-- 
2.17.1



[PATCH 6/6] lib/scatterlist: Fix overflow check in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

It is necessary to ensure types on both sides of the comparison are of the
same width. Otherwise the check overflows sooner than expect due left hand
side being an unsigned long length, and the right hand side unsigned int
number of elements multiplied by element size.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 581a2e91e515..c87243d46f10 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -485,7 +485,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
/* Check for integer overflow */
-   if (length > (nent << (PAGE_SHIFT + order)))
+   if (length > ((unsigned long)nent << (PAGE_SHIFT + order)))
return NULL;
nalloc = nent;
if (chainable) {
-- 
2.17.1



[PATCH 4/6] lib/scatterlist: Do not leak pages when high-order allocation fails

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

If a higher-order allocation fails, the existing abort and cleanup path
would consider all segments allocated so far as 0-order page allocations
and would therefore leak memory.

Fix this by cleaning up using sgl_free_n_order which allows the correct
page order to be passed in.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 3cc01cd82242..0caed79d7291 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -481,7 +481,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 {
struct scatterlist *sgl, *sg;
struct page *page;
-   unsigned int nent, nalloc;
+   unsigned int nent, nalloc, i;
u32 elem_len;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
@@ -501,17 +501,19 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 
sg_init_table(sgl, nalloc);
sg = sgl;
+   i = 0;
while (length) {
elem_len = min_t(u64, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
-   sgl_free(sgl);
+   sgl_free_n_order(sgl, i, order);
return NULL;
}
 
sg_set_page(sg, page, elem_len, 0);
length -= elem_len;
sg = sg_next(sg);
+   i++;
}
WARN_ONCE(length, "length = %ld\n", length);
if (nent_p)
-- 
2.17.1



[PATCH 1/6] lib/scatterlist: Use natural long for sgl_alloc(_order) length parameter

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

None of the callers need unsinged long long (they either pass in an int,
u32, or size_t) so it is not required to burden the 32-bit builds with an
overspecified length parameter.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 include/linux/scatterlist.h |  8 
 lib/scatterlist.c   | 10 +-
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index 093aa57120b0..bdede25901b5 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -280,10 +280,10 @@ int sg_alloc_table_from_pages(struct sg_table *sgt, 
struct page **pages,
  unsigned long size, gfp_t gfp_mask);
 
 #ifdef CONFIG_SGL_ALLOC
-struct scatterlist *sgl_alloc_order(unsigned long long length,
-   unsigned int order, bool chainable,
-   gfp_t gfp, unsigned int *nent_p);
-struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+struct scatterlist *sgl_alloc_order(unsigned long length, unsigned int order,
+   bool chainable, gfp_t gfp,
+   unsigned int *nent_p);
+struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p);
 void sgl_free_n_order(struct scatterlist *sgl, int nents, int order);
 void sgl_free_order(struct scatterlist *sgl, int order);
diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 7c6096a71704..014018f90e28 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -475,9 +475,9 @@ EXPORT_SYMBOL(sg_alloc_table_from_pages);
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
-struct scatterlist *sgl_alloc_order(unsigned long long length,
-   unsigned int order, bool chainable,
-   gfp_t gfp, unsigned int *nent_p)
+struct scatterlist *sgl_alloc_order(unsigned long length, unsigned int order,
+   bool chainable, gfp_t gfp,
+   unsigned int *nent_p)
 {
struct scatterlist *sgl, *sg;
struct page *page;
@@ -514,7 +514,7 @@ struct scatterlist *sgl_alloc_order(unsigned long long 
length,
length -= elem_len;
sg = sg_next(sg);
}
-   WARN_ONCE(length, "length = %lld\n", length);
+   WARN_ONCE(length, "length = %ld\n", length);
if (nent_p)
*nent_p = nent;
return sgl;
@@ -529,7 +529,7 @@ EXPORT_SYMBOL(sgl_alloc_order);
  *
  * Returns: A pointer to an initialized scatterlist or %NULL upon failure.
  */
-struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
+struct scatterlist *sgl_alloc(unsigned long length, gfp_t gfp,
  unsigned int *nent_p)
 {
return sgl_alloc_order(length, 0, false, gfp, nent_p);
-- 
2.17.1



[PATCH 3/6] lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

sg_init_table will clear the allocated block so requesting zeroed memory
from the allocator is redundant.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 23e53dce897d..3cc01cd82242 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -495,8 +495,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
return NULL;
nalloc++;
}
-   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist),
-   (gfp & ~GFP_DMA) | __GFP_ZERO);
+   sgl = kmalloc_array(nalloc, sizeof(struct scatterlist), gfp & ~GFP_DMA);
if (!sgl)
return NULL;
 
-- 
2.17.1



[PATCH 0/6] lib/scatterlist: sgl API fixes and cleanups

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Mostly same fixes and cleanups I've sent earlier in the year, but with some
patches dropped and some split into smaller ones as per request.

Tvrtko Ursulin (6):
  lib/scatterlist: Use natural long for sgl_alloc(_order) length
parameter
  lib/scatterlist: Use consistent types in sgl API
  lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order
  lib/scatterlist: Do not leak pages when high-order allocation fails
  lib/scatterlist: Use appropriate type for elem_len in sgl_alloc_order
  lib/scatterlist: Fix overflow check in sgl_alloc_order

 include/linux/scatterlist.h | 13 +++--
 lib/scatterlist.c   | 33 +
 2 files changed, 24 insertions(+), 22 deletions(-)

-- 
2.17.1



[PATCH 0/6] lib/scatterlist: sgl API fixes and cleanups

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Mostly same fixes and cleanups I've sent earlier in the year, but with some
patches dropped and some split into smaller ones as per request.

Tvrtko Ursulin (6):
  lib/scatterlist: Use natural long for sgl_alloc(_order) length
parameter
  lib/scatterlist: Use consistent types in sgl API
  lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order
  lib/scatterlist: Do not leak pages when high-order allocation fails
  lib/scatterlist: Use appropriate type for elem_len in sgl_alloc_order
  lib/scatterlist: Fix overflow check in sgl_alloc_order

 include/linux/scatterlist.h | 13 +++--
 lib/scatterlist.c   | 33 +
 2 files changed, 24 insertions(+), 22 deletions(-)

-- 
2.17.1



[PATCH 5/6] lib/scatterlist: Use appropriate type for elem_len in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

We should not use an explicit width u32 for elem_len but unsinged int to
match the underlying type in struct scatterlist.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 0caed79d7291..581a2e91e515 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -481,8 +481,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 {
struct scatterlist *sgl, *sg;
struct page *page;
-   unsigned int nent, nalloc, i;
-   u32 elem_len;
+   unsigned int nent, nalloc, elem_len, i;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
/* Check for integer overflow */
@@ -503,7 +502,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
sg = sgl;
i = 0;
while (length) {
-   elem_len = min_t(u64, length, PAGE_SIZE << order);
+   elem_len = min_t(unsigned long, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
sgl_free_n_order(sgl, i, order);
-- 
2.17.1



[PATCH 5/6] lib/scatterlist: Use appropriate type for elem_len in sgl_alloc_order

2018-09-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

We should not use an explicit width u32 for elem_len but unsinged int to
match the underlying type in struct scatterlist.

Signed-off-by: Tvrtko Ursulin 
Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Johannes Thumshirn 
Cc: Jens Axboe 
---
 lib/scatterlist.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/lib/scatterlist.c b/lib/scatterlist.c
index 0caed79d7291..581a2e91e515 100644
--- a/lib/scatterlist.c
+++ b/lib/scatterlist.c
@@ -481,8 +481,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
 {
struct scatterlist *sgl, *sg;
struct page *page;
-   unsigned int nent, nalloc, i;
-   u32 elem_len;
+   unsigned int nent, nalloc, elem_len, i;
 
nent = round_up(length, PAGE_SIZE << order) >> (PAGE_SHIFT + order);
/* Check for integer overflow */
@@ -503,7 +502,7 @@ struct scatterlist *sgl_alloc_order(unsigned long length, 
unsigned int order,
sg = sgl;
i = 0;
while (length) {
-   elem_len = min_t(u64, length, PAGE_SIZE << order);
+   elem_len = min_t(unsigned long, length, PAGE_SIZE << order);
page = alloc_pages(gfp, order);
if (!page) {
sgl_free_n_order(sgl, i, order);
-- 
2.17.1



[RFC 2/5] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch we need to start
passing in the PMU object pointer to perf_paranoid_* helpers.

This patch only changes the API across the code base without changing the
behaviour.

v2:
 * Correct errors in core-book3s.c as reported by kbuild test robot.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 arch/powerpc/perf/core-book3s.c | 31 ++-
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  |  6 +++---
 kernel/events/core.c| 15 ---
 kernel/trace/trace_event_perf.c |  6 --
 7 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 81f8a0c838ae..1e8b1aed6e81 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -95,7 +95,13 @@ static inline unsigned long perf_ip_adjust(struct pt_regs 
*regs)
 {
return 0;
 }
-static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp) { }
+
+static inline void
+perf_get_data_addr(struct pmu *pmu, struct pt_regs *regs, u64 *addrp)
+{
+
+}
+
 static inline u32 perf_get_misc_flags(struct pt_regs *regs)
 {
return 0;
@@ -126,7 +132,13 @@ static unsigned long ebb_switch_in(bool ebb, struct 
cpu_hw_events *cpuhw)
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
 static void power_pmu_sched_task(struct perf_event_context *ctx, bool 
sched_in) {}
-static inline void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw) {}
+
+static inline void
+power_pmu_bhrb_read(struct pmu *pmu,struct cpu_hw_events *cpuhw)
+{
+
+}
+
 static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
 
@@ -170,7 +182,8 @@ static inline unsigned long perf_ip_adjust(struct pt_regs 
*regs)
  * pointed to by SIAR; this is indicated by the [POWER6_]MMCRA_SDSYNC, the
  * [POWER7P_]MMCRA_SDAR_VALID bit in MMCRA, or the SDAR_VALID bit in SIER.
  */
-static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp)
+static inline void
+perf_get_data_addr(struct pmu *pmu, struct pt_regs *regs, u64 *addrp)
 {
unsigned long mmcra = regs->dsisr;
bool sdar_valid;
@@ -195,7 +208,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
 
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+   if (perf_paranoid_kernel(pmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
 }
@@ -435,7 +448,7 @@ static __u64 power_pmu_bhrb_to(u64 addr)
 }
 
 /* Processing BHRB entries */
-static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
+static void power_pmu_bhrb_read(struct pmu *pmu, struct cpu_hw_events *cpuhw)
 {
u64 val;
u64 addr;
@@ -463,8 +476,8 @@ static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 * exporting it to userspace (avoid exposure of regions
 * where we could have speculative execution)
 */
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
-   is_kernel_addr(addr))
+   if (perf_paranoid_kernel(pmu) &&
+   !capable(CAP_SYS_ADMIN) && is_kernel_addr(addr))
continue;
 
/* Branches are read most recent first (ie. mfbhrb 0 is
@@ -2066,12 +2079,12 @@ static void record_and_restart(struct perf_event 
*event, unsigned long val,
 
if (event->attr.sample_type &
(PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR))
-   perf_get_data_addr(regs, );
+   perf_get_data_addr(event->pmu, regs, );
 
if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
struct cpu_hw_events *cpuhw;
cpuhw = this_cpu_ptr(_hw_events);
-   power_pmu_bhrb_read(cpuhw);
+   power_pmu_bhrb_read(event->pmu, cpuhw);
data.br_stack = >bhrb_stack;
}
 
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ stati

[RFC 2/5] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch we need to start
passing in the PMU object pointer to perf_paranoid_* helpers.

This patch only changes the API across the code base without changing the
behaviour.

v2:
 * Correct errors in core-book3s.c as reported by kbuild test robot.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 arch/powerpc/perf/core-book3s.c | 31 ++-
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  |  6 +++---
 kernel/events/core.c| 15 ---
 kernel/trace/trace_event_perf.c |  6 --
 7 files changed, 40 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 81f8a0c838ae..1e8b1aed6e81 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -95,7 +95,13 @@ static inline unsigned long perf_ip_adjust(struct pt_regs 
*regs)
 {
return 0;
 }
-static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp) { }
+
+static inline void
+perf_get_data_addr(struct pmu *pmu, struct pt_regs *regs, u64 *addrp)
+{
+
+}
+
 static inline u32 perf_get_misc_flags(struct pt_regs *regs)
 {
return 0;
@@ -126,7 +132,13 @@ static unsigned long ebb_switch_in(bool ebb, struct 
cpu_hw_events *cpuhw)
 static inline void power_pmu_bhrb_enable(struct perf_event *event) {}
 static inline void power_pmu_bhrb_disable(struct perf_event *event) {}
 static void power_pmu_sched_task(struct perf_event_context *ctx, bool 
sched_in) {}
-static inline void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw) {}
+
+static inline void
+power_pmu_bhrb_read(struct pmu *pmu,struct cpu_hw_events *cpuhw)
+{
+
+}
+
 static void pmao_restore_workaround(bool ebb) { }
 #endif /* CONFIG_PPC32 */
 
@@ -170,7 +182,8 @@ static inline unsigned long perf_ip_adjust(struct pt_regs 
*regs)
  * pointed to by SIAR; this is indicated by the [POWER6_]MMCRA_SDSYNC, the
  * [POWER7P_]MMCRA_SDAR_VALID bit in MMCRA, or the SDAR_VALID bit in SIER.
  */
-static inline void perf_get_data_addr(struct pt_regs *regs, u64 *addrp)
+static inline void
+perf_get_data_addr(struct pmu *pmu, struct pt_regs *regs, u64 *addrp)
 {
unsigned long mmcra = regs->dsisr;
bool sdar_valid;
@@ -195,7 +208,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
 
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+   if (perf_paranoid_kernel(pmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
 }
@@ -435,7 +448,7 @@ static __u64 power_pmu_bhrb_to(u64 addr)
 }
 
 /* Processing BHRB entries */
-static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
+static void power_pmu_bhrb_read(struct pmu *pmu, struct cpu_hw_events *cpuhw)
 {
u64 val;
u64 addr;
@@ -463,8 +476,8 @@ static void power_pmu_bhrb_read(struct cpu_hw_events *cpuhw)
 * exporting it to userspace (avoid exposure of regions
 * where we could have speculative execution)
 */
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
-   is_kernel_addr(addr))
+   if (perf_paranoid_kernel(pmu) &&
+   !capable(CAP_SYS_ADMIN) && is_kernel_addr(addr))
continue;
 
/* Branches are read most recent first (ie. mfbhrb 0 is
@@ -2066,12 +2079,12 @@ static void record_and_restart(struct perf_event 
*event, unsigned long val,
 
if (event->attr.sample_type &
(PERF_SAMPLE_ADDR | PERF_SAMPLE_PHYS_ADDR))
-   perf_get_data_addr(regs, );
+   perf_get_data_addr(event->pmu, regs, );
 
if (event->attr.sample_type & PERF_SAMPLE_BRANCH_STACK) {
struct cpu_hw_events *cpuhw;
cpuhw = this_cpu_ptr(_hw_events);
-   power_pmu_bhrb_read(cpuhw);
+   power_pmu_bhrb_read(event->pmu, cpuhw);
data.br_stack = >bhrb_stack;
}
 
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ stati

[RFC 3/5] perf: Allow per PMU access control

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 include/linux/perf_event.h | 12 ++--
 kernel/events/core.c   | 59 ++
 kernel/sysctl.c|  4 ++-
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 22906bcc1bcd..bb82e47f5343 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* per PMU access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1169,6 +1172,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1181,17 +1187,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
 
 static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
 static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
 static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
 }
 
 extern void perf_event_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f556144bc0c5..35f122349508 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
 
 static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
 
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
 int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9430,6 +9448,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
 }
 
+/*
+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static ssize_t
+perf_event_paranoid_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   int ret, val;
+
+   ret = kstrtoint(buf, 0, );
+   if (ret)
+   return ret;
+
+   if (val < -1 || val > 2)
+   return -EINVAL;
+
+   pmu->perf_event_paranoid = val;
+
+   return count;
+}
+
+static DEVICE_ATTR_RW(perf_event_paranoid);
+
 /*
  * Let userspace know that this PMU supports address range filtering:
  */
@@ -9544,6 +9597,11 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (ret)
goto free_dev;
 
+   /* Add fine-grained access control attribute. */
+ 

[RFC 1/5] perf: Move some access checks later in perf_event_open

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 kernel/events/core.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index bccce538f51d..adcd9eae13fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10194,10 +10194,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
 
if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
@@ -10414,11 +10410,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
 
-   if (!attr.exclude_kernel) {
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10432,11 +10423,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
 
-   /* Only privileged users can get physical addresses */
-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10506,6 +10492,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
 
+   if (!attr.exclude_kernel) {
+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
-- 
2.17.1



[RFC 4/5] perf Documentation: Document the per PMU perf_event_paranoid interface

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Explain behaviour of the new control knob along side the existing perf
event documentation.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 .../testing/sysfs-bus-event_source-devices-events  | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
index 505f080d20a1..b3096ae9be6f 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
@@ -92,3 +92,17 @@ Description: Perf event scaling factors
 
This is provided to avoid performing floating point arithmetic
in the kernel.
+
+What: /sys/bus/event_source/devices//perf_event_paranoid
+Date: 2018/06/26
+Contact:   Linux kernel mailing list 
+Description:   Per PMU access control file
+
+   This is the per PMU version of the perf_event_paranoid sysctl
+   which allows controlling the access control level for each
+   individual PMU.
+
+   Writes to the sysctl will propagate to all PMU providers.
+
+   For explanation of supported values please consult the sysctl
+   documentation.
-- 
2.17.1



[RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

New in this version:

1. Hopefully fixed build issues in core-book3s.c. (Thanks kbuild test robot!)
2. Perf tool support for new controls, largely contributed by Jiri Olsa.

Tvrtko Ursulin (5):
  perf: Move some access checks later in perf_event_open
  perf: Pass pmu pointer to perf_paranoid_* helpers
  perf: Allow per PMU access control
  perf Documentation: Document the per PMU perf_event_paranoid interface
  tools/perf: Add support for per-PMU access control

 .../sysfs-bus-event_source-devices-events |  14 +++
 arch/powerpc/perf/core-book3s.c   |  31 --
 arch/x86/events/intel/bts.c   |   2 +-
 arch/x86/events/intel/core.c  |   2 +-
 arch/x86/events/intel/p4.c|   2 +-
 include/linux/perf_event.h|  18 ++-
 kernel/events/core.c  | 104 +++---
 kernel/sysctl.c   |   4 +-
 kernel/trace/trace_event_perf.c   |   6 +-
 tools/perf/arch/arm/util/cs-etm.c |   2 +-
 tools/perf/arch/arm64/util/arm-spe.c  |   2 +-
 tools/perf/arch/x86/util/intel-bts.c  |   2 +-
 tools/perf/arch/x86/util/intel-pt.c   |   2 +-
 tools/perf/util/evsel.c   |  41 ++-
 tools/perf/util/pmu.c |  17 +++
 tools/perf/util/pmu.h |   1 +
 16 files changed, 204 insertions(+), 46 deletions(-)

-- 
2.17.1



[RFC 5/5] tools/perf: Add support for per-PMU access control

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Now that the perf core supports per-PMU paranoid settings we need to
extend the tool to support that.

We handle the per-PMU setting in the platform support code where
applicable and also notify the user of the new facility on failures to
open the event.

Thanks to Jiri Olsa for contributing most of the "meat" for this patch and
suggestion to improve the error banner as well.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 tools/perf/arch/arm/util/cs-etm.c|  2 +-
 tools/perf/arch/arm64/util/arm-spe.c |  2 +-
 tools/perf/arch/x86/util/intel-bts.c |  2 +-
 tools/perf/arch/x86/util/intel-pt.c  |  2 +-
 tools/perf/util/evsel.c  | 41 ++--
 tools/perf/util/pmu.c| 17 
 tools/perf/util/pmu.h|  1 +
 7 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/tools/perf/arch/arm/util/cs-etm.c 
b/tools/perf/arch/arm/util/cs-etm.c
index 2f595cd73da6..a5437f100ab9 100644
--- a/tools/perf/arch/arm/util/cs-etm.c
+++ b/tools/perf/arch/arm/util/cs-etm.c
@@ -69,7 +69,7 @@ static int cs_etm_recording_options(struct auxtrace_record 
*itr,
struct perf_pmu *cs_etm_pmu = ptr->cs_etm_pmu;
struct perf_evsel *evsel, *cs_etm_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = (geteuid() == 0 || perf_event_paranoid() < 0);
+   bool privileged = (geteuid() == 0 || cs_etm_pmu->paranoid < 0);
 
ptr->evlist = evlist;
ptr->snapshot_mode = opts->auxtrace_snapshot_mode;
diff --git a/tools/perf/arch/arm64/util/arm-spe.c 
b/tools/perf/arch/arm64/util/arm-spe.c
index 5ccfce87e693..e7e8154757fa 100644
--- a/tools/perf/arch/arm64/util/arm-spe.c
+++ b/tools/perf/arch/arm64/util/arm-spe.c
@@ -65,7 +65,7 @@ static int arm_spe_recording_options(struct auxtrace_record 
*itr,
container_of(itr, struct arm_spe_recording, itr);
struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
struct perf_evsel *evsel, *arm_spe_evsel = NULL;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || arm_spe_pmu->paranoid < 0;
struct perf_evsel *tracking_evsel;
int err;
 
diff --git a/tools/perf/arch/x86/util/intel-bts.c 
b/tools/perf/arch/x86/util/intel-bts.c
index 781df40b2966..c97e6556c8e7 100644
--- a/tools/perf/arch/x86/util/intel-bts.c
+++ b/tools/perf/arch/x86/util/intel-bts.c
@@ -116,7 +116,7 @@ static int intel_bts_recording_options(struct 
auxtrace_record *itr,
struct perf_pmu *intel_bts_pmu = btsr->intel_bts_pmu;
struct perf_evsel *evsel, *intel_bts_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || intel_bts_pmu->paranoid < 0;
 
btsr->evlist = evlist;
btsr->snapshot_mode = opts->auxtrace_snapshot_mode;
diff --git a/tools/perf/arch/x86/util/intel-pt.c 
b/tools/perf/arch/x86/util/intel-pt.c
index db0ba8caf5a2..ffbe5f7f1c57 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -555,7 +555,7 @@ static int intel_pt_recording_options(struct 
auxtrace_record *itr,
bool have_timing_info, need_immediate = false;
struct perf_evsel *evsel, *intel_pt_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || intel_pt_pmu->paranoid < 0;
u64 tsc_bit;
int err;
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1a61628a1c12..fbebce0593f4 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "asm/bug.h"
@@ -2843,10 +2845,37 @@ static bool find_process(const char *name)
return ret ? false : true;
 }
 
+static int __pmu_paranoid_value(const char *name, char *buf, int bufsz)
+{
+   char path[PATH_MAX];
+   int fd, ret;
+
+   ret = snprintf(path, sizeof(path),
+  "%s/bus/event_source/devices/%s/perf_event_paranoid",
+   sysfs__mountpoint(), name);
+   if (ret < 0 || ret == sizeof(path))
+   return -1;
+
+   fd = open(path, O_RDONLY);
+   if (fd < 0)
+   return -1;
+
+   ret = read(fd, buf, bufsz - 1);
+   if (ret <= 0)
+   r

[RFC 3/5] perf: Allow per PMU access control

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 include/linux/perf_event.h | 12 ++--
 kernel/events/core.c   | 59 ++
 kernel/sysctl.c|  4 ++-
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 22906bcc1bcd..bb82e47f5343 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* per PMU access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1169,6 +1172,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1181,17 +1187,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
 
 static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
 static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
 static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
 }
 
 extern void perf_event_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index f556144bc0c5..35f122349508 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
 
 static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
 
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
 int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9430,6 +9448,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
 }
 
+/*
+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static ssize_t
+perf_event_paranoid_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   int ret, val;
+
+   ret = kstrtoint(buf, 0, );
+   if (ret)
+   return ret;
+
+   if (val < -1 || val > 2)
+   return -EINVAL;
+
+   pmu->perf_event_paranoid = val;
+
+   return count;
+}
+
+static DEVICE_ATTR_RW(perf_event_paranoid);
+
 /*
  * Let userspace know that this PMU supports address range filtering:
  */
@@ -9544,6 +9597,11 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (ret)
goto free_dev;
 
+   /* Add fine-grained access control attribute. */
+ 

[RFC 1/5] perf: Move some access checks later in perf_event_open

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 kernel/events/core.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index bccce538f51d..adcd9eae13fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10194,10 +10194,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
 
if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
@@ -10414,11 +10410,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
 
-   if (!attr.exclude_kernel) {
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10432,11 +10423,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
 
-   /* Only privileged users can get physical addresses */
-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10506,6 +10492,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
 
+   if (!attr.exclude_kernel) {
+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
-- 
2.17.1



[RFC 4/5] perf Documentation: Document the per PMU perf_event_paranoid interface

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Explain behaviour of the new control knob along side the existing perf
event documentation.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 .../testing/sysfs-bus-event_source-devices-events  | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
index 505f080d20a1..b3096ae9be6f 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
@@ -92,3 +92,17 @@ Description: Perf event scaling factors
 
This is provided to avoid performing floating point arithmetic
in the kernel.
+
+What: /sys/bus/event_source/devices//perf_event_paranoid
+Date: 2018/06/26
+Contact:   Linux kernel mailing list 
+Description:   Per PMU access control file
+
+   This is the per PMU version of the perf_event_paranoid sysctl
+   which allows controlling the access control level for each
+   individual PMU.
+
+   Writes to the sysctl will propagate to all PMU providers.
+
+   For explanation of supported values please consult the sysctl
+   documentation.
-- 
2.17.1



[RFC 0/5] perf: Per PMU access controls (paranoid setting)

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

New in this version:

1. Hopefully fixed build issues in core-book3s.c. (Thanks kbuild test robot!)
2. Perf tool support for new controls, largely contributed by Jiri Olsa.

Tvrtko Ursulin (5):
  perf: Move some access checks later in perf_event_open
  perf: Pass pmu pointer to perf_paranoid_* helpers
  perf: Allow per PMU access control
  perf Documentation: Document the per PMU perf_event_paranoid interface
  tools/perf: Add support for per-PMU access control

 .../sysfs-bus-event_source-devices-events |  14 +++
 arch/powerpc/perf/core-book3s.c   |  31 --
 arch/x86/events/intel/bts.c   |   2 +-
 arch/x86/events/intel/core.c  |   2 +-
 arch/x86/events/intel/p4.c|   2 +-
 include/linux/perf_event.h|  18 ++-
 kernel/events/core.c  | 104 +++---
 kernel/sysctl.c   |   4 +-
 kernel/trace/trace_event_perf.c   |   6 +-
 tools/perf/arch/arm/util/cs-etm.c |   2 +-
 tools/perf/arch/arm64/util/arm-spe.c  |   2 +-
 tools/perf/arch/x86/util/intel-bts.c  |   2 +-
 tools/perf/arch/x86/util/intel-pt.c   |   2 +-
 tools/perf/util/evsel.c   |  41 ++-
 tools/perf/util/pmu.c |  17 +++
 tools/perf/util/pmu.h |   1 +
 16 files changed, 204 insertions(+), 46 deletions(-)

-- 
2.17.1



[RFC 5/5] tools/perf: Add support for per-PMU access control

2018-09-19 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Now that the perf core supports per-PMU paranoid settings we need to
extend the tool to support that.

We handle the per-PMU setting in the platform support code where
applicable and also notify the user of the new facility on failures to
open the event.

Thanks to Jiri Olsa for contributing most of the "meat" for this patch and
suggestion to improve the error banner as well.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 tools/perf/arch/arm/util/cs-etm.c|  2 +-
 tools/perf/arch/arm64/util/arm-spe.c |  2 +-
 tools/perf/arch/x86/util/intel-bts.c |  2 +-
 tools/perf/arch/x86/util/intel-pt.c  |  2 +-
 tools/perf/util/evsel.c  | 41 ++--
 tools/perf/util/pmu.c| 17 
 tools/perf/util/pmu.h|  1 +
 7 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/tools/perf/arch/arm/util/cs-etm.c 
b/tools/perf/arch/arm/util/cs-etm.c
index 2f595cd73da6..a5437f100ab9 100644
--- a/tools/perf/arch/arm/util/cs-etm.c
+++ b/tools/perf/arch/arm/util/cs-etm.c
@@ -69,7 +69,7 @@ static int cs_etm_recording_options(struct auxtrace_record 
*itr,
struct perf_pmu *cs_etm_pmu = ptr->cs_etm_pmu;
struct perf_evsel *evsel, *cs_etm_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = (geteuid() == 0 || perf_event_paranoid() < 0);
+   bool privileged = (geteuid() == 0 || cs_etm_pmu->paranoid < 0);
 
ptr->evlist = evlist;
ptr->snapshot_mode = opts->auxtrace_snapshot_mode;
diff --git a/tools/perf/arch/arm64/util/arm-spe.c 
b/tools/perf/arch/arm64/util/arm-spe.c
index 5ccfce87e693..e7e8154757fa 100644
--- a/tools/perf/arch/arm64/util/arm-spe.c
+++ b/tools/perf/arch/arm64/util/arm-spe.c
@@ -65,7 +65,7 @@ static int arm_spe_recording_options(struct auxtrace_record 
*itr,
container_of(itr, struct arm_spe_recording, itr);
struct perf_pmu *arm_spe_pmu = sper->arm_spe_pmu;
struct perf_evsel *evsel, *arm_spe_evsel = NULL;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || arm_spe_pmu->paranoid < 0;
struct perf_evsel *tracking_evsel;
int err;
 
diff --git a/tools/perf/arch/x86/util/intel-bts.c 
b/tools/perf/arch/x86/util/intel-bts.c
index 781df40b2966..c97e6556c8e7 100644
--- a/tools/perf/arch/x86/util/intel-bts.c
+++ b/tools/perf/arch/x86/util/intel-bts.c
@@ -116,7 +116,7 @@ static int intel_bts_recording_options(struct 
auxtrace_record *itr,
struct perf_pmu *intel_bts_pmu = btsr->intel_bts_pmu;
struct perf_evsel *evsel, *intel_bts_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || intel_bts_pmu->paranoid < 0;
 
btsr->evlist = evlist;
btsr->snapshot_mode = opts->auxtrace_snapshot_mode;
diff --git a/tools/perf/arch/x86/util/intel-pt.c 
b/tools/perf/arch/x86/util/intel-pt.c
index db0ba8caf5a2..ffbe5f7f1c57 100644
--- a/tools/perf/arch/x86/util/intel-pt.c
+++ b/tools/perf/arch/x86/util/intel-pt.c
@@ -555,7 +555,7 @@ static int intel_pt_recording_options(struct 
auxtrace_record *itr,
bool have_timing_info, need_immediate = false;
struct perf_evsel *evsel, *intel_pt_evsel = NULL;
const struct cpu_map *cpus = evlist->cpus;
-   bool privileged = geteuid() == 0 || perf_event_paranoid() < 0;
+   bool privileged = geteuid() == 0 || intel_pt_pmu->paranoid < 0;
u64 tsc_bit;
int err;
 
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1a61628a1c12..fbebce0593f4 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -9,6 +9,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -20,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include "asm/bug.h"
@@ -2843,10 +2845,37 @@ static bool find_process(const char *name)
return ret ? false : true;
 }
 
+static int __pmu_paranoid_value(const char *name, char *buf, int bufsz)
+{
+   char path[PATH_MAX];
+   int fd, ret;
+
+   ret = snprintf(path, sizeof(path),
+  "%s/bus/event_source/devices/%s/perf_event_paranoid",
+   sysfs__mountpoint(), name);
+   if (ret < 0 || ret == sizeof(path))
+   return -1;
+
+   fd = open(path, O_RDONLY);
+   if (fd < 0)
+   return -1;
+
+   ret = read(fd, buf, bufsz - 1);
+   if (ret <= 0)
+   r

Re: [RFC 0/4] perf: Per PMU access controls (paranoid setting)

2018-09-12 Thread Tvrtko Ursulin



On 12/09/18 07:52, Alexey Budankov wrote:


Hi,

Is there any plans or may be even progress on that so far?


It's hanging in the back of my mind. AFAIR after last round there was a 
build failure or two to fix on more exotic (to me) hardware, and Jiri 
Olsa provided a tools/perf snippet supporting the feature.


But essentially I haven't done any work on it since due not seeing the 
route to upstream. :( In other words, will someone review it and will 
that r-b make it have a chance of getting into some tree. If I had a 
clear statement from someone with authority in these aspects I would 
progress it, but otherwise it felt like it's not going anywhere.


Regards,

Tvrtko



Thanks,
Alexey

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org

Tvrtko Ursulin (4):
   perf: Move some access checks later in perf_event_open
   perf: Pass pmu pointer to perf_paranoid_* helpers
   perf: Allow per PMU access control
   perf Documentation: Document the per PMU perf_event_paranoid interface

  .../sysfs-bus-event_source-devices-events |  14 +++
  arch/powerpc/perf/core-book3s.c   |   2 +-
  arch/x86/events/intel/bts.c   |   2 +-
  arch/x86/events/intel/core.c  |   2 +-
  arch/x86/events/intel/p4.c|   2 +-
  include/linux/perf_event.h|  18 ++-
  kernel/events/core.c  | 104 +++---
  kernel/sysctl.c   |   4 +-
  kernel/trace/trace_event_perf.c   |   6 +-
  9 files changed, 123 insertions(+), 31 deletions(-)





Re: [RFC 0/4] perf: Per PMU access controls (paranoid setting)

2018-09-12 Thread Tvrtko Ursulin



On 12/09/18 07:52, Alexey Budankov wrote:


Hi,

Is there any plans or may be even progress on that so far?


It's hanging in the back of my mind. AFAIR after last round there was a 
build failure or two to fix on more exotic (to me) hardware, and Jiri 
Olsa provided a tools/perf snippet supporting the feature.


But essentially I haven't done any work on it since due not seeing the 
route to upstream. :( In other words, will someone review it and will 
that r-b make it have a chance of getting into some tree. If I had a 
clear statement from someone with authority in these aspects I would 
progress it, but otherwise it felt like it's not going anywhere.


Regards,

Tvrtko



Thanks,
Alexey

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org

Tvrtko Ursulin (4):
   perf: Move some access checks later in perf_event_open
   perf: Pass pmu pointer to perf_paranoid_* helpers
   perf: Allow per PMU access control
   perf Documentation: Document the per PMU perf_event_paranoid interface

  .../sysfs-bus-event_source-devices-events |  14 +++
  arch/powerpc/perf/core-book3s.c   |   2 +-
  arch/x86/events/intel/bts.c   |   2 +-
  arch/x86/events/intel/core.c  |   2 +-
  arch/x86/events/intel/p4.c|   2 +-
  include/linux/perf_event.h|  18 ++-
  kernel/events/core.c  | 104 +++---
  kernel/sysctl.c   |   4 +-
  kernel/trace/trace_event_perf.c   |   6 +-
  9 files changed, 123 insertions(+), 31 deletions(-)





Re: [RFC 2/4] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-07-03 Thread Tvrtko Ursulin



Hi Ravi,

On 03/07/18 11:24, Ravi Bangoria wrote:

Hi Tvrtko,


@@ -199,7 +199,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
  
-	if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&

+   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
  }


This patch fails for me on powerpc:

arch/powerpc/perf/core-book3s.c: In function ‘perf_get_data_addr’:
arch/powerpc/perf/core-book3s.c:202:27: error: passing argument 1 of 
‘perf_paranoid_kernel’ from incompatible pointer type 
[-Werror=incompatible-pointer-types]
   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
^~~~
In file included from arch/powerpc/perf/core-book3s.c:13:0:
./include/linux/perf_event.h:1191:20: note: expected ‘const struct pmu *’ but 
argument is of type ‘struct power_pmu *’
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 ^~~~
arch/powerpc/perf/core-book3s.c: In function ‘power_pmu_bhrb_read’:
arch/powerpc/perf/core-book3s.c:470:8: error: too few arguments to function 
‘perf_paranoid_kernel’
 if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
 ^~~~
In file included from arch/powerpc/perf/core-book3s.c:13:0:
./include/linux/perf_event.h:1191:20: note: declared here
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 ^~~~
   CC  net/ipv6/route.o


Yep, kbuild reported this as well. I will need to re-arrange the code a 
bit to a) pass the correct pointer in, and b) I missed one call-site as 
well.


I was holding of doing that until some more general feedback arrives.

Regards,

Tvrtko


Re: [RFC 2/4] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-07-03 Thread Tvrtko Ursulin



Hi Ravi,

On 03/07/18 11:24, Ravi Bangoria wrote:

Hi Tvrtko,


@@ -199,7 +199,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
  
-	if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&

+   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
  }


This patch fails for me on powerpc:

arch/powerpc/perf/core-book3s.c: In function ‘perf_get_data_addr’:
arch/powerpc/perf/core-book3s.c:202:27: error: passing argument 1 of 
‘perf_paranoid_kernel’ from incompatible pointer type 
[-Werror=incompatible-pointer-types]
   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
^~~~
In file included from arch/powerpc/perf/core-book3s.c:13:0:
./include/linux/perf_event.h:1191:20: note: expected ‘const struct pmu *’ but 
argument is of type ‘struct power_pmu *’
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 ^~~~
arch/powerpc/perf/core-book3s.c: In function ‘power_pmu_bhrb_read’:
arch/powerpc/perf/core-book3s.c:470:8: error: too few arguments to function 
‘perf_paranoid_kernel’
 if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
 ^~~~
In file included from arch/powerpc/perf/core-book3s.c:13:0:
./include/linux/perf_event.h:1191:20: note: declared here
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 ^~~~
   CC  net/ipv6/route.o


Yep, kbuild reported this as well. I will need to re-arrange the code a 
bit to a) pass the correct pointer in, and b) I missed one call-site as 
well.


I was holding of doing that until some more general feedback arrives.

Regards,

Tvrtko


Re: [RFC 3/4] perf: Allow per PMU access control

2018-06-27 Thread Tvrtko Ursulin



On 27/06/18 10:47, Alexey Budankov wrote:



On 27.06.2018 12:15, Tvrtko Ursulin wrote:


On 26/06/18 18:25, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
   include/linux/perf_event.h | 12 ++--
   kernel/events/core.c   | 59 ++
   kernel/sysctl.c    |  4 ++-
   3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
   /* number of address filters this PMU can do */
   unsigned int    nr_addr_filters;
   +    /* per PMU access control */
+    int    perf_event_paranoid;


It looks like it needs to be declared as atomic and atomic_read/atomic_write
operations need to be explicitly used below in the patch as far this
variable may be manipulated by different threads at the same time
without explicit locking.


It is just a write of an integer from either sysfs access or sysctl. As such I 
don't think going atomic would change anything. There is no RMW or increment or 
anything on it.

Unless there are architectures where int stores are not atomic? But then the 
existing sysctl would have the same issue. So I suspect we can indeed rely on 
int store being atomic.


Yep, aligned word read/write is atomic on Intel and there is no runtime issue
currently, but the implementation itself is multithread and implicitly relies
on that atomicity so my suggestion is just explicitly code that assumption :).
Also, as you mentioned, that makes the arch independent part of code more 
portable.


Perhaps we are not on the same page, but my argument was that the 
current sysctl (or sysctl via proc) has the same issue with multiple 
threads potentially writing to it. As long as the end result is a valid 
value it is not a problem. So I don't see what this patch changes in 
that respect. Different tasks writing either of the sysctl/sysfs values 
race, but end up with valid values everywhere. If we can rely on int 
stores to be atomic on all architectures I don't see an effective 
difference after changing to atomic_t, or even taking the pmu mutex over 
the per PMU sysfs writes. So I don't see value in complicating the code 
with either approach but am happy to be proved wrong if that is the case.


Regards,

Tvrtko


Re: [RFC 3/4] perf: Allow per PMU access control

2018-06-27 Thread Tvrtko Ursulin



On 27/06/18 10:47, Alexey Budankov wrote:



On 27.06.2018 12:15, Tvrtko Ursulin wrote:


On 26/06/18 18:25, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
   include/linux/perf_event.h | 12 ++--
   kernel/events/core.c   | 59 ++
   kernel/sysctl.c    |  4 ++-
   3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
   /* number of address filters this PMU can do */
   unsigned int    nr_addr_filters;
   +    /* per PMU access control */
+    int    perf_event_paranoid;


It looks like it needs to be declared as atomic and atomic_read/atomic_write
operations need to be explicitly used below in the patch as far this
variable may be manipulated by different threads at the same time
without explicit locking.


It is just a write of an integer from either sysfs access or sysctl. As such I 
don't think going atomic would change anything. There is no RMW or increment or 
anything on it.

Unless there are architectures where int stores are not atomic? But then the 
existing sysctl would have the same issue. So I suspect we can indeed rely on 
int store being atomic.


Yep, aligned word read/write is atomic on Intel and there is no runtime issue
currently, but the implementation itself is multithread and implicitly relies
on that atomicity so my suggestion is just explicitly code that assumption :).
Also, as you mentioned, that makes the arch independent part of code more 
portable.


Perhaps we are not on the same page, but my argument was that the 
current sysctl (or sysctl via proc) has the same issue with multiple 
threads potentially writing to it. As long as the end result is a valid 
value it is not a problem. So I don't see what this patch changes in 
that respect. Different tasks writing either of the sysctl/sysfs values 
race, but end up with valid values everywhere. If we can rely on int 
stores to be atomic on all architectures I don't see an effective 
difference after changing to atomic_t, or even taking the pmu mutex over 
the per PMU sysfs writes. So I don't see value in complicating the code 
with either approach but am happy to be proved wrong if that is the case.


Regards,

Tvrtko


Re: [RFC 3/4] perf: Allow per PMU access control

2018-06-27 Thread Tvrtko Ursulin



On 26/06/18 18:25, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
  include/linux/perf_event.h | 12 ++--
  kernel/events/core.c   | 59 ++
  kernel/sysctl.c|  4 ++-
  3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
  
+	/* per PMU access control */

+   int perf_event_paranoid;


It looks like it needs to be declared as atomic and atomic_read/atomic_write
operations need to be explicitly used below in the patch as far this
variable may be manipulated by different threads at the same time
without explicit locking.


It is just a write of an integer from either sysfs access or sysctl. As 
such I don't think going atomic would change anything. There is no RMW 
or increment or anything on it.


Unless there are architectures where int stores are not atomic? But then 
the existing sysctl would have the same issue. So I suspect we can 
indeed rely on int store being atomic.


Regards,

Tvrtko




+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1168,6 +1171,9 @@ extern int sysctl_perf_cpu_time_max_percent;
  
  extern void perf_sample_event_took(u64 sample_len_ns);
  
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,

+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
  extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1180,17 +1186,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
  
  static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
  }
  
  static inline bool perf_paranoid_cpu(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
  }
  
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
  }
  
  extern void perf_event_init(void);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 370c89e81722..da36317dc8dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
  
  static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
  
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,

+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
  int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9425,6 +9443,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
  }
  
+/*

+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static 

Re: [RFC 3/4] perf: Allow per PMU access control

2018-06-27 Thread Tvrtko Ursulin



On 26/06/18 18:25, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
  include/linux/perf_event.h | 12 ++--
  kernel/events/core.c   | 59 ++
  kernel/sysctl.c|  4 ++-
  3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
  
+	/* per PMU access control */

+   int perf_event_paranoid;


It looks like it needs to be declared as atomic and atomic_read/atomic_write
operations need to be explicitly used below in the patch as far this
variable may be manipulated by different threads at the same time
without explicit locking.


It is just a write of an integer from either sysfs access or sysctl. As 
such I don't think going atomic would change anything. There is no RMW 
or increment or anything on it.


Unless there are architectures where int stores are not atomic? But then 
the existing sysctl would have the same issue. So I suspect we can 
indeed rely on int store being atomic.


Regards,

Tvrtko




+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1168,6 +1171,9 @@ extern int sysctl_perf_cpu_time_max_percent;
  
  extern void perf_sample_event_took(u64 sample_len_ns);
  
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,

+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
  extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1180,17 +1186,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
  
  static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
  }
  
  static inline bool perf_paranoid_cpu(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
  }
  
  static inline bool perf_paranoid_kernel(const struct pmu *pmu)

  {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
  }
  
  extern void perf_event_init(void);

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 370c89e81722..da36317dc8dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
  
  static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
  
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,

+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
  int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9425,6 +9443,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
  }
  
+/*

+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static 

Re: [RFC 1/4] perf: Move some access checks later in perf_event_open

2018-06-27 Thread Tvrtko Ursulin



On 26/06/18 18:24, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
  kernel/events/core.c | 36 ++--
  1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f490caca9aa4..12de95b0472e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10189,10 +10189,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
  
  	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {

@@ -10409,11 +10405,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
  
-	if (!attr.exclude_kernel) {

-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10427,11 +10418,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
  
-	/* Only privileged users can get physical addresses */

-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10501,6 +10487,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
  
+	if (!attr.exclude_kernel) {

+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;


I would separate this combined permissions check into a function e.g.
static bool perf_test_pmu_paranoid(const struct pmu *pmu, int *err) to avoid
code duplication.


My thinking was for this to be as mechanical (code movement) as 
possible, but I can consider it.


Regards,

Tvrtko


+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;





Re: [RFC 1/4] perf: Move some access checks later in perf_event_open

2018-06-27 Thread Tvrtko Ursulin



On 26/06/18 18:24, Alexey Budankov wrote:

Hi,

On 26.06.2018 18:36, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
  kernel/events/core.c | 36 ++--
  1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f490caca9aa4..12de95b0472e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10189,10 +10189,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
  
  	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {

@@ -10409,11 +10405,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
  
-	if (!attr.exclude_kernel) {

-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10427,11 +10418,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
  
-	/* Only privileged users can get physical addresses */

-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10501,6 +10487,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
  
+	if (!attr.exclude_kernel) {

+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;


I would separate this combined permissions check into a function e.g.
static bool perf_test_pmu_paranoid(const struct pmu *pmu, int *err) to avoid
code duplication.


My thinking was for this to be as mechanical (code movement) as 
possible, but I can consider it.


Regards,

Tvrtko


+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;





[RFC 2/4] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch we need to start
passing in the PMU object pointer to perf_paranoid_* helpers.

This patch only changes the API across the code base without changing the
behaviour.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 arch/powerpc/perf/core-book3s.c |  2 +-
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  |  6 +++---
 kernel/events/core.c| 15 ---
 kernel/trace/trace_event_perf.c |  6 --
 7 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 3f66fcf8ad99..ae6716cea308 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -199,7 +199,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
 
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
 }
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ static int bts_event_init(struct perf_event *event)
 * Note that the default paranoia setting permits unprivileged
 * users to profile the kernel.
 */
-   if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
+   if (event->attr.exclude_kernel && perf_paranoid_kernel(event->pmu) &&
!capable(CAP_SYS_ADMIN))
return -EACCES;
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a96e516..6b126bdbd16c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3025,7 +3025,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (x86_pmu.version < 3)
return -EINVAL;
 
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
 
event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..878451ef1ace 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 * the user needs special permissions to be able to use it
 */
if (p4_ht_active() && p4_event_bind_map[v].shared) {
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1fa12887ec02..d7938d88c028 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1178,17 +1178,17 @@ extern int perf_cpu_time_max_percent_handler(struct 
ctl_table *table, int write,
 int perf_event_max_stack_handler(struct ctl_table *table, int write,
 void __user *buffer, size_t *lenp, loff_t 
*ppos);
 
-static inline bool perf_paranoid_tracepoint_raw(void)
+static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > -1;
 }
 
-static inline bool perf_paranoid_cpu(void)
+static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > 0;
 }
 
-static inline bool perf_paranoid_kernel(void)
+static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > 1;
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 12de95b0472e..370c89e81722 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4113,7 +4113,7 @@ find_get_context(struct pmu *pmu, struct task_struct 
*task,
 
if (!task) {
/* Must be root to operate on a CPU event: */
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(pmu) && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);
 
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
@@ -5681,7 +5681,7 @@ static int perf_mma

[RFC 2/4] perf: Pass pmu pointer to perf_paranoid_* helpers

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch we need to start
passing in the PMU object pointer to perf_paranoid_* helpers.

This patch only changes the API across the code base without changing the
behaviour.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 arch/powerpc/perf/core-book3s.c |  2 +-
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  |  6 +++---
 kernel/events/core.c| 15 ---
 kernel/trace/trace_event_perf.c |  6 --
 7 files changed, 19 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 3f66fcf8ad99..ae6716cea308 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -199,7 +199,7 @@ static inline void perf_get_data_addr(struct pt_regs *regs, 
u64 *addrp)
if (!(mmcra & MMCRA_SAMPLE_ENABLE) || sdar_valid)
*addrp = mfspr(SPRN_SDAR);
 
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN) &&
+   if (perf_paranoid_kernel(ppmu) && !capable(CAP_SYS_ADMIN) &&
is_kernel_addr(mfspr(SPRN_SDAR)))
*addrp = 0;
 }
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ static int bts_event_init(struct perf_event *event)
 * Note that the default paranoia setting permits unprivileged
 * users to profile the kernel.
 */
-   if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
+   if (event->attr.exclude_kernel && perf_paranoid_kernel(event->pmu) &&
!capable(CAP_SYS_ADMIN))
return -EACCES;
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a96e516..6b126bdbd16c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3025,7 +3025,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (x86_pmu.version < 3)
return -EINVAL;
 
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
 
event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..878451ef1ace 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 * the user needs special permissions to be able to use it
 */
if (p4_ht_active() && p4_event_bind_map[v].shared) {
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 1fa12887ec02..d7938d88c028 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1178,17 +1178,17 @@ extern int perf_cpu_time_max_percent_handler(struct 
ctl_table *table, int write,
 int perf_event_max_stack_handler(struct ctl_table *table, int write,
 void __user *buffer, size_t *lenp, loff_t 
*ppos);
 
-static inline bool perf_paranoid_tracepoint_raw(void)
+static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > -1;
 }
 
-static inline bool perf_paranoid_cpu(void)
+static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > 0;
 }
 
-static inline bool perf_paranoid_kernel(void)
+static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
return sysctl_perf_event_paranoid > 1;
 }
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 12de95b0472e..370c89e81722 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -4113,7 +4113,7 @@ find_get_context(struct pmu *pmu, struct task_struct 
*task,
 
if (!task) {
/* Must be root to operate on a CPU event: */
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(pmu) && !capable(CAP_SYS_ADMIN))
return ERR_PTR(-EACCES);
 
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
@@ -5681,7 +5681,7 @@ static int perf_mma

[RFC 3/4] perf: Allow per PMU access control

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 include/linux/perf_event.h | 12 ++--
 kernel/events/core.c   | 59 ++
 kernel/sysctl.c|  4 ++-
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* per PMU access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1168,6 +1171,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1180,17 +1186,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
 
 static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
 static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
 static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
 }
 
 extern void perf_event_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 370c89e81722..da36317dc8dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
 
 static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
 
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
 int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9425,6 +9443,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
 }
 
+/*
+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static ssize_t
+perf_event_paranoid_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   int ret, val;
+
+   ret = kstrtoint(buf, 0, );
+   if (ret)
+   return ret;
+
+   if (val < -1 || val > 2)
+   return -EINVAL;
+
+   pmu->perf_event_paranoid = val;
+
+   return count;
+}
+
+static DEVICE_ATTR_RW(perf_event_paranoid);
+
 /*
  * Let userspace know that this PMU supports address range filtering:
  */
@@ -9539,6 +9592,11 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (ret)
goto free_dev;
 
+   /* Add fine-grained access control attribute. */
+ 

[RFC 4/4] perf Documentation: Document the per PMU perf_event_paranoid interface

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Explain behaviour of the new control knob along side the existing perf
event documentation.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 .../testing/sysfs-bus-event_source-devices-events  | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
index 505f080d20a1..b3096ae9be6f 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
@@ -92,3 +92,17 @@ Description: Perf event scaling factors
 
This is provided to avoid performing floating point arithmetic
in the kernel.
+
+What: /sys/bus/event_source/devices//perf_event_paranoid
+Date: 2018/06/26
+Contact:   Linux kernel mailing list 
+Description:   Per PMU access control file
+
+   This is the per PMU version of the perf_event_paranoid sysctl
+   which allows controlling the access control level for each
+   individual PMU.
+
+   Writes to the sysctl will propagate to all PMU providers.
+
+   For explanation of supported values please consult the sysctl
+   documentation.
-- 
2.17.1



[RFC 3/4] perf: Allow per PMU access control

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 include/linux/perf_event.h | 12 ++--
 kernel/events/core.c   | 59 ++
 kernel/sysctl.c|  4 ++-
 3 files changed, 71 insertions(+), 4 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index d7938d88c028..22e91cc2d9e1 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* per PMU access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1168,6 +1171,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1180,17 +1186,17 @@ int perf_event_max_stack_handler(struct ctl_table 
*table, int write,
 
 static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
 static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
 static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
 }
 
 extern void perf_event_init(void);
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 370c89e81722..da36317dc8dc 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -432,6 +432,24 @@ static void update_perf_cpu_limits(void)
 
 static bool perf_rotate_context(struct perf_cpu_context *cpuctx);
 
+int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos)
+{
+   int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+   struct pmu *pmu;
+
+   if (ret || !write)
+   return ret;
+
+   mutex_lock(_lock);
+   list_for_each_entry(pmu, , entry)
+   pmu->perf_event_paranoid = sysctl_perf_event_paranoid;
+   mutex_unlock(_lock);
+
+   return 0;
+}
+
 int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos)
@@ -9425,6 +9443,41 @@ static void free_pmu_context(struct pmu *pmu)
mutex_unlock(_lock);
 }
 
+/*
+ * Fine-grained access control:
+ */
+static ssize_t
+perf_event_paranoid_show(struct device *dev,
+struct device_attribute *attr,
+char *page)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+
+   return snprintf(page, PAGE_SIZE - 1, "%d\n", pmu->perf_event_paranoid);
+}
+
+static ssize_t
+perf_event_paranoid_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+   struct pmu *pmu = dev_get_drvdata(dev);
+   int ret, val;
+
+   ret = kstrtoint(buf, 0, );
+   if (ret)
+   return ret;
+
+   if (val < -1 || val > 2)
+   return -EINVAL;
+
+   pmu->perf_event_paranoid = val;
+
+   return count;
+}
+
+static DEVICE_ATTR_RW(perf_event_paranoid);
+
 /*
  * Let userspace know that this PMU supports address range filtering:
  */
@@ -9539,6 +9592,11 @@ static int pmu_dev_alloc(struct pmu *pmu)
if (ret)
goto free_dev;
 
+   /* Add fine-grained access control attribute. */
+ 

[RFC 4/4] perf Documentation: Document the per PMU perf_event_paranoid interface

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

Explain behaviour of the new control knob along side the existing perf
event documentation.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 .../testing/sysfs-bus-event_source-devices-events  | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events 
b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
index 505f080d20a1..b3096ae9be6f 100644
--- a/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
+++ b/Documentation/ABI/testing/sysfs-bus-event_source-devices-events
@@ -92,3 +92,17 @@ Description: Perf event scaling factors
 
This is provided to avoid performing floating point arithmetic
in the kernel.
+
+What: /sys/bus/event_source/devices//perf_event_paranoid
+Date: 2018/06/26
+Contact:   Linux kernel mailing list 
+Description:   Per PMU access control file
+
+   This is the per PMU version of the perf_event_paranoid sysctl
+   which allows controlling the access control level for each
+   individual PMU.
+
+   Writes to the sysctl will propagate to all PMU providers.
+
+   For explanation of supported values please consult the sysctl
+   documentation.
-- 
2.17.1



[RFC 1/4] perf: Move some access checks later in perf_event_open

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 kernel/events/core.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f490caca9aa4..12de95b0472e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10189,10 +10189,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
 
if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
@@ -10409,11 +10405,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
 
-   if (!attr.exclude_kernel) {
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10427,11 +10418,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
 
-   /* Only privileged users can get physical addresses */
-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10501,6 +10487,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
 
+   if (!attr.exclude_kernel) {
+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
-- 
2.17.1



[RFC 1/4] perf: Move some access checks later in perf_event_open

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

To enable per-PMU access controls in a following patch first move all call
sites of perf_paranoid_kernel() to after the event has been created.

Signed-off-by: Tvrtko Ursulin 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org
---
 kernel/events/core.c | 36 ++--
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index f490caca9aa4..12de95b0472e 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -10189,10 +10189,6 @@ static int perf_copy_attr(struct perf_event_attr 
__user *uattr,
 */
attr->branch_sample_type = mask;
}
-   /* privileged levels capture (kernel, hv): check permissions */
-   if ((mask & PERF_SAMPLE_BRANCH_PERM_PLM)
-   && perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
}
 
if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
@@ -10409,11 +10405,6 @@ SYSCALL_DEFINE5(perf_event_open,
if (err)
return err;
 
-   if (!attr.exclude_kernel) {
-   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-   }
-
if (attr.namespaces) {
if (!capable(CAP_SYS_ADMIN))
return -EACCES;
@@ -10427,11 +10418,6 @@ SYSCALL_DEFINE5(perf_event_open,
return -EINVAL;
}
 
-   /* Only privileged users can get physical addresses */
-   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
-   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN))
-   return -EACCES;
-
/*
 * In cgroup mode, the pid argument is used to pass the fd
 * opened to the cgroup directory in cgroupfs. The cpu argument
@@ -10501,6 +10487,28 @@ SYSCALL_DEFINE5(perf_event_open,
goto err_cred;
}
 
+   if (!attr.exclude_kernel) {
+   if (perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+   }
+
+   /* Only privileged users can get physical addresses */
+   if ((attr.sample_type & PERF_SAMPLE_PHYS_ADDR) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
+   /* privileged levels capture (kernel, hv): check permissions */
+   if ((attr.sample_type & PERF_SAMPLE_BRANCH_STACK) &&
+   (attr.branch_sample_type & PERF_SAMPLE_BRANCH_PERM_PLM) &&
+   perf_paranoid_kernel() && !capable(CAP_SYS_ADMIN)) {
+   err = -EACCES;
+   goto err_alloc;
+   }
+
if (is_sampling_event(event)) {
if (event->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) {
err = -EOPNOTSUPP;
-- 
2.17.1



[RFC 0/4] perf: Per PMU access controls (paranoid setting)

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org

Tvrtko Ursulin (4):
  perf: Move some access checks later in perf_event_open
  perf: Pass pmu pointer to perf_paranoid_* helpers
  perf: Allow per PMU access control
  perf Documentation: Document the per PMU perf_event_paranoid interface

 .../sysfs-bus-event_source-devices-events |  14 +++
 arch/powerpc/perf/core-book3s.c   |   2 +-
 arch/x86/events/intel/bts.c   |   2 +-
 arch/x86/events/intel/core.c  |   2 +-
 arch/x86/events/intel/p4.c|   2 +-
 include/linux/perf_event.h|  18 ++-
 kernel/events/core.c  | 104 +++---
 kernel/sysctl.c   |   4 +-
 kernel/trace/trace_event_perf.c   |   6 +-
 9 files changed, 123 insertions(+), 31 deletions(-)

-- 
2.17.1



[RFC 0/4] perf: Per PMU access controls (paranoid setting)

2018-06-26 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Discussion from previous posting:
https://lkml.org/lkml/2018/5/21/156

Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: "H. Peter Anvin" 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: Madhavan Srinivasan 
Cc: Andi Kleen 
Cc: Alexey Budankov 
Cc: linux-kernel@vger.kernel.org
Cc: x...@kernel.org

Tvrtko Ursulin (4):
  perf: Move some access checks later in perf_event_open
  perf: Pass pmu pointer to perf_paranoid_* helpers
  perf: Allow per PMU access control
  perf Documentation: Document the per PMU perf_event_paranoid interface

 .../sysfs-bus-event_source-devices-events |  14 +++
 arch/powerpc/perf/core-book3s.c   |   2 +-
 arch/x86/events/intel/bts.c   |   2 +-
 arch/x86/events/intel/core.c  |   2 +-
 arch/x86/events/intel/p4.c|   2 +-
 include/linux/perf_event.h|  18 ++-
 kernel/events/core.c  | 104 +++---
 kernel/sysctl.c   |   4 +-
 kernel/trace/trace_event_perf.c   |   6 +-
 9 files changed, 123 insertions(+), 31 deletions(-)

-- 
2.17.1



Re: [RFC] perf: Allow fine-grained PMU access control

2018-06-11 Thread Tvrtko Ursulin



Hi,

On 22/05/2018 18:19, Andi Kleen wrote:

IMHO, it is unsafe for CBOX pmu but could IMC, UPI pmus be an exception here?
Because currently perf stat -I from IMC, UPI counters is only allowed when
system wide monitoring is permitted and this prevents joint perf record and
perf stat -I in cluster environments where users usually lack ability to
modify paranoid. Adding Andi who may have more ideas regarding all that.


PMU isolation is about not making side channels worse. There are normally
already side channels from timing, but it has a degree of noise.

PMU isolation is just to prevent opening side channels with less noise.
But reducing noise is always a trade off, it can never be perfect
and at some point there are dimishing returns.

In general the farther you are from the origin of the noise there
is already more noise. The PMU can reduce the noise, but if it's far
enough away it may not make much difference.

So there are always trade offs with shades of grey, not a black
and white situation. Depending on your security requirements
it may be totally reasonable e.g. to allow the PMU
on the memory controller (which is already very noisy in any case),
but not on the caches.

Or allow it only on the graphics which is already fairly isolated.

So per pmu paranoid settings are a useful concept.


So it seems there is some positive feedback and fine-grained controls 
would be useful for other PMU's in cluster environments.


If we have agreement on that, question is how to drive this forward? 
Would someone be able to review the patch I've sent, or suggest more 
people to look at it before it could be queued up for merge?


Regards,

Tvrtko


Re: [RFC] perf: Allow fine-grained PMU access control

2018-06-11 Thread Tvrtko Ursulin



Hi,

On 22/05/2018 18:19, Andi Kleen wrote:

IMHO, it is unsafe for CBOX pmu but could IMC, UPI pmus be an exception here?
Because currently perf stat -I from IMC, UPI counters is only allowed when
system wide monitoring is permitted and this prevents joint perf record and
perf stat -I in cluster environments where users usually lack ability to
modify paranoid. Adding Andi who may have more ideas regarding all that.


PMU isolation is about not making side channels worse. There are normally
already side channels from timing, but it has a degree of noise.

PMU isolation is just to prevent opening side channels with less noise.
But reducing noise is always a trade off, it can never be perfect
and at some point there are dimishing returns.

In general the farther you are from the origin of the noise there
is already more noise. The PMU can reduce the noise, but if it's far
enough away it may not make much difference.

So there are always trade offs with shades of grey, not a black
and white situation. Depending on your security requirements
it may be totally reasonable e.g. to allow the PMU
on the memory controller (which is already very noisy in any case),
but not on the caches.

Or allow it only on the graphics which is already fairly isolated.

So per pmu paranoid settings are a useful concept.


So it seems there is some positive feedback and fine-grained controls 
would be useful for other PMU's in cluster environments.


If we have agreement on that, question is how to drive this forward? 
Would someone be able to review the patch I've sent, or suggest more 
people to look at it before it could be queued up for merge?


Regards,

Tvrtko


Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Tvrtko Ursulin


On 22/05/2018 13:32, Peter Zijlstra wrote:

On Tue, May 22, 2018 at 10:29:29AM +0100, Tvrtko Ursulin wrote:


On 22/05/18 10:05, Peter Zijlstra wrote:

On Mon, May 21, 2018 at 10:25:49AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin <tvrtko.ursu...@intel.com>

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.


Could you explain how exactly this makes sense?

For example, how does it make sense for one PMU to reveal kernel data
while another PMU is not allowed.

Once you allow one PMU to do so, the secret is out.

So please explain, in excruciating detail, how you want to use this and
how exactly that makes sense from a security pov.


Not sure it will be excruciating but will try to explain once again.

There are two things:

1. i915 PMU which exports data such as different engine busyness levels.
(Perhaps you remember, you helped us implement this from the perf API
angle.)


Right, but I completely forgot everything again.. So thanks for
reminding.


2. Customers who want to look at those stats in production.

They want to use it to answer questions such as:

a) How loaded is my server and can it take one more of X type of job?
b) What is the least utilised video engine to submit the next packet of work
to?
c) What is the least utilised server to schedule the next transcoding job
on?


On the other hand, do those counters provide enough information for a
side-channel (timing) attack on GPGPU workloads? Because, as you say, it
is a shared resource. So if user A is doing GPGPU crypto, and user B is
observing, might he infer things from the counters?


This question would need to be looked at by security experts. And maybe 
it would be best to spawn off that effort separately. Because for me the 
most important question here is whether adding per PMU access control 
makes security worse, better, or is neutral? At the moment I cannot see 
that it makes anything worse, since the real-world alternative is to 
turn all security off. Enabling sysadmins to only relax access to a 
subset of PMU's I think can at worst be neutral. And if it is not 
possible to side-channel everything from anything, then it should be 
better overall security.


In terms of what metrics i915 PMU exposes the current list is this:

1. GPU global counters
1.1 Driver requested frequency and actual GPU frequency
1.2 Time spent in RC6 state
1.3 Interrupt count

2. Per GPU engine counters
2.1 Time spent engine was executing something
2.2 Time spent engine was waiting on semaphores
2.3 Time spent engine was waiting on sync events

In the future we are also considering:

2.4 Number of requests queued / runnable / running


Current option for them is to turn off the global paranoid setting which
then enables unprivileged access to _all_ PMU providers.


Right.


To me it sounded quite logical that it would be better for the paranoid knob
to be more fine-grained, so that they can configure their servers so only
access to needed data is possible.


The proposed semantics are a tad awkward though, the moment you prod at
the sysctl you loose all individual PMU settings. Ideally the per-pmu
would have a special setting that says follow-global in addition to the
existing ones.


Hmm.. possibly follow global makes sense for some use cases, but also I 
do not at the moment see awkwardness in the proposed semantics. The 
master knob should be only touched by sysadmins so any override of 
individual settings is a top-level decision, together will all the 
sub-controls, which is as it should be. If we had follow global, I 
suspect we would still need to have top-level override so it is 
basically a discussion on the richness of the controls.



I am not sure what do you mean by "Once you allow one PMU to do so, the
secret is out."? What secret? Are you implying that enabling unprivileged
access to i915 engine busyness data opens up access to CPU PMU's as well via
some side channel?


It was not i915 specific; but if you look at the descriptions:

  * perf event paranoia level:
  *  -1 - not paranoid at all
  *   0 - disallow raw tracepoint access for unpriv
  *   1 - disallow cpu events for unpriv
  *   2 - disallow kernel profiling for unpriv

Then the moment you allow some data to escape, it cannot be put back.
i915 is fairly special in that (afaict) it doesn't leak kernel specific
data

In general I think allowing access to uncore PMUs will leak kernel data.
Thus in general I'm fairly wary of all this.


Yeah, I guess I don't follow this argument since I am not relaxing any 
security criteria. Just adding ability to apply the existing scale per 
individual PMU provider.



Is there no other way to expose this information? Can't we do a
traditional load-avg like thing for the GPU?


We of course could expose the same data in sysfs, or somewhere, and then 
control access to it via the filesystem, but we

Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Tvrtko Ursulin


On 22/05/2018 13:32, Peter Zijlstra wrote:

On Tue, May 22, 2018 at 10:29:29AM +0100, Tvrtko Ursulin wrote:


On 22/05/18 10:05, Peter Zijlstra wrote:

On Mon, May 21, 2018 at 10:25:49AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.


Could you explain how exactly this makes sense?

For example, how does it make sense for one PMU to reveal kernel data
while another PMU is not allowed.

Once you allow one PMU to do so, the secret is out.

So please explain, in excruciating detail, how you want to use this and
how exactly that makes sense from a security pov.


Not sure it will be excruciating but will try to explain once again.

There are two things:

1. i915 PMU which exports data such as different engine busyness levels.
(Perhaps you remember, you helped us implement this from the perf API
angle.)


Right, but I completely forgot everything again.. So thanks for
reminding.


2. Customers who want to look at those stats in production.

They want to use it to answer questions such as:

a) How loaded is my server and can it take one more of X type of job?
b) What is the least utilised video engine to submit the next packet of work
to?
c) What is the least utilised server to schedule the next transcoding job
on?


On the other hand, do those counters provide enough information for a
side-channel (timing) attack on GPGPU workloads? Because, as you say, it
is a shared resource. So if user A is doing GPGPU crypto, and user B is
observing, might he infer things from the counters?


This question would need to be looked at by security experts. And maybe 
it would be best to spawn off that effort separately. Because for me the 
most important question here is whether adding per PMU access control 
makes security worse, better, or is neutral? At the moment I cannot see 
that it makes anything worse, since the real-world alternative is to 
turn all security off. Enabling sysadmins to only relax access to a 
subset of PMU's I think can at worst be neutral. And if it is not 
possible to side-channel everything from anything, then it should be 
better overall security.


In terms of what metrics i915 PMU exposes the current list is this:

1. GPU global counters
1.1 Driver requested frequency and actual GPU frequency
1.2 Time spent in RC6 state
1.3 Interrupt count

2. Per GPU engine counters
2.1 Time spent engine was executing something
2.2 Time spent engine was waiting on semaphores
2.3 Time spent engine was waiting on sync events

In the future we are also considering:

2.4 Number of requests queued / runnable / running


Current option for them is to turn off the global paranoid setting which
then enables unprivileged access to _all_ PMU providers.


Right.


To me it sounded quite logical that it would be better for the paranoid knob
to be more fine-grained, so that they can configure their servers so only
access to needed data is possible.


The proposed semantics are a tad awkward though, the moment you prod at
the sysctl you loose all individual PMU settings. Ideally the per-pmu
would have a special setting that says follow-global in addition to the
existing ones.


Hmm.. possibly follow global makes sense for some use cases, but also I 
do not at the moment see awkwardness in the proposed semantics. The 
master knob should be only touched by sysadmins so any override of 
individual settings is a top-level decision, together will all the 
sub-controls, which is as it should be. If we had follow global, I 
suspect we would still need to have top-level override so it is 
basically a discussion on the richness of the controls.



I am not sure what do you mean by "Once you allow one PMU to do so, the
secret is out."? What secret? Are you implying that enabling unprivileged
access to i915 engine busyness data opens up access to CPU PMU's as well via
some side channel?


It was not i915 specific; but if you look at the descriptions:

  * perf event paranoia level:
  *  -1 - not paranoid at all
  *   0 - disallow raw tracepoint access for unpriv
  *   1 - disallow cpu events for unpriv
  *   2 - disallow kernel profiling for unpriv

Then the moment you allow some data to escape, it cannot be put back.
i915 is fairly special in that (afaict) it doesn't leak kernel specific
data

In general I think allowing access to uncore PMUs will leak kernel data.
Thus in general I'm fairly wary of all this.


Yeah, I guess I don't follow this argument since I am not relaxing any 
security criteria. Just adding ability to apply the existing scale per 
individual PMU provider.



Is there no other way to expose this information? Can't we do a
traditional load-avg like thing for the GPU?


We of course could expose the same data in sysfs, or somewhere, and then 
control access to it via the filesystem, but we wanted to avoid 
duplication. Since

Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Tvrtko Ursulin


On 22/05/18 10:05, Peter Zijlstra wrote:

On Mon, May 21, 2018 at 10:25:49AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin <tvrtko.ursu...@intel.com>

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.


Could you explain how exactly this makes sense?

For example, how does it make sense for one PMU to reveal kernel data
while another PMU is not allowed.

Once you allow one PMU to do so, the secret is out.

So please explain, in excruciating detail, how you want to use this and
how exactly that makes sense from a security pov.


Not sure it will be excruciating but will try to explain once again.

There are two things:

1. i915 PMU which exports data such as different engine busyness levels. 
(Perhaps you remember, you helped us implement this from the perf API 
angle.)


2. Customers who want to look at those stats in production.

They want to use it to answer questions such as:

a) How loaded is my server and can it take one more of X type of job?
b) What is the least utilised video engine to submit the next packet of 
work to?
c) What is the least utilised server to schedule the next transcoding 
job on?


Current option for them is to turn off the global paranoid setting which 
then enables unprivileged access to _all_ PMU providers.


To me it sounded quite logical that it would be better for the paranoid 
knob to be more fine-grained, so that they can configure their servers 
so only access to needed data is possible.


I am not sure what do you mean by "Once you allow one PMU to do so, the 
secret is out."? What secret? Are you implying that enabling 
unprivileged access to i915 engine busyness data opens up access to CPU 
PMU's as well via some side channel?


Regards,

Tvrtko


These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.





Re: [RFC] perf: Allow fine-grained PMU access control

2018-05-22 Thread Tvrtko Ursulin


On 22/05/18 10:05, Peter Zijlstra wrote:

On Mon, May 21, 2018 at 10:25:49AM +0100, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.


Could you explain how exactly this makes sense?

For example, how does it make sense for one PMU to reveal kernel data
while another PMU is not allowed.

Once you allow one PMU to do so, the secret is out.

So please explain, in excruciating detail, how you want to use this and
how exactly that makes sense from a security pov.


Not sure it will be excruciating but will try to explain once again.

There are two things:

1. i915 PMU which exports data such as different engine busyness levels. 
(Perhaps you remember, you helped us implement this from the perf API 
angle.)


2. Customers who want to look at those stats in production.

They want to use it to answer questions such as:

a) How loaded is my server and can it take one more of X type of job?
b) What is the least utilised video engine to submit the next packet of 
work to?
c) What is the least utilised server to schedule the next transcoding 
job on?


Current option for them is to turn off the global paranoid setting which 
then enables unprivileged access to _all_ PMU providers.


To me it sounded quite logical that it would be better for the paranoid 
knob to be more fine-grained, so that they can configure their servers 
so only access to needed data is possible.


I am not sure what do you mean by "Once you allow one PMU to do so, the 
secret is out."? What secret? Are you implying that enabling 
unprivileged access to i915 engine busyness data opens up access to CPU 
PMU's as well via some side channel?


Regards,

Tvrtko


These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.





[RFC] perf: Allow fine-grained PMU access control

2018-05-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin <tvrtko.ursu...@intel.com>

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin <tvrtko.ursu...@intel.com>
Cc: Peter Zijlstra <pet...@infradead.org>
Cc: Ingo Molnar <mi...@redhat.com>
Cc: Arnaldo Carvalho de Melo <a...@kernel.org>
Cc: Alexander Shishkin <alexander.shish...@linux.intel.com>
Cc: Jiri Olsa <jo...@redhat.com>
Cc: Namhyung Kim <namhy...@kernel.org>
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  | 18 --
 kernel/events/core.c| 99 +++--
 kernel/sysctl.c |  4 +-
 kernel/trace/trace_event_perf.c |  6 +-
 7 files changed, 105 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ static int bts_event_init(struct perf_event *event)
 * Note that the default paranoia setting permits unprivileged
 * users to profile the kernel.
 */
-   if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
+   if (event->attr.exclude_kernel && perf_paranoid_kernel(event->pmu) &&
!capable(CAP_SYS_ADMIN))
return -EACCES;
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a96e516..6b126bdbd16c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3025,7 +3025,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (x86_pmu.version < 3)
return -EINVAL;
 
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
 
event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..878451ef1ace 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 * the user needs special permissions to be able to use it
 */
if (p4_ht_active() && p4_event_bind_map[v].shared) {
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e71e99eb9a4e..2d9e7b4bcfac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* fine grained access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1159,6 +1162,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1169,19 +1175,19 @@ extern int perf_cpu_time_max_percent_handler(struct 
ctl_table *table, int write,
 int perf_event_max_stack_handler(struct ctl_table *table, int write,
 void __user *buffer, size_t *lenp, loff_t 
*ppos);
 
-static inline bool perf_paranoid_tracepoint_raw(void)
+static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
-static inline bool perf_paranoid_cpu(void)
+static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
-static inline b

[RFC] perf: Allow fine-grained PMU access control

2018-05-21 Thread Tvrtko Ursulin
From: Tvrtko Ursulin 

For situations where sysadmins might want to allow different level of
of access control for different PMUs, we start creating per-PMU
perf_event_paranoid controls in sysfs.

These work in equivalent fashion as the existing perf_event_paranoid
sysctl, which now becomes the parent control for each PMU.

On PMU registration the global/parent value will be inherited by each PMU,
as it will be propagated to all registered PMUs when the sysctl is
updated.

At any later point individual PMU access controls, located in
/device//perf_event_paranoid, can be adjusted to achieve
fine grained access control.

Signed-off-by: Tvrtko Ursulin 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Namhyung Kim 
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/events/intel/bts.c |  2 +-
 arch/x86/events/intel/core.c|  2 +-
 arch/x86/events/intel/p4.c  |  2 +-
 include/linux/perf_event.h  | 18 --
 kernel/events/core.c| 99 +++--
 kernel/sysctl.c |  4 +-
 kernel/trace/trace_event_perf.c |  6 +-
 7 files changed, 105 insertions(+), 28 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 24ffa1e88cf9..e416c9e2400a 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -555,7 +555,7 @@ static int bts_event_init(struct perf_event *event)
 * Note that the default paranoia setting permits unprivileged
 * users to profile the kernel.
 */
-   if (event->attr.exclude_kernel && perf_paranoid_kernel() &&
+   if (event->attr.exclude_kernel && perf_paranoid_kernel(event->pmu) &&
!capable(CAP_SYS_ADMIN))
return -EACCES;
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 707b2a96e516..6b126bdbd16c 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3025,7 +3025,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
if (x86_pmu.version < 3)
return -EINVAL;
 
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
 
event->hw.config |= ARCH_PERFMON_EVENTSEL_ANY;
diff --git a/arch/x86/events/intel/p4.c b/arch/x86/events/intel/p4.c
index d32c0eed38ca..878451ef1ace 100644
--- a/arch/x86/events/intel/p4.c
+++ b/arch/x86/events/intel/p4.c
@@ -776,7 +776,7 @@ static int p4_validate_raw_event(struct perf_event *event)
 * the user needs special permissions to be able to use it
 */
if (p4_ht_active() && p4_event_bind_map[v].shared) {
-   if (perf_paranoid_cpu() && !capable(CAP_SYS_ADMIN))
+   if (perf_paranoid_cpu(event->pmu) && !capable(CAP_SYS_ADMIN))
return -EACCES;
}
 
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index e71e99eb9a4e..2d9e7b4bcfac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -271,6 +271,9 @@ struct pmu {
/* number of address filters this PMU can do */
unsigned intnr_addr_filters;
 
+   /* fine grained access control */
+   int perf_event_paranoid;
+
/*
 * Fully disable/enable this PMU, can be used to protect from the PMI
 * as well as for lazy/batch writing of the MSRs.
@@ -1159,6 +1162,9 @@ extern int sysctl_perf_cpu_time_max_percent;
 
 extern void perf_sample_event_took(u64 sample_len_ns);
 
+extern int perf_proc_paranoid_handler(struct ctl_table *table, int write,
+   void __user *buffer, size_t *lenp,
+   loff_t *ppos);
 extern int perf_proc_update_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
@@ -1169,19 +1175,19 @@ extern int perf_cpu_time_max_percent_handler(struct 
ctl_table *table, int write,
 int perf_event_max_stack_handler(struct ctl_table *table, int write,
 void __user *buffer, size_t *lenp, loff_t 
*ppos);
 
-static inline bool perf_paranoid_tracepoint_raw(void)
+static inline bool perf_paranoid_tracepoint_raw(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > -1;
+   return pmu->perf_event_paranoid > -1;
 }
 
-static inline bool perf_paranoid_cpu(void)
+static inline bool perf_paranoid_cpu(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 0;
+   return pmu->perf_event_paranoid > 0;
 }
 
-static inline bool perf_paranoid_kernel(void)
+static inline bool perf_paranoid_kernel(const struct pmu *pmu)
 {
-   return sysctl_perf_event_paranoid > 1;
+   return pmu->perf_event_paranoid > 1;
 }
 
 extern void perf_event_i

Re: [Intel-gfx] [PATCH] [v2] drm/i915/pmu: avoid -Wmaybe-uninitialized warning

2018-03-14 Thread Tvrtko Ursulin


On 13/03/2018 20:10, Arnd Bergmann wrote:

On Tue, Mar 13, 2018 at 6:46 PM, Tvrtko Ursulin
<tvrtko.ursu...@linux.intel.com> wrote:


On 13/03/2018 16:19, Arnd Bergmann wrote:


The conditional spinlock confuses gcc into thinking the 'flags' value
might contain uninitialized data:

drivers/gpu/drm/i915/i915_pmu.c: In function '__i915_pmu_event_read':
arch/x86/include/asm/paravirt_types.h:573:3: error: 'flags' may be used
uninitialized in this function [-Werror=maybe-uninitialized]



Hm, how does paravirt_types.h comes into the picture?


spin_unlock_irqrestore() calls arch_local_irq_restore()


The code is correct, but it's easy to see how the compiler gets confused
here. This avoids the problem by pulling the lock outside of the function
into its only caller.



Is it specific gcc version, specific options, or specific kernel config that
this happens?


Not gcc version specific (same result with gcc-4.9 through 8, didn't test
earlier versions that are currently broken).


Strange that it hasn't been seen so far.


It seems to be a relatively rare 'randconfig' combination. Looking at
the preprocessed sources, I find:

static u64 get_rc6(struct drm_i915_private *i915, bool locked)
{

  unsigned long flags;
  u64 val;

  if (intel_runtime_pm_get_if_in_use(i915)) {
   val = __get_rc6(i915);
   intel_runtime_pm_put(i915);
   if (!locked)
do { do { ({ unsigned long __dummy; typeof(flags) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags) __dummy2; (void)(&__dummy == &__dummy2); 1; });
flags = arch_local_irq_save(); } while (0); trace_hardirqs_off(); }
while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>pmu.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   if (val >= i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur) {
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = 0;
i915->pmu.sample[__I915_SAMPLE_RC6].cur = val;
   } else {
val = i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur;
   }
   if (!locked)
spin_unlock_irqrestore(>pmu.lock, flags);
  } else {
   struct pci_dev *pdev = i915->drm.pdev;
   struct device *kdev = >dev;
   unsigned long flags2;
# 455 "/git/arm-soc/drivers/gpu/drm/i915/i915_pmu.c"
   if (!locked)
do { do { ({ unsigned long __dummy; typeof(flags) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags) __dummy2; (void)(&__dummy == &__dummy2); 1; });
flags = arch_local_irq_save(); } while (0); trace_hardirqs_off(); }
while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>pmu.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   do { do { ({ unsigned long __dummy; typeof(flags2) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags2) __dummy2; (void)(&__dummy == &__dummy2); 1;
}); flags2 = arch_local_irq_save(); } while (0); trace_hardirqs_off();
} while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>power.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)
i915->pmu.suspended_jiffies_last =
   kdev->power.suspended_jiffies;

   val = kdev->power.suspended_jiffies -
 i915->pmu.suspended_jiffies_last;
   val += jiffies - kdev->power.accounting_timestamp;

   spin_unlock_irqrestore(>power.lock, flags2);

   val = jiffies_to_nsecs(val);
   val += i915->pmu.sample[__I915_SAMPLE_RC6].cur;
   i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = val;

   if (!locked)
spin_unlock_irqrestore(>pmu.lock, flags);
  }
   return val;
}

so it seems that the spin_lock_irqsave() is completely inlined through
a macro while the unlock is not, and the lock contains a memory barrier
(among other things) that might tell the compiler that the state of the
'locked' flag could changed underneath it.


Ha, interesting. So it sounds more like us having to workaround a bug in 
the paravirt spinlock macros.


I think I would prefer a different solution, where we don't end up doing 
MMIO under irqsave spinlock. I'll send a patch.


Regards,

Tvrtko



It could also be the problem that arch_local_irq_restore() uses
__builtin_expect() in  PVOP_TEST_NULL(op) when
CONFIG_PARAVIRT_DEBUG is enabled, see

static inline __attribute__((unused))
__attribute__((no_instrument_function))
__attribute__((no_instrument_function)) void
arch_local_irq_restore(unsigned long f)
{
  ({ unsigned long __eax = __eax, __edx = __edx, __ecx = __ecx;; do {
if (__builtin_expect(!!(pv_irq_ops.restore_fl.func == ((void *)0)),
0)) do { do { asm volatile("1:\t" ".byte 0x0f, 0x0b" "\n"
".pushsection 

Re: [Intel-gfx] [PATCH] [v2] drm/i915/pmu: avoid -Wmaybe-uninitialized warning

2018-03-14 Thread Tvrtko Ursulin


On 13/03/2018 20:10, Arnd Bergmann wrote:

On Tue, Mar 13, 2018 at 6:46 PM, Tvrtko Ursulin
 wrote:


On 13/03/2018 16:19, Arnd Bergmann wrote:


The conditional spinlock confuses gcc into thinking the 'flags' value
might contain uninitialized data:

drivers/gpu/drm/i915/i915_pmu.c: In function '__i915_pmu_event_read':
arch/x86/include/asm/paravirt_types.h:573:3: error: 'flags' may be used
uninitialized in this function [-Werror=maybe-uninitialized]



Hm, how does paravirt_types.h comes into the picture?


spin_unlock_irqrestore() calls arch_local_irq_restore()


The code is correct, but it's easy to see how the compiler gets confused
here. This avoids the problem by pulling the lock outside of the function
into its only caller.



Is it specific gcc version, specific options, or specific kernel config that
this happens?


Not gcc version specific (same result with gcc-4.9 through 8, didn't test
earlier versions that are currently broken).


Strange that it hasn't been seen so far.


It seems to be a relatively rare 'randconfig' combination. Looking at
the preprocessed sources, I find:

static u64 get_rc6(struct drm_i915_private *i915, bool locked)
{

  unsigned long flags;
  u64 val;

  if (intel_runtime_pm_get_if_in_use(i915)) {
   val = __get_rc6(i915);
   intel_runtime_pm_put(i915);
   if (!locked)
do { do { ({ unsigned long __dummy; typeof(flags) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags) __dummy2; (void)(&__dummy == &__dummy2); 1; });
flags = arch_local_irq_save(); } while (0); trace_hardirqs_off(); }
while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>pmu.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   if (val >= i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur) {
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = 0;
i915->pmu.sample[__I915_SAMPLE_RC6].cur = val;
   } else {
val = i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur;
   }
   if (!locked)
spin_unlock_irqrestore(>pmu.lock, flags);
  } else {
   struct pci_dev *pdev = i915->drm.pdev;
   struct device *kdev = >dev;
   unsigned long flags2;
# 455 "/git/arm-soc/drivers/gpu/drm/i915/i915_pmu.c"
   if (!locked)
do { do { ({ unsigned long __dummy; typeof(flags) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags) __dummy2; (void)(&__dummy == &__dummy2); 1; });
flags = arch_local_irq_save(); } while (0); trace_hardirqs_off(); }
while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>pmu.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   do { do { ({ unsigned long __dummy; typeof(flags2) __dummy2;
(void)(&__dummy == &__dummy2); 1; }); do { do { do { ({ unsigned long
__dummy; typeof(flags2) __dummy2; (void)(&__dummy == &__dummy2); 1;
}); flags2 = arch_local_irq_save(); } while (0); trace_hardirqs_off();
} while (0); do { __asm__ __volatile__("": : :"memory"); do { (void)0;
(void)(spinlock_check(>power.lock)); } while (0); } while (0); }
while (0); } while (0); } while (0);

   if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)
i915->pmu.suspended_jiffies_last =
   kdev->power.suspended_jiffies;

   val = kdev->power.suspended_jiffies -
 i915->pmu.suspended_jiffies_last;
   val += jiffies - kdev->power.accounting_timestamp;

   spin_unlock_irqrestore(>power.lock, flags2);

   val = jiffies_to_nsecs(val);
   val += i915->pmu.sample[__I915_SAMPLE_RC6].cur;
   i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = val;

   if (!locked)
spin_unlock_irqrestore(>pmu.lock, flags);
  }
   return val;
}

so it seems that the spin_lock_irqsave() is completely inlined through
a macro while the unlock is not, and the lock contains a memory barrier
(among other things) that might tell the compiler that the state of the
'locked' flag could changed underneath it.


Ha, interesting. So it sounds more like us having to workaround a bug in 
the paravirt spinlock macros.


I think I would prefer a different solution, where we don't end up doing 
MMIO under irqsave spinlock. I'll send a patch.


Regards,

Tvrtko



It could also be the problem that arch_local_irq_restore() uses
__builtin_expect() in  PVOP_TEST_NULL(op) when
CONFIG_PARAVIRT_DEBUG is enabled, see

static inline __attribute__((unused))
__attribute__((no_instrument_function))
__attribute__((no_instrument_function)) void
arch_local_irq_restore(unsigned long f)
{
  ({ unsigned long __eax = __eax, __edx = __edx, __ecx = __ecx;; do {
if (__builtin_expect(!!(pv_irq_ops.restore_fl.func == ((void *)0)),
0)) do { do { asm volatile("1:\t" ".byte 0x0f, 0x0b" "\n"
".pushsection __bug_table,\"aw\"\n" &q

Re: [Intel-gfx] [PATCH] [v2] drm/i915/pmu: avoid -Wmaybe-uninitialized warning

2018-03-13 Thread Tvrtko Ursulin


On 13/03/2018 16:19, Arnd Bergmann wrote:

The conditional spinlock confuses gcc into thinking the 'flags' value
might contain uninitialized data:

drivers/gpu/drm/i915/i915_pmu.c: In function '__i915_pmu_event_read':
arch/x86/include/asm/paravirt_types.h:573:3: error: 'flags' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]


Hm, how does paravirt_types.h comes into the picture?


The code is correct, but it's easy to see how the compiler gets confused
here. This avoids the problem by pulling the lock outside of the function
into its only caller.


Is it specific gcc version, specific options, or specific kernel config 
that this happens?


Strange that it hasn't been seen so far.

Regards,

Tvrtko


Fixes: 1fe699e30113 ("drm/i915/pmu: Fix sleep under atomic in RC6 readout")
Signed-off-by: Arnd Bergmann 
---
v2: removed unused function argument, fixed 'break' statement.
---
  drivers/gpu/drm/i915/i915_pmu.c | 25 ++---
  1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 4bc7aefa9541..d6b9b6b5fb98 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -412,10 +412,9 @@ static u64 __get_rc6(struct drm_i915_private *i915)
return val;
  }
  
-static u64 get_rc6(struct drm_i915_private *i915, bool locked)

+static u64 get_rc6(struct drm_i915_private *i915)
  {
  #if IS_ENABLED(CONFIG_PM)
-   unsigned long flags;
u64 val;
  
  	if (intel_runtime_pm_get_if_in_use(i915)) {

@@ -428,18 +427,12 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
 * previously.
 */
  
-		if (!locked)

-   spin_lock_irqsave(>pmu.lock, flags);
-
if (val >= i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur) {
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = 0;
i915->pmu.sample[__I915_SAMPLE_RC6].cur = val;
} else {
val = i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur;
}
-
-   if (!locked)
-   spin_unlock_irqrestore(>pmu.lock, flags);
} else {
struct pci_dev *pdev = i915->drm.pdev;
struct device *kdev = >dev;
@@ -452,9 +445,6 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
 * on top of the last known real value, as the approximated RC6
 * counter value.
 */
-   if (!locked)
-   spin_lock_irqsave(>pmu.lock, flags);
-
spin_lock_irqsave(>power.lock, flags2);
  
  		if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)

@@ -470,9 +460,6 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
val = jiffies_to_nsecs(val);
val += i915->pmu.sample[__I915_SAMPLE_RC6].cur;
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = val;
-
-   if (!locked)
-   spin_unlock_irqrestore(>pmu.lock, flags);
}
  
  	return val;

@@ -519,7 +506,15 @@ static u64 __i915_pmu_event_read(struct perf_event *event, 
bool locked)
val = count_interrupts(i915);
break;
case I915_PMU_RC6_RESIDENCY:
-   val = get_rc6(i915, locked);
+   if (!locked) {
+   unsigned long flags;
+
+   spin_lock_irqsave(>pmu.lock, flags);
+   val = get_rc6(i915);
+   spin_unlock_irqrestore(>pmu.lock, flags);
+   } else {
+   val = get_rc6(i915);
+   }
break;
}
}



Re: [Intel-gfx] [PATCH] [v2] drm/i915/pmu: avoid -Wmaybe-uninitialized warning

2018-03-13 Thread Tvrtko Ursulin


On 13/03/2018 16:19, Arnd Bergmann wrote:

The conditional spinlock confuses gcc into thinking the 'flags' value
might contain uninitialized data:

drivers/gpu/drm/i915/i915_pmu.c: In function '__i915_pmu_event_read':
arch/x86/include/asm/paravirt_types.h:573:3: error: 'flags' may be used 
uninitialized in this function [-Werror=maybe-uninitialized]


Hm, how does paravirt_types.h comes into the picture?


The code is correct, but it's easy to see how the compiler gets confused
here. This avoids the problem by pulling the lock outside of the function
into its only caller.


Is it specific gcc version, specific options, or specific kernel config 
that this happens?


Strange that it hasn't been seen so far.

Regards,

Tvrtko


Fixes: 1fe699e30113 ("drm/i915/pmu: Fix sleep under atomic in RC6 readout")
Signed-off-by: Arnd Bergmann 
---
v2: removed unused function argument, fixed 'break' statement.
---
  drivers/gpu/drm/i915/i915_pmu.c | 25 ++---
  1 file changed, 10 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_pmu.c b/drivers/gpu/drm/i915/i915_pmu.c
index 4bc7aefa9541..d6b9b6b5fb98 100644
--- a/drivers/gpu/drm/i915/i915_pmu.c
+++ b/drivers/gpu/drm/i915/i915_pmu.c
@@ -412,10 +412,9 @@ static u64 __get_rc6(struct drm_i915_private *i915)
return val;
  }
  
-static u64 get_rc6(struct drm_i915_private *i915, bool locked)

+static u64 get_rc6(struct drm_i915_private *i915)
  {
  #if IS_ENABLED(CONFIG_PM)
-   unsigned long flags;
u64 val;
  
  	if (intel_runtime_pm_get_if_in_use(i915)) {

@@ -428,18 +427,12 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
 * previously.
 */
  
-		if (!locked)

-   spin_lock_irqsave(>pmu.lock, flags);
-
if (val >= i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur) {
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = 0;
i915->pmu.sample[__I915_SAMPLE_RC6].cur = val;
} else {
val = i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur;
}
-
-   if (!locked)
-   spin_unlock_irqrestore(>pmu.lock, flags);
} else {
struct pci_dev *pdev = i915->drm.pdev;
struct device *kdev = >dev;
@@ -452,9 +445,6 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
 * on top of the last known real value, as the approximated RC6
 * counter value.
 */
-   if (!locked)
-   spin_lock_irqsave(>pmu.lock, flags);
-
spin_lock_irqsave(>power.lock, flags2);
  
  		if (!i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur)

@@ -470,9 +460,6 @@ static u64 get_rc6(struct drm_i915_private *i915, bool 
locked)
val = jiffies_to_nsecs(val);
val += i915->pmu.sample[__I915_SAMPLE_RC6].cur;
i915->pmu.sample[__I915_SAMPLE_RC6_ESTIMATED].cur = val;
-
-   if (!locked)
-   spin_unlock_irqrestore(>pmu.lock, flags);
}
  
  	return val;

@@ -519,7 +506,15 @@ static u64 __i915_pmu_event_read(struct perf_event *event, 
bool locked)
val = count_interrupts(i915);
break;
case I915_PMU_RC6_RESIDENCY:
-   val = get_rc6(i915, locked);
+   if (!locked) {
+   unsigned long flags;
+
+   spin_lock_irqsave(>pmu.lock, flags);
+   val = get_rc6(i915);
+   spin_unlock_irqrestore(>pmu.lock, flags);
+   } else {
+   val = get_rc6(i915);
+   }
break;
}
}



Re: [PATCH 1/6] lib/scatterlist: Tidy types and fix overflow checking in sgl_alloc_order

2018-03-09 Thread Tvrtko Ursulin


On 07/03/18 17:06, Tvrtko Ursulin wrote:

On 07/03/18 16:10, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:
sgl_alloc_order explicitly takes a 64-bit length (unsigned long long) 
but

then rejects it in overflow checking if greater than 4GiB allocation was
requested. This is a consequence of using unsigned int for the right 
hand
side condition which then natuarally overflows when shifted left, 
earlier

than nent otherwise would.

Fix is to promote the right hand side of the conditional to unsigned 
long.


Agreed.


It is also not useful to allow for 64-bit lenght on 32-bit platforms so
I have changed this type to a natural unsigned long. Like this it 
changes

size naturally depending on the architecture.


I do not agree. Although uncommon, it is possible that e.g. a SCSI 
initiator
sends a transfer of more than 4 GB to a target system and that that 
transfer
must not be split. Since this code is used by the SCSI target, I think 
that's
an example of an application where it is useful to allow allocations 
of more

than 4 GB at once on a 32-bit system.


If it can work on 32-bit (it can DMA from highmem or what?) and 
allocation can realistically succeed then  I'm happy to defer to storage 
experts on this one.


Furthermore on this specific point, the only caller of sgl_alloc_order 
in the tree passes in u32 for length.


So even if there will be some realistic use case for >4GiB allocations 
on 32-bit systems (where unsigned long is 32-bit, to be more precise) I 
don't see a problem with my patch right now.


First five patches from the series are all IMO fixes and cleanup of 
unused parts of the API.


Regards,

Tvrtko




2.

elem_len should not be explicitly sized u32 but unsigned int, to match
the underlying struct scatterlist nents type. Same for the nent_p output
parameter type.


Are you sure it is useful to support allocations with an order that 
exceeds
(31 - PAGE_SHIFT)? Since memory gets fragmented easily in the Linux 
kernel I

think that it's unlikely that such allocations will succeed.


Not sure what you are getting at here.

There are not explicit width types anywhere in the SGL API apart this 
u32 elem_lem.


So I changed it to unsigned int not to confuse. It gets passed in to 
sg_set_page which takes unsigned int. So no reason for it to be u32.





I renamed this to chunk_len and consolidated its use throughout the
function.


Please undo this change such that the diff remains as short as possible.


Name change only? Yeah can do that. Even though chunk as a term is 
somewhat established elsewhere in lib/scatterlist.c.



-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+  unsigned int order)
  {
  struct scatterlist *sg;
  struct page *page;
-    int i;
+    unsigned int i;
  for_each_sg(sgl, sg, nents, i) {
  if (!sg)
@@ -583,9 +587,9 @@ EXPORT_SYMBOL(sgl_free_n_order);
   * @sgl: Scatterlist with one or more elements
   * @order: Second argument for __free_pages()
   */
-void sgl_free_order(struct scatterlist *sgl, int order)
+void sgl_free_order(struct scatterlist *sgl, unsigned int order)
  {
-    sgl_free_n_order(sgl, INT_MAX, order);
+    sgl_free_n_order(sgl, UINT_MAX, order);
  }
  EXPORT_SYMBOL(sgl_free_order);


Do you have an application that calls these functions to allocate more 
than
INT_MAX * PAGE_SIZE bytes at once? If not, please leave these changes 
out.


There is no reason to used signed int here and it is even inconsistent 
with itself because sgl_alloc_order returns you nents in an unsigned 
int. And sg_init_table takes unsigned int for nents. So really I see no 
reason to have signed types for nents on sgl_free side of the API.


Regards,

Tvrtko


Re: [PATCH 1/6] lib/scatterlist: Tidy types and fix overflow checking in sgl_alloc_order

2018-03-09 Thread Tvrtko Ursulin


On 07/03/18 17:06, Tvrtko Ursulin wrote:

On 07/03/18 16:10, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:
sgl_alloc_order explicitly takes a 64-bit length (unsigned long long) 
but

then rejects it in overflow checking if greater than 4GiB allocation was
requested. This is a consequence of using unsigned int for the right 
hand
side condition which then natuarally overflows when shifted left, 
earlier

than nent otherwise would.

Fix is to promote the right hand side of the conditional to unsigned 
long.


Agreed.


It is also not useful to allow for 64-bit lenght on 32-bit platforms so
I have changed this type to a natural unsigned long. Like this it 
changes

size naturally depending on the architecture.


I do not agree. Although uncommon, it is possible that e.g. a SCSI 
initiator
sends a transfer of more than 4 GB to a target system and that that 
transfer
must not be split. Since this code is used by the SCSI target, I think 
that's
an example of an application where it is useful to allow allocations 
of more

than 4 GB at once on a 32-bit system.


If it can work on 32-bit (it can DMA from highmem or what?) and 
allocation can realistically succeed then  I'm happy to defer to storage 
experts on this one.


Furthermore on this specific point, the only caller of sgl_alloc_order 
in the tree passes in u32 for length.


So even if there will be some realistic use case for >4GiB allocations 
on 32-bit systems (where unsigned long is 32-bit, to be more precise) I 
don't see a problem with my patch right now.


First five patches from the series are all IMO fixes and cleanup of 
unused parts of the API.


Regards,

Tvrtko




2.

elem_len should not be explicitly sized u32 but unsigned int, to match
the underlying struct scatterlist nents type. Same for the nent_p output
parameter type.


Are you sure it is useful to support allocations with an order that 
exceeds
(31 - PAGE_SHIFT)? Since memory gets fragmented easily in the Linux 
kernel I

think that it's unlikely that such allocations will succeed.


Not sure what you are getting at here.

There are not explicit width types anywhere in the SGL API apart this 
u32 elem_lem.


So I changed it to unsigned int not to confuse. It gets passed in to 
sg_set_page which takes unsigned int. So no reason for it to be u32.





I renamed this to chunk_len and consolidated its use throughout the
function.


Please undo this change such that the diff remains as short as possible.


Name change only? Yeah can do that. Even though chunk as a term is 
somewhat established elsewhere in lib/scatterlist.c.



-void sgl_free_n_order(struct scatterlist *sgl, int nents, int order)
+void sgl_free_n_order(struct scatterlist *sgl, unsigned int nents,
+  unsigned int order)
  {
  struct scatterlist *sg;
  struct page *page;
-    int i;
+    unsigned int i;
  for_each_sg(sgl, sg, nents, i) {
  if (!sg)
@@ -583,9 +587,9 @@ EXPORT_SYMBOL(sgl_free_n_order);
   * @sgl: Scatterlist with one or more elements
   * @order: Second argument for __free_pages()
   */
-void sgl_free_order(struct scatterlist *sgl, int order)
+void sgl_free_order(struct scatterlist *sgl, unsigned int order)
  {
-    sgl_free_n_order(sgl, INT_MAX, order);
+    sgl_free_n_order(sgl, UINT_MAX, order);
  }
  EXPORT_SYMBOL(sgl_free_order);


Do you have an application that calls these functions to allocate more 
than
INT_MAX * PAGE_SIZE bytes at once? If not, please leave these changes 
out.


There is no reason to used signed int here and it is even inconsistent 
with itself because sgl_alloc_order returns you nents in an unsigned 
int. And sg_init_table takes unsigned int for nents. So really I see no 
reason to have signed types for nents on sgl_free side of the API.


Regards,

Tvrtko


Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-08 Thread Tvrtko Ursulin


On 08/03/18 15:56, Bart Van Assche wrote:

On Thu, 2018-03-08 at 07:59 +, Tvrtko Ursulin wrote:

However there is a different bug in my patch relating to the last entry
which can have shorter length from the rest. So get_order on the last
entry is incorrect - I have to store the deduced order and carry it over.


Will that work if there is only one entry in the list and if it is a short
entry?


Yeah, needs more work. I especially don't like that case (as in any 
other with a final short chunk) wasting memory. So it would need more 
refactoring to make it possible.


It did work in my internal tree where sgl_alloc_order was extended to 
become sgl_alloc_order_min_max, and as such uses a smaller order for 
smaller chunks.


This patch can be dropped for now but the earlier ones are still valid I 
think. On those one I think we have some opens on how to proceed so if 
you could reply there, where applicable, that would be great.


Regards,

Tvrtko


Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-08 Thread Tvrtko Ursulin


On 08/03/18 15:56, Bart Van Assche wrote:

On Thu, 2018-03-08 at 07:59 +, Tvrtko Ursulin wrote:

However there is a different bug in my patch relating to the last entry
which can have shorter length from the rest. So get_order on the last
entry is incorrect - I have to store the deduced order and carry it over.


Will that work if there is only one entry in the list and if it is a short
entry?


Yeah, needs more work. I especially don't like that case (as in any 
other with a final short chunk) wasting memory. So it would need more 
refactoring to make it possible.


It did work in my internal tree where sgl_alloc_order was extended to 
become sgl_alloc_order_min_max, and as such uses a smaller order for 
smaller chunks.


This patch can be dropped for now but the earlier ones are still valid I 
think. On those one I think we have some opens on how to proceed so if 
you could reply there, where applicable, that would be great.


Regards,

Tvrtko


Re: [PATCH 0/6] lib/scatterlist: Small SGL tidy

2018-03-08 Thread Tvrtko Ursulin


On 07/03/18 18:38, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

I spotted a few small issues in the recent added SGL code so am sending some
patches to tidy this.


Can you send the fixes as a separate series and keep the rework / behavior 
changes
for later such that Jens can queue the fixes for kernel v4.16?


Yes of course. But which ones you consider rework and behaviour changes? 
To me all look like fixes / cleanups. (Last patch in the series needs a 
fix/respin btw)


Regards,

Tvrtko



Re: [PATCH 0/6] lib/scatterlist: Small SGL tidy

2018-03-08 Thread Tvrtko Ursulin


On 07/03/18 18:38, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

I spotted a few small issues in the recent added SGL code so am sending some
patches to tidy this.


Can you send the fixes as a separate series and keep the rework / behavior 
changes
for later such that Jens can queue the fixes for kernel v4.16?


Yes of course. But which ones you consider rework and behaviour changes? 
To me all look like fixes / cleanups. (Last patch in the series needs a 
fix/respin btw)


Regards,

Tvrtko



Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-08 Thread Tvrtko Ursulin

Hi,

On 07/03/18 18:30, James Bottomley wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin <tvrtko.ursu...@intel.com>


Firstly, I don't see any justifiable benefit to churning this API, so
why bother? but secondly this:


Primarily because I wanted to extend sgl_alloc_order slightly in order 
to be able to use it from i915. And then in the process noticed a couple 
of bugs in the implementation, type inconsistencies and unused exported 
symbols. That gave me a feeling API could actually use a bit of work.



We can derive the order from sg->length and so do not need to pass it
in explicitly.


Is wrong.  I can have a length 2 scatterlist that crosses a page
boundary, but I can also have one within a single page, so the order
cannot be deduced from the length.


sgl_alloc_order never does this.

However there is a different bug in my patch relating to the last entry 
which can have shorter length from the rest. So get_order on the last 
entry is incorrect - I have to store the deduced order and carry it over.


In which case it may even make sense to refactor sgl_alloc_order a bit 
more to avoid wastage on the last entry with high order allocations.


Regards,

Tvrtko


Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-08 Thread Tvrtko Ursulin

Hi,

On 07/03/18 18:30, James Bottomley wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

From: Tvrtko Ursulin 


Firstly, I don't see any justifiable benefit to churning this API, so
why bother? but secondly this:


Primarily because I wanted to extend sgl_alloc_order slightly in order 
to be able to use it from i915. And then in the process noticed a couple 
of bugs in the implementation, type inconsistencies and unused exported 
symbols. That gave me a feeling API could actually use a bit of work.



We can derive the order from sg->length and so do not need to pass it
in explicitly.


Is wrong.  I can have a length 2 scatterlist that crosses a page
boundary, but I can also have one within a single page, so the order
cannot be deduced from the length.


sgl_alloc_order never does this.

However there is a different bug in my patch relating to the last entry 
which can have shorter length from the rest. So get_order on the last 
entry is incorrect - I have to store the deduced order and carry it over.


In which case it may even make sense to refactor sgl_alloc_order a bit 
more to avoid wastage on the last entry with high order allocations.


Regards,

Tvrtko


Re: [PATCH 2/6] lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order

2018-03-07 Thread Tvrtko Ursulin


On 07/03/18 15:35, Andy Shevchenko wrote:

On Wed, Mar 7, 2018 at 2:47 PM, Tvrtko Ursulin <tursu...@ursulin.net> wrote:


+   sgl = kmalloc_array(nent, sizeof(struct scatterlist), (gfp & ~GFP_DMA));


The parens now become redundant.


True thanks! I am also not sure that re-using the same gfp_t for 
metadata is backing store is the right approach. At least I think 
__GFP_HIGHMEM needs also to be cleared for kmalloc.


Regards,

Tvrtko



Re: [PATCH 2/6] lib/scatterlist: Skip requesting zeroed allocations in sgl_alloc_order

2018-03-07 Thread Tvrtko Ursulin


On 07/03/18 15:35, Andy Shevchenko wrote:

On Wed, Mar 7, 2018 at 2:47 PM, Tvrtko Ursulin  wrote:


+   sgl = kmalloc_array(nent, sizeof(struct scatterlist), (gfp & ~GFP_DMA));


The parens now become redundant.


True thanks! I am also not sure that re-using the same gfp_t for 
metadata is backing store is the right approach. At least I think 
__GFP_HIGHMEM needs also to be cleared for kmalloc.


Regards,

Tvrtko



Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-07 Thread Tvrtko Ursulin


On 07/03/18 16:23, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

We can derive the order from sg->length and so do not need to pass it in
explicitly. Rename the function to sgl_free_n.


Using get_order() to compute the order looks fine to me but this patch will
have to rebased in order to address the comments on the previous patches.


Ok I guess my main questions are the ones from the cover letter - where 
is this API going and why did it get in a bit of a funky state? Because 
it doesn't look fully thought through and tested to me.


My motivation is that I would like to extend it to add 
sgl_alloc_order_min_max, which takes min order and max order, and 
allocates as large chunks as it can given those constraints. This is 
something we have in i915 and could then drop our implementation and use 
the library function.


But I also wanted to refactor sgl_alloc_order to benefit from the 
existing chained struct scatterlist allocator. But SGL API does not 
embed into struct sg_table, neither it carries explicitly the number of 
nents allocated, making it impossible to correctly free with existing 
sg_free_table.


Another benefit of using the existing sg allocator would be that for 
large allocation you don't depend on the availability of contiguous 
chunks like you do with kmalloc_array.


For instance if in another reply you mentioned 4GiB allocations are a 
possibility. If you use order 0 that means you need 1M nents, which can 
be something like 32 bytes each and you need a 32MiB kmalloc for the 
nents array and thats quite big. If you would be able to reuse the 
existing sg_alloc_table infrastructure (I have patches which extract it 
if you don't want to deal with struct sg_table), you would benefit from 
PAGE_SIZE allocations.


Also I am not sure if a single gfp argument to sgl_alloc_order is the 
right thing to do. I have a feeling you either need to ignore it for 
kmalloc_array, or pass in two gfp_t arguments to be used for metadata 
and backing storage respectively.


So I have many questions regarding the current state and future 
direction, but essentially would like to make it usable for other 
drivers, like i915, as well.


Regards,

Tvrtko


Re: [PATCH 6/6] lib/scatterlist: Drop order argument from sgl_free_n_order

2018-03-07 Thread Tvrtko Ursulin


On 07/03/18 16:23, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

We can derive the order from sg->length and so do not need to pass it in
explicitly. Rename the function to sgl_free_n.


Using get_order() to compute the order looks fine to me but this patch will
have to rebased in order to address the comments on the previous patches.


Ok I guess my main questions are the ones from the cover letter - where 
is this API going and why did it get in a bit of a funky state? Because 
it doesn't look fully thought through and tested to me.


My motivation is that I would like to extend it to add 
sgl_alloc_order_min_max, which takes min order and max order, and 
allocates as large chunks as it can given those constraints. This is 
something we have in i915 and could then drop our implementation and use 
the library function.


But I also wanted to refactor sgl_alloc_order to benefit from the 
existing chained struct scatterlist allocator. But SGL API does not 
embed into struct sg_table, neither it carries explicitly the number of 
nents allocated, making it impossible to correctly free with existing 
sg_free_table.


Another benefit of using the existing sg allocator would be that for 
large allocation you don't depend on the availability of contiguous 
chunks like you do with kmalloc_array.


For instance if in another reply you mentioned 4GiB allocations are a 
possibility. If you use order 0 that means you need 1M nents, which can 
be something like 32 bytes each and you need a 32MiB kmalloc for the 
nents array and thats quite big. If you would be able to reuse the 
existing sg_alloc_table infrastructure (I have patches which extract it 
if you don't want to deal with struct sg_table), you would benefit from 
PAGE_SIZE allocations.


Also I am not sure if a single gfp argument to sgl_alloc_order is the 
right thing to do. I have a feeling you either need to ignore it for 
kmalloc_array, or pass in two gfp_t arguments to be used for metadata 
and backing storage respectively.


So I have many questions regarding the current state and future 
direction, but essentially would like to make it usable for other 
drivers, like i915, as well.


Regards,

Tvrtko


Re: [PATCH 4/6] lib/scatterlist: Unexport some trivial wrappers

2018-03-07 Thread Tvrtko Ursulin


On 07/03/18 16:19, Bart Van Assche wrote:

On Wed, 2018-03-07 at 12:47 +, Tvrtko Ursulin wrote:

Save some kernel size by moving trivial wrappers to header as static
inline instead of exporting symbols for them.


Something that you may be unaware of is that the introduction of the sgl
helper functions is only a first step. The next step will be to introduce
a caching allocator for sg-lists. So for small sg-lists inlining won't
help performance. But moving these definitions from a .c file into a .h
file will (slightly) slow down kernel compilation. So I'd prefer that you
drop this patch.


Question is how will the future work influence these trivial wrappers?

I wasn't suggesting I removed them for performance reasons, but just 
because they are really trivial and so there is no need right now to 
have them as exported symbols.


And actually in one of the earlier work I did in lib/scatterlist.c 
Andrew Morton complained a bit to the prevalence of these trivial 
wrappers. So I even had plans to remove some of the existing ones but 
never got round to it.


Regards,

Tvrtko


  1   2   3   4   >