[PATCH v5] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Madhavan Srinivasan
When decoding the perf_regs mask in regs_dump__printf(),
we loop through the mask using find_first_bit and find_next_bit functions.
"mask" is of type "u64", but sent as a "unsigned long *" to
lib functions along with sizeof().

While the exisitng code works fine in most of the case,
the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
When reading u64 using (u32 *)()[0], perf (lib/find_*_bit()) assumes it gets
lower 32bits of u64 which is wrong. Proposed fix is to swap the words
of the u64 to handle this case. This is _not_ endianess swap.

Suggested-by: Yury Norov 
Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Kan Liang 
Cc: Wang Nan 
Cc: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
---
Changelog v4:
1) Removed the new macro and resued the DECLARE_BITMAP

Changelog v3:
1)Moved the swap function to lib/bitmap.c
2)Added a macro for declaration
3)Added the comments

Changelog v2:
1)Moved the swap code to a common function
2)Added more comments in the code

Changelog v1:
1)updated commit message and patch subject
2)Add the fix to print_sample_iregs() in builtin-script.c

 tools/include/linux/bitmap.h |  2 ++
 tools/lib/bitmap.c   | 18 ++
 tools/perf/builtin-script.c  |  4 +++-
 tools/perf/util/session.c|  4 +++-
 4 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 28f5493da491..5e98525387dc 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -2,6 +2,7 @@
 #define _PERF_BITOPS_H
 
 #include 
+#include 
 #include 
 
 #define DECLARE_BITMAP(name,bits) \
@@ -10,6 +11,7 @@
 int __bitmap_weight(const unsigned long *bitmap, int bits);
 void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
 const unsigned long *bitmap2, int bits);
+void bitmap_from_u64(unsigned long *dst, u64 mask);
 
 #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
 
diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
index 0a1adcfd..464a0cc63e6a 100644
--- a/tools/lib/bitmap.c
+++ b/tools/lib/bitmap.c
@@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long 
*bitmap1,
for (k = 0; k < nr; k++)
dst[k] = bitmap1[k] | bitmap2[k];
 }
+
+/*
+ * bitmap_from_u64 - Check and swap words within u64.
+ *  @mask: source bitmap
+ *  @dst:  destination bitmap
+ *
+ * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 
bits.
+ * When reading u64 using (u32 *)()[0] and (u32 *)()[1],
+ * we will get wrong value for the mask. That is "(u32 *)()[0]"
+ * gets upper 32 bits of u64, but perf may expect lower 32bits of u64.
+ */
+void bitmap_from_u64(unsigned long *dst, u64 mask)
+{
+   dst[0] = mask & ULONG_MAX;
+
+   if (sizeof(mask) > sizeof(unsigned long))
+   dst[1] = mask >> 32;
+}
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e3ce2f34d3ad..132a4bb70c31 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -412,11 +412,13 @@ static void print_sample_iregs(struct perf_sample *sample,
struct regs_dump *regs = >intr_regs;
uint64_t mask = attr->sample_regs_intr;
unsigned i = 0, r;
+   DECLARE_BITMAP(_mask, 64);
 
if (!regs)
return;
 
-   for_each_set_bit(r, (unsigned long *) , sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(r, _mask, sizeof(mask) * 8) {
u64 val = regs->regs[i++];
printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5214974e841a..3576bf13 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -940,8 +940,10 @@ static void branch_stack__printf(struct perf_sample 
*sample)
 static void regs_dump__printf(u64 mask, u64 *regs)
 {
unsigned rid, i = 0;
+   DECLARE_BITMAP(_mask, 64);
 
-   for_each_set_bit(rid, (unsigned long *) , sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(rid, _mask, sizeof(mask) * 8) {
u64 val = regs[i++];
 
printf(" %-5s 0x%" PRIx64 "\n",
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 3/9] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-06-22 Thread Dave Young


- Original Message -
From: "Dave Young" 
To: "Thiago Jung Bauermann" 
Cc: linuxppc-dev@lists.ozlabs.org, ke...@lists.infradead.org, 
linux-ker...@vger.kernel.org, "Eric Biederman" 
Sent: Thursday, June 23, 2016 10:30:52 AM
Subject: Re: [PATCH v3 3/9] kexec_file: Factor out kexec_locate_mem_hole from 
kexec_add_buffer.

On 06/22/16 at 08:34pm, Thiago Jung Bauermann wrote:
> Am Mittwoch, 22 Juni 2016, 18:18:01 schrieb Dave Young:
> > On 06/21/16 at 04:48pm, Thiago Jung Bauermann wrote:
> > > +/**
> > > + * kexec_locate_mem_hole - find free memory to load segment or use in
> > > purgatory + * @image: kexec image being updated.
> > > + * @size:Memory size.
> > > + * @align:   Minimum alignment needed.
> > > + * @min_addr:Minimum starting address.
> > > + * @max_addr:Maximum end address.
> > > + * @top_down Find the highest free memory region?
> > > + * @addr On success, will have start address of the memory region
> > > found.
> > > + *
> > > + * Return: 0 on success, negative errno on error.
> > > + */
> > > +int kexec_locate_mem_hole(struct kimage *image, unsigned long size,
> > > +   unsigned long align, unsigned long min_addr,
> > > +   unsigned long max_addr, bool top_down,
> > > +   unsigned long *addr)
> > > +{
> > > + int ret;
> > > + struct kexec_buf buf;
> > > +
> > > + memset(, 0, sizeof(struct kexec_buf));
> > > + buf.image = image;
> > > +
> > > + buf.memsz = size;
> > > + buf.buf_align = align;
> > > + buf.buf_min = min_addr;
> > > + buf.buf_max = max_addr;
> > > + buf.top_down = top_down;
> > 
> > Since patch 2/9 moved kexec_buf from internal header file to kexec.h it
> > will be natural to passing a kexec_buf pointer intead of passing all
> > these arguments in kexec_locate_mem_hole.
> > 
> > kbuf.mem can be used for addr.
> 
> Ok. What about this version?
> -- 
> []'s
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
> 
> Subject: [PATCH 3/9] kexec_file: Factor out kexec_locate_mem_hole from
>  kexec_add_buffer.
> 
> kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
> implementation to find free memory for the purgatory stack.
> 
> Signed-off-by: Thiago Jung Bauermann 
> Cc: Eric Biederman 
> Cc: Dave Young 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  include/linux/kexec.h | 12 +---
>  kernel/kexec_file.c   | 25 -
>  2 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3d91bcfc180d..e8b099da47f5 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -147,9 +147,14 @@ struct kexec_file_ops {
>  #endif
>  };
>  
> -/*
> - * Keeps track of buffer parameters as provided by caller for requesting
> - * memory placement of buffer.
> +/**
> + * struct kexec_buf - parameters for finding a place for a buffer in memory
> + * @image:   kexec image in which memory to search.
> + * @size:Memory size for the buffer.
> + * @align:   Minimum alignment needed.
> + * @min_addr:Minimum starting address.
> + * @max_addr:Maximum end address.
> + * @top_down:Find the highest free memory region?

Hmm, hold on. For declaring a struct in a header file, comment should be
just after each fields, like below, your format is for a function instead:
struct pci_slot {
struct pci_bus *bus;/* The bus this slot is on */
struct list_head list;  /* node in list of slots on this bus */
struct hotplug_slot *hotplug;   /* Hotplug info (migrate over time) */
unsigned char number;   /* PCI_SLOT(pci_dev->devfn) */
struct kobject kobj;
};

BTW, what is @size? there's no size field in kexec_buf. I think it is not
necessary to add these comment, they are easy to understand. If you really
want, please rewrite them correctly, for example "image" description is wrong.
It is not only for searching memory only, top_down description is also bad.

Thanks
Dave
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v4] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Madhavan Srinivasan


On Thursday 23 June 2016 10:48 AM, Yury Norov wrote:
> On Thu, Jun 23, 2016 at 10:31:16AM +0530, Madhavan Srinivasan wrote:
>> When decoding the perf_regs mask in regs_dump__printf(),
>> we loop through the mask using find_first_bit and find_next_bit functions.
>> "mask" is of type "u64", but sent as a "unsigned long *" to
>> lib functions along with sizeof().
>>
>> While the exisitng code works fine in most of the case,
>> the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
>> When reading u64 using (u32 *)()[0], perf (lib/find_*_bit()) assumes it 
>> gets
>> lower 32bits of u64 which is wrong. Proposed fix is to swap the words
>> of the u64 to handle this case. This is _not_ endianess swap.
>>
>> Suggested-by: Yury Norov 
>> Cc: Yury Norov 
>> Cc: Peter Zijlstra 
>> Cc: Ingo Molnar 
>> Cc: Arnaldo Carvalho de Melo 
>> Cc: Alexander Shishkin 
>> Cc: Jiri Olsa 
>> Cc: Adrian Hunter 
>> Cc: Kan Liang 
>> Cc: Wang Nan 
>> Cc: Michael Ellerman 
>> Signed-off-by: Madhavan Srinivasan 
>> ---
>> Changelog v3:
>> 1)Moved the swap function to lib/bitmap.c
>> 2)Added a macro for declaration
>> 3)Added the comments
>>
>> Changelog v2:
>> 1)Moved the swap code to a common function
>> 2)Added more comments in the code
>>
>> Changelog v1:
>> 1)updated commit message and patch subject
>> 2)Add the fix to print_sample_iregs() in builtin-script.c
>>
>>  tools/include/linux/bitmap.h |  5 +
>>  tools/lib/bitmap.c   | 18 ++
>>  tools/perf/builtin-script.c  |  4 +++-
>>  tools/perf/util/session.c|  4 +++-
>>  4 files changed, 29 insertions(+), 2 deletions(-)
>>
>> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
>> index 28f5493da491..6be9a7ddcb03 100644
>> --- a/tools/include/linux/bitmap.h
>> +++ b/tools/include/linux/bitmap.h
>> @@ -2,14 +2,19 @@
>>  #define _PERF_BITOPS_H
>>  
>>  #include 
>> +#include 
>>  #include 
>>  
>>  #define DECLARE_BITMAP(name,bits) \
>>  unsigned long name[BITS_TO_LONGS(bits)]
>>  
>> +#define DECLARE_U64_BITMAP(__name) \
>> +unsigned long __name[sizeof(u64)/sizeof(unsigned long)]
>> +
>>  int __bitmap_weight(const unsigned long *bitmap, int bits);
>>  void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
>>   const unsigned long *bitmap2, int bits);
>> +void bitmap_from_u64(unsigned long *dst, u64 mask);
>>  
>>  #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 
>> 1)))
>>  
>> diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
>> index 0a1adcfd..464a0cc63e6a 100644
>> --- a/tools/lib/bitmap.c
>> +++ b/tools/lib/bitmap.c
>> @@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long 
>> *bitmap1,
>>  for (k = 0; k < nr; k++)
>>  dst[k] = bitmap1[k] | bitmap2[k];
>>  }
>> +
>> +/*
>> + * bitmap_from_u64 - Check and swap words within u64.
>> + *  @mask: source bitmap
>> + *  @dst:  destination bitmap
>> + *
>> + * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 
>> bits.
>> + * When reading u64 using (u32 *)()[0] and (u32 *)()[1],
>> + * we will get wrong value for the mask. That is "(u32 *)()[0]"
>> + * gets upper 32 bits of u64, but perf may expect lower 32bits of u64.
>> + */
>> +void bitmap_from_u64(unsigned long *dst, u64 mask)
>> +{
>> +dst[0] = mask & ULONG_MAX;
>> +
>> +if (sizeof(mask) > sizeof(unsigned long))
>> +dst[1] = mask >> 32;
>> +}
>> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
>> index e3ce2f34d3ad..1120ca117071 100644
>> --- a/tools/perf/builtin-script.c
>> +++ b/tools/perf/builtin-script.c
>> @@ -412,11 +412,13 @@ static void print_sample_iregs(struct perf_sample 
>> *sample,
>>  struct regs_dump *regs = >intr_regs;
>>  uint64_t mask = attr->sample_regs_intr;
>>  unsigned i = 0, r;
>> +DECLARE_U64_BITMAP(_mask);
> I thought again, and realized that it may be just
>   DECLARE_BITMAP(_mask, 64);
>
> I think it's better than introduce new macro and I'd recommend you to
> send v5 doing this. But this version is OK to me as well. So it's up
> to you.

Yeah. Make sense. My bad did not look close at DECLARE_BITMAP.
Will send out a v5 now with that change.

Maddy
>
> Reviewed-by: Yury Norov 
>
>>  if (!regs)
>>  return;
>>  
>> -for_each_set_bit(r, (unsigned long *) , sizeof(mask) * 8) {
>> +bitmap_from_u64(_mask, mask);
>> +for_each_set_bit(r, _mask, sizeof(mask) * 8) {
>>  u64 val = regs->regs[i++];
>>  printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val);
>>  }
>> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
>> index 

Re: [PATCH v4] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Yury Norov
On Thu, Jun 23, 2016 at 10:31:16AM +0530, Madhavan Srinivasan wrote:
> When decoding the perf_regs mask in regs_dump__printf(),
> we loop through the mask using find_first_bit and find_next_bit functions.
> "mask" is of type "u64", but sent as a "unsigned long *" to
> lib functions along with sizeof().
> 
> While the exisitng code works fine in most of the case,
> the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
> When reading u64 using (u32 *)()[0], perf (lib/find_*_bit()) assumes it 
> gets
> lower 32bits of u64 which is wrong. Proposed fix is to swap the words
> of the u64 to handle this case. This is _not_ endianess swap.
> 
> Suggested-by: Yury Norov 
> Cc: Yury Norov 
> Cc: Peter Zijlstra 
> Cc: Ingo Molnar 
> Cc: Arnaldo Carvalho de Melo 
> Cc: Alexander Shishkin 
> Cc: Jiri Olsa 
> Cc: Adrian Hunter 
> Cc: Kan Liang 
> Cc: Wang Nan 
> Cc: Michael Ellerman 
> Signed-off-by: Madhavan Srinivasan 
> ---
> Changelog v3:
> 1)Moved the swap function to lib/bitmap.c
> 2)Added a macro for declaration
> 3)Added the comments
> 
> Changelog v2:
> 1)Moved the swap code to a common function
> 2)Added more comments in the code
> 
> Changelog v1:
> 1)updated commit message and patch subject
> 2)Add the fix to print_sample_iregs() in builtin-script.c
> 
>  tools/include/linux/bitmap.h |  5 +
>  tools/lib/bitmap.c   | 18 ++
>  tools/perf/builtin-script.c  |  4 +++-
>  tools/perf/util/session.c|  4 +++-
>  4 files changed, 29 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
> index 28f5493da491..6be9a7ddcb03 100644
> --- a/tools/include/linux/bitmap.h
> +++ b/tools/include/linux/bitmap.h
> @@ -2,14 +2,19 @@
>  #define _PERF_BITOPS_H
>  
>  #include 
> +#include 
>  #include 
>  
>  #define DECLARE_BITMAP(name,bits) \
>   unsigned long name[BITS_TO_LONGS(bits)]
>  
> +#define DECLARE_U64_BITMAP(__name) \
> + unsigned long __name[sizeof(u64)/sizeof(unsigned long)]
> +
>  int __bitmap_weight(const unsigned long *bitmap, int bits);
>  void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
>const unsigned long *bitmap2, int bits);
> +void bitmap_from_u64(unsigned long *dst, u64 mask);
>  
>  #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 
> 1)))
>  
> diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
> index 0a1adcfd..464a0cc63e6a 100644
> --- a/tools/lib/bitmap.c
> +++ b/tools/lib/bitmap.c
> @@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long 
> *bitmap1,
>   for (k = 0; k < nr; k++)
>   dst[k] = bitmap1[k] | bitmap2[k];
>  }
> +
> +/*
> + * bitmap_from_u64 - Check and swap words within u64.
> + *  @mask: source bitmap
> + *  @dst:  destination bitmap
> + *
> + * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 
> bits.
> + * When reading u64 using (u32 *)()[0] and (u32 *)()[1],
> + * we will get wrong value for the mask. That is "(u32 *)()[0]"
> + * gets upper 32 bits of u64, but perf may expect lower 32bits of u64.
> + */
> +void bitmap_from_u64(unsigned long *dst, u64 mask)
> +{
> + dst[0] = mask & ULONG_MAX;
> +
> + if (sizeof(mask) > sizeof(unsigned long))
> + dst[1] = mask >> 32;
> +}
> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
> index e3ce2f34d3ad..1120ca117071 100644
> --- a/tools/perf/builtin-script.c
> +++ b/tools/perf/builtin-script.c
> @@ -412,11 +412,13 @@ static void print_sample_iregs(struct perf_sample 
> *sample,
>   struct regs_dump *regs = >intr_regs;
>   uint64_t mask = attr->sample_regs_intr;
>   unsigned i = 0, r;
> + DECLARE_U64_BITMAP(_mask);

I thought again, and realized that it may be just
DECLARE_BITMAP(_mask, 64);

I think it's better than introduce new macro and I'd recommend you to
send v5 doing this. But this version is OK to me as well. So it's up
to you.

Reviewed-by: Yury Norov 

>   if (!regs)
>   return;
>  
> - for_each_set_bit(r, (unsigned long *) , sizeof(mask) * 8) {
> + bitmap_from_u64(_mask, mask);
> + for_each_set_bit(r, _mask, sizeof(mask) * 8) {
>   u64 val = regs->regs[i++];
>   printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val);
>   }
> diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
> index 5214974e841a..fab1f9c1e0f5 100644
> --- a/tools/perf/util/session.c
> +++ b/tools/perf/util/session.c
> @@ -940,8 +940,10 @@ static void branch_stack__printf(struct perf_sample 
> *sample)
>  static void regs_dump__printf(u64 mask, u64 *regs)
>  {
>   unsigned rid, i = 0;
> + 

[PATCH v4] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Madhavan Srinivasan
When decoding the perf_regs mask in regs_dump__printf(),
we loop through the mask using find_first_bit and find_next_bit functions.
"mask" is of type "u64", but sent as a "unsigned long *" to
lib functions along with sizeof().

While the exisitng code works fine in most of the case,
the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
When reading u64 using (u32 *)()[0], perf (lib/find_*_bit()) assumes it gets
lower 32bits of u64 which is wrong. Proposed fix is to swap the words
of the u64 to handle this case. This is _not_ endianess swap.

Suggested-by: Yury Norov 
Cc: Yury Norov 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexander Shishkin 
Cc: Jiri Olsa 
Cc: Adrian Hunter 
Cc: Kan Liang 
Cc: Wang Nan 
Cc: Michael Ellerman 
Signed-off-by: Madhavan Srinivasan 
---
Changelog v3:
1)Moved the swap function to lib/bitmap.c
2)Added a macro for declaration
3)Added the comments

Changelog v2:
1)Moved the swap code to a common function
2)Added more comments in the code

Changelog v1:
1)updated commit message and patch subject
2)Add the fix to print_sample_iregs() in builtin-script.c

 tools/include/linux/bitmap.h |  5 +
 tools/lib/bitmap.c   | 18 ++
 tools/perf/builtin-script.c  |  4 +++-
 tools/perf/util/session.c|  4 +++-
 4 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 28f5493da491..6be9a7ddcb03 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -2,14 +2,19 @@
 #define _PERF_BITOPS_H
 
 #include 
+#include 
 #include 
 
 #define DECLARE_BITMAP(name,bits) \
unsigned long name[BITS_TO_LONGS(bits)]
 
+#define DECLARE_U64_BITMAP(__name) \
+   unsigned long __name[sizeof(u64)/sizeof(unsigned long)]
+
 int __bitmap_weight(const unsigned long *bitmap, int bits);
 void __bitmap_or(unsigned long *dst, const unsigned long *bitmap1,
 const unsigned long *bitmap2, int bits);
+void bitmap_from_u64(unsigned long *dst, u64 mask);
 
 #define BITMAP_FIRST_WORD_MASK(start) (~0UL << ((start) & (BITS_PER_LONG - 1)))
 
diff --git a/tools/lib/bitmap.c b/tools/lib/bitmap.c
index 0a1adcfd..464a0cc63e6a 100644
--- a/tools/lib/bitmap.c
+++ b/tools/lib/bitmap.c
@@ -29,3 +29,21 @@ void __bitmap_or(unsigned long *dst, const unsigned long 
*bitmap1,
for (k = 0; k < nr; k++)
dst[k] = bitmap1[k] | bitmap2[k];
 }
+
+/*
+ * bitmap_from_u64 - Check and swap words within u64.
+ *  @mask: source bitmap
+ *  @dst:  destination bitmap
+ *
+ * In 32 bit big endian userspace on a 64bit kernel, 'unsigned long' is 32 
bits.
+ * When reading u64 using (u32 *)()[0] and (u32 *)()[1],
+ * we will get wrong value for the mask. That is "(u32 *)()[0]"
+ * gets upper 32 bits of u64, but perf may expect lower 32bits of u64.
+ */
+void bitmap_from_u64(unsigned long *dst, u64 mask)
+{
+   dst[0] = mask & ULONG_MAX;
+
+   if (sizeof(mask) > sizeof(unsigned long))
+   dst[1] = mask >> 32;
+}
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index e3ce2f34d3ad..1120ca117071 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -412,11 +412,13 @@ static void print_sample_iregs(struct perf_sample *sample,
struct regs_dump *regs = >intr_regs;
uint64_t mask = attr->sample_regs_intr;
unsigned i = 0, r;
+   DECLARE_U64_BITMAP(_mask);
 
if (!regs)
return;
 
-   for_each_set_bit(r, (unsigned long *) , sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(r, _mask, sizeof(mask) * 8) {
u64 val = regs->regs[i++];
printf("%5s:0x%"PRIx64" ", perf_reg_name(r), val);
}
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 5214974e841a..fab1f9c1e0f5 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -940,8 +940,10 @@ static void branch_stack__printf(struct perf_sample 
*sample)
 static void regs_dump__printf(u64 mask, u64 *regs)
 {
unsigned rid, i = 0;
+   DECLARE_U64_BITMAP(_mask);
 
-   for_each_set_bit(rid, (unsigned long *) , sizeof(mask) * 8) {
+   bitmap_from_u64(_mask, mask);
+   for_each_set_bit(rid, _mask, sizeof(mask) * 8) {
u64 val = regs[i++];
 
printf(" %-5s 0x%" PRIx64 "\n",
-- 
1.9.1

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] cpuidle/powernv: Fix snooze timeout

2016-06-22 Thread Shreyas B Prabhu


On 06/23/2016 05:18 AM, Balbir Singh wrote:
> 
> 
> On 23/06/16 05:36, Shreyas B. Prabhu wrote:
>> Snooze is a poll idle state in powernv and pseries platforms. Snooze
>> has a timeout so that if a cpu stays in snooze for more than target
>> residency of the next available idle state, then it would exit thereby
>> giving chance to the cpuidle governor to re-evaluate and
>> promote the cpu to a deeper idle state. Therefore whenever snooze exits
>> due to this timeout, its last_residency will be target_residency of next
>> deeper state.
>>
>> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
>> changed the math around last_residency calculation. Specifically, while
>> converting last_residency value from nanoseconds to microseconds it does
>> right shift by 10. Due to this, in snooze timeout exit scenarios
>> last_residency calculated is roughly 2.3% less than target_residency of
>> next available state. This pattern is picked up get_typical_interval()
>> in the menu governor and therefore expected_interval in menu_select() is
>> frequently less than the target_residency of any state but snooze.
>>
>> Due to this we are entering snooze at a higher rate, thereby affecting
>> the single thread performance.
>> Since the math around last_residency is not meant to be precise, fix this
>> issue setting snooze timeout to 105% of target_residency of next
>> available idle state.
>>
>> This also adds comment around why snooze timeout is necessary.
>>
>> Reported-by: Anton Blanchard 
>> Signed-off-by: Shreyas B. Prabhu 
>> ---
>>  drivers/cpuidle/cpuidle-powernv.c | 14 ++
>>  drivers/cpuidle/cpuidle-pseries.c | 13 +
>>  2 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
>> b/drivers/cpuidle/cpuidle-powernv.c
>> index e12dc30..5835491 100644
>> --- a/drivers/cpuidle/cpuidle-powernv.c
>> +++ b/drivers/cpuidle/cpuidle-powernv.c
>> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>>  cpuidle_state_table = powernv_states;
>>  /* Device tree can indicate more idle states */
>>  max_idle_state = powernv_add_idle_states();
>> +
>> +/*
>> + * Staying in snooze for a long period can degrade the
>> + * perfomance of the sibling cpus. Set timeout for snooze such
>> + * that if the cpu stays in snooze longer than target residency
>> + * of the next available idle state then exit from snooze. This
>> + * gives a chance to the cpuidle governor to re-evaluate and
>> + * promote it to deeper idle states.
>> + */
>>  if (max_idle_state > 1) {
>>  snooze_timeout_en = true;
>>  snooze_timeout = powernv_states[1].target_residency *
>>   tb_ticks_per_usec;
>> +/*
>> + * Give a 5% margin since target residency related math
>> + * is not precise in cpuidle core.
>> + */
> 
> Is this due to the microsecond conversion mentioned above? It would be nice to
> have it in the comment. Does
> 
> (powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec 
> solve
> your rounding issues, assuming the issue is really rounding or maybe it is due
> to the shift by 10, could you please elaborate on what related math is not
> precise? That would explain to me why I missed understanding your changes.
> 
>> +snooze_timeout += snooze_timeout / 20;
> 
> For now 5% is sufficient, but do you want to check to assert to check if
> 
> snooze_timeout (in microseconds) / tb_ticks_per_usec > 
> powernv_states[i].target_residency?
> 

This is not a rounding issue. As I mentioned in the commit message, this
is because of the last_residency calculation in cpuidle.c.
To elaborate, last residency calculation is done in the following way
after commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with
local_clock()") -

cpuidle_enter_state()
{
[...]
time_start = local_clock();
[enter idle state]
time_end = local_clock();
/*
 * local_clock() returns the time in nanosecond, let's shift
 * by 10 (divide by 1024) to have microsecond based time.
 */
diff = (time_end - time_start) >> 10;
[...]
dev->last_residency = (int) diff;
}

Because of >>10 as opposed to /1000, last_residency is lesser by 2.3%

In snooze timeout exit scenarios because of this, last_residency
calculated is 2.3% less than target_residency of next available state.
This affects get_typical_interval() in the menu governor and therefore
expected_interval in menu_select() is frequently less than the
target_residency of any state but snooze.


I'll expand the comments as you suggested.

Thanks,
Shreyas

___
Linuxppc-dev 

RE: [PATCH 2/4] soc: fsl: add GUTS driver for QorIQ platforms

2016-06-22 Thread Yangbo Lu
Hi Arnd,

Could you comment on these?
Thanks.


Best regards,
Yangbo Lu


> -Original Message-
> From: Scott Wood [mailto:o...@buserror.net]
> Sent: Saturday, June 11, 2016 9:51 AM
> To: Arnd Bergmann; linuxppc-dev@lists.ozlabs.org
> Cc: Mark Rutland; Ulf Hansson; linux-ker...@vger.kernel.org; linux-
> i...@vger.kernel.org; linux-...@vger.kernel.org; Qiang Zhao; Russell King;
> Bhupesh Sharma; Joerg Roedel; Claudiu Manoil; devicet...@vger.kernel.org;
> Kumar Gala; Rob Herring; Santosh Shilimkar; linux-arm-
> ker...@lists.infradead.org; net...@vger.kernel.org; linux-
> m...@vger.kernel.org; Xiaobo Xie; Yang-Leo Li; iommu@lists.linux-
> foundation.org; Yangbo Lu
> Subject: Re: [PATCH 2/4] soc: fsl: add GUTS driver for QorIQ platforms
> 
> On Thu, 2016-06-02 at 10:43 +0200, Arnd Bergmann wrote:
> > On Wednesday, June 1, 2016 8:47:22 PM CEST Scott Wood wrote:
> > > On Mon, 2016-05-30 at 15:15 +0200, Arnd Bergmann wrote:
> > > > diff --git a/drivers/soc/fsl/guts.c b/drivers/soc/fsl/guts.c new
> > > > file mode 100644 index ..2f30698f5bcf
> > > > --- /dev/null
> > > > +++ b/drivers/soc/fsl/guts.c
> > > > @@ -0,0 +1,130 @@
> > > > +/*
> > > > + * Freescale QorIQ Platforms GUTS Driver
> > > > + *
> > > > + * Copyright (C) 2016 Freescale Semiconductor, Inc.
> > > > + *
> > > > + * This program is free software; you can redistribute it and/or
> > > > +modify
> > > > + * it under the terms of the GNU General Public License as
> > > > +published by
> > > > + * the Free Software Foundation; either version 2 of the License,
> > > > +or
> > > > + * (at your option) any later version.
> > > > + */
> > > > +
> > > > +#include 
> > > > +#include  #include 
> > > > +#include  #include  #include
> > > > + #include 
> > > > +
> > > > +#define GUTS_PVR   0x0a0
> > > > +#define GUTS_SVR   0x0a4
> > > > +
> > > > +struct guts {
> > > > +   void __iomem *regs;
> > >
> > > We already have a struct to define guts.  Why are you not using it?
> > > Why do you consider using it to be "abuse"?  What if we want to move
> > > more guts functionality into this driver?
> >
> > This structure was in the original patch, I left it in there, only
> > removed the inclusion of the powerpc header file, which seemed to be
> > misplaced.
> 
> I'm not refering "struct guts".  I'm referring to changing "struct
> ccsr_guts __iomem *regs" into "void __iomem *regs".
> 
> And it's not a powerpc header file.
> 
> > > > +/*
> > > > + * Table for matching compatible strings, for device tree
> > > > + * guts node, for Freescale QorIQ SOCs.
> > > > + */
> > > > +static const struct of_device_id fsl_guts_of_match[] = {
> > > > +   /* For T4 & B4 Series SOCs */
> > > > +   { .compatible = "fsl,qoriq-device-config-1.0", .data = "T4/B4
> > > > series" },
> > > [snip]
> > > > +   { .compatible = "fsl,qoriq-device-config-2.0", .data = "P
> > > > series"
> > >
> > > As noted in my comment on patch 3/4, these descriptions are reversed.
> > >
> > > They're also incomplete.  t2080 has device config 2.0.  t1040 is
> > > described as
> > > 2.0 though it should probably be 2.1 (or better, drop the generic
> > > compatible altogether).
> >
> > Ok. Ideally I think we'd even look up the specific SoC names from the
> > SVC rather than the compatible string. I just didn't have a good list
> > for those to put in the driver.
> 
> The list is in arch/powerpc/include/asm/mpc85xx.h but I don't know why we
> need to convert it to a string in the first place.
> 
> >
> > > > +   /*
> > > > +* syscon devices default to little-endian, but on powerpc we
> > > > have
> > > > +* existing device trees with big-endian maps and an absent
> > > > endianess
> > > > +* "big-property"
> > > > +*/
> > > > +   if (!IS_ENABLED(CONFIG_POWERPC) &&
> > > > +   !of_property_read_bool(dev->of_node, "big-endian"))
> > > > +   guts->little_endian = true;
> > >
> > > This is not a syscon device (Yangbo's patch to add a guts node on
> > > ls2080 is the only guts node that says "syscon", and that was a
> > > leftover from earlier revisions and should probably be removed).
> > > Even if it were, where is it documented that syscon defaults to
> > > little-endian?
> >
> > Documentation/devicetree/bindings/regmap/regmap.txt
> >
> > We had a little screwup here, basically regmap (and by consequence,
> > syscon) always defaulted to little-endian way before that was
> > documented, so it's too late to change it,
> 
> What causes a device node to fall under the jurisdiction of regmap.txt?
>  Again, these nodes do not claim "syscon" compatibility.
> 
> > although I agree it would have made sense to document regmap to
> > default to big-endian on powerpc.
> 
> Please don't.  It's enough of a mess as is; no need to start throwing in
> architecture ifdefs.
> 
> > > Documentation/devicetree/bindings/common-properties.txt says that
> > > the individual binding specifies the default.  The default for this
> > > node 

Re: [PATCH v3 3/9] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-06-22 Thread Dave Young
On 06/22/16 at 08:34pm, Thiago Jung Bauermann wrote:
> Am Mittwoch, 22 Juni 2016, 18:18:01 schrieb Dave Young:
> > On 06/21/16 at 04:48pm, Thiago Jung Bauermann wrote:
> > > +/**
> > > + * kexec_locate_mem_hole - find free memory to load segment or use in
> > > purgatory + * @image: kexec image being updated.
> > > + * @size:Memory size.
> > > + * @align:   Minimum alignment needed.
> > > + * @min_addr:Minimum starting address.
> > > + * @max_addr:Maximum end address.
> > > + * @top_down Find the highest free memory region?
> > > + * @addr On success, will have start address of the memory region
> > > found.
> > > + *
> > > + * Return: 0 on success, negative errno on error.
> > > + */
> > > +int kexec_locate_mem_hole(struct kimage *image, unsigned long size,
> > > +   unsigned long align, unsigned long min_addr,
> > > +   unsigned long max_addr, bool top_down,
> > > +   unsigned long *addr)
> > > +{
> > > + int ret;
> > > + struct kexec_buf buf;
> > > +
> > > + memset(, 0, sizeof(struct kexec_buf));
> > > + buf.image = image;
> > > +
> > > + buf.memsz = size;
> > > + buf.buf_align = align;
> > > + buf.buf_min = min_addr;
> > > + buf.buf_max = max_addr;
> > > + buf.top_down = top_down;
> > 
> > Since patch 2/9 moved kexec_buf from internal header file to kexec.h it
> > will be natural to passing a kexec_buf pointer intead of passing all
> > these arguments in kexec_locate_mem_hole.
> > 
> > kbuf.mem can be used for addr.
> 
> Ok. What about this version?
> -- 
> []'s
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
> 
> Subject: [PATCH 3/9] kexec_file: Factor out kexec_locate_mem_hole from
>  kexec_add_buffer.
> 
> kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
> implementation to find free memory for the purgatory stack.
> 
> Signed-off-by: Thiago Jung Bauermann 
> Cc: Eric Biederman 
> Cc: Dave Young 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  include/linux/kexec.h | 12 +---
>  kernel/kexec_file.c   | 25 -
>  2 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3d91bcfc180d..e8b099da47f5 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -147,9 +147,14 @@ struct kexec_file_ops {
>  #endif
>  };
>  
> -/*
> - * Keeps track of buffer parameters as provided by caller for requesting
> - * memory placement of buffer.
> +/**
> + * struct kexec_buf - parameters for finding a place for a buffer in memory
> + * @image:   kexec image in which memory to search.
> + * @size:Memory size for the buffer.
> + * @align:   Minimum alignment needed.
> + * @min_addr:Minimum starting address.
> + * @max_addr:Maximum end address.
> + * @top_down:Find the highest free memory region?

Above parameter comments should go to previous patch.
Other than that it looks good.

Thanks
Dave
>   */
>  struct kexec_buf {
>   struct kimage *image;
> @@ -163,6 +168,7 @@ struct kexec_buf {
>  
>  int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
>  int (*func)(u64, u64, void *));
> +int kexec_locate_mem_hole(struct kexec_buf *kbuf);
>  #endif /* CONFIG_KEXEC_FILE */
>  
>  struct kimage {
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b1f1f6402518..445d66add8ca 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -449,6 +449,23 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
>   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>  }
>  
> +/**
> + * kexec_locate_mem_hole - find free memory to load segment or use in 
> purgatory
> + * @kbuf:Parameters for the memory search.
> + *
> + * On success, kbuf->mem will have the start address of the memory region 
> found.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int kexec_locate_mem_hole(struct kexec_buf *kbuf)
> +{
> + int ret;
> +
> + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
> +
> + return ret == 1 ? 0 : -EADDRNOTAVAIL;
> +}
> +
>  /*
>   * Helper function for placing a buffer in a kexec segment. This assumes
>   * that kexec_mutex is held.
> @@ -493,11 +510,9 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
> unsigned long bufsz,
>   kbuf->top_down = top_down;
>  
>   /* Walk the RAM ranges and allocate a suitable range for the buffer */
> - ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
> - if (ret != 1) {
> - /* A suitable memory range could not be found for buffer */
> - return -EADDRNOTAVAIL;
> - }
> + ret = kexec_locate_mem_hole(kbuf);
> + if (ret)
> + return ret;
>  
>   /* Found a suitable memory range */
>   ksegment = >segment[image->nr_segments];
> -- 
> 1.9.1
> 
> 
> 
> 

Re: [PATCH v3 2/9] kexec_file: Generalize kexec_add_buffer.

2016-06-22 Thread Dave Young
On 06/22/16 at 08:30pm, Thiago Jung Bauermann wrote:
> Am Mittwoch, 22 Juni 2016, 18:20:47 schrieb Dave Young:
> > The patch looks good, but could the subject be more specific?
> > 
> > For example just like the first sentence of the patch descriotion:
> > Allow architectures to specify their own memory walking function
> 
> Ok, What about this? I also changed the description to refer to x86 arch
> instead of Intel arch.

It looks good to me.

Thanks
Dave

> 
> []'s
> Thiago Jung Bauermann
> IBM Linux Technology Center
> 
> 
> Subject: [PATCH 2/9] kexec_file: Allow arch-specific memory walking for
>  kexec_add_buffer
> 
> Allow architectures to specify a different memory walking function for
> kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
> PowerPC uses the memblock subsystem.
> 
> Signed-off-by: Thiago Jung Bauermann 
> Cc: Eric Biederman 
> Cc: Dave Young 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  include/linux/kexec.h   | 19 ++-
>  kernel/kexec_file.c | 30 ++
>  kernel/kexec_internal.h | 14 --
>  3 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index e8acb2b43dd9..3d91bcfc180d 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -146,7 +146,24 @@ struct kexec_file_ops {
>   kexec_verify_sig_t *verify_sig;
>  #endif
>  };
> -#endif
> +
> +/*
> + * Keeps track of buffer parameters as provided by caller for requesting
> + * memory placement of buffer.
> + */
> +struct kexec_buf {
> + struct kimage *image;
> + unsigned long mem;
> + unsigned long memsz;
> + unsigned long buf_align;
> + unsigned long buf_min;
> + unsigned long buf_max;
> + bool top_down;  /* allocate from top of memory hole */
> +};
> +
> +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> +int (*func)(u64, u64, void *));
> +#endif /* CONFIG_KEXEC_FILE */
>  
>  struct kimage {
>   kimage_entry_t head;
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b6eec7527e9f..b1f1f6402518 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
> void *arg)
>   return locate_mem_hole_bottom_up(start, end, kbuf);
>  }
>  
> +/**
> + * arch_kexec_walk_mem - call func(data) on free memory regions
> + * @kbuf:Context info for the search. Also passed to @func.
> + * @func:Function to call for each memory region.
> + *
> + * Return: The memory walk will stop when func returns a non-zero value
> + * and that value will be returned. If all free regions are visited without
> + * func returning non-zero, then zero will be returned.
> + */
> +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> +int (*func)(u64, u64, void *))
> +{
> + if (kbuf->image->type == KEXEC_TYPE_CRASH)
> + return walk_iomem_res_desc(crashk_res.desc,
> +IORESOURCE_SYSTEM_RAM | 
> IORESOURCE_BUSY,
> +crashk_res.start, crashk_res.end,
> +kbuf, func);
> + else
> + return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
> +}
> +
>  /*
>   * Helper function for placing a buffer in a kexec segment. This assumes
>   * that kexec_mutex is held.
> @@ -472,14 +493,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
> unsigned long bufsz,
>   kbuf->top_down = top_down;
>  
>   /* Walk the RAM ranges and allocate a suitable range for the buffer */
> - if (image->type == KEXEC_TYPE_CRASH)
> - ret = walk_iomem_res_desc(crashk_res.desc,
> - IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> - crashk_res.start, crashk_res.end, kbuf,
> - locate_mem_hole_callback);
> - else
> - ret = walk_system_ram_res(0, -1, kbuf,
> -   locate_mem_hole_callback);
> + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
>   if (ret != 1) {
>   /* A suitable memory range could not be found for buffer */
>   return -EADDRNOTAVAIL;
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index eefd5bf960c2..4cef7e4706b0 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -20,20 +20,6 @@ struct kexec_sha_region {
>   unsigned long len;
>  };
>  
> -/*
> - * Keeps track of buffer parameters as provided by caller for requesting
> - * memory placement of buffer.
> - */
> -struct kexec_buf {
> - struct kimage *image;
> - unsigned long mem;
> - unsigned long memsz;
> - unsigned long buf_align;
> - unsigned long buf_min;
> - unsigned 

Re: [PATCH v10 13/18] powerpc/pci: Delay populating pdn

2016-06-22 Thread Gavin Shan
On Thu, Jun 23, 2016 at 10:59:57AM +1000, Daniel Axtens wrote:
>Hi,
>
>mpe merged this to his next yesterday, and my nightly scripts picked up
>the following cppcheck warning:
>
>[arch/powerpc/kernel/pci_dn.c:486]: (error) Memory leak: pdn
>
>That goes to the following function:
>
>static void *add_pdn(struct device_node *dn, void *data)
>{
>struct pci_controller *hose = data;
>struct pci_dn *pdn;
>
>pdn = pci_add_device_node_info(hose, dn);
>if (!pdn)
>return ERR_PTR(-ENOMEM);
>
>return NULL;
>}
>
>It looks like this is just called in pci_devs_phb_init_dynamic:
>
>/* Update dn->phb ptrs for new phb and children devices */
>pci_traverse_device_nodes(dn, add_pdn, phb);
>
>So my understanding is that this isn't a real memory leak but just
>cppcheck getting confused.
>
>Gavin, can you confirm that?
>

Daniel, thanks for the report. It's a false alarm: pci_traverse_device_nodes()
will be terminated immediate on non-NULL return value from add_pdn(). So the
function (add_pdn()) returns NULL intentionally. The newly created @pdn has
been attached to @dn. The @pdn is release when the @dn is released in PCI
hot remove time.

>Regards,
>Daniel
>

Thanks,
Gavin

>Gavin Shan  writes:
>
>> The pdn (struct pci_dn) instances are allocated from memblock or
>> bootmem when creating PCI controller (hoses) in setup_arch(). PCI
>> hotplug, which will be supported by proceeding patches, releases
>> PCI device nodes and their corresponding pdn on unplugging event.
>> The memory chunks for pdn instances allocated from memblock or
>> bootmem are hard to reused after being released.
>>
>> This delays creating pdn by pci_devs_phb_init() from setup_arch()
>> to core_initcall() so that they are allocated from slab. The memory
>> consumed by pdn can be released to system without problem during
>> PCI unplugging time. It indicates that pci_dn is unavailable in
>> setup_arch() and the the fixup on pdn (like AGP's) can't be carried
>> out that time. We have to do that in pcibios_root_bridge_prepare()
>> on maple/pasemi/powermac platforms where/when the pdn is available.
>> pcibios_root_bridge_prepare is called from subsys_initcall() which
>> is executed after core_initcall() so the code flow does not change.
>>
>> At the mean while, the EEH device is created when pdn is populated,
>> meaning pdn and EEH device have same life cycle. In turn, we needn't
>> call eeh_dev_init() to create EEH device explicitly.
>>
>> Signed-off-by: Gavin Shan 
>> Reviewed-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/include/asm/eeh.h |  2 +-
>>  arch/powerpc/include/asm/ppc-pci.h |  2 --
>>  arch/powerpc/kernel/eeh_dev.c  | 17 +++
>>  arch/powerpc/kernel/pci_dn.c   | 23 
>>  arch/powerpc/platforms/maple/pci.c | 34 ++
>>  arch/powerpc/platforms/pasemi/pci.c|  3 ---
>>  arch/powerpc/platforms/powermac/pci.c  | 38 
>> +-
>>  arch/powerpc/platforms/powernv/pci.c   |  3 ---
>>  arch/powerpc/platforms/pseries/setup.c |  6 +-
>>  9 files changed, 69 insertions(+), 59 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>> index fb9f376..8721580 100644
>> --- a/arch/powerpc/include/asm/eeh.h
>> +++ b/arch/powerpc/include/asm/eeh.h
>> @@ -274,7 +274,7 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>>  const char *eeh_pe_loc_get(struct eeh_pe *pe);
>>  struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>>  
>> -void *eeh_dev_init(struct pci_dn *pdn, void *data);
>> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn);
>>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>>  int eeh_init(void);
>>  int __init eeh_ops_register(struct eeh_ops *ops);
>> diff --git a/arch/powerpc/include/asm/ppc-pci.h 
>> b/arch/powerpc/include/asm/ppc-pci.h
>> index 8753e4e..0f73de0 100644
>> --- a/arch/powerpc/include/asm/ppc-pci.h
>> +++ b/arch/powerpc/include/asm/ppc-pci.h
>> @@ -39,8 +39,6 @@ void *pci_traverse_device_nodes(struct device_node *start,
>>  void *traverse_pci_dn(struct pci_dn *root,
>>void *(*fn)(struct pci_dn *, void *),
>>void *data);
>> -
>> -extern void pci_devs_phb_init(void);
>>  extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>>  
>>  /* From rtas_pci.h */
>> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
>> index 7815095..d6b2ca7 100644
>> --- a/arch/powerpc/kernel/eeh_dev.c
>> +++ b/arch/powerpc/kernel/eeh_dev.c
>> @@ -44,14 +44,13 @@
>>  /**
>>   * eeh_dev_init - Create EEH device according to OF node
>>   * @pdn: PCI device node
>> - * @data: PHB
>>   *
>>   * It will create EEH device according to the given OF node. The function
>>   * might be called by PCI emunation, DR, PHB hotplug.
>>   */
>> -void *eeh_dev_init(struct pci_dn *pdn, void *data)
>> 

Re: [PATCH v10 13/18] powerpc/pci: Delay populating pdn

2016-06-22 Thread Daniel Axtens
Hi,

mpe merged this to his next yesterday, and my nightly scripts picked up
the following cppcheck warning:

[arch/powerpc/kernel/pci_dn.c:486]: (error) Memory leak: pdn

That goes to the following function:

static void *add_pdn(struct device_node *dn, void *data)
{
struct pci_controller *hose = data;
struct pci_dn *pdn;

pdn = pci_add_device_node_info(hose, dn);
if (!pdn)
return ERR_PTR(-ENOMEM);

return NULL;
}

It looks like this is just called in pci_devs_phb_init_dynamic:

/* Update dn->phb ptrs for new phb and children devices */
pci_traverse_device_nodes(dn, add_pdn, phb);

So my understanding is that this isn't a real memory leak but just
cppcheck getting confused.

Gavin, can you confirm that?

Regards,
Daniel

Gavin Shan  writes:

> The pdn (struct pci_dn) instances are allocated from memblock or
> bootmem when creating PCI controller (hoses) in setup_arch(). PCI
> hotplug, which will be supported by proceeding patches, releases
> PCI device nodes and their corresponding pdn on unplugging event.
> The memory chunks for pdn instances allocated from memblock or
> bootmem are hard to reused after being released.
>
> This delays creating pdn by pci_devs_phb_init() from setup_arch()
> to core_initcall() so that they are allocated from slab. The memory
> consumed by pdn can be released to system without problem during
> PCI unplugging time. It indicates that pci_dn is unavailable in
> setup_arch() and the the fixup on pdn (like AGP's) can't be carried
> out that time. We have to do that in pcibios_root_bridge_prepare()
> on maple/pasemi/powermac platforms where/when the pdn is available.
> pcibios_root_bridge_prepare is called from subsys_initcall() which
> is executed after core_initcall() so the code flow does not change.
>
> At the mean while, the EEH device is created when pdn is populated,
> meaning pdn and EEH device have same life cycle. In turn, we needn't
> call eeh_dev_init() to create EEH device explicitly.
>
> Signed-off-by: Gavin Shan 
> Reviewed-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/eeh.h |  2 +-
>  arch/powerpc/include/asm/ppc-pci.h |  2 --
>  arch/powerpc/kernel/eeh_dev.c  | 17 +++
>  arch/powerpc/kernel/pci_dn.c   | 23 
>  arch/powerpc/platforms/maple/pci.c | 34 ++
>  arch/powerpc/platforms/pasemi/pci.c|  3 ---
>  arch/powerpc/platforms/powermac/pci.c  | 38 
> +-
>  arch/powerpc/platforms/powernv/pci.c   |  3 ---
>  arch/powerpc/platforms/pseries/setup.c |  6 +-
>  9 files changed, 69 insertions(+), 59 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index fb9f376..8721580 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -274,7 +274,7 @@ void eeh_pe_restore_bars(struct eeh_pe *pe);
>  const char *eeh_pe_loc_get(struct eeh_pe *pe);
>  struct pci_bus *eeh_pe_bus_get(struct eeh_pe *pe);
>  
> -void *eeh_dev_init(struct pci_dn *pdn, void *data);
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn);
>  void eeh_dev_phb_init_dynamic(struct pci_controller *phb);
>  int eeh_init(void);
>  int __init eeh_ops_register(struct eeh_ops *ops);
> diff --git a/arch/powerpc/include/asm/ppc-pci.h 
> b/arch/powerpc/include/asm/ppc-pci.h
> index 8753e4e..0f73de0 100644
> --- a/arch/powerpc/include/asm/ppc-pci.h
> +++ b/arch/powerpc/include/asm/ppc-pci.h
> @@ -39,8 +39,6 @@ void *pci_traverse_device_nodes(struct device_node *start,
>  void *traverse_pci_dn(struct pci_dn *root,
> void *(*fn)(struct pci_dn *, void *),
> void *data);
> -
> -extern void pci_devs_phb_init(void);
>  extern void pci_devs_phb_init_dynamic(struct pci_controller *phb);
>  
>  /* From rtas_pci.h */
> diff --git a/arch/powerpc/kernel/eeh_dev.c b/arch/powerpc/kernel/eeh_dev.c
> index 7815095..d6b2ca7 100644
> --- a/arch/powerpc/kernel/eeh_dev.c
> +++ b/arch/powerpc/kernel/eeh_dev.c
> @@ -44,14 +44,13 @@
>  /**
>   * eeh_dev_init - Create EEH device according to OF node
>   * @pdn: PCI device node
> - * @data: PHB
>   *
>   * It will create EEH device according to the given OF node. The function
>   * might be called by PCI emunation, DR, PHB hotplug.
>   */
> -void *eeh_dev_init(struct pci_dn *pdn, void *data)
> +struct eeh_dev *eeh_dev_init(struct pci_dn *pdn)
>  {
> - struct pci_controller *phb = data;
> + struct pci_controller *phb = pdn->phb;
>   struct eeh_dev *edev;
>  
>   /* Allocate EEH device */
> @@ -69,7 +68,7 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>   INIT_LIST_HEAD(>list);
>   INIT_LIST_HEAD(>rmv_list);
>  
> - return NULL;
> + return edev;
>  }
>  
>  /**
> @@ -81,16 +80,8 @@ void *eeh_dev_init(struct pci_dn *pdn, void *data)
>   */
>  void 

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf  wrote:
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: preempt_schedule_irq, CALL_enter_from_user_mode,
> prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> etc.  That way, the unwinder will always be able to find pt_regs from an
> interrupt/exception, even if starting from one of these other calls.
>
> But then, things get ugly.  You have to either setup and tear down the
> frame for every possible call, or do a higher-level setup/teardown
> across multiple calls, which invalidates several assumptions in the
> entry code about the location of pt_regs on the stack.
>

Here's yet another harebrained idea.  Maybe it works better than my
previous harebrained ideas :)

Your patch is already creating a somewhat nonstandard stack frame:

+   movq%rbp,   0*8(%rsp)
+   movq$entry_frame_ret,   1*8(%rsp)
+   movq%rsp, %rbp

It's kind of a normal stack frame, but rbp points at something odd,
and to unwind it fully correctly, the unwinder needs to know about it.

What if we made it even more special, along the lines of:

leaq offset_to_ptregs(%rsp), %rbp
xorq $-1, %rbp

IOW, don't write anything to the stack at all, and just put a special
value into RBP that says "the next frame is pt_regs at such-and-such
address".  Do this once on entry and make sure to restore RBP (from
pt_regs) on exit.  Now the unwinder can notice that RBP has the high
bits clear *and* that the negation of it points to the stack, and it
can figure out what's going on.

What do you think?  Am I nuts or could this work?

It had better not have much risk of breaking things worse than they
currently are, given that current kernel allow user code to stick any
value it likes into the very last element of the RBP chain.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/9] kexec_file_load implementation for PowerPC

2016-06-22 Thread Balbir Singh


On 23/06/16 03:02, Thiago Jung Bauermann wrote:
> Hello Balbir,
>
Hi Thiago
 
>>> 3. have IMA pass-on its event log (where integrity measurements are
>>>
>>>registered) accross kexec to the second kernel, so that the event
>>>history is preserved.
>>
>> OK.. and this is safe? Do both the kernels need to be signed by the
>> same certificate?
> 
> They don't. The integrity of the event log (assuming that is what you mean 
> by "this" in "this is safe") is guaranteed by the TPM device. Each event in 
> the measurement list extends a PCR and records its PCR value. It is 
> cryptographically guaranteed that if you replay the PCR extends recorded in 
> the event log and in the end of the process they match the current PCR 
> values in the TPM device, then that event log is correct.


What I meant was how does the new kernel know that the old kernel did not
cheat while passing on the values? I presume because we trust that kernel
via a signature.


and

How do we know the new kernel is safe to load - I guess via a signature that
the new kernel is signed with (assuming it is present in the key ring).

Balbir Singh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] cpuidle/powernv: Fix snooze timeout

2016-06-22 Thread Balbir Singh


On 23/06/16 05:36, Shreyas B. Prabhu wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
> 
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
> 
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
> Since the math around last_residency is not meant to be precise, fix this
> issue setting snooze timeout to 105% of target_residency of next
> available idle state.
> 
> This also adds comment around why snooze timeout is necessary.
> 
> Reported-by: Anton Blanchard 
> Signed-off-by: Shreyas B. Prabhu 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 14 ++
>  drivers/cpuidle/cpuidle-pseries.c | 13 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index e12dc30..5835491 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
>   cpuidle_state_table = powernv_states;
>   /* Device tree can indicate more idle states */
>   max_idle_state = powernv_add_idle_states();
> +
> + /*
> +  * Staying in snooze for a long period can degrade the
> +  * perfomance of the sibling cpus. Set timeout for snooze such
> +  * that if the cpu stays in snooze longer than target residency
> +  * of the next available idle state then exit from snooze. This
> +  * gives a chance to the cpuidle governor to re-evaluate and
> +  * promote it to deeper idle states.
> +  */
>   if (max_idle_state > 1) {
>   snooze_timeout_en = true;
>   snooze_timeout = powernv_states[1].target_residency *
>tb_ticks_per_usec;
> + /*
> +  * Give a 5% margin since target residency related math
> +  * is not precise in cpuidle core.
> +  */

Is this due to the microsecond conversion mentioned above? It would be nice to
have it in the comment. Does

(powernv_states[1].target_residency + tb_ticks_per_usec) / tb_ticks_per_usec 
solve
your rounding issues, assuming the issue is really rounding or maybe it is due
to the shift by 10, could you please elaborate on what related math is not
precise? That would explain to me why I missed understanding your changes.

> + snooze_timeout += snooze_timeout / 20;

For now 5% is sufficient, but do you want to check to assert to check if

snooze_timeout (in microseconds) / tb_ticks_per_usec > 
powernv_states[i].target_residency?

>   }
>   } else
>   return -ENODEV;
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 07135e0..22de841 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
>   } else
>   return -ENODEV;
>  
> + /*
> +  * Staying in snooze for a long period can degrade the
> +  * perfomance of the sibling cpus. Set timeout for snooze such
> +  * that if the cpu stays in snooze longer than target residency
> +  * of the next available idle state then exit from snooze. This
> +  * gives a chance to the cpuidle governor to re-evaluate and
> +  * promote it to deeper idle states.
> +  */
>   if (max_idle_state > 1) {
>   snooze_timeout_en = true;
>   snooze_timeout = cpuidle_state_table[1].target_residency *
>tb_ticks_per_usec;
> + /*
> +  * Give a 5% margin since target residency related math
> +  * is not precise in cpuidle core.
> +  */
> + snooze_timeout += snooze_timeout / 20;
>   }
>   return 0;
>  }
> 
___

Re: [PATCH v3 3/9] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-06-22 Thread Thiago Jung Bauermann
Am Mittwoch, 22 Juni 2016, 18:18:01 schrieb Dave Young:
> On 06/21/16 at 04:48pm, Thiago Jung Bauermann wrote:
> > +/**
> > + * kexec_locate_mem_hole - find free memory to load segment or use in
> > purgatory + * @image:   kexec image being updated.
> > + * @size:  Memory size.
> > + * @align: Minimum alignment needed.
> > + * @min_addr:  Minimum starting address.
> > + * @max_addr:  Maximum end address.
> > + * @top_down   Find the highest free memory region?
> > + * @addr   On success, will have start address of the memory region
> > found.
> > + *
> > + * Return: 0 on success, negative errno on error.
> > + */
> > +int kexec_locate_mem_hole(struct kimage *image, unsigned long size,
> > + unsigned long align, unsigned long min_addr,
> > + unsigned long max_addr, bool top_down,
> > + unsigned long *addr)
> > +{
> > +   int ret;
> > +   struct kexec_buf buf;
> > +
> > +   memset(, 0, sizeof(struct kexec_buf));
> > +   buf.image = image;
> > +
> > +   buf.memsz = size;
> > +   buf.buf_align = align;
> > +   buf.buf_min = min_addr;
> > +   buf.buf_max = max_addr;
> > +   buf.top_down = top_down;
> 
> Since patch 2/9 moved kexec_buf from internal header file to kexec.h it
> will be natural to passing a kexec_buf pointer intead of passing all
> these arguments in kexec_locate_mem_hole.
> 
> kbuf.mem can be used for addr.

Ok. What about this version?
-- 
[]'s
Thiago Jung Bauermann
IBM Linux Technology Center


Subject: [PATCH 3/9] kexec_file: Factor out kexec_locate_mem_hole from
 kexec_add_buffer.

kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
implementation to find free memory for the purgatory stack.

Signed-off-by: Thiago Jung Bauermann 
Cc: Eric Biederman 
Cc: Dave Young 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/kexec.h | 12 +---
 kernel/kexec_file.c   | 25 -
 2 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 3d91bcfc180d..e8b099da47f5 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -147,9 +147,14 @@ struct kexec_file_ops {
 #endif
 };
 
-/*
- * Keeps track of buffer parameters as provided by caller for requesting
- * memory placement of buffer.
+/**
+ * struct kexec_buf - parameters for finding a place for a buffer in memory
+ * @image: kexec image in which memory to search.
+ * @size:  Memory size for the buffer.
+ * @align: Minimum alignment needed.
+ * @min_addr:  Minimum starting address.
+ * @max_addr:  Maximum end address.
+ * @top_down:  Find the highest free memory region?
  */
 struct kexec_buf {
struct kimage *image;
@@ -163,6 +168,7 @@ struct kexec_buf {
 
 int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
   int (*func)(u64, u64, void *));
+int kexec_locate_mem_hole(struct kexec_buf *kbuf);
 #endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b1f1f6402518..445d66add8ca 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -449,6 +449,23 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
 }
 
+/**
+ * kexec_locate_mem_hole - find free memory to load segment or use in purgatory
+ * @kbuf:  Parameters for the memory search.
+ *
+ * On success, kbuf->mem will have the start address of the memory region 
found.
+ *
+ * Return: 0 on success, negative errno on error.
+ */
+int kexec_locate_mem_hole(struct kexec_buf *kbuf)
+{
+   int ret;
+
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
+
+   return ret == 1 ? 0 : -EADDRNOTAVAIL;
+}
+
 /*
  * Helper function for placing a buffer in a kexec segment. This assumes
  * that kexec_mutex is held.
@@ -493,11 +510,9 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
unsigned long bufsz,
kbuf->top_down = top_down;
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
-   if (ret != 1) {
-   /* A suitable memory range could not be found for buffer */
-   return -EADDRNOTAVAIL;
-   }
+   ret = kexec_locate_mem_hole(kbuf);
+   if (ret)
+   return ret;
 
/* Found a suitable memory range */
ksegment = >segment[image->nr_segments];
-- 
1.9.1


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 2/9] kexec_file: Generalize kexec_add_buffer.

2016-06-22 Thread Thiago Jung Bauermann
Am Mittwoch, 22 Juni 2016, 18:20:47 schrieb Dave Young:
> The patch looks good, but could the subject be more specific?
> 
> For example just like the first sentence of the patch descriotion:
> Allow architectures to specify their own memory walking function

Ok, What about this? I also changed the description to refer to x86 arch
instead of Intel arch.

[]'s
Thiago Jung Bauermann
IBM Linux Technology Center


Subject: [PATCH 2/9] kexec_file: Allow arch-specific memory walking for
 kexec_add_buffer

Allow architectures to specify a different memory walking function for
kexec_add_buffer. x86 uses iomem to track reserved memory ranges, but
PowerPC uses the memblock subsystem.

Signed-off-by: Thiago Jung Bauermann 
Cc: Eric Biederman 
Cc: Dave Young 
Cc: ke...@lists.infradead.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/kexec.h   | 19 ++-
 kernel/kexec_file.c | 30 ++
 kernel/kexec_internal.h | 14 --
 3 files changed, 40 insertions(+), 23 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index e8acb2b43dd9..3d91bcfc180d 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -146,7 +146,24 @@ struct kexec_file_ops {
kexec_verify_sig_t *verify_sig;
 #endif
 };
-#endif
+
+/*
+ * Keeps track of buffer parameters as provided by caller for requesting
+ * memory placement of buffer.
+ */
+struct kexec_buf {
+   struct kimage *image;
+   unsigned long mem;
+   unsigned long memsz;
+   unsigned long buf_align;
+   unsigned long buf_min;
+   unsigned long buf_max;
+   bool top_down;  /* allocate from top of memory hole */
+};
+
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *));
+#endif /* CONFIG_KEXEC_FILE */
 
 struct kimage {
kimage_entry_t head;
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b6eec7527e9f..b1f1f6402518 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
void *arg)
return locate_mem_hole_bottom_up(start, end, kbuf);
 }
 
+/**
+ * arch_kexec_walk_mem - call func(data) on free memory regions
+ * @kbuf:  Context info for the search. Also passed to @func.
+ * @func:  Function to call for each memory region.
+ *
+ * Return: The memory walk will stop when func returns a non-zero value
+ * and that value will be returned. If all free regions are visited without
+ * func returning non-zero, then zero will be returned.
+ */
+int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
+  int (*func)(u64, u64, void *))
+{
+   if (kbuf->image->type == KEXEC_TYPE_CRASH)
+   return walk_iomem_res_desc(crashk_res.desc,
+  IORESOURCE_SYSTEM_RAM | 
IORESOURCE_BUSY,
+  crashk_res.start, crashk_res.end,
+  kbuf, func);
+   else
+   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
+}
+
 /*
  * Helper function for placing a buffer in a kexec segment. This assumes
  * that kexec_mutex is held.
@@ -472,14 +493,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
unsigned long bufsz,
kbuf->top_down = top_down;
 
/* Walk the RAM ranges and allocate a suitable range for the buffer */
-   if (image->type == KEXEC_TYPE_CRASH)
-   ret = walk_iomem_res_desc(crashk_res.desc,
-   IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
-   crashk_res.start, crashk_res.end, kbuf,
-   locate_mem_hole_callback);
-   else
-   ret = walk_system_ram_res(0, -1, kbuf,
- locate_mem_hole_callback);
+   ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
if (ret != 1) {
/* A suitable memory range could not be found for buffer */
return -EADDRNOTAVAIL;
diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
index eefd5bf960c2..4cef7e4706b0 100644
--- a/kernel/kexec_internal.h
+++ b/kernel/kexec_internal.h
@@ -20,20 +20,6 @@ struct kexec_sha_region {
unsigned long len;
 };
 
-/*
- * Keeps track of buffer parameters as provided by caller for requesting
- * memory placement of buffer.
- */
-struct kexec_buf {
-   struct kimage *image;
-   unsigned long mem;
-   unsigned long memsz;
-   unsigned long buf_align;
-   unsigned long buf_min;
-   unsigned long buf_max;
-   bool top_down;  /* allocate from top of memory hole */
-};
-
 void kimage_file_post_load_cleanup(struct kimage *image);
 #else /* CONFIG_KEXEC_FILE */
 static inline void kimage_file_post_load_cleanup(struct kimage *image) { }
-- 
1.9.1



Re: [PATCH] cpuidle/powernv: Fix snooze timeout

2016-06-22 Thread Rafael J. Wysocki
On Wed, Jun 22, 2016 at 9:36 PM, Shreyas B. Prabhu
 wrote:
> Snooze is a poll idle state in powernv and pseries platforms. Snooze
> has a timeout so that if a cpu stays in snooze for more than target
> residency of the next available idle state, then it would exit thereby
> giving chance to the cpuidle governor to re-evaluate and
> promote the cpu to a deeper idle state. Therefore whenever snooze exits
> due to this timeout, its last_residency will be target_residency of next
> deeper state.
>
> commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
> changed the math around last_residency calculation. Specifically, while
> converting last_residency value from nanoseconds to microseconds it does
> right shift by 10. Due to this, in snooze timeout exit scenarios
> last_residency calculated is roughly 2.3% less than target_residency of
> next available state. This pattern is picked up get_typical_interval()
> in the menu governor and therefore expected_interval in menu_select() is
> frequently less than the target_residency of any state but snooze.
>
> Due to this we are entering snooze at a higher rate, thereby affecting
> the single thread performance.
> Since the math around last_residency is not meant to be precise, fix this
> issue setting snooze timeout to 105% of target_residency of next
> available idle state.
>
> This also adds comment around why snooze timeout is necessary.

Daniel, any comments?

> Reported-by: Anton Blanchard 
> Signed-off-by: Shreyas B. Prabhu 
> ---
>  drivers/cpuidle/cpuidle-powernv.c | 14 ++
>  drivers/cpuidle/cpuidle-pseries.c | 13 +
>  2 files changed, 27 insertions(+)
>
> diff --git a/drivers/cpuidle/cpuidle-powernv.c 
> b/drivers/cpuidle/cpuidle-powernv.c
> index e12dc30..5835491 100644
> --- a/drivers/cpuidle/cpuidle-powernv.c
> +++ b/drivers/cpuidle/cpuidle-powernv.c
> @@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
> cpuidle_state_table = powernv_states;
> /* Device tree can indicate more idle states */
> max_idle_state = powernv_add_idle_states();
> +
> +   /*
> +* Staying in snooze for a long period can degrade the
> +* perfomance of the sibling cpus. Set timeout for snooze such
> +* that if the cpu stays in snooze longer than target 
> residency
> +* of the next available idle state then exit from snooze. 
> This
> +* gives a chance to the cpuidle governor to re-evaluate and
> +* promote it to deeper idle states.
> +*/
> if (max_idle_state > 1) {
> snooze_timeout_en = true;
> snooze_timeout = powernv_states[1].target_residency *
>  tb_ticks_per_usec;
> +   /*
> +* Give a 5% margin since target residency related 
> math
> +* is not precise in cpuidle core.
> +*/
> +   snooze_timeout += snooze_timeout / 20;
> }
> } else
> return -ENODEV;
> diff --git a/drivers/cpuidle/cpuidle-pseries.c 
> b/drivers/cpuidle/cpuidle-pseries.c
> index 07135e0..22de841 100644
> --- a/drivers/cpuidle/cpuidle-pseries.c
> +++ b/drivers/cpuidle/cpuidle-pseries.c
> @@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
> } else
> return -ENODEV;
>
> +   /*
> +* Staying in snooze for a long period can degrade the
> +* perfomance of the sibling cpus. Set timeout for snooze such
> +* that if the cpu stays in snooze longer than target residency
> +* of the next available idle state then exit from snooze. This
> +* gives a chance to the cpuidle governor to re-evaluate and
> +* promote it to deeper idle states.
> +*/
> if (max_idle_state > 1) {
> snooze_timeout_en = true;
> snooze_timeout = cpuidle_state_table[1].target_residency *
>  tb_ticks_per_usec;
> +   /*
> +* Give a 5% margin since target residency related math
> +* is not precise in cpuidle core.
> +*/
> +   snooze_timeout += snooze_timeout / 20;
> }
> return 0;
>  }
> --
> 2.1.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

RE: Kernel 4.7: PAGE_GUARDED and _PAGE_NO_CACHE

2016-06-22 Thread luigi burdo
Christian, i think Darren patch override some memory region that on kernel.org 
was different and this make the kernel patched run and work

Luigi

Da: Christian Zigotzky [chzigot...@xenosoft.de]
Inviato: mercoledì 22 giugno 2016 21.07
A: Benjamin Herrenschmidt; Michael Ellerman; Aneesh Kumar K.V; Darren Stevens; 
linuxppc-dev@lists.ozlabs.org; Michael Ellerman; Julian Margetson; Adrian Cox; 
R.T.Dickinson; R.T.Dickinson; Pat Wall; Pat Wall; cont...@a-eon.com; Matthew 
Leaman; luigi burdo; Christian Zigotzky
Oggetto: Kernel 4.7: PAGE_GUARDED and _PAGE_NO_CACHE

Hi All,

Please find attached Darren's patch. With this patch, the Nemo board
boots. That means, the problematic source code is somewhere in this patch.
Which file in this patch is responsible for starting the kernel?

Thanks,

Christian

On 13 June 2016 at 8:09 PM, Christian Zigotzky wrote:
> Hi Ben,
>
> I could send you a patch but it doesn't work with the three PowerPC
> commits. I think we have to fix the boot issue at first. After that we
> can integrate the first patch for the Nemo board.
>
> Cheers,
>
> Christian
>
> On 13 June 2016 at 10:19 AM, Benjamin Herrenschmidt wrote:
>> The right way to not have this problem anymore is to cleanup and
>> submit your patches upstream so they don't break all the time :-)
>>
>> Cheers,
>> Ben.
>>
>>

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: cpuidle broken on mainline

2016-06-22 Thread Shreyas B Prabhu
Hi,

On 06/22/2016 08:46 PM, Anton Blanchard wrote:
> Hi,
> 
> I was noticing some pretty big run to run variations on single threaded
> benchmarks, and I've isolated it cpuidle issues. If I look at
> the cpuidle tracepoint, I notice we only go into the snooze state.
> 
> Do we have any known bugs in cpuidle at the moment?
> 
I've posted a fix here-
https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-June/144726.html

Thanks,
Shreyas

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] cpuidle/powernv: Fix snooze timeout

2016-06-22 Thread Shreyas B. Prabhu
Snooze is a poll idle state in powernv and pseries platforms. Snooze
has a timeout so that if a cpu stays in snooze for more than target
residency of the next available idle state, then it would exit thereby
giving chance to the cpuidle governor to re-evaluate and
promote the cpu to a deeper idle state. Therefore whenever snooze exits
due to this timeout, its last_residency will be target_residency of next
deeper state.

commit e93e59ce5b85 ("cpuidle: Replace ktime_get() with local_clock()")
changed the math around last_residency calculation. Specifically, while
converting last_residency value from nanoseconds to microseconds it does
right shift by 10. Due to this, in snooze timeout exit scenarios
last_residency calculated is roughly 2.3% less than target_residency of
next available state. This pattern is picked up get_typical_interval()
in the menu governor and therefore expected_interval in menu_select() is
frequently less than the target_residency of any state but snooze.

Due to this we are entering snooze at a higher rate, thereby affecting
the single thread performance.
Since the math around last_residency is not meant to be precise, fix this
issue setting snooze timeout to 105% of target_residency of next
available idle state.

This also adds comment around why snooze timeout is necessary.

Reported-by: Anton Blanchard 
Signed-off-by: Shreyas B. Prabhu 
---
 drivers/cpuidle/cpuidle-powernv.c | 14 ++
 drivers/cpuidle/cpuidle-pseries.c | 13 +
 2 files changed, 27 insertions(+)

diff --git a/drivers/cpuidle/cpuidle-powernv.c 
b/drivers/cpuidle/cpuidle-powernv.c
index e12dc30..5835491 100644
--- a/drivers/cpuidle/cpuidle-powernv.c
+++ b/drivers/cpuidle/cpuidle-powernv.c
@@ -268,10 +268,24 @@ static int powernv_idle_probe(void)
cpuidle_state_table = powernv_states;
/* Device tree can indicate more idle states */
max_idle_state = powernv_add_idle_states();
+
+   /*
+* Staying in snooze for a long period can degrade the
+* perfomance of the sibling cpus. Set timeout for snooze such
+* that if the cpu stays in snooze longer than target residency
+* of the next available idle state then exit from snooze. This
+* gives a chance to the cpuidle governor to re-evaluate and
+* promote it to deeper idle states.
+*/
if (max_idle_state > 1) {
snooze_timeout_en = true;
snooze_timeout = powernv_states[1].target_residency *
 tb_ticks_per_usec;
+   /*
+* Give a 5% margin since target residency related math
+* is not precise in cpuidle core.
+*/
+   snooze_timeout += snooze_timeout / 20;
}
} else
return -ENODEV;
diff --git a/drivers/cpuidle/cpuidle-pseries.c 
b/drivers/cpuidle/cpuidle-pseries.c
index 07135e0..22de841 100644
--- a/drivers/cpuidle/cpuidle-pseries.c
+++ b/drivers/cpuidle/cpuidle-pseries.c
@@ -250,10 +250,23 @@ static int pseries_idle_probe(void)
} else
return -ENODEV;
 
+   /*
+* Staying in snooze for a long period can degrade the
+* perfomance of the sibling cpus. Set timeout for snooze such
+* that if the cpu stays in snooze longer than target residency
+* of the next available idle state then exit from snooze. This
+* gives a chance to the cpuidle governor to re-evaluate and
+* promote it to deeper idle states.
+*/
if (max_idle_state > 1) {
snooze_timeout_en = true;
snooze_timeout = cpuidle_state_table[1].target_residency *
 tb_ticks_per_usec;
+   /*
+* Give a 5% margin since target residency related math
+* is not precise in cpuidle core.
+*/
+   snooze_timeout += snooze_timeout / 20;
}
return 0;
 }
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 2/2] powerpc/fadump: parse fadump reserve memory size based on memory range

2016-06-22 Thread Hari Bathini
Currently, memory for fadump can be specified with fadump_reserve_mem=size,
where only a fixed size can be specified. Add the below syntax as well, to
support conditional reservation based on system memory size:

fadump_reserve_mem=:[,:,...]

This syntax helps using the same commandline parameter for different system
memory sizes.

Signed-off-by: Hari Bathini 
Reviewed-by: Mahesh J Salgaonkar 
---
Changes in v2:
1. Changed subject from "[PATCH v2 2/2] powerpc/fadump: add support to parse 
size based on memory range".
2. Rebased to latest upstream.

 arch/powerpc/kernel/fadump.c |   64 --
 1 file changed, 55 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 3cb3b02a..e435828 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -193,6 +193,56 @@ static unsigned long init_fadump_mem_struct(struct 
fadump_mem_struct *fdm,
return addr;
 }
 
+/*
+ * This function parses command line for fadump_reserve_mem=
+ *
+ * Supports the below two syntaxes:
+ *1. fadump_reserve_mem=size
+ *2. fadump_reserve_mem=ramsize-range:size[,...]
+ *
+ * Sets fw_dump.reserve_bootvar with the memory size
+ * provided, 0 otherwise
+ *
+ * The function returns -EINVAL on failure, 0 otherwise.
+ */
+static int __init parse_fadump_reserve_mem(void)
+{
+   char *name = "fadump_reserve_mem=";
+   char *fadump_cmdline = NULL, *cur;
+
+   fw_dump.reserve_bootvar = 0;
+
+   /* find fadump_reserve_mem and use the last one if there are many */
+   cur = strstr(boot_command_line, name);
+   while (cur) {
+   fadump_cmdline = cur;
+   cur = strstr(cur+1, name);
+   }
+
+   /* when no fadump_reserve_mem= cmdline option is provided */
+   if (!fadump_cmdline)
+   return 0;
+
+   fadump_cmdline += strlen(name);
+
+   /* for fadump_reserve_mem=size cmdline syntax */
+   if (!is_param_range_based(fadump_cmdline)) {
+   fw_dump.reserve_bootvar = memparse(fadump_cmdline, NULL);
+   return 0;
+   }
+
+   /* for fadump_reserve_mem=ramsize-range:size[,...] cmdline syntax */
+   cur = fadump_cmdline;
+   fw_dump.reserve_bootvar = parse_mem_range_size("fadump_reserve_mem",
+   , memblock_phys_mem_size());
+   if (cur == fadump_cmdline) {
+   printk(KERN_INFO "fadump_reserve_mem: Invaild syntax!\n");
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
 /**
  * fadump_calculate_reserve_size(): reserve variable boot area 5% of System RAM
  *
@@ -212,12 +262,17 @@ static inline unsigned long 
fadump_calculate_reserve_size(void)
 {
unsigned long size;
 
+   /* sets fw_dump.reserve_bootvar */
+   parse_fadump_reserve_mem();
+
/*
 * Check if the size is specified through fadump_reserve_mem= cmdline
 * option. If yes, then use that.
 */
if (fw_dump.reserve_bootvar)
return fw_dump.reserve_bootvar;
+   else
+   printk(KERN_INFO "fadump: calculating default boot size\n");
 
/* divide by 20 to get 5% of value */
size = memblock_end_of_DRAM() / 20;
@@ -348,15 +403,6 @@ static int __init early_fadump_param(char *p)
 }
 early_param("fadump", early_fadump_param);
 
-/* Look for fadump_reserve_mem= cmdline option */
-static int __init early_fadump_reserve_mem(char *p)
-{
-   if (p)
-   fw_dump.reserve_bootvar = memparse(p, );
-   return 0;
-}
-early_param("fadump_reserve_mem", early_fadump_reserve_mem);
-
 static void register_fw_dump(struct fadump_mem_struct *fdm)
 {
int rc;

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v2 1/2] refactor code parsing size based on memory range

2016-06-22 Thread Hari Bathini
Currently, crashkernel parameter supports the below syntax to parse size
based on memory range:

crashkernel=:[,:,...]

While such parsing is implemented for crashkernel parameter, it applies to
other parameters with similar syntax. So, move this code to a more generic
place for code reuse.

Cc: Eric Biederman 
Cc: Vivek Goyal 
Cc: Rusty Russell 
Cc: ke...@lists.infradead.org
Signed-off-by: Hari Bathini 
---
changes in v2:
1. Rebased to latest upstream.
2. Marked few more people on cc.

 include/linux/kernel.h |5 +++
 kernel/kexec_core.c|   63 +++-
 kernel/params.c|   96 
 3 files changed, 106 insertions(+), 58 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94aa10f..72f55e5 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -436,6 +436,11 @@ extern char *get_options(const char *str, int nints, int 
*ints);
 extern unsigned long long memparse(const char *ptr, char **retptr);
 extern bool parse_option_str(const char *str, const char *option);
 
+extern bool __init is_param_range_based(const char *cmdline);
+extern unsigned long long __init parse_mem_range_size(const char *param,
+ char **str,
+ unsigned long long 
system_ram);
+
 extern int core_kernel_text(unsigned long addr);
 extern int core_kernel_data(unsigned long addr);
 extern int __kernel_text_address(unsigned long addr);
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index 56b3ed0..d43f5cc 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -1083,59 +1083,9 @@ static int __init parse_crashkernel_mem(char *cmdline,
char *cur = cmdline, *tmp;
 
/* for each entry of the comma-separated list */
-   do {
-   unsigned long long start, end = ULLONG_MAX, size;
-
-   /* get the start of the range */
-   start = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (*cur != '-') {
-   pr_warn("crashkernel: '-' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   /* if no ':' is here, than we read the end */
-   if (*cur != ':') {
-   end = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("crashkernel: Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (end <= start) {
-   pr_warn("crashkernel: end <= start\n");
-   return -EINVAL;
-   }
-   }
-
-   if (*cur != ':') {
-   pr_warn("crashkernel: ':' expected\n");
-   return -EINVAL;
-   }
-   cur++;
-
-   size = memparse(cur, );
-   if (cur == tmp) {
-   pr_warn("Memory value expected\n");
-   return -EINVAL;
-   }
-   cur = tmp;
-   if (size >= system_ram) {
-   pr_warn("crashkernel: invalid size\n");
-   return -EINVAL;
-   }
-
-   /* match ? */
-   if (system_ram >= start && system_ram < end) {
-   *crash_size = size;
-   break;
-   }
-   } while (*cur++ == ',');
+   *crash_size = parse_mem_range_size("crashkernel", , system_ram);
+   if (cur == cmdline)
+   return -EINVAL;
 
if (*crash_size > 0) {
while (*cur && *cur != ' ' && *cur != '@')
@@ -1272,7 +1222,6 @@ static int __init __parse_crashkernel(char *cmdline,
 const char *name,
 const char *suffix)
 {
-   char*first_colon, *first_space;
char*ck_cmdline;
 
BUG_ON(!crash_size || !crash_base);
@@ -1290,12 +1239,10 @@ static int __init __parse_crashkernel(char *cmdline,
return parse_crashkernel_suffix(ck_cmdline, crash_size,
suffix);
/*
-* if the commandline contains a ':', then that's the extended
+* if the parameter is range based, then that's the extended
 * syntax -- if not, it must be the classic syntax
 */
-   first_colon = strchr(ck_cmdline, ':');
-   first_space = strchr(ck_cmdline, ' ');
-   if (first_colon && (!first_space || first_colon < first_space))
+   if 

[PATCH v2 0/2] powerpc/fadump: support memory range syntax for fadump memory reservation

2016-06-22 Thread Hari Bathini
This patchset adds support to input system memory range based memory size
for fadump reservation. The crashkernel parameter already supports such
syntax. The first patch refactors the parsing code of crashkernel parameter
for reuse. The second patch uses the newly refactored parsing code to reserve
memory for fadump based on system memory size.

---

Hari Bathini (2):
  refactor code parsing size based on memory range
  powerpc/fadump: parse fadump reserve memory size based on memory range


 arch/powerpc/kernel/fadump.c |   64 
 include/linux/kernel.h   |5 ++
 kernel/kexec_core.c  |   63 ++--
 kernel/params.c  |   96 ++
 4 files changed, 161 insertions(+), 67 deletions(-)

--
Hari Bathini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v13 01/16] PCI: Let pci_mmap_page_range() take resource address

2016-06-22 Thread Yinghai Lu
On Wed, Jun 22, 2016 at 8:22 AM, Bjorn Helgaas  wrote:
> On Tue, Jun 21, 2016 at 09:32:49PM -0700, Yinghai Lu wrote:
>
> If sparc is broken, let's make this a tiny sparc-only patch that fixes
> only the breakage -- no cleanup or restructuring.  Then we can do the
> more extensive work in a separate patch.

ok. please check attached two patches that take position for
[PATCH v13 01/16]

>
> The example mmap() I keep asking for would be very helpful to me in
> understanding the problem.  It would probably also help folks who
> maintain user programs that use mmap.  They need to figure out whether
> they have code that worked most places but has always been broken on
> sparc, or code that depended on the previous sparc behavior and will
> be broken by this change, or what.

ok.

calling : ./test_mmap_proc /proc/bus/pci/:00/04.0 0x200

code : test_mmap_proc.c

#include 
#include 
#include 
#include 
#include 

#define PCIIOC_BASE ('P' << 24 | 'C' << 16 | 'I' << 8)
#define PCIIOC_CONTROLLER   (PCIIOC_BASE | 0x00)/* Get
controller for PCI device. */
#define PCIIOC_MMAP_IS_IO   (PCIIOC_BASE | 0x01)/* Set mmap
state to I/O space. */
#define PCIIOC_MMAP_IS_MEM  (PCIIOC_BASE | 0x02)/* Set mmap
state to MEM space. */
#define PCIIOC_WRITE_COMBINE(PCIIOC_BASE | 0x03)/*
Enable/disable write-combining. */

/* sparc64 */
#define PAGE_SHIFT 13
#define PAGE_SIZE (1UL << PAGE_SHIFT)
#define PAGE_MASK (~(PAGE_SIZE-1))

int main(int argc, char *argv[])
{
  int i;
  int fd;
  char *addr;
  unsigned long offset = 0;
  unsigned long left = 0;

  fd = open(argv[1], O_RDONLY);
  if (fd < 0) {
perror("open");
return 1;
  }

  if (argc > 2)
 sscanf(argv[2], "0x%lx", );

  left = offset & (PAGE_SIZE - 1);
  offset &= PAGE_MASK;

  ioctl(fd, PCIIOC_MMAP_IS_MEM);
  addr = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, offset);
  if (addr == MAP_FAILED) {
perror("mmap");
return 1;
  }

  printf("map finished\n");


  for (i = 0; i < 8; i++)
printf("%x ", addr[i + left]);
  printf("\n");

  munmap(addr, PAGE_SIZE);

  close(fd);

  return 0;
}
From: Yinghai Lu 
Subject: [PATCH v13.update.part1 01/16] PCI: Fix proc mmap on sparc

In 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files"), try
to check exposed value with resource start/end in proc mmap path.

|start = vma->vm_pgoff;
|size = ((pci_resource_len(pdev, resno) - 1) >> PAGE_SHIFT) + 1;
|pci_start = (mmap_api == PCI_MMAP_PROCFS) ?
|pci_resource_start(pdev, resno) >> PAGE_SHIFT : 0;
|if (start >= pci_start && start < pci_start + size &&
|start + nr <= pci_start + size)

vma->vm_pgoff is above code segment is user/BAR value >> PAGE_SHIFT.
pci_start is resource->start >> PAGE_SHIFT.

For sparc, resource start is different from BAR start aka pci bus address.
pci bus address need to add offset to be the resource start.

So that commit breaks all arch that exposed value is BAR/user value,
and need to be offseted to resource address.

test code using: ./test_mmap_proc /proc/bus/pci/:00/04.0 0x200
test code segment:
  fd = open(argv[1], O_RDONLY);
  ...
  sscanf(argv[2], "0x%lx", );
  left = offset & (PAGE_SIZE - 1);
  offset &= PAGE_MASK;
  ioctl(fd, PCIIOC_MMAP_IS_MEM);
  addr = mmap(NULL, PAGE_SIZE, PROT_READ, MAP_SHARED, fd, offset);
  for (i = 0; i < 8; i++)
printf("%x ", addr[i + left]);
  munmap(addr, PAGE_SIZE);
  close(fd);

Fixes: 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files")
Signed-off-by: Yinghai Lu 

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index bcd10c7..e907154 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -974,15 +974,20 @@ void pci_remove_legacy_files(struct pci_bus *b)
 int pci_mmap_fits(struct pci_dev *pdev, int resno, struct vm_area_struct *vma,
 		  enum pci_mmap_api mmap_api)
 {
-	unsigned long nr, start, size, pci_start;
+	unsigned long nr, start, size, pci_start = 0;
 
 	if (pci_resource_len(pdev, resno) == 0)
 		return 0;
 	nr = vma_pages(vma);
 	start = vma->vm_pgoff;
 	size = ((pci_resource_len(pdev, resno) - 1) >> PAGE_SHIFT) + 1;
-	pci_start = (mmap_api == PCI_MMAP_PROCFS) ?
-			pci_resource_start(pdev, resno) >> PAGE_SHIFT : 0;
+	if (mmap_api == PCI_MMAP_PROCFS) {
+		resource_size_t user_start, user_end;
+
+		pci_resource_to_user(pdev, resno, >resource[resno],
+	_start, _end);
+		pci_start = user_start >> PAGE_SHIFT;
+	}
 	if (start >= pci_start && start < pci_start + size &&
 			start + nr <= pci_start + size)
 		return 1;
From: Yinghai Lu 
Subject: [PATCH v13.update.part2 01/16] PCI: Let pci_mmap_page_range() take resource address

Original pci_mmap_page_range() is taking PCI BAR value aka usr_address.

Bjorn found out that it would be much simple to pass resource address
directly and avoid extra those __pci_mmap_make_offset.

In this patch:

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 11:40 AM, Josh Poimboeuf  wrote:
> On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
>> >
>> > So are you suggesting something like:
>> >
>> > .macro ENTRY_CALL func pt_regs_offset=0
>> > call \func
>> > 1:  .pushsection .entry_calls, "a"
>> > .long 1b - .
>> > .long \pt_regs_offset
>> > .popsection
>> > .endm
>> >
>> > and then change every call in the entry code to ENTRY_CALL?
>>
>> Yes, exactly, modulo whether the section name is good.  hpa is
>> probably the authority on that.
>
> Well, as you probably know, I don't really like peppering ENTRY_CALL
> everywhere. :-/

Me neither.  But at least it's less constraining on the
already-fairly-hairy code.

>
> Also I wonder how we could annotate the hypercalls, for example
> DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

Oh, yuck.  But forcing all the DISABLE_INTERRUPTS and
ENABLE_INTERRUPTS invocations to be in frame pointer regions isn't so
great either.

DWARF solves this problem completely and IMO fairly cleanly.  Maybe we
should add your task flag and then consider removing it again when
DWARF happens.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Kernel 4.7: PAGE_GUARDED and _PAGE_NO_CACHE

2016-06-22 Thread Christian Zigotzky
Hi All,

Please find attached Darren's patch. With this patch, the Nemo board
boots. That means, the problematic source code is somewhere in this patch.
Which file in this patch is responsible for starting the kernel?

Thanks,

Christian

On 13 June 2016 at 8:09 PM, Christian Zigotzky wrote:
> Hi Ben,
>
> I could send you a patch but it doesn't work with the three PowerPC
> commits. I think we have to fix the boot issue at first. After that we
> can integrate the first patch for the Nemo board.
>
> Cheers,
>
> Christian
>
> On 13 June 2016 at 10:19 AM, Benjamin Herrenschmidt wrote:
>> The right way to not have this problem anymore is to cleanup and
>> submit your patches upstream so they don't break all the time :-)
>>
>> Cheers,
>> Ben.
>>
>>

diff --git a/arch/powerpc/include/asm/book3s/64/hash.h b/arch/powerpc/include/asm/book3s/64/hash.h
index f61cad3..cd3e915 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -45,17 +45,17 @@
 /*
  * Define the address range of the kernel non-linear virtual area
  */
-#define H_KERN_VIRT_START ASM_CONST(0xD000)
-#define H_KERN_VIRT_SIZE	ASM_CONST(0x1000)
+#define KERN_VIRT_START ASM_CONST(0xD000)
+#define KERN_VIRT_SIZE	ASM_CONST(0x1000)
 
 /*
  * The vmalloc space starts at the beginning of that region, and
  * occupies half of it on hash CPUs and a quarter of it on Book3E
  * (we keep a quarter for the virtual memmap)
  */
-#define H_VMALLOC_START	H_KERN_VIRT_START
-#define H_VMALLOC_SIZE	(H_KERN_VIRT_SIZE >> 1)
-#define H_VMALLOC_END	(H_VMALLOC_START + H_VMALLOC_SIZE)
+#define VMALLOC_START	KERN_VIRT_START
+#define VMALLOC_SIZE	(KERN_VIRT_SIZE >> 1)
+#define VMALLOC_END	(VMALLOC_START + VMALLOC_SIZE)
 
 /*
  * Region IDs
@@ -64,7 +64,7 @@
 #define REGION_MASK		(0xfUL << REGION_SHIFT)
 #define REGION_ID(ea)		(((unsigned long)(ea)) >> REGION_SHIFT)
 
-#define VMALLOC_REGION_ID	(REGION_ID(H_VMALLOC_START))
+#define VMALLOC_REGION_ID	(REGION_ID(VMALLOC_START))
 #define KERNEL_REGION_ID	(REGION_ID(PAGE_OFFSET))
 #define VMEMMAP_REGION_ID	(0xfUL)	/* Server only */
 #define USER_REGION_ID		(0UL)
@@ -73,7 +73,7 @@
  * Defines the address of the vmemap area, in its own region on
  * hash table CPUs.
  */
-#define H_VMEMMAP_BASE		(VMEMMAP_REGION_ID << REGION_SHIFT)
+#define VMEMMAP_BASE		(VMEMMAP_REGION_ID << REGION_SHIFT)
 
 #ifdef CONFIG_PPC_MM_SLICES
 #define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 88a5eca..bdfea62 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -218,18 +218,6 @@ extern unsigned long __pte_frag_size_shift;
 #define PUD_MASKED_BITS		0xc0ffUL
 /* Bits to mask out from a PGD to get to the PUD page */
 #define PGD_MASKED_BITS		0xc0ffUL
-
-extern unsigned long __vmalloc_start;
-extern unsigned long __vmalloc_end;
-#define VMALLOC_START	__vmalloc_start
-#define VMALLOC_END	__vmalloc_end
-
-extern unsigned long __kernel_virt_start;
-extern unsigned long __kernel_virt_size;
-#define KERN_VIRT_START __kernel_virt_start
-#define KERN_VIRT_SIZE  __kernel_virt_size
-extern struct page *vmemmap;
-extern unsigned long ioremap_bot;
 #endif /* __ASSEMBLY__ */
 
 #include 
@@ -242,6 +230,7 @@ extern unsigned long ioremap_bot;
 #endif
 
 #include 
+
 /*
  * The second half of the kernel virtual space is used for IO mappings,
  * it's itself carved into the PIO region (ISA and PHB IO space) and
@@ -260,6 +249,8 @@ extern unsigned long ioremap_bot;
 #define IOREMAP_BASE	(PHB_IO_END)
 #define IOREMAP_END	(KERN_VIRT_START + KERN_VIRT_SIZE)
 
+#define vmmemap			((struct page *)VMEMMAP_BASE)
+
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 937d4e2..a8b24d6 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -31,74 +31,6 @@
 			  RADIX_PUD_INDEX_SIZE + RADIX_PGD_INDEX_SIZE + PAGE_SHIFT)
 #define RADIX_PGTABLE_RANGE (ASM_CONST(1) << RADIX_PGTABLE_EADDR_SIZE)
 
-/*
- * We support 52 bit address space, Use top bit for kernel
- * virtual mapping. Also make sure kernel fit in the top
- * quadrant.
- *
- *   +--+
- *   +--+  Kernel virtual map (0xc008)
- *   |  |
- *   |  |
- *   |  |
- * 0b11..+--+  Kernel linear map (0xc)
- *   |  |
- *   | 2 quadrant   |
- *   |  |
- * 0b10..+--+
- *   |  |
- *   |1 quadrant|
- *   |  |
- * 0b01..+--+
- *   |  |
- *   |0 quadrant   

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 11:26:21AM -0700, Andy Lutomirski wrote:
> On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf  wrote:
> > On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> >> > So I got a chance to look at this some more.  I'm thinking that to make
> >> > this feature more consistently useful, we shouldn't only annotate
> >> > pt_regs frames for calls to handlers; other calls should be annotated as
> >> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> >> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> >> > etc.  That way, the unwinder will always be able to find pt_regs from an
> >> > interrupt/exception, even if starting from one of these other calls.
> >> >
> >> > But then, things get ugly.  You have to either setup and tear down the
> >> > frame for every possible call, or do a higher-level setup/teardown
> >> > across multiple calls, which invalidates several assumptions in the
> >> > entry code about the location of pt_regs on the stack.
> >> >
> >> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> >> > make assumptions about the location of pt_regs.  And they're used by
> >> > both syscall and interrupt code.  So if we didn't create a frame pointer
> >> > header for syscalls, we'd basically need two versions of the macros: one
> >> > for irqs/exceptions and one for syscalls.
> >> >
> >> > So I think the cleanest way to handle this is to always allocate two
> >> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> >> > entry code can assume that pt_regs is at a constant location, and all
> >> > the above problems go away.  Another benefit is that we'd only need two
> >> > saves instead of three -- the pointer to pt_regs is no longer needed
> >> > since pt_regs is always immediately after the frame header.
> >> >
> >> > I worked up a patch to implement this -- see below.  It writes the frame
> >> > pointer in all entry paths, including syscalls.  This helps keep the
> >> > code simple.
> >> >
> >> > The downside is a small performance penalty: with getppid()-in-a-loop on
> >> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> >> > 2% slowdown.  But I doubt it would be measurable in a real-world
> >> > workload.
> >> >
> >> > It looks like about half the slowdown is due to the extra stack
> >> > allocation (which presumably adds a little d-cache pressure on the stack
> >> > memory) and the other half is due to the stack writes.
> >> >
> >> > I could remove the writes from the syscall path but it would only save
> >> > about half a ns, and it would make the code less robust.  Plus it's nice
> >> > to have the consistency of having *all* pt_regs frames annotated.
> >>
> >> This is a bit messy, and I'm not really sure that the entry code
> >> should be have to operate under constraints like this.  Also,
> >> convincing myself this works for NMI sounds unpleasant.
> >>
> >> Maybe we should go back to my idea of just listing the call sites in a 
> >> table.
> >
> > So are you suggesting something like:
> >
> > .macro ENTRY_CALL func pt_regs_offset=0
> > call \func
> > 1:  .pushsection .entry_calls, "a"
> > .long 1b - .
> > .long \pt_regs_offset
> > .popsection
> > .endm
> >
> > and then change every call in the entry code to ENTRY_CALL?
> 
> Yes, exactly, modulo whether the section name is good.  hpa is
> probably the authority on that.

Well, as you probably know, I don't really like peppering ENTRY_CALL
everywhere. :-/

Also I wonder how we could annotate the hypercalls, for example
DISABLE_INTERRUPTS actually wraps the call in a push/pop pair.

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 11:22 AM, Josh Poimboeuf  wrote:
> On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
>> > So I got a chance to look at this some more.  I'm thinking that to make
>> > this feature more consistently useful, we shouldn't only annotate
>> > pt_regs frames for calls to handlers; other calls should be annotated as
>> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
>> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
>> > etc.  That way, the unwinder will always be able to find pt_regs from an
>> > interrupt/exception, even if starting from one of these other calls.
>> >
>> > But then, things get ugly.  You have to either setup and tear down the
>> > frame for every possible call, or do a higher-level setup/teardown
>> > across multiple calls, which invalidates several assumptions in the
>> > entry code about the location of pt_regs on the stack.
>> >
>> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
>> > make assumptions about the location of pt_regs.  And they're used by
>> > both syscall and interrupt code.  So if we didn't create a frame pointer
>> > header for syscalls, we'd basically need two versions of the macros: one
>> > for irqs/exceptions and one for syscalls.
>> >
>> > So I think the cleanest way to handle this is to always allocate two
>> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
>> > entry code can assume that pt_regs is at a constant location, and all
>> > the above problems go away.  Another benefit is that we'd only need two
>> > saves instead of three -- the pointer to pt_regs is no longer needed
>> > since pt_regs is always immediately after the frame header.
>> >
>> > I worked up a patch to implement this -- see below.  It writes the frame
>> > pointer in all entry paths, including syscalls.  This helps keep the
>> > code simple.
>> >
>> > The downside is a small performance penalty: with getppid()-in-a-loop on
>> > my laptop, the average syscall went from 52ns to 53ns, which is about a
>> > 2% slowdown.  But I doubt it would be measurable in a real-world
>> > workload.
>> >
>> > It looks like about half the slowdown is due to the extra stack
>> > allocation (which presumably adds a little d-cache pressure on the stack
>> > memory) and the other half is due to the stack writes.
>> >
>> > I could remove the writes from the syscall path but it would only save
>> > about half a ns, and it would make the code less robust.  Plus it's nice
>> > to have the consistency of having *all* pt_regs frames annotated.
>>
>> This is a bit messy, and I'm not really sure that the entry code
>> should be have to operate under constraints like this.  Also,
>> convincing myself this works for NMI sounds unpleasant.
>>
>> Maybe we should go back to my idea of just listing the call sites in a table.
>
> So are you suggesting something like:
>
> .macro ENTRY_CALL func pt_regs_offset=0
> call \func
> 1:  .pushsection .entry_calls, "a"
> .long 1b - .
> .long \pt_regs_offset
> .popsection
> .endm
>
> and then change every call in the entry code to ENTRY_CALL?

Yes, exactly, modulo whether the section name is good.  hpa is
probably the authority on that.

--Andy
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Wed, Jun 22, 2016 at 10:59:23AM -0700, Andy Lutomirski wrote:
> > So I got a chance to look at this some more.  I'm thinking that to make
> > this feature more consistently useful, we shouldn't only annotate
> > pt_regs frames for calls to handlers; other calls should be annotated as
> > well: preempt_schedule_irq, CALL_enter_from_user_mode,
> > prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
> > etc.  That way, the unwinder will always be able to find pt_regs from an
> > interrupt/exception, even if starting from one of these other calls.
> >
> > But then, things get ugly.  You have to either setup and tear down the
> > frame for every possible call, or do a higher-level setup/teardown
> > across multiple calls, which invalidates several assumptions in the
> > entry code about the location of pt_regs on the stack.
> >
> > Also problematic is that several of the macros (like TRACE_IRQS_IRETQ)
> > make assumptions about the location of pt_regs.  And they're used by
> > both syscall and interrupt code.  So if we didn't create a frame pointer
> > header for syscalls, we'd basically need two versions of the macros: one
> > for irqs/exceptions and one for syscalls.
> >
> > So I think the cleanest way to handle this is to always allocate two
> > extra registers on the stack in ALLOC_PT_GPREGS_ON_STACK.  Then all
> > entry code can assume that pt_regs is at a constant location, and all
> > the above problems go away.  Another benefit is that we'd only need two
> > saves instead of three -- the pointer to pt_regs is no longer needed
> > since pt_regs is always immediately after the frame header.
> >
> > I worked up a patch to implement this -- see below.  It writes the frame
> > pointer in all entry paths, including syscalls.  This helps keep the
> > code simple.
> >
> > The downside is a small performance penalty: with getppid()-in-a-loop on
> > my laptop, the average syscall went from 52ns to 53ns, which is about a
> > 2% slowdown.  But I doubt it would be measurable in a real-world
> > workload.
> >
> > It looks like about half the slowdown is due to the extra stack
> > allocation (which presumably adds a little d-cache pressure on the stack
> > memory) and the other half is due to the stack writes.
> >
> > I could remove the writes from the syscall path but it would only save
> > about half a ns, and it would make the code less robust.  Plus it's nice
> > to have the consistency of having *all* pt_regs frames annotated.
> 
> This is a bit messy, and I'm not really sure that the entry code
> should be have to operate under constraints like this.  Also,
> convincing myself this works for NMI sounds unpleasant.
> 
> Maybe we should go back to my idea of just listing the call sites in a table.

So are you suggesting something like:

.macro ENTRY_CALL func pt_regs_offset=0
call \func
1:  .pushsection .entry_calls, "a"
.long 1b - .
.long \pt_regs_offset
.popsection
.endm

and then change every call in the entry code to ENTRY_CALL?

-- 
Josh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Andy Lutomirski
On Wed, Jun 22, 2016 at 9:30 AM, Josh Poimboeuf  wrote:
> On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
>> On May 23, 2016 7:28 PM, "Josh Poimboeuf"  wrote:
>> > > Maybe I'm coming around to liking this idea.
>> >
>> > Ok, good :-)
>> >
>> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
>> > > printer, etc), unwinding across a kernel exception would look like:
>> > >
>> > >  - some_func
>> > >  - some_other_func
>> > >  - do_page_fault
>> > >  - page_fault
>> > >
>> > > After page_fault, the next unwind step takes us to the faulting RIP
>> > > (faulting_func) and reports that all GPRs are known.  It should
>> > > probably learn this fact from DWARF if DWARF is available, instead of
>> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
>> > > may be incomplete.  A nice pretty printer could now print all the
>> > > regs.
>> > >
>> > >  - faulting_func
>> > >  - etc.
>> > >
>> > > For this to work, we don't actually need the unwinder to explicitly
>> > > know where pt_regs is.
>> >
>> > That's true (but only for DWARF).
>> >
>> > > Food for thought, though: if user code does SYSENTER with TF set,
>> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
>> > > about it.  DWARF will handle it without any fanfare, but I don't know
>> > > if it's going to cause trouble for you pre-DWARF.
>> >
>> > In this case it should see the stack pointer is past the pt_regs offset,
>> > so it would just report it as an empty stack.
>>
>> OK
>>
>> >
>> > > I'm also not sure it makes sense to apply this before the unwinder
>> > > that can consume it is ready.  Maybe, if it would be consistent with
>> > > your plans, it would make sense to rewrite the unwinder first, then
>> > > apply this and teach live patching to use the new unwinder, and *then*
>> > > add DWARF support?
>> >
>> > For the purposes of livepatch, the reliable unwinder needs to detect
>> > whether an interrupt/exception pt_regs frame exists on a sleeping task
>> > (or current).  This patch would allow us to do that.
>> >
>> > So my preferred order of doing things would be:
>> >
>> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
>> > 2) this patch for annotating pt_regs stack frames
>> > 3) reliable unwinder, similar to what I already posted, except it relies
>> >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
>> >the new inactive task frame format of #1
>> > 4) livepatch consistency model which uses the reliable unwinder
>> > 5) rewrite unwinder, and port all users to the new interface
>> > 6) add DWARF unwinder
>> >
>> > 1-4 are pretty much already written, whereas 5 and 6 will take
>> > considerably more work.
>>
>> Fair enough.
>>
>> >
>> > > > +   /*
>> > > > +* Create a stack frame for the saved pt_regs.  This allows 
>> > > > frame
>> > > > +* pointer based unwinders to find pt_regs on the stack.
>> > > > +*/
>> > > > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +   pushq   \regs
>> > > > +   pushq   $pt_regs+1
>> > >
>> > > Why the +1?
>> >
>> > Some unwinders like gdb are smart enough to report the function which
>> > contains the instruction *before* the return address.  Without the +1,
>> > they would show the wrong function.
>>
>> Lovely.  Want to add a comment?
>>
>> >
>> > > > +   pushq   %rbp
>> > > > +   movq%rsp, %rbp
>> > > > +#endif
>> > > > +   .endm
>> > >
>> > > I keep wanting this to be only two pushes and to fudge rbp to make it
>> > > work, but I don't see a good way.  But let's call it
>> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
>> > > nested_frame or similar.
>> >
>> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
>> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
>>
>> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
>> too.  CREATE_PT_REGS_FRAME is probably fine.
>>
>> > > > +
>> > > > +/* fake function which allows stack unwinders to detect pt_regs 
>> > > > frames */
>> > > > +#ifdef CONFIG_FRAME_POINTER
>> > > > +ENTRY(pt_regs)
>> > > > +   nop
>> > > > +   nop
>> > > > +ENDPROC(pt_regs)
>> > > > +#endif /* CONFIG_FRAME_POINTER */
>> > >
>> > > Why is this two bytes long?  Is there some reason it has to be more
>> > > than one byte?
>> >
>> > Similar to above, this is related to the need to support various
>> > unwinders.  Whether the unwinder displays the ret_addr or the
>> > instruction preceding it, either way the instruction needs to be inside
>> > the function for the function to be reported.
>>
>> OK.
>
> Andy,
>
> So I got a chance to look at this some more.  I'm thinking that to make
> this feature more consistently useful, we shouldn't only annotate
> pt_regs frames for calls to handlers; other calls should be annotated as
> well: 

Re: [PATCH v3 0/9] kexec_file_load implementation for PowerPC

2016-06-22 Thread Thiago Jung Bauermann
Hello Balbir,

Am Mittwoch, 22 Juni 2016, 23:29:46 schrieb Balbir Singh:
> On Tue, 21 Jun 2016 16:48:32 -0300
> Thiago Jung Bauermann  wrote:
> > This patch series implements the kexec_file_load system call on
> > PowerPC.
> > 
> > This system call moves the reading of the kernel, initrd and the
> > device tree from the userspace kexec tool to the kernel. This is
> > needed if you want to do one or both of the following:
> > 
> > 1. only allow loading of signed kernels.
> > 2. "measure" (i.e., record the hashes of) the kernel, initrd, kernel
> > 
> >command line and other boot inputs for the Integrity Measurement
> >Architecture subsystem.
> > 
> > The above are the functions kexec already has built into
> > kexec_file_load. Yesterday I posted a set of patches which allows a
> > third feature:
> > 
> > 3. have IMA pass-on its event log (where integrity measurements are
> > 
> >registered) accross kexec to the second kernel, so that the event
> >history is preserved.
> 
> OK.. and this is safe? Do both the kernels need to be signed by the
> same certificate?

They don't. The integrity of the event log (assuming that is what you mean 
by "this" in "this is safe") is guaranteed by the TPM device. Each event in 
the measurement list extends a PCR and records its PCR value. It is 
cryptographically guaranteed that if you replay the PCR extends recorded in 
the event log and in the end of the process they match the current PCR 
values in the TPM device, then that event log is correct.

The kernel signature serves to ensure that you only run kernels from an 
authorized provider. It doesn't play a role in integrity assurance, which 
aims to verify that the machine is really running the code it says it is 
running. As I understand it, at least. It's a bit subtle and I could be 
missing something...

[]'s
Thiago Jung Bauermann
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCHv2 1/7] ppc bpf/jit: Disable classic BPF JIT on ppc64le

2016-06-22 Thread Thadeu Lima de Souza Cascardo
On Wed, Jun 22, 2016 at 09:55:01PM +0530, Naveen N. Rao wrote:
> Classic BPF JIT was never ported completely to work on little endian
> powerpc. However, it can be enabled and will crash the system when used.
> As such, disable use of BPF JIT on ppc64le.

Thanks, Naveen.

Acked-by: Thadeu Lima de Souza Cascardo 

> 
> Cc: sta...@vger.kernel.org
> Cc: Matt Evans 
> Cc: Denis Kirjanov 
> Cc: Michael Ellerman 
> Cc: Paul Mackerras 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: "David S. Miller" 
> Cc: Ananth N Mavinakayanahalli 
> Cc: Thadeu Lima de Souza Cascardo 
> Reported-by: Thadeu Lima de Souza Cascardo 
> Signed-off-by: Naveen N. Rao 
> ---
>  arch/powerpc/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 01f7464..0a9d439 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -128,7 +128,7 @@ config PPC
>   select IRQ_FORCED_THREADING
>   select HAVE_RCU_TABLE_FREE if SMP
>   select HAVE_SYSCALL_TRACEPOINTS
> - select HAVE_CBPF_JIT
> + select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
>   select HAVE_ARCH_JUMP_LABEL
>   select ARCH_HAVE_NMI_SAFE_CMPXCHG
>   select ARCH_HAS_GCOV_PROFILE_ALL
> -- 
> 2.8.2
> 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/6] kexec_file: Add buffer hand-over for the next kernel

2016-06-22 Thread Thiago Jung Bauermann
Hello Dave,

Thanks for your considerations on this feature.

Am Mittwoch, 22 Juni 2016, 09:20:46 schrieb Dave Young:
> On 06/20/16 at 10:44pm, Thiago Jung Bauermann wrote:
> > This feature was implemented because the Integrity Measurement
> > Architecture subsystem needs to preserve its measurement list accross
> > the kexec reboot. This is so that IMA can implement trusted boot
> > support on the OpenPower platform, because on such systems an
> > intermediary Linux instance running as part of the firmware is used to
> > boot the target operating system via kexec. Using this mechanism, IMA
> > on this intermediary instance can hand over to the target OS the
> > measurements of the components that were used to boot it.
> We have CONFIG_KEXEC_VERIFY_SIG, why not verifying the kernel to be
> loaded instead?  I feel IMA should rebuild its measurement instead of
> passing it to another kernel.

In trusted boot, each stage of the boot process (firmware, boot loader, 
target OS) measures the following stage before passing control to it, and 
records that measurement cumulatively so that the target OS can look back 
and see measurements of all the components that were used from the earliest 
boot stages until the target OS was loaded (including a measurement of the 
OS itself).

If IMA had to rebuild the measurements, it would mean that one stage is 
measuring itself. This violates this design property of the trusted boot 
process (i.e., each boot stage is measured by the one before it) so it's not 
really an option. It has to receive the measurements from the boot stage 
that ran before it.

> Kexec reboot is also a reboot. If we have
> to preserve something get from firmware we can do it, but other than
> that I think it sounds not a good idea.

OpenPower uses a Linux kernel (and initrd with a tiny system image) as a 
boot loader, so in this platform a kexec reboot is not a reboot. It is part 
of the boot process itself as the way of passing control from the boot 
loader to the target OS.

> > This series applies on top of v2 of the "kexec_file_load implementation
> 
> > for PowerPC" patch series at:
> The kexec_file_load patches should be addressed first, no?

Yes. I posted this series for two reasons:

1. The PowerPC maintainer asked why he would want to have the 
kexec_file_load system call, and this feature is one of the reasons. I 
wanted to show that it is not an hypothetical feature, there is a 
functioning implementation.

2. I want to start discussion on this feature with the community early, so 
that I can incorporate feedback and have it ready to be accepted (or closer 
to ready at least) by the time the kexec_file_load patches are accepted.

[]'s
Thiago Jung Bauermann
IBM Linux Technology Center

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC PATCH v2 05/18] sched: add task flag for preempt IRQ tracking

2016-06-22 Thread Josh Poimboeuf
On Mon, May 23, 2016 at 08:52:12PM -0700, Andy Lutomirski wrote:
> On May 23, 2016 7:28 PM, "Josh Poimboeuf"  wrote:
> > > Maybe I'm coming around to liking this idea.
> >
> > Ok, good :-)
> >
> > > In an ideal world (DWARF support, high-quality unwinder, nice pretty
> > > printer, etc), unwinding across a kernel exception would look like:
> > >
> > >  - some_func
> > >  - some_other_func
> > >  - do_page_fault
> > >  - page_fault
> > >
> > > After page_fault, the next unwind step takes us to the faulting RIP
> > > (faulting_func) and reports that all GPRs are known.  It should
> > > probably learn this fact from DWARF if DWARF is available, instead of
> > > directly decoding pt_regs, due to a few funny cases in which pt_regs
> > > may be incomplete.  A nice pretty printer could now print all the
> > > regs.
> > >
> > >  - faulting_func
> > >  - etc.
> > >
> > > For this to work, we don't actually need the unwinder to explicitly
> > > know where pt_regs is.
> >
> > That's true (but only for DWARF).
> >
> > > Food for thought, though: if user code does SYSENTER with TF set,
> > > you'll end up with partial pt_regs.  There's nothing the kernel can do
> > > about it.  DWARF will handle it without any fanfare, but I don't know
> > > if it's going to cause trouble for you pre-DWARF.
> >
> > In this case it should see the stack pointer is past the pt_regs offset,
> > so it would just report it as an empty stack.
> 
> OK
> 
> >
> > > I'm also not sure it makes sense to apply this before the unwinder
> > > that can consume it is ready.  Maybe, if it would be consistent with
> > > your plans, it would make sense to rewrite the unwinder first, then
> > > apply this and teach live patching to use the new unwinder, and *then*
> > > add DWARF support?
> >
> > For the purposes of livepatch, the reliable unwinder needs to detect
> > whether an interrupt/exception pt_regs frame exists on a sleeping task
> > (or current).  This patch would allow us to do that.
> >
> > So my preferred order of doing things would be:
> >
> > 1) Brian Gerst's switch_to() cleanup and any related unwinder fixes
> > 2) this patch for annotating pt_regs stack frames
> > 3) reliable unwinder, similar to what I already posted, except it relies
> >on this patch instead of PF_PREEMPT_IRQ, and knows how to deal with
> >the new inactive task frame format of #1
> > 4) livepatch consistency model which uses the reliable unwinder
> > 5) rewrite unwinder, and port all users to the new interface
> > 6) add DWARF unwinder
> >
> > 1-4 are pretty much already written, whereas 5 and 6 will take
> > considerably more work.
> 
> Fair enough.
> 
> >
> > > > +   /*
> > > > +* Create a stack frame for the saved pt_regs.  This allows 
> > > > frame
> > > > +* pointer based unwinders to find pt_regs on the stack.
> > > > +*/
> > > > +   .macro CREATE_PT_REGS_FRAME regs=%rsp
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +   pushq   \regs
> > > > +   pushq   $pt_regs+1
> > >
> > > Why the +1?
> >
> > Some unwinders like gdb are smart enough to report the function which
> > contains the instruction *before* the return address.  Without the +1,
> > they would show the wrong function.
> 
> Lovely.  Want to add a comment?
> 
> >
> > > > +   pushq   %rbp
> > > > +   movq%rsp, %rbp
> > > > +#endif
> > > > +   .endm
> > >
> > > I keep wanting this to be only two pushes and to fudge rbp to make it
> > > work, but I don't see a good way.  But let's call it
> > > CREATE_NESTED_ENTRY_FRAME or something, and let's rename pt_regs to
> > > nested_frame or similar.
> >
> > Or, if we aren't going to annotate syscall pt_regs, we could give it a
> > more specific name.  CREATE_INTERRUPT_FRAME and interrupt_frame()?
> 
> CREATE_INTERRUPT_FRAME is confusing because it applies to idtentry,
> too.  CREATE_PT_REGS_FRAME is probably fine.
> 
> > > > +
> > > > +/* fake function which allows stack unwinders to detect pt_regs frames 
> > > > */
> > > > +#ifdef CONFIG_FRAME_POINTER
> > > > +ENTRY(pt_regs)
> > > > +   nop
> > > > +   nop
> > > > +ENDPROC(pt_regs)
> > > > +#endif /* CONFIG_FRAME_POINTER */
> > >
> > > Why is this two bytes long?  Is there some reason it has to be more
> > > than one byte?
> >
> > Similar to above, this is related to the need to support various
> > unwinders.  Whether the unwinder displays the ret_addr or the
> > instruction preceding it, either way the instruction needs to be inside
> > the function for the function to be reported.
> 
> OK.

Andy,

So I got a chance to look at this some more.  I'm thinking that to make
this feature more consistently useful, we shouldn't only annotate
pt_regs frames for calls to handlers; other calls should be annotated as
well: preempt_schedule_irq, CALL_enter_from_user_mode,
prepare_exit_to_usermode, SWAPGS, TRACE_IRQS_OFF, DISABLE_INTERRUPTS,
etc.  That way, the unwinder will always be able to find pt_regs from an
interrupt/exception, even 

[PATCHv2 7/7] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-06-22 Thread Naveen N. Rao
PPC64 eBPF JIT compiler.

Enable with:
echo 1 > /proc/sys/net/core/bpf_jit_enable
or
echo 2 > /proc/sys/net/core/bpf_jit_enable

... to see the generated JIT code. This can further be processed with
tools/net/bpf_jit_disasm.

With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
test_bpf: Summary: 305 PASSED, 0 FAILED, [297/297 JIT'ed]

... on both ppc64 BE and LE.

The details of the approach are documented through various comments in
the code.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/Kconfig  |   3 +-
 arch/powerpc/include/asm/asm-compat.h |   2 +
 arch/powerpc/include/asm/ppc-opcode.h |  20 +-
 arch/powerpc/net/Makefile |   4 +
 arch/powerpc/net/bpf_jit.h|  53 +-
 arch/powerpc/net/bpf_jit64.h  | 102 
 arch/powerpc/net/bpf_jit_asm64.S  | 180 +++
 arch/powerpc/net/bpf_jit_comp64.c | 954 ++
 8 files changed, 1315 insertions(+), 3 deletions(-)
 create mode 100644 arch/powerpc/net/bpf_jit64.h
 create mode 100644 arch/powerpc/net/bpf_jit_asm64.S
 create mode 100644 arch/powerpc/net/bpf_jit_comp64.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0a9d439..ee82f9a 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,7 +128,8 @@ config PPC
select IRQ_FORCED_THREADING
select HAVE_RCU_TABLE_FREE if SMP
select HAVE_SYSCALL_TRACEPOINTS
-   select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
+   select HAVE_CBPF_JIT if !PPC64
+   select HAVE_EBPF_JIT if PPC64
select HAVE_ARCH_JUMP_LABEL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAS_GCOV_PROFILE_ALL
diff --git a/arch/powerpc/include/asm/asm-compat.h 
b/arch/powerpc/include/asm/asm-compat.h
index dc85dcb..cee3aa0 100644
--- a/arch/powerpc/include/asm/asm-compat.h
+++ b/arch/powerpc/include/asm/asm-compat.h
@@ -36,11 +36,13 @@
 #define PPC_MIN_STKFRM 112
 
 #ifdef __BIG_ENDIAN__
+#define LHZX_BEstringify_in_c(lhzx)
 #define LWZX_BEstringify_in_c(lwzx)
 #define LDX_BE stringify_in_c(ldx)
 #define STWX_BEstringify_in_c(stwx)
 #define STDX_BEstringify_in_c(stdx)
 #else
+#define LHZX_BEstringify_in_c(lhbrx)
 #define LWZX_BEstringify_in_c(lwbrx)
 #define LDX_BE stringify_in_c(ldbrx)
 #define STWX_BEstringify_in_c(stwbrx)
diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index fd8d640..6a77d130 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -142,9 +142,11 @@
 #define PPC_INST_ISEL  0x7c1e
 #define PPC_INST_ISEL_MASK 0xfc3e
 #define PPC_INST_LDARX 0x7ca8
+#define PPC_INST_STDCX 0x7c0001ad
 #define PPC_INST_LSWI  0x7c0004aa
 #define PPC_INST_LSWX  0x7c00042a
 #define PPC_INST_LWARX 0x7c28
+#define PPC_INST_STWCX 0x7c00012d
 #define PPC_INST_LWSYNC0x7c2004ac
 #define PPC_INST_SYNC  0x7c0004ac
 #define PPC_INST_SYNC_MASK 0xfc0007fe
@@ -211,8 +213,11 @@
 #define PPC_INST_LBZ   0x8800
 #define PPC_INST_LD0xe800
 #define PPC_INST_LHZ   0xa000
-#define PPC_INST_LHBRX 0x7c00062c
 #define PPC_INST_LWZ   0x8000
+#define PPC_INST_LHBRX 0x7c00062c
+#define PPC_INST_LDBRX 0x7c000428
+#define PPC_INST_STB   0x9800
+#define PPC_INST_STH   0xb000
 #define PPC_INST_STD   0xf800
 #define PPC_INST_STDU  0xf801
 #define PPC_INST_STW   0x9000
@@ -221,22 +226,34 @@
 #define PPC_INST_MTLR  0x7c0803a6
 #define PPC_INST_CMPWI 0x2c00
 #define PPC_INST_CMPDI 0x2c20
+#define PPC_INST_CMPW  0x7c00
+#define PPC_INST_CMPD  0x7c20
 #define PPC_INST_CMPLW 0x7c40
+#define PPC_INST_CMPLD 0x7c200040
 #define PPC_INST_CMPLWI0x2800
+#define PPC_INST_CMPLDI0x2820
 #define PPC_INST_ADDI  0x3800
 #define PPC_INST_ADDIS 0x3c00
 #define PPC_INST_ADD   0x7c000214
 #define PPC_INST_SUB   0x7c50
 #define PPC_INST_BLR   0x4e800020
 #define 

[PATCHv2 6/7] ppc: bpf/jit: Isolate classic BPF JIT specifics into a separate header

2016-06-22 Thread Naveen N. Rao
Break out classic BPF JIT specifics into a separate header in
preparation for eBPF JIT implementation. Note that ppc32 will still need
the classic BPF JIT.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit.h  | 121 +-
 arch/powerpc/net/bpf_jit32.h| 139 
 arch/powerpc/net/bpf_jit_asm.S  |   2 +-
 arch/powerpc/net/bpf_jit_comp.c |   2 +-
 4 files changed, 143 insertions(+), 121 deletions(-)
 create mode 100644 arch/powerpc/net/bpf_jit32.h

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 9041d3f..313cfaf 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -1,4 +1,5 @@
-/* bpf_jit.h: BPF JIT compiler for PPC64
+/*
+ * bpf_jit.h: BPF JIT compiler for PPC
  *
  * Copyright 2011 Matt Evans , IBM Corporation
  *
@@ -10,66 +11,8 @@
 #ifndef _BPF_JIT_H
 #define _BPF_JIT_H
 
-#ifdef CONFIG_PPC64
-#define BPF_PPC_STACK_R3_OFF   48
-#define BPF_PPC_STACK_LOCALS   32
-#define BPF_PPC_STACK_BASIC(48+64)
-#define BPF_PPC_STACK_SAVE (18*8)
-#define BPF_PPC_STACKFRAME (BPF_PPC_STACK_BASIC+BPF_PPC_STACK_LOCALS+ \
-BPF_PPC_STACK_SAVE)
-#define BPF_PPC_SLOWPATH_FRAME (48+64)
-#else
-#define BPF_PPC_STACK_R3_OFF   24
-#define BPF_PPC_STACK_LOCALS   16
-#define BPF_PPC_STACK_BASIC(24+32)
-#define BPF_PPC_STACK_SAVE (18*4)
-#define BPF_PPC_STACKFRAME (BPF_PPC_STACK_BASIC+BPF_PPC_STACK_LOCALS+ \
-BPF_PPC_STACK_SAVE)
-#define BPF_PPC_SLOWPATH_FRAME (24+32)
-#endif
-
-#define REG_SZ (BITS_PER_LONG/8)
-
-/*
- * Generated code register usage:
- *
- * As normal PPC C ABI (e.g. r1=sp, r2=TOC), with:
- *
- * skb r3  (Entry parameter)
- * A register  r4
- * X register  r5
- * addr param  r6
- * r7-r10  scratch
- * skb->data   r14
- * skb headlen r15 (skb->len - skb->data_len)
- * m[0]r16
- * m[...]  ...
- * m[15]   r31
- */
-#define r_skb  3
-#define r_ret  3
-#define r_A4
-#define r_X5
-#define r_addr 6
-#define r_scratch1 7
-#define r_scratch2 8
-#define r_D14
-#define r_HL   15
-#define r_M16
-
 #ifndef __ASSEMBLY__
 
-/*
- * Assembly helpers from arch/powerpc/net/bpf_jit.S:
- */
-#define DECLARE_LOAD_FUNC(func)\
-   extern u8 func[], func##_negative_offset[], func##_positive_offset[]
-
-DECLARE_LOAD_FUNC(sk_load_word);
-DECLARE_LOAD_FUNC(sk_load_half);
-DECLARE_LOAD_FUNC(sk_load_byte);
-DECLARE_LOAD_FUNC(sk_load_byte_msh);
-
 #ifdef CONFIG_PPC64
 #define FUNCTION_DESCR_SIZE24
 #else
@@ -131,46 +74,6 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
 #define PPC_BPF_STLU(r, base, i) do { PPC_STWU(r, base, i); } while(0)
 #endif
 
-/* Convenience helpers for the above with 'far' offsets: */
-#define PPC_LBZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LBZ(r, base, i);   \
-   else {  PPC_ADDIS(r, base, IMM_HA(i));\
-   PPC_LBZ(r, r, IMM_L(i)); } } while(0)
-
-#define PPC_LD_OFFS(r, base, i) do { if ((i) < 32768) PPC_LD(r, base, i); \
-   else {  PPC_ADDIS(r, base, IMM_HA(i));\
-   PPC_LD(r, r, IMM_L(i)); } } while(0)
-
-#define PPC_LWZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LWZ(r, base, i);   \
-   else {  PPC_ADDIS(r, base, IMM_HA(i));\
-   PPC_LWZ(r, r, IMM_L(i)); } } while(0)
-
-#define PPC_LHZ_OFFS(r, base, i) do { if ((i) < 32768) PPC_LHZ(r, base, i);   \
-   else {  PPC_ADDIS(r, base, IMM_HA(i));\
-   PPC_LHZ(r, r, IMM_L(i)); } } while(0)
-
-#ifdef CONFIG_PPC64
-#define PPC_LL_OFFS(r, base, i) do { PPC_LD_OFFS(r, base, i); } while(0)
-#else
-#define PPC_LL_OFFS(r, base, i) do { PPC_LWZ_OFFS(r, base, i); } while(0)
-#endif
-
-#ifdef CONFIG_SMP
-#ifdef CONFIG_PPC64
-#define PPC_BPF_LOAD_CPU(r)\
-   do { BUILD_BUG_ON(FIELD_SIZEOF(struct paca_struct, paca_index) != 2);   
\
-   PPC_LHZ_OFFS(r, 13, offsetof(struct paca_struct, paca_index));  
\
-   } while (0)
-#else
-#define PPC_BPF_LOAD_CPU(r) \
-   do { BUILD_BUG_ON(FIELD_SIZEOF(struct thread_info, cpu) != 4);  
\
-   PPC_LHZ_OFFS(r, (1 & ~(THREAD_SIZE - 1)),   
\
- 

[PATCHv2 5/7] ppc: bpf/jit: A few cleanups

2016-06-22 Thread Naveen N. Rao
1. Per the ISA, ADDIS actually uses RT, rather than RS. Though
the result is the same, make the usage clear.
2. The multiply instruction used is a 32-bit multiply. Rename PPC_MUL()
to PPC_MULW() to make the same clear.
3. PPC_STW[U] take the entire 16-bit immediate value and do not require
word-alignment, per the ISA. Change the macros to use IMM_L().
4. A few white-space cleanups to satisfy checkpatch.pl.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit.h  | 13 +++--
 arch/powerpc/net/bpf_jit_comp.c |  8 
 2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 95d0e38..9041d3f 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -83,7 +83,7 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
  */
 #define IMM_H(i)   ((uintptr_t)(i)>>16)
 #define IMM_HA(i)  (((uintptr_t)(i)>>16) +   \
-(((uintptr_t)(i) & 0x8000) >> 15))
+   (((uintptr_t)(i) & 0x8000) >> 15))
 #define IMM_L(i)   ((uintptr_t)(i) & 0x)
 
 #define PLANT_INSTR(d, idx, instr)   \
@@ -99,16 +99,16 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
 #define PPC_MR(d, a)   PPC_OR(d, a, a)
 #define PPC_LI(r, i)   PPC_ADDI(r, 0, i)
 #define PPC_ADDIS(d, a, i) EMIT(PPC_INST_ADDIS | \
-___PPC_RS(d) | ___PPC_RA(a) | IMM_L(i))
+___PPC_RT(d) | ___PPC_RA(a) | IMM_L(i))
 #define PPC_LIS(r, i)  PPC_ADDIS(r, 0, i)
 #define PPC_STD(r, base, i)EMIT(PPC_INST_STD | ___PPC_RS(r) |\
 ___PPC_RA(base) | ((i) & 0xfffc))
 #define PPC_STDU(r, base, i)   EMIT(PPC_INST_STDU | ___PPC_RS(r) |   \
 ___PPC_RA(base) | ((i) & 0xfffc))
 #define PPC_STW(r, base, i)EMIT(PPC_INST_STW | ___PPC_RS(r) |\
-___PPC_RA(base) | ((i) & 0xfffc))
+___PPC_RA(base) | IMM_L(i))
 #define PPC_STWU(r, base, i)   EMIT(PPC_INST_STWU | ___PPC_RS(r) |   \
-___PPC_RA(base) | ((i) & 0xfffc))
+___PPC_RA(base) | IMM_L(i))
 
 #define PPC_LBZ(r, base, i)EMIT(PPC_INST_LBZ | ___PPC_RT(r) |\
 ___PPC_RA(base) | IMM_L(i))
@@ -174,13 +174,14 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
 #define PPC_CMPWI(a, i)EMIT(PPC_INST_CMPWI | ___PPC_RA(a) | 
IMM_L(i))
 #define PPC_CMPDI(a, i)EMIT(PPC_INST_CMPDI | ___PPC_RA(a) | 
IMM_L(i))
 #define PPC_CMPLWI(a, i)   EMIT(PPC_INST_CMPLWI | ___PPC_RA(a) | IMM_L(i))
-#define PPC_CMPLW(a, b)EMIT(PPC_INST_CMPLW | ___PPC_RA(a) | 
___PPC_RB(b))
+#define PPC_CMPLW(a, b)EMIT(PPC_INST_CMPLW | ___PPC_RA(a) |
  \
+   ___PPC_RB(b))
 
 #define PPC_SUB(d, a, b)   EMIT(PPC_INST_SUB | ___PPC_RT(d) |\
 ___PPC_RB(a) | ___PPC_RA(b))
 #define PPC_ADD(d, a, b)   EMIT(PPC_INST_ADD | ___PPC_RT(d) |\
 ___PPC_RA(a) | ___PPC_RB(b))
-#define PPC_MUL(d, a, b)   EMIT(PPC_INST_MULLW | ___PPC_RT(d) |  \
+#define PPC_MULW(d, a, b)  EMIT(PPC_INST_MULLW | ___PPC_RT(d) |  \
 ___PPC_RA(a) | ___PPC_RB(b))
 #define PPC_MULHWU(d, a, b)EMIT(PPC_INST_MULHWU | ___PPC_RT(d) | \
 ___PPC_RA(a) | ___PPC_RB(b))
diff --git a/arch/powerpc/net/bpf_jit_comp.c b/arch/powerpc/net/bpf_jit_comp.c
index 2d66a84..6012aac 100644
--- a/arch/powerpc/net/bpf_jit_comp.c
+++ b/arch/powerpc/net/bpf_jit_comp.c
@@ -161,14 +161,14 @@ static int bpf_jit_build_body(struct bpf_prog *fp, u32 
*image,
break;
case BPF_ALU | BPF_MUL | BPF_X: /* A *= X; */
ctx->seen |= SEEN_XREG;
-   PPC_MUL(r_A, r_A, r_X);
+   PPC_MULW(r_A, r_A, r_X);
break;
case BPF_ALU | BPF_MUL | BPF_K: /* A *= K */
if (K < 32768)
PPC_MULI(r_A, r_A, K);
else {
 

[PATCHv2 4/7] ppc: bpf/jit: Introduce rotate immediate instructions

2016-06-22 Thread Naveen N. Rao
Since we will be using the rotate immediate instructions for extended
BPF JIT, let's introduce macros for the same. And since the shift
immediate operations use the rotate immediate instructions, let's redo
those macros to use the newly introduced instructions.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/ppc-opcode.h |  2 ++
 arch/powerpc/net/bpf_jit.h| 20 +++-
 2 files changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc-opcode.h 
b/arch/powerpc/include/asm/ppc-opcode.h
index 1d035c1..fd8d640 100644
--- a/arch/powerpc/include/asm/ppc-opcode.h
+++ b/arch/powerpc/include/asm/ppc-opcode.h
@@ -272,6 +272,8 @@
 #define __PPC_SH(s)__PPC_WS(s)
 #define __PPC_MB(s)(((s) & 0x1f) << 6)
 #define __PPC_ME(s)(((s) & 0x1f) << 1)
+#define __PPC_MB64(s)  (__PPC_MB(s) | ((s) & 0x20))
+#define __PPC_ME64(s)  __PPC_MB64(s)
 #define __PPC_BI(s)(((s) & 0x1f) << 16)
 #define __PPC_CT(t)(((t) & 0x0f) << 21)
 
diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 4c1e055..95d0e38 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -210,18 +210,20 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
 ___PPC_RS(a) | ___PPC_RB(s))
 #define PPC_SRW(d, a, s)   EMIT(PPC_INST_SRW | ___PPC_RA(d) |\
 ___PPC_RS(a) | ___PPC_RB(s))
+#define PPC_RLWINM(d, a, i, mb, me)EMIT(PPC_INST_RLWINM | ___PPC_RA(d) | \
+   ___PPC_RS(a) | __PPC_SH(i) |  \
+   __PPC_MB(mb) | __PPC_ME(me))
+#define PPC_RLDICR(d, a, i, me)EMIT(PPC_INST_RLDICR | 
___PPC_RA(d) | \
+   ___PPC_RS(a) | __PPC_SH(i) |  \
+   __PPC_ME64(me) | (((i) & 0x20) >> 4))
+
 /* slwi = rlwinm Rx, Ry, n, 0, 31-n */
-#define PPC_SLWI(d, a, i)  EMIT(PPC_INST_RLWINM | ___PPC_RA(d) | \
-___PPC_RS(a) | __PPC_SH(i) | \
-__PPC_MB(0) | __PPC_ME(31-(i)))
+#define PPC_SLWI(d, a, i)  PPC_RLWINM(d, a, i, 0, 31-(i))
 /* srwi = rlwinm Rx, Ry, 32-n, n, 31 */
-#define PPC_SRWI(d, a, i)  EMIT(PPC_INST_RLWINM | ___PPC_RA(d) | \
-___PPC_RS(a) | __PPC_SH(32-(i)) |\
-__PPC_MB(i) | __PPC_ME(31))
+#define PPC_SRWI(d, a, i)  PPC_RLWINM(d, a, 32-(i), i, 31)
 /* sldi = rldicr Rx, Ry, n, 63-n */
-#define PPC_SLDI(d, a, i)  EMIT(PPC_INST_RLDICR | ___PPC_RA(d) | \
-___PPC_RS(a) | __PPC_SH(i) | \
-__PPC_MB(63-(i)) | (((i) & 0x20) >> 4))
+#define PPC_SLDI(d, a, i)  PPC_RLDICR(d, a, i, 63-(i))
+
 #define PPC_NEG(d, a)  EMIT(PPC_INST_NEG | ___PPC_RT(d) | ___PPC_RA(a))
 
 /* Long jump; (unconditional 'branch') */
-- 
2.8.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv2 2/7] ppc: bpf/jit: Fix/enhance 32-bit Load Immediate implementation

2016-06-22 Thread Naveen N. Rao
The existing LI32() macro can sometimes result in a sign-extended 32-bit
load that does not clear the top 32-bits properly. As an example,
loading 0x7fff results in the register containing
0x7fff. While this does not impact classic BPF JIT
implementation (since that only uses the lower word for all operations),
we would like to share this macro between classic BPF JIT and extended
BPF JIT, wherein the entire 64-bit value in the register matters. Fix
this by first doing a shifted LI followed by ORI.

An additional optimization is with loading values between -32768 to -1,
where we now only need a single LI.

The new implementation now generates the same or less number of
instructions.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit.h | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index 889fd19..a9882db 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -232,10 +232,17 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
 (((cond) & 0x3ff) << 16) |   \
 (((dest) - (ctx->idx * 4)) & \
  0xfffc))
-#define PPC_LI32(d, i) do { PPC_LI(d, IMM_L(i)); \
-   if ((u32)(uintptr_t)(i) >= 32768) {   \
-   PPC_ADDIS(d, d, IMM_HA(i));   \
+/* Sign-extended 32-bit immediate load */
+#define PPC_LI32(d, i) do {  \
+   if ((int)(uintptr_t)(i) >= -32768 &&  \
+   (int)(uintptr_t)(i) < 32768)  \
+   PPC_LI(d, i); \
+   else {\
+   PPC_LIS(d, IMM_H(i)); \
+   if (IMM_L(i)) \
+   PPC_ORI(d, d, IMM_L(i));  \
} } while(0)
+
 #define PPC_LI64(d, i) do {  \
if (!((uintptr_t)(i) & 0xULL))\
PPC_LI32(d, i);   \
-- 
2.8.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv2 3/7] ppc: bpf/jit: Optimize 64-bit Immediate loads

2016-06-22 Thread Naveen N. Rao
Similar to the LI32() optimization, if the value can be represented
in 32-bits, use LI32(). Also handle loading a few specific forms of
immediate values in an optimum manner.

Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Acked-by: Alexei Starovoitov 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/net/bpf_jit.h | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/net/bpf_jit.h b/arch/powerpc/net/bpf_jit.h
index a9882db..4c1e055 100644
--- a/arch/powerpc/net/bpf_jit.h
+++ b/arch/powerpc/net/bpf_jit.h
@@ -244,20 +244,25 @@ DECLARE_LOAD_FUNC(sk_load_byte_msh);
} } while(0)
 
 #define PPC_LI64(d, i) do {  \
-   if (!((uintptr_t)(i) & 0xULL))\
+   if ((long)(i) >= -2147483648 &&   \
+   (long)(i) < 2147483648)   \
PPC_LI32(d, i);   \
else {\
-   PPC_LIS(d, ((uintptr_t)(i) >> 48));   \
-   if ((uintptr_t)(i) & 0xULL)   \
-   PPC_ORI(d, d, \
-   ((uintptr_t)(i) >> 32) & 0x); \
+   if (!((uintptr_t)(i) & 0x8000ULL))\
+   PPC_LI(d, ((uintptr_t)(i) >> 32) & 0x);   \
+   else {\
+   PPC_LIS(d, ((uintptr_t)(i) >> 48));   \
+   if ((uintptr_t)(i) & 0xULL)   \
+   PPC_ORI(d, d, \
+ ((uintptr_t)(i) >> 32) & 0x);   \
+   } \
PPC_SLDI(d, d, 32);   \
if ((uintptr_t)(i) & 0xULL)   \
PPC_ORIS(d, d,\
 ((uintptr_t)(i) >> 16) & 0x);\
if ((uintptr_t)(i) & 0xULL)   \
PPC_ORI(d, d, (uintptr_t)(i) & 0x);   \
-   } } while (0);
+   } } while (0)
 
 #ifdef CONFIG_PPC64
 #define PPC_FUNC_ADDR(d,i) do { PPC_LI64(d, i); } while(0)
-- 
2.8.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv2 1/7] ppc bpf/jit: Disable classic BPF JIT on ppc64le

2016-06-22 Thread Naveen N. Rao
Classic BPF JIT was never ported completely to work on little endian
powerpc. However, it can be enabled and will crash the system when used.
As such, disable use of BPF JIT on ppc64le.

Cc: sta...@vger.kernel.org
Cc: Matt Evans 
Cc: Denis Kirjanov 
Cc: Michael Ellerman 
Cc: Paul Mackerras 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: "David S. Miller" 
Cc: Ananth N Mavinakayanahalli 
Cc: Thadeu Lima de Souza Cascardo 
Reported-by: Thadeu Lima de Souza Cascardo 
Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 01f7464..0a9d439 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -128,7 +128,7 @@ config PPC
select IRQ_FORCED_THREADING
select HAVE_RCU_TABLE_FREE if SMP
select HAVE_SYSCALL_TRACEPOINTS
-   select HAVE_CBPF_JIT
+   select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
select HAVE_ARCH_JUMP_LABEL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_HAS_GCOV_PROFILE_ALL
-- 
2.8.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv2 0/7] eBPF JIT for PPC64

2016-06-22 Thread Naveen N. Rao
v2 changes:
  - Patch 1 is new and is cc'ed -stable
  - Patch 7 has 3 changes:
- Include asm/kprobes.h to resolve a build error reported by Michael
Ellerman
- Remove check for fp in bpf_int_jit_compile() as suggested by Daniel
Borkmann
- Fix a crash on Cell processor reported by Michael Ellerman by
changing image to be a u8 pointer for proper size calculation.


Earlier cover letter:
Implement extended BPF JIT for ppc64. We retain the classic BPF JIT for
ppc32 and move ppc64 BE/LE to use the new JIT. Classic BPF filters will
be converted to extended BPF (see convert_filter()) and JIT'ed with the
new compiler.

Most of the existing macros are retained and fixed/enhanced where
appropriate. Patches 1-4 are geared towards this.

Patch 5 breaks out the classic BPF JIT specifics into a separate
bpf_jit32.h header file, while retaining all the generic instruction
macros in bpf_jit.h.

Patch 6 implements eBPF JIT for ppc64.

Since the RFC patchset [1], powerpc JIT has now gained support for skb
access helpers and now passes all tests in test_bpf.ko. Review comments
on the RFC patches have been addressed (use of an ABI macro [2] and use
of bpf_jit_binary_alloc()), along with a few other generic fixes and
updates.

Prominent TODOs:
 - implement BPF tail calls
 - support for BPF constant blinding

Please note that patch [2] is a pre-requisite for this patchset, and is
not yet upstream.


- Naveen

[1] http://thread.gmane.org/gmane.linux.kernel/2188694
[2] http://thread.gmane.org/gmane.linux.ports.ppc.embedded/96514


Naveen N. Rao (7):
  ppc bpf/jit: Disable classic BPF JIT on ppc64le
  ppc: bpf/jit: Fix/enhance 32-bit Load Immediate implementation
  ppc: bpf/jit: Optimize 64-bit Immediate loads
  ppc: bpf/jit: Introduce rotate immediate instructions
  ppc: bpf/jit: A few cleanups
  ppc: bpf/jit: Isolate classic BPF JIT specifics into a separate header
  ppc: ebpf/jit: Implement JIT compiler for extended BPF

 arch/powerpc/Kconfig  |   3 +-
 arch/powerpc/include/asm/asm-compat.h |   2 +
 arch/powerpc/include/asm/ppc-opcode.h |  22 +-
 arch/powerpc/net/Makefile |   4 +
 arch/powerpc/net/bpf_jit.h| 235 -
 arch/powerpc/net/bpf_jit32.h  | 139 +
 arch/powerpc/net/bpf_jit64.h  | 102 
 arch/powerpc/net/bpf_jit_asm.S|   2 +-
 arch/powerpc/net/bpf_jit_asm64.S  | 180 +++
 arch/powerpc/net/bpf_jit_comp.c   |  10 +-
 arch/powerpc/net/bpf_jit_comp64.c | 954 ++
 11 files changed, 1502 insertions(+), 151 deletions(-)
 create mode 100644 arch/powerpc/net/bpf_jit32.h
 create mode 100644 arch/powerpc/net/bpf_jit64.h
 create mode 100644 arch/powerpc/net/bpf_jit_asm64.S
 create mode 100644 arch/powerpc/net/bpf_jit_comp64.c

-- 
2.8.2

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 3/3] powerpc/pnv/pci: Fix incorrect PE reservation attempt on some 64-bit BARs

2016-06-22 Thread Benjamin Herrenschmidt
On Wed, 2016-06-22 at 20:32 +1000, Gavin Shan wrote:
> In pci_read_bridge_mmio_pref(), no prefetchable window (64bits+pref)
> is populated if bit PCI_PREF_RANGE_TYPE_64 (0x1) isn't set on PCI
> config register (PCI_PREF_MEMORY_BASE, 0x24). During the resource
> resizing and assigning stage in PCI core, all resources including
> 64-bits prefetchable resources will be covered by 32-bits bridge
> window.

Yeah ok, that's it, thanks !

Cheers,
Ben.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v13 01/16] PCI: Let pci_mmap_page_range() take resource address

2016-06-22 Thread Bjorn Helgaas
On Tue, Jun 21, 2016 at 09:32:49PM -0700, Yinghai Lu wrote:
> On Sat, Jun 18, 2016 at 5:17 AM, Bjorn Helgaas  wrote:
> > On Fri, Jun 17, 2016 at 07:24:46PM -0700, Yinghai Lu wrote:
> >> In 8c05cd08a7 ("PCI: fix offset check for sysfs mmapped files"), try
> >> to check exposed value with resource start/end in proc mmap path.
> >>
> >> |start = vma->vm_pgoff;
> >> |size = ((pci_resource_len(pdev, resno) - 1) >> PAGE_SHIFT) + 1;
> >> |pci_start = (mmap_api == PCI_MMAP_PROCFS) ?
> >> |pci_resource_start(pdev, resno) >> PAGE_SHIFT : 0;
> >> |if (start >= pci_start && start < pci_start + size &&
> >> |start + nr <= pci_start + size)
> >>
> >> That breaks sparc that exposed value is BAR value, and need to be offseted
> >> to resource address.
> >
> > I asked this same question of the v12 patch, but I don't think you
> > answered it:
> >
> > I'm not quite sure what you're saying here.  Are you saying that sparc
> > is currently broken, and this patch fixes it?  If so, what exactly is
> > broken?  Can you give a small example of an mmap that is currently
> > broken?
> 
> Yes, for sparc that path (proc mmap) is broken, but only according to
> code checking.
> 
> The reason for the problem is not discovered is that seem all users
> (other than x86) are not
> use proc_mmap ?
> 
> vma->vm_pgoff is that code segment is User/BAR value >> PAGE_SHIFT.
> pci_start is resource->start >> PAGE_SHIFT.
> 
> For sparc, resource start is different from BAR start aka pci bus address.
> pci bus address add offset to be the resource start.

If sparc is broken, let's make this a tiny sparc-only patch that fixes
only the breakage -- no cleanup or restructuring.  Then we can do the
more extensive work in a separate patch.

The example mmap() I keep asking for would be very helpful to me in
understanding the problem.  It would probably also help folks who
maintain user programs that use mmap.  They need to figure out whether
they have code that worked most places but has always been broken on
sparc, or code that depended on the previous sparc behavior and will
be broken by this change, or what.

Bjorn
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

cpuidle broken on mainline

2016-06-22 Thread Anton Blanchard via Linuxppc-dev
Hi,

I was noticing some pretty big run to run variations on single threaded
benchmarks, and I've isolated it cpuidle issues. If I look at
the cpuidle tracepoint, I notice we only go into the snooze state.

Do we have any known bugs in cpuidle at the moment?

While looking around, I also noticed PM_QOS_CPU_DMA_LATENCY,
which seems like a bad thing for us. It is used in places like
the e1000e adapter:

drivers/net/ethernet/intel/e1000e/netdev.c:
pm_qos_add_request(>pm_qos_req, PM_QOS_CPU_DMA_LATENCY

Seems like that would force us into SMT8 mode all the time. Can we
disable it on ppc64le completely?

Anton

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-22 Thread Naveen N. Rao
On 2016/06/22 12:42PM, Naveen N Rao wrote:
> On 2016/06/21 11:47AM, Thadeu Lima de Souza Cascardo wrote:
> > On Tue, Jun 21, 2016 at 09:15:48PM +1000, Michael Ellerman wrote:
> > > On Tue, 2016-06-21 at 14:28 +0530, Naveen N. Rao wrote:
> > > > On 2016/06/20 03:56PM, Thadeu Lima de Souza Cascardo wrote:
> > > > > On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> > > > > > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > > > > > > 
> > > > > > > Hi, Michael and Naveen.
> > > > > > > 
> > > > > > > I noticed independently that there is a problem with BPF JIT and 
> > > > > > > ABIv2, and
> > > > > > > worked out the patch below before I noticed Naveen's patchset and 
> > > > > > > the latest
> > > > > > > changes in ppc tree for a better way to check for ABI versions.
> > > > > > > 
> > > > > > > However, since the issue described below affect mainline and 
> > > > > > > stable kernels,
> > > > > > > would you consider applying it before merging your two patchsets, 
> > > > > > > so that we can
> > > > > > > more easily backport the fix?
> > > > > > 
> > > > > > Hi Cascardo,
> > > > > > Given that this has been broken on ABIv2 since forever, I didn't 
> > > > > > bother 
> > > > > > fixing it. But, I can see why this would be a good thing to have 
> > > > > > for 
> > > > > > -stable and existing distros. However, while your patch below may 
> > > > > > fix 
> > > > > > the crash you're seeing on ppc64le, it is not sufficient -- you'll 
> > > > > > need 
> > > > > > changes in bpf_jit_asm.S as well.
> > > > > 
> > > > > Hi, Naveen.
> > > > > 
> > > > > Any tips on how to exercise possible issues there? Or what changes 
> > > > > you think
> > > > > would be sufficient?
> > > > 
> > > > The calling convention is different with ABIv2 and so we'll need 
> > > > changes 
> > > > in bpf_slow_path_common() and sk_negative_common().
> > > 
> > > How big would those changes be? Do we know?

So, this does need quite a few changes:
- the skb helpers need to emit code to setup TOC and the JIT code needs 
  to be updated to setup r12.
- the slow path code needs to be changed to store r3 elsewhere on ABIv2
- the above also means we need to change the stack macros with the 
  proper ABIv2 values
- the little endian support isn't complete as well -- some of the skb 
  helpers are not using byte swap instructions.

As such, I think we should just disable classic JIT on ppc64le.


- Naveen
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0/6] kexec_file: Add buffer hand-over for the next kernel

2016-06-22 Thread Mimi Zohar
Hi Dave,

On Wed, 2016-06-22 at 09:20 +0800, Dave Young wrote:
> On 06/20/16 at 10:44pm, Thiago Jung Bauermann wrote:
> > Hello,
> > 
> > This patch series implements a mechanism which allows the kernel to pass on
> > a buffer to the kernel that will be kexec'd. This buffer is passed as a
> > segment which is added to the kimage when it is being prepared by
> > kexec_file_load.
> > 
> > How the second kernel is informed of this buffer is architecture-specific.
> > On PowerPC, this is done via the device tree, by checking the properties
> > /chosen/linux,kexec-handover-buffer-start and
> > /chosen/linux,kexec-handover-buffer-end, which is analogous to how the
> > kernel finds the initrd.
> > 
> > This feature was implemented because the Integrity Measurement Architecture
> > subsystem needs to preserve its measurement list accross the kexec reboot.
> > This is so that IMA can implement trusted boot support on the OpenPower
> > platform, because on such systems an intermediary Linux instance running as
> > part of the firmware is used to boot the target operating system via kexec.
> > Using this mechanism, IMA on this intermediary instance can hand over to the
> > target OS the measurements of the components that were used to boot it.
> 
> We have CONFIG_KEXEC_VERIFY_SIG, why not verifying the kernel to be
> loaded instead?  I feel IMA should rebuild its measurement instead of
> passing it to another kernel. Kexec reboot is also a reboot. If we have
> to preserve something get from firmware we can do it, but other than
> that I think it sounds not a good idea.

The signature verification is needed for secure boot.  Carrying the IMA
measurement list across kexec is needed for trusted boot.  In this case,
the boot loader is Linux, which needs to carry the measurements, stored
in memory, across kexec to the target system.

The kernel_read_file_from_fd() calls the pre and post security
kernel_read hooks.  These hooks can verify file signatures, store
measurements in the IMA measurement list and extend the TPM.  To enable
both measuring and appraising (signature verification) of the kernel
image and the initramfs, include the following rules in the IMA policy:

measure func=KEXEC_KERNEL_CHECK
appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig
#
measure func=KEXEC_INITRAMFS_CHECK
appraise func=KEXEC_INITRAMFS_CHECK appraise_type=imasig

Thiago's path set provides the means for carrying the trusted boot
measurements across kexec.

Mimi

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v6, 1/2] cxl: Add mechanism for delivering AFU driver specific events

2016-06-22 Thread Philippe Bergheaud

Should also check against the length of user-buffer (count) provided in the read
call.Ideally this condition check should be moved to the read call where
you have access to the count variable.

Right now libcxl is using a harcoded value of CXL_READ_MIN_SIZE to
issue the read call and in kernel code we have a check to ensure that
read buffer is atleast CXL_READ_MIN_SIZE in size.

But it might be a good idea to decouple driver from
CXL_MAX_EVENT_DATA_SIZE. Ideally the maximum event size that we can
support should be dependent on the amount user buffer we receive in the
read call. That way future libcxl can support larger event_data without
needing a change to the cxl.h


[...]

+#define CXL_MAX_EVENT_DATA_SIZE 128
+



Agree with Matt's earlier comments. 128 is very small and I would prefer
for atleast a page size (4k/64K) limit.



afu_read() enforces a minimum buffer size of CXL_READ_MIN_SIZE = 4K, as 
documented in Documentation/powerpc/cxl.txt. This information is missing from 
the man pages of the libcxl functions cxl_read_event/cxl_read_expected_event. I 
will fix these.

Regarding the maximum event size, as afu_read returns one event per call, and 
as there is no API to tell userland the maximum size of a cxl event, I think 
that we should simply use (and document) the same value (4K) as the maximum cxl 
event size.

Philippe



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [v6, 1/2] cxl: Add mechanism for delivering AFU driver specific events

2016-06-22 Thread Philippe Bergheaud

Matthew R. Ochs wrote:

On Jun 21, 2016, at 5:34 AM, Vaibhav Jain  wrote:

Hi Ian,

Ian Munsie  writes:



Excerpts from Vaibhav Jain's message of 2016-06-20 14:20:16 +0530:

What exactly is the use case for this API? I'd vote to drop it if we can
do without it.


Agree with this. Functionality of this API can be merged with
cxl_set_driver_ops when called with NULL arg for cxl_afu_driver_ops.



Passing a NULL arg instead of calling an 'unset' API is fine with us.

I'll add that for cxlflash, I can't envision a scenario where we'll unset the
driver ops for a context.


Agreed, thanks. I will drop API cxl_unset_driver_ops() in v7.

Philippe

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 0/9] kexec_file_load implementation for PowerPC

2016-06-22 Thread Balbir Singh
On Tue, 21 Jun 2016 16:48:32 -0300
Thiago Jung Bauermann  wrote:

> Hello,
> 
> This patch series implements the kexec_file_load system call on
> PowerPC.
> 
> This system call moves the reading of the kernel, initrd and the
> device tree from the userspace kexec tool to the kernel. This is
> needed if you want to do one or both of the following:
> 
> 1. only allow loading of signed kernels.
> 2. "measure" (i.e., record the hashes of) the kernel, initrd, kernel
>command line and other boot inputs for the Integrity Measurement
>Architecture subsystem.
> 
> The above are the functions kexec already has built into
> kexec_file_load. Yesterday I posted a set of patches which allows a
> third feature:
> 
> 3. have IMA pass-on its event log (where integrity measurements are
>registered) accross kexec to the second kernel, so that the event
>history is preserved.

OK.. and this is safe? Do both the kernels need to be signed by the
same certificate?


Balbir Singh
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-06-22 Thread Naveen N. Rao
On 2016/06/22 08:37PM, Michael Ellerman wrote:
> On Tue, 2016-06-07 at 19:02 +0530, Naveen N. Rao wrote:
> 
> > PPC64 eBPF JIT compiler.
> > 
> > Enable with:
> > echo 1 > /proc/sys/net/core/bpf_jit_enable
> > or
> > echo 2 > /proc/sys/net/core/bpf_jit_enable
> > 
> > ... to see the generated JIT code. This can further be processed with
> > tools/net/bpf_jit_disasm.
> > 
> > With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
> > test_bpf: Summary: 305 PASSED, 0 FAILED, [297/297 JIT'ed]
> > 
> > ... on both ppc64 BE and LE.
> > 
> > The details of the approach are documented through various comments in
> > the code.
> 
> This is crashing for me on a Cell machine, not sure why at a glance:
> 
> 
> test_bpf: #250 JMP_JSET_X: if (0x3 & 0x) return 1 jited:1 14 PASS
> test_bpf: #251 JMP_JA: Jump, gap, jump, ... jited:1 15 PASS
> test_bpf: #252 BPF_MAXINSNS: Maximum possible literals 
> Unable to handle kernel paging request for data at address 0xd7b2
> Faulting instruction address: 0xc0667b6c
> cpu 0x0: Vector: 300 (Data Access) at [c007f83bf3a0]
> pc: c0667b6c: .flush_icache_range+0x3c/0x84
> lr: c0082354: .bpf_int_jit_compile+0x1fc/0x2c8
> sp: c007f83bf620
>msr: 9200b032
>dar: d7b2
>  dsisr: 4000
>   current = 0xc007f8249580
>   paca= 0xcfff softe: 0irq_happened: 0x01
> pid   = 1822, comm = insmod
> Linux version 4.7.0-rc3-00061-g007c99b9d8c1 (mich...@ka3.ozlabs.ibm.com) (gcc 
> version 6.1.0 (GCC) ) #3 SMP Wed Jun 22 19:22:23 AEST 2016
> enter ? for help
> [link register   ] c0082354 .bpf_int_jit_compile+0x1fc/0x2c8
> [c007f83bf620] c00822fc .bpf_int_jit_compile+0x1a4/0x2c8 
> (unreliable)
> [c007f83bf700] c013cda4 .bpf_prog_select_runtime+0x24/0x108
> [c007f83bf780] c0548918 .bpf_prepare_filter+0x9b0/0x9e8
> [c007f83bf830] c05489d4 .bpf_prog_create+0x84/0xd0
> [c007f83bf8c0] d3b21158 .test_bpf_init+0x28c/0x83c [test_bpf]
> [c007f83bfa00] c000a7b4 .do_one_initcall+0x5c/0x1c0
> [c007f83bfae0] c0669058 .do_init_module+0x80/0x21c
> [c007f83bfb80] c011e3a0 .load_module+0x2028/0x23a8
> [c007f83bfd20] c011e898 .SyS_init_module+0x178/0x1b0
> [c007f83bfe30] c0009220 system_call+0x38/0x110
> --- Exception: c01 (System Call) at 0ff5e0c4
> SP (ffde0960) is in userspace
> 0:mon> r
> R00 = c01c   R16 = 
> R01 = c007f83bf620   R17 = 024000c0
> R02 = c094ce00   R18 = 
> R03 = d7b1   R19 = d3c32df0
> R04 = d7b40338   R20 = c072b488

Wow. I can't actually understand why this did not trigger for me. We are 
sending incorrect values into flush_icache_range(). So the first page is 
being flushed properly, but we are faulting trying to access another 
page. Patch forthcoming.

Thanks,
Naveen
 

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Yury Norov
On Wed, Jun 22, 2016 at 09:20:43AM +0530, Madhavan Srinivasan wrote:
> 
> 
> On Tuesday 21 June 2016 09:05 PM, Yury Norov wrote:
> > On Tue, Jun 21, 2016 at 08:26:40PM +0530, Madhavan Srinivasan wrote:
> >> When decoding the perf_regs mask in regs_dump__printf(),
> >> we loop through the mask using find_first_bit and find_next_bit functions.
> >> "mask" is of type "u64", but sent as a "unsigned long *" to
> >> lib functions along with sizeof().
> >>
> >> While the exisitng code works fine in most of the case,
> >> the logic is broken when using a 32bit perf on a 64bit kernel (Big Endian).
> >> When reading u64 using (u32 *)()[0], perf (lib/find_*_bit()) assumes 
> >> it gets
> >> lower 32bits of u64 which is wrong. Proposed fix is to swap the words
> >> of the u64 to handle this case. This is _not_ endianess swap.
> >>
> >> Suggested-by: Yury Norov 
> >> Cc: Yury Norov 
> >> Cc: Peter Zijlstra 
> >> Cc: Ingo Molnar 
> >> Cc: Arnaldo Carvalho de Melo 
> >> Cc: Alexander Shishkin 
> >> Cc: Jiri Olsa 
> >> Cc: Adrian Hunter 
> >> Cc: Kan Liang 
> >> Cc: Wang Nan 
> >> Cc: Michael Ellerman 
> >> Signed-off-by: Madhavan Srinivasan 
> >> ---
> >> Changelog v2:
> >> 1)Moved the swap code to a common function
> >> 2)Added more comments in the code
> >>
> >> Changelog v1:
> >> 1)updated commit message and patch subject
> >> 2)Add the fix to print_sample_iregs() in builtin-script.c
> >>
> >>  tools/include/linux/bitmap.h |  9 +
> > What about include/linux/bitmap.h? I think we'd place it there first.
> 
> Wanted to handle that separately.
> 
> >
> >>  tools/perf/builtin-script.c  | 16 +++-
> >>  tools/perf/util/session.c| 16 +++-
> >>  3 files changed, 39 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
> >> index 28f5493da491..79998b26eb04 100644
> >> --- a/tools/include/linux/bitmap.h
> >> +++ b/tools/include/linux/bitmap.h
> >> @@ -2,6 +2,7 @@
> >>  #define _PERF_BITOPS_H
> >>  
> >>  #include 
> >> +#include 
> >>  #include 
> >>  
> >>  #define DECLARE_BITMAP(name,bits) \
> >> @@ -22,6 +23,14 @@ void __bitmap_or(unsigned long *dst, const unsigned 
> >> long *bitmap1,
> >>  #define small_const_nbits(nbits) \
> >>(__builtin_constant_p(nbits) && (nbits) <= BITS_PER_LONG)
> >>  
> >> +static inline void bitmap_from_u64(unsigned long *_mask, u64 mask)
> > Inline is not required. Some people don't not like it. Underscored 
> > parameter in
> 
> Not sure why you say that. IIUC we can avoid a function call overhead,
> also rest of the functions in the file likes it.

Documentation/CodingStyle, Chapter 15. 

But I was wrong here for 3 reasons.
1. Your function is exactly 3 lines of code;
2. Condition is calculated at compile time;
3. Code around uses inline, and your patch should not be exception.

Anyway, inline is only hint for GCC. And you cannot expect overhead
elimination in every specific case. If you really need inline, use
__always_inline directive.

> > function declaration is not the best idea as well. Try:
> > static void bitmap_from_u64(unsigned long *bitmap, u64 mask)
> >
> >> +{
> >> +  _mask[0] = mask & ULONG_MAX;
> >> +
> >> +  if (sizeof(mask) > sizeof(unsigned long))
> >> +  _mask[1] = mask >> 32;
> >> +}
> >> +
> >>  static inline void bitmap_zero(unsigned long *dst, int nbits)
> >>  {
> >>if (small_const_nbits(nbits))
> >> diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
> >> index e3ce2f34d3ad..73928310fd91 100644
> >> --- a/tools/perf/builtin-script.c
> >> +++ b/tools/perf/builtin-script.c
> >> @@ -412,11 +412,25 @@ static void print_sample_iregs(struct perf_sample 
> >> *sample,
> >>struct regs_dump *regs = >intr_regs;
> >>uint64_t mask = attr->sample_regs_intr;
> >>unsigned i = 0, r;
> >> +  unsigned long _mask[sizeof(mask)/sizeof(unsigned long)];
> > If we start with it, I think we'd hide declaration machinery as well:
> >
> > #define DECLARE_L64_BITMAP(__name) unsigned long 
> > __name[sizeof(u64)/sizeof(unsigned long)]
> > or
> > #define L64_BITMAP_SIZE (sizeof(u64)/sizeof(unsigned long))
> >
> > Or both :) Whatever you prefer.
> 
> ok
> 
> >
> >>  
> >>if (!regs)
> >>return;
> >>  
> >> -  for_each_set_bit(r, (unsigned long *) , sizeof(mask) * 8) {
> >> +  /*
> >> +   * Since u64 is passed as 'unsigned long *', check
> >> +   * to see whether we need to swap words within u64.
> >> +   * Reason being, in 32 bit big endian userspace on a
> >> +   * 64bit kernel, 'unsigned long' is 32 bits.
> >> +   * When reading u64 using (u32 *)()[0] and (u32 *)()[1],
> >> +   * we will get wrong value for the mask. This is what
> >> +   * find_first_bit() and 

Re: [PATCH 6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-06-22 Thread Michael Ellerman
On Tue, 2016-06-07 at 19:02 +0530, Naveen N. Rao wrote:

> PPC64 eBPF JIT compiler.
> 
> Enable with:
> echo 1 > /proc/sys/net/core/bpf_jit_enable
> or
> echo 2 > /proc/sys/net/core/bpf_jit_enable
> 
> ... to see the generated JIT code. This can further be processed with
> tools/net/bpf_jit_disasm.
> 
> With CONFIG_TEST_BPF=m and 'modprobe test_bpf':
> test_bpf: Summary: 305 PASSED, 0 FAILED, [297/297 JIT'ed]
> 
> ... on both ppc64 BE and LE.
> 
> The details of the approach are documented through various comments in
> the code.

This is crashing for me on a Cell machine, not sure why at a glance:


test_bpf: #250 JMP_JSET_X: if (0x3 & 0x) return 1 jited:1 14 PASS
test_bpf: #251 JMP_JA: Jump, gap, jump, ... jited:1 15 PASS
test_bpf: #252 BPF_MAXINSNS: Maximum possible literals 
Unable to handle kernel paging request for data at address 0xd7b2
Faulting instruction address: 0xc0667b6c
cpu 0x0: Vector: 300 (Data Access) at [c007f83bf3a0]
pc: c0667b6c: .flush_icache_range+0x3c/0x84
lr: c0082354: .bpf_int_jit_compile+0x1fc/0x2c8
sp: c007f83bf620
   msr: 9200b032
   dar: d7b2
 dsisr: 4000
  current = 0xc007f8249580
  paca= 0xcfff   softe: 0irq_happened: 0x01
pid   = 1822, comm = insmod
Linux version 4.7.0-rc3-00061-g007c99b9d8c1 (mich...@ka3.ozlabs.ibm.com) (gcc 
version 6.1.0 (GCC) ) #3 SMP Wed Jun 22 19:22:23 AEST 2016
enter ? for help
[link register   ] c0082354 .bpf_int_jit_compile+0x1fc/0x2c8
[c007f83bf620] c00822fc .bpf_int_jit_compile+0x1a4/0x2c8 
(unreliable)
[c007f83bf700] c013cda4 .bpf_prog_select_runtime+0x24/0x108
[c007f83bf780] c0548918 .bpf_prepare_filter+0x9b0/0x9e8
[c007f83bf830] c05489d4 .bpf_prog_create+0x84/0xd0
[c007f83bf8c0] d3b21158 .test_bpf_init+0x28c/0x83c [test_bpf]
[c007f83bfa00] c000a7b4 .do_one_initcall+0x5c/0x1c0
[c007f83bfae0] c0669058 .do_init_module+0x80/0x21c
[c007f83bfb80] c011e3a0 .load_module+0x2028/0x23a8
[c007f83bfd20] c011e898 .SyS_init_module+0x178/0x1b0
[c007f83bfe30] c0009220 system_call+0x38/0x110
--- Exception: c01 (System Call) at 0ff5e0c4
SP (ffde0960) is in userspace
0:mon> r
R00 = c01c   R16 = 
R01 = c007f83bf620   R17 = 024000c0
R02 = c094ce00   R18 = 
R03 = d7b1   R19 = d3c32df0
R04 = d7b40338   R20 = c072b488
R05 = 007f   R21 = d7b1
R06 = d7b2   R22 = c098184c
R07 = 0080   R23 = 
R08 = 0607   R24 = 000300e0
R09 = 0007   R25 = c020
R10 = c0861ee0   R26 = d7b10270
R11 = c06755f8   R27 = c007fe0e
R12 = d7b10270   R28 = 2003
R13 = cfff   R29 = c007f83bf690
R14 = d3c32d61   R30 = 0003
R15 =    R31 = d7ae
pc  = c0667b6c .flush_icache_range+0x3c/0x84
lr  = c0082354 .bpf_int_jit_compile+0x1fc/0x2c8
msr = 9200b032   cr  = 44000248
ctr = 0407   xer = 2000   trap =  300
dar = d7b2   dsisr = 4000
0:mon> S
msr  = 90001032  sprg0= 8001
pvr  = 00703000  sprg1= cfff
dec  = 9f2d8ba4  sprg2= cfff
sp   = c007f83bed30  sprg3= 
toc  = c094ce00  dar  = d7b2
0:mon> u
SLB contents of cpu 0x0
00 c800 af32f5079500 256M ESID=c  VSID=af32f5079 
LLP:100 
01 d800 836935091510 256M ESID=d  VSID=836935091 
LLP:110 
02 c007f800 b52186c20500 256M ESID=c007f  VSID=b52186c20 
LLP:100 
03 c003f000 b224435e0500
04 c007f000 b52186c20500
05 c003f000 b224435e0500
06 c007f000 b52186c20500
07 c003f000 b224435e0500
08 c007f000 b52186c20500
09 c003f000 b224435e0500
10 c007f000 b52186c20500
11 c003f000 b224435e0500
12 c007f000 b52186c20500
13 c003f000 b224435e0500
14 c007f000 b52186c20500
15 c003f000 b224435e0500
16 c007f000 b52186c20500
17 c0007800 af86a8668500 256M ESID=c0007  VSID=af86a8668 
LLP:100 
18 c003f000 b224435e0500
19 c007f000 b52186c20500
20 c003f000 b224435e0500
21 c007f000 b52186c20500
22 c003f000 b224435e0500
23 c007f000 b52186c20500
24 c003f000 b224435e0500
25 c007f000 b52186c20500
26 c003f000 b224435e0500
27 c007f000 b52186c20500
28 c003f000 b224435e0500
29 c007f000 b52186c20500
30 c003f000 b224435e0500
31 c007f000 

Re: [PATCH 3/3] powerpc/pnv/pci: Fix incorrect PE reservation attempt on some 64-bit BARs

2016-06-22 Thread Gavin Shan
On Wed, Jun 22, 2016 at 05:26:19PM +1000, Benjamin Herrenschmidt wrote:
>The generic allocation code may sometimes decide to assign a prefetchable
>64-bit BAR to the M32 window. In fact it may also decide to allocate
>a 64-bit non-prefetchable BAR to the M64 one ! So using the resource
>flags as a test to decide which window was used for PE allocation is
>just wrong and leads to insane PE numbers.
>
>Instead, compare the addresses to figure it out.
>
>CC: sta...@vger.kernel.org
>Signed-off-by: Benjamin Herrenschmidt 

Acked-by: Gavin Shan 

>---
>
>This is a pretty nasty bug, I'd like to have Gavin ack it first but
>then we should push it back to distros. I don't know yet *why* the
>generic code is eager to put my BARs into 32-bit space but that's
>irrelevant here, it's allowed to do that and we should do the right
>thing anyway.
>

It's likely related the lost 64-bits flag in prefetchable window on
root port. The similar issue was observed on CAPI adapter connected
to RC directly on Garrison platform, which was fixed by commit d40160f
("PHB3: Emulate root complex pref 64-bits window") in skiboot.

pcibios_init
pcibios_scan_phb
pci_scan_child_bus  -> Scan root port
pci_scan_bridge -> Create bus behind root port
pci_scan_child_bus  -> Scan devices on bus#1
pcibios_fixup_bus
pci_read_bridge_bases   -> Setup bridge windows of root port
pci_read_bridge_mmio_pref

In pci_read_bridge_mmio_pref(), no prefetchable window (64bits+pref)
is populated if bit PCI_PREF_RANGE_TYPE_64 (0x1) isn't set on PCI
config register (PCI_PREF_MEMORY_BASE, 0x24). During the resource
resizing and assigning stage in PCI core, all resources including
64-bits prefetchable resources will be covered by 32-bits bridge
window.

Thanks,
Gavin

> arch/powerpc/platforms/powernv/pci-ioda.c | 27 +--
> 1 file changed, 17 insertions(+), 10 deletions(-)
>
>diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>b/arch/powerpc/platforms/powernv/pci-ioda.c
>index c6396b6..0321ba3 100644
>--- a/arch/powerpc/platforms/powernv/pci-ioda.c
>+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>@@ -110,10 +110,16 @@ static int __init iommu_setup(char *str)
> }
> early_param("iommu", iommu_setup);
>
>-static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
>+static inline bool pnv_pci_is_mem_pref_64(struct pnv_phb *phb, struct 
>resource *r)
> {
>-  return ((flags & (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH)) ==
>-  (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
>+  /* WARNING: We cannot rely on the resource flags. The Linux PCI
>+   * allocation code sometimes decides to put a 64-bit prefetchable
>+   * BAR in the 32-bit window, so we have to compare the addresses.
>+   *
>+   * For simplicity we only test resource start.
>+   */
>+  return (r->start >= phb->ioda.m64_base &&
>+  r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
> }
>
> static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
>@@ -230,7 +236,7 @@ static void pnv_ioda_reserve_dev_m64_pe(struct pci_dev 
>*pdev,
>   sgsz = phb->ioda.m64_segsize;
>   for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
>   r = >resource[i];
>-  if (!r->parent || !pnv_pci_is_mem_pref_64(r->flags))
>+  if (!r->parent || !pnv_pci_is_mem_pref_64(phb, r))
>   continue;
>
>   start = _ALIGN_DOWN(r->start - base, sgsz);
>@@ -3059,7 +3065,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
>pci_dev *pdev)
>   res = >resource[i + PCI_IOV_RESOURCES];
>   if (!res->flags || res->parent)
>   continue;
>-  if (!pnv_pci_is_mem_pref_64(res->flags)) {
>+  if (!pnv_pci_is_mem_pref_64(phb, res)) {
>   dev_warn(>dev, "Don't support SR-IOV with"
>   " non M64 VF BAR%d: %pR. \n",
>i, res);
>@@ -3154,8 +3160,7 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe,
>   region.start += phb->ioda.io_segsize;
>   index++;
>   }
>-  } else if ((res->flags & IORESOURCE_MEM) &&
>- !pnv_pci_is_mem_pref_64(res->flags)) {
>+  } else if ((res->flags & IORESOURCE_MEM) && 
>!pnv_pci_is_mem_pref_64(phb, res)) {
>   region.start = res->start -
>  phb->hose->mem_offset[0] -
>  phb->ioda.m32_pci_base;
>@@ -3314,9 +3319,11 @@ static resource_size_t pnv_pci_window_alignment(struct 
>pci_bus *bus,
>   bridge = bridge->bus->self;
>   }
>
>-  /* We fail back to M32 if M64 isn't supported */
>-  if (phb->ioda.m64_segsize &&
>-  pnv_pci_is_mem_pref_64(type))
>+  /* We fail back to M32 if M64 isn't supported. We enforce the M64
>+   * alignment for any 64-bit resource, PCIe 

Re: [PATCH v3 2/9] kexec_file: Generalize kexec_add_buffer.

2016-06-22 Thread Dave Young

The patch looks good, but could the subject be more specific?

For example just like the first sentence of the patch descriotion:
Allow architectures to specify their own memory walking function

On 06/21/16 at 04:48pm, Thiago Jung Bauermann wrote:
> Allow architectures to specify different memory walking functions for
> kexec_add_buffer. Intel uses iomem to track reserved memory ranges,
> but PowerPC uses the memblock subsystem.
> 
> Signed-off-by: Thiago Jung Bauermann 
> Cc: Eric Biederman 
> Cc: Dave Young 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  include/linux/kexec.h   | 19 ++-
>  kernel/kexec_file.c | 30 ++
>  kernel/kexec_internal.h | 14 --
>  3 files changed, 40 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index e8acb2b43dd9..3d91bcfc180d 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -146,7 +146,24 @@ struct kexec_file_ops {
>   kexec_verify_sig_t *verify_sig;
>  #endif
>  };
> -#endif
> +
> +/*
> + * Keeps track of buffer parameters as provided by caller for requesting
> + * memory placement of buffer.
> + */
> +struct kexec_buf {
> + struct kimage *image;
> + unsigned long mem;
> + unsigned long memsz;
> + unsigned long buf_align;
> + unsigned long buf_min;
> + unsigned long buf_max;
> + bool top_down;  /* allocate from top of memory hole */
> +};
> +
> +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> +int (*func)(u64, u64, void *));
> +#endif /* CONFIG_KEXEC_FILE */
>  
>  struct kimage {
>   kimage_entry_t head;
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b6eec7527e9f..b1f1f6402518 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -428,6 +428,27 @@ static int locate_mem_hole_callback(u64 start, u64 end, 
> void *arg)
>   return locate_mem_hole_bottom_up(start, end, kbuf);
>  }
>  
> +/**
> + * arch_kexec_walk_mem - call func(data) on free memory regions
> + * @kbuf:Context info for the search. Also passed to @func.
> + * @func:Function to call for each memory region.
> + *
> + * Return: The memory walk will stop when func returns a non-zero value
> + * and that value will be returned. If all free regions are visited without
> + * func returning non-zero, then zero will be returned.
> + */
> +int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
> +int (*func)(u64, u64, void *))
> +{
> + if (kbuf->image->type == KEXEC_TYPE_CRASH)
> + return walk_iomem_res_desc(crashk_res.desc,
> +IORESOURCE_SYSTEM_RAM | 
> IORESOURCE_BUSY,
> +crashk_res.start, crashk_res.end,
> +kbuf, func);
> + else
> + return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
> +}
> +
>  /*
>   * Helper function for placing a buffer in a kexec segment. This assumes
>   * that kexec_mutex is held.
> @@ -472,14 +493,7 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
> unsigned long bufsz,
>   kbuf->top_down = top_down;
>  
>   /* Walk the RAM ranges and allocate a suitable range for the buffer */
> - if (image->type == KEXEC_TYPE_CRASH)
> - ret = walk_iomem_res_desc(crashk_res.desc,
> - IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY,
> - crashk_res.start, crashk_res.end, kbuf,
> - locate_mem_hole_callback);
> - else
> - ret = walk_system_ram_res(0, -1, kbuf,
> -   locate_mem_hole_callback);
> + ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
>   if (ret != 1) {
>   /* A suitable memory range could not be found for buffer */
>   return -EADDRNOTAVAIL;
> diff --git a/kernel/kexec_internal.h b/kernel/kexec_internal.h
> index eefd5bf960c2..4cef7e4706b0 100644
> --- a/kernel/kexec_internal.h
> +++ b/kernel/kexec_internal.h
> @@ -20,20 +20,6 @@ struct kexec_sha_region {
>   unsigned long len;
>  };
>  
> -/*
> - * Keeps track of buffer parameters as provided by caller for requesting
> - * memory placement of buffer.
> - */
> -struct kexec_buf {
> - struct kimage *image;
> - unsigned long mem;
> - unsigned long memsz;
> - unsigned long buf_align;
> - unsigned long buf_min;
> - unsigned long buf_max;
> - bool top_down;  /* allocate from top of memory hole */
> -};
> -
>  void kimage_file_post_load_cleanup(struct kimage *image);
>  #else /* CONFIG_KEXEC_FILE */
>  static inline void kimage_file_post_load_cleanup(struct kimage *image) { }

Thanks
Dave
___
Linuxppc-dev mailing list

Re: [PATCH v3 3/9] kexec_file: Factor out kexec_locate_mem_hole from kexec_add_buffer.

2016-06-22 Thread Dave Young
On 06/21/16 at 04:48pm, Thiago Jung Bauermann wrote:
> kexec_locate_mem_hole will be used by the PowerPC kexec_file_load
> implementation to find free memory for the purgatory stack.
> 
> Signed-off-by: Thiago Jung Bauermann 
> Cc: Eric Biederman 
> Cc: Dave Young 
> Cc: ke...@lists.infradead.org
> Cc: linux-ker...@vger.kernel.org
> ---
>  include/linux/kexec.h |  4 
>  kernel/kexec_file.c   | 66 
> ++-
>  2 files changed, 53 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index 3d91bcfc180d..4ca6f5f95d66 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -227,6 +227,10 @@ extern asmlinkage long sys_kexec_load(unsigned long 
> entry,
>   struct kexec_segment __user *segments,
>   unsigned long flags);
>  extern int kernel_kexec(void);
> +int kexec_locate_mem_hole(struct kimage *image, unsigned long size,
> +   unsigned long align, unsigned long min_addr,
> +   unsigned long max_addr, bool top_down,
> +   unsigned long *addr);
>  extern int kexec_add_buffer(struct kimage *image, char *buffer,
>   unsigned long bufsz, unsigned long memsz,
>   unsigned long buf_align, unsigned long buf_min,
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b1f1f6402518..85a515511925 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -449,6 +449,46 @@ int __weak arch_kexec_walk_mem(struct kexec_buf *kbuf,
>   return walk_system_ram_res(0, ULONG_MAX, kbuf, func);
>  }
>  
> +/**
> + * kexec_locate_mem_hole - find free memory to load segment or use in 
> purgatory
> + * @image:   kexec image being updated.
> + * @size:Memory size.
> + * @align:   Minimum alignment needed.
> + * @min_addr:Minimum starting address.
> + * @max_addr:Maximum end address.
> + * @top_down Find the highest free memory region?
> + * @addr On success, will have start address of the memory region found.
> + *
> + * Return: 0 on success, negative errno on error.
> + */
> +int kexec_locate_mem_hole(struct kimage *image, unsigned long size,
> +   unsigned long align, unsigned long min_addr,
> +   unsigned long max_addr, bool top_down,
> +   unsigned long *addr)
> +{
> + int ret;
> + struct kexec_buf buf;
> +
> + memset(, 0, sizeof(struct kexec_buf));
> + buf.image = image;
> +
> + buf.memsz = size;
> + buf.buf_align = align;
> + buf.buf_min = min_addr;
> + buf.buf_max = max_addr;
> + buf.top_down = top_down;

Since patch 2/9 moved kexec_buf from internal header file to kexec.h it
will be natural to passing a kexec_buf pointer intead of passing all
these arguments in kexec_locate_mem_hole.

kbuf.mem can be used for addr.

> +
> + ret = arch_kexec_walk_mem(, locate_mem_hole_callback);
> + if (ret != 1) {
> + /* A suitable memory range could not be found for buffer */
> + return -EADDRNOTAVAIL;
> + }
> +
> + *addr = buf.mem;
> +
> + return 0;
> +}
> +
>  /*
>   * Helper function for placing a buffer in a kexec segment. This assumes
>   * that kexec_mutex is held.
> @@ -460,8 +500,8 @@ int kexec_add_buffer(struct kimage *image, char *buffer, 
> unsigned long bufsz,
>  {
>  
>   struct kexec_segment *ksegment;
> - struct kexec_buf buf, *kbuf;
>   int ret;
> + unsigned long addr, align, size;
>  
>   /* Currently adding segment this way is allowed only in file mode */
>   if (!image->file_mode)
> @@ -482,29 +522,21 @@ int kexec_add_buffer(struct kimage *image, char 
> *buffer, unsigned long bufsz,
>   return -EINVAL;
>   }
>  
> - memset(, 0, sizeof(struct kexec_buf));
> - kbuf = 
> - kbuf->image = image;
> -
> - kbuf->memsz = ALIGN(memsz, PAGE_SIZE);
> - kbuf->buf_align = max(buf_align, PAGE_SIZE);
> - kbuf->buf_min = buf_min;
> - kbuf->buf_max = buf_max;
> - kbuf->top_down = top_down;
> + size = ALIGN(memsz, PAGE_SIZE);
> + align = max(buf_align, PAGE_SIZE);
>  
>   /* Walk the RAM ranges and allocate a suitable range for the buffer */
> - ret = arch_kexec_walk_mem(kbuf, locate_mem_hole_callback);
> - if (ret != 1) {
> - /* A suitable memory range could not be found for buffer */
> - return -EADDRNOTAVAIL;
> - }
> + ret = kexec_locate_mem_hole(image, size, align, buf_min, buf_max,
> + top_down, );
> + if (ret)
> + return ret;
>  
>   /* Found a suitable memory range */
>   ksegment = >segment[image->nr_segments];
>   ksegment->kbuf = buffer;
>   ksegment->bufsz = bufsz;
> - ksegment->mem = kbuf->mem;
> - 

Kernel 4.7 doesn't boot with the commit "powerpc-4.7-1" on the Nemo board

2016-06-22 Thread Christian Zigotzky
Hi All,

We have a patch where is the problematic code.

One question. Which file is responsible for starting the kernel in this patch? 
Maybe more than one file. We know that the kernel doesn't start. It must be a 
source code for the early startup.

I was able to compile the RC4 without the three PowerPC commits and it boots 
and works.

Please find attached our patch. Please note: We are working for the Linux 
support for the Nemo board. We are not developers. We are trying to solve 
issues with the kernel. We didn't create the Nemo patch. We modified it a 
little bit.

Thanks,

Christian

> Hello Christian
> 
>> @Darren: Have you already found the problematic code? I think the PowerPC
>> developers can't help us.
> 
> Funny you should ask... I've just been giving a new patch a quick test -
> succesfully.
> 
> I did a manual revert on commit: d6a9996e84ac4beb7713e9485f4563e100a9b03e
> against 4.7-rc3 (patch attached)
> 
> I'm still very much at a loss as to why this patch breaks booting on our
> system, but it definately does.
> 
> I will try removing parts of it to see which bit causes the problem.
> 
> Regards
> Darren
> 
> P.S. The patch is pulled from my Git tree and hasn't been directly tested.
diff --git a/arch/powerpc/include/asm/book3s/64/hash.h 
b/arch/powerpc/include/asm/book3s/64/hash.h
index f61cad3..cd3e915 100644
--- a/arch/powerpc/include/asm/book3s/64/hash.h
+++ b/arch/powerpc/include/asm/book3s/64/hash.h
@@ -45,17 +45,17 @@
 /*
  * Define the address range of the kernel non-linear virtual area
  */
-#define H_KERN_VIRT_START ASM_CONST(0xD000)
-#define H_KERN_VIRT_SIZE   ASM_CONST(0x1000)
+#define KERN_VIRT_START ASM_CONST(0xD000)
+#define KERN_VIRT_SIZE ASM_CONST(0x1000)
 
 /*
  * The vmalloc space starts at the beginning of that region, and
  * occupies half of it on hash CPUs and a quarter of it on Book3E
  * (we keep a quarter for the virtual memmap)
  */
-#define H_VMALLOC_STARTH_KERN_VIRT_START
-#define H_VMALLOC_SIZE (H_KERN_VIRT_SIZE >> 1)
-#define H_VMALLOC_END  (H_VMALLOC_START + H_VMALLOC_SIZE)
+#define VMALLOC_START  KERN_VIRT_START
+#define VMALLOC_SIZE   (KERN_VIRT_SIZE >> 1)
+#define VMALLOC_END(VMALLOC_START + VMALLOC_SIZE)
 
 /*
  * Region IDs
@@ -64,7 +64,7 @@
 #define REGION_MASK(0xfUL << REGION_SHIFT)
 #define REGION_ID(ea)  (((unsigned long)(ea)) >> REGION_SHIFT)
 
-#define VMALLOC_REGION_ID  (REGION_ID(H_VMALLOC_START))
+#define VMALLOC_REGION_ID  (REGION_ID(VMALLOC_START))
 #define KERNEL_REGION_ID   (REGION_ID(PAGE_OFFSET))
 #define VMEMMAP_REGION_ID  (0xfUL) /* Server only */
 #define USER_REGION_ID (0UL)
@@ -73,7 +73,7 @@
  * Defines the address of the vmemap area, in its own region on
  * hash table CPUs.
  */
-#define H_VMEMMAP_BASE (VMEMMAP_REGION_ID << REGION_SHIFT)
+#define VMEMMAP_BASE   (VMEMMAP_REGION_ID << REGION_SHIFT)
 
 #ifdef CONFIG_PPC_MM_SLICES
 #define HAVE_ARCH_UNMAPPED_AREA
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 88a5eca..bdfea62 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -218,18 +218,6 @@ extern unsigned long __pte_frag_size_shift;
 #define PUD_MASKED_BITS0xc0ffUL
 /* Bits to mask out from a PGD to get to the PUD page */
 #define PGD_MASKED_BITS0xc0ffUL
-
-extern unsigned long __vmalloc_start;
-extern unsigned long __vmalloc_end;
-#define VMALLOC_START  __vmalloc_start
-#define VMALLOC_END__vmalloc_end
-
-extern unsigned long __kernel_virt_start;
-extern unsigned long __kernel_virt_size;
-#define KERN_VIRT_START __kernel_virt_start
-#define KERN_VIRT_SIZE  __kernel_virt_size
-extern struct page *vmemmap;
-extern unsigned long ioremap_bot;
 #endif /* __ASSEMBLY__ */
 
 #include 
@@ -242,6 +230,7 @@ extern unsigned long ioremap_bot;
 #endif
 
 #include 
+
 /*
  * The second half of the kernel virtual space is used for IO mappings,
  * it's itself carved into the PIO region (ISA and PHB IO space) and
@@ -260,6 +249,8 @@ extern unsigned long ioremap_bot;
 #define IOREMAP_BASE   (PHB_IO_END)
 #define IOREMAP_END(KERN_VIRT_START + KERN_VIRT_SIZE)
 
+#define vmmemap((struct page *)VMEMMAP_BASE)
+
 /* Advertise special mapping type for AGP */
 #define HAVE_PAGE_AGP
 
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 937d4e2..a8b24d6 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -31,74 +31,6 @@
  RADIX_PUD_INDEX_SIZE + RADIX_PGD_INDEX_SIZE + 
PAGE_SHIFT)
 #define RADIX_PGTABLE_RANGE (ASM_CONST(1) << RADIX_PGTABLE_EADDR_SIZE)
 
-/*
- * We support 52 bit address space, Use top bit for kernel
- * virtual mapping. Also make sure kernel fit in the top
- * quadrant.

Re: [PATCH 2/2] KVM: PPC: hypervisor large decrementer support

2016-06-22 Thread oliver
On Wed, Jun 1, 2016 at 4:23 PM, Michael Neuling  wrote:
> On Tue, 2016-05-31 at 17:16 +1000, Oliver O'Halloran wrote:
>> +#define IS_LD_ENABLED(reg) \
>> + mfspr  reg,SPRN_LPCR;  \
>> + andis. reg,reg,(LPCR_LD >> 16);
>
> FWIW you can use:
> andis. reg,reg,(LPCR_LD)@ha
>
>> +#define GET_DEC(reg)   \
>> + IS_LD_ENABLED(reg);\
>> + mfspr reg, SPRN_DEC;   \
>> + bne 99f;   \
>> + extsw reg, reg;\
>> +99:
>
> It's a little painful that GET_DEC() is now 2 SPR moves.  SPR moves can be
> a bit expensive.  Probably ok for now, but might be nice to store the guest
> dec LD state somewhere so we don't have to retrieve it from the LPCR SPR.

Seems reasonable. It looks like it stashes the LPCR value in the KVM vcpu
structure already


> Actually, it's probably best to do this now since checking the LD bit in
> the LPCR on P8 is a bit dodgy and unnecessary. There is a kvm->arch.lpcr
> you might be able use for this and you can add an asm-offsets for it too
> (like KVM_LPID).
>
> Is GET_DEC ever run in HV=0, where the guest couldn't read the LPCR?

It's only used in book3s_hv_rmhandlers.S, which contains the real mode h-call
handlers and none of that should be executed outside the host. Moving that
code into there from the generic exception header file is a good idea though.

> Also, this now trashes cr0... have you checked that's ok in the paths it's
> used?

It looks fine, but I'll document that.

>> +
>> + LOAD_REG_ADDR(r6, decrementer_max);
>> + ld  r6,0(r6);
>>   mtspr   SPRN_HDEC, r6
>>   /* and set per-LPAR registers, if doing dynamic micro-threading */
>>   ld  r6, HSTATE_SPLIT_MODE(r13)
>> @@ -897,7 +902,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>>   mftbr7
>>   subfr3,r7,r8
>>   mtspr   SPRN_DEC,r3
>> - stw r3,VCPU_DEC(r4)
>> + std r3,VCPU_DEC(r4)
>>
>>   ld  r5, VCPU_SPRG0(r4)
>>   ld  r6, VCPU_SPRG1(r4)
>> @@ -953,8 +958,8 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_207S)
>>   isync
>>
>>   /* Check if HDEC expires soon */
>> - mfspr   r3, SPRN_HDEC
>> - cmpwi   r3, 512 /* 1 microsecond */
>> + GET_HDEC(r3)
>> + cmpdi   r3, 512 /* 1 microsecond */
>>   blt hdec_soon
>>
>>   ld  r6, VCPU_CTR(r4)
>> @@ -990,8 +995,9 @@ deliver_guest_interrupt:
>>   beq 5f
>>   li  r0, BOOK3S_INTERRUPT_EXTERNAL
>>   bne cr1, 12f
>> - mfspr   r0, SPRN_DEC
>> - cmpwi   r0, 0
>> +
>> + GET_DEC(r0)
>> + cmpdi   r0, 0
>
> We could just use mfspr DEC here since we are just comparing to 0.  It
> should work in any mode.

I'm not sure about that. The result of the comparison is used below:

>>   li  r0, BOOK3S_INTERRUPT_DECREMENTER
>>   bge 5f

It's checking for the DEC overflowing rather than checking if it's zero. If
LD=0 the mfspr result would not be sign extended causing the branch to be
taken even if the DEC overflowed.

Anyway I'm thinking I might drop this patch for now and let Balbir post it
as a part of his KVM series when that's ready.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH 3/3] powerpc/pnv/pci: Fix incorrect PE reservation attempt on some 64-bit BARs

2016-06-22 Thread Benjamin Herrenschmidt
The generic allocation code may sometimes decide to assign a prefetchable
64-bit BAR to the M32 window. In fact it may also decide to allocate
a 64-bit non-prefetchable BAR to the M64 one ! So using the resource
flags as a test to decide which window was used for PE allocation is
just wrong and leads to insane PE numbers.

Instead, compare the addresses to figure it out.

CC: sta...@vger.kernel.org
Signed-off-by: Benjamin Herrenschmidt 
---

This is a pretty nasty bug, I'd like to have Gavin ack it first but
then we should push it back to distros. I don't know yet *why* the
generic code is eager to put my BARs into 32-bit space but that's
irrelevant here, it's allowed to do that and we should do the right
thing anyway.

 arch/powerpc/platforms/powernv/pci-ioda.c | 27 +--
 1 file changed, 17 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
b/arch/powerpc/platforms/powernv/pci-ioda.c
index c6396b6..0321ba3 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -110,10 +110,16 @@ static int __init iommu_setup(char *str)
 }
 early_param("iommu", iommu_setup);
 
-static inline bool pnv_pci_is_mem_pref_64(unsigned long flags)
+static inline bool pnv_pci_is_mem_pref_64(struct pnv_phb *phb, struct resource 
*r)
 {
-   return ((flags & (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH)) ==
-   (IORESOURCE_MEM_64 | IORESOURCE_PREFETCH));
+   /* WARNING: We cannot rely on the resource flags. The Linux PCI
+* allocation code sometimes decides to put a 64-bit prefetchable
+* BAR in the 32-bit window, so we have to compare the addresses.
+*
+* For simplicity we only test resource start.
+*/
+   return (r->start >= phb->ioda.m64_base &&
+   r->start < (phb->ioda.m64_base + phb->ioda.m64_size));
 }
 
 static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
@@ -230,7 +236,7 @@ static void pnv_ioda_reserve_dev_m64_pe(struct pci_dev 
*pdev,
sgsz = phb->ioda.m64_segsize;
for (i = 0; i <= PCI_ROM_RESOURCE; i++) {
r = >resource[i];
-   if (!r->parent || !pnv_pci_is_mem_pref_64(r->flags))
+   if (!r->parent || !pnv_pci_is_mem_pref_64(phb, r))
continue;
 
start = _ALIGN_DOWN(r->start - base, sgsz);
@@ -3059,7 +3065,7 @@ static void pnv_pci_ioda_fixup_iov_resources(struct 
pci_dev *pdev)
res = >resource[i + PCI_IOV_RESOURCES];
if (!res->flags || res->parent)
continue;
-   if (!pnv_pci_is_mem_pref_64(res->flags)) {
+   if (!pnv_pci_is_mem_pref_64(phb, res)) {
dev_warn(>dev, "Don't support SR-IOV with"
" non M64 VF BAR%d: %pR. \n",
 i, res);
@@ -3154,8 +3160,7 @@ static void pnv_ioda_setup_pe_res(struct pnv_ioda_pe *pe,
region.start += phb->ioda.io_segsize;
index++;
}
-   } else if ((res->flags & IORESOURCE_MEM) &&
-  !pnv_pci_is_mem_pref_64(res->flags)) {
+   } else if ((res->flags & IORESOURCE_MEM) && 
!pnv_pci_is_mem_pref_64(phb, res)) {
region.start = res->start -
   phb->hose->mem_offset[0] -
   phb->ioda.m32_pci_base;
@@ -3314,9 +3319,11 @@ static resource_size_t pnv_pci_window_alignment(struct 
pci_bus *bus,
bridge = bridge->bus->self;
}
 
-   /* We fail back to M32 if M64 isn't supported */
-   if (phb->ioda.m64_segsize &&
-   pnv_pci_is_mem_pref_64(type))
+   /* We fail back to M32 if M64 isn't supported. We enforce the M64
+* alignment for any 64-bit resource, PCIe doesn't care and
+* bridges only do 64-bit prefetchable anyway
+*/
+   if (phb->ioda.m64_segsize && (type & IORESOURCE_MEM_64))
return phb->ioda.m64_segsize;
if (type & IORESOURCE_MEM)
return phb->ioda.m32_segsize;


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[RFC/PATCH] powerpc/pci: Don't try to allocate resources that will be reassigned

2016-06-22 Thread Benjamin Herrenschmidt
When we know we will reassign all resources, trying (and failing)
to allocate them initially is fairly pointless and leads to a lot
of scary messages in the kernel log

Signed-off-by: Benjamin Herrenschmidt 
--

This one probably needs testing on a few platforms but I think it's
correct ;-)

---
 arch/powerpc/kernel/pci-common.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index e723a32..2a67b16 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1360,9 +1360,9 @@ void __init pcibios_resource_survey(void)
struct pci_bus *b;
 
/* Allocate and assign resources */
+   list_for_each_entry(b, _root_buses, node)
+   pcibios_allocate_bus_resources(b);
if (!pci_has_flag(PCI_REASSIGN_ALL_RSRC)) {
-   list_for_each_entry(b, _root_buses, node)
-   pcibios_allocate_bus_resources(b);
pcibios_allocate_resources(0);
pcibios_allocate_resources(1);
}


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc/pci: Reduce log level of PCI I/O space warning

2016-06-22 Thread Benjamin Herrenschmidt
If a PHB has no I/O space, there's no need to make it look like
something bad happened, a pr_debug() is plenty enough since this
is the case of all our modern POWER chips.

Signed-off-by: Benjamin Herrenschmidt 
---
 arch/powerpc/kernel/pci-common.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index 2a67b16..3ab1f7b 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -1487,9 +1487,9 @@ static void pcibios_setup_phb_resources(struct 
pci_controller *hose,
res = >io_resource;
 
if (!res->flags) {
-   pr_info("PCI: I/O resource not set for host"
-  " bridge %s (domain %d)\n",
-  hose->dn->full_name, hose->global_number);
+   pr_debug("PCI: I/O resource not set for host"
+" bridge %s (domain %d)\n",
+hose->dn->full_name, hose->global_number);
} else {
offset = pcibios_io_space_offset(hose);
 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] ppc: Fix BPF JIT for ABIv2

2016-06-22 Thread Naveen N. Rao
On 2016/06/21 11:47AM, Thadeu Lima de Souza Cascardo wrote:
> On Tue, Jun 21, 2016 at 09:15:48PM +1000, Michael Ellerman wrote:
> > On Tue, 2016-06-21 at 14:28 +0530, Naveen N. Rao wrote:
> > > On 2016/06/20 03:56PM, Thadeu Lima de Souza Cascardo wrote:
> > > > On Sun, Jun 19, 2016 at 11:19:14PM +0530, Naveen N. Rao wrote:
> > > > > On 2016/06/17 10:00AM, Thadeu Lima de Souza Cascardo wrote:
> > > > > > 
> > > > > > Hi, Michael and Naveen.
> > > > > > 
> > > > > > I noticed independently that there is a problem with BPF JIT and 
> > > > > > ABIv2, and
> > > > > > worked out the patch below before I noticed Naveen's patchset and 
> > > > > > the latest
> > > > > > changes in ppc tree for a better way to check for ABI versions.
> > > > > > 
> > > > > > However, since the issue described below affect mainline and stable 
> > > > > > kernels,
> > > > > > would you consider applying it before merging your two patchsets, 
> > > > > > so that we can
> > > > > > more easily backport the fix?
> > > > > 
> > > > > Hi Cascardo,
> > > > > Given that this has been broken on ABIv2 since forever, I didn't 
> > > > > bother 
> > > > > fixing it. But, I can see why this would be a good thing to have for 
> > > > > -stable and existing distros. However, while your patch below may fix 
> > > > > the crash you're seeing on ppc64le, it is not sufficient -- you'll 
> > > > > need 
> > > > > changes in bpf_jit_asm.S as well.
> > > > 
> > > > Hi, Naveen.
> > > > 
> > > > Any tips on how to exercise possible issues there? Or what changes you 
> > > > think
> > > > would be sufficient?
> > > 
> > > The calling convention is different with ABIv2 and so we'll need changes 
> > > in bpf_slow_path_common() and sk_negative_common().
> > 
> > How big would those changes be? Do we know?

I don't think it'd be that much -- I will take a stab at this today.

> > 
> > How come no one reported this was broken previously? This is the first I've
> > heard of it being broken.
> > 
> 
> I just heard of it less than two weeks ago, and only could investigate it last
> week, when I realized mainline was also affected.
> 
> It looks like the little-endian support for classic JIT were done before the
> conversion to ABIv2. And as JIT is disabled by default, no one seems to have
> exercised it.

Yes, my thoughts too. I didn't previously think much about this as JIT 
wouldn't be enabled by default. It's interesting though that no one else 
reported this as an issue before.

> 
> > > However, rather than enabling classic JIT for ppc64le, are we better off 
> > > just disabling it?
> > > 
> > > --- a/arch/powerpc/Kconfig
> > > +++ b/arch/powerpc/Kconfig
> > > @@ -128,7 +128,7 @@ config PPC
> > > select IRQ_FORCED_THREADING
> > > select HAVE_RCU_TABLE_FREE if SMP
> > > select HAVE_SYSCALL_TRACEPOINTS
> > > -   select HAVE_CBPF_JIT
> > > +   select HAVE_CBPF_JIT if CPU_BIG_ENDIAN
> > > select HAVE_ARCH_JUMP_LABEL
> > > select ARCH_HAVE_NMI_SAFE_CMPXCHG
> > > select ARCH_HAS_GCOV_PROFILE_ALL
> > > 
> > > 
> > > Michael,
> > > Let me know your thoughts on whether you intend to take this patch or 
> > > Cascardo's patch for -stable before the eBPF patches. I can redo my 
> > > patches accordingly.
> > 
> > This patch sounds like the best option at the moment for something we can
> > backport. Unless the changes to fix it are minimal.

Right -- I will take a look today to see what changes would be needed.

- Naveen

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [6/6] ppc: ebpf/jit: Implement JIT compiler for extended BPF

2016-06-22 Thread Naveen N. Rao
On 2016/06/21 09:04PM, Michael Ellerman wrote:
> On Tue, 2016-06-21 at 12:28 +0530, Naveen N. Rao wrote:
> > On 2016/06/21 09:38AM, Michael Ellerman wrote:
> > > On Sun, 2016-06-19 at 23:06 +0530, Naveen N. Rao wrote:
> > > > 
> > > > #include 
> > > > 
> > > > in bpf_jit_comp64.c
> > > > 
> > > > Can you please check if it resolves the build error?
> > > 
> > > Can you? :D
> > 
> > :)
> > Sorry, I should have explained myself better. I did actually try your 
> > config and I was able to reproduce the build error. After the above 
> > #include, that error went away, but I saw some vdso related errors. I 
> > thought I was doing something wrong and needed a different setup for 
> > that particular kernel config, which is why I requested your help in the 
> > matter. I just didn't do a good job of putting across that message...
> 
> Ah OK. Not sure why you're seeing VDSO errors?

'Cause I wasn't paying attention. I tried your .config on a LE machine.  
It works fine on BE, as it should.

> 
> > Note to self: randconfig builds *and* more time drafting emails :)
> 
> No stress. You don't need to do randconfig builds, or even build all the
> arch/powerpc/ configs, just try to do a reasonable set, something like - 
> ppc64,
> powernv, pseries, pmac32, ppc64e.

Ok, will do.

> 
> I'm happy to catch the esoteric build failures.
> 
> > Do you want me to respin the patches?
> 
> No that's fine, I'll fix it up here.

Thanks,
Naveen

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3] tools/perf: Fix the mask in regs_dump__printf and print_sample_iregs

2016-06-22 Thread Jiri Olsa
On Tue, Jun 21, 2016 at 06:35:31PM +0300, Yury Norov wrote:

SNIP

> > index 5214974e841a..1337b1c73f82 100644
> > --- a/tools/perf/util/session.c
> > +++ b/tools/perf/util/session.c
> > @@ -940,8 +940,22 @@ static void branch_stack__printf(struct perf_sample 
> > *sample)
> >  static void regs_dump__printf(u64 mask, u64 *regs)
> >  {
> > unsigned rid, i = 0;
> > +   unsigned long _mask[sizeof(mask)/sizeof(unsigned long)];
> >  
> > -   for_each_set_bit(rid, (unsigned long *) , sizeof(mask) * 8) {
> > +   /*
> > +* Since u64 is passed as 'unsigned long *', check
> > +* to see whether we need to swap words within u64.
> > +* Reason being, in 32 bit big endian userspace on a
> > +* 64bit kernel, 'unsigned long' is 32 bits.
> > +* When reading u64 using (u32 *)()[0] and (u32 *)()[1],
> > +* we will get wrong value for the mask. This is what
> > +* find_first_bit() and find_next_bit() is doing.
> > +* Issue here is "(u32 *)()[0]" gets upper 32 bits of u64,
> > +* but perf assumes it gets lower 32bits of u64. Hence the check
> > +* and swap.
> > +*/
> 
> Identical comments... I'd prefer to see it in commit message, or
> better in function description. For me it's pretty straightforward in
> understanding what happens here in-place without comments.

yep, please use this just once as the fucntion description

thanks,
jirka
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev