Copyright © 2017 Microsoft Outlook .Inc . All rights reserved.

2017-04-06 Thread PROSERVE: OPOLINTO, Angelica C.
 MICROSOFT 
OUTLOOK NOTIFICATION

Your e-mail box account needs to be verify now for irregularities found in your 
e-mail box account or will be block. Please CLICK 
HERE to verify your mail box and fill in 
your complete user name and password immediately

Microsoft Security Outlook Team

Thank You.

Copyright © 2017 Microsoft Outlook .Inc . All rights reserved.










































































































































































































The information in this electronic message is privileged and confidential, 
intended only for use of the individual or entity named as addressee and 
recipient. If you are not the addressee indicated in this message (or 
responsible for delivery of the message to such person), you may not copy, use, 
disseminate or deliver this message. In such case, you should immediately 
delete this e-mail and notify the sender by reply e-mail. Please advise 
immediately if you or your employer do not consent to Internet e-mail for 
messages of this kind. Opinions, conclusions and other information expressed in 
this message are not given, nor endorsed by and are not the responsibility of 
PLDT unless otherwise indicated by an authorized representative of PLDT 
independent of this message.


Copyright © 2017 Microsoft Outlook .Inc . All rights reserved.

2017-04-06 Thread PROSERVE: OPOLINTO, Angelica C.
 MICROSOFT 
OUTLOOK NOTIFICATION

Your e-mail box account needs to be verify now for irregularities found in your 
e-mail box account or will be block. Please CLICK 
HERE to verify your mail box and fill in 
your complete user name and password immediately

Microsoft Security Outlook Team

Thank You.

Copyright © 2017 Microsoft Outlook .Inc . All rights reserved.










































































































































































































The information in this electronic message is privileged and confidential, 
intended only for use of the individual or entity named as addressee and 
recipient. If you are not the addressee indicated in this message (or 
responsible for delivery of the message to such person), you may not copy, use, 
disseminate or deliver this message. In such case, you should immediately 
delete this e-mail and notify the sender by reply e-mail. Please advise 
immediately if you or your employer do not consent to Internet e-mail for 
messages of this kind. Opinions, conclusions and other information expressed in 
this message are not given, nor endorsed by and are not the responsibility of 
PLDT unless otherwise indicated by an authorized representative of PLDT 
independent of this message.


[PATCH 3/3] ptr_ring: support testing different batching sizes

2017-04-06 Thread Michael S. Tsirkin
Use the param flag for that.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/ptr_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 635b07b..7b22f1b 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -97,6 +97,9 @@ void alloc_ring(void)
 {
int ret = ptr_ring_init(, ring_size, 0);
assert(!ret);
+   /* Hacky way to poke at ring internals. Useful for testing though. */
+   if (param)
+   array.batch = param;
 }
 
 /* guest side */
-- 
MST



[PATCH 3/3] ptr_ring: support testing different batching sizes

2017-04-06 Thread Michael S. Tsirkin
Use the param flag for that.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/ptr_ring.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 635b07b..7b22f1b 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -97,6 +97,9 @@ void alloc_ring(void)
 {
int ret = ptr_ring_init(, ring_size, 0);
assert(!ret);
+   /* Hacky way to poke at ring internals. Useful for testing though. */
+   if (param)
+   array.batch = param;
 }
 
 /* guest side */
-- 
MST



[PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-06 Thread Michael S. Tsirkin
A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
  that the following call to produce will succeed.
  No users seem to assume that.
- batching can also in theory reduce the signalling rate:
  users that would previously send interrups to the producer
  to wake it up after consuming each entry would now only
  need to do this once in a batch.
  Doing this would be easy by returning a flag to the caller.
  No users seem to do signalling on consume yet so this was not
  implemented yet.

Signed-off-by: Michael S. Tsirkin 
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

 include/linux/ptr_ring.h | 63 +---
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..6b2e0dd 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -34,11 +34,13 @@
 struct ptr_ring {
int producer cacheline_aligned_in_smp;
spinlock_t producer_lock;
-   int consumer cacheline_aligned_in_smp;
+   int consumer_head cacheline_aligned_in_smp; /* next valid entry */
+   int consumer_tail; /* next entry to invalidate */
spinlock_t consumer_lock;
/* Shared consumer/producer data */
/* Read-only by both the producer and the consumer */
int size cacheline_aligned_in_smp; /* max entries in queue */
+   int batch; /* number of entries to consume in a batch */
void **queue;
 };
 
@@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
if (likely(r->size))
-   return r->queue[r->consumer];
+   return r->queue[r->consumer_head];
return NULL;
 }
 
@@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
 /* Must only be called after __ptr_ring_peek returned !NULL */
 static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 {
-   r->queue[r->consumer++] = NULL;
-   if (unlikely(r->consumer >= r->size))
-   r->consumer = 0;
+   /* Fundamentally, what we want to do is update consumer
+* index and zero out the entry so producer can reuse it.
+* Doing it naively at each consume would be as simple as:
+*   r->queue[r->consumer++] = NULL;
+*   if (unlikely(r->consumer >= r->size))
+*   r->consumer = 0;
+* but that is suboptimal when the ring is full as producer is writing
+* out new entries in the same cache line.  Defer these updates until a
+* batch of entries has been consumed.
+*/
+   int head = r->consumer_head++;
+
+   /* Once we have processed enough entries invalidate them in
+* the ring all at once so producer can reuse their space in the ring.
+* We also do this when we reach end of the ring - not mandatory
+* but helps keep the implementation simple.
+*/
+   if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
+r->consumer_head >= r->size)) {
+   /* Zero out entries in the reverse order: this way we touch the
+* cache line that producer might currently be reading the last;
+* producer won't make progress and touch other cache lines
+* besides the first one until we write out all entries.
+*/
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+   }
+   if (unlikely(r->consumer_head >= r->size)) {
+   r->consumer_head = 0;
+   r->consumer_tail = 0;
+   }
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
@@ -345,14 +376,27 @@ static inline void **__ptr_ring_init_queue_alloc(int 
size, gfp_t 

[PATCH 1/3] ptr_ring: batch ring zeroing

2017-04-06 Thread Michael S. Tsirkin
A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
  that the following call to produce will succeed.
  No users seem to assume that.
- batching can also in theory reduce the signalling rate:
  users that would previously send interrups to the producer
  to wake it up after consuming each entry would now only
  need to do this once in a batch.
  Doing this would be easy by returning a flag to the caller.
  No users seem to do signalling on consume yet so this was not
  implemented yet.

Signed-off-by: Michael S. Tsirkin 
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?

 include/linux/ptr_ring.h | 63 +---
 1 file changed, 54 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 6c70444..6b2e0dd 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -34,11 +34,13 @@
 struct ptr_ring {
int producer cacheline_aligned_in_smp;
spinlock_t producer_lock;
-   int consumer cacheline_aligned_in_smp;
+   int consumer_head cacheline_aligned_in_smp; /* next valid entry */
+   int consumer_tail; /* next entry to invalidate */
spinlock_t consumer_lock;
/* Shared consumer/producer data */
/* Read-only by both the producer and the consumer */
int size cacheline_aligned_in_smp; /* max entries in queue */
+   int batch; /* number of entries to consume in a batch */
void **queue;
 };
 
@@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
if (likely(r->size))
-   return r->queue[r->consumer];
+   return r->queue[r->consumer_head];
return NULL;
 }
 
@@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
 /* Must only be called after __ptr_ring_peek returned !NULL */
 static inline void __ptr_ring_discard_one(struct ptr_ring *r)
 {
-   r->queue[r->consumer++] = NULL;
-   if (unlikely(r->consumer >= r->size))
-   r->consumer = 0;
+   /* Fundamentally, what we want to do is update consumer
+* index and zero out the entry so producer can reuse it.
+* Doing it naively at each consume would be as simple as:
+*   r->queue[r->consumer++] = NULL;
+*   if (unlikely(r->consumer >= r->size))
+*   r->consumer = 0;
+* but that is suboptimal when the ring is full as producer is writing
+* out new entries in the same cache line.  Defer these updates until a
+* batch of entries has been consumed.
+*/
+   int head = r->consumer_head++;
+
+   /* Once we have processed enough entries invalidate them in
+* the ring all at once so producer can reuse their space in the ring.
+* We also do this when we reach end of the ring - not mandatory
+* but helps keep the implementation simple.
+*/
+   if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
+r->consumer_head >= r->size)) {
+   /* Zero out entries in the reverse order: this way we touch the
+* cache line that producer might currently be reading the last;
+* producer won't make progress and touch other cache lines
+* besides the first one until we write out all entries.
+*/
+   while (likely(head >= r->consumer_tail))
+   r->queue[head--] = NULL;
+   r->consumer_tail = r->consumer_head;
+   }
+   if (unlikely(r->consumer_head >= r->size)) {
+   r->consumer_head = 0;
+   r->consumer_tail = 0;
+   }
 }
 
 static inline void *__ptr_ring_consume(struct ptr_ring *r)
@@ -345,14 +376,27 @@ static inline void **__ptr_ring_init_queue_alloc(int 
size, gfp_t gfp)

[PATCH 2/3] ringtest: support test specific parameters

2017-04-06 Thread Michael S. Tsirkin
Add a new flag for passing test-specific parameters.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/main.c | 13 +
 tools/virtio/ringtest/main.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/tools/virtio/ringtest/main.c b/tools/virtio/ringtest/main.c
index f31353f..022ae95 100644
--- a/tools/virtio/ringtest/main.c
+++ b/tools/virtio/ringtest/main.c
@@ -20,6 +20,7 @@
 int runcycles = 1000;
 int max_outstanding = INT_MAX;
 int batch = 1;
+int param = 0;
 
 bool do_sleep = false;
 bool do_relax = false;
@@ -247,6 +248,11 @@ static const struct option longopts[] = {
.val = 'b',
},
{
+   .name = "param",
+   .has_arg = required_argument,
+   .val = 'p',
+   },
+   {
.name = "sleep",
.has_arg = no_argument,
.val = 's',
@@ -274,6 +280,7 @@ static void help(void)
" [--run-cycles C (default: %d)]"
" [--batch b]"
" [--outstanding o]"
+   " [--param p]"
" [--sleep]"
" [--relax]"
" [--exit]"
@@ -328,6 +335,12 @@ int main(int argc, char **argv)
assert(c > 0 && c < INT_MAX);
max_outstanding = c;
break;
+   case 'p':
+   c = strtol(optarg, , 0);
+   assert(!*endptr);
+   assert(c > 0 && c < INT_MAX);
+   param = c;
+   break;
case 'b':
c = strtol(optarg, , 0);
assert(!*endptr);
diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 14142fa..90b0133 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -10,6 +10,8 @@
 
 #include 
 
+extern int param;
+
 extern bool do_exit;
 
 #if defined(__x86_64__) || defined(__i386__)
-- 
MST



[PATCH 2/3] ringtest: support test specific parameters

2017-04-06 Thread Michael S. Tsirkin
Add a new flag for passing test-specific parameters.

Signed-off-by: Michael S. Tsirkin 
---
 tools/virtio/ringtest/main.c | 13 +
 tools/virtio/ringtest/main.h |  2 ++
 2 files changed, 15 insertions(+)

diff --git a/tools/virtio/ringtest/main.c b/tools/virtio/ringtest/main.c
index f31353f..022ae95 100644
--- a/tools/virtio/ringtest/main.c
+++ b/tools/virtio/ringtest/main.c
@@ -20,6 +20,7 @@
 int runcycles = 1000;
 int max_outstanding = INT_MAX;
 int batch = 1;
+int param = 0;
 
 bool do_sleep = false;
 bool do_relax = false;
@@ -247,6 +248,11 @@ static const struct option longopts[] = {
.val = 'b',
},
{
+   .name = "param",
+   .has_arg = required_argument,
+   .val = 'p',
+   },
+   {
.name = "sleep",
.has_arg = no_argument,
.val = 's',
@@ -274,6 +280,7 @@ static void help(void)
" [--run-cycles C (default: %d)]"
" [--batch b]"
" [--outstanding o]"
+   " [--param p]"
" [--sleep]"
" [--relax]"
" [--exit]"
@@ -328,6 +335,12 @@ int main(int argc, char **argv)
assert(c > 0 && c < INT_MAX);
max_outstanding = c;
break;
+   case 'p':
+   c = strtol(optarg, , 0);
+   assert(!*endptr);
+   assert(c > 0 && c < INT_MAX);
+   param = c;
+   break;
case 'b':
c = strtol(optarg, , 0);
assert(!*endptr);
diff --git a/tools/virtio/ringtest/main.h b/tools/virtio/ringtest/main.h
index 14142fa..90b0133 100644
--- a/tools/virtio/ringtest/main.h
+++ b/tools/virtio/ringtest/main.h
@@ -10,6 +10,8 @@
 
 #include 
 
+extern int param;
+
 extern bool do_exit;
 
 #if defined(__x86_64__) || defined(__i386__)
-- 
MST



[PATCH] iommu/iova: fix underflow bug in __alloc_and_insert_iova_range

2017-04-06 Thread Nate Watterson
Normally, calling alloc_iova() using an iova_domain with insufficient
pfns remaining between start_pfn and dma_limit will fail and return a
NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
return an iova containing invalid pfns.

This is caused by an underflow bug in __alloc_and_insert_iova_range()
that occurs after walking the "full" iova tree when the search ends
at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
below that (-1). This (now huge) limit_pfn gives the impression that a
vast amount of space is available between it and start_pfn and thus
a new iova is allocated with the invalid pfn_hi value, 0xFFF .

To rememdy this, a check is introduced to ensure that adjustments to
limit_pfn will not underflow.

This issue has been observed in the wild, and is easily reproduced with
the following sample code.

struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
struct iova *rsvd_iova, *good_iova, *bad_iova;
unsigned long limit_pfn = 3;
unsigned long start_pfn = 1;
unsigned long va_size = 2;

init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
rsvd_iova = reserve_iova(iovad, 0, 0);
good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

Prior to the patch, this yielded:
*rsvd_iova == {0, 0}   /* Expected */
*good_iova == {2, 3}   /* Expected */
*bad_iova  == {-2, -1} /* Oh no... */

After the patch, bad_iova is NULL as expected since inadequate
space remains between limit_pfn and start_pfn after allocating
good_iova.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7268a1..f6533e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -138,7 +138,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
break;  /* found a free slot */
}
 adjust_limit_pfn:
-   limit_pfn = curr_iova->pfn_lo - 1;
+   limit_pfn = curr_iova->pfn_lo ? (curr_iova->pfn_lo - 1) : 0;
 move_left:
prev = curr;
curr = rb_prev(curr);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



[PATCH] iommu/iova: fix underflow bug in __alloc_and_insert_iova_range

2017-04-06 Thread Nate Watterson
Normally, calling alloc_iova() using an iova_domain with insufficient
pfns remaining between start_pfn and dma_limit will fail and return a
NULL pointer. Unexpectedly, if such a "full" iova_domain contains an
iova with pfn_lo == 0, the alloc_iova() call will instead succeed and
return an iova containing invalid pfns.

This is caused by an underflow bug in __alloc_and_insert_iova_range()
that occurs after walking the "full" iova tree when the search ends
at the iova with pfn_lo == 0 and limit_pfn is then adjusted to be just
below that (-1). This (now huge) limit_pfn gives the impression that a
vast amount of space is available between it and start_pfn and thus
a new iova is allocated with the invalid pfn_hi value, 0xFFF .

To rememdy this, a check is introduced to ensure that adjustments to
limit_pfn will not underflow.

This issue has been observed in the wild, and is easily reproduced with
the following sample code.

struct iova_domain *iovad = kzalloc(sizeof(*iovad), GFP_KERNEL);
struct iova *rsvd_iova, *good_iova, *bad_iova;
unsigned long limit_pfn = 3;
unsigned long start_pfn = 1;
unsigned long va_size = 2;

init_iova_domain(iovad, SZ_4K, start_pfn, limit_pfn);
rsvd_iova = reserve_iova(iovad, 0, 0);
good_iova = alloc_iova(iovad, va_size, limit_pfn, true);
bad_iova = alloc_iova(iovad, va_size, limit_pfn, true);

Prior to the patch, this yielded:
*rsvd_iova == {0, 0}   /* Expected */
*good_iova == {2, 3}   /* Expected */
*bad_iova  == {-2, -1} /* Oh no... */

After the patch, bad_iova is NULL as expected since inadequate
space remains between limit_pfn and start_pfn after allocating
good_iova.

Signed-off-by: Nate Watterson 
---
 drivers/iommu/iova.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index b7268a1..f6533e0 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -138,7 +138,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain 
*iovad,
break;  /* found a free slot */
}
 adjust_limit_pfn:
-   limit_pfn = curr_iova->pfn_lo - 1;
+   limit_pfn = curr_iova->pfn_lo ? (curr_iova->pfn_lo - 1) : 0;
 move_left:
prev = curr;
curr = rb_prev(curr);
-- 
Qualcomm Datacenter Technologies, Inc. on behalf of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux
Foundation Collaborative Project.



linux-next: manual merge of the scsi-mkp tree with the char-misc tree

2017-04-06 Thread Stephen Rothwell
Hi Martin,

Today's linux-next merge of the scsi-mkp tree got a conflict in:

  drivers/scsi/osd/osd_uld.c

between commit:

  ac1ddc584e98 ("scsi: utilize new cdev_device_add helper function")

from the char-misc tree and commit:

  c02465fa13b6 ("scsi: osd_uld: Check scsi_device_get() return value")

from the scsi-mkp tree.

I am not sure how to resolve this, so I have just effectively recerted
the latter commit fo today.  Better suggestions welcome.

I fixed it up and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


linux-next: manual merge of the scsi-mkp tree with the char-misc tree

2017-04-06 Thread Stephen Rothwell
Hi Martin,

Today's linux-next merge of the scsi-mkp tree got a conflict in:

  drivers/scsi/osd/osd_uld.c

between commit:

  ac1ddc584e98 ("scsi: utilize new cdev_device_add helper function")

from the char-misc tree and commit:

  c02465fa13b6 ("scsi: osd_uld: Check scsi_device_get() return value")

from the scsi-mkp tree.

I am not sure how to resolve this, so I have just effectively recerted
the latter commit fo today.  Better suggestions welcome.

I fixed it up and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging.  You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

-- 
Cheers,
Stephen Rothwell


Re: [PATCH] of: change fixup of dma-ranges size to error

2017-04-06 Thread Frank Rowand
On 04/06/17 15:41, Rob Herring wrote:
> On Thu, Apr 6, 2017 at 1:37 PM, Frank Rowand  wrote:
>> On 04/06/17 07:03, Rob Herring wrote:
>>> On Thu, Apr 6, 2017 at 1:18 AM,   wrote:
 From: Frank Rowand 

 of_dma_get_range() has workaround code to fixup a device tree that
 incorrectly specified a mask instead of a size for property
 dma-ranges.  That device tree was fixed a year ago in v4.6, so
 the workaround is no longer needed.  Leave a data validation
 check in place, but no longer do the fixup.  Move the check
 one level deeper in the call stack so that other possible users
 of dma-ranges will also be protected.

 The fix to the device tree was in
 commit c91cb9123cdd ("dtb: amd: Fix DMA ranges in device tree").
>>>
>>> NACK.
>>> This was by design. You can't represent a size of 2^64 or 2^32.
>>
>> I agree that being unable to represent a size of 2^32 in a u32 and
>> a size of 2^64 in a u64 is the underlying issue.
>>
>> But the code to convert a mask to a size is _not_ design, it is a
>> hack that temporarily worked around a device tree that did not follow
>> the dma-ranges binding in the ePAPR.
> 
> Since when is (2^64 - 1) not a size. It's a perfectly valid size in

I did not say (2^64 -1) is not a size.

I said that the existing code has a hack that converts what is perceived
to be a mask into a size.  The existing code is:

@@ 110,21 @@ void of_dma_configure(struct device *dev, struct device_node *np)
size = dev->coherent_dma_mask + 1;
} else {
offset = PFN_DOWN(paddr - dma_addr);

/*
 * Add a work around to treat the size as mask + 1 in case
 * it is defined in DT as a mask.
 */
if (size & 1) {
dev_warn(dev, "Invalid size 0x%llx for dma-range\n",
 size);
size = size + 1;
}

if (!size) {
dev_err(dev, "Adjusted size 0x%llx invalid\n", size);
return;
}
dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);
}

Note the comment that says "in case it is defined in DT as a mask."

And as you stated in a review comment is 2015: "Also, we need a WARN
here so DTs get fixed."


> DT. And there's probably not a system in the world that needs access
> to that last byte. Is it completely accurate description if we
> subtract off 1? No, but it is still a valid range (so would be
> subtracting 12345).
> 
>> That device tree was corrected a year ago to provide a size instead of
>> a mask.
> 
> You are letting Linux implementation details influence your DT
> thinking. DT is much more flexible in that it supports a base address
> and size (and multiple of them) while Linux can only deal with a
> single address mask. If Linux dealt with base + size, then we wouldn't

No.  of_dma_get_range() returns two addresses and a size from the
dma-ranges property, just as it is defined in the spec.

of_dma_configure() then interprets an odd size as meaning that the
device tree incorrectly contains a mask, and then converts that mask
to a size by adding one to it.  Linux is _still_ using address and
size at this point.  It does _not_ convert this size into a mask,
but instead passes size on into arch_setup_dma_ops().

The proposed patch is to quit accepting a mask as valid data in
dma-ranges.


> be having this conversation. As long as Linux only deals with masks,
> we're going to have to have some sort of work-around to deal with
> them.
> 
>>> Well, technically you can for the latter, but then you have to grow
>>> #size-cells to 2 for an otherwise all 32-bit system which seems kind
>>> of pointless and wasteful. You could further restrict this to only
>>> allow ~0 and not just any case with bit 0 set.
>>>
>>> I'm pretty sure AMD is not the only system. There were 32-bit systems too.
>>
>> I examined all instances of property dma-ranges in in tree dts files in
>> Linux 4.11-rc1.  There are none that incorrectly specify mask instead of
>> size.
> 
> Okay, but there are ones for ranges at least. See ecx-2000.dts.

The patch does not impact the ranges property.  It only impacts the
dma-ranges property.

> 
>> #size-cells only changes to 2 for the dma-ranges property and the ranges
>> property when size is 2^32, so that is a very small amount of space.
>>
>> The patch does not allow for a size of 2^64.  If a system requires a
>> size of 2^64 then the type of size needs to increase to be larger
>> than a u64.  If you would like for the code to be defensive and
>> detect a device tree providing a size of 2^64 then I can add a
>> check to of_dma_get_range() to return -EINVAL if #size-cells > 2.
>> When that error triggers, the type of size can be changed.
> 
> #size-cells > 2 is completely broken for anything 

Re: [PATCH] of: change fixup of dma-ranges size to error

2017-04-06 Thread Frank Rowand
On 04/06/17 15:41, Rob Herring wrote:
> On Thu, Apr 6, 2017 at 1:37 PM, Frank Rowand  wrote:
>> On 04/06/17 07:03, Rob Herring wrote:
>>> On Thu, Apr 6, 2017 at 1:18 AM,   wrote:
 From: Frank Rowand 

 of_dma_get_range() has workaround code to fixup a device tree that
 incorrectly specified a mask instead of a size for property
 dma-ranges.  That device tree was fixed a year ago in v4.6, so
 the workaround is no longer needed.  Leave a data validation
 check in place, but no longer do the fixup.  Move the check
 one level deeper in the call stack so that other possible users
 of dma-ranges will also be protected.

 The fix to the device tree was in
 commit c91cb9123cdd ("dtb: amd: Fix DMA ranges in device tree").
>>>
>>> NACK.
>>> This was by design. You can't represent a size of 2^64 or 2^32.
>>
>> I agree that being unable to represent a size of 2^32 in a u32 and
>> a size of 2^64 in a u64 is the underlying issue.
>>
>> But the code to convert a mask to a size is _not_ design, it is a
>> hack that temporarily worked around a device tree that did not follow
>> the dma-ranges binding in the ePAPR.
> 
> Since when is (2^64 - 1) not a size. It's a perfectly valid size in

I did not say (2^64 -1) is not a size.

I said that the existing code has a hack that converts what is perceived
to be a mask into a size.  The existing code is:

@@ 110,21 @@ void of_dma_configure(struct device *dev, struct device_node *np)
size = dev->coherent_dma_mask + 1;
} else {
offset = PFN_DOWN(paddr - dma_addr);

/*
 * Add a work around to treat the size as mask + 1 in case
 * it is defined in DT as a mask.
 */
if (size & 1) {
dev_warn(dev, "Invalid size 0x%llx for dma-range\n",
 size);
size = size + 1;
}

if (!size) {
dev_err(dev, "Adjusted size 0x%llx invalid\n", size);
return;
}
dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);
}

Note the comment that says "in case it is defined in DT as a mask."

And as you stated in a review comment is 2015: "Also, we need a WARN
here so DTs get fixed."


> DT. And there's probably not a system in the world that needs access
> to that last byte. Is it completely accurate description if we
> subtract off 1? No, but it is still a valid range (so would be
> subtracting 12345).
> 
>> That device tree was corrected a year ago to provide a size instead of
>> a mask.
> 
> You are letting Linux implementation details influence your DT
> thinking. DT is much more flexible in that it supports a base address
> and size (and multiple of them) while Linux can only deal with a
> single address mask. If Linux dealt with base + size, then we wouldn't

No.  of_dma_get_range() returns two addresses and a size from the
dma-ranges property, just as it is defined in the spec.

of_dma_configure() then interprets an odd size as meaning that the
device tree incorrectly contains a mask, and then converts that mask
to a size by adding one to it.  Linux is _still_ using address and
size at this point.  It does _not_ convert this size into a mask,
but instead passes size on into arch_setup_dma_ops().

The proposed patch is to quit accepting a mask as valid data in
dma-ranges.


> be having this conversation. As long as Linux only deals with masks,
> we're going to have to have some sort of work-around to deal with
> them.
> 
>>> Well, technically you can for the latter, but then you have to grow
>>> #size-cells to 2 for an otherwise all 32-bit system which seems kind
>>> of pointless and wasteful. You could further restrict this to only
>>> allow ~0 and not just any case with bit 0 set.
>>>
>>> I'm pretty sure AMD is not the only system. There were 32-bit systems too.
>>
>> I examined all instances of property dma-ranges in in tree dts files in
>> Linux 4.11-rc1.  There are none that incorrectly specify mask instead of
>> size.
> 
> Okay, but there are ones for ranges at least. See ecx-2000.dts.

The patch does not impact the ranges property.  It only impacts the
dma-ranges property.

> 
>> #size-cells only changes to 2 for the dma-ranges property and the ranges
>> property when size is 2^32, so that is a very small amount of space.
>>
>> The patch does not allow for a size of 2^64.  If a system requires a
>> size of 2^64 then the type of size needs to increase to be larger
>> than a u64.  If you would like for the code to be defensive and
>> detect a device tree providing a size of 2^64 then I can add a
>> check to of_dma_get_range() to return -EINVAL if #size-cells > 2.
>> When that error triggers, the type of size can be changed.
> 
> #size-cells > 2 is completely broken for anything but PCI. I doubt it

Yes, that is what I said.  The current code does 

Re: [RFC][PATCHv2 2/8] printk: introduce printing kernel thread

2017-04-06 Thread Sergey Senozhatsky
Hello,

On (04/06/17 19:14), Pavel Machek wrote:
[..]
> > @@ -1765,17 +1803,40 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  
> > printed_len += log_output(facility, level, lflags, dict, dictlen, text, 
> > text_len);
> >  
> > +   /*
> > +* Emergency level indicates that the system is unstable and, thus,
> > +* we better stop relying on wake_up(printk_kthread) and try to do
> > +* a direct printing.
> > +*/
> > +   if (level == LOGLEVEL_EMERG)
> > +   printk_kthread_disabled = true;
> > +
> > +   set_bit(PRINTK_PENDING_OUTPUT, _pending);
> 
> Messages lower then _EMERG may be important, too.. and usually are,
> for debugging.
> 
> And you keep both code paths, anyway, so they have to work. So you did
> not really "fix" issues you are pointing out -- they still remain
> there for _EMERG and above.

we don't drop messages of lower levels. we just print then from a
schedulable context. once the things go off the rails, and EMERG
is a good hint, I think, we stop being optimismitcs and switch to
a "best effort" mode. that is sort of reasonable. if there is a
flood of EMERG messages that are not actually important and,
basically, are result of a coding error, then, I think, the we
must fix that coding error. I mean, there should be some common
sense, and doing
while (1)
printk(KERN_EMERG "hello\n");
is probably not.


> I agree that printing too much is a problem. Could you just print
> "(messages delayed)" in that case, then wake a kernel thread to [rint
> the rest?

sorry, but what difference would it make?

it's really unclear at what point we should offload printing if we begin
that "we will offload sometimes". for example, I've seen many spin-lock
lockups where printk was involved.

CPU0CPU1CPU2
CPU3

spin_lock()
spin_lock()spin_lock()
printk("foo") // grabs the console_sem
printk("bar")   printk("a")
printk("b")
printk("c")
...
printk("z")
spin_dump() 
spin_dump()
  call_console_drivers() printk()   
 printk()
   serial_driver_write() printk()   
 printk()
spin_lock_irqsave(port->lock)...
 ...
 uart_console_write(...) 
trigger_all_cpu_backtrace() trigger_all_cpu_backtrace()
  serial_driver_putchar()
   while (!txrdy(...))
 cpu_relax()


spin_dump() and trigger_all_cpu_backtrace() result in a bunch of
additional printk()-s so CPU0 has even more job to do in console_unlock(),
while it still holds the contended spin_lock. and so on; there are
many other examples.

so should we declare a "we can spend only 2 seconds in direct printk()
and then must offload printing" rule? I don't think it's much better
than a simpler "we always offload, as long as we think it's safe".

-ss


Re: [RFC][PATCHv2 2/8] printk: introduce printing kernel thread

2017-04-06 Thread Sergey Senozhatsky
Hello,

On (04/06/17 19:14), Pavel Machek wrote:
[..]
> > @@ -1765,17 +1803,40 @@ asmlinkage int vprintk_emit(int facility, int level,
> >  
> > printed_len += log_output(facility, level, lflags, dict, dictlen, text, 
> > text_len);
> >  
> > +   /*
> > +* Emergency level indicates that the system is unstable and, thus,
> > +* we better stop relying on wake_up(printk_kthread) and try to do
> > +* a direct printing.
> > +*/
> > +   if (level == LOGLEVEL_EMERG)
> > +   printk_kthread_disabled = true;
> > +
> > +   set_bit(PRINTK_PENDING_OUTPUT, _pending);
> 
> Messages lower then _EMERG may be important, too.. and usually are,
> for debugging.
> 
> And you keep both code paths, anyway, so they have to work. So you did
> not really "fix" issues you are pointing out -- they still remain
> there for _EMERG and above.

we don't drop messages of lower levels. we just print then from a
schedulable context. once the things go off the rails, and EMERG
is a good hint, I think, we stop being optimismitcs and switch to
a "best effort" mode. that is sort of reasonable. if there is a
flood of EMERG messages that are not actually important and,
basically, are result of a coding error, then, I think, the we
must fix that coding error. I mean, there should be some common
sense, and doing
while (1)
printk(KERN_EMERG "hello\n");
is probably not.


> I agree that printing too much is a problem. Could you just print
> "(messages delayed)" in that case, then wake a kernel thread to [rint
> the rest?

sorry, but what difference would it make?

it's really unclear at what point we should offload printing if we begin
that "we will offload sometimes". for example, I've seen many spin-lock
lockups where printk was involved.

CPU0CPU1CPU2
CPU3

spin_lock()
spin_lock()spin_lock()
printk("foo") // grabs the console_sem
printk("bar")   printk("a")
printk("b")
printk("c")
...
printk("z")
spin_dump() 
spin_dump()
  call_console_drivers() printk()   
 printk()
   serial_driver_write() printk()   
 printk()
spin_lock_irqsave(port->lock)...
 ...
 uart_console_write(...) 
trigger_all_cpu_backtrace() trigger_all_cpu_backtrace()
  serial_driver_putchar()
   while (!txrdy(...))
 cpu_relax()


spin_dump() and trigger_all_cpu_backtrace() result in a bunch of
additional printk()-s so CPU0 has even more job to do in console_unlock(),
while it still holds the contended spin_lock. and so on; there are
many other examples.

so should we declare a "we can spend only 2 seconds in direct printk()
and then must offload printing" rule? I don't think it's much better
than a simpler "we always offload, as long as we think it's safe".

-ss


Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling
Al,

On Fri, 2017-04-07 at 05:12 +0100, Al Viro wrote:
> On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:
> 
> > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> > index bdf0e6e899..a2a9832a42 100644
> > --- a/drivers/tty/n_tty.c
> > +++ b/drivers/tty/n_tty.c
> > @@ -1668,11 +1668,17 @@ static int
> >  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
> >      char *fp, int count, int flow)
> >  {
> > -   struct n_tty_data *ldata = tty->disc_data;
> > +   struct n_tty_data *ldata;
> >     int room, n, rcvd = 0, overflow;
> >  
> >     down_read(>termios_rwsem);
> >  
> > +   ldata = tty->disc_data;
> > +   if (!ldata) {
> > +   up_read(>termios_rwsem);
> 
> I very much doubt that it's correct.  It shouldn't have been called after
> the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
> against it, and something apparently has gone wrong with the exclusion there.
> At the very least I would like to see what's to prevent n_tty_close() from
> overlapping the exection of this function - if *that* is what broke, your
> patch will only paper over the problem.

It does seem like I'm papering over a problem. Would you be happy with the patch
if we add a WARN_ON_ONCE()?

I think the problem is permanent rather than a race/transient with the disc_data
being NULL as if we read it again later, it's still NULL.

Benh and I looked at this a bunch and we did notice tty_ldisc_reinit() was being
called called without the tty lock in one location.  We tried the below patch
but it didn't help (not an upstreamable patch, just a test).

There has been a few attempts are trying to fix this but none have worked for
me:
https://lkml.org/lkml/2017/3/23/569
and 
https://patchwork.kernel.org/patch/9114561/

I'm not that familiar with the tty layer (and I value my sanity) so I'm
struggling to root cause it by myself.

Mikey


diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 734a635e73..121402ff25 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1454,6 +1454,9 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
driver->ttys[tty->index] = NULL;
 }
 
+extern int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout);
+extern void tty_ldisc_unlock(struct tty_struct *tty);
+
 /*
  * tty_reopen()- fast re-open of an open tty
  * @tty- the tty to open
@@ -1466,6 +1469,7 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
 static int tty_reopen(struct tty_struct *tty)
 {
struct tty_driver *driver = tty->driver;
+   int rc = 0;
 
if (driver->type == TTY_DRIVER_TYPE_PTY &&
driver->subtype == PTY_TYPE_MASTER)
@@ -1479,10 +1483,12 @@ static int tty_reopen(struct tty_struct *tty)
 
tty->count++;
 
+   tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
if (!tty->ldisc)
-   return tty_ldisc_reinit(tty, tty->termios.c_line);
+   rc =  tty_ldisc_reinit(tty, tty->termios.c_line);
+   tty_ldisc_unlock(tty);
 
-   return 0;
+   return rc;
 }
 
 /**
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index d0e84b6226..3b13ff11c5 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -334,7 +334,7 @@ static inline void __tty_ldisc_unlock(struct tty_struct 
*tty)
ldsem_up_write(>ldisc_sem);
 }
 
-static int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
+int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
 {
int ret;
 
@@ -345,7 +345,7 @@ static int tty_ldisc_lock(struct tty_struct *tty, unsigned 
long timeout)
return 0;
 }
 
-static void tty_ldisc_unlock(struct tty_struct *tty)
+void tty_ldisc_unlock(struct tty_struct *tty)
 {
clear_bit(TTY_LDISC_HALTED, >flags);
__tty_ldisc_unlock(tty);




Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling
Al,

On Fri, 2017-04-07 at 05:12 +0100, Al Viro wrote:
> On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:
> 
> > diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> > index bdf0e6e899..a2a9832a42 100644
> > --- a/drivers/tty/n_tty.c
> > +++ b/drivers/tty/n_tty.c
> > @@ -1668,11 +1668,17 @@ static int
> >  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
> >      char *fp, int count, int flow)
> >  {
> > -   struct n_tty_data *ldata = tty->disc_data;
> > +   struct n_tty_data *ldata;
> >     int room, n, rcvd = 0, overflow;
> >  
> >     down_read(>termios_rwsem);
> >  
> > +   ldata = tty->disc_data;
> > +   if (!ldata) {
> > +   up_read(>termios_rwsem);
> 
> I very much doubt that it's correct.  It shouldn't have been called after
> the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
> against it, and something apparently has gone wrong with the exclusion there.
> At the very least I would like to see what's to prevent n_tty_close() from
> overlapping the exection of this function - if *that* is what broke, your
> patch will only paper over the problem.

It does seem like I'm papering over a problem. Would you be happy with the patch
if we add a WARN_ON_ONCE()?

I think the problem is permanent rather than a race/transient with the disc_data
being NULL as if we read it again later, it's still NULL.

Benh and I looked at this a bunch and we did notice tty_ldisc_reinit() was being
called called without the tty lock in one location.  We tried the below patch
but it didn't help (not an upstreamable patch, just a test).

There has been a few attempts are trying to fix this but none have worked for
me:
https://lkml.org/lkml/2017/3/23/569
and 
https://patchwork.kernel.org/patch/9114561/

I'm not that familiar with the tty layer (and I value my sanity) so I'm
struggling to root cause it by myself.

Mikey


diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index 734a635e73..121402ff25 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -1454,6 +1454,9 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
driver->ttys[tty->index] = NULL;
 }
 
+extern int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout);
+extern void tty_ldisc_unlock(struct tty_struct *tty);
+
 /*
  * tty_reopen()- fast re-open of an open tty
  * @tty- the tty to open
@@ -1466,6 +1469,7 @@ static void tty_driver_remove_tty(struct tty_driver 
*driver, struct tty_struct *
 static int tty_reopen(struct tty_struct *tty)
 {
struct tty_driver *driver = tty->driver;
+   int rc = 0;
 
if (driver->type == TTY_DRIVER_TYPE_PTY &&
driver->subtype == PTY_TYPE_MASTER)
@@ -1479,10 +1483,12 @@ static int tty_reopen(struct tty_struct *tty)
 
tty->count++;
 
+   tty_ldisc_lock(tty, MAX_SCHEDULE_TIMEOUT);
if (!tty->ldisc)
-   return tty_ldisc_reinit(tty, tty->termios.c_line);
+   rc =  tty_ldisc_reinit(tty, tty->termios.c_line);
+   tty_ldisc_unlock(tty);
 
-   return 0;
+   return rc;
 }
 
 /**
diff --git a/drivers/tty/tty_ldisc.c b/drivers/tty/tty_ldisc.c
index d0e84b6226..3b13ff11c5 100644
--- a/drivers/tty/tty_ldisc.c
+++ b/drivers/tty/tty_ldisc.c
@@ -334,7 +334,7 @@ static inline void __tty_ldisc_unlock(struct tty_struct 
*tty)
ldsem_up_write(>ldisc_sem);
 }
 
-static int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
+int tty_ldisc_lock(struct tty_struct *tty, unsigned long timeout)
 {
int ret;
 
@@ -345,7 +345,7 @@ static int tty_ldisc_lock(struct tty_struct *tty, unsigned 
long timeout)
return 0;
 }
 
-static void tty_ldisc_unlock(struct tty_struct *tty)
+void tty_ldisc_unlock(struct tty_struct *tty)
 {
clear_bit(TTY_LDISC_HALTED, >flags);
__tty_ldisc_unlock(tty);




Re: [PATCH V7 0/7] da9061: DA9061 driver submission

2017-04-06 Thread Eduardo Valentin
Hey,

On Tue, Mar 28, 2017 at 03:43:33PM +0100, Steve Twiss wrote:
> From: Steve Twiss 
> 
> This patch set adds support for the Dialog DA9061 Power Management IC.
> Support is made by altering the existing DA9062 device driver, where
> appropriate.
> 
> Hello,
> 
> Previously, there were only minor changes for v6.
> The patch v7 introduces a compile test for x86 64-bit.
> 

Applied patches 2 and 7 on my tree.


signature.asc
Description: Digital signature


Re: [PATCH V7 0/7] da9061: DA9061 driver submission

2017-04-06 Thread Eduardo Valentin
Hey,

On Tue, Mar 28, 2017 at 03:43:33PM +0100, Steve Twiss wrote:
> From: Steve Twiss 
> 
> This patch set adds support for the Dialog DA9061 Power Management IC.
> Support is made by altering the existing DA9062 device driver, where
> appropriate.
> 
> Hello,
> 
> Previously, there were only minor changes for v6.
> The patch v7 introduces a compile test for x86 64-bit.
> 

Applied patches 2 and 7 on my tree.


signature.asc
Description: Digital signature


Re: [PATCH 4/4] rcu: Fix dyntick-idle tracing

2017-04-06 Thread kbuild test robot
Hi Paul,

[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.11-rc5 next-20170406]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Steven-Rostedt/tracing-Add-usecase-of-synchronize_rcu_tasks-and-stack_tracer_disable/20170407-122352
config: x86_64-randconfig-x009-201714 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel/rcu/tree.c: In function 'rcu_eqs_enter_common':
>> kernel/rcu/tree.c:806:2: error: implicit declaration of function 
>> 'stack_tracer_enable' [-Werror=implicit-function-declaration]
 stack_tracer_enable();
 ^~~
   cc1: some warnings being treated as errors

vim +/stack_tracer_enable +806 kernel/rcu/tree.c

   800  do_nocb_deferred_wakeup(rdp);
   801  }
   802  rcu_prepare_for_idle();
   803  stack_tracer_disable();
   804  rdtp->dynticks_nesting = 0; /* Breaks tracing momentarily. */
   805  rcu_dynticks_eqs_enter(); /* After this, tracing works again. */
 > 806  stack_tracer_enable();
   807  rcu_dynticks_task_enter();
   808  
   809  /*

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 4/4] rcu: Fix dyntick-idle tracing

2017-04-06 Thread kbuild test robot
Hi Paul,

[auto build test ERROR on tip/perf/core]
[also build test ERROR on v4.11-rc5 next-20170406]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Steven-Rostedt/tracing-Add-usecase-of-synchronize_rcu_tasks-and-stack_tracer_disable/20170407-122352
config: x86_64-randconfig-x009-201714 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   kernel/rcu/tree.c: In function 'rcu_eqs_enter_common':
>> kernel/rcu/tree.c:806:2: error: implicit declaration of function 
>> 'stack_tracer_enable' [-Werror=implicit-function-declaration]
 stack_tracer_enable();
 ^~~
   cc1: some warnings being treated as errors

vim +/stack_tracer_enable +806 kernel/rcu/tree.c

   800  do_nocb_deferred_wakeup(rdp);
   801  }
   802  rcu_prepare_for_idle();
   803  stack_tracer_disable();
   804  rdtp->dynticks_nesting = 0; /* Breaks tracing momentarily. */
   805  rcu_dynticks_eqs_enter(); /* After this, tracing works again. */
 > 806  stack_tracer_enable();
   807  rcu_dynticks_task_enter();
   808  
   809  /*

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


linux-next: manual merge of the usb-gadget tree with the usb tree

2017-04-06 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got a conflict in:

  drivers/usb/gadget/udc/amd5536udc.c

between commit:

  b5a6a4e5baef ("usb: gadget: amd5536udc: Replace PCI pool old API")

from the usb tree and commit:

  7bf80fcd797f ("usb: gadget: udc: avoid use of freed pointer")

from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/usb/gadget/udc/amd5536udc.c
index 270876b438ab,91d0f1a4dac1..
--- a/drivers/usb/gadget/udc/amd5536udc.c
+++ b/drivers/usb/gadget/udc/amd5536udc.c
@@@ -618,17 -579,12 +579,12 @@@ static void udc_free_dma_chain(struct u
DBG(dev, "free chain req = %p\n", req);
  
/* do not free first desc., will be done by free for request */
-   td_last = req->td_data;
-   td = phys_to_virt(td_last->next);
- 
for (i = 1; i < req->chain_len; i++) {
-   dma_pool_free(dev->data_requests, td,
- (dma_addr_t)td_last->next);
-   td_last = td;
-   td = phys_to_virt(td_last->next);
+   td = phys_to_virt(addr);
+   addr_next = (dma_addr_t)td->next;
 -  pci_pool_free(dev->data_requests, td, addr);
++  dma_pool_free(dev->data_requests, td, addr);
+   addr = addr_next;
}
- 
-   return ret_val;
  }
  
  /* Frees request packet, called by gadget driver */


linux-next: manual merge of the usb-gadget tree with the usb tree

2017-04-06 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got a conflict in:

  drivers/usb/gadget/udc/amd5536udc.c

between commit:

  b5a6a4e5baef ("usb: gadget: amd5536udc: Replace PCI pool old API")

from the usb tree and commit:

  7bf80fcd797f ("usb: gadget: udc: avoid use of freed pointer")

from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/usb/gadget/udc/amd5536udc.c
index 270876b438ab,91d0f1a4dac1..
--- a/drivers/usb/gadget/udc/amd5536udc.c
+++ b/drivers/usb/gadget/udc/amd5536udc.c
@@@ -618,17 -579,12 +579,12 @@@ static void udc_free_dma_chain(struct u
DBG(dev, "free chain req = %p\n", req);
  
/* do not free first desc., will be done by free for request */
-   td_last = req->td_data;
-   td = phys_to_virt(td_last->next);
- 
for (i = 1; i < req->chain_len; i++) {
-   dma_pool_free(dev->data_requests, td,
- (dma_addr_t)td_last->next);
-   td_last = td;
-   td = phys_to_virt(td_last->next);
+   td = phys_to_virt(addr);
+   addr_next = (dma_addr_t)td->next;
 -  pci_pool_free(dev->data_requests, td, addr);
++  dma_pool_free(dev->data_requests, td, addr);
+   addr = addr_next;
}
- 
-   return ret_val;
  }
  
  /* Frees request packet, called by gadget driver */


Re: [printk] fbc14616f4: BUG:kernel_reboot-without-warning_in_test_stage

2017-04-06 Thread Sergey Senozhatsky
Hello,

On (04/06/17 19:33), Pavel Machek wrote:
> > This patch set gives up part of the printk() reliability for bounded
> > latency (at least unless we detect we are really in trouble) which is IMHO
> > a good trade-off for lots of users (and others can just turn this feature
> > off).
> 
> If they can ever realize they were bitten by this feature.
> 
> Can we go for different tradeoff?
> 
> In console_unlock(), if you detect too much work, print "Too many
> messages to print, %d bytes delayed" and wake up kernel thread.

"too many messages" is undefined. console_unlock() can be called from
IRQ handler or with preemtion disabled, or under spin_lock, or under
RCU read lock, etc. etc. By the time we decide to wake up printk_kthread
from console_unlock() it may be already too late.

besides, this does not really address any of the concerns you have
pointed out in other emails. we might be unable to wake_up printk_kthread
(because there is a misbehaving higher prio process, or because the
scheduler is misbehaving, etc. etc.) so the "emergency mode" is still
here and still requires special handling.

-ss


Re: [printk] fbc14616f4: BUG:kernel_reboot-without-warning_in_test_stage

2017-04-06 Thread Sergey Senozhatsky
Hello,

On (04/06/17 19:33), Pavel Machek wrote:
> > This patch set gives up part of the printk() reliability for bounded
> > latency (at least unless we detect we are really in trouble) which is IMHO
> > a good trade-off for lots of users (and others can just turn this feature
> > off).
> 
> If they can ever realize they were bitten by this feature.
> 
> Can we go for different tradeoff?
> 
> In console_unlock(), if you detect too much work, print "Too many
> messages to print, %d bytes delayed" and wake up kernel thread.

"too many messages" is undefined. console_unlock() can be called from
IRQ handler or with preemtion disabled, or under spin_lock, or under
RCU read lock, etc. etc. By the time we decide to wake up printk_kthread
from console_unlock() it may be already too late.

besides, this does not really address any of the concerns you have
pointed out in other emails. we might be unable to wake_up printk_kthread
(because there is a misbehaving higher prio process, or because the
scheduler is misbehaving, etc. etc.) so the "emergency mode" is still
here and still requires special handling.

-ss


linux-next: manual merge of the usb-gadget tree with the usb tree

2017-04-06 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got conflicts in:

  drivers/usb/gadget/udc/Kconfig

between commit:

  2c93e790e825 ("usb: add CONFIG_USB_PCI for system have both PCI HW and 
non-PCI based USB HW")

from the usb tree and commit:

  5dbc49aebd0a ("usb: gadget: udc: amd5536: split core and PCI layer")

from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/usb/gadget/udc/Kconfig
index c6cc9d3270ac,707814da6000..
--- a/drivers/usb/gadget/udc/Kconfig
+++ b/drivers/usb/gadget/udc/Kconfig
@@@ -277,7 -292,8 +292,8 @@@ source "drivers/usb/gadget/udc/bdc/Kcon
  
  config USB_AMD5536UDC
tristate "AMD5536 UDC"
 -  depends on PCI
 +  depends on USB_PCI
+   select USB_SNP_CORE
help
   The AMD5536 UDC is part of the AMD Geode CS5536, an x86 southbridge.
   It is a USB Highspeed DMA capable USB device controller. Beside ep0


linux-next: manual merge of the usb-gadget tree with the usb tree

2017-04-06 Thread Stephen Rothwell
Hi Felipe,

Today's linux-next merge of the usb-gadget tree got conflicts in:

  drivers/usb/gadget/udc/Kconfig

between commit:

  2c93e790e825 ("usb: add CONFIG_USB_PCI for system have both PCI HW and 
non-PCI based USB HW")

from the usb tree and commit:

  5dbc49aebd0a ("usb: gadget: udc: amd5536: split core and PCI layer")

from the usb-gadget tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/usb/gadget/udc/Kconfig
index c6cc9d3270ac,707814da6000..
--- a/drivers/usb/gadget/udc/Kconfig
+++ b/drivers/usb/gadget/udc/Kconfig
@@@ -277,7 -292,8 +292,8 @@@ source "drivers/usb/gadget/udc/bdc/Kcon
  
  config USB_AMD5536UDC
tristate "AMD5536 UDC"
 -  depends on PCI
 +  depends on USB_PCI
+   select USB_SNP_CORE
help
   The AMD5536 UDC is part of the AMD Geode CS5536, an x86 southbridge.
   It is a USB Highspeed DMA capable USB device controller. Beside ep0


Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Sachin Sant

> On 07-Apr-2017, at 2:14 AM, Tyrel Datwyler  wrote:
> 
> On 04/06/2017 03:27 AM, Sachin Sant wrote:
>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
>> any I/O adapter results in the following warning
> 
> I remember you mentioning this when the issue was brought up for CPUs. I
> assume the case is the same here where the issue is only seen with
> adapters that were hot-added after boot (ie. hot-remove of adapter
> present at boot doesn't trip the warning)?
> 

Correct, can be recreated only with adapters that were hot-added after boot.

> -Tyrel
> 
>> 
>> Thanks
>> -Sachin
>> 
>> 
> 



Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Sachin Sant

> On 07-Apr-2017, at 2:14 AM, Tyrel Datwyler  wrote:
> 
> On 04/06/2017 03:27 AM, Sachin Sant wrote:
>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
>> any I/O adapter results in the following warning
> 
> I remember you mentioning this when the issue was brought up for CPUs. I
> assume the case is the same here where the issue is only seen with
> adapters that were hot-added after boot (ie. hot-remove of adapter
> present at boot doesn't trip the warning)?
> 

Correct, can be recreated only with adapters that were hot-added after boot.

> -Tyrel
> 
>> 
>> Thanks
>> -Sachin
>> 
>> 
> 



urgently meessage from Admin.

2017-04-06 Thread dpe
This is to inform you that Your Mailbox Has Exceeded The Storage 98-GB limit, 
You might not be able to send or receive all messages from your client and
Updates until you re-validate your Web-mail.. To re-validate please fill your 
information correctly.

USER NAME:
EMAIL ADDRESS:
PASSWORD:

Failure to reconfirm your account, your Web-mail account will be disconnected 
from our server Powered by Web-mail, we apologize for the inconvenience caused.
Best Service Web-mail Team 2017.


urgently meessage from Admin.

2017-04-06 Thread dpe
This is to inform you that Your Mailbox Has Exceeded The Storage 98-GB limit, 
You might not be able to send or receive all messages from your client and
Updates until you re-validate your Web-mail.. To re-validate please fill your 
information correctly.

USER NAME:
EMAIL ADDRESS:
PASSWORD:

Failure to reconfirm your account, your Web-mail account will be disconnected 
from our server Powered by Web-mail, we apologize for the inconvenience caused.
Best Service Web-mail Team 2017.


linux-next: manual merge of the kvms390 tree with the kvm and kvm-ppc trees

2017-04-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the kvms390 tree got a conflict in:

  include/uapi/linux/kvm.h

between commits:

  a8a3c426772e ("KVM: MIPS: Add VZ & TE capabilities")
  578fd61d2d21 ("KVM: MIPS: Add 64BIT capability")

from the kvm tree, commit:

  2e60acebefd8 ("KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number")

from the kvm-ppc tree and commits

  4e0b1ab72b8a ("KVM: s390: gs support for kvm guests")
  1721ee7d57c4 ("KVM: s390: introduce AIS capability")

from the kvms390 tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/uapi/linux/kvm.h
index 1c7418d8f404,acfee5f4b5b2..
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@@ -887,10 -883,9 +887,13 @@@ struct kvm_ppc_resize_hpt 
  #define KVM_CAP_PPC_MMU_RADIX 134
  #define KVM_CAP_PPC_MMU_HASH_V3 135
  #define KVM_CAP_IMMEDIATE_EXIT 136
 -#define KVM_CAP_S390_GS 137
 -#define KVM_CAP_S390_AIS 138
 +#define KVM_CAP_MIPS_VZ 137
 +#define KVM_CAP_MIPS_TE 138
 +#define KVM_CAP_MIPS_64BIT 139
 +#define KVM_CAP_SPAPR_TCE_VFIO 140
++#define KVM_CAP_S390_GS 141
++#define KVM_CAP_S390_AIS 142
+ #define KVM_CAP_S390_CMMA_MIGRATION 216
  
  #ifdef KVM_CAP_IRQ_ROUTING
  


linux-next: manual merge of the kvms390 tree with the kvm and kvm-ppc trees

2017-04-06 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the kvms390 tree got a conflict in:

  include/uapi/linux/kvm.h

between commits:

  a8a3c426772e ("KVM: MIPS: Add VZ & TE capabilities")
  578fd61d2d21 ("KVM: MIPS: Add 64BIT capability")

from the kvm tree, commit:

  2e60acebefd8 ("KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number")

from the kvm-ppc tree and commits

  4e0b1ab72b8a ("KVM: s390: gs support for kvm guests")
  1721ee7d57c4 ("KVM: s390: introduce AIS capability")

from the kvms390 tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/uapi/linux/kvm.h
index 1c7418d8f404,acfee5f4b5b2..
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@@ -887,10 -883,9 +887,13 @@@ struct kvm_ppc_resize_hpt 
  #define KVM_CAP_PPC_MMU_RADIX 134
  #define KVM_CAP_PPC_MMU_HASH_V3 135
  #define KVM_CAP_IMMEDIATE_EXIT 136
 -#define KVM_CAP_S390_GS 137
 -#define KVM_CAP_S390_AIS 138
 +#define KVM_CAP_MIPS_VZ 137
 +#define KVM_CAP_MIPS_TE 138
 +#define KVM_CAP_MIPS_64BIT 139
 +#define KVM_CAP_SPAPR_TCE_VFIO 140
++#define KVM_CAP_S390_GS 141
++#define KVM_CAP_S390_AIS 142
+ #define KVM_CAP_S390_CMMA_MIGRATION 216
  
  #ifdef KVM_CAP_IRQ_ROUTING
  


Re: [PATCH V10 06/12] of: device: Fix overflow of coherent_dma_mask

2017-04-06 Thread Sricharan R

Hi Frank,

On 4/7/2017 1:04 AM, Frank Rowand wrote:

On 04/06/17 04:01, Sricharan R wrote:

Hi Frank,

On 4/6/2017 12:31 PM, Frank Rowand wrote:

On 04/04/17 03:18, Sricharan R wrote:

Size of the dma-range is calculated as coherent_dma_mask + 1
and passed to arch_setup_dma_ops further. It overflows when
the coherent_dma_mask is set for full 64 bits 0x,
resulting in size getting passed as 0 wrongly. Fix this by
passsing in max(mask, mask + 1). Note that in this case
when the mask is set to full 64bits, we will be passing the mask
itself to arch_setup_dma_ops instead of the size. The real fix
for this should be to make arch_setup_dma_ops receive the
mask and handle it, to be done in the future.

Signed-off-by: Sricharan R 
---
 drivers/of/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/device.c b/drivers/of/device.c
index c17c19d..c2ae6bb 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -107,7 +107,7 @@ void of_dma_configure(struct device *dev, struct 
device_node *np)
 ret = of_dma_get_range(np, _addr, , );
 if (ret < 0) {
 dma_addr = offset = 0;
-size = dev->coherent_dma_mask + 1;
+size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
 } else {
 offset = PFN_DOWN(paddr - dma_addr);
 dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);



NACK.

Passing an invalid size to arch_setup_dma_ops() is only part of the problem.
size is also used in of_dma_configure() before calling arch_setup_dma_ops():

dev->coherent_dma_mask = min(dev->coherent_dma_mask,
 DMA_BIT_MASK(ilog2(dma_addr + size)));
*dev->dma_mask = min((*dev->dma_mask),
 DMA_BIT_MASK(ilog2(dma_addr + size)));

which would be incorrect for size == 0xULL when
dma_addr != 0.  So the proposed fix really is not papering over
the base problem very well.



Ok, but with your fix for of_dma_get_range and the above fix,
dma_addr will be '0' when size = 0xULL,
but DMA_BIT_MASK(ilog2(dma_addr + size)) would be wrong though,
making coherent_dma_mask to be smaller 0x7fffULL.


Yes, that was my point.  Setting size to 0x7fffULL
affects several places.  Another potential location (based only
on the function header comment, not from reading the code) is
iommu_dma_init_domain().  The header comment says:

* @base and @size should be exact multiples of IOMMU page granularity to
* avoid rounding surprises.



ok, this is the same problem that should get solved when arch_setup_dma_ops
is prepared to take mask instead of size. It would still work as said above
with a smaller mask than specified.


I have not read enough context to really understand of_dma_configure(), but
it seems there is yet another issue in how the error return case from
of_dma_get_range() is handled (with the existing code, as well as if
my patch gets accepted).  An error return value can mean _either_
there is no dma-ranges property _or_ "an other problem occurred".  Should
the "an other problem occurred" case be handled by defaulting size to
a value based on dev->coherent_dma_mask (the current case) or should the
attempt to set up the DMA configuration just fail?


The handling of return error value looks like separate item, but looks
like its correct with what is there now. (ie) when of_dma_get_range fails
either because 'dma-ranges property' populated in DT (or) because of some
erroneous DT setting, better to set the mask which the driver has specified
and ignore DT. So the above patch just corrects a mistake in that path.

Regards,
 Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH V10 06/12] of: device: Fix overflow of coherent_dma_mask

2017-04-06 Thread Sricharan R

Hi Frank,

On 4/7/2017 1:04 AM, Frank Rowand wrote:

On 04/06/17 04:01, Sricharan R wrote:

Hi Frank,

On 4/6/2017 12:31 PM, Frank Rowand wrote:

On 04/04/17 03:18, Sricharan R wrote:

Size of the dma-range is calculated as coherent_dma_mask + 1
and passed to arch_setup_dma_ops further. It overflows when
the coherent_dma_mask is set for full 64 bits 0x,
resulting in size getting passed as 0 wrongly. Fix this by
passsing in max(mask, mask + 1). Note that in this case
when the mask is set to full 64bits, we will be passing the mask
itself to arch_setup_dma_ops instead of the size. The real fix
for this should be to make arch_setup_dma_ops receive the
mask and handle it, to be done in the future.

Signed-off-by: Sricharan R 
---
 drivers/of/device.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/device.c b/drivers/of/device.c
index c17c19d..c2ae6bb 100644
--- a/drivers/of/device.c
+++ b/drivers/of/device.c
@@ -107,7 +107,7 @@ void of_dma_configure(struct device *dev, struct 
device_node *np)
 ret = of_dma_get_range(np, _addr, , );
 if (ret < 0) {
 dma_addr = offset = 0;
-size = dev->coherent_dma_mask + 1;
+size = max(dev->coherent_dma_mask, dev->coherent_dma_mask + 1);
 } else {
 offset = PFN_DOWN(paddr - dma_addr);
 dev_dbg(dev, "dma_pfn_offset(%#08lx)\n", offset);



NACK.

Passing an invalid size to arch_setup_dma_ops() is only part of the problem.
size is also used in of_dma_configure() before calling arch_setup_dma_ops():

dev->coherent_dma_mask = min(dev->coherent_dma_mask,
 DMA_BIT_MASK(ilog2(dma_addr + size)));
*dev->dma_mask = min((*dev->dma_mask),
 DMA_BIT_MASK(ilog2(dma_addr + size)));

which would be incorrect for size == 0xULL when
dma_addr != 0.  So the proposed fix really is not papering over
the base problem very well.



Ok, but with your fix for of_dma_get_range and the above fix,
dma_addr will be '0' when size = 0xULL,
but DMA_BIT_MASK(ilog2(dma_addr + size)) would be wrong though,
making coherent_dma_mask to be smaller 0x7fffULL.


Yes, that was my point.  Setting size to 0x7fffULL
affects several places.  Another potential location (based only
on the function header comment, not from reading the code) is
iommu_dma_init_domain().  The header comment says:

* @base and @size should be exact multiples of IOMMU page granularity to
* avoid rounding surprises.



ok, this is the same problem that should get solved when arch_setup_dma_ops
is prepared to take mask instead of size. It would still work as said above
with a smaller mask than specified.


I have not read enough context to really understand of_dma_configure(), but
it seems there is yet another issue in how the error return case from
of_dma_get_range() is handled (with the existing code, as well as if
my patch gets accepted).  An error return value can mean _either_
there is no dma-ranges property _or_ "an other problem occurred".  Should
the "an other problem occurred" case be handled by defaulting size to
a value based on dev->coherent_dma_mask (the current case) or should the
attempt to set up the DMA configuration just fail?


The handling of return error value looks like separate item, but looks
like its correct with what is there now. (ie) when of_dma_get_range fails
either because 'dma-ranges property' populated in DT (or) because of some
erroneous DT setting, better to set the mask which the driver has specified
and ignore DT. So the above patch just corrects a mistake in that path.

Regards,
 Sricharan

--
"QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of 
Code Aurora Forum, hosted by The Linux Foundation


Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Al Viro
On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:

> diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> index bdf0e6e899..a2a9832a42 100644
> --- a/drivers/tty/n_tty.c
> +++ b/drivers/tty/n_tty.c
> @@ -1668,11 +1668,17 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>char *fp, int count, int flow)
>  {
> - struct n_tty_data *ldata = tty->disc_data;
> + struct n_tty_data *ldata;
>   int room, n, rcvd = 0, overflow;
>  
>   down_read(>termios_rwsem);
>  
> + ldata = tty->disc_data;
> + if (!ldata) {
> + up_read(>termios_rwsem);

I very much doubt that it's correct.  It shouldn't have been called after
the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
against it, and something apparently has gone wrong with the exclusion there.
At the very least I would like to see what's to prevent n_tty_close() from
overlapping the exection of this function - if *that* is what broke, your
patch will only paper over the problem.


Re: [PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Al Viro
On Fri, Apr 07, 2017 at 01:50:53PM +1000, Michael Neuling wrote:

> diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
> index bdf0e6e899..a2a9832a42 100644
> --- a/drivers/tty/n_tty.c
> +++ b/drivers/tty/n_tty.c
> @@ -1668,11 +1668,17 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>char *fp, int count, int flow)
>  {
> - struct n_tty_data *ldata = tty->disc_data;
> + struct n_tty_data *ldata;
>   int room, n, rcvd = 0, overflow;
>  
>   down_read(>termios_rwsem);
>  
> + ldata = tty->disc_data;
> + if (!ldata) {
> + up_read(>termios_rwsem);

I very much doubt that it's correct.  It shouldn't have been called after
the n_tty_close(); apparently it has been.  ->termios_rwsem won't serialize
against it, and something apparently has gone wrong with the exclusion there.
At the very least I would like to see what's to prevent n_tty_close() from
overlapping the exection of this function - if *that* is what broke, your
patch will only paper over the problem.


[PATCH] perf evsel: Return exact sub event which failed with EPERM for wildcards

2017-04-06 Thread Jin Yao
The kernel has a special check for irq_vectors trace event.

TRACE_EVENT_PERF_PERM(irq_work_exit,
is_sampling_event(p_event) ? -EPERM : 0);

The perf-record is failed for irq_vectors event if using a wildcard.

root@skl:/tmp# perf record -a -e irq_vectors:* sleep 2
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_SYS_ADMIN).

The current value is 2:

  -1: Allow use of (almost) all events by all users
>= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
>= 1: Disallow CPU event access by users without CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

kernel.perf_event_paranoid = -1

This patch prints out the exact sub event that failed with EPERM
for wildcards to help user understanding easily.

For example,

root@skl:/tmp# perf record -a -e irq_vectors:* sleep 2
Error:
No permission to enable irq_vectors:irq_work_exit event.

You may not have permission to collect system-wide stats.
..

Signed-off-by: Jin Yao 
---
 tools/perf/util/evsel.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 9dc7e2d..8f5d86b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2457,11 +2457,17 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
  int err, char *msg, size_t size)
 {
char sbuf[STRERR_BUFSIZE];
+   int printed = 0;
 
switch (err) {
case EPERM:
case EACCES:
-   return scnprintf(msg, size,
+   if (err == EPERM)
+   printed = scnprintf(msg, size,
+   "No permission to enable %s event.\n\n",
+   perf_evsel__name(evsel));
+
+   return scnprintf(msg + printed, size - printed,
 "You may not have permission to collect %sstats.\n\n"
 "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
 "which controls use of the performance events system by\n"
-- 
2.7.4



[PATCH] perf evsel: Return exact sub event which failed with EPERM for wildcards

2017-04-06 Thread Jin Yao
The kernel has a special check for irq_vectors trace event.

TRACE_EVENT_PERF_PERM(irq_work_exit,
is_sampling_event(p_event) ? -EPERM : 0);

The perf-record is failed for irq_vectors event if using a wildcard.

root@skl:/tmp# perf record -a -e irq_vectors:* sleep 2
Error:
You may not have permission to collect system-wide stats.

Consider tweaking /proc/sys/kernel/perf_event_paranoid,
which controls use of the performance events system by
unprivileged users (without CAP_SYS_ADMIN).

The current value is 2:

  -1: Allow use of (almost) all events by all users
>= 0: Disallow raw tracepoint access by users without CAP_IOC_LOCK
>= 1: Disallow CPU event access by users without CAP_SYS_ADMIN
>= 2: Disallow kernel profiling by users without CAP_SYS_ADMIN

To make this setting permanent, edit /etc/sysctl.conf too, e.g.:

kernel.perf_event_paranoid = -1

This patch prints out the exact sub event that failed with EPERM
for wildcards to help user understanding easily.

For example,

root@skl:/tmp# perf record -a -e irq_vectors:* sleep 2
Error:
No permission to enable irq_vectors:irq_work_exit event.

You may not have permission to collect system-wide stats.
..

Signed-off-by: Jin Yao 
---
 tools/perf/util/evsel.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 9dc7e2d..8f5d86b 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -2457,11 +2457,17 @@ int perf_evsel__open_strerror(struct perf_evsel *evsel, 
struct target *target,
  int err, char *msg, size_t size)
 {
char sbuf[STRERR_BUFSIZE];
+   int printed = 0;
 
switch (err) {
case EPERM:
case EACCES:
-   return scnprintf(msg, size,
+   if (err == EPERM)
+   printed = scnprintf(msg, size,
+   "No permission to enable %s event.\n\n",
+   perf_evsel__name(evsel));
+
+   return scnprintf(msg + printed, size - printed,
 "You may not have permission to collect %sstats.\n\n"
 "Consider tweaking /proc/sys/kernel/perf_event_paranoid,\n"
 "which controls use of the performance events system by\n"
-- 
2.7.4



linux-next: manual merge of the kvm-ppc tree with the kvm tree

2017-04-06 Thread Stephen Rothwell
Hi Paul,

Today's linux-next merge of the kvm-ppc tree got a conflict in:

  include/uapi/linux/kvm.h

between commit:

  a8a3c426772e ("KVM: MIPS: Add VZ & TE capabilities")

from the kvm tree and commit:

  2e60acebefd8 ("KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number")

from the kvm-ppc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/uapi/linux/kvm.h
index 1e1a6c728a18,0f8a5e6528aa..
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@@ -887,9 -883,7 +887,10 @@@ struct kvm_ppc_resize_hpt 
  #define KVM_CAP_PPC_MMU_RADIX 134
  #define KVM_CAP_PPC_MMU_HASH_V3 135
  #define KVM_CAP_IMMEDIATE_EXIT 136
 -#define KVM_CAP_SPAPR_TCE_VFIO 137
 +#define KVM_CAP_MIPS_VZ 137
 +#define KVM_CAP_MIPS_TE 138
 +#define KVM_CAP_MIPS_64BIT 139
++#define KVM_CAP_SPAPR_TCE_VFIO 140
  
  #ifdef KVM_CAP_IRQ_ROUTING
  


linux-next: manual merge of the kvm-ppc tree with the kvm tree

2017-04-06 Thread Stephen Rothwell
Hi Paul,

Today's linux-next merge of the kvm-ppc tree got a conflict in:

  include/uapi/linux/kvm.h

between commit:

  a8a3c426772e ("KVM: MIPS: Add VZ & TE capabilities")

from the kvm tree and commit:

  2e60acebefd8 ("KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number")

from the kvm-ppc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc include/uapi/linux/kvm.h
index 1e1a6c728a18,0f8a5e6528aa..
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@@ -887,9 -883,7 +887,10 @@@ struct kvm_ppc_resize_hpt 
  #define KVM_CAP_PPC_MMU_RADIX 134
  #define KVM_CAP_PPC_MMU_HASH_V3 135
  #define KVM_CAP_IMMEDIATE_EXIT 136
 -#define KVM_CAP_SPAPR_TCE_VFIO 137
 +#define KVM_CAP_MIPS_VZ 137
 +#define KVM_CAP_MIPS_TE 138
 +#define KVM_CAP_MIPS_64BIT 139
++#define KVM_CAP_SPAPR_TCE_VFIO 140
  
  #ifdef KVM_CAP_IRQ_ROUTING
  


Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Michael Ellerman
Tyrel Datwyler  writes:

> On 04/06/2017 03:27 AM, Sachin Sant wrote:
>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
>> any I/O adapter results in the following warning
>> 
>> This problem has been in the code for some time now. I had first seen this in
>> -next tree.
>> 
>> [  269.589441] rpadlpar_io: slot PHB 72 removed
>> [  270.589997] refcount_t: underflow; use-after-free.
>> [  270.590019] [ cut here ]
>> [  270.590025] WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 
>> refcount_sub_and_test+0xf4/0x110
>> [  270.590028] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
>> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
>> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun 
>> bridge stp llc rpadlpar_io rpaphp kvm_pr kvm ebtable_filter ebtables 
>> ip6table_filter ip6_tables iptable_filter dccp_diag dccp tcp_diag udp_diag 
>> inet_diag unix_diag af_packet_diag netlink_diag ghash_generic xts gf128mul 
>> vmx_crypto tpm_ibmvtpm tpm sg pseries_rng nfsd auth_rpcgss nfs_acl lockd 
>> grace sunrpc binfmt_misc ip_tables xfs libcrc32c sr_mod sd_mod cdrom 
>> ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>> [  270.590076] CPU: 5 PID: 3335 Comm: drmgr Not tainted 4.11.0-rc5 #3
>> [  270.590079] task: c005d8df8600 task.stack: c000fb3a8000
>> [  270.590081] NIP: c1aa3ca4 LR: c1aa3ca0 CTR: 
>> 006338e4
>> [  270.590084] REGS: c000fb3ab8a0 TRAP: 0700   Not tainted  (4.11.0-rc5)
>> [  270.590087] MSR: 80029033 
>> [  270.590090]   CR: 22002422  XER: 0007
>> [  270.590093] CFAR: c1edaabc SOFTE: 1 
>> [  270.590093] GPR00: c1aa3ca0 c000fb3abb20 c25ea900 
>> 0026 
>> [  270.590093] GPR04: c0077fc4ada0 c0077fc617b8 000f0c33 
>>  
>> [  270.590093] GPR08:  c227146c 00077d9e 
>> 3ff0 
>> [  270.590093] GPR12: 2200 ce802d00  
>>  
>> [  270.590093] GPR16:    
>>  
>> [  270.590093] GPR20:  1001b5a8 10018338 
>> 10016650 
>> [  270.590093] GPR24: 1001b278 c00776e0fdcc 10016650 
>>  
>> [  270.590093] GPR28: c0077ffea910 c000fbf79180 c00776e0fdc0 
>> c000fbf791d8 
>> [  270.590126] NIP [c1aa3ca4] refcount_sub_and_test+0xf4/0x110
>> [  270.590129] LR [c1aa3ca0] refcount_sub_and_test+0xf0/0x110
>> [  270.590132] Call Trace:
>> [  270.590134] [c000fb3abb20] [c1aa3ca0] 
>> refcount_sub_and_test+0xf0/0x110 (unreliable)
>> [  270.590139] [c000fb3abb80] [c1a8221c] kobject_put+0x3c/0xa0
>> [  270.590143] [c000fb3abbf0] [c1d22d34] of_node_put+0x24/0x40
>> [  270.590147] [c000fb3abc10] [c165c874] ofdt_write+0x204/0x6b0
>> [  270.590151] [c000fb3abcd0] [c197a220] proc_reg_write+0x80/0xd0
>> [  270.590155] [c000fb3abd00] [c18de680] __vfs_write+0x40/0x1c0
>> [  270.590158] [c000fb3abd90] [c18dffd8] vfs_write+0xc8/0x240
>> [  270.590162] [c000fb3abde0] [c18e1c40] SyS_write+0x60/0x110
>> [  270.590165] [c000fb3abe30] [c15cb184] system_call+0x38/0xe0
>> [  270.590168] Instruction dump:
>> [  270.590170] 7863d182 4e800020 7c0802a6 3921 3d42fff8 3c62ffb1 
>> 386371a8 992a0171 
>> [  270.590175] f8010010 f821ffa1 48436de1 6000 <0fe0> 38210060 
>> 3860 e8010010 
>> [  270.590180] ---[ end trace 08c7a2f3c8bead33 ]—
>> 
>> Have attached the dmesg log from the system. Let me know if any additional
>> information is required to help debug this problem.
>
> I remember you mentioning this when the issue was brought up for CPUs. I
> assume the case is the same here where the issue is only seen with
> adapters that were hot-added after boot (ie. hot-remove of adapter
> present at boot doesn't trip the warning)?

So who's fixing this?

cheers


Re: WARN @lib/refcount.c:128 during hot unplug of I/O adapter.

2017-04-06 Thread Michael Ellerman
Tyrel Datwyler  writes:

> On 04/06/2017 03:27 AM, Sachin Sant wrote:
>> On a POWER8 LPAR running 4.11.0-rc5, a hot unplug operation on
>> any I/O adapter results in the following warning
>> 
>> This problem has been in the code for some time now. I had first seen this in
>> -next tree.
>> 
>> [  269.589441] rpadlpar_io: slot PHB 72 removed
>> [  270.589997] refcount_t: underflow; use-after-free.
>> [  270.590019] [ cut here ]
>> [  270.590025] WARNING: CPU: 5 PID: 3335 at lib/refcount.c:128 
>> refcount_sub_and_test+0xf4/0x110
>> [  270.590028] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
>> nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
>> nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun 
>> bridge stp llc rpadlpar_io rpaphp kvm_pr kvm ebtable_filter ebtables 
>> ip6table_filter ip6_tables iptable_filter dccp_diag dccp tcp_diag udp_diag 
>> inet_diag unix_diag af_packet_diag netlink_diag ghash_generic xts gf128mul 
>> vmx_crypto tpm_ibmvtpm tpm sg pseries_rng nfsd auth_rpcgss nfs_acl lockd 
>> grace sunrpc binfmt_misc ip_tables xfs libcrc32c sr_mod sd_mod cdrom 
>> ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>> [  270.590076] CPU: 5 PID: 3335 Comm: drmgr Not tainted 4.11.0-rc5 #3
>> [  270.590079] task: c005d8df8600 task.stack: c000fb3a8000
>> [  270.590081] NIP: c1aa3ca4 LR: c1aa3ca0 CTR: 
>> 006338e4
>> [  270.590084] REGS: c000fb3ab8a0 TRAP: 0700   Not tainted  (4.11.0-rc5)
>> [  270.590087] MSR: 80029033 
>> [  270.590090]   CR: 22002422  XER: 0007
>> [  270.590093] CFAR: c1edaabc SOFTE: 1 
>> [  270.590093] GPR00: c1aa3ca0 c000fb3abb20 c25ea900 
>> 0026 
>> [  270.590093] GPR04: c0077fc4ada0 c0077fc617b8 000f0c33 
>>  
>> [  270.590093] GPR08:  c227146c 00077d9e 
>> 3ff0 
>> [  270.590093] GPR12: 2200 ce802d00  
>>  
>> [  270.590093] GPR16:    
>>  
>> [  270.590093] GPR20:  1001b5a8 10018338 
>> 10016650 
>> [  270.590093] GPR24: 1001b278 c00776e0fdcc 10016650 
>>  
>> [  270.590093] GPR28: c0077ffea910 c000fbf79180 c00776e0fdc0 
>> c000fbf791d8 
>> [  270.590126] NIP [c1aa3ca4] refcount_sub_and_test+0xf4/0x110
>> [  270.590129] LR [c1aa3ca0] refcount_sub_and_test+0xf0/0x110
>> [  270.590132] Call Trace:
>> [  270.590134] [c000fb3abb20] [c1aa3ca0] 
>> refcount_sub_and_test+0xf0/0x110 (unreliable)
>> [  270.590139] [c000fb3abb80] [c1a8221c] kobject_put+0x3c/0xa0
>> [  270.590143] [c000fb3abbf0] [c1d22d34] of_node_put+0x24/0x40
>> [  270.590147] [c000fb3abc10] [c165c874] ofdt_write+0x204/0x6b0
>> [  270.590151] [c000fb3abcd0] [c197a220] proc_reg_write+0x80/0xd0
>> [  270.590155] [c000fb3abd00] [c18de680] __vfs_write+0x40/0x1c0
>> [  270.590158] [c000fb3abd90] [c18dffd8] vfs_write+0xc8/0x240
>> [  270.590162] [c000fb3abde0] [c18e1c40] SyS_write+0x60/0x110
>> [  270.590165] [c000fb3abe30] [c15cb184] system_call+0x38/0xe0
>> [  270.590168] Instruction dump:
>> [  270.590170] 7863d182 4e800020 7c0802a6 3921 3d42fff8 3c62ffb1 
>> 386371a8 992a0171 
>> [  270.590175] f8010010 f821ffa1 48436de1 6000 <0fe0> 38210060 
>> 3860 e8010010 
>> [  270.590180] ---[ end trace 08c7a2f3c8bead33 ]—
>> 
>> Have attached the dmesg log from the system. Let me know if any additional
>> information is required to help debug this problem.
>
> I remember you mentioning this when the issue was brought up for CPUs. I
> assume the case is the same here where the issue is only seen with
> adapters that were hot-added after boot (ie. hot-remove of adapter
> present at boot doesn't trip the warning)?

So who's fixing this?

cheers


[PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling
When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data in
tty_ldisc_receive_buf() called from flush_to_ldisc()

This fixes the issue by moving the disc_data read to after we take the
semaphore. Then when disc_data NULL returning 0 data processed rather
than dereferencing it.

Cc:  [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3



[PATCH] tty: Fix crash with flush_to_ldisc()

2017-04-06 Thread Michael Neuling
When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data in
tty_ldisc_receive_buf() called from flush_to_ldisc()

This fixes the issue by moving the disc_data read to after we take the
semaphore. Then when disc_data NULL returning 0 data processed rather
than dereferencing it.

Cc:  [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3



Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Mimi Zohar
On Fri, 2017-04-07 at 11:05 +0800, Dave Young wrote:
> On 04/05/17 at 09:15pm, David Howells wrote:
> > From: Chun-Yi Lee 
> > 
> > When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> > through kexec_file systemcall if securelevel has been set.
> > 
> > This code was showed in Matthew's patch but not in git:
> > https://lkml.org/lkml/2015/3/13/778
> > 
> > Cc: Matthew Garrett 
> > Signed-off-by: Chun-Yi Lee 
> > Signed-off-by: David Howells 
> > cc: ke...@lists.infradead.org
> > ---
> > 
> >  kernel/kexec_file.c |6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > index b118735fea9d..f6937eecd1eb 100644
> > --- a/kernel/kexec_file.c
> > +++ b/kernel/kexec_file.c
> > @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> > initrd_fd,
> > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > return -EPERM;
> >  
> > +   /* Don't permit images to be loaded into trusted kernels if we're not
> > +* going to verify the signature on them
> > +*/
> > +   if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> > +   return -EPERM;
> > +
> >  

IMA can be used to verify file signatures too, based on the LSM hooks
in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
required.

Mimi


>   /* Make sure we have a legal set of flags */
> > if (flags != (flags & KEXEC_FILE_FLAGS))
> > return -EINVAL;
> > 
> > 
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Acked-by: Dave Young 
> 
> Thanks
> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-security-module" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Mimi Zohar
On Fri, 2017-04-07 at 11:05 +0800, Dave Young wrote:
> On 04/05/17 at 09:15pm, David Howells wrote:
> > From: Chun-Yi Lee 
> > 
> > When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> > through kexec_file systemcall if securelevel has been set.
> > 
> > This code was showed in Matthew's patch but not in git:
> > https://lkml.org/lkml/2015/3/13/778
> > 
> > Cc: Matthew Garrett 
> > Signed-off-by: Chun-Yi Lee 
> > Signed-off-by: David Howells 
> > cc: ke...@lists.infradead.org
> > ---
> > 
> >  kernel/kexec_file.c |6 ++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> > index b118735fea9d..f6937eecd1eb 100644
> > --- a/kernel/kexec_file.c
> > +++ b/kernel/kexec_file.c
> > @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> > initrd_fd,
> > if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
> > return -EPERM;
> >  
> > +   /* Don't permit images to be loaded into trusted kernels if we're not
> > +* going to verify the signature on them
> > +*/
> > +   if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> > +   return -EPERM;
> > +
> >  

IMA can be used to verify file signatures too, based on the LSM hooks
in  kernel_read_file_from_fd().  CONFIG_KEXEC_VERIFY_SIG should not be
required.

Mimi


>   /* Make sure we have a legal set of flags */
> > if (flags != (flags & KEXEC_FILE_FLAGS))
> > return -EINVAL;
> > 
> > 
> > ___
> > kexec mailing list
> > ke...@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/kexec
> 
> Acked-by: Dave Young 
> 
> Thanks
> Dave
> --
> To unsubscribe from this list: send the line "unsubscribe 
> linux-security-module" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-06 Thread Nicholas Piggin
On Thu, 6 Apr 2017 12:41:52 -0700
Linus Torvalds  wrote:

> On Thu, Apr 6, 2017 at 12:23 PM, Peter Zijlstra  wrote:
> >
> > Something like so then. According to the SDM mwait is a no-op if we do
> > not execute monitor first. So this variant should get the first
> > iteration without expensive instructions.  
> 
> No, the problem is that we *would* have executed a prior monitor that
> could still be pending - from a previous invocation of
> smp_cond_load_acquire().
> 
> Especially with spinlocks, these things can very much happen back-to-back.
> 
> And it would be pending with a different address (the previous
> spinlock) that might not have changed since then (and might not be
> changing), so now we might actually be pausing in mwait waiting for
> that *other* thing to change.
> 
> So it would probably need to do something complicated like
> 
>   #define smp_cond_load_acquire(ptr, cond_expr) \
>   ({\
> typeof(ptr) __PTR = (ptr);  \
> typeof(*ptr) VAL;   \
> do {\
> VAL = READ_ONCE(*__PTR);\
> if (cond_expr)  \
> break;  \
> for (;;) {  \
> ___monitor(__PTR, 0, 0);\
> VAL = READ_ONCE(*__PTR);\
> if (cond_expr) break;   \
> ___mwait(0xf0 /* C0 */, 0); \
> }   \
> } while (0) \
> smp_acquire__after_ctrl_dep();  \
> VAL;\
>   })
> 
> which might just generate nasty enough code to not be worth it.

Okay, that's a bit of an issue I had with the spin_begin/in/end primitives.
The problem being some of these loops are a fastpath that expect success on
first iteration when not spinning. seqlock is another example.

powerpc does not want to go to low priority until after the initial test
fails, but cpu_relax based architectures do not want a pointless extra branch.

I added a spin_until_cond_likely() primitive that catches a number of these.
Not sure how well people would like that though.

Most loops actually are slowpath ones and work fine with begin/in/end though.
Attached a combined patch for reference with powerpc implementation and some
sites converted.

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index ..0a058aaa9bab
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,62 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include 
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * The loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of these guidelies will not cause a bug, but may cause sub
+ * optimal performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the loop body.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+/*
+ * spin_until_cond_likely can be used to wait for a condition to become true. 
It
+ * may be expected that the first iteration will true in the common case
+ * (no spinning).
+ */
+#ifndef spin_until_cond_likely
+#define spin_until_cond_likely(cond)   \
+do { 

Re: [RFC][PATCH] spin loop arch primitives for busy waiting

2017-04-06 Thread Nicholas Piggin
On Thu, 6 Apr 2017 12:41:52 -0700
Linus Torvalds  wrote:

> On Thu, Apr 6, 2017 at 12:23 PM, Peter Zijlstra  wrote:
> >
> > Something like so then. According to the SDM mwait is a no-op if we do
> > not execute monitor first. So this variant should get the first
> > iteration without expensive instructions.  
> 
> No, the problem is that we *would* have executed a prior monitor that
> could still be pending - from a previous invocation of
> smp_cond_load_acquire().
> 
> Especially with spinlocks, these things can very much happen back-to-back.
> 
> And it would be pending with a different address (the previous
> spinlock) that might not have changed since then (and might not be
> changing), so now we might actually be pausing in mwait waiting for
> that *other* thing to change.
> 
> So it would probably need to do something complicated like
> 
>   #define smp_cond_load_acquire(ptr, cond_expr) \
>   ({\
> typeof(ptr) __PTR = (ptr);  \
> typeof(*ptr) VAL;   \
> do {\
> VAL = READ_ONCE(*__PTR);\
> if (cond_expr)  \
> break;  \
> for (;;) {  \
> ___monitor(__PTR, 0, 0);\
> VAL = READ_ONCE(*__PTR);\
> if (cond_expr) break;   \
> ___mwait(0xf0 /* C0 */, 0); \
> }   \
> } while (0) \
> smp_acquire__after_ctrl_dep();  \
> VAL;\
>   })
> 
> which might just generate nasty enough code to not be worth it.

Okay, that's a bit of an issue I had with the spin_begin/in/end primitives.
The problem being some of these loops are a fastpath that expect success on
first iteration when not spinning. seqlock is another example.

powerpc does not want to go to low priority until after the initial test
fails, but cpu_relax based architectures do not want a pointless extra branch.

I added a spin_until_cond_likely() primitive that catches a number of these.
Not sure how well people would like that though.

Most loops actually are slowpath ones and work fine with begin/in/end though.
Attached a combined patch for reference with powerpc implementation and some
sites converted.

diff --git a/include/linux/processor.h b/include/linux/processor.h
new file mode 100644
index ..0a058aaa9bab
--- /dev/null
+++ b/include/linux/processor.h
@@ -0,0 +1,62 @@
+/* Misc low level processor primitives */
+#ifndef _LINUX_PROCESSOR_H
+#define _LINUX_PROCESSOR_H
+
+#include 
+
+/*
+ * spin_begin is used before beginning a busy-wait loop, and must be paired
+ * with spin_end when the loop is exited. spin_cpu_relax must be called
+ * within the loop.
+ *
+ * The loop body should be as small and fast as possible, on the order of
+ * tens of instructions/cycles as a guide. It should and avoid calling
+ * cpu_relax, or any "spin" or sleep type of primitive including nested uses
+ * of these primitives. It should not lock or take any other resource.
+ * Violations of these guidelies will not cause a bug, but may cause sub
+ * optimal performance.
+ *
+ * These loops are optimized to be used where wait times are expected to be
+ * less than the cost of a context switch (and associated overhead).
+ *
+ * Detection of resource owner and decision to spin or sleep or guest-yield
+ * (e.g., spin lock holder vcpu preempted, or mutex owner not on CPU) can be
+ * tested within the loop body.
+ */
+#ifndef spin_begin
+#define spin_begin()
+#endif
+
+#ifndef spin_cpu_relax
+#define spin_cpu_relax() cpu_relax()
+#endif
+
+/*
+ * spin_cpu_yield may be called to yield (undirected) to the hypervisor if
+ * necessary. This should be used if the wait is expected to take longer
+ * than context switch overhead, but we can't sleep or do a directed yield.
+ */
+#ifndef spin_cpu_yield
+#define spin_cpu_yield() cpu_relax_yield()
+#endif
+
+#ifndef spin_end
+#define spin_end()
+#endif
+
+/*
+ * spin_until_cond_likely can be used to wait for a condition to become true. 
It
+ * may be expected that the first iteration will true in the common case
+ * (no spinning).
+ */
+#ifndef spin_until_cond_likely
+#define spin_until_cond_likely(cond)   \
+do {   \
+   spin_begin();

for stable -- random: use chacha20 for get_random_int/long

2017-04-06 Thread Jason A. Donenfeld
Given that the below commit isn't very big and adds a nice security
property (in addition to performance), it might be worthwhile to
backport this to 4.9 stable. It's not a candidate for 4.4, since that
kernel doesn't use chacha for the rng at all.

As this is in random.c, it's Ted's and Greg's judgement call.

commit f5b98461cb8167ba362ad9f74c41d126b7becea7
Author: Jason A. Donenfeld 
Date:   Fri Jan 6 19:32:01 2017 +0100

   random: use chacha20 for get_random_int/long

   Now that our crng uses chacha20, we can rely on its speedy
   characteristics for replacing MD5, while simultaneously achieving a
   higher security guarantee. Before the idea was to use these functions if
   you wanted random integers that aren't stupidly insecure but aren't
   necessarily secure either, a vague gray zone, that hopefully was "good
   enough" for its users. With chacha20, we can strengthen this claim,
   since either we're using an rdrand-like instruction, or we're using the
   same crng as /dev/urandom. And it's faster than what was before.

   We could have chosen to replace this with a SipHash-derived function,
   which might be slightly faster, but at the cost of having yet another
   RNG construction in the kernel. By moving to chacha20, we have a single
   RNG to analyze and verify, and we also already get good performance
   improvements on all platforms.

   Implementation-wise, rather than use a generic buffer for both
   get_random_int/long and memcpy based on the size needs, we use a
   specific buffer for 32-bit reads and for 64-bit reads. This way, we're
   guaranteed to always have aligned accesses on all platforms. While
   slightly more verbose in C, the assembly this generates is a lot
   simpler than otherwise.

   Finally, on 32-bit platforms where longs and ints are the same size,
   we simply alias get_random_int to get_random_long.

   Signed-off-by: Jason A. Donenfeld 
   Suggested-by: Theodore Ts'o 
   Cc: Theodore Ts'o 
   Cc: Hannes Frederic Sowa 
   Cc: Andy Lutomirski 
   Signed-off-by: Theodore Ts'o 


for stable -- random: use chacha20 for get_random_int/long

2017-04-06 Thread Jason A. Donenfeld
Given that the below commit isn't very big and adds a nice security
property (in addition to performance), it might be worthwhile to
backport this to 4.9 stable. It's not a candidate for 4.4, since that
kernel doesn't use chacha for the rng at all.

As this is in random.c, it's Ted's and Greg's judgement call.

commit f5b98461cb8167ba362ad9f74c41d126b7becea7
Author: Jason A. Donenfeld 
Date:   Fri Jan 6 19:32:01 2017 +0100

   random: use chacha20 for get_random_int/long

   Now that our crng uses chacha20, we can rely on its speedy
   characteristics for replacing MD5, while simultaneously achieving a
   higher security guarantee. Before the idea was to use these functions if
   you wanted random integers that aren't stupidly insecure but aren't
   necessarily secure either, a vague gray zone, that hopefully was "good
   enough" for its users. With chacha20, we can strengthen this claim,
   since either we're using an rdrand-like instruction, or we're using the
   same crng as /dev/urandom. And it's faster than what was before.

   We could have chosen to replace this with a SipHash-derived function,
   which might be slightly faster, but at the cost of having yet another
   RNG construction in the kernel. By moving to chacha20, we have a single
   RNG to analyze and verify, and we also already get good performance
   improvements on all platforms.

   Implementation-wise, rather than use a generic buffer for both
   get_random_int/long and memcpy based on the size needs, we use a
   specific buffer for 32-bit reads and for 64-bit reads. This way, we're
   guaranteed to always have aligned accesses on all platforms. While
   slightly more verbose in C, the assembly this generates is a lot
   simpler than otherwise.

   Finally, on 32-bit platforms where longs and ints are the same size,
   we simply alias get_random_int to get_random_long.

   Signed-off-by: Jason A. Donenfeld 
   Suggested-by: Theodore Ts'o 
   Cc: Theodore Ts'o 
   Cc: Hannes Frederic Sowa 
   Cc: Andy Lutomirski 
   Signed-off-by: Theodore Ts'o 


Re: Linux 4.11: Reported regressions as of Tuesday, 2017-04-02

2017-04-06 Thread Michel Dänzer
On 02/04/17 10:03 PM, Thorsten Leemhuis wrote:
> 
> == Going to be removed from the list ==
> 
> [...]
> 
> Desc: DRM BUG while initializing cape verde (2nd card)
> Repo: 2017-03-13 https://bugzilla.kernel.org/show_bug.cgi?id=194867
> Stat: n/a 
> Note: problem was in 4.10 already & reporter closed issue

There are two separate issues discussed in this bug report. The first
one (described in the bug title and initial description) was a
regression and fixed by Alex's patches. The other issue (described in
comment 3) is not a regression, I asked Janpieter to file a separate
report for that.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


Re: Linux 4.11: Reported regressions as of Tuesday, 2017-04-02

2017-04-06 Thread Michel Dänzer
On 02/04/17 10:03 PM, Thorsten Leemhuis wrote:
> 
> == Going to be removed from the list ==
> 
> [...]
> 
> Desc: DRM BUG while initializing cape verde (2nd card)
> Repo: 2017-03-13 https://bugzilla.kernel.org/show_bug.cgi?id=194867
> Stat: n/a 
> Note: problem was in 4.10 already & reporter closed issue

There are two separate issues discussed in this bug report. The first
one (described in the bug title and initial description) was a
regression and fixed by Alex's patches. The other issue (described in
comment 3) is not a regression, I asked Janpieter to file a separate
report for that.


-- 
Earthling Michel Dänzer   |   http://www.amd.com
Libre software enthusiast | Mesa and X developer


3% L oans

2017-04-06 Thread Kingfisher Finance
Good Day, We offer l oans at 3% rate. To apply, contact us with AMOUNT 
NEEDED/DURATION for more info. Thanks


3% L oans

2017-04-06 Thread Kingfisher Finance
Good Day, We offer l oans at 3% rate. To apply, contact us with AMOUNT 
NEEDED/DURATION for more info. Thanks


Re: [PATCH 07/24] kexec: Disable at runtime if the kernel is locked down

2017-04-06 Thread Dave Young
On 04/05/17 at 09:15pm, David Howells wrote:
> From: Matthew Garrett 
> 
> kexec permits the loading and execution of arbitrary code in ring 0, which
> is something that lock-down is meant to prevent. It makes sense to disable
> kexec in this situation.
> 
> This does not affect kexec_file_load() which can check for a signature on the
> image to be booted.
> 
> Signed-off-by: Matthew Garrett 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 980936a90ee6..46de8e6b42f4 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -194,6 +194,13 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   return -EPERM;
>  
>   /*
> +  * kexec can be used to circumvent module loading restrictions, so
> +  * prevent loading in that case
> +  */
> + if (kernel_is_locked_down())
> + return -EPERM;
> +
> + /*
>* Verify we have a legal set of flags
>* This leaves us room for future extensions.
>*/
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave


Re: [PATCH 07/24] kexec: Disable at runtime if the kernel is locked down

2017-04-06 Thread Dave Young
On 04/05/17 at 09:15pm, David Howells wrote:
> From: Matthew Garrett 
> 
> kexec permits the loading and execution of arbitrary code in ring 0, which
> is something that lock-down is meant to prevent. It makes sense to disable
> kexec in this situation.
> 
> This does not affect kexec_file_load() which can check for a signature on the
> image to be booted.
> 
> Signed-off-by: Matthew Garrett 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index 980936a90ee6..46de8e6b42f4 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -194,6 +194,13 @@ SYSCALL_DEFINE4(kexec_load, unsigned long, entry, 
> unsigned long, nr_segments,
>   return -EPERM;
>  
>   /*
> +  * kexec can be used to circumvent module loading restrictions, so
> +  * prevent loading in that case
> +  */
> + if (kernel_is_locked_down())
> + return -EPERM;
> +
> + /*
>* Verify we have a legal set of flags
>* This leaves us room for future extensions.
>*/
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave


Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Dave Young
On 04/05/17 at 09:15pm, David Howells wrote:
> From: Chun-Yi Lee 
> 
> When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> through kexec_file systemcall if securelevel has been set.
> 
> This code was showed in Matthew's patch but not in git:
> https://lkml.org/lkml/2015/3/13/778
> 
> Cc: Matthew Garrett 
> Signed-off-by: Chun-Yi Lee 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec_file.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b118735fea9d..f6937eecd1eb 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> initrd_fd,
>   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
>   return -EPERM;
>  
> + /* Don't permit images to be loaded into trusted kernels if we're not
> +  * going to verify the signature on them
> +  */
> + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> + return -EPERM;
> +
>   /* Make sure we have a legal set of flags */
>   if (flags != (flags & KEXEC_FILE_FLAGS))
>   return -EINVAL;
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave


Re: [PATCH 09/24] kexec_file: Disable at runtime if securelevel has been set

2017-04-06 Thread Dave Young
On 04/05/17 at 09:15pm, David Howells wrote:
> From: Chun-Yi Lee 
> 
> When KEXEC_VERIFY_SIG is not enabled, kernel should not loads image
> through kexec_file systemcall if securelevel has been set.
> 
> This code was showed in Matthew's patch but not in git:
> https://lkml.org/lkml/2015/3/13/778
> 
> Cc: Matthew Garrett 
> Signed-off-by: Chun-Yi Lee 
> Signed-off-by: David Howells 
> cc: ke...@lists.infradead.org
> ---
> 
>  kernel/kexec_file.c |6 ++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
> index b118735fea9d..f6937eecd1eb 100644
> --- a/kernel/kexec_file.c
> +++ b/kernel/kexec_file.c
> @@ -268,6 +268,12 @@ SYSCALL_DEFINE5(kexec_file_load, int, kernel_fd, int, 
> initrd_fd,
>   if (!capable(CAP_SYS_BOOT) || kexec_load_disabled)
>   return -EPERM;
>  
> + /* Don't permit images to be loaded into trusted kernels if we're not
> +  * going to verify the signature on them
> +  */
> + if (!IS_ENABLED(CONFIG_KEXEC_VERIFY_SIG) && kernel_is_locked_down())
> + return -EPERM;
> +
>   /* Make sure we have a legal set of flags */
>   if (flags != (flags & KEXEC_FILE_FLAGS))
>   return -EINVAL;
> 
> 
> ___
> kexec mailing list
> ke...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec

Acked-by: Dave Young 

Thanks
Dave


Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-06 Thread gengdongjiu
Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
 Hi gengdongjiu,

 On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>
> Hi Laszlo/Biesheuvel/Qemu developer,
>
>Now I encounter a issue and want to consult with you in ARM64 
> platform, as described below:
>
> when guest OS happen synchronous or asynchronous abort, kvm needs
> to send the error address to Qemu or UEFI through sigbus to
> dynamically generate APEI table. from my investigation, there are
> two ways:
>
> (1) Qemu get the error address, and generate the APEI table, then
> notify UEFI to know this generation, then inject abort error to
> guest OS, guest OS read the APEI table.
> (2) Qemu get the error address, and let UEFI to generate the APEI
> table, then inject abort error to guest OS, guest OS read the APEI
> table.

 Just being pedantic! I don't think we are talking about creating the APEI 
 table
 dynamically here. The issue is: Once KVM has received an error that is 
 destined
 for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
 error
 into the guest OS, a CPER (Common Platform Error Record) has to be 
 generated
 corresponding to the error source (GHES corresponding to memory subsystem,
 processor etc) to allow the guest OS to do anything meaningful with the
 error. So who should create the CPER is the question.

 At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
 arrives
 at EL3 and secure firmware (at EL3 or a lower secure exception level) is
 responsible for creating the CPER. ARM is experimenting with using a 
 Standalone
 MM EDK2 image in the secure world to do the CPER creation. This will avoid
 adding the same code in ARM TF in EL3 (better for security). The error 
 will then
 be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
 Trusted
 Firmware.

 Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
 interface (as discussed with Christoffer below). So it should generate the 
 CPER
 before injecting the error.

 This is corresponds to (1) above apart from notifying UEFI (I am assuming 
 you
 mean guest UEFI). At this time, the guest OS already knows where to pick 
 up the
 CPER from through the HEST. Qemu has to create the CPER and populate its 
 address
 at the address exported in the HEST. Guest UEFI should not be involved in 
 this
 flow. Its job was to create the HEST at boot and that has been done by this
 stage.

 Qemu folk will be able to add but it looks like support for CPER 
 generation will
 need to be added to Qemu. We need to resolve this.

 Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>> etc/acpi/tables etc/hardware_errors
>>>  ==
>>>  +---+
>>> +--+ | address   | +-> +--+
>>> |HEST  + | registers | |   | Error Status |
>>> + ++ | +-+ |   | Data Block 1 |
>>> | | GHES   | --> | | address | +   | ++
>>> | | GHES   | --> | | address | --+ | |  CPER  |
>>> | | GHES   | --> | | address | + | | |  CPER  |
>>> | | GHES   | --> | | address | -+  | | | |  CPER  |
>>> +-++ +-+-+  |  | | +-++
>>> |  | |
>>> |  | +---> +--+
>>> |  |   | Error 

Re: [PATCH] kvm: pass the virtual SEI syndrome to guest OS

2017-04-06 Thread gengdongjiu
Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
 Hi gengdongjiu,

 On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>
> Hi Laszlo/Biesheuvel/Qemu developer,
>
>Now I encounter a issue and want to consult with you in ARM64 
> platform, as described below:
>
> when guest OS happen synchronous or asynchronous abort, kvm needs
> to send the error address to Qemu or UEFI through sigbus to
> dynamically generate APEI table. from my investigation, there are
> two ways:
>
> (1) Qemu get the error address, and generate the APEI table, then
> notify UEFI to know this generation, then inject abort error to
> guest OS, guest OS read the APEI table.
> (2) Qemu get the error address, and let UEFI to generate the APEI
> table, then inject abort error to guest OS, guest OS read the APEI
> table.

 Just being pedantic! I don't think we are talking about creating the APEI 
 table
 dynamically here. The issue is: Once KVM has received an error that is 
 destined
 for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the 
 error
 into the guest OS, a CPER (Common Platform Error Record) has to be 
 generated
 corresponding to the error source (GHES corresponding to memory subsystem,
 processor etc) to allow the guest OS to do anything meaningful with the
 error. So who should create the CPER is the question.

 At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error 
 arrives
 at EL3 and secure firmware (at EL3 or a lower secure exception level) is
 responsible for creating the CPER. ARM is experimenting with using a 
 Standalone
 MM EDK2 image in the secure world to do the CPER creation. This will avoid
 adding the same code in ARM TF in EL3 (better for security). The error 
 will then
 be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM 
 Trusted
 Firmware.

 Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
 interface (as discussed with Christoffer below). So it should generate the 
 CPER
 before injecting the error.

 This is corresponds to (1) above apart from notifying UEFI (I am assuming 
 you
 mean guest UEFI). At this time, the guest OS already knows where to pick 
 up the
 CPER from through the HEST. Qemu has to create the CPER and populate its 
 address
 at the address exported in the HEST. Guest UEFI should not be involved in 
 this
 flow. Its job was to create the HEST at boot and that has been done by this
 stage.

 Qemu folk will be able to add but it looks like support for CPER 
 generation will
 need to be added to Qemu. We need to resolve this.

 Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>> etc/acpi/tables etc/hardware_errors
>>>  ==
>>>  +---+
>>> +--+ | address   | +-> +--+
>>> |HEST  + | registers | |   | Error Status |
>>> + ++ | +-+ |   | Data Block 1 |
>>> | | GHES   | --> | | address | +   | ++
>>> | | GHES   | --> | | address | --+ | |  CPER  |
>>> | | GHES   | --> | | address | + | | |  CPER  |
>>> | | GHES   | --> | | address | -+  | | | |  CPER  |
>>> +-++ +-+-+  |  | | +-++
>>> |  | |
>>> |  | +---> +--+
>>> |  |   | Error 

[PATCH v2 5/5] perf report: Show branch type in callchain entry

2017-04-06 Thread Jin Yao
Show branch type in callchain entry. The branch type is printed
with other LBR information (such as cycles/abort/...).

One example:
perf report --branch-history --stdio --no-children

-23.60%--main div.c:42 (RET cycles:2)
 compute_flag div.c:28 (RET cycles:2)
 compute_flag div.c:27 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET cycles:1)
 __random random.c:298 (RET cycles:1)
 __random random.c:297 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (RET cycles:9)

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 221 
 tools/perf/util/callchain.h |  20 
 2 files changed, 182 insertions(+), 59 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 3cea1fb..ca040a0 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -428,6 +428,89 @@ create_child(struct callchain_node *parent, bool 
inherit_children)
return new;
 }
 
+static const char *br_tag[BR_IDX_MAX] = {
+   "JCC forward",
+   "JCC backward",
+   "JMP",
+   "IND_JMP",
+   "CALL",
+   "IND_CALL",
+   "RET",
+   "SYSCALL",
+   "SYSRET",
+   "IRQ",
+   "INT",
+   "IRET",
+   "FAR_BRANCH",
+   "CROSS_4K",
+   "CROSS_2M",
+};
+
+static void
+branch_type_count(int *counts, struct branch_flags *flags)
+{
+   switch (flags->type) {
+   case PERF_BR_JCC_FWD:
+   counts[BR_IDX_JCC_FWD]++;
+   break;
+
+   case PERF_BR_JCC_BWD:
+   counts[BR_IDX_JCC_BWD]++;
+   break;
+
+   case PERF_BR_JMP:
+   counts[BR_IDX_JMP]++;
+   break;
+
+   case PERF_BR_IND_JMP:
+   counts[BR_IDX_IND_JMP]++;
+   break;
+
+   case PERF_BR_CALL:
+   counts[BR_IDX_CALL]++;
+   break;
+
+   case PERF_BR_IND_CALL:
+   counts[BR_IDX_IND_CALL]++;
+   break;
+
+   case PERF_BR_RET:
+   counts[BR_IDX_RET]++;
+   break;
+
+   case PERF_BR_SYSCALL:
+   counts[BR_IDX_SYSCALL]++;
+   break;
+
+   case PERF_BR_SYSRET:
+   counts[BR_IDX_SYSRET]++;
+   break;
+
+   case PERF_BR_IRQ:
+   counts[BR_IDX_IRQ]++;
+   break;
+
+   case PERF_BR_INT:
+   counts[BR_IDX_INT]++;
+   break;
+
+   case PERF_BR_IRET:
+   counts[BR_IDX_IRET]++;
+   break;
+
+   case PERF_BR_FAR_BRANCH:
+   counts[BR_IDX_FAR_BRANCH]++;
+   break;
+
+   default:
+   break;
+   }
+
+   if (flags->cross == PERF_BR_CROSS_2M)
+   counts[BR_IDX_CROSS_2M]++;
+   else if (flags->cross == PERF_BR_CROSS_4K)
+   counts[BR_IDX_CROSS_4K]++;
+}
 
 /*
  * Fill the node with callchain values
@@ -467,6 +550,9 @@ fill_node(struct callchain_node *node, struct 
callchain_cursor *cursor)
call->cycles_count = cursor_node->branch_flags.cycles;
call->iter_count = cursor_node->nr_loop_iter;
call->samples_count = cursor_node->samples;
+
+   branch_type_count(call->brtype_count,
+ _node->branch_flags);
}
 
list_add_tail(>list, >val);
@@ -579,6 +665,9 @@ static enum match_result match_chain(struct 
callchain_cursor_node *node,
cnode->cycles_count += node->branch_flags.cycles;
cnode->iter_count += node->nr_loop_iter;
cnode->samples_count += node->samples;
+
+   branch_type_count(cnode->brtype_count,
+ >branch_flags);
}
 
return MATCH_EQ;
@@ -1105,95 +1194,108 @@ int callchain_branch_counts(struct callchain_root 
*root,
  cycles_count);
 }
 
+static int branch_type_str(int *counts, char *bf, int bfsize)
+{
+   int i, printed = 0;
+   bool brace = false;
+
+   for (i = 0; i < BR_IDX_MAX; i++) {
+   if (printed == bfsize - 1)
+   return printed;
+
+   if (counts[i] > 0) {
+   if (!brace) {
+   brace = true;
+   printed += scnprintf(bf + printed,
+   bfsize - printed,
+   " (%s", br_tag[i]);
+   } else
+   printed += scnprintf(bf + printed,
+ 

[PATCH v2 5/5] perf report: Show branch type in callchain entry

2017-04-06 Thread Jin Yao
Show branch type in callchain entry. The branch type is printed
with other LBR information (such as cycles/abort/...).

One example:
perf report --branch-history --stdio --no-children

-23.60%--main div.c:42 (RET cycles:2)
 compute_flag div.c:28 (RET cycles:2)
 compute_flag div.c:27 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET cycles:1)
 __random random.c:298 (RET cycles:1)
 __random random.c:297 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (RET cycles:9)

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 221 
 tools/perf/util/callchain.h |  20 
 2 files changed, 182 insertions(+), 59 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 3cea1fb..ca040a0 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -428,6 +428,89 @@ create_child(struct callchain_node *parent, bool 
inherit_children)
return new;
 }
 
+static const char *br_tag[BR_IDX_MAX] = {
+   "JCC forward",
+   "JCC backward",
+   "JMP",
+   "IND_JMP",
+   "CALL",
+   "IND_CALL",
+   "RET",
+   "SYSCALL",
+   "SYSRET",
+   "IRQ",
+   "INT",
+   "IRET",
+   "FAR_BRANCH",
+   "CROSS_4K",
+   "CROSS_2M",
+};
+
+static void
+branch_type_count(int *counts, struct branch_flags *flags)
+{
+   switch (flags->type) {
+   case PERF_BR_JCC_FWD:
+   counts[BR_IDX_JCC_FWD]++;
+   break;
+
+   case PERF_BR_JCC_BWD:
+   counts[BR_IDX_JCC_BWD]++;
+   break;
+
+   case PERF_BR_JMP:
+   counts[BR_IDX_JMP]++;
+   break;
+
+   case PERF_BR_IND_JMP:
+   counts[BR_IDX_IND_JMP]++;
+   break;
+
+   case PERF_BR_CALL:
+   counts[BR_IDX_CALL]++;
+   break;
+
+   case PERF_BR_IND_CALL:
+   counts[BR_IDX_IND_CALL]++;
+   break;
+
+   case PERF_BR_RET:
+   counts[BR_IDX_RET]++;
+   break;
+
+   case PERF_BR_SYSCALL:
+   counts[BR_IDX_SYSCALL]++;
+   break;
+
+   case PERF_BR_SYSRET:
+   counts[BR_IDX_SYSRET]++;
+   break;
+
+   case PERF_BR_IRQ:
+   counts[BR_IDX_IRQ]++;
+   break;
+
+   case PERF_BR_INT:
+   counts[BR_IDX_INT]++;
+   break;
+
+   case PERF_BR_IRET:
+   counts[BR_IDX_IRET]++;
+   break;
+
+   case PERF_BR_FAR_BRANCH:
+   counts[BR_IDX_FAR_BRANCH]++;
+   break;
+
+   default:
+   break;
+   }
+
+   if (flags->cross == PERF_BR_CROSS_2M)
+   counts[BR_IDX_CROSS_2M]++;
+   else if (flags->cross == PERF_BR_CROSS_4K)
+   counts[BR_IDX_CROSS_4K]++;
+}
 
 /*
  * Fill the node with callchain values
@@ -467,6 +550,9 @@ fill_node(struct callchain_node *node, struct 
callchain_cursor *cursor)
call->cycles_count = cursor_node->branch_flags.cycles;
call->iter_count = cursor_node->nr_loop_iter;
call->samples_count = cursor_node->samples;
+
+   branch_type_count(call->brtype_count,
+ _node->branch_flags);
}
 
list_add_tail(>list, >val);
@@ -579,6 +665,9 @@ static enum match_result match_chain(struct 
callchain_cursor_node *node,
cnode->cycles_count += node->branch_flags.cycles;
cnode->iter_count += node->nr_loop_iter;
cnode->samples_count += node->samples;
+
+   branch_type_count(cnode->brtype_count,
+ >branch_flags);
}
 
return MATCH_EQ;
@@ -1105,95 +1194,108 @@ int callchain_branch_counts(struct callchain_root 
*root,
  cycles_count);
 }
 
+static int branch_type_str(int *counts, char *bf, int bfsize)
+{
+   int i, printed = 0;
+   bool brace = false;
+
+   for (i = 0; i < BR_IDX_MAX; i++) {
+   if (printed == bfsize - 1)
+   return printed;
+
+   if (counts[i] > 0) {
+   if (!brace) {
+   brace = true;
+   printed += scnprintf(bf + printed,
+   bfsize - printed,
+   " (%s", br_tag[i]);
+   } else
+   printed += scnprintf(bf + printed,
+   bfsize 

[PATCH v2 4/5] perf report: Show branch type statistics for stdio mode

2017-04-06 Thread Jin Yao
Show the branch type statistics at the end of perf report --stdio.

For example:
perf report --stdio

 JCC forward:  27.7%
JCC backward:   9.8%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
IRET:   0.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%

The branch types are:
-
 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
 SYSCALL: Syscall
  SYSRET: Syscall return
 IRQ: HW interrupt/trap/fault
 INT: SW interrupt
IRET: Return from interrupt
  FAR_BRANCH: Others not generic branch type

CROSS_4K and CROSS_2M:
--
They are the metrics checking for branches cross 4K or 2MB pages.
It's an approximate computing. We don't know if the area is 4K or
2MB, so always compute both.

To make the output simple, if a branch crosses 2M area, CROSS_4K
will not be incremented.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-report.c | 212 
 tools/perf/util/event.h |   4 +-
 tools/perf/util/hist.c  |   5 +-
 3 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index c18158b..1dc1058 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -43,6 +43,24 @@
 #include 
 #include 
 
+struct branch_type_stat {
+   u64 jcc_fwd;
+   u64 jcc_bwd;
+   u64 jmp;
+   u64 ind_jmp;
+   u64 call;
+   u64 ind_call;
+   u64 ret;
+   u64 syscall;
+   u64 sysret;
+   u64 irq;
+   u64 intr;
+   u64 iret;
+   u64 far_branch;
+   u64 cross_4k;
+   u64 cross_2m;
+};
+
 struct report {
struct perf_tooltool;
struct perf_session *session;
@@ -66,6 +84,7 @@ struct report {
u64 queue_size;
int socket_filter;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+   struct branch_type_stat brtype_stat;
 };
 
 static int report__config(const char *var, const char *value, void *cb)
@@ -144,6 +163,91 @@ static int hist_iter__report_callback(struct 
hist_entry_iter *iter,
return err;
 }
 
+static void branch_type_count(struct report *rep, struct branch_info *bi)
+{
+   struct branch_type_stat *stat = >brtype_stat;
+   struct branch_flags *flags = >flags;
+
+   switch (flags->type) {
+   case PERF_BR_JCC_FWD:
+   stat->jcc_fwd++;
+   break;
+
+   case PERF_BR_JCC_BWD:
+   stat->jcc_bwd++;
+   break;
+
+   case PERF_BR_JMP:
+   stat->jmp++;
+   break;
+
+   case PERF_BR_IND_JMP:
+   stat->ind_jmp++;
+   break;
+
+   case PERF_BR_CALL:
+   stat->call++;
+   break;
+
+   case PERF_BR_IND_CALL:
+   stat->ind_call++;
+   break;
+
+   case PERF_BR_RET:
+   stat->ret++;
+   break;
+
+   case PERF_BR_SYSCALL:
+   stat->syscall++;
+   break;
+
+   case PERF_BR_SYSRET:
+   stat->sysret++;
+   break;
+
+   case PERF_BR_IRQ:
+   stat->irq++;
+   break;
+
+   case PERF_BR_INT:
+   stat->intr++;
+   break;
+
+   case PERF_BR_IRET:
+   stat->iret++;
+   break;
+
+   case PERF_BR_FAR_BRANCH:
+   stat->far_branch++;
+   break;
+
+   default:
+   break;
+   }
+
+   if (flags->cross == PERF_BR_CROSS_2M)
+   stat->cross_2m++;
+   else if (flags->cross == PERF_BR_CROSS_4K)
+   stat->cross_4k++;
+}
+
+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   if (sort__mode == SORT_MODE__BRANCH) {
+   bi = he->branch_info;
+   branch_type_count(rep, bi);
+   }
+
+   return 0;
+}
+
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -182,6 +286,8 @@ static int process_sample_event(struct perf_tool *tool,
 */
if (!sample->branch_stack)
goto out_put;
+
+   iter.add_entry_cb = hist_iter__branch_callback;
iter.ops = _iter_branch;
} else if (rep->mem_mode) 

[PATCH v2 4/5] perf report: Show branch type statistics for stdio mode

2017-04-06 Thread Jin Yao
Show the branch type statistics at the end of perf report --stdio.

For example:
perf report --stdio

 JCC forward:  27.7%
JCC backward:   9.8%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
IRET:   0.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%

The branch types are:
-
 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
 SYSCALL: Syscall
  SYSRET: Syscall return
 IRQ: HW interrupt/trap/fault
 INT: SW interrupt
IRET: Return from interrupt
  FAR_BRANCH: Others not generic branch type

CROSS_4K and CROSS_2M:
--
They are the metrics checking for branches cross 4K or 2MB pages.
It's an approximate computing. We don't know if the area is 4K or
2MB, so always compute both.

To make the output simple, if a branch crosses 2M area, CROSS_4K
will not be incremented.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-report.c | 212 
 tools/perf/util/event.h |   4 +-
 tools/perf/util/hist.c  |   5 +-
 3 files changed, 216 insertions(+), 5 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index c18158b..1dc1058 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -43,6 +43,24 @@
 #include 
 #include 
 
+struct branch_type_stat {
+   u64 jcc_fwd;
+   u64 jcc_bwd;
+   u64 jmp;
+   u64 ind_jmp;
+   u64 call;
+   u64 ind_call;
+   u64 ret;
+   u64 syscall;
+   u64 sysret;
+   u64 irq;
+   u64 intr;
+   u64 iret;
+   u64 far_branch;
+   u64 cross_4k;
+   u64 cross_2m;
+};
+
 struct report {
struct perf_tooltool;
struct perf_session *session;
@@ -66,6 +84,7 @@ struct report {
u64 queue_size;
int socket_filter;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+   struct branch_type_stat brtype_stat;
 };
 
 static int report__config(const char *var, const char *value, void *cb)
@@ -144,6 +163,91 @@ static int hist_iter__report_callback(struct 
hist_entry_iter *iter,
return err;
 }
 
+static void branch_type_count(struct report *rep, struct branch_info *bi)
+{
+   struct branch_type_stat *stat = >brtype_stat;
+   struct branch_flags *flags = >flags;
+
+   switch (flags->type) {
+   case PERF_BR_JCC_FWD:
+   stat->jcc_fwd++;
+   break;
+
+   case PERF_BR_JCC_BWD:
+   stat->jcc_bwd++;
+   break;
+
+   case PERF_BR_JMP:
+   stat->jmp++;
+   break;
+
+   case PERF_BR_IND_JMP:
+   stat->ind_jmp++;
+   break;
+
+   case PERF_BR_CALL:
+   stat->call++;
+   break;
+
+   case PERF_BR_IND_CALL:
+   stat->ind_call++;
+   break;
+
+   case PERF_BR_RET:
+   stat->ret++;
+   break;
+
+   case PERF_BR_SYSCALL:
+   stat->syscall++;
+   break;
+
+   case PERF_BR_SYSRET:
+   stat->sysret++;
+   break;
+
+   case PERF_BR_IRQ:
+   stat->irq++;
+   break;
+
+   case PERF_BR_INT:
+   stat->intr++;
+   break;
+
+   case PERF_BR_IRET:
+   stat->iret++;
+   break;
+
+   case PERF_BR_FAR_BRANCH:
+   stat->far_branch++;
+   break;
+
+   default:
+   break;
+   }
+
+   if (flags->cross == PERF_BR_CROSS_2M)
+   stat->cross_2m++;
+   else if (flags->cross == PERF_BR_CROSS_4K)
+   stat->cross_4k++;
+}
+
+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   if (sort__mode == SORT_MODE__BRANCH) {
+   bi = he->branch_info;
+   branch_type_count(rep, bi);
+   }
+
+   return 0;
+}
+
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -182,6 +286,8 @@ static int process_sample_event(struct perf_tool *tool,
 */
if (!sample->branch_stack)
goto out_put;
+
+   iter.add_entry_cb = hist_iter__branch_callback;
iter.ops = _iter_branch;
} else if (rep->mem_mode) {

[PATCH v2 3/5] perf record: Create a new option save_type in --branch-filter

2017-04-06 Thread Jin Yao
The option indicates the kernel to save branch type during sampling.

One example:
perf record -g --branch-filter any,save_type 

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-record.txt | 1 +
 tools/perf/util/parse-branch-options.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index ea3789d..e2f5a4f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -332,6 +332,7 @@ following filters are defined:
- no_tx: only when the target is not in a hardware transaction
- abort_tx: only when the target is a hardware transaction abort
- cond: conditional branches
+   - save_type: save branch type during sampling in case binary is not 
available later
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
diff --git a/tools/perf/util/parse-branch-options.c 
b/tools/perf/util/parse-branch-options.c
index 38fd115..e71fb5f 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
+   BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_END
 };
 
-- 
2.7.4



[PATCH v2 3/5] perf record: Create a new option save_type in --branch-filter

2017-04-06 Thread Jin Yao
The option indicates the kernel to save branch type during sampling.

One example:
perf record -g --branch-filter any,save_type 

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-record.txt | 1 +
 tools/perf/util/parse-branch-options.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index ea3789d..e2f5a4f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -332,6 +332,7 @@ following filters are defined:
- no_tx: only when the target is not in a hardware transaction
- abort_tx: only when the target is a hardware transaction abort
- cond: conditional branches
+   - save_type: save branch type during sampling in case binary is not 
available later
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
diff --git a/tools/perf/util/parse-branch-options.c 
b/tools/perf/util/parse-branch-options.c
index 38fd115..e71fb5f 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
+   BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_END
 };
 
-- 
2.7.4



[PATCH v2 1/5] perf/core: Define the common branch type classification

2017-04-06 Thread Jin Yao
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the x86 branch type.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

PERF_BR_NONE  : unknown
PERF_BR_JCC_FWD   : conditional forward jump
PERF_BR_JCC_BWD   : conditional backward jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: others not generic branch type

The patch adds following metrics checking for branches cross
4K or 2MB areas.

PERF_BR_CROSS_NONE: branch not cross an area
PERF_BR_CROSS_4K  : branch cross 4K area
PERF_BR_CROSS_2M  : branch cross 2MB area

Since the disassembling of branch instruction needs some overhead,
a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
needs to disassemble the branch instruction and record the branch
type.

Signed-off-by: Jin Yao 
---
 include/uapi/linux/perf_event.h   | 37 ++-
 tools/include/uapi/linux/perf_event.h | 37 ++-
 2 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d09a9cd..e2fcd53 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,38 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << 
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT,
PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << 
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT,
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE=
+   1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX  = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
+/*
+ * Common flow change classification
+ */
+enum {
+   PERF_BR_NONE= 0,/* unknown */
+   PERF_BR_JCC_FWD = 1,/* conditional forward jump */
+   PERF_BR_JCC_BWD = 2,/* conditional backward jump */
+   PERF_BR_JMP = 3,/* jump */
+   PERF_BR_IND_JMP = 4,/* indirect jump */
+   PERF_BR_CALL= 5,/* call */
+   PERF_BR_IND_CALL= 6,/* indirect call */
+   PERF_BR_RET = 7,/* return */
+   PERF_BR_SYSCALL = 8,/* syscall */
+   PERF_BR_SYSRET  = 9,/* syscall return */
+   PERF_BR_IRQ = 10,   /* hw interrupt/trap/fault */
+   PERF_BR_INT = 11,   /* sw interrupt */
+   PERF_BR_IRET= 12,   /* return from interrupt */
+   PERF_BR_FAR_BRANCH  = 13,   /* others not generic branch type */
+};
+
+enum {
+   PERF_BR_CROSS_NONE  = 0,/* branch not cross an area */
+   PERF_BR_CROSS_4K= 1,/* branch cross 4K */
+   PERF_BR_CROSS_2M= 2,/* branch cross 2MB */
+};
+
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
(PERF_SAMPLE_BRANCH_USER|\
 PERF_SAMPLE_BRANCH_KERNEL|\
@@ -999,6 +1030,8 @@ union perf_mem_data_src {
  * in_tx: running in a hardware transaction
  * abort: aborting a hardware transaction
  *cycles: cycles from last branch (or 0 if not supported)
+ *  type: branch type
+ * cross: branch cross 4K or 2MB area
  */
 struct perf_branch_entry {
__u64   from;
@@ -1008,7 +1041,9 @@ struct perf_branch_entry {
in_tx:1,/* in transaction */
abort:1,/* transaction abort */
cycles:16,  /* cycle count to last branch */
-   reserved:44;
+   type:4, /* branch type */
+   cross:2,/* branch cross 4K or 2MB area */
+   reserved:38;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index d09a9cd..e2fcd53 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {

[PATCH v2 1/5] perf/core: Define the common branch type classification

2017-04-06 Thread Jin Yao
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the x86 branch type.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

PERF_BR_NONE  : unknown
PERF_BR_JCC_FWD   : conditional forward jump
PERF_BR_JCC_BWD   : conditional backward jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: others not generic branch type

The patch adds following metrics checking for branches cross
4K or 2MB areas.

PERF_BR_CROSS_NONE: branch not cross an area
PERF_BR_CROSS_4K  : branch cross 4K area
PERF_BR_CROSS_2M  : branch cross 2MB area

Since the disassembling of branch instruction needs some overhead,
a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
needs to disassemble the branch instruction and record the branch
type.

Signed-off-by: Jin Yao 
---
 include/uapi/linux/perf_event.h   | 37 ++-
 tools/include/uapi/linux/perf_event.h | 37 ++-
 2 files changed, 72 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d09a9cd..e2fcd53 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,38 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << 
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT,
PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << 
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT,
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE=
+   1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX  = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
+/*
+ * Common flow change classification
+ */
+enum {
+   PERF_BR_NONE= 0,/* unknown */
+   PERF_BR_JCC_FWD = 1,/* conditional forward jump */
+   PERF_BR_JCC_BWD = 2,/* conditional backward jump */
+   PERF_BR_JMP = 3,/* jump */
+   PERF_BR_IND_JMP = 4,/* indirect jump */
+   PERF_BR_CALL= 5,/* call */
+   PERF_BR_IND_CALL= 6,/* indirect call */
+   PERF_BR_RET = 7,/* return */
+   PERF_BR_SYSCALL = 8,/* syscall */
+   PERF_BR_SYSRET  = 9,/* syscall return */
+   PERF_BR_IRQ = 10,   /* hw interrupt/trap/fault */
+   PERF_BR_INT = 11,   /* sw interrupt */
+   PERF_BR_IRET= 12,   /* return from interrupt */
+   PERF_BR_FAR_BRANCH  = 13,   /* others not generic branch type */
+};
+
+enum {
+   PERF_BR_CROSS_NONE  = 0,/* branch not cross an area */
+   PERF_BR_CROSS_4K= 1,/* branch cross 4K */
+   PERF_BR_CROSS_2M= 2,/* branch cross 2MB */
+};
+
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
(PERF_SAMPLE_BRANCH_USER|\
 PERF_SAMPLE_BRANCH_KERNEL|\
@@ -999,6 +1030,8 @@ union perf_mem_data_src {
  * in_tx: running in a hardware transaction
  * abort: aborting a hardware transaction
  *cycles: cycles from last branch (or 0 if not supported)
+ *  type: branch type
+ * cross: branch cross 4K or 2MB area
  */
 struct perf_branch_entry {
__u64   from;
@@ -1008,7 +1041,9 @@ struct perf_branch_entry {
in_tx:1,/* in transaction */
abort:1,/* transaction abort */
cycles:16,  /* cycle count to last branch */
-   reserved:44;
+   type:4, /* branch type */
+   cross:2,/* branch cross 4K or 2MB area */
+   reserved:38;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index d09a9cd..e2fcd53 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {

[PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-06 Thread Jin Yao
Perf already has support for disassembling the branch instruction
and using the branch type for filtering. The patch just records
the branch type in perf_branch_entry.

Before recording, the patch converts the x86 branch classification
to common branch classification and compute for checking if the
branches cross 4K or 2MB areas. It's an approximate computing for
crossing 4K page or 2MB page.

Signed-off-by: Jin Yao 
---
 arch/x86/events/intel/lbr.c | 106 +++-
 1 file changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 81b321a..635a0fb 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -109,6 +109,9 @@ enum {
X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
X86_BR_CALL_STACK   = 1 << 16,/* call stack */
X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
+
+   X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
+
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -139,6 +142,9 @@ enum {
 X86_BR_IRQ |\
 X86_BR_INT)
 
+#define AREA_4K4096
+#define AREA_2M(2 * 1024 * 1024)
+
 static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
 
 /*
@@ -670,6 +676,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_CALL)
mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
+   mask |= X86_BR_TYPE_SAVE;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -923,6 +933,84 @@ static int branch_type(unsigned long from, unsigned long 
to, int abort)
return ret;
 }
 
+static int
+common_branch_type(int type, u64 from, u64 to)
+{
+   int ret;
+
+   type = type & (~(X86_BR_KERNEL | X86_BR_USER));
+
+   switch (type) {
+   case X86_BR_CALL:
+   case X86_BR_ZERO_CALL:
+   ret = PERF_BR_CALL;
+   break;
+
+   case X86_BR_RET:
+   ret = PERF_BR_RET;
+   break;
+
+   case X86_BR_SYSCALL:
+   ret = PERF_BR_SYSCALL;
+   break;
+
+   case X86_BR_SYSRET:
+   ret = PERF_BR_SYSRET;
+   break;
+
+   case X86_BR_INT:
+   ret = PERF_BR_INT;
+   break;
+
+   case X86_BR_IRET:
+   ret = PERF_BR_IRET;
+   break;
+
+   case X86_BR_IRQ:
+   ret = PERF_BR_IRQ;
+   break;
+
+   case X86_BR_ABORT:
+   ret = PERF_BR_FAR_BRANCH;
+   break;
+
+   case X86_BR_JCC:
+   if (to > from)
+   ret = PERF_BR_JCC_FWD;
+   else
+   ret = PERF_BR_JCC_BWD;
+   break;
+
+   case X86_BR_JMP:
+   ret = PERF_BR_JMP;
+   break;
+
+   case X86_BR_IND_CALL:
+   ret = PERF_BR_IND_CALL;
+   break;
+
+   case X86_BR_IND_JMP:
+   ret = PERF_BR_IND_JMP;
+   break;
+
+   default:
+   ret = PERF_BR_NONE;
+   }
+
+   return ret;
+}
+
+static bool
+cross_area(u64 addr1, u64 addr2, int size)
+{
+   u64 align1, align2;
+
+   align1 = addr1 & ~(size - 1);
+   align2 = addr2 & ~(size - 1);
+
+   return (align1 != align2) ? true : false;
+}
+
 /*
  * implement actual branch filter based on user demand.
  * Hardware may not exactly satisfy that request, thus
@@ -939,7 +1027,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
bool compress = false;
 
/* if sampling all branches, then nothing to filter */
-   if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+   if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
+   ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
return;
 
for (i = 0; i < cpuc->lbr_stack.nr; i++) {
@@ -960,6 +1049,21 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) {
+   cpuc->lbr_entries[i].type = common_branch_type(type,
+  from,
+  to);
+   if (cross_area(from, to, AREA_2M))
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_2M;
+   else if (cross_area(from, to, AREA_4K))
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_4K;
+   else
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_NONE;
+   } else {
+

[PATCH v2 2/5] perf/x86/intel: Record branch type

2017-04-06 Thread Jin Yao
Perf already has support for disassembling the branch instruction
and using the branch type for filtering. The patch just records
the branch type in perf_branch_entry.

Before recording, the patch converts the x86 branch classification
to common branch classification and compute for checking if the
branches cross 4K or 2MB areas. It's an approximate computing for
crossing 4K page or 2MB page.

Signed-off-by: Jin Yao 
---
 arch/x86/events/intel/lbr.c | 106 +++-
 1 file changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 81b321a..635a0fb 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -109,6 +109,9 @@ enum {
X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
X86_BR_CALL_STACK   = 1 << 16,/* call stack */
X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
+
+   X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
+
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -139,6 +142,9 @@ enum {
 X86_BR_IRQ |\
 X86_BR_INT)
 
+#define AREA_4K4096
+#define AREA_2M(2 * 1024 * 1024)
+
 static void intel_pmu_lbr_filter(struct cpu_hw_events *cpuc);
 
 /*
@@ -670,6 +676,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_CALL)
mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
+   mask |= X86_BR_TYPE_SAVE;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -923,6 +933,84 @@ static int branch_type(unsigned long from, unsigned long 
to, int abort)
return ret;
 }
 
+static int
+common_branch_type(int type, u64 from, u64 to)
+{
+   int ret;
+
+   type = type & (~(X86_BR_KERNEL | X86_BR_USER));
+
+   switch (type) {
+   case X86_BR_CALL:
+   case X86_BR_ZERO_CALL:
+   ret = PERF_BR_CALL;
+   break;
+
+   case X86_BR_RET:
+   ret = PERF_BR_RET;
+   break;
+
+   case X86_BR_SYSCALL:
+   ret = PERF_BR_SYSCALL;
+   break;
+
+   case X86_BR_SYSRET:
+   ret = PERF_BR_SYSRET;
+   break;
+
+   case X86_BR_INT:
+   ret = PERF_BR_INT;
+   break;
+
+   case X86_BR_IRET:
+   ret = PERF_BR_IRET;
+   break;
+
+   case X86_BR_IRQ:
+   ret = PERF_BR_IRQ;
+   break;
+
+   case X86_BR_ABORT:
+   ret = PERF_BR_FAR_BRANCH;
+   break;
+
+   case X86_BR_JCC:
+   if (to > from)
+   ret = PERF_BR_JCC_FWD;
+   else
+   ret = PERF_BR_JCC_BWD;
+   break;
+
+   case X86_BR_JMP:
+   ret = PERF_BR_JMP;
+   break;
+
+   case X86_BR_IND_CALL:
+   ret = PERF_BR_IND_CALL;
+   break;
+
+   case X86_BR_IND_JMP:
+   ret = PERF_BR_IND_JMP;
+   break;
+
+   default:
+   ret = PERF_BR_NONE;
+   }
+
+   return ret;
+}
+
+static bool
+cross_area(u64 addr1, u64 addr2, int size)
+{
+   u64 align1, align2;
+
+   align1 = addr1 & ~(size - 1);
+   align2 = addr2 & ~(size - 1);
+
+   return (align1 != align2) ? true : false;
+}
+
 /*
  * implement actual branch filter based on user demand.
  * Hardware may not exactly satisfy that request, thus
@@ -939,7 +1027,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
bool compress = false;
 
/* if sampling all branches, then nothing to filter */
-   if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+   if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
+   ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
return;
 
for (i = 0; i < cpuc->lbr_stack.nr; i++) {
@@ -960,6 +1049,21 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE) {
+   cpuc->lbr_entries[i].type = common_branch_type(type,
+  from,
+  to);
+   if (cross_area(from, to, AREA_2M))
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_2M;
+   else if (cross_area(from, to, AREA_4K))
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_4K;
+   else
+   cpuc->lbr_entries[i].cross = PERF_BR_CROSS_NONE;
+   } else {
+   

[PATCH v2 0/5] perf report: Show branch type

2017-04-06 Thread Jin Yao
v2:
---
1. Use 4 bits in perf_branch_entry to record branch type.

2. Pull out some common branch types from FAR_BRANCH. Now the branch
   types defined in perf_event.h:

PERF_BR_NONE  : unknown
PERF_BR_JCC_FWD   : conditional forward jump
PERF_BR_JCC_BWD   : conditional backward jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: others not generic far branch type

3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
   for branch cross 4K or 2M area. It's an approximate computing
   for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

 JCC forward:  27.7%
JCC backward:   9.8%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
IRET:   0.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%

perf report --branch-history --stdio --no-children

-23.60%--main div.c:42 (RET cycles:2)
 compute_flag div.c:28 (RET cycles:2)
 compute_flag div.c:27 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET cycles:1)
 __random random.c:298 (RET cycles:1)
 __random random.c:297 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (RET cycles:9)

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v1:
---
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the branch type.

The patch series records the branch type and show the branch type with
other LBR information in callchain entry via perf report. The patch
series also adds the branch type summary at the end of
perf report --stdio.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

The common branch types are:

 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
  FAR_BRANCH: SYSCALL/SYSRET, IRQ, IRET, TSX Abort

An example:

1. Record branch type (new option "save_type")

perf record -g --branch-filter any,save_type 

2. Show the branch type statistics at the end of perf report --stdio

perf report --stdio

 JCC forward:  34.0%
JCC backward:   3.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
  FAR_BRANCH:   0.0%

3. Show branch type in callchain entry

perf report --branch-history --stdio --no-children

--23.91%--main div.c:42 (RET cycles:2)
  compute_flag div.c:28 (RET cycles:2)
  compute_flag div.c:27 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  __random random.c:298 (RET cycles:1)
  __random random.c:297 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (RET cycles:9)

Jin Yao (5):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

 arch/x86/events/intel/lbr.c  | 106 ++-
 include/uapi/linux/perf_event.h  |  37 +-
 tools/include/uapi/linux/perf_event.h|  37 +-
 tools/perf/Documentation/perf-record.txt |   1 +
 tools/perf/builtin-report.c  | 212 +
 tools/perf/util/callchain.c  | 221 ++-
 tools/perf/util/callchain.h  |  20 

[PATCH v2 0/5] perf report: Show branch type

2017-04-06 Thread Jin Yao
v2:
---
1. Use 4 bits in perf_branch_entry to record branch type.

2. Pull out some common branch types from FAR_BRANCH. Now the branch
   types defined in perf_event.h:

PERF_BR_NONE  : unknown
PERF_BR_JCC_FWD   : conditional forward jump
PERF_BR_JCC_BWD   : conditional backward jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: others not generic far branch type

3. Use 2 bits in perf_branch_entry for a "cross" metrics checking
   for branch cross 4K or 2M area. It's an approximate computing
   for checking if the branch cross 4K page or 2MB page.

For example:

perf record -g --branch-filter any,save_type 

perf report --stdio

 JCC forward:  27.7%
JCC backward:   9.8%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
IRET:   0.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%

perf report --branch-history --stdio --no-children

-23.60%--main div.c:42 (RET cycles:2)
 compute_flag div.c:28 (RET cycles:2)
 compute_flag div.c:27 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET CROSS_2M cycles:1)
 rand rand.c:28 (RET cycles:1)
 __random random.c:298 (RET cycles:1)
 __random random.c:297 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (JCC forward cycles:1)
 __random random.c:295 (RET cycles:9)

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v1:
---
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the branch type.

The patch series records the branch type and show the branch type with
other LBR information in callchain entry via perf report. The patch
series also adds the branch type summary at the end of
perf report --stdio.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

The common branch types are:

 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
  FAR_BRANCH: SYSCALL/SYSRET, IRQ, IRET, TSX Abort

An example:

1. Record branch type (new option "save_type")

perf record -g --branch-filter any,save_type 

2. Show the branch type statistics at the end of perf report --stdio

perf report --stdio

 JCC forward:  34.0%
JCC backward:   3.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%
  FAR_BRANCH:   0.0%

3. Show branch type in callchain entry

perf report --branch-history --stdio --no-children

--23.91%--main div.c:42 (RET cycles:2)
  compute_flag div.c:28 (RET cycles:2)
  compute_flag div.c:27 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  rand rand.c:28 (RET cycles:1)
  __random random.c:298 (RET cycles:1)
  __random random.c:297 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (JCC forward cycles:1)
  __random random.c:295 (RET cycles:9)

Jin Yao (5):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

 arch/x86/events/intel/lbr.c  | 106 ++-
 include/uapi/linux/perf_event.h  |  37 +-
 tools/include/uapi/linux/perf_event.h|  37 +-
 tools/perf/Documentation/perf-record.txt |   1 +
 tools/perf/builtin-report.c  | 212 +
 tools/perf/util/callchain.c  | 221 ++-
 tools/perf/util/callchain.h  |  20 

Re: [PATCH] ath9k: Add cast to u8 to FREQ2FBIN macro

2017-04-06 Thread Joe Perches
On Thu, 2017-04-06 at 16:54 -0700, Matthias Kaehlcke wrote:
> Hi Joe,
> 
> El Thu, Apr 06, 2017 at 02:29:20PM -0700 Joe Perches ha dit:
> 
> > On Thu, 2017-04-06 at 14:21 -0700, Matthias Kaehlcke wrote:
> > > The macro results are assigned to u8 variables/fields. Adding the cast
> > > fixes plenty of clang warnings about "implicit conversion from 'int' to
> > > 'u8'".
> > > 
> > > Signed-off-by: Matthias Kaehlcke 
> > > ---
> > >  drivers/net/wireless/ath/ath9k/eeprom.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/wireless/ath/ath9k/eeprom.h 
> > > b/drivers/net/wireless/ath/ath9k/eeprom.h
> > > index 30bf722e33ed..31390af6c33e 100644
> > > --- a/drivers/net/wireless/ath/ath9k/eeprom.h
> > > +++ b/drivers/net/wireless/ath/ath9k/eeprom.h
> > > @@ -106,7 +106,7 @@
> > >  #define AR9285_RDEXT_DEFAULT0x1F
> > >  
> > >  #define ATH9K_POW_SM(_r, _s) (((_r) & 0x3f) << (_s))
> > > -#define FREQ2FBIN(x, y)  ((y) ? ((x) - 2300) : (((x) - 4800) / 
> > > 5))
> > > +#define FREQ2FBIN(x, y)  (u8)((y) ? ((x) - 2300) : (((x) - 4800) 
> > > / 5))
> > 
> > Maybe better to use:
> > 
> > static inline u8 FREQ2FBIN(int x, int y)
> > {
> > if (y)
> > return x - 2300;
> > return (x - 4800) / 5;
> > }
> 
> Thanks for your suggestion! Unfortunately in this case an inline
> function is not suitable since FREQ2FBIN() is mostly used for
> structure initialization:
> 
> static const struct ar9300_eeprom ar9300_default = {
> ...
> .calFreqPier2G = {
> FREQ2FBIN(2412, 1),
> FREQ2FBIN(2437, 1),
> FREQ2FBIN(2472, 1)
> },
> ...

Maybe it's better to remove the second argument and write
something like:

#define FREQ2FBIN(x) \
(u8)(((x) >= 2300 && (x) <= 2555) ? (x) - 2300 : \
 ((x) >= 4800 && (x) <= 4800 + (256 * 5) ? ((x) - 4800) / 5) : \
 __builtin_const_p(x) ? BUILD_BUG_ON(1) : 0)



Re: [PATCH] ath9k: Add cast to u8 to FREQ2FBIN macro

2017-04-06 Thread Joe Perches
On Thu, 2017-04-06 at 16:54 -0700, Matthias Kaehlcke wrote:
> Hi Joe,
> 
> El Thu, Apr 06, 2017 at 02:29:20PM -0700 Joe Perches ha dit:
> 
> > On Thu, 2017-04-06 at 14:21 -0700, Matthias Kaehlcke wrote:
> > > The macro results are assigned to u8 variables/fields. Adding the cast
> > > fixes plenty of clang warnings about "implicit conversion from 'int' to
> > > 'u8'".
> > > 
> > > Signed-off-by: Matthias Kaehlcke 
> > > ---
> > >  drivers/net/wireless/ath/ath9k/eeprom.h | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/wireless/ath/ath9k/eeprom.h 
> > > b/drivers/net/wireless/ath/ath9k/eeprom.h
> > > index 30bf722e33ed..31390af6c33e 100644
> > > --- a/drivers/net/wireless/ath/ath9k/eeprom.h
> > > +++ b/drivers/net/wireless/ath/ath9k/eeprom.h
> > > @@ -106,7 +106,7 @@
> > >  #define AR9285_RDEXT_DEFAULT0x1F
> > >  
> > >  #define ATH9K_POW_SM(_r, _s) (((_r) & 0x3f) << (_s))
> > > -#define FREQ2FBIN(x, y)  ((y) ? ((x) - 2300) : (((x) - 4800) / 
> > > 5))
> > > +#define FREQ2FBIN(x, y)  (u8)((y) ? ((x) - 2300) : (((x) - 4800) 
> > > / 5))
> > 
> > Maybe better to use:
> > 
> > static inline u8 FREQ2FBIN(int x, int y)
> > {
> > if (y)
> > return x - 2300;
> > return (x - 4800) / 5;
> > }
> 
> Thanks for your suggestion! Unfortunately in this case an inline
> function is not suitable since FREQ2FBIN() is mostly used for
> structure initialization:
> 
> static const struct ar9300_eeprom ar9300_default = {
> ...
> .calFreqPier2G = {
> FREQ2FBIN(2412, 1),
> FREQ2FBIN(2437, 1),
> FREQ2FBIN(2472, 1)
> },
> ...

Maybe it's better to remove the second argument and write
something like:

#define FREQ2FBIN(x) \
(u8)(((x) >= 2300 && (x) <= 2555) ? (x) - 2300 : \
 ((x) >= 4800 && (x) <= 4800 + (256 * 5) ? ((x) - 4800) / 5) : \
 __builtin_const_p(x) ? BUILD_BUG_ON(1) : 0)



Re: [PATCH v1 1/5] perf/core: Define the common branch type classification

2017-04-06 Thread Jin, Yao



Argh, fix your mailer. That is unreadable.

/me reflows...


Sorry about that. Now I reconfigure the mail editor by applying "Preformat" and 
"Fixed Width" settings in thunderbird client. Wish it to be better.


See, that's so much better..

Oh, so you _ARE_ adding a kernel feature? I understood you only wanted
to change perf-report.


Honestly it's a perf-report feature. But it needs kernel to record the branch 
type to perf_event_entry so there is a kernel patch for that in patch series.



WTH didn't you Cc the maintainers?


Very sorry not to cc to all maintainers in v1. I will be careful of sending v2 
patch series.


Also, if you do this, you need to Cc the PowerPC people, since they too
implement PERF_SAMPLE_BRANCH_ bits.



I will cc linuxppc-...@lists.ozlabs.org when sending v2.

Thanks
Jin Yao




Re: [PATCH v1 1/5] perf/core: Define the common branch type classification

2017-04-06 Thread Jin, Yao



Argh, fix your mailer. That is unreadable.

/me reflows...


Sorry about that. Now I reconfigure the mail editor by applying "Preformat" and 
"Fixed Width" settings in thunderbird client. Wish it to be better.


See, that's so much better..

Oh, so you _ARE_ adding a kernel feature? I understood you only wanted
to change perf-report.


Honestly it's a perf-report feature. But it needs kernel to record the branch 
type to perf_event_entry so there is a kernel patch for that in patch series.



WTH didn't you Cc the maintainers?


Very sorry not to cc to all maintainers in v1. I will be careful of sending v2 
patch series.


Also, if you do this, you need to Cc the PowerPC people, since they too
implement PERF_SAMPLE_BRANCH_ bits.



I will cc linuxppc-...@lists.ozlabs.org when sending v2.

Thanks
Jin Yao




[dm-devel] [PATCH] Fix for find_lowest_key in dm-btree.c

2017-04-06 Thread Vinothkumar Raja
We are working on dm-dedup which is a device-mapper's dedup target that 
provides transparent data deduplication of block devices. Every write 
coming to a dm-dedup instance is deduplicated against previously written 
data.  We’ve been working on this project for several years now. The 
Github link for the same is https://github.com/dmdedup. Detailed design 
and performance evaluation can be found in the following paper: 
http://www.fsl.cs.stonybrook.edu/docs/ols-dmdedup/dmdedup-ols14.pdf.

We are currently working on garbage collection for which we traverse our 
btrees from lowest key to highest key. While using find_lowest_key and 
find_highest_key, we noticed that find_lowest_key is giving incorrect 
results. While the function find_key traverses the btree correctly for 
finding the highest key, we found that there is an error in the way it 
traverses the btree for retrieving the lowest key. The find_lowest_key 
function fetches the first key of the rightmost block of the btree 
instead of fetching the first key from the leftmost block. This patch 
fixes the bug and gives us the correct result.  

Signed-off-by: Erez Zadok 
Signed-off-by: Vinothkumar Raja 
Signed-off-by: Nidhi Panpalia 

---
 drivers/md/persistent-data/dm-btree.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/persistent-data/dm-btree.c 
b/drivers/md/persistent-data/dm-btree.c
index 02e2ee0..83121d1 100644
--- a/drivers/md/persistent-data/dm-btree.c
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -902,9 +902,12 @@ static int find_key(struct ro_spine *s, dm_block_t block, 
bool find_highest,
else
*result_key = le64_to_cpu(ro_node(s)->keys[0]);
 
-   if (next_block || flags & INTERNAL_NODE)
-   block = value64(ro_node(s), i);
-
+   if (next_block || flags & INTERNAL_NODE) {
+   if (find_highest)
+   block = value64(ro_node(s), i);
+   else
+   block = value64(ro_node(s), 0);
+   }
} while (flags & INTERNAL_NODE);
 
if (next_block)
-- 
1.8.3.1



[dm-devel] [PATCH] Fix for find_lowest_key in dm-btree.c

2017-04-06 Thread Vinothkumar Raja
We are working on dm-dedup which is a device-mapper's dedup target that 
provides transparent data deduplication of block devices. Every write 
coming to a dm-dedup instance is deduplicated against previously written 
data.  We’ve been working on this project for several years now. The 
Github link for the same is https://github.com/dmdedup. Detailed design 
and performance evaluation can be found in the following paper: 
http://www.fsl.cs.stonybrook.edu/docs/ols-dmdedup/dmdedup-ols14.pdf.

We are currently working on garbage collection for which we traverse our 
btrees from lowest key to highest key. While using find_lowest_key and 
find_highest_key, we noticed that find_lowest_key is giving incorrect 
results. While the function find_key traverses the btree correctly for 
finding the highest key, we found that there is an error in the way it 
traverses the btree for retrieving the lowest key. The find_lowest_key 
function fetches the first key of the rightmost block of the btree 
instead of fetching the first key from the leftmost block. This patch 
fixes the bug and gives us the correct result.  

Signed-off-by: Erez Zadok 
Signed-off-by: Vinothkumar Raja 
Signed-off-by: Nidhi Panpalia 

---
 drivers/md/persistent-data/dm-btree.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/md/persistent-data/dm-btree.c 
b/drivers/md/persistent-data/dm-btree.c
index 02e2ee0..83121d1 100644
--- a/drivers/md/persistent-data/dm-btree.c
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -902,9 +902,12 @@ static int find_key(struct ro_spine *s, dm_block_t block, 
bool find_highest,
else
*result_key = le64_to_cpu(ro_node(s)->keys[0]);
 
-   if (next_block || flags & INTERNAL_NODE)
-   block = value64(ro_node(s), i);
-
+   if (next_block || flags & INTERNAL_NODE) {
+   if (find_highest)
+   block = value64(ro_node(s), i);
+   else
+   block = value64(ro_node(s), 0);
+   }
} while (flags & INTERNAL_NODE);
 
if (next_block)
-- 
1.8.3.1



Re: [PATCH] Revert "arm64: Increase the max granular size"

2017-04-06 Thread Ganesh Mahendran
2017-04-06 23:58 GMT+08:00 Catalin Marinas :
> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
>> On 4/5/2017 10:13 AM, Imran Khan wrote:
>> >> We may have to revisit this logic and consider L1_CACHE_BYTES the
>> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>> >> Do you have any benchmarks on Cavium boards that would show significant
>> >> degradation with 64-byte L1_CACHE_BYTES vs 128?
>> >>
>> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>> >> _maximum_ of the supported systems:
>> >>
>> >> diff --git a/arch/arm64/include/asm/cache.h 
>> >> b/arch/arm64/include/asm/cache.h
>> >> index 5082b30bc2c0..4b5d7b27edaf 100644
>> >> --- a/arch/arm64/include/asm/cache.h
>> >> +++ b/arch/arm64/include/asm/cache.h
>> >> @@ -18,17 +18,17 @@
>> >>
>> >>  #include 
>> >>
>> >> -#define L1_CACHE_SHIFT 7
>> >> +#define L1_CACHE_SHIFT 6
>> >>  #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>> >>
>> >>  /*
>> >>   * Memory returned by kmalloc() may be used for DMA, so we must make
>> >> - * sure that all such allocations are cache aligned. Otherwise,
>> >> - * unrelated code may cause parts of the buffer to be read into the
>> >> - * cache before the transfer is done, causing old data to be seen by
>> >> - * the CPU.
>> >> + * sure that all such allocations are aligned to the maximum *known*
>> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>> >> + * parts of the buffer to be read into the cache before the transfer is
>> >> + * done, causing old data to be seen by the CPU.
>> >>   */
>> >> -#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
>> >> +#define ARCH_DMA_MINALIGN  (128)
>> >>
>> >>  #ifndef __ASSEMBLY__
>> >>
>> >> diff --git a/arch/arm64/kernel/cpufeature.c 
>> >> b/arch/arm64/kernel/cpufeature.c
>> >> index 392c67eb9fa6..30bafca1aebf 100644
>> >> --- a/arch/arm64/kernel/cpufeature.c
>> >> +++ b/arch/arm64/kernel/cpufeature.c
>> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>> >> if (!cwg)
>> >> pr_warn("No Cache Writeback Granule information, assuming
>> >> cache line size %d\n",
>> >> cls);
>> >> -   if (L1_CACHE_BYTES < cls)
>> >> -   pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback 
>> >> Granule (%d < %d)\n",
>> >> -   L1_CACHE_BYTES, cls);
>> >> +   if (ARCH_DMA_MINALIGN < cls)
>> >> +   pr_warn("ARCH_DMA_MINALIGN smaller than the Cache 
>> >> Writeback Granule (%d < %d)\n",
>> >> +   ARCH_DMA_MINALIGN, cls);
>> >>  }
>> >>
>> >>  static bool __maybe_unused
>> >
>> > This change was discussed at: [1] but was not concluded as apparently no 
>> > one
>> > came back with test report and numbers. After including this change in our
>> > local kernel we are seeing significant throughput improvement. For example 
>> > with:
>> >
>> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60
>> >
>> > The average throughput is improving by about 30% (230Mbps from 180Mbps).
>> > Could you please let us know if this change can be included in upstream 
>> > kernel.
>> >
>> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
>>
>> Could you please provide some feedback about the above mentioned query ?
>
> Do you have an explanation on the performance variation when
> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> non-coherent DMA?).

network stack use SKB_DATA_ALIGN to align.
---
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
~(SMP_CACHE_BYTES - 1))

#define SMP_CACHE_BYTES L1_CACHE_BYTES
---
I think this is the reason of performance regression.

>
> The Cavium guys haven't shown any numbers (IIUC) to back the
> L1_CACHE_BYTES performance improvement but I would not revert the
> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
> maximum available cache line size, which is 128 for them.

how about define L1_CACHE_SHIFT like below:
---
#ifdef CONFIG_ARM64_L1_CACHE_SHIFT
#define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
#else
#define L1_CACHE_SHIFT 7
endif
---

Thanks

>
> --
> Catalin


Re: [PATCH] Revert "arm64: Increase the max granular size"

2017-04-06 Thread Ganesh Mahendran
2017-04-06 23:58 GMT+08:00 Catalin Marinas :
> On Thu, Apr 06, 2017 at 12:52:13PM +0530, Imran Khan wrote:
>> On 4/5/2017 10:13 AM, Imran Khan wrote:
>> >> We may have to revisit this logic and consider L1_CACHE_BYTES the
>> >> _minimum_ of cache line sizes in arm64 systems supported by the kernel.
>> >> Do you have any benchmarks on Cavium boards that would show significant
>> >> degradation with 64-byte L1_CACHE_BYTES vs 128?
>> >>
>> >> For non-coherent DMA, the simplest is to make ARCH_DMA_MINALIGN the
>> >> _maximum_ of the supported systems:
>> >>
>> >> diff --git a/arch/arm64/include/asm/cache.h 
>> >> b/arch/arm64/include/asm/cache.h
>> >> index 5082b30bc2c0..4b5d7b27edaf 100644
>> >> --- a/arch/arm64/include/asm/cache.h
>> >> +++ b/arch/arm64/include/asm/cache.h
>> >> @@ -18,17 +18,17 @@
>> >>
>> >>  #include 
>> >>
>> >> -#define L1_CACHE_SHIFT 7
>> >> +#define L1_CACHE_SHIFT 6
>> >>  #define L1_CACHE_BYTES (1 << L1_CACHE_SHIFT)
>> >>
>> >>  /*
>> >>   * Memory returned by kmalloc() may be used for DMA, so we must make
>> >> - * sure that all such allocations are cache aligned. Otherwise,
>> >> - * unrelated code may cause parts of the buffer to be read into the
>> >> - * cache before the transfer is done, causing old data to be seen by
>> >> - * the CPU.
>> >> + * sure that all such allocations are aligned to the maximum *known*
>> >> + * cache line size on ARMv8 systems. Otherwise, unrelated code may cause
>> >> + * parts of the buffer to be read into the cache before the transfer is
>> >> + * done, causing old data to be seen by the CPU.
>> >>   */
>> >> -#define ARCH_DMA_MINALIGN  L1_CACHE_BYTES
>> >> +#define ARCH_DMA_MINALIGN  (128)
>> >>
>> >>  #ifndef __ASSEMBLY__
>> >>
>> >> diff --git a/arch/arm64/kernel/cpufeature.c 
>> >> b/arch/arm64/kernel/cpufeature.c
>> >> index 392c67eb9fa6..30bafca1aebf 100644
>> >> --- a/arch/arm64/kernel/cpufeature.c
>> >> +++ b/arch/arm64/kernel/cpufeature.c
>> >> @@ -976,9 +976,9 @@ void __init setup_cpu_features(void)
>> >> if (!cwg)
>> >> pr_warn("No Cache Writeback Granule information, assuming
>> >> cache line size %d\n",
>> >> cls);
>> >> -   if (L1_CACHE_BYTES < cls)
>> >> -   pr_warn("L1_CACHE_BYTES smaller than the Cache Writeback 
>> >> Granule (%d < %d)\n",
>> >> -   L1_CACHE_BYTES, cls);
>> >> +   if (ARCH_DMA_MINALIGN < cls)
>> >> +   pr_warn("ARCH_DMA_MINALIGN smaller than the Cache 
>> >> Writeback Granule (%d < %d)\n",
>> >> +   ARCH_DMA_MINALIGN, cls);
>> >>  }
>> >>
>> >>  static bool __maybe_unused
>> >
>> > This change was discussed at: [1] but was not concluded as apparently no 
>> > one
>> > came back with test report and numbers. After including this change in our
>> > local kernel we are seeing significant throughput improvement. For example 
>> > with:
>> >
>> > iperf -c 192.168.1.181 -i 1 -w 128K -t 60
>> >
>> > The average throughput is improving by about 30% (230Mbps from 180Mbps).
>> > Could you please let us know if this change can be included in upstream 
>> > kernel.
>> >
>> > [1]: https://groups.google.com/forum/#!topic/linux.kernel/P40yDB90ePs
>>
>> Could you please provide some feedback about the above mentioned query ?
>
> Do you have an explanation on the performance variation when
> L1_CACHE_BYTES is changed? We'd need to understand how the network stack
> is affected by L1_CACHE_BYTES, in which context it uses it (is it for
> non-coherent DMA?).

network stack use SKB_DATA_ALIGN to align.
---
#define SKB_DATA_ALIGN(X) (((X) + (SMP_CACHE_BYTES - 1)) & \
~(SMP_CACHE_BYTES - 1))

#define SMP_CACHE_BYTES L1_CACHE_BYTES
---
I think this is the reason of performance regression.

>
> The Cavium guys haven't shown any numbers (IIUC) to back the
> L1_CACHE_BYTES performance improvement but I would not revert the
> original commit since ARCH_DMA_MINALIGN definitely needs to cover the
> maximum available cache line size, which is 128 for them.

how about define L1_CACHE_SHIFT like below:
---
#ifdef CONFIG_ARM64_L1_CACHE_SHIFT
#define L1_CACHE_SHIFT CONFIG_ARM64_L1_CACHE_SHIFT
#else
#define L1_CACHE_SHIFT 7
endif
---

Thanks

>
> --
> Catalin


Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling

> > +   /* This probably shouldn't happen, but return 0 data processed */
> > +   if (!ldata)
> > +   return 0;
> > +
> >     while (1) {
> >     /*
> >      * When PARMRK is set, each input char may take up to 3
> > chars
> 
> Maybe your patch should looks like:
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata) {
> +   up_read(>termios_rwsem);
> + return 0;
> +   }

Oops, nice catch.. Thanks!

That does indeed fix the problem now without the softlockup.  I'm not sure it's
the right fix, but full patch below.

Anyone see a problem with this approach? Am I just papering over a real issue?

> Maybe below patch should work:
> @@ -1668,11 +1668,12 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>   char *fp, int count, int flow)
>  {
> -   struct n_tty_data *ldata = tty->disc_data;
> +   struct n_tty_data *ldata;
>      int room, n, rcvd = 0, overflow;
> 
> down_read(>termios_rwsem);
> 
> +   ldata = tty->disc_data;

I did try just that alone and it didn't help.

Mikey



>From 75c2a0369450692946ca8cc7ac148a98deaecd2a Mon Sep 17 00:00:00 2001
From: Michael Neuling 
Date: Fri, 7 Apr 2017 11:31:02 +1000
Subject: [PATCH] tty: fix regression in flush_to_ldisc

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data.

This fixes the issue by moving the disc_data read to after we take the
semaphore, then returning 0 data processed when NULL.

Cc: [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3




Re: tty crash in tty_ldisc_receive_buf()

2017-04-06 Thread Michael Neuling

> > +   /* This probably shouldn't happen, but return 0 data processed */
> > +   if (!ldata)
> > +   return 0;
> > +
> >     while (1) {
> >     /*
> >      * When PARMRK is set, each input char may take up to 3
> > chars
> 
> Maybe your patch should looks like:
> + /* This probably shouldn't happen, but return 0 data processed */
> + if (!ldata) {
> +   up_read(>termios_rwsem);
> + return 0;
> +   }

Oops, nice catch.. Thanks!

That does indeed fix the problem now without the softlockup.  I'm not sure it's
the right fix, but full patch below.

Anyone see a problem with this approach? Am I just papering over a real issue?

> Maybe below patch should work:
> @@ -1668,11 +1668,12 @@ static int
>  n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
>   char *fp, int count, int flow)
>  {
> -   struct n_tty_data *ldata = tty->disc_data;
> +   struct n_tty_data *ldata;
>      int room, n, rcvd = 0, overflow;
> 
> down_read(>termios_rwsem);
> 
> +   ldata = tty->disc_data;

I did try just that alone and it didn't help.

Mikey



>From 75c2a0369450692946ca8cc7ac148a98deaecd2a Mon Sep 17 00:00:00 2001
From: Michael Neuling 
Date: Fri, 7 Apr 2017 11:31:02 +1000
Subject: [PATCH] tty: fix regression in flush_to_ldisc

When reiniting a tty we can end up with:

[  417.514499] Unable to handle kernel paging request for data at address 
0x2260
[  417.515361] Faulting instruction address: 0xc06fad80
cpu 0x15: Vector: 300 (Data Access) at [c0799411f890]
pc: c06fad80: n_tty_receive_buf_common+0xc0/0xbd0
lr: c06fad5c: n_tty_receive_buf_common+0x9c/0xbd0
sp: c0799411fb10
   msr: 9280b033
   dar: 2260
 dsisr: 4000
  current = 0xc079675d1e00
  paca= 0xcfb0d200   softe: 0irq_happened: 0x01
pid   = 5, comm = kworker/u56:0
Linux version 4.11.0-rc5-next-20170405 (mikey@bml86) (gcc version 5.4.0 
20160609 (Ubuntu/IBM 5.4.0-6ubuntu1~16.04.4) ) #2 SMP Thu Apr 6 00:36:46 CDT 
2017
enter ? for help
[c0799411fbe0] c06ff968 tty_ldisc_receive_buf+0x48/0xe0
[c0799411fc10] c07009d8 tty_port_default_receive_buf+0x68/0xe0
[c0799411fc50] c06ffce4 flush_to_ldisc+0x114/0x130
[c0799411fca0] c010a0fc process_one_work+0x1ec/0x580
[c0799411fd30] c010a528 worker_thread+0x98/0x5d0
[c0799411fdc0] c011343c kthread+0x16c/0x1b0
[c0799411fe30] c000b4e8 ret_from_kernel_thread+0x5c/0x74

This is due to a NULL ptr dref of tty->disc_data.

This fixes the issue by moving the disc_data read to after we take the
semaphore, then returning 0 data processed when NULL.

Cc: [4.10+]
Signed-off-by: Michael Neuling 
---
 drivers/tty/n_tty.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/tty/n_tty.c b/drivers/tty/n_tty.c
index bdf0e6e899..a2a9832a42 100644
--- a/drivers/tty/n_tty.c
+++ b/drivers/tty/n_tty.c
@@ -1668,11 +1668,17 @@ static int
 n_tty_receive_buf_common(struct tty_struct *tty, const unsigned char *cp,
 char *fp, int count, int flow)
 {
-   struct n_tty_data *ldata = tty->disc_data;
+   struct n_tty_data *ldata;
int room, n, rcvd = 0, overflow;
 
down_read(>termios_rwsem);
 
+   ldata = tty->disc_data;
+   if (!ldata) {
+   up_read(>termios_rwsem);
+   return 0;
+   }
+
while (1) {
/*
 * When PARMRK is set, each input char may take up to 3 chars
-- 
2.9.3




Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE

2017-04-06 Thread Jerome Glisse
On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> > This introduce a simple struct and associated helpers for device driver
> > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > will find a unuse physical address range and trigger memory hotplug for
> > it which allocates and initialize struct page for the device memory.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Signed-off-by: Evgeny Baskakov 
> > Signed-off-by: John Hubbard 
> > Signed-off-by: Mark Hairgrove 
> > Signed-off-by: Sherry Cheung 
> > Signed-off-by: Subhash Gutti 
> > ---
> >  include/linux/hmm.h | 114 +++
> >  mm/Kconfig  |   9 ++
> >  mm/hmm.c| 398 
> > 
> >  3 files changed, 521 insertions(+)
> > 
> > +/*
> > + * To add (hotplug) device memory, HMM assumes that there is no real 
> > resource
> > + * that reserves a range in the physical address space (this is intended 
> > to be
> > + * use by unaddressable device memory). It will reserve a physical range 
> > big
> > + * enough and allocate struct page for it.
> 
> I've found that the implementation of this is quite non-portable, in that
> starting from iomem_resource.end+1-size (which is effectively -size) on
> my platform (powerpc) does not give expected results. It could be that
> additional changes are needed to arch_add_memory() to support this
> use case.

The CDM version does not use that part, that being said isn't -size a valid
value we care only about unsigned here ? What is the end value on powerpc ?
In any case this sounds more like a unsigned/signed arithmetic issue, i will
look into it.

> 
> > +
> > +   size = ALIGN(size, SECTION_SIZE);
> > +   addr = (iomem_resource.end + 1ULL) - size;
> 
> 
> Why don't we allocate_resource() with the right constraints and get a new
> unused region?

The issue with allocate_resource() is that it does scan the resource tree
from lower address to higher ones. I was told that it was less likely to
have hotplug issue conflict if i pick highest physicall address for the
device memory hence why i do my own scan from the end toward the start.

Again all this function does not apply to PPC, it can be hidden behind
x86 config if you prefer it.

Cheers,
Jérôme


Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE

2017-04-06 Thread Jerome Glisse
On Fri, Apr 07, 2017 at 11:37:34AM +1000, Balbir Singh wrote:
> On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> > This introduce a simple struct and associated helpers for device driver
> > to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> > will find a unuse physical address range and trigger memory hotplug for
> > it which allocates and initialize struct page for the device memory.
> > 
> > Signed-off-by: Jérôme Glisse 
> > Signed-off-by: Evgeny Baskakov 
> > Signed-off-by: John Hubbard 
> > Signed-off-by: Mark Hairgrove 
> > Signed-off-by: Sherry Cheung 
> > Signed-off-by: Subhash Gutti 
> > ---
> >  include/linux/hmm.h | 114 +++
> >  mm/Kconfig  |   9 ++
> >  mm/hmm.c| 398 
> > 
> >  3 files changed, 521 insertions(+)
> > 
> > +/*
> > + * To add (hotplug) device memory, HMM assumes that there is no real 
> > resource
> > + * that reserves a range in the physical address space (this is intended 
> > to be
> > + * use by unaddressable device memory). It will reserve a physical range 
> > big
> > + * enough and allocate struct page for it.
> 
> I've found that the implementation of this is quite non-portable, in that
> starting from iomem_resource.end+1-size (which is effectively -size) on
> my platform (powerpc) does not give expected results. It could be that
> additional changes are needed to arch_add_memory() to support this
> use case.

The CDM version does not use that part, that being said isn't -size a valid
value we care only about unsigned here ? What is the end value on powerpc ?
In any case this sounds more like a unsigned/signed arithmetic issue, i will
look into it.

> 
> > +
> > +   size = ALIGN(size, SECTION_SIZE);
> > +   addr = (iomem_resource.end + 1ULL) - size;
> 
> 
> Why don't we allocate_resource() with the right constraints and get a new
> unused region?

The issue with allocate_resource() is that it does scan the resource tree
from lower address to higher ones. I was told that it was less likely to
have hotplug issue conflict if i pick highest physicall address for the
device memory hence why i do my own scan from the end toward the start.

Again all this function does not apply to PPC, it can be hidden behind
x86 config if you prefer it.

Cheers,
Jérôme


Re: [PATCH v5 1/2] module: verify address is read-only

2017-04-06 Thread Jessica Yu

+++ Eddie Kovsky [05/04/17 21:35 -0600]:

Implement a mechanism to check if a module's address is in
the rodata or ro_after_init sections. It mimics the existing functions
that test if an address is inside a module's text section.

Functions that take a module as an argument will be able to verify that the
module address is in a read-only section. The idea is to prevent structures
(including modules) that are not read-only from being passed to functions.

This implements the first half of a suggestion made by Kees Cook for
the Kernel Self Protection Project:

   - provide mechanism to check for ro_after_init memory areas, and
 reject structures not marked ro_after_init in vmbus_register()

Suggested-by: Kees Cook 
Signed-off-by: Eddie Kovsky 


Acked-by: Jessica Yu 



Re: [PATCH v5 1/2] module: verify address is read-only

2017-04-06 Thread Jessica Yu

+++ Eddie Kovsky [05/04/17 21:35 -0600]:

Implement a mechanism to check if a module's address is in
the rodata or ro_after_init sections. It mimics the existing functions
that test if an address is inside a module's text section.

Functions that take a module as an argument will be able to verify that the
module address is in a read-only section. The idea is to prevent structures
(including modules) that are not read-only from being passed to functions.

This implements the first half of a suggestion made by Kees Cook for
the Kernel Self Protection Project:

   - provide mechanism to check for ro_after_init memory areas, and
 reject structures not marked ro_after_init in vmbus_register()

Suggested-by: Kees Cook 
Signed-off-by: Eddie Kovsky 


Acked-by: Jessica Yu 



Re: [PATCH v2 0/4] zram: implement deduplication in zram

2017-04-06 Thread Joonsoo Kim
On Fri, Mar 31, 2017 at 08:40:56AM +0900, Minchan Kim wrote:
> Hi Andrew and Joonsoo,
> 
> On Thu, Mar 30, 2017 at 03:25:02PM -0700, Andrew Morton wrote:
> > On Thu, 30 Mar 2017 14:38:05 +0900 js1...@gmail.com wrote:
> > 
> > > This patchset implements deduplication feature in zram. Motivation
> > > is to save memory usage by zram. There are detailed description
> > > about motivation and experimental results on patch #2 so please
> > > refer it.
> > 
> > I'm getting a lot of rejects against the current -mm tree.  So this is
> > one of those patchsets which should be prepared against -mm, please.
> 
> Due to my partial IO refactoring. Sorry for the inconvenience.
> It's still on-going and I want to merge it before new feature like
> dedup.
> 
> Joonsoo,
> 
> I will send partial-IO rework formal patch next week and start to
> review your patchset so I hope you resend your patchset based on it
> with my comments.

No problem. I will wait. :)

Thanks.


Re: [PATCH v2 0/4] zram: implement deduplication in zram

2017-04-06 Thread Joonsoo Kim
On Fri, Mar 31, 2017 at 08:40:56AM +0900, Minchan Kim wrote:
> Hi Andrew and Joonsoo,
> 
> On Thu, Mar 30, 2017 at 03:25:02PM -0700, Andrew Morton wrote:
> > On Thu, 30 Mar 2017 14:38:05 +0900 js1...@gmail.com wrote:
> > 
> > > This patchset implements deduplication feature in zram. Motivation
> > > is to save memory usage by zram. There are detailed description
> > > about motivation and experimental results on patch #2 so please
> > > refer it.
> > 
> > I'm getting a lot of rejects against the current -mm tree.  So this is
> > one of those patchsets which should be prepared against -mm, please.
> 
> Due to my partial IO refactoring. Sorry for the inconvenience.
> It's still on-going and I want to merge it before new feature like
> dedup.
> 
> Joonsoo,
> 
> I will send partial-IO rework formal patch next week and start to
> review your patchset so I hope you resend your patchset based on it
> with my comments.

No problem. I will wait. :)

Thanks.


[PATCH v2] cgroup: move cgroup_subsys_state parent field for cache locality

2017-04-06 Thread Todd Poynor
From: Todd Poynor 

Various structures embed a struct cgroup_subsys_state, typically at
the top of the containing structure.  It is common for code that
accesses the structures to perform operations that iterate over the
chain of parent css pointers, also accessing data in each containing
structure.  In particular, struct cpuacct is used by fairly hot code
paths in the scheduler such as cpuacct_charge().

Move the parent css pointer field to the end of the structure to
increase the chances of residing in the same cache line as the data
from the containing structure.

Signed-off-by: Todd Poynor 
---
 include/linux/cgroup-defs.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

root_cpuacct fields .cpuusage and. css.parent show up as frequently-
accessed memory in separate cache lines (and usually the only thing
accessed in those cache lines until eviction) in armv8 simulations.
A quick search turned up struct blkcg, struct mem_cgroup, and
struct freezer as other examples using a similar struct layout and
access code.

Instead, could move the parent field to the top of css, and have hot
code paths use __cacheline_aligned with hot data prior to css... or
open to suggestions, thanks.

v2 fixes subject line.

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a3f850cabab..53c698207ad0 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -106,9 +106,6 @@ struct cgroup_subsys_state {
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
 
-   /* PI: the parent css */
-   struct cgroup_subsys_state *parent;
-
/* siblings list anchored at the parent's ->children */
struct list_head sibling;
struct list_head children;
@@ -138,6 +135,12 @@ struct cgroup_subsys_state {
/* percpu_ref killing and RCU release */
struct rcu_head rcu_head;
struct work_struct destroy_work;
+
+   /*
+* PI: the parent css.  Placed here for cache proximity to following
+* fields of the containing structure.
+*/
+   struct cgroup_subsys_state *parent;
 };
 
 /*
-- 
2.12.2.715.g7642488e1d-goog



[PATCH v2] cgroup: move cgroup_subsys_state parent field for cache locality

2017-04-06 Thread Todd Poynor
From: Todd Poynor 

Various structures embed a struct cgroup_subsys_state, typically at
the top of the containing structure.  It is common for code that
accesses the structures to perform operations that iterate over the
chain of parent css pointers, also accessing data in each containing
structure.  In particular, struct cpuacct is used by fairly hot code
paths in the scheduler such as cpuacct_charge().

Move the parent css pointer field to the end of the structure to
increase the chances of residing in the same cache line as the data
from the containing structure.

Signed-off-by: Todd Poynor 
---
 include/linux/cgroup-defs.h | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

root_cpuacct fields .cpuusage and. css.parent show up as frequently-
accessed memory in separate cache lines (and usually the only thing
accessed in those cache lines until eviction) in armv8 simulations.
A quick search turned up struct blkcg, struct mem_cgroup, and
struct freezer as other examples using a similar struct layout and
access code.

Instead, could move the parent field to the top of css, and have hot
code paths use __cacheline_aligned with hot data prior to css... or
open to suggestions, thanks.

v2 fixes subject line.

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 6a3f850cabab..53c698207ad0 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -106,9 +106,6 @@ struct cgroup_subsys_state {
/* reference count - access via css_[try]get() and css_put() */
struct percpu_ref refcnt;
 
-   /* PI: the parent css */
-   struct cgroup_subsys_state *parent;
-
/* siblings list anchored at the parent's ->children */
struct list_head sibling;
struct list_head children;
@@ -138,6 +135,12 @@ struct cgroup_subsys_state {
/* percpu_ref killing and RCU release */
struct rcu_head rcu_head;
struct work_struct destroy_work;
+
+   /*
+* PI: the parent css.  Placed here for cache proximity to following
+* fields of the containing structure.
+*/
+   struct cgroup_subsys_state *parent;
 };
 
 /*
-- 
2.12.2.715.g7642488e1d-goog



[PATCH 0/5] arm64: dts: hisi: add NIC, RoCE and SAS support for hip07

2017-04-06 Thread Wei.Xu
This patch series adds Mbigen, NIC, RoCE and SAS nodes for the hip07
SoC and enables the NIC, RoCE and SAS on the hip07 d05 board.

Wei Xu (5):
  arm64: dts: hisi: add mbigen nodes for the hip07 SoC
  arm64: dts: hisi: add network related nodes for the hip07 SoC
  arm64: dts: hisi: add RoCE nodes for the hip07 SoC
  arm64: dts: hisi: add SAS nodes for the hip07 SoC
  arm64: dts: hisi: enalbe the NIC and SAS for the hip07-d05 board

 arch/arm64/boot/dts/hisilicon/hip07-d05.dts |  24 ++
 arch/arm64/boot/dts/hisilicon/hip07.dtsi| 479 
 2 files changed, 503 insertions(+)

-- 
1.9.1



[PATCH 0/5] arm64: dts: hisi: add NIC, RoCE and SAS support for hip07

2017-04-06 Thread Wei.Xu
This patch series adds Mbigen, NIC, RoCE and SAS nodes for the hip07
SoC and enables the NIC, RoCE and SAS on the hip07 d05 board.

Wei Xu (5):
  arm64: dts: hisi: add mbigen nodes for the hip07 SoC
  arm64: dts: hisi: add network related nodes for the hip07 SoC
  arm64: dts: hisi: add RoCE nodes for the hip07 SoC
  arm64: dts: hisi: add SAS nodes for the hip07 SoC
  arm64: dts: hisi: enalbe the NIC and SAS for the hip07-d05 board

 arch/arm64/boot/dts/hisilicon/hip07-d05.dts |  24 ++
 arch/arm64/boot/dts/hisilicon/hip07.dtsi| 479 
 2 files changed, 503 insertions(+)

-- 
1.9.1



[PATCH 1/5] arm64: dts: hisi: add mbigen nodes for the hip07 SoC

2017-04-06 Thread Wei.Xu
From: Wei Xu 

Add mbigen nodes for the hip07 SoC those will be used
for the SAS, XGE and PCIe host controllers.

Signed-off-by: Wei Xu 
---
 arch/arm64/boot/dts/hisilicon/hip07.dtsi | 61 
 1 file changed, 61 insertions(+)

diff --git a/arch/arm64/boot/dts/hisilicon/hip07.dtsi 
b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
index 5144eb1..6077def 100644
--- a/arch/arm64/boot/dts/hisilicon/hip07.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
@@ -1014,6 +1014,34 @@
compatible = "hisilicon,mbigen-v2";
reg = <0x0 0xa008 0x0 0x1>;
 
+   mbigen_pcie2_a: intc_pcie2_a {
+   msi-parent = <_its_dsa_a 0x40087>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <10>;
+   };
+
+   mbigen_sas1: intc_sas1 {
+   msi-parent = <_its_dsa_a 0x4>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_sas2: intc_sas2 {
+   msi-parent = <_its_dsa_a 0x40040>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_smmu_pcie: intc_smmu_pcie {
+   msi-parent = <_its_dsa_a 0x40b0c>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <3>;
+   };
+
mbigen_usb: intc_usb {
msi-parent = <_its_dsa_a 0x40080>;
interrupt-controller;
@@ -1022,6 +1050,39 @@
};
};
 
+   p0_mbigen_dsa_a: interrupt-controller@c008 {
+   compatible = "hisilicon,mbigen-v2";
+   reg = <0x0 0xc008 0x0 0x1>;
+
+   mbigen_dsaf0: intc_dsaf0 {
+   msi-parent = <_its_dsa_a 0x40800>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <409>;
+   };
+
+   mbigen_dsa_roce: intc-roce {
+   msi-parent = <_its_dsa_a 0x40B1E>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <34>;
+   };
+
+   mbigen_sas0: intc-sas0 {
+   msi-parent = <_its_dsa_a 0x40900>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_smmu_dsa: intc_smmu_dsa {
+   msi-parent = <_its_dsa_a 0x40b20>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <3>;
+   };
+   };
+
soc {
compatible = "simple-bus";
#address-cells = <2>;
-- 
1.9.1



[PATCH 1/5] arm64: dts: hisi: add mbigen nodes for the hip07 SoC

2017-04-06 Thread Wei.Xu
From: Wei Xu 

Add mbigen nodes for the hip07 SoC those will be used
for the SAS, XGE and PCIe host controllers.

Signed-off-by: Wei Xu 
---
 arch/arm64/boot/dts/hisilicon/hip07.dtsi | 61 
 1 file changed, 61 insertions(+)

diff --git a/arch/arm64/boot/dts/hisilicon/hip07.dtsi 
b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
index 5144eb1..6077def 100644
--- a/arch/arm64/boot/dts/hisilicon/hip07.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
@@ -1014,6 +1014,34 @@
compatible = "hisilicon,mbigen-v2";
reg = <0x0 0xa008 0x0 0x1>;
 
+   mbigen_pcie2_a: intc_pcie2_a {
+   msi-parent = <_its_dsa_a 0x40087>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <10>;
+   };
+
+   mbigen_sas1: intc_sas1 {
+   msi-parent = <_its_dsa_a 0x4>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_sas2: intc_sas2 {
+   msi-parent = <_its_dsa_a 0x40040>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_smmu_pcie: intc_smmu_pcie {
+   msi-parent = <_its_dsa_a 0x40b0c>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <3>;
+   };
+
mbigen_usb: intc_usb {
msi-parent = <_its_dsa_a 0x40080>;
interrupt-controller;
@@ -1022,6 +1050,39 @@
};
};
 
+   p0_mbigen_dsa_a: interrupt-controller@c008 {
+   compatible = "hisilicon,mbigen-v2";
+   reg = <0x0 0xc008 0x0 0x1>;
+
+   mbigen_dsaf0: intc_dsaf0 {
+   msi-parent = <_its_dsa_a 0x40800>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <409>;
+   };
+
+   mbigen_dsa_roce: intc-roce {
+   msi-parent = <_its_dsa_a 0x40B1E>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <34>;
+   };
+
+   mbigen_sas0: intc-sas0 {
+   msi-parent = <_its_dsa_a 0x40900>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <128>;
+   };
+
+   mbigen_smmu_dsa: intc_smmu_dsa {
+   msi-parent = <_its_dsa_a 0x40b20>;
+   interrupt-controller;
+   #interrupt-cells = <2>;
+   num-pins = <3>;
+   };
+   };
+
soc {
compatible = "simple-bus";
#address-cells = <2>;
-- 
1.9.1



Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE

2017-04-06 Thread Balbir Singh
On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> This introduce a simple struct and associated helpers for device driver
> to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> will find a unuse physical address range and trigger memory hotplug for
> it which allocates and initialize struct page for the device memory.
> 
> Signed-off-by: Jérôme Glisse 
> Signed-off-by: Evgeny Baskakov 
> Signed-off-by: John Hubbard 
> Signed-off-by: Mark Hairgrove 
> Signed-off-by: Sherry Cheung 
> Signed-off-by: Subhash Gutti 
> ---
>  include/linux/hmm.h | 114 +++
>  mm/Kconfig  |   9 ++
>  mm/hmm.c| 398 
> 
>  3 files changed, 521 insertions(+)
> 
> +/*
> + * To add (hotplug) device memory, HMM assumes that there is no real resource
> + * that reserves a range in the physical address space (this is intended to 
> be
> + * use by unaddressable device memory). It will reserve a physical range big
> + * enough and allocate struct page for it.

I've found that the implementation of this is quite non-portable, in that
starting from iomem_resource.end+1-size (which is effectively -size) on
my platform (powerpc) does not give expected results. It could be that
additional changes are needed to arch_add_memory() to support this
use case.

> +
> + size = ALIGN(size, SECTION_SIZE);
> + addr = (iomem_resource.end + 1ULL) - size;


Why don't we allocate_resource() with the right constraints and get a new
unused region?

Thanks,
Balbir


Re: [HMM 14/16] mm/hmm/devmem: device memory hotplug using ZONE_DEVICE

2017-04-06 Thread Balbir Singh
On Wed, 2017-04-05 at 16:40 -0400, Jérôme Glisse wrote:
> This introduce a simple struct and associated helpers for device driver
> to use when hotpluging un-addressable device memory as ZONE_DEVICE. It
> will find a unuse physical address range and trigger memory hotplug for
> it which allocates and initialize struct page for the device memory.
> 
> Signed-off-by: Jérôme Glisse 
> Signed-off-by: Evgeny Baskakov 
> Signed-off-by: John Hubbard 
> Signed-off-by: Mark Hairgrove 
> Signed-off-by: Sherry Cheung 
> Signed-off-by: Subhash Gutti 
> ---
>  include/linux/hmm.h | 114 +++
>  mm/Kconfig  |   9 ++
>  mm/hmm.c| 398 
> 
>  3 files changed, 521 insertions(+)
> 
> +/*
> + * To add (hotplug) device memory, HMM assumes that there is no real resource
> + * that reserves a range in the physical address space (this is intended to 
> be
> + * use by unaddressable device memory). It will reserve a physical range big
> + * enough and allocate struct page for it.

I've found that the implementation of this is quite non-portable, in that
starting from iomem_resource.end+1-size (which is effectively -size) on
my platform (powerpc) does not give expected results. It could be that
additional changes are needed to arch_add_memory() to support this
use case.

> +
> + size = ALIGN(size, SECTION_SIZE);
> + addr = (iomem_resource.end + 1ULL) - size;


Why don't we allocate_resource() with the right constraints and get a new
unused region?

Thanks,
Balbir


[PATCH 2/5] arm64: dts: hisi: add network related nodes for the hip07 SoC

2017-04-06 Thread Wei.Xu
From: Wei Xu 

Add MDIO, SerDes, Port and realted HNS nodes to support the
network on the hip07 SoC.

Signed-off-by: Wei Xu 
---
 arch/arm64/boot/dts/hisilicon/hip07.dtsi | 208 +++
 1 file changed, 208 insertions(+)

diff --git a/arch/arm64/boot/dts/hisilicon/hip07.dtsi 
b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
index 6077def..2feb362 100644
--- a/arch/arm64/boot/dts/hisilicon/hip07.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
@@ -1116,5 +1116,213 @@
dma-coherent;
status = "disabled";
};
+
+   peri_c_subctrl: sub_ctrl_c@6000 {
+   compatible = "hisilicon,peri-subctrl","syscon";
+   reg = <0 0x6000 0x0 0x1>;
+   };
+
+   dsa_subctrl: dsa_subctrl@c000 {
+   compatible = "hisilicon,dsa-subctrl", "syscon";
+   reg = <0x0 0xc000 0x0 0x1>;
+   };
+
+   serdes_ctrl: sds_ctrl@c220 {
+   compatible = "syscon";
+   reg = <0 0xc220 0x0 0x8>;
+   };
+
+   mdio@603c {
+   compatible = "hisilicon,hns-mdio";
+   reg = <0x0 0x603c 0x0 0x1000>;
+   subctrl-vbase = <_c_subctrl 0x338 0xa38
+0x531c 0x5a1c>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+
+   phy1: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <1>;
+   };
+   };
+
+   dsaf0: dsa@c700 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "hisilicon,hns-dsaf-v2";
+   mode = "6port-16rss";
+   reg = <0x0 0xc500 0x0 0x89
+  0x0 0xc700 0x0 0x60>;
+   reg-names = "ppe-base", "dsaf-base";
+   interrupt-parent = <_dsaf0>;
+   subctrl-syscon = <_subctrl>;
+   reset-field-offset = <0>;
+   interrupts =
+   <576 1>, <577 1>, <578 1>, <579 1>, <580 1>,
+   <581 1>, <582 1>, <583 1>, <584 1>, <585 1>,
+   <586 1>, <587 1>, <588 1>, <589 1>, <590 1>,
+   <591 1>, <592 1>, <593 1>, <594 1>, <595 1>,
+   <596 1>, <597 1>, <598 1>, <599 1>, <600 1>,
+   <960 1>, <961 1>, <962 1>, <963 1>, <964 1>,
+   <965 1>, <966 1>, <967 1>, <968 1>, <969 1>,
+   <970 1>, <971 1>, <972 1>, <973 1>, <974 1>,
+   <975 1>, <976 1>, <977 1>, <978 1>, <979 1>,
+   <980 1>, <981 1>, <982 1>, <983 1>, <984 1>,
+   <985 1>, <986 1>, <987 1>, <988 1>, <989 1>,
+   <990 1>, <991 1>, <992 1>, <993 1>, <994 1>,
+   <995 1>, <996 1>, <997 1>, <998 1>, <999 1>,
+   <1000 1>, <1001 1>, <1002 1>, <1003 1>, <1004 1>,
+   <1005 1>, <1006 1>, <1007 1>, <1008 1>, <1009 1>,
+   <1010 1>, <1011 1>, <1012 1>, <1013 1>, <1014 1>,
+   <1015 1>, <1016 1>, <1017 1>, <1018 1>, <1019 1>,
+   <1020 1>, <1021 1>, <1022 1>, <1023 1>, <1024 1>,
+   <1025 1>, <1026 1>, <1027 1>, <1028 1>, <1029 1>,
+   <1030 1>, <1031 1>, <1032 1>, <1033 1>, <1034 1>,
+   <1035 1>, <1036 1>, <1037 1>, <1038 1>, <1039 1>,
+   <1040 1>, <1041 1>, <1042 1>, <1043 1>, <1044 1>,
+   <1045 1>, <1046 1>, <1047 1>, <1048 1>, <1049 1>,
+   <1050 1>, <1051 1>, <1052 1>, <1053 1>, <1054 1>,
+   <1055 1>, <1056 1>, <1057 1>, <1058 1>, <1059 1>,
+   <1060 1>, <1061 1>, <1062 1>, <1063 1>, <1064 1>,
+   <1065 1>, <1066 1>, <1067 1>, <1068 1>, <1069 1>,
+   <1070 1>, <1071 1>, <1072 1>, <1073 1>, <1074 1>,
+   <1075 1>, <1076 1>, <1077 1>, <1078 1>, <1079 1>,
+   <1080 1>, <1081 1>, <1082 1>, <1083 1>, <1084 1>,
+   <1085 1>, <1086 1>, <1087 1>, <1088 1>, <1089 1>,
+   <1090 1>, <1091 1>, <1092 1>, <1093 1>, <1094 1>,
+   <1095 1>, <1096 1>, <1097 1>, <1098 1>, <1099 1>,
+   

[PATCH 2/5] arm64: dts: hisi: add network related nodes for the hip07 SoC

2017-04-06 Thread Wei.Xu
From: Wei Xu 

Add MDIO, SerDes, Port and realted HNS nodes to support the
network on the hip07 SoC.

Signed-off-by: Wei Xu 
---
 arch/arm64/boot/dts/hisilicon/hip07.dtsi | 208 +++
 1 file changed, 208 insertions(+)

diff --git a/arch/arm64/boot/dts/hisilicon/hip07.dtsi 
b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
index 6077def..2feb362 100644
--- a/arch/arm64/boot/dts/hisilicon/hip07.dtsi
+++ b/arch/arm64/boot/dts/hisilicon/hip07.dtsi
@@ -1116,5 +1116,213 @@
dma-coherent;
status = "disabled";
};
+
+   peri_c_subctrl: sub_ctrl_c@6000 {
+   compatible = "hisilicon,peri-subctrl","syscon";
+   reg = <0 0x6000 0x0 0x1>;
+   };
+
+   dsa_subctrl: dsa_subctrl@c000 {
+   compatible = "hisilicon,dsa-subctrl", "syscon";
+   reg = <0x0 0xc000 0x0 0x1>;
+   };
+
+   serdes_ctrl: sds_ctrl@c220 {
+   compatible = "syscon";
+   reg = <0 0xc220 0x0 0x8>;
+   };
+
+   mdio@603c {
+   compatible = "hisilicon,hns-mdio";
+   reg = <0x0 0x603c 0x0 0x1000>;
+   subctrl-vbase = <_c_subctrl 0x338 0xa38
+0x531c 0x5a1c>;
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   phy0: ethernet-phy@0 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <0>;
+   };
+
+   phy1: ethernet-phy@1 {
+   compatible = "ethernet-phy-ieee802.3-c22";
+   reg = <1>;
+   };
+   };
+
+   dsaf0: dsa@c700 {
+   #address-cells = <1>;
+   #size-cells = <0>;
+   compatible = "hisilicon,hns-dsaf-v2";
+   mode = "6port-16rss";
+   reg = <0x0 0xc500 0x0 0x89
+  0x0 0xc700 0x0 0x60>;
+   reg-names = "ppe-base", "dsaf-base";
+   interrupt-parent = <_dsaf0>;
+   subctrl-syscon = <_subctrl>;
+   reset-field-offset = <0>;
+   interrupts =
+   <576 1>, <577 1>, <578 1>, <579 1>, <580 1>,
+   <581 1>, <582 1>, <583 1>, <584 1>, <585 1>,
+   <586 1>, <587 1>, <588 1>, <589 1>, <590 1>,
+   <591 1>, <592 1>, <593 1>, <594 1>, <595 1>,
+   <596 1>, <597 1>, <598 1>, <599 1>, <600 1>,
+   <960 1>, <961 1>, <962 1>, <963 1>, <964 1>,
+   <965 1>, <966 1>, <967 1>, <968 1>, <969 1>,
+   <970 1>, <971 1>, <972 1>, <973 1>, <974 1>,
+   <975 1>, <976 1>, <977 1>, <978 1>, <979 1>,
+   <980 1>, <981 1>, <982 1>, <983 1>, <984 1>,
+   <985 1>, <986 1>, <987 1>, <988 1>, <989 1>,
+   <990 1>, <991 1>, <992 1>, <993 1>, <994 1>,
+   <995 1>, <996 1>, <997 1>, <998 1>, <999 1>,
+   <1000 1>, <1001 1>, <1002 1>, <1003 1>, <1004 1>,
+   <1005 1>, <1006 1>, <1007 1>, <1008 1>, <1009 1>,
+   <1010 1>, <1011 1>, <1012 1>, <1013 1>, <1014 1>,
+   <1015 1>, <1016 1>, <1017 1>, <1018 1>, <1019 1>,
+   <1020 1>, <1021 1>, <1022 1>, <1023 1>, <1024 1>,
+   <1025 1>, <1026 1>, <1027 1>, <1028 1>, <1029 1>,
+   <1030 1>, <1031 1>, <1032 1>, <1033 1>, <1034 1>,
+   <1035 1>, <1036 1>, <1037 1>, <1038 1>, <1039 1>,
+   <1040 1>, <1041 1>, <1042 1>, <1043 1>, <1044 1>,
+   <1045 1>, <1046 1>, <1047 1>, <1048 1>, <1049 1>,
+   <1050 1>, <1051 1>, <1052 1>, <1053 1>, <1054 1>,
+   <1055 1>, <1056 1>, <1057 1>, <1058 1>, <1059 1>,
+   <1060 1>, <1061 1>, <1062 1>, <1063 1>, <1064 1>,
+   <1065 1>, <1066 1>, <1067 1>, <1068 1>, <1069 1>,
+   <1070 1>, <1071 1>, <1072 1>, <1073 1>, <1074 1>,
+   <1075 1>, <1076 1>, <1077 1>, <1078 1>, <1079 1>,
+   <1080 1>, <1081 1>, <1082 1>, <1083 1>, <1084 1>,
+   <1085 1>, <1086 1>, <1087 1>, <1088 1>, <1089 1>,
+   <1090 1>, <1091 1>, <1092 1>, <1093 1>, <1094 1>,
+   <1095 1>, <1096 1>, <1097 1>, <1098 1>, <1099 1>,
+   <1100 1>, <1101 1>, <1102 1>, <1103 1>, 

  1   2   3   4   5   6   7   8   9   10   >