Re: [PATCH v3 1/5] dt-bindings: gpu: add bindings for the ARM Mali Midgard GPU

2017-04-19 Thread Guillaume Tucker

Hi Heiko,

On 19/04/17 10:02, Heiko Stuebner wrote:

Am Mittwoch, 19. April 2017, 09:06:17 CEST schrieb Guillaume Tucker:

diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt 
b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt
new file mode 100644
index ..917c4f8d178f
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt
@@ -0,0 +1,57 @@
+ARM Mali Midgard GPU
+
+
+Required properties:
+
+- compatible :
+  * Must be one of the following:
++ "arm,mali-t60x"
++ "arm,mali-t62x"
++ "arm,mali-t720"
++ "arm,mali-t760"
++ "arm,mali-t820"
++ "arm,mali-t830"
++ "arm,mali-t860"
++ "arm,mali-t880"
+  * And, optionally, one of the vendor specific compatible:
++ "amlogic,meson-gxm-mali"


Please add a "rockchip,rk3288-mali" as well :-) , as I don't trust that the
generic compatible will be enough for all time and having that already
defined makes fixing the per soc things later a lot easier.


Sure, will do in patch v4.


+
+- reg : Physical base address of the device and length of the register area.
+
+- interrupts : Contains the three IRQ lines required by Mali Midgard devices.
+
+- interrupt-names : Contains the names of IRQ resources in the order they were
+  provided in the interrupts property. Must contain: "job", "mmu", "gpu".
+
+
+Optional properties:
+
+- clocks : Phandle to clock for the Mali Midgard device.
+
+- mali-supply : Phandle to regulator for the Mali device. Refer to
+  Documentation/devicetree/bindings/regulator/regulator.txt for details.
+
+- operating-points : Refer to Documentation/devicetree/bindings/power/opp.txt
+  for details.


So I can simply change that to operating-points-v2.  Both
versions can be used in practice but it sounds like
operating-points can just be ignored in this binding's
documentation.  Could you please confirm?

Thanks,
Guillaume


Re: [PATCH v3 1/5] dt-bindings: gpu: add bindings for the ARM Mali Midgard GPU

2017-04-19 Thread Guillaume Tucker

Hi Heiko,

On 19/04/17 10:02, Heiko Stuebner wrote:

Am Mittwoch, 19. April 2017, 09:06:17 CEST schrieb Guillaume Tucker:

diff --git a/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt 
b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt
new file mode 100644
index ..917c4f8d178f
--- /dev/null
+++ b/Documentation/devicetree/bindings/gpu/arm,mali-midgard.txt
@@ -0,0 +1,57 @@
+ARM Mali Midgard GPU
+
+
+Required properties:
+
+- compatible :
+  * Must be one of the following:
++ "arm,mali-t60x"
++ "arm,mali-t62x"
++ "arm,mali-t720"
++ "arm,mali-t760"
++ "arm,mali-t820"
++ "arm,mali-t830"
++ "arm,mali-t860"
++ "arm,mali-t880"
+  * And, optionally, one of the vendor specific compatible:
++ "amlogic,meson-gxm-mali"


Please add a "rockchip,rk3288-mali" as well :-) , as I don't trust that the
generic compatible will be enough for all time and having that already
defined makes fixing the per soc things later a lot easier.


Sure, will do in patch v4.


+
+- reg : Physical base address of the device and length of the register area.
+
+- interrupts : Contains the three IRQ lines required by Mali Midgard devices.
+
+- interrupt-names : Contains the names of IRQ resources in the order they were
+  provided in the interrupts property. Must contain: "job", "mmu", "gpu".
+
+
+Optional properties:
+
+- clocks : Phandle to clock for the Mali Midgard device.
+
+- mali-supply : Phandle to regulator for the Mali device. Refer to
+  Documentation/devicetree/bindings/regulator/regulator.txt for details.
+
+- operating-points : Refer to Documentation/devicetree/bindings/power/opp.txt
+  for details.


So I can simply change that to operating-points-v2.  Both
versions can be used in practice but it sounds like
operating-points can just be ignored in this binding's
documentation.  Could you please confirm?

Thanks,
Guillaume


[PATCH V4 2/7] ARM: TI: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
Acked-by: Tony Lindgren 
---
 .../devicetree/bindings/cpufreq/ti-cpufreq.txt   | 20 ++--
 arch/arm/boot/dts/am4372.dtsi| 10 +-
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt 
b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
index ba0e15ad5bd9..0c38e4b8fc51 100644
--- a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
+++ b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
@@ -63,64 +63,64 @@ cpu0_opp_table: opp-table {
 * because they can not be enabled simultaneously on a
 * single SoC.
 */
-   opp50@3 {
+   opp50-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <95 931000 969000>;
opp-supported-hw = <0x06 0x0010>;
opp-suspend;
};
 
-   opp100@27500 {
+   opp100-27500 {
opp-hz = /bits/ 64 <27500>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x01 0x00FF>;
opp-suspend;
};
 
-   opp100@3 {
+   opp100-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x06 0x0020>;
opp-suspend;
};
 
-   opp100@5 {
+   opp100-5 {
opp-hz = /bits/ 64 <5>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x01 0x>;
};
 
-   opp100@6 {
+   opp100-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x06 0x0040>;
};
 
-   opp120@6 {
+   opp120-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0x01 0x>;
};
 
-   opp120@72000 {
+   opp120-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0x06 0x0080>;
};
 
-   oppturbo@72000 {
+   oppturbo-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0x01 0x>;
};
 
-   oppturbo@8 {
+   oppturbo-8 {
opp-hz = /bits/ 64 <8>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0x06 0x0100>;
};
 
-   oppnitro@10 {
+   oppnitro-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <1325000 1298500 1351500>;
opp-supported-hw = <0x04 0x0200>;
diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi
index 97fcaf415de1..1532ffe1de63 100644
--- a/arch/arm/boot/dts/am4372.dtsi
+++ b/arch/arm/boot/dts/am4372.dtsi
@@ -60,32 +60,32 @@
cpu0_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp50@3 {
+   opp50-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <95 931000 969000>;
opp-supported-hw = <0xFF 0x01>;
opp-suspend;
};
 
-   opp100@6 {
+   opp100-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0xFF 0x04>;
};
 
-   opp120@72000 {
+   opp120-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0xFF 0x08>;
};
 
-   oppturbo@8 {
+   oppturbo-8 {
opp-hz = /bits/ 64 <8>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0xFF 0x10>;
};
 
-   oppnitro@10 {
+   

[PATCH 01/11] blk: remove bio_set arg from blk_queue_split()

2017-04-19 Thread NeilBrown
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown 
---
 block/blk-core.c  |2 +-
 block/blk-merge.c |9 -
 block/blk-mq.c|2 +-
 drivers/block/drbd/drbd_req.c |2 +-
 drivers/block/pktcdvd.c   |2 +-
 drivers/block/ps3vram.c   |2 +-
 drivers/block/rsxx/dev.c  |2 +-
 drivers/block/umem.c  |2 +-
 drivers/block/zram/zram_drv.c |2 +-
 drivers/lightnvm/rrpc.c   |2 +-
 drivers/md/md.c   |2 +-
 drivers/s390/block/dcssblk.c  |2 +-
 drivers/s390/block/xpram.c|2 +-
 include/linux/blkdev.h|3 +--
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 25aea293ee98..f5d64ad75b36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1662,7 +1662,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 */
blk_queue_bounce(q, );
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..d59074556703 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -202,8 +202,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
struct bio *split, *res;
unsigned nsegs;
@@ -211,13 +210,13 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
switch (bio_op(*bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
-   split = blk_bio_discard_split(q, *bio, bs, );
+   split = blk_bio_discard_split(q, *bio, q->bio_split, );
break;
case REQ_OP_WRITE_ZEROES:
-   split = blk_bio_write_zeroes_split(q, *bio, bs, );
+   split = blk_bio_write_zeroes_split(q, *bio, q->bio_split, 
);
break;
case REQ_OP_WRITE_SAME:
-   split = blk_bio_write_same_split(q, *bio, bs, );
+   split = blk_bio_write_same_split(q, *bio, q->bio_split, );
break;
default:
split = blk_bio_segment_split(q, *bio, q->bio_split, );
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c496692ecc5b..365cb17308e5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1549,7 +1549,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
return BLK_QC_T_NONE;
}
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
if (!is_flush_fua && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, _count, _queue_rq))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index b5730e17b455..fa62dd8a4d46 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1557,7 +1557,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, 
struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
blk_queue_bounce(q, );
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
pd = q->queuedata;
if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue 
*q, struct bio *bio)
 
dev_dbg(>core, "%s\n", __func__);
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
spin_lock_irq(>lock);
busy = !bio_list_empty(>list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 9c566364ac9c..01624eaefcba 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, 
struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;
 
-   blk_queue_split(q, , q->bio_split);
+   

[PATCH V4 2/7] ARM: TI: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
Acked-by: Tony Lindgren 
---
 .../devicetree/bindings/cpufreq/ti-cpufreq.txt   | 20 ++--
 arch/arm/boot/dts/am4372.dtsi| 10 +-
 2 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt 
b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
index ba0e15ad5bd9..0c38e4b8fc51 100644
--- a/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
+++ b/Documentation/devicetree/bindings/cpufreq/ti-cpufreq.txt
@@ -63,64 +63,64 @@ cpu0_opp_table: opp-table {
 * because they can not be enabled simultaneously on a
 * single SoC.
 */
-   opp50@3 {
+   opp50-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <95 931000 969000>;
opp-supported-hw = <0x06 0x0010>;
opp-suspend;
};
 
-   opp100@27500 {
+   opp100-27500 {
opp-hz = /bits/ 64 <27500>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x01 0x00FF>;
opp-suspend;
};
 
-   opp100@3 {
+   opp100-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x06 0x0020>;
opp-suspend;
};
 
-   opp100@5 {
+   opp100-5 {
opp-hz = /bits/ 64 <5>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x01 0x>;
};
 
-   opp100@6 {
+   opp100-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0x06 0x0040>;
};
 
-   opp120@6 {
+   opp120-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0x01 0x>;
};
 
-   opp120@72000 {
+   opp120-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0x06 0x0080>;
};
 
-   oppturbo@72000 {
+   oppturbo-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0x01 0x>;
};
 
-   oppturbo@8 {
+   oppturbo-8 {
opp-hz = /bits/ 64 <8>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0x06 0x0100>;
};
 
-   oppnitro@10 {
+   oppnitro-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <1325000 1298500 1351500>;
opp-supported-hw = <0x04 0x0200>;
diff --git a/arch/arm/boot/dts/am4372.dtsi b/arch/arm/boot/dts/am4372.dtsi
index 97fcaf415de1..1532ffe1de63 100644
--- a/arch/arm/boot/dts/am4372.dtsi
+++ b/arch/arm/boot/dts/am4372.dtsi
@@ -60,32 +60,32 @@
cpu0_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp50@3 {
+   opp50-3 {
opp-hz = /bits/ 64 <3>;
opp-microvolt = <95 931000 969000>;
opp-supported-hw = <0xFF 0x01>;
opp-suspend;
};
 
-   opp100@6 {
+   opp100-6 {
opp-hz = /bits/ 64 <6>;
opp-microvolt = <110 1078000 1122000>;
opp-supported-hw = <0xFF 0x04>;
};
 
-   opp120@72000 {
+   opp120-72000 {
opp-hz = /bits/ 64 <72000>;
opp-microvolt = <120 1176000 1224000>;
opp-supported-hw = <0xFF 0x08>;
};
 
-   oppturbo@8 {
+   oppturbo-8 {
opp-hz = /bits/ 64 <8>;
opp-microvolt = <126 1234800 1285200>;
opp-supported-hw = <0xFF 0x10>;
};
 
-   oppnitro@10 {
+   oppnitro-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <1325000 1298500 

[PATCH 01/11] blk: remove bio_set arg from blk_queue_split()

2017-04-19 Thread NeilBrown
blk_queue_split() is always called with the last arg being q->bio_split,
where 'q' is the first arg.

Also blk_queue_split() sometimes uses the passed-in 'bs' and sometimes uses
q->bio_split.

This is inconsistent and unnecessary.  Remove the last arg and always use
q->bio_split inside blk_queue_split()

Signed-off-by: NeilBrown 
---
 block/blk-core.c  |2 +-
 block/blk-merge.c |9 -
 block/blk-mq.c|2 +-
 drivers/block/drbd/drbd_req.c |2 +-
 drivers/block/pktcdvd.c   |2 +-
 drivers/block/ps3vram.c   |2 +-
 drivers/block/rsxx/dev.c  |2 +-
 drivers/block/umem.c  |2 +-
 drivers/block/zram/zram_drv.c |2 +-
 drivers/lightnvm/rrpc.c   |2 +-
 drivers/md/md.c   |2 +-
 drivers/s390/block/dcssblk.c  |2 +-
 drivers/s390/block/xpram.c|2 +-
 include/linux/blkdev.h|3 +--
 14 files changed, 17 insertions(+), 19 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 25aea293ee98..f5d64ad75b36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1662,7 +1662,7 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 */
blk_queue_bounce(q, );
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 3990ae406341..d59074556703 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -202,8 +202,7 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
return do_split ? new : NULL;
 }
 
-void blk_queue_split(struct request_queue *q, struct bio **bio,
-struct bio_set *bs)
+void blk_queue_split(struct request_queue *q, struct bio **bio)
 {
struct bio *split, *res;
unsigned nsegs;
@@ -211,13 +210,13 @@ void blk_queue_split(struct request_queue *q, struct bio 
**bio,
switch (bio_op(*bio)) {
case REQ_OP_DISCARD:
case REQ_OP_SECURE_ERASE:
-   split = blk_bio_discard_split(q, *bio, bs, );
+   split = blk_bio_discard_split(q, *bio, q->bio_split, );
break;
case REQ_OP_WRITE_ZEROES:
-   split = blk_bio_write_zeroes_split(q, *bio, bs, );
+   split = blk_bio_write_zeroes_split(q, *bio, q->bio_split, 
);
break;
case REQ_OP_WRITE_SAME:
-   split = blk_bio_write_same_split(q, *bio, bs, );
+   split = blk_bio_write_same_split(q, *bio, q->bio_split, );
break;
default:
split = blk_bio_segment_split(q, *bio, q->bio_split, );
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c496692ecc5b..365cb17308e5 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1549,7 +1549,7 @@ static blk_qc_t blk_mq_make_request(struct request_queue 
*q, struct bio *bio)
return BLK_QC_T_NONE;
}
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
if (!is_flush_fua && !blk_queue_nomerges(q) &&
blk_attempt_plug_merge(q, bio, _count, _queue_rq))
diff --git a/drivers/block/drbd/drbd_req.c b/drivers/block/drbd/drbd_req.c
index b5730e17b455..fa62dd8a4d46 100644
--- a/drivers/block/drbd/drbd_req.c
+++ b/drivers/block/drbd/drbd_req.c
@@ -1557,7 +1557,7 @@ blk_qc_t drbd_make_request(struct request_queue *q, 
struct bio *bio)
struct drbd_device *device = (struct drbd_device *) q->queuedata;
unsigned long start_jif;
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
start_jif = jiffies;
 
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 66d846ba85a9..98394d034c29 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -2414,7 +2414,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
blk_queue_bounce(q, );
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
pd = q->queuedata;
if (!pd) {
diff --git a/drivers/block/ps3vram.c b/drivers/block/ps3vram.c
index 456b4fe21559..48072c0c1010 100644
--- a/drivers/block/ps3vram.c
+++ b/drivers/block/ps3vram.c
@@ -606,7 +606,7 @@ static blk_qc_t ps3vram_make_request(struct request_queue 
*q, struct bio *bio)
 
dev_dbg(>core, "%s\n", __func__);
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, );
 
spin_lock_irq(>lock);
busy = !bio_list_empty(>list);
diff --git a/drivers/block/rsxx/dev.c b/drivers/block/rsxx/dev.c
index 9c566364ac9c..01624eaefcba 100644
--- a/drivers/block/rsxx/dev.c
+++ b/drivers/block/rsxx/dev.c
@@ -151,7 +151,7 @@ static blk_qc_t rsxx_make_request(struct request_queue *q, 
struct bio *bio)
struct rsxx_bio_meta *bio_meta;
int st = -EINVAL;
 
-   blk_queue_split(q, , q->bio_split);
+   blk_queue_split(q, 

[PATCH 02/11] blk: make the bioset rescue_workqueue optional.

2017-04-19 Thread NeilBrown
This patch converts bioset_create() and
bioset_create_nobvec() to not create a workqueue so
alloctions will never trigger punt_bios_to_rescuer().  It
also introduces bioset_create_rescued() and
bioset_create_nobvec_rescued() which preserve the old
behaviour.

All callers of bioset_create() and bioset_create_nobvec(),
that are inside block device drivers, are converted to the
_rescued() version.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
and one allocated by target_core_iblock.c.

biosets used by md/raid to not need rescuing as
their usage was recently audited to revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Signed-off-by: NeilBrown 
---
 block/bio.c   |   28 
 block/blk-core.c  |2 +-
 drivers/md/bcache/super.c |4 ++--
 drivers/md/dm-crypt.c |2 +-
 drivers/md/dm-io.c|2 +-
 drivers/md/dm.c   |5 +++--
 include/linux/bio.h   |2 ++
 7 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..b8e304015dc8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
struct bio_list punt, nopunt;
struct bio *bio;
 
+   if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+   return;
/*
 * In order to guarantee forward progress we must punt only bios that
 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
nr_iovecs,
 
if (current->bio_list &&
(!bio_list_empty(>bio_list[0]) ||
-!bio_list_empty(>bio_list[1])))
+!bio_list_empty(>bio_list[1])) &&
+   bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1923,7 +1926,8 @@ EXPORT_SYMBOL(bioset_free);
 
 static struct bio_set *__bioset_create(unsigned int pool_size,
   unsigned int front_pad,
-  bool create_bvec_pool)
+  bool create_bvec_pool,
+  bool create_rescue_workqueue)
 {
unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;
@@ -1954,6 +1958,9 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
goto bad;
}
 
+   if (!create_rescue_workqueue)
+   return bs;
+
bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
if (!bs->rescue_workqueue)
goto bad;
@@ -1979,10 +1986,16 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
  */
 struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 {
-   return __bioset_create(pool_size, front_pad, true);
+   return __bioset_create(pool_size, front_pad, true, false);
 }
 EXPORT_SYMBOL(bioset_create);
 
+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int 
front_pad)
+{
+   return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
 /**
  * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
  * @pool_size: Number of bio to cache in the mempool
@@ -1994,10 +2007,17 @@ EXPORT_SYMBOL(bioset_create);
  */
 struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int 
front_pad)
 {
-   return __bioset_create(pool_size, front_pad, false);
+   return __bioset_create(pool_size, front_pad, false, false);
 }
 EXPORT_SYMBOL(bioset_create_nobvec);
 
+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+unsigned int front_pad)
+{
+   return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/block/blk-core.c b/block/blk-core.c
index f5d64ad75b36..23f20cb84b2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -728,7 +728,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+   q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int 

[PATCH 02/11] blk: make the bioset rescue_workqueue optional.

2017-04-19 Thread NeilBrown
This patch converts bioset_create() and
bioset_create_nobvec() to not create a workqueue so
alloctions will never trigger punt_bios_to_rescuer().  It
also introduces bioset_create_rescued() and
bioset_create_nobvec_rescued() which preserve the old
behaviour.

All callers of bioset_create() and bioset_create_nobvec(),
that are inside block device drivers, are converted to the
_rescued() version.

biosets used by filesystems or other top-level users do not
need rescuing as the bio can never be queued behind other
bios.  This includes fs_bio_set, blkdev_dio_pool,
btrfs_bioset, xfs_ioend_bioset, drbd_md_io_bio_set,
and one allocated by target_core_iblock.c.

biosets used by md/raid to not need rescuing as
their usage was recently audited to revised to never
risk deadlock.

It is hoped that most, if not all, of the remaining biosets
can end up being the non-rescued version.

Signed-off-by: NeilBrown 
---
 block/bio.c   |   28 
 block/blk-core.c  |2 +-
 drivers/md/bcache/super.c |4 ++--
 drivers/md/dm-crypt.c |2 +-
 drivers/md/dm-io.c|2 +-
 drivers/md/dm.c   |5 +++--
 include/linux/bio.h   |2 ++
 7 files changed, 34 insertions(+), 11 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 888e7801c638..b8e304015dc8 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -363,6 +363,8 @@ static void punt_bios_to_rescuer(struct bio_set *bs)
struct bio_list punt, nopunt;
struct bio *bio;
 
+   if (!WARN_ON_ONCE(!bs->rescue_workqueue))
+   return;
/*
 * In order to guarantee forward progress we must punt only bios that
 * were allocated from this bio_set; otherwise, if there was a bio on
@@ -474,7 +476,8 @@ struct bio *bio_alloc_bioset(gfp_t gfp_mask, unsigned int 
nr_iovecs,
 
if (current->bio_list &&
(!bio_list_empty(>bio_list[0]) ||
-!bio_list_empty(>bio_list[1])))
+!bio_list_empty(>bio_list[1])) &&
+   bs->rescue_workqueue)
gfp_mask &= ~__GFP_DIRECT_RECLAIM;
 
p = mempool_alloc(bs->bio_pool, gfp_mask);
@@ -1923,7 +1926,8 @@ EXPORT_SYMBOL(bioset_free);
 
 static struct bio_set *__bioset_create(unsigned int pool_size,
   unsigned int front_pad,
-  bool create_bvec_pool)
+  bool create_bvec_pool,
+  bool create_rescue_workqueue)
 {
unsigned int back_pad = BIO_INLINE_VECS * sizeof(struct bio_vec);
struct bio_set *bs;
@@ -1954,6 +1958,9 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
goto bad;
}
 
+   if (!create_rescue_workqueue)
+   return bs;
+
bs->rescue_workqueue = alloc_workqueue("bioset", WQ_MEM_RECLAIM, 0);
if (!bs->rescue_workqueue)
goto bad;
@@ -1979,10 +1986,16 @@ static struct bio_set *__bioset_create(unsigned int 
pool_size,
  */
 struct bio_set *bioset_create(unsigned int pool_size, unsigned int front_pad)
 {
-   return __bioset_create(pool_size, front_pad, true);
+   return __bioset_create(pool_size, front_pad, true, false);
 }
 EXPORT_SYMBOL(bioset_create);
 
+struct bio_set *bioset_create_rescued(unsigned int pool_size, unsigned int 
front_pad)
+{
+   return __bioset_create(pool_size, front_pad, true, true);
+}
+EXPORT_SYMBOL(bioset_create_rescued);
+
 /**
  * bioset_create_nobvec  - Create a bio_set without bio_vec mempool
  * @pool_size: Number of bio to cache in the mempool
@@ -1994,10 +2007,17 @@ EXPORT_SYMBOL(bioset_create);
  */
 struct bio_set *bioset_create_nobvec(unsigned int pool_size, unsigned int 
front_pad)
 {
-   return __bioset_create(pool_size, front_pad, false);
+   return __bioset_create(pool_size, front_pad, false, false);
 }
 EXPORT_SYMBOL(bioset_create_nobvec);
 
+struct bio_set *bioset_create_nobvec_rescued(unsigned int pool_size,
+unsigned int front_pad)
+{
+   return __bioset_create(pool_size, front_pad, false, true);
+}
+EXPORT_SYMBOL(bioset_create_nobvec_rescued);
+
 #ifdef CONFIG_BLK_CGROUP
 
 /**
diff --git a/block/blk-core.c b/block/blk-core.c
index f5d64ad75b36..23f20cb84b2f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -728,7 +728,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
+   q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;
 
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index 85e3f21c2514..6cb30792f0ed 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -786,7 +786,7 @@ static int bcache_device_init(struct 

[PATCH V4 5/7] ARM: sun8i: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Maxime Ripard 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/sun8i-a33.dtsi | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/sun8i-a33.dtsi b/arch/arm/boot/dts/sun8i-a33.dtsi
index 306af6cadf26..a2c555d6475c 100644
--- a/arch/arm/boot/dts/sun8i-a33.dtsi
+++ b/arch/arm/boot/dts/sun8i-a33.dtsi
@@ -49,19 +49,19 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@64800 {
+   opp-64800 {
opp-hz = /bits/ 64 <64800>;
opp-microvolt = <104>;
clock-latency-ns = <244144>; /* 8 32k periods */
};
 
-   opp@81600 {
+   opp-81600 {
opp-hz = /bits/ 64 <81600>;
opp-microvolt = <110>;
clock-latency-ns = <244144>; /* 8 32k periods */
};
 
-   opp@100800 {
+   opp-100800 {
opp-hz = /bits/ 64 <100800>;
opp-microvolt = <120>;
clock-latency-ns = <244144>; /* 8 32k periods */
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH V4 5/7] ARM: sun8i: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Maxime Ripard 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/sun8i-a33.dtsi | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/arm/boot/dts/sun8i-a33.dtsi b/arch/arm/boot/dts/sun8i-a33.dtsi
index 306af6cadf26..a2c555d6475c 100644
--- a/arch/arm/boot/dts/sun8i-a33.dtsi
+++ b/arch/arm/boot/dts/sun8i-a33.dtsi
@@ -49,19 +49,19 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@64800 {
+   opp-64800 {
opp-hz = /bits/ 64 <64800>;
opp-microvolt = <104>;
clock-latency-ns = <244144>; /* 8 32k periods */
};
 
-   opp@81600 {
+   opp-81600 {
opp-hz = /bits/ 64 <81600>;
opp-microvolt = <110>;
clock-latency-ns = <244144>; /* 8 32k periods */
};
 
-   opp@100800 {
+   opp-100800 {
opp-hz = /bits/ 64 <100800>;
opp-microvolt = <120>;
clock-latency-ns = <244144>; /* 8 32k periods */
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH V4 3/7] ARM: exynos: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Reviewed-by: Chanwoo Choi 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Rob Herring 
---
 .../devicetree/bindings/devfreq/exynos-bus.txt | 46 +++
 arch/arm/boot/dts/exynos3250.dtsi  | 46 +++
 arch/arm/boot/dts/exynos4210.dtsi  | 32 +--
 arch/arm/boot/dts/exynos4412-prime.dtsi|  4 +-
 arch/arm/boot/dts/exynos4412.dtsi  | 66 +++---
 arch/arm/boot/dts/exynos5420.dtsi  | 40 ++---
 arch/arm/boot/dts/exynos5800.dtsi  | 56 +-
 arch/arm64/boot/dts/exynos/exynos5433-bus.dtsi | 48 
 arch/arm64/boot/dts/exynos/exynos5433.dtsi | 50 
 9 files changed, 194 insertions(+), 194 deletions(-)

diff --git a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt 
b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
index d085ef90d27c..f8e946471a58 100644
--- a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
+++ b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
@@ -202,23 +202,23 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
opp-microvolt = <80>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
opp-microvolt = <80>;
};
-   opp@13400 {
+   opp-13400 {
opp-hz = /bits/ 64 <13400>;
opp-microvolt = <80>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
opp-microvolt = <825000>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
opp-microvolt = <875000>;
};
@@ -292,23 +292,23 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
opp-microvolt = <90>;
};
-   opp@8000 {
+   opp-8000 {
opp-hz = /bits/ 64 <8000>;
opp-microvolt = <90>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
opp-microvolt = <100>;
};
-   opp@13400 {
+   opp-13400 {
opp-hz = /bits/ 64 <13400>;
opp-microvolt = <100>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
opp-microvolt = <100>;
};
@@ -318,19 +318,19 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
};
-   opp@8000 {
+   opp-8000 {
opp-hz = /bits/ 64 <8000>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
};
};
@@ -339,19 +339,19 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
};
-   

[PATCH V4 3/7] ARM: exynos: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Reviewed-by: Chanwoo Choi 
Reviewed-by: Krzysztof Kozlowski 
Acked-by: Rob Herring 
---
 .../devicetree/bindings/devfreq/exynos-bus.txt | 46 +++
 arch/arm/boot/dts/exynos3250.dtsi  | 46 +++
 arch/arm/boot/dts/exynos4210.dtsi  | 32 +--
 arch/arm/boot/dts/exynos4412-prime.dtsi|  4 +-
 arch/arm/boot/dts/exynos4412.dtsi  | 66 +++---
 arch/arm/boot/dts/exynos5420.dtsi  | 40 ++---
 arch/arm/boot/dts/exynos5800.dtsi  | 56 +-
 arch/arm64/boot/dts/exynos/exynos5433-bus.dtsi | 48 
 arch/arm64/boot/dts/exynos/exynos5433.dtsi | 50 
 9 files changed, 194 insertions(+), 194 deletions(-)

diff --git a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt 
b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
index d085ef90d27c..f8e946471a58 100644
--- a/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
+++ b/Documentation/devicetree/bindings/devfreq/exynos-bus.txt
@@ -202,23 +202,23 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
opp-microvolt = <80>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
opp-microvolt = <80>;
};
-   opp@13400 {
+   opp-13400 {
opp-hz = /bits/ 64 <13400>;
opp-microvolt = <80>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
opp-microvolt = <825000>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
opp-microvolt = <875000>;
};
@@ -292,23 +292,23 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
opp-microvolt = <90>;
};
-   opp@8000 {
+   opp-8000 {
opp-hz = /bits/ 64 <8000>;
opp-microvolt = <90>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
opp-microvolt = <100>;
};
-   opp@13400 {
+   opp-13400 {
opp-hz = /bits/ 64 <13400>;
opp-microvolt = <100>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
opp-microvolt = <100>;
};
@@ -318,19 +318,19 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
};
-   opp@8000 {
+   opp-8000 {
opp-hz = /bits/ 64 <8000>;
};
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
};
};
@@ -339,19 +339,19 @@ is able to support the bus frequency for all Exynos SoCs.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5000 {
+   opp-5000 {
opp-hz = /bits/ 64 <5000>;
};
-   opp@8000 {
+   opp-8000 {
opp-hz = /bits/ 64 <8000>;
};
-   opp@1 {
+   

Re: [PATCH v3] axon_ram: add dax_operations support

2017-04-19 Thread kbuild test robot
Hi Dan,

[auto build test WARNING on powerpc/next]
[also build test WARNING on v4.11-rc7 next-20170419]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Dan-Williams/axon_ram-add-dax_operations-support/20170420-091615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All warnings (new ones prefixed by >>):

   arch/powerpc/sysdev/axonram.c: In function 'axon_ram_dax_direct_access':
   arch/powerpc/sysdev/axonram.c:176:31: error: implicit declaration of 
function 'dax_get_private' [-Werror=implicit-function-declaration]
 struct axon_ram_bank *bank = dax_get_private(dax_dev);
  ^~~
>> arch/powerpc/sysdev/axonram.c:176:31: warning: initialization makes pointer 
>> from integer without a cast [-Wint-conversion]
   arch/powerpc/sysdev/axonram.c: At top level:
   arch/powerpc/sysdev/axonram.c:181:21: error: variable 'axon_ram_dax_ops' has 
initializer but incomplete type
static const struct dax_operations axon_ram_dax_ops = {
^~
   arch/powerpc/sysdev/axonram.c:182:2: error: unknown field 'direct_access' 
specified in initializer
 .direct_access = axon_ram_dax_direct_access,
 ^
>> arch/powerpc/sysdev/axonram.c:182:19: warning: excess elements in struct 
>> initializer
 .direct_access = axon_ram_dax_direct_access,
  ^~
   arch/powerpc/sysdev/axonram.c:182:19: note: (near initialization for 
'axon_ram_dax_ops')
   arch/powerpc/sysdev/axonram.c: In function 'axon_ram_probe':
   arch/powerpc/sysdev/axonram.c:255:18: error: implicit declaration of 
function 'alloc_dax' [-Werror=implicit-function-declaration]
 bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
 ^
>> arch/powerpc/sysdev/axonram.c:255:16: warning: assignment makes pointer from 
>> integer without a cast [-Wint-conversion]
 bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
   ^
   arch/powerpc/sysdev/axonram.c:313:3: error: implicit declaration of function 
'kill_dax' [-Werror=implicit-function-declaration]
  kill_dax(bank->dax_dev);
  ^~~~
   arch/powerpc/sysdev/axonram.c:314:3: error: implicit declaration of function 
'put_dax' [-Werror=implicit-function-declaration]
  put_dax(bank->dax_dev);
  ^~~
   arch/powerpc/sysdev/axonram.c: At top level:
   arch/powerpc/sysdev/axonram.c:181:36: error: storage size of 
'axon_ram_dax_ops' isn't known
static const struct dax_operations axon_ram_dax_ops = {
   ^~~~
   cc1: some warnings being treated as errors

vim +176 arch/powerpc/sysdev/axonram.c

   170  };
   171  
   172  static long
   173  axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, 
long nr_pages,
   174 void **kaddr, pfn_t *pfn)
   175  {
 > 176  struct axon_ram_bank *bank = dax_get_private(dax_dev);
   177  
   178  return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, 
pfn);
   179  }
   180  
   181  static const struct dax_operations axon_ram_dax_ops = {
 > 182  .direct_access = axon_ram_dax_direct_access,
   183  };
   184  
   185  /**
   186   * axon_ram_probe - probe() method for platform driver
   187   * @device: see platform_driver method
   188   */
   189  static int axon_ram_probe(struct platform_device *device)
   190  {
   191  static int axon_ram_bank_id = -1;
   192  struct axon_ram_bank *bank;
   193  struct resource resource;
   194  int rc = 0;
   195  
   196  axon_ram_bank_id++;
   197  
   198  dev_info(>dev, "Found memory controller on %s\n",
   199  device->dev.of_node->full_name);
   200  
   201  bank = kzalloc(sizeof(struct axon_ram_bank), GFP_KERNEL);
   202  if (bank == NULL) {
   203  dev_err(>dev, "Out of memory\n");
   204  rc = -ENOMEM;
   205  goto failed;
   206  }
   207  
   208  device->dev.platform_data = bank;
   209  
   210  bank->device = device;
   211  
   212  if (of_address_to_resource(device->dev.of_node, 0, ) 
!= 0) {
   213  dev_err(>dev, "Cannot access device tree\n");
   214  rc = -EFAULT;
   215  goto failed;
 

Re: [PATCH v3] axon_ram: add dax_operations support

2017-04-19 Thread kbuild test robot
Hi Dan,

[auto build test WARNING on powerpc/next]
[also build test WARNING on v4.11-rc7 next-20170419]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Dan-Williams/axon_ram-add-dax_operations-support/20170420-091615
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gnu-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=powerpc 

All warnings (new ones prefixed by >>):

   arch/powerpc/sysdev/axonram.c: In function 'axon_ram_dax_direct_access':
   arch/powerpc/sysdev/axonram.c:176:31: error: implicit declaration of 
function 'dax_get_private' [-Werror=implicit-function-declaration]
 struct axon_ram_bank *bank = dax_get_private(dax_dev);
  ^~~
>> arch/powerpc/sysdev/axonram.c:176:31: warning: initialization makes pointer 
>> from integer without a cast [-Wint-conversion]
   arch/powerpc/sysdev/axonram.c: At top level:
   arch/powerpc/sysdev/axonram.c:181:21: error: variable 'axon_ram_dax_ops' has 
initializer but incomplete type
static const struct dax_operations axon_ram_dax_ops = {
^~
   arch/powerpc/sysdev/axonram.c:182:2: error: unknown field 'direct_access' 
specified in initializer
 .direct_access = axon_ram_dax_direct_access,
 ^
>> arch/powerpc/sysdev/axonram.c:182:19: warning: excess elements in struct 
>> initializer
 .direct_access = axon_ram_dax_direct_access,
  ^~
   arch/powerpc/sysdev/axonram.c:182:19: note: (near initialization for 
'axon_ram_dax_ops')
   arch/powerpc/sysdev/axonram.c: In function 'axon_ram_probe':
   arch/powerpc/sysdev/axonram.c:255:18: error: implicit declaration of 
function 'alloc_dax' [-Werror=implicit-function-declaration]
 bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
 ^
>> arch/powerpc/sysdev/axonram.c:255:16: warning: assignment makes pointer from 
>> integer without a cast [-Wint-conversion]
 bank->dax_dev = alloc_dax(bank, bank->disk->disk_name,
   ^
   arch/powerpc/sysdev/axonram.c:313:3: error: implicit declaration of function 
'kill_dax' [-Werror=implicit-function-declaration]
  kill_dax(bank->dax_dev);
  ^~~~
   arch/powerpc/sysdev/axonram.c:314:3: error: implicit declaration of function 
'put_dax' [-Werror=implicit-function-declaration]
  put_dax(bank->dax_dev);
  ^~~
   arch/powerpc/sysdev/axonram.c: At top level:
   arch/powerpc/sysdev/axonram.c:181:36: error: storage size of 
'axon_ram_dax_ops' isn't known
static const struct dax_operations axon_ram_dax_ops = {
   ^~~~
   cc1: some warnings being treated as errors

vim +176 arch/powerpc/sysdev/axonram.c

   170  };
   171  
   172  static long
   173  axon_ram_dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, 
long nr_pages,
   174 void **kaddr, pfn_t *pfn)
   175  {
 > 176  struct axon_ram_bank *bank = dax_get_private(dax_dev);
   177  
   178  return __axon_ram_direct_access(bank, pgoff, nr_pages, kaddr, 
pfn);
   179  }
   180  
   181  static const struct dax_operations axon_ram_dax_ops = {
 > 182  .direct_access = axon_ram_dax_direct_access,
   183  };
   184  
   185  /**
   186   * axon_ram_probe - probe() method for platform driver
   187   * @device: see platform_driver method
   188   */
   189  static int axon_ram_probe(struct platform_device *device)
   190  {
   191  static int axon_ram_bank_id = -1;
   192  struct axon_ram_bank *bank;
   193  struct resource resource;
   194  int rc = 0;
   195  
   196  axon_ram_bank_id++;
   197  
   198  dev_info(>dev, "Found memory controller on %s\n",
   199  device->dev.of_node->full_name);
   200  
   201  bank = kzalloc(sizeof(struct axon_ram_bank), GFP_KERNEL);
   202  if (bank == NULL) {
   203  dev_err(>dev, "Out of memory\n");
   204  rc = -ENOMEM;
   205  goto failed;
   206  }
   207  
   208  device->dev.platform_data = bank;
   209  
   210  bank->device = device;
   211  
   212  if (of_address_to_resource(device->dev.of_node, 0, ) 
!= 0) {
   213  dev_err(>dev, "Cannot access device tree\n");
   214  rc = -EFAULT;
   215  goto failed;
 

[PATCH 04/11] block: Improvements to bounce-buffer handling

2017-04-19 Thread NeilBrown
Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
 1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.

Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.

So:
 - allocate 2 private biosets for blk_queue_bounce, one for
   splitting enormous bios and one for cloning bios.
 - add code to split a bio that exceeds BIO_MAX_PAGES.
 - Fix up the comments in blk_queue_split()

Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   14 --
 block/bounce.c|   27 ++-
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d59074556703..51c84540d3bb 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -117,17 +117,11 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * each holds at most BIO_MAX_PAGES bvecs because
 * bio_clone() can fail to allocate big bvecs.
 *
-* It should have been better to apply the limit per
-* request queue in which bio_clone() is involved,
-* instead of globally. The biggest blocker is the
-* bio_clone() in bio bounce.
+* Those drivers which will need to use bio_clone()
+* should tell us in some way.  For now, impose the
+* BIO_MAX_PAGES limit on all queues.
 *
-* If bio is splitted by this reason, we should have
-* allowed to continue bios merging, but don't do
-* that now for making the change simple.
-*
-* TODO: deal with bio bounce's bio_clone() gracefully
-* and convert the global limit into per-queue limit.
+* TODO: handle users of bio_clone() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/block/bounce.c b/block/bounce.c
index 1cb5dd3a5da1..51fb538b504d 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -26,6 +26,7 @@
 #define POOL_SIZE  64
 #define ISA_POOL_SIZE  16
 
+struct bio_set *bounce_bio_set, *bounce_bio_split;
 static mempool_t *page_pool, *isa_page_pool;
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
@@ -40,6 +41,14 @@ static __init int init_emergency_pool(void)
BUG_ON(!page_pool);
pr_info("pool size: %d pages\n", POOL_SIZE);
 
+   bounce_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   BUG_ON(!bounce_bio_set);
+   if (bioset_integrity_create(bounce_bio_set, BIO_POOL_SIZE))
+   BUG_ON(1);
+
+   bounce_bio_split = bioset_create_nobvec(BIO_POOL_SIZE, 0);
+   BUG_ON(!bounce_bio_split);
+
return 0;
 }
 
@@ -194,7 +203,23 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
 
return;
 bounce:
-   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+   if (bio_segments(*bio_orig) > BIO_MAX_PAGES) {
+   int cnt = 0;
+   int sectors = 0;
+   struct bio_vec bv;
+   struct bvec_iter iter;
+   bio_for_each_segment(bv, *bio_orig, iter) {
+   if (cnt++ < BIO_MAX_PAGES)
+   sectors += bv.bv_len >> 9;
+   else
+   break;
+   }
+   bio = bio_split(*bio_orig, sectors, GFP_NOIO, bounce_bio_split);
+   bio_chain(bio, *bio_orig);
+   generic_make_request(*bio_orig);
+   *bio_orig = bio;
+   }
+   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
 
bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;




[PATCH V4 6/7] ARM: uniphier: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Masahiro Yamada 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/uniphier-pro5.dtsi | 32 
 arch/arm/boot/dts/uniphier-pxs2.dtsi | 16 ++--
 arch/arm64/boot/dts/socionext/uniphier-ld11.dtsi | 14 +--
 arch/arm64/boot/dts/socionext/uniphier-ld20.dtsi | 32 
 4 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/arch/arm/boot/dts/uniphier-pro5.dtsi 
b/arch/arm/boot/dts/uniphier-pro5.dtsi
index dbc5e5333163..22ef2842be3a 100644
--- a/arch/arm/boot/dts/uniphier-pro5.dtsi
+++ b/arch/arm/boot/dts/uniphier-pro5.dtsi
@@ -77,67 +77,67 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
clock-latency-ns = <300>;
};
-   opp@116667000 {
+   opp-116667000 {
opp-hz = /bits/ 64 <116667000>;
clock-latency-ns = <300>;
};
-   opp@15000 {
+   opp-15000 {
opp-hz = /bits/ 64 <15000>;
clock-latency-ns = <300>;
};
-   opp@17500 {
+   opp-17500 {
opp-hz = /bits/ 64 <17500>;
clock-latency-ns = <300>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
clock-latency-ns = <300>;
};
-   opp@24000 {
+   opp-24000 {
opp-hz = /bits/ 64 <24000>;
clock-latency-ns = <300>;
};
-   opp@3 {
+   opp-3 {
opp-hz = /bits/ 64 <3>;
clock-latency-ns = <300>;
};
-   opp@35000 {
+   opp-35000 {
opp-hz = /bits/ 64 <35000>;
clock-latency-ns = <300>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
clock-latency-ns = <300>;
};
-   opp@47000 {
+   opp-47000 {
opp-hz = /bits/ 64 <47000>;
clock-latency-ns = <300>;
};
-   opp@6 {
+   opp-6 {
opp-hz = /bits/ 64 <6>;
clock-latency-ns = <300>;
};
-   opp@7 {
+   opp-7 {
opp-hz = /bits/ 64 <7>;
clock-latency-ns = <300>;
};
-   opp@8 {
+   opp-8 {
opp-hz = /bits/ 64 <8>;
clock-latency-ns = <300>;
};
-   opp@94000 {
+   opp-94000 {
opp-hz = /bits/ 64 <94000>;
clock-latency-ns = <300>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
clock-latency-ns = <300>;
};
-   opp@14 {
+   opp-14 {
opp-hz = /bits/ 64 <14>;
clock-latency-ns = <300>;
};
diff --git a/arch/arm/boot/dts/uniphier-pxs2.dtsi 
b/arch/arm/boot/dts/uniphier-pxs2.dtsi
index e9e031d63c1a..acaaa2187843 100644
--- a/arch/arm/boot/dts/uniphier-pxs2.dtsi
+++ b/arch/arm/boot/dts/uniphier-pxs2.dtsi
@@ -97,35 +97,35 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
clock-latency-ns = <300>;
};
-   opp@15000 {
+   opp-15000 {
opp-hz = /bits/ 64 <15000>;
clock-latency-ns = <300>;
};
-   opp@2 {
+   

[PATCH 04/11] block: Improvements to bounce-buffer handling

2017-04-19 Thread NeilBrown
Since commit 23688bf4f830 ("block: ensure to split after potentially
bouncing a bio") blk_queue_bounce() is called *before*
blk_queue_split().
This means that:
 1/ the comments blk_queue_split() about bounce buffers are
irrelevant, and
 2/ a very large bio (more than BIO_MAX_PAGES) will no longer be
split before it arrives at blk_queue_bounce(), leading to the
possibility that bio_clone_bioset() will fail and a NULL
will be dereferenced.

Separately, blk_queue_bounce() shouldn't use fs_bio_set as the bio
being copied could be from the same set, and this could lead to a
deadlock.

So:
 - allocate 2 private biosets for blk_queue_bounce, one for
   splitting enormous bios and one for cloning bios.
 - add code to split a bio that exceeds BIO_MAX_PAGES.
 - Fix up the comments in blk_queue_split()

Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   14 --
 block/bounce.c|   27 ++-
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index d59074556703..51c84540d3bb 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -117,17 +117,11 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * each holds at most BIO_MAX_PAGES bvecs because
 * bio_clone() can fail to allocate big bvecs.
 *
-* It should have been better to apply the limit per
-* request queue in which bio_clone() is involved,
-* instead of globally. The biggest blocker is the
-* bio_clone() in bio bounce.
+* Those drivers which will need to use bio_clone()
+* should tell us in some way.  For now, impose the
+* BIO_MAX_PAGES limit on all queues.
 *
-* If bio is splitted by this reason, we should have
-* allowed to continue bios merging, but don't do
-* that now for making the change simple.
-*
-* TODO: deal with bio bounce's bio_clone() gracefully
-* and convert the global limit into per-queue limit.
+* TODO: handle users of bio_clone() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/block/bounce.c b/block/bounce.c
index 1cb5dd3a5da1..51fb538b504d 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -26,6 +26,7 @@
 #define POOL_SIZE  64
 #define ISA_POOL_SIZE  16
 
+struct bio_set *bounce_bio_set, *bounce_bio_split;
 static mempool_t *page_pool, *isa_page_pool;
 
 #if defined(CONFIG_HIGHMEM) || defined(CONFIG_NEED_BOUNCE_POOL)
@@ -40,6 +41,14 @@ static __init int init_emergency_pool(void)
BUG_ON(!page_pool);
pr_info("pool size: %d pages\n", POOL_SIZE);
 
+   bounce_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   BUG_ON(!bounce_bio_set);
+   if (bioset_integrity_create(bounce_bio_set, BIO_POOL_SIZE))
+   BUG_ON(1);
+
+   bounce_bio_split = bioset_create_nobvec(BIO_POOL_SIZE, 0);
+   BUG_ON(!bounce_bio_split);
+
return 0;
 }
 
@@ -194,7 +203,23 @@ static void __blk_queue_bounce(struct request_queue *q, 
struct bio **bio_orig,
 
return;
 bounce:
-   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, fs_bio_set);
+   if (bio_segments(*bio_orig) > BIO_MAX_PAGES) {
+   int cnt = 0;
+   int sectors = 0;
+   struct bio_vec bv;
+   struct bvec_iter iter;
+   bio_for_each_segment(bv, *bio_orig, iter) {
+   if (cnt++ < BIO_MAX_PAGES)
+   sectors += bv.bv_len >> 9;
+   else
+   break;
+   }
+   bio = bio_split(*bio_orig, sectors, GFP_NOIO, bounce_bio_split);
+   bio_chain(bio, *bio_orig);
+   generic_make_request(*bio_orig);
+   *bio_orig = bio;
+   }
+   bio = bio_clone_bioset(*bio_orig, GFP_NOIO, bounce_bio_set);
 
bio_for_each_segment_all(to, bio, i) {
struct page *page = to->bv_page;




[PATCH V4 6/7] ARM: uniphier: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Masahiro Yamada 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/uniphier-pro5.dtsi | 32 
 arch/arm/boot/dts/uniphier-pxs2.dtsi | 16 ++--
 arch/arm64/boot/dts/socionext/uniphier-ld11.dtsi | 14 +--
 arch/arm64/boot/dts/socionext/uniphier-ld20.dtsi | 32 
 4 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/arch/arm/boot/dts/uniphier-pro5.dtsi 
b/arch/arm/boot/dts/uniphier-pro5.dtsi
index dbc5e5333163..22ef2842be3a 100644
--- a/arch/arm/boot/dts/uniphier-pro5.dtsi
+++ b/arch/arm/boot/dts/uniphier-pro5.dtsi
@@ -77,67 +77,67 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
clock-latency-ns = <300>;
};
-   opp@116667000 {
+   opp-116667000 {
opp-hz = /bits/ 64 <116667000>;
clock-latency-ns = <300>;
};
-   opp@15000 {
+   opp-15000 {
opp-hz = /bits/ 64 <15000>;
clock-latency-ns = <300>;
};
-   opp@17500 {
+   opp-17500 {
opp-hz = /bits/ 64 <17500>;
clock-latency-ns = <300>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
clock-latency-ns = <300>;
};
-   opp@24000 {
+   opp-24000 {
opp-hz = /bits/ 64 <24000>;
clock-latency-ns = <300>;
};
-   opp@3 {
+   opp-3 {
opp-hz = /bits/ 64 <3>;
clock-latency-ns = <300>;
};
-   opp@35000 {
+   opp-35000 {
opp-hz = /bits/ 64 <35000>;
clock-latency-ns = <300>;
};
-   opp@4 {
+   opp-4 {
opp-hz = /bits/ 64 <4>;
clock-latency-ns = <300>;
};
-   opp@47000 {
+   opp-47000 {
opp-hz = /bits/ 64 <47000>;
clock-latency-ns = <300>;
};
-   opp@6 {
+   opp-6 {
opp-hz = /bits/ 64 <6>;
clock-latency-ns = <300>;
};
-   opp@7 {
+   opp-7 {
opp-hz = /bits/ 64 <7>;
clock-latency-ns = <300>;
};
-   opp@8 {
+   opp-8 {
opp-hz = /bits/ 64 <8>;
clock-latency-ns = <300>;
};
-   opp@94000 {
+   opp-94000 {
opp-hz = /bits/ 64 <94000>;
clock-latency-ns = <300>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
clock-latency-ns = <300>;
};
-   opp@14 {
+   opp-14 {
opp-hz = /bits/ 64 <14>;
clock-latency-ns = <300>;
};
diff --git a/arch/arm/boot/dts/uniphier-pxs2.dtsi 
b/arch/arm/boot/dts/uniphier-pxs2.dtsi
index e9e031d63c1a..acaaa2187843 100644
--- a/arch/arm/boot/dts/uniphier-pxs2.dtsi
+++ b/arch/arm/boot/dts/uniphier-pxs2.dtsi
@@ -97,35 +97,35 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@1 {
+   opp-1 {
opp-hz = /bits/ 64 <1>;
clock-latency-ns = <300>;
};
-   opp@15000 {
+   opp-15000 {
opp-hz = /bits/ 64 <15000>;
clock-latency-ns = <300>;
};
-   opp@2 {
+   opp-2 {
opp-hz = /bits/ 64 <2>;
clock-latency-ns = <300>;

[PATCH 07/11] pktcdvd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.

Signed-off-by: NeilBrown 
---
 drivers/block/pktcdvd.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 98394d034c29..7a437b5b8804 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -98,6 +98,7 @@ static int write_congestion_on  = PKT_WRITE_CONGESTION_ON;
 static int write_congestion_off = PKT_WRITE_CONGESTION_OFF;
 static struct mutex ctl_mutex; /* Serialize open/close/setup/teardown */
 static mempool_t *psd_pool;
+static struct bio_set *pkt_bio_set;
 
 static struct class*class_pktcdvd = NULL;/* /sys/class/pktcdvd */
 static struct dentry   *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd 
*/
@@ -2310,7 +2311,7 @@ static void pkt_end_io_read_cloned(struct bio *bio)
 
 static void pkt_make_request_read(struct pktcdvd_device *pd, struct bio *bio)
 {
-   struct bio *cloned_bio = bio_clone(bio, GFP_NOIO);
+   struct bio *cloned_bio = bio_clone_fast(bio, GFP_NOIO, pkt_bio_set);
struct packet_stacked_data *psd = mempool_alloc(psd_pool, GFP_NOIO);
 
psd->pd = pd;
@@ -2455,7 +2456,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
split = bio_split(bio, last_zone -
  bio->bi_iter.bi_sector,
- GFP_NOIO, fs_bio_set);
+ GFP_NOIO, pkt_bio_set);
bio_chain(split, bio);
} else {
split = bio;
@@ -2919,6 +2920,11 @@ static int __init pkt_init(void)
sizeof(struct packet_stacked_data));
if (!psd_pool)
return -ENOMEM;
+   pkt_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   if (!pkt_bio_set) {
+   mempool_destroy(psd_pool);
+   return -ENOMEM;
+   }
 
ret = register_blkdev(pktdev_major, DRIVER_NAME);
if (ret < 0) {
@@ -2951,6 +2957,7 @@ static int __init pkt_init(void)
unregister_blkdev(pktdev_major, DRIVER_NAME);
 out2:
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
return ret;
 }
 
@@ -2964,6 +2971,7 @@ static void __exit pkt_exit(void)
 
unregister_blkdev(pktdev_major, DRIVER_NAME);
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
 }
 
 MODULE_DESCRIPTION("Packet writing layer for CD/DVD drives");




[PATCH 07/11] pktcdvd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
pktcdvd doesn't change the bi_io_vec of the clone bio,
so it is more efficient to use bio_clone_fast(), and not clone
the bi_io_vec.
This requires providing a bio_set, and it is safest to
provide a dedicated bio_set rather than sharing
fs_bio_set, which filesytems use.
This new bio_set, pkt_bio_set, can also be use for the bio_split()
call as the two allocations (bio_clone_fast, and bio_split) are
independent, neither can block a bio allocated by the other.

Signed-off-by: NeilBrown 
---
 drivers/block/pktcdvd.c |   12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index 98394d034c29..7a437b5b8804 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -98,6 +98,7 @@ static int write_congestion_on  = PKT_WRITE_CONGESTION_ON;
 static int write_congestion_off = PKT_WRITE_CONGESTION_OFF;
 static struct mutex ctl_mutex; /* Serialize open/close/setup/teardown */
 static mempool_t *psd_pool;
+static struct bio_set *pkt_bio_set;
 
 static struct class*class_pktcdvd = NULL;/* /sys/class/pktcdvd */
 static struct dentry   *pkt_debugfs_root = NULL; /* /sys/kernel/debug/pktcdvd 
*/
@@ -2310,7 +2311,7 @@ static void pkt_end_io_read_cloned(struct bio *bio)
 
 static void pkt_make_request_read(struct pktcdvd_device *pd, struct bio *bio)
 {
-   struct bio *cloned_bio = bio_clone(bio, GFP_NOIO);
+   struct bio *cloned_bio = bio_clone_fast(bio, GFP_NOIO, pkt_bio_set);
struct packet_stacked_data *psd = mempool_alloc(psd_pool, GFP_NOIO);
 
psd->pd = pd;
@@ -2455,7 +2456,7 @@ static blk_qc_t pkt_make_request(struct request_queue *q, 
struct bio *bio)
 
split = bio_split(bio, last_zone -
  bio->bi_iter.bi_sector,
- GFP_NOIO, fs_bio_set);
+ GFP_NOIO, pkt_bio_set);
bio_chain(split, bio);
} else {
split = bio;
@@ -2919,6 +2920,11 @@ static int __init pkt_init(void)
sizeof(struct packet_stacked_data));
if (!psd_pool)
return -ENOMEM;
+   pkt_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   if (!pkt_bio_set) {
+   mempool_destroy(psd_pool);
+   return -ENOMEM;
+   }
 
ret = register_blkdev(pktdev_major, DRIVER_NAME);
if (ret < 0) {
@@ -2951,6 +2957,7 @@ static int __init pkt_init(void)
unregister_blkdev(pktdev_major, DRIVER_NAME);
 out2:
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
return ret;
 }
 
@@ -2964,6 +2971,7 @@ static void __exit pkt_exit(void)
 
unregister_blkdev(pktdev_major, DRIVER_NAME);
mempool_destroy(psd_pool);
+   bioset_free(pkt_bio_set);
 }
 
 MODULE_DESCRIPTION("Packet writing layer for CD/DVD drives");




[PATCH 03/11] blk: use non-rescuing bioset for q->bio_split.

2017-04-19 Thread NeilBrown
A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Signed-off-by: NeilBrown 
---
 block/blk-core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 23f20cb84b2f..f5d64ad75b36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -728,7 +728,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+   q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;
 




[PATCH 03/11] blk: use non-rescuing bioset for q->bio_split.

2017-04-19 Thread NeilBrown
A rescuing bioset is only useful if there might be bios from
that same bioset on the bio_list_on_stack queue at a time
when bio_alloc_bioset() is called.  This never applies to
q->bio_split.

Allocations from q->bio_split are only ever made from
blk_queue_split() which is only ever called early in each of
various make_request_fn()s.  The original bio (call this A)
is then passed to generic_make_request() and is placed on
the bio_list_on_stack queue, and the bio that was allocated
from q->bio_split (B) is processed.

The processing of this may cause other bios to be passed to
generic_make_request() or may even cause the bio B itself to
be passed, possible after some prefix has been split off
(using some other bioset).

generic_make_request() now guarantees that all of these bios
(B and dependants) will be fully processed before the tail
of the original bio A gets handled.  None of these early bios
can possible trigger an allocation from the original
q->bio_split as they are either too small to require
splitting or (more likely) are destined for a different queue.

The next time that the original q->bio_split might be used
by this thread is when A is processed again, as it might
still be too big to handle directly.  By this time there
cannot be any other bios allocated from q->bio_split in the
generic_make_request() queue.  So no rescuing will ever be
needed.

Signed-off-by: NeilBrown 
---
 block/blk-core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 23f20cb84b2f..f5d64ad75b36 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -728,7 +728,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, 
int node_id)
if (q->id < 0)
goto fail_q;
 
-   q->bio_split = bioset_create_rescued(BIO_POOL_SIZE, 0);
+   q->bio_split = bioset_create(BIO_POOL_SIZE, 0);
if (!q->bio_split)
goto fail_id;
 




[PATCH V4 4/7] ARM: pxa: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/pxa25x.dtsi |  8 
 arch/arm/boot/dts/pxa27x.dtsi | 14 +++---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/arm/boot/dts/pxa25x.dtsi b/arch/arm/boot/dts/pxa25x.dtsi
index f9f4726396a0..95d59be97213 100644
--- a/arch/arm/boot/dts/pxa25x.dtsi
+++ b/arch/arm/boot/dts/pxa25x.dtsi
@@ -93,22 +93,22 @@
pxa250_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp@99532800 {
+   opp-99532800 {
opp-hz = /bits/ 64 <99532800>;
opp-microvolt = <100 95 165>;
clock-latency-ns = <20>;
};
-   opp@199065600 {
+   opp-199065600 {
opp-hz = /bits/ 64 <199065600>;
opp-microvolt = <100 95 165>;
clock-latency-ns = <20>;
};
-   opp@298598400 {
+   opp-298598400 {
opp-hz = /bits/ 64 <298598400>;
opp-microvolt = <110 1045000 165>;
clock-latency-ns = <20>;
};
-   opp@398131200 {
+   opp-398131200 {
opp-hz = /bits/ 64 <398131200>;
opp-microvolt = <130 1235000 165>;
clock-latency-ns = <20>;
diff --git a/arch/arm/boot/dts/pxa27x.dtsi b/arch/arm/boot/dts/pxa27x.dtsi
index e0fab48ba6fa..5f1d6da02a4c 100644
--- a/arch/arm/boot/dts/pxa27x.dtsi
+++ b/arch/arm/boot/dts/pxa27x.dtsi
@@ -141,37 +141,37 @@
pxa270_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp@10400 {
+   opp-10400 {
opp-hz = /bits/ 64 <10400>;
opp-microvolt = <90 90 1705000>;
clock-latency-ns = <20>;
};
-   opp@15600 {
+   opp-15600 {
opp-hz = /bits/ 64 <15600>;
opp-microvolt = <100 100 1705000>;
clock-latency-ns = <20>;
};
-   opp@20800 {
+   opp-20800 {
opp-hz = /bits/ 64 <20800>;
opp-microvolt = <118 118 1705000>;
clock-latency-ns = <20>;
};
-   opp@31200 {
+   opp-31200 {
opp-hz = /bits/ 64 <31200>;
opp-microvolt = <125 125 1705000>;
clock-latency-ns = <20>;
};
-   opp@41600 {
+   opp-41600 {
opp-hz = /bits/ 64 <41600>;
opp-microvolt = <135 135 1705000>;
clock-latency-ns = <20>;
};
-   opp@52000 {
+   opp-52000 {
opp-hz = /bits/ 64 <52000>;
opp-microvolt = <145 145 1705000>;
clock-latency-ns = <20>;
};
-   opp@62400 {
+   opp-62400 {
opp-hz = /bits/ 64 <62400>;
opp-microvolt = <155 155 1705000>;
clock-latency-ns = <20>;
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH V4 4/7] ARM: pxa: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 arch/arm/boot/dts/pxa25x.dtsi |  8 
 arch/arm/boot/dts/pxa27x.dtsi | 14 +++---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/arch/arm/boot/dts/pxa25x.dtsi b/arch/arm/boot/dts/pxa25x.dtsi
index f9f4726396a0..95d59be97213 100644
--- a/arch/arm/boot/dts/pxa25x.dtsi
+++ b/arch/arm/boot/dts/pxa25x.dtsi
@@ -93,22 +93,22 @@
pxa250_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp@99532800 {
+   opp-99532800 {
opp-hz = /bits/ 64 <99532800>;
opp-microvolt = <100 95 165>;
clock-latency-ns = <20>;
};
-   opp@199065600 {
+   opp-199065600 {
opp-hz = /bits/ 64 <199065600>;
opp-microvolt = <100 95 165>;
clock-latency-ns = <20>;
};
-   opp@298598400 {
+   opp-298598400 {
opp-hz = /bits/ 64 <298598400>;
opp-microvolt = <110 1045000 165>;
clock-latency-ns = <20>;
};
-   opp@398131200 {
+   opp-398131200 {
opp-hz = /bits/ 64 <398131200>;
opp-microvolt = <130 1235000 165>;
clock-latency-ns = <20>;
diff --git a/arch/arm/boot/dts/pxa27x.dtsi b/arch/arm/boot/dts/pxa27x.dtsi
index e0fab48ba6fa..5f1d6da02a4c 100644
--- a/arch/arm/boot/dts/pxa27x.dtsi
+++ b/arch/arm/boot/dts/pxa27x.dtsi
@@ -141,37 +141,37 @@
pxa270_opp_table: opp_table0 {
compatible = "operating-points-v2";
 
-   opp@10400 {
+   opp-10400 {
opp-hz = /bits/ 64 <10400>;
opp-microvolt = <90 90 1705000>;
clock-latency-ns = <20>;
};
-   opp@15600 {
+   opp-15600 {
opp-hz = /bits/ 64 <15600>;
opp-microvolt = <100 100 1705000>;
clock-latency-ns = <20>;
};
-   opp@20800 {
+   opp-20800 {
opp-hz = /bits/ 64 <20800>;
opp-microvolt = <118 118 1705000>;
clock-latency-ns = <20>;
};
-   opp@31200 {
+   opp-31200 {
opp-hz = /bits/ 64 <31200>;
opp-microvolt = <125 125 1705000>;
clock-latency-ns = <20>;
};
-   opp@41600 {
+   opp-41600 {
opp-hz = /bits/ 64 <41600>;
opp-microvolt = <135 135 1705000>;
clock-latency-ns = <20>;
};
-   opp@52000 {
+   opp-52000 {
opp-hz = /bits/ 64 <52000>;
opp-microvolt = <145 145 1705000>;
clock-latency-ns = <20>;
};
-   opp@62400 {
+   opp-62400 {
opp-hz = /bits/ 64 <62400>;
opp-microvolt = <155 155 1705000>;
clock-latency-ns = <20>;
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH 11/11] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-04-19 Thread NeilBrown
blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.

No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.

The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.

So remove the big helpful comment and the code.

Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   16 
 1 file changed, 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e7862e9dcc39..cea544ec5d96 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -108,25 +108,9 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
-   unsigned bvecs = 0;
 
bio_for_each_segment(bv, bio, iter) {
/*
-* With arbitrary bio size, the incoming bio may be very
-* big. We have to split the bio into small bios so that
-* each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone_bioset() can fail to allocate big bvecs.
-*
-* Those drivers which will need to use bio_clone_bioset()
-* should tell us in some way.  For now, impose the
-* BIO_MAX_PAGES limit on all queues.
-*
-* TODO: handle users of bio_clone_bioset() differently.
-*/
-   if (bvecs++ >= BIO_MAX_PAGES)
-   goto split;
-
-   /*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
 */




[PATCH 10/11] block: remove bio_clone() and all references.

2017-04-19 Thread NeilBrown
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Signed-off-by: NeilBrown 
---
 Documentation/block/biodoc.txt |2 +-
 block/bio.c|2 +-
 block/blk-merge.c  |6 +++---
 drivers/md/md.c|2 +-
 include/linux/bio.h|5 -
 5 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 01ddeaf64b0f..9490f2845f06 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be 
accessed after the
 i/o is issued (since the bio may otherwise get freed in case i/o completion
 happens in the meantime).
 
-The bio_clone() routine may be used to duplicate a bio, where the clone
+The bio_clone_fast() routine may be used to duplicate a bio, where the clone
 shares the bio_vec_list with the original bio (i.e. both point to the
 same bio_vec_list). This would typically be used for splitting i/o requests
 in lvm or md.
diff --git a/block/bio.c b/block/bio.c
index b8e304015dc8..9ef1da1830e4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -547,7 +547,7 @@ EXPORT_SYMBOL(zero_fill_bio);
  *
  * Description:
  *   Put a reference to a  bio, either one you have gotten with
- *   bio_alloc, bio_get or bio_clone. The last put of a bio will free it.
+ *   bio_alloc, bio_get or bio_clone_*. The last put of a bio will free it.
  **/
 void bio_put(struct bio *bio)
 {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 51c84540d3bb..e7862e9dcc39 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -115,13 +115,13 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * With arbitrary bio size, the incoming bio may be very
 * big. We have to split the bio into small bios so that
 * each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone() can fail to allocate big bvecs.
+* bio_clone_bioset() can fail to allocate big bvecs.
 *
-* Those drivers which will need to use bio_clone()
+* Those drivers which will need to use bio_clone_bioset()
 * should tell us in some way.  For now, impose the
 * BIO_MAX_PAGES limit on all queues.
 *
-* TODO: handle users of bio_clone() differently.
+* TODO: handle users of bio_clone_bioset() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index de37ace40470..8b415e13d2f8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -185,7 +185,7 @@ static int start_readonly;
 static bool create_on_open = true;
 
 /* bio_clone_mddev
- * like bio_clone, but with a local bio set
+ * like bio_clone_bioset, but with a local bio set
  */
 
 struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2eb8bfae5276..5227850592cf 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -394,11 +394,6 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, 
unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 
-static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
-{
-   return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
-}
-
 static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);




[PATCH 05/11] rbd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.

This requires that we provide a bio_set.  bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks.  As filesystems
use fs_bio_set, block devices shouldn't.

As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use.  So allocate that when the module is
initialised, and use it with bio_clone_fast().

Signed-off-by: NeilBrown 
---
 drivers/block/rbd.c |   16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 089ac4179919..48eecffc612e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -441,6 +441,8 @@ static DEFINE_SPINLOCK(rbd_client_list_lock);
 static struct kmem_cache   *rbd_img_request_cache;
 static struct kmem_cache   *rbd_obj_request_cache;
 
+static struct bio_set  *rbd_bio_clone;
+
 static int rbd_major;
 static DEFINE_IDA(rbd_dev_id_ida);
 
@@ -1362,7 +1364,7 @@ static struct bio *bio_clone_range(struct bio *bio_src,
 {
struct bio *bio;
 
-   bio = bio_clone(bio_src, gfpmask);
+   bio = bio_clone_fast(bio_src, gfpmask, rbd_bio_clone);
if (!bio)
return NULL;/* ENOMEM */
 
@@ -6342,8 +6344,16 @@ static int rbd_slab_init(void)
if (!rbd_obj_request_cache)
goto out_err;
 
+   rbd_assert(!rbd_bio_clone);
+   rbd_bio_clone = bioset_create(BIO_POOL_SIZE, 0);
+   if (!rbd_bio_clone)
+   goto out_err_clone;
+
return 0;
 
+out_err_clone:
+   kmem_cache_destroy(rbd_obj_request_cache);
+   rbd_obj_request_cache = NULL;
 out_err:
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
@@ -6359,6 +6369,10 @@ static void rbd_slab_exit(void)
rbd_assert(rbd_img_request_cache);
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
+
+   rbd_assert(rbd_bio_clone);
+   bioset_free(rbd_bio_clone);
+   rbd_bio_clone = NULL;
 }
 
 static int __init rbd_init(void)




[PATCH 11/11] block: don't check for BIO_MAX_PAGES in blk_bio_segment_split()

2017-04-19 Thread NeilBrown
blk_bio_segment_split() makes sure bios have no more than
BIO_MAX_PAGES entries in the bi_io_vec.
This was done because bio_clone_bioset() (when given a
mempool bioset) could not handle larger io_vecs.

No driver uses bio_clone_bioset() any more, they all
use bio_clone_fast() if anything, and bio_clone_fast()
doesn't clone the bi_io_vec.

The main user of of bio_clone_bioset() at this level
is bounce.c, and bouncing now happens before blk_bio_segment_split(),
so that is not of concern.

So remove the big helpful comment and the code.

Signed-off-by: NeilBrown 
---
 block/blk-merge.c |   16 
 1 file changed, 16 deletions(-)

diff --git a/block/blk-merge.c b/block/blk-merge.c
index e7862e9dcc39..cea544ec5d96 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -108,25 +108,9 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
bool do_split = true;
struct bio *new = NULL;
const unsigned max_sectors = get_max_io_size(q, bio);
-   unsigned bvecs = 0;
 
bio_for_each_segment(bv, bio, iter) {
/*
-* With arbitrary bio size, the incoming bio may be very
-* big. We have to split the bio into small bios so that
-* each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone_bioset() can fail to allocate big bvecs.
-*
-* Those drivers which will need to use bio_clone_bioset()
-* should tell us in some way.  For now, impose the
-* BIO_MAX_PAGES limit on all queues.
-*
-* TODO: handle users of bio_clone_bioset() differently.
-*/
-   if (bvecs++ >= BIO_MAX_PAGES)
-   goto split;
-
-   /*
 * If the queue doesn't support SG gaps and adding this
 * offset would create a gap, disallow it.
 */




[PATCH 10/11] block: remove bio_clone() and all references.

2017-04-19 Thread NeilBrown
bio_clone() is no longer used.
Only bio_clone_bioset() or bio_clone_fast().
This is for the best, as bio_clone() used fs_bio_set,
and filesystems are unlikely to want to use bio_clone().

So remove bio_clone() and all references.
This includes a fix to some incorrect documentation.

Signed-off-by: NeilBrown 
---
 Documentation/block/biodoc.txt |2 +-
 block/bio.c|2 +-
 block/blk-merge.c  |6 +++---
 drivers/md/md.c|2 +-
 include/linux/bio.h|5 -
 5 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/Documentation/block/biodoc.txt b/Documentation/block/biodoc.txt
index 01ddeaf64b0f..9490f2845f06 100644
--- a/Documentation/block/biodoc.txt
+++ b/Documentation/block/biodoc.txt
@@ -632,7 +632,7 @@ to i/o submission, if the bio fields are likely to be 
accessed after the
 i/o is issued (since the bio may otherwise get freed in case i/o completion
 happens in the meantime).
 
-The bio_clone() routine may be used to duplicate a bio, where the clone
+The bio_clone_fast() routine may be used to duplicate a bio, where the clone
 shares the bio_vec_list with the original bio (i.e. both point to the
 same bio_vec_list). This would typically be used for splitting i/o requests
 in lvm or md.
diff --git a/block/bio.c b/block/bio.c
index b8e304015dc8..9ef1da1830e4 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -547,7 +547,7 @@ EXPORT_SYMBOL(zero_fill_bio);
  *
  * Description:
  *   Put a reference to a  bio, either one you have gotten with
- *   bio_alloc, bio_get or bio_clone. The last put of a bio will free it.
+ *   bio_alloc, bio_get or bio_clone_*. The last put of a bio will free it.
  **/
 void bio_put(struct bio *bio)
 {
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 51c84540d3bb..e7862e9dcc39 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -115,13 +115,13 @@ static struct bio *blk_bio_segment_split(struct 
request_queue *q,
 * With arbitrary bio size, the incoming bio may be very
 * big. We have to split the bio into small bios so that
 * each holds at most BIO_MAX_PAGES bvecs because
-* bio_clone() can fail to allocate big bvecs.
+* bio_clone_bioset() can fail to allocate big bvecs.
 *
-* Those drivers which will need to use bio_clone()
+* Those drivers which will need to use bio_clone_bioset()
 * should tell us in some way.  For now, impose the
 * BIO_MAX_PAGES limit on all queues.
 *
-* TODO: handle users of bio_clone() differently.
+* TODO: handle users of bio_clone_bioset() differently.
 */
if (bvecs++ >= BIO_MAX_PAGES)
goto split;
diff --git a/drivers/md/md.c b/drivers/md/md.c
index de37ace40470..8b415e13d2f8 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -185,7 +185,7 @@ static int start_readonly;
 static bool create_on_open = true;
 
 /* bio_clone_mddev
- * like bio_clone, but with a local bio set
+ * like bio_clone_bioset, but with a local bio set
  */
 
 struct bio *bio_alloc_mddev(gfp_t gfp_mask, int nr_iovecs,
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 2eb8bfae5276..5227850592cf 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -394,11 +394,6 @@ static inline struct bio *bio_alloc(gfp_t gfp_mask, 
unsigned int nr_iovecs)
return bio_alloc_bioset(gfp_mask, nr_iovecs, fs_bio_set);
 }
 
-static inline struct bio *bio_clone(struct bio *bio, gfp_t gfp_mask)
-{
-   return bio_clone_bioset(bio, gfp_mask, fs_bio_set);
-}
-
 static inline struct bio *bio_kmalloc(gfp_t gfp_mask, unsigned int nr_iovecs)
 {
return bio_alloc_bioset(gfp_mask, nr_iovecs, NULL);




[PATCH 05/11] rbd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
bio_clone() makes a copy of the bi_io_vec, but rbd never changes that,
so there is no need for a copy.
bio_clone_fast() can be used instead, which avoids making the copy.

This requires that we provide a bio_set.  bio_clone() uses fs_bio_set,
but it isn't, in general, safe to use the same bio_set at different
levels of the stack, as that can lead to deadlocks.  As filesystems
use fs_bio_set, block devices shouldn't.

As rbd never stacks, it is safe to have a single global bio_set for
all rbd devices to use.  So allocate that when the module is
initialised, and use it with bio_clone_fast().

Signed-off-by: NeilBrown 
---
 drivers/block/rbd.c |   16 +++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 089ac4179919..48eecffc612e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -441,6 +441,8 @@ static DEFINE_SPINLOCK(rbd_client_list_lock);
 static struct kmem_cache   *rbd_img_request_cache;
 static struct kmem_cache   *rbd_obj_request_cache;
 
+static struct bio_set  *rbd_bio_clone;
+
 static int rbd_major;
 static DEFINE_IDA(rbd_dev_id_ida);
 
@@ -1362,7 +1364,7 @@ static struct bio *bio_clone_range(struct bio *bio_src,
 {
struct bio *bio;
 
-   bio = bio_clone(bio_src, gfpmask);
+   bio = bio_clone_fast(bio_src, gfpmask, rbd_bio_clone);
if (!bio)
return NULL;/* ENOMEM */
 
@@ -6342,8 +6344,16 @@ static int rbd_slab_init(void)
if (!rbd_obj_request_cache)
goto out_err;
 
+   rbd_assert(!rbd_bio_clone);
+   rbd_bio_clone = bioset_create(BIO_POOL_SIZE, 0);
+   if (!rbd_bio_clone)
+   goto out_err_clone;
+
return 0;
 
+out_err_clone:
+   kmem_cache_destroy(rbd_obj_request_cache);
+   rbd_obj_request_cache = NULL;
 out_err:
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
@@ -6359,6 +6369,10 @@ static void rbd_slab_exit(void)
rbd_assert(rbd_img_request_cache);
kmem_cache_destroy(rbd_img_request_cache);
rbd_img_request_cache = NULL;
+
+   rbd_assert(rbd_bio_clone);
+   bioset_free(rbd_bio_clone);
+   rbd_bio_clone = NULL;
 }
 
 static int __init rbd_init(void)




[PATCH 09/11] bcache: use kmalloc to allocate bio in bch_data_verify()

2017-04-19 Thread NeilBrown
This function allocates a bio, then a collection
of pages.  It copes with failure.

It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages.  These fail
in different ways, so the usage is inconsistent.

Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.

Signed-off-by: NeilBrown 
---
 drivers/md/bcache/debug.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 06f55056aaae..35a5a7210e51 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,7 +110,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
struct bio_vec bv, cbv;
struct bvec_iter iter, citer = { 0 };
 
-   check = bio_clone(bio, GFP_NOIO);
+   check = bio_clone_kmalloc(bio, GFP_NOIO);
if (!check)
return;
check->bi_opf = REQ_OP_READ;




[PATCH V4 7/7] ARM: ZTE: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 arch/arm64/boot/dts/zte/zx296718.dtsi | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/boot/dts/zte/zx296718.dtsi 
b/arch/arm64/boot/dts/zte/zx296718.dtsi
index b850b2cd0adc..2c7dc69987df 100644
--- a/arch/arm64/boot/dts/zte/zx296718.dtsi
+++ b/arch/arm64/boot/dts/zte/zx296718.dtsi
@@ -118,27 +118,27 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5 {
+   opp-5 {
opp-hz = /bits/ 64 <5>;
clock-latency-ns = <50>;
};
 
-   opp@64800 {
+   opp-64800 {
opp-hz = /bits/ 64 <64800>;
clock-latency-ns = <50>;
};
 
-   opp@8 {
+   opp-8 {
opp-hz = /bits/ 64 <8>;
clock-latency-ns = <50>;
};
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
clock-latency-ns = <50>;
};
 
-   opp@118800 {
+   opp-118800 {
opp-hz = /bits/ 64 <118800>;
clock-latency-ns = <50>;
};
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH 09/11] bcache: use kmalloc to allocate bio in bch_data_verify()

2017-04-19 Thread NeilBrown
This function allocates a bio, then a collection
of pages.  It copes with failure.

It currently uses a mempool() to allocate the bio,
but alloc_page() to allocate the pages.  These fail
in different ways, so the usage is inconsistent.

Change the bio_clone() to bio_clone_kmalloc()
so that no pool is used either for the bio or the pages.

Signed-off-by: NeilBrown 
---
 drivers/md/bcache/debug.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/md/bcache/debug.c b/drivers/md/bcache/debug.c
index 06f55056aaae..35a5a7210e51 100644
--- a/drivers/md/bcache/debug.c
+++ b/drivers/md/bcache/debug.c
@@ -110,7 +110,7 @@ void bch_data_verify(struct cached_dev *dc, struct bio *bio)
struct bio_vec bv, cbv;
struct bvec_iter iter, citer = { 0 };
 
-   check = bio_clone(bio, GFP_NOIO);
+   check = bio_clone_kmalloc(bio, GFP_NOIO);
if (!check)
return;
check->bi_opf = REQ_OP_READ;




[PATCH V4 7/7] ARM: ZTE: Use - instead of @ for DT OPP entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 arch/arm64/boot/dts/zte/zx296718.dtsi | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/boot/dts/zte/zx296718.dtsi 
b/arch/arm64/boot/dts/zte/zx296718.dtsi
index b850b2cd0adc..2c7dc69987df 100644
--- a/arch/arm64/boot/dts/zte/zx296718.dtsi
+++ b/arch/arm64/boot/dts/zte/zx296718.dtsi
@@ -118,27 +118,27 @@
compatible = "operating-points-v2";
opp-shared;
 
-   opp@5 {
+   opp-5 {
opp-hz = /bits/ 64 <5>;
clock-latency-ns = <50>;
};
 
-   opp@64800 {
+   opp-64800 {
opp-hz = /bits/ 64 <64800>;
clock-latency-ns = <50>;
};
 
-   opp@8 {
+   opp-8 {
opp-hz = /bits/ 64 <8>;
clock-latency-ns = <50>;
};
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
clock-latency-ns = <50>;
};
 
-   opp@118800 {
+   opp-118800 {
opp-hz = /bits/ 64 <118800>;
clock-latency-ns = <50>;
};
-- 
2.12.0.432.g71c3a4f4ba37



[PATCH 06/11] drbd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part.  So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.

So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().

Signed-off-by: NeilBrown 
---
 drivers/block/drbd/drbd_int.h  |3 +++
 drivers/block/drbd/drbd_main.c |9 +
 drivers/block/drbd/drbd_req.h  |2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d5da45bb03a6..f91982515a6b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1441,6 +1441,9 @@ extern struct bio_set *drbd_md_io_bio_set;
 /* to allocate from that set */
 extern struct bio *bio_alloc_drbd(gfp_t gfp_mask);
 
+/* And a bio_set for cloning */
+extern struct bio_set *drbd_io_bio_set;
+
 extern struct mutex resources_mutex;
 
 extern int conn_lowest_minor(struct drbd_connection *connection);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 84455c365f57..0cc50a7ca1c8 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -128,6 +128,7 @@ mempool_t *drbd_request_mempool;
 mempool_t *drbd_ee_mempool;
 mempool_t *drbd_md_io_page_pool;
 struct bio_set *drbd_md_io_bio_set;
+struct bio_set *drbd_io_bio_set;
 
 /* I do not use a standard mempool, because:
1) I want to hand out the pre-allocated objects first.
@@ -2098,6 +2099,8 @@ static void drbd_destroy_mempools(void)
 
/* D_ASSERT(device, atomic_read(_pp_vacant)==0); */
 
+   if (drbd_io_bio_set)
+   bioset_free(drbd_io_bio_set);
if (drbd_md_io_bio_set)
bioset_free(drbd_md_io_bio_set);
if (drbd_md_io_page_pool)
@@ -2115,6 +2118,7 @@ static void drbd_destroy_mempools(void)
if (drbd_al_ext_cache)
kmem_cache_destroy(drbd_al_ext_cache);
 
+   drbd_io_bio_set  = NULL;
drbd_md_io_bio_set   = NULL;
drbd_md_io_page_pool = NULL;
drbd_ee_mempool  = NULL;
@@ -2142,6 +2146,7 @@ static int drbd_create_mempools(void)
drbd_pp_pool = NULL;
drbd_md_io_page_pool = NULL;
drbd_md_io_bio_set   = NULL;
+   drbd_io_bio_set  = NULL;
 
/* caches */
drbd_request_cache = kmem_cache_create(
@@ -2165,6 +2170,10 @@ static int drbd_create_mempools(void)
goto Enomem;
 
/* mempools */
+   drbd_io_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   if (drbd_io_bio_set == NULL)
+   goto Enomem;
+
drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
if (drbd_md_io_bio_set == NULL)
goto Enomem;
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index eb49e7f2da91..a24e870853eb 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -263,7 +263,7 @@ enum drbd_req_state_bits {
 static inline void drbd_req_make_private_bio(struct drbd_request *req, struct 
bio *bio_src)
 {
struct bio *bio;
-   bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+   bio = bio_clone_fast(bio_src, GFP_NOIO, drbd_io_bio_set); /* XXX cannot 
fail!! */
 
req->private_bio = bio;
 




[PATCH 06/11] drbd: use bio_clone_fast() instead of bio_clone()

2017-04-19 Thread NeilBrown
drbd does not modify the bi_io_vec of the cloned bio,
so there is no need to clone that part.  So bio_clone_fast()
is the better choice.
For bio_clone_fast() we need to specify a bio_set.
We could use fs_bio_set, which bio_clone() uses, or
drbd_md_io_bio_set, which drbd uses for metadata, but it is
generally best to avoid sharing bio_sets unless you can
be certain that there are no interdependencies.

So create a new bio_set, drbd_io_bio_set, and use bio_clone_fast().

Signed-off-by: NeilBrown 
---
 drivers/block/drbd/drbd_int.h  |3 +++
 drivers/block/drbd/drbd_main.c |9 +
 drivers/block/drbd/drbd_req.h  |2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/drivers/block/drbd/drbd_int.h b/drivers/block/drbd/drbd_int.h
index d5da45bb03a6..f91982515a6b 100644
--- a/drivers/block/drbd/drbd_int.h
+++ b/drivers/block/drbd/drbd_int.h
@@ -1441,6 +1441,9 @@ extern struct bio_set *drbd_md_io_bio_set;
 /* to allocate from that set */
 extern struct bio *bio_alloc_drbd(gfp_t gfp_mask);
 
+/* And a bio_set for cloning */
+extern struct bio_set *drbd_io_bio_set;
+
 extern struct mutex resources_mutex;
 
 extern int conn_lowest_minor(struct drbd_connection *connection);
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 84455c365f57..0cc50a7ca1c8 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -128,6 +128,7 @@ mempool_t *drbd_request_mempool;
 mempool_t *drbd_ee_mempool;
 mempool_t *drbd_md_io_page_pool;
 struct bio_set *drbd_md_io_bio_set;
+struct bio_set *drbd_io_bio_set;
 
 /* I do not use a standard mempool, because:
1) I want to hand out the pre-allocated objects first.
@@ -2098,6 +2099,8 @@ static void drbd_destroy_mempools(void)
 
/* D_ASSERT(device, atomic_read(_pp_vacant)==0); */
 
+   if (drbd_io_bio_set)
+   bioset_free(drbd_io_bio_set);
if (drbd_md_io_bio_set)
bioset_free(drbd_md_io_bio_set);
if (drbd_md_io_page_pool)
@@ -2115,6 +2118,7 @@ static void drbd_destroy_mempools(void)
if (drbd_al_ext_cache)
kmem_cache_destroy(drbd_al_ext_cache);
 
+   drbd_io_bio_set  = NULL;
drbd_md_io_bio_set   = NULL;
drbd_md_io_page_pool = NULL;
drbd_ee_mempool  = NULL;
@@ -2142,6 +2146,7 @@ static int drbd_create_mempools(void)
drbd_pp_pool = NULL;
drbd_md_io_page_pool = NULL;
drbd_md_io_bio_set   = NULL;
+   drbd_io_bio_set  = NULL;
 
/* caches */
drbd_request_cache = kmem_cache_create(
@@ -2165,6 +2170,10 @@ static int drbd_create_mempools(void)
goto Enomem;
 
/* mempools */
+   drbd_io_bio_set = bioset_create(BIO_POOL_SIZE, 0);
+   if (drbd_io_bio_set == NULL)
+   goto Enomem;
+
drbd_md_io_bio_set = bioset_create(DRBD_MIN_POOL_PAGES, 0);
if (drbd_md_io_bio_set == NULL)
goto Enomem;
diff --git a/drivers/block/drbd/drbd_req.h b/drivers/block/drbd/drbd_req.h
index eb49e7f2da91..a24e870853eb 100644
--- a/drivers/block/drbd/drbd_req.h
+++ b/drivers/block/drbd/drbd_req.h
@@ -263,7 +263,7 @@ enum drbd_req_state_bits {
 static inline void drbd_req_make_private_bio(struct drbd_request *req, struct 
bio *bio_src)
 {
struct bio *bio;
-   bio = bio_clone(bio_src, GFP_NOIO); /* XXX cannot fail?? */
+   bio = bio_clone_fast(bio_src, GFP_NOIO, drbd_io_bio_set); /* XXX cannot 
fail!! */
 
req->private_bio = bio;
 




[PATCH V4 1/7] PM / OPP: Use - instead of @ for DT entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 Documentation/devicetree/bindings/opp/opp.txt | 38 +--
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/Documentation/devicetree/bindings/opp/opp.txt 
b/Documentation/devicetree/bindings/opp/opp.txt
index 63725498bd20..e36d261b9ba6 100644
--- a/Documentation/devicetree/bindings/opp/opp.txt
+++ b/Documentation/devicetree/bindings/opp/opp.txt
@@ -186,20 +186,20 @@ Example 1: Single cluster Dual-core ARM cortex A9, switch 
DVFS states together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
clock-latency-ns = <29>;
@@ -265,20 +265,20 @@ independently.
 * independently.
 */
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
opp-microamp = <9;
@@ -341,20 +341,20 @@ DVFS state together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
opp-microamp = <9>;
@@ -367,20 +367,20 @@ DVFS state together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@13 {
+   opp-13 {
opp-hz = /bits/ 64 <13>;
opp-microvolt = <105 1045000 1055000>;
opp-microamp = <95000>;
clock-latency-ns = <40>;
opp-suspend;
};
-   opp@14 {
+   opp-14 {
opp-hz = /bits/ 64 <14>;
opp-microvolt = <1075000>;
opp-microamp = <10>;
clock-latency-ns = <40>;
};
-   opp@15 {
+   opp-15 {
opp-hz = /bits/ 64 <15>;
opp-microvolt = <110 101 111>;
opp-microamp = <95000>;
@@ -409,7 +409,7 @@ Example 4: Handling multiple regulators
 

Re: [PATCH v3 2/5] ARM: dts: rockchip: add ARM Mali GPU node for rk3288

2017-04-19 Thread Guillaume Tucker

Hi Heiko,

On 19/04/17 09:59, Heiko Stuebner wrote:

Am Mittwoch, 19. April 2017, 09:06:18 CEST schrieb Guillaume Tucker:

Add Mali GPU device tree node for the rk3288 SoC, with devfreq
opp table.

Tested-by: Enric Balletbo i Serra 
Signed-off-by: Guillaume Tucker 
---
 arch/arm/boot/dts/rk3288.dtsi | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm/boot/dts/rk3288.dtsi b/arch/arm/boot/dts/rk3288.dtsi
index df8a0dbe9d91..187eed528f83 100644
--- a/arch/arm/boot/dts/rk3288.dtsi
+++ b/arch/arm/boot/dts/rk3288.dtsi
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -227,6 +228,27 @@
ports = <_out>, <_out>;
};

+   gpu: mali@ffa3 {


please sort nodes by address. ffa3 should be placed below hdmi@ff98
and above qos@ffaa .


Sure, will fix that in v4.


+   compatible = "arm,mali-t760", "arm,mali-midgard";


As indicated before I don't trust that a generic binding will work for
everything, so I would feel safer if we had a "rockchip,rk3288-mali" in
front for future purposes, making it a

compatible = "rockchip,rk3288-mali", "arm,mali-t760", 
"arm,mali-midgard";


OK, sorry I overlooked this part.  I'll add it in v4 with a
vendor compatible string in the binding documentation.


+   reg = <0xffa3 0x1>;
+   interrupts = ,
+,
+;
+   interrupt-names = "job", "mmu", "gpu";
+   clocks = < ACLK_GPU>;
+   operating-points = <
+   /* KHz uV */
+   10 95
+   20 95
+   30 100
+   40 110
+   50 120
+   60 125
+   >;


Wasn't there a wish for opp-v2 in a previous version?


Well it wasn't entirely clear to me in Rob's email whether it was
necessary to use opp-v2 now or rather if it would be a potential
option whenever opp-v2 was needed.  If operating-points (v1) are
being deprecated then I can change that in my next patch v4.
Using operating-points-v2 with the Mali driver works as far as I
can tell on rk3288 so that's not an issue.

Thanks,
Guillaume



[PATCH V4 1/7] PM / OPP: Use - instead of @ for DT entries

2017-04-19 Thread Viresh Kumar
Compiling the DT file with W=1, DTC warns like follows:

Warning (unit_address_vs_reg): Node /opp_table0/opp@10 has a
unit name, but no reg property

Fix this by replacing '@' with '-' as the OPP nodes will never have a
"reg" property.

Reported-by: Krzysztof Kozlowski 
Reported-by: Masahiro Yamada 
Suggested-by: Mark Rutland 
Signed-off-by: Viresh Kumar 
Acked-by: Rob Herring 
---
 Documentation/devicetree/bindings/opp/opp.txt | 38 +--
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/Documentation/devicetree/bindings/opp/opp.txt 
b/Documentation/devicetree/bindings/opp/opp.txt
index 63725498bd20..e36d261b9ba6 100644
--- a/Documentation/devicetree/bindings/opp/opp.txt
+++ b/Documentation/devicetree/bindings/opp/opp.txt
@@ -186,20 +186,20 @@ Example 1: Single cluster Dual-core ARM cortex A9, switch 
DVFS states together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
clock-latency-ns = <29>;
@@ -265,20 +265,20 @@ independently.
 * independently.
 */
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
opp-microamp = <9;
@@ -341,20 +341,20 @@ DVFS state together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@10 {
+   opp-10 {
opp-hz = /bits/ 64 <10>;
opp-microvolt = <975000 97 985000>;
opp-microamp = <7>;
clock-latency-ns = <30>;
opp-suspend;
};
-   opp@11 {
+   opp-11 {
opp-hz = /bits/ 64 <11>;
opp-microvolt = <100 98 101>;
opp-microamp = <8>;
clock-latency-ns = <31>;
};
-   opp@12 {
+   opp-12 {
opp-hz = /bits/ 64 <12>;
opp-microvolt = <1025000>;
opp-microamp = <9>;
@@ -367,20 +367,20 @@ DVFS state together.
compatible = "operating-points-v2";
opp-shared;
 
-   opp@13 {
+   opp-13 {
opp-hz = /bits/ 64 <13>;
opp-microvolt = <105 1045000 1055000>;
opp-microamp = <95000>;
clock-latency-ns = <40>;
opp-suspend;
};
-   opp@14 {
+   opp-14 {
opp-hz = /bits/ 64 <14>;
opp-microvolt = <1075000>;
opp-microamp = <10>;
clock-latency-ns = <40>;
};
-   opp@15 {
+   opp-15 {
opp-hz = /bits/ 64 <15>;
opp-microvolt = <110 101 111>;
opp-microamp = <95000>;
@@ -409,7 +409,7 @@ Example 4: Handling multiple regulators
compatible = "operating-points-v2";
opp-shared;
 
-   opp@10 {
+ 

Re: [PATCH v3 2/5] ARM: dts: rockchip: add ARM Mali GPU node for rk3288

2017-04-19 Thread Guillaume Tucker

Hi Heiko,

On 19/04/17 09:59, Heiko Stuebner wrote:

Am Mittwoch, 19. April 2017, 09:06:18 CEST schrieb Guillaume Tucker:

Add Mali GPU device tree node for the rk3288 SoC, with devfreq
opp table.

Tested-by: Enric Balletbo i Serra 
Signed-off-by: Guillaume Tucker 
---
 arch/arm/boot/dts/rk3288.dtsi | 22 ++
 1 file changed, 22 insertions(+)

diff --git a/arch/arm/boot/dts/rk3288.dtsi b/arch/arm/boot/dts/rk3288.dtsi
index df8a0dbe9d91..187eed528f83 100644
--- a/arch/arm/boot/dts/rk3288.dtsi
+++ b/arch/arm/boot/dts/rk3288.dtsi
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -227,6 +228,27 @@
ports = <_out>, <_out>;
};

+   gpu: mali@ffa3 {


please sort nodes by address. ffa3 should be placed below hdmi@ff98
and above qos@ffaa .


Sure, will fix that in v4.


+   compatible = "arm,mali-t760", "arm,mali-midgard";


As indicated before I don't trust that a generic binding will work for
everything, so I would feel safer if we had a "rockchip,rk3288-mali" in
front for future purposes, making it a

compatible = "rockchip,rk3288-mali", "arm,mali-t760", 
"arm,mali-midgard";


OK, sorry I overlooked this part.  I'll add it in v4 with a
vendor compatible string in the binding documentation.


+   reg = <0xffa3 0x1>;
+   interrupts = ,
+,
+;
+   interrupt-names = "job", "mmu", "gpu";
+   clocks = < ACLK_GPU>;
+   operating-points = <
+   /* KHz uV */
+   10 95
+   20 95
+   30 100
+   40 110
+   50 120
+   60 125
+   >;


Wasn't there a wish for opp-v2 in a previous version?


Well it wasn't entirely clear to me in Rob's email whether it was
necessary to use opp-v2 now or rather if it would be a potential
option whenever opp-v2 was needed.  If operating-points (v1) are
being deprecated then I can change that in my next patch v4.
Using operating-points-v2 with the Mali driver works as far as I
can tell on rk3288 so that's not an issue.

Thanks,
Guillaume



linux-next: build failure after merge of the rcu tree

2017-04-19 Thread Stephen Rothwell
Hi Paul,

[Also reported by Michael elsewhere]

After merging the rcu tree, today's linux-next build (powerpc
pseries_le_defconfig) failed like this:

arch/powerpc/kvm/book3s_hv_rmhandlers.S: Assembler messages:
arch/powerpc/kvm/book3s_hv_rmhandlers.S:587: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:617: Error: operand out of range 
(0x9d88 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:619: Error: operand out of range 
(0x9dc0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:643: Error: operand out of range 
(0x9df8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:650: Error: operand out of range 
(0x9d8c is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1353: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1663: Error: operand out of range 
(0x9db0 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1665: Error: operand out of range 
(0x9dc8 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1734: Error: operand out of range 
(0x9db8 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1782: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1837: Error: operand out of range 
(0x9de0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1877: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1918: Error: operand out of range 
(0x9de0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1943: Error: operand out of range 
(0xa048 is not between 0x8000 and 0x7ffc)

Caused by commit

  b2bb79507ba1 ("srcu: Parallelize callback handling")

I have left it broken for today.

-- 
Cheers,
Stephen Rothwell


linux-next: build failure after merge of the rcu tree

2017-04-19 Thread Stephen Rothwell
Hi Paul,

[Also reported by Michael elsewhere]

After merging the rcu tree, today's linux-next build (powerpc
pseries_le_defconfig) failed like this:

arch/powerpc/kvm/book3s_hv_rmhandlers.S: Assembler messages:
arch/powerpc/kvm/book3s_hv_rmhandlers.S:587: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:617: Error: operand out of range 
(0x9d88 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:619: Error: operand out of range 
(0x9dc0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:643: Error: operand out of range 
(0x9df8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:650: Error: operand out of range 
(0x9d8c is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1353: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1663: Error: operand out of range 
(0x9db0 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1665: Error: operand out of range 
(0x9dc8 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1734: Error: operand out of range 
(0x9db8 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1782: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1837: Error: operand out of range 
(0x9de0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1877: Error: operand out of range 
(0x9ff8 is not between 0x8000 and 0x7fff)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1918: Error: operand out of range 
(0x9de0 is not between 0x8000 and 0x7ffc)
arch/powerpc/kvm/book3s_hv_rmhandlers.S:1943: Error: operand out of range 
(0xa048 is not between 0x8000 and 0x7ffc)

Caused by commit

  b2bb79507ba1 ("srcu: Parallelize callback handling")

I have left it broken for today.

-- 
Cheers,
Stephen Rothwell


[GIT PULL] Bugfixes for the Keys subsystem

2017-04-19 Thread James Morris
Please pull these patches for the Keys subsystem, which fix:

- CVE-2017-7472
- CVE-2017-6951
- CVE-2016-9604


The following changes since commit f61143c45077df4fa78e2f1ba455a00bbe1d5b8c:

  Merge tag 'clk-fixes-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux (2017-04-19 17:16:18 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git 
for-linus

David Howells (2):
  KEYS: Disallow keyrings beginning with '.' to be joined as session 
keyrings
  KEYS: Change the name of the dead type to ".dead" to prevent user access

Eric Biggers (1):
  KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings

 security/keys/gc.c   |2 +-
 security/keys/keyctl.c   |   20 ++
 security/keys/process_keys.c |   44 +
 3 files changed, 39 insertions(+), 27 deletions(-)

---

commit b1be815668e2f0c6a1ebb9e5c27e3ae1bf4b9917
Author: Eric Biggers 
Date:   Wed Apr 19 17:13:02 2017 +0100

KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings

This fixes CVE-2017-7472.

Running the following program as an unprivileged user exhausts kernel
memory by leaking thread keyrings:

#include 

int main()
{
for (;;)

keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_THREAD_KEYRING);
}

Fix it by only creating a new thread keyring if there wasn't one before.
To make things more consistent, make install_thread_keyring_to_cred()
and install_process_keyring_to_cred() both return 0 if the corresponding
keyring is already present.

Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials")
Cc: sta...@vger.kernel.org
Signed-off-by: Eric Biggers 
Signed-off-by: David Howells 
Signed-off-by: James Morris 

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index ab082a2..4ad3212 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1258,8 +1258,8 @@ long keyctl_reject_key(key_serial_t id, unsigned timeout, 
unsigned error,
  * Read or set the default keyring in which request_key() will cache keys and
  * return the old setting.
  *
- * If a process keyring is specified then this will be created if it doesn't
- * yet exist.  The old setting will be returned if successful.
+ * If a thread or process keyring is specified then it will be created if it
+ * doesn't yet exist.  The old setting will be returned if successful.
  */
 long keyctl_set_reqkey_keyring(int reqkey_defl)
 {
@@ -1284,11 +1284,8 @@ long keyctl_set_reqkey_keyring(int reqkey_defl)
 
case KEY_REQKEY_DEFL_PROCESS_KEYRING:
ret = install_process_keyring_to_cred(new);
-   if (ret < 0) {
-   if (ret != -EEXIST)
-   goto error;
-   ret = 0;
-   }
+   if (ret < 0)
+   goto error;
goto set;
 
case KEY_REQKEY_DEFL_DEFAULT:
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index b6fdd22..9139b18 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -128,13 +128,18 @@ int install_user_keyrings(void)
 }
 
 /*
- * Install a fresh thread keyring directly to new credentials.  This keyring is
- * allowed to overrun the quota.
+ * Install a thread keyring to the given credentials struct if it didn't have
+ * one already.  This is allowed to overrun the quota.
+ *
+ * Return: 0 if a thread keyring is now present; -errno on failure.
  */
 int install_thread_keyring_to_cred(struct cred *new)
 {
struct key *keyring;
 
+   if (new->thread_keyring)
+   return 0;
+
keyring = keyring_alloc("_tid", new->uid, new->gid, new,
KEY_POS_ALL | KEY_USR_VIEW,
KEY_ALLOC_QUOTA_OVERRUN,
@@ -147,7 +152,9 @@ int install_thread_keyring_to_cred(struct cred *new)
 }
 
 /*
- * Install a fresh thread keyring, discarding the old one.
+ * Install a thread keyring to the current task if it didn't have one already.
+ *
+ * Return: 0 if a thread keyring is now present; -errno on failure.
  */
 static int install_thread_keyring(void)
 {
@@ -158,8 +165,6 @@ static int install_thread_keyring(void)
if (!new)
return -ENOMEM;
 
-   BUG_ON(new->thread_keyring);
-
ret = install_thread_keyring_to_cred(new);
if (ret < 0) {
abort_creds(new);
@@ -170,17 +175,17 @@ static int install_thread_keyring(void)
 }
 
 /*
- * Install a process keyring directly to a credentials struct.
+ * Install a process keyring to the given credentials struct if it didn't have
+ * one already.  This is allowed to overrun the quota.
  *
- * Returns -EEXIST if there 

[GIT PULL] Bugfixes for the Keys subsystem

2017-04-19 Thread James Morris
Please pull these patches for the Keys subsystem, which fix:

- CVE-2017-7472
- CVE-2017-6951
- CVE-2016-9604


The following changes since commit f61143c45077df4fa78e2f1ba455a00bbe1d5b8c:

  Merge tag 'clk-fixes-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux (2017-04-19 17:16:18 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security.git 
for-linus

David Howells (2):
  KEYS: Disallow keyrings beginning with '.' to be joined as session 
keyrings
  KEYS: Change the name of the dead type to ".dead" to prevent user access

Eric Biggers (1):
  KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings

 security/keys/gc.c   |2 +-
 security/keys/keyctl.c   |   20 ++
 security/keys/process_keys.c |   44 +
 3 files changed, 39 insertions(+), 27 deletions(-)

---

commit b1be815668e2f0c6a1ebb9e5c27e3ae1bf4b9917
Author: Eric Biggers 
Date:   Wed Apr 19 17:13:02 2017 +0100

KEYS: fix keyctl_set_reqkey_keyring() to not leak thread keyrings

This fixes CVE-2017-7472.

Running the following program as an unprivileged user exhausts kernel
memory by leaking thread keyrings:

#include 

int main()
{
for (;;)

keyctl_set_reqkey_keyring(KEY_REQKEY_DEFL_THREAD_KEYRING);
}

Fix it by only creating a new thread keyring if there wasn't one before.
To make things more consistent, make install_thread_keyring_to_cred()
and install_process_keyring_to_cred() both return 0 if the corresponding
keyring is already present.

Fixes: d84f4f992cbd ("CRED: Inaugurate COW credentials")
Cc: sta...@vger.kernel.org
Signed-off-by: Eric Biggers 
Signed-off-by: David Howells 
Signed-off-by: James Morris 

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index ab082a2..4ad3212 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -1258,8 +1258,8 @@ long keyctl_reject_key(key_serial_t id, unsigned timeout, 
unsigned error,
  * Read or set the default keyring in which request_key() will cache keys and
  * return the old setting.
  *
- * If a process keyring is specified then this will be created if it doesn't
- * yet exist.  The old setting will be returned if successful.
+ * If a thread or process keyring is specified then it will be created if it
+ * doesn't yet exist.  The old setting will be returned if successful.
  */
 long keyctl_set_reqkey_keyring(int reqkey_defl)
 {
@@ -1284,11 +1284,8 @@ long keyctl_set_reqkey_keyring(int reqkey_defl)
 
case KEY_REQKEY_DEFL_PROCESS_KEYRING:
ret = install_process_keyring_to_cred(new);
-   if (ret < 0) {
-   if (ret != -EEXIST)
-   goto error;
-   ret = 0;
-   }
+   if (ret < 0)
+   goto error;
goto set;
 
case KEY_REQKEY_DEFL_DEFAULT:
diff --git a/security/keys/process_keys.c b/security/keys/process_keys.c
index b6fdd22..9139b18 100644
--- a/security/keys/process_keys.c
+++ b/security/keys/process_keys.c
@@ -128,13 +128,18 @@ int install_user_keyrings(void)
 }
 
 /*
- * Install a fresh thread keyring directly to new credentials.  This keyring is
- * allowed to overrun the quota.
+ * Install a thread keyring to the given credentials struct if it didn't have
+ * one already.  This is allowed to overrun the quota.
+ *
+ * Return: 0 if a thread keyring is now present; -errno on failure.
  */
 int install_thread_keyring_to_cred(struct cred *new)
 {
struct key *keyring;
 
+   if (new->thread_keyring)
+   return 0;
+
keyring = keyring_alloc("_tid", new->uid, new->gid, new,
KEY_POS_ALL | KEY_USR_VIEW,
KEY_ALLOC_QUOTA_OVERRUN,
@@ -147,7 +152,9 @@ int install_thread_keyring_to_cred(struct cred *new)
 }
 
 /*
- * Install a fresh thread keyring, discarding the old one.
+ * Install a thread keyring to the current task if it didn't have one already.
+ *
+ * Return: 0 if a thread keyring is now present; -errno on failure.
  */
 static int install_thread_keyring(void)
 {
@@ -158,8 +165,6 @@ static int install_thread_keyring(void)
if (!new)
return -ENOMEM;
 
-   BUG_ON(new->thread_keyring);
-
ret = install_thread_keyring_to_cred(new);
if (ret < 0) {
abort_creds(new);
@@ -170,17 +175,17 @@ static int install_thread_keyring(void)
 }
 
 /*
- * Install a process keyring directly to a credentials struct.
+ * Install a process keyring to the given credentials struct if it didn't have
+ * one already.  This is allowed to overrun the quota.
  *
- * Returns -EEXIST if there was already a process keyring, 0 if one installed,
- * and other value on any other error

Re: [PATCH V4 1/9] PM / OPP: Allow OPP table to be used for power-domains

2017-04-19 Thread Viresh Kumar
On 19-04-17, 14:58, Sudeep Holla wrote:
> On 19/04/17 12:47, Viresh Kumar wrote:
> > On 18-04-17, 17:01, Sudeep Holla wrote:
> >> Understood. I would incline towards reusing regulators we that's what is
> > 
> > It can be just a regulator, but it can be anything else as well. That
> > entity may have its own clock/volt/current tunables, etc.
> > 
> >> changed behind the scene. Calling this operating performance point
> >> is misleading and doesn't align well with existing specs/features.
> > 
> > Yeah, but there are no voltage levels available here and that doesn't
> > fit as a regulator then.
> > 
> 
> We can't dismiss just based on that. We do have systems where
> performance index is mapped to clocks though it may not be 1:1 mapping.
> I am not disagreeing here, just trying to understand it better.

@Stephen: Can you answer here please ?

> >> Understood. We have exactly same thing with SCPI but it controls both
> >> frequency and voltage referred as operating points. In general, this OPP
> >> terminology is used in SCPI/ACPI/SCMI specifications as both frequency
> >> and voltage control. I am bit worried that this binding might introduce
> >> confusions on the definitions. But it can be reworded/renamed easily if
> >> required.
> > 
> > Yeah, so far we have been looking at OPPs as freq-voltage pairs ONLY
> > and that is changing. I am not sure if it going in the wrong
> > direction really. Without frequency also it is an operating point for
> > the domain. Isn't it?
> > 
> 
> Yes, I completely agree. I am not saying the direction is wrong. I am
> saying it's confusing and binding needs to be more clear.

What exactly isn't clear? (Yeah, there had been lots of emails and I
want to know what improvements are you looking for).

> On the contrary(playing devil's advocate here), we can treat all
> existing regulators alone as OPP then if you strip the voltages and
> treat it as abstract number.

But then we are going to have lots of platform specific code which
will program the actual hardware, etc. Which is all handled by the
regulator framework. Also note that the regulator core selects the
common voltage selected by all the children, while we want to select
the highest performance point here.

Even if we have to configure both clock and voltage for the power
domain using standard clk/regulator frameworks, OPP will work just
fine as it will do that then. So, its not that we are bypassing the
regulator framework here. It will be used if we have the voltages
available for the power-domain's performance states.

> So if the firmware handles more than just
> regulators, I agree.

I don't know the internals of that really.

> At the same time, I would have preferred firmware
> to even abstract the frequency like ACPI CPPC.

Frequency isn't required to be configured for the cases I know, but it
can be in future implementations.

> It would be good to get
> more information on what exactly that firmware handles.

@Stephen ?

> I am just more cautious here since we are designing generic bindings and
> changing generic code, we need to understand what that firmware supports
> and how it may evolve(so that we can maintain DT compatibility)

Sure, I am fine with more discussions on it :)

> I did a brief check and wanted to check if this is SMD/RPM regulators ?

Yes, Qcom calls the external core as Resource and Power manager (RPM).

-- 
viresh


Re: [PATCH V4 1/9] PM / OPP: Allow OPP table to be used for power-domains

2017-04-19 Thread Viresh Kumar
On 19-04-17, 14:58, Sudeep Holla wrote:
> On 19/04/17 12:47, Viresh Kumar wrote:
> > On 18-04-17, 17:01, Sudeep Holla wrote:
> >> Understood. I would incline towards reusing regulators we that's what is
> > 
> > It can be just a regulator, but it can be anything else as well. That
> > entity may have its own clock/volt/current tunables, etc.
> > 
> >> changed behind the scene. Calling this operating performance point
> >> is misleading and doesn't align well with existing specs/features.
> > 
> > Yeah, but there are no voltage levels available here and that doesn't
> > fit as a regulator then.
> > 
> 
> We can't dismiss just based on that. We do have systems where
> performance index is mapped to clocks though it may not be 1:1 mapping.
> I am not disagreeing here, just trying to understand it better.

@Stephen: Can you answer here please ?

> >> Understood. We have exactly same thing with SCPI but it controls both
> >> frequency and voltage referred as operating points. In general, this OPP
> >> terminology is used in SCPI/ACPI/SCMI specifications as both frequency
> >> and voltage control. I am bit worried that this binding might introduce
> >> confusions on the definitions. But it can be reworded/renamed easily if
> >> required.
> > 
> > Yeah, so far we have been looking at OPPs as freq-voltage pairs ONLY
> > and that is changing. I am not sure if it going in the wrong
> > direction really. Without frequency also it is an operating point for
> > the domain. Isn't it?
> > 
> 
> Yes, I completely agree. I am not saying the direction is wrong. I am
> saying it's confusing and binding needs to be more clear.

What exactly isn't clear? (Yeah, there had been lots of emails and I
want to know what improvements are you looking for).

> On the contrary(playing devil's advocate here), we can treat all
> existing regulators alone as OPP then if you strip the voltages and
> treat it as abstract number.

But then we are going to have lots of platform specific code which
will program the actual hardware, etc. Which is all handled by the
regulator framework. Also note that the regulator core selects the
common voltage selected by all the children, while we want to select
the highest performance point here.

Even if we have to configure both clock and voltage for the power
domain using standard clk/regulator frameworks, OPP will work just
fine as it will do that then. So, its not that we are bypassing the
regulator framework here. It will be used if we have the voltages
available for the power-domain's performance states.

> So if the firmware handles more than just
> regulators, I agree.

I don't know the internals of that really.

> At the same time, I would have preferred firmware
> to even abstract the frequency like ACPI CPPC.

Frequency isn't required to be configured for the cases I know, but it
can be in future implementations.

> It would be good to get
> more information on what exactly that firmware handles.

@Stephen ?

> I am just more cautious here since we are designing generic bindings and
> changing generic code, we need to understand what that firmware supports
> and how it may evolve(so that we can maintain DT compatibility)

Sure, I am fine with more discussions on it :)

> I did a brief check and wanted to check if this is SMD/RPM regulators ?

Yes, Qcom calls the external core as Resource and Power manager (RPM).

-- 
viresh


Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Tyrel Datwyler
On 04/19/2017 07:33 PM, Steven Rostedt wrote:
> On Wed, 19 Apr 2017 16:27:10 -0700
> Tyrel Datwyler  wrote:
> 
>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
>> # cat trace | grep -A6 "/pci@8002018"
> 
> Just to let you know that there is now stacktrace event triggers, where
> you don't need to stacktrace all events, you can pick and choose. And
> even filter the stack trace on specific fields of the event.

This is great, and I did figure that out this afternoon. One thing I was
still trying to determine though was whether its possible to set these
triggers at boot? As far as I could tell I'm still limited to
"trace_options=stacktrace" as a kernel boot parameter to get the stack
for event tracepoints.

-Tyrel

> 
>  # cd /sys/kernel/debug/tracing
>  # echo "stacktrace if common_pid == $$ && reason == 3" \
>> events/tlb/tlb_flush/trigger
> 
>  # cat trace
> bash-1103  [003] ...1  1290.100133: tlb_flush: pages:-1 
> reason:local mm shootdown (3)
> bash-1103  [003] ...2  1290.100140: 
>  => copy_process.part.39
>  => _do_fork
>  => SyS_clone
>  => do_syscall_64
>  => return_from_SYSCALL_64
> 
> -- Steve
> 



Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Tyrel Datwyler
On 04/19/2017 07:33 PM, Steven Rostedt wrote:
> On Wed, 19 Apr 2017 16:27:10 -0700
> Tyrel Datwyler  wrote:
> 
>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
>> # cat trace | grep -A6 "/pci@8002018"
> 
> Just to let you know that there is now stacktrace event triggers, where
> you don't need to stacktrace all events, you can pick and choose. And
> even filter the stack trace on specific fields of the event.

This is great, and I did figure that out this afternoon. One thing I was
still trying to determine though was whether its possible to set these
triggers at boot? As far as I could tell I'm still limited to
"trace_options=stacktrace" as a kernel boot parameter to get the stack
for event tracepoints.

-Tyrel

> 
>  # cd /sys/kernel/debug/tracing
>  # echo "stacktrace if common_pid == $$ && reason == 3" \
>> events/tlb/tlb_flush/trigger
> 
>  # cat trace
> bash-1103  [003] ...1  1290.100133: tlb_flush: pages:-1 
> reason:local mm shootdown (3)
> bash-1103  [003] ...2  1290.100140: 
>  => copy_process.part.39
>  => _do_fork
>  => SyS_clone
>  => do_syscall_64
>  => return_from_SYSCALL_64
> 
> -- Steve
> 



Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11

2017-04-19 Thread Christoph Hellwig
On Wed, Apr 19, 2017 at 09:52:17PM -0700, Andy Lutomirski wrote:
> > I can make it so that force_apst=0 means no APST and force_apst=1 mean
> > yes APST and we could try again with a quirk list for 4.12.  There's a
> > decent chance that a few more weeks with Ubuntu having APST on will
> > shake out all the problems fairly quickly.
> 
> Here's a more concrete and more sensible proposal:

Can we just have force_apst=on to force it on, force_apst=off to turn
it off, and leave it with that?  And yes, I mean the strings instead
of the weird numbers.


Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11

2017-04-19 Thread Christoph Hellwig
On Wed, Apr 19, 2017 at 09:52:17PM -0700, Andy Lutomirski wrote:
> > I can make it so that force_apst=0 means no APST and force_apst=1 mean
> > yes APST and we could try again with a quirk list for 4.12.  There's a
> > decent chance that a few more weeks with Ubuntu having APST on will
> > shake out all the problems fairly quickly.
> 
> Here's a more concrete and more sensible proposal:

Can we just have force_apst=on to force it on, force_apst=off to turn
it off, and leave it with that?  And yes, I mean the strings instead
of the weird numbers.


Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 21:43, Frank Rowand wrote:
> On 04/19/17 16:27, Tyrel Datwyler wrote:
>> On 04/18/2017 06:31 PM, Michael Ellerman wrote:

< snip >

>>
>> To get that same info as far as I know is to add a dump_stack() after
>> each pr_debug.
> 
> Here is a patch that I have used.  It is not as user friendly in terms
> of human readable stack traces (though a very small user space program
> should be able to fix that).  The patch is cut and pasted into this
> email, so probably white space damaged.

< snip >

> +
> + if (node) {
> + int k;
> + int refcount = refcount_read(>kobj.kref.refcount);
> + pr_err("XXX get 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 
> 0x%08lx 0x%08lx] ",
> + node, refcount,

If this was a real patch, meant for people other than myself, the
pr_err() would instead be pr_debug().

-Frank

< snip >


[PATCH] arm: Documentation: update a path name

2017-04-19 Thread Perr Zhang
the path in the example cmd is out of date, and the path for now
is also mentioned in the same file

Signed-off-by: Perr Zhang 
---
 Documentation/arm/mem_alignment | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/arm/mem_alignment b/Documentation/arm/mem_alignment
index c7c7a11..6335fca 100644
--- a/Documentation/arm/mem_alignment
+++ b/Documentation/arm/mem_alignment
@@ -48,7 +48,7 @@ Note that not all combinations are supported - only values 0 
through 5.
 For example, the following will turn on the warnings, but without
 fixing up or sending SIGBUS signals:
 
-   echo 1 > /proc/sys/debug/alignment
+   echo 1 > /proc/cpu/alignment
 
 You can also read the content of the same file to get statistical
 information on unaligned access occurrences plus the current mode of
-- 
1.9.3




Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 21:43, Frank Rowand wrote:
> On 04/19/17 16:27, Tyrel Datwyler wrote:
>> On 04/18/2017 06:31 PM, Michael Ellerman wrote:

< snip >

>>
>> To get that same info as far as I know is to add a dump_stack() after
>> each pr_debug.
> 
> Here is a patch that I have used.  It is not as user friendly in terms
> of human readable stack traces (though a very small user space program
> should be able to fix that).  The patch is cut and pasted into this
> email, so probably white space damaged.

< snip >

> +
> + if (node) {
> + int k;
> + int refcount = refcount_read(>kobj.kref.refcount);
> + pr_err("XXX get 0x%p %3d [0x%08lx 0x%08lx 0x%08lx 0x%08lx 
> 0x%08lx 0x%08lx] ",
> + node, refcount,

If this was a real patch, meant for people other than myself, the
pr_err() would instead be pr_debug().

-Frank

< snip >


[PATCH] arm: Documentation: update a path name

2017-04-19 Thread Perr Zhang
the path in the example cmd is out of date, and the path for now
is also mentioned in the same file

Signed-off-by: Perr Zhang 
---
 Documentation/arm/mem_alignment | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/arm/mem_alignment b/Documentation/arm/mem_alignment
index c7c7a11..6335fca 100644
--- a/Documentation/arm/mem_alignment
+++ b/Documentation/arm/mem_alignment
@@ -48,7 +48,7 @@ Note that not all combinations are supported - only values 0 
through 5.
 For example, the following will turn on the warnings, but without
 fixing up or sending SIGBUS signals:
 
-   echo 1 > /proc/sys/debug/alignment
+   echo 1 > /proc/cpu/alignment
 
 You can also read the content of the same file to get statistical
 information on unaligned access occurrences plus the current mode of
-- 
1.9.3




[PATCH] arm: alignment: update comments on /proc/cpu/alignment

2017-04-19 Thread Perr Zhang
The path name in the comment is out of date, update it.

Signed-off-by: Perr Zhang 
---
 arch/arm/mm/alignment.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/alignment.c b/arch/arm/mm/alignment.c
index 2c96190..db9c2cf 100644
--- a/arch/arm/mm/alignment.c
+++ b/arch/arm/mm/alignment.c
@@ -33,8 +33,12 @@
 
 /*
  * 32-bit misaligned trap handler (c) 1998 San Mehat (CCC) -July 1998
- * /proc/sys/debug/alignment, modified and integrated into
- * Linux 2.1 by Russell King
+ * /proc/cpu/alignment, modified and integrated into Linux 2.1 by Russell King
+ * Note:
+ * The path name may be different for very old versions of Linux,
+ * i.e, /proc/sys/debug/alignment for Linux 2.1.
+ * This was relocated because it was in conflict with sysctl(/proc/sys/)
+ * and it doesn't contain sysctl information.
  *
  * Speed optimisations and better fault handling by Russell King.
  *
@@ -985,10 +989,8 @@ static int __init noalign_setup(char *__unused)
 __setup("noalign", noalign_setup);
 
 /*
- * This needs to be done after sysctl_init, otherwise sys/ will be
- * overwritten.  Actually, this shouldn't be in sys/ at all since
- * it isn't a sysctl, and it doesn't contain sysctl information.
- * We now locate it in /proc/cpu/alignment instead.
+ * Refer to Documentation/arm/mem_alignment for
+ * usage of /proc/cpu/alignment.
  */
 static int __init alignment_init(void)
 {
-- 
1.9.3




[PATCH] arm: alignment: update comments on /proc/cpu/alignment

2017-04-19 Thread Perr Zhang
The path name in the comment is out of date, update it.

Signed-off-by: Perr Zhang 
---
 arch/arm/mm/alignment.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/arch/arm/mm/alignment.c b/arch/arm/mm/alignment.c
index 2c96190..db9c2cf 100644
--- a/arch/arm/mm/alignment.c
+++ b/arch/arm/mm/alignment.c
@@ -33,8 +33,12 @@
 
 /*
  * 32-bit misaligned trap handler (c) 1998 San Mehat (CCC) -July 1998
- * /proc/sys/debug/alignment, modified and integrated into
- * Linux 2.1 by Russell King
+ * /proc/cpu/alignment, modified and integrated into Linux 2.1 by Russell King
+ * Note:
+ * The path name may be different for very old versions of Linux,
+ * i.e, /proc/sys/debug/alignment for Linux 2.1.
+ * This was relocated because it was in conflict with sysctl(/proc/sys/)
+ * and it doesn't contain sysctl information.
  *
  * Speed optimisations and better fault handling by Russell King.
  *
@@ -985,10 +989,8 @@ static int __init noalign_setup(char *__unused)
 __setup("noalign", noalign_setup);
 
 /*
- * This needs to be done after sysctl_init, otherwise sys/ will be
- * overwritten.  Actually, this shouldn't be in sys/ at all since
- * it isn't a sysctl, and it doesn't contain sysctl information.
- * We now locate it in /proc/cpu/alignment instead.
+ * Refer to Documentation/arm/mem_alignment for
+ * usage of /proc/cpu/alignment.
  */
 static int __init alignment_init(void)
 {
-- 
1.9.3




Re: [PATCH] fs:orangefs:orangefs-debug.h: Use ARRAY_SIZE kernel macro

2017-04-19 Thread kbuild test robot
Hi Karim,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.11-rc7 next-20170419]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Karim-Eshapa/fs-orangefs-orangefs-debug-h-Use-ARRAY_SIZE-kernel-macro/20170420-105518
config: x86_64-randconfig-i0-201716 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from fs/orangefs/orangefs-debug.h:15:0,
from fs/orangefs/protocol.h:336,
from fs/orangefs/acl.c:7:
   fs/orangefs/orangefs-kernel.h: In function 'is_root_handle':
>> fs/orangefs/orangefs-kernel.h:365:2: error: implicit declaration of function 
>> 'gossip_debug' [-Werror=implicit-function-declaration]
 gossip_debug(GOSSIP_DCACHE_DEBUG,
 ^
>> fs/orangefs/orangefs-kernel.h:365:15: error: 'GOSSIP_DCACHE_DEBUG' 
>> undeclared (first use in this function)
 gossip_debug(GOSSIP_DCACHE_DEBUG,
  ^
   fs/orangefs/orangefs-kernel.h:365:15: note: each undeclared identifier is 
reported only once for each function it appears in
   fs/orangefs/orangefs-kernel.h: In function 'match_handle':
   fs/orangefs/orangefs-kernel.h:381:15: error: 'GOSSIP_DCACHE_DEBUG' 
undeclared (first use in this function)
 gossip_debug(GOSSIP_DCACHE_DEBUG,
  ^
   cc1: some warnings being treated as errors

vim +/gossip_debug +365 fs/orangefs/orangefs-kernel.h

f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  359  {
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  360   return 
get_ino_from_khandle(dentry->d_parent->d_inode);
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  361  }
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  362  
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  363  static 
inline int is_root_handle(struct inode *inode)
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  364  {
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17 @365   
gossip_debug(GOSSIP_DCACHE_DEBUG,
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  366   
 "%s: root handle: %pU, this handle: %pU:\n",
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  367   
 __func__,
8bb8aefd fs/orangefs/pvfs2-kernel.h Yi Liu2015-11-24  368   
 _SB(inode->i_sb)->root_khandle,

:: The code at line 365 was first introduced by commit
:: f7ab093f74bf638ed98fd1115f3efa17e308bb7f Orangefs: kernel client part 1

:: TO: Mike Marshall <hub...@omnibond.com>
:: CC: Mike Marshall <hub...@omnibond.com>

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] fs:orangefs:orangefs-debug.h: Use ARRAY_SIZE kernel macro

2017-04-19 Thread kbuild test robot
Hi Karim,

[auto build test ERROR on linus/master]
[also build test ERROR on v4.11-rc7 next-20170419]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Karim-Eshapa/fs-orangefs-orangefs-debug-h-Use-ARRAY_SIZE-kernel-macro/20170420-105518
config: x86_64-randconfig-i0-201716 (attached as .config)
compiler: gcc-4.9 (Debian 4.9.4-2) 4.9.4
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   In file included from fs/orangefs/orangefs-debug.h:15:0,
from fs/orangefs/protocol.h:336,
from fs/orangefs/acl.c:7:
   fs/orangefs/orangefs-kernel.h: In function 'is_root_handle':
>> fs/orangefs/orangefs-kernel.h:365:2: error: implicit declaration of function 
>> 'gossip_debug' [-Werror=implicit-function-declaration]
 gossip_debug(GOSSIP_DCACHE_DEBUG,
 ^
>> fs/orangefs/orangefs-kernel.h:365:15: error: 'GOSSIP_DCACHE_DEBUG' 
>> undeclared (first use in this function)
 gossip_debug(GOSSIP_DCACHE_DEBUG,
  ^
   fs/orangefs/orangefs-kernel.h:365:15: note: each undeclared identifier is 
reported only once for each function it appears in
   fs/orangefs/orangefs-kernel.h: In function 'match_handle':
   fs/orangefs/orangefs-kernel.h:381:15: error: 'GOSSIP_DCACHE_DEBUG' 
undeclared (first use in this function)
 gossip_debug(GOSSIP_DCACHE_DEBUG,
  ^
   cc1: some warnings being treated as errors

vim +/gossip_debug +365 fs/orangefs/orangefs-kernel.h

f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  359  {
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  360   return 
get_ino_from_khandle(dentry->d_parent->d_inode);
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  361  }
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  362  
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  363  static 
inline int is_root_handle(struct inode *inode)
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  364  {
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17 @365   
gossip_debug(GOSSIP_DCACHE_DEBUG,
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  366   
 "%s: root handle: %pU, this handle: %pU:\n",
f7ab093f fs/orangefs/pvfs2-kernel.h Mike Marshall 2015-07-17  367   
 __func__,
8bb8aefd fs/orangefs/pvfs2-kernel.h Yi Liu2015-11-24  368   
 _SB(inode->i_sb)->root_khandle,

:: The code at line 365 was first introduced by commit
:: f7ab093f74bf638ed98fd1115f3efa17e308bb7f Orangefs: kernel client part 1

:: TO: Mike Marshall 
:: CC: Mike Marshall 

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [RFC] mm/madvise: Enable (soft|hard) offline of HugeTLB pages at PGD level

2017-04-19 Thread Anshuman Khandual
On 04/19/2017 12:12 PM, Anshuman Khandual wrote:
> On 04/19/2017 11:50 AM, Aneesh Kumar K.V wrote:
>> Anshuman Khandual  writes:
>>
>>> Though migrating gigantic HugeTLB pages does not sound much like real
>>> world use case, they can be affected by memory errors. Hence migration
>>> at the PGD level HugeTLB pages should be supported just to enable soft
>>> and hard offline use cases.
>> In that case do we want to isolated the entire 16GB range ? Should we
>> just dequeue the page from hugepage pool convert them to regular 64K
>> pages and then isolate the 64K that had memory error ?
> Though its a better thing to do, assuming that we can actually dequeue
> the huge page and push it to the buddy allocator as normal 64K pages
> (need to check on this as the original allocation happened from the
> memblock instead of the buddy allocator, guess it should be possible
> given that we do similar stuff during memory hot plug). In that case
> we will also have to consider the same for the PMD based HugeTLB pages
> as well or it should be only for these gigantic huge pages ?

If we look at the code inside the function soft_offline_huge_page(),
if the source huge page has been freed to the active_freelist then
we mark the *entire* hugepage as poisoned but if the huge page has
been released back to the buddy allocator then only the page in
question is marked poisoned not the entire huge page. This was
part was added with the commit a49ecbcd7 ("mm/memory-failure.c:
recheck PageHuge() after hugetlb page migrate successfully"). But
when I look at the migrate_pages() handling of huge pages, it always
calls putback_active_hugepage() after successful migration to release
the huge page back the active list not to the buddy allocator. I am 
not sure if the second half of 'if' block is ever getting executed
at all.

I am starting to wonder whats the point of releasing the huge page
to the active list in migrate_pages() when we will go and mark the
entire huge page as *poisoned*, put it in a dangling state (page->lru
pointing to itself) which can not be allocated anyway.

After migrate_pages() is successful and the source huge page is
release to the active list. We just mark the single normal page
has poisoned, get the source page from the active list and free
it to the buddy allocator. This should just take care both PMD
and PGD based huge pages.

--
ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
/*
 * We know that soft_offline_huge_page() tries to migrate
 * only one hugepage pointed to by hpage, so we need not
 * run through the pagelist here.
 */
putback_active_hugepage(hpage);
if (ret > 0)
ret = -EIO;
} else {
/* overcommit hugetlb page will be freed to buddy */
if (PageHuge(page)) {
set_page_hwpoison_huge_page(hpage);
dequeue_hwpoisoned_huge_page(hpage);
num_poisoned_pages_add(1 << compound_order(hpage));
} else {
SetPageHWPoison(page);
num_poisoned_pages_inc();
}
}
--



Re: [RFC] mm/madvise: Enable (soft|hard) offline of HugeTLB pages at PGD level

2017-04-19 Thread Anshuman Khandual
On 04/19/2017 12:12 PM, Anshuman Khandual wrote:
> On 04/19/2017 11:50 AM, Aneesh Kumar K.V wrote:
>> Anshuman Khandual  writes:
>>
>>> Though migrating gigantic HugeTLB pages does not sound much like real
>>> world use case, they can be affected by memory errors. Hence migration
>>> at the PGD level HugeTLB pages should be supported just to enable soft
>>> and hard offline use cases.
>> In that case do we want to isolated the entire 16GB range ? Should we
>> just dequeue the page from hugepage pool convert them to regular 64K
>> pages and then isolate the 64K that had memory error ?
> Though its a better thing to do, assuming that we can actually dequeue
> the huge page and push it to the buddy allocator as normal 64K pages
> (need to check on this as the original allocation happened from the
> memblock instead of the buddy allocator, guess it should be possible
> given that we do similar stuff during memory hot plug). In that case
> we will also have to consider the same for the PMD based HugeTLB pages
> as well or it should be only for these gigantic huge pages ?

If we look at the code inside the function soft_offline_huge_page(),
if the source huge page has been freed to the active_freelist then
we mark the *entire* hugepage as poisoned but if the huge page has
been released back to the buddy allocator then only the page in
question is marked poisoned not the entire huge page. This was
part was added with the commit a49ecbcd7 ("mm/memory-failure.c:
recheck PageHuge() after hugetlb page migrate successfully"). But
when I look at the migrate_pages() handling of huge pages, it always
calls putback_active_hugepage() after successful migration to release
the huge page back the active list not to the buddy allocator. I am 
not sure if the second half of 'if' block is ever getting executed
at all.

I am starting to wonder whats the point of releasing the huge page
to the active list in migrate_pages() when we will go and mark the
entire huge page as *poisoned*, put it in a dangling state (page->lru
pointing to itself) which can not be allocated anyway.

After migrate_pages() is successful and the source huge page is
release to the active list. We just mark the single normal page
has poisoned, get the source page from the active list and free
it to the buddy allocator. This should just take care both PMD
and PGD based huge pages.

--
ret = migrate_pages(, new_page, NULL, MPOL_MF_MOVE_ALL,
MIGRATE_SYNC, MR_MEMORY_FAILURE);
if (ret) {
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
pfn, ret, page->flags);
/*
 * We know that soft_offline_huge_page() tries to migrate
 * only one hugepage pointed to by hpage, so we need not
 * run through the pagelist here.
 */
putback_active_hugepage(hpage);
if (ret > 0)
ret = -EIO;
} else {
/* overcommit hugetlb page will be freed to buddy */
if (PageHuge(page)) {
set_page_hwpoison_huge_page(hpage);
dequeue_hwpoisoned_huge_page(hpage);
num_poisoned_pages_add(1 << compound_order(hpage));
} else {
SetPageHWPoison(page);
num_poisoned_pages_inc();
}
}
--



Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11

2017-04-19 Thread Andy Lutomirski
On Wed, Apr 19, 2017 at 8:55 PM, Andy Lutomirski  wrote:
> On Wed, Apr 19, 2017 at 8:10 PM, Jens Axboe  wrote:
>> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>>> Sorry for waiting so long for this.  I was waiting for feedback from
>>> Samsung, but they haven't root-caused the issue yet, and I should
>>> have just done this from the beginning.
>>>
>>> This series makes APST more debuggable and updates the quirk list.
>>> The quirks I'm aware of are:
>>>
>>>  - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
>>>laptops (which are essentially the same laptop) can lose their
>>>PCIe link if they're allowed to use the deepest APST state.
>>>Samsung engineers have an affected system and are working on
>>>it.  The same exact SSDs in other machines (even an XPS 13)
>>>seem to work fine.
>>>
>>>  - One Toshiba device malfunctions if APST is used at all.
>>
>> You need to split this series in two, patches 1-3 can wait. For 4.11,
>> all we need to do is turn off APST on any device that potentially has
>> this problem.
>>
>>> One thing that improves my confidence that there aren't too many
>>> more problems with APST is that Ubuntu has backported APST to Zesty,
>>> so it's already gotten a bit of testing in a widely used (if very
>>> new) release.
>>
>> Honestly, I think the best path for 4.11 is to turn off APST by default,
>> make it opt-in instead. I don't share your optimism here, as I made
>> clear back from before we even merged this feature.
>>
>>
>
> I can make it so that force_apst=0 means no APST and force_apst=1 mean
> yes APST and we could try again with a quirk list for 4.12.  There's a
> decent chance that a few more weeks with Ubuntu having APST on will
> shake out all the problems fairly quickly.

Here's a more concrete and more sensible proposal:

For 4.11:

force_apst=0: Default.  APST off on all Samsung 950-like devices
regardless of what laptop and on the Toshiba device.
force_apst=1: Use APST except where known bad.  APST deepest state
disabled on Samsung 950-like devices on XPS 15 and Precision 5510.
APST off on the Toshiba device.
force_apst=2: APST fully on regardless of any quirks.

For 4.12-rc1: force_apst=0 works like force_apst=1, but we keep both
values for compatibility and in case we need to add another overly
broad quirk some day.

Would something like this make sense?


Re: [PATCH 0/5] nvme APST fixes/improvements for 4.11

2017-04-19 Thread Andy Lutomirski
On Wed, Apr 19, 2017 at 8:55 PM, Andy Lutomirski  wrote:
> On Wed, Apr 19, 2017 at 8:10 PM, Jens Axboe  wrote:
>> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>>> Sorry for waiting so long for this.  I was waiting for feedback from
>>> Samsung, but they haven't root-caused the issue yet, and I should
>>> have just done this from the beginning.
>>>
>>> This series makes APST more debuggable and updates the quirk list.
>>> The quirks I'm aware of are:
>>>
>>>  - Samsung 950 series SSDs in Dell XPS 15 9550 and Precision 5510
>>>laptops (which are essentially the same laptop) can lose their
>>>PCIe link if they're allowed to use the deepest APST state.
>>>Samsung engineers have an affected system and are working on
>>>it.  The same exact SSDs in other machines (even an XPS 13)
>>>seem to work fine.
>>>
>>>  - One Toshiba device malfunctions if APST is used at all.
>>
>> You need to split this series in two, patches 1-3 can wait. For 4.11,
>> all we need to do is turn off APST on any device that potentially has
>> this problem.
>>
>>> One thing that improves my confidence that there aren't too many
>>> more problems with APST is that Ubuntu has backported APST to Zesty,
>>> so it's already gotten a bit of testing in a widely used (if very
>>> new) release.
>>
>> Honestly, I think the best path for 4.11 is to turn off APST by default,
>> make it opt-in instead. I don't share your optimism here, as I made
>> clear back from before we even merged this feature.
>>
>>
>
> I can make it so that force_apst=0 means no APST and force_apst=1 mean
> yes APST and we could try again with a quirk list for 4.12.  There's a
> decent chance that a few more weeks with Ubuntu having APST on will
> shake out all the problems fairly quickly.

Here's a more concrete and more sensible proposal:

For 4.11:

force_apst=0: Default.  APST off on all Samsung 950-like devices
regardless of what laptop and on the Toshiba device.
force_apst=1: Use APST except where known bad.  APST deepest state
disabled on Samsung 950-like devices on XPS 15 and Precision 5510.
APST off on the Toshiba device.
force_apst=2: APST fully on regardless of any quirks.

For 4.12-rc1: force_apst=0 works like force_apst=1, but we keep both
values for compatibility and in case we need to add another overly
broad quirk some day.

Would something like this make sense?


Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 19:33, Steven Rostedt wrote:
> On Wed, 19 Apr 2017 16:27:10 -0700
> Tyrel Datwyler  wrote:
> 
>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
>> # cat trace | grep -A6 "/pci@8002018"
> 
> Just to let you know that there is now stacktrace event triggers, where
> you don't need to stacktrace all events, you can pick and choose. And
> even filter the stack trace on specific fields of the event.
> 
>  # cd /sys/kernel/debug/tracing
>  # echo "stacktrace if common_pid == $$ && reason == 3" \
>> events/tlb/tlb_flush/trigger
> 
>  # cat trace
> bash-1103  [003] ...1  1290.100133: tlb_flush: pages:-1 
> reason:local mm shootdown (3)
> bash-1103  [003] ...2  1290.100140: 
>  => copy_process.part.39
>  => _do_fork
>  => SyS_clone
>  => do_syscall_64
>  => return_from_SYSCALL_64
> 
> -- Steve
> .
> 

Thanks for chiming in.

The power and flexibility of the trace tools is quite amazing
I need to make room in my schedule to catch up on what has been
added in the last several years.

-Frank


Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 19:33, Steven Rostedt wrote:
> On Wed, 19 Apr 2017 16:27:10 -0700
> Tyrel Datwyler  wrote:
> 
>> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
>> # cat trace | grep -A6 "/pci@8002018"
> 
> Just to let you know that there is now stacktrace event triggers, where
> you don't need to stacktrace all events, you can pick and choose. And
> even filter the stack trace on specific fields of the event.
> 
>  # cd /sys/kernel/debug/tracing
>  # echo "stacktrace if common_pid == $$ && reason == 3" \
>> events/tlb/tlb_flush/trigger
> 
>  # cat trace
> bash-1103  [003] ...1  1290.100133: tlb_flush: pages:-1 
> reason:local mm shootdown (3)
> bash-1103  [003] ...2  1290.100140: 
>  => copy_process.part.39
>  => _do_fork
>  => SyS_clone
>  => do_syscall_64
>  => return_from_SYSCALL_64
> 
> -- Steve
> .
> 

Thanks for chiming in.

The power and flexibility of the trace tools is quite amazing
I need to make room in my schedule to catch up on what has been
added in the last several years.

-Frank


[PATCH V5 7/9] PM / domain: Register for PM QOS performance notifier

2017-04-19 Thread Viresh Kumar
Some platforms have the capability to configure the performance state of
their Power Domains. The performance levels are identified by positive
integer values, a lower value represents lower performance state. The
power domain driver should be able to retrieve all information required
to configure the performance state of the power domain, with the help of
the performance constraint's target value.

This patch implements performance state management in PM domain core.
The performance QOS uses the common QOS notifier list and we call
__performance_notifier() if the notifier is issued for performance
constraint.

This also allows the power domain drivers to implement a
->set_performance_state() callback, which will be called by the power
domain core from within the notifier routine. If a domain doesn't
implement ->set_performance_state() callback, then it is assumed that
its parents are responsible for performance state configuration. Both
devices and sub-domains are accounted for while finding the highest
performance state requested.

Signed-off-by: Viresh Kumar 
---
V4->V5:
- drop "notifier" from dev_pm_qos_notifier_is_performance

 drivers/base/power/domain.c | 77 +
 include/linux/pm_domain.h   |  4 +++
 2 files changed, 81 insertions(+)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index f6f616ac5cc2..7d35dafe8c97 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -462,6 +462,79 @@ static int genpd_latency_notifier(struct 
generic_pm_domain_data *gpd_data,
return NOTIFY_DONE;
 }
 
+static void __update_domain_performance_state(struct generic_pm_domain *genpd,
+ int depth)
+{
+   struct generic_pm_domain_data *pd_data;
+   struct generic_pm_domain *subdomain;
+   struct pm_domain_data *pdd;
+   unsigned int state = 0;
+   struct gpd_link *link;
+
+   /* Traverse all devices within the domain */
+   list_for_each_entry(pdd, >dev_list, list_node) {
+   pd_data = to_gpd_data(pdd);
+
+   if (pd_data->performance_state > state)
+   state = pd_data->performance_state;
+   }
+
+   /* Traverse all subdomains within the domain */
+   list_for_each_entry(link, >master_links, master_node) {
+   subdomain = link->slave;
+
+   if (subdomain->performance_state > state)
+   state = subdomain->performance_state;
+   }
+
+   if (genpd->performance_state == state)
+   return;
+
+   genpd->performance_state = state;
+
+   if (genpd->set_performance_state) {
+   genpd->set_performance_state(genpd, state);
+   return;
+   }
+
+   /* Propagate to parent power domains */
+   list_for_each_entry(link, >slave_links, slave_node) {
+   struct generic_pm_domain *master = link->master;
+
+   genpd_lock_nested(master, depth + 1);
+   __update_domain_performance_state(master, depth + 1);
+   genpd_unlock(master);
+   }
+}
+
+static int __performance_notifier(struct generic_pm_domain_data *gpd_data,
+ unsigned long val)
+{
+   struct generic_pm_domain *genpd = ERR_PTR(-ENODATA);
+   struct device *dev = gpd_data->base.dev;
+   struct pm_domain_data *pdd;
+
+   spin_lock_irq(>power.lock);
+
+   pdd = dev->power.subsys_data ?
+   dev->power.subsys_data->domain_data : NULL;
+
+   if (pdd && pdd->dev)
+   genpd = dev_to_genpd(dev);
+
+   spin_unlock_irq(>power.lock);
+
+   if (IS_ERR(genpd))
+   return NOTIFY_DONE;
+
+   genpd_lock(genpd);
+   gpd_data->performance_state = val;
+   __update_domain_performance_state(genpd, 0);
+   genpd_unlock(genpd);
+
+   return NOTIFY_DONE;
+}
+
 static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
 unsigned long val, void *ptr)
 {
@@ -474,6 +547,9 @@ static int genpd_dev_pm_qos_notifier(struct notifier_block 
*nb,
if (dev_pm_qos_is_resume_latency(dev, ptr))
return genpd_latency_notifier(gpd_data, val);
 
+   if (dev_pm_qos_is_performance(dev, ptr))
+   return __performance_notifier(gpd_data, val);
+
dev_err(dev, "%s: Unexpected notifier call\n", __func__);
return NOTIFY_BAD;
 }
@@ -1168,6 +1244,7 @@ static struct generic_pm_domain_data 
*genpd_alloc_dev_data(struct device *dev,
gpd_data->td.constraint_changed = true;
gpd_data->td.effective_constraint_ns = -1;
gpd_data->nb.notifier_call = genpd_dev_pm_qos_notifier;
+   gpd_data->performance_state = 0;
 
spin_lock_irq(>power.lock);
 
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index b7803a251044..84ee474e66d0 100644
--- a/include/linux/pm_domain.h
+++ 

[PATCH V5 7/9] PM / domain: Register for PM QOS performance notifier

2017-04-19 Thread Viresh Kumar
Some platforms have the capability to configure the performance state of
their Power Domains. The performance levels are identified by positive
integer values, a lower value represents lower performance state. The
power domain driver should be able to retrieve all information required
to configure the performance state of the power domain, with the help of
the performance constraint's target value.

This patch implements performance state management in PM domain core.
The performance QOS uses the common QOS notifier list and we call
__performance_notifier() if the notifier is issued for performance
constraint.

This also allows the power domain drivers to implement a
->set_performance_state() callback, which will be called by the power
domain core from within the notifier routine. If a domain doesn't
implement ->set_performance_state() callback, then it is assumed that
its parents are responsible for performance state configuration. Both
devices and sub-domains are accounted for while finding the highest
performance state requested.

Signed-off-by: Viresh Kumar 
---
V4->V5:
- drop "notifier" from dev_pm_qos_notifier_is_performance

 drivers/base/power/domain.c | 77 +
 include/linux/pm_domain.h   |  4 +++
 2 files changed, 81 insertions(+)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index f6f616ac5cc2..7d35dafe8c97 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -462,6 +462,79 @@ static int genpd_latency_notifier(struct 
generic_pm_domain_data *gpd_data,
return NOTIFY_DONE;
 }
 
+static void __update_domain_performance_state(struct generic_pm_domain *genpd,
+ int depth)
+{
+   struct generic_pm_domain_data *pd_data;
+   struct generic_pm_domain *subdomain;
+   struct pm_domain_data *pdd;
+   unsigned int state = 0;
+   struct gpd_link *link;
+
+   /* Traverse all devices within the domain */
+   list_for_each_entry(pdd, >dev_list, list_node) {
+   pd_data = to_gpd_data(pdd);
+
+   if (pd_data->performance_state > state)
+   state = pd_data->performance_state;
+   }
+
+   /* Traverse all subdomains within the domain */
+   list_for_each_entry(link, >master_links, master_node) {
+   subdomain = link->slave;
+
+   if (subdomain->performance_state > state)
+   state = subdomain->performance_state;
+   }
+
+   if (genpd->performance_state == state)
+   return;
+
+   genpd->performance_state = state;
+
+   if (genpd->set_performance_state) {
+   genpd->set_performance_state(genpd, state);
+   return;
+   }
+
+   /* Propagate to parent power domains */
+   list_for_each_entry(link, >slave_links, slave_node) {
+   struct generic_pm_domain *master = link->master;
+
+   genpd_lock_nested(master, depth + 1);
+   __update_domain_performance_state(master, depth + 1);
+   genpd_unlock(master);
+   }
+}
+
+static int __performance_notifier(struct generic_pm_domain_data *gpd_data,
+ unsigned long val)
+{
+   struct generic_pm_domain *genpd = ERR_PTR(-ENODATA);
+   struct device *dev = gpd_data->base.dev;
+   struct pm_domain_data *pdd;
+
+   spin_lock_irq(>power.lock);
+
+   pdd = dev->power.subsys_data ?
+   dev->power.subsys_data->domain_data : NULL;
+
+   if (pdd && pdd->dev)
+   genpd = dev_to_genpd(dev);
+
+   spin_unlock_irq(>power.lock);
+
+   if (IS_ERR(genpd))
+   return NOTIFY_DONE;
+
+   genpd_lock(genpd);
+   gpd_data->performance_state = val;
+   __update_domain_performance_state(genpd, 0);
+   genpd_unlock(genpd);
+
+   return NOTIFY_DONE;
+}
+
 static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
 unsigned long val, void *ptr)
 {
@@ -474,6 +547,9 @@ static int genpd_dev_pm_qos_notifier(struct notifier_block 
*nb,
if (dev_pm_qos_is_resume_latency(dev, ptr))
return genpd_latency_notifier(gpd_data, val);
 
+   if (dev_pm_qos_is_performance(dev, ptr))
+   return __performance_notifier(gpd_data, val);
+
dev_err(dev, "%s: Unexpected notifier call\n", __func__);
return NOTIFY_BAD;
 }
@@ -1168,6 +1244,7 @@ static struct generic_pm_domain_data 
*genpd_alloc_dev_data(struct device *dev,
gpd_data->td.constraint_changed = true;
gpd_data->td.effective_constraint_ns = -1;
gpd_data->nb.notifier_call = genpd_dev_pm_qos_notifier;
+   gpd_data->performance_state = 0;
 
spin_lock_irq(>power.lock);
 
diff --git a/include/linux/pm_domain.h b/include/linux/pm_domain.h
index b7803a251044..84ee474e66d0 100644
--- a/include/linux/pm_domain.h
+++ b/include/linux/pm_domain.h
@@ -63,8 

[PATCH V5 4/9] PM / QOS: Add DEV_PM_QOS_PERFORMANCE request

2017-04-19 Thread Viresh Kumar
Some platforms have the capability to configure the performance state of
their Power Domains. The performance levels are identified by positive
integer values, a lower value represents lower performance state. The
power domain driver should be able to retrieve all information required
to configure the performance state of the power domain, with the help of
the performance constraint's target value.

This patch adds a new QOS request type: DEV_PM_QOS_PERFORMANCE to
support runtime performance constraints for the devices. Also allow
notifiers to be registered against it, which will be used by frameworks
like genpd.

Signed-off-by: Viresh Kumar 
---
V4->V5:
- s/ only/
- drop performance_req field
- drop "notifier" from dev_pm_qos_notifier_is_performance

 Documentation/power/pm_qos_interface.txt |  2 +-
 drivers/base/power/qos.c | 21 +
 include/linux/pm_qos.h   |  9 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/Documentation/power/pm_qos_interface.txt 
b/Documentation/power/pm_qos_interface.txt
index 21d2d48f87a2..42870d28fc3c 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -168,7 +168,7 @@ The per-device PM QoS framework has a per-device 
notification tree.
 int dev_pm_qos_add_notifier(device, notifier):
 Adds a notification callback function for the device.
 The callback is called when the aggregated value of the device constraints list
-is changed (for resume latency device PM QoS only).
+is changed (for resume latency and performance device PM QoS).
 
 int dev_pm_qos_remove_notifier(device, notifier):
 Removes the notification callback function for the device.
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index 654d8a12c2e7..084d26960dae 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -150,6 +150,10 @@ static int apply_constraint(struct dev_pm_qos_request *req,
req->dev->power.set_latency_tolerance(req->dev, value);
}
break;
+   case DEV_PM_QOS_PERFORMANCE:
+   ret = pm_qos_update_target(>performance, >data.pnode,
+  action, value);
+   break;
case DEV_PM_QOS_FLAGS:
ret = pm_qos_update_flags(>flags, >data.flr,
  action, value);
@@ -194,6 +198,14 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
c->no_constraint_value = PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT;
c->type = PM_QOS_MIN;
 
+   c = >performance;
+   plist_head_init(>list);
+   c->target_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->default_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->no_constraint_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->type = PM_QOS_MAX;
+   c->notifiers = >notifiers;
+
INIT_LIST_HEAD(>flags.list);
 
spin_lock_irq(>power.lock);
@@ -252,6 +264,11 @@ void dev_pm_qos_constraints_destroy(struct device *dev)
apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
memset(req, 0, sizeof(*req));
}
+   c = >performance;
+   plist_for_each_entry_safe(req, tmp, >list, data.pnode) {
+   apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
+   memset(req, 0, sizeof(*req));
+   }
f = >flags;
list_for_each_entry_safe(req, tmp, >list, data.flr.node) {
apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
@@ -362,6 +379,7 @@ static int __dev_pm_qos_update_request(struct 
dev_pm_qos_request *req,
switch(req->type) {
case DEV_PM_QOS_RESUME_LATENCY:
case DEV_PM_QOS_LATENCY_TOLERANCE:
+   case DEV_PM_QOS_PERFORMANCE:
curr_value = req->data.pnode.prio;
break;
case DEV_PM_QOS_FLAGS:
@@ -571,6 +589,9 @@ static void __dev_pm_qos_drop_user_request(struct device 
*dev,
req = dev->power.qos->flags_req;
dev->power.qos->flags_req = NULL;
break;
+   case DEV_PM_QOS_PERFORMANCE:
+   dev_err(dev, "Invalid user request (performance)\n");
+   return;
}
__dev_pm_qos_remove_request(req);
kfree(req);
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index e546d1a2f237..665f90face40 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -36,6 +36,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE0
 #define PM_QOS_LATENCY_TOLERANCE_DEFAULT_VALUE 0
 #define PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT (-1)
+#define PM_QOS_PERFORMANCE_DEFAULT_VALUE   0
 #define PM_QOS_LATENCY_ANY ((s32)(~(__u32)0 >> 1))
 
 #define PM_QOS_FLAG_NO_POWER_OFF   (1 << 0)
@@ -55,6 +56,7 @@ struct pm_qos_flags_request {
 enum 

[PATCH V5 4/9] PM / QOS: Add DEV_PM_QOS_PERFORMANCE request

2017-04-19 Thread Viresh Kumar
Some platforms have the capability to configure the performance state of
their Power Domains. The performance levels are identified by positive
integer values, a lower value represents lower performance state. The
power domain driver should be able to retrieve all information required
to configure the performance state of the power domain, with the help of
the performance constraint's target value.

This patch adds a new QOS request type: DEV_PM_QOS_PERFORMANCE to
support runtime performance constraints for the devices. Also allow
notifiers to be registered against it, which will be used by frameworks
like genpd.

Signed-off-by: Viresh Kumar 
---
V4->V5:
- s/ only/
- drop performance_req field
- drop "notifier" from dev_pm_qos_notifier_is_performance

 Documentation/power/pm_qos_interface.txt |  2 +-
 drivers/base/power/qos.c | 21 +
 include/linux/pm_qos.h   |  9 +
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/Documentation/power/pm_qos_interface.txt 
b/Documentation/power/pm_qos_interface.txt
index 21d2d48f87a2..42870d28fc3c 100644
--- a/Documentation/power/pm_qos_interface.txt
+++ b/Documentation/power/pm_qos_interface.txt
@@ -168,7 +168,7 @@ The per-device PM QoS framework has a per-device 
notification tree.
 int dev_pm_qos_add_notifier(device, notifier):
 Adds a notification callback function for the device.
 The callback is called when the aggregated value of the device constraints list
-is changed (for resume latency device PM QoS only).
+is changed (for resume latency and performance device PM QoS).
 
 int dev_pm_qos_remove_notifier(device, notifier):
 Removes the notification callback function for the device.
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index 654d8a12c2e7..084d26960dae 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -150,6 +150,10 @@ static int apply_constraint(struct dev_pm_qos_request *req,
req->dev->power.set_latency_tolerance(req->dev, value);
}
break;
+   case DEV_PM_QOS_PERFORMANCE:
+   ret = pm_qos_update_target(>performance, >data.pnode,
+  action, value);
+   break;
case DEV_PM_QOS_FLAGS:
ret = pm_qos_update_flags(>flags, >data.flr,
  action, value);
@@ -194,6 +198,14 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
c->no_constraint_value = PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT;
c->type = PM_QOS_MIN;
 
+   c = >performance;
+   plist_head_init(>list);
+   c->target_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->default_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->no_constraint_value = PM_QOS_PERFORMANCE_DEFAULT_VALUE;
+   c->type = PM_QOS_MAX;
+   c->notifiers = >notifiers;
+
INIT_LIST_HEAD(>flags.list);
 
spin_lock_irq(>power.lock);
@@ -252,6 +264,11 @@ void dev_pm_qos_constraints_destroy(struct device *dev)
apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
memset(req, 0, sizeof(*req));
}
+   c = >performance;
+   plist_for_each_entry_safe(req, tmp, >list, data.pnode) {
+   apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
+   memset(req, 0, sizeof(*req));
+   }
f = >flags;
list_for_each_entry_safe(req, tmp, >list, data.flr.node) {
apply_constraint(req, PM_QOS_REMOVE_REQ, PM_QOS_DEFAULT_VALUE);
@@ -362,6 +379,7 @@ static int __dev_pm_qos_update_request(struct 
dev_pm_qos_request *req,
switch(req->type) {
case DEV_PM_QOS_RESUME_LATENCY:
case DEV_PM_QOS_LATENCY_TOLERANCE:
+   case DEV_PM_QOS_PERFORMANCE:
curr_value = req->data.pnode.prio;
break;
case DEV_PM_QOS_FLAGS:
@@ -571,6 +589,9 @@ static void __dev_pm_qos_drop_user_request(struct device 
*dev,
req = dev->power.qos->flags_req;
dev->power.qos->flags_req = NULL;
break;
+   case DEV_PM_QOS_PERFORMANCE:
+   dev_err(dev, "Invalid user request (performance)\n");
+   return;
}
__dev_pm_qos_remove_request(req);
kfree(req);
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index e546d1a2f237..665f90face40 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -36,6 +36,7 @@ enum pm_qos_flags_status {
 #define PM_QOS_RESUME_LATENCY_DEFAULT_VALUE0
 #define PM_QOS_LATENCY_TOLERANCE_DEFAULT_VALUE 0
 #define PM_QOS_LATENCY_TOLERANCE_NO_CONSTRAINT (-1)
+#define PM_QOS_PERFORMANCE_DEFAULT_VALUE   0
 #define PM_QOS_LATENCY_ANY ((s32)(~(__u32)0 >> 1))
 
 #define PM_QOS_FLAG_NO_POWER_OFF   (1 << 0)
@@ -55,6 +56,7 @@ struct pm_qos_flags_request {
 enum dev_pm_qos_req_type {

[PATCH V5 3/9] PM / QOS: Keep common notifier list for genpd constraints

2017-04-19 Thread Viresh Kumar
Only the resume_latency constraint uses the notifiers right now. In
order to prepare for adding new constraint types with notifiers, move to
a common notifier list.

Update pm_qos_update_target() to pass a pointer to the constraint
structure to the notifier callbacks. Also update the notifier callbacks
as well to error out for unexpected constraints.

Signed-off-by: Viresh Kumar 
Acked-by: Ulf Hansson 
---
V4->V5:
- s/__resume_latency_notifier/genpd_latency_notifier
- drop "notifier" from dev_pm_qos_notifier_is_resume_latency

 drivers/base/power/domain.c | 26 +++---
 drivers/base/power/qos.c| 15 ---
 include/linux/pm_qos.h  |  7 +++
 kernel/power/qos.c  |  2 +-
 4 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index da49a8383dc3..f6f616ac5cc2 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -426,14 +426,10 @@ static int genpd_power_on(struct generic_pm_domain 
*genpd, unsigned int depth)
return ret;
 }
 
-static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
-unsigned long val, void *ptr)
+static int genpd_latency_notifier(struct generic_pm_domain_data *gpd_data,
+ unsigned long val)
 {
-   struct generic_pm_domain_data *gpd_data;
-   struct device *dev;
-
-   gpd_data = container_of(nb, struct generic_pm_domain_data, nb);
-   dev = gpd_data->base.dev;
+   struct device *dev = gpd_data->base.dev;
 
for (;;) {
struct generic_pm_domain *genpd;
@@ -466,6 +462,22 @@ static int genpd_dev_pm_qos_notifier(struct notifier_block 
*nb,
return NOTIFY_DONE;
 }
 
+static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
+unsigned long val, void *ptr)
+{
+   struct generic_pm_domain_data *gpd_data;
+   struct device *dev;
+
+   gpd_data = container_of(nb, struct generic_pm_domain_data, nb);
+   dev = gpd_data->base.dev;
+
+   if (dev_pm_qos_is_resume_latency(dev, ptr))
+   return genpd_latency_notifier(gpd_data, val);
+
+   dev_err(dev, "%s: Unexpected notifier call\n", __func__);
+   return NOTIFY_BAD;
+}
+
 /**
  * genpd_power_off_work_fn - Power off PM domain whose subdomain count is 0.
  * @work: Work structure used for scheduling the execution of this function.
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index f850daeffba4..654d8a12c2e7 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -172,18 +172,12 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
 {
struct dev_pm_qos *qos;
struct pm_qos_constraints *c;
-   struct blocking_notifier_head *n;
 
qos = kzalloc(sizeof(*qos), GFP_KERNEL);
if (!qos)
return -ENOMEM;
 
-   n = kzalloc(sizeof(*n), GFP_KERNEL);
-   if (!n) {
-   kfree(qos);
-   return -ENOMEM;
-   }
-   BLOCKING_INIT_NOTIFIER_HEAD(n);
+   BLOCKING_INIT_NOTIFIER_HEAD(>notifiers);
 
c = >resume_latency;
plist_head_init(>list);
@@ -191,7 +185,7 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
c->default_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
c->no_constraint_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
c->type = PM_QOS_MIN;
-   c->notifiers = n;
+   c->notifiers = >notifiers;
 
c = >latency_tolerance;
plist_head_init(>list);
@@ -268,7 +262,6 @@ void dev_pm_qos_constraints_destroy(struct device *dev)
dev->power.qos = ERR_PTR(-ENODEV);
spin_unlock_irq(>power.lock);
 
-   kfree(qos->resume_latency.notifiers);
kfree(qos);
 
  out:
@@ -487,7 +480,7 @@ int dev_pm_qos_add_notifier(struct device *dev, struct 
notifier_block *notifier)
ret = dev_pm_qos_constraints_allocate(dev);
 
if (!ret)
-   ret = 
blocking_notifier_chain_register(dev->power.qos->resume_latency.notifiers,
+   ret = 
blocking_notifier_chain_register(>power.qos->notifiers,
   notifier);
 
mutex_unlock(_pm_qos_mtx);
@@ -514,7 +507,7 @@ int dev_pm_qos_remove_notifier(struct device *dev,
 
/* Silently return if the constraints object is not present. */
if (!IS_ERR_OR_NULL(dev->power.qos))
-   retval = 
blocking_notifier_chain_unregister(dev->power.qos->resume_latency.notifiers,
+   retval = 
blocking_notifier_chain_unregister(>power.qos->notifiers,
notifier);
 
mutex_unlock(_pm_qos_mtx);
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 032b55909145..e546d1a2f237 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -100,6 

[PATCH V5 3/9] PM / QOS: Keep common notifier list for genpd constraints

2017-04-19 Thread Viresh Kumar
Only the resume_latency constraint uses the notifiers right now. In
order to prepare for adding new constraint types with notifiers, move to
a common notifier list.

Update pm_qos_update_target() to pass a pointer to the constraint
structure to the notifier callbacks. Also update the notifier callbacks
as well to error out for unexpected constraints.

Signed-off-by: Viresh Kumar 
Acked-by: Ulf Hansson 
---
V4->V5:
- s/__resume_latency_notifier/genpd_latency_notifier
- drop "notifier" from dev_pm_qos_notifier_is_resume_latency

 drivers/base/power/domain.c | 26 +++---
 drivers/base/power/qos.c| 15 ---
 include/linux/pm_qos.h  |  7 +++
 kernel/power/qos.c  |  2 +-
 4 files changed, 31 insertions(+), 19 deletions(-)

diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
index da49a8383dc3..f6f616ac5cc2 100644
--- a/drivers/base/power/domain.c
+++ b/drivers/base/power/domain.c
@@ -426,14 +426,10 @@ static int genpd_power_on(struct generic_pm_domain 
*genpd, unsigned int depth)
return ret;
 }
 
-static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
-unsigned long val, void *ptr)
+static int genpd_latency_notifier(struct generic_pm_domain_data *gpd_data,
+ unsigned long val)
 {
-   struct generic_pm_domain_data *gpd_data;
-   struct device *dev;
-
-   gpd_data = container_of(nb, struct generic_pm_domain_data, nb);
-   dev = gpd_data->base.dev;
+   struct device *dev = gpd_data->base.dev;
 
for (;;) {
struct generic_pm_domain *genpd;
@@ -466,6 +462,22 @@ static int genpd_dev_pm_qos_notifier(struct notifier_block 
*nb,
return NOTIFY_DONE;
 }
 
+static int genpd_dev_pm_qos_notifier(struct notifier_block *nb,
+unsigned long val, void *ptr)
+{
+   struct generic_pm_domain_data *gpd_data;
+   struct device *dev;
+
+   gpd_data = container_of(nb, struct generic_pm_domain_data, nb);
+   dev = gpd_data->base.dev;
+
+   if (dev_pm_qos_is_resume_latency(dev, ptr))
+   return genpd_latency_notifier(gpd_data, val);
+
+   dev_err(dev, "%s: Unexpected notifier call\n", __func__);
+   return NOTIFY_BAD;
+}
+
 /**
  * genpd_power_off_work_fn - Power off PM domain whose subdomain count is 0.
  * @work: Work structure used for scheduling the execution of this function.
diff --git a/drivers/base/power/qos.c b/drivers/base/power/qos.c
index f850daeffba4..654d8a12c2e7 100644
--- a/drivers/base/power/qos.c
+++ b/drivers/base/power/qos.c
@@ -172,18 +172,12 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
 {
struct dev_pm_qos *qos;
struct pm_qos_constraints *c;
-   struct blocking_notifier_head *n;
 
qos = kzalloc(sizeof(*qos), GFP_KERNEL);
if (!qos)
return -ENOMEM;
 
-   n = kzalloc(sizeof(*n), GFP_KERNEL);
-   if (!n) {
-   kfree(qos);
-   return -ENOMEM;
-   }
-   BLOCKING_INIT_NOTIFIER_HEAD(n);
+   BLOCKING_INIT_NOTIFIER_HEAD(>notifiers);
 
c = >resume_latency;
plist_head_init(>list);
@@ -191,7 +185,7 @@ static int dev_pm_qos_constraints_allocate(struct device 
*dev)
c->default_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
c->no_constraint_value = PM_QOS_RESUME_LATENCY_DEFAULT_VALUE;
c->type = PM_QOS_MIN;
-   c->notifiers = n;
+   c->notifiers = >notifiers;
 
c = >latency_tolerance;
plist_head_init(>list);
@@ -268,7 +262,6 @@ void dev_pm_qos_constraints_destroy(struct device *dev)
dev->power.qos = ERR_PTR(-ENODEV);
spin_unlock_irq(>power.lock);
 
-   kfree(qos->resume_latency.notifiers);
kfree(qos);
 
  out:
@@ -487,7 +480,7 @@ int dev_pm_qos_add_notifier(struct device *dev, struct 
notifier_block *notifier)
ret = dev_pm_qos_constraints_allocate(dev);
 
if (!ret)
-   ret = 
blocking_notifier_chain_register(dev->power.qos->resume_latency.notifiers,
+   ret = 
blocking_notifier_chain_register(>power.qos->notifiers,
   notifier);
 
mutex_unlock(_pm_qos_mtx);
@@ -514,7 +507,7 @@ int dev_pm_qos_remove_notifier(struct device *dev,
 
/* Silently return if the constraints object is not present. */
if (!IS_ERR_OR_NULL(dev->power.qos))
-   retval = 
blocking_notifier_chain_unregister(dev->power.qos->resume_latency.notifiers,
+   retval = 
blocking_notifier_chain_unregister(>power.qos->notifiers,
notifier);
 
mutex_unlock(_pm_qos_mtx);
diff --git a/include/linux/pm_qos.h b/include/linux/pm_qos.h
index 032b55909145..e546d1a2f237 100644
--- a/include/linux/pm_qos.h
+++ b/include/linux/pm_qos.h
@@ -100,6 +100,7 @@ struct dev_pm_qos {
struct 

Re: [PATCH] make TIOCSTI ioctl require CAP_SYS_ADMIN

2017-04-19 Thread Matt Brown

On 04/19/2017 07:53 PM, Serge E. Hallyn wrote:

Quoting Matt Brown (m...@nmatt.com):

On 04/19/2017 12:58 AM, Serge E. Hallyn wrote:

On Tue, Apr 18, 2017 at 11:45:26PM -0400, Matt Brown wrote:

This patch reproduces GRKERNSEC_HARDEN_TTY functionality from the grsecurity
project in-kernel.

This will create the Kconfig SECURITY_TIOCSTI_RESTRICT and the corresponding
sysctl kernel.tiocsti_restrict that, when activated, restrict all TIOCSTI
ioctl calls from non CAP_SYS_ADMIN users.

Possible effects on userland:

There could be a few user programs that would be effected by this
change.
See: 
notable programs are: agetty, csh, xemacs and tcsh

However, I still believe that this change is worth it given that the
Kconfig defaults to n. This will be a feature that is turned on for the


It's not worthless, but note that for instance before this was fixed
in lxc, this patch would not have helped with escapes from privileged
containers.



I assume you are talking about this CVE:
https://bugzilla.redhat.com/show_bug.cgi?id=1411256

In retrospect, is there any way that an escape from a privileged
container with the this bug could have been prevented?


I don't know, that's what I was probing for.  Detecting that the pgrp
or session - heck, the pid namespace - has changed would seem like a
good indicator that it shouldn't be able to push.



pgrp and session won't do because in the case we are discussing
current->signal->tty is the same as tty.

This is the current check that is already in place:
 | if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
 |  return -EPERM;

The only thing I could find to detect the tty message coming from a
container is as follows:
 | task_active_pid_ns(current)->level

This will be zero when run on the host, but 1 when run inside a
container. However this is very much a hack and could probably break
some userland stuff where there are multiple levels of namespaces.

The real problem is that there are no TTY namespaces. I don't think we
can solve this problem for CAP_SYS_ADMIN containers unless we want to
introduce a config that allows one to override normal CAP_SYS_ADMIN
functionality by denying TIOCSTI ioctls for processes whom
task_active_pid_ns(current)->level is equal to 0.

In the mean time, I think we can go ahead with this feature to give
people the ability to lock down non CAP_SYS_ADMIN containers/processes.


same reason that people activate it when using grsecurity. Users of this
opt-in feature will realize that they are choosing security over some OS
features like unprivileged TIOCSTI ioctls, as should be clear in the
Kconfig help message.

Threat Model/Patch Rational:

>From grsecurity's config for GRKERNSEC_HARDEN_TTY.

| There are very few legitimate uses for this functionality and it
| has made vulnerabilities in several 'su'-like programs possible in
| the past.  Even without these vulnerabilities, it provides an
| attacker with an easy mechanism to move laterally among other
| processes within the same user's compromised session.

So if one process within a tty session becomes compromised it can follow
that additional processes, that are thought to be in different security
boundaries, can be compromised as a result. When using a program like su
or sudo, these additional processes could be in a tty session where TTY file
descriptors are indeed shared over privilege boundaries.

This is also an excellent writeup about the issue:


Signed-off-by: Matt Brown 
---
drivers/tty/tty_io.c |  4 
include/linux/tty.h  |  2 ++
kernel/sysctl.c  | 12 
security/Kconfig | 13 +
4 files changed, 31 insertions(+)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e6d1a65..31894e8 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2296,11 +2296,15 @@ static int tty_fasync(int fd, struct file *filp, int on)
 *  FIXME: may race normal receive processing
 */

+int tiocsti_restrict = IS_ENABLED(CONFIG_SECURITY_TIOCSTI_RESTRICT);
+
static int tiocsti(struct tty_struct *tty, char __user *p)
{
char ch, mbz = 0;
struct tty_ldisc *ld;

+   if (tiocsti_restrict && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
return -EPERM;
if (get_user(ch, p))
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 1017e904..7011102 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -342,6 +342,8 @@ struct tty_file_private {
struct list_head list;
};

+extern int tiocsti_restrict;
+
/* tty magic number */
#define TTY_MAGIC   0x5401

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index acf0a5a..68d1363 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
#include 
#include 
#include 
+#include 

#include 
#include 
@@ 

Re: [PATCH] make TIOCSTI ioctl require CAP_SYS_ADMIN

2017-04-19 Thread Matt Brown

On 04/19/2017 07:53 PM, Serge E. Hallyn wrote:

Quoting Matt Brown (m...@nmatt.com):

On 04/19/2017 12:58 AM, Serge E. Hallyn wrote:

On Tue, Apr 18, 2017 at 11:45:26PM -0400, Matt Brown wrote:

This patch reproduces GRKERNSEC_HARDEN_TTY functionality from the grsecurity
project in-kernel.

This will create the Kconfig SECURITY_TIOCSTI_RESTRICT and the corresponding
sysctl kernel.tiocsti_restrict that, when activated, restrict all TIOCSTI
ioctl calls from non CAP_SYS_ADMIN users.

Possible effects on userland:

There could be a few user programs that would be effected by this
change.
See: 
notable programs are: agetty, csh, xemacs and tcsh

However, I still believe that this change is worth it given that the
Kconfig defaults to n. This will be a feature that is turned on for the


It's not worthless, but note that for instance before this was fixed
in lxc, this patch would not have helped with escapes from privileged
containers.



I assume you are talking about this CVE:
https://bugzilla.redhat.com/show_bug.cgi?id=1411256

In retrospect, is there any way that an escape from a privileged
container with the this bug could have been prevented?


I don't know, that's what I was probing for.  Detecting that the pgrp
or session - heck, the pid namespace - has changed would seem like a
good indicator that it shouldn't be able to push.



pgrp and session won't do because in the case we are discussing
current->signal->tty is the same as tty.

This is the current check that is already in place:
 | if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
 |  return -EPERM;

The only thing I could find to detect the tty message coming from a
container is as follows:
 | task_active_pid_ns(current)->level

This will be zero when run on the host, but 1 when run inside a
container. However this is very much a hack and could probably break
some userland stuff where there are multiple levels of namespaces.

The real problem is that there are no TTY namespaces. I don't think we
can solve this problem for CAP_SYS_ADMIN containers unless we want to
introduce a config that allows one to override normal CAP_SYS_ADMIN
functionality by denying TIOCSTI ioctls for processes whom
task_active_pid_ns(current)->level is equal to 0.

In the mean time, I think we can go ahead with this feature to give
people the ability to lock down non CAP_SYS_ADMIN containers/processes.


same reason that people activate it when using grsecurity. Users of this
opt-in feature will realize that they are choosing security over some OS
features like unprivileged TIOCSTI ioctls, as should be clear in the
Kconfig help message.

Threat Model/Patch Rational:

>From grsecurity's config for GRKERNSEC_HARDEN_TTY.

| There are very few legitimate uses for this functionality and it
| has made vulnerabilities in several 'su'-like programs possible in
| the past.  Even without these vulnerabilities, it provides an
| attacker with an easy mechanism to move laterally among other
| processes within the same user's compromised session.

So if one process within a tty session becomes compromised it can follow
that additional processes, that are thought to be in different security
boundaries, can be compromised as a result. When using a program like su
or sudo, these additional processes could be in a tty session where TTY file
descriptors are indeed shared over privilege boundaries.

This is also an excellent writeup about the issue:


Signed-off-by: Matt Brown 
---
drivers/tty/tty_io.c |  4 
include/linux/tty.h  |  2 ++
kernel/sysctl.c  | 12 
security/Kconfig | 13 +
4 files changed, 31 insertions(+)

diff --git a/drivers/tty/tty_io.c b/drivers/tty/tty_io.c
index e6d1a65..31894e8 100644
--- a/drivers/tty/tty_io.c
+++ b/drivers/tty/tty_io.c
@@ -2296,11 +2296,15 @@ static int tty_fasync(int fd, struct file *filp, int on)
 *  FIXME: may race normal receive processing
 */

+int tiocsti_restrict = IS_ENABLED(CONFIG_SECURITY_TIOCSTI_RESTRICT);
+
static int tiocsti(struct tty_struct *tty, char __user *p)
{
char ch, mbz = 0;
struct tty_ldisc *ld;

+   if (tiocsti_restrict && !capable(CAP_SYS_ADMIN))
+   return -EPERM;
if ((current->signal->tty != tty) && !capable(CAP_SYS_ADMIN))
return -EPERM;
if (get_user(ch, p))
diff --git a/include/linux/tty.h b/include/linux/tty.h
index 1017e904..7011102 100644
--- a/include/linux/tty.h
+++ b/include/linux/tty.h
@@ -342,6 +342,8 @@ struct tty_file_private {
struct list_head list;
};

+extern int tiocsti_restrict;
+
/* tty magic number */
#define TTY_MAGIC   0x5401

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index acf0a5a..68d1363 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -67,6 +67,7 @@
#include 
#include 
#include 
+#include 

#include 
#include 
@@ -833,6 +834,17 

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 16:27, Tyrel Datwyler wrote:
> On 04/18/2017 06:31 PM, Michael Ellerman wrote:
>> Frank Rowand  writes:
>>
>>> On 04/17/17 17:32, Tyrel Datwyler wrote:
 This patch introduces event tracepoints for tracking a device_nodes
 reference cycle as well as reconfig notifications generated in response
 to node/property manipulations.

 With the recent upstreaming of the refcount API several device_node
 underflows and leaks have come to my attention in the pseries (DLPAR) 
 dynamic
 logical partitioning code (ie. POWER speak for hotplugging virtual and 
 physcial
 resources at runtime such as cpus or IOAs). These tracepoints provide a
 easy and quick mechanism for validating the reference counting of
 device_nodes during their lifetime.

 Further, when pseries lpars are migrated to a different machine we
 perform a live update of our device tree to bring it into alignment with 
 the
 configuration of the new machine. The of_reconfig_notify trace point
 provides a mechanism that can be turned for debuging the device tree
 modifications with out having to build a custom kernel to get at the
 DEBUG code introduced by commit 00aa3720.
>>>
>>> I do not like changing individual (or small groups of) printk() style
>>> debugging information to tracepoint style.
>>
>> I'm not quite sure which printks() you're referring to.
>>
>> The only printks that are removed in this series are under #ifdef DEBUG,
>> and so are essentially not there unless you build a custom kernel.
>>
>> They also only cover the reconfig case, which is actually less
>> interesting than the much more common and bug-prone get/put logic.
>>
>>> As far as I know, there is no easy way to combine trace data and printk()
>>> style data to create a single chronology of events.  If some of the
>>> information needed to debug an issue is trace data and some is printk()
>>> style data then it becomes more difficult to understand the overall
>>> situation.
>>
>> If you enable CONFIG_PRINTK_TIME then you should be able to just sort
>> the trace and the printk output by the timestamp. If you're really
>> trying to correlate the two then you should probably just be using
>> trace_printk().
>>
>> But IMO this level of detail, tracing every get/put, does not belong in
>> printk. Trace points are absolutely the right solution for this type of
>> debugging.
> 
> Something else to keep in mind is that while pr_debugs could be used to
> provide feedback on the reference counts and of_reconfig events they
> don't in anyway tell us where they are happening in the kernel. The

Yes, that is critical information.  When there are refcount issues, the
root cause is at varying levels back in the call stack.


> trace infrastructure provides the ability to stack trace those events.
> The following example provides me a lot more information about who is
> doing what and where after I hot-add an ethernet adapter:
> 
> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
> # cat trace | grep -A6 "/pci@8002018"
> ...
>drmgr-7349  [006] d...  7138.821875: of_node_get: refcount=8,
> dn->full_name=/pci@8002018
>drmgr-7349  [006] d...  7138.821876: 
>  => .msi_quota_for_device
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range

Nice!  It is great to have function names in the call stack.


> --
>drmgr-7349  [006] d...  7138.821876: of_node_put: refcount=2,
> dn->full_name=/pci@8002018/ethernet@0
>drmgr-7349  [006] d...  7138.821877: 
>  => .msi_quota_for_device
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range
> --
>drmgr-7349  [006]   7138.821878: of_node_put: refcount=7,
> dn->full_name=/pci@8002018
>drmgr-7349  [006]   7138.821879: 
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range
>  => .bnx2x_enable_msix
> --
> 
> To get that same info as far as I know is to add a dump_stack() after
> each pr_debug.

Here is a patch that I have used.  It is not as user friendly in terms
of human readable stack traces (though a very small user space program
should be able to fix that).  The patch is cut and pasted into this
email, so probably white space damaged.

Instead of dumping the stack, each line in the "report" contains
the top six addresses in the call stack.  If interesting, they
can be post-processed (as I will show in some examples below).

---
 drivers/of/dynamic.c |   29 +
 1 file changed, 29 insertions(+)

Index: b/drivers/of/dynamic.c
===
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "of_private.h"
 
@@ -27,6 +28,20 

Re: [PATCH] of: introduce event tracepoints for dynamic device_node lifecyle

2017-04-19 Thread Frank Rowand
On 04/19/17 16:27, Tyrel Datwyler wrote:
> On 04/18/2017 06:31 PM, Michael Ellerman wrote:
>> Frank Rowand  writes:
>>
>>> On 04/17/17 17:32, Tyrel Datwyler wrote:
 This patch introduces event tracepoints for tracking a device_nodes
 reference cycle as well as reconfig notifications generated in response
 to node/property manipulations.

 With the recent upstreaming of the refcount API several device_node
 underflows and leaks have come to my attention in the pseries (DLPAR) 
 dynamic
 logical partitioning code (ie. POWER speak for hotplugging virtual and 
 physcial
 resources at runtime such as cpus or IOAs). These tracepoints provide a
 easy and quick mechanism for validating the reference counting of
 device_nodes during their lifetime.

 Further, when pseries lpars are migrated to a different machine we
 perform a live update of our device tree to bring it into alignment with 
 the
 configuration of the new machine. The of_reconfig_notify trace point
 provides a mechanism that can be turned for debuging the device tree
 modifications with out having to build a custom kernel to get at the
 DEBUG code introduced by commit 00aa3720.
>>>
>>> I do not like changing individual (or small groups of) printk() style
>>> debugging information to tracepoint style.
>>
>> I'm not quite sure which printks() you're referring to.
>>
>> The only printks that are removed in this series are under #ifdef DEBUG,
>> and so are essentially not there unless you build a custom kernel.
>>
>> They also only cover the reconfig case, which is actually less
>> interesting than the much more common and bug-prone get/put logic.
>>
>>> As far as I know, there is no easy way to combine trace data and printk()
>>> style data to create a single chronology of events.  If some of the
>>> information needed to debug an issue is trace data and some is printk()
>>> style data then it becomes more difficult to understand the overall
>>> situation.
>>
>> If you enable CONFIG_PRINTK_TIME then you should be able to just sort
>> the trace and the printk output by the timestamp. If you're really
>> trying to correlate the two then you should probably just be using
>> trace_printk().
>>
>> But IMO this level of detail, tracing every get/put, does not belong in
>> printk. Trace points are absolutely the right solution for this type of
>> debugging.
> 
> Something else to keep in mind is that while pr_debugs could be used to
> provide feedback on the reference counts and of_reconfig events they
> don't in anyway tell us where they are happening in the kernel. The

Yes, that is critical information.  When there are refcount issues, the
root cause is at varying levels back in the call stack.


> trace infrastructure provides the ability to stack trace those events.
> The following example provides me a lot more information about who is
> doing what and where after I hot-add an ethernet adapter:
> 
> # echo stacktrace > /sys/kernel/debug/tracing/trace_options
> # cat trace | grep -A6 "/pci@8002018"
> ...
>drmgr-7349  [006] d...  7138.821875: of_node_get: refcount=8,
> dn->full_name=/pci@8002018
>drmgr-7349  [006] d...  7138.821876: 
>  => .msi_quota_for_device
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range

Nice!  It is great to have function names in the call stack.


> --
>drmgr-7349  [006] d...  7138.821876: of_node_put: refcount=2,
> dn->full_name=/pci@8002018/ethernet@0
>drmgr-7349  [006] d...  7138.821877: 
>  => .msi_quota_for_device
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range
> --
>drmgr-7349  [006]   7138.821878: of_node_put: refcount=7,
> dn->full_name=/pci@8002018
>drmgr-7349  [006]   7138.821879: 
>  => .rtas_setup_msi_irqs
>  => .arch_setup_msi_irqs
>  => .__pci_enable_msix
>  => .pci_enable_msix_range
>  => .bnx2x_enable_msix
> --
> 
> To get that same info as far as I know is to add a dump_stack() after
> each pr_debug.

Here is a patch that I have used.  It is not as user friendly in terms
of human readable stack traces (though a very small user space program
should be able to fix that).  The patch is cut and pasted into this
email, so probably white space damaged.

Instead of dumping the stack, each line in the "report" contains
the top six addresses in the call stack.  If interesting, they
can be post-processed (as I will show in some examples below).

---
 drivers/of/dynamic.c |   29 +
 1 file changed, 29 insertions(+)

Index: b/drivers/of/dynamic.c
===
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "of_private.h"
 
@@ -27,6 +28,20 @@ struct device_node 

RE: [PATCH 4/5] nvme: Adjust the Samsung APST quirk

2017-04-19 Thread Judy Brock
[Jens] Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

[Andy] I have decent evidence.  All of the reports are from XPS 15 9550 or 
Precision 5510, and Dell confirmed that they're basically the same machine and 
run literally the same BIOS.

The answer as per the above as far as we know is "yes".
>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

1) Why? The answer to the question above was "Yes". This has been reported 
exclusively on the two Dell models with that exact same BIOS. Additionally, 
there are reports of the device acting fine on other systems so it is not 
purely specific to the device.

We request that the quirk should be only on the affected Dell machines - there 
is no reason to completely disable APST on Samsung.

2) Samsung shared in the more private thread that we are seeing excessive 
recovery attempts on the PCIe bus - no PCIe TLPs seen, just ordered sets.  This 
looks like a signal integrity problem to us on the Dell side. We shared excerpt 
of PCIe trace on the offline thread.

3) We also shared a more extensive report with Dell today. We've asked them to 
look into it. 

4) There was at least one report of same symptom on a Toshiba device and Lenovo 
system that seemed to also disappear by avoiding PS4. So it seems it would be 
best to continue to try to get to the bottom of the problem (root cause) and 
quirk judiciously in the meantime.

Thanks,
Judy



-Original Message-
From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf Of 
Andy Lutomirski
Sent: Wednesday, April 19, 2017 8:51 PM
To: Jens Axboe
Cc: Sagi Grimberg; linux-kernel@vger.kernel.org; linux-nvme; Keith Busch; 
Kai-Heng Feng; Andy Lutomirski; Christoph Hellwig; Niranjan Sivakumar
Subject: Re: [PATCH 4/5] nvme: Adjust the Samsung APST quirk

On Wed, Apr 19, 2017 at 8:07 PM, Jens Axboe  wrote:
> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>> I got a couple more reports: the Samsung APST issues appears to 
>> affect multiple 950-series devices in Dell XPS 15 9550 and Precision
>> 5510 laptops.  Change the quirk: rather than blacklisting the 
>> firmware on the first problematic SSD that was reported, disable APST 
>> on all 144d:a802 devices if they're installed in the two affected 
>> Dell models.  While we're at it, disable only the deepest sleep state 
>> instead of all of them -- the reporters say that this is sufficient 
>> to fix the problem.
>>
>> (I have a device that appears to be entirely identical to one of the 
>> affected devices, but I have a different Dell laptop, so it's not the 
>> case that all Samsung devices with firmware BXW75D0Q are broken under 
>> all circumstances.)
>>
>> Samsung engineers have an affected system, and hopefully they'll give 
>> us a better workaround some time soon.  In the mean time, this should 
>> minimize regressions.
>
> Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

I have decent evidence.  All of the reports are from XPS 15 9550 or Precision 
5510, and Dell confirmed that they're basically the same machine and run 
literally the same BIOS.  One of these reports is from a device with exactly 
the same model and firmware as my SSD, and mine is fine.  (I have a different 
laptop.)

>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

I'd be fine with doing that for 4.11 and then doing this for 4.12-rc1.

___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


RE: [PATCH 4/5] nvme: Adjust the Samsung APST quirk

2017-04-19 Thread Judy Brock
[Jens] Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

[Andy] I have decent evidence.  All of the reports are from XPS 15 9550 or 
Precision 5510, and Dell confirmed that they're basically the same machine and 
run literally the same BIOS.

The answer as per the above as far as we know is "yes".
>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

1) Why? The answer to the question above was "Yes". This has been reported 
exclusively on the two Dell models with that exact same BIOS. Additionally, 
there are reports of the device acting fine on other systems so it is not 
purely specific to the device.

We request that the quirk should be only on the affected Dell machines - there 
is no reason to completely disable APST on Samsung.

2) Samsung shared in the more private thread that we are seeing excessive 
recovery attempts on the PCIe bus - no PCIe TLPs seen, just ordered sets.  This 
looks like a signal integrity problem to us on the Dell side. We shared excerpt 
of PCIe trace on the offline thread.

3) We also shared a more extensive report with Dell today. We've asked them to 
look into it. 

4) There was at least one report of same symptom on a Toshiba device and Lenovo 
system that seemed to also disappear by avoiding PS4. So it seems it would be 
best to continue to try to get to the bottom of the problem (root cause) and 
quirk judiciously in the meantime.

Thanks,
Judy



-Original Message-
From: Linux-nvme [mailto:linux-nvme-boun...@lists.infradead.org] On Behalf Of 
Andy Lutomirski
Sent: Wednesday, April 19, 2017 8:51 PM
To: Jens Axboe
Cc: Sagi Grimberg; linux-kernel@vger.kernel.org; linux-nvme; Keith Busch; 
Kai-Heng Feng; Andy Lutomirski; Christoph Hellwig; Niranjan Sivakumar
Subject: Re: [PATCH 4/5] nvme: Adjust the Samsung APST quirk

On Wed, Apr 19, 2017 at 8:07 PM, Jens Axboe  wrote:
> On Wed, Apr 19 2017, Andy Lutomirski wrote:
>> I got a couple more reports: the Samsung APST issues appears to 
>> affect multiple 950-series devices in Dell XPS 15 9550 and Precision
>> 5510 laptops.  Change the quirk: rather than blacklisting the 
>> firmware on the first problematic SSD that was reported, disable APST 
>> on all 144d:a802 devices if they're installed in the two affected 
>> Dell models.  While we're at it, disable only the deepest sleep state 
>> instead of all of them -- the reporters say that this is sufficient 
>> to fix the problem.
>>
>> (I have a device that appears to be entirely identical to one of the 
>> affected devices, but I have a different Dell laptop, so it's not the 
>> case that all Samsung devices with firmware BXW75D0Q are broken under 
>> all circumstances.)
>>
>> Samsung engineers have an affected system, and hopefully they'll give 
>> us a better workaround some time soon.  In the mean time, this should 
>> minimize regressions.
>
> Do we know for a fact that it only happens on those systems, and isn't 
> purely specific to the device?

I have decent evidence.  All of the reports are from XPS 15 9550 or Precision 
5510, and Dell confirmed that they're basically the same machine and run 
literally the same BIOS.  One of these reports is from a device with exactly 
the same model and firmware as my SSD, and mine is fine.  (I have a different 
laptop.)

>
> At this point in time, I'd be much more comfortable completely 
> disabling APST on Samsung, period.
>

I'd be fine with doing that for 4.11 and then doing this for 4.12-rc1.

___
Linux-nvme mailing list
linux-n...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: [PATCH net-next] brcmfmac: fix build without CONFIG_BRCMFMAC_PROTO_BCDC

2017-04-19 Thread Kalle Valo
Arnd Bergmann  writes:

> With CONFIG_BRCMFMAC_PROTO_BCDC unset, we cannot build the fwsignal.c file:
>
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c: In function 
> 'brcmf_fws_notify_credit_map':
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> implicit declaration of function 'drvr_to_fws'; did you mean 'dev_to_psd'? 
> [-Werror=implicit-function-declaration]
>   struct brcmf_fws_info *fws = drvr_to_fws(ifp->drvr);
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1621:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
>
> However, as pointed out in the changeset description for the patch that caused
> the problem, fwsignal.c is only required when CONFIG_BRCMFMAC_PROTO_BCDC is
> enabled, so we can simply change the Makefile to build it conditionally.
>
> Fixes: acf8ac41dd73 ("brcmfmac: remove reference to fwsignal data from struct 
> brcmf_pub")
> Signed-off-by: Arnd Bergmann 

The fix is actually for wireless-drivers-next, acf8ac41dd73 is not in
net-next yet. And I already applied an identical fix from Arend:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git/commit/?id=26ecfe01790381c4caa65ec9cce484c623f092c4

-- 
Kalle Valo


Re: [PATCH net-next] brcmfmac: fix build without CONFIG_BRCMFMAC_PROTO_BCDC

2017-04-19 Thread Kalle Valo
Arnd Bergmann  writes:

> With CONFIG_BRCMFMAC_PROTO_BCDC unset, we cannot build the fwsignal.c file:
>
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c: In function 
> 'brcmf_fws_notify_credit_map':
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> implicit declaration of function 'drvr_to_fws'; did you mean 'dev_to_psd'? 
> [-Werror=implicit-function-declaration]
>   struct brcmf_fws_info *fws = drvr_to_fws(ifp->drvr);
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1590:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
> drivers/net/wireless/broadcom/brcm80211/brcmfmac/fwsignal.c:1621:31: error: 
> initialization makes pointer from integer without a cast 
> [-Werror=int-conversion]
>
> However, as pointed out in the changeset description for the patch that caused
> the problem, fwsignal.c is only required when CONFIG_BRCMFMAC_PROTO_BCDC is
> enabled, so we can simply change the Makefile to build it conditionally.
>
> Fixes: acf8ac41dd73 ("brcmfmac: remove reference to fwsignal data from struct 
> brcmf_pub")
> Signed-off-by: Arnd Bergmann 

The fix is actually for wireless-drivers-next, acf8ac41dd73 is not in
net-next yet. And I already applied an identical fix from Arend:

https://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next.git/commit/?id=26ecfe01790381c4caa65ec9cce484c623f092c4

-- 
Kalle Valo


Re: [PATCH V4 4/9] PM / QOS: Add DEV_PM_QOS_PERFORMANCE request

2017-04-19 Thread Viresh Kumar
On 19-04-17, 16:07, Ulf Hansson wrote:
> On 20 March 2017 at 10:32, Viresh Kumar  wrote:
> > @@ -571,6 +589,9 @@ static void __dev_pm_qos_drop_user_request(struct 
> > device *dev,
> > req = dev->power.qos->flags_req;
> > dev->power.qos->flags_req = NULL;
> > break;
> > +   case DEV_PM_QOS_PERFORMANCE:
> > +   dev_err(dev, "Invalid user request (performance)\n");
> > +   return;
> 
> Isn't it possible to drop a performance request?

I am not exposing the performance QOS via sysfs. Should we ? I thought
this has to be worked out within kernel only and so haven't provided
any user interface.

> > @@ -96,9 +98,11 @@ struct pm_qos_flags {
> >  struct dev_pm_qos {
> > struct pm_qos_constraints resume_latency;
> > struct pm_qos_constraints latency_tolerance;
> > +   struct pm_qos_constraints performance;
> > struct pm_qos_flags flags;
> > struct dev_pm_qos_request *resume_latency_req;
> > struct dev_pm_qos_request *latency_tolerance_req;
> > +   struct dev_pm_qos_request *performance_req;
> 
> I didn't find performance_req being used at all...

I just over-copied it seems. The OPP framework creates its own request
structure and so this should be dropped.

-- 
viresh


Re: [PATCH V4 4/9] PM / QOS: Add DEV_PM_QOS_PERFORMANCE request

2017-04-19 Thread Viresh Kumar
On 19-04-17, 16:07, Ulf Hansson wrote:
> On 20 March 2017 at 10:32, Viresh Kumar  wrote:
> > @@ -571,6 +589,9 @@ static void __dev_pm_qos_drop_user_request(struct 
> > device *dev,
> > req = dev->power.qos->flags_req;
> > dev->power.qos->flags_req = NULL;
> > break;
> > +   case DEV_PM_QOS_PERFORMANCE:
> > +   dev_err(dev, "Invalid user request (performance)\n");
> > +   return;
> 
> Isn't it possible to drop a performance request?

I am not exposing the performance QOS via sysfs. Should we ? I thought
this has to be worked out within kernel only and so haven't provided
any user interface.

> > @@ -96,9 +98,11 @@ struct pm_qos_flags {
> >  struct dev_pm_qos {
> > struct pm_qos_constraints resume_latency;
> > struct pm_qos_constraints latency_tolerance;
> > +   struct pm_qos_constraints performance;
> > struct pm_qos_flags flags;
> > struct dev_pm_qos_request *resume_latency_req;
> > struct dev_pm_qos_request *latency_tolerance_req;
> > +   struct dev_pm_qos_request *performance_req;
> 
> I didn't find performance_req being used at all...

I just over-copied it seems. The OPP framework creates its own request
structure and so this should be dropped.

-- 
viresh


Re: [PATCH] iommu/arm-smmu: Return IOVA in iova_to_phys when SMMU is bypassed

2017-04-19 Thread Sunil Kovvuri
On Mon, Apr 17, 2017 at 5:27 PM,   wrote:
> From: Sunil Goutham 
>
> For software initiated address translation, when domain type is
> IOMMU_DOMAIN_IDENTITY i.e SMMU is bypassed, mimic HW behavior
> i.e return the same IOVA as translated address.
>
> This patch is an extension to Will Deacon's patchset
> "Implement SMMU passthrough using the default domain".
>
> Signed-off-by: Sunil Goutham 
> ---
>  drivers/iommu/arm-smmu.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 41afb07..2f4a130 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1405,6 +1405,9 @@ static phys_addr_t arm_smmu_iova_to_phys(struct 
> iommu_domain *domain,
> struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
>
> +   if (domain->type == IOMMU_DOMAIN_IDENTITY)
> +   return iova;
> +
> if (!ops)
> return 0;
>
> --
> 2.7.4
>

Any comments or is this patch accepted ?

Thanks,
Sunil.


Re: [PATCH] iommu/arm-smmu: Return IOVA in iova_to_phys when SMMU is bypassed

2017-04-19 Thread Sunil Kovvuri
On Mon, Apr 17, 2017 at 5:27 PM,   wrote:
> From: Sunil Goutham 
>
> For software initiated address translation, when domain type is
> IOMMU_DOMAIN_IDENTITY i.e SMMU is bypassed, mimic HW behavior
> i.e return the same IOVA as translated address.
>
> This patch is an extension to Will Deacon's patchset
> "Implement SMMU passthrough using the default domain".
>
> Signed-off-by: Sunil Goutham 
> ---
>  drivers/iommu/arm-smmu.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 41afb07..2f4a130 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -1405,6 +1405,9 @@ static phys_addr_t arm_smmu_iova_to_phys(struct 
> iommu_domain *domain,
> struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> struct io_pgtable_ops *ops= smmu_domain->pgtbl_ops;
>
> +   if (domain->type == IOMMU_DOMAIN_IDENTITY)
> +   return iova;
> +
> if (!ops)
> return 0;
>
> --
> 2.7.4
>

Any comments or is this patch accepted ?

Thanks,
Sunil.


Re: [PATCH v2] soc: brcmstb: enable drivers for ARM64 and BMIPS

2017-04-19 Thread Baruch Siach
Hi Markus,

On Wed, Apr 19, 2017 at 03:50:40PM -0700, Markus Mayer wrote:
> From: Markus Mayer 
> 
> We enable the BRCMSTB SoC drivers not only for ARM, but also ARM64 and
> BMIPS.
> 
> Signed-off-by: Markus Mayer 
> ---
> 
> I used (COMPILE_TEST & OF) as condition like Raspberry Pi, since we
> have of_* calls in our code, too.

All of_* calls under drivers/soc/bcm/ either have empty implementations for 
the !CONFIG_OF case, or call only such routines. The RPi OF condition is most 
likely not needed as well.

baruch

> Changes since v1:
>   - Add BMIPS
>   - Add COMPILE_TEST
> 
>  drivers/soc/bcm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/soc/bcm/Kconfig b/drivers/soc/bcm/Kconfig
> index a39b0d58ddd0..9463d6709909 100644
> --- a/drivers/soc/bcm/Kconfig
> +++ b/drivers/soc/bcm/Kconfig
> @@ -11,7 +11,7 @@ config RASPBERRYPI_POWER
>  
>  config SOC_BRCMSTB
>   bool "Broadcom STB SoC drivers"
> - depends on ARM
> + depends on ARM || ARM64 || BMIPS_GENERIC || (COMPILE_TEST && OF)
>   select SOC_BUS
>   help
> Enables drivers for the Broadcom Set-Top Box (STB) series of chips.

-- 
 http://baruch.siach.name/blog/  ~. .~   Tk Open Systems
=}ooO--U--Ooo{=
   - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -


Re: [PATCH v2] soc: brcmstb: enable drivers for ARM64 and BMIPS

2017-04-19 Thread Baruch Siach
Hi Markus,

On Wed, Apr 19, 2017 at 03:50:40PM -0700, Markus Mayer wrote:
> From: Markus Mayer 
> 
> We enable the BRCMSTB SoC drivers not only for ARM, but also ARM64 and
> BMIPS.
> 
> Signed-off-by: Markus Mayer 
> ---
> 
> I used (COMPILE_TEST & OF) as condition like Raspberry Pi, since we
> have of_* calls in our code, too.

All of_* calls under drivers/soc/bcm/ either have empty implementations for 
the !CONFIG_OF case, or call only such routines. The RPi OF condition is most 
likely not needed as well.

baruch

> Changes since v1:
>   - Add BMIPS
>   - Add COMPILE_TEST
> 
>  drivers/soc/bcm/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/soc/bcm/Kconfig b/drivers/soc/bcm/Kconfig
> index a39b0d58ddd0..9463d6709909 100644
> --- a/drivers/soc/bcm/Kconfig
> +++ b/drivers/soc/bcm/Kconfig
> @@ -11,7 +11,7 @@ config RASPBERRYPI_POWER
>  
>  config SOC_BRCMSTB
>   bool "Broadcom STB SoC drivers"
> - depends on ARM
> + depends on ARM || ARM64 || BMIPS_GENERIC || (COMPILE_TEST && OF)
>   select SOC_BUS
>   help
> Enables drivers for the Broadcom Set-Top Box (STB) series of chips.

-- 
 http://baruch.siach.name/blog/  ~. .~   Tk Open Systems
=}ooO--U--Ooo{=
   - bar...@tkos.co.il - tel: +972.52.368.4656, http://www.tkos.co.il -


[PATCH v6 5/7] perf util: Create branch.c/.h for common branch functions

2017-04-19 Thread Jin Yao
Create new util/branch.c and util/branch.h to contain the common
branch functions. Such as:

branch_type_count(): Count the numbers of branch types
branch_type_name() : Return the name of branch type
branch_type_stat_display(): Display branch type statistics info
branch_type_str(): Construct the branch type string.

The branch type is saved in branch_flags.

Change log
--

v6: Move that multiline conditional code inside {} brackets.
Move branch_type_stat_display() from builtin-report.c to
  branch.c.
Move branch_type_str() from callchain.c to branch.c.

v5: It's a new patch in v5 patch series.

Signed-off-by: Jin Yao 
---
 tools/perf/util/Build|   1 +
 tools/perf/util/branch.c | 168 +++
 tools/perf/util/branch.h |  25 +++
 tools/perf/util/event.h  |   3 +-
 4 files changed, 196 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/util/branch.c
 create mode 100644 tools/perf/util/branch.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index f0b9e5d..391bf85 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -91,6 +91,7 @@ libperf-y += vsprintf.o
 libperf-y += drv_configs.o
 libperf-y += time-utils.o
 libperf-y += expr-bison.o
+libperf-y += branch.o
 
 libperf-$(CONFIG_LIBBPF) += bpf-loader.o
 libperf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
diff --git a/tools/perf/util/branch.c b/tools/perf/util/branch.c
new file mode 100644
index 000..4aa56ad
--- /dev/null
+++ b/tools/perf/util/branch.c
@@ -0,0 +1,168 @@
+#include "perf.h"
+#include "util/util.h"
+#include "util/debug.h"
+#include "util/branch.h"
+
+static bool cross_area(u64 addr1, u64 addr2, int size)
+{
+   u64 align1, align2;
+
+   align1 = addr1 & ~(size - 1);
+   align2 = addr2 & ~(size - 1);
+
+   return (align1 != align2) ? true : false;
+}
+
+#define AREA_4K4096
+#define AREA_2M(2 * 1024 * 1024)
+
+void branch_type_count(struct branch_type_stat *stat,
+  struct branch_flags *flags,
+  u64 from, u64 to)
+{
+   if (flags->type == PERF_BR_NONE || from == 0)
+   return;
+
+   stat->counts[flags->type]++;
+
+   if (flags->type == PERF_BR_JCC) {
+   if (to > from)
+   stat->jcc_fwd++;
+   else
+   stat->jcc_bwd++;
+   }
+
+   if (cross_area(from, to, AREA_2M))
+   stat->cross_2m++;
+   else if (cross_area(from, to, AREA_4K))
+   stat->cross_4k++;
+}
+
+const char *branch_type_name(int type)
+{
+   const char *branch_names[PERF_BR_MAX] = {
+   "N/A",
+   "JCC",
+   "JMP",
+   "IND_JMP",
+   "CALL",
+   "IND_CALL",
+   "RET",
+   "SYSCALL",
+   "SYSRET",
+   "IRQ",
+   "INT",
+   "IRET",
+   "FAR_BRANCH",
+   };
+
+   if (type >= 0 && type < PERF_BR_MAX)
+   return branch_names[type];
+
+   return NULL;
+}
+
+void branch_type_stat_display(FILE *fp, struct branch_type_stat *stat)
+{
+   u64 total = 0;
+   int i;
+
+   for (i = 0; i < PERF_BR_MAX; i++)
+   total += stat->counts[i];
+
+   if (total == 0)
+   return;
+
+   fprintf(fp, "\n#");
+   fprintf(fp, "\n# Branch Statistics:");
+   fprintf(fp, "\n#");
+
+   if (stat->jcc_fwd > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC forward",
+   100.0 * (double)stat->jcc_fwd / (double)total);
+   }
+
+   if (stat->jcc_bwd > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC backward",
+   100.0 * (double)stat->jcc_bwd / (double)total);
+   }
+
+   if (stat->cross_4k > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_4K",
+   100.0 * (double)stat->cross_4k / (double)total);
+   }
+
+   if (stat->cross_2m > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_2M",
+   100.0 * (double)stat->cross_2m / (double)total);
+   }
+
+   for (i = 0; i < PERF_BR_MAX; i++) {
+   if (stat->counts[i] > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   branch_type_name(i),
+   100.0 *
+   (double)stat->counts[i] / (double)total);
+   }
+}
+
+static int count_str_printf(int index, const char *str,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s",
+   (index) ? " " : " (", str);
+
+   return printed;
+}
+
+int branch_type_str(struct branch_type_stat *stat,
+   char *bf, int bfsize)
+{
+   int i, j = 0, 

[PATCH v6 4/7] perf report: Refactor the branch info printing code

2017-04-19 Thread Jin Yao
The branch info such as predicted/cycles/... are printed at the
callchain entries.

For example: perf report --branch-history --no-children --stdio

--1.07%--main div.c:39 (predicted:52.4% cycles:1 iterations:17)
  main div.c:44 (predicted:52.4% cycles:1)
  main div.c:42 (cycles:2)
  compute_flag div.c:28 (cycles:2)
  compute_flag div.c:27 (cycles:1)
  rand rand.c:28 (cycles:1)
  rand rand.c:28 (cycles:1)
  __random random.c:298 (cycles:1)
  __random random.c:297 (cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (cycles:1)

But the current code is difficult to maintain and extend. This patch
refactors the code for easy maintenance.

Change log
--

v6: 1. Put the multiline condition code into {} brackets in
   counts_str_build()

2. Keep the original display order, that is:
   predicted, abort, cycles, iterations

v5: It's a new patch in v5 patch series.

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 106 
 1 file changed, 47 insertions(+), 59 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0096d45..d44b5ed 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -1106,83 +1106,71 @@ int callchain_branch_counts(struct callchain_root *root,
  cycles_count);
 }
 
+static int count_pri64_printf(int index, const char *str, u64 value,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s:%" PRId64 "",
+   (index) ? " " : " (", str, value);
+
+   return printed;
+}
+
+static int count_float_printf(int index, const char *str, float value,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s:%.1f%%",
+   (index) ? " " : " (", str, value);
+
+   return printed;
+}
+
 static int counts_str_build(char *bf, int bfsize,
 u64 branch_count, u64 predicted_count,
 u64 abort_count, u64 cycles_count,
 u64 iter_count, u64 samples_count)
 {
-   double predicted_percent = 0.0;
-   const char *null_str = "";
-   char iter_str[32];
-   char cycle_str[32];
-   char *istr, *cstr;
u64 cycles;
+   int printed = 0, i = 0;
 
if (branch_count == 0)
return scnprintf(bf, bfsize, " (calltrace)");
 
-   cycles = cycles_count / branch_count;
-
-   if (iter_count && samples_count) {
-   if (cycles > 0)
-   scnprintf(iter_str, sizeof(iter_str),
-" iterations:%" PRId64 "",
-iter_count / samples_count);
-   else
-   scnprintf(iter_str, sizeof(iter_str),
-"iterations:%" PRId64 "",
-iter_count / samples_count);
-   istr = iter_str;
-   } else
-   istr = (char *)null_str;
-
-   if (cycles > 0) {
-   scnprintf(cycle_str, sizeof(cycle_str),
- "cycles:%" PRId64 "", cycles);
-   cstr = cycle_str;
-   } else
-   cstr = (char *)null_str;
-
-   predicted_percent = predicted_count * 100.0 / branch_count;
+   if (predicted_count < branch_count) {
+   printed += count_float_printf(i++, "predicted",
+   predicted_count * 100.0 / branch_count,
+   bf + printed, bfsize - printed);
+   }
 
-   if ((predicted_count == branch_count) && (abort_count == 0)) {
-   if ((cycles > 0) || (istr != (char *)null_str))
-   return scnprintf(bf, bfsize, " (%s%s)", cstr, istr);
-   else
-   return scnprintf(bf, bfsize, "%s", (char *)null_str);
+   if (abort_count) {
+   printed += count_float_printf(i++, "abort",
+   abort_count * 100.0 / branch_count,
+   bf + printed, bfsize - printed);
}
 
-   if ((predicted_count < branch_count) && (abort_count == 0)) {
-   if ((cycles > 0) || (istr != (char *)null_str))
-   return scnprintf(bf, bfsize,
-   " (predicted:%.1f%% %s%s)",
-   predicted_percent, cstr, istr);
-   else {
-   return scnprintf(bf, bfsize,
-   " (predicted:%.1f%%)",
-   predicted_percent);
-   }
+   cycles = cycles_count / branch_count;
+   if (cycles) {
+   

[PATCH v6 5/7] perf util: Create branch.c/.h for common branch functions

2017-04-19 Thread Jin Yao
Create new util/branch.c and util/branch.h to contain the common
branch functions. Such as:

branch_type_count(): Count the numbers of branch types
branch_type_name() : Return the name of branch type
branch_type_stat_display(): Display branch type statistics info
branch_type_str(): Construct the branch type string.

The branch type is saved in branch_flags.

Change log
--

v6: Move that multiline conditional code inside {} brackets.
Move branch_type_stat_display() from builtin-report.c to
  branch.c.
Move branch_type_str() from callchain.c to branch.c.

v5: It's a new patch in v5 patch series.

Signed-off-by: Jin Yao 
---
 tools/perf/util/Build|   1 +
 tools/perf/util/branch.c | 168 +++
 tools/perf/util/branch.h |  25 +++
 tools/perf/util/event.h  |   3 +-
 4 files changed, 196 insertions(+), 1 deletion(-)
 create mode 100644 tools/perf/util/branch.c
 create mode 100644 tools/perf/util/branch.h

diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index f0b9e5d..391bf85 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -91,6 +91,7 @@ libperf-y += vsprintf.o
 libperf-y += drv_configs.o
 libperf-y += time-utils.o
 libperf-y += expr-bison.o
+libperf-y += branch.o
 
 libperf-$(CONFIG_LIBBPF) += bpf-loader.o
 libperf-$(CONFIG_BPF_PROLOGUE) += bpf-prologue.o
diff --git a/tools/perf/util/branch.c b/tools/perf/util/branch.c
new file mode 100644
index 000..4aa56ad
--- /dev/null
+++ b/tools/perf/util/branch.c
@@ -0,0 +1,168 @@
+#include "perf.h"
+#include "util/util.h"
+#include "util/debug.h"
+#include "util/branch.h"
+
+static bool cross_area(u64 addr1, u64 addr2, int size)
+{
+   u64 align1, align2;
+
+   align1 = addr1 & ~(size - 1);
+   align2 = addr2 & ~(size - 1);
+
+   return (align1 != align2) ? true : false;
+}
+
+#define AREA_4K4096
+#define AREA_2M(2 * 1024 * 1024)
+
+void branch_type_count(struct branch_type_stat *stat,
+  struct branch_flags *flags,
+  u64 from, u64 to)
+{
+   if (flags->type == PERF_BR_NONE || from == 0)
+   return;
+
+   stat->counts[flags->type]++;
+
+   if (flags->type == PERF_BR_JCC) {
+   if (to > from)
+   stat->jcc_fwd++;
+   else
+   stat->jcc_bwd++;
+   }
+
+   if (cross_area(from, to, AREA_2M))
+   stat->cross_2m++;
+   else if (cross_area(from, to, AREA_4K))
+   stat->cross_4k++;
+}
+
+const char *branch_type_name(int type)
+{
+   const char *branch_names[PERF_BR_MAX] = {
+   "N/A",
+   "JCC",
+   "JMP",
+   "IND_JMP",
+   "CALL",
+   "IND_CALL",
+   "RET",
+   "SYSCALL",
+   "SYSRET",
+   "IRQ",
+   "INT",
+   "IRET",
+   "FAR_BRANCH",
+   };
+
+   if (type >= 0 && type < PERF_BR_MAX)
+   return branch_names[type];
+
+   return NULL;
+}
+
+void branch_type_stat_display(FILE *fp, struct branch_type_stat *stat)
+{
+   u64 total = 0;
+   int i;
+
+   for (i = 0; i < PERF_BR_MAX; i++)
+   total += stat->counts[i];
+
+   if (total == 0)
+   return;
+
+   fprintf(fp, "\n#");
+   fprintf(fp, "\n# Branch Statistics:");
+   fprintf(fp, "\n#");
+
+   if (stat->jcc_fwd > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC forward",
+   100.0 * (double)stat->jcc_fwd / (double)total);
+   }
+
+   if (stat->jcc_bwd > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "JCC backward",
+   100.0 * (double)stat->jcc_bwd / (double)total);
+   }
+
+   if (stat->cross_4k > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_4K",
+   100.0 * (double)stat->cross_4k / (double)total);
+   }
+
+   if (stat->cross_2m > 0) {
+   fprintf(fp, "\n%12s: %5.1f%%",
+   "CROSS_2M",
+   100.0 * (double)stat->cross_2m / (double)total);
+   }
+
+   for (i = 0; i < PERF_BR_MAX; i++) {
+   if (stat->counts[i] > 0)
+   fprintf(fp, "\n%12s: %5.1f%%",
+   branch_type_name(i),
+   100.0 *
+   (double)stat->counts[i] / (double)total);
+   }
+}
+
+static int count_str_printf(int index, const char *str,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s",
+   (index) ? " " : " (", str);
+
+   return printed;
+}
+
+int branch_type_str(struct branch_type_stat *stat,
+   char *bf, int bfsize)
+{
+   int i, j = 0, printed = 0;
+   u64 

[PATCH v6 4/7] perf report: Refactor the branch info printing code

2017-04-19 Thread Jin Yao
The branch info such as predicted/cycles/... are printed at the
callchain entries.

For example: perf report --branch-history --no-children --stdio

--1.07%--main div.c:39 (predicted:52.4% cycles:1 iterations:17)
  main div.c:44 (predicted:52.4% cycles:1)
  main div.c:42 (cycles:2)
  compute_flag div.c:28 (cycles:2)
  compute_flag div.c:27 (cycles:1)
  rand rand.c:28 (cycles:1)
  rand rand.c:28 (cycles:1)
  __random random.c:298 (cycles:1)
  __random random.c:297 (cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (cycles:1)

But the current code is difficult to maintain and extend. This patch
refactors the code for easy maintenance.

Change log
--

v6: 1. Put the multiline condition code into {} brackets in
   counts_str_build()

2. Keep the original display order, that is:
   predicted, abort, cycles, iterations

v5: It's a new patch in v5 patch series.

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 106 
 1 file changed, 47 insertions(+), 59 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index 0096d45..d44b5ed 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -1106,83 +1106,71 @@ int callchain_branch_counts(struct callchain_root *root,
  cycles_count);
 }
 
+static int count_pri64_printf(int index, const char *str, u64 value,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s:%" PRId64 "",
+   (index) ? " " : " (", str, value);
+
+   return printed;
+}
+
+static int count_float_printf(int index, const char *str, float value,
+   char *bf, int bfsize)
+{
+   int printed;
+
+   printed = scnprintf(bf, bfsize,
+   "%s%s:%.1f%%",
+   (index) ? " " : " (", str, value);
+
+   return printed;
+}
+
 static int counts_str_build(char *bf, int bfsize,
 u64 branch_count, u64 predicted_count,
 u64 abort_count, u64 cycles_count,
 u64 iter_count, u64 samples_count)
 {
-   double predicted_percent = 0.0;
-   const char *null_str = "";
-   char iter_str[32];
-   char cycle_str[32];
-   char *istr, *cstr;
u64 cycles;
+   int printed = 0, i = 0;
 
if (branch_count == 0)
return scnprintf(bf, bfsize, " (calltrace)");
 
-   cycles = cycles_count / branch_count;
-
-   if (iter_count && samples_count) {
-   if (cycles > 0)
-   scnprintf(iter_str, sizeof(iter_str),
-" iterations:%" PRId64 "",
-iter_count / samples_count);
-   else
-   scnprintf(iter_str, sizeof(iter_str),
-"iterations:%" PRId64 "",
-iter_count / samples_count);
-   istr = iter_str;
-   } else
-   istr = (char *)null_str;
-
-   if (cycles > 0) {
-   scnprintf(cycle_str, sizeof(cycle_str),
- "cycles:%" PRId64 "", cycles);
-   cstr = cycle_str;
-   } else
-   cstr = (char *)null_str;
-
-   predicted_percent = predicted_count * 100.0 / branch_count;
+   if (predicted_count < branch_count) {
+   printed += count_float_printf(i++, "predicted",
+   predicted_count * 100.0 / branch_count,
+   bf + printed, bfsize - printed);
+   }
 
-   if ((predicted_count == branch_count) && (abort_count == 0)) {
-   if ((cycles > 0) || (istr != (char *)null_str))
-   return scnprintf(bf, bfsize, " (%s%s)", cstr, istr);
-   else
-   return scnprintf(bf, bfsize, "%s", (char *)null_str);
+   if (abort_count) {
+   printed += count_float_printf(i++, "abort",
+   abort_count * 100.0 / branch_count,
+   bf + printed, bfsize - printed);
}
 
-   if ((predicted_count < branch_count) && (abort_count == 0)) {
-   if ((cycles > 0) || (istr != (char *)null_str))
-   return scnprintf(bf, bfsize,
-   " (predicted:%.1f%% %s%s)",
-   predicted_percent, cstr, istr);
-   else {
-   return scnprintf(bf, bfsize,
-   " (predicted:%.1f%%)",
-   predicted_percent);
-   }
+   cycles = cycles_count / branch_count;
+   if (cycles) {
+   printed += 

[PATCH v6 6/7] perf report: Show branch type statistics for stdio mode

2017-04-19 Thread Jin Yao
Show the branch type statistics at the end of perf report --stdio.

For example:
perf report --stdio

 JCC forward:  27.6%
JCC backward:  10.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%
 JCC:  37.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%

The branch types are:
-
 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
 SYSCALL: Syscall
  SYSRET: Syscall return
 IRQ: HW interrupt/trap/fault
 INT: SW interrupt
IRET: Return from interrupt
  FAR_BRANCH: Others not generic branch type

CROSS_4K and CROSS_2M:
--
They are the metrics checking for branches cross 4K or 2MB pages.
It's an approximate computing. We don't know if the area is 4K or
2MB, so always compute both.

To make the output simple, if a branch crosses 2M area, CROSS_4K
will not be incremented.

Change log
--

v6: Remove branch_type_stat_display() since it's moved to branch.c.

v5: Remove the unnecessary sort__mode checking in
hist_iter__branch_callback().

v4: Comparing to previous version, the major changes are:

Add the computing of JCC forward/JCC backward and cross page checking
by using the from and to addresses.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-report.c | 25 +
 tools/perf/util/hist.c  |  5 +
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5bbd4b2..ba5026a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -37,6 +37,7 @@
 #include "arch/common.h"
 #include "util/time-utils.h"
 #include "util/auxtrace.h"
+#include "util/branch.h"
 
 #include 
 #include 
@@ -68,6 +69,7 @@ struct report {
u64 queue_size;
int socket_filter;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+   struct branch_type_stat brtype_stat;
 };
 
 static int report__config(const char *var, const char *value, void *cb)
@@ -146,6 +148,22 @@ static int hist_iter__report_callback(struct 
hist_entry_iter *iter,
return err;
 }
 
+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   bi = he->branch_info;
+   branch_type_count(>brtype_stat, >flags,
+ bi->from.addr, bi->to.addr);
+
+   return 0;
+}
+
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -184,6 +202,8 @@ static int process_sample_event(struct perf_tool *tool,
 */
if (!sample->branch_stack)
goto out_put;
+
+   iter.add_entry_cb = hist_iter__branch_callback;
iter.ops = _iter_branch;
} else if (rep->mem_mode) {
iter.ops = _iter_mem;
@@ -406,6 +426,9 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist 
*evlist,
perf_read_values_destroy(>show_threads_values);
}
 
+   if (sort__mode == SORT_MODE__BRANCH)
+   branch_type_stat_display(stdout, >brtype_stat);
+
return 0;
 }
 
@@ -938,6 +961,8 @@ int cmd_report(int argc, const char **argv)
if (has_br_stack && branch_call_mode)
symbol_conf.show_branchflag_count = true;
 
+   memset(_stat, 0, sizeof(struct branch_type_stat));
+
/*
 * Branch mode is a tristate:
 * -1 means default, so decide based on the file having branch data.
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 65d4275..f3a3be5 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -747,12 +747,9 @@ iter_prepare_branch_entry(struct hist_entry_iter *iter, 
struct addr_location *al
 }
 
 static int
-iter_add_single_branch_entry(struct hist_entry_iter *iter,
+iter_add_single_branch_entry(struct hist_entry_iter *iter __maybe_unused,
 struct addr_location *al __maybe_unused)
 {
-   /* to avoid calling callback function */
-   iter->he = NULL;
-
return 0;
 }
 
-- 
2.7.4



[PATCH v6 1/7] perf/core: Define the common branch type classification

2017-04-19 Thread Jin Yao
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the x86 branch type.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

PERF_BR_NONE  : unknown
PERF_BR_JCC   : conditional jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: not generic far branch type

The patch also adds a new field type (4 bits) in perf_branch_entry
to record the branch type.

Since the disassembling of branch instruction needs some overhead,
a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
needs to disassemble the branch instruction and record the branch
type.

Change log
--

v6: Not changed.

v5: Not changed. The v5 patch series just change the userspace.

v4: Comparing to previous version, the major changes are:

1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be
   computed later in userspace.

2. Remove the "cross" field in perf_branch_entry. The cross page
   computing will be done later in userspace.

Signed-off-by: Jin Yao 
---
 include/uapi/linux/perf_event.h   | 29 -
 tools/include/uapi/linux/perf_event.h | 29 -
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,32 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << 
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT,
PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << 
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT,
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE=
+   1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX  = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
+/*
+ * Common flow change classification
+ */
+enum {
+   PERF_BR_NONE= 0,/* unknown */
+   PERF_BR_JCC = 1,/* conditional jump */
+   PERF_BR_JMP = 2,/* jump */
+   PERF_BR_IND_JMP = 3,/* indirect jump */
+   PERF_BR_CALL= 4,/* call */
+   PERF_BR_IND_CALL= 5,/* indirect call */
+   PERF_BR_RET = 6,/* return */
+   PERF_BR_SYSCALL = 7,/* syscall */
+   PERF_BR_SYSRET  = 8,/* syscall return */
+   PERF_BR_IRQ = 9,/* hw interrupt/trap/fault */
+   PERF_BR_INT = 10,   /* sw interrupt */
+   PERF_BR_IRET= 11,   /* return from interrupt */
+   PERF_BR_FAR_BRANCH  = 12,   /* not generic far branch type */
+   PERF_BR_MAX,
+};
+
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
(PERF_SAMPLE_BRANCH_USER|\
 PERF_SAMPLE_BRANCH_KERNEL|\
@@ -999,6 +1024,7 @@ union perf_mem_data_src {
  * in_tx: running in a hardware transaction
  * abort: aborting a hardware transaction
  *cycles: cycles from last branch (or 0 if not supported)
+ *  type: branch type
  */
 struct perf_branch_entry {
__u64   from;
@@ -1008,7 +1034,8 @@ struct perf_branch_entry {
in_tx:1,/* in transaction */
abort:1,/* transaction abort */
cycles:16,  /* cycle count to last branch */
-   reserved:44;
+   type:4, /* branch type */
+   reserved:40;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save 

[PATCH v6 3/7] perf record: Create a new option save_type in --branch-filter

2017-04-19 Thread Jin Yao
The option indicates the kernel to save branch type during sampling.

One example:
perf record -g --branch-filter any,save_type 

Change log
--

v6: Not changed.

v5: Not changed.

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-record.txt | 1 +
 tools/perf/util/parse-branch-options.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index ea3789d..e2f5a4f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -332,6 +332,7 @@ following filters are defined:
- no_tx: only when the target is not in a hardware transaction
- abort_tx: only when the target is a hardware transaction abort
- cond: conditional branches
+   - save_type: save branch type during sampling in case binary is not 
available later
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
diff --git a/tools/perf/util/parse-branch-options.c 
b/tools/perf/util/parse-branch-options.c
index 38fd115..e71fb5f 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
+   BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_END
 };
 
-- 
2.7.4



[PATCH v6 6/7] perf report: Show branch type statistics for stdio mode

2017-04-19 Thread Jin Yao
Show the branch type statistics at the end of perf report --stdio.

For example:
perf report --stdio

 JCC forward:  27.6%
JCC backward:  10.0%
CROSS_4K:   0.0%
CROSS_2M:  14.3%
 JCC:  37.6%
 JMP:   0.0%
 IND_JMP:   6.5%
CALL:  26.6%
IND_CALL:   0.0%
 RET:  29.3%

The branch types are:
-
 JCC forward: Conditional forward jump
JCC backward: Conditional backward jump
 JMP: Jump imm
 IND_JMP: Jump reg/mem
CALL: Call imm
IND_CALL: Call reg/mem
 RET: Ret
 SYSCALL: Syscall
  SYSRET: Syscall return
 IRQ: HW interrupt/trap/fault
 INT: SW interrupt
IRET: Return from interrupt
  FAR_BRANCH: Others not generic branch type

CROSS_4K and CROSS_2M:
--
They are the metrics checking for branches cross 4K or 2MB pages.
It's an approximate computing. We don't know if the area is 4K or
2MB, so always compute both.

To make the output simple, if a branch crosses 2M area, CROSS_4K
will not be incremented.

Change log
--

v6: Remove branch_type_stat_display() since it's moved to branch.c.

v5: Remove the unnecessary sort__mode checking in
hist_iter__branch_callback().

v4: Comparing to previous version, the major changes are:

Add the computing of JCC forward/JCC backward and cross page checking
by using the from and to addresses.

Signed-off-by: Jin Yao 
---
 tools/perf/builtin-report.c | 25 +
 tools/perf/util/hist.c  |  5 +
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/tools/perf/builtin-report.c b/tools/perf/builtin-report.c
index 5bbd4b2..ba5026a 100644
--- a/tools/perf/builtin-report.c
+++ b/tools/perf/builtin-report.c
@@ -37,6 +37,7 @@
 #include "arch/common.h"
 #include "util/time-utils.h"
 #include "util/auxtrace.h"
+#include "util/branch.h"
 
 #include 
 #include 
@@ -68,6 +69,7 @@ struct report {
u64 queue_size;
int socket_filter;
DECLARE_BITMAP(cpu_bitmap, MAX_NR_CPUS);
+   struct branch_type_stat brtype_stat;
 };
 
 static int report__config(const char *var, const char *value, void *cb)
@@ -146,6 +148,22 @@ static int hist_iter__report_callback(struct 
hist_entry_iter *iter,
return err;
 }
 
+static int hist_iter__branch_callback(struct hist_entry_iter *iter,
+ struct addr_location *al __maybe_unused,
+ bool single __maybe_unused,
+ void *arg)
+{
+   struct hist_entry *he = iter->he;
+   struct report *rep = arg;
+   struct branch_info *bi;
+
+   bi = he->branch_info;
+   branch_type_count(>brtype_stat, >flags,
+ bi->from.addr, bi->to.addr);
+
+   return 0;
+}
+
 static int process_sample_event(struct perf_tool *tool,
union perf_event *event,
struct perf_sample *sample,
@@ -184,6 +202,8 @@ static int process_sample_event(struct perf_tool *tool,
 */
if (!sample->branch_stack)
goto out_put;
+
+   iter.add_entry_cb = hist_iter__branch_callback;
iter.ops = _iter_branch;
} else if (rep->mem_mode) {
iter.ops = _iter_mem;
@@ -406,6 +426,9 @@ static int perf_evlist__tty_browse_hists(struct perf_evlist 
*evlist,
perf_read_values_destroy(>show_threads_values);
}
 
+   if (sort__mode == SORT_MODE__BRANCH)
+   branch_type_stat_display(stdout, >brtype_stat);
+
return 0;
 }
 
@@ -938,6 +961,8 @@ int cmd_report(int argc, const char **argv)
if (has_br_stack && branch_call_mode)
symbol_conf.show_branchflag_count = true;
 
+   memset(_stat, 0, sizeof(struct branch_type_stat));
+
/*
 * Branch mode is a tristate:
 * -1 means default, so decide based on the file having branch data.
diff --git a/tools/perf/util/hist.c b/tools/perf/util/hist.c
index 65d4275..f3a3be5 100644
--- a/tools/perf/util/hist.c
+++ b/tools/perf/util/hist.c
@@ -747,12 +747,9 @@ iter_prepare_branch_entry(struct hist_entry_iter *iter, 
struct addr_location *al
 }
 
 static int
-iter_add_single_branch_entry(struct hist_entry_iter *iter,
+iter_add_single_branch_entry(struct hist_entry_iter *iter __maybe_unused,
 struct addr_location *al __maybe_unused)
 {
-   /* to avoid calling callback function */
-   iter->he = NULL;
-
return 0;
 }
 
-- 
2.7.4



[PATCH v6 1/7] perf/core: Define the common branch type classification

2017-04-19 Thread Jin Yao
It is often useful to know the branch types while analyzing branch
data. For example, a call is very different from a conditional branch.

Currently we have to look it up in binary while the binary may later
not be available and even the binary is available but user has to take
some time. It is very useful for user to check it directly in perf
report.

Perf already has support for disassembling the branch instruction
to get the x86 branch type.

To keep consistent on kernel and userspace and make the classification
more common, the patch adds the common branch type classification
in perf_event.h.

PERF_BR_NONE  : unknown
PERF_BR_JCC   : conditional jump
PERF_BR_JMP   : jump
PERF_BR_IND_JMP   : indirect jump
PERF_BR_CALL  : call
PERF_BR_IND_CALL  : indirect call
PERF_BR_RET   : return
PERF_BR_SYSCALL   : syscall
PERF_BR_SYSRET: syscall return
PERF_BR_IRQ   : hw interrupt/trap/fault
PERF_BR_INT   : sw interrupt
PERF_BR_IRET  : return from interrupt
PERF_BR_FAR_BRANCH: not generic far branch type

The patch also adds a new field type (4 bits) in perf_branch_entry
to record the branch type.

Since the disassembling of branch instruction needs some overhead,
a new PERF_SAMPLE_BRANCH_TYPE_SAVE is introduced to indicate if it
needs to disassemble the branch instruction and record the branch
type.

Change log
--

v6: Not changed.

v5: Not changed. The v5 patch series just change the userspace.

v4: Comparing to previous version, the major changes are:

1. Remove the PERF_BR_JCC_FWD/PERF_BR_JCC_BWD, they will be
   computed later in userspace.

2. Remove the "cross" field in perf_branch_entry. The cross page
   computing will be done later in userspace.

Signed-off-by: Jin Yao 
---
 include/uapi/linux/perf_event.h   | 29 -
 tools/include/uapi/linux/perf_event.h | 29 -
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+
PERF_SAMPLE_BRANCH_MAX_SHIFT/* non-ABI */
 };
 
@@ -198,9 +200,32 @@ enum perf_branch_sample_type {
PERF_SAMPLE_BRANCH_NO_FLAGS = 1U << 
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT,
PERF_SAMPLE_BRANCH_NO_CYCLES= 1U << 
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT,
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE=
+   1U << PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT,
+
PERF_SAMPLE_BRANCH_MAX  = 1U << PERF_SAMPLE_BRANCH_MAX_SHIFT,
 };
 
+/*
+ * Common flow change classification
+ */
+enum {
+   PERF_BR_NONE= 0,/* unknown */
+   PERF_BR_JCC = 1,/* conditional jump */
+   PERF_BR_JMP = 2,/* jump */
+   PERF_BR_IND_JMP = 3,/* indirect jump */
+   PERF_BR_CALL= 4,/* call */
+   PERF_BR_IND_CALL= 5,/* indirect call */
+   PERF_BR_RET = 6,/* return */
+   PERF_BR_SYSCALL = 7,/* syscall */
+   PERF_BR_SYSRET  = 8,/* syscall return */
+   PERF_BR_IRQ = 9,/* hw interrupt/trap/fault */
+   PERF_BR_INT = 10,   /* sw interrupt */
+   PERF_BR_IRET= 11,   /* return from interrupt */
+   PERF_BR_FAR_BRANCH  = 12,   /* not generic far branch type */
+   PERF_BR_MAX,
+};
+
 #define PERF_SAMPLE_BRANCH_PLM_ALL \
(PERF_SAMPLE_BRANCH_USER|\
 PERF_SAMPLE_BRANCH_KERNEL|\
@@ -999,6 +1024,7 @@ union perf_mem_data_src {
  * in_tx: running in a hardware transaction
  * abort: aborting a hardware transaction
  *cycles: cycles from last branch (or 0 if not supported)
+ *  type: branch type
  */
 struct perf_branch_entry {
__u64   from;
@@ -1008,7 +1034,8 @@ struct perf_branch_entry {
in_tx:1,/* in transaction */
abort:1,/* transaction abort */
cycles:16,  /* cycle count to last branch */
-   reserved:44;
+   type:4, /* branch type */
+   reserved:40;
 };
 
 #endif /* _UAPI_LINUX_PERF_EVENT_H */
diff --git a/tools/include/uapi/linux/perf_event.h 
b/tools/include/uapi/linux/perf_event.h
index d09a9cd..69af012 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -174,6 +174,8 @@ enum perf_branch_sample_type_shift {
PERF_SAMPLE_BRANCH_NO_FLAGS_SHIFT   = 14, /* no flags */
PERF_SAMPLE_BRANCH_NO_CYCLES_SHIFT  = 15, /* no cycles */
 
+   PERF_SAMPLE_BRANCH_TYPE_SAVE_SHIFT  = 16, /* save branch type */
+

[PATCH v6 3/7] perf record: Create a new option save_type in --branch-filter

2017-04-19 Thread Jin Yao
The option indicates the kernel to save branch type during sampling.

One example:
perf record -g --branch-filter any,save_type 

Change log
--

v6: Not changed.

v5: Not changed.

Signed-off-by: Jin Yao 
---
 tools/perf/Documentation/perf-record.txt | 1 +
 tools/perf/util/parse-branch-options.c   | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tools/perf/Documentation/perf-record.txt 
b/tools/perf/Documentation/perf-record.txt
index ea3789d..e2f5a4f 100644
--- a/tools/perf/Documentation/perf-record.txt
+++ b/tools/perf/Documentation/perf-record.txt
@@ -332,6 +332,7 @@ following filters are defined:
- no_tx: only when the target is not in a hardware transaction
- abort_tx: only when the target is a hardware transaction abort
- cond: conditional branches
+   - save_type: save branch type during sampling in case binary is not 
available later
 
 +
 The option requires at least one branch type among any, any_call, any_ret, 
ind_call, cond.
diff --git a/tools/perf/util/parse-branch-options.c 
b/tools/perf/util/parse-branch-options.c
index 38fd115..e71fb5f 100644
--- a/tools/perf/util/parse-branch-options.c
+++ b/tools/perf/util/parse-branch-options.c
@@ -28,6 +28,7 @@ static const struct branch_mode branch_modes[] = {
BRANCH_OPT("cond", PERF_SAMPLE_BRANCH_COND),
BRANCH_OPT("ind_jmp", PERF_SAMPLE_BRANCH_IND_JUMP),
BRANCH_OPT("call", PERF_SAMPLE_BRANCH_CALL),
+   BRANCH_OPT("save_type", PERF_SAMPLE_BRANCH_TYPE_SAVE),
BRANCH_END
 };
 
-- 
2.7.4



[PATCH v6 7/7] perf report: Show branch type in callchain entry

2017-04-19 Thread Jin Yao
Show branch type in callchain entry. The branch type is printed
with other LBR information (such as cycles/abort/...).

For example:
perf report --branch-history --stdio --no-children

--24.21%--main div.c:42 (RET CROSS_2M cycles:2)
  compute_flag div.c:28 (cycles:2)
  compute_flag div.c:27 (RET CROSS_2M cycles:1)
  rand rand.c:28 (cycles:1)
  rand rand.c:28 (RET CROSS_2M cycles:1)
  __random random.c:298 (cycles:1)
  __random random.c:297 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (RET CROSS_2M cycles:9)

Change log
--

v6: Remove the branch_type_str() since it's moved to branch.c.

v5: Rewrite the branch info print code in util/callchain.c.

v4: Comparing to previous version, the major changes are:

Since we have to compute the JCC forward/JCC backward and cross
page checking in user space by from and to addresses, while each
callchain entry only contains one ip (either from or to), so
this patch will append a branch from address to the callchain
entry which just contains the to ip.

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 38 +-
 tools/perf/util/callchain.h |  5 -
 tools/perf/util/machine.c   | 26 +-
 3 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index d44b5ed..cfae50d 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -23,6 +23,7 @@
 #include "sort.h"
 #include "machine.h"
 #include "callchain.h"
+#include "branch.h"
 
 __thread struct callchain_cursor callchain_cursor;
 
@@ -468,6 +469,11 @@ fill_node(struct callchain_node *node, struct 
callchain_cursor *cursor)
call->cycles_count = cursor_node->branch_flags.cycles;
call->iter_count = cursor_node->nr_loop_iter;
call->samples_count = cursor_node->samples;
+
+   branch_type_count(>brtype_stat,
+ _node->branch_flags,
+ cursor_node->branch_from,
+ cursor_node->ip);
}
 
list_add_tail(>list, >val);
@@ -580,6 +586,11 @@ static enum match_result match_chain(struct 
callchain_cursor_node *node,
cnode->cycles_count += node->branch_flags.cycles;
cnode->iter_count += node->nr_loop_iter;
cnode->samples_count += node->samples;
+
+   branch_type_count(>brtype_stat,
+ >branch_flags,
+ node->branch_from,
+ node->ip);
}
 
return MATCH_EQ;
@@ -814,7 +825,7 @@ merge_chain_branch(struct callchain_cursor *cursor,
list_for_each_entry_safe(list, next_list, >val, list) {
callchain_cursor_append(cursor, list->ip,
list->ms.map, list->ms.sym,
-   false, NULL, 0, 0);
+   false, NULL, 0, 0, 0);
list_del(>list);
map__zput(list->ms.map);
free(list);
@@ -854,7 +865,7 @@ int callchain_merge(struct callchain_cursor *cursor,
 int callchain_cursor_append(struct callchain_cursor *cursor,
u64 ip, struct map *map, struct symbol *sym,
bool branch, struct branch_flags *flags,
-   int nr_loop_iter, int samples)
+   int nr_loop_iter, int samples, u64 branch_from)
 {
struct callchain_cursor_node *node = *cursor->last;
 
@@ -878,6 +889,7 @@ int callchain_cursor_append(struct callchain_cursor *cursor,
memcpy(>branch_flags, flags,
sizeof(struct branch_flags));
 
+   node->branch_from = branch_from;
cursor->nr++;
 
cursor->last = >next;
@@ -1133,14 +1145,19 @@ static int count_float_printf(int index, const char 
*str, float value,
 static int counts_str_build(char *bf, int bfsize,
 u64 branch_count, u64 predicted_count,
 u64 abort_count, u64 cycles_count,
-u64 iter_count, u64 samples_count)
+u64 iter_count, u64 samples_count,
+struct branch_type_stat *brtype_stat)
 {
u64 cycles;
-   int printed = 0, i = 0;
+   int printed, i = 0;
 
if (branch_count == 0)
return scnprintf(bf, bfsize, " (calltrace)");
 
+   printed = branch_type_str(brtype_stat, bf, bfsize);
+ 

[PATCH v6 7/7] perf report: Show branch type in callchain entry

2017-04-19 Thread Jin Yao
Show branch type in callchain entry. The branch type is printed
with other LBR information (such as cycles/abort/...).

For example:
perf report --branch-history --stdio --no-children

--24.21%--main div.c:42 (RET CROSS_2M cycles:2)
  compute_flag div.c:28 (cycles:2)
  compute_flag div.c:27 (RET CROSS_2M cycles:1)
  rand rand.c:28 (cycles:1)
  rand rand.c:28 (RET CROSS_2M cycles:1)
  __random random.c:298 (cycles:1)
  __random random.c:297 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (JCC backward CROSS_2M cycles:1)
  __random random.c:295 (cycles:1)
  __random random.c:295 (RET CROSS_2M cycles:9)

Change log
--

v6: Remove the branch_type_str() since it's moved to branch.c.

v5: Rewrite the branch info print code in util/callchain.c.

v4: Comparing to previous version, the major changes are:

Since we have to compute the JCC forward/JCC backward and cross
page checking in user space by from and to addresses, while each
callchain entry only contains one ip (either from or to), so
this patch will append a branch from address to the callchain
entry which just contains the to ip.

Signed-off-by: Jin Yao 
---
 tools/perf/util/callchain.c | 38 +-
 tools/perf/util/callchain.h |  5 -
 tools/perf/util/machine.c   | 26 +-
 3 files changed, 50 insertions(+), 19 deletions(-)

diff --git a/tools/perf/util/callchain.c b/tools/perf/util/callchain.c
index d44b5ed..cfae50d 100644
--- a/tools/perf/util/callchain.c
+++ b/tools/perf/util/callchain.c
@@ -23,6 +23,7 @@
 #include "sort.h"
 #include "machine.h"
 #include "callchain.h"
+#include "branch.h"
 
 __thread struct callchain_cursor callchain_cursor;
 
@@ -468,6 +469,11 @@ fill_node(struct callchain_node *node, struct 
callchain_cursor *cursor)
call->cycles_count = cursor_node->branch_flags.cycles;
call->iter_count = cursor_node->nr_loop_iter;
call->samples_count = cursor_node->samples;
+
+   branch_type_count(>brtype_stat,
+ _node->branch_flags,
+ cursor_node->branch_from,
+ cursor_node->ip);
}
 
list_add_tail(>list, >val);
@@ -580,6 +586,11 @@ static enum match_result match_chain(struct 
callchain_cursor_node *node,
cnode->cycles_count += node->branch_flags.cycles;
cnode->iter_count += node->nr_loop_iter;
cnode->samples_count += node->samples;
+
+   branch_type_count(>brtype_stat,
+ >branch_flags,
+ node->branch_from,
+ node->ip);
}
 
return MATCH_EQ;
@@ -814,7 +825,7 @@ merge_chain_branch(struct callchain_cursor *cursor,
list_for_each_entry_safe(list, next_list, >val, list) {
callchain_cursor_append(cursor, list->ip,
list->ms.map, list->ms.sym,
-   false, NULL, 0, 0);
+   false, NULL, 0, 0, 0);
list_del(>list);
map__zput(list->ms.map);
free(list);
@@ -854,7 +865,7 @@ int callchain_merge(struct callchain_cursor *cursor,
 int callchain_cursor_append(struct callchain_cursor *cursor,
u64 ip, struct map *map, struct symbol *sym,
bool branch, struct branch_flags *flags,
-   int nr_loop_iter, int samples)
+   int nr_loop_iter, int samples, u64 branch_from)
 {
struct callchain_cursor_node *node = *cursor->last;
 
@@ -878,6 +889,7 @@ int callchain_cursor_append(struct callchain_cursor *cursor,
memcpy(>branch_flags, flags,
sizeof(struct branch_flags));
 
+   node->branch_from = branch_from;
cursor->nr++;
 
cursor->last = >next;
@@ -1133,14 +1145,19 @@ static int count_float_printf(int index, const char 
*str, float value,
 static int counts_str_build(char *bf, int bfsize,
 u64 branch_count, u64 predicted_count,
 u64 abort_count, u64 cycles_count,
-u64 iter_count, u64 samples_count)
+u64 iter_count, u64 samples_count,
+struct branch_type_stat *brtype_stat)
 {
u64 cycles;
-   int printed = 0, i = 0;
+   int printed, i = 0;
 
if (branch_count == 0)
return scnprintf(bf, bfsize, " (calltrace)");
 
+   printed = branch_type_str(brtype_stat, bf, bfsize);
+   if (printed)
+ 

[PATCH v6 2/7] perf/x86/intel: Record branch type

2017-04-19 Thread Jin Yao
Perf already has support for disassembling the branch instruction
and using the branch type for filtering. The patch just records
the branch type in perf_branch_entry.

Before recording, the patch converts the x86 branch type to
common branch type.

Change log
--

v6: Not changed.

v5: Just fix the merge error. No other update.

v4: Comparing to previous version, the major changes are:

1. Uses a lookup table to convert x86 branch type to common branch
   type.

2. Move the JCC forward/JCC backward and cross page computing to
   user space.

3. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64

Signed-off-by: Jin Yao 
---
 arch/x86/events/intel/lbr.c | 53 -
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index f924629..f10a7ed 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -109,6 +109,9 @@ enum {
X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
X86_BR_CALL_STACK   = 1 << 16,/* call stack */
X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
+
+   X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
+
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -510,6 +513,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[i].in_tx  = 0;
cpuc->lbr_entries[i].abort  = 0;
cpuc->lbr_entries[i].cycles = 0;
+   cpuc->lbr_entries[i].type   = 0;
cpuc->lbr_entries[i].reserved   = 0;
}
cpuc->lbr_stack.nr = i;
@@ -596,6 +600,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[out].in_tx = in_tx;
cpuc->lbr_entries[out].abort = abort;
cpuc->lbr_entries[out].cycles= cycles;
+   cpuc->lbr_entries[out].type  = 0;
cpuc->lbr_entries[out].reserved  = 0;
out++;
}
@@ -673,6 +678,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_CALL)
mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
+   mask |= X86_BR_TYPE_SAVE;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -926,6 +935,44 @@ static int branch_type(unsigned long from, unsigned long 
to, int abort)
return ret;
 }
 
+#define X86_BR_TYPE_MAP_MAX16
+
+static int
+common_branch_type(int type)
+{
+   int i, mask;
+   const int branch_map[X86_BR_TYPE_MAP_MAX] = {
+   PERF_BR_CALL,   /* X86_BR_CALL */
+   PERF_BR_RET,/* X86_BR_RET */
+   PERF_BR_SYSCALL,/* X86_BR_SYSCALL */
+   PERF_BR_SYSRET, /* X86_BR_SYSRET */
+   PERF_BR_INT,/* X86_BR_INT */
+   PERF_BR_IRET,   /* X86_BR_IRET */
+   PERF_BR_JCC,/* X86_BR_JCC */
+   PERF_BR_JMP,/* X86_BR_JMP */
+   PERF_BR_IRQ,/* X86_BR_IRQ */
+   PERF_BR_IND_CALL,   /* X86_BR_IND_CALL */
+   PERF_BR_NONE,   /* X86_BR_ABORT */
+   PERF_BR_NONE,   /* X86_BR_IN_TX */
+   PERF_BR_NONE,   /* X86_BR_NO_TX */
+   PERF_BR_CALL,   /* X86_BR_ZERO_CALL */
+   PERF_BR_NONE,   /* X86_BR_CALL_STACK */
+   PERF_BR_IND_JMP,/* X86_BR_IND_JMP */
+   };
+
+   type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */
+   mask = ~(~0 << 1);
+
+   for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) {
+   if (type & mask)
+   return branch_map[i];
+
+   type >>= 1;
+   }
+
+   return PERF_BR_NONE;
+}
+
 /*
  * implement actual branch filter based on user demand.
  * Hardware may not exactly satisfy that request, thus
@@ -942,7 +989,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
bool compress = false;
 
/* if sampling all branches, then nothing to filter */
-   if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+   if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
+   ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
return;
 
for (i = 0; i < cpuc->lbr_stack.nr; i++) {
@@ -963,6 +1011,9 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
+   cpuc->lbr_entries[i].type = common_branch_type(type);
}
 
if (!compress)
-- 
2.7.4



[PATCH v6 2/7] perf/x86/intel: Record branch type

2017-04-19 Thread Jin Yao
Perf already has support for disassembling the branch instruction
and using the branch type for filtering. The patch just records
the branch type in perf_branch_entry.

Before recording, the patch converts the x86 branch type to
common branch type.

Change log
--

v6: Not changed.

v5: Just fix the merge error. No other update.

v4: Comparing to previous version, the major changes are:

1. Uses a lookup table to convert x86 branch type to common branch
   type.

2. Move the JCC forward/JCC backward and cross page computing to
   user space.

3. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64

Signed-off-by: Jin Yao 
---
 arch/x86/events/intel/lbr.c | 53 -
 1 file changed, 52 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index f924629..f10a7ed 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -109,6 +109,9 @@ enum {
X86_BR_ZERO_CALL= 1 << 15,/* zero length call */
X86_BR_CALL_STACK   = 1 << 16,/* call stack */
X86_BR_IND_JMP  = 1 << 17,/* indirect jump */
+
+   X86_BR_TYPE_SAVE= 1 << 18,/* indicate to save branch type */
+
 };
 
 #define X86_BR_PLM (X86_BR_USER | X86_BR_KERNEL)
@@ -510,6 +513,7 @@ static void intel_pmu_lbr_read_32(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[i].in_tx  = 0;
cpuc->lbr_entries[i].abort  = 0;
cpuc->lbr_entries[i].cycles = 0;
+   cpuc->lbr_entries[i].type   = 0;
cpuc->lbr_entries[i].reserved   = 0;
}
cpuc->lbr_stack.nr = i;
@@ -596,6 +600,7 @@ static void intel_pmu_lbr_read_64(struct cpu_hw_events 
*cpuc)
cpuc->lbr_entries[out].in_tx = in_tx;
cpuc->lbr_entries[out].abort = abort;
cpuc->lbr_entries[out].cycles= cycles;
+   cpuc->lbr_entries[out].type  = 0;
cpuc->lbr_entries[out].reserved  = 0;
out++;
}
@@ -673,6 +678,10 @@ static int intel_pmu_setup_sw_lbr_filter(struct perf_event 
*event)
 
if (br_type & PERF_SAMPLE_BRANCH_CALL)
mask |= X86_BR_CALL | X86_BR_ZERO_CALL;
+
+   if (br_type & PERF_SAMPLE_BRANCH_TYPE_SAVE)
+   mask |= X86_BR_TYPE_SAVE;
+
/*
 * stash actual user request into reg, it may
 * be used by fixup code for some CPU
@@ -926,6 +935,44 @@ static int branch_type(unsigned long from, unsigned long 
to, int abort)
return ret;
 }
 
+#define X86_BR_TYPE_MAP_MAX16
+
+static int
+common_branch_type(int type)
+{
+   int i, mask;
+   const int branch_map[X86_BR_TYPE_MAP_MAX] = {
+   PERF_BR_CALL,   /* X86_BR_CALL */
+   PERF_BR_RET,/* X86_BR_RET */
+   PERF_BR_SYSCALL,/* X86_BR_SYSCALL */
+   PERF_BR_SYSRET, /* X86_BR_SYSRET */
+   PERF_BR_INT,/* X86_BR_INT */
+   PERF_BR_IRET,   /* X86_BR_IRET */
+   PERF_BR_JCC,/* X86_BR_JCC */
+   PERF_BR_JMP,/* X86_BR_JMP */
+   PERF_BR_IRQ,/* X86_BR_IRQ */
+   PERF_BR_IND_CALL,   /* X86_BR_IND_CALL */
+   PERF_BR_NONE,   /* X86_BR_ABORT */
+   PERF_BR_NONE,   /* X86_BR_IN_TX */
+   PERF_BR_NONE,   /* X86_BR_NO_TX */
+   PERF_BR_CALL,   /* X86_BR_ZERO_CALL */
+   PERF_BR_NONE,   /* X86_BR_CALL_STACK */
+   PERF_BR_IND_JMP,/* X86_BR_IND_JMP */
+   };
+
+   type >>= 2; /* skip X86_BR_USER and X86_BR_KERNEL */
+   mask = ~(~0 << 1);
+
+   for (i = 0; i < X86_BR_TYPE_MAP_MAX; i++) {
+   if (type & mask)
+   return branch_map[i];
+
+   type >>= 1;
+   }
+
+   return PERF_BR_NONE;
+}
+
 /*
  * implement actual branch filter based on user demand.
  * Hardware may not exactly satisfy that request, thus
@@ -942,7 +989,8 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
bool compress = false;
 
/* if sampling all branches, then nothing to filter */
-   if ((br_sel & X86_BR_ALL) == X86_BR_ALL)
+   if (((br_sel & X86_BR_ALL) == X86_BR_ALL) &&
+   ((br_sel & X86_BR_TYPE_SAVE) != X86_BR_TYPE_SAVE))
return;
 
for (i = 0; i < cpuc->lbr_stack.nr; i++) {
@@ -963,6 +1011,9 @@ intel_pmu_lbr_filter(struct cpu_hw_events *cpuc)
cpuc->lbr_entries[i].from = 0;
compress = true;
}
+
+   if ((br_sel & X86_BR_TYPE_SAVE) == X86_BR_TYPE_SAVE)
+   cpuc->lbr_entries[i].type = common_branch_type(type);
}
 
if (!compress)
-- 
2.7.4



[PATCH v6 0/7] perf report: Show branch type

2017-04-19 Thread Jin Yao
v6:
   Update according to the review comments from
   Jiri Olsa . Major modifications are: 

   1. Move that multiline conditional code inside {} brackets.

   2. Move branch_type_stat_display() from builtin-report.c to
  branch.c. Move branch_type_str() from callchain.c to
  branch.c.

   3. Keep the original branch info display order, that is:
  predicted, abort, cycles, iterations

v5:
---
   Mainly the v5 patch series are updated according to
   comments from Jiri Olsa .

   The kernel part doesn't have functional change. It just
   solve the merge issue.

   In userspace, the functions of branch type counting and
   branch type name resolving are moved to the new files: 
   util/branch.c, util/branch.h.

   And refactor the branch info printing code for better
   maintenance.

Not changed (or just fix merge issue):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter

New patches:
  perf report: Refactor the branch info printing code
  perf util: Create branch.c/.h for common branch functions

Changed:
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

v4:
---
1. Describe the major changes in patch description.
   Thanks for Peter Zijlstra's reminding. 

2. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64. Remove the invalid else code in
   intel_pmu_lbr_filter. 

v3:
---
1. Move the JCC forward/backward and cross page computing from
   kernel to userspace.

2. Use lookup table to replace original switch/case processing.

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v2:
---
1. Use 4 bits in perf_branch_entry to record branch type.

2. Pull out some common branch types from FAR_BRANCH. Now the branch
   types defined in perf_event.h:

Jin Yao (7):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter
  perf report: Refactor the branch info printing code
  perf util: Create branch.c/.h for common branch functions
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

 arch/x86/events/intel/lbr.c  |  53 +-
 include/uapi/linux/perf_event.h  |  29 +-
 tools/include/uapi/linux/perf_event.h|  29 +-
 tools/perf/Documentation/perf-record.txt |   1 +
 tools/perf/builtin-report.c  |  25 +
 tools/perf/util/Build|   1 +
 tools/perf/util/branch.c | 168 +++
 tools/perf/util/branch.h |  25 +
 tools/perf/util/callchain.c  | 140 ++
 tools/perf/util/callchain.h  |   5 +-
 tools/perf/util/event.h  |   3 +-
 tools/perf/util/hist.c   |   5 +-
 tools/perf/util/machine.c|  26 +++--
 tools/perf/util/parse-branch-options.c   |   1 +
 14 files changed, 427 insertions(+), 84 deletions(-)
 create mode 100644 tools/perf/util/branch.c
 create mode 100644 tools/perf/util/branch.h

-- 
2.7.4



[PATCH v6 0/7] perf report: Show branch type

2017-04-19 Thread Jin Yao
v6:
   Update according to the review comments from
   Jiri Olsa . Major modifications are: 

   1. Move that multiline conditional code inside {} brackets.

   2. Move branch_type_stat_display() from builtin-report.c to
  branch.c. Move branch_type_str() from callchain.c to
  branch.c.

   3. Keep the original branch info display order, that is:
  predicted, abort, cycles, iterations

v5:
---
   Mainly the v5 patch series are updated according to
   comments from Jiri Olsa .

   The kernel part doesn't have functional change. It just
   solve the merge issue.

   In userspace, the functions of branch type counting and
   branch type name resolving are moved to the new files: 
   util/branch.c, util/branch.h.

   And refactor the branch info printing code for better
   maintenance.

Not changed (or just fix merge issue):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter

New patches:
  perf report: Refactor the branch info printing code
  perf util: Create branch.c/.h for common branch functions

Changed:
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

v4:
---
1. Describe the major changes in patch description.
   Thanks for Peter Zijlstra's reminding. 

2. Initialize branch type to 0 in intel_pmu_lbr_read_32 and
   intel_pmu_lbr_read_64. Remove the invalid else code in
   intel_pmu_lbr_filter. 

v3:
---
1. Move the JCC forward/backward and cross page computing from
   kernel to userspace.

2. Use lookup table to replace original switch/case processing.

Changed:
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

Not changed:
  perf record: Create a new option save_type in --branch-filter

v2:
---
1. Use 4 bits in perf_branch_entry to record branch type.

2. Pull out some common branch types from FAR_BRANCH. Now the branch
   types defined in perf_event.h:

Jin Yao (7):
  perf/core: Define the common branch type classification
  perf/x86/intel: Record branch type
  perf record: Create a new option save_type in --branch-filter
  perf report: Refactor the branch info printing code
  perf util: Create branch.c/.h for common branch functions
  perf report: Show branch type statistics for stdio mode
  perf report: Show branch type in callchain entry

 arch/x86/events/intel/lbr.c  |  53 +-
 include/uapi/linux/perf_event.h  |  29 +-
 tools/include/uapi/linux/perf_event.h|  29 +-
 tools/perf/Documentation/perf-record.txt |   1 +
 tools/perf/builtin-report.c  |  25 +
 tools/perf/util/Build|   1 +
 tools/perf/util/branch.c | 168 +++
 tools/perf/util/branch.h |  25 +
 tools/perf/util/callchain.c  | 140 ++
 tools/perf/util/callchain.h  |   5 +-
 tools/perf/util/event.h  |   3 +-
 tools/perf/util/hist.c   |   5 +-
 tools/perf/util/machine.c|  26 +++--
 tools/perf/util/parse-branch-options.c   |   1 +
 14 files changed, 427 insertions(+), 84 deletions(-)
 create mode 100644 tools/perf/util/branch.c
 create mode 100644 tools/perf/util/branch.h

-- 
2.7.4



  1   2   3   4   5   6   7   8   9   10   >