Re: KASAN: use-after-free Read in sctp_association_free (2)

2018-03-09 Thread Xin Long
On Sat, Mar 10, 2018 at 6:08 AM, Neil Horman  wrote:
> On Fri, Mar 09, 2018 at 12:59:06PM -0800, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> fd372a7a9e5e9d8011a0222d10edd3523abcd3b1 (Thu Mar 8 19:43:48 2018 +)
>> Merge tag 'mlx5-updates-2018-02-28-2' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
>>
>> So far this crash happened 2 times on net-next.
>> C reproducer is attached.
>> syzkaller reproducer is attached.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+a4e4112c3aff00c8c...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> ==
>> BUG: KASAN: use-after-free in sctp_association_free+0x7b7/0x930
>> net/sctp/associola.c:332
>> Read of size 8 at addr 8801d8006ae0 by task syzkaller914861/4202
>>
>> CPU: 1 PID: 4202 Comm: syzkaller914861 Not tainted 4.16.0-rc4+ #258
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>>  kasan_report_error mm/kasan/report.c:354 [inline]
>>  kasan_report+0x23c/0x360 mm/kasan/report.c:412
>>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>>  sctp_association_free+0x7b7/0x930 net/sctp/associola.c:332
>>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:639
>>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>>  SyS_sendto+0x40/0x50 net/socket.c:1716
>>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> RIP: 0033:0x446d09
>> RSP: 002b:7f5dbac21da8 EFLAGS: 0216 ORIG_RAX: 002c
>> RAX: ffda RBX: 006e29fc RCX: 00446d09
>> RDX: 0001 RSI: 2340 RDI: 0003
>> RBP: 006e29f8 R08: 204d9000 R09: 001c
>> R10:  R11: 0216 R12: 
>> R13: 7fff7b26fb1f R14: 7f5dbac229c0 R15: 006e2b60
>>
> I think we have a corner case with a0ff660058b88d12625a783ce9e5c1371c87951f
> here.  If a peeloff event happens during a wait for sendbuf space, EPIPE will 
> be
> returned, and the code path appears to call sctp_association_put twice, 
> leading
> to the use after free situation.  I'll write a patch this weekend
Hi, Neil, you're right.

I didn't expect peeloff can be done on a NEW asoc, as peeloff needs
assoc_id, which can only be set when connecting has started.

But I realized that:
f84af33 sctp: factor out sctp_sendmsg_to_asoc from sctp_sendmsg

moved sctp_primitive_ASSOCIATE(connecting) before sctp_wait_for_sndbuf
(snd buffer waiting). It means peeloff can be done on a NEW asoc.
So you may want to move it back.

One good thing is the fix shouldn't touch the conflict on
https://lkml.org/lkml/2018/3/7/1175
We can fix it right now, I think. But pls double check it before
submitting the patch. We just can't grow up that fixup for linus
tree's merge.

Thanks.


Re: KASAN: use-after-free Read in sctp_association_free (2)

2018-03-09 Thread Xin Long
On Sat, Mar 10, 2018 at 6:08 AM, Neil Horman  wrote:
> On Fri, Mar 09, 2018 at 12:59:06PM -0800, syzbot wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> fd372a7a9e5e9d8011a0222d10edd3523abcd3b1 (Thu Mar 8 19:43:48 2018 +)
>> Merge tag 'mlx5-updates-2018-02-28-2' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
>>
>> So far this crash happened 2 times on net-next.
>> C reproducer is attached.
>> syzkaller reproducer is attached.
>> Raw console output is attached.
>> compiler: gcc (GCC) 7.1.1 20170620
>> .config is attached.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+a4e4112c3aff00c8c...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
>>
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> IPVS: ftp: loaded support on port[0] = 21
>> ==
>> BUG: KASAN: use-after-free in sctp_association_free+0x7b7/0x930
>> net/sctp/associola.c:332
>> Read of size 8 at addr 8801d8006ae0 by task syzkaller914861/4202
>>
>> CPU: 1 PID: 4202 Comm: syzkaller914861 Not tainted 4.16.0-rc4+ #258
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:17 [inline]
>>  dump_stack+0x194/0x24d lib/dump_stack.c:53
>>  print_address_description+0x73/0x250 mm/kasan/report.c:256
>>  kasan_report_error mm/kasan/report.c:354 [inline]
>>  kasan_report+0x23c/0x360 mm/kasan/report.c:412
>>  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>>  sctp_association_free+0x7b7/0x930 net/sctp/associola.c:332
>>  sctp_sendmsg+0xc67/0x1a80 net/sctp/socket.c:2075
>>  inet_sendmsg+0x11f/0x5e0 net/ipv4/af_inet.c:763
>>  sock_sendmsg_nosec net/socket.c:629 [inline]
>>  sock_sendmsg+0xca/0x110 net/socket.c:639
>>  SYSC_sendto+0x361/0x5c0 net/socket.c:1748
>>  SyS_sendto+0x40/0x50 net/socket.c:1716
>>  do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
>>  entry_SYSCALL_64_after_hwframe+0x42/0xb7
>> RIP: 0033:0x446d09
>> RSP: 002b:7f5dbac21da8 EFLAGS: 0216 ORIG_RAX: 002c
>> RAX: ffda RBX: 006e29fc RCX: 00446d09
>> RDX: 0001 RSI: 2340 RDI: 0003
>> RBP: 006e29f8 R08: 204d9000 R09: 001c
>> R10:  R11: 0216 R12: 
>> R13: 7fff7b26fb1f R14: 7f5dbac229c0 R15: 006e2b60
>>
> I think we have a corner case with a0ff660058b88d12625a783ce9e5c1371c87951f
> here.  If a peeloff event happens during a wait for sendbuf space, EPIPE will 
> be
> returned, and the code path appears to call sctp_association_put twice, 
> leading
> to the use after free situation.  I'll write a patch this weekend
Hi, Neil, you're right.

I didn't expect peeloff can be done on a NEW asoc, as peeloff needs
assoc_id, which can only be set when connecting has started.

But I realized that:
f84af33 sctp: factor out sctp_sendmsg_to_asoc from sctp_sendmsg

moved sctp_primitive_ASSOCIATE(connecting) before sctp_wait_for_sndbuf
(snd buffer waiting). It means peeloff can be done on a NEW asoc.
So you may want to move it back.

One good thing is the fix shouldn't touch the conflict on
https://lkml.org/lkml/2018/3/7/1175
We can fix it right now, I think. But pls double check it before
submitting the patch. We just can't grow up that fixup for linus
tree's merge.

Thanks.


[PATCH 2/2] drivers: android: binder: fixed a brace coding style issue

2018-03-09 Thread Vaibhav Murkute
Fixed a coding style issue.

Signed-off-by: Vaibhav Murkute 
---
 drivers/android/binder.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 764b63a5aade..2729bb75ca19 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -2641,11 +2641,11 @@ static bool binder_proc_transaction(struct 
binder_transaction *t,
binder_node_lock(node);
if (oneway) {
BUG_ON(thread);
-   if (node->has_async_transaction) {
+   if (node->has_async_transaction)
pending_async = true;
-   } else {
+   else
node->has_async_transaction = true;
-   }
+
}
 
binder_inner_proc_lock(proc);
-- 
2.15.1



[PATCH 2/2] drivers: android: binder: fixed a brace coding style issue

2018-03-09 Thread Vaibhav Murkute
Fixed a coding style issue.

Signed-off-by: Vaibhav Murkute 
---
 drivers/android/binder.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/android/binder.c b/drivers/android/binder.c
index 764b63a5aade..2729bb75ca19 100644
--- a/drivers/android/binder.c
+++ b/drivers/android/binder.c
@@ -2641,11 +2641,11 @@ static bool binder_proc_transaction(struct 
binder_transaction *t,
binder_node_lock(node);
if (oneway) {
BUG_ON(thread);
-   if (node->has_async_transaction) {
+   if (node->has_async_transaction)
pending_async = true;
-   } else {
+   else
node->has_async_transaction = true;
-   }
+
}
 
binder_inner_proc_lock(proc);
-- 
2.15.1



[PATCH v2] net: ipv6: xfrm6_state: remove VLA usage

2018-03-09 Thread Andreas Christoforou
The kernel would like to have all stack VLA usage removed[1].
Instead of dynamic allocation, just use XFRM_MAX_DEPTH
as already done for the "class" array, but as per feedback,
I will not drop maxclass because that changes the behavior.
In one case, it'll do this loop up to 5, the other
caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Andreas Christoforou 
---
v2:
- use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
---
 net/ipv6/xfrm6_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index b15075a..270a53a 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -62,7 +62,7 @@ __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void 
*p), int maxclass)
 {
int i;
int class[XFRM_MAX_DEPTH];
-   int count[maxclass];
+   int count[XFRM_MAX_DEPTH];
 
memset(count, 0, sizeof(count));
 
-- 
2.7.4



[PATCH v2] net: ipv6: xfrm6_state: remove VLA usage

2018-03-09 Thread Andreas Christoforou
The kernel would like to have all stack VLA usage removed[1].
Instead of dynamic allocation, just use XFRM_MAX_DEPTH
as already done for the "class" array, but as per feedback,
I will not drop maxclass because that changes the behavior.
In one case, it'll do this loop up to 5, the other
caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Andreas Christoforou 
---
v2:
- use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
---
 net/ipv6/xfrm6_state.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index b15075a..270a53a 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -62,7 +62,7 @@ __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void 
*p), int maxclass)
 {
int i;
int class[XFRM_MAX_DEPTH];
-   int count[maxclass];
+   int count[XFRM_MAX_DEPTH];
 
memset(count, 0, sizeof(count));
 
-- 
2.7.4



RE: [RFC/RFT][PATCH v3 0/6] sched/cpuidle: Idle loop rework

2018-03-09 Thread Doug Smythies
On 2018.03.09 07:19 Rik van Riel wrote:
> On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
>> Hi All,
>> 
>> Thanks a lot for the discussion and testing so far!
>> 
>> This is a total respin of the whole series, so please look at it
>> afresh.
>> Patches 2 and 3 are the most similar to their previous versions, but
>> still they are different enough.
>
> This series gives no RCU errors on startup,
> and no CPUs seem to be getting stuck any more.

Confirmed on my test server. Boot is normal and no other errors, so far.

Part 1: Idle test:

I was able to repeat Mike's higher power issue under very light load,
well no load in my case, with V2.

V3 is much better.

A one hour trace on my very idle server was 22 times smaller with V3
than V2, and mainly due to idle state 4 not exiting and re-entering
every tick time for great periods of time.

Disclaimer: From past experience, 1 hour is not nearly long enough
for this test. Issues tend to come in bunches, sometimes many hours
apart.

V2:
Idle State 4: Entries: 1359560
CPU: 0: Entries: 125305
CPU: 1: Entries: 62489
CPU: 2: Entries: 10203
CPU: 3: Entries: 108107
CPU: 4: Entries: 19915
CPU: 5: Entries: 430253
CPU: 6: Entries: 564650
CPU: 7: Entries: 38638

V3:
Idle State 4: Entries: 64505
CPU: 0: Entries: 13060
CPU: 1: Entries: 5266
CPU: 2: Entries: 15744
CPU: 3: Entries: 5574
CPU: 4: Entries: 8425
CPU: 5: Entries: 6270
CPU: 6: Entries: 5592
CPU: 7: Entries: 4574

Kernel 4.16-rc4:
Idle State 4: Entries: 61390
CPU: 0: Entries: 9529
CPU: 1: Entries: 10556
CPU: 2: Entries: 5478
CPU: 3: Entries: 5991
CPU: 4: Entries: 3686
CPU: 5: Entries: 7610
CPU: 6: Entries: 11074
CPU: 7: Entries: 7466

With apologies to those that do not like the term "PowerNightmares",
it has become very ingrained in my tools:

V2:
1 hour idle Summary:

Idle State 0: Total Entries: 113 : PowerNightmares: 56 : Not PN time (seconds): 
0.001224 : PN time: 65.543239 : Ratio: 53548.397792
Idle State 1: Total Entries: 1015 : PowerNightmares: 42 : Not PN time 
(seconds): 0.053986 : PN time: 21.054470 : Ratio: 389.998703
Idle State 2: Total Entries: 1382 : PowerNightmares: 17 : Not PN time 
(seconds): 0.728686 : PN time: 6.046906 : Ratio: 8.298370
Idle State 3: Total Entries: 113 : PowerNightmares: 13 : Not PN time (seconds): 
0.069055 : PN time: 6.021458 : Ratio: 87.198002

V3:
1 hour idle Summary: Average processor package power 3.78 watts

Idle State 0: Total Entries: 134 : PowerNightmares: 109 : Not PN time 
(seconds): 0.000477 : PN time: 144.719723 : Ratio: 303395.646541
Idle State 1: Total Entries: 1104 : PowerNightmares: 84 : Not PN time 
(seconds): 0.052639 : PN time: 74.639142 : Ratio: 1417.943768
Idle State 2: Total Entries: 968 : PowerNightmares: 141 : Not PN time 
(seconds): 0.325953 : PN time: 128.235137 : Ratio: 393.416035
Idle State 3: Total Entries: 295 : PowerNightmares: 103 : Not PN time 
(seconds): 0.164884 : PN time: 97.159421 : Ratio: 589.259243

Kernel 4.16-rc4: Average processor package power (excluding a few minutes of 
abnormal power) 3.70 watts.
1 hour idle Summary:

Idle State 0: Total Entries: 168 : PowerNightmares: 59 : Not PN time (seconds): 
0.001323 : PN time: 81.802197 : Ratio: 61830.836545
Idle State 1: Total Entries: 1669 : PowerNightmares: 78 : Not PN time 
(seconds): 0.022003 : PN time: 37.477413 : Ratio: 1703.286509
Idle State 2: Total Entries: 1447 : PowerNightmares: 30 : Not PN time 
(seconds): 0.502672 : PN time: 0.789344 : Ratio: 1.570296
Idle State 3: Total Entries: 176 : PowerNightmares: 0 : Not PN time (seconds): 
0.259425 : PN time: 0.00 : Ratio: 0.00

Part 2: 100% load on one CPU test. Test duration 4 hours

V3: Summary: Average processor package power 26.75 watts

Idle State 0: Total Entries: 10039 : PowerNightmares: 7186 : Not PN time 
(seconds): 0.067477 : PN time: 6215.220295 : Ratio: 92108.722903
Idle State 1: Total Entries: 17268 : PowerNightmares: 195 : Not PN time 
(seconds): 0.213049 : PN time: 55.905323 : Ratio: 262.405939
Idle State 2: Total Entries: 5858 : PowerNightmares: 676 : Not PN time 
(seconds): 2.578006 : PN time: 167.282069 : Ratio: 64.888161
Idle State 3: Total Entries: 1500 : PowerNightmares: 488 : Not PN time 
(seconds): 0.772463 : PN time: 125.514015 : Ratio: 162.485472

Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts

Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time 
(seconds): 0.051532 : PN time: 7886.309553 : Ratio: 153037.133492
Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time 
(seconds): 0.211999 : PN time: 77.395467 : Ratio: 365.074679
Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time 
(seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 
1.663376 : PN time: 0.00 : Ratio: 0.00

Graph of package power verses time: http://fast.smythies.com/rjwv3_100.png

... Doug




RE: [RFC/RFT][PATCH v3 0/6] sched/cpuidle: Idle loop rework

2018-03-09 Thread Doug Smythies
On 2018.03.09 07:19 Rik van Riel wrote:
> On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
>> Hi All,
>> 
>> Thanks a lot for the discussion and testing so far!
>> 
>> This is a total respin of the whole series, so please look at it
>> afresh.
>> Patches 2 and 3 are the most similar to their previous versions, but
>> still they are different enough.
>
> This series gives no RCU errors on startup,
> and no CPUs seem to be getting stuck any more.

Confirmed on my test server. Boot is normal and no other errors, so far.

Part 1: Idle test:

I was able to repeat Mike's higher power issue under very light load,
well no load in my case, with V2.

V3 is much better.

A one hour trace on my very idle server was 22 times smaller with V3
than V2, and mainly due to idle state 4 not exiting and re-entering
every tick time for great periods of time.

Disclaimer: From past experience, 1 hour is not nearly long enough
for this test. Issues tend to come in bunches, sometimes many hours
apart.

V2:
Idle State 4: Entries: 1359560
CPU: 0: Entries: 125305
CPU: 1: Entries: 62489
CPU: 2: Entries: 10203
CPU: 3: Entries: 108107
CPU: 4: Entries: 19915
CPU: 5: Entries: 430253
CPU: 6: Entries: 564650
CPU: 7: Entries: 38638

V3:
Idle State 4: Entries: 64505
CPU: 0: Entries: 13060
CPU: 1: Entries: 5266
CPU: 2: Entries: 15744
CPU: 3: Entries: 5574
CPU: 4: Entries: 8425
CPU: 5: Entries: 6270
CPU: 6: Entries: 5592
CPU: 7: Entries: 4574

Kernel 4.16-rc4:
Idle State 4: Entries: 61390
CPU: 0: Entries: 9529
CPU: 1: Entries: 10556
CPU: 2: Entries: 5478
CPU: 3: Entries: 5991
CPU: 4: Entries: 3686
CPU: 5: Entries: 7610
CPU: 6: Entries: 11074
CPU: 7: Entries: 7466

With apologies to those that do not like the term "PowerNightmares",
it has become very ingrained in my tools:

V2:
1 hour idle Summary:

Idle State 0: Total Entries: 113 : PowerNightmares: 56 : Not PN time (seconds): 
0.001224 : PN time: 65.543239 : Ratio: 53548.397792
Idle State 1: Total Entries: 1015 : PowerNightmares: 42 : Not PN time 
(seconds): 0.053986 : PN time: 21.054470 : Ratio: 389.998703
Idle State 2: Total Entries: 1382 : PowerNightmares: 17 : Not PN time 
(seconds): 0.728686 : PN time: 6.046906 : Ratio: 8.298370
Idle State 3: Total Entries: 113 : PowerNightmares: 13 : Not PN time (seconds): 
0.069055 : PN time: 6.021458 : Ratio: 87.198002

V3:
1 hour idle Summary: Average processor package power 3.78 watts

Idle State 0: Total Entries: 134 : PowerNightmares: 109 : Not PN time 
(seconds): 0.000477 : PN time: 144.719723 : Ratio: 303395.646541
Idle State 1: Total Entries: 1104 : PowerNightmares: 84 : Not PN time 
(seconds): 0.052639 : PN time: 74.639142 : Ratio: 1417.943768
Idle State 2: Total Entries: 968 : PowerNightmares: 141 : Not PN time 
(seconds): 0.325953 : PN time: 128.235137 : Ratio: 393.416035
Idle State 3: Total Entries: 295 : PowerNightmares: 103 : Not PN time 
(seconds): 0.164884 : PN time: 97.159421 : Ratio: 589.259243

Kernel 4.16-rc4: Average processor package power (excluding a few minutes of 
abnormal power) 3.70 watts.
1 hour idle Summary:

Idle State 0: Total Entries: 168 : PowerNightmares: 59 : Not PN time (seconds): 
0.001323 : PN time: 81.802197 : Ratio: 61830.836545
Idle State 1: Total Entries: 1669 : PowerNightmares: 78 : Not PN time 
(seconds): 0.022003 : PN time: 37.477413 : Ratio: 1703.286509
Idle State 2: Total Entries: 1447 : PowerNightmares: 30 : Not PN time 
(seconds): 0.502672 : PN time: 0.789344 : Ratio: 1.570296
Idle State 3: Total Entries: 176 : PowerNightmares: 0 : Not PN time (seconds): 
0.259425 : PN time: 0.00 : Ratio: 0.00

Part 2: 100% load on one CPU test. Test duration 4 hours

V3: Summary: Average processor package power 26.75 watts

Idle State 0: Total Entries: 10039 : PowerNightmares: 7186 : Not PN time 
(seconds): 0.067477 : PN time: 6215.220295 : Ratio: 92108.722903
Idle State 1: Total Entries: 17268 : PowerNightmares: 195 : Not PN time 
(seconds): 0.213049 : PN time: 55.905323 : Ratio: 262.405939
Idle State 2: Total Entries: 5858 : PowerNightmares: 676 : Not PN time 
(seconds): 2.578006 : PN time: 167.282069 : Ratio: 64.888161
Idle State 3: Total Entries: 1500 : PowerNightmares: 488 : Not PN time 
(seconds): 0.772463 : PN time: 125.514015 : Ratio: 162.485472

Kernel 4.16-rc4: Summary: Average processor package power 27.41 watts

Idle State 0: Total Entries: 9096 : PowerNightmares: 6540 : Not PN time 
(seconds): 0.051532 : PN time: 7886.309553 : Ratio: 153037.133492
Idle State 1: Total Entries: 28731 : PowerNightmares: 215 : Not PN time 
(seconds): 0.211999 : PN time: 77.395467 : Ratio: 365.074679
Idle State 2: Total Entries: 4474 : PowerNightmares: 97 : Not PN time 
(seconds): 1.959059 : PN time: 0.874112 : Ratio: 0.446190
Idle State 3: Total Entries: 2319 : PowerNightmares: 0 : Not PN time (seconds): 
1.663376 : PN time: 0.00 : Ratio: 0.00

Graph of package power verses time: http://fast.smythies.com/rjwv3_100.png

... Doug




Re: [PATCH 4.14 1/4] powerpc/mm/slice: Remove intermediate bitmap copy

2018-03-09 Thread christophe leroy



Le 10/03/2018 à 01:10, Greg Kroah-Hartman a écrit :

On Fri, Mar 09, 2018 at 04:48:59PM +0100, Christophe Leroy wrote:

Upstream 326691ad4f179e6edc7eb1271e618dd673e4736d


There is no such git commit id in Linus's tree :(

Please fix up and resend the series.


I checked again, it is there

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/mm/slice.c?h=next-20180309=326691ad4f179e6edc7eb1271e618dd673e4736d

The id seems to be exactly the same.

Christophe



thanks,

greg k-h



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



Re: [PATCH 4.14 1/4] powerpc/mm/slice: Remove intermediate bitmap copy

2018-03-09 Thread christophe leroy



Le 10/03/2018 à 01:10, Greg Kroah-Hartman a écrit :

On Fri, Mar 09, 2018 at 04:48:59PM +0100, Christophe Leroy wrote:

Upstream 326691ad4f179e6edc7eb1271e618dd673e4736d


There is no such git commit id in Linus's tree :(

Please fix up and resend the series.


I checked again, it is there

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/arch/powerpc/mm/slice.c?h=next-20180309=326691ad4f179e6edc7eb1271e618dd673e4736d

The id seems to be exactly the same.

Christophe



thanks,

greg k-h



---
L'absence de virus dans ce courrier électronique a été vérifiée par le logiciel 
antivirus Avast.
https://www.avast.com/antivirus



[tip:x86/pti] x86/kprobes: Fix kernel crash when probing .entry_trampoline code

2018-03-09 Thread tip-bot for Francis Deslauriers
Commit-ID:  c07a8f8b08ba683ea24f3ac9159f37ae94daf47f
Gitweb: https://git.kernel.org/tip/c07a8f8b08ba683ea24f3ac9159f37ae94daf47f
Author: Francis Deslauriers 
AuthorDate: Thu, 8 Mar 2018 22:18:12 -0500
Committer:  Ingo Molnar 
CommitDate: Fri, 9 Mar 2018 09:58:36 +0100

x86/kprobes: Fix kernel crash when probing .entry_trampoline code

Disable the kprobe probing of the entry trampoline:

.entry_trampoline is a code area that is used to ensure page table
isolation between userspace and kernelspace.

At the beginning of the execution of the trampoline, we load the
kernel's CR3 register. This has the effect of enabling the translation
of the kernel virtual addresses to physical addresses. Before this
happens most kernel addresses can not be translated because the running
process' CR3 is still used.

If a kprobe is placed on the trampoline code before that change of the
CR3 register happens the kernel crashes because int3 handling pages are
not accessible.

To fix this, add the .entry_trampoline section to the kprobe blacklist
to prohibit the probing of code before all the kernel pages are
accessible.

Signed-off-by: Francis Deslauriers 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: mathieu.desnoy...@efficios.com
Cc: mhira...@kernel.org
Link: 
http://lkml.kernel.org/r/1520565492-4637-2-git-send-email-francis.deslauri...@efficios.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/sections.h |  1 +
 arch/x86/kernel/kprobes/core.c  | 10 +-
 arch/x86/kernel/vmlinux.lds.S   |  2 ++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index d6baf23782bc..5c019d23d06b 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -10,6 +10,7 @@ extern struct exception_table_entry __stop___ex_table[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
+extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..0715f827607c 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1168,10 +1168,18 @@ NOKPROBE_SYMBOL(longjmp_break_handler);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
+   bool is_in_entry_trampoline_section = false;
+
+#ifdef CONFIG_X86_64
+   is_in_entry_trampoline_section =
+   (addr >= (unsigned long)__entry_trampoline_start &&
+addr < (unsigned long)__entry_trampoline_end);
+#endif
return  (addr >= (unsigned long)__kprobes_text_start &&
 addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
-addr < (unsigned long)__entry_text_end);
+addr < (unsigned long)__entry_text_end) ||
+   is_in_entry_trampoline_section;
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 9b138a06c1a4..b854ebf5851b 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,9 +118,11 @@ SECTIONS
 
 #ifdef CONFIG_X86_64
. = ALIGN(PAGE_SIZE);
+   VMLINUX_SYMBOL(__entry_trampoline_start) = .;
_entry_trampoline = .;
*(.entry_trampoline)
. = ALIGN(PAGE_SIZE);
+   VMLINUX_SYMBOL(__entry_trampoline_end) = .;
ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is 
too big");
 #endif
 


[tip:x86/pti] x86/kprobes: Fix kernel crash when probing .entry_trampoline code

2018-03-09 Thread tip-bot for Francis Deslauriers
Commit-ID:  c07a8f8b08ba683ea24f3ac9159f37ae94daf47f
Gitweb: https://git.kernel.org/tip/c07a8f8b08ba683ea24f3ac9159f37ae94daf47f
Author: Francis Deslauriers 
AuthorDate: Thu, 8 Mar 2018 22:18:12 -0500
Committer:  Ingo Molnar 
CommitDate: Fri, 9 Mar 2018 09:58:36 +0100

x86/kprobes: Fix kernel crash when probing .entry_trampoline code

Disable the kprobe probing of the entry trampoline:

.entry_trampoline is a code area that is used to ensure page table
isolation between userspace and kernelspace.

At the beginning of the execution of the trampoline, we load the
kernel's CR3 register. This has the effect of enabling the translation
of the kernel virtual addresses to physical addresses. Before this
happens most kernel addresses can not be translated because the running
process' CR3 is still used.

If a kprobe is placed on the trampoline code before that change of the
CR3 register happens the kernel crashes because int3 handling pages are
not accessible.

To fix this, add the .entry_trampoline section to the kprobe blacklist
to prohibit the probing of code before all the kernel pages are
accessible.

Signed-off-by: Francis Deslauriers 
Reviewed-by: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Denys Vlasenko 
Cc: H. Peter Anvin 
Cc: Josh Poimboeuf 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: mathieu.desnoy...@efficios.com
Cc: mhira...@kernel.org
Link: 
http://lkml.kernel.org/r/1520565492-4637-2-git-send-email-francis.deslauri...@efficios.com
Signed-off-by: Ingo Molnar 
---
 arch/x86/include/asm/sections.h |  1 +
 arch/x86/kernel/kprobes/core.c  | 10 +-
 arch/x86/kernel/vmlinux.lds.S   |  2 ++
 3 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/sections.h b/arch/x86/include/asm/sections.h
index d6baf23782bc..5c019d23d06b 100644
--- a/arch/x86/include/asm/sections.h
+++ b/arch/x86/include/asm/sections.h
@@ -10,6 +10,7 @@ extern struct exception_table_entry __stop___ex_table[];
 
 #if defined(CONFIG_X86_64)
 extern char __end_rodata_hpage_align[];
+extern char __entry_trampoline_start[], __entry_trampoline_end[];
 #endif
 
 #endif /* _ASM_X86_SECTIONS_H */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index bd36f3c33cd0..0715f827607c 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -1168,10 +1168,18 @@ NOKPROBE_SYMBOL(longjmp_break_handler);
 
 bool arch_within_kprobe_blacklist(unsigned long addr)
 {
+   bool is_in_entry_trampoline_section = false;
+
+#ifdef CONFIG_X86_64
+   is_in_entry_trampoline_section =
+   (addr >= (unsigned long)__entry_trampoline_start &&
+addr < (unsigned long)__entry_trampoline_end);
+#endif
return  (addr >= (unsigned long)__kprobes_text_start &&
 addr < (unsigned long)__kprobes_text_end) ||
(addr >= (unsigned long)__entry_text_start &&
-addr < (unsigned long)__entry_text_end);
+addr < (unsigned long)__entry_text_end) ||
+   is_in_entry_trampoline_section;
 }
 
 int __init arch_init_kprobes(void)
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index 9b138a06c1a4..b854ebf5851b 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -118,9 +118,11 @@ SECTIONS
 
 #ifdef CONFIG_X86_64
. = ALIGN(PAGE_SIZE);
+   VMLINUX_SYMBOL(__entry_trampoline_start) = .;
_entry_trampoline = .;
*(.entry_trampoline)
. = ALIGN(PAGE_SIZE);
+   VMLINUX_SYMBOL(__entry_trampoline_end) = .;
ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is 
too big");
 #endif
 


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 7:10 AM, Miguel Ojeda
 wrote:
> On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
>> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>>
 When max() is used in stack array size calculations from literal values
 (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
 thinks this is a dynamic calculation due to the single-eval logic, which
 is not needed in the literal case. This change removes several accidental
 stack VLAs from an x86 allmodconfig build:

 $ diff -u before.txt after.txt | grep ^-
 -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
 variable length array ‘ids’ [-Wvla]
 -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
 array ‘namebuf’ [-Wvla]
 -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array 
 ‘sym’ [-Wvla]
 -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
 ‘buff64’ [-Wvla]

 Based on an earlier patch from Josh Poimboeuf.
>>>
>>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>>
>>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>>> ./include/linux/jiffies.h:444: error: first argument to 
>>> '__builtin_choose_expr' not a constant
>>
>>
>> I'm seeing that problem with
>>> gcc --version
>> gcc (SUSE Linux) 4.8.5
>
> Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
> 5.1.0 and it seems to work as well.
>

Just compiled 4.9.0 and it seems to work -- so that would be the
minimum required.

Sigh...

Some enterprise distros are either already shipping gcc >= 5 or will
probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
to ask for a newer gcc? Are there many users/companies out there using
enterprise distributions' gcc to compile and run the very latest
kernels?

Miguel


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 7:10 AM, Miguel Ojeda
 wrote:
> On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
>> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>>
 When max() is used in stack array size calculations from literal values
 (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
 thinks this is a dynamic calculation due to the single-eval logic, which
 is not needed in the literal case. This change removes several accidental
 stack VLAs from an x86 allmodconfig build:

 $ diff -u before.txt after.txt | grep ^-
 -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
 variable length array ‘ids’ [-Wvla]
 -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
 array ‘namebuf’ [-Wvla]
 -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array 
 ‘sym’ [-Wvla]
 -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
 ‘buff64’ [-Wvla]

 Based on an earlier patch from Josh Poimboeuf.
>>>
>>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>>
>>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>>> ./include/linux/jiffies.h:444: error: first argument to 
>>> '__builtin_choose_expr' not a constant
>>
>>
>> I'm seeing that problem with
>>> gcc --version
>> gcc (SUSE Linux) 4.8.5
>
> Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
> 5.1.0 and it seems to work as well.
>

Just compiled 4.9.0 and it seems to work -- so that would be the
minimum required.

Sigh...

Some enterprise distros are either already shipping gcc >= 5 or will
probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
to ask for a newer gcc? Are there many users/companies out there using
enterprise distributions' gcc to compile and run the very latest
kernels?

Miguel


[PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext2/ext2.h  |1 +
 fs/ext2/inode.c |   28 
 fs/ext2/namei.c |   18 ++
 3 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations 
ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..09608f7e9e39 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -990,6 +990,13 @@ const struct address_space_operations ext2_nobh_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+const struct address_space_operations ext2_dax_aops = {
+   .writepages = ext2_writepages,
+   .direct_IO  = ext2_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1395,18 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+   inode->i_op = _file_inode_operations;
+   inode->i_fop = _file_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, NOBH))
+   inode->i_mapping->a_ops = _nobh_aops;
+   else
+   inode->i_mapping->a_ops = _aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
struct ext2_inode_info *ei;
@@ -1480,14 +1499,7 @@ struct inode *ext2_iget (struct super_block *sb, 
unsigned long ino)
ei->i_data[n] = raw_inode->i_block[n];
 
if (S_ISREG(inode->i_mode)) {
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry 
* dentry, umode_t mode
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
mark_inode_dirty(inode);
return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry 
*dentry, umode_t mode)
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
mark_inode_dirty(inode);
d_tmpfile(dentry, inode);
unlock_new_inode(inode);



[PATCH v5 04/11] ext2, dax: introduce ext2_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext2/ext2.h  |1 +
 fs/ext2/inode.c |   28 
 fs/ext2/namei.c |   18 ++
 3 files changed, 23 insertions(+), 24 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index 032295e1d386..cc40802ddfa8 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -814,6 +814,7 @@ extern const struct inode_operations 
ext2_file_inode_operations;
 extern const struct file_operations ext2_file_operations;
 
 /* inode.c */
+extern void ext2_set_file_ops(struct inode *inode);
 extern const struct address_space_operations ext2_aops;
 extern const struct address_space_operations ext2_nobh_aops;
 extern const struct iomap_ops ext2_iomap_ops;
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 9b2ac55ac34f..09608f7e9e39 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -990,6 +990,13 @@ const struct address_space_operations ext2_nobh_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+const struct address_space_operations ext2_dax_aops = {
+   .writepages = ext2_writepages,
+   .direct_IO  = ext2_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
+
 /*
  * Probably it should be a library function... search for first non-zero word
  * or memcmp with zero_page, whatever is better for particular architecture.
@@ -1388,6 +1395,18 @@ void ext2_set_inode_flags(struct inode *inode)
inode->i_flags |= S_DAX;
 }
 
+void ext2_set_file_ops(struct inode *inode)
+{
+   inode->i_op = _file_inode_operations;
+   inode->i_fop = _file_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, NOBH))
+   inode->i_mapping->a_ops = _nobh_aops;
+   else
+   inode->i_mapping->a_ops = _aops;
+}
+
 struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
 {
struct ext2_inode_info *ei;
@@ -1480,14 +1499,7 @@ struct inode *ext2_iget (struct super_block *sb, 
unsigned long ino)
ei->i_data[n] = raw_inode->i_block[n];
 
if (S_ISREG(inode->i_mode)) {
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
} else if (S_ISDIR(inode->i_mode)) {
inode->i_op = _dir_inode_operations;
inode->i_fop = _dir_operations;
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index e078075dc66f..55f7caadb093 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -107,14 +107,7 @@ static int ext2_create (struct inode * dir, struct dentry 
* dentry, umode_t mode
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
mark_inode_dirty(inode);
return ext2_add_nondir(dentry, inode);
 }
@@ -125,14 +118,7 @@ static int ext2_tmpfile(struct inode *dir, struct dentry 
*dentry, umode_t mode)
if (IS_ERR(inode))
return PTR_ERR(inode);
 
-   inode->i_op = _file_inode_operations;
-   if (test_opt(inode->i_sb, NOBH)) {
-   inode->i_mapping->a_ops = _nobh_aops;
-   inode->i_fop = _file_operations;
-   } else {
-   inode->i_mapping->a_ops = _aops;
-   inode->i_fop = _file_operations;
-   }
+   ext2_set_file_ops(inode);
mark_inode_dirty(inode);
d_tmpfile(dentry, inode);
unlock_new_inode(inode);



[PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one

2018-03-09 Thread Dan Williams
Add a generic facility for awaiting an atomic_t to reach a value of 1.

Page reference counts typically need to reach 0 to be considered a
free / inactive page. However, ZONE_DEVICE pages allocated via
devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
done at init time to assign pages to the page allocator is skipped.

These pages will have their reference count elevated > 1 by
get_user_pages() when they are under DMA. In order to coordinate DMA to
these pages vs filesytem operations like hole-punch and truncate the
filesystem-dax implementation needs to capture the DMA-idle event i.e.
the 2 to 1 count transition).

For now, this implementation does not have functional behavior change,
follow-on patches will add waiters for these page-idle events.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |2 +-
 include/linux/wait_bit.h |   13 ++
 kernel/sched/wait_bit.c  |   59 +++---
 3 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 619b1ed6434c..7e10fa3460e2 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-   /* TODO: wakeup page-idle waiters */
+   wake_up_atomic_one(>_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 61b39eaf7cad..564c9a0141cd 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -33,10 +33,15 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct 
wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct 
wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+static inline void wake_up_atomic_one(atomic_t *p)
+{
+   wake_up_atomic_t(p);
+}
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, 
unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f 
*action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, 
unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, 
unsigned int mode);
+int out_of_line_wait_on_atomic_one(atomic_t *p, wait_atomic_t_action_f action, 
unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -262,4 +267,12 @@ int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f 
action, unsigned mode
return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_atomic_one(atomic_t *val, wait_atomic_t_action_f action, unsigned 
mode)
+{
+   might_sleep();
+   if (atomic_read(val) == 1)
+   return 0;
+   return out_of_line_wait_on_atomic_one(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 84cb3acd9260..8739b1e50df5 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,28 +162,47 @@ static inline wait_queue_head_t 
*atomic_t_waitqueue(atomic_t *p)
return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned 
mode, int sync,
- void *arg)
+static struct wait_bit_queue_entry *to_wait_bit_q(
+   struct wait_queue_entry *wq_entry)
+{
+   return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int __wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg, int target)
 {
struct wait_bit_key *key = arg;
-   struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct 
wait_bit_queue_entry, wq_entry);
+   struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
atomic_t *val = key->flags;
 
if (wait_bit->key.flags != key->flags ||
wait_bit->key.bit_nr != key->bit_nr ||
-   atomic_read(val) != 0)
+   atomic_read(val) != target)
return 0;
return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg)
+{
+   return __wake_atomic_t_function(wq_entry, mode, sync, arg, 0);
+}
+
+static int wake_atomic_one_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg)
+{
+   return __wake_atomic_t_function(wq_entry, mode, sync, arg, 1);
+}
+
 /*
  

[PATCH v5 08/11] wait_bit: introduce {wait_on,wake_up}_atomic_one

2018-03-09 Thread Dan Williams
Add a generic facility for awaiting an atomic_t to reach a value of 1.

Page reference counts typically need to reach 0 to be considered a
free / inactive page. However, ZONE_DEVICE pages allocated via
devm_memremap_pages() are never 'onlined', i.e. the put_page() typically
done at init time to assign pages to the page allocator is skipped.

These pages will have their reference count elevated > 1 by
get_user_pages() when they are under DMA. In order to coordinate DMA to
these pages vs filesytem operations like hole-punch and truncate the
filesystem-dax implementation needs to capture the DMA-idle event i.e.
the 2 to 1 count transition).

For now, this implementation does not have functional behavior change,
follow-on patches will add waiters for these page-idle events.

Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |2 +-
 include/linux/wait_bit.h |   13 ++
 kernel/sched/wait_bit.c  |   59 +++---
 3 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 619b1ed6434c..7e10fa3460e2 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -167,7 +167,7 @@ struct dax_device {
 #if IS_ENABLED(CONFIG_FS_DAX)
 static void generic_dax_pagefree(struct page *page, void *data)
 {
-   /* TODO: wakeup page-idle waiters */
+   wake_up_atomic_one(>_refcount);
 }
 
 struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
diff --git a/include/linux/wait_bit.h b/include/linux/wait_bit.h
index 61b39eaf7cad..564c9a0141cd 100644
--- a/include/linux/wait_bit.h
+++ b/include/linux/wait_bit.h
@@ -33,10 +33,15 @@ int __wait_on_bit(struct wait_queue_head *wq_head, struct 
wait_bit_queue_entry *
 int __wait_on_bit_lock(struct wait_queue_head *wq_head, struct 
wait_bit_queue_entry *wbq_entry, wait_bit_action_f *action, unsigned int mode);
 void wake_up_bit(void *word, int bit);
 void wake_up_atomic_t(atomic_t *p);
+static inline void wake_up_atomic_one(atomic_t *p)
+{
+   wake_up_atomic_t(p);
+}
 int out_of_line_wait_on_bit(void *word, int, wait_bit_action_f *action, 
unsigned int mode);
 int out_of_line_wait_on_bit_timeout(void *word, int, wait_bit_action_f 
*action, unsigned int mode, unsigned long timeout);
 int out_of_line_wait_on_bit_lock(void *word, int, wait_bit_action_f *action, 
unsigned int mode);
 int out_of_line_wait_on_atomic_t(atomic_t *p, wait_atomic_t_action_f action, 
unsigned int mode);
+int out_of_line_wait_on_atomic_one(atomic_t *p, wait_atomic_t_action_f action, 
unsigned int mode);
 struct wait_queue_head *bit_waitqueue(void *word, int bit);
 extern void __init wait_bit_init(void);
 
@@ -262,4 +267,12 @@ int wait_on_atomic_t(atomic_t *val, wait_atomic_t_action_f 
action, unsigned mode
return out_of_line_wait_on_atomic_t(val, action, mode);
 }
 
+static inline
+int wait_on_atomic_one(atomic_t *val, wait_atomic_t_action_f action, unsigned 
mode)
+{
+   might_sleep();
+   if (atomic_read(val) == 1)
+   return 0;
+   return out_of_line_wait_on_atomic_one(val, action, mode);
+}
 #endif /* _LINUX_WAIT_BIT_H */
diff --git a/kernel/sched/wait_bit.c b/kernel/sched/wait_bit.c
index 84cb3acd9260..8739b1e50df5 100644
--- a/kernel/sched/wait_bit.c
+++ b/kernel/sched/wait_bit.c
@@ -162,28 +162,47 @@ static inline wait_queue_head_t 
*atomic_t_waitqueue(atomic_t *p)
return bit_waitqueue(p, 0);
 }
 
-static int wake_atomic_t_function(struct wait_queue_entry *wq_entry, unsigned 
mode, int sync,
- void *arg)
+static struct wait_bit_queue_entry *to_wait_bit_q(
+   struct wait_queue_entry *wq_entry)
+{
+   return container_of(wq_entry, struct wait_bit_queue_entry, wq_entry);
+}
+
+static int __wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg, int target)
 {
struct wait_bit_key *key = arg;
-   struct wait_bit_queue_entry *wait_bit = container_of(wq_entry, struct 
wait_bit_queue_entry, wq_entry);
+   struct wait_bit_queue_entry *wait_bit = to_wait_bit_q(wq_entry);
atomic_t *val = key->flags;
 
if (wait_bit->key.flags != key->flags ||
wait_bit->key.bit_nr != key->bit_nr ||
-   atomic_read(val) != 0)
+   atomic_read(val) != target)
return 0;
return autoremove_wake_function(wq_entry, mode, sync, key);
 }
 
+static int wake_atomic_t_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg)
+{
+   return __wake_atomic_t_function(wq_entry, mode, sync, arg, 0);
+}
+
+static int wake_atomic_one_function(struct wait_queue_entry *wq_entry,
+   unsigned mode, int sync, void *arg)
+{
+   return __wake_atomic_t_function(wq_entry, mode, sync, arg, 1);
+}
+
 /*
  * To allow interruptible waiting and asynchronous (i.e. nonblocking) waiting,
 

[PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G   O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Suggested-by: Jan Kara 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/dax.c|   27 +++
 fs/xfs/xfs_aops.c   |7 +++
 fs/xfs/xfs_aops.h   |1 +
 fs/xfs/xfs_iops.c   |5 -
 include/linux/dax.h |6 ++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..ba02772fccbc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,33 @@
 #define PG_PMD_COLOUR  ((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR  (PMD_SIZE >> PAGE_SHIFT)
 
+int dax_set_page_dirty(struct page *page)
+{
+   /*
+* Unlike __set_page_dirty_no_writeback that handles dirty page
+* tracking in the page object, dax does all dirty tracking in
+* the inode address_space in response to mkwrite faults. In the
+* dax case we only need to worry about potentially dirty CPU
+* caches, not dirty page cache pages to write back.
+*
+* This callback is defined to prevent fallback to
+* __set_page_dirty_buffers() in set_page_dirty().
+*/
+   return 0;
+}
+EXPORT_SYMBOL(dax_set_page_dirty);
+
+void dax_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   /*
+* There is no page cache to invalidate in the dax case, however
+* we need this callback defined to prevent falling back to
+* block_invalidatepage() in do_invalidatepage().
+*/
+}
+EXPORT_SYMBOL(dax_invalidatepage);
+
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..5788b680fa01 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1505,3 +1505,10 @@ const struct address_space_operations 
xfs_address_space_operations = {
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+   .writepages = xfs_vm_writepages,
+   .direct_IO  = xfs_vm_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
case S_IFREG:
inode->i_op = _inode_operations;
inode->i_fop = _file_operations;
-   inode->i_mapping->a_ops = _address_space_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else
+   inode->i_mapping->a_ops = _address_space_operations;
break;
case S_IFDIR:
if (xfs_sb_version_hasasciici(_M(inode->i_sb)->m_sb))
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..3045c0d9c804 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,9 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_set_page_dirty(struct page *page);
+void dax_invalidatepage(struct page *page, unsigned int offset,

[PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page

2018-03-09 Thread Dan Williams
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   56 
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ba02772fccbc..fecf463a1468 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -325,6 +325,56 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+   if (dax_is_zero_entry(entry))
+   return 0;
+   else if (dax_is_empty_entry(entry))
+   return 0;
+   else if (dax_is_pmd_entry(entry))
+   return HPAGE_SIZE;
+   else
+   return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+   for (pfn = dax_radix_pfn(entry), \
+   end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+   pfn < end_pfn; \
+   pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+   unsigned long pfn, end_pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   }
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+   bool trunc)
+{
+   unsigned long pfn, end_pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   page->mapping = NULL;
+   }
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -431,6 +481,7 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
}
 
if (pmd_downgrade) {
+   dax_disassociate_entry(entry, mapping, false);
radix_tree_delete(>page_tree, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -480,6 +531,7 @@ static int __dax_invalidate_mapping_entry(struct 
address_space *mapping,
(radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
goto out;
+   dax_disassociate_entry(entry, mapping, trunc);
radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
ret = 1;
@@ -574,6 +626,10 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
 
spin_lock_irq(>tree_lock);
new_entry = dax_radix_locked_entry(pfn, flags);
+   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+   dax_disassociate_entry(entry, mapping, false);
+   dax_associate_entry(new_entry, mapping);
+   }
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*



[PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type

2018-03-09 Thread Dan Williams
When xfs is operating as the back-end of a pNFS block server, it prevents
collisions between local and remote operations by requiring a lease to
be held for remotely accessed blocks. Local filesystem operations break
those leases before writing or mutating the extent map of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

XFS_BREAK_REMOTE and XFS_BREAK_MAPS are introduced as flags to control
the layouts that need to be broken by xfs_break_layouts(). While
XFS_BREAK_REMOTE is invoked in all calls to the new xfs_break_layouts(),
XFS_BREAK_MAPS only needs to specified when extents may be unmapped,
i.e. xfs_file_fallocate() and xfs_ioc_space(). XFS_BREAK_MAPS also
imposes the additional locking constraint of breaking (awaiting) pinned
dax mappings while holding XFS_MMAPLOCK_EXCL.

There is a small functional change in this rework. For the cases where
XFS_BREAK_MAPS is specified to xfs_break_layouts(), the
XFS_MMAPLOCK_EXCL ilock is held over the break_layouts() loop in
xfs_break_leased_layouts().

Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Reported-by: Dave Chinner 
Reported-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |   32 ++--
 fs/xfs/xfs_inode.h |9 +
 fs/xfs/xfs_ioctl.c |9 +++--
 fs/xfs/xfs_iops.c  |   12 +++-
 fs/xfs/xfs_pnfs.c  |8 +++-
 fs/xfs/xfs_pnfs.h  |4 ++--
 6 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..f914f0628dc2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
if (error <= 0)
return error;
 
-   error = xfs_break_layouts(inode, iolock);
+   error = xfs_break_layouts(inode, iolock, XFS_BREAK_REMOTE);
if (error)
return error;
 
@@ -752,6 +752,28 @@ xfs_file_write_iter(
return ret;
 }
 
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock,
+   unsigned long   flags)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   uintiolock_assert = 0;
+   int ret = 0;
+
+   if (flags & XFS_BREAK_REMOTE)
+   iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
+   if (flags & XFS_BREAK_MAPS)
+   iolock_assert |= XFS_MMAPLOCK_EXCL;
+
+   ASSERT(xfs_isilocked(ip, iolock_assert));
+
+   if (flags & XFS_BREAK_REMOTE)
+   ret = xfs_break_leased_layouts(inode, iolock);
+   return ret;
+}
+
 #defineXFS_FALLOC_FL_SUPPORTED 
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |  \
@@ -768,7 +790,7 @@ xfs_file_fallocate(
struct xfs_inode*ip = XFS_I(inode);
longerror;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
loff_t  new_size = 0;
booldo_file_insert = false;
 
@@ -778,13 +800,11 @@ xfs_file_fallocate(
return -EOPNOTSUPP;
 
xfs_ilock(ip, iolock);
-   error = xfs_break_layouts(inode, );
+   error = xfs_break_layouts(inode, ,
+   XFS_BREAK_REMOTE | XFS_BREAK_MAPS);
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..9b73ceb09cb1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>> XFS_ILOCK_SHIFT)
 
 /*
+ * Flags for layout breaks
+ */
+#define XFS_BREAK_REMOTE (1<<0) /* break remote layout leases */
+#define XFS_BREAK_MAPS   (1<<1) /* break local direct (dax) mappings */
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +453,9 @@ int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 xfs_fsize_t isize, bool *did_zeroing);
 intxfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
bool *did_zero);
+intxfs_break_layouts(struct inode *inode, uint *iolock,
+   unsigned long flags);
+
 
 /* from xfs_iops.c */
 extern 

[PATCH v5 10/11] xfs: prepare xfs_break_layouts() for another layout type

2018-03-09 Thread Dan Williams
When xfs is operating as the back-end of a pNFS block server, it prevents
collisions between local and remote operations by requiring a lease to
be held for remotely accessed blocks. Local filesystem operations break
those leases before writing or mutating the extent map of the file.

A similar mechanism is needed to prevent operations on pinned dax
mappings, like device-DMA, from colliding with extent unmap operations.

XFS_BREAK_REMOTE and XFS_BREAK_MAPS are introduced as flags to control
the layouts that need to be broken by xfs_break_layouts(). While
XFS_BREAK_REMOTE is invoked in all calls to the new xfs_break_layouts(),
XFS_BREAK_MAPS only needs to specified when extents may be unmapped,
i.e. xfs_file_fallocate() and xfs_ioc_space(). XFS_BREAK_MAPS also
imposes the additional locking constraint of breaking (awaiting) pinned
dax mappings while holding XFS_MMAPLOCK_EXCL.

There is a small functional change in this rework. For the cases where
XFS_BREAK_MAPS is specified to xfs_break_layouts(), the
XFS_MMAPLOCK_EXCL ilock is held over the break_layouts() loop in
xfs_break_leased_layouts().

Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Reported-by: Dave Chinner 
Reported-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c  |   32 ++--
 fs/xfs/xfs_inode.h |9 +
 fs/xfs/xfs_ioctl.c |9 +++--
 fs/xfs/xfs_iops.c  |   12 +++-
 fs/xfs/xfs_pnfs.c  |8 +++-
 fs/xfs/xfs_pnfs.h  |4 ++--
 6 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9ea08326f876..f914f0628dc2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -350,7 +350,7 @@ xfs_file_aio_write_checks(
if (error <= 0)
return error;
 
-   error = xfs_break_layouts(inode, iolock);
+   error = xfs_break_layouts(inode, iolock, XFS_BREAK_REMOTE);
if (error)
return error;
 
@@ -752,6 +752,28 @@ xfs_file_write_iter(
return ret;
 }
 
+int
+xfs_break_layouts(
+   struct inode*inode,
+   uint*iolock,
+   unsigned long   flags)
+{
+   struct xfs_inode*ip = XFS_I(inode);
+   uintiolock_assert = 0;
+   int ret = 0;
+
+   if (flags & XFS_BREAK_REMOTE)
+   iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
+   if (flags & XFS_BREAK_MAPS)
+   iolock_assert |= XFS_MMAPLOCK_EXCL;
+
+   ASSERT(xfs_isilocked(ip, iolock_assert));
+
+   if (flags & XFS_BREAK_REMOTE)
+   ret = xfs_break_leased_layouts(inode, iolock);
+   return ret;
+}
+
 #defineXFS_FALLOC_FL_SUPPORTED 
\
(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |   \
 FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |  \
@@ -768,7 +790,7 @@ xfs_file_fallocate(
struct xfs_inode*ip = XFS_I(inode);
longerror;
enum xfs_prealloc_flags flags = 0;
-   uintiolock = XFS_IOLOCK_EXCL;
+   uintiolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
loff_t  new_size = 0;
booldo_file_insert = false;
 
@@ -778,13 +800,11 @@ xfs_file_fallocate(
return -EOPNOTSUPP;
 
xfs_ilock(ip, iolock);
-   error = xfs_break_layouts(inode, );
+   error = xfs_break_layouts(inode, ,
+   XFS_BREAK_REMOTE | XFS_BREAK_MAPS);
if (error)
goto out_unlock;
 
-   xfs_ilock(ip, XFS_MMAPLOCK_EXCL);
-   iolock |= XFS_MMAPLOCK_EXCL;
-
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);
if (error)
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 3e8dc990d41c..9b73ceb09cb1 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -379,6 +379,12 @@ static inline void xfs_ifunlock(struct xfs_inode *ip)
>> XFS_ILOCK_SHIFT)
 
 /*
+ * Flags for layout breaks
+ */
+#define XFS_BREAK_REMOTE (1<<0) /* break remote layout leases */
+#define XFS_BREAK_MAPS   (1<<1) /* break local direct (dax) mappings */
+
+/*
  * For multiple groups support: if S_ISGID bit is set in the parent
  * directory, group of new file is set to that of the parent, and
  * new subdirectory gets S_ISGID bit from parent.
@@ -447,6 +453,9 @@ int xfs_zero_eof(struct xfs_inode *ip, xfs_off_t offset,
 xfs_fsize_t isize, bool *did_zeroing);
 intxfs_zero_range(struct xfs_inode *ip, xfs_off_t pos, xfs_off_t count,
bool *did_zero);
+intxfs_break_layouts(struct inode *inode, uint *iolock,
+   unsigned long flags);
+
 
 /* from xfs_iops.c */
 extern void xfs_setup_inode(struct xfs_inode *ip);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 

[PATCH v5 05/11] fs, dax: use page->mapping to warn if truncate collides with a busy page

2018-03-09 Thread Dan Williams
Catch cases where extent unmap operations encounter pages that are
pinned / busy. Typically this is pinned pages that are under active dma.
This warning is a canary for potential data corruption as truncated
blocks could be allocated to a new file while the device is still
performing i/o.

Here is an example of a collision that this implementation catches:

 WARNING: CPU: 2 PID: 1286 at fs/dax.c:343 dax_disassociate_entry+0x55/0x80
 [..]
 Call Trace:
  __dax_invalidate_mapping_entry+0x6c/0xf0
  dax_delete_mapping_entry+0xf/0x20
  truncate_exceptional_pvec_entries.part.12+0x1af/0x200
  truncate_inode_pages_range+0x268/0x970
  ? tlb_gather_mmu+0x10/0x20
  ? up_write+0x1c/0x40
  ? unmap_mapping_range+0x73/0x140
  xfs_free_file_space+0x1b6/0x5b0 [xfs]
  ? xfs_file_fallocate+0x7f/0x320 [xfs]
  ? down_write_nested+0x40/0x70
  ? xfs_ilock+0x21d/0x2f0 [xfs]
  xfs_file_fallocate+0x162/0x320 [xfs]
  ? rcu_read_lock_sched_held+0x3f/0x70
  ? rcu_sync_lockdep_assert+0x2a/0x50
  ? __sb_start_write+0xd0/0x1b0
  ? vfs_fallocate+0x20c/0x270
  vfs_fallocate+0x154/0x270
  SyS_fallocate+0x43/0x80
  entry_SYSCALL_64_fastpath+0x1f/0x96

Cc: Jeff Moyer 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c |   56 
 1 file changed, 56 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index ba02772fccbc..fecf463a1468 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -325,6 +325,56 @@ static void put_unlocked_mapping_entry(struct 
address_space *mapping,
dax_wake_mapping_entry_waiter(mapping, index, entry, false);
 }
 
+static unsigned long dax_entry_size(void *entry)
+{
+   if (dax_is_zero_entry(entry))
+   return 0;
+   else if (dax_is_empty_entry(entry))
+   return 0;
+   else if (dax_is_pmd_entry(entry))
+   return HPAGE_SIZE;
+   else
+   return PAGE_SIZE;
+}
+
+#define for_each_entry_pfn(entry, pfn, end_pfn) \
+   for (pfn = dax_radix_pfn(entry), \
+   end_pfn = pfn + dax_entry_size(entry) / PAGE_SIZE; \
+   pfn < end_pfn; \
+   pfn++)
+
+static void dax_associate_entry(void *entry, struct address_space *mapping)
+{
+   unsigned long pfn, end_pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(page->mapping);
+   page->mapping = mapping;
+   }
+}
+
+static void dax_disassociate_entry(void *entry, struct address_space *mapping,
+   bool trunc)
+{
+   unsigned long pfn, end_pfn;
+
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
+   WARN_ON_ONCE(page->mapping && page->mapping != mapping);
+   page->mapping = NULL;
+   }
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -431,6 +481,7 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
}
 
if (pmd_downgrade) {
+   dax_disassociate_entry(entry, mapping, false);
radix_tree_delete(>page_tree, index);
mapping->nrexceptional--;
dax_wake_mapping_entry_waiter(mapping, index, entry,
@@ -480,6 +531,7 @@ static int __dax_invalidate_mapping_entry(struct 
address_space *mapping,
(radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
 radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
goto out;
+   dax_disassociate_entry(entry, mapping, trunc);
radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
ret = 1;
@@ -574,6 +626,10 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
 
spin_lock_irq(>tree_lock);
new_entry = dax_radix_locked_entry(pfn, flags);
+   if (dax_entry_size(entry) != dax_entry_size(new_entry)) {
+   dax_disassociate_entry(entry, mapping, false);
+   dax_associate_entry(new_entry, mapping);
+   }
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*



[PATCH v5 02/11] xfs, dax: introduce xfs_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings like the
following:

 WARNING: CPU: 27 PID: 1783 at fs/xfs/xfs_aops.c:1468
 xfs_vm_set_page_dirty+0xf3/0x1b0 [xfs]
 [..]
 CPU: 27 PID: 1783 Comm: dma-collision Tainted: G   O 4.15.0-rc2+ #984
 [..]
 Call Trace:
  set_page_dirty_lock+0x40/0x60
  bio_set_pages_dirty+0x37/0x50
  iomap_dio_actor+0x2b7/0x3b0
  ? iomap_dio_zero+0x110/0x110
  iomap_apply+0xa4/0x110
  iomap_dio_rw+0x29e/0x3b0
  ? iomap_dio_zero+0x110/0x110
  ? xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_dio_aio_read+0x7c/0x1a0 [xfs]
  xfs_file_read_iter+0xa0/0xc0 [xfs]
  __vfs_read+0xf9/0x170
  vfs_read+0xa6/0x150
  SyS_pread64+0x93/0xb0
  entry_SYSCALL_64_fastpath+0x1f/0x96

...where the default set_page_dirty() handler assumes that dirty state
is being tracked in 'struct page' flags.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Suggested-by: Jan Kara 
Suggested-by: Dave Chinner 
Signed-off-by: Dan Williams 
---
 fs/dax.c|   27 +++
 fs/xfs/xfs_aops.c   |7 +++
 fs/xfs/xfs_aops.h   |1 +
 fs/xfs/xfs_iops.c   |5 -
 include/linux/dax.h |6 ++
 5 files changed, 45 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index b646a46e4d12..ba02772fccbc 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,33 @@
 #define PG_PMD_COLOUR  ((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR  (PMD_SIZE >> PAGE_SHIFT)
 
+int dax_set_page_dirty(struct page *page)
+{
+   /*
+* Unlike __set_page_dirty_no_writeback that handles dirty page
+* tracking in the page object, dax does all dirty tracking in
+* the inode address_space in response to mkwrite faults. In the
+* dax case we only need to worry about potentially dirty CPU
+* caches, not dirty page cache pages to write back.
+*
+* This callback is defined to prevent fallback to
+* __set_page_dirty_buffers() in set_page_dirty().
+*/
+   return 0;
+}
+EXPORT_SYMBOL(dax_set_page_dirty);
+
+void dax_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length)
+{
+   /*
+* There is no page cache to invalidate in the dax case, however
+* we need this callback defined to prevent falling back to
+* block_invalidatepage() in do_invalidatepage().
+*/
+}
+EXPORT_SYMBOL(dax_invalidatepage);
+
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 9c6a830da0ee..5788b680fa01 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1505,3 +1505,10 @@ const struct address_space_operations 
xfs_address_space_operations = {
.is_partially_uptodate  = block_is_partially_uptodate,
.error_remove_page  = generic_error_remove_page,
 };
+
+const struct address_space_operations xfs_dax_aops = {
+   .writepages = xfs_vm_writepages,
+   .direct_IO  = xfs_vm_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 88c85ea63da0..69346d460dfa 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -54,6 +54,7 @@ struct xfs_ioend {
 };
 
 extern const struct address_space_operations xfs_address_space_operations;
+extern const struct address_space_operations xfs_dax_aops;
 
 intxfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
 
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 56475fcd76f2..951e84df5576 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -1272,7 +1272,10 @@ xfs_setup_iops(
case S_IFREG:
inode->i_op = _inode_operations;
inode->i_fop = _file_operations;
-   inode->i_mapping->a_ops = _address_space_operations;
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else
+   inode->i_mapping->a_ops = _address_space_operations;
break;
case S_IFDIR:
if (xfs_sb_version_hasasciici(_M(inode->i_sb)->m_sb))
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0185ecdae135..3045c0d9c804 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -57,6 +57,9 @@ static inline void fs_put_dax(struct dax_device *dax_dev)
 }
 
 struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev);
+int dax_set_page_dirty(struct page *page);
+void dax_invalidatepage(struct page *page, unsigned int offset,
+   unsigned int length);
 #else
 static inline int bdev_dax_supported(struct super_block *sb, int blocksize)
 {
@@ -76,6 +79,9 @@ 

[PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks

2018-03-09 Thread Dan Williams
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   79 --
 drivers/nvdimm/pmem.c|3 +-
 fs/ext2/super.c  |6 ++-
 fs/ext4/super.c  |6 ++-
 fs/xfs/xfs_super.c   |   20 ++--
 include/linux/dax.h  |   17 +-
 include/linux/memremap.h |8 +
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..ecefe9f7eb60 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t 
sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-   if (!blk_queue_dax(bdev->bd_queue))
-   return NULL;
-   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
const char *host;
void *private;
unsigned long flags;
+   struct dev_pagemap *pgmap;
const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+   /* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+   struct dax_device *dax_dev;
+   struct dev_pagemap *pgmap;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return NULL;
+   dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_dev->pgmap)
+   return dax_dev;
+   pgmap = dax_dev->pgmap;
+
+   mutex_lock(_lock);
+   if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+   || pgmap->page_fault
+   || pgmap->type != MEMORY_DEVICE_HOST) {
+   put_dax(dax_dev);
+   mutex_unlock(_lock);
+   return NULL;
+   }
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+   pgmap->page_free = generic_dax_pagefree;
+   pgmap->data = owner;
+   mutex_unlock(_lock);
+
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+   put_dax(dax_dev);
+   if (!pgmap)
+   return;
+   if (!pgmap->data)
+   return;
+
+   mutex_lock(_lock);
+   WARN_ON(pgmap->data != owner);
+   pgmap->type = MEMORY_DEVICE_HOST;
+   pgmap->page_free = NULL;
+   pgmap->data = NULL;
+   mutex_unlock(_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+   const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+   struct dax_device *dax_dev = 

[PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext4/inode.c |   11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..ef21f0ad38ff 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3946,6 +3946,13 @@ static const struct address_space_operations 
ext4_da_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+   .writepages = ext4_writepages,
+   .direct_IO  = ext4_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3965,9 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
-   if (test_opt(inode->i_sb, DELALLOC))
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = _da_aops;
else
inode->i_mapping->a_ops = _aops;



[PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts()

2018-03-09 Thread Dan Williams
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
log held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1). While waiting, the XFS_MMAPLOCK_EXCL lock is
dropped in order to not deadlock the process that might be trying to
elevate the page count of more pages before arranging for any of them to
go idle. I.e. the typical case of submitting I/O is that
iov_iter_get_pages() elevates the reference count of all pages in the
I/O before starting I/O on the first page.

Cc: Jan Kara 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |   68 +++--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f914f0628dc2..3e7a69cebf95 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,55 @@ xfs_file_write_iter(
return ret;
 }
 
+static int xfs_wait_dax_page(
+   atomic_t*count,
+   unsigned intmode)
+{
+   uintiolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
+   struct page *page = refcount_to_page(count);
+   struct address_space*mapping = page->mapping;
+   struct inode*inode = mapping->host;
+   struct xfs_inode*ip = XFS_I(inode);
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
+
+   if (page_ref_count(page) == 1)
+   return 0;
+
+   xfs_iunlock(ip, iolock);
+   schedule();
+   xfs_ilock(ip, iolock);
+
+   if (signal_pending_state(mode, current))
+   return -EINTR;
+   return 1;
+}
+
+static int
+xfs_break_dax_layouts(
+   struct inode*inode,
+   uintiolock)
+{
+   struct page *page;
+   int ret;
+
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   ret = wait_on_atomic_one(>_refcount, xfs_wait_dax_page,
+   TASK_INTERRUPTIBLE);
+
+   if (ret <= 0)
+   return ret;
+
+   /*
+* We slept, so need to retry. Yes, this assumes transient page
+* pins.
+*/
+   return -EBUSY;
+}
+
 int
 xfs_break_layouts(
struct inode*inode,
@@ -765,12 +814,25 @@ xfs_break_layouts(
if (flags & XFS_BREAK_REMOTE)
iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
if (flags & XFS_BREAK_MAPS)
-   iolock_assert |= XFS_MMAPLOCK_EXCL;
+   iolock_assert |= XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 
ASSERT(xfs_isilocked(ip, iolock_assert));
 
-   if (flags & XFS_BREAK_REMOTE)
-   ret = xfs_break_leased_layouts(inode, iolock);
+   do {
+   if (flags & XFS_BREAK_REMOTE)
+   ret = xfs_break_leased_layouts(inode, iolock);
+   if (ret)
+   return ret;
+   if (flags & XFS_BREAK_MAPS)
+   ret = xfs_break_dax_layouts(inode, *iolock);
+   /*
+* EBUSY indicates that we dropped locks and waited for
+* the dax layout to be released. When that happens we
+* need to revalidate that no new leases or pinned dax
+* mappings have been established.
+*/
+   } while (ret == -EBUSY);
+
return ret;
 }
 



[PATCH v5 06/11] mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks

2018-03-09 Thread Dan Williams
In order to resolve collisions between filesystem operations and DMA to
DAX mapped pages we need a callback when DMA completes. With a callback
we can hold off filesystem operations while DMA is in-flight and then
resume those operations when the last put_page() occurs on a DMA page.

Recall that the 'struct page' entries for DAX memory are created with
devm_memremap_pages(). That routine arranges for the pages to be
allocated, but never onlined, so a DAX page is DMA-idle when its
reference count reaches one.

Also recall that the HMM sub-system added infrastructure to trap the
page-idle (2-to-1 reference count) transition of the pages allocated by
devm_memremap_pages() and trigger a callback via the 'struct
dev_pagemap' associated with the page range. Whereas the HMM callbacks
are going to a device driver to manage bounce pages in device-memory in
the filesystem-dax case we will call back to filesystem specified
callback.

Since the callback is not known at devm_memremap_pages() time we arrange
for the filesystem to install it at mount time. No functional changes
are expected as this only registers a nop handler for the ->page_free()
event for device-mapped pages.

Cc: Michal Hocko 
Cc: "Jérôme Glisse" 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |   79 --
 drivers/nvdimm/pmem.c|3 +-
 fs/ext2/super.c  |6 ++-
 fs/ext4/super.c  |6 ++-
 fs/xfs/xfs_super.c   |   20 ++--
 include/linux/dax.h  |   17 +-
 include/linux/memremap.h |8 +
 7 files changed, 103 insertions(+), 36 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index 2b2332b605e4..ecefe9f7eb60 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -29,6 +29,7 @@ static struct vfsmount *dax_mnt;
 static DEFINE_IDA(dax_minor_ida);
 static struct kmem_cache *dax_cache __read_mostly;
 static struct super_block *dax_superblock __read_mostly;
+static DEFINE_MUTEX(devmap_lock);
 
 #define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
 static struct hlist_head dax_host_list[DAX_HASH_SIZE];
@@ -62,16 +63,6 @@ int bdev_dax_pgoff(struct block_device *bdev, sector_t 
sector, size_t size,
 }
 EXPORT_SYMBOL(bdev_dax_pgoff);
 
-#if IS_ENABLED(CONFIG_FS_DAX)
-struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
-{
-   if (!blk_queue_dax(bdev->bd_queue))
-   return NULL;
-   return fs_dax_get_by_host(bdev->bd_disk->disk_name);
-}
-EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
-#endif
-
 /**
  * __bdev_dax_supported() - Check if the device supports dax for filesystem
  * @sb: The superblock of the device
@@ -169,9 +160,66 @@ struct dax_device {
const char *host;
void *private;
unsigned long flags;
+   struct dev_pagemap *pgmap;
const struct dax_operations *ops;
 };
 
+#if IS_ENABLED(CONFIG_FS_DAX)
+static void generic_dax_pagefree(struct page *page, void *data)
+{
+   /* TODO: wakeup page-idle waiters */
+}
+
+struct dax_device *fs_dax_claim_bdev(struct block_device *bdev, void *owner)
+{
+   struct dax_device *dax_dev;
+   struct dev_pagemap *pgmap;
+
+   if (!blk_queue_dax(bdev->bd_queue))
+   return NULL;
+   dax_dev = fs_dax_get_by_host(bdev->bd_disk->disk_name);
+   if (!dax_dev->pgmap)
+   return dax_dev;
+   pgmap = dax_dev->pgmap;
+
+   mutex_lock(_lock);
+   if ((pgmap->data && pgmap->data != owner) || pgmap->page_free
+   || pgmap->page_fault
+   || pgmap->type != MEMORY_DEVICE_HOST) {
+   put_dax(dax_dev);
+   mutex_unlock(_lock);
+   return NULL;
+   }
+
+   pgmap->type = MEMORY_DEVICE_FS_DAX;
+   pgmap->page_free = generic_dax_pagefree;
+   pgmap->data = owner;
+   mutex_unlock(_lock);
+
+   return dax_dev;
+}
+EXPORT_SYMBOL_GPL(fs_dax_claim_bdev);
+
+void fs_dax_release(struct dax_device *dax_dev, void *owner)
+{
+   struct dev_pagemap *pgmap = dax_dev ? dax_dev->pgmap : NULL;
+
+   put_dax(dax_dev);
+   if (!pgmap)
+   return;
+   if (!pgmap->data)
+   return;
+
+   mutex_lock(_lock);
+   WARN_ON(pgmap->data != owner);
+   pgmap->type = MEMORY_DEVICE_HOST;
+   pgmap->page_free = NULL;
+   pgmap->data = NULL;
+   mutex_unlock(_lock);
+}
+EXPORT_SYMBOL_GPL(fs_dax_release);
+#endif
+
 static ssize_t write_cache_show(struct device *dev,
struct device_attribute *attr, char *buf)
 {
@@ -499,6 +547,17 @@ struct dax_device *alloc_dax(void *private, const char 
*__host,
 }
 EXPORT_SYMBOL_GPL(alloc_dax);
 
+struct dax_device *alloc_dax_devmap(void *private, const char *host,
+   const struct dax_operations *ops, struct dev_pagemap *pgmap)
+{
+   struct dax_device *dax_dev = alloc_dax(private, host, ops);
+
+   if (dax_dev)
+   dax_dev->pgmap = pgmap;

[PATCH v5 03/11] ext4, dax: introduce ext4_dax_aops

2018-03-09 Thread Dan Williams
In preparation for the dax implementation to start associating dax pages
to inodes via page->mapping, we need to provide a 'struct
address_space_operations' instance for dax. Otherwise, direct-I/O
triggers incorrect page cache assumptions and warnings.

Cc: "Theodore Ts'o" 
Cc: Andreas Dilger 
Cc: linux-e...@vger.kernel.org
Cc: Jan Kara 
Signed-off-by: Dan Williams 
---
 fs/ext4/inode.c |   11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c94780075b04..ef21f0ad38ff 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3946,6 +3946,13 @@ static const struct address_space_operations 
ext4_da_aops = {
.error_remove_page  = generic_error_remove_page,
 };
 
+static const struct address_space_operations ext4_dax_aops = {
+   .writepages = ext4_writepages,
+   .direct_IO  = ext4_direct_IO,
+   .set_page_dirty = dax_set_page_dirty,
+   .invalidatepage = dax_invalidatepage,
+};
+
 void ext4_set_aops(struct inode *inode)
 {
switch (ext4_inode_journal_mode(inode)) {
@@ -3958,7 +3965,9 @@ void ext4_set_aops(struct inode *inode)
default:
BUG();
}
-   if (test_opt(inode->i_sb, DELALLOC))
+   if (IS_DAX(inode))
+   inode->i_mapping->a_ops = _dax_aops;
+   else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = _da_aops;
else
inode->i_mapping->a_ops = _aops;



[PATCH v5 11/11] xfs, dax: introduce xfs_break_dax_layouts()

2018-03-09 Thread Dan Williams
xfs_break_dax_layouts(), similar to xfs_break_leased_layouts(), scans
for busy / pinned dax pages and waits for those pages to go idle before
any potential extent unmap operation.

dax_layout_busy_page() handles synchronizing against new page-busy
events (get_user_pages). It invalidates all mappings to trigger the
get_user_pages slow path which will eventually block on the xfs inode
log held in XFS_MMAPLOCK_EXCL mode. If dax_layout_busy_page() finds a
busy page it returns it for xfs to wait for the page-idle event that
will fire when the page reference count reaches 1 (recall ZONE_DEVICE
pages are idle at count 1). While waiting, the XFS_MMAPLOCK_EXCL lock is
dropped in order to not deadlock the process that might be trying to
elevate the page count of more pages before arranging for any of them to
go idle. I.e. the typical case of submitting I/O is that
iov_iter_get_pages() elevates the reference count of all pages in the
I/O before starting I/O on the first page.

Cc: Jan Kara 
Cc: Dave Chinner 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/xfs/xfs_file.c |   68 +++--
 1 file changed, 65 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f914f0628dc2..3e7a69cebf95 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -752,6 +752,55 @@ xfs_file_write_iter(
return ret;
 }
 
+static int xfs_wait_dax_page(
+   atomic_t*count,
+   unsigned intmode)
+{
+   uintiolock = XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
+   struct page *page = refcount_to_page(count);
+   struct address_space*mapping = page->mapping;
+   struct inode*inode = mapping->host;
+   struct xfs_inode*ip = XFS_I(inode);
+
+   ASSERT(xfs_isilocked(ip, XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL));
+
+   if (page_ref_count(page) == 1)
+   return 0;
+
+   xfs_iunlock(ip, iolock);
+   schedule();
+   xfs_ilock(ip, iolock);
+
+   if (signal_pending_state(mode, current))
+   return -EINTR;
+   return 1;
+}
+
+static int
+xfs_break_dax_layouts(
+   struct inode*inode,
+   uintiolock)
+{
+   struct page *page;
+   int ret;
+
+   page = dax_layout_busy_page(inode->i_mapping);
+   if (!page)
+   return 0;
+
+   ret = wait_on_atomic_one(>_refcount, xfs_wait_dax_page,
+   TASK_INTERRUPTIBLE);
+
+   if (ret <= 0)
+   return ret;
+
+   /*
+* We slept, so need to retry. Yes, this assumes transient page
+* pins.
+*/
+   return -EBUSY;
+}
+
 int
 xfs_break_layouts(
struct inode*inode,
@@ -765,12 +814,25 @@ xfs_break_layouts(
if (flags & XFS_BREAK_REMOTE)
iolock_assert |= XFS_IOLOCK_SHARED|XFS_IOLOCK_EXCL;
if (flags & XFS_BREAK_MAPS)
-   iolock_assert |= XFS_MMAPLOCK_EXCL;
+   iolock_assert |= XFS_IOLOCK_EXCL|XFS_MMAPLOCK_EXCL;
 
ASSERT(xfs_isilocked(ip, iolock_assert));
 
-   if (flags & XFS_BREAK_REMOTE)
-   ret = xfs_break_leased_layouts(inode, iolock);
+   do {
+   if (flags & XFS_BREAK_REMOTE)
+   ret = xfs_break_leased_layouts(inode, iolock);
+   if (ret)
+   return ret;
+   if (flags & XFS_BREAK_MAPS)
+   ret = xfs_break_dax_layouts(inode, *iolock);
+   /*
+* EBUSY indicates that we dropped locks and waited for
+* the dax layout to be released. When that happens we
+* need to revalidate that no new leases or pinned dax
+* mappings have been established.
+*/
+   } while (ret == -EBUSY);
+
return ret;
 }
 



[PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS

2018-03-09 Thread Dan Williams
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: "Jérôme Glisse" 
Cc: Michal Hocko 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |2 ++
 fs/Kconfig   |1 +
 include/linux/memremap.h |   20 ++-
 include/linux/mm.h   |   61 --
 kernel/memremap.c|   30 ---
 mm/Kconfig   |5 
 mm/hmm.c |   13 ++
 mm/swap.c|3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecefe9f7eb60..619b1ed6434c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device 
*bdev, void *owner)
return NULL;
}
 
+   dev_pagemap_get_ops();
pgmap->type = MEMORY_DEVICE_FS_DAX;
pgmap->page_free = generic_dax_pagefree;
pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_free = NULL;
pgmap->data = NULL;
+   dev_pagemap_put_ops();
mutex_unlock(_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1f0832bbc32f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   select DEV_PAGEMAP_OPS
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..9faf25d6abef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include 
 #include 
 #include 
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
enum memory_type type;
 };
 
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..088c76bce360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+   if (!static_branch_unlikely(_managed_key))
+   return false;

[PATCH v5 07/11] mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS

2018-03-09 Thread Dan Williams
The HMM sub-system extended dev_pagemap to arrange a callback when a
dev_pagemap managed page is freed. Since a dev_pagemap page is free /
idle when its reference count is 1 it requires an additional branch to
check the page-type at put_page() time. Given put_page() is a hot-path
we do not want to incur that check if HMM is not in use, so a static
branch is used to avoid that overhead when not necessary.

Now, the FS_DAX implementation wants to reuse this mechanism for
receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
static-key into a generic mechanism that either HMM or FS_DAX code paths
can enable.

Cc: "Jérôme Glisse" 
Cc: Michal Hocko 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c  |2 ++
 fs/Kconfig   |1 +
 include/linux/memremap.h |   20 ++-
 include/linux/mm.h   |   61 --
 kernel/memremap.c|   30 ---
 mm/Kconfig   |5 
 mm/hmm.c |   13 ++
 mm/swap.c|3 ++
 8 files changed, 84 insertions(+), 51 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecefe9f7eb60..619b1ed6434c 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -191,6 +191,7 @@ struct dax_device *fs_dax_claim_bdev(struct block_device 
*bdev, void *owner)
return NULL;
}
 
+   dev_pagemap_get_ops();
pgmap->type = MEMORY_DEVICE_FS_DAX;
pgmap->page_free = generic_dax_pagefree;
pgmap->data = owner;
@@ -215,6 +216,7 @@ void fs_dax_release(struct dax_device *dax_dev, void *owner)
pgmap->type = MEMORY_DEVICE_HOST;
pgmap->page_free = NULL;
pgmap->data = NULL;
+   dev_pagemap_put_ops();
mutex_unlock(_lock);
 }
 EXPORT_SYMBOL_GPL(fs_dax_release);
diff --git a/fs/Kconfig b/fs/Kconfig
index bc821a86d965..1f0832bbc32f 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -38,6 +38,7 @@ config FS_DAX
bool "Direct Access (DAX) support"
depends on MMU
depends on !(ARM || MIPS || SPARC)
+   select DEV_PAGEMAP_OPS
select FS_IOMAP
select DAX
help
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 02d6d042ee7f..9faf25d6abef 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -1,7 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
-#include 
 #include 
 #include 
 
@@ -130,6 +129,9 @@ struct dev_pagemap {
enum memory_type type;
 };
 
+void dev_pagemap_get_ops(void);
+void dev_pagemap_put_ops(void);
+
 #ifdef CONFIG_ZONE_DEVICE
 void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap);
 struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
@@ -137,8 +139,6 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn,
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
-
-static inline bool is_zone_device_page(const struct page *page);
 #else
 static inline void *devm_memremap_pages(struct device *dev,
struct dev_pagemap *pgmap)
@@ -169,20 +169,6 @@ static inline void vmem_altmap_free(struct vmem_altmap 
*altmap,
 }
 #endif /* CONFIG_ZONE_DEVICE */
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-static inline bool is_device_private_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PRIVATE;
-}
-
-static inline bool is_device_public_page(const struct page *page)
-{
-   return is_zone_device_page(page) &&
-   page->pgmap->type == MEMORY_DEVICE_PUBLIC;
-}
-#endif /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-
 static inline void put_dev_pagemap(struct dev_pagemap *pgmap)
 {
if (pgmap)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..088c76bce360 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -812,27 +812,55 @@ static inline bool is_zone_device_page(const struct page 
*page)
 }
 #endif
 
-#if defined(CONFIG_DEVICE_PRIVATE) || defined(CONFIG_DEVICE_PUBLIC)
-void put_zone_device_private_or_public_page(struct page *page);
-DECLARE_STATIC_KEY_FALSE(device_private_key);
-#define IS_HMM_ENABLED static_branch_unlikely(_private_key)
-static inline bool is_device_private_page(const struct page *page);
-static inline bool is_device_public_page(const struct page *page);
-#else /* CONFIG_DEVICE_PRIVATE || CONFIG_DEVICE_PUBLIC */
-static inline void put_zone_device_private_or_public_page(struct page *page)
+#ifdef CONFIG_DEV_PAGEMAP_OPS
+void __put_devmap_managed_page(struct page *page);
+DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
+static inline bool put_devmap_managed_page(struct page *page)
 {
+   if (!static_branch_unlikely(_managed_key))
+   return false;
+   if (!is_zone_device_page(page))
+   return 

[PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings

2018-03-09 Thread Dan Williams
Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Dave Chinner 
Cc: Matthew Wilcox 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Hansen 
Cc: Andrew Morton 
Reported-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c|   93 +++
 include/linux/dax.h |   30 
 mm/gup.c|5 +++
 3 files changed, 128 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index fecf463a1468..cfaaf31fae85 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -375,6 +375,19 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+   unsigned long pfn, end_pfn;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   if (page_ref_count(page) > 1)
+   return page;
+   }
+   return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -516,6 +529,85 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   pgoff_t indices[PAGEVEC_SIZE];
+   struct page *page = NULL;
+   struct pagevec pvec;
+   pgoff_t index, end;
+   unsigned i;
+
+   /*
+* In the 'limited' case get_user_pages() for dax is disabled.
+*/
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return NULL;
+
+   if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+   return NULL;
+
+   pagevec_init();
+   index = 0;
+   end = -1;
+   /*
+* Flush dax_layout_lock() sections to ensure all possible page
+* references have been taken, or otherwise arrange for faults
+* to block on the filesystem lock 

[PATCH v5 09/11] mm, fs, dax: handle layout changes to pinned dax mappings

2018-03-09 Thread Dan Williams
Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.

Cc: Jan Kara 
Cc: Jeff Moyer 
Cc: Dave Chinner 
Cc: Matthew Wilcox 
Cc: Alexander Viro 
Cc: "Darrick J. Wong" 
Cc: Ross Zwisler 
Cc: Dave Hansen 
Cc: Andrew Morton 
Reported-by: Christoph Hellwig 
Signed-off-by: Dan Williams 
---
 fs/dax.c|   93 +++
 include/linux/dax.h |   30 
 mm/gup.c|5 +++
 3 files changed, 128 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index fecf463a1468..cfaaf31fae85 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -375,6 +375,19 @@ static void dax_disassociate_entry(void *entry, struct 
address_space *mapping,
}
 }
 
+static struct page *dax_busy_page(void *entry)
+{
+   unsigned long pfn, end_pfn;
+
+   for_each_entry_pfn(entry, pfn, end_pfn) {
+   struct page *page = pfn_to_page(pfn);
+
+   if (page_ref_count(page) > 1)
+   return page;
+   }
+   return NULL;
+}
+
 /*
  * Find radix tree entry at given index. If it points to an exceptional entry,
  * return it with the radix tree entry locked. If the radix tree doesn't
@@ -516,6 +529,85 @@ static void *grab_mapping_entry(struct address_space 
*mapping, pgoff_t index,
return entry;
 }
 
+/**
+ * dax_layout_busy_page - find first pinned page in @mapping
+ * @mapping: address space to scan for a page with ref count > 1
+ *
+ * DAX requires ZONE_DEVICE mapped pages. These pages are never
+ * 'onlined' to the page allocator so they are considered idle when
+ * page->count == 1. A filesystem uses this interface to determine if
+ * any page in the mapping is busy, i.e. for DMA, or other
+ * get_user_pages() usages.
+ *
+ * It is expected that the filesystem is holding locks to block the
+ * establishment of new mappings in this address_space. I.e. it expects
+ * to be able to run unmap_mapping_range() and subsequently not race
+ * mapping_mapped() becoming true. It expects that get_user_pages() pte
+ * walks are performed under rcu_read_lock().
+ */
+struct page *dax_layout_busy_page(struct address_space *mapping)
+{
+   pgoff_t indices[PAGEVEC_SIZE];
+   struct page *page = NULL;
+   struct pagevec pvec;
+   pgoff_t index, end;
+   unsigned i;
+
+   /*
+* In the 'limited' case get_user_pages() for dax is disabled.
+*/
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
+   return NULL;
+
+   if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+   return NULL;
+
+   pagevec_init();
+   index = 0;
+   end = -1;
+   /*
+* Flush dax_layout_lock() sections to ensure all possible page
+* references have been taken, or otherwise arrange for faults
+* to block on the filesystem lock that is taken for
+* establishing new mappings.
+*/
+   unmap_mapping_range(mapping, 0, 0, 1);
+   synchronize_rcu();
+
+   while (index < end && pagevec_lookup_entries(, mapping, index,
+   min(end 

[PATCH v5 00/11] dax: fix dma vs truncate/hole-punch

2018-03-09 Thread Dan Williams
Changes since v4 [1]:
* Kill the DEFINE_FSDAX_AOPS macro and just open code new
  address_space_operations instances for each fs (Matthew, Jan, Dave,
  Christoph)

* Rename routines that had a 'dma_' prefix with 'dax_layout_' and merge
  the dax-layout-break into xfs_break_layouts() (Dave, Christoph)

* Rework the implementation to have the fsdax core find the pages, but
  leave the responsibility of waiting on those pages to the filesystem
  (Dave).

* Drop the nfit_test infrastructure for testing this mechanism, I plan
  to investigate better mechanisms for injecting arbitrary put_page()
  delays for dax pages relative to an extent unmap operation. The
  dm_delay target does not do what I want since it operates at whole
  device level. A better test interface would be a mechanism to delay
  I/O completion based on whether a bio referenced a given LBA.

Not changed since v4:
* This implementation still relies on RCU for synchronizing
  get_user_pages() and get_user_pages_fast() against
  dax_layout_busy_page(). We could perform the operation with just
  barriers if we knew at get_user_pages() time that the pages were flagged
  for truncation.  However, dax_layout_busy_page() does not have the
  information to flag that a page is actually going to be truncated, only
  that it *might* be truncated.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html



Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.


---

Dan Williams (11):
  dax: store pfns in the radix
  xfs, dax: introduce xfs_dax_aops
  ext4, dax: introduce ext4_dax_aops
  ext2, dax: introduce ext2_dax_aops
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  wait_bit: introduce {wait_on,wake_up}_atomic_one
  mm, fs, dax: handle layout changes to pinned dax mappings
  xfs: prepare xfs_break_layouts() for another layout type
  xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c  |   96 +++--
 drivers/nvdimm/pmem.c|3 -
 fs/Kconfig   |1 
 fs/dax.c |  259 +-
 fs/ext2/ext2.h   |1 
 fs/ext2/inode.c  |   28 -
 fs/ext2/namei.c  |   18 ---
 fs/ext2/super.c  |6 +
 fs/ext4/inode.c  |   11 ++
 fs/ext4/super.c  |6 +
 fs/xfs/xfs_aops.c|7 +
 fs/xfs/xfs_aops.h|1 
 fs/xfs/xfs_file.c|   94 -
 fs/xfs/xfs_inode.h   |9 ++
 fs/xfs/xfs_ioctl.c   |9 +-
 fs/xfs/xfs_iops.c|   17 ++-
 fs/xfs/xfs_pnfs.c|8 +
 fs/xfs/xfs_pnfs.h|4 -
 fs/xfs/xfs_super.c   |   20 ++--
 include/linux/dax.h  |   45 +++-
 include/linux/memremap.h |   28 ++---
 include/linux/mm.h   |   61 ---
 include/linux/wait_bit.h |   13 ++
 kernel/memremap.c|   30 +
 kernel/sched/wait_bit.c  |   

[PATCH v5 00/11] dax: fix dma vs truncate/hole-punch

2018-03-09 Thread Dan Williams
Changes since v4 [1]:
* Kill the DEFINE_FSDAX_AOPS macro and just open code new
  address_space_operations instances for each fs (Matthew, Jan, Dave,
  Christoph)

* Rename routines that had a 'dma_' prefix with 'dax_layout_' and merge
  the dax-layout-break into xfs_break_layouts() (Dave, Christoph)

* Rework the implementation to have the fsdax core find the pages, but
  leave the responsibility of waiting on those pages to the filesystem
  (Dave).

* Drop the nfit_test infrastructure for testing this mechanism, I plan
  to investigate better mechanisms for injecting arbitrary put_page()
  delays for dax pages relative to an extent unmap operation. The
  dm_delay target does not do what I want since it operates at whole
  device level. A better test interface would be a mechanism to delay
  I/O completion based on whether a bio referenced a given LBA.

Not changed since v4:
* This implementation still relies on RCU for synchronizing
  get_user_pages() and get_user_pages_fast() against
  dax_layout_busy_page(). We could perform the operation with just
  barriers if we knew at get_user_pages() time that the pages were flagged
  for truncation.  However, dax_layout_busy_page() does not have the
  information to flag that a page is actually going to be truncated, only
  that it *might* be truncated.

[1]: https://lists.01.org/pipermail/linux-nvdimm/2017-December/013704.html



Background:

get_user_pages() in the filesystem pins file backed memory pages for
access by devices performing dma. However, it only pins the memory pages
not the page-to-file offset association. If a file is truncated the
pages are mapped out of the file and dma may continue indefinitely into
a page that is owned by a device driver. This breaks coherency of the
file vs dma, but the assumption is that if userspace wants the
file-space truncated it does not matter what data is inbound from the
device, it is not relevant anymore. The only expectation is that dma can
safely continue while the filesystem reallocates the block(s).

Problem:

This expectation that dma can safely continue while the filesystem
changes the block map is broken by dax. With dax the target dma page
*is* the filesystem block. The model of leaving the page pinned for dma,
but truncating the file block out of the file, means that the filesytem
is free to reallocate a block under active dma to another file and now
the expected data-incoherency situation has turned into active
data-corruption.

Solution:

Defer all filesystem operations (fallocate(), truncate()) on a dax mode
file while any page/block in the file is under active dma. This solution
assumes that dma is transient. Cases where dma operations are known to
not be transient, like RDMA, have been explicitly disabled via
commits like 5f1d43de5416 "IB/core: disable memory registration of
filesystem-dax vmas".

The dax_layout_busy_page() routine is called by filesystems with a lock
held against mm faults (i_mmap_lock) to find pinned / busy dax pages.
The process of looking up a busy page invalidates all mappings
to trigger any subsequent get_user_pages() to block on i_mmap_lock.
The filesystem continues to call dax_layout_busy_page() until it finally
returns no more active pages. This approach assumes that the page
pinning is transient, if that assumption is violated the system would
have likely hung from the uncompleted I/O.


---

Dan Williams (11):
  dax: store pfns in the radix
  xfs, dax: introduce xfs_dax_aops
  ext4, dax: introduce ext4_dax_aops
  ext2, dax: introduce ext2_dax_aops
  fs, dax: use page->mapping to warn if truncate collides with a busy page
  mm, dax: enable filesystems to trigger dev_pagemap ->page_free callbacks
  mm, dev_pagemap: introduce CONFIG_DEV_PAGEMAP_OPS
  wait_bit: introduce {wait_on,wake_up}_atomic_one
  mm, fs, dax: handle layout changes to pinned dax mappings
  xfs: prepare xfs_break_layouts() for another layout type
  xfs, dax: introduce xfs_break_dax_layouts()


 drivers/dax/super.c  |   96 +++--
 drivers/nvdimm/pmem.c|3 -
 fs/Kconfig   |1 
 fs/dax.c |  259 +-
 fs/ext2/ext2.h   |1 
 fs/ext2/inode.c  |   28 -
 fs/ext2/namei.c  |   18 ---
 fs/ext2/super.c  |6 +
 fs/ext4/inode.c  |   11 ++
 fs/ext4/super.c  |6 +
 fs/xfs/xfs_aops.c|7 +
 fs/xfs/xfs_aops.h|1 
 fs/xfs/xfs_file.c|   94 -
 fs/xfs/xfs_inode.h   |9 ++
 fs/xfs/xfs_ioctl.c   |9 +-
 fs/xfs/xfs_iops.c|   17 ++-
 fs/xfs/xfs_pnfs.c|8 +
 fs/xfs/xfs_pnfs.h|4 -
 fs/xfs/xfs_super.c   |   20 ++--
 include/linux/dax.h  |   45 +++-
 include/linux/memremap.h |   28 ++---
 include/linux/mm.h   |   61 ---
 include/linux/wait_bit.h |   13 ++
 kernel/memremap.c|   30 +
 kernel/sched/wait_bit.c  |   

[PATCH v5 01/11] dax: store pfns in the radix

2018-03-09 Thread Dan Williams
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |   15 +++--
 fs/dax.c|   83 +++
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
return len < 0 ? len : -EIO;
}
 
-   if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-   || pfn_t_devmap(pfn))
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+   /*
+* An arch that has enabled the pmem api should also
+* have its drivers support pfn_t_devmap()
+*
+* This is a developer warning and should not trigger in
+* production. dax_flush() will crash since it depends
+* on being able to do (page_address(pfn_to_page())).
+*/
+   WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+   } else if (pfn_t_devmap(pfn)) {
/* pass */;
-   else {
+   } else {
pr_debug("VFS (%s): error: dax support not enabled\n",
sb->s_id);
return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 
3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-   ((unsigned long)sector << RADIX_DAX_SHIFT) |
-   RADIX_DAX_ENTRY_LOCK);
+   (pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, 
struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
  struct vm_fault *vmf,
- void *entry, sector_t sector,
+ void *entry, pfn_t pfn_t,
  unsigned long flags, bool dirty)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *new_entry;
+   unsigned long pfn = pfn_t_to_pfn(pfn_t);
pgoff_t index = vmf->pgoff;
+   void *new_entry;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
}
 
spin_lock_irq(>tree_lock);
-   new_entry = dax_radix_locked_entry(sector, flags);
+   new_entry = dax_radix_locked_entry(pfn, flags);
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-   struct dax_device *dax_dev, struct address_space *mapping,
-   pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+   struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *entry2, **slot, *kaddr;
-   long ret = 0, id;
-   sector_t sector;
-   pgoff_t pgoff;
+   void *entry2, **slot;
+   unsigned long pfn;
+   long ret = 0;
size_t size;
-   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
goto put_unlocked;
/*
 * Entry got reallocated elsewhere? No need to writeback. We have to
-* compare sectors as we must not bail out due to difference in lockbit
+* compare pfns as we must not bail out due to difference in lockbit
 * or entry type.
 */
-   if 

[PATCH v5 01/11] dax: store pfns in the radix

2018-03-09 Thread Dan Williams
In preparation for examining the busy state of dax pages in the truncate
path, switch from sectors to pfns in the radix.

Cc: Jeff Moyer 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: Ross Zwisler 
Reviewed-by: Jan Kara 
Signed-off-by: Dan Williams 
---
 drivers/dax/super.c |   15 +++--
 fs/dax.c|   83 +++
 2 files changed, 43 insertions(+), 55 deletions(-)

diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index ecdc292aa4e4..2b2332b605e4 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -124,10 +124,19 @@ int __bdev_dax_supported(struct super_block *sb, int 
blocksize)
return len < 0 ? len : -EIO;
}
 
-   if ((IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn))
-   || pfn_t_devmap(pfn))
+   if (IS_ENABLED(CONFIG_FS_DAX_LIMITED) && pfn_t_special(pfn)) {
+   /*
+* An arch that has enabled the pmem api should also
+* have its drivers support pfn_t_devmap()
+*
+* This is a developer warning and should not trigger in
+* production. dax_flush() will crash since it depends
+* on being able to do (page_address(pfn_to_page())).
+*/
+   WARN_ON(IS_ENABLED(CONFIG_ARCH_HAS_PMEM_API));
+   } else if (pfn_t_devmap(pfn)) {
/* pass */;
-   else {
+   } else {
pr_debug("VFS (%s): error: dax support not enabled\n",
sb->s_id);
return -EOPNOTSUPP;
diff --git a/fs/dax.c b/fs/dax.c
index 0276df90e86c..b646a46e4d12 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -73,16 +73,15 @@ fs_initcall(init_dax_wait_table);
 #define RADIX_DAX_ZERO_PAGE(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
 #define RADIX_DAX_EMPTY(1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 
3))
 
-static unsigned long dax_radix_sector(void *entry)
+static unsigned long dax_radix_pfn(void *entry)
 {
return (unsigned long)entry >> RADIX_DAX_SHIFT;
 }
 
-static void *dax_radix_locked_entry(sector_t sector, unsigned long flags)
+static void *dax_radix_locked_entry(unsigned long pfn, unsigned long flags)
 {
return (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY | flags |
-   ((unsigned long)sector << RADIX_DAX_SHIFT) |
-   RADIX_DAX_ENTRY_LOCK);
+   (pfn << RADIX_DAX_SHIFT) | RADIX_DAX_ENTRY_LOCK);
 }
 
 static unsigned int dax_radix_order(void *entry)
@@ -526,12 +525,13 @@ static int copy_user_dax(struct block_device *bdev, 
struct dax_device *dax_dev,
  */
 static void *dax_insert_mapping_entry(struct address_space *mapping,
  struct vm_fault *vmf,
- void *entry, sector_t sector,
+ void *entry, pfn_t pfn_t,
  unsigned long flags, bool dirty)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *new_entry;
+   unsigned long pfn = pfn_t_to_pfn(pfn_t);
pgoff_t index = vmf->pgoff;
+   void *new_entry;
 
if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -546,7 +546,7 @@ static void *dax_insert_mapping_entry(struct address_space 
*mapping,
}
 
spin_lock_irq(>tree_lock);
-   new_entry = dax_radix_locked_entry(sector, flags);
+   new_entry = dax_radix_locked_entry(pfn, flags);
 
if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry)) {
/*
@@ -657,17 +657,14 @@ static void dax_mapping_entry_mkclean(struct 
address_space *mapping,
i_mmap_unlock_read(mapping);
 }
 
-static int dax_writeback_one(struct block_device *bdev,
-   struct dax_device *dax_dev, struct address_space *mapping,
-   pgoff_t index, void *entry)
+static int dax_writeback_one(struct dax_device *dax_dev,
+   struct address_space *mapping, pgoff_t index, void *entry)
 {
struct radix_tree_root *page_tree = >page_tree;
-   void *entry2, **slot, *kaddr;
-   long ret = 0, id;
-   sector_t sector;
-   pgoff_t pgoff;
+   void *entry2, **slot;
+   unsigned long pfn;
+   long ret = 0;
size_t size;
-   pfn_t pfn;
 
/*
 * A page got tagged dirty in DAX mapping? Something is seriously
@@ -683,10 +680,10 @@ static int dax_writeback_one(struct block_device *bdev,
goto put_unlocked;
/*
 * Entry got reallocated elsewhere? No need to writeback. We have to
-* compare sectors as we must not bail out due to difference in lockbit
+* compare pfns as we must not bail out due to difference in lockbit
 * or entry type.
 */
-   if (dax_radix_sector(entry2) != dax_radix_sector(entry))
+   if (dax_radix_pfn(entry2) != dax_radix_pfn(entry))
goto 

[PATCH 2/2] rtc: s5m: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLAs and replace them
with fixed-length arrays instead.

>From a security viewpoint, the use of Variable Length Arrays can be
a vector for stack overflow attacks. Also, in general, as the code
evolves it is easy to lose track of how big a VLA can get. Thus, we
can end up having segfaults that are hard to debug.

Also, fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/rtc/rtc-s5m.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 4c363de..8428455 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -47,6 +47,8 @@ enum {
RTC_MONTH,
RTC_YEAR1,
RTC_YEAR2,
+   /* Make sure this is always the last enum name. */
+   RTC_MAX_NUM_TIME_REGS
 };
 
 /*
@@ -378,7 +380,7 @@ static void s5m8763_tm_to_data(struct rtc_time *tm, u8 
*data)
 static int s5m_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret;
 
if (info->regs->read_time_udr_mask) {
@@ -424,7 +426,7 @@ static int s5m_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
 static int s5m_rtc_set_time(struct device *dev, struct rtc_time *tm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret = 0;
 
switch (info->device_type) {
@@ -461,7 +463,7 @@ static int s5m_rtc_set_time(struct device *dev, struct 
rtc_time *tm)
 static int s5m_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
unsigned int val;
int ret, i;
 
@@ -511,7 +513,7 @@ static int s5m_rtc_read_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
 
 static int s5m_rtc_stop_alarm(struct s5m_rtc_info *info)
 {
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret, i;
struct rtc_time tm;
 
@@ -556,7 +558,7 @@ static int s5m_rtc_stop_alarm(struct s5m_rtc_info *info)
 static int s5m_rtc_start_alarm(struct s5m_rtc_info *info)
 {
int ret;
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
u8 alarm0_conf;
struct rtc_time tm;
 
@@ -609,7 +611,7 @@ static int s5m_rtc_start_alarm(struct s5m_rtc_info *info)
 static int s5m_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret;
 
switch (info->device_type) {
-- 
2.7.4



[PATCH 2/2] rtc: s5m: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLAs and replace them
with fixed-length arrays instead.

>From a security viewpoint, the use of Variable Length Arrays can be
a vector for stack overflow attacks. Also, in general, as the code
evolves it is easy to lose track of how big a VLA can get. Thus, we
can end up having segfaults that are hard to debug.

Also, fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/rtc/rtc-s5m.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 4c363de..8428455 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -47,6 +47,8 @@ enum {
RTC_MONTH,
RTC_YEAR1,
RTC_YEAR2,
+   /* Make sure this is always the last enum name. */
+   RTC_MAX_NUM_TIME_REGS
 };
 
 /*
@@ -378,7 +380,7 @@ static void s5m8763_tm_to_data(struct rtc_time *tm, u8 
*data)
 static int s5m_rtc_read_time(struct device *dev, struct rtc_time *tm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret;
 
if (info->regs->read_time_udr_mask) {
@@ -424,7 +426,7 @@ static int s5m_rtc_read_time(struct device *dev, struct 
rtc_time *tm)
 static int s5m_rtc_set_time(struct device *dev, struct rtc_time *tm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret = 0;
 
switch (info->device_type) {
@@ -461,7 +463,7 @@ static int s5m_rtc_set_time(struct device *dev, struct 
rtc_time *tm)
 static int s5m_rtc_read_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
unsigned int val;
int ret, i;
 
@@ -511,7 +513,7 @@ static int s5m_rtc_read_alarm(struct device *dev, struct 
rtc_wkalrm *alrm)
 
 static int s5m_rtc_stop_alarm(struct s5m_rtc_info *info)
 {
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret, i;
struct rtc_time tm;
 
@@ -556,7 +558,7 @@ static int s5m_rtc_stop_alarm(struct s5m_rtc_info *info)
 static int s5m_rtc_start_alarm(struct s5m_rtc_info *info)
 {
int ret;
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
u8 alarm0_conf;
struct rtc_time tm;
 
@@ -609,7 +611,7 @@ static int s5m_rtc_start_alarm(struct s5m_rtc_info *info)
 static int s5m_rtc_set_alarm(struct device *dev, struct rtc_wkalrm *alrm)
 {
struct s5m_rtc_info *info = dev_get_drvdata(dev);
-   u8 data[info->regs->regs_count];
+   u8 data[RTC_MAX_NUM_TIME_REGS];
int ret;
 
switch (info->device_type) {
-- 
2.7.4



Re: [PATCH v2 3/3] ALSA: hda: Disabled unused audio controller for Dell platforms with Switchable Graphics

2018-03-09 Thread Lukas Wunner
On Fri, Mar 09, 2018 at 05:30:15PM +0800, Kai Heng Feng wrote:
> >On Thursday 08 March 2018 17:10:23 Kai-Heng Feng wrote:
> >>Some Dell platforms (Preicsion 7510/7710/7520/7720) have a BIOS option
> >>"Switchable Graphics" (SG).
> >>
> >>When SG is enabled, we have:
> >>00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
> >>00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
> >>01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>[AMD/ATI] Ellesmere [Polaris10]
> >>01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere
> >>[Radeon RX 580]
> >>
> >>The Intel Audio outputs all the sound, including HDMI audio. The audio
> >>controller comes with AMD graphics doesn't get used.
> >>
> >>When SG is disabled, we have:
> >>00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
> >>01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>[AMD/ATI] Ellesmere [Polaris10]
> >>01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere
> >>[Radeon RX 580]
> >>
> >>Now it's a typical discrete-only system. HDMI audio comes from AMD audio
> >>controller, others from Intel audio controller.
> >>
> >>When SG is enabled, the unused AMD audio controller still exposes its
> >>sysfs, so userspace still opens the control file and stream. If
> >>userspace tries to output sound through the stream, it hangs when
> >>runtime suspend kicks in:
> >>[ 12.796265] snd_hda_intel :01:00.1: Disabling via vga_switcheroo
> >>[ 12.796367] snd_hda_intel :01:00.1: Cannot lock devices!
> >>
> >>Since the discrete audio controller isn't useful when SG enabled, we
> >>should just disable the device.
> 
> The platform does have a NVIDIA variant, but the discrete NVIDIA have a
> audio controller, hence it doesn't have the issue.

Sorry, I don't quite understand:  The AMD variant *also* has an audio
controller, so what's the difference?  Or did you mean the Nvidia
variant *doesn't* have an audio controller?

Pretty much all modern Nvidia GPUs do have an integrated HDA
controller, however it's possible to hide it by clearing a bit
at offset 0x488 in the GPU's config space.  Some BIOSes hide
the HDA if no external display is attached.

I could imagine that the BIOS of the Dell machines in question
hides the HDA if Switchable Graphics is enabled.  If that is the
case, be aware that there's an ongoing discussion to always expose
the HDA controller because the behavior of some BIOSes to only
expose the HDA when a display is attached causes massive problems
with Linux' HDA driver:
https://bugs.freedesktop.org/show_bug.cgi?id=75985

If we decide to always expose the HDA controller on Nvidia cards,
you may need to also match for the Nvidia vendor ID here.

Thanks,

Lukas


Re: [PATCH v2 3/3] ALSA: hda: Disabled unused audio controller for Dell platforms with Switchable Graphics

2018-03-09 Thread Lukas Wunner
On Fri, Mar 09, 2018 at 05:30:15PM +0800, Kai Heng Feng wrote:
> >On Thursday 08 March 2018 17:10:23 Kai-Heng Feng wrote:
> >>Some Dell platforms (Preicsion 7510/7710/7520/7720) have a BIOS option
> >>"Switchable Graphics" (SG).
> >>
> >>When SG is enabled, we have:
> >>00:02.0 VGA compatible controller: Intel Corporation Device 591b (rev 04)
> >>00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
> >>01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>[AMD/ATI] Ellesmere [Polaris10]
> >>01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere
> >>[Radeon RX 580]
> >>
> >>The Intel Audio outputs all the sound, including HDMI audio. The audio
> >>controller comes with AMD graphics doesn't get used.
> >>
> >>When SG is disabled, we have:
> >>00:1f.3 Audio device: Intel Corporation CM238 HD Audio Controller (rev 31)
> >>01:00.0 VGA compatible controller: Advanced Micro Devices, Inc.
> >>[AMD/ATI] Ellesmere [Polaris10]
> >>01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere
> >>[Radeon RX 580]
> >>
> >>Now it's a typical discrete-only system. HDMI audio comes from AMD audio
> >>controller, others from Intel audio controller.
> >>
> >>When SG is enabled, the unused AMD audio controller still exposes its
> >>sysfs, so userspace still opens the control file and stream. If
> >>userspace tries to output sound through the stream, it hangs when
> >>runtime suspend kicks in:
> >>[ 12.796265] snd_hda_intel :01:00.1: Disabling via vga_switcheroo
> >>[ 12.796367] snd_hda_intel :01:00.1: Cannot lock devices!
> >>
> >>Since the discrete audio controller isn't useful when SG enabled, we
> >>should just disable the device.
> 
> The platform does have a NVIDIA variant, but the discrete NVIDIA have a
> audio controller, hence it doesn't have the issue.

Sorry, I don't quite understand:  The AMD variant *also* has an audio
controller, so what's the difference?  Or did you mean the Nvidia
variant *doesn't* have an audio controller?

Pretty much all modern Nvidia GPUs do have an integrated HDA
controller, however it's possible to hide it by clearing a bit
at offset 0x488 in the GPU's config space.  Some BIOSes hide
the HDA if no external display is attached.

I could imagine that the BIOS of the Dell machines in question
hides the HDA if Switchable Graphics is enabled.  If that is the
case, be aware that there's an ongoing discussion to always expose
the HDA controller because the behavior of some BIOSes to only
expose the HDA when a display is attached causes massive problems
with Linux' HDA driver:
https://bugs.freedesktop.org/show_bug.cgi?id=75985

If we decide to always expose the HDA controller on Nvidia cards,
you may need to also match for the Nvidia vendor ID here.

Thanks,

Lukas


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Dave Hansen
On 03/09/2018 09:55 PM, Ram Pai wrote:
> On Fri, Mar 09, 2018 at 02:40:32PM -0800, Dave Hansen wrote:
>> On 03/09/2018 12:12 AM, Ram Pai wrote:
>>> Once an address range is associated with an allocated pkey, it cannot be
>>> reverted back to key-0. There is no valid reason for the above behavior.  On
>>> the contrary applications need the ability to do so.
>> Why don't we just set pkey 0 to be allocated in the allocation bitmap by
>> default?
> ok. that will make it allocatable. But it will not be associatable,
> given the bug in the current code. And what will be the
> default key associated with a pte? zero? or something else?

I'm just saying that I think we should try to keep from making it
special as much as possible.

Let's fix the bug that keeps it from being associatable.



Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Dave Hansen
On 03/09/2018 09:55 PM, Ram Pai wrote:
> On Fri, Mar 09, 2018 at 02:40:32PM -0800, Dave Hansen wrote:
>> On 03/09/2018 12:12 AM, Ram Pai wrote:
>>> Once an address range is associated with an allocated pkey, it cannot be
>>> reverted back to key-0. There is no valid reason for the above behavior.  On
>>> the contrary applications need the ability to do so.
>> Why don't we just set pkey 0 to be allocated in the allocation bitmap by
>> default?
> ok. that will make it allocatable. But it will not be associatable,
> given the bug in the current code. And what will be the
> default key associated with a pte? zero? or something else?

I'm just saying that I think we should try to keep from making it
special as much as possible.

Let's fix the bug that keeps it from being associatable.



[PATCH] vgacon: fix function prototypes

2018-03-09 Thread Joao Moreira
It is possible to indirectly invoke functions with prototypes that do not
match those of the respectively used function pointers by using void types.
Despite widely used as a feature for relaxing function invocation, this
should be avoided when possible as it may prevent the use of heuristics
such as prototype matching-based Control-Flow Integrity, which can be used
to prevent ROP-based attacks.

Given the above, the current efforts to improve the Linux security, and the
upcoming kernel support to compilers with CFI features, fix prototypes in
vgacon console driver.

Another similar fix can be seen in [1].

[1] https://android-review.googlesource.com/c/kernel/common/+/602010

Signed-off-by:  João Moreira 
---
 drivers/video/console/vgacon.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/video/console/vgacon.c b/drivers/video/console/vgacon.c
index a17ba1465815..f00b630f6839 100644
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -1407,21 +1407,29 @@ static bool vgacon_scroll(struct vc_data *c, unsigned 
int t, unsigned int b,
  *  The console `switch' structure for the VGA based console
  */
 
-static int vgacon_dummy(struct vc_data *c)
+static int vgacon_clear(struct vc_data *c)
 {
return 0;
 }
 
-#define DUMMY (void *) vgacon_dummy
+static void vgacon_putc(struct vc_data *c, int a, int b, int d)
+{
+   return;
+}
+
+static void vgacon_putcs(struct vc_data *c, ushort *s, int a, int b, int d)
+{
+   return;
+}
 
 const struct consw vga_con = {
.owner = THIS_MODULE,
.con_startup = vgacon_startup,
.con_init = vgacon_init,
.con_deinit = vgacon_deinit,
-   .con_clear = DUMMY,
-   .con_putc = DUMMY,
-   .con_putcs = DUMMY,
+   .con_clear = vgacon_clear,
+   .con_putc = vgacon_putc,
+   .con_putcs = vgacon_putcs,
.con_cursor = vgacon_cursor,
.con_scroll = vgacon_scroll,
.con_switch = vgacon_switch,
-- 
2.13.6



[PATCH] vgacon: fix function prototypes

2018-03-09 Thread Joao Moreira
It is possible to indirectly invoke functions with prototypes that do not
match those of the respectively used function pointers by using void types.
Despite widely used as a feature for relaxing function invocation, this
should be avoided when possible as it may prevent the use of heuristics
such as prototype matching-based Control-Flow Integrity, which can be used
to prevent ROP-based attacks.

Given the above, the current efforts to improve the Linux security, and the
upcoming kernel support to compilers with CFI features, fix prototypes in
vgacon console driver.

Another similar fix can be seen in [1].

[1] https://android-review.googlesource.com/c/kernel/common/+/602010

Signed-off-by:  João Moreira 
---
 drivers/video/console/vgacon.c | 18 +-
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/drivers/video/console/vgacon.c b/drivers/video/console/vgacon.c
index a17ba1465815..f00b630f6839 100644
--- a/drivers/video/console/vgacon.c
+++ b/drivers/video/console/vgacon.c
@@ -1407,21 +1407,29 @@ static bool vgacon_scroll(struct vc_data *c, unsigned 
int t, unsigned int b,
  *  The console `switch' structure for the VGA based console
  */
 
-static int vgacon_dummy(struct vc_data *c)
+static int vgacon_clear(struct vc_data *c)
 {
return 0;
 }
 
-#define DUMMY (void *) vgacon_dummy
+static void vgacon_putc(struct vc_data *c, int a, int b, int d)
+{
+   return;
+}
+
+static void vgacon_putcs(struct vc_data *c, ushort *s, int a, int b, int d)
+{
+   return;
+}
 
 const struct consw vga_con = {
.owner = THIS_MODULE,
.con_startup = vgacon_startup,
.con_init = vgacon_init,
.con_deinit = vgacon_deinit,
-   .con_clear = DUMMY,
-   .con_putc = DUMMY,
-   .con_putcs = DUMMY,
+   .con_clear = vgacon_clear,
+   .con_putc = vgacon_putc,
+   .con_putcs = vgacon_putcs,
.con_cursor = vgacon_cursor,
.con_scroll = vgacon_scroll,
.con_switch = vgacon_switch,
-- 
2.13.6



[PATCH 0/2] Remove VLA usage in rtc-s5m

2018-03-09 Thread Gustavo A. R. Silva
This patchset aims to remove VLA usage from rtc-s5m.

The first patch moves an enum from rtc.h to rtc-s5m.c, as this is the
only driver in which such enum is actually being used [1].

The second patch adds the enum name RTC_MAX_NUM_TIME_REGS, which will
be used as a maximum length to the current VLAs, hence turning them
into fixed-length arrays instead.

[1] https://marc.info/?l=linux-rtc=152060068925948=2

Thanks

Gustavo A. R. Silva (2):
  rtc: s5m: move enum from rtc.h to rtc-s5m.c
  rtc: s5m: Remove VLA usage

 drivers/rtc/rtc-s5m.c   | 25 +++--
 include/linux/mfd/samsung/rtc.h | 11 ---
 2 files changed, 19 insertions(+), 17 deletions(-)

-- 
2.7.4



[PATCH 0/2] Remove VLA usage in rtc-s5m

2018-03-09 Thread Gustavo A. R. Silva
This patchset aims to remove VLA usage from rtc-s5m.

The first patch moves an enum from rtc.h to rtc-s5m.c, as this is the
only driver in which such enum is actually being used [1].

The second patch adds the enum name RTC_MAX_NUM_TIME_REGS, which will
be used as a maximum length to the current VLAs, hence turning them
into fixed-length arrays instead.

[1] https://marc.info/?l=linux-rtc=152060068925948=2

Thanks

Gustavo A. R. Silva (2):
  rtc: s5m: move enum from rtc.h to rtc-s5m.c
  rtc: s5m: Remove VLA usage

 drivers/rtc/rtc-s5m.c   | 25 +++--
 include/linux/mfd/samsung/rtc.h | 11 ---
 2 files changed, 19 insertions(+), 17 deletions(-)

-- 
2.7.4



[PATCH 1/2] rtc: s5m: Move enum from rtc.h to rtc-s5m.c

2018-03-09 Thread Gustavo A. R. Silva
Move this enum to rtc-s5m.c once it is meaningless to others drivers [1].

[1] https://marc.info/?l=linux-rtc=152060068925948=2

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/rtc/rtc-s5m.c   | 11 +++
 include/linux/mfd/samsung/rtc.h | 11 ---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 6deae10..4c363de 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -38,6 +38,17 @@
  */
 #define UDR_READ_RETRY_CNT 5
 
+enum {
+   RTC_SEC = 0,
+   RTC_MIN,
+   RTC_HOUR,
+   RTC_WEEKDAY,
+   RTC_DATE,
+   RTC_MONTH,
+   RTC_YEAR1,
+   RTC_YEAR2,
+};
+
 /*
  * Registers used by the driver which are different between chipsets.
  *
diff --git a/include/linux/mfd/samsung/rtc.h b/include/linux/mfd/samsung/rtc.h
index 48c3c5b..9ed2871 100644
--- a/include/linux/mfd/samsung/rtc.h
+++ b/include/linux/mfd/samsung/rtc.h
@@ -141,15 +141,4 @@ enum s2mps_rtc_reg {
 #define WTSR_ENABLE_SHIFT  6
 #define WTSR_ENABLE_MASK   (1 << WTSR_ENABLE_SHIFT)
 
-enum {
-   RTC_SEC = 0,
-   RTC_MIN,
-   RTC_HOUR,
-   RTC_WEEKDAY,
-   RTC_DATE,
-   RTC_MONTH,
-   RTC_YEAR1,
-   RTC_YEAR2,
-};
-
 #endif /*  __LINUX_MFD_SEC_RTC_H */
-- 
2.7.4



[PATCH 1/2] rtc: s5m: Move enum from rtc.h to rtc-s5m.c

2018-03-09 Thread Gustavo A. R. Silva
Move this enum to rtc-s5m.c once it is meaningless to others drivers [1].

[1] https://marc.info/?l=linux-rtc=152060068925948=2

Signed-off-by: Gustavo A. R. Silva 
---
 drivers/rtc/rtc-s5m.c   | 11 +++
 include/linux/mfd/samsung/rtc.h | 11 ---
 2 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 6deae10..4c363de 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -38,6 +38,17 @@
  */
 #define UDR_READ_RETRY_CNT 5
 
+enum {
+   RTC_SEC = 0,
+   RTC_MIN,
+   RTC_HOUR,
+   RTC_WEEKDAY,
+   RTC_DATE,
+   RTC_MONTH,
+   RTC_YEAR1,
+   RTC_YEAR2,
+};
+
 /*
  * Registers used by the driver which are different between chipsets.
  *
diff --git a/include/linux/mfd/samsung/rtc.h b/include/linux/mfd/samsung/rtc.h
index 48c3c5b..9ed2871 100644
--- a/include/linux/mfd/samsung/rtc.h
+++ b/include/linux/mfd/samsung/rtc.h
@@ -141,15 +141,4 @@ enum s2mps_rtc_reg {
 #define WTSR_ENABLE_SHIFT  6
 #define WTSR_ENABLE_MASK   (1 << WTSR_ENABLE_SHIFT)
 
-enum {
-   RTC_SEC = 0,
-   RTC_MIN,
-   RTC_HOUR,
-   RTC_WEEKDAY,
-   RTC_DATE,
-   RTC_MONTH,
-   RTC_YEAR1,
-   RTC_YEAR2,
-};
-
 #endif /*  __LINUX_MFD_SEC_RTC_H */
-- 
2.7.4



Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>
>>> When max() is used in stack array size calculations from literal values
>>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>>> thinks this is a dynamic calculation due to the single-eval logic, which
>>> is not needed in the literal case. This change removes several accidental
>>> stack VLAs from an x86 allmodconfig build:
>>>
>>> $ diff -u before.txt after.txt | grep ^-
>>> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
>>> variable length array ‘ids’ [-Wvla]
>>> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
>>> array ‘namebuf’ [-Wvla]
>>> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
>>> [-Wvla]
>>> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
>>> ‘buff64’ [-Wvla]
>>>
>>> Based on an earlier patch from Josh Poimboeuf.
>>
>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>
>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>> ./include/linux/jiffies.h:444: error: first argument to 
>> '__builtin_choose_expr' not a constant
>
>
> I'm seeing that problem with
>> gcc --version
> gcc (SUSE Linux) 4.8.5

Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
5.1.0 and it seems to work as well.

Cheers,
Miguel


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-09 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>
>>> When max() is used in stack array size calculations from literal values
>>> (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
>>> thinks this is a dynamic calculation due to the single-eval logic, which
>>> is not needed in the literal case. This change removes several accidental
>>> stack VLAs from an x86 allmodconfig build:
>>>
>>> $ diff -u before.txt after.txt | grep ^-
>>> -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
>>> variable length array ‘ids’ [-Wvla]
>>> -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
>>> array ‘namebuf’ [-Wvla]
>>> -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array ‘sym’ 
>>> [-Wvla]
>>> -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
>>> ‘buff’ [-Wvla]
>>> -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
>>> ‘buff64’ [-Wvla]
>>>
>>> Based on an earlier patch from Josh Poimboeuf.
>>
>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>
>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>> ./include/linux/jiffies.h:444: error: first argument to 
>> '__builtin_choose_expr' not a constant
>
>
> I'm seeing that problem with
>> gcc --version
> gcc (SUSE Linux) 4.8.5

Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
5.1.0 and it seems to work as well.

Cheers,
Miguel


Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Ram Pai
On Fri, Mar 09, 2018 at 02:40:32PM -0800, Dave Hansen wrote:
> On 03/09/2018 12:12 AM, Ram Pai wrote:
> > Once an address range is associated with an allocated pkey, it cannot be
> > reverted back to key-0. There is no valid reason for the above behavior.  On
> > the contrary applications need the ability to do so.
> 
> Why don't we just set pkey 0 to be allocated in the allocation bitmap by
> default?

ok. that will make it allocatable. But it will not be associatable,
given the bug in the current code. And what will be the
default key associated with a pte? zero? or something else?

> 
> We *could* also just not let it be special and let it be freed.  An app
> could theoretically be careful and make sure nothing is using it.

unable to see how this solves the problem. Need some more explaination.


RP



Re: [PATCH] x86, powerpc : pkey-mprotect must allow pkey-0

2018-03-09 Thread Ram Pai
On Fri, Mar 09, 2018 at 02:40:32PM -0800, Dave Hansen wrote:
> On 03/09/2018 12:12 AM, Ram Pai wrote:
> > Once an address range is associated with an allocated pkey, it cannot be
> > reverted back to key-0. There is no valid reason for the above behavior.  On
> > the contrary applications need the ability to do so.
> 
> Why don't we just set pkey 0 to be allocated in the allocation bitmap by
> default?

ok. that will make it allocatable. But it will not be associatable,
given the bug in the current code. And what will be the
default key associated with a pte? zero? or something else?

> 
> We *could* also just not let it be special and let it be freed.  An app
> could theoretically be careful and make sure nothing is using it.

unable to see how this solves the problem. Need some more explaination.


RP



Re: [PATCH v2 0/3] drm: Add LVDS decoder bridge

2018-03-09 Thread Archit Taneja

Hi,

On Friday 09 March 2018 07:21 PM, Jacopo Mondi wrote:

Hello,
after some discussion on the proposed bindings for generic lvds decoder and
Thine THC63LVD1024, I decided to drop the THC63 specific part and just live with
a transparent decoder that does not support any configuration from DT.

Dropping THC63 support to avoid discussion on how to better implement support
for a DRM bridge with 2 input ports and focus on LVDS mode propagation through
bridges as explained in v1 cover letter (for DRM people: please see [1] as why
I find difficult to implement support for bridges with multiple input endpoints)

Same base branch as v1, with same patches for V3M Eagle applied on top.
git://jmondi.org/linux v3m/v4.16-rc3/base

Thanks
j

v1 -> v2:
- Drop support for THC63LVD1024

[1] I had a quick at how to model a DRM bridge with multiple input
ports, and I see a blocker in how DRM identifies and matches bridges using
the devices node in place of the endpoint nodes.

As THC63LVD1024 supports up to 2 LVDS inputs and 2 LVDS outputs, I see only
a few ways to support that:
  1) register 2 drm bridges from the same driver (one for each input/output 
pair)
 but they would both be matches on the same device node when the preceding
 bridge calls "of_drm_find_bridge()".


I think this is the way to go. DRM doesn't say anywhere that we can't 
have 2 drm_bridge-s contained in a single device. About the issue with

of_drm_find_bridge(), if you set the 2 bridge's 'of_node' field to
the bridge1 and bridge2 nodes as shown below, wouldn't that suffice. 
From what I know, we don't necessarily need to set the bridge's of_node

to the device (i.e, thschip) itself.

thschip {
...
ports {
bridge1: port@0 {
...
};

bridge2: port@1 {
...
};
};
};


Thanks,
Archit


  2) register a single bridge with multiple "next bridges", but when the bridge
 gets attached I don't see a way on how to identify on which next bridge
 "drm_bridge_attach()" on, as it depends on the endpoint the current bridge
 has been attached on first, and we don't have that information.
  3) Register more instances of the same chip in DTS, one for each input/output
 pair. They gonna share supplies and gpios, and I don't like that.

I had a quick look at the currently in mainline bridges and none of them has
multiple input endpoints, except for HDMI audio endpoint, which I haven't found
in use in any DTS. I guess the problem has been already debated and maybe solved
in the past, so feel free to point me to other sources.

Jacopo Mondi (3):
   dt-bindings: display: bridge: Document LVDS to parallel decoder
   drm: bridge: Add LVDS decoder driver
   arm64: dts: renesas: Add LVDS decoder to R-Car V3M Eagle

  .../bindings/display/bridge/lvds-decoder.txt   |  42 ++
  arch/arm64/boot/dts/renesas/r8a77970-eagle.dts |  31 +++-
  drivers/gpu/drm/bridge/Kconfig |   8 ++
  drivers/gpu/drm/bridge/Makefile|   1 +
  drivers/gpu/drm/bridge/lvds-decoder.c  | 157 +
  5 files changed, 237 insertions(+), 2 deletions(-)
  create mode 100644 
Documentation/devicetree/bindings/display/bridge/lvds-decoder.txt
  create mode 100644 drivers/gpu/drm/bridge/lvds-decoder.c

--
2.7.4



Re: [PATCH v2 0/3] drm: Add LVDS decoder bridge

2018-03-09 Thread Archit Taneja

Hi,

On Friday 09 March 2018 07:21 PM, Jacopo Mondi wrote:

Hello,
after some discussion on the proposed bindings for generic lvds decoder and
Thine THC63LVD1024, I decided to drop the THC63 specific part and just live with
a transparent decoder that does not support any configuration from DT.

Dropping THC63 support to avoid discussion on how to better implement support
for a DRM bridge with 2 input ports and focus on LVDS mode propagation through
bridges as explained in v1 cover letter (for DRM people: please see [1] as why
I find difficult to implement support for bridges with multiple input endpoints)

Same base branch as v1, with same patches for V3M Eagle applied on top.
git://jmondi.org/linux v3m/v4.16-rc3/base

Thanks
j

v1 -> v2:
- Drop support for THC63LVD1024

[1] I had a quick at how to model a DRM bridge with multiple input
ports, and I see a blocker in how DRM identifies and matches bridges using
the devices node in place of the endpoint nodes.

As THC63LVD1024 supports up to 2 LVDS inputs and 2 LVDS outputs, I see only
a few ways to support that:
  1) register 2 drm bridges from the same driver (one for each input/output 
pair)
 but they would both be matches on the same device node when the preceding
 bridge calls "of_drm_find_bridge()".


I think this is the way to go. DRM doesn't say anywhere that we can't 
have 2 drm_bridge-s contained in a single device. About the issue with

of_drm_find_bridge(), if you set the 2 bridge's 'of_node' field to
the bridge1 and bridge2 nodes as shown below, wouldn't that suffice. 
From what I know, we don't necessarily need to set the bridge's of_node

to the device (i.e, thschip) itself.

thschip {
...
ports {
bridge1: port@0 {
...
};

bridge2: port@1 {
...
};
};
};


Thanks,
Archit


  2) register a single bridge with multiple "next bridges", but when the bridge
 gets attached I don't see a way on how to identify on which next bridge
 "drm_bridge_attach()" on, as it depends on the endpoint the current bridge
 has been attached on first, and we don't have that information.
  3) Register more instances of the same chip in DTS, one for each input/output
 pair. They gonna share supplies and gpios, and I don't like that.

I had a quick look at the currently in mainline bridges and none of them has
multiple input endpoints, except for HDMI audio endpoint, which I haven't found
in use in any DTS. I guess the problem has been already debated and maybe solved
in the past, so feel free to point me to other sources.

Jacopo Mondi (3):
   dt-bindings: display: bridge: Document LVDS to parallel decoder
   drm: bridge: Add LVDS decoder driver
   arm64: dts: renesas: Add LVDS decoder to R-Car V3M Eagle

  .../bindings/display/bridge/lvds-decoder.txt   |  42 ++
  arch/arm64/boot/dts/renesas/r8a77970-eagle.dts |  31 +++-
  drivers/gpu/drm/bridge/Kconfig |   8 ++
  drivers/gpu/drm/bridge/Makefile|   1 +
  drivers/gpu/drm/bridge/lvds-decoder.c  | 157 +
  5 files changed, 237 insertions(+), 2 deletions(-)
  create mode 100644 
Documentation/devicetree/bindings/display/bridge/lvds-decoder.txt
  create mode 100644 drivers/gpu/drm/bridge/lvds-decoder.c

--
2.7.4



[5/5 V3] tpm: factor out tpm_get_timeouts

2018-03-09 Thread Tomas Winkler
Factor out tpm_get_timeouts into tpm2_get_timeouts
and tpm1_get_timeouts.

Signed-off-by: Tomas Winkler 
---
V2: Rebase
V3: 1. Fix typo tmp->tpm
2. Fix sparse WARNING: line over 80 characters

 drivers/char/tpm/tpm-interface.c | 127 ++-
 drivers/char/tpm/tpm.h   |   5 +-
 drivers/char/tpm/tpm1-cmd.c  | 107 +
 drivers/char/tpm/tpm2-cmd.c  |  22 +++
 4 files changed, 137 insertions(+), 124 deletions(-)

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 40d1770f6b38..7f6968b750c8 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -402,132 +402,13 @@ EXPORT_SYMBOL_GPL(tpm_getcap);
 
 int tpm_get_timeouts(struct tpm_chip *chip)
 {
-   cap_t cap;
-   unsigned long timeout_old[4], timeout_chip[4], timeout_eff[4];
-   ssize_t rc;
-
if (chip->flags & TPM_CHIP_FLAG_HAVE_TIMEOUTS)
return 0;
 
-   if (chip->flags & TPM_CHIP_FLAG_TPM2) {
-   /* Fixed timeouts for TPM2 */
-   chip->timeout_a = msecs_to_jiffies(TPM2_TIMEOUT_A);
-   chip->timeout_b = msecs_to_jiffies(TPM2_TIMEOUT_B);
-   chip->timeout_c = msecs_to_jiffies(TPM2_TIMEOUT_C);
-   chip->timeout_d = msecs_to_jiffies(TPM2_TIMEOUT_D);
-   chip->duration[TPM_SHORT] =
-   msecs_to_jiffies(TPM2_DURATION_SHORT);
-   chip->duration[TPM_MEDIUM] =
-   msecs_to_jiffies(TPM2_DURATION_MEDIUM);
-   chip->duration[TPM_LONG] =
-   msecs_to_jiffies(TPM2_DURATION_LONG);
-   chip->duration[TPM_LONG_LONG] =
-   msecs_to_jiffies(TPM2_DURATION_LONG_LONG);
-
-   chip->flags |= TPM_CHIP_FLAG_HAVE_TIMEOUTS;
-   return 0;
-   }
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_TIMEOUT, , NULL,
-   sizeof(cap.timeout));
-   if (rc == TPM_ERR_INVALID_POSTINIT) {
-   if (tpm_startup(chip))
-   return rc;
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_TIMEOUT, ,
-   "attempting to determine the timeouts",
-   sizeof(cap.timeout));
-   }
-
-   if (rc) {
-   dev_err(>dev,
-   "A TPM error (%zd) occurred attempting to determine the 
timeouts\n",
-   rc);
-   return rc;
-   }
-
-   timeout_old[0] = jiffies_to_usecs(chip->timeout_a);
-   timeout_old[1] = jiffies_to_usecs(chip->timeout_b);
-   timeout_old[2] = jiffies_to_usecs(chip->timeout_c);
-   timeout_old[3] = jiffies_to_usecs(chip->timeout_d);
-   timeout_chip[0] = be32_to_cpu(cap.timeout.a);
-   timeout_chip[1] = be32_to_cpu(cap.timeout.b);
-   timeout_chip[2] = be32_to_cpu(cap.timeout.c);
-   timeout_chip[3] = be32_to_cpu(cap.timeout.d);
-   memcpy(timeout_eff, timeout_chip, sizeof(timeout_eff));
-
-   /*
-* Provide ability for vendor overrides of timeout values in case
-* of misreporting.
-*/
-   if (chip->ops->update_timeouts != NULL)
-   chip->timeout_adjusted =
-   chip->ops->update_timeouts(chip, timeout_eff);
-
-   if (!chip->timeout_adjusted) {
-   /* Restore default if chip reported 0 */
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(timeout_eff); i++) {
-   if (timeout_eff[i])
-   continue;
-
-   timeout_eff[i] = timeout_old[i];
-   chip->timeout_adjusted = true;
-   }
-
-   if (timeout_eff[0] != 0 && timeout_eff[0] < 1000) {
-   /* timeouts in msec rather usec */
-   for (i = 0; i != ARRAY_SIZE(timeout_eff); i++)
-   timeout_eff[i] *= 1000;
-   chip->timeout_adjusted = true;
-   }
-   }
-
-   /* Report adjusted timeouts */
-   if (chip->timeout_adjusted) {
-   dev_info(>dev,
-HW_ERR "Adjusting reported timeouts: A %lu->%luus B 
%lu->%luus C %lu->%luus D %lu->%luus\n",
-timeout_chip[0], timeout_eff[0],
-timeout_chip[1], timeout_eff[1],
-timeout_chip[2], timeout_eff[2],
-timeout_chip[3], timeout_eff[3]);
-   }
-
-   chip->timeout_a = usecs_to_jiffies(timeout_eff[0]);
-   chip->timeout_b = usecs_to_jiffies(timeout_eff[1]);
-   chip->timeout_c = usecs_to_jiffies(timeout_eff[2]);
-   chip->timeout_d = usecs_to_jiffies(timeout_eff[3]);
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_DURATION, ,
-   "attempting to determine the durations",
-   sizeof(cap.duration));
-   

[5/5 V3] tpm: factor out tpm_get_timeouts

2018-03-09 Thread Tomas Winkler
Factor out tpm_get_timeouts into tpm2_get_timeouts
and tpm1_get_timeouts.

Signed-off-by: Tomas Winkler 
---
V2: Rebase
V3: 1. Fix typo tmp->tpm
2. Fix sparse WARNING: line over 80 characters

 drivers/char/tpm/tpm-interface.c | 127 ++-
 drivers/char/tpm/tpm.h   |   5 +-
 drivers/char/tpm/tpm1-cmd.c  | 107 +
 drivers/char/tpm/tpm2-cmd.c  |  22 +++
 4 files changed, 137 insertions(+), 124 deletions(-)

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 40d1770f6b38..7f6968b750c8 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -402,132 +402,13 @@ EXPORT_SYMBOL_GPL(tpm_getcap);
 
 int tpm_get_timeouts(struct tpm_chip *chip)
 {
-   cap_t cap;
-   unsigned long timeout_old[4], timeout_chip[4], timeout_eff[4];
-   ssize_t rc;
-
if (chip->flags & TPM_CHIP_FLAG_HAVE_TIMEOUTS)
return 0;
 
-   if (chip->flags & TPM_CHIP_FLAG_TPM2) {
-   /* Fixed timeouts for TPM2 */
-   chip->timeout_a = msecs_to_jiffies(TPM2_TIMEOUT_A);
-   chip->timeout_b = msecs_to_jiffies(TPM2_TIMEOUT_B);
-   chip->timeout_c = msecs_to_jiffies(TPM2_TIMEOUT_C);
-   chip->timeout_d = msecs_to_jiffies(TPM2_TIMEOUT_D);
-   chip->duration[TPM_SHORT] =
-   msecs_to_jiffies(TPM2_DURATION_SHORT);
-   chip->duration[TPM_MEDIUM] =
-   msecs_to_jiffies(TPM2_DURATION_MEDIUM);
-   chip->duration[TPM_LONG] =
-   msecs_to_jiffies(TPM2_DURATION_LONG);
-   chip->duration[TPM_LONG_LONG] =
-   msecs_to_jiffies(TPM2_DURATION_LONG_LONG);
-
-   chip->flags |= TPM_CHIP_FLAG_HAVE_TIMEOUTS;
-   return 0;
-   }
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_TIMEOUT, , NULL,
-   sizeof(cap.timeout));
-   if (rc == TPM_ERR_INVALID_POSTINIT) {
-   if (tpm_startup(chip))
-   return rc;
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_TIMEOUT, ,
-   "attempting to determine the timeouts",
-   sizeof(cap.timeout));
-   }
-
-   if (rc) {
-   dev_err(>dev,
-   "A TPM error (%zd) occurred attempting to determine the 
timeouts\n",
-   rc);
-   return rc;
-   }
-
-   timeout_old[0] = jiffies_to_usecs(chip->timeout_a);
-   timeout_old[1] = jiffies_to_usecs(chip->timeout_b);
-   timeout_old[2] = jiffies_to_usecs(chip->timeout_c);
-   timeout_old[3] = jiffies_to_usecs(chip->timeout_d);
-   timeout_chip[0] = be32_to_cpu(cap.timeout.a);
-   timeout_chip[1] = be32_to_cpu(cap.timeout.b);
-   timeout_chip[2] = be32_to_cpu(cap.timeout.c);
-   timeout_chip[3] = be32_to_cpu(cap.timeout.d);
-   memcpy(timeout_eff, timeout_chip, sizeof(timeout_eff));
-
-   /*
-* Provide ability for vendor overrides of timeout values in case
-* of misreporting.
-*/
-   if (chip->ops->update_timeouts != NULL)
-   chip->timeout_adjusted =
-   chip->ops->update_timeouts(chip, timeout_eff);
-
-   if (!chip->timeout_adjusted) {
-   /* Restore default if chip reported 0 */
-   int i;
-
-   for (i = 0; i < ARRAY_SIZE(timeout_eff); i++) {
-   if (timeout_eff[i])
-   continue;
-
-   timeout_eff[i] = timeout_old[i];
-   chip->timeout_adjusted = true;
-   }
-
-   if (timeout_eff[0] != 0 && timeout_eff[0] < 1000) {
-   /* timeouts in msec rather usec */
-   for (i = 0; i != ARRAY_SIZE(timeout_eff); i++)
-   timeout_eff[i] *= 1000;
-   chip->timeout_adjusted = true;
-   }
-   }
-
-   /* Report adjusted timeouts */
-   if (chip->timeout_adjusted) {
-   dev_info(>dev,
-HW_ERR "Adjusting reported timeouts: A %lu->%luus B 
%lu->%luus C %lu->%luus D %lu->%luus\n",
-timeout_chip[0], timeout_eff[0],
-timeout_chip[1], timeout_eff[1],
-timeout_chip[2], timeout_eff[2],
-timeout_chip[3], timeout_eff[3]);
-   }
-
-   chip->timeout_a = usecs_to_jiffies(timeout_eff[0]);
-   chip->timeout_b = usecs_to_jiffies(timeout_eff[1]);
-   chip->timeout_c = usecs_to_jiffies(timeout_eff[2]);
-   chip->timeout_d = usecs_to_jiffies(timeout_eff[3]);
-
-   rc = tpm_getcap(chip, TPM_CAP_PROP_TIS_DURATION, ,
-   "attempting to determine the durations",
-   sizeof(cap.duration));
-   if (rc)
-   

[GIT PULL] ARM: at91: DT for 4.17

2018-03-09 Thread Alexandre Belloni
Arnd, Olof,

Not much this cycle, mainly changes on the Axentia boards from Peter and
a cleanup from Bartosz.

The following changes since commit 7928b2cbe55b2a410a0f5c1f154610059c57b1b2:

  Linux 4.16-rc1 (2018-02-11 15:04:29 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
tags/at91-ab-4.17-dt

for you to fetch changes up to 6fa65edf87886f85a18680ed7bdedc15cb810065:

  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9263ek 
(2018-02-13 17:15:18 +0100)


AT91 DT for 4.17:

 - use 'atmel' as at24 manufacturer
 - device addition and fixes for axentia boards
 - fix sama5d4 pinctrl compatible


Alexandre Belloni (1):
  ARM: dts: at91: sam9rl: Properly assign copyright

Bartosz Golaszewski (5):
  ARM: dts: at91: use 'atmel' as at24 manufacturer for sama5d34ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9260ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9g20ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91-sama5d2_ptc_ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9263ek

Peter Rosin (5):
  ARM: dts: at91: nattis: use the correct compatible for the eeprom
  ARM: dts: at91: tse850: use the correct compatible for the eeprom
  ARM: dts: at91: nattis: use up-to-date mtd partitions
  ARM: dts: at91: nattis: add lvds-encoder
  ARM: dts: at91: tse850: make the sound dai cell count explicit

Santiago Esteban (1):
  ARM: dts: at91: sama5d4: fix pinctrl compatible string

 arch/arm/boot/dts/at91-nattis-2-natte-2.dts | 60 +
 arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts   |  2 +-
 arch/arm/boot/dts/at91-tse850-3.dts |  3 +-
 arch/arm/boot/dts/at91sam9260ek.dts |  2 +-
 arch/arm/boot/dts/at91sam9263ek.dts |  2 +-
 arch/arm/boot/dts/at91sam9g20ek_common.dtsi |  2 +-
 arch/arm/boot/dts/at91sam9rl.dtsi   |  3 +-
 arch/arm/boot/dts/at91sam9rlek.dts  |  3 +-
 arch/arm/boot/dts/sama5d34ek.dts|  2 +-
 arch/arm/boot/dts/sama5d4.dtsi  |  2 +-
 10 files changed, 57 insertions(+), 24 deletions(-)

-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


[GIT PULL] ARM: at91: DT for 4.17

2018-03-09 Thread Alexandre Belloni
Arnd, Olof,

Not much this cycle, mainly changes on the Axentia boards from Peter and
a cleanup from Bartosz.

The following changes since commit 7928b2cbe55b2a410a0f5c1f154610059c57b1b2:

  Linux 4.16-rc1 (2018-02-11 15:04:29 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
tags/at91-ab-4.17-dt

for you to fetch changes up to 6fa65edf87886f85a18680ed7bdedc15cb810065:

  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9263ek 
(2018-02-13 17:15:18 +0100)


AT91 DT for 4.17:

 - use 'atmel' as at24 manufacturer
 - device addition and fixes for axentia boards
 - fix sama5d4 pinctrl compatible


Alexandre Belloni (1):
  ARM: dts: at91: sam9rl: Properly assign copyright

Bartosz Golaszewski (5):
  ARM: dts: at91: use 'atmel' as at24 manufacturer for sama5d34ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9260ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9g20ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91-sama5d2_ptc_ek
  ARM: dts: at91: use 'atmel' as at24 manufacturer for at91sam9263ek

Peter Rosin (5):
  ARM: dts: at91: nattis: use the correct compatible for the eeprom
  ARM: dts: at91: tse850: use the correct compatible for the eeprom
  ARM: dts: at91: nattis: use up-to-date mtd partitions
  ARM: dts: at91: nattis: add lvds-encoder
  ARM: dts: at91: tse850: make the sound dai cell count explicit

Santiago Esteban (1):
  ARM: dts: at91: sama5d4: fix pinctrl compatible string

 arch/arm/boot/dts/at91-nattis-2-natte-2.dts | 60 +
 arch/arm/boot/dts/at91-sama5d2_ptc_ek.dts   |  2 +-
 arch/arm/boot/dts/at91-tse850-3.dts |  3 +-
 arch/arm/boot/dts/at91sam9260ek.dts |  2 +-
 arch/arm/boot/dts/at91sam9263ek.dts |  2 +-
 arch/arm/boot/dts/at91sam9g20ek_common.dtsi |  2 +-
 arch/arm/boot/dts/at91sam9rl.dtsi   |  3 +-
 arch/arm/boot/dts/at91sam9rlek.dts  |  3 +-
 arch/arm/boot/dts/sama5d34ek.dts|  2 +-
 arch/arm/boot/dts/sama5d4.dtsi  |  2 +-
 10 files changed, 57 insertions(+), 24 deletions(-)

-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


[PATCH] rbd: Remove VLA usage

2018-03-09 Thread Kyle Spiers
>From 4198ebe2e8058ff676d8e2f993d8806d6ca29c11 Mon Sep 17 00:00:00 2001
From: Kyle Spiers 
Date: Fri, 9 Mar 2018 12:34:15 -0800
Subject: [PATCH] rbd: Remove VLA usage

As part of the effort to remove VLAs from the kernel[1], this moves
the literal values into the stack array calculation instead of using a
variable for the sizing. The resulting size can be found from
sizeof(buf).

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kyle Spiers 

---
 drivers/block/rbd.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e40da0..0e94e1f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3100,8 +3100,8 @@ static int __rbd_notify_op_lock(struct rbd_device
*rbd_dev,
 {
 struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc;
 struct rbd_client_id cid = rbd_get_cid(rbd_dev);
-    int buf_size = 4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN;
-    char buf[buf_size];
+    char buf[4 + 4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN];
+    int buf_size = sizeof(buf);
 void *p = buf;
 
 dout("%s rbd_dev %p notify_op %d\n", __func__, rbd_dev, notify_op);
@@ -3619,8 +3619,8 @@ static void __rbd_acknowledge_notify(struct
rbd_device *rbd_dev,
              u64 notify_id, u64 cookie, s32 *result)
 {
 struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc;
-    int buf_size = 4 + CEPH_ENCODING_START_BLK_LEN;
-    char buf[buf_size];
+    char buf[4 + CEPH_ENCODING_START_BLK_LEN];
+    int buf_size = sizeof(buf);
 int ret;
 
 if (result) {
-- 2.7.4



[PATCH] rbd: Remove VLA usage

2018-03-09 Thread Kyle Spiers
>From 4198ebe2e8058ff676d8e2f993d8806d6ca29c11 Mon Sep 17 00:00:00 2001
From: Kyle Spiers 
Date: Fri, 9 Mar 2018 12:34:15 -0800
Subject: [PATCH] rbd: Remove VLA usage

As part of the effort to remove VLAs from the kernel[1], this moves
the literal values into the stack array calculation instead of using a
variable for the sizing. The resulting size can be found from
sizeof(buf).

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Kyle Spiers 

---
 drivers/block/rbd.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8e40da0..0e94e1f 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -3100,8 +3100,8 @@ static int __rbd_notify_op_lock(struct rbd_device
*rbd_dev,
 {
 struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc;
 struct rbd_client_id cid = rbd_get_cid(rbd_dev);
-    int buf_size = 4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN;
-    char buf[buf_size];
+    char buf[4 + 4 + 8 + 8 + CEPH_ENCODING_START_BLK_LEN];
+    int buf_size = sizeof(buf);
 void *p = buf;
 
 dout("%s rbd_dev %p notify_op %d\n", __func__, rbd_dev, notify_op);
@@ -3619,8 +3619,8 @@ static void __rbd_acknowledge_notify(struct
rbd_device *rbd_dev,
              u64 notify_id, u64 cookie, s32 *result)
 {
 struct ceph_osd_client *osdc = _dev->rbd_client->client->osdc;
-    int buf_size = 4 + CEPH_ENCODING_START_BLK_LEN;
-    char buf[buf_size];
+    char buf[4 + CEPH_ENCODING_START_BLK_LEN];
+    int buf_size = sizeof(buf);
 int ret;
 
 if (result) {
-- 2.7.4



[GIT PULL] ARM: at91: SoC for 4.17

2018-03-09 Thread Alexandre Belloni
Arnd, Olof,

It has been two years that Atmel merge with Microchip, rename where
relevant.

This is based on my fixes PR which is already in next/soc. Tell me if
this is not right.

The following changes since commit c8d5dcf122b194e897d2a6311903eae0c1023325:

  MAINTAINERS: ARM: at91: update my email address (2018-02-22 16:22:15 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
tags/at91-ab-4.17-soc

for you to fetch changes up to ed08b63c8b3e23dfc8a32f0b450a23a35a3d91b4:

  ARM: at91: Kconfig: Update company to Microchip (2018-02-28 16:21:51 +0100)


AT91 SoC for 4.17:

 - Rename Atmel to Microhip in MAINTAINERS, Documentation and Kconfig


Nicolas Ferre (3):
  MAINTAINERS: ARM: at91: update entry for ARM/Microchip
  Documentation: at91: Update Microchip SoC documentation
  ARM: at91: Kconfig: Update company to Microchip

 Documentation/arm/{Atmel => Microchip}/README | 52 +--
 MAINTAINERS   | 42 +++---
 arch/arm/mach-at91/Kconfig| 14 
 3 files changed, 53 insertions(+), 55 deletions(-)
 rename Documentation/arm/{Atmel => Microchip}/README (64%)

-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


[GIT PULL] ARM: at91: SoC for 4.17

2018-03-09 Thread Alexandre Belloni
Arnd, Olof,

It has been two years that Atmel merge with Microchip, rename where
relevant.

This is based on my fixes PR which is already in next/soc. Tell me if
this is not right.

The following changes since commit c8d5dcf122b194e897d2a6311903eae0c1023325:

  MAINTAINERS: ARM: at91: update my email address (2018-02-22 16:22:15 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
tags/at91-ab-4.17-soc

for you to fetch changes up to ed08b63c8b3e23dfc8a32f0b450a23a35a3d91b4:

  ARM: at91: Kconfig: Update company to Microchip (2018-02-28 16:21:51 +0100)


AT91 SoC for 4.17:

 - Rename Atmel to Microhip in MAINTAINERS, Documentation and Kconfig


Nicolas Ferre (3):
  MAINTAINERS: ARM: at91: update entry for ARM/Microchip
  Documentation: at91: Update Microchip SoC documentation
  ARM: at91: Kconfig: Update company to Microchip

 Documentation/arm/{Atmel => Microchip}/README | 52 +--
 MAINTAINERS   | 42 +++---
 arch/arm/mach-at91/Kconfig| 14 
 3 files changed, 53 insertions(+), 55 deletions(-)
 rename Documentation/arm/{Atmel => Microchip}/README (64%)

-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com


Re: [PATCH 3.18 00/21] 3.18.99-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.18.99 release.
> There are 21 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:17:44 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.99-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-3.18.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 3.18 00/21] 3.18.99-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 3.18.99 release.
> There are 21 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:17:44 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v3.x/stable-review/patch-3.18.99-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-3.18.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.4 00/36] 4.4.121-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.121 release.
> There are 36 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:17:54 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.121-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.4.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.4 00/36] 4.4.121-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.4.121 release.
> There are 36 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:17:54 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.4.121-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.4.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.9 00/65] 4.9.87-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.9.87 release.
> There are 65 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:06 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.87-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.9 00/65] 4.9.87-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:18 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.9.87 release.
> There are 65 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:06 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.9.87-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.9.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.14 0/9] 4.14.26-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.14.26 release.
> There are 9 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:16 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.26-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.14.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah


Re: [PATCH 4.14 0/9] 4.14.26-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.14.26 release.
> There are 9 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:16 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.26-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.14.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah


Re: [PATCH 4.15 00/11] 4.15.9-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.15.9 release.
> There are 11 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:21 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.15.9-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.15.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



Re: [PATCH 4.15 00/11] 4.15.9-stable review

2018-03-09 Thread Shuah Khan
On 03/09/2018 05:19 PM, Greg Kroah-Hartman wrote:
> This is the start of the stable review cycle for the 4.15.9 release.
> There are 11 patches in this series, all will be posted as a response
> to this one.  If anyone has any issues with these being applied, please
> let me know.
> 
> Responses should be made by Mon Mar 12 00:18:21 UTC 2018.
> Anything received after that time might be too late.
> 
> The whole patch series can be found in one patch at:
>   
> https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.15.9-rc1.gz
> or in the git tree and branch at:
>   
> git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
> linux-4.15.y
> and the diffstat can be found below.
> 
> thanks,
> 
> greg k-h
> 

Compiled and booted on my test system. No dmesg regressions.

thanks,
-- Shuah



RE: [RFC 3/3] arch/x86/kvm: SVM: Introduce pause loop exit logic in SVM

2018-03-09 Thread Moger, Babu
Radim,
 Thanks for the comments. Taken care of most of the comments.
 I have few questions/comments. Please see inline.

> -Original Message-
> From: Radim Krčmář 
> Sent: Friday, March 9, 2018 12:13 PM
> To: Moger, Babu 
> Cc: j...@8bytes.org; t...@linutronix.de; mi...@redhat.com;
> h...@zytor.com; x...@kernel.org; pbonz...@redhat.com;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC 3/3] arch/x86/kvm: SVM: Introduce pause loop exit logic in
> SVM
> 
> 2018-03-02 11:17-0500, Babu Moger:
> > Bring the PLE(pause loop exit) logic to AMD svm driver.
> > We have noticed it help in situations where numerous pauses are
> generated
> > due to spinlock or other scenarios. Tested it with idle=poll and noticed
> > pause interceptions go down considerably.
> >
> > Signed-off-by: Babu Moger 
> > ---
> >  arch/x86/kvm/svm.c | 114
> -
> >  arch/x86/kvm/x86.h |   1 +
> >  2 files changed, 114 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 50a4e95..30bc851 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -263,6 +263,55 @@ struct amd_svm_iommu_ir {
> >  static bool npt_enabled;
> >  #endif
> >
> > +/*
> > + * These 2 parameters are used to config the controls for Pause-Loop
> Exiting:
> > + * pause_filter_thresh: On processors that support Pause
> filtering(indicated
> > + * by CPUID Fn8000_000A_EDX), the VMCB provides a 16 bit pause filter
> > + * count value. On VMRUN this value is loaded into an internal counter.
> > + * Each time a pause instruction is executed, this counter is
> decremented
> > + * until it reaches zero at which time a #VMEXIT is generated if pause
> > + * intercept is enabled. Refer to  AMD APM Vol 2 Section 15.14.4 Pause
> > + * Intercept Filtering for more details.
> > + * This also indicate if ple logic enabled.
> > + *
> > + * pause_filter_count: In addition, some processor families support
> advanced
> 
> The comment has thresh/count flipped.

Good catch. Thanks

> 
> > + * pause filtering (indicated by CPUID Fn8000_000A_EDX) upper bound
> on
> > + * the amount of time a guest is allowed to execute in a pause loop.
> > + * In this mode, a 16-bit pause filter threshold field is added in the
> > + * VMCB. The threshold value is a cycle count that is used to reset the
> > + * pause counter. As with simple pause filtering, VMRUN loads the
> pause
> > + * count value from VMCB into an internal counter. Then, on each
> pause
> > + * instruction the hardware checks the elapsed number of cycles since
> > + * the most recent pause instruction against the pause filter threshold.
> > + * If the elapsed cycle count is greater than the pause filter threshold,
> > + * then the internal pause count is reloaded from the VMCB and
> execution
> > + * continues. If the elapsed cycle count is less than the pause filter
> > + * threshold, then the internal pause count is decremented. If the
> count
> > + * value is less than zero and PAUSE intercept is enabled, a #VMEXIT is
> > + * triggered. If advanced pause filtering is supported and pause filter
> > + * threshold field is set to zero, the filter will operate in the simpler,
> > + * count only mode.
> > + */
> > +
> > +static int pause_filter_thresh = KVM_DEFAULT_PLE_GAP;
> > +module_param(pause_filter_thresh, int, S_IRUGO);
> 
> I think it was a mistake to put signed values in VMX ...
> Please use unsigned variants and also properly sized.
> (The module param type would be "ushort" instead of "int".)

Sure. Will take care.
> 
> > +static int pause_filter_count = KVM_DEFAULT_PLE_WINDOW;
> > +module_param(pause_filter_count, int, S_IRUGO);
> 
> We are going to want a different default for pause_filter_count, because
> they have a different meaning.  On Intel, it's the number of cycles, on
> AMD, it's the number of PAUSE instructions.
> 
> The AMD's 3k is a bit high in comparison to Intel's 4k, but I'd keep 3k
> unless we have other benchmark results.

Ok. Testing with pause_filter_count = 3k for AMD. If everything goes fine, will 
make these changes.

> 
> > +static int ple_window_grow = KVM_DEFAULT_PLE_WINDOW_GROW;
> 
> The naming would be nicer with a consistent prefix.  We're growing
> pause_filter_count, so pause_filter_count_grow is easier to understand.
> (Albeit unwieldy.)

Sure. Will take care.

> 
> > +module_param(ple_window_grow, int, S_IRUGO);
> 
> (This is better as unsigned too ... VMX should have had that.)

Yes. Will fix it.

> 
> > @@ -1046,6 +1095,58 @@ static int avic_ga_log_notifier(u32 ga_tag)
> > return 0;
> >  }
> >
> > +static void grow_ple_window(struct kvm_vcpu *vcpu)
> > +{
> > +   struct vcpu_svm *svm = to_svm(vcpu);
> > +   struct vmcb_control_area *control = >vmcb->control;
> > +   int old = control->pause_filter_count;
> > +
> > +   control->pause_filter_count = __grow_ple_window(old,
> > +

RE: [RFC 3/3] arch/x86/kvm: SVM: Introduce pause loop exit logic in SVM

2018-03-09 Thread Moger, Babu
Radim,
 Thanks for the comments. Taken care of most of the comments.
 I have few questions/comments. Please see inline.

> -Original Message-
> From: Radim Krčmář 
> Sent: Friday, March 9, 2018 12:13 PM
> To: Moger, Babu 
> Cc: j...@8bytes.org; t...@linutronix.de; mi...@redhat.com;
> h...@zytor.com; x...@kernel.org; pbonz...@redhat.com;
> k...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: Re: [RFC 3/3] arch/x86/kvm: SVM: Introduce pause loop exit logic in
> SVM
> 
> 2018-03-02 11:17-0500, Babu Moger:
> > Bring the PLE(pause loop exit) logic to AMD svm driver.
> > We have noticed it help in situations where numerous pauses are
> generated
> > due to spinlock or other scenarios. Tested it with idle=poll and noticed
> > pause interceptions go down considerably.
> >
> > Signed-off-by: Babu Moger 
> > ---
> >  arch/x86/kvm/svm.c | 114
> -
> >  arch/x86/kvm/x86.h |   1 +
> >  2 files changed, 114 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index 50a4e95..30bc851 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -263,6 +263,55 @@ struct amd_svm_iommu_ir {
> >  static bool npt_enabled;
> >  #endif
> >
> > +/*
> > + * These 2 parameters are used to config the controls for Pause-Loop
> Exiting:
> > + * pause_filter_thresh: On processors that support Pause
> filtering(indicated
> > + * by CPUID Fn8000_000A_EDX), the VMCB provides a 16 bit pause filter
> > + * count value. On VMRUN this value is loaded into an internal counter.
> > + * Each time a pause instruction is executed, this counter is
> decremented
> > + * until it reaches zero at which time a #VMEXIT is generated if pause
> > + * intercept is enabled. Refer to  AMD APM Vol 2 Section 15.14.4 Pause
> > + * Intercept Filtering for more details.
> > + * This also indicate if ple logic enabled.
> > + *
> > + * pause_filter_count: In addition, some processor families support
> advanced
> 
> The comment has thresh/count flipped.

Good catch. Thanks

> 
> > + * pause filtering (indicated by CPUID Fn8000_000A_EDX) upper bound
> on
> > + * the amount of time a guest is allowed to execute in a pause loop.
> > + * In this mode, a 16-bit pause filter threshold field is added in the
> > + * VMCB. The threshold value is a cycle count that is used to reset the
> > + * pause counter. As with simple pause filtering, VMRUN loads the
> pause
> > + * count value from VMCB into an internal counter. Then, on each
> pause
> > + * instruction the hardware checks the elapsed number of cycles since
> > + * the most recent pause instruction against the pause filter threshold.
> > + * If the elapsed cycle count is greater than the pause filter threshold,
> > + * then the internal pause count is reloaded from the VMCB and
> execution
> > + * continues. If the elapsed cycle count is less than the pause filter
> > + * threshold, then the internal pause count is decremented. If the
> count
> > + * value is less than zero and PAUSE intercept is enabled, a #VMEXIT is
> > + * triggered. If advanced pause filtering is supported and pause filter
> > + * threshold field is set to zero, the filter will operate in the simpler,
> > + * count only mode.
> > + */
> > +
> > +static int pause_filter_thresh = KVM_DEFAULT_PLE_GAP;
> > +module_param(pause_filter_thresh, int, S_IRUGO);
> 
> I think it was a mistake to put signed values in VMX ...
> Please use unsigned variants and also properly sized.
> (The module param type would be "ushort" instead of "int".)

Sure. Will take care.
> 
> > +static int pause_filter_count = KVM_DEFAULT_PLE_WINDOW;
> > +module_param(pause_filter_count, int, S_IRUGO);
> 
> We are going to want a different default for pause_filter_count, because
> they have a different meaning.  On Intel, it's the number of cycles, on
> AMD, it's the number of PAUSE instructions.
> 
> The AMD's 3k is a bit high in comparison to Intel's 4k, but I'd keep 3k
> unless we have other benchmark results.

Ok. Testing with pause_filter_count = 3k for AMD. If everything goes fine, will 
make these changes.

> 
> > +static int ple_window_grow = KVM_DEFAULT_PLE_WINDOW_GROW;
> 
> The naming would be nicer with a consistent prefix.  We're growing
> pause_filter_count, so pause_filter_count_grow is easier to understand.
> (Albeit unwieldy.)

Sure. Will take care.

> 
> > +module_param(ple_window_grow, int, S_IRUGO);
> 
> (This is better as unsigned too ... VMX should have had that.)

Yes. Will fix it.

> 
> > @@ -1046,6 +1095,58 @@ static int avic_ga_log_notifier(u32 ga_tag)
> > return 0;
> >  }
> >
> > +static void grow_ple_window(struct kvm_vcpu *vcpu)
> > +{
> > +   struct vcpu_svm *svm = to_svm(vcpu);
> > +   struct vmcb_control_area *control = >vmcb->control;
> > +   int old = control->pause_filter_count;
> > +
> > +   control->pause_filter_count = __grow_ple_window(old,
> > +   pause_filter_count,
> > +

Re: [PATCH 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Brian Norris
On Fri, Mar 9, 2018 at 8:02 PM, Guenter Roeck  wrote:
> On 03/09/2018 07:28 PM, Brian Norris wrote:
>> I guess I could mention it. I was assuming that was an intended behavior
>> of the existing driver: that we set resp_mode=0 (via clobber), so we
>> always get a system reset (we don't try to handle any interrupt in this
>> driver).
>>
> I don't think it was intended behavior. We don't even know for sure (or at
> least
> I don't know) if all implementations of this IP have the same configuration
> bit
> layout. All we can do is hope for the best.

Huh, OK. I did try to look for any sort of generic DesignWare register
documentation, and I couldn't find one easily (even with a proper
Synopsys account -- maybe I wasn't looking in the right place). But
besides the Rockchip TRMs, I did find some openly accessible Altera
SoCFPGA docs [1] which also use this, and they have a few things to
add:
(1) they have the same 'reset pulse length' field, except it's labeled RO
(2) they have the same 'response mode' field
(3) the docs for the entire register say:

"The value of a reserved bit must be maintained in software. When you
modify registers containing reserved bit fields, you must use a
read-modify-write operation to preserve state and prevent
indeterminate system behavior."

So, that pretty well corroborates my patch. Nice.

> Still, clobbering just 1 bit is better than clobbering 30 bit.

Yeah, that's the idea. Well, as long as it's only the 1 bit I want to clobber ;)

I guess if we really find that any of this becomes more problematic
(and varies enough from IP to IP), then we'll need chip-specific
compatible properties.

Brian

[1] e.g. 
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/arria-10/a10_5v4.pdf


Re: [PATCH 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Brian Norris
On Fri, Mar 9, 2018 at 8:02 PM, Guenter Roeck  wrote:
> On 03/09/2018 07:28 PM, Brian Norris wrote:
>> I guess I could mention it. I was assuming that was an intended behavior
>> of the existing driver: that we set resp_mode=0 (via clobber), so we
>> always get a system reset (we don't try to handle any interrupt in this
>> driver).
>>
> I don't think it was intended behavior. We don't even know for sure (or at
> least
> I don't know) if all implementations of this IP have the same configuration
> bit
> layout. All we can do is hope for the best.

Huh, OK. I did try to look for any sort of generic DesignWare register
documentation, and I couldn't find one easily (even with a proper
Synopsys account -- maybe I wasn't looking in the right place). But
besides the Rockchip TRMs, I did find some openly accessible Altera
SoCFPGA docs [1] which also use this, and they have a few things to
add:
(1) they have the same 'reset pulse length' field, except it's labeled RO
(2) they have the same 'response mode' field
(3) the docs for the entire register say:

"The value of a reserved bit must be maintained in software. When you
modify registers containing reserved bit fields, you must use a
read-modify-write operation to preserve state and prevent
indeterminate system behavior."

So, that pretty well corroborates my patch. Nice.

> Still, clobbering just 1 bit is better than clobbering 30 bit.

Yeah, that's the idea. Well, as long as it's only the 1 bit I want to clobber ;)

I guess if we really find that any of this becomes more problematic
(and varies enough from IP to IP), then we'll need chip-specific
compatible properties.

Brian

[1] e.g. 
https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/hb/arria-10/a10_5v4.pdf


Re: [RFC/RFT][PATCH v3 0/6] sched/cpuidle: Idle loop rework

2018-03-09 Thread Mike Galbraith
On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> Hi All,
> 
> Thanks a lot for the discussion and testing so far!
> 
> This is a total respin of the whole series, so please look at it afresh.
> Patches 2 and 3 are the most similar to their previous versions, but
> still they are different enough.

Respin of testdrive...

i4790 booted nopti nospectre_v2

30 sec tbench
4.16.0.g1b88acc-master (virgin)
Throughput 559.279 MB/sec  1 clients  1 procs  max_latency=0.046 ms
Throughput 997.119 MB/sec  2 clients  2 procs  max_latency=0.246 ms
Throughput 1693.04 MB/sec  4 clients  4 procs  max_latency=4.309 ms
Throughput 3597.2 MB/sec  8 clients  8 procs  max_latency=6.760 ms
Throughput 3474.55 MB/sec  16 clients  16 procs  max_latency=6.743 ms

4.16.0.g1b88acc-master (+ v2)
Throughput 588.929 MB/sec  1 clients  1 procs  max_latency=0.291 ms
Throughput 1080.93 MB/sec  2 clients  2 procs  max_latency=0.639 ms
Throughput 1826.3 MB/sec  4 clients  4 procs  max_latency=0.647 ms
Throughput 3561.01 MB/sec  8 clients  8 procs  max_latency=1.279 ms
Throughput 3382.98 MB/sec  16 clients  16 procs  max_latency=4.817 ms

4.16.0.g1b88acc-master (+ v3)
Throughput 588.711 MB/sec  1 clients  1 procs  max_latency=0.067 ms
Throughput 1077.71 MB/sec  2 clients  2 procs  max_latency=0.298 ms
Throughput 1803.47 MB/sec  4 clients  4 procs  max_latency=0.667 ms
Throughput 3591.4 MB/sec  8 clients  8 procs  max_latency=4.999 ms
Throughput 3444.74 MB/sec  16 clients  16 procs  max_latency=1.995 ms

4.16.0.g1b88acc-master (+ my local patches)
Throughput 722.559 MB/sec  1 clients  1 procs  max_latency=0.087 ms
Throughput 1208.59 MB/sec  2 clients  2 procs  max_latency=0.289 ms
Throughput 2071.94 MB/sec  4 clients  4 procs  max_latency=0.654 ms
Throughput 3784.91 MB/sec  8 clients  8 procs  max_latency=0.974 ms
Throughput 3644.4 MB/sec  16 clients  16 procs  max_latency=5.620 ms

turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 
300;killall firefox

PkgWatt
  1 2 3
4.16.0.g1b88acc-master 6.95  7.03  6.91 (virgin)
4.16.0.g1b88acc-master 7.20  7.25  7.26 (+v2)
4.16.0.g1b88acc-master 7.04  6.97  7.07 (+v3)
4.16.0.g1b88acc-master 6.90  7.06  6.95 (+my patches)

No change wrt nohz high frequency cross core scheduling overhead, but
the light load power consumption oddity did go away.

(btw, don't read anything into max_latency numbers, that's GUI noise)

-Mike


Re: [RFC/RFT][PATCH v3 0/6] sched/cpuidle: Idle loop rework

2018-03-09 Thread Mike Galbraith
On Fri, 2018-03-09 at 10:34 +0100, Rafael J. Wysocki wrote:
> Hi All,
> 
> Thanks a lot for the discussion and testing so far!
> 
> This is a total respin of the whole series, so please look at it afresh.
> Patches 2 and 3 are the most similar to their previous versions, but
> still they are different enough.

Respin of testdrive...

i4790 booted nopti nospectre_v2

30 sec tbench
4.16.0.g1b88acc-master (virgin)
Throughput 559.279 MB/sec  1 clients  1 procs  max_latency=0.046 ms
Throughput 997.119 MB/sec  2 clients  2 procs  max_latency=0.246 ms
Throughput 1693.04 MB/sec  4 clients  4 procs  max_latency=4.309 ms
Throughput 3597.2 MB/sec  8 clients  8 procs  max_latency=6.760 ms
Throughput 3474.55 MB/sec  16 clients  16 procs  max_latency=6.743 ms

4.16.0.g1b88acc-master (+ v2)
Throughput 588.929 MB/sec  1 clients  1 procs  max_latency=0.291 ms
Throughput 1080.93 MB/sec  2 clients  2 procs  max_latency=0.639 ms
Throughput 1826.3 MB/sec  4 clients  4 procs  max_latency=0.647 ms
Throughput 3561.01 MB/sec  8 clients  8 procs  max_latency=1.279 ms
Throughput 3382.98 MB/sec  16 clients  16 procs  max_latency=4.817 ms

4.16.0.g1b88acc-master (+ v3)
Throughput 588.711 MB/sec  1 clients  1 procs  max_latency=0.067 ms
Throughput 1077.71 MB/sec  2 clients  2 procs  max_latency=0.298 ms
Throughput 1803.47 MB/sec  4 clients  4 procs  max_latency=0.667 ms
Throughput 3591.4 MB/sec  8 clients  8 procs  max_latency=4.999 ms
Throughput 3444.74 MB/sec  16 clients  16 procs  max_latency=1.995 ms

4.16.0.g1b88acc-master (+ my local patches)
Throughput 722.559 MB/sec  1 clients  1 procs  max_latency=0.087 ms
Throughput 1208.59 MB/sec  2 clients  2 procs  max_latency=0.289 ms
Throughput 2071.94 MB/sec  4 clients  4 procs  max_latency=0.654 ms
Throughput 3784.91 MB/sec  8 clients  8 procs  max_latency=0.974 ms
Throughput 3644.4 MB/sec  16 clients  16 procs  max_latency=5.620 ms

turbostat -q -- firefox /root/tmp/video/BigBuckBunny-DivXPlusHD.mkv & sleep 
300;killall firefox

PkgWatt
  1 2 3
4.16.0.g1b88acc-master 6.95  7.03  6.91 (virgin)
4.16.0.g1b88acc-master 7.20  7.25  7.26 (+v2)
4.16.0.g1b88acc-master 7.04  6.97  7.07 (+v3)
4.16.0.g1b88acc-master 6.90  7.06  6.95 (+my patches)

No change wrt nohz high frequency cross core scheduling overhead, but
the light load power consumption oddity did go away.

(btw, don't read anything into max_latency numbers, that's GUI noise)

-Mike


RE: [PATCH 2/2] e1000e: Fix link check race condition

2018-03-09 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Benjamin Poirier
> Sent: Monday, March 5, 2018 5:56 PM
> To: Kirsher, Jeffrey T 
> Cc: Alexander Duyck ; Lennart Sorensen
> ; intel-wired-...@lists.osuosl.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 2/2] e1000e: Fix link check race condition
> 
> Alex reported the following race condition:
> 
> /* link goes up... interrupt... schedule watchdog */
> \ e1000_watchdog_task
>   \ e1000e_has_link
>   \ hw->mac.ops.check_for_link() ===
> e1000e_check_for_copper_link
>   \ e1000e_phy_has_link_generic(..., )
>   link = true
> 
>/* link goes down... interrupt */
>\ e1000_msix_other
>hw->mac.get_link_status =
> true
> 
>   /* link is up */
>   mac->get_link_status = false
> 
>   link_active = true
>   /* link_active is true, wrongly, and stays so because
>* get_link_status is false */
> 
> Avoid this problem by making sure that we don't set get_link_status = false
> after having checked the link.
> 
> It seems this problem has been present since the introduction of e1000e.
> 
> Link: https://lkml.org/lkml/2018/1/29/338
> Reported-by: Alexander Duyck 
> Signed-off-by: Benjamin Poirier 
> ---
>  drivers/net/ethernet/intel/e1000e/ich8lan.c | 31 ---
> --
>  drivers/net/ethernet/intel/e1000e/mac.c | 14 ++---
>  2 files changed, 24 insertions(+), 21 deletions(-)

Tested-by: Aaron Brown 


RE: [PATCH 2/2] e1000e: Fix link check race condition

2018-03-09 Thread Brown, Aaron F
> From: netdev-ow...@vger.kernel.org [mailto:netdev-
> ow...@vger.kernel.org] On Behalf Of Benjamin Poirier
> Sent: Monday, March 5, 2018 5:56 PM
> To: Kirsher, Jeffrey T 
> Cc: Alexander Duyck ; Lennart Sorensen
> ; intel-wired-...@lists.osuosl.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [PATCH 2/2] e1000e: Fix link check race condition
> 
> Alex reported the following race condition:
> 
> /* link goes up... interrupt... schedule watchdog */
> \ e1000_watchdog_task
>   \ e1000e_has_link
>   \ hw->mac.ops.check_for_link() ===
> e1000e_check_for_copper_link
>   \ e1000e_phy_has_link_generic(..., )
>   link = true
> 
>/* link goes down... interrupt */
>\ e1000_msix_other
>hw->mac.get_link_status =
> true
> 
>   /* link is up */
>   mac->get_link_status = false
> 
>   link_active = true
>   /* link_active is true, wrongly, and stays so because
>* get_link_status is false */
> 
> Avoid this problem by making sure that we don't set get_link_status = false
> after having checked the link.
> 
> It seems this problem has been present since the introduction of e1000e.
> 
> Link: https://lkml.org/lkml/2018/1/29/338
> Reported-by: Alexander Duyck 
> Signed-off-by: Benjamin Poirier 
> ---
>  drivers/net/ethernet/intel/e1000e/ich8lan.c | 31 ---
> --
>  drivers/net/ethernet/intel/e1000e/mac.c | 14 ++---
>  2 files changed, 24 insertions(+), 21 deletions(-)

Tested-by: Aaron Brown 


RE: [Intel-wired-lan] [PATCH 1/2] Revert "e1000e: Separate signaling for link check/link up"

2018-03-09 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Benjamin Poirier
> Sent: Monday, March 5, 2018 5:56 PM
> To: Kirsher, Jeffrey T 
> Cc: net...@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux-
> ker...@vger.kernel.org; Lennart Sorensen 
> Subject: [Intel-wired-lan] [PATCH 1/2] Revert "e1000e: Separate signaling for
> link check/link up"
> 
> This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
> This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.
> 
> Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> changed what happens to the link status when there is an error which
> happens after "get_link_status = false" in the copper check_for_link
> callbacks. Previously, such an error would be ignored and the link
> considered up. After that commit, any error implies that the link is down.
> 
> Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
> up") and its followups. After reverting, the race condition described in
> the log of commit 19110cfbb34d is reintroduced. It may still be triggered
> by LSC events but this should keep the link down in case the link is
> electrically unstable, as discussed. The race may no longer be
> triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
> receiver overrun interrupt bursts") restored reading icr in the Other
> handler.
> 
> Link: https://lkml.org/lkml/2018/3/1/789
> Signed-off-by: Benjamin Poirier 
> ---
>  drivers/net/ethernet/intel/e1000e/ich8lan.c | 13 -
>  drivers/net/ethernet/intel/e1000e/mac.c | 13 -
>  drivers/net/ethernet/intel/e1000e/netdev.c  |  2 +-
>  3 files changed, 9 insertions(+), 19 deletions(-)
> 

Tested-by: Aaron Brown 


RE: [Intel-wired-lan] [PATCH 1/2] Revert "e1000e: Separate signaling for link check/link up"

2018-03-09 Thread Brown, Aaron F
> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> Behalf Of Benjamin Poirier
> Sent: Monday, March 5, 2018 5:56 PM
> To: Kirsher, Jeffrey T 
> Cc: net...@vger.kernel.org; intel-wired-...@lists.osuosl.org; linux-
> ker...@vger.kernel.org; Lennart Sorensen 
> Subject: [Intel-wired-lan] [PATCH 1/2] Revert "e1000e: Separate signaling for
> link check/link up"
> 
> This reverts commit 19110cfbb34d4af0cdfe14cd243f3b09dc95b013.
> This reverts commit 4110e02eb45ea447ec6f5459c9934de0a273fb91.
> This reverts commit d3604515c9eda464a92e8e67aae82dfe07fe3c98.
> 
> Commit 19110cfbb34d ("e1000e: Separate signaling for link check/link up")
> changed what happens to the link status when there is an error which
> happens after "get_link_status = false" in the copper check_for_link
> callbacks. Previously, such an error would be ignored and the link
> considered up. After that commit, any error implies that the link is down.
> 
> Revert commit 19110cfbb34d ("e1000e: Separate signaling for link check/link
> up") and its followups. After reverting, the race condition described in
> the log of commit 19110cfbb34d is reintroduced. It may still be triggered
> by LSC events but this should keep the link down in case the link is
> electrically unstable, as discussed. The race may no longer be
> triggered by RXO events because commit 4aea7a5c5e94 ("e1000e: Avoid
> receiver overrun interrupt bursts") restored reading icr in the Other
> handler.
> 
> Link: https://lkml.org/lkml/2018/3/1/789
> Signed-off-by: Benjamin Poirier 
> ---
>  drivers/net/ethernet/intel/e1000e/ich8lan.c | 13 -
>  drivers/net/ethernet/intel/e1000e/mac.c | 13 -
>  drivers/net/ethernet/intel/e1000e/netdev.c  |  2 +-
>  3 files changed, 9 insertions(+), 19 deletions(-)
> 

Tested-by: Aaron Brown 


Re: [PATCH v4 03/10] drivers: qcom: rpmh-rsc: log RPMH requests in FTRACE

2018-03-09 Thread Lina Iyer

On Fri, Mar 09 2018 at 16:52 -0700, Steven Rostedt wrote:

On Fri,  9 Mar 2018 16:25:36 -0700
Lina Iyer  wrote:


Log sent RPMH requests and interrupt responses in FTRACE.

Cc: Steven Rostedt 
Signed-off-by: Lina Iyer 
---

Changes in v4:
- fix compilation issues, use __assign_str
- use %#x instead of 0x%08x


Hmm, I don't believe libtraceevent (used by trace-cmd and perf)
supports "%#x". But that needs to be fixed in libtraceevent and you
don't need to modify this patch.

+__field(bool, wait)

Usually I would recommend against 'bool' in structures, but it
shouldn't affect the tracing code. Might want to look at how it
converts it in the /sys/kernel/tracing/events/rpmh/rpmh_send_msg/format
file. It probably makes no difference if it was an int.

Other than that... Looks good.


field:bool wait;offset:32;  size:1; signed:0;

-- Lina


Reviewed-by: Steven Rostedt (VMware) 

-- Steve



Changes in v3:
- Use __string() instead of char *
- fix TRACE_INCLUDE_PATH
---


Re: [PATCH v4 03/10] drivers: qcom: rpmh-rsc: log RPMH requests in FTRACE

2018-03-09 Thread Lina Iyer

On Fri, Mar 09 2018 at 16:52 -0700, Steven Rostedt wrote:

On Fri,  9 Mar 2018 16:25:36 -0700
Lina Iyer  wrote:


Log sent RPMH requests and interrupt responses in FTRACE.

Cc: Steven Rostedt 
Signed-off-by: Lina Iyer 
---

Changes in v4:
- fix compilation issues, use __assign_str
- use %#x instead of 0x%08x


Hmm, I don't believe libtraceevent (used by trace-cmd and perf)
supports "%#x". But that needs to be fixed in libtraceevent and you
don't need to modify this patch.

+__field(bool, wait)

Usually I would recommend against 'bool' in structures, but it
shouldn't affect the tracing code. Might want to look at how it
converts it in the /sys/kernel/tracing/events/rpmh/rpmh_send_msg/format
file. It probably makes no difference if it was an int.

Other than that... Looks good.


field:bool wait;offset:32;  size:1; signed:0;

-- Lina


Reviewed-by: Steven Rostedt (VMware) 

-- Steve



Changes in v3:
- Use __string() instead of char *
- fix TRACE_INCLUDE_PATH
---


Re: [PATCH] rtc: s5m: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva


Hi Krzysztof,

On 03/09/2018 07:04 AM, Krzysztof Kozlowski wrote:

On Thu, Mar 8, 2018 at 7:03 PM, Gustavo A. R. Silva
 wrote:



On 03/08/2018 11:58 AM, Kees Cook wrote:


On Thu, Mar 8, 2018 at 9:20 AM, Gustavo A. R. Silva
 wrote:


In preparation to enabling -Wvla, remove VLAs and replace them
with fixed-length arrays instead.

  From a security viewpoint, the use of Variable Length Arrays can be
a vector for stack overflow attacks. Also, in general, as the code
evolves it is easy to lose track of how big a VLA can get. Thus, we
can end up having segfaults that are hard to debug.

Also, fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
   drivers/rtc/rtc-s5m.c | 15 +--
   1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 6deae10..2b5f4f7 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -38,6 +38,9 @@
*/
   #define UDR_READ_RETRY_CNT 5

+/* Maximum number of registers for setting time/alarm0/alarm1 */
+#define MAX_NUM_TIME_REGS  8



I would adjust the various const struct s5m_rtc_reg_config's
.regs_count to be represented by this new define, so the stack and the
structures stay in sync. Something like:

static const struct s5m_rtc_reg_config s2mps13_rtc_regs = {
  .regs_count = MAX_NUM_TIME_REGS - 1,

?



Yep. I thought about that and decided to wait for some feedback first. But
yeah, I think is that'd be a good change.


Define and these assignments should be somehow connected with enum
defining the offsets for data[] (from
include/linux/mfd/samsung/rtc.h). Otherwise we define the same in two
places. The enum could be itself (in separate patch) moved to the
driver because it is meaningless for others.



I got it.

I'll move the enum to rtc-s5m.c and add RTC_MAX_NUM_TIME_REGS at the end 
of it. I'll send a patch series for this.


Thanks for the feedback.
--
Gustavo


Best regards,
Krzysztof





Re: [PATCH] rtc: s5m: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva


Hi Krzysztof,

On 03/09/2018 07:04 AM, Krzysztof Kozlowski wrote:

On Thu, Mar 8, 2018 at 7:03 PM, Gustavo A. R. Silva
 wrote:



On 03/08/2018 11:58 AM, Kees Cook wrote:


On Thu, Mar 8, 2018 at 9:20 AM, Gustavo A. R. Silva
 wrote:


In preparation to enabling -Wvla, remove VLAs and replace them
with fixed-length arrays instead.

  From a security viewpoint, the use of Variable Length Arrays can be
a vector for stack overflow attacks. Also, in general, as the code
evolves it is easy to lose track of how big a VLA can get. Thus, we
can end up having segfaults that are hard to debug.

Also, fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
   drivers/rtc/rtc-s5m.c | 15 +--
   1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/rtc/rtc-s5m.c b/drivers/rtc/rtc-s5m.c
index 6deae10..2b5f4f7 100644
--- a/drivers/rtc/rtc-s5m.c
+++ b/drivers/rtc/rtc-s5m.c
@@ -38,6 +38,9 @@
*/
   #define UDR_READ_RETRY_CNT 5

+/* Maximum number of registers for setting time/alarm0/alarm1 */
+#define MAX_NUM_TIME_REGS  8



I would adjust the various const struct s5m_rtc_reg_config's
.regs_count to be represented by this new define, so the stack and the
structures stay in sync. Something like:

static const struct s5m_rtc_reg_config s2mps13_rtc_regs = {
  .regs_count = MAX_NUM_TIME_REGS - 1,

?



Yep. I thought about that and decided to wait for some feedback first. But
yeah, I think is that'd be a good change.


Define and these assignments should be somehow connected with enum
defining the offsets for data[] (from
include/linux/mfd/samsung/rtc.h). Otherwise we define the same in two
places. The enum could be itself (in separate patch) moved to the
driver because it is meaningless for others.



I got it.

I'll move the enum to rtc-s5m.c and add RTC_MAX_NUM_TIME_REGS at the end 
of it. I'll send a patch series for this.


Thanks for the feedback.
--
Gustavo


Best regards,
Krzysztof





Re: possible deadlock in get_user_pages_unlocked

2018-03-09 Thread Eric Biggers
On Fri, Feb 09, 2018 at 07:19:25PM -0800, Eric Biggers wrote:
> Hi Al,
> 
> On Sat, Feb 10, 2018 at 01:36:40AM +, Al Viro wrote:
> > On Fri, Feb 02, 2018 at 09:57:27AM +0100, Dmitry Vyukov wrote:
> > 
> > > syzbot tests for up to 5 minutes. However, if there is a race involved
> > > then you may need more time because the crash is probabilistic.
> > > But from what I see most of the time, if one can't reproduce it
> > > easily, it's usually due to some differences in setup that just don't
> > > allow the crash to happen at all.
> > > FWIW syzbot re-runs each reproducer on a freshly booted dedicated VM
> > > and what it provided is the kernel output it got during run of the
> > > provided program. So we have reasonably high assurance that this
> > > reproducer worked in at least one setup.
> > 
> > Could you guys check if the following fixes the reproducer?
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 61015793f952..058a9a8e4e2e 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -861,6 +861,9 @@ static __always_inline long 
> > __get_user_pages_locked(struct task_struct *tsk,
> > BUG_ON(*locked != 1);
> > }
> >  
> > +   if (flags & FOLL_NOWAIT)
> > +   locked = NULL;
> > +
> > if (pages)
> > flags |= FOLL_GET;
> >  
> 
> Yes that fixes the reproducer for me.
> 

Just to follow up on this: it seems that Al's suggested fix didn't go anywhere,
but someone else eventually ran into this bug (which was a real deadlock) and a
slightly different fix was merged, commit 96312e61282ae.  It fixes the
reproducer for me too.  Telling syzbot so that it can close the bug:

#syz fix: mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT

- Eric


Re: possible deadlock in get_user_pages_unlocked

2018-03-09 Thread Eric Biggers
On Fri, Feb 09, 2018 at 07:19:25PM -0800, Eric Biggers wrote:
> Hi Al,
> 
> On Sat, Feb 10, 2018 at 01:36:40AM +, Al Viro wrote:
> > On Fri, Feb 02, 2018 at 09:57:27AM +0100, Dmitry Vyukov wrote:
> > 
> > > syzbot tests for up to 5 minutes. However, if there is a race involved
> > > then you may need more time because the crash is probabilistic.
> > > But from what I see most of the time, if one can't reproduce it
> > > easily, it's usually due to some differences in setup that just don't
> > > allow the crash to happen at all.
> > > FWIW syzbot re-runs each reproducer on a freshly booted dedicated VM
> > > and what it provided is the kernel output it got during run of the
> > > provided program. So we have reasonably high assurance that this
> > > reproducer worked in at least one setup.
> > 
> > Could you guys check if the following fixes the reproducer?
> > 
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 61015793f952..058a9a8e4e2e 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -861,6 +861,9 @@ static __always_inline long 
> > __get_user_pages_locked(struct task_struct *tsk,
> > BUG_ON(*locked != 1);
> > }
> >  
> > +   if (flags & FOLL_NOWAIT)
> > +   locked = NULL;
> > +
> > if (pages)
> > flags |= FOLL_GET;
> >  
> 
> Yes that fixes the reproducer for me.
> 

Just to follow up on this: it seems that Al's suggested fix didn't go anywhere,
but someone else eventually ran into this bug (which was a real deadlock) and a
slightly different fix was merged, commit 96312e61282ae.  It fixes the
reproducer for me too.  Telling syzbot so that it can close the bug:

#syz fix: mm/gup.c: teach get_user_pages_unlocked to handle FOLL_NOWAIT

- Eric


Re: [PATCH v2 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Guenter Roeck

On 03/09/2018 07:46 PM, Brian Norris wrote:

RK3399 has rst_pulse_length in CONTROL_REG[4:2], determining the length
of pulse to issue for system reset. We shouldn't clobber this value,
because that might make the system reset ineffective. On RK3399, we're
seeing that a value of 000b (meaning 2 cycles) yields an unreliable
(partial?) reset, and so we only fully reset after the watchdog fires a
second time. If we retain the system default (010b, or 8 clock cycles),
then the watchdog reset is much more reliable.

Read-modify-write retains the system value and improves reset
reliability.

It seems we were intentionally clobbering the response mode previously,
to ensure we performed a system reset (we don't support an interrupt
notification), so retain that explicitly.

Signed-off-by: Brian Norris 


Reviewed-by: Guenter Roeck 


---
v2:
  * factor out helper
  * handle both start() and restart() cases
  * note the RESP_MODE handling in commit message
---
  drivers/watchdog/dw_wdt.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index c2f4ff516230..918357bccf5e 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -34,6 +34,7 @@
  
  #define WDOG_CONTROL_REG_OFFSET		0x00

  #define WDOG_CONTROL_REG_WDT_EN_MASK  0x01
+#define WDOG_CONTROL_REG_RESP_MODE_MASK0x02
  #define WDOG_TIMEOUT_RANGE_REG_OFFSET 0x04
  #define WDOG_TIMEOUT_RANGE_TOPINIT_SHIFT4
  #define WDOG_CURRENT_COUNT_REG_OFFSET 0x08
@@ -121,14 +122,23 @@ static int dw_wdt_set_timeout(struct watchdog_device 
*wdd, unsigned int top_s)
return 0;
  }
  
+static void dw_wdt_arm_system_reset(struct dw_wdt *dw_wdt)

+{
+   u32 val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+
+   /* Disable interrupt mode; always perform system reset. */
+   val &= ~WDOG_CONTROL_REG_RESP_MODE_MASK;
+   /* Enable watchdog. */
+   val |= WDOG_CONTROL_REG_WDT_EN_MASK;
+   writel(val, dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+}
+
  static int dw_wdt_start(struct watchdog_device *wdd)
  {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
  
  	dw_wdt_set_timeout(wdd, wdd->timeout);

-
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   dw_wdt_arm_system_reset(dw_wdt);
  
  	return 0;

  }
@@ -152,16 +162,13 @@ static int dw_wdt_restart(struct watchdog_device *wdd,
  unsigned long action, void *data)
  {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
-   u32 val;
  
  	writel(0, dw_wdt->regs + WDOG_TIMEOUT_RANGE_REG_OFFSET);

-   val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
-   if (val & WDOG_CONTROL_REG_WDT_EN_MASK)
+   if (dw_wdt_is_enabled(dw_wdt))
writel(WDOG_COUNTER_RESTART_KICK_VALUE,
   dw_wdt->regs + WDOG_COUNTER_RESTART_REG_OFFSET);
else
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   dw_wdt_arm_system_reset(dw_wdt);
  
  	/* wait for reset to assert... */

mdelay(500);





Re: [PATCH v2 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Guenter Roeck

On 03/09/2018 07:46 PM, Brian Norris wrote:

RK3399 has rst_pulse_length in CONTROL_REG[4:2], determining the length
of pulse to issue for system reset. We shouldn't clobber this value,
because that might make the system reset ineffective. On RK3399, we're
seeing that a value of 000b (meaning 2 cycles) yields an unreliable
(partial?) reset, and so we only fully reset after the watchdog fires a
second time. If we retain the system default (010b, or 8 clock cycles),
then the watchdog reset is much more reliable.

Read-modify-write retains the system value and improves reset
reliability.

It seems we were intentionally clobbering the response mode previously,
to ensure we performed a system reset (we don't support an interrupt
notification), so retain that explicitly.

Signed-off-by: Brian Norris 


Reviewed-by: Guenter Roeck 


---
v2:
  * factor out helper
  * handle both start() and restart() cases
  * note the RESP_MODE handling in commit message
---
  drivers/watchdog/dw_wdt.c | 23 +++
  1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index c2f4ff516230..918357bccf5e 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -34,6 +34,7 @@
  
  #define WDOG_CONTROL_REG_OFFSET		0x00

  #define WDOG_CONTROL_REG_WDT_EN_MASK  0x01
+#define WDOG_CONTROL_REG_RESP_MODE_MASK0x02
  #define WDOG_TIMEOUT_RANGE_REG_OFFSET 0x04
  #define WDOG_TIMEOUT_RANGE_TOPINIT_SHIFT4
  #define WDOG_CURRENT_COUNT_REG_OFFSET 0x08
@@ -121,14 +122,23 @@ static int dw_wdt_set_timeout(struct watchdog_device 
*wdd, unsigned int top_s)
return 0;
  }
  
+static void dw_wdt_arm_system_reset(struct dw_wdt *dw_wdt)

+{
+   u32 val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+
+   /* Disable interrupt mode; always perform system reset. */
+   val &= ~WDOG_CONTROL_REG_RESP_MODE_MASK;
+   /* Enable watchdog. */
+   val |= WDOG_CONTROL_REG_WDT_EN_MASK;
+   writel(val, dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+}
+
  static int dw_wdt_start(struct watchdog_device *wdd)
  {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
  
  	dw_wdt_set_timeout(wdd, wdd->timeout);

-
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   dw_wdt_arm_system_reset(dw_wdt);
  
  	return 0;

  }
@@ -152,16 +162,13 @@ static int dw_wdt_restart(struct watchdog_device *wdd,
  unsigned long action, void *data)
  {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
-   u32 val;
  
  	writel(0, dw_wdt->regs + WDOG_TIMEOUT_RANGE_REG_OFFSET);

-   val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
-   if (val & WDOG_CONTROL_REG_WDT_EN_MASK)
+   if (dw_wdt_is_enabled(dw_wdt))
writel(WDOG_COUNTER_RESTART_KICK_VALUE,
   dw_wdt->regs + WDOG_COUNTER_RESTART_REG_OFFSET);
else
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   dw_wdt_arm_system_reset(dw_wdt);
  
  	/* wait for reset to assert... */

mdelay(500);





Re: [PATCH 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Guenter Roeck

Hi Brian,

On 03/09/2018 07:28 PM, Brian Norris wrote:

Hi,

On Fri, Mar 09, 2018 at 07:20:38PM -0800, Guenter Roeck wrote:

On 03/09/2018 06:44 PM, Brian Norris wrote:

RK3399 has rst_pulse_length in CONTROL_REG[4:2], determining the length
of pulse to issue for system reset. We shouldn't clobber this value,
because that might make the system reset ineffective. On RK3399, we're
seeing that a value of 000b (meaning 2 cycles) yields an unreliable
(partial?) reset, and so we only fully reset after the watchdog fires a
second time. If we retain the system default (010b, or 8 clock cycles),
then the watchdog reset is much more reliable.

Read-modify-write retains the system value and improves reset
reliability.

Signed-off-by: Brian Norris 
---
   drivers/watchdog/dw_wdt.c | 10 --
   1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index c2f4ff516230..6925d3e6c6b3 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -34,6 +34,7 @@
   #define WDOG_CONTROL_REG_OFFSET  0x00
   #define WDOG_CONTROL_REG_WDT_EN_MASK 0x01
+#define WDOG_CONTROL_REG_RESP_MODE_MASK0x02
   #define WDOG_TIMEOUT_RANGE_REG_OFFSET0x04
   #define WDOG_TIMEOUT_RANGE_TOPINIT_SHIFT4
   #define WDOG_CURRENT_COUNT_REG_OFFSET0x08
@@ -124,11 +125,16 @@ static int dw_wdt_set_timeout(struct watchdog_device 
*wdd, unsigned int top_s)
   static int dw_wdt_start(struct watchdog_device *wdd)
   {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
+   u32 val;
dw_wdt_set_timeout(wdd, wdd->timeout);
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   /* Disable interrupt mode; always perform system reset. */
+   val &= ~WDOG_CONTROL_REG_RESP_MODE_MASK;


You don't talk about this change in the description.


I guess I could mention it. I was assuming that was an intended behavior
of the existing driver: that we set resp_mode=0 (via clobber), so we
always get a system reset (we don't try to handle any interrupt in this
driver).


I don't think it was intended behavior. We don't even know for sure (or at least
I don't know) if all implementations of this IP have the same configuration bit
layout. All we can do is hope for the best.

Still, clobbering just 1 bit is better than clobbering 30 bit.

Thanks,
Guenter


I'll include something along those lines in the commit message.


+   /* Enable watchdog. */
+   val |= WDOG_CONTROL_REG_WDT_EN_MASK;
+   writel(val, dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
return 0;
   }



Similar code is in dw_wdt_restart(), where it may be equally or even
more important. Granted, only if the watchdog isn't running, but still...


Oh, I misread that code. It looked like an read/modify/write already,
but it was actually just a read/check/write. I should fix that, since
otherwise the restart will clobber the very thing I'm trying to fix
here, which might actually make the intended machine restart quite
ineffective.

Thanks,
Brian





Re: [PATCH 1/2] watchdog: dw: RMW the control register

2018-03-09 Thread Guenter Roeck

Hi Brian,

On 03/09/2018 07:28 PM, Brian Norris wrote:

Hi,

On Fri, Mar 09, 2018 at 07:20:38PM -0800, Guenter Roeck wrote:

On 03/09/2018 06:44 PM, Brian Norris wrote:

RK3399 has rst_pulse_length in CONTROL_REG[4:2], determining the length
of pulse to issue for system reset. We shouldn't clobber this value,
because that might make the system reset ineffective. On RK3399, we're
seeing that a value of 000b (meaning 2 cycles) yields an unreliable
(partial?) reset, and so we only fully reset after the watchdog fires a
second time. If we retain the system default (010b, or 8 clock cycles),
then the watchdog reset is much more reliable.

Read-modify-write retains the system value and improves reset
reliability.

Signed-off-by: Brian Norris 
---
   drivers/watchdog/dw_wdt.c | 10 --
   1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/watchdog/dw_wdt.c b/drivers/watchdog/dw_wdt.c
index c2f4ff516230..6925d3e6c6b3 100644
--- a/drivers/watchdog/dw_wdt.c
+++ b/drivers/watchdog/dw_wdt.c
@@ -34,6 +34,7 @@
   #define WDOG_CONTROL_REG_OFFSET  0x00
   #define WDOG_CONTROL_REG_WDT_EN_MASK 0x01
+#define WDOG_CONTROL_REG_RESP_MODE_MASK0x02
   #define WDOG_TIMEOUT_RANGE_REG_OFFSET0x04
   #define WDOG_TIMEOUT_RANGE_TOPINIT_SHIFT4
   #define WDOG_CURRENT_COUNT_REG_OFFSET0x08
@@ -124,11 +125,16 @@ static int dw_wdt_set_timeout(struct watchdog_device 
*wdd, unsigned int top_s)
   static int dw_wdt_start(struct watchdog_device *wdd)
   {
struct dw_wdt *dw_wdt = to_dw_wdt(wdd);
+   u32 val;
dw_wdt_set_timeout(wdd, wdd->timeout);
-   writel(WDOG_CONTROL_REG_WDT_EN_MASK,
-  dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   val = readl(dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
+   /* Disable interrupt mode; always perform system reset. */
+   val &= ~WDOG_CONTROL_REG_RESP_MODE_MASK;


You don't talk about this change in the description.


I guess I could mention it. I was assuming that was an intended behavior
of the existing driver: that we set resp_mode=0 (via clobber), so we
always get a system reset (we don't try to handle any interrupt in this
driver).


I don't think it was intended behavior. We don't even know for sure (or at least
I don't know) if all implementations of this IP have the same configuration bit
layout. All we can do is hope for the best.

Still, clobbering just 1 bit is better than clobbering 30 bit.

Thanks,
Guenter


I'll include something along those lines in the commit message.


+   /* Enable watchdog. */
+   val |= WDOG_CONTROL_REG_WDT_EN_MASK;
+   writel(val, dw_wdt->regs + WDOG_CONTROL_REG_OFFSET);
return 0;
   }



Similar code is in dw_wdt_restart(), where it may be equally or even
more important. Granted, only if the watchdog isn't running, but still...


Oh, I misread that code. It looked like an read/modify/write already,
but it was actually just a read/check/write. I should fix that, since
otherwise the restart will clobber the very thing I'm trying to fix
here, which might actually make the intended machine restart quite
ineffective.

Thanks,
Brian





[PATCH v2] Input: stmpe-keypad - remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
Changes in v2:
 - Update the code based on Dmitry Torokhov comments. Thanks Dmitry.

 drivers/input/keyboard/stmpe-keypad.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/input/keyboard/stmpe-keypad.c 
b/drivers/input/keyboard/stmpe-keypad.c
index 8c6c0b9..d69e631 100644
--- a/drivers/input/keyboard/stmpe-keypad.c
+++ b/drivers/input/keyboard/stmpe-keypad.c
@@ -48,6 +48,14 @@
 #define STMPE_KEYPAD_KEYMAP_MAX_SIZE \
(STMPE_KEYPAD_MAX_ROWS * STMPE_KEYPAD_MAX_COLS)
 
+
+#define STMPE1601_NUM_DATA 5
+#define STMPE2401_NUM_DATA 3
+#define STMPE2403_NUM_DATA 5
+
+/* Make sure it covers all cases above */
+#define MAX_NUM_DATA   5
+
 /**
  * struct stmpe_keypad_variant - model-specific attributes
  * @auto_increment: whether the KPC_DATA_BYTE register address
@@ -74,7 +82,7 @@ struct stmpe_keypad_variant {
 static const struct stmpe_keypad_variant stmpe_keypad_variants[] = {
[STMPE1601] = {
.auto_increment = true,
-   .num_data   = 5,
+   .num_data   = STMPE1601_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 8,
@@ -84,7 +92,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2401] = {
.auto_increment = false,
.set_pullup = true,
-   .num_data   = 3,
+   .num_data   = STMPE2401_NUM_DATA,
.num_normal_data= 2,
.max_cols   = 8,
.max_rows   = 12,
@@ -94,7 +102,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2403] = {
.auto_increment = true,
.set_pullup = true,
-   .num_data   = 5,
+   .num_data   = STMPE2403_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 12,
@@ -156,7 +164,7 @@ static irqreturn_t stmpe_keypad_irq(int irq, void *dev)
struct stmpe_keypad *keypad = dev;
struct input_dev *input = keypad->input;
const struct stmpe_keypad_variant *variant = keypad->variant;
-   u8 fifo[variant->num_data];
+   u8 fifo[MAX_NUM_DATA];
int ret;
int i;
 
-- 
2.7.4



[PATCH v2] Input: stmpe-keypad - remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
Changes in v2:
 - Update the code based on Dmitry Torokhov comments. Thanks Dmitry.

 drivers/input/keyboard/stmpe-keypad.c | 16 
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/drivers/input/keyboard/stmpe-keypad.c 
b/drivers/input/keyboard/stmpe-keypad.c
index 8c6c0b9..d69e631 100644
--- a/drivers/input/keyboard/stmpe-keypad.c
+++ b/drivers/input/keyboard/stmpe-keypad.c
@@ -48,6 +48,14 @@
 #define STMPE_KEYPAD_KEYMAP_MAX_SIZE \
(STMPE_KEYPAD_MAX_ROWS * STMPE_KEYPAD_MAX_COLS)
 
+
+#define STMPE1601_NUM_DATA 5
+#define STMPE2401_NUM_DATA 3
+#define STMPE2403_NUM_DATA 5
+
+/* Make sure it covers all cases above */
+#define MAX_NUM_DATA   5
+
 /**
  * struct stmpe_keypad_variant - model-specific attributes
  * @auto_increment: whether the KPC_DATA_BYTE register address
@@ -74,7 +82,7 @@ struct stmpe_keypad_variant {
 static const struct stmpe_keypad_variant stmpe_keypad_variants[] = {
[STMPE1601] = {
.auto_increment = true,
-   .num_data   = 5,
+   .num_data   = STMPE1601_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 8,
@@ -84,7 +92,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2401] = {
.auto_increment = false,
.set_pullup = true,
-   .num_data   = 3,
+   .num_data   = STMPE2401_NUM_DATA,
.num_normal_data= 2,
.max_cols   = 8,
.max_rows   = 12,
@@ -94,7 +102,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2403] = {
.auto_increment = true,
.set_pullup = true,
-   .num_data   = 5,
+   .num_data   = STMPE2403_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 12,
@@ -156,7 +164,7 @@ static irqreturn_t stmpe_keypad_irq(int irq, void *dev)
struct stmpe_keypad *keypad = dev;
struct input_dev *input = keypad->input;
const struct stmpe_keypad_variant *variant = keypad->variant;
-   u8 fifo[variant->num_data];
+   u8 fifo[MAX_NUM_DATA];
int ret;
int i;
 
-- 
2.7.4



Re: [PATCH] Input: stmpe-keypad - remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva

Hi Dmitry,

On 03/09/2018 05:32 PM, Dmitry Torokhov wrote:

On Fri, Mar 09, 2018 at 04:42:08PM -0600, Gustavo A. R. Silva wrote:

In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
  drivers/input/keyboard/stmpe-keypad.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/input/keyboard/stmpe-keypad.c 
b/drivers/input/keyboard/stmpe-keypad.c
index 8c6c0b9..cfa1dbe 100644
--- a/drivers/input/keyboard/stmpe-keypad.c
+++ b/drivers/input/keyboard/stmpe-keypad.c
@@ -48,6 +48,8 @@
  #define STMPE_KEYPAD_KEYMAP_MAX_SIZE \
(STMPE_KEYPAD_MAX_ROWS * STMPE_KEYPAD_MAX_COLS)
  
+#define MAX_NUM_DATA			5

+
  /**
   * struct stmpe_keypad_variant - model-specific attributes
   * @auto_increment: whether the KPC_DATA_BYTE register address
@@ -74,7 +76,7 @@ struct stmpe_keypad_variant {
  static const struct stmpe_keypad_variant stmpe_keypad_variants[] = {
[STMPE1601] = {
.auto_increment = true,
-   .num_data   = 5,
+   .num_data   = MAX_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 8,
@@ -84,7 +86,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2401] = {
.auto_increment = false,
.set_pullup = true,
-   .num_data   = 3,
+   .num_data   = MAX_NUM_DATA - 2,


Logically MAX_NUM_DATA - 2 does not mean anything, it is simply a way
for you to get to 3, so I'd rather we did not do that.



Yeah. I agree.


Can we do

#define STMPE1601_NUM_DATA  5
#define STMPE2401_NUM_DATA  3
#define STMPE2403_NUM_DATA  5
#define MAX_NUM_DATAmax3(STMPE1601_NUM_DATA, \
 STMPE2401_NUM_DATA, \
 STMPE2403_NUM_DATA)


or simply

/* Make sure it covers all cases above */
#define MAX_NUM_DATA5


This one works just fine.

I'll send v2 of this patch shortly.

Thanks for the feedback.
I appreciate it.
--
Gustavo




.num_normal_data= 2,
.max_cols   = 8,
.max_rows   = 12,
@@ -94,7 +96,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2403] = {
.auto_increment = true,
.set_pullup = true,
-   .num_data   = 5,
+   .num_data   = MAX_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 12,
@@ -156,7 +158,7 @@ static irqreturn_t stmpe_keypad_irq(int irq, void *dev)
struct stmpe_keypad *keypad = dev;
struct input_dev *input = keypad->input;
const struct stmpe_keypad_variant *variant = keypad->variant;
-   u8 fifo[variant->num_data];
+   u8 fifo[MAX_NUM_DATA];
int ret;
int i;
  
--

2.7.4



Thanks.






Re: [PATCH] Input: stmpe-keypad - remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva

Hi Dmitry,

On 03/09/2018 05:32 PM, Dmitry Torokhov wrote:

On Fri, Mar 09, 2018 at 04:42:08PM -0600, Gustavo A. R. Silva wrote:

In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---
  drivers/input/keyboard/stmpe-keypad.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/input/keyboard/stmpe-keypad.c 
b/drivers/input/keyboard/stmpe-keypad.c
index 8c6c0b9..cfa1dbe 100644
--- a/drivers/input/keyboard/stmpe-keypad.c
+++ b/drivers/input/keyboard/stmpe-keypad.c
@@ -48,6 +48,8 @@
  #define STMPE_KEYPAD_KEYMAP_MAX_SIZE \
(STMPE_KEYPAD_MAX_ROWS * STMPE_KEYPAD_MAX_COLS)
  
+#define MAX_NUM_DATA			5

+
  /**
   * struct stmpe_keypad_variant - model-specific attributes
   * @auto_increment: whether the KPC_DATA_BYTE register address
@@ -74,7 +76,7 @@ struct stmpe_keypad_variant {
  static const struct stmpe_keypad_variant stmpe_keypad_variants[] = {
[STMPE1601] = {
.auto_increment = true,
-   .num_data   = 5,
+   .num_data   = MAX_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 8,
@@ -84,7 +86,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2401] = {
.auto_increment = false,
.set_pullup = true,
-   .num_data   = 3,
+   .num_data   = MAX_NUM_DATA - 2,


Logically MAX_NUM_DATA - 2 does not mean anything, it is simply a way
for you to get to 3, so I'd rather we did not do that.



Yeah. I agree.


Can we do

#define STMPE1601_NUM_DATA  5
#define STMPE2401_NUM_DATA  3
#define STMPE2403_NUM_DATA  5
#define MAX_NUM_DATAmax3(STMPE1601_NUM_DATA, \
 STMPE2401_NUM_DATA, \
 STMPE2403_NUM_DATA)


or simply

/* Make sure it covers all cases above */
#define MAX_NUM_DATA5


This one works just fine.

I'll send v2 of this patch shortly.

Thanks for the feedback.
I appreciate it.
--
Gustavo




.num_normal_data= 2,
.max_cols   = 8,
.max_rows   = 12,
@@ -94,7 +96,7 @@ static const struct stmpe_keypad_variant 
stmpe_keypad_variants[] = {
[STMPE2403] = {
.auto_increment = true,
.set_pullup = true,
-   .num_data   = 5,
+   .num_data   = MAX_NUM_DATA,
.num_normal_data= 3,
.max_cols   = 8,
.max_rows   = 12,
@@ -156,7 +158,7 @@ static irqreturn_t stmpe_keypad_irq(int irq, void *dev)
struct stmpe_keypad *keypad = dev;
struct input_dev *input = keypad->input;
const struct stmpe_keypad_variant *variant = keypad->variant;
-   u8 fifo[variant->num_data];
+   u8 fifo[MAX_NUM_DATA];
int ret;
int i;
  
--

2.7.4



Thanks.






[PATCH] EDAC, sb_edac: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---

Notice that due to this change, the field max_interleave is no longer
used after it has been initialized. Maybe it should be removed?

Thanks
--
Gustavo
---
 drivers/edac/sb_edac.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 8721002..11cb2c9 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -110,6 +110,7 @@ static const u32 knl_interleave_list[] = {
0xdc, 0xe4, 0xec, 0xf4, 0xfc, /* 15-19 */
0x104, 0x10c, 0x114, 0x11c,   /* 20-23 */
 };
+#define MAX_INTERLEAVE ARRAY_SIZE(knl_interleave_list)
 
 struct interleave_pkg {
unsigned char start;
@@ -1899,7 +1900,7 @@ static int get_memory_error_data(struct mem_ctl_info *mci,
int n_rir, n_sads, n_tads, sad_way, sck_xch;
int sad_interl, idx, base_ch;
int interleave_mode, shiftup = 0;
-   unsignedsad_interleave[pvt->info.max_interleave];
+   unsigned intsad_interleave[MAX_INTERLEAVE];
u32 reg, dram_rule;
u8  ch_way, sck_way, pkg, sad_ha = 0;
u32 tad_offset;
-- 
2.7.4



[PATCH] EDAC, sb_edac: Remove VLA usage

2018-03-09 Thread Gustavo A. R. Silva
In preparation to enabling -Wvla, remove VLA and replace it
with a fixed-length array instead.

Fixed as part of the directive to remove all VLAs from
the kernel: https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Gustavo A. R. Silva 
---

Notice that due to this change, the field max_interleave is no longer
used after it has been initialized. Maybe it should be removed?

Thanks
--
Gustavo
---
 drivers/edac/sb_edac.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/sb_edac.c b/drivers/edac/sb_edac.c
index 8721002..11cb2c9 100644
--- a/drivers/edac/sb_edac.c
+++ b/drivers/edac/sb_edac.c
@@ -110,6 +110,7 @@ static const u32 knl_interleave_list[] = {
0xdc, 0xe4, 0xec, 0xf4, 0xfc, /* 15-19 */
0x104, 0x10c, 0x114, 0x11c,   /* 20-23 */
 };
+#define MAX_INTERLEAVE ARRAY_SIZE(knl_interleave_list)
 
 struct interleave_pkg {
unsigned char start;
@@ -1899,7 +1900,7 @@ static int get_memory_error_data(struct mem_ctl_info *mci,
int n_rir, n_sads, n_tads, sad_way, sck_xch;
int sad_interl, idx, base_ch;
int interleave_mode, shiftup = 0;
-   unsignedsad_interleave[pvt->info.max_interleave];
+   unsigned intsad_interleave[MAX_INTERLEAVE];
u32 reg, dram_rule;
u8  ch_way, sck_way, pkg, sad_ha = 0;
u32 tad_offset;
-- 
2.7.4



  1   2   3   4   5   6   7   8   9   10   >