Re: 5.10 LTS Kernel: 2 or 6 years?

2021-02-22 Thread Nishanth Aravamudan
Hi Greg,

On 26.01.2021 [08:29:25 +0100], Greg Kroah-Hartman wrote:
> On Mon, Jan 25, 2021 at 11:55:11AM -0800, Scott Branden wrote:
> > Hi All,
> > 
> > The 5.10 LTS kernel being officially LTS supported for 2 years
> > presents a problem: why would anyone select a 5.10 kernel with 2
> > year LTS when 5.4 kernel has a 6 year LTS.



> > If 5.10 is "actually" going to be supported for 6 years it would be
> > quite valuable to make such a declaration.
> > https://www.kernel.org/category/releases.html
> 
> Why?  What would that change?
> 
> Ok, seriously, this happens every year, and every year we go through
> the same thing, it's not like this is somehow new, right?
> 
> I want to see companies _using_ the kernel, and most importantly,
> _updating_ their devices with it, to know if it is worth to keep
> around for longer than 2 years.  I also, hopefully, want to see how
> those companies will help me out in the testing and maintenance of
> that kernel version in order to make supporting it for 6 years
> actually possible.
> 
> So, are you planning on using 5.10?  Will you will be willing to help
> out in testing the -rc releases I make to let me know if there are any
> problems, and to help in pointing out and backporting any specific
> patches that your platforms need for that kernel release?
> 
> When I get this kind of promises and support from companies, then I am
> glad to bump up the length of the kernel support from 2 to 6 years,
> and I mark it on the web site.  Traditionally this happens in
> Febuary/March once I hear from enough companies.  Can I count on your
> support in this endeavor?

I am very sorry for the long delay on my end (I had privately e-mailed
Greg on January 28) -- DigitalOcean also intends to provide feedback and
testing on the 5.10 series. We intend to use it as the basis for our
next-next kernel and are very invested in ensuring the stability and
performance of the kernel.

Thanks,
Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-11-02 Thread Nishanth Aravamudan
On 17.09.2018 [13:33:15 +0200], Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
> > On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
> 
> > >> B) Why would I want this?
> > > 
> > >>In the L1TF context, it prevents other applications from
> > >>loading additional data into the L1 cache, while one
> > >>application tries to leak data.
> > > 
> > > That is the whole and only reason you did this;
> > It really isn't. But as your mind seems made up, I'm not going to
> > bother to argue.
> 
> > >> D) What can I *not* do with this?
> > >> -
> > >>
> > >> Besides the missing load-balancing within coscheduled
> > >> task-groups, this implementation has the following properties,
> > >> which might be considered short-comings.
> > >>
> > >> This particular implementation focuses on SCHED_OTHER tasks
> > >> managed by CFS and allows coscheduling them. Interrupts as well
> > >> as tasks in higher scheduling classes are currently out-of-scope:
> > >> they are assumed to be negligible interruptions as far as
> > >> coscheduling is concerned and they do *not* cause a preemption of
> > >> a whole group. This implementation could be extended to cover
> > >> higher scheduling classes. Interrupts, however, are an orthogonal
> > >> issue.
> > >>
> > >> The collective context switch from one coscheduled set of tasks
> > >> to another -- while fast -- is not atomic. If a use-case needs
> > >> the absolute guarantee that all tasks of the previous set have
> > >> stopped executing before any task of the next set starts
> > >> executing, an additional hand-shake/barrier needs to be added.
> > > 
> > > IOW it's completely friggin useless for L1TF.
> > 
> > Do you believe me now, that L1TF is not "the whole and only reason"
> > I did this? :D
> 
> You did mention this work first to me in the context of L1TF, so I might
> have jumped to conclusions here.
> 
> Also, I have, of course, been looking at (SMT) co-scheduling,
> specifically in the context of L1TF, myself. I came up with a vastly
> different approach. Tim - where are we on getting some of that posted?
> 
> Note; that even though I wrote much of that code, I don't particularly
> like it either :-)

Did your approach get posted to LKML? I never saw it I don't think, and
I don't see it on lore. Could it be posted as an RFC, even if not
suitable for upstreaming yet, just for comparison?

Thanks!
-Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-11-02 Thread Nishanth Aravamudan
On 17.09.2018 [13:33:15 +0200], Peter Zijlstra wrote:
> On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote:
> > On 09/14/2018 01:12 PM, Peter Zijlstra wrote:
> > > On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Schönherr wrote:
> 
> > >> B) Why would I want this?
> > > 
> > >>In the L1TF context, it prevents other applications from
> > >>loading additional data into the L1 cache, while one
> > >>application tries to leak data.
> > > 
> > > That is the whole and only reason you did this;
> > It really isn't. But as your mind seems made up, I'm not going to
> > bother to argue.
> 
> > >> D) What can I *not* do with this?
> > >> -
> > >>
> > >> Besides the missing load-balancing within coscheduled
> > >> task-groups, this implementation has the following properties,
> > >> which might be considered short-comings.
> > >>
> > >> This particular implementation focuses on SCHED_OTHER tasks
> > >> managed by CFS and allows coscheduling them. Interrupts as well
> > >> as tasks in higher scheduling classes are currently out-of-scope:
> > >> they are assumed to be negligible interruptions as far as
> > >> coscheduling is concerned and they do *not* cause a preemption of
> > >> a whole group. This implementation could be extended to cover
> > >> higher scheduling classes. Interrupts, however, are an orthogonal
> > >> issue.
> > >>
> > >> The collective context switch from one coscheduled set of tasks
> > >> to another -- while fast -- is not atomic. If a use-case needs
> > >> the absolute guarantee that all tasks of the previous set have
> > >> stopped executing before any task of the next set starts
> > >> executing, an additional hand-shake/barrier needs to be added.
> > > 
> > > IOW it's completely friggin useless for L1TF.
> > 
> > Do you believe me now, that L1TF is not "the whole and only reason"
> > I did this? :D
> 
> You did mention this work first to me in the context of L1TF, so I might
> have jumped to conclusions here.
> 
> Also, I have, of course, been looking at (SMT) co-scheduling,
> specifically in the context of L1TF, myself. I came up with a vastly
> different approach. Tim - where are we on getting some of that posted?
> 
> Note; that even though I wrote much of that code, I don't particularly
> like it either :-)

Did your approach get posted to LKML? I never saw it I don't think, and
I don't see it on lore. Could it be posted as an RFC, even if not
suitable for upstreaming yet, just for comparison?

Thanks!
-Nish


Re: Kernel panic when enabling cgroup2 io controller at runtime

2018-11-01 Thread Nishanth Aravamudan
On 01.11.2018 [12:03:40 -0700], Nishanth Aravamudan wrote:
> Hi,
> 
> tl;dr: I see a kernel NULL pointer dereference with Linus' master
> (7c6c54b5) when enabling the IO cgroup2 controller at runtime. Is this
> PEBKAC and if so what config option am I missing?

Actually, this might be totally unrelated to my cgroup testing, and just
happened to be exacerbated by it? Adding LKML to the CC, preserving the
prior oops below and pasting another oops I just got after waiting a bit
during a normal boot.

[   38.450985] BUG: unable to handle kernel NULL pointer dereference at 

[   38.458879] PGD 0 P4D 0 
[   38.461444] Oops:  [#1] SMP PTI
[   38.464964] CPU: 27 PID: 2159 Comm: auditd Kdump: loaded Tainted: G  
 O  4.19.0+ #3
[   38.473713] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 
10/19/2017
[   38.481298] RIP: 0010:get_request+0x133/0x8b0
[   38.485674] Code: ff ff ff 41 f7 d4 48 89 85 78 ff ff ff 4c 01 f8 41 83 c4 
02 48 89 45 90 44 89 a5 74 ff ff ff 4d 8b 27 48 85 db 49 8b 44 24 18 <48> 8b 00 
48 89 855
[   38.504489] RSP: 0018:b59e5c3bb9c0 EFLAGS: 00010086
[   38.509722] RAX:  RBX: a0424bd78e00 RCX: 0001
[   38.516888] RDX: 355bbf83dbb0 RSI: 0800 RDI: a041eb1a6c80
[   38.524047] RBP: b59e5c3bba68 R08: 0060 R09: 9fe264871360
[   38.531188] R10: b59e5c3bbb28 R11: 1000 R12: 9fe2635d9360
[   38.538340] R13: 0001 R14: 0040 R15: a041eb1a6c40
[   38.545490] FS:  7fec3109c700() GS:a0427f54() 
knlGS:
[   38.553618] CS:  0010 DS:  ES:  CR0: 80050033
[   38.559381] CR2:  CR3: 00beaf27a002 CR4: 007606e0
[   38.566524] DR0:  DR1:  DR2: 
[   38.573680] DR3:  DR6: fffe0ff0 DR7: 0400
[   38.580830] PKRU: 5554
[   38.583543] Call Trace:
[   38.586013]  ? wait_woken+0x80/0x80
[   38.589543]  blk_queue_bio+0x131/0x460
[   38.593304]  generic_make_request+0x1a4/0x410
[   38.597673]  raid10_unplug+0x112/0x1b0 [raid10]
[   38.602211]  ? raid10_unplug+0x112/0x1b0 [raid10]
[   38.606927]  blk_flush_plug_list+0xce/0x250
[   38.611123]  blk_finish_plug+0x2c/0x40
[   38.614892]  ext4_writepages+0x635/0xe90
[   38.618837]  do_writepages+0x4b/0xe0
[   38.622424]  ? ext4_mark_inode_dirty+0x1d0/0x1d0
[   38.627068]  ? do_writepages+0x4b/0xe0
[   38.630838]  ? call_rcu+0x10/0x20
[   38.634168]  ? inode_switch_wbs+0x15d/0x190
[   38.638363]  __filemap_fdatawrite_range+0xc1/0x100
[   38.643161]  ? __filemap_fdatawrite_range+0xc1/0x100
[   38.648137]  file_write_and_wait_range+0x5a/0xb0
[   38.652767]  ext4_sync_file+0x111/0x3b0
[   38.656611]  vfs_fsync_range+0x48/0x80
[   38.660375]  ? __fget_light+0x54/0x60
[   38.664049]  do_fsync+0x3d/0x70
[   38.667203]  __x64_sys_fsync+0x14/0x20
[   38.670965]  do_syscall_64+0x5a/0x120
[   38.674639]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   38.679710] RIP: 0033:0x7fec320eeb07
[   38.683764] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 
53 89 fb 48 83 ec 10 e8 04 f5 ff ff 89 df 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 
f0 ff ff4
[   38.703360] RSP: 002b:7fec3109be40 EFLAGS: 0293 ORIG_RAX: 
004a
[   38.711331] RAX: ffda RBX: 0005 RCX: 7fec320eeb07
[   38.718882] RDX:  RSI:  RDI: 0005
[   38.726428] RBP:  R08:  R09: 
[   38.733936] R10:  R11: 0293 R12: 7fec3109bfc0
[   38.741467] R13:  R14:  R15: 7ffdfe1da3e0
[   38.749025] Modules linked in: ebtable_filter ebtables ip6table_filter 
iptable_filter nbd vport_stt(O) openvswitch(O) nf_nat_ipv6 nf_nat_ipv4 
nf_conncount nf_nat u0
[   38.749064]  multipath linear mlx5_ib raid1 raid10 ses enclosure 
scsi_transport_sas ib_uverbs ib_core mlx5_core mgag200 i2c_algo_bit mlxfw ttm 
devlink drm_kms_helpi
[   38.861715] CR2: 
[0.061107] do_IRQ: 0.35 No irq handler for vector
[0.103225] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d 
is b0)
[2.894501] scsi 0:0:32:0: Wrong diagnostic page; asked for 10 got 0
[3.263453] Out of memory: Kill process 222 (systemd-udevd) score 10 or 
sacrifice child
[3.271482] Killed process 222 (systemd-udevd) total-vm:26388kB, 
anon-rss:1244kB, file-rss:3088kB, shmem-rss:0kB
[3.325928] Out of memory: Kill process 225 (systemd-udevd) score 10 or 
sacrifice child
[3.333960] Killed process 386 (mdadm) total-vm:7236kB, anon-rss:120kB, 
file-rss:1788kB, shmem-rss:0kB
[3.981778] Out of memory: Kill process 450 (loadkeys) score 5 or sacrifice 
child
[3.989311] Killed process 450 (loadkeys) total-vm:4708kB, anon-rss:272kB, 
file-rss:1780kB, shmem-rss:0kB
[3.999073] Out of memory: Kill

Re: Kernel panic when enabling cgroup2 io controller at runtime

2018-11-01 Thread Nishanth Aravamudan
On 01.11.2018 [12:03:40 -0700], Nishanth Aravamudan wrote:
> Hi,
> 
> tl;dr: I see a kernel NULL pointer dereference with Linus' master
> (7c6c54b5) when enabling the IO cgroup2 controller at runtime. Is this
> PEBKAC and if so what config option am I missing?

Actually, this might be totally unrelated to my cgroup testing, and just
happened to be exacerbated by it? Adding LKML to the CC, preserving the
prior oops below and pasting another oops I just got after waiting a bit
during a normal boot.

[   38.450985] BUG: unable to handle kernel NULL pointer dereference at 

[   38.458879] PGD 0 P4D 0 
[   38.461444] Oops:  [#1] SMP PTI
[   38.464964] CPU: 27 PID: 2159 Comm: auditd Kdump: loaded Tainted: G  
 O  4.19.0+ #3
[   38.473713] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 
10/19/2017
[   38.481298] RIP: 0010:get_request+0x133/0x8b0
[   38.485674] Code: ff ff ff 41 f7 d4 48 89 85 78 ff ff ff 4c 01 f8 41 83 c4 
02 48 89 45 90 44 89 a5 74 ff ff ff 4d 8b 27 48 85 db 49 8b 44 24 18 <48> 8b 00 
48 89 855
[   38.504489] RSP: 0018:b59e5c3bb9c0 EFLAGS: 00010086
[   38.509722] RAX:  RBX: a0424bd78e00 RCX: 0001
[   38.516888] RDX: 355bbf83dbb0 RSI: 0800 RDI: a041eb1a6c80
[   38.524047] RBP: b59e5c3bba68 R08: 0060 R09: 9fe264871360
[   38.531188] R10: b59e5c3bbb28 R11: 1000 R12: 9fe2635d9360
[   38.538340] R13: 0001 R14: 0040 R15: a041eb1a6c40
[   38.545490] FS:  7fec3109c700() GS:a0427f54() 
knlGS:
[   38.553618] CS:  0010 DS:  ES:  CR0: 80050033
[   38.559381] CR2:  CR3: 00beaf27a002 CR4: 007606e0
[   38.566524] DR0:  DR1:  DR2: 
[   38.573680] DR3:  DR6: fffe0ff0 DR7: 0400
[   38.580830] PKRU: 5554
[   38.583543] Call Trace:
[   38.586013]  ? wait_woken+0x80/0x80
[   38.589543]  blk_queue_bio+0x131/0x460
[   38.593304]  generic_make_request+0x1a4/0x410
[   38.597673]  raid10_unplug+0x112/0x1b0 [raid10]
[   38.602211]  ? raid10_unplug+0x112/0x1b0 [raid10]
[   38.606927]  blk_flush_plug_list+0xce/0x250
[   38.611123]  blk_finish_plug+0x2c/0x40
[   38.614892]  ext4_writepages+0x635/0xe90
[   38.618837]  do_writepages+0x4b/0xe0
[   38.622424]  ? ext4_mark_inode_dirty+0x1d0/0x1d0
[   38.627068]  ? do_writepages+0x4b/0xe0
[   38.630838]  ? call_rcu+0x10/0x20
[   38.634168]  ? inode_switch_wbs+0x15d/0x190
[   38.638363]  __filemap_fdatawrite_range+0xc1/0x100
[   38.643161]  ? __filemap_fdatawrite_range+0xc1/0x100
[   38.648137]  file_write_and_wait_range+0x5a/0xb0
[   38.652767]  ext4_sync_file+0x111/0x3b0
[   38.656611]  vfs_fsync_range+0x48/0x80
[   38.660375]  ? __fget_light+0x54/0x60
[   38.664049]  do_fsync+0x3d/0x70
[   38.667203]  __x64_sys_fsync+0x14/0x20
[   38.670965]  do_syscall_64+0x5a/0x120
[   38.674639]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   38.679710] RIP: 0033:0x7fec320eeb07
[   38.683764] Code: 00 00 0f 05 48 3d 00 f0 ff ff 77 3f f3 c3 0f 1f 44 00 00 
53 89 fb 48 83 ec 10 e8 04 f5 ff ff 89 df 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 
f0 ff ff4
[   38.703360] RSP: 002b:7fec3109be40 EFLAGS: 0293 ORIG_RAX: 
004a
[   38.711331] RAX: ffda RBX: 0005 RCX: 7fec320eeb07
[   38.718882] RDX:  RSI:  RDI: 0005
[   38.726428] RBP:  R08:  R09: 
[   38.733936] R10:  R11: 0293 R12: 7fec3109bfc0
[   38.741467] R13:  R14:  R15: 7ffdfe1da3e0
[   38.749025] Modules linked in: ebtable_filter ebtables ip6table_filter 
iptable_filter nbd vport_stt(O) openvswitch(O) nf_nat_ipv6 nf_nat_ipv4 
nf_conncount nf_nat u0
[   38.749064]  multipath linear mlx5_ib raid1 raid10 ses enclosure 
scsi_transport_sas ib_uverbs ib_core mlx5_core mgag200 i2c_algo_bit mlxfw ttm 
devlink drm_kms_helpi
[   38.861715] CR2: 
[0.061107] do_IRQ: 0.35 No irq handler for vector
[0.103225] [Firmware Bug]: the BIOS has corrupted hw-PMU resources (MSR 38d 
is b0)
[2.894501] scsi 0:0:32:0: Wrong diagnostic page; asked for 10 got 0
[3.263453] Out of memory: Kill process 222 (systemd-udevd) score 10 or 
sacrifice child
[3.271482] Killed process 222 (systemd-udevd) total-vm:26388kB, 
anon-rss:1244kB, file-rss:3088kB, shmem-rss:0kB
[3.325928] Out of memory: Kill process 225 (systemd-udevd) score 10 or 
sacrifice child
[3.333960] Killed process 386 (mdadm) total-vm:7236kB, anon-rss:120kB, 
file-rss:1788kB, shmem-rss:0kB
[3.981778] Out of memory: Kill process 450 (loadkeys) score 5 or sacrifice 
child
[3.989311] Killed process 450 (loadkeys) total-vm:4708kB, anon-rss:272kB, 
file-rss:1780kB, shmem-rss:0kB
[3.999073] Out of memory: Kill

Re: [RFC 61/60] cosched: Accumulated fixes and improvements

2018-09-26 Thread Nishanth Aravamudan
On 26.09.2018 [10:25:19 -0700], Nishanth Aravamudan wrote:
> On 13.09.2018 [21:19:38 +0200], Jan H. Schönherr wrote:
> > Here is an "extra" patch containing bug fixes and warning removals,
> > that I have accumulated up to this point.
> > 
> > It goes on top of the other 60 patches. (When it is time for v2,
> > these fixes will be integrated into the appropriate patches within
> > the series.)
> 
> I found another issue today, while attempting to test (with 61/60
> applied) separate coscheduling cgroups for vcpus and emulator threads
> [the default configuration with libvirt].



> Serial console output (I apologize that some lines got truncated)

I got an non-truncated log as well:

[  764.132461] BUG: unable to handle kernel NULL pointer dereference at 
0040
[  764.141001] PGD 0 P4D 0 
[  764.144020] Oops:  [#1] SMP PTI
[  764.147988] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G   OE 
4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[  764.159086] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 
10/19/2017
[  764.166968] RIP: 0010:set_next_entity+0x15/0x1d0
[  764.171887] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 
48 89 f30
[  764.191276] RSP: 0018:b97158cdfd78 EFLAGS: 00010046
[  764.196888] RAX:  RBX: 9806c0ee2d80 RCX: 
[  764.204403] RDX: 0022 RSI:  RDI: 9806c0ee2e00
[  764.211918] RBP: b97158cdfda0 R08: b97178cd8000 R09: 6080
[  764.219412] R10:  R11: 0001 R12: 9806c0ee2e00
[  764.226903] R13:  R14: 9806c0ee2e00 R15: 
[  764.234433] FS:  () GS:9806c0ec() 
knlGS:
[  764.242919] CS:  0010 DS:  ES:  CR0: 80050033
[  764.249045] CR2: 0040 CR3: 0002d720a004 CR4: 007626e0
[  764.256558] DR0:  DR1:  DR2: 
[  764.264108] DR3:  DR6: fffe0ff0 DR7: 0400
[  764.271663] PKRU: 5554
[  764.274784] Call Trace:
[  764.277633]  pick_next_task_fair+0x8a7/0xa20
[  764.282292]  __schedule+0x13a/0x8e0
[  764.286184]  schedule_idle+0x2c/0x40
[  764.290161]  do_idle+0x169/0x280
[  764.293816]  cpu_startup_entry+0x73/0x80
[  764.298151]  start_secondary+0x1ab/0x200
[  764.302513]  secondary_startup_64+0xa4/0xb0
[  764.307127] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables r
[  764.381780]  coretemp lp parport btrfs zstd_compress raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid0 multipath linei
[  764.414567] CR2: 0040
[  764.418596] ---[ end trace 9b35e3cb99f8eacb ]---
[  764.437343] RIP: 0010:set_next_entity+0x15/0x1d0
[  764.442748] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 
48 89 f30
[  764.462845] RSP: 0018:b97158cdfd78 EFLAGS: 00010046
[  764.468788] RAX:  RBX: 9806c0ee2d80 RCX: 
[  764.476633] RDX: 0022 RSI:  RDI: 9806c0ee2e00
[  764.484476] RBP: b97158cdfda0 R08: b97178cd8000 R09: 6080
[  764.492322] R10:  R11: 0001 R12: 9806c0ee2e00
[  764.500143] R13:  R14: 9806c0ee2e00 R15: 
[  764.507988] FS:  () GS:9806c0ec() 
knlGS:
[  764.516801] CS:  0010 DS:  ES:  CR0: 80050033
[  764.523258] CR2: 0040 CR3: 0002d720a004 CR4: 007626e0
[  764.531084] DR0:  DR1:  DR2: 
[  764.538987] DR3:  DR6: fffe0ff0 DR7: 0400
[  764.546813] PKRU: 5554
[  764.550185] Kernel panic - not syncing: Attempted to kill the idle task!
[  764.557615] Kernel Offset: 0x1f40 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  764.581890] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---
[  764.590574] WARNING: CPU: 70 PID: 0 at 
/build/linux-4.19-0rc3.ag.4/kernel/sched/core.c:1187 set_task_cpu+0x193/0x1a0
[  764.601740] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables r
[  764.677788]  coretemp lp parport btrfs zstd_compress raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid0 multipath linei
[  764.711018] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G  DOE 
4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[ 

Re: [RFC 61/60] cosched: Accumulated fixes and improvements

2018-09-26 Thread Nishanth Aravamudan
On 26.09.2018 [10:25:19 -0700], Nishanth Aravamudan wrote:
> On 13.09.2018 [21:19:38 +0200], Jan H. Schönherr wrote:
> > Here is an "extra" patch containing bug fixes and warning removals,
> > that I have accumulated up to this point.
> > 
> > It goes on top of the other 60 patches. (When it is time for v2,
> > these fixes will be integrated into the appropriate patches within
> > the series.)
> 
> I found another issue today, while attempting to test (with 61/60
> applied) separate coscheduling cgroups for vcpus and emulator threads
> [the default configuration with libvirt].



> Serial console output (I apologize that some lines got truncated)

I got an non-truncated log as well:

[  764.132461] BUG: unable to handle kernel NULL pointer dereference at 
0040
[  764.141001] PGD 0 P4D 0 
[  764.144020] Oops:  [#1] SMP PTI
[  764.147988] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G   OE 
4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[  764.159086] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 
10/19/2017
[  764.166968] RIP: 0010:set_next_entity+0x15/0x1d0
[  764.171887] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 
48 89 f30
[  764.191276] RSP: 0018:b97158cdfd78 EFLAGS: 00010046
[  764.196888] RAX:  RBX: 9806c0ee2d80 RCX: 
[  764.204403] RDX: 0022 RSI:  RDI: 9806c0ee2e00
[  764.211918] RBP: b97158cdfda0 R08: b97178cd8000 R09: 6080
[  764.219412] R10:  R11: 0001 R12: 9806c0ee2e00
[  764.226903] R13:  R14: 9806c0ee2e00 R15: 
[  764.234433] FS:  () GS:9806c0ec() 
knlGS:
[  764.242919] CS:  0010 DS:  ES:  CR0: 80050033
[  764.249045] CR2: 0040 CR3: 0002d720a004 CR4: 007626e0
[  764.256558] DR0:  DR1:  DR2: 
[  764.264108] DR3:  DR6: fffe0ff0 DR7: 0400
[  764.271663] PKRU: 5554
[  764.274784] Call Trace:
[  764.277633]  pick_next_task_fair+0x8a7/0xa20
[  764.282292]  __schedule+0x13a/0x8e0
[  764.286184]  schedule_idle+0x2c/0x40
[  764.290161]  do_idle+0x169/0x280
[  764.293816]  cpu_startup_entry+0x73/0x80
[  764.298151]  start_secondary+0x1ab/0x200
[  764.302513]  secondary_startup_64+0xa4/0xb0
[  764.307127] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables r
[  764.381780]  coretemp lp parport btrfs zstd_compress raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid0 multipath linei
[  764.414567] CR2: 0040
[  764.418596] ---[ end trace 9b35e3cb99f8eacb ]---
[  764.437343] RIP: 0010:set_next_entity+0x15/0x1d0
[  764.442748] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 
00 00 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41 55 41 54 49 89 fc 53 <8b> 46 40 
48 89 f30
[  764.462845] RSP: 0018:b97158cdfd78 EFLAGS: 00010046
[  764.468788] RAX:  RBX: 9806c0ee2d80 RCX: 
[  764.476633] RDX: 0022 RSI:  RDI: 9806c0ee2e00
[  764.484476] RBP: b97158cdfda0 R08: b97178cd8000 R09: 6080
[  764.492322] R10:  R11: 0001 R12: 9806c0ee2e00
[  764.500143] R13:  R14: 9806c0ee2e00 R15: 
[  764.507988] FS:  () GS:9806c0ec() 
knlGS:
[  764.516801] CS:  0010 DS:  ES:  CR0: 80050033
[  764.523258] CR2: 0040 CR3: 0002d720a004 CR4: 007626e0
[  764.531084] DR0:  DR1:  DR2: 
[  764.538987] DR3:  DR6: fffe0ff0 DR7: 0400
[  764.546813] PKRU: 5554
[  764.550185] Kernel panic - not syncing: Attempted to kill the idle task!
[  764.557615] Kernel Offset: 0x1f40 from 0x8100 (relocation 
range: 0x8000-0xbfff)
[  764.581890] ---[ end Kernel panic - not syncing: Attempted to kill the idle 
task! ]---
[  764.590574] WARNING: CPU: 70 PID: 0 at 
/build/linux-4.19-0rc3.ag.4/kernel/sched/core.c:1187 set_task_cpu+0x193/0x1a0
[  764.601740] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables r
[  764.677788]  coretemp lp parport btrfs zstd_compress raid456 
async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq 
libcrc32c raid0 multipath linei
[  764.711018] CPU: 70 PID: 0 Comm: swapper/70 Tainted: G  DOE 
4.19-0rc3.ag-generic #4+1536951040do~8680a1b
[ 

Re: [RFC 61/60] cosched: Accumulated fixes and improvements

2018-09-26 Thread Nishanth Aravamudan
On 13.09.2018 [21:19:38 +0200], Jan H. Schönherr wrote:
> Here is an "extra" patch containing bug fixes and warning removals,
> that I have accumulated up to this point.
> 
> It goes on top of the other 60 patches. (When it is time for v2,
> these fixes will be integrated into the appropriate patches within
> the series.)

I found another issue today, while attempting to test (with 61/60
applied) separate coscheduling cgroups for vcpus and emulator threads
[the default configuration with libvirt].

/sys/fs/cgroup/cpu# cat cpu.scheduled 
1
/sys/fs/cgroup/cpu# cd machine/
/sys/fs/cgroup/cpu/machine# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine# cd VM-1.libvirt-qemu/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cd vcpu0/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cd ../emulator/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# 

Serial console output (I apologize that some lines got truncated)

[ 1060.840120] BUG: unable to handle kernel NULL pointer dere0
[ 1060.848782] PGD 0 P4D 0 
[ 1060.852068] Oops:  [#1] SMP PTI
[ 1060.856207] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G   OE 4.19b
[ 1060.867029] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1060.874872] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1060.879770] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1060.899165] RSP: 0018:aa2b98c0fd78 EFLAGS: 00010046
[ 1060.904720] RAX:  RBX: 996940ba2d80 RCX: 
[ 1060.912199] RDX: 0008 RSI:  RDI: 996940ba2e00
[ 1060.919678] RBP: aa2b98c0fda0 R08:  R09: 
[ 1060.927174] R10:  R11: 0001 R12: 996940ba2e00
[ 1060.934655] R13:  R14: 996940ba2e00 R15: 
[ 1060.942134] FS:  () GS:996940b8() knlGS:0
[ 1060.950572] CS:  0010 DS:  ES:  CR0: 80050033
[ 1060.956673] CR2: 0040 CR3: 0064af40a006 CR4: 007626e0
[ 1060.964172] DR0:  DR1:  DR2: 
[ 1060.971677] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1060.979191] PKRU: 5554
[ 1060.982282] Call Trace:
[ 1060.985126]  pick_next_task_fair+0x8a7/0xa20
[ 1060.989794]  __schedule+0x13a/0x8e0
[ 1060.993691]  ? update_ts_time_stats+0x59/0x80
[ 1060.998439]  schedule_idle+0x2c/0x40
[ 1061.002410]  do_idle+0x169/0x280
[ 1061.006032]  cpu_startup_entry+0x73/0x80
[ 1061.010348]  start_secondary+0x1ab/0x200
[ 1061.014673]  secondary_startup_64+0xa4/0xb0
[ 1061.019265] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1061.093145]  mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1061.126494] CR2: 0040
[ 1061.130467] ---[ end trace 3462ef57e3394c4f ]---
[ 1061.147237] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1061.152510] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1061.172573] RSP: 0018:aa2b98c0fd78 EFLAGS: 00010046
[ 1061.178482] RAX:  RBX: 996940ba2d80 RCX: 
[ 1061.186309] RDX: 0008 RSI:  RDI: 996940ba2e00
[ 1061.194109] RBP: aa2b98c0fda0 R08:  R09: 
[ 1061.201908] R10:  R11: 0001 R12: 996940ba2e00
[ 1061.209698] R13:  R14: 996940ba2e00 R15: 
[ 1061.217490] FS:  () GS:996940b8() knlGS:0
[ 1061.226236] CS:  0010 DS:  ES:  CR0: 80050033
[ 1061.232622] CR2: 0040 CR3: 0064af40a006 CR4: 007626e0
[ 1061.240405] DR0:  DR1:  DR2: 
[ 1061.248168] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1061.255909] PKRU: 5554
[ 1061.259221] Kernel panic - not syncing: Attempted to kill the idle task!
[ 1062.345087] Shutting down cpus with NMI
[ 1062.351037] Kernel Offset: 0x3340 from 0x8100 (relocation ra)
[ 1062.374645] ---[ end Kernel panic - not syncing: Attempted to kill the idle -
[ 1062.383218] WARNING: CPU: 44 PID: 0 at /build/linux-4.19-0rc3.ag.4/kernel/sc0
[ 1062.394380] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1062.469725]  mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1062.503656] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G  DOE 4.19b
[ 1062.514972] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1062.523357] RIP: 0010:set_task_cpu+0x193/0x1a0
[ 1062.528624] Code: 00 00 04 e9 36 ff ff ff 0f 0b e9 be fe ff ff f7 43 60 fd f5
[ 1062.549066] 

Re: [RFC 61/60] cosched: Accumulated fixes and improvements

2018-09-26 Thread Nishanth Aravamudan
On 13.09.2018 [21:19:38 +0200], Jan H. Schönherr wrote:
> Here is an "extra" patch containing bug fixes and warning removals,
> that I have accumulated up to this point.
> 
> It goes on top of the other 60 patches. (When it is time for v2,
> these fixes will be integrated into the appropriate patches within
> the series.)

I found another issue today, while attempting to test (with 61/60
applied) separate coscheduling cgroups for vcpus and emulator threads
[the default configuration with libvirt].

/sys/fs/cgroup/cpu# cat cpu.scheduled 
1
/sys/fs/cgroup/cpu# cd machine/
/sys/fs/cgroup/cpu/machine# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine# cd VM-1.libvirt-qemu/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu# cd vcpu0/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cat cpu.scheduled
0
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/vcpu0# cd ../emulator/
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# echo 1 > cpu.scheduled
/sys/fs/cgroup/cpu/machine/VM-1.libvirt-qemu/emulator# 

Serial console output (I apologize that some lines got truncated)

[ 1060.840120] BUG: unable to handle kernel NULL pointer dere0
[ 1060.848782] PGD 0 P4D 0 
[ 1060.852068] Oops:  [#1] SMP PTI
[ 1060.856207] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G   OE 4.19b
[ 1060.867029] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1060.874872] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1060.879770] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1060.899165] RSP: 0018:aa2b98c0fd78 EFLAGS: 00010046
[ 1060.904720] RAX:  RBX: 996940ba2d80 RCX: 
[ 1060.912199] RDX: 0008 RSI:  RDI: 996940ba2e00
[ 1060.919678] RBP: aa2b98c0fda0 R08:  R09: 
[ 1060.927174] R10:  R11: 0001 R12: 996940ba2e00
[ 1060.934655] R13:  R14: 996940ba2e00 R15: 
[ 1060.942134] FS:  () GS:996940b8() knlGS:0
[ 1060.950572] CS:  0010 DS:  ES:  CR0: 80050033
[ 1060.956673] CR2: 0040 CR3: 0064af40a006 CR4: 007626e0
[ 1060.964172] DR0:  DR1:  DR2: 
[ 1060.971677] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1060.979191] PKRU: 5554
[ 1060.982282] Call Trace:
[ 1060.985126]  pick_next_task_fair+0x8a7/0xa20
[ 1060.989794]  __schedule+0x13a/0x8e0
[ 1060.993691]  ? update_ts_time_stats+0x59/0x80
[ 1060.998439]  schedule_idle+0x2c/0x40
[ 1061.002410]  do_idle+0x169/0x280
[ 1061.006032]  cpu_startup_entry+0x73/0x80
[ 1061.010348]  start_secondary+0x1ab/0x200
[ 1061.014673]  secondary_startup_64+0xa4/0xb0
[ 1061.019265] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1061.093145]  mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1061.126494] CR2: 0040
[ 1061.130467] ---[ end trace 3462ef57e3394c4f ]---
[ 1061.147237] RIP: 0010:set_next_entity+0x15/0x1d0
[ 1061.152510] Code: c8 48 8b 7d d0 eb 96 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00
[ 1061.172573] RSP: 0018:aa2b98c0fd78 EFLAGS: 00010046
[ 1061.178482] RAX:  RBX: 996940ba2d80 RCX: 
[ 1061.186309] RDX: 0008 RSI:  RDI: 996940ba2e00
[ 1061.194109] RBP: aa2b98c0fda0 R08:  R09: 
[ 1061.201908] R10:  R11: 0001 R12: 996940ba2e00
[ 1061.209698] R13:  R14: 996940ba2e00 R15: 
[ 1061.217490] FS:  () GS:996940b8() knlGS:0
[ 1061.226236] CS:  0010 DS:  ES:  CR0: 80050033
[ 1061.232622] CR2: 0040 CR3: 0064af40a006 CR4: 007626e0
[ 1061.240405] DR0:  DR1:  DR2: 
[ 1061.248168] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1061.255909] PKRU: 5554
[ 1061.259221] Kernel panic - not syncing: Attempted to kill the idle task!
[ 1062.345087] Shutting down cpus with NMI
[ 1062.351037] Kernel Offset: 0x3340 from 0x8100 (relocation ra)
[ 1062.374645] ---[ end Kernel panic - not syncing: Attempted to kill the idle -
[ 1062.383218] WARNING: CPU: 44 PID: 0 at /build/linux-4.19-0rc3.ag.4/kernel/sc0
[ 1062.394380] Modules linked in: act_police cls_basic ebtable_filter ebtables i
[ 1062.469725]  mac_hid coretemp lp parport btrfs zstd_compress raid456 async_ri
[ 1062.503656] CPU: 44 PID: 0 Comm: swapper/44 Tainted: G  DOE 4.19b
[ 1062.514972] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.2.11 10/17
[ 1062.523357] RIP: 0010:set_task_cpu+0x193/0x1a0
[ 1062.528624] Code: 00 00 04 e9 36 ff ff ff 0f 0b e9 be fe ff ff f7 43 60 fd f5
[ 1062.549066] 

Re: [RFC 00/60] Coscheduling for Linux

2018-09-13 Thread Nishanth Aravamudan
On 13.09.2018 [13:31:36 +0200], Jan H. Schönherr wrote:
> On 09/13/2018 01:15 AM, Nishanth Aravamudan wrote:
> > [...] if I just try to set machine's
> > cpu.scheduled to 1, with no other changes (not even changing any child
> > cgroup's cpu.scheduled yet), I get the following trace:
> > 
> > [16052.164259] [ cut here ]
> > [16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
> > [16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 
> > assert_clock_updated.isra.82.part.83+0x15/0x18
> [snip]
> 
> This goes away with the change below (which fixes patch 58/60).

Yep, I can confirm this one as well, is now fixed.

-Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-09-13 Thread Nishanth Aravamudan
On 13.09.2018 [13:31:36 +0200], Jan H. Schönherr wrote:
> On 09/13/2018 01:15 AM, Nishanth Aravamudan wrote:
> > [...] if I just try to set machine's
> > cpu.scheduled to 1, with no other changes (not even changing any child
> > cgroup's cpu.scheduled yet), I get the following trace:
> > 
> > [16052.164259] [ cut here ]
> > [16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
> > [16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 
> > assert_clock_updated.isra.82.part.83+0x15/0x18
> [snip]
> 
> This goes away with the change below (which fixes patch 58/60).

Yep, I can confirm this one as well, is now fixed.

-Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-09-12 Thread Nishanth Aravamudan
On 13.09.2018 [01:18:14 +0200], Jan H. Schönherr wrote:
> On 09/12/2018 09:34 PM, Jan H. Schönherr wrote:
> > That said, I see a hang, too. It seems to happen, when there is a
> > cpu.scheduled!=0 group that is not a direct child of the root task group.
> > You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> > (The case ==0 within !=0 within the root task group works for me.)
> > 
> > I'm going to dive into the code.
> 
> With the patch below (which technically changes patch 55/60), the hang
> I experienced is gone.
> 
> Please let me know, if it works for you as well.

Yep, this does fix the soft lockups for me, thanks! However, if I do a:

# find /sys/fs/cgroup/cpu/machine -mindepth 2 -maxdepth 2 -name cpu.scheduled 
-exec /bin/sh -c "echo 1 > {} " \;

which should co-schedule all the cgroups for emulator and vcpu threads,
I see the same warning I mentioned in my other e-mail:

[10469.832822] [ cut here ]
[10469.837555] rq->clock_update_flags < RQCF_ACT_SKIP
[10469.837574] WARNING: CPU: 89 PID: 49630 at kernel/sched/sched.h:1303 
assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.853042] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[10469.924590]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 
enclosure scsi
[10469.945010] CPU: 89 PID: 49630 Comm: sh Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #2
[10469.960061] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[10469.967657] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.974126] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 
5f c3 48 c7 c7 28 30 eb 8d 31 c0 c6 05 67 18 27 01 01 e8 14 e0 fb ff <0f> 0b c3 
48 8b 970
[10469.993018] RSP: 0018:abc0b534fca8 EFLAGS: 00010096
[10469.998341] RAX: 0026 RBX: 9d74d12ede00 RCX: 0006
[10470.005559] RDX:  RSI: 0096 RDI: 9d74dfb16620
[10470.012780] RBP: 9d74df562e00 R08: 0796 R09: abc0b534fc48
[10470.020005] R10:  R11:  R12: 9d74d2849800
[10470.027226] R13: 0001 R14: 9d74df562e00 R15: 0001
[10470.034445] FS:  7fea86812740() GS:9d74dfb0() 
knlGS:
[10470.042678] CS:  0010 DS:  ES:  CR0: 80050033
[10470.048511] CR2: 5620f00314d8 CR3: 002cc55ea004 CR4: 007626e0
[10470.055739] DR0:  DR1:  DR2: 
[10470.062965] DR3:  DR6: fffe0ff0 DR7: 0400
[10470.070186] PKRU: 5554
[10470.072976] Call Trace:
[10470.075508]  update_curr+0x19f/0x1c0
[10470.079211]  dequeue_entity+0x21/0x8c0
[10470.083056]  dequeue_entity_fair+0x46/0x1c0
[10470.087321]  sdrq_update_root+0x35d/0x480
[10470.091420]  cosched_set_scheduled+0x80/0x1c0
[10470.095892]  cpu_scheduled_write_u64+0x26/0x30
[10470.100427]  cgroup_file_write+0xe3/0x140
[10470.104523]  kernfs_fop_write+0x110/0x190
[10470.108624]  __vfs_write+0x26/0x170
[10470.112236]  ? __audit_syscall_entry+0x101/0x130
[10470.116943]  ? _cond_resched+0x15/0x30
[10470.120781]  ? __sb_start_write+0x41/0x80
[10470.124871]  vfs_write+0xad/0x1a0
[10470.128268]  ksys_write+0x42/0x90
[10470.131668]  do_syscall_64+0x55/0x110
[10470.135421]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10470.140558] RIP: 0033:0x7fea863253c0
[10470.144213] Code: 73 01 c3 48 8b 0d c8 2a 2d 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 0f 1f 44 00 00 83 3d bd 8c 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 
f0 ff ff4
[10470.163114] RSP: 002b:7ffe7cb22d18 EFLAGS: 0246 ORIG_RAX: 
0001
[10470.170783] RAX: ffda RBX: 5620f002f4d0 RCX: 7fea863253c0
[10470.178002] RDX: 0002 RSI: 5620f002f4d0 RDI: 0001
[10470.185222] RBP: 0002 R08: 0001 R09: 006b
[10470.192486] R10: 0008 R11: 0246 R12: 0001
[10470.199705] R13: 0002 R14: 7fff R15: 0002
[10470.206923] ---[ end trace fbf46e2c721c7acb ]---

Thanks,
Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-09-12 Thread Nishanth Aravamudan
On 13.09.2018 [01:18:14 +0200], Jan H. Schönherr wrote:
> On 09/12/2018 09:34 PM, Jan H. Schönherr wrote:
> > That said, I see a hang, too. It seems to happen, when there is a
> > cpu.scheduled!=0 group that is not a direct child of the root task group.
> > You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> > (The case ==0 within !=0 within the root task group works for me.)
> > 
> > I'm going to dive into the code.
> 
> With the patch below (which technically changes patch 55/60), the hang
> I experienced is gone.
> 
> Please let me know, if it works for you as well.

Yep, this does fix the soft lockups for me, thanks! However, if I do a:

# find /sys/fs/cgroup/cpu/machine -mindepth 2 -maxdepth 2 -name cpu.scheduled 
-exec /bin/sh -c "echo 1 > {} " \;

which should co-schedule all the cgroups for emulator and vcpu threads,
I see the same warning I mentioned in my other e-mail:

[10469.832822] [ cut here ]
[10469.837555] rq->clock_update_flags < RQCF_ACT_SKIP
[10469.837574] WARNING: CPU: 89 PID: 49630 at kernel/sched/sched.h:1303 
assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.853042] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[10469.924590]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 
enclosure scsi
[10469.945010] CPU: 89 PID: 49630 Comm: sh Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #2
[10469.960061] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[10469.967657] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[10469.974126] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 
5f c3 48 c7 c7 28 30 eb 8d 31 c0 c6 05 67 18 27 01 01 e8 14 e0 fb ff <0f> 0b c3 
48 8b 970
[10469.993018] RSP: 0018:abc0b534fca8 EFLAGS: 00010096
[10469.998341] RAX: 0026 RBX: 9d74d12ede00 RCX: 0006
[10470.005559] RDX:  RSI: 0096 RDI: 9d74dfb16620
[10470.012780] RBP: 9d74df562e00 R08: 0796 R09: abc0b534fc48
[10470.020005] R10:  R11:  R12: 9d74d2849800
[10470.027226] R13: 0001 R14: 9d74df562e00 R15: 0001
[10470.034445] FS:  7fea86812740() GS:9d74dfb0() 
knlGS:
[10470.042678] CS:  0010 DS:  ES:  CR0: 80050033
[10470.048511] CR2: 5620f00314d8 CR3: 002cc55ea004 CR4: 007626e0
[10470.055739] DR0:  DR1:  DR2: 
[10470.062965] DR3:  DR6: fffe0ff0 DR7: 0400
[10470.070186] PKRU: 5554
[10470.072976] Call Trace:
[10470.075508]  update_curr+0x19f/0x1c0
[10470.079211]  dequeue_entity+0x21/0x8c0
[10470.083056]  dequeue_entity_fair+0x46/0x1c0
[10470.087321]  sdrq_update_root+0x35d/0x480
[10470.091420]  cosched_set_scheduled+0x80/0x1c0
[10470.095892]  cpu_scheduled_write_u64+0x26/0x30
[10470.100427]  cgroup_file_write+0xe3/0x140
[10470.104523]  kernfs_fop_write+0x110/0x190
[10470.108624]  __vfs_write+0x26/0x170
[10470.112236]  ? __audit_syscall_entry+0x101/0x130
[10470.116943]  ? _cond_resched+0x15/0x30
[10470.120781]  ? __sb_start_write+0x41/0x80
[10470.124871]  vfs_write+0xad/0x1a0
[10470.128268]  ksys_write+0x42/0x90
[10470.131668]  do_syscall_64+0x55/0x110
[10470.135421]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[10470.140558] RIP: 0033:0x7fea863253c0
[10470.144213] Code: 73 01 c3 48 8b 0d c8 2a 2d 00 f7 d8 64 89 01 48 83 c8 ff 
c3 66 0f 1f 44 00 00 83 3d bd 8c 2d 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 
f0 ff ff4
[10470.163114] RSP: 002b:7ffe7cb22d18 EFLAGS: 0246 ORIG_RAX: 
0001
[10470.170783] RAX: ffda RBX: 5620f002f4d0 RCX: 7fea863253c0
[10470.178002] RDX: 0002 RSI: 5620f002f4d0 RDI: 0001
[10470.185222] RBP: 0002 R08: 0001 R09: 006b
[10470.192486] R10: 0008 R11: 0246 R12: 0001
[10470.199705] R13: 0002 R14: 7fff R15: 0002
[10470.206923] ---[ end trace fbf46e2c721c7acb ]---

Thanks,
Nish


Re: [RFC 00/60] Coscheduling for Linux

2018-09-12 Thread Nishanth Aravamudan
On 12.09.2018 [21:34:14 +0200], Jan H. Schönherr wrote:
> On 09/12/2018 02:24 AM, Nishanth Aravamudan wrote:
> > [ I am not subscribed to LKML, please keep me CC'd on replies ]
> > 
> > I tried a simple test with several VMs (in my initial test, I have 48
> > idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
> > pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
> > cgroups' to be co-scheduled (/bin/echo 1 >
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu/cpu.scheduled), the
> > machine hangs. This is using cosched_max_level=1.
> > 
> > There are several moving parts there, so I tried narrowing it down, by
> > only coscheduling one VM, and thing seemed fine:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat cpu.scheduled 
> > 1
> > 
> > One thing that is not entirely obvious to me (but might be completely
> > intentional) is that since by default the top-level libvirt cpu cgroups
> > are empty:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat tasks 
> > 
> > the result of this should be a no-op, right? [This becomes relevant
> > below] Specifically, all of the threads of qemu are in sub-cgroups,
> > which do not indicate they are co-scheduling:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat emulator/cpu.scheduled 
> > 0
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat vcpu0/cpu.scheduled 
> > 0
> > 
> 
> This setup *should* work. It should be possible to set cpu.scheduled
> independent of the cpu.scheduled values of parent and child task groups.
> Any intermediate regular task group (i.e. cpu.scheduled==0) will still
> contribute the group fairness aspects.

Ah I see, that makes sense, thank you.

> That said, I see a hang, too. It seems to happen, when there is a
> cpu.scheduled!=0 group that is not a direct child of the root task group.
> You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> (The case ==0 within !=0 within the root task group works for me.)
>
> I'm going to dive into the code.
> 
> [...]
> > I am happy to do any further debugging I can do, or try patches on top
> > of those posted on the mailing list.
> 
> If you're willing, you can try to get rid of the intermediate "machine"
> cgroup in your setup for the moment. This might tell us, whether we're
> looking at the same issue.

Yep I will do this now. Note that if I just try to set machine's
cpu.scheduled to 1, with no other changes (not even changing any child
cgroup's cpu.scheduled yet), I get the following trace:

[16052.164259] [ cut here ]
[16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
[16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 
assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.184424] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[16052.255653]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 
enclosure scsi
[16052.276029] CPU: 59 PID: 59533 Comm: bash Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #1
[16052.291142] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[16052.298728] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.305166] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 
5f c3 48 c7 c7 28 30 eb 94 31 c0 c6 05 47 18 27 01 01 e8 f4 df fb ff <0f> 0b c3 
48 8b 970
[16052.324050] RSP: 0018:9cada610bca8 EFLAGS: 00010096
[16052.329361] RAX: 0026 RBX: 8f06d65bae00 RCX: 0006
[16052.336580] RDX:  RSI: 0096 RDI: 8f1edf756620
[16052.343799] RBP: 8f06e0462e00 R08: 079b R09: 9cada610bc48
[16052.351018] R10:  R11:  R12: 8f06e0462e80
[16052.358237] R13: 0001 R14: 8f06e0462e00 R15: 0001
[16052.365458] FS:  7ff07ab02740() GS:8f1edf74() 
knlGS:
[16052.373647] CS:  0010 DS:  ES:  CR0: 80050033
[16052.379480] CR2: 7ff07ab139d8 CR3: 002ca2aea002 CR4: 007626e0
[16052.386698] DR0:  DR1:  DR2: 
[16052.393917] DR3:  DR6: fffe0ff0 DR7: 0400
[16052.401137] PKRU: 5554
[16052.403927] Call Trace:
[16052.406460]  update_curr+0x19f/0x1c0
[16052.410116]  dequeue_entity+0x21/0x8c0
[16052.413950]  ? terminate_walk+0x55/0xb0
[16052.417871]  dequeue_entity_fair+0x46/0x1c0
[16052.422136]  sdrq_update_root+0

Re: [RFC 00/60] Coscheduling for Linux

2018-09-12 Thread Nishanth Aravamudan
On 12.09.2018 [21:34:14 +0200], Jan H. Schönherr wrote:
> On 09/12/2018 02:24 AM, Nishanth Aravamudan wrote:
> > [ I am not subscribed to LKML, please keep me CC'd on replies ]
> > 
> > I tried a simple test with several VMs (in my initial test, I have 48
> > idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
> > pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
> > cgroups' to be co-scheduled (/bin/echo 1 >
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu/cpu.scheduled), the
> > machine hangs. This is using cosched_max_level=1.
> > 
> > There are several moving parts there, so I tried narrowing it down, by
> > only coscheduling one VM, and thing seemed fine:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat cpu.scheduled 
> > 1
> > 
> > One thing that is not entirely obvious to me (but might be completely
> > intentional) is that since by default the top-level libvirt cpu cgroups
> > are empty:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat tasks 
> > 
> > the result of this should be a no-op, right? [This becomes relevant
> > below] Specifically, all of the threads of qemu are in sub-cgroups,
> > which do not indicate they are co-scheduling:
> > 
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat emulator/cpu.scheduled 
> > 0
> > /sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat vcpu0/cpu.scheduled 
> > 0
> > 
> 
> This setup *should* work. It should be possible to set cpu.scheduled
> independent of the cpu.scheduled values of parent and child task groups.
> Any intermediate regular task group (i.e. cpu.scheduled==0) will still
> contribute the group fairness aspects.

Ah I see, that makes sense, thank you.

> That said, I see a hang, too. It seems to happen, when there is a
> cpu.scheduled!=0 group that is not a direct child of the root task group.
> You seem to have "/sys/fs/cgroup/cpu/machine" as an intermediate group.
> (The case ==0 within !=0 within the root task group works for me.)
>
> I'm going to dive into the code.
> 
> [...]
> > I am happy to do any further debugging I can do, or try patches on top
> > of those posted on the mailing list.
> 
> If you're willing, you can try to get rid of the intermediate "machine"
> cgroup in your setup for the moment. This might tell us, whether we're
> looking at the same issue.

Yep I will do this now. Note that if I just try to set machine's
cpu.scheduled to 1, with no other changes (not even changing any child
cgroup's cpu.scheduled yet), I get the following trace:

[16052.164259] [ cut here ]
[16052.168973] rq->clock_update_flags < RQCF_ACT_SKIP
[16052.168991] WARNING: CPU: 59 PID: 59533 at kernel/sched/sched.h:1303 
assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.184424] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[16052.255653]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor raid6_pq ses libcrc32c raid1 
enclosure scsi
[16052.276029] CPU: 59 PID: 59533 Comm: bash Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #1
[16052.291142] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[16052.298728] RIP: 0010:assert_clock_updated.isra.82.part.83+0x15/0x18
[16052.305166] Code: 0f 85 75 ff ff ff 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 
5f c3 48 c7 c7 28 30 eb 94 31 c0 c6 05 47 18 27 01 01 e8 f4 df fb ff <0f> 0b c3 
48 8b 970
[16052.324050] RSP: 0018:9cada610bca8 EFLAGS: 00010096
[16052.329361] RAX: 0026 RBX: 8f06d65bae00 RCX: 0006
[16052.336580] RDX:  RSI: 0096 RDI: 8f1edf756620
[16052.343799] RBP: 8f06e0462e00 R08: 079b R09: 9cada610bc48
[16052.351018] R10:  R11:  R12: 8f06e0462e80
[16052.358237] R13: 0001 R14: 8f06e0462e00 R15: 0001
[16052.365458] FS:  7ff07ab02740() GS:8f1edf74() 
knlGS:
[16052.373647] CS:  0010 DS:  ES:  CR0: 80050033
[16052.379480] CR2: 7ff07ab139d8 CR3: 002ca2aea002 CR4: 007626e0
[16052.386698] DR0:  DR1:  DR2: 
[16052.393917] DR3:  DR6: fffe0ff0 DR7: 0400
[16052.401137] PKRU: 5554
[16052.403927] Call Trace:
[16052.406460]  update_curr+0x19f/0x1c0
[16052.410116]  dequeue_entity+0x21/0x8c0
[16052.413950]  ? terminate_walk+0x55/0xb0
[16052.417871]  dequeue_entity_fair+0x46/0x1c0
[16052.422136]  sdrq_update_root+0

Re: [RFC 00/60] Coscheduling for Linux

2018-09-11 Thread Nishanth Aravamudan
[ I am not subscribed to LKML, please keep me CC'd on replies ]

On 07.09.2018 [23:39:47 +0200], Jan H. Schönherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different
> coscheduling use-cases, while at the same time being non-intrusive, so
> that behavior of legacy workloads does not change.

I tried a simple test with several VMs (in my initial test, I have 48
idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
cgroups' to be co-scheduled (/bin/echo 1 >
/sys/fs/cgroup/cpu/machine/.libvirt-qemu/cpu.scheduled), the
machine hangs. This is using cosched_max_level=1.

There are several moving parts there, so I tried narrowing it down, by
only coscheduling one VM, and thing seemed fine:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat cpu.scheduled 
1

One thing that is not entirely obvious to me (but might be completely
intentional) is that since by default the top-level libvirt cpu cgroups
are empty:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat tasks 

the result of this should be a no-op, right? [This becomes relevant
below] Specifically, all of the threads of qemu are in sub-cgroups,
which do not indicate they are co-scheduling:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat emulator/cpu.scheduled 
0
/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat vcpu0/cpu.scheduled 
0

When I then try to coschedule the second VM, the machine hangs.

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
Timeout, server  not responding.

On the console, I see the same backtraces I see when I try to set all of
the VMs to be coscheduled:

[  144.494091] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [CPU 
0/KVM:25344]
[  144.507629] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[  144.578858]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor ses raid6_pq enclosure libcrc32c 
raid1 scsi
[  144.599227] CPU: 87 PID: 25344 Comm: CPU 0/KVM Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #1
[  144.608819] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[  144.616403] RIP: 0010:smp_call_function_single+0xa7/0xd0
[  144.621818] Code: 01 48 89 d1 48 89 f2 4c 89 c6 e8 64 fe ff ff c9 c3 48 89 
d1 48 89 f2 48 89 e6 e8 54 fe ff ff 8b 54 24 18 83 e2 01 74 0b f3 90 <8b> 54 24 
18 83 e25
[  144.640703] RSP: 0018:b2a4a75abb40 EFLAGS: 0202 ORIG_RAX: 
ff13
[  144.648390] RAX:  RBX: 0057 RCX: 
[  144.655607] RDX: 0001 RSI: 00fb RDI: 0202
[  144.662826] RBP: b2a4a75abb60 R08:  R09: 0f39
[  144.670073] R10:  R11:  R12: 8a9c03fc8000
[  144.677301] R13: 8ab4589dc100 R14: 0057 R15: 
[  144.684519] FS:  7f51cd41a700() GS:8ab45fac() 
knlGS:
[  144.692710] CS:  0010 DS:  ES:  CR0: 80050033
[  144.698542] CR2: 00c4203c CR3: 00178a97e005 CR4: 007626e0
[  144.705771] DR0:  DR1:  DR2: 
[  144.712989] DR3:  DR6: fffe0ff0 DR7: 0400
[  144.720215] PKRU: 5554
[  144.723016] Call Trace:
[  144.725553]  ? vmx_sched_in+0xc0/0xc0 [kvm_intel]
[  144.730341]  vmx_vcpu_load+0x244/0x310 [kvm_intel]
[  144.735220]  ? __switch_to_asm+0x40/0x70
[  144.739231]  ? __switch_to_asm+0x34/0x70
[  144.743235]  ? __switch_to_asm+0x40/0x70
[  144.747240]  ? __switch_to_asm+0x34/0x70
[  144.751243]  ? __switch_to_asm+0x40/0x70
[  144.755246]  ? __switch_to_asm+0x34/0x70
[  144.759250]  ? __switch_to_asm+0x40/0x70
[  144.763272]  ? __switch_to_asm+0x34/0x70
[  144.767284]  ? __switch_to_asm+0x40/0x70
[  144.771296]  ? __switch_to_asm+0x34/0x70
[  144.775299]  ? __switch_to_asm+0x40/0x70
[  144.779313]  ? __switch_to_asm+0x34/0x70
[  144.783317]  ? __switch_to_asm+0x40/0x70
[  144.787338]  kvm_arch_vcpu_load+0x40/0x270 [kvm]
[  144.792056]  finish_task_switch+0xe2/0x260
[  144.796238]  __schedule+0x316/0x890
[  144.799810]  schedule+0x32/0x80
[  144.803039]  kvm_vcpu_block+0x7a/0x2e0 [kvm]
[  144.807399]  kvm_arch_vcpu_ioctl_run+0x1a7/0x1990 [kvm]
[  144.812705]  ? futex_wake+0x84/0x150
[  144.816368]  kvm_vcpu_ioctl+0x3ab/0x5d0 [kvm]
[  144.820810]  ? wake_up_q+0x70/0x70
[  144.824311]  do_vfs_ioctl+0x92/0x600
[  144.827985]  ? syscall_trace_enter+0x1ac/0x290
[  144.832517]  ksys_ioctl+0x60/0x90
[  144.835913]  ? exit_to_usermode_loop+0xa6/0xc2
[  144.840436]  __x64_sys_ioctl+0x16/0x20
[  144.844267]  do_syscall_64+0x55/0x110
[  144.848012]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  

Re: [RFC 00/60] Coscheduling for Linux

2018-09-11 Thread Nishanth Aravamudan
[ I am not subscribed to LKML, please keep me CC'd on replies ]

On 07.09.2018 [23:39:47 +0200], Jan H. Schönherr wrote:
> This patch series extends CFS with support for coscheduling. The
> implementation is versatile enough to cover many different
> coscheduling use-cases, while at the same time being non-intrusive, so
> that behavior of legacy workloads does not change.

I tried a simple test with several VMs (in my initial test, I have 48
idle 1-cpu 512-mb VMs and 2 idle 2-cpu, 2-gb VMs) using libvirt, none
pinned to any CPUs. When I tried to set all of the top-level libvirt cpu
cgroups' to be co-scheduled (/bin/echo 1 >
/sys/fs/cgroup/cpu/machine/.libvirt-qemu/cpu.scheduled), the
machine hangs. This is using cosched_max_level=1.

There are several moving parts there, so I tried narrowing it down, by
only coscheduling one VM, and thing seemed fine:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat cpu.scheduled 
1

One thing that is not entirely obvious to me (but might be completely
intentional) is that since by default the top-level libvirt cpu cgroups
are empty:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat tasks 

the result of this should be a no-op, right? [This becomes relevant
below] Specifically, all of the threads of qemu are in sub-cgroups,
which do not indicate they are co-scheduling:

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat emulator/cpu.scheduled 
0
/sys/fs/cgroup/cpu/machine/.libvirt-qemu# cat vcpu0/cpu.scheduled 
0

When I then try to coschedule the second VM, the machine hangs.

/sys/fs/cgroup/cpu/machine/.libvirt-qemu# echo 1 > cpu.scheduled 
Timeout, server  not responding.

On the console, I see the same backtraces I see when I try to set all of
the VMs to be coscheduled:

[  144.494091] watchdog: BUG: soft lockup - CPU#87 stuck for 22s! [CPU 
0/KVM:25344]
[  144.507629] Modules linked in: act_police cls_basic ebtable_filter ebtables 
ip6table_filter iptable_filter nbd ip6table_raw ip6_tables xt_CT iptable_raw 
ip_tables s
[  144.578858]  xxhash raid10 raid0 multipath linear raid456 async_raid6_recov 
async_memcpy async_pq async_xor async_tx xor ses raid6_pq enclosure libcrc32c 
raid1 scsi
[  144.599227] CPU: 87 PID: 25344 Comm: CPU 0/KVM Tainted: G   O  
4.19.0-rc2-amazon-cosched+ #1
[  144.608819] Hardware name: Dell Inc. PowerEdge R640/0W23H8, BIOS 1.4.9 
06/29/2018
[  144.616403] RIP: 0010:smp_call_function_single+0xa7/0xd0
[  144.621818] Code: 01 48 89 d1 48 89 f2 4c 89 c6 e8 64 fe ff ff c9 c3 48 89 
d1 48 89 f2 48 89 e6 e8 54 fe ff ff 8b 54 24 18 83 e2 01 74 0b f3 90 <8b> 54 24 
18 83 e25
[  144.640703] RSP: 0018:b2a4a75abb40 EFLAGS: 0202 ORIG_RAX: 
ff13
[  144.648390] RAX:  RBX: 0057 RCX: 
[  144.655607] RDX: 0001 RSI: 00fb RDI: 0202
[  144.662826] RBP: b2a4a75abb60 R08:  R09: 0f39
[  144.670073] R10:  R11:  R12: 8a9c03fc8000
[  144.677301] R13: 8ab4589dc100 R14: 0057 R15: 
[  144.684519] FS:  7f51cd41a700() GS:8ab45fac() 
knlGS:
[  144.692710] CS:  0010 DS:  ES:  CR0: 80050033
[  144.698542] CR2: 00c4203c CR3: 00178a97e005 CR4: 007626e0
[  144.705771] DR0:  DR1:  DR2: 
[  144.712989] DR3:  DR6: fffe0ff0 DR7: 0400
[  144.720215] PKRU: 5554
[  144.723016] Call Trace:
[  144.725553]  ? vmx_sched_in+0xc0/0xc0 [kvm_intel]
[  144.730341]  vmx_vcpu_load+0x244/0x310 [kvm_intel]
[  144.735220]  ? __switch_to_asm+0x40/0x70
[  144.739231]  ? __switch_to_asm+0x34/0x70
[  144.743235]  ? __switch_to_asm+0x40/0x70
[  144.747240]  ? __switch_to_asm+0x34/0x70
[  144.751243]  ? __switch_to_asm+0x40/0x70
[  144.755246]  ? __switch_to_asm+0x34/0x70
[  144.759250]  ? __switch_to_asm+0x40/0x70
[  144.763272]  ? __switch_to_asm+0x34/0x70
[  144.767284]  ? __switch_to_asm+0x40/0x70
[  144.771296]  ? __switch_to_asm+0x34/0x70
[  144.775299]  ? __switch_to_asm+0x40/0x70
[  144.779313]  ? __switch_to_asm+0x34/0x70
[  144.783317]  ? __switch_to_asm+0x40/0x70
[  144.787338]  kvm_arch_vcpu_load+0x40/0x270 [kvm]
[  144.792056]  finish_task_switch+0xe2/0x260
[  144.796238]  __schedule+0x316/0x890
[  144.799810]  schedule+0x32/0x80
[  144.803039]  kvm_vcpu_block+0x7a/0x2e0 [kvm]
[  144.807399]  kvm_arch_vcpu_ioctl_run+0x1a7/0x1990 [kvm]
[  144.812705]  ? futex_wake+0x84/0x150
[  144.816368]  kvm_vcpu_ioctl+0x3ab/0x5d0 [kvm]
[  144.820810]  ? wake_up_q+0x70/0x70
[  144.824311]  do_vfs_ioctl+0x92/0x600
[  144.827985]  ? syscall_trace_enter+0x1ac/0x290
[  144.832517]  ksys_ioctl+0x60/0x90
[  144.835913]  ? exit_to_usermode_loop+0xa6/0xc2
[  144.840436]  __x64_sys_ioctl+0x16/0x20
[  144.844267]  do_syscall_64+0x55/0x110
[  144.848012]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  

Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-06 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

FWIW, whose tree should this go through? The bug only appears on Power,
afaik, but the patch is now just an NVMe change.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-06 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

FWIW, whose tree should this go through? The bug only appears on Power,
afaik, but the patch is now just an NVMe change.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

Fair enough :) Thanks for all your reviews and comments.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 03.11.2015 [13:46:25 +], Keith Busch wrote:
> On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > > index ccc0c1f93daa..a9a5285bdb39 100644
> > > --- a/drivers/block/nvme-core.c
> > > +++ b/drivers/block/nvme-core.c
> > > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct 
> > > nvme_dev *dev)
> > >   u32 aqa;
> > >   u64 cap = readq(>bar->cap);
> > >   struct nvme_queue *nvmeq;
> > > - unsigned page_shift = PAGE_SHIFT;
> > > + /*
> > > +  * default to a 4K page size, with the intention to update this
> > > +  * path in the future to accomodate architectures with differing
> > > +  * kernel and IO page sizes.
> > > +  */
> > > + unsigned page_shift = 12;
> > >   unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> > >   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> > 
> > Looks good as a start.  Note that all the MPSMIN/MAX checking could
> > be removed as NVMe devices must support 4k pages.
> 
> MAX can go, and while it's probably the case that all devices support 4k,
> it's not a spec requirement, so we should keep the dev_page_min check.

Ok, here's an updated patch.

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan 

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

v3 -> v4:
  Rebase to the 4.3, including the new code locations.
  Based upon feedback from Keith Busch and Christoph Hellwig, remove the
  device max check, as the spec requires MPSMAX >= 4K.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e878590e71b6..00ca45bb0bc0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1701,9 +1701,13 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
-   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,
@@ -1712,13 +1716,6 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
1 << page_shift);
return -ENODEV;
}
-   if (page_shift > dev_page_max) {
-   dev_info(dev->dev,
-   "Device maximum page size (%u) smaller than "
-   "host (%u); enabling work-around\n",
-   1 << dev_page_max, 1 << page_shift);
-   page_shift = dev_page_max;
-   }
 
dev->subsystem = readl(>bar->vs) >= NVME_VS(1, 1) ?
NVME_CAP_NSSRC(cap) : 0;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 03.11.2015 [13:46:25 +], Keith Busch wrote:
> On Tue, Nov 03, 2015 at 05:18:24AM -0800, Christoph Hellwig wrote:
> > On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > > diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
> > > index ccc0c1f93daa..a9a5285bdb39 100644
> > > --- a/drivers/block/nvme-core.c
> > > +++ b/drivers/block/nvme-core.c
> > > @@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct 
> > > nvme_dev *dev)
> > >   u32 aqa;
> > >   u64 cap = readq(>bar->cap);
> > >   struct nvme_queue *nvmeq;
> > > - unsigned page_shift = PAGE_SHIFT;
> > > + /*
> > > +  * default to a 4K page size, with the intention to update this
> > > +  * path in the future to accomodate architectures with differing
> > > +  * kernel and IO page sizes.
> > > +  */
> > > + unsigned page_shift = 12;
> > >   unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
> > >   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
> > 
> > Looks good as a start.  Note that all the MPSMIN/MAX checking could
> > be removed as NVMe devices must support 4k pages.
> 
> MAX can go, and while it's probably the case that all devices support 4k,
> it's not a spec requirement, so we should keep the dev_page_min check.

Ok, here's an updated patch.

We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

With the functionally equivalent v3 of this patch, our hardware test
exerciser survives when using 32-bit DMA; without the patch, the kernel
will BUG within a few minutes.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

v3 -> v4:
  Rebase to the 4.3, including the new code locations.
  Based upon feedback from Keith Busch and Christoph Hellwig, remove the
  device max check, as the spec requires MPSMAX >= 4K.

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index e878590e71b6..00ca45bb0bc0 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -1701,9 +1701,13 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
-   unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,
@@ -1712,13 +1716,6 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
1 << page_shift);
return -ENODEV;
}
-   if (page_shift > dev_page_max) {
-   dev_info(dev->dev,
-   "Device maximum page size (%u) smaller than "
-   "host (%u); enabling work-around\n",
-   1 << dev_page_max, 1 << page_shift);
-   page_shift = dev_page_max;
-   }
 
dev->subsystem = readl(>bar->vs) >= NVME_VS(1, 1) ?

Re: [PATCH 1/1 v4] drivers/nvme: default to 4k device page size

2015-11-05 Thread Nishanth Aravamudan
On 05.11.2015 [11:58:39 -0800], Christoph Hellwig wrote:
> Looks fine,
> 
> Reviewed-by: Christoph Hellwig 
> 
> ... but I doubt we'll ever bother updating it.  Most architectures
> with arger page sizes also have iommus and would need different settings
> for different iommus vs direct mapping for very little gain.  There's a
> reason why we never bothered for RDMA either.

Fair enough :) Thanks for all your reviews and comments.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 30.10.2015 [21:48:48 +], Keith Busch wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > Given that it's 4K just about everywhere by default (and sort of
> > implicitly expected to be, I guess), I think I'd prefer we default to
> > 4K. That should mitigate the performance impact (I'll ask our IO team to
> > do some runs, but since this impacts functionality on some hardware, I
> > don't think it's too relevant for now). Unless there are NVMe devcies
> > with a MPSMAX < 4K? 
> 
> Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
> but you can hard code it like you've done in your patch.
> 
> The spec defines MPSMAX such that it's impossible to find a device
> with MPSMAX < 4k.

Great, thanks!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [18:49:55 -0700], David Miller wrote:
> From: Nishanth Aravamudan 
> Date: Thu, 29 Oct 2015 08:57:01 -0700
> 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> > 
> > Sorry for the continuing questions -- I got lots of conflicting feedback
> > on the last series and want to make sure v4 is more acceptable.
> 
> In the long term I would be very happy to see us having a real interface
> for this stuff, just my opinion...

Yep, I think I'll try and balance the two -- fix NVMe for now with a 4K
page size as suggested by Christoph, and then work on the more complete
API for the next merge.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [17:20:43 +], Busch, Keith wrote:
> On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote:
> > On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> > > We had a quick cht about this issue and I think we simply should
> > > default to a NVMe controler page size of 4k everywhere as that's the
> > > safe default.  This is also what we do for RDMA Memory reigstrations and
> > > it works fine there for SRP and iSER.
> > 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> 
> It is the safe option, but you're right that it might have a
> measurable performance impact (can you run an experiment?). Maybe we
> should just change the driver to always use MPSMIN for the moment in
> the interest of time, and you can flush out the new API before the
> next merge window.

Given that it's 4K just about everywhere by default (and sort of
implicitly expected to be, I guess), I think I'd prefer we default to
4K. That should mitigate the performance impact (I'll ask our IO team to
do some runs, but since this impacts functionality on some hardware, I
don't think it's too relevant for now). Unless there are NVMe devcies
with a MPSMAX < 4K? 

Something like the following?



We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

Signed-off-by: Nishanth Aravamudan 

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ccc0c1f93daa..a9a5285bdb39 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 




-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [17:20:43 +], Busch, Keith wrote:
> On Thu, Oct 29, 2015 at 08:57:01AM -0700, Nishanth Aravamudan wrote:
> > On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> > > We had a quick cht about this issue and I think we simply should
> > > default to a NVMe controler page size of 4k everywhere as that's the
> > > safe default.  This is also what we do for RDMA Memory reigstrations and
> > > it works fine there for SRP and iSER.
> > 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> 
> It is the safe option, but you're right that it might have a
> measurable performance impact (can you run an experiment?). Maybe we
> should just change the driver to always use MPSMIN for the moment in
> the interest of time, and you can flush out the new API before the
> next merge window.

Given that it's 4K just about everywhere by default (and sort of
implicitly expected to be, I guess), I think I'd prefer we default to
4K. That should mitigate the performance impact (I'll ask our IO team to
do some runs, but since this impacts functionality on some hardware, I
don't think it's too relevant for now). Unless there are NVMe devcies
with a MPSMAX < 4K? 

Something like the following?



We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size. There is not currently an
API to obtain the IOMMU's page size across all architectures and in the
interest of a stop-gap fix to this functional issue, default the NVMe
device page size to 4K, with the intent of adding such an API and
implementation across all architectures in the next merge window.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

v2 -> v3:
  In the interest of fixing the functional problem in the short-term,
  just force the device page size to 4K and work on adding the new API
  in the next merge window.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index ccc0c1f93daa..a9a5285bdb39 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -1717,7 +1717,12 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   /*
+* default to a 4K page size, with the intention to update this
+* path in the future to accomodate architectures with differing
+* kernel and IO page sizes.
+*/
+   unsigned page_shift = 12;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 




-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/1 v3] drivers/nvme: default to 4k device page size

2015-10-30 Thread Nishanth Aravamudan
On 30.10.2015 [21:48:48 +], Keith Busch wrote:
> On Fri, Oct 30, 2015 at 02:35:11PM -0700, Nishanth Aravamudan wrote:
> > Given that it's 4K just about everywhere by default (and sort of
> > implicitly expected to be, I guess), I think I'd prefer we default to
> > 4K. That should mitigate the performance impact (I'll ask our IO team to
> > do some runs, but since this impacts functionality on some hardware, I
> > don't think it's too relevant for now). Unless there are NVMe devcies
> > with a MPSMAX < 4K? 
> 
> Right, I assumed MPSMIN was always 4k for the same reason you mentioned,
> but you can hard code it like you've done in your patch.
> 
> The spec defines MPSMAX such that it's impossible to find a device
> with MPSMAX < 4k.

Great, thanks!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-30 Thread Nishanth Aravamudan
On 29.10.2015 [18:49:55 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Thu, 29 Oct 2015 08:57:01 -0700
> 
> > So, would that imply changing just the NVMe driver code rather than
> > adding the dma_page_shift API at all? What about
> > architectures that can support the larger page sizes? There is an
> > implied performance impact, at least, of shifting the IO size down.
> > 
> > Sorry for the continuing questions -- I got lots of conflicting feedback
> > on the last series and want to make sure v4 is more acceptable.
> 
> In the long term I would be very happy to see us having a real interface
> for this stuff, just my opinion...

Yep, I think I'll try and balance the two -- fix NVMe for now with a 4K
page size as suggested by Christoph, and then work on the more complete
API for the next merge.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-29 Thread Nishanth Aravamudan
On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> On Wed, Oct 28, 2015 at 01:59:23PM +, Busch, Keith wrote:
> > The "new" interface for all the other architectures is the same as the
> > old one we've been using for the last 5 years.
> > 
> > I welcome x86 maintainer feedback to confirm virtual and DMA addresses
> > have the same offset at 4k alignment, but I have to insist we don't
> > break my currently working hardware to force their attention.
> 
> We had a quick cht about this issue and I think we simply should
> default to a NVMe controler page size of 4k everywhere as that's the
> safe default.  This is also what we do for RDMA Memory reigstrations and
> it works fine there for SRP and iSER.

So, would that imply changing just the NVMe driver code rather than
adding the dma_page_shift API at all? What about
architectures that can support the larger page sizes? There is an
implied performance impact, at least, of shifting the IO size down.

Sorry for the continuing questions -- I got lots of conflicting feedback
on the last series and want to make sure v4 is more acceptable.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-29 Thread Nishanth Aravamudan
On 29.10.2015 [04:55:36 -0700], Christoph Hellwig wrote:
> On Wed, Oct 28, 2015 at 01:59:23PM +, Busch, Keith wrote:
> > The "new" interface for all the other architectures is the same as the
> > old one we've been using for the last 5 years.
> > 
> > I welcome x86 maintainer feedback to confirm virtual and DMA addresses
> > have the same offset at 4k alignment, but I have to insist we don't
> > break my currently working hardware to force their attention.
> 
> We had a quick cht about this issue and I think we simply should
> default to a NVMe controler page size of 4k everywhere as that's the
> safe default.  This is also what we do for RDMA Memory reigstrations and
> it works fine there for SRP and iSER.

So, would that imply changing just the NVMe driver code rather than
adding the dma_page_shift API at all? What about
architectures that can support the larger page sizes? There is an
implied performance impact, at least, of shifting the IO size down.

Sorry for the continuing questions -- I got lots of conflicting feedback
on the last series and want to make sure v4 is more acceptable.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote:
> On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> > 
> > In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?
> 
> None :-) The TCEs are completely bypassed. You get a N:M linear mapping
> of all memory starting at 1<<59 PCI side.

Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT,
then?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [12:00:20 +1100], Alexey Kardashevskiy wrote:
> On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:
> >On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> >>On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >>>On Power, the kernel's page size can differ from the IOMMU's page size,
> >>>so we need to override the generic implementation, which always returns
> >>>the kernel's page size. Lookup the IOMMU's page size from struct
> >>>iommu_table, if available. Fallback to the kernel's page size,
> >>>otherwise.
> >>>
> >>>Signed-off-by: Nishanth Aravamudan 
> >>>---
> >>>  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >>>  arch/powerpc/kernel/dma.c  | 9 +
> >>>  2 files changed, 12 insertions(+)
> >>>
> >>>diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >>>b/arch/powerpc/include/asm/dma-mapping.h
> >>>index 7f522c0..c5638f4 100644
> >>>--- a/arch/powerpc/include/asm/dma-mapping.h
> >>>+++ b/arch/powerpc/include/asm/dma-mapping.h
> >>>@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >>>dma_addr_t off)
> >>>  #define HAVE_ARCH_DMA_SET_MASK 1
> >>>  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >>>
> >>>+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >>>+extern unsigned long dma_get_page_shift(struct device *dev);
> >>>+
> >>>  #include 
> >>>
> >>>  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >>>diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >>>index 59503ed..e805af2 100644
> >>>--- a/arch/powerpc/kernel/dma.c
> >>>+++ b/arch/powerpc/kernel/dma.c
> >>>@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >>>  }
> >>>  EXPORT_SYMBOL(dma_set_mask);
> >>>
> >>>+unsigned long dma_get_page_shift(struct device *dev)
> >>>+{
> >>>+  struct iommu_table *tbl = get_iommu_table_base(dev);
> >>>+  if (tbl)
> >>>+  return tbl->it_page_shift;
> >>
> >>
> >>All PCI devices have this initialized on POWER (at least, our, IBM's
> >>POWER) so 4K will always be returned here while in the case of
> >>(get_dma_ops(dev)==_direct_ops) it could actually return
> >>PAGE_SHIFT. Is 4K still preferred value to return here?
> >
> >Right, so the logic of my series, goes like this:
> >
> >a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
> >PAGE_SHIFT everywhere, including Power.
> >
> >b) After 2/7, the Power code will return either the IOMMU table's shift
> >value, if set, or PAGE_SHIFT (I guess this would be the case if
> >get_dma_ops(dev) == _direct_ops, as you said). That is no different
> >than we have now, except we can return the accurate IOMMU value if
> >available.
> 
> If it is not available, then something went wrong and BUG_ON(!tbl ||
> !tbl->it_page_shift) make more sense here than pretending that this
> function can ever return PAGE_SHIFT. imho.

That's a good point, thanks!

> >3) After 3/7, the platform can override the generic Power
> >get_dma_page_shift().
> >
> >4) After 4/7, pseries will return the DDW value, if available, then
> >fallback to the IOMMU table's value. I think in the case of
> >get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
> >are using DDW, right?
> 
> This is for pseries guests; for the powernv host it is a "bypass"
> mode which does 64bit direct DMA mapping and there is no additional
> window for that (i.e. DIRECT64_PROPNAME, etc).

You're right! I should update the code to handle both cases.

In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?

Seems like this would be a different platform implentation I'd put in
for 'powernv', is that right?

My apologies for missing that, and thank you for the review!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:53:22 -0700], David Miller wrote:
> From: Nishanth Aravamudan 
> Date: Tue, 27 Oct 2015 15:20:10 -0700
> 
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> Yes, I am basically suggesting to simply not provide a default at all.

For my own edification -- what is the way that gets resolved? I guess I
mean it seems like linux-next would cease to compile because of my new
series. Would my patches just get kicked out of -next for introducing
that (or even via the 0-day notifications), or should I put something
into the commit message indicating it is an API introduction?

Sorry for the tentativeness, I have not introduce a cross-architecture
API like this before.

Thanks,
Nish

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote:
> Hi Nishanth,
> 
> On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
>  wrote:
> > On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> >> From: Nishanth Aravamudan 
> >> Date: Fri, 23 Oct 2015 13:54:20 -0700
> >>
> >> > 1) add a generic dma_get_page_shift implementation that just returns
> >> > PAGE_SHIFT
> >>
> >> I won't object to this patch series, but if I had implemented this I
> >> would have required the architectures to implement this explicitly,
> >> one-by-one.  I think it is less error prone and more likely to end
> >> up with all the architectures setting this correctly.
> >
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> He's suggesting that you _don't_ put a generic implementation in
> /include/linux/dma-mapping.h and instead add it to _every_
> architecture.

Ah, I see! Well, I don't know much about the DMA internals of most
architectures -- and my approach kept things functionally the same
everywhere (using PAGE_SHIFT) except:

a) Power, where I know it doesn't work as-is
and
b) sparc, where the code implied that a different value than PAGE_SHIFT
should be used.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >On Power, the kernel's page size can differ from the IOMMU's page size,
> >so we need to override the generic implementation, which always returns
> >the kernel's page size. Lookup the IOMMU's page size from struct
> >iommu_table, if available. Fallback to the kernel's page size,
> >otherwise.
> >
> >Signed-off-by: Nishanth Aravamudan 
> >---
> >  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >  arch/powerpc/kernel/dma.c  | 9 +
> >  2 files changed, 12 insertions(+)
> >
> >diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >b/arch/powerpc/include/asm/dma-mapping.h
> >index 7f522c0..c5638f4 100644
> >--- a/arch/powerpc/include/asm/dma-mapping.h
> >+++ b/arch/powerpc/include/asm/dma-mapping.h
> >@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >dma_addr_t off)
> >  #define HAVE_ARCH_DMA_SET_MASK 1
> >  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >
> >+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >+extern unsigned long dma_get_page_shift(struct device *dev);
> >+
> >  #include 
> >
> >  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >index 59503ed..e805af2 100644
> >--- a/arch/powerpc/kernel/dma.c
> >+++ b/arch/powerpc/kernel/dma.c
> >@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >  }
> >  EXPORT_SYMBOL(dma_set_mask);
> >
> >+unsigned long dma_get_page_shift(struct device *dev)
> >+{
> >+struct iommu_table *tbl = get_iommu_table_base(dev);
> >+if (tbl)
> >+return tbl->it_page_shift;
> 
> 
> All PCI devices have this initialized on POWER (at least, our, IBM's
> POWER) so 4K will always be returned here while in the case of
> (get_dma_ops(dev)==_direct_ops) it could actually return
> PAGE_SHIFT. Is 4K still preferred value to return here?

Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == _direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.

3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
are using DDW, right?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [16:56:10 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:
> >When DDW (Dynamic DMA Windows) are present for a device, we have stored
> >the TCE (Translation Control Entry) size in a special device tree
> >property. Check if we have enabled DDW for the device and return the TCE
> >size from that property if present. If the property isn't present,
> >fallback to looking the value up in struct iommu_table. If we don't find
> >a iommu_table, fallback to the kernel's page size.
> >
> >Signed-off-by: Nishanth Aravamudan 
> >---
> >  arch/powerpc/platforms/pseries/iommu.c | 36 
> > ++
> >  1 file changed, 36 insertions(+)
> >
> >diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> >b/arch/powerpc/platforms/pseries/iommu.c
> >index 0946b98..1bf6471 100644
> >--- a/arch/powerpc/platforms/pseries/iommu.c
> >+++ b/arch/powerpc/platforms/pseries/iommu.c
> >@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct 
> >device *dev)
> > return dma_iommu_ops.get_required_mask(dev);
> >  }
> >
> >+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
> >+{
> >+struct iommu_table *tbl;
> >+
> >+if (!disable_ddw && dev_is_pci(dev)) {
> >+struct pci_dev *pdev = to_pci_dev(dev);
> >+struct device_node *dn;
> >+
> >+dn = pci_device_to_OF_node(pdev);
> >+
> >+/* search upwards for ibm,dma-window */
> >+for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
> >+dn = dn->parent)
> >+if (of_get_property(dn, "ibm,dma-window", NULL))
> >+break;
> >+/*
> >+ * if there is a DDW configuration, the TCE shift is stored in
> >+ * the property
> >+ */
> >+if (dn && PCI_DN(dn)) {
> >+const struct dynamic_dma_window_prop *direct64 =
> >+of_get_property(dn, DIRECT64_PROPNAME, NULL);
> 
> 
> This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM
> does not set it as 64bit windows are dynamic there so something like
> find_existing_ddw() needs to be used here.

DIRECT64_PROPNAME is a Linux thing, not a pHyp or QEMU/KVM thing -- it's
created by the Linux DDW logic and left in the device-tree when we
successfully configure DDW.

You're right, though, that logically find_existing_ddw() would be better
to use here. I'll spin up a new version.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> From: Nishanth Aravamudan 
> Date: Fri, 23 Oct 2015 13:54:20 -0700
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> I won't object to this patch series, but if I had implemented this I
> would have required the architectures to implement this explicitly,
> one-by-one.  I think it is less error prone and more likely to end
> up with all the architectures setting this correctly.

Well, looks like I should spin up a v4 anyways for the powerpc changes.
So, to make sure I understand your point, should I make the generic
dma_get_page_shift a compile-error kind of thing? It will only fail on
architectures that actually build the NVME driver (as the only caller).
But I'm not sure how exactly to achieve that, if you could give a bit
more detail I'd appreciate it!

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [16:56:10 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:59 AM, Nishanth Aravamudan wrote:
> >When DDW (Dynamic DMA Windows) are present for a device, we have stored
> >the TCE (Translation Control Entry) size in a special device tree
> >property. Check if we have enabled DDW for the device and return the TCE
> >size from that property if present. If the property isn't present,
> >fallback to looking the value up in struct iommu_table. If we don't find
> >a iommu_table, fallback to the kernel's page size.
> >
> >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/platforms/pseries/iommu.c | 36 
> > ++
> >  1 file changed, 36 insertions(+)
> >
> >diff --git a/arch/powerpc/platforms/pseries/iommu.c 
> >b/arch/powerpc/platforms/pseries/iommu.c
> >index 0946b98..1bf6471 100644
> >--- a/arch/powerpc/platforms/pseries/iommu.c
> >+++ b/arch/powerpc/platforms/pseries/iommu.c
> >@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct 
> >device *dev)
> > return dma_iommu_ops.get_required_mask(dev);
> >  }
> >
> >+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
> >+{
> >+struct iommu_table *tbl;
> >+
> >+if (!disable_ddw && dev_is_pci(dev)) {
> >+struct pci_dev *pdev = to_pci_dev(dev);
> >+struct device_node *dn;
> >+
> >+dn = pci_device_to_OF_node(pdev);
> >+
> >+/* search upwards for ibm,dma-window */
> >+for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
> >+dn = dn->parent)
> >+if (of_get_property(dn, "ibm,dma-window", NULL))
> >+break;
> >+/*
> >+ * if there is a DDW configuration, the TCE shift is stored in
> >+ * the property
> >+ */
> >+if (dn && PCI_DN(dn)) {
> >+const struct dynamic_dma_window_prop *direct64 =
> >+of_get_property(dn, DIRECT64_PROPNAME, NULL);
> 
> 
> This DIRECT64_PROPNAME property is only present under pHyp, QEMU/KVM
> does not set it as 64bit windows are dynamic there so something like
> find_existing_ddw() needs to be used here.

DIRECT64_PROPNAME is a Linux thing, not a pHyp or QEMU/KVM thing -- it's
created by the Linux DDW logic and left in the device-tree when we
successfully configure DDW.

You're right, though, that logically find_existing_ddw() would be better
to use here. I'll spin up a new version.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >On Power, the kernel's page size can differ from the IOMMU's page size,
> >so we need to override the generic implementation, which always returns
> >the kernel's page size. Lookup the IOMMU's page size from struct
> >iommu_table, if available. Fallback to the kernel's page size,
> >otherwise.
> >
> >Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >---
> >  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >  arch/powerpc/kernel/dma.c  | 9 +
> >  2 files changed, 12 insertions(+)
> >
> >diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >b/arch/powerpc/include/asm/dma-mapping.h
> >index 7f522c0..c5638f4 100644
> >--- a/arch/powerpc/include/asm/dma-mapping.h
> >+++ b/arch/powerpc/include/asm/dma-mapping.h
> >@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >dma_addr_t off)
> >  #define HAVE_ARCH_DMA_SET_MASK 1
> >  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >
> >+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >+extern unsigned long dma_get_page_shift(struct device *dev);
> >+
> >  #include 
> >
> >  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >index 59503ed..e805af2 100644
> >--- a/arch/powerpc/kernel/dma.c
> >+++ b/arch/powerpc/kernel/dma.c
> >@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >  }
> >  EXPORT_SYMBOL(dma_set_mask);
> >
> >+unsigned long dma_get_page_shift(struct device *dev)
> >+{
> >+struct iommu_table *tbl = get_iommu_table_base(dev);
> >+if (tbl)
> >+return tbl->it_page_shift;
> 
> 
> All PCI devices have this initialized on POWER (at least, our, IBM's
> POWER) so 4K will always be returned here while in the case of
> (get_dma_ops(dev)==_direct_ops) it could actually return
> PAGE_SHIFT. Is 4K still preferred value to return here?

Right, so the logic of my series, goes like this:

a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
PAGE_SHIFT everywhere, including Power.

b) After 2/7, the Power code will return either the IOMMU table's shift
value, if set, or PAGE_SHIFT (I guess this would be the case if
get_dma_ops(dev) == _direct_ops, as you said). That is no different
than we have now, except we can return the accurate IOMMU value if
available.

3) After 3/7, the platform can override the generic Power
get_dma_page_shift().

4) After 4/7, pseries will return the DDW value, if available, then
fallback to the IOMMU table's value. I think in the case of
get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
are using DDW, right?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Fri, 23 Oct 2015 13:54:20 -0700
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> I won't object to this patch series, but if I had implemented this I
> would have required the architectures to implement this explicitly,
> one-by-one.  I think it is less error prone and more likely to end
> up with all the architectures setting this correctly.

Well, looks like I should spin up a v4 anyways for the powerpc changes.
So, to make sure I understand your point, should I make the generic
dma_get_page_shift a compile-error kind of thing? It will only fail on
architectures that actually build the NVME driver (as the only caller).
But I'm not sure how exactly to achieve that, if you could give a bit
more detail I'd appreciate it!

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [12:00:20 +1100], Alexey Kardashevskiy wrote:
> On 10/28/2015 09:27 AM, Nishanth Aravamudan wrote:
> >On 27.10.2015 [17:02:16 +1100], Alexey Kardashevskiy wrote:
> >>On 10/24/2015 07:57 AM, Nishanth Aravamudan wrote:
> >>>On Power, the kernel's page size can differ from the IOMMU's page size,
> >>>so we need to override the generic implementation, which always returns
> >>>the kernel's page size. Lookup the IOMMU's page size from struct
> >>>iommu_table, if available. Fallback to the kernel's page size,
> >>>otherwise.
> >>>
> >>>Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >>>---
> >>>  arch/powerpc/include/asm/dma-mapping.h | 3 +++
> >>>  arch/powerpc/kernel/dma.c  | 9 +
> >>>  2 files changed, 12 insertions(+)
> >>>
> >>>diff --git a/arch/powerpc/include/asm/dma-mapping.h 
> >>>b/arch/powerpc/include/asm/dma-mapping.h
> >>>index 7f522c0..c5638f4 100644
> >>>--- a/arch/powerpc/include/asm/dma-mapping.h
> >>>+++ b/arch/powerpc/include/asm/dma-mapping.h
> >>>@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
> >>>dma_addr_t off)
> >>>  #define HAVE_ARCH_DMA_SET_MASK 1
> >>>  extern int dma_set_mask(struct device *dev, u64 dma_mask);
> >>>
> >>>+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
> >>>+extern unsigned long dma_get_page_shift(struct device *dev);
> >>>+
> >>>  #include 
> >>>
> >>>  extern int __dma_set_mask(struct device *dev, u64 dma_mask);
> >>>diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
> >>>index 59503ed..e805af2 100644
> >>>--- a/arch/powerpc/kernel/dma.c
> >>>+++ b/arch/powerpc/kernel/dma.c
> >>>@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
> >>>  }
> >>>  EXPORT_SYMBOL(dma_set_mask);
> >>>
> >>>+unsigned long dma_get_page_shift(struct device *dev)
> >>>+{
> >>>+  struct iommu_table *tbl = get_iommu_table_base(dev);
> >>>+  if (tbl)
> >>>+  return tbl->it_page_shift;
> >>
> >>
> >>All PCI devices have this initialized on POWER (at least, our, IBM's
> >>POWER) so 4K will always be returned here while in the case of
> >>(get_dma_ops(dev)==_direct_ops) it could actually return
> >>PAGE_SHIFT. Is 4K still preferred value to return here?
> >
> >Right, so the logic of my series, goes like this:
> >
> >a) We currently are assuming DMA_PAGE_SHIFT (conceptual constant) is
> >PAGE_SHIFT everywhere, including Power.
> >
> >b) After 2/7, the Power code will return either the IOMMU table's shift
> >value, if set, or PAGE_SHIFT (I guess this would be the case if
> >get_dma_ops(dev) == _direct_ops, as you said). That is no different
> >than we have now, except we can return the accurate IOMMU value if
> >available.
> 
> If it is not available, then something went wrong and BUG_ON(!tbl ||
> !tbl->it_page_shift) make more sense here than pretending that this
> function can ever return PAGE_SHIFT. imho.

That's a good point, thanks!

> >3) After 3/7, the platform can override the generic Power
> >get_dma_page_shift().
> >
> >4) After 4/7, pseries will return the DDW value, if available, then
> >fallback to the IOMMU table's value. I think in the case of
> >get_dma_ops(dev)==_direct_ops, the only way that can happen is if we
> >are using DDW, right?
> 
> This is for pseries guests; for the powernv host it is a "bypass"
> mode which does 64bit direct DMA mapping and there is no additional
> window for that (i.e. DIRECT64_PROPNAME, etc).

You're right! I should update the code to handle both cases.

In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?

Seems like this would be a different platform implentation I'd put in
for 'powernv', is that right?

My apologies for missing that, and thank you for the review!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 27.10.2015 [17:53:22 -0700], David Miller wrote:
> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> Date: Tue, 27 Oct 2015 15:20:10 -0700
> 
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> Yes, I am basically suggesting to simply not provide a default at all.

For my own edification -- what is the way that gets resolved? I guess I
mean it seems like linux-next would cease to compile because of my new
series. Would my patches just get kicked out of -next for introducing
that (or even via the 0-day notifications), or should I put something
into the commit message indicating it is an API introduction?

Sorry for the tentativeness, I have not introduce a cross-architecture
API like this before.

Thanks,
Nish

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [09:57:48 +1100], Julian Calaby wrote:
> Hi Nishanth,
> 
> On Wed, Oct 28, 2015 at 9:20 AM, Nishanth Aravamudan
> <n...@linux.vnet.ibm.com> wrote:
> > On 26.10.2015 [18:27:46 -0700], David Miller wrote:
> >> From: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> >> Date: Fri, 23 Oct 2015 13:54:20 -0700
> >>
> >> > 1) add a generic dma_get_page_shift implementation that just returns
> >> > PAGE_SHIFT
> >>
> >> I won't object to this patch series, but if I had implemented this I
> >> would have required the architectures to implement this explicitly,
> >> one-by-one.  I think it is less error prone and more likely to end
> >> up with all the architectures setting this correctly.
> >
> > Well, looks like I should spin up a v4 anyways for the powerpc changes.
> > So, to make sure I understand your point, should I make the generic
> > dma_get_page_shift a compile-error kind of thing? It will only fail on
> > architectures that actually build the NVME driver (as the only caller).
> > But I'm not sure how exactly to achieve that, if you could give a bit
> > more detail I'd appreciate it!
> 
> He's suggesting that you _don't_ put a generic implementation in
> /include/linux/dma-mapping.h and instead add it to _every_
> architecture.

Ah, I see! Well, I don't know much about the DMA internals of most
architectures -- and my approach kept things functionally the same
everywhere (using PAGE_SHIFT) except:

a) Power, where I know it doesn't work as-is
and
b) sparc, where the code implied that a different value than PAGE_SHIFT
should be used.

Thanks,
Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-27 Thread Nishanth Aravamudan
On 28.10.2015 [11:20:05 +0900], Benjamin Herrenschmidt wrote:
> On Tue, 2015-10-27 at 18:54 -0700, Nishanth Aravamudan wrote:
> > 
> > In "bypass" mode, what TCE size is used? Is it guaranteed to be 4K?
> 
> None :-) The TCEs are completely bypassed. You get a N:M linear mapping
> of all memory starting at 1<<59 PCI side.

Err, duh, sorry! Ok, so in that case, DMA page shift is PAGE_SHIFT,
then?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
[Apologies for the subject line, should just have the [RFC PATCH 5/7]]

On 23.10.2015 [14:00:08 -0700], Nishanth Aravamudan wrote:
> In order to cleanly expose the desired IOMMU page shift via the new
> dma_get_page_shift API, we need to have the sparc constants available in
> a more typical location. There should be no functional impact to this
> move, but it is untested.
> 
> Signed-off-by: Nishanth Aravamudan 
> ---
>  arch/sparc/include/asm/iommu_common.h | 51 
> +++
>  arch/sparc/kernel/iommu.c |  2 +-
>  arch/sparc/kernel/iommu_common.h  | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c|  2 +-
>  arch/sparc/kernel/pci_sabre.c |  2 +-
>  arch/sparc/kernel/pci_schizo.c|  2 +-
>  arch/sparc/kernel/pci_sun4v.c |  2 +-
>  arch/sparc/kernel/psycho_common.c |  2 +-
>  arch/sparc/kernel/sbus.c  |  3 +--
>  9 files changed, 58 insertions(+), 59 deletions(-)
>  create mode 100644 arch/sparc/include/asm/iommu_common.h
>  delete mode 100644 arch/sparc/kernel/iommu_common.h
> 
> diff --git a/arch/sparc/include/asm/iommu_common.h 
> b/arch/sparc/include/asm/iommu_common.h
> new file mode 100644
> index 000..b40cec2
> --- /dev/null
> +++ b/arch/sparc/include/asm/iommu_common.h
> @@ -0,0 +1,51 @@
> +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> + *
> + * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> + */
> +
> +#ifndef _IOMMU_COMMON_H
> +#define _IOMMU_COMMON_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +/*
> + * These give mapping size of each iommu pte/tlb.
> + */
> +#define IO_PAGE_SHIFT13
> +#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> +#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> +#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> +
> +#define IO_TSB_ENTRIES   (128*1024)
> +#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> +
> +/*
> + * This is the hardwired shift in the iotlb tag/data parts.
> + */
> +#define IOMMU_PAGE_SHIFT 13
> +
> +#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> +
> +static inline int is_span_boundary(unsigned long entry,
> +unsigned long shift,
> +unsigned long boundary_size,
> +struct scatterlist *outs,
> +struct scatterlist *sg)
> +{
> + unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
> + int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
> +  IO_PAGE_SIZE);
> +
> + return iommu_is_span_boundary(entry, nr, shift, boundary_size);
> +}
> +
> +#endif /* _IOMMU_COMMON_H */
> diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
> index 5320689..f9b6077 100644
> --- a/arch/sparc/kernel/iommu.c
> +++ b/arch/sparc/kernel/iommu.c
> @@ -20,8 +20,8 @@
>  #endif
>  
>  #include 
> +#include 
>  
> -#include "iommu_common.h"
>  #include "kernel.h"
>  
>  #define STC_CTXMATCH_ADDR(STC, CTX)  \
> diff --git a/arch/sparc/kernel/iommu_common.h 
> b/arch/sparc/kernel/iommu_common.h
> deleted file mode 100644
> index b40cec2..000
> --- a/arch/sparc/kernel/iommu_common.h
> +++ /dev/null
> @@ -1,51 +0,0 @@
> -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> - *
> - * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> - */
> -
> -#ifndef _IOMMU_COMMON_H
> -#define _IOMMU_COMMON_H
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#include 
> -
> -/*
> - * These give mapping size of each iommu pte/tlb.
> - */
> -#define IO_PAGE_SHIFT13
> -#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> -#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> -#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> -
> -#define IO_TSB_ENTRIES   (128*1024)
> -#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> -
> -/*
> - * This is the hardwired shift in the iotlb tag/data parts.
> - */
> -#define IOMMU_PAGE_SHIFT 13
> -
> -#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> -
> -static inline int is_span_boundary(unsigned long entry,
> -unsigned long shift,
> -  

[PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan 

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

 drivers/block/nvme-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6f04771..5a79106 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1711,7 +1712,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On sparc, the kernel's page size differs from the IOMMU's page size, so
override the generic implementation, which always returns the kernel's
page size, and return IOMMU_PAGE_SHIFT instead.

Signed-off-by: Nishanth Aravamudan 

---
I know very little about sparc, so please correct me if this patch is
invalid.

 arch/sparc/include/asm/dma-mapping.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/sparc/include/asm/dma-mapping.h 
b/arch/sparc/include/asm/dma-mapping.h
index a21da59..76ba470 100644
--- a/arch/sparc/include/asm/dma-mapping.h
+++ b/arch/sparc/include/asm/dma-mapping.h
@@ -52,6 +52,14 @@ static inline int dma_set_mask(struct device *dev, u64 mask)
return -EINVAL;
 }
 
+/* for IOMMU_PAGE_SHIFT */
+#include 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return IOMMU_PAGE_SHIFT;
+}
+
 #include 
 
 #endif
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
In order to cleanly expose the desired IOMMU page shift via the new
dma_get_page_shift API, we need to have the sparc constants available in
a more typical location. There should be no functional impact to this
move, but it is untested.

Signed-off-by: Nishanth Aravamudan 
---
 arch/sparc/include/asm/iommu_common.h | 51 +++
 arch/sparc/kernel/iommu.c |  2 +-
 arch/sparc/kernel/iommu_common.h  | 51 ---
 arch/sparc/kernel/pci_psycho.c|  2 +-
 arch/sparc/kernel/pci_sabre.c |  2 +-
 arch/sparc/kernel/pci_schizo.c|  2 +-
 arch/sparc/kernel/pci_sun4v.c |  2 +-
 arch/sparc/kernel/psycho_common.c |  2 +-
 arch/sparc/kernel/sbus.c  |  3 +--
 9 files changed, 58 insertions(+), 59 deletions(-)
 create mode 100644 arch/sparc/include/asm/iommu_common.h
 delete mode 100644 arch/sparc/kernel/iommu_common.h

diff --git a/arch/sparc/include/asm/iommu_common.h 
b/arch/sparc/include/asm/iommu_common.h
new file mode 100644
index 000..b40cec2
--- /dev/null
+++ b/arch/sparc/include/asm/iommu_common.h
@@ -0,0 +1,51 @@
+/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
+ *
+ * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
+ */
+
+#ifndef _IOMMU_COMMON_H
+#define _IOMMU_COMMON_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * These give mapping size of each iommu pte/tlb.
+ */
+#define IO_PAGE_SHIFT  13
+#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
+#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
+#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
+
+#define IO_TSB_ENTRIES (128*1024)
+#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
+
+/*
+ * This is the hardwired shift in the iotlb tag/data parts.
+ */
+#define IOMMU_PAGE_SHIFT   13
+
+#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
+
+static inline int is_span_boundary(unsigned long entry,
+  unsigned long shift,
+  unsigned long boundary_size,
+  struct scatterlist *outs,
+  struct scatterlist *sg)
+{
+   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
+   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
+IO_PAGE_SIZE);
+
+   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
+}
+
+#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5320689..f9b6077 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -20,8 +20,8 @@
 #endif
 
 #include 
+#include 
 
-#include "iommu_common.h"
 #include "kernel.h"
 
 #define STC_CTXMATCH_ADDR(STC, CTX)\
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
deleted file mode 100644
index b40cec2..000
--- a/arch/sparc/kernel/iommu_common.h
+++ /dev/null
@@ -1,51 +0,0 @@
-/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
- *
- * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
- */
-
-#ifndef _IOMMU_COMMON_H
-#define _IOMMU_COMMON_H
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-/*
- * These give mapping size of each iommu pte/tlb.
- */
-#define IO_PAGE_SHIFT  13
-#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
-#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
-
-#define IO_TSB_ENTRIES (128*1024)
-#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
-
-/*
- * This is the hardwired shift in the iotlb tag/data parts.
- */
-#define IOMMU_PAGE_SHIFT   13
-
-#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
-
-static inline int is_span_boundary(unsigned long entry,
-  unsigned long shift,
-  unsigned long boundary_size,
-  struct scatterlist *outs,
-  struct scatterlist *sg)
-{
-   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
-   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
-IO_PAGE_SIZE);
-
-   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
-}
-
-#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c
index 7dce27b..332fd6f 100644
--- a/arch/sparc/kernel/pci_psycho.c
+++ b/arch/sparc/kernel/pci_psycho.c
@@ -15,13 +15,13 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
 #include "pci_impl.h"
-#include "iommu_common.h"
 #include

[PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan 
---
 arch/powerpc/platforms/pseries/iommu.c | 36 ++
 1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan 
---
 arch/powerpc/include/asm/machdep.h | 3 ++-
 arch/powerpc/kernel/dma.c  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 3f191f5..9dd03cf 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,9 +83,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
[Sorry, subject should have been 0/7!]

On 23.10.2015 [13:54:20 -0700], Nishanth Aravamudan wrote:
> We received a bug report recently when DDW (64-bit direct DMA on Power)
> is not enabled for NVMe devices. In that case, we fall back to 32-bit
> DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
> Entries).
> 
> The NVMe device driver, though, assumes that the DMA alignment for the
> PRP entries will match the device's page size, and that the DMA aligment
> matches the kernel's page aligment. On Power, the the IOMMU page size,
> as mentioned above, can be 4K, while the device can have a page size of
> 8K, while the kernel has a page size of 64K. This eventually trips the
> BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
> of 4K but not 8K (e.g., 0xF000).
> 
> In this particular case, and generally, we want to use the IOMMU's page
> size for the default device page size, rather than the kernel's page
> size.
> 
> This series consists of five patches:
> 
> 1) add a generic dma_get_page_shift implementation that just returns
> PAGE_SHIFT
> 2) override the generic implementation on Power to use the IOMMU table's
> page shift if available
> 3) allow further specific overriding on power with machdep platform
> overrides
> 4) use the machdep override on pseries, as the DDW code puts the TCE
> shift in a special property and there is no IOMMU table available
> 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
> include/asm
> 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
> 7) leverage the new API in the NVMe driver
> 
> With these patches, a NVMe device survives our internal hardware
> exerciser; the kernel BUGs within a few seconds without the patch.
> 
>  arch/powerpc/include/asm/dma-mapping.h |  3 +++
>  arch/powerpc/include/asm/machdep.h |  3 ++-
>  arch/powerpc/kernel/dma.c  | 11 +++
>  arch/powerpc/platforms/pseries/iommu.c | 36 
> 
>  arch/sparc/include/asm/dma-mapping.h   |  8 
>  arch/sparc/include/asm/iommu_common.h  | 51 
> +++
>  arch/sparc/kernel/iommu.c  |  2 +-
>  arch/sparc/kernel/iommu_common.h   | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c |  2 +-
>  arch/sparc/kernel/pci_sabre.c  |  2 +-
>  arch/sparc/kernel/pci_schizo.c |  2 +-
>  arch/sparc/kernel/pci_sun4v.c  |  2 +-
>  arch/sparc/kernel/psycho_common.c  |  2 +-
>  arch/sparc/kernel/sbus.c   |  3 +--
>  drivers/block/nvme-core.c  |  3 ++-
>  include/linux/dma-mapping.h|  7 +++
>  16 files changed, 127 insertions(+), 61 deletions(-)
> 
> v1 -> v2:
>   Based upon feedback from Christoph Hellwig, rather than using an
>   arch-specific hack, expose the DMA page shift via a generic DMA API and
>   override it on Power as needed.
> v2 -> v3:
>   Based upon feedback from Christoph Hellwig, put the generic
>   implementation in include/linux/dma-mapping.h, since not all archs use
>   include/asm-generic/dma-mapping-common.h.
>   Add sparc implementation, as that arch seems to have a different IOMMU
>   page size.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan 
---
 arch/powerpc/include/asm/dma-mapping.h | 3 +++
 arch/powerpc/kernel/dma.c  | 9 +
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include 
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API

2015-10-23 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan 

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.
v2 -> v3:
  Based upon feedback from Christoph Hellwig that not all architectures
  have moved to dma-mapping-common.h, so move the #ifdef and default to
  linux/dma-mapping.h.
---
 include/linux/dma-mapping.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff0..7eaba8d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -95,6 +95,13 @@ static inline u64 dma_get_mask(struct device *dev)
return DMA_BIT_MASK(32);
 }
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 #else
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
include/asm
6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
7) leverage the new API in the NVMe driver

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h |  3 +++
 arch/powerpc/include/asm/machdep.h |  3 ++-
 arch/powerpc/kernel/dma.c  | 11 +++
 arch/powerpc/platforms/pseries/iommu.c | 36 

 arch/sparc/include/asm/dma-mapping.h   |  8 
 arch/sparc/include/asm/iommu_common.h  | 51 
+++
 arch/sparc/kernel/iommu.c  |  2 +-
 arch/sparc/kernel/iommu_common.h   | 51 
---
 arch/sparc/kernel/pci_psycho.c |  2 +-
 arch/sparc/kernel/pci_sabre.c  |  2 +-
 arch/sparc/kernel/pci_schizo.c |  2 +-
 arch/sparc/kernel/pci_sun4v.c  |  2 +-
 arch/sparc/kernel/psycho_common.c  |  2 +-
 arch/sparc/kernel/sbus.c   |  3 +--
 drivers/block/nvme-core.c  |  3 ++-
 include/linux/dma-mapping.h|  7 +++
 16 files changed, 127 insertions(+), 61 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.
v2 -> v3:
  Based upon feedback from Christoph Hellwig, put the generic
  implementation in include/linux/dma-mapping.h, since not all archs use
  include/asm-generic/dma-mapping-common.h.
  Add sparc implementation, as that arch seems to have a different IOMMU
  page size.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
include/asm
6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
7) leverage the new API in the NVMe driver

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h |  3 +++
 arch/powerpc/include/asm/machdep.h |  3 ++-
 arch/powerpc/kernel/dma.c  | 11 +++
 arch/powerpc/platforms/pseries/iommu.c | 36 

 arch/sparc/include/asm/dma-mapping.h   |  8 
 arch/sparc/include/asm/iommu_common.h  | 51 
+++
 arch/sparc/kernel/iommu.c  |  2 +-
 arch/sparc/kernel/iommu_common.h   | 51 
---
 arch/sparc/kernel/pci_psycho.c |  2 +-
 arch/sparc/kernel/pci_sabre.c  |  2 +-
 arch/sparc/kernel/pci_schizo.c |  2 +-
 arch/sparc/kernel/pci_sun4v.c  |  2 +-
 arch/sparc/kernel/psycho_common.c  |  2 +-
 arch/sparc/kernel/sbus.c   |  3 +--
 drivers/block/nvme-core.c  |  3 ++-
 include/linux/dma-mapping.h|  7 +++
 16 files changed, 127 insertions(+), 61 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.
v2 -> v3:
  Based upon feedback from Christoph Hellwig, put the generic
  implementation in include/linux/dma-mapping.h, since not all archs use
  include/asm-generic/dma-mapping-common.h.
  Add sparc implementation, as that arch seems to have a different IOMMU
  page size.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
In order to cleanly expose the desired IOMMU page shift via the new
dma_get_page_shift API, we need to have the sparc constants available in
a more typical location. There should be no functional impact to this
move, but it is untested.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/sparc/include/asm/iommu_common.h | 51 +++
 arch/sparc/kernel/iommu.c |  2 +-
 arch/sparc/kernel/iommu_common.h  | 51 ---
 arch/sparc/kernel/pci_psycho.c|  2 +-
 arch/sparc/kernel/pci_sabre.c |  2 +-
 arch/sparc/kernel/pci_schizo.c|  2 +-
 arch/sparc/kernel/pci_sun4v.c |  2 +-
 arch/sparc/kernel/psycho_common.c |  2 +-
 arch/sparc/kernel/sbus.c  |  3 +--
 9 files changed, 58 insertions(+), 59 deletions(-)
 create mode 100644 arch/sparc/include/asm/iommu_common.h
 delete mode 100644 arch/sparc/kernel/iommu_common.h

diff --git a/arch/sparc/include/asm/iommu_common.h 
b/arch/sparc/include/asm/iommu_common.h
new file mode 100644
index 000..b40cec2
--- /dev/null
+++ b/arch/sparc/include/asm/iommu_common.h
@@ -0,0 +1,51 @@
+/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
+ *
+ * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
+ */
+
+#ifndef _IOMMU_COMMON_H
+#define _IOMMU_COMMON_H
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+/*
+ * These give mapping size of each iommu pte/tlb.
+ */
+#define IO_PAGE_SHIFT  13
+#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
+#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
+#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
+
+#define IO_TSB_ENTRIES (128*1024)
+#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
+
+/*
+ * This is the hardwired shift in the iotlb tag/data parts.
+ */
+#define IOMMU_PAGE_SHIFT   13
+
+#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
+
+static inline int is_span_boundary(unsigned long entry,
+  unsigned long shift,
+  unsigned long boundary_size,
+  struct scatterlist *outs,
+  struct scatterlist *sg)
+{
+   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
+   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
+IO_PAGE_SIZE);
+
+   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
+}
+
+#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index 5320689..f9b6077 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -20,8 +20,8 @@
 #endif
 
 #include 
+#include 
 
-#include "iommu_common.h"
 #include "kernel.h"
 
 #define STC_CTXMATCH_ADDR(STC, CTX)\
diff --git a/arch/sparc/kernel/iommu_common.h b/arch/sparc/kernel/iommu_common.h
deleted file mode 100644
index b40cec2..000
--- a/arch/sparc/kernel/iommu_common.h
+++ /dev/null
@@ -1,51 +0,0 @@
-/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
- *
- * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
- */
-
-#ifndef _IOMMU_COMMON_H
-#define _IOMMU_COMMON_H
-
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include 
-
-/*
- * These give mapping size of each iommu pte/tlb.
- */
-#define IO_PAGE_SHIFT  13
-#define IO_PAGE_SIZE   (1UL << IO_PAGE_SHIFT)
-#define IO_PAGE_MASK   (~(IO_PAGE_SIZE-1))
-#define IO_PAGE_ALIGN(addr)ALIGN(addr, IO_PAGE_SIZE)
-
-#define IO_TSB_ENTRIES (128*1024)
-#define IO_TSB_SIZE(IO_TSB_ENTRIES * 8)
-
-/*
- * This is the hardwired shift in the iotlb tag/data parts.
- */
-#define IOMMU_PAGE_SHIFT   13
-
-#define SG_ENT_PHYS_ADDRESS(SG)(__pa(sg_virt((SG
-
-static inline int is_span_boundary(unsigned long entry,
-  unsigned long shift,
-  unsigned long boundary_size,
-  struct scatterlist *outs,
-  struct scatterlist *sg)
-{
-   unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
-   int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
-IO_PAGE_SIZE);
-
-   return iommu_is_span_boundary(entry, nr, shift, boundary_size);
-}
-
-#endif /* _IOMMU_COMMON_H */
diff --git a/arch/sparc/kernel/pci_psycho.c b/arch/sparc/kernel/pci_psycho.c
index 7dce27b..332fd6f 100644
--- a/arch/sparc/kernel/pci_psycho.c
+++ b/arch/sparc/kernel/pci_psycho.c
@@ -15,13 +15,13 @@
 #include 
 
 #include 
+#include 
 #include 
 #include 
 #include 
 #include 
 
 #include "pci_impl.h"
-#include "i

[RFC PATCH 6/7] sparc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On sparc, the kernel's page size differs from the IOMMU's page size, so
override the generic implementation, which always returns the kernel's
page size, and return IOMMU_PAGE_SHIFT instead.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
I know very little about sparc, so please correct me if this patch is
invalid.

 arch/sparc/include/asm/dma-mapping.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/sparc/include/asm/dma-mapping.h 
b/arch/sparc/include/asm/dma-mapping.h
index a21da59..76ba470 100644
--- a/arch/sparc/include/asm/dma-mapping.h
+++ b/arch/sparc/include/asm/dma-mapping.h
@@ -52,6 +52,14 @@ static inline int dma_set_mask(struct device *dev, u64 mask)
return -EINVAL;
 }
 
+/* for IOMMU_PAGE_SHIFT */
+#include 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return IOMMU_PAGE_SHIFT;
+}
+
 #include 
 
 #endif
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/7 v3] dma-mapping: add generic dma_get_page_shift API

2015-10-23 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.
v2 -> v3:
  Based upon feedback from Christoph Hellwig that not all architectures
  have moved to dma-mapping-common.h, so move the #ifdef and default to
  linux/dma-mapping.h.
---
 include/linux/dma-mapping.h | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index ac07ff0..7eaba8d 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -95,6 +95,13 @@ static inline u64 dma_get_mask(struct device *dev)
return DMA_BIT_MASK(32);
 }
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #ifdef CONFIG_ARCH_HAS_DMA_SET_COHERENT_MASK
 int dma_set_coherent_mask(struct device *dev, u64 mask);
 #else
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 5/7] [RFC PATCH 5/7] sparc: rename kernel/iommu_common.h -> include/asm/iommu_common.h

2015-10-23 Thread Nishanth Aravamudan
[Apologies for the subject line, should just have the [RFC PATCH 5/7]]

On 23.10.2015 [14:00:08 -0700], Nishanth Aravamudan wrote:
> In order to cleanly expose the desired IOMMU page shift via the new
> dma_get_page_shift API, we need to have the sparc constants available in
> a more typical location. There should be no functional impact to this
> move, but it is untested.
> 
> Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> ---
>  arch/sparc/include/asm/iommu_common.h | 51 
> +++
>  arch/sparc/kernel/iommu.c |  2 +-
>  arch/sparc/kernel/iommu_common.h  | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c|  2 +-
>  arch/sparc/kernel/pci_sabre.c |  2 +-
>  arch/sparc/kernel/pci_schizo.c|  2 +-
>  arch/sparc/kernel/pci_sun4v.c |  2 +-
>  arch/sparc/kernel/psycho_common.c |  2 +-
>  arch/sparc/kernel/sbus.c  |  3 +--
>  9 files changed, 58 insertions(+), 59 deletions(-)
>  create mode 100644 arch/sparc/include/asm/iommu_common.h
>  delete mode 100644 arch/sparc/kernel/iommu_common.h
> 
> diff --git a/arch/sparc/include/asm/iommu_common.h 
> b/arch/sparc/include/asm/iommu_common.h
> new file mode 100644
> index 000..b40cec2
> --- /dev/null
> +++ b/arch/sparc/include/asm/iommu_common.h
> @@ -0,0 +1,51 @@
> +/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> + *
> + * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> + */
> +
> +#ifndef _IOMMU_COMMON_H
> +#define _IOMMU_COMMON_H
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +
> +/*
> + * These give mapping size of each iommu pte/tlb.
> + */
> +#define IO_PAGE_SHIFT13
> +#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> +#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> +#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> +
> +#define IO_TSB_ENTRIES   (128*1024)
> +#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> +
> +/*
> + * This is the hardwired shift in the iotlb tag/data parts.
> + */
> +#define IOMMU_PAGE_SHIFT 13
> +
> +#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> +
> +static inline int is_span_boundary(unsigned long entry,
> +unsigned long shift,
> +unsigned long boundary_size,
> +struct scatterlist *outs,
> +struct scatterlist *sg)
> +{
> + unsigned long paddr = SG_ENT_PHYS_ADDRESS(outs);
> + int nr = iommu_num_pages(paddr, outs->dma_length + sg->length,
> +  IO_PAGE_SIZE);
> +
> + return iommu_is_span_boundary(entry, nr, shift, boundary_size);
> +}
> +
> +#endif /* _IOMMU_COMMON_H */
> diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
> index 5320689..f9b6077 100644
> --- a/arch/sparc/kernel/iommu.c
> +++ b/arch/sparc/kernel/iommu.c
> @@ -20,8 +20,8 @@
>  #endif
>  
>  #include 
> +#include 
>  
> -#include "iommu_common.h"
>  #include "kernel.h"
>  
>  #define STC_CTXMATCH_ADDR(STC, CTX)  \
> diff --git a/arch/sparc/kernel/iommu_common.h 
> b/arch/sparc/kernel/iommu_common.h
> deleted file mode 100644
> index b40cec2..000
> --- a/arch/sparc/kernel/iommu_common.h
> +++ /dev/null
> @@ -1,51 +0,0 @@
> -/* iommu_common.h: UltraSparc SBUS/PCI common iommu declarations.
> - *
> - * Copyright (C) 1999, 2008 David S. Miller (da...@davemloft.net)
> - */
> -
> -#ifndef _IOMMU_COMMON_H
> -#define _IOMMU_COMMON_H
> -
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -#include 
> -
> -#include 
> -
> -/*
> - * These give mapping size of each iommu pte/tlb.
> - */
> -#define IO_PAGE_SHIFT13
> -#define IO_PAGE_SIZE (1UL << IO_PAGE_SHIFT)
> -#define IO_PAGE_MASK (~(IO_PAGE_SIZE-1))
> -#define IO_PAGE_ALIGN(addr)  ALIGN(addr, IO_PAGE_SIZE)
> -
> -#define IO_TSB_ENTRIES   (128*1024)
> -#define IO_TSB_SIZE  (IO_TSB_ENTRIES * 8)
> -
> -/*
> - * This is the hardwired shift in the iotlb tag/data parts.
> - */
> -#define IOMMU_PAGE_SHIFT 13
> -
> -#define SG_ENT_PHYS_ADDRESS(SG)  (__pa(sg_virt((SG
> -
> -static inline int is_span_boundary(unsigned long entry,
> -unsigned long sh

[PATCH 7/7 v2] drivers/nvme: default to the IOMMU page size

2015-10-23 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

 drivers/block/nvme-core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 6f04771..5a79106 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1711,7 +1712,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/7 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/dma-mapping.h | 3 +++
 arch/powerpc/kernel/dma.c  | 9 +
 2 files changed, 12 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include 
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v3] Fix NVMe driver support on Power with 32-bit DMA

2015-10-23 Thread Nishanth Aravamudan
[Sorry, subject should have been 0/7!]

On 23.10.2015 [13:54:20 -0700], Nishanth Aravamudan wrote:
> We received a bug report recently when DDW (64-bit direct DMA on Power)
> is not enabled for NVMe devices. In that case, we fall back to 32-bit
> DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
> Entries).
> 
> The NVMe device driver, though, assumes that the DMA alignment for the
> PRP entries will match the device's page size, and that the DMA aligment
> matches the kernel's page aligment. On Power, the the IOMMU page size,
> as mentioned above, can be 4K, while the device can have a page size of
> 8K, while the kernel has a page size of 64K. This eventually trips the
> BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
> of 4K but not 8K (e.g., 0xF000).
> 
> In this particular case, and generally, we want to use the IOMMU's page
> size for the default device page size, rather than the kernel's page
> size.
> 
> This series consists of five patches:
> 
> 1) add a generic dma_get_page_shift implementation that just returns
> PAGE_SHIFT
> 2) override the generic implementation on Power to use the IOMMU table's
> page shift if available
> 3) allow further specific overriding on power with machdep platform
> overrides
> 4) use the machdep override on pseries, as the DDW code puts the TCE
> shift in a special property and there is no IOMMU table available
> 5) move some sparc code around to make IOMMU_PAGE_SHIFT available in
> include/asm
> 6) override the generic implementation on sparce to use IOMMU_PAGE_SHIFT
> 7) leverage the new API in the NVMe driver
> 
> With these patches, a NVMe device survives our internal hardware
> exerciser; the kernel BUGs within a few seconds without the patch.
> 
>  arch/powerpc/include/asm/dma-mapping.h |  3 +++
>  arch/powerpc/include/asm/machdep.h |  3 ++-
>  arch/powerpc/kernel/dma.c  | 11 +++
>  arch/powerpc/platforms/pseries/iommu.c | 36 
> 
>  arch/sparc/include/asm/dma-mapping.h   |  8 
>  arch/sparc/include/asm/iommu_common.h  | 51 
> +++
>  arch/sparc/kernel/iommu.c  |  2 +-
>  arch/sparc/kernel/iommu_common.h   | 51 
> ---
>  arch/sparc/kernel/pci_psycho.c |  2 +-
>  arch/sparc/kernel/pci_sabre.c  |  2 +-
>  arch/sparc/kernel/pci_schizo.c |  2 +-
>  arch/sparc/kernel/pci_sun4v.c  |  2 +-
>  arch/sparc/kernel/psycho_common.c  |  2 +-
>  arch/sparc/kernel/sbus.c   |  3 +--
>  drivers/block/nvme-core.c  |  3 ++-
>  include/linux/dma-mapping.h|  7 +++
>  16 files changed, 127 insertions(+), 61 deletions(-)
> 
> v1 -> v2:
>   Based upon feedback from Christoph Hellwig, rather than using an
>   arch-specific hack, expose the DMA page shift via a generic DMA API and
>   override it on Power as needed.
> v2 -> v3:
>   Based upon feedback from Christoph Hellwig, put the generic
>   implementation in include/linux/dma-mapping.h, since not all archs use
>   include/asm-generic/dma-mapping-common.h.
>   Add sparc implementation, as that arch seems to have a different IOMMU
>   page size.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/7 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/platforms/pseries/iommu.c | 36 ++
 1 file changed, 36 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/7 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-23 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/machdep.h | 3 ++-
 arch/powerpc/kernel/dma.c  | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index 3f191f5..9dd03cf 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -83,9 +83,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-19 Thread Nishanth Aravamudan
On 15.10.2015 [15:52:19 -0700], Nishanth Aravamudan wrote:
> On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> > Hi Nishanth,
> > 
> > sorry for the late reply.
> > 
> > > > On Power, since it's technically variable, we'd need a function. So are
> > > > you suggesting define'ing it to a function just on Power and leaving it
> > > > a constant elsewhere?
> > > > 
> > > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > > 
> > > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > > with a sparc-specific function.
> > > 
> > > Are you ok with leaving it a function for now (the only caller is in
> > > NVMe obviously).
> > 
> > 
> > I guess we do indeed need a function then.  I'll take a look at your
> > patch, but as long you found a way to avoid adding too much boilerplate
> > code it should be fine.
> 
> Ok, so I've got the moved function (include/linux/dma-mapping.h
> instead of dma-mapping-common.h) ready to go, which should only
> involve changing the first patch in the series. But I'm really
> mystified by what to do for sparc, which defines IOMMU_PAGE_SHIFT and
> IO_PAGE_SHIFT in arch/sparc/kernel/iommu_common.h.
> 
> 1) Which constant reflects the value we mean for this function on
> sparc?  I assume it should be IOMMU_PAGE_SHIFT, but they are the same
> value and I want to make sure I get the semantics right.
> 
> 2) Where would I put sparc's definition of dma_get_page_shift()?
> Should it be in a asm/dma-mapping.h? Should we move some of the
> constants from arch/sparc/kernel/iommu_common.h to
> arch/sparc/include/asm/iommu_common.h and then #include that in
> asm/dma-mapping.h?
> 
> Dave M., any opinions/insights? Essentially, this helper function
> assists the NVMe driver in determining what page size it should use to
> satisfy both the device and IOMMU's requirements. Maybe I
> misunderstand the constants on sparc and PAGE_SHIFT is fine there too?

Anyone with sparc knowledge have an opinion/input?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-19 Thread Nishanth Aravamudan
On 15.10.2015 [15:52:19 -0700], Nishanth Aravamudan wrote:
> On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> > Hi Nishanth,
> > 
> > sorry for the late reply.
> > 
> > > > On Power, since it's technically variable, we'd need a function. So are
> > > > you suggesting define'ing it to a function just on Power and leaving it
> > > > a constant elsewhere?
> > > > 
> > > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > > 
> > > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > > with a sparc-specific function.
> > > 
> > > Are you ok with leaving it a function for now (the only caller is in
> > > NVMe obviously).
> > 
> > 
> > I guess we do indeed need a function then.  I'll take a look at your
> > patch, but as long you found a way to avoid adding too much boilerplate
> > code it should be fine.
> 
> Ok, so I've got the moved function (include/linux/dma-mapping.h
> instead of dma-mapping-common.h) ready to go, which should only
> involve changing the first patch in the series. But I'm really
> mystified by what to do for sparc, which defines IOMMU_PAGE_SHIFT and
> IO_PAGE_SHIFT in arch/sparc/kernel/iommu_common.h.
> 
> 1) Which constant reflects the value we mean for this function on
> sparc?  I assume it should be IOMMU_PAGE_SHIFT, but they are the same
> value and I want to make sure I get the semantics right.
> 
> 2) Where would I put sparc's definition of dma_get_page_shift()?
> Should it be in a asm/dma-mapping.h? Should we move some of the
> constants from arch/sparc/kernel/iommu_common.h to
> arch/sparc/include/asm/iommu_common.h and then #include that in
> asm/dma-mapping.h?
> 
> Dave M., any opinions/insights? Essentially, this helper function
> assists the NVMe driver in determining what page size it should use to
> satisfy both the device and IOMMU's requirements. Maybe I
> misunderstand the constants on sparc and PAGE_SHIFT is fine there too?

Anyone with sparc knowledge have an opinion/input?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-15 Thread Nishanth Aravamudan
On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> sorry for the late reply.
> 
> > > On Power, since it's technically variable, we'd need a function. So are
> > > you suggesting define'ing it to a function just on Power and leaving it
> > > a constant elsewhere?
> > > 
> > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > 
> > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > with a sparc-specific function.
> > 
> > Are you ok with leaving it a function for now (the only caller is in
> > NVMe obviously).
> 
> 
> I guess we do indeed need a function then.  I'll take a look at your
> patch, but as long you found a way to avoid adding too much boilerplate
> code it should be fine.

Ok, so I've got the moved function (include/linux/dma-mapping.h instead
of dma-mapping-common.h) ready to go, which should only involve changing
the first patch in the series. But I'm really mystified by what to do
for sparc, which defines IOMMU_PAGE_SHIFT and IO_PAGE_SHIFT in
arch/sparc/kernel/iommu_common.h.

1) Which constant reflects the value we mean for this function on sparc?
I assume it should be IOMMU_PAGE_SHIFT, but they are the same value and
I want to make sure I get the semantics right.

2) Where would I put sparc's definition of dma_get_page_shift()? Should
it be in a asm/dma-mapping.h? Should we move some of the constants from
arch/sparc/kernel/iommu_common.h to
arch/sparc/include/asm/iommu_common.h and then #include that in
asm/dma-mapping.h?

Dave M., any opinions/insights? Essentially, this helper function
assists the NVMe driver in determining what page size it should use to
satisfy both the device and IOMMU's requirements. Maybe I misunderstand
the constants on sparc and PAGE_SHIFT is fine there too?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-15 Thread Nishanth Aravamudan
On 14.10.2015 [08:42:51 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> sorry for the late reply.
> 
> > > On Power, since it's technically variable, we'd need a function. So are
> > > you suggesting define'ing it to a function just on Power and leaving it
> > > a constant elsewhere?
> > > 
> > > I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.
> > 
> > Sorry, I should have been more specific -- I'm ready to spin out a v3,
> > with a sparc-specific function.
> > 
> > Are you ok with leaving it a function for now (the only caller is in
> > NVMe obviously).
> 
> 
> I guess we do indeed need a function then.  I'll take a look at your
> patch, but as long you found a way to avoid adding too much boilerplate
> code it should be fine.

Ok, so I've got the moved function (include/linux/dma-mapping.h instead
of dma-mapping-common.h) ready to go, which should only involve changing
the first patch in the series. But I'm really mystified by what to do
for sparc, which defines IOMMU_PAGE_SHIFT and IO_PAGE_SHIFT in
arch/sparc/kernel/iommu_common.h.

1) Which constant reflects the value we mean for this function on sparc?
I assume it should be IOMMU_PAGE_SHIFT, but they are the same value and
I want to make sure I get the semantics right.

2) Where would I put sparc's definition of dma_get_page_shift()? Should
it be in a asm/dma-mapping.h? Should we move some of the constants from
arch/sparc/kernel/iommu_common.h to
arch/sparc/include/asm/iommu_common.h and then #include that in
asm/dma-mapping.h?

Dave M., any opinions/insights? Essentially, this helper function
assists the NVMe driver in determining what page size it should use to
satisfy both the device and IOMMU's requirements. Maybe I misunderstand
the constants on sparc and PAGE_SHIFT is fine there too?

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-14 Thread Nishanth Aravamudan
Hi Christoph,

On 12.10.2015 [14:06:51 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> > with an #ifndef in common code?
> 
> On Power, since it's technically variable, we'd need a function. So are
> you suggesting define'ing it to a function just on Power and leaving it
> a constant elsewhere?
> 
> I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

Sorry, I should have been more specific -- I'm ready to spin out a v3,
with a sparc-specific function.

Are you ok with leaving it a function for now (the only caller is in
NVMe obviously).

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-14 Thread Nishanth Aravamudan
Hi Christoph,

On 12.10.2015 [14:06:51 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> > Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> > with an #ifndef in common code?
> 
> On Power, since it's technically variable, we'd need a function. So are
> you suggesting define'ing it to a function just on Power and leaving it
> a constant elsewhere?
> 
> I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

Sorry, I should have been more specific -- I'm ready to spin out a v3,
with a sparc-specific function.

Are you ok with leaving it a function for now (the only caller is in
NVMe obviously).

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 12.10.2015 [09:03:52 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> > On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > > We will leverage this macro in the NVMe driver, which needs to know the
> > > configured IOMMU page shift to properly configure its device's page
> > > size.
> > > 
> > > Signed-off-by: Nishanth Aravamudan 
> > > 
> > > ---
> > > Given this is available, it seems reasonable to expose -- and it doesn't
> > > really make sense to make the driver do a log2 call on the existing
> > > IOMMU_PAGE_SIZE() value.
> > > 
> > > diff --git a/arch/powerpc/include/asm/iommu.h 
> > > b/arch/powerpc/include/asm/iommu.h
> > > index ca18cff..6fdf857 100644
> > > --- a/arch/powerpc/include/asm/iommu.h
> > > +++ b/arch/powerpc/include/asm/iommu.h
> > > @@ -36,6 +36,7 @@
> > >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> > >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> > >  
> > > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> > >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> > 
> > Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> > uses the new IOMMU_PAGE_SHIFT macro.
> 
> Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
> think that through. Will update.

Err, replied too quickly -- just got back from vacation -- this is
from an old version of the patchset, we no longer introduce this macro.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

On Power, since it's technically variable, we'd need a function. So are
you suggesting define'ing it to a function just on Power and leaving it
a constant elsewhere?

I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

I suppose we could do that -- I wasn't sure if the macro would be
palatable.

> Also not all architectures use dma-mapping-common.h yet, so you either
> need to update all of those as well, or just add the #ifndef directly
> to linux/dma-mapping.h.

Ok, I will take a look at that and spin up a v3.

Thanks for the feedback!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > We will leverage this macro in the NVMe driver, which needs to know the
> > configured IOMMU page shift to properly configure its device's page
> > size.
> > 
> > Signed-off-by: Nishanth Aravamudan 
> > 
> > ---
> > Given this is available, it seems reasonable to expose -- and it doesn't
> > really make sense to make the driver do a log2 call on the existing
> > IOMMU_PAGE_SIZE() value.
> > 
> > diff --git a/arch/powerpc/include/asm/iommu.h 
> > b/arch/powerpc/include/asm/iommu.h
> > index ca18cff..6fdf857 100644
> > --- a/arch/powerpc/include/asm/iommu.h
> > +++ b/arch/powerpc/include/asm/iommu.h
> > @@ -36,6 +36,7 @@
> >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> >  
> > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> 
> Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> uses the new IOMMU_PAGE_SHIFT macro.

Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
think that through. Will update.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

I suppose we could do that -- I wasn't sure if the macro would be
palatable.

> Also not all architectures use dma-mapping-common.h yet, so you either
> need to update all of those as well, or just add the #ifndef directly
> to linux/dma-mapping.h.

Ok, I will take a look at that and spin up a v3.

Thanks for the feedback!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > We will leverage this macro in the NVMe driver, which needs to know the
> > configured IOMMU page shift to properly configure its device's page
> > size.
> > 
> > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > 
> > ---
> > Given this is available, it seems reasonable to expose -- and it doesn't
> > really make sense to make the driver do a log2 call on the existing
> > IOMMU_PAGE_SIZE() value.
> > 
> > diff --git a/arch/powerpc/include/asm/iommu.h 
> > b/arch/powerpc/include/asm/iommu.h
> > index ca18cff..6fdf857 100644
> > --- a/arch/powerpc/include/asm/iommu.h
> > +++ b/arch/powerpc/include/asm/iommu.h
> > @@ -36,6 +36,7 @@
> >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> >  
> > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> 
> Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> uses the new IOMMU_PAGE_SHIFT macro.

Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
think that through. Will update.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-12 Thread Nishanth Aravamudan
On 06.10.2015 [02:51:36 -0700], Christoph Hellwig wrote:
> Do we need a function here or can we just have a IOMMU_PAGE_SHIFT define
> with an #ifndef in common code?

On Power, since it's technically variable, we'd need a function. So are
you suggesting define'ing it to a function just on Power and leaving it
a constant elsewhere?

I noticed that sparc has a IOMMU_PAGE_SHIFT already, fwiw.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-12 Thread Nishanth Aravamudan
On 12.10.2015 [09:03:52 -0700], Nishanth Aravamudan wrote:
> On 06.10.2015 [14:19:43 +1100], David Gibson wrote:
> > On Fri, Oct 02, 2015 at 10:18:00AM -0700, Nishanth Aravamudan wrote:
> > > We will leverage this macro in the NVMe driver, which needs to know the
> > > configured IOMMU page shift to properly configure its device's page
> > > size.
> > > 
> > > Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>
> > > 
> > > ---
> > > Given this is available, it seems reasonable to expose -- and it doesn't
> > > really make sense to make the driver do a log2 call on the existing
> > > IOMMU_PAGE_SIZE() value.
> > > 
> > > diff --git a/arch/powerpc/include/asm/iommu.h 
> > > b/arch/powerpc/include/asm/iommu.h
> > > index ca18cff..6fdf857 100644
> > > --- a/arch/powerpc/include/asm/iommu.h
> > > +++ b/arch/powerpc/include/asm/iommu.h
> > > @@ -36,6 +36,7 @@
> > >  #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
> > >  #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
> > >  
> > > +#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
> > >  #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
> > 
> > Seems like it would be a touch safer to alter IOMMU_PAGE_SIZE so it
> > uses the new IOMMU_PAGE_SHIFT macro.
> 
> Yes absolutely! Sorry, I initially didn't add the first macro, so didn't
> think that through. Will update.

Err, replied too quickly -- just got back from vacation -- this is
from an old version of the patchset, we no longer introduce this macro.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
On 03.10.2015 [07:35:09 +1000], Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-02 at 14:04 -0700, Nishanth Aravamudan wrote:
> > Right, I did start with your advice and tried that approach, but it
> > turned out I was wrong about the actual issue at the time. The problem
> > for NVMe isn't actually the starting address alignment (which it can
> > handle not being aligned to the device's page size). It doesn't handle
> > (addr + len % dev_page_size != 0). That is, it's really a length
> > alignment issue.
> > 
> > It seems incredibly device specific to have a an API into the DMA code
> > to request an end alignment -- no other device seems to have this
> > issue/design. If you think that's better, I can fiddle with that
> > instead.
> > 
> > Sorry, I should have called this out better as an alternative
> > consideration.
> 
> Nah it's fine. Ok. Also adding the alignment requirement to the API
> would have been a much more complex patch since it would have had to
> be implemented for all archs.
> 
> I think your current solution is fine.

Great, thanks. Also, while it's possible an alignment API would be more
performant...we're already not using DDW on Power in this case,
performance is not a primary concern. We want to simply be
functional/correct in this configuration.

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
On 03.10.2015 [06:51:06 +1000], Benjamin Herrenschmidt wrote:
> On Fri, 2015-10-02 at 13:09 -0700, Nishanth Aravamudan wrote:
> 
> > 1) add a generic dma_get_page_shift implementation that just returns
> > PAGE_SHIFT
> 
> So you chose to return the granularity of the iommu to the driver
> rather than providing a way for the driver to request a specific
> alignment for DMA mappings. Any specific reason ?

Right, I did start with your advice and tried that approach, but it
turned out I was wrong about the actual issue at the time. The problem
for NVMe isn't actually the starting address alignment (which it can
handle not being aligned to the device's page size). It doesn't handle
(addr + len % dev_page_size != 0). That is, it's really a length
alignment issue.

It seems incredibly device specific to have a an API into the DMA code
to request an end alignment -- no other device seems to have this
issue/design. If you think that's better, I can fiddle with that
instead.

Sorry, I should have called this out better as an alternative
consideration.

-Nish


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5 v2] drivers/nvme: default to the IOMMU page size

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b97fc3f..c561137 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1713,7 +1714,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan 

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan 

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index cab6753..5c372e3 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,9 +78,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/5 v2] powerpc/dma-mapping: override dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
On Power, the kernel's page size can differ from the IOMMU's page size,
so we need to override the generic implementation, which always returns
the kernel's page size. Lookup the IOMMU's page size from struct
iommu_table, if available. Fallback to the kernel's page size,
otherwise.

Signed-off-by: Nishanth Aravamudan 

diff --git a/arch/powerpc/include/asm/dma-mapping.h 
b/arch/powerpc/include/asm/dma-mapping.h
index 7f522c0..c5638f4 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -125,6 +125,9 @@ static inline void set_dma_offset(struct device *dev, 
dma_addr_t off)
 #define HAVE_ARCH_DMA_SET_MASK 1
 extern int dma_set_mask(struct device *dev, u64 dma_mask);
 
+#define HAVE_ARCH_DMA_GET_PAGE_SHIFT 1
+extern unsigned long dma_get_page_shift(struct device *dev);
+
 #include 
 
 extern int __dma_set_mask(struct device *dev, u64 dma_mask);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 59503ed..e805af2 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -335,6 +335,15 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
 }
 EXPORT_SYMBOL(dma_set_mask);
 
+unsigned long dma_get_page_shift(struct device *dev)
+{
+   struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+   return PAGE_SHIFT;
+}
+EXPORT_SYMBOL(dma_get_page_shift);
+
 u64 __dma_get_required_mask(struct device *dev)
 {
struct dma_map_ops *dma_ops = get_dma_ops(dev);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-02 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan 

diff --git a/include/asm-generic/dma-mapping-common.h 
b/include/asm-generic/dma-mapping-common.h
index b1bc954..86e4e97 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -355,4 +355,11 @@ static inline int dma_set_mask(struct device *dev, u64 
mask)
 }
 #endif
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).
 
The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).
 
In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.
 
This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) leverage the new API in the NVMe driver
 
With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h   |  3 +++
 arch/powerpc/include/asm/machdep.h   |  3 ++-
 arch/powerpc/kernel/dma.c| 11 +++
 arch/powerpc/platforms/pseries/iommu.c   | 36 

 drivers/block/nvme-core.c|  3 ++-
 include/asm-generic/dma-mapping-common.h |  7 +++
 6 files changed, 61 insertions(+), 2 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
On 02.10.2015 [10:25:44 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> please expose this value through the generic DMA API instead of adding
> architecture specific hacks to drivers.

Ok, I'm happy to do that instead -- what I struggled with is that I
don't have enough knowledge of the various architectures to provide the
right default implementation. It should be sufficient for the default to
return PAGE_SHIFT, and on Power just override that to return the IOMMU
table's page size? Since the only user will be the NVMe driver
currently, that should be fine?

Sorry for the less-than-ideal patch!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan 

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 7920c27..969a95e 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NVME_MINORS(1U << MINORBITS)
 #define NVME_Q_DEPTH   1024
@@ -1680,6 +1681,11 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
unsigned page_shift = PAGE_SHIFT;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
+#ifdef CONFIG_PPC64
+   struct iommu_table *tbl = get_iommu_table_base(dev->dev);
+   if (tbl)
+   page_shift = IOMMU_PAGE_SHIFT(tbl);
+#endif
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of two patches, one of which exposes the IOMMU's
page shift on Power (currently only the page size is exposed, and it
seems unnecessary to ilog2 that value in the driver). The second patch
leverages this value on Power in the NVMe driver.

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-02 Thread Nishanth Aravamudan
We will leverage this macro in the NVMe driver, which needs to know the
configured IOMMU page shift to properly configure its device's page
size.

Signed-off-by: Nishanth Aravamudan 

---
Given this is available, it seems reasonable to expose -- and it doesn't
really make sense to make the driver do a log2 call on the existing
IOMMU_PAGE_SIZE() value.

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ca18cff..6fdf857 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -36,6 +36,7 @@
 #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
 #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
+#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
 #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
 #define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1))
 #define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr))

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

This series consists of two patches, one of which exposes the IOMMU's
page shift on Power (currently only the page size is exposed, and it
seems unnecessary to ilog2 that value in the driver). The second patch
leverages this value on Power in the NVMe driver.

With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] powerpc/iommu: expose IOMMU page shift

2015-10-02 Thread Nishanth Aravamudan
We will leverage this macro in the NVMe driver, which needs to know the
configured IOMMU page shift to properly configure its device's page
size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

---
Given this is available, it seems reasonable to expose -- and it doesn't
really make sense to make the driver do a log2 call on the existing
IOMMU_PAGE_SIZE() value.

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index ca18cff..6fdf857 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -36,6 +36,7 @@
 #define IOMMU_PAGE_MASK_4K   (~((1 << IOMMU_PAGE_SHIFT_4K) - 1))
 #define IOMMU_PAGE_ALIGN_4K(addr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE_4K)
 
+#define IOMMU_PAGE_SHIFT(tblptr) (tblptr)->it_page_shift
 #define IOMMU_PAGE_SIZE(tblptr) (ASM_CONST(1) << (tblptr)->it_page_shift)
 #define IOMMU_PAGE_MASK(tblptr) (~((1 << (tblptr)->it_page_shift) - 1))
 #define IOMMU_PAGE_ALIGN(addr, tblptr) _ALIGN_UP(addr, IOMMU_PAGE_SIZE(tblptr))

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index 7920c27..969a95e 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -42,6 +42,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define NVME_MINORS(1U << MINORBITS)
 #define NVME_Q_DEPTH   1024
@@ -1680,6 +1681,11 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
unsigned page_shift = PAGE_SHIFT;
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
+#ifdef CONFIG_PPC64
+   struct iommu_table *tbl = get_iommu_table_base(dev->dev);
+   if (tbl)
+   page_shift = IOMMU_PAGE_SHIFT(tbl);
+#endif
 
if (page_shift < dev_page_min) {
dev_err(dev->dev,

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] drivers/nvme: default to the IOMMU page size on Power

2015-10-02 Thread Nishanth Aravamudan
On 02.10.2015 [10:25:44 -0700], Christoph Hellwig wrote:
> Hi Nishanth,
> 
> please expose this value through the generic DMA API instead of adding
> architecture specific hacks to drivers.

Ok, I'm happy to do that instead -- what I struggled with is that I
don't have enough knowledge of the various architectures to provide the
right default implementation. It should be sufficient for the default to
return PAGE_SHIFT, and on Power just override that to return the IOMMU
table's page size? Since the only user will be the NVMe driver
currently, that should be fine?

Sorry for the less-than-ideal patch!

-Nish

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/5 v2] powerpc/dma: implement per-platform dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
The IOMMU page size is not always stored in struct iommu on Power.
Specifically if a device is configured for DDW (Dynamic DMA Windows aka.
64-bit direct DMA), the used TCE (Translation Control Entry) size is
stored in a special device property created at run-time by the DDW
configuration code. DDW is a pseries-specific feature, so allow
platforms to override the implementation of dma_get_page_shift if
desired.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/include/asm/machdep.h 
b/arch/powerpc/include/asm/machdep.h
index cab6753..5c372e3 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -78,9 +78,10 @@ struct machdep_calls {
 #endif
 #endif /* CONFIG_PPC64 */
 
-   /* Platform set_dma_mask and dma_get_required_mask overrides */
+   /* Platform overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
u64 (*dma_get_required_mask)(struct device *dev);
+   unsigned long   (*dma_get_page_shift)(struct device *dev);
 
int (*probe)(void);
void(*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index e805af2..c363896 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -338,6 +338,8 @@ EXPORT_SYMBOL(dma_set_mask);
 unsigned long dma_get_page_shift(struct device *dev)
 {
struct iommu_table *tbl = get_iommu_table_base(dev);
+   if (ppc_md.dma_get_page_shift)
+   return ppc_md.dma_get_page_shift(dev);
if (tbl)
return tbl->it_page_shift;
return PAGE_SHIFT;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/5 v2] Fix NVMe driver support on Power with 32-bit DMA

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).
 
The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).
 
In this particular case, and generally, we want to use the IOMMU's page
size for the default device page size, rather than the kernel's page
size.
 
This series consists of five patches:

1) add a generic dma_get_page_shift implementation that just returns
PAGE_SHIFT
2) override the generic implementation on Power to use the IOMMU table's
page shift if available
3) allow further specific overriding on power with machdep platform
overrides
4) use the machdep override on pseries, as the DDW code puts the TCE
shift in a special property and there is no IOMMU table available
5) leverage the new API in the NVMe driver
 
With these patches, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

 arch/powerpc/include/asm/dma-mapping.h   |  3 +++
 arch/powerpc/include/asm/machdep.h   |  3 ++-
 arch/powerpc/kernel/dma.c| 11 +++
 arch/powerpc/platforms/pseries/iommu.c   | 36 

 drivers/block/nvme-core.c|  3 ++-
 include/asm-generic/dma-mapping-common.h |  7 +++
 6 files changed, 61 insertions(+), 2 deletions(-)

v1 -> v2:
  Based upon feedback from Christoph Hellwig, rather than using an
  arch-specific hack, expose the DMA page shift via a generic DMA API and
  override it on Power as needed.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/5 v2] dma-mapping: add generic dma_get_page_shift API

2015-10-02 Thread Nishanth Aravamudan
Drivers like NVMe need to be able to determine the page size used for
DMA transfers. Add a new API that defaults to return PAGE_SHIFT on all
architectures.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/include/asm-generic/dma-mapping-common.h 
b/include/asm-generic/dma-mapping-common.h
index b1bc954..86e4e97 100644
--- a/include/asm-generic/dma-mapping-common.h
+++ b/include/asm-generic/dma-mapping-common.h
@@ -355,4 +355,11 @@ static inline int dma_set_mask(struct device *dev, u64 
mask)
 }
 #endif
 
+#ifndef HAVE_ARCH_DMA_GET_PAGE_SHIFT
+static inline unsigned long dma_get_page_shift(struct device *dev)
+{
+   return PAGE_SHIFT;
+}
+#endif
+
 #endif

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 4/5 v2] pseries/iommu: implement DDW-aware dma_get_page_shift

2015-10-02 Thread Nishanth Aravamudan
When DDW (Dynamic DMA Windows) are present for a device, we have stored
the TCE (Translation Control Entry) size in a special device tree
property. Check if we have enabled DDW for the device and return the TCE
size from that property if present. If the property isn't present,
fallback to looking the value up in struct iommu_table. If we don't find
a iommu_table, fallback to the kernel's page size.

Signed-off-by: Nishanth Aravamudan <n...@linux.vnet.ibm.com>

diff --git a/arch/powerpc/platforms/pseries/iommu.c 
b/arch/powerpc/platforms/pseries/iommu.c
index 0946b98..1bf6471 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1292,6 +1292,40 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
return dma_iommu_ops.get_required_mask(dev);
 }
 
+static unsigned long dma_get_page_shift_pSeriesLP(struct device *dev)
+{
+   struct iommu_table *tbl;
+
+   if (!disable_ddw && dev_is_pci(dev)) {
+   struct pci_dev *pdev = to_pci_dev(dev);
+   struct device_node *dn;
+
+   dn = pci_device_to_OF_node(pdev);
+
+   /* search upwards for ibm,dma-window */
+   for (; dn && PCI_DN(dn) && !PCI_DN(dn)->table_group;
+   dn = dn->parent)
+   if (of_get_property(dn, "ibm,dma-window", NULL))
+   break;
+   /*
+* if there is a DDW configuration, the TCE shift is stored in
+* the property
+*/
+   if (dn && PCI_DN(dn)) {
+   const struct dynamic_dma_window_prop *direct64 =
+   of_get_property(dn, DIRECT64_PROPNAME, NULL);
+   if (direct64)
+   return be32_to_cpu(direct64->tce_shift);
+   }
+   }
+
+   tbl = get_iommu_table_base(dev);
+   if (tbl)
+   return tbl->it_page_shift;
+
+   return PAGE_SHIFT;
+}
+
 #else  /* CONFIG_PCI */
 #define pci_dma_bus_setup_pSeries  NULL
 #define pci_dma_dev_setup_pSeries  NULL
@@ -1299,6 +1333,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device 
*dev)
 #define pci_dma_dev_setup_pSeriesLPNULL
 #define dma_set_mask_pSeriesLP NULL
 #define dma_get_required_mask_pSeriesLPNULL
+#define dma_get_page_shift_pSeriesLP   NULL
 #endif /* !CONFIG_PCI */
 
 static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1395,6 +1430,7 @@ void iommu_init_early_pSeries(void)
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
+   ppc_md.dma_get_page_shift = dma_get_page_shift_pSeriesLP;
} else {
pseries_pci_controller_ops.dma_bus_setup = 
pci_dma_bus_setup_pSeries;
pseries_pci_controller_ops.dma_dev_setup = 
pci_dma_dev_setup_pSeries;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 5/5 v2] drivers/nvme: default to the IOMMU page size

2015-10-02 Thread Nishanth Aravamudan
We received a bug report recently when DDW (64-bit direct DMA on Power)
is not enabled for NVMe devices. In that case, we fall back to 32-bit
DMA via the IOMMU, which is always done via 4K TCEs (Translation Control
Entries).

The NVMe device driver, though, assumes that the DMA alignment for the
PRP entries will match the device's page size, and that the DMA aligment
matches the kernel's page aligment. On Power, the the IOMMU page size,
as mentioned above, can be 4K, while the device can have a page size of
8K, while the kernel has a page size of 64K. This eventually trips the
BUG_ON in nvme_setup_prps(), as we have a 'dma_len' that is a multiple
of 4K but not 8K (e.g., 0xF000).

In this particular case of page sizes, we clearly want to use the
IOMMU's page size in the driver. And generally, the NVMe driver in this
function should be using the IOMMU's page size for the default device
page size, rather than the kernel's page size.

With this patch, a NVMe device survives our internal hardware
exerciser; the kernel BUGs within a few seconds without the patch.

---
v1 -> v2:
  Based upon feedback from Christoph Hellwig, implement the IOMMU page
  size lookup as a generic DMA API, rather than an architecture-specific
  hack.

diff --git a/drivers/block/nvme-core.c b/drivers/block/nvme-core.c
index b97fc3f..c561137 100644
--- a/drivers/block/nvme-core.c
+++ b/drivers/block/nvme-core.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1713,7 +1714,7 @@ static int nvme_configure_admin_queue(struct nvme_dev 
*dev)
u32 aqa;
u64 cap = readq(>bar->cap);
struct nvme_queue *nvmeq;
-   unsigned page_shift = PAGE_SHIFT;
+   unsigned page_shift = dma_get_page_shift(dev->dev);
unsigned dev_page_min = NVME_CAP_MPSMIN(cap) + 12;
unsigned dev_page_max = NVME_CAP_MPSMAX(cap) + 12;
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   4   5   6   >