Re: [PATCH RFC 01/15] MIPS: replace **** with a hug

2018-11-30 Thread Mike Galbraith
On Fri, 2018-11-30 at 11:27 -0800, Jarkko Sakkinen wrote:
> In order to comply with the CoC, replace  with a hug.
> 
> Signed-off-by: Jarkko Sakkinen 
> ---
>  arch/mips/pci/ops-bridge.c  | 24 
>  arch/mips/sgi-ip22/ip22-setup.c |  2 +-
>  2 files changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/mips/pci/ops-bridge.c b/arch/mips/pci/ops-bridge.c
> index a1d2c4ae0d1b..c755c4c92fa9 100644
> --- a/arch/mips/pci/ops-bridge.c
> +++ b/arch/mips/pci/ops-bridge.c
> @@ -56,7 +56,7 @@ static int pci_conf0_read_config(struct pci_bus *bus, 
> unsigned int devfn,
>   return PCIBIOS_DEVICE_NOT_FOUND;
>  
>   /*
> -  * IOC3 is fucking fucked beyond belief ...  Don't even give the
> +  * IOC3 is hugging hugged beyond belief ...  Don't even give the

This obfuscation is a complete waste of reader brain cycles, as it will
transparently be reverted to the original in order to restore meaning.

-Mike 


Re: memcg oops: memcg_kmem_charge_memcg()->try_charge()->page_counter_try_charge()->BOOM

2018-10-29 Thread Mike Galbraith
On Mon, 2018-10-29 at 21:49 +, Roman Gushchin wrote:
> On Mon, Oct 29, 2018 at 09:46:54PM +0100, Mike Galbraith wrote:
> 
> > Ah, I have cgroup_disable=memory on the command line, which turns out
> > to be why your box doesn't explode, while mine does.
> 
> Yeah, here it is. I'll send the fix in few minutes. Please,
> test it on your setup. Your tested-by will be appreciated.

Yup, all-better-by:/me


Re: memcg oops: memcg_kmem_charge_memcg()->try_charge()->page_counter_try_charge()->BOOM

2018-10-29 Thread Mike Galbraith
On Mon, 2018-10-29 at 18:54 +, Roman Gushchin wrote:
> 
> Hi Mike!
> 
> Thank you for the report!
> 
> Do you see it reliable every time you boot up the machine?

Yeah.

> How do you run kvm?

My VMs are full SW/data clones of my i7-4790/openSUSE  box.

>  Is there something special about your cgroup setup?

No, I generally have no use for cgroups.

> I've made several attempts to reproduce the issue, but haven't got anything
> so far. I've used your config, and played with different cgroups setups.

Ah, I have cgroup_disable=memory on the command line, which turns out
to be why your box doesn't explode, while mine does.

> Do you know where in the page_counter_try_charge() it fails?
> 
> Also, can you, please, check if the following patch mitigates the problem?

Yeah, that plugs it up.

-Mike


Re: memcg oops: memcg_kmem_charge_memcg()->try_charge()->page_counter_try_charge()->BOOM

2018-10-29 Thread Mike Galbraith
On Mon, 2018-10-29 at 14:20 +0100, Michal Hocko wrote:
> 
> > [4.420976] Code: f3 c3 0f 1f 00 0f 1f 44 00 00 48 85 ff 0f 84 a8 00 00 
> > 00 41 56 48 89 f8 41 55 49 89 fe 41 54 49 89 d5 55 49 89 f4 53 48 89 f3 
> >  48 0f c1 1f 48 01 f3 48 39 5f 18 48 89 fd 73 17 eb 41 48 89 e8
> > [4.424162] RSP: 0018:b27840c57cb0 EFLAGS: 00010202
> > [4.425236] RAX: 00f8 RBX: 0020 RCX: 
> > 0200
> > [4.426467] RDX: b27840c57d08 RSI: 0020 RDI: 
> > 00f8
> > [4.427652] RBP: 0001 R08:  R09: 
> > b278410bc000
> > [4.428883] R10: b27840c57ed0 R11: 0040 R12: 
> > 0020
> > [4.430168] R13: b27840c57d08 R14: 00f8 R15: 
> > 006000c0
> > [4.431411] FS:  7f79081a3940() GS:92a4b7bc() 
> > knlGS:
> > [4.432748] CS:  0010 DS:  ES:  CR0: 80050033
> > [4.433836] CR2: 00f8 CR3: 0002310ac002 CR4: 
> > 001606e0
> > [4.435500] Call Trace:
> > [4.436319]  try_charge+0x92/0x7b0
> > [4.437284]  ? unlazy_walk+0x4c/0xb0
> > [4.438676]  ? terminate_walk+0x91/0x100
> > [4.439984]  memcg_kmem_charge_memcg+0x28/0x80
> > [4.441059]  memcg_kmem_charge+0x88/0x1d0
> > [4.442105]  copy_process.part.37+0x23a/0x2070
> 
> Could you faddr2line this please?

homer:/usr/local/src/kernel/linux-master # ./scripts/faddr2line vmlinux 
copy_process.part.37+0x23a
copy_process.part.37+0x23a/0x2070:
memcg_charge_kernel_stack at kernel/fork.c:401
(inlined by) dup_task_struct at kernel/fork.c:850
(inlined by) copy_process at kernel/fork.c:1750

I bisected it this afternoon, and confirmed the result via revert.

9b6f7e163cd0f468d1b9696b785659d3c27c8667 is the first bad commit
commit 9b6f7e163cd0f468d1b9696b785659d3c27c8667
Author: Roman Gushchin 
Date:   Fri Oct 26 15:03:19 2018 -0700

mm: rework memcg kernel stack accounting

If CONFIG_VMAP_STACK is set, kernel stacks are allocated using
__vmalloc_node_range() with __GFP_ACCOUNT.  So kernel stack pages are
charged against corresponding memory cgroups on allocation and uncharged
on releasing them.

The problem is that we do cache kernel stacks in small per-cpu caches and
do reuse them for new tasks, which can belong to different memory cgroups.

Each stack page still holds a reference to the original cgroup, so the
cgroup can't be released until the vmap area is released.

To make this happen we need more than two subsequent exits without forks
in between on the current cpu, which makes it very unlikely to happen.  As
a result, I saw a significant number of dying cgroups (in theory, up to 2
* number_of_cpu + number_of_tasks), which can't be released even by
significant memory pressure.

As a cgroup structure can take a significant amount of memory (first of
all, per-cpu data like memcg statistics), it leads to a noticeable waste
of memory.

Link: http://lkml.kernel.org/r/20180827162621.30187-1-g...@fb.com
Fixes: ac496bf48d97 ("fork: Optimize task creation by caching two thread 
stacks per CPU if CONFIG_VMAP_STACK=y")
Signed-off-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Andy Lutomirski 
Cc: Konstantin Khlebnikov 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

:04 04 19a916f067fb987c6b15ce04f0e656c590db39dd 
edde98ce70d28e03f623f86f54887720516fcd91 M  include
:04 04 04213da714a8a10580baccd0b0977a6744fa2374 
9204198e8eb4043b059f2a4eeaa4e19679fd3ddb M  kernel

git bisect start
# good: [e5f6d9afa3415104e402cd69288bb03f7165eeba] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc
git bisect good e5f6d9afa3415104e402cd69288bb03f7165eeba
# bad: [345671ea0f9258f410eb057b9ced9cefbbe5dc78] Merge branch 'akpm' (patches 
from Andrew)
git bisect bad 345671ea0f9258f410eb057b9ced9cefbbe5dc78
# bad: [ae2b01f37044c10e975d22116755df56252b09d8] mm: remove vm_insert_pfn()
git bisect bad ae2b01f37044c10e975d22116755df56252b09d8
# good: [9703fc8caf36ac65dca1538b23dd137de0b53233] Merge tag 'usb-4.20-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect good 9703fc8caf36ac65dca1538b23dd137de0b53233
# good: [bf58e8820c48805394ec9e76339f0c4646050432] nvmem: change the signature 
of nvmem_unregister()
git bisect good bf58e8820c48805394ec9e76339f0c4646050432
# good: [cccb3b19e762edc8ef0481be506967555cb9e317] nvmem: fix 
nvmem_cell_get_from_lookup()
git bisect good cccb3b19e762edc8ef0481be506967555cb9e317
# good: [18d0eae30e6a4f8644d589243d7ac1d70d29203d] Merge tag 
'char-misc-4.20-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect good 18d0eae30e6a4f8644d589243d7ac1d70d29203d
# bad: [9b6f7e163cd0f468d1b9696b785659d3c27c8667] mm: rework memcg kernel 

Re: [PATCH RT 08/22] Revert "x86: UV: raw_spinlock conversion"

2018-09-06 Thread Mike Galbraith
On Thu, 2018-09-06 at 09:35 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-09-05 08:28:02 [-0400], Steven Rostedt wrote:
> > 4.14.63-rt41-rc1 stable review patch.
> > If anyone has any objections, please let me know.
> > 
> > --
> > 
> > From: Sebastian Andrzej Siewior 
> > 
> > [ Upstream commit 2a9c45d8f89112458364285cbe2b0729561953f1 ]
> > 
> > Drop the Ultraviolet patch. UV looks broken upstream for PREEMPT, too.
> > Mike is the only person I know that has such a thing and he isn't going
> > to fix this upstream (from 1526977462.6491.1.ca...@gmx.de):
> 
> I don't think that we need to propagate that revert for stable. I
> reverted it in the devel tree because nobody wanted this upstream and I
> couldn't test it. For that reason I didn't see the point for having it
> in the RT tree.
> However, if you want to revert it for stable, be my guest. It probably
> will have no impact and if it will people might step forward and fix it
> properly / upstream.

I'm in favor of reverting it as useless cruft.  UV has been broken
forever wrt PREEMPT, and nobody cares.  The original interest in UV RT
support evaporated while 2.6.33-rt was still current (and when getting
it working took a bit more than a spinlock conversion). 

-Mike


Re: [regression/bisected] 4.19 cycle boot time IO stalls

2018-09-05 Thread Mike Galbraith
On Wed, 2018-09-05 at 07:39 -0600, Jens Axboe wrote:
> 
> I bet it's the host busy change from Ming, which I already
> reported as being the culprit for another test failure I had. For
> some reason it's not merged yet, nudge nudge Martin. You can test
> by reverting:
> 
> commit 328728630d9f2bf14b82ca30b5e47489beefe361
> Author: Ming Lei 
> Date:   Sun Jun 24 22:03:27 2018 +0800
> 
> scsi: core: avoid host-wide host_busy counter for scsi_mq
> 
> which was slightly modified by 265d59aacbce, so you'll want to
> yank that one first.

Bingo (woohoo, no dog slow take 2 required).

> BTW, that suse email for me hasn't worked in 12 years :-)

Yeah, I must have double tapped mouse, and picked up one of the many
ancient artifacts in Evolution's DB.  Sorry about that.

-Mike


[regression/bisected] 4.19 cycle boot time IO stalls

2018-09-05 Thread Mike Galbraith
Greetings,

I've been seeing $subject, decided to take the time to try to bisect
the little bugger.  The hangs are not 100% repeatable, and while
bisection with a 5 boot go/nogo threshold seemed to go smoothly, it
ended up fingering a merge commit (sigh).

Box has an SSD (unused only by windows 10 box came with) and 3 spinning
rust buckets that I normally use with BFQ via a udev rule, but CFQ does
the same, so scheduler is seemingly irrelevant.  However, in 7 crash
dumps, all of which look about like the data below the bisect log,
there is something relevant, namely the hung 'tlp' task (powersaving
script of some sort for laptops according to the package description). 
That knob twiddling script is present/hung in all, making me a tad
suspicious, and indeed, testing with the final (bad) kernel, all I have
to do to eliminate hangs is to remove the 'tlp' package.  Verified via
remove, 5 boots work fine, reinstall, 2 of 5 hang, remove again, 10 of
10 work fine, reinstall, 2 in a row hang.

Seems pretty certain that tlp script is what inspires bug to raise its
ugly head.  WRT bisection result itself, munged merge seems far less
likely than a false negative having knocked bisection off course.

72f02ba66bd83b54054da20eae550123de84da6f is the first bad commit
git bisect start
# good: [94710cac0ef4ee177a63b5227664b38c95bbf703] Linux 4.18
git bisect good 94710cac0ef4ee177a63b5227664b38c95bbf703
# bad: [60c1f89241d49bacf71035470684a8d7b4bb46ea] Merge tag 
'dma-mapping-4.19-2' of git://git.infradead.org/users/hch/dma-mapping
git bisect bad 60c1f89241d49bacf71035470684a8d7b4bb46ea
# good: [54dbe75bbf1e189982516de179147208e90b5e45] Merge tag 
'drm-next-2018-08-15' of git://anongit.freedesktop.org/drm/drm
git bisect good 54dbe75bbf1e189982516de179147208e90b5e45
# bad: [d5acba26bfa097a618be425522b1ec4269d3edaf] Merge tag 
'char-misc-4.19-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad d5acba26bfa097a618be425522b1ec4269d3edaf
# bad: [9bd553929f68921be0f2014dd06561e0c8249a0d] Merge tag 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
git bisect bad 9bd553929f68921be0f2014dd06561e0c8249a0d
# bad: [f91e654474d413201ae578820fb63f8a811f6c4e] Merge branch 'next-integrity' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect bad f91e654474d413201ae578820fb63f8a811f6c4e
# good: [c1c2ad82c772966d3cdb9a4852329fa2cf71853a] Merge tag 'edac_for_4.19' of 
git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
git bisect good c1c2ad82c772966d3cdb9a4852329fa2cf71853a
# good: [51372570ac3c919b036e760f4ca449e81cf8e995] scsi: core: use 
blk_mq_run_hw_queues in scsi_kick_queue
git bisect good 51372570ac3c919b036e760f4ca449e81cf8e995
# good: [ac7da1b787d9ea43680c487613269742c48d8747] Merge branches 
'clk-actions-s700', 'clk-exynos-unused', 'clk-qcom-dispcc-845', 
'clk-scmi-round' and 'clk-cs2000-spdx' into clk-next
git bisect good ac7da1b787d9ea43680c487613269742c48d8747
# good: [6ff0497402ef7269ee6a72f62eb85adaa7a4768e] gpiolib: Fix of_node 
inconsistency
git bisect good 6ff0497402ef7269ee6a72f62eb85adaa7a4768e
# bad: [2b2f2aedba985108cbc92a761ac0d9fc4c774616] Merge tag 'gfs2-4.19.fixes' 
of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
git bisect bad 2b2f2aedba985108cbc92a761ac0d9fc4c774616
# good: [1d45bb7f9d2a5cbae1e5d9a5f72adad84db4d318] gfs2: Use iomap for stuffed 
direct I/O reads
git bisect good 1d45bb7f9d2a5cbae1e5d9a5f72adad84db4d318
# good: [db06f826ec12bf0701ea7fc0a3c0aa00b84417c8] Merge tag 'clk-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
git bisect good db06f826ec12bf0701ea7fc0a3c0aa00b84417c8
# good: [3f30f929bb17877ebc1653c6f3ff41863f1ba524] gfs2: cleanup: call 
gfs2_rgrp_ondisk2lvb from gfs2_rgrp_out
git bisect good 3f30f929bb17877ebc1653c6f3ff41863f1ba524
# good: [dffe12a82826082d2129ef91b17b257254cb60fc] gfs2: Fix gfs2_testbit to 
use clone bitmaps
git bisect good dffe12a82826082d2129ef91b17b257254cb60fc
# good: [f5580d0f8bf60993a5fbc73ee04678070ffbba57] gfs2: eliminate 
update_rgrp_lvb_unlinked
git bisect good f5580d0f8bf60993a5fbc73ee04678070ffbba57
# bad: [72f02ba66bd83b54054da20eae550123de84da6f] Merge tag 'scsi-misc' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad 72f02ba66bd83b54054da20eae550123de84da6f
# first bad commit: [72f02ba66bd83b54054da20eae550123de84da6f] Merge tag 
'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

crash> ps | grep UN
417  2   7  8803f8992940  UN   0.0   0  0  [kworker/u16:4]
476  2   7  8803f02e9b80  UN   0.0   0  0  [jbd2/sdb3-8]
   3078   3072   6  8803fb70e040  UN   1.0 1706708 177212  sddm-greeter
   3204   3155   0  8803fb6e3700  UN   0.0   13784   3520  tlp
crash> bt 417
PID: 417TASK: 8803f8992940  CPU: 7   COMMAND: "kworker/u16:4"
 #0 [8803f04636f0] __schedule at 815cca93
 #1 [8803f0463780] schedule at 815cd0a8
 #2 

Re: bisected - arm64 kvm unit test failures

2018-08-22 Thread Mike Galbraith
On Wed, 2018-08-22 at 14:50 +0100, Marc Zyngier wrote:
> On 22/08/18 14:38, Mike Galbraith wrote:
> > On Tue, 2018-08-21 at 16:34 +0100, Marc Zyngier wrote:
> >> Could you give that patchlet[1] a go? It solves a similar issue for me
> >> on a different platform.
> >>
> >> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032469.html
> > 
> > Yup, all better.
> 
> Now I'm not sure it actually fixes anything. Alex's test case seem to
> have opened a new can of worms... In tracing hell at the moment.

I'll keep the test box reserved a while longer.  When you get it's
other leg stuffed into the casket, if you want testing, holler.

-Mike


Re: bisected - arm64 kvm unit test failures

2018-08-22 Thread Mike Galbraith
On Tue, 2018-08-21 at 16:34 +0100, Marc Zyngier wrote:
> Could you give that patchlet[1] a go? It solves a similar issue for me
> on a different platform.
> 
> [1] https://lists.cs.columbia.edu/pipermail/kvmarm/2018-August/032469.html

Yup, all better.

-Mike


Re: [BUG v4.14-rt] kernel BUG at /work/rt/stable-rt.git/kernel/sched/core.c:1639!

2018-08-19 Thread Mike Galbraith
On Sat, 2018-08-18 at 15:13 +0200, Mike Galbraith wrote:
> seems it has be something from the 4.17 cycle that went back to 4.14-
> stable after 4.1[56]-stable trees went extinct.

See ("sched/core: Require cpu_active() in select_task_rq(), for user tasks")

Fix it like so?

sched: Allow pinned user tasks to be awakened to the CPU they pinned

Since 7af443ee16976, select_fallback_rq() will BUG() if the CPU to
which a task has pinned itself and pinned becomes !cpu_active()
while it slept.  Serving a 10 megaton eviction notice is neither
helpful nor required, the task will migrate when it can do so.

Signed-off-by: Mike Galbraith 
---
 kernel/sched/core.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -980,7 +980,7 @@ static inline bool is_cpu_allowed(struct
if (!cpumask_test_cpu(cpu, p->cpus_ptr))
return false;
 
-   if (is_per_cpu_kthread(p))
+   if (is_per_cpu_kthread(p) || __migrate_disabled(p))
return cpu_online(cpu);
 
return cpu_active(cpu);


Re: [BUG v4.14-rt] kernel BUG at /work/rt/stable-rt.git/kernel/sched/core.c:1639!

2018-08-18 Thread Mike Galbraith
On Sat, 2018-08-18 at 12:29 +0200, Mike Galbraith wrote:
> On Fri, 2018-08-17 at 16:23 -0400, Steven Rostedt wrote:
> > Pulling in stable releases into v4.14-rt I triggered this with my CPU
> > hotplug test:
> > 
> > [ cut here ]
> > kernel BUG at /work/rt/stable-rt.git/kernel/sched/core.c:1639!
> > invalid opcode:  [#1] PREEMPT SMP PTI
> > Modules linked in: sunrpc ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
> > nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt 
> > snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep 
> > snd_seq snd_seq_device snd_pcm snd_timer snd shpchp i2c_i801 soundcore 
> > floppy i915 drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect 
> > syscopyarea i2c_algo_bit iosf_mbi video [last unloaded: speedstep_lib]
> > CPU: 1 PID: 2944 Comm: mkdumprd Not tainted 4.14.63-test-rt40+ #782
> > Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled 
> > by O.E.M., BIOS SDBLI944.86P 05/08/2007
> > task: 880037888d80 task.stack: c9538000
> > RIP: 0010:select_fallback_rq+0xc3/0x122
> 
> I noticed this upstream, and had started hunting for the origin, but
> had thought that 4.14-rt was OK.  Clearly not the case, but it's not
> 4.14.60.. stable changes interacting badly either, virgin 4.14.59-rt37
> just reproduced in a vm clone of my workstation.

4.15.18-rt37 (4.14-rt rolled forward) does not reproduce, nor does
4.16.18-rt12, but 4.17.0-rt5 (v4.16.12-rt5 rolled forward) does, so
seems it has be something from the 4.17 cycle that went back to 4.14-
stable after 4.1[56]-stable trees went extinct.

-Mike


Re: [BUG v4.14-rt] kernel BUG at /work/rt/stable-rt.git/kernel/sched/core.c:1639!

2018-08-18 Thread Mike Galbraith
On Fri, 2018-08-17 at 16:23 -0400, Steven Rostedt wrote:
> Pulling in stable releases into v4.14-rt I triggered this with my CPU
> hotplug test:
> 
> [ cut here ]
> kernel BUG at /work/rt/stable-rt.git/kernel/sched/core.c:1639!
> invalid opcode:  [#1] PREEMPT SMP PTI
> Modules linked in: sunrpc ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 
> nf_defrag_ipv6 ip6table_filter ip6_tables uinput snd_hda_codec_idt 
> snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep 
> snd_seq snd_seq_device snd_pcm snd_timer snd shpchp i2c_i801 soundcore floppy 
> i915 drm_kms_helper drm fb_sys_fops sysimgblt sysfillrect syscopyarea 
> i2c_algo_bit iosf_mbi video [last unloaded: speedstep_lib]
> CPU: 1 PID: 2944 Comm: mkdumprd Not tainted 4.14.63-test-rt40+ #782
> Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by 
> O.E.M., BIOS SDBLI944.86P 05/08/2007
> task: 880037888d80 task.stack: c9538000
> RIP: 0010:select_fallback_rq+0xc3/0x122

I noticed this upstream, and had started hunting for the origin, but
had thought that 4.14-rt was OK.  Clearly not the case, but it's not
4.14.60.. stable changes interacting badly either, virgin 4.14.59-rt37
just reproduced in a vm clone of my workstation.

-Mike


[PATCH] rcu: Convert rcu_state.ofl_lock to raw_spinlock_t

2018-08-15 Thread Mike Galbraith


1e64b15a4b10 ("rcu: Fix grace-period hangs due to race with CPU offline")
added spinlock_t ofl_lock to the rcu_state structure, then takes it with
preemption disabled during CPU offline, giving RT sleeping lock heartburn.

Convert it to raw_spinlock_t.

Signed-off-by: Mike Galbraith 
---
 kernel/rcu/tree.c |   12 ++--
 kernel/rcu/tree.h |2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -111,7 +111,7 @@ struct rcu_state sname##_state = { \
.abbr = sabbr, \
.exp_mutex = __MUTEX_INITIALIZER(sname##_state.exp_mutex), \
.exp_wake_mutex = __MUTEX_INITIALIZER(sname##_state.exp_wake_mutex), \
-   .ofl_lock = __SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
+   .ofl_lock = __RAW_SPIN_LOCK_UNLOCKED(sname##_state.ofl_lock), \
 }
 
 RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
@@ -1962,13 +1962,13 @@ static bool rcu_gp_init(struct rcu_state
 */
rsp->gp_state = RCU_GP_ONOFF;
rcu_for_each_leaf_node(rsp, rnp) {
-   spin_lock(>ofl_lock);
+   raw_spin_lock(>ofl_lock);
raw_spin_lock_irq_rcu_node(rnp);
if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
!rnp->wait_blkd_tasks) {
/* Nothing to do on this leaf rcu_node structure. */
raw_spin_unlock_irq_rcu_node(rnp);
-   spin_unlock(>ofl_lock);
+   raw_spin_unlock(>ofl_lock);
continue;
}
 
@@ -2004,7 +2004,7 @@ static bool rcu_gp_init(struct rcu_state
}
 
raw_spin_unlock_irq_rcu_node(rnp);
-   spin_unlock(>ofl_lock);
+   raw_spin_unlock(>ofl_lock);
}
rcu_gp_slow(rsp, gp_preinit_delay); /* Races with CPU hotplug. */
 
@@ -3892,7 +3892,7 @@ static void rcu_cleanup_dying_idle_cpu(i
 
/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
mask = rdp->grpmask;
-   spin_lock(>ofl_lock);
+   raw_spin_lock(>ofl_lock);
raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order 
guarantee. */
rdp->rcu_ofl_gp_seq = READ_ONCE(rsp->gp_seq);
rdp->rcu_ofl_gp_flags = READ_ONCE(rsp->gp_flags);
@@ -3903,7 +3903,7 @@ static void rcu_cleanup_dying_idle_cpu(i
}
rnp->qsmaskinitnext &= ~mask;
raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
-   spin_unlock(>ofl_lock);
+   raw_spin_unlock(>ofl_lock);
 }
 
 /*
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -368,7 +368,7 @@ struct rcu_state {
char abbr;  /* Abbreviated name. */
struct list_head flavors;   /* List of RCU flavors. */
 
-   spinlock_t ofl_lock cacheline_internodealigned_in_smp;
+   raw_spinlock_t ofl_lock cacheline_internodealigned_in_smp;
/* Synchronize offline with */
/*  GP pre-initialization. */
 };


Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-14 Thread Mike Galbraith
On Wed, 2018-08-15 at 11:59 +0800, Dave Young wrote:
> > Does this improve things, and plug the no boot hole?
> 
> Would you mind to tune my patch with some acpi_rsdp checking and add
> some error message in case kexec load failure? Eg. suggest people to use
> append acpi_rsdp for noefi booting etc.

Yeah, -ENODEV is better than hanging, but not very informative.

> I'm still not very satisfied with the code cleanup..

Not surprising, I didn't like it much either (ergo interrogative).

-Mike


Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Mike Galbraith
On Fri, 2018-08-10 at 18:28 +0800, Dave Young wrote:
> 
> > @@ -250,8 +253,10 @@ setup_boot_parameters(struct kimage *image, struct 
> > boot_params *params,
> >  
> >  #ifdef CONFIG_EFI
> > /* Setup EFI state */
> > -   setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> > +   ret = setup_efi_state(params, params_load_addr, efi_map_offset, 
> > efi_map_sz,
> > efi_setup_data_offset);
> > +   if (ret)
> 
> Here should check efi_enabled(EFI_BOOT) && ret

Patch with that works for me.

> In case efi boot we need the efi info set correctly,  or one need pass
> acpi_rsdp= in kernel cmdline param.
> 
> Still not sure how to allow one to workaround it by using acpi_rsdp=
> param with kexec_file_load..

Does this improve things, and plug the no boot hole?

x86, kdump: cleanup efi setup data handling a bit

1. Remove efi specific variables from bzImage64_load() other than the
one it needs, efi_map_sz, passing it and params_cmdline_sz on to efi
setup functions, giving them all they need without duplication.

2. Only allocate space for efi setup data when a 1:1 mapping is available.
Bail early with -ENODEV if not available, but is required to boot, and
acpi_rsdp= was not passed on the command line. 

3. Use the proper config dependency to isolate efi setup functions,
adding a !EFI_RUNTIME_MAP stub for setup_efi_state().

4. Change efi functions that cannot fail to void. 

Signed-off-by: Mike Galbraith 
---
 arch/x86/kernel/kexec-bzimage64.c |   99 +-
 1 file changed, 45 insertions(+), 54 deletions(-)

--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -112,35 +112,32 @@ static int setup_e820_entries(struct boo
return 0;
 }
 
-#ifdef CONFIG_EFI
-static int setup_efi_info_memmap(struct boot_params *params,
+#ifdef CONFIG_EFI_RUNTIME_MAP
+static void setup_efi_info_memmap(struct boot_params *params,
  unsigned long params_load_addr,
- unsigned int efi_map_offset,
+ unsigned int params_cmdline_sz,
  unsigned int efi_map_sz)
 {
-   void *efi_map = (void *)params + efi_map_offset;
-   unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
+   void *efi_map = (void *)params + params_cmdline_sz;
+   unsigned long efi_map_phys_addr = params_load_addr + params_cmdline_sz;
struct efi_info *ei = >efi_info;
 
-   if (!efi_map_sz)
-   return -EINVAL;
-
efi_runtime_map_copy(efi_map, efi_map_sz);
 
ei->efi_memmap = efi_map_phys_addr & 0x;
ei->efi_memmap_hi = efi_map_phys_addr >> 32;
ei->efi_memmap_size = efi_map_sz;
-
-   return 0;
 }
 
-static int
+static void
 prepare_add_efi_setup_data(struct boot_params *params,
-  unsigned long params_load_addr,
-  unsigned int efi_setup_data_offset)
+  unsigned long params_load_addr,
+  unsigned int params_cmdline_sz,
+  unsigned int efi_map_sz)
 {
+   unsigned int data_offset = params_cmdline_sz + ALIGN(efi_map_sz, 16);
unsigned long setup_data_phys;
-   struct setup_data *sd = (void *)params + efi_setup_data_offset;
+   struct setup_data *sd = (void *)params + data_offset;
struct efi_setup_data *esd = (void *)sd + sizeof(struct setup_data);
 
esd->fw_vendor = efi.fw_vendor;
@@ -152,33 +149,20 @@ prepare_add_efi_setup_data(struct boot_p
sd->len = sizeof(struct efi_setup_data);
 
/* Add setup data */
-   setup_data_phys = params_load_addr + efi_setup_data_offset;
+   setup_data_phys = params_load_addr + data_offset;
sd->next = params->hdr.setup_data;
params->hdr.setup_data = setup_data_phys;
-
-   return 0;
 }
 
 static int
 setup_efi_state(struct boot_params *params, unsigned long params_load_addr,
-   unsigned int efi_map_offset, unsigned int efi_map_sz,
-   unsigned int efi_setup_data_offset)
+   unsigned int params_cmdline_sz, unsigned int efi_map_sz)
 {
struct efi_info *current_ei = _params.efi_info;
struct efi_info *ei = >efi_info;
-   int ret;
-
-   if (!current_ei->efi_memmap_size)
-   return -EINVAL;
 
-   /*
-* If 1:1 mapping is not enabled, second kernel can not setup EFI
-* and use EFI run time services. User space will have to pass
-* acpi_rsdp= on kernel command line to make second kernel boot
-* without efi.
-*/
-   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_RUNTIME_SERVICES))
-   return -ENODEV;
+   if (!efi_map_sz || !current_ei->efi_memmap_size)
+   r

Re: [PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-10 Thread Mike Galbraith
On Fri, 2018-08-10 at 16:45 +0800, Dave Young wrote:
> 
> BTW, this patch only fix the kexec load phase problem,  even if kexec
> load successfully with the fix, the 2nd kernel can not boot because efi
> memmap info is not correct and usable.

Hm.  I didn't do anything else with kexec, but did crashdump my box
both w/wo efi=noruntime.

> So we should go with some fix similar to below, and do the cleanup we
> mentioned with a separate patch later.

Ah, you mean the one I had _just_ built when I saw this :)

> Also user space kexec-tools need a similar patch to error out in case
> no runtime maps.  It would be good to fix both userspace and kernel
> load.
> 
> diff --git a/arch/x86/kernel/kexec-bzimage64.c 
> b/arch/x86/kernel/kexec-bzimage64.c
> index 7326078eaa7a..e34ba2f53cfb 100644
> --- a/arch/x86/kernel/kexec-bzimage64.c
> +++ b/arch/x86/kernel/kexec-bzimage64.c
> @@ -123,7 +123,7 @@ static int setup_efi_info_memmap(struct boot_params 
> *params,
>   struct efi_info *ei = >efi_info;
>  
>   if (!efi_map_sz)
> - return 0;
> + return -EINVAL;
>  
>   efi_runtime_map_copy(efi_map, efi_map_sz);
>  
> @@ -166,9 +166,10 @@ setup_efi_state(struct boot_params *params, unsigned 
> long params_load_addr,
>  {
>   struct efi_info *current_ei = _params.efi_info;
>   struct efi_info *ei = >efi_info;
> + int ret;
>  
>   if (!current_ei->efi_memmap_size)
> - return 0;
> + return -EINVAL;
>  
>   /*
>* If 1:1 mapping is not enabled, second kernel can not setup EFI
> @@ -176,8 +177,8 @@ setup_efi_state(struct boot_params *params, unsigned long 
> params_load_addr,
>* acpi_rsdp= on kernel command line to make second kernel boot
>* without efi.
>*/
> - if (efi_enabled(EFI_OLD_MEMMAP))
> - return 0;
> + if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_RUNTIME_SERVICES))
> + return -ENODEV;
>  
>   ei->efi_loader_signature = current_ei->efi_loader_signature;
>   ei->efi_systab = current_ei->efi_systab;
> @@ -186,8 +187,10 @@ setup_efi_state(struct boot_params *params, unsigned 
> long params_load_addr,
>   ei->efi_memdesc_version = current_ei->efi_memdesc_version;
>   ei->efi_memdesc_size = efi_get_runtime_map_desc_size();
>  
> - setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> + ret = setup_efi_info_memmap(params, params_load_addr, efi_map_offset,
> efi_map_sz);
> + if (ret)
> + return ret;
>   prepare_add_efi_setup_data(params, params_load_addr,
>  efi_setup_data_offset);
>   return 0;
> @@ -250,8 +253,10 @@ setup_boot_parameters(struct kimage *image, struct 
> boot_params *params,
>  
>  #ifdef CONFIG_EFI
>   /* Setup EFI state */
> - setup_efi_state(params, params_load_addr, efi_map_offset, efi_map_sz,
> + ret = setup_efi_state(params, params_load_addr, efi_map_offset, 
> efi_map_sz,
>   efi_setup_data_offset);
> + if (ret)
> + return ret;
>  #endif
>  
>   /* Setup EDD info */


[PATCH] x86, kdump: Fix efi=noruntime NULL pointer dereference

2018-08-08 Thread Mike Galbraith
When booting with efi=noruntime, we call efi_runtime_map_copy() while
loading the kdump kernel, and trip over a NULL efi.memmap.map.  Avoid
that and a useless allocation when the only mapping we can use (1:1)
is not available.

Signed-off-by: Mike Galbraith 
---
 arch/x86/kernel/kexec-bzimage64.c |   22 +++---
 1 file changed, 11 insertions(+), 11 deletions(-)

--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -122,9 +122,6 @@ static int setup_efi_info_memmap(struct
unsigned long efi_map_phys_addr = params_load_addr + efi_map_offset;
struct efi_info *ei = >efi_info;
 
-   if (!efi_map_sz)
-   return 0;
-
efi_runtime_map_copy(efi_map, efi_map_sz);
 
ei->efi_memmap = efi_map_phys_addr & 0x;
@@ -176,7 +173,7 @@ setup_efi_state(struct boot_params *para
 * acpi_rsdp= on kernel command line to make second kernel boot
 * without efi.
 */
-   if (efi_enabled(EFI_OLD_MEMMAP))
+   if (efi_enabled(EFI_OLD_MEMMAP) || !efi_enabled(EFI_MEMMAP))
return 0;
 
ei->efi_loader_signature = current_ei->efi_loader_signature;
@@ -338,7 +335,7 @@ static void *bzImage64_load(struct kimag
struct kexec_entry64_regs regs64;
void *stack;
unsigned int setup_hdr_offset = offsetof(struct boot_params, hdr);
-   unsigned int efi_map_offset, efi_map_sz, efi_setup_data_offset;
+   unsigned int efi_map_offset = 0, efi_map_sz = 0, efi_setup_data_offset 
= 0;
struct kexec_buf kbuf = { .image = image, .buf_max = ULONG_MAX,
  .top_down = true };
struct kexec_buf pbuf = { .image = image, .buf_min = MIN_PURGATORY_ADDR,
@@ -397,19 +394,22 @@ static void *bzImage64_load(struct kimag
 * have to create separate segment for each. Keeps things
 * little bit simple
 */
-   efi_map_sz = efi_get_runtime_map_size();
params_cmdline_sz = sizeof(struct boot_params) + cmdline_len +
MAX_ELFCOREHDR_STR_LEN;
params_cmdline_sz = ALIGN(params_cmdline_sz, 16);
-   kbuf.bufsz = params_cmdline_sz + ALIGN(efi_map_sz, 16) +
-   sizeof(struct setup_data) +
-   sizeof(struct efi_setup_data);
+   kbuf.bufsz = params_cmdline_sz + sizeof(struct setup_data);
+
+   /* Now add space for the efi stuff if we have a useable 1:1 mapping. */
+   if (!efi_enabled(EFI_OLD_MEMMAP) && efi_enabled(EFI_MEMMAP)) {
+   efi_map_sz = efi_get_runtime_map_size();
+   kbuf.bufsz += ALIGN(efi_map_sz, 16) + sizeof(struct 
efi_setup_data);
+   efi_map_offset = params_cmdline_sz;
+   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
+   }
 
params = kzalloc(kbuf.bufsz, GFP_KERNEL);
if (!params)
return ERR_PTR(-ENOMEM);
-   efi_map_offset = params_cmdline_sz;
-   efi_setup_data_offset = efi_map_offset + ALIGN(efi_map_sz, 16);
 
/* Copy setup header onto bootparams. Documentation/x86/boot.txt */
setup_header_size = 0x0202 + kernel[0x0201] - setup_hdr_offset;


Re: [rt-patch 4/3] arm,KVM: Move phys_timer handling to hard irq context

2018-08-04 Thread Mike Galbraith
On Sat, 2018-08-04 at 14:25 +0200, Mike Galbraith wrote:
> 
> Besides, there are more interesting fish in the arm64 sea than kvm.
> 
> virgin 4.16.18-rt12-rt
> 
> [  537.236131] ITS queue timeout (65440 65504 4640)
> [  537.236150] ITS cmd its_build_inv_cmd failed

FWIW, I had thought that something 4.16..master had made that business
go entirely away, but it only became much more infrequent.  So much for
plan to slowly sneak up on whatever than was via ~bisect.



Re: [rt-patch 4/3] arm,KVM: Move phys_timer handling to hard irq context

2018-08-04 Thread Mike Galbraith
On Thu, 2018-08-02 at 19:43 +0200, Mike Galbraith wrote:
> On Thu, 2018-08-02 at 18:50 +0200, Mike Galbraith wrote:
> > On Thu, 2018-08-02 at 12:31 -0400, Steven Rostedt wrote:
> > > On Thu, 02 Aug 2018 08:56:20 +0200
> > > Mike Galbraith  wrote:
> > > 
> > > > (arm-land adventures 1/3 take2 will have to wait, my cup runeth over) 
> > > > 
> > > > v4.14..v4.15 timer handling changes including calling 
> > > > kvm_timer_vcpu_load()
> > > 
> > > I take it that this should be added to v4.16-rt and marked stable-rt?
> > 
> > Yeah, barring way sexier ideas of course.
> 
> Gah, wait.  WRT marking for stable-rt, only the pmu allocation splat
> fix is applicable to @stable-rt, kvm locking woes start at 4.15.

None of that series got picked up, so the pmu patch is moot as well. 
The warning is harmless anyway.

Besides, there are more interesting fish in the arm64 sea than kvm.

virgin 4.16.18-rt12-rt

[  537.236131] ITS queue timeout (65440 65504 4640)
[  537.236150] ITS cmd its_build_inv_cmd failed
[  537.236160] sched: RT throttling activated
[ 1229.180124] ITS queue timeout (65440 65504 3744)
[ 1229.180134] ITS cmd its_build_inv_cmd failed

(converts those to trace_printk [invalidating virgin source warantee])

# tracer: nop
#
#  _-=> irqs-off
# / _=> need-resched
#| /  _=> need-resched_lazy
#|| / _---=> hardirq/softirq
#||| / _--=> preempt-depth
# / delay
#   TASK-PID   CPU#  |TIMESTAMP  FUNCTION
#  | |   |   |   | |
  -0 [000] d..h2..84.967153: 
its_wait_for_range_completion+0xa8/0xe0: ITS queue timeout (65440 65504 96)
  -0 [000] d..h2..84.967155: 
its_send_single_command+0xd4/0x130: ITS cmd its_build_inv_cmd failed

(starts cyclictest+kbuild)

[ 9609.602489] NOHZ: local_softirq_pending 10
[10089.311713] NOHZ: local_softirq_pending 80
[10089.311735] NOHZ: local_softirq_pending 80
[10097.315809] NOHZ: local_softirq_pending 80
[10112.262436] NOHZ: local_softirq_pending 10
[10118.966066] NOHZ: local_softirq_pending 80
[10121.211106] NOHZ: local_softirq_pending 80
[10121.287093] NOHZ: local_softirq_pending 80
[10329.501202] NOHZ: local_softirq_pending 80
[10329.501231] NOHZ: local_softirq_pending 80

T:60 (17291) P:99 I:31000 C:  18128 Min:  5 Act:   12 Avg:  107 Max: 
1581163 <== Thar she blows!

 irq/464-eth0-Tx-3955  [060] d...111 11015.059743: 
its_wait_for_range_completion+0xa8/0xe0: ITS queue timeout (65440 65504 5280)
 irq/464-eth0-Tx-3955  [060] d...111 11015.059749: 
 => irq_chip_unmask_parent+0x2c/0x38
 => its_unmask_msi_irq+0x28/0x38
 => unmask_irq.part.4+0x30/0x50
 => unmask_threaded_irq+0x4c/0x58
 => irq_finalize_oneshot.part.2+0x9c/0x100
 => irq_forced_thread_fn+0x60/0x98
 => irq_thread+0x148/0x1e0
 => kthread+0x140/0x148
 => ret_from_fork+0x10/0x18
 irq/464-eth0-Tx-3955  [060] d...111 11015.059751: 
its_send_single_command+0xd4/0x130: ITS cmd its_build_inv_cmd failed
 irq/464-eth0-Tx-3955  [060] d...111 11015.059752: 
 => its_unmask_msi_irq+0x28/0x38
 => unmask_irq.part.4+0x30/0x50
 => unmask_threaded_irq+0x4c/0x58
 => irq_finalize_oneshot.part.2+0x9c/0x100
 => irq_forced_thread_fn+0x60/0x98
 => irq_thread+0x148/0x1e0
 => kthread+0x140/0x148
 => ret_from_fork+0x10/0x18
 irq/464-eth0-Tx-3955  [060] d..h311 11015.059764: update_curr_rt+0x280/0x320: 
sched: RT throttling activated
 irq/464-eth0-Tx-3955  [060] d..h311 11015.059765: 
 => update_process_times+0x48/0x60
 => tick_sched_handle.isra.5+0x34/0x70
 => tick_sched_timer+0x54/0xa8
 => __hrtimer_run_queues+0xf8/0x388
 => hrtimer_interrupt+0xe4/0x258
 => arch_timer_handler_phys+0x38/0x58
 => handle_percpu_devid_irq+0xa8/0x2c8
 => generic_handle_irq+0x34/0x50
 => __handle_domain_irq+0x8c/0x100
 => gic_handle_irq+0x80/0x18c
 => el1_irq+0xb4/0x13c
 => _raw_spin_unlock_irq+0x2c/0x78
 => irq_finalize_oneshot.part.2+0x6c/0x100
 => irq_forced_thread_fn+0x60/0x98
 => irq_thread+0x148/0x1e0
 => kthread+0x140/0x148
 => ret_from_fork+0x10/0x18
  -0 [008] d..h3.. 11015.374565: 
sched_rt_period_timer+0x2f0/0x420: sched: RT throttling deactivated
  -0 [008] d..h3.. 11015.374570: 
 => arch_timer_handler_phys+0x38/0x58
 => handle_percpu_devid_irq+0xa8/0x2c8
 => generic_handle_irq+0x34/0x50
 => __handle_domain_irq+0x8c/0x100
 => gic_handle_irq+0x80/0x18c
 => el1_irq+0xb4/0x13c
 => arch_cpu_idle+0x2c/0x1d0
 => default_idle_call+0x30/0x48
 => do_idle+0x1c4/0x218
 => cpu_startup_entry+0x2c/0x30
 => secondary_start_kernel+0x184/0x1d0


Re: [rt-patch 4/3] arm,KVM: Move phys_timer handling to hard irq context

2018-08-02 Thread Mike Galbraith
On Thu, 2018-08-02 at 18:50 +0200, Mike Galbraith wrote:
> On Thu, 2018-08-02 at 12:31 -0400, Steven Rostedt wrote:
> > On Thu, 02 Aug 2018 08:56:20 +0200
> > Mike Galbraith  wrote:
> > 
> > > (arm-land adventures 1/3 take2 will have to wait, my cup runeth over) 
> > > 
> > > v4.14..v4.15 timer handling changes including calling 
> > > kvm_timer_vcpu_load()
> > 
> > I take it that this should be added to v4.16-rt and marked stable-rt?
> 
> Yeah, barring way sexier ideas of course.

Gah, wait.  WRT marking for stable-rt, only the pmu allocation splat
fix is applicable to @stable-rt, kvm locking woes start at 4.15.

-Mike


Re: [rt-patch 4/3] arm,KVM: Move phys_timer handling to hard irq context

2018-08-02 Thread Mike Galbraith
On Thu, 2018-08-02 at 12:31 -0400, Steven Rostedt wrote:
> On Thu, 02 Aug 2018 08:56:20 +0200
> Mike Galbraith  wrote:
> 
> > (arm-land adventures 1/3 take2 will have to wait, my cup runeth over) 
> > 
> > v4.14..v4.15 timer handling changes including calling kvm_timer_vcpu_load()
> 
> I take it that this should be added to v4.16-rt and marked stable-rt?

Yeah, barring way sexier ideas of course.

-Mike


[rt-patch 1/3 v2] arm64/acpi/perf: move pmu allocation to an early CPU up hook

2018-08-02 Thread Mike Galbraith
(bah, make a clean The End to adventures in arm-land)

RT cannot allocate while irqs are disabled.

  BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:974
  in_atomic(): 0, irqs_disabled(): 128, pid: 25, name: cpuhp/0
  CPU: 0 PID: 25 Comm: cpuhp/0 Not tainted 4.16.18-rt10-rt #2
  Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.32 08/22/2017
  Call trace:
   dump_backtrace+0x0/0x188
   show_stack+0x24/0x30
   dump_stack+0x9c/0xd0
   ___might_sleep+0x124/0x188
   rt_spin_lock+0x40/0x80
   pcpu_alloc+0x104/0x7a0
   __alloc_percpu_gfp+0x38/0x48
   __armpmu_alloc+0x44/0x168
   armpmu_alloc_atomic+0x1c/0x28
   arm_pmu_acpi_cpu_starting+0x1cc/0x210
   cpuhp_invoke_callback+0xb8/0x820
   cpuhp_thread_fun+0xc0/0x1e0
   smpboot_thread_fn+0x1ac/0x2c8
   kthread+0x134/0x138
   ret_from_fork+0x10/0x18

Do the allocation and other preparation for probe along with the other
CPUHP_PERF_{ARCH}_PREPARE stages, where we'll be preemptible, thus no
longer requiring a GFP_ATOMIC allocation either.

Signed-off-by: Mike Galbraith 
---
 drivers/perf/arm_pmu_acpi.c |   12 ++--
 include/linux/cpuhotplug.h  |2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

--- a/drivers/perf/arm_pmu_acpi.c
+++ b/drivers/perf/arm_pmu_acpi.c
@@ -135,10 +135,10 @@ static struct arm_pmu *arm_pmu_acpi_find
return pmu;
}
 
-   pmu = armpmu_alloc_atomic();
+   pmu = armpmu_alloc();
if (!pmu) {
pr_warn("Unable to allocate PMU for CPU%d\n",
-   smp_processor_id());
+   raw_smp_processor_id());
return NULL;
}
 
@@ -185,7 +185,7 @@ static bool pmu_irq_matches(struct arm_p
  * coming up. The perf core won't open events while a hotplug event is in
  * progress.
  */
-static int arm_pmu_acpi_cpu_starting(unsigned int cpu)
+static int arm_pmu_acpi_cpu_prepare(unsigned int cpu)
 {
struct arm_pmu *pmu;
struct pmu_hw_events __percpu *hw_events;
@@ -283,9 +283,9 @@ static int arm_pmu_acpi_init(void)
if (ret)
return ret;
 
-   ret = cpuhp_setup_state(CPUHP_AP_PERF_ARM_ACPI_STARTING,
-   "perf/arm/pmu_acpi:starting",
-   arm_pmu_acpi_cpu_starting, NULL);
+   ret = cpuhp_setup_state(CPUHP_PERF_ARM_PMU_ACPI_PREPARE,
+   "perf/arm/pmu_acpi:prepare",
+   arm_pmu_acpi_cpu_prepare, NULL);
 
return ret;
 }
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -29,6 +29,7 @@ enum cpuhp_state {
CPUHP_PERF_PREPARE,
CPUHP_PERF_X86_PREPARE,
CPUHP_PERF_X86_AMD_UNCORE_PREP,
+   CPUHP_PERF_ARM_PMU_ACPI_PREPARE,
CPUHP_PERF_BFIN,
CPUHP_PERF_POWER,
CPUHP_PERF_SUPERH,
@@ -114,7 +115,6 @@ enum cpuhp_state {
CPUHP_AP_ARM_VFP_STARTING,
CPUHP_AP_ARM64_DEBUG_MONITORS_STARTING,
CPUHP_AP_PERF_ARM_HW_BREAKPOINT_STARTING,
-   CPUHP_AP_PERF_ARM_ACPI_STARTING,
CPUHP_AP_PERF_ARM_STARTING,
CPUHP_AP_ARM_L2X0_STARTING,
CPUHP_AP_ARM_ARCH_TIMER_STARTING,


Re: cpu stopper threads and setaffinity leads to deadlock

2018-08-02 Thread Mike Galbraith
On Thu, 2018-08-02 at 10:12 +0200, Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 06:34:40PM -0700, Sodagudi Prasad wrote:
> > diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> > index e190d1e..f932e1e 100644
> > --- a/kernel/stop_machine.c
> > +++ b/kernel/stop_machine.c
> > @@ -87,9 +87,9 @@ static bool cpu_stop_queue_work(unsigned int cpu, struct
> > cpu_stop_work *work)
> > __cpu_stop_queue_work(stopper, work, );
> > else if (work->done)
> > cpu_stop_signal_done(work->done);
> > -   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> > wake_up_q();
> > +   raw_spin_unlock_irqrestore(>lock, flags);
> > 
> 
> That puts the wakeup back under stopper lock, which causes another
> deadlock iirc.

Yup, one you fixed.

0b26351b910fb (Peter Zijlstra 2018-04-20 11:50:05 +0200 92) 
wake_up_q();


[rt-patch 4/3] arm,KVM: Move phys_timer handling to hard irq context

2018-08-02 Thread Mike Galbraith
(arm-land adventures 1/3 take2 will have to wait, my cup runeth over) 

v4.14..v4.15 timer handling changes including calling kvm_timer_vcpu_load()
during kvm_preempt_ops.sched_in and taking vgic_dist.lpi_list_lock in the
timer interrupt handler required locks for which locking rules/context had
been changed be converted to raw_spinlock_t...

Quoting virt/kvm/arm/vgic/vgic.c:
 * Locking order is always:
 * kvm->lock (mutex)
 *   its->cmd_lock (mutex)
 * its->its_lock (mutex)
 *   vgic_cpu->ap_list_lock must be taken with IRQs disabled
 * kvm->lpi_list_lock   must be taken with IRQs disabled
 *   vgic_irq->irq_lock must be taken with IRQs disabled
 *
 * As the ap_list_lock might be taken from the timer interrupt handler,
 * we have to disable IRQs before taking this lock and everything lower
 * than it.

...and fixed the obvious bricking consequence of those changes for RT,
but left an RT specific kvm unit test timer failure in its wake.  Handling
phys_timer in hard interrupt context as expected cures that failure.

Pre:
PASS selftest-setup (2 tests)
PASS selftest-vectors-kernel (2 tests)
PASS selftest-vectors-user (2 tests)
PASS selftest-smp (65 tests)
PASS pci-test (1 tests)
PASS pmu (3 tests)
PASS gicv2-ipi (3 tests)
PASS gicv3-ipi (3 tests)
PASS gicv2-active (1 tests)
PASS gicv3-active (1 tests)
PASS psci (4 tests)
FAIL timer (8 tests, 1 unexpected failures)

Post:
PASS selftest-setup (2 tests)
PASS selftest-vectors-kernel (2 tests)
PASS selftest-vectors-user (2 tests)
PASS selftest-smp (65 tests)
PASS pci-test (1 tests)
PASS pmu (3 tests)
PASS gicv2-ipi (3 tests)
PASS gicv3-ipi (3 tests)
PASS gicv2-active (1 tests)
PASS gicv3-active (1 tests)
PASS psci (4 tests)
PASS timer (8 tests)

Signed-off-by: Mike Galbraith 
---
 virt/kvm/arm/arch_timer.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/virt/kvm/arm/arch_timer.c
+++ b/virt/kvm/arm/arch_timer.c
@@ -634,7 +634,7 @@ void kvm_timer_vcpu_init(struct kvm_vcpu
hrtimer_init(>bg_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
timer->bg_timer.function = kvm_bg_timer_expire;
 
-   hrtimer_init(>phys_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+   hrtimer_init(>phys_timer, CLOCK_MONOTONIC, 
HRTIMER_MODE_ABS_HARD);
timer->phys_timer.function = kvm_phys_timer_expire;
 
vtimer->irq.irq = default_vtimer_irq.irq;


Re: bisected - arm64 kvm unit test failures

2018-08-01 Thread Mike Galbraith
On Wed, 2018-08-01 at 08:22 +0100, Marc Zyngier wrote:
> On Wed, 01 Aug 2018 07:02:25 +0100,
> Mike Galbraith  wrote:
> > 
> > [1  ]
> > On Wed, 2018-08-01 at 06:35 +0100, Marc Zyngier wrote:
> > > 
> > > Is it something that is reproducible with the current mainline (non-RT)?
> > 
> > These waters are a bit muddy, it's config dependent.  I'm trying to
> > generate a reproducing !RT config for -rc7 as we speak.  If I build
> > openSUSE/master-default, it does NOT reproduce.  That with the bisect
> > config just finished building, and yup, it reproduced
> > (attached).
> 
> Thanks for that.

Cheers.  The dependency is THP, which is disabled in RT.

--- config.save 2018-05-18 17:59:44.729165480 +0200
+++ .config 2018-08-01 11:00:24.484148316 +0200
@@ -632,7 +632,10 @@ CONFIG_DEFAULT_MMAP_MIN_ADDR=65536
 CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
 CONFIG_MEMORY_FAILURE=y
 # CONFIG_HWPOISON_INJECT is not set
-# CONFIG_TRANSPARENT_HUGEPAGE is not set
+CONFIG_TRANSPARENT_HUGEPAGE=y
+CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
+# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
+CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
 CONFIG_CLEANCACHE=y
 CONFIG_FRONTSWAP=y
 CONFIG_CMA=y
@@ -4306,6 +4309,7 @@ CONFIG_RAS=y
 # CONFIG_ANDROID is not set
 # CONFIG_LIBNVDIMM is not set
 CONFIG_DAX=y
+# CONFIG_DEV_DAX is not set
 CONFIG_NVMEM=y
 # CONFIG_MTK_EFUSE is not set
 # CONFIG_QCOM_QFPROM is not set
@@ -5175,6 +5179,7 @@ CONFIG_DECOMPRESS_XZ=y
 CONFIG_DECOMPRESS_LZO=y
 CONFIG_DECOMPRESS_LZ4=y
 CONFIG_GENERIC_ALLOCATOR=y
+CONFIG_RADIX_TREE_MULTIORDER=y
 CONFIG_ASSOCIATIVE_ARRAY=y
 CONFIG_HAS_IOMEM=y
 CONFIG_HAS_IOPORT_MAP=y



Re: bisected - arm64 kvm unit test failures

2018-08-01 Thread Mike Galbraith
On Wed, 2018-08-01 at 08:22 +0100, Marc Zyngier wrote:
> 
> > Box is a 4 node/64 core TaiShan 2280.
> 
> Is that what is also known as D05/HIP07, with 64 Cortex-A72?

No idea, our rent-a-box web client shows nothing informative.

-Mike


Re: bisected - arm64 kvm unit test failures

2018-08-01 Thread Mike Galbraith
On Wed, 2018-08-01 at 06:35 +0100, Marc Zyngier wrote:
> 
> Is it something that is reproducible with the current mainline (non-RT)?

These waters are a bit muddy, it's config dependent.  I'm trying to
generate a reproducing !RT config for -rc7 as we speak.  If I build
openSUSE/master-default, it does NOT reproduce.  That with the bisect
config just finished building, and yup, it reproduced (attached). 

Every RT tree 4.16..master.today reproduces, and is fixed up (modulo
the RT specific failure) by reverting the fingered commit.

> Pretty worrying. What HW is that on?

Box is a 4 node/64 core TaiShan 2280.

-Mike

config-4.18.0-rc7-bisect.xz
Description: application/xz


bisected - arm64 kvm unit test failures

2018-07-31 Thread Mike Galbraith
On Mon, 2018-07-30 at 18:24 +0200, Mike Galbraith wrote:
> On Sun, 2018-07-29 at 13:47 +0200, Mike Galbraith wrote:
> > FYI, per kvm unit tests, 4.16-rt definitely has more kvm issues.

But it's not RT, or rather most of it isn't...

> > huawei5:/abuild/mike/kvm-unit-tests # uname -r
> > 4.16.18-rt11-rt
> > huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
> > PASS selftest-setup (2 tests)
> > FAIL selftest-vectors-kernel 
> > FAIL selftest-vectors-user 
> > PASS selftest-smp (65 tests)
> > PASS pci-test (1 tests)
> > PASS pmu (3 tests)
> > FAIL gicv2-ipi 
> > FAIL gicv3-ipi 
> > FAIL gicv2-active 
> > FAIL gicv3-active 
> > PASS psci (4 tests)
> > FAIL timer 
> > huawei5:/abuild/mike/kvm-unit-tests #
> > 
> > 4.14-rt passes all tests.  The above is with the kvm raw_spinlock_t
> > conversion patch applied, but the 4.12 based SLERT tree I cloned to
> > explore arm-land in the first place shows only one timer failure, and
> > has/needs it applied as well, which would seem to vindicate it.
> > 
> > huawei5:/abuild/mike/kvm-unit-tests # uname -r
> > 4.12.14-0.gec0b559-rt
> > huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
> > PASS selftest-setup (2 tests)
> > PASS selftest-vectors-kernel (2 tests)
> > PASS selftest-vectors-user (2 tests)
> > PASS selftest-smp (65 tests)
> > PASS pci-test (1 tests)
> > PASS pmu (3 tests)
> > PASS gicv2-ipi (3 tests)
> > PASS gicv3-ipi (3 tests)
> > PASS gicv2-active (1 tests)
> > PASS gicv3-active (1 tests)
> > PASS psci (4 tests)
> > FAIL timer (8 tests, 1 unexpected failures)
> 
> FWIW, this single timer failure wass inspired by something in the 4-15
> merge window.

As noted, the single timer failure is an RT issue of some sort, and
remains.  The rest I bisected in @stable with the attached config, and
confirmed that revert fixes up 4.16-rt as well (modulo singleton).

a9c0e12ebee56ef06b7eccdbc73bab71d0018df8 is the first bad commit
commit a9c0e12ebee56ef06b7eccdbc73bab71d0018df8
Author: Marc Zyngier 
Date:   Mon Oct 23 17:11:20 2017 +0100

KVM: arm/arm64: Only clean the dcache on translation fault

The only case where we actually need to perform a dcache maintenance
is when we map the page for the first time, and subsequent permission
faults do not require cache maintenance. Let's make it conditional
on not being a permission fault (and thus a translation fault).

Reviewed-by: Christoffer Dall 
Signed-off-by: Marc Zyngier 
Signed-off-by: Christoffer Dall 

:04 04 951e77e6ec8df405f4bad59086ad1416c480ce3c 
946dd071aa755606eaa5103d870bc471cc07258f M  virt

git bisect start
# good: [a8ec862fd39d9adb88469eb8b9125daccc1c8335] Linux 4.15.18
git bisect good a8ec862fd39d9adb88469eb8b9125daccc1c8335
# bad: [62e9ccfaaedffd057e921fca976f9f7f71c9b254] Linux 4.16.18
git bisect bad 62e9ccfaaedffd057e921fca976f9f7f71c9b254
# good: [d8a5b80568a9cb66810e75b182018e9edb68e8ff] Linux 4.15
git bisect good d8a5b80568a9cb66810e75b182018e9edb68e8ff
# good: [fe53d1443a146326b49d57fe6336b5c2a725223f] Merge tag 'armsoc-drivers' 
of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect good fe53d1443a146326b49d57fe6336b5c2a725223f
# good: [9a61df9e5f7471fe5be3e02bd0bed726b2761a54] Merge tag 'kbuild-v4.16-2' 
of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
git bisect good 9a61df9e5f7471fe5be3e02bd0bed726b2761a54
# bad: [c4f4d2f917729e9b7b8bb452bf4971be93e7a15f] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect bad c4f4d2f917729e9b7b8bb452bf4971be93e7a15f
# bad: [97ace515f01439d4cf6e898b4094040dc12d36e7] Merge tag 'armsoc-fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 97ace515f01439d4cf6e898b4094040dc12d36e7
# bad: [9f416319f40cd857d2bb517630e5855a905ef3fb] arm64: fix unwind_frame() for 
filtered out fn for function graph tracing
git bisect bad 9f416319f40cd857d2bb517630e5855a905ef3fb
# bad: [405cacc947f7b58969b2a8ab1568c2d98b245308] drm/i915/vlv: Add cdclk 
workaround for DSI
git bisect bad 405cacc947f7b58969b2a8ab1568c2d98b245308
# good: [810f4600ec5ee79c68dcbb136ed26a652df46348] Merge tag 
'kvm-s390-next-4.16-3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux
git bisect good 810f4600ec5ee79c68dcbb136ed26a652df46348
# bad: [1ab03c072feb579c9fd116de25be2b211e6bff6a] Merge tag 
'kvm-ppc-next-4.16-2' of 
git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc
git bisect bad 1ab03c072feb579c9fd116de25be2b211e6bff6a
# bad: [87cedc6be55954c6efd6eca2e694132513f65a2a] kvm: x86: remove efer_reload 
entry in kvm_vcpu_stat
git bisect bad 87cedc6be55954c6efd6eca2e694132513f65a2a
# good: [fefb876b9b96fa7e4ed3d906979ea45b4cf07349] arm64: KVM: PTE/PMD S2 XN 
bit definition
git bisect good fefb876b9b9

Re: candidates for @devel-rt localversion-rt++

2018-07-30 Thread Mike Galbraith
On Sun, 2018-07-29 at 13:47 +0200, Mike Galbraith wrote:
> FYI, per kvm unit tests, 4.16-rt definitely has more kvm issues.
> 
> huawei5:/abuild/mike/kvm-unit-tests # uname -r
> 4.16.18-rt11-rt
> huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
> PASS selftest-setup (2 tests)
> FAIL selftest-vectors-kernel 
> FAIL selftest-vectors-user 
> PASS selftest-smp (65 tests)
> PASS pci-test (1 tests)
> PASS pmu (3 tests)
> FAIL gicv2-ipi 
> FAIL gicv3-ipi 
> FAIL gicv2-active 
> FAIL gicv3-active 
> PASS psci (4 tests)
> FAIL timer 
> huawei5:/abuild/mike/kvm-unit-tests #
> 
> 4.14-rt passes all tests.  The above is with the kvm raw_spinlock_t
> conversion patch applied, but the 4.12 based SLERT tree I cloned to
> explore arm-land in the first place shows only one timer failure, and
> has/needs it applied as well, which would seem to vindicate it.
> 
> huawei5:/abuild/mike/kvm-unit-tests # uname -r
> 4.12.14-0.gec0b559-rt
> huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
> PASS selftest-setup (2 tests)
> PASS selftest-vectors-kernel (2 tests)
> PASS selftest-vectors-user (2 tests)
> PASS selftest-smp (65 tests)
> PASS pci-test (1 tests)
> PASS pmu (3 tests)
> PASS gicv2-ipi (3 tests)
> PASS gicv3-ipi (3 tests)
> PASS gicv2-active (1 tests)
> PASS gicv3-active (1 tests)
> PASS psci (4 tests)
> FAIL timer (8 tests, 1 unexpected failures)

FWIW, this single timer failure wass inspired by something in the 4-15
merge window.  A 4.14-rt based 4.15-rt updated to include recent fixes 
reproduces the exact same (ie my colleagues imported it into SLE).  The
rest landed in 4.16.. staring at which is not proving the least bit
enlightening.

-Mike


Re: [rt-patch 3/3] arm, KVM: convert vgic_irq.irq_lock to raw_spinlock_t

2018-07-30 Thread Mike Galbraith
On Mon, 2018-07-30 at 11:27 +0200, Peter Zijlstra wrote:
> 
> The thing missing from the Changelog is the analysis that all the work
> done under these locks is indeed properly bounded and cannot cause
> excessive latencies.

True, I have no idea what worst case hold times are.  Nothing poked me
dead in the eye when looking around in completely alien code, nor did
cyclictest inspire concern running on box with no base of comparison.

I do know that latency is now < infinity, a modest improvement ;-)

-Mike


Re: candidates for @devel-rt localversion-rt++

2018-07-29 Thread Mike Galbraith
FYI, per kvm unit tests, 4.16-rt definitely has more kvm issues.

huawei5:/abuild/mike/kvm-unit-tests # uname -r
4.16.18-rt11-rt
huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
PASS selftest-setup (2 tests)
FAIL selftest-vectors-kernel 
FAIL selftest-vectors-user 
PASS selftest-smp (65 tests)
PASS pci-test (1 tests)
PASS pmu (3 tests)
FAIL gicv2-ipi 
FAIL gicv3-ipi 
FAIL gicv2-active 
FAIL gicv3-active 
PASS psci (4 tests)
FAIL timer 
huawei5:/abuild/mike/kvm-unit-tests #

4.14-rt passes all tests.  The above is with the kvm raw_spinlock_t
conversion patch applied, but the 4.12 based SLERT tree I cloned to
explore arm-land in the first place shows only one timer failure, and
has/needs it applied as well, which would seem to vindicate it.

huawei5:/abuild/mike/kvm-unit-tests # uname -r
4.12.14-0.gec0b559-rt
huawei5:/abuild/mike/kvm-unit-tests # ./run_tests.sh
PASS selftest-setup (2 tests)
PASS selftest-vectors-kernel (2 tests)
PASS selftest-vectors-user (2 tests)
PASS selftest-smp (65 tests)
PASS pci-test (1 tests)
PASS pmu (3 tests)
PASS gicv2-ipi (3 tests)
PASS gicv3-ipi (3 tests)
PASS gicv2-active (1 tests)
PASS gicv3-active (1 tests)
PASS psci (4 tests)
FAIL timer (8 tests, 1 unexpected failures)
huawei5:/abuild/mike/kvm-unit-tests #


Re: candidates for @devel-rt localversion-rt++

2018-07-29 Thread Mike Galbraith
On Sat, 2018-07-28 at 11:07 +0200, Mike Galbraith wrote:
> 1. arm64/acpi/perf: move pmu allocation to an early CPU up hook

Nope, it's an ex-candidate.  Having found/run kvm unit tests, I
discovered that while the above fixes boot time splat, it somehow
manages to break kvm pmu tests, so needs staring at.

-Mike


[rt-patch 1/3] arm64/acpi/perf: move pmu allocation to an early CPU up hook

2018-07-28 Thread Mike Galbraith


RT cannot allocate while irqs are disabled.

  BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:974
  in_atomic(): 0, irqs_disabled(): 128, pid: 25, name: cpuhp/0
  CPU: 0 PID: 25 Comm: cpuhp/0 Not tainted 4.16.18-rt10-rt #2
  Hardware name: Huawei TaiShan 2280 /BC11SPCD, BIOS 1.32 08/22/2017
  Call trace:
   dump_backtrace+0x0/0x188
   show_stack+0x24/0x30
   dump_stack+0x9c/0xd0
   ___might_sleep+0x124/0x188
   rt_spin_lock+0x40/0x80
   pcpu_alloc+0x104/0x7a0
   __alloc_percpu_gfp+0x38/0x48
   __armpmu_alloc+0x44/0x168
   armpmu_alloc_atomic+0x1c/0x28
   arm_pmu_acpi_cpu_starting+0x1cc/0x210
   cpuhp_invoke_callback+0xb8/0x820
   cpuhp_thread_fun+0xc0/0x1e0
   smpboot_thread_fn+0x1ac/0x2c8
   kthread+0x134/0x138
   ret_from_fork+0x10/0x18

Move the allocation to CPUHP_BP_PREPARE_DYN, where we'll be preemptible,
thus no longer needing GFP_ATOMIC.

Signed-off-by: Mike Galbraith 
---
 drivers/perf/arm_pmu_acpi.c |6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/drivers/perf/arm_pmu_acpi.c
+++ b/drivers/perf/arm_pmu_acpi.c
@@ -135,10 +135,10 @@ static struct arm_pmu *arm_pmu_acpi_find
return pmu;
}
 
-   pmu = armpmu_alloc_atomic();
+   pmu = armpmu_alloc();
if (!pmu) {
pr_warn("Unable to allocate PMU for CPU%d\n",
-   smp_processor_id());
+   raw_smp_processor_id());
return NULL;
}
 
@@ -283,7 +283,7 @@ static int arm_pmu_acpi_init(void)
if (ret)
return ret;
 
-   ret = cpuhp_setup_state(CPUHP_AP_PERF_ARM_ACPI_STARTING,
+   ret = cpuhp_setup_state(CPUHP_BP_PREPARE_DYN,
"perf/arm/pmu_acpi:starting",
arm_pmu_acpi_cpu_starting, NULL);
 


candidates for @devel-rt localversion-rt++

2018-07-28 Thread Mike Galbraith
1. arm64/acpi/perf: move pmu allocation to an early CPU up hook
2. sched: Introduce raw_cond_resched_lock()
3. arm, KVM: convert vgic_irq.irq_lock to raw_spinlock_t

With these applied, 4 socket TaiShan 2280 box boots shiny new -rt11
gripe free, and has been tossed into SUSE's kvm build-bot slave pit,
where it is presumably performing acceptably, given its boss keeps
giving it more work to do.  (I see only lack of smoke/flame)

1 should fly, 2 and 3 may well die.. as box does without them.

-Mike


[rt-patch 2/3] sched: Introduce raw_cond_resched_lock()

2018-07-28 Thread Mike Galbraith


Add raw_cond_resched_lock() infrastructure.

Signed-off-by: Mike Galbraith 
---
 include/linux/sched.h |   15 +++
 kernel/sched/core.c   |   20 
 2 files changed, 35 insertions(+)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1779,12 +1779,18 @@ static inline int _cond_resched(void) {
 })
 
 extern int __cond_resched_lock(spinlock_t *lock);
+extern int __raw_cond_resched_lock(raw_spinlock_t *lock);
 
 #define cond_resched_lock(lock) ({ \
___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
__cond_resched_lock(lock);  \
 })
 
+#define raw_cond_resched_lock(lock) ({ \
+   ___might_sleep(__FILE__, __LINE__, PREEMPT_LOCK_OFFSET);\
+   __raw_cond_resched_lock(lock);  \
+})
+
 #ifndef CONFIG_PREEMPT_RT_FULL
 extern int __cond_resched_softirq(void);
 
@@ -1817,6 +1823,15 @@ static inline int spin_needbreak(spinloc
 #else
return 0;
 #endif
+}
+
+static inline int raw_spin_needbreak(raw_spinlock_t *lock)
+{
+#ifdef CONFIG_PREEMPT
+   return raw_spin_is_contended(lock);
+#else
+   return 0;
+#endif
 }
 
 static __always_inline bool need_resched(void)
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5065,6 +5065,26 @@ int __cond_resched_lock(spinlock_t *lock
 }
 EXPORT_SYMBOL(__cond_resched_lock);
 
+int __raw_cond_resched_lock(raw_spinlock_t *lock)
+{
+   int resched = should_resched(PREEMPT_LOCK_OFFSET);
+   int ret = 0;
+
+   lockdep_assert_held(lock);
+
+   if (raw_spin_needbreak(lock) || resched) {
+   raw_spin_unlock(lock);
+   if (resched)
+   preempt_schedule_common();
+   else
+   cpu_relax();
+   ret = 1;
+   raw_spin_lock(lock);
+   }
+   return ret;
+}
+EXPORT_SYMBOL(__raw_cond_resched_lock);
+
 #ifndef CONFIG_PREEMPT_RT_FULL
 int __sched __cond_resched_softirq(void)
 {


[rt-patch 3/3] arm, KVM: convert vgic_irq.irq_lock to raw_spinlock_t

2018-07-28 Thread Mike Galbraith


b103cc3f10c0 ("KVM: arm/arm64: Avoid timer save/restore in vcpu entry/exit")
requires vgic_irq.irq_lock be converted to raw_spinlock_t.

Problem: kvm_preempt_ops.sched_in = kvm_sched_in;
   kvm_sched_in()
  kvm_arch_vcpu_load()
 kvm_timer_vcpu_load() <- b103cc3f10c0 addition
kvm_timer_vcpu_load_gic()
   kvm_vgic_map_is_active()
  spin_lock_irqsave(>irq_lock, flags);

Quoting virt/kvm/arm/vgic/vgic.c, locking order is...

  kvm->lock (mutex)
its->cmd_lock (mutex)
  its->its_lock (mutex)
vgic_cpu->ap_list_lock must be taken with IRQs disabled
  kvm->lpi_list_lock   must be taken with IRQs disabled
vgic_irq->irq_lock must be taken with IRQs disabled

...meaning vgic_dist.lpi_list_lock and vgic_cpu.ap_list_lock must be
converted as well.

Signed-off-by: Mike Galbraith 
---
 include/kvm/arm_vgic.h   |6 -
 virt/kvm/arm/vgic/vgic-debug.c   |4 -
 virt/kvm/arm/vgic/vgic-init.c|8 +-
 virt/kvm/arm/vgic/vgic-its.c |   22 +++
 virt/kvm/arm/vgic/vgic-mmio-v2.c |   14 ++--
 virt/kvm/arm/vgic/vgic-mmio-v3.c |   10 +--
 virt/kvm/arm/vgic/vgic-mmio.c|   34 +--
 virt/kvm/arm/vgic/vgic-v2.c  |4 -
 virt/kvm/arm/vgic/vgic-v3.c  |8 +-
 virt/kvm/arm/vgic/vgic.c |  120 +++
 10 files changed, 115 insertions(+), 115 deletions(-)

--- a/include/kvm/arm_vgic.h
+++ b/include/kvm/arm_vgic.h
@@ -96,7 +96,7 @@ enum vgic_irq_config {
 };
 
 struct vgic_irq {
-   spinlock_t irq_lock;/* Protects the content of the struct */
+   raw_spinlock_t irq_lock;/* Protects the content of the struct */
struct list_head lpi_list;  /* Used to link all LPIs together */
struct list_head ap_list;
 
@@ -243,7 +243,7 @@ struct vgic_dist {
u64 propbaser;
 
/* Protects the lpi_list and the count value below. */
-   spinlock_t  lpi_list_lock;
+   raw_spinlock_t  lpi_list_lock;
struct list_headlpi_list_head;
int lpi_list_count;
 
@@ -296,7 +296,7 @@ struct vgic_cpu {
unsigned int used_lrs;
struct vgic_irq private_irqs[VGIC_NR_PRIVATE_IRQS];
 
-   spinlock_t ap_list_lock;/* Protects the ap_list */
+   raw_spinlock_t ap_list_lock;/* Protects the ap_list */
 
/*
 * List of IRQs that this VCPU should consider because they are either
--- a/virt/kvm/arm/vgic/vgic-debug.c
+++ b/virt/kvm/arm/vgic/vgic-debug.c
@@ -228,9 +228,9 @@ static int vgic_debug_show(struct seq_fi
irq = >arch.vgic.spis[iter->intid - VGIC_NR_PRIVATE_IRQS];
}
 
-   spin_lock_irqsave(>irq_lock, flags);
+   raw_spin_lock_irqsave(>irq_lock, flags);
print_irq_state(s, irq, vcpu);
-   spin_unlock_irqrestore(>irq_lock, flags);
+   raw_spin_unlock_irqrestore(>irq_lock, flags);
 
return 0;
 }
--- a/virt/kvm/arm/vgic/vgic-init.c
+++ b/virt/kvm/arm/vgic/vgic-init.c
@@ -64,7 +64,7 @@ void kvm_vgic_early_init(struct kvm *kvm
struct vgic_dist *dist = >arch.vgic;
 
INIT_LIST_HEAD(>lpi_list_head);
-   spin_lock_init(>lpi_list_lock);
+   raw_spin_lock_init(>lpi_list_lock);
 }
 
 /**
@@ -80,7 +80,7 @@ void kvm_vgic_vcpu_early_init(struct kvm
int i;
 
INIT_LIST_HEAD(_cpu->ap_list_head);
-   spin_lock_init(_cpu->ap_list_lock);
+   raw_spin_lock_init(_cpu->ap_list_lock);
 
/*
 * Enable and configure all SGIs to be edge-triggered and
@@ -90,7 +90,7 @@ void kvm_vgic_vcpu_early_init(struct kvm
struct vgic_irq *irq = _cpu->private_irqs[i];
 
INIT_LIST_HEAD(>ap_list);
-   spin_lock_init(>irq_lock);
+   raw_spin_lock_init(>irq_lock);
irq->intid = i;
irq->vcpu = NULL;
irq->target_vcpu = vcpu;
@@ -214,7 +214,7 @@ static int kvm_vgic_dist_init(struct kvm
 
irq->intid = i + VGIC_NR_PRIVATE_IRQS;
INIT_LIST_HEAD(>ap_list);
-   spin_lock_init(>irq_lock);
+   raw_spin_lock_init(>irq_lock);
irq->vcpu = NULL;
irq->target_vcpu = vcpu0;
kref_init(>refcount);
--- a/virt/kvm/arm/vgic/vgic-its.c
+++ b/virt/kvm/arm/vgic/vgic-its.c
@@ -65,14 +65,14 @@ static struct vgic_irq *vgic_add_lpi(str
 
INIT_LIST_HEAD(>lpi_list);
INIT_LIST_HEAD(>ap_list);
-   spin_lock_init(>irq_lock);
+   raw_spin_lock_init(>irq_lock);
 
irq->config = VGIC_CONFIG_EDGE;
kref_init(>refcount);
irq->intid = intid;
irq->target_vcpu = vcpu;
 
-   spin_lock_irqsave(>lpi_list_lock, flags);
+   raw_spin_lock_irqsave(>lpi_list_

Re: [PATCH RT v3] arm64: fpsimd: use preemp_disable in addition to local_bh_disable()

2018-07-26 Thread Mike Galbraith
On Thu, 2018-07-26 at 17:06 +0200, Sebastian Andrzej Siewior wrote:
> 
> @@ -1115,6 +1139,7 @@ void kernel_neon_begin(void)
>  
>   BUG_ON(!may_use_simd());
>  
> + preempt_disable();
>   local_bh_disable();
>  
>   __this_cpu_write(kernel_neon_busy, true);
> @@ -1131,6 +1156,7 @@ void kernel_neon_begin(void)
>   preempt_disable();

Nit: this preempt_disable() could be removed...
 
>   local_bh_enable();
> + preempt_enable();
>  }
>  EXPORT_SYMBOL(kernel_neon_begin);

...instead of adding this one.

-Mike


Re: [PATCH RT v2] arm64: fpsimd: use a local_lock() in addition to local_bh_disable()

2018-07-18 Thread Mike Galbraith
See pseudo-patch below.  That cures the reported gcc gripeage.

On Sun, 2018-07-15 at 09:22 +0200, Mike Galbraith wrote:
> On Sat, 2018-07-14 at 00:03 +0200, Mike Galbraith wrote:
> > On Fri, 2018-07-13 at 19:49 +0200, Sebastian Andrzej Siewior wrote:
> > > In v4.16-RT I noticed a number of warnings from task_fpsimd_load(). The
> > > code disables BH and expects that it is not preemptible. On -RT the
> > > task remains preemptible but remains the same CPU. This may corrupt the
> > > content of the SIMD registers if the task is preempted during
> > > saving/restoring those registers.
> > > Add a locallock around this process. This avoids that the any function
> > > within the locallock block is invoked more than once on the same CPU.
> > > 
> > > The kernel_neon_begin() can't be kept preemptible. If the task-switch 
> > > notices
> > > TIF_FOREIGN_FPSTATE then it would restore task's SIMD state and we lose 
> > > the
> > > state of registers used for in-kernel-work. We would require additional 
> > > storage
> > > for the in-kernel copy of the registers. But then the NEON-crypto checks 
> > > for
> > > the need-resched flag so it shouldn't that bad.
> > > The preempt_disable() avoids the context switch while the kernel uses the 
> > > SIMD
> > > registers. Unfortunately we have to balance out the migrate_disable() 
> > > counter
> > > because local_lock_bh() is invoked in different context compared to its 
> > > unlock
> > > counterpart.
> > > 
> > > __efi_fpsimd_begin() should not use kernel_fpu_begin() due to its
> > > preempt_disable() context and instead save the registers always in its
> > > extra spot on RT.
> > > 
> > > Signed-off-by: Sebastian Andrzej Siewior 
> > > ---
> > > 
> > > This seems to make work (crypto chacha20-neon + cyclictest). I have no
> > > EFI so I have no clue if saving SIMD while calling to EFI works.
> > 
> > All is not well on cavium test box.  I'm seeing random errors ala...
> > 
> > ./include/linux/fs.h:3137:11: internal compiler error: Segmentation fault
> > ./include/linux/bio.h:175:1: internal compiler error: in grokdeclarator, at 
> > c/c-decl.c:7023
> > 
> > ...during make -j96 (2*cpus) kbuild.  Turns out 4.14-rt has this issue
> > as well, which is unsurprising if it's related to fpsimd woes.  Box
> > does not exhibit the issue with NONRT kernels, PREEMPT or NOPREEMPT.
> 
> Verified to be SIMD woes.  I backported your V2 to 4.14-rt, and the
> CPUS*2 kbuild still reliably reproduced the corruption issue.  I then
> did the below to both 4.14-rt and 4.16-rt, and the corruption is gone.
> 
> (this looks a bit like a patch, but is actually a functional yellow
> sticky should I need to come back for another poke at it later)
> 
> arm64: fpsimd: disable preemption for RT where that is assumed
> 
> 1. Per Sebastian's analysis, kernel_neon_begin() can't be made preemptible:
> If the task-switch notices TIF_FOREIGN_FPSTATE then it would restore task's
> SIMD state and we lose the state of registers used for in-kernel-work.  We
> would require additional storage for the in-kernel copy of the registers.
> But then the NEON-crypto checks for the need-resched flag so it shouldn't
> that bad.
> 
> 2. arch_efi_call_virt_setup/teardown() encapsulate __efi_fpsimd_begin/end()
> in preempt disabled sections via efi_virtmap_load/unload().  That could be
> fixed, but... 
> 
> 3. A local lock solution which left preempt disabled sections 1 & 2 intact
> failed, CPUS*2 parallel kbuild reliably reproduced memory corruption.
> 
> Given the two non-preemptible sections which could encapsulate something
> painful remained intact with the local lock solution, and the fact that
> the remaining BH disabled sections are all small, with empirical evidence
> at hand that at LEAST one truely does require preemption be disabled,
> the best solution for both RT and !RT is to simply disable preemption for
> RT where !RT assumes preemption has been disabled.  That adds no cycles
> to the !RT case, fewer cycles to the RT case, and requires no (ugly) work
> around for the consequences of local_unlock() under preempt_disable().
> 
> Signed-off-by: Mike Galbraith 
> ---
>  arch/arm64/kernel/fpsimd.c |   18 +++---
>  1 file changed, 15 insertions(+), 3 deletions(-)
> 
> --- a/arch/arm64/kernel/fpsimd.c
> +++ b/arch/arm64/kernel/fpsimd.c
> @@ -594,6 +594,7 @@ int sve_set_vector_length(struct task_st
>* non-SVE thread.
>*/
>   if (task == current) {
> + preempt_di

Re: [PATCH RT v2] arm64: fpsimd: use a local_lock() in addition to local_bh_disable()

2018-07-18 Thread Mike Galbraith
On Wed, 2018-07-18 at 11:27 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-07-14 00:03:44 [+0200], Mike Galbraith wrote:
> > > This seems to make work (crypto chacha20-neon + cyclictest). I have no
> > > EFI so I have no clue if saving SIMD while calling to EFI works.
> > 
> > All is not well on cavium test box.  I'm seeing random errors ala...
> > 
> > ./include/linux/fs.h:3137:11: internal compiler error: Segmentation fault
> > ./include/linux/bio.h:175:1: internal compiler error: in grokdeclarator, at 
> > c/c-decl.c:7023
> > 
> > ...during make -j96 (2*cpus) kbuild.  Turns out 4.14-rt has this issue
> > as well, which is unsurprising if it's related to fpsimd woes.  Box
> > does not exhibit the issue with NONRT kernels, PREEMPT or NOPREEMPT.
> > 
> > To file under FWIW, arm64 configured SLE15-RT, 4.12 based kernel
> > containing virgin @stable arch/arm64/kernel/fpsimd.c, does not exhibit
> > the problem. (relevant? dunno, it may be unrelated to fpsimd.c).
> 
> Okay, so you did not test this because you can't compile.

Nope, the running kernel, the one that is doing the segfaulting etc,
has the patches applied.

It is exhibiting that symptom because those patches do not cure this
symptom, one which I verified to be present in virgin 4.14-rt as well. 
The pseudo-patch I sent, disabling preemption where it is assumed to be
disabled instead, does cure it.  With preemption so disabled, I can
beat on affected kernels (>=4.14-rt) as long as I like.

This particular 48 core Cavium is very slow, maybe that makes it easier
to reproduce, dunno.  According to pipe-test, the thing is essentially
a dozen RPi super-glued together.  pipe-test pinned to a single core
can only context switch at ~40KHz with PREEMPT_RT, or ~90 with
NOPREEMPT, comparable to measurement done in real deal RPi.

-Mike


Re: [PATCH RT v2] arm64: fpsimd: use a local_lock() in addition to local_bh_disable()

2018-07-15 Thread Mike Galbraith
On Sat, 2018-07-14 at 00:03 +0200, Mike Galbraith wrote:
> On Fri, 2018-07-13 at 19:49 +0200, Sebastian Andrzej Siewior wrote:
> > In v4.16-RT I noticed a number of warnings from task_fpsimd_load(). The
> > code disables BH and expects that it is not preemptible. On -RT the
> > task remains preemptible but remains the same CPU. This may corrupt the
> > content of the SIMD registers if the task is preempted during
> > saving/restoring those registers.
> > Add a locallock around this process. This avoids that the any function
> > within the locallock block is invoked more than once on the same CPU.
> > 
> > The kernel_neon_begin() can't be kept preemptible. If the task-switch 
> > notices
> > TIF_FOREIGN_FPSTATE then it would restore task's SIMD state and we lose the
> > state of registers used for in-kernel-work. We would require additional 
> > storage
> > for the in-kernel copy of the registers. But then the NEON-crypto checks for
> > the need-resched flag so it shouldn't that bad.
> > The preempt_disable() avoids the context switch while the kernel uses the 
> > SIMD
> > registers. Unfortunately we have to balance out the migrate_disable() 
> > counter
> > because local_lock_bh() is invoked in different context compared to its 
> > unlock
> > counterpart.
> > 
> > __efi_fpsimd_begin() should not use kernel_fpu_begin() due to its
> > preempt_disable() context and instead save the registers always in its
> > extra spot on RT.
> > 
> > Signed-off-by: Sebastian Andrzej Siewior 
> > ---
> > 
> > This seems to make work (crypto chacha20-neon + cyclictest). I have no
> > EFI so I have no clue if saving SIMD while calling to EFI works.
> 
> All is not well on cavium test box.  I'm seeing random errors ala...
> 
> ./include/linux/fs.h:3137:11: internal compiler error: Segmentation fault
> ./include/linux/bio.h:175:1: internal compiler error: in grokdeclarator, at 
> c/c-decl.c:7023
> 
> ...during make -j96 (2*cpus) kbuild.  Turns out 4.14-rt has this issue
> as well, which is unsurprising if it's related to fpsimd woes.  Box
> does not exhibit the issue with NONRT kernels, PREEMPT or NOPREEMPT.

Verified to be SIMD woes.  I backported your V2 to 4.14-rt, and the
CPUS*2 kbuild still reliably reproduced the corruption issue.  I then
did the below to both 4.14-rt and 4.16-rt, and the corruption is gone.

(this looks a bit like a patch, but is actually a functional yellow
sticky should I need to come back for another poke at it later)

arm64: fpsimd: disable preemption for RT where that is assumed

1. Per Sebastian's analysis, kernel_neon_begin() can't be made preemptible:
If the task-switch notices TIF_FOREIGN_FPSTATE then it would restore task's
SIMD state and we lose the state of registers used for in-kernel-work.  We
would require additional storage for the in-kernel copy of the registers.
But then the NEON-crypto checks for the need-resched flag so it shouldn't
that bad.

2. arch_efi_call_virt_setup/teardown() encapsulate __efi_fpsimd_begin/end()
in preempt disabled sections via efi_virtmap_load/unload().  That could be
fixed, but... 

3. A local lock solution which left preempt disabled sections 1 & 2 intact
failed, CPUS*2 parallel kbuild reliably reproduced memory corruption.

Given the two non-preemptible sections which could encapsulate something
painful remained intact with the local lock solution, and the fact that
the remaining BH disabled sections are all small, with empirical evidence
at hand that at LEAST one truely does require preemption be disabled,
the best solution for both RT and !RT is to simply disable preemption for
RT where !RT assumes preemption has been disabled.  That adds no cycles
to the !RT case, fewer cycles to the RT case, and requires no (ugly) work
around for the consequences of local_unlock() under preempt_disable().

Signed-off-by: Mike Galbraith 
---
 arch/arm64/kernel/fpsimd.c |   18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

--- a/arch/arm64/kernel/fpsimd.c
+++ b/arch/arm64/kernel/fpsimd.c
@@ -594,6 +594,7 @@ int sve_set_vector_length(struct task_st
 * non-SVE thread.
 */
if (task == current) {
+   preempt_disable_rt();
local_bh_disable();
 
task_fpsimd_save();
@@ -604,8 +605,10 @@ int sve_set_vector_length(struct task_st
if (test_and_clear_tsk_thread_flag(task, TIF_SVE))
sve_to_fpsimd(task);
 
-   if (task == current)
+   if (task == current) {
local_bh_enable();
+   preempt_enable_rt();
+   }
 
/*
 * Force reallocation of task SVE state to the correct size
@@ -837,6 +840,7 @@ asmlinkage void do_sve_acc(unsigned int
 
sve_alloc(current);
 
+   pree

Re: [PATCH RT v2] arm64: fpsimd: use a local_lock() in addition to local_bh_disable()

2018-07-13 Thread Mike Galbraith
On Fri, 2018-07-13 at 19:49 +0200, Sebastian Andrzej Siewior wrote:
> In v4.16-RT I noticed a number of warnings from task_fpsimd_load(). The
> code disables BH and expects that it is not preemptible. On -RT the
> task remains preemptible but remains the same CPU. This may corrupt the
> content of the SIMD registers if the task is preempted during
> saving/restoring those registers.
> Add a locallock around this process. This avoids that the any function
> within the locallock block is invoked more than once on the same CPU.
> 
> The kernel_neon_begin() can't be kept preemptible. If the task-switch notices
> TIF_FOREIGN_FPSTATE then it would restore task's SIMD state and we lose the
> state of registers used for in-kernel-work. We would require additional 
> storage
> for the in-kernel copy of the registers. But then the NEON-crypto checks for
> the need-resched flag so it shouldn't that bad.
> The preempt_disable() avoids the context switch while the kernel uses the SIMD
> registers. Unfortunately we have to balance out the migrate_disable() counter
> because local_lock_bh() is invoked in different context compared to its unlock
> counterpart.
> 
> __efi_fpsimd_begin() should not use kernel_fpu_begin() due to its
> preempt_disable() context and instead save the registers always in its
> extra spot on RT.
> 
> Signed-off-by: Sebastian Andrzej Siewior 
> ---
> 
> This seems to make work (crypto chacha20-neon + cyclictest). I have no
> EFI so I have no clue if saving SIMD while calling to EFI works.

All is not well on cavium test box.  I'm seeing random errors ala...

./include/linux/fs.h:3137:11: internal compiler error: Segmentation fault
./include/linux/bio.h:175:1: internal compiler error: in grokdeclarator, at 
c/c-decl.c:7023

...during make -j96 (2*cpus) kbuild.  Turns out 4.14-rt has this issue
as well, which is unsurprising if it's related to fpsimd woes.  Box
does not exhibit the issue with NONRT kernels, PREEMPT or NOPREEMPT.

To file under FWIW, arm64 configured SLE15-RT, 4.12 based kernel
containing virgin @stable arch/arm64/kernel/fpsimd.c, does not exhibit
the problem. (relevant? dunno, it may be unrelated to fpsimd.c).

-Mike


Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

2018-07-09 Thread Mike Galbraith
On Mon, 2018-07-09 at 17:38 -0400, Rik van Riel wrote:
> 
> I added your code, and Signed-off-By in patch
> 1 for version 5 of the series.

No objection, but no need (like taking credit for fixing a typo:).


Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

2018-07-08 Thread Mike Galbraith
BTW, a second gripe ala the first, but wrt mm_init_cpumask(_mm):

In function ‘bitmap_zero’,
inlined from ‘cpumask_clear’ at ./include/linux/cpumask.h:378:2,
inlined from ‘mm_init_cpumask’ at ./include/linux/mm_types.h:504:2,
inlined from ‘efi_alloc_page_tables’ at 
arch/x86/platform/efi/efi_64.c:235:2:
./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes into a 
region of size 0 overflows the destination [-Wstringop-overflow=]
   memset(dst, 0, len);
   ^~~



Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

2018-07-08 Thread Mike Galbraith
On Sat, 2018-07-07 at 17:25 -0400, Rik van Riel wrote:
> 
> > ./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes
> > into a region of size 0 overflows the destination [-Wstringop-
> > overflow=]
> >memset(dst, 0, len);
> >^~~
> 
> I don't understand this one.
> 
> Inside init_mm we have this line:
> .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
> 
> which is the way the documentation suggests statically
> allocated variable size arrays should be allocated 
> and initialized.
> 
> How does that result in a memset of the same size,
> on the same array, to throw an error like above?

Compiler knows that ->cpu_bitmap is 64 bits of storage, and with
!CPUMASK_OFFSTACK, nr_cpumask_bits = NR_CPUS.  With NR_CPUS > 64,
compiler gripes, with NR_CPUS <= 64 it's a happy camper.

> What am I doing wrong?

Below is what I did to get box to both STHU, and to boot with the
openSUSE master branch config I sent.  Without the efi_mm hunk, boot
hangs early with or without the other hunk.

I build and boot tested the openSUSE config, a NOPREEMPT+MAXSMP config,
my local config w. NR_CPUS=8, and master-rt w. NR_CPUS=256, which is
the only one that got any real exercise (building the others).

---
 drivers/firmware/efi/efi.c |1 +
 include/linux/mm_types.h   |5 -
 2 files changed, 5 insertions(+), 1 deletion(-)

--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -82,6 +82,7 @@ struct mm_struct efi_mm = {
.mmap_sem   = __RWSEM_INITIALIZER(efi_mm.mmap_sem),
.page_table_lock= __SPIN_LOCK_UNLOCKED(efi_mm.page_table_lock),
.mmlist = LIST_HEAD_INIT(efi_mm.mmlist),
+   .cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
 };
 
 static bool disable_runtime;
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -501,7 +501,10 @@ extern struct mm_struct init_mm;
 
 static inline void mm_init_cpumask(struct mm_struct *mm)
 {
-   cpumask_clear((struct cpumask *)>cpu_bitmap);
+   unsigned long cpu_bitmap = (unsigned long)mm;
+
+   cpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+   cpumask_clear((struct cpumask *)cpu_bitmap);
 }
 
 /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */


Re: [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs

2018-07-07 Thread Mike Galbraith
(bah, I see I replied to wrong patch version, but it's still valid)

On Sat, 2018-07-07 at 14:26 +0200, Mike Galbraith wrote:
> On Fri, 2018-06-29 at 10:29 -0400, Rik van Riel wrote:
> > diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> > index e59214ec52b1..c4073367219d 100644
> > --- a/arch/x86/mm/tlb.c
> > +++ b/arch/x86/mm/tlb.c
> > @@ -718,14 +718,47 @@ void tlb_flush_remove_tables_local(void *arg)
> > }
> >  }
> >  
> > +static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
> > + struct cpumask* lazy_cpus)
> > +{
> > +   int cpu;
> > +
> > +   for_each_cpu(cpu, mm_cpumask(mm)) {
> > +   if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
> > +   cpumask_set_cpu(cpu, lazy_cpus);
> > +   }
> > +}
> > +
> >  void tlb_flush_remove_tables(struct mm_struct *mm)
> >  {
> > int cpu = get_cpu();
> > +   cpumask_var_t lazy_cpus;
> > +
> > +   if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids)
> > +   return;
> 
> A put_cpu() went missing.
> 
> > +
> > +   if (!zalloc_cpumask_var(_cpus, GFP_ATOMIC)) {
> > +   /*
> > +* If the cpumask allocation fails, do a brute force flush
> > +* on all the CPUs that have this mm loaded.
> > +*/
> > +   smp_call_function_many(mm_cpumask(mm),
> > +   tlb_flush_remove_tables_local, (void *)mm, 1);
> > +   return;
> > +   }
> 
> Ditto.
> 
>   -Mike


Re: [PATCH 5/7] x86,tlb: only send page table free TLB flush to lazy TLB CPUs

2018-07-07 Thread Mike Galbraith
On Fri, 2018-06-29 at 10:29 -0400, Rik van Riel wrote:
> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
> index e59214ec52b1..c4073367219d 100644
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -718,14 +718,47 @@ void tlb_flush_remove_tables_local(void *arg)
>   }
>  }
>  
> +static void mm_fill_lazy_tlb_cpu_mask(struct mm_struct *mm,
> +   struct cpumask* lazy_cpus)
> +{
> + int cpu;
> +
> + for_each_cpu(cpu, mm_cpumask(mm)) {
> + if (!per_cpu(cpu_tlbstate.is_lazy, cpu))
> + cpumask_set_cpu(cpu, lazy_cpus);
> + }
> +}
> +
>  void tlb_flush_remove_tables(struct mm_struct *mm)
>  {
>   int cpu = get_cpu();
> + cpumask_var_t lazy_cpus;
> +
> + if (cpumask_any_but(mm_cpumask(mm), cpu) >= nr_cpu_ids)
> + return;

A put_cpu() went missing.

> +
> + if (!zalloc_cpumask_var(_cpus, GFP_ATOMIC)) {
> + /*
> +  * If the cpumask allocation fails, do a brute force flush
> +  * on all the CPUs that have this mm loaded.
> +  */
> + smp_call_function_many(mm_cpumask(mm),
> + tlb_flush_remove_tables_local, (void *)mm, 1);
> + return;
> + }

Ditto.

-Mike


Re: [PATCH 1/7] mm: allocate mm_cpumask dynamically based on nr_cpu_ids

2018-07-07 Thread Mike Galbraith
On Fri, 2018-07-06 at 17:56 -0400, Rik van Riel wrote:
> The mm_struct always contains a cpumask bitmap, regardless of
> CONFIG_CPUMASK_OFFSTACK. That means the first step can be to
> simplify things, and simply have one bitmask at the end of the
> mm_struct for the mm_cpumask.

Otherwise virgin master.today grumbles.

  CC  kernel/bounds.s
  UPD include/generated/timeconst.h
  UPD include/generated/bounds.h
  CC  arch/x86/kernel/asm-offsets.s
  UPD include/generated/asm-offsets.h
  CALLscripts/checksyscalls.sh
  CHK include/generated/compile.h
  HOSTCC  usr/gen_init_cpio
  UPD include/generated/compile.h
  CC  init/main.o
In file included from ./include/linux/cpumask.h:12:0,
 from ./arch/x86/include/asm/cpumask.h:5,
 from ./arch/x86/include/asm/msr.h:11,
 from ./arch/x86/include/asm/processor.h:21,
 from ./arch/x86/include/asm/cpufeature.h:5,
 from ./arch/x86/include/asm/thread_info.h:53,
 from ./include/linux/thread_info.h:38,
 from ./arch/x86/include/asm/preempt.h:7,
 from ./include/linux/preempt.h:81,
 from ./include/linux/spinlock.h:51,
 from ./include/linux/seqlock.h:36,
 from ./include/linux/time.h:6,
 from ./include/linux/stat.h:19,
 from ./include/linux/module.h:10,
 from init/main.c:16:
In function ‘bitmap_zero’,
inlined from ‘cpumask_clear’ at ./include/linux/cpumask.h:378:2,
inlined from ‘mm_init_cpumask’ at ./include/linux/mm_types.h:504:2,
inlined from ‘start_kernel’ at init/main.c:560:2:
./include/linux/bitmap.h:208:3: warning: ‘memset’ writing 64 bytes into a 
region of size 0 overflows the destination [-Wstringop-overflow=]
   memset(dst, 0, len);
   ^~~


config.gz
Description: application/gzip


Re: [tip:x86/pti] x86/asm: Pad assembly functions with INT3 instructions

2018-06-17 Thread Mike Galbraith
On Sun, 2018-06-17 at 21:47 +0200, Borislav Petkov wrote:
> On Sun, Jun 17, 2018 at 04:02:58PM +0200, Mike Galbraith wrote:
> > (/me does that.. all better)
> > 
> > From 6ac281ee69f4cb5b581d5f49662fb56b6326155a Mon Sep 17 00:00:00 2001
> > From: Borislav Petkov 
> > Date: Sun, 17 Jun 2018 13:57:42 +0200
> > Subject: [PATCH] x86/crypto: Add a missing RET
> > 
> > crypto_aegis128_aesni_enc_tail() needs to return too.
> > 
> > Signed-off-by: Borislav Petkov 
> 
> [ Mike: took care of the other tail calls. ]
> Signed-off-by: Mike Galbraith 

I didn't think a sign-off was needed... it was your brain that wiggled
my fingers after all ;-)

-Mike


Re: [tip:x86/pti] x86/asm: Pad assembly functions with INT3 instructions

2018-06-17 Thread Mike Galbraith
On Sun, 2018-06-17 at 15:38 +0200, Mike Galbraith wrote:
> On Sun, 2018-06-17 at 14:00 +0200, Borislav Petkov wrote:
> > On Sun, Jun 17, 2018 at 01:40:13PM +0200, Mike Galbraith wrote:
> > > On Mon, 2018-05-14 at 05:53 -0700, tip-bot for Alexey Dobriyan wrote:
> > > > Commit-ID:  51bad67ffbce0aaa44579f84ef5d05597054ec6a
> > > > Gitweb: 
> > > > https://git.kernel.org/tip/51bad67ffbce0aaa44579f84ef5d05597054ec6a
> > > > Author: Alexey Dobriyan 
> > > > AuthorDate: Tue, 8 May 2018 00:37:55 +0300
> > > > Committer:  Ingo Molnar 
> > > > CommitDate: Mon, 14 May 2018 11:43:03 +0200
> > > > 
> > > > x86/asm: Pad assembly functions with INT3 instructions
> > > > 
> > > > Use INT3 instead of NOP. All that padding between functions is
> > > > an illegal area, no legitimate code should jump into it.
> > > 
> > > Is dinky patchlet suggesting cryptomgr is being naughty?
> > > 
> > > (revert silences spew, but..)
> > > 
> > > ...
> > > [   21.041608] int3:  [#1] SMP PTI
> > > [   21.041754] CPU: 3 PID: 935 Comm: cryptomgr_test Tainted: G
> > > E 4.17.0.g075a1d3-tip-default #146
> > > [   21.041888] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> > > 09/23/2013
> > > [   21.042035] RIP: 0010:crypto_aegis128_aesni_enc_tail+0x74/0x80 
> > > [aegis128_aesni]
> > > [   21.042171] Code: 38 dc ca 66 0f 38 dc d3 66 0f 38 dc de 66 0f ef e5 
> > > f3 0f 7f 27 f3 0f 7f 47 10 f3 0f 7f 4f 20 f3 0f 7f 57 30 f3 0f 7f 5f 40 
> > > cc  cc cc cc cc cc cc cc cc cc cc cc 48 83 fe 10 0f 82 c3 03 00 00
> > 
> > Looks like it misses a RET:
> 
> Bingo.
> 
> [   28.751069] RIP: 0010:crypto_aegis128l_aesni_enc_tail+0xcd/0xd0 
> [aegis128l_aesni]
> 
> Next next next..

(/me does that.. all better)

>From 6ac281ee69f4cb5b581d5f49662fb56b6326155a Mon Sep 17 00:00:00 2001
From: Borislav Petkov 
Date: Sun, 17 Jun 2018 13:57:42 +0200
Subject: [PATCH] x86/crypto: Add a missing RET

crypto_aegis128_aesni_enc_tail() needs to return too.

Signed-off-by: Borislav Petkov 
---
 arch/x86/crypto/aegis128-aesni-asm.S  |1 +
 arch/x86/crypto/aegis128l-aesni-asm.S |1 +
 arch/x86/crypto/aegis256-aesni-asm.S  |1 +
 arch/x86/crypto/morus1280-avx2-asm.S  |1 +
 arch/x86/crypto/morus1280-sse2-asm.S  |1 +
 arch/x86/crypto/morus640-sse2-asm.S   |1 +
 6 files changed, 6 insertions(+)

--- a/arch/x86/crypto/aegis128-aesni-asm.S
+++ b/arch/x86/crypto/aegis128-aesni-asm.S
@@ -535,6 +535,7 @@ ENTRY(crypto_aegis128_aesni_enc_tail)
movdqu STATE3, 0x40(STATEP)
 
FRAME_END
+   ret
 ENDPROC(crypto_aegis128_aesni_enc_tail)
 
 .macro decrypt_block a s0 s1 s2 s3 s4 i
--- a/arch/x86/crypto/aegis128l-aesni-asm.S
+++ b/arch/x86/crypto/aegis128l-aesni-asm.S
@@ -645,6 +645,7 @@ ENTRY(crypto_aegis128l_aesni_enc_tail)
state_store0
 
FRAME_END
+   ret
 ENDPROC(crypto_aegis128l_aesni_enc_tail)
 
 /*
--- a/arch/x86/crypto/aegis256-aesni-asm.S
+++ b/arch/x86/crypto/aegis256-aesni-asm.S
@@ -543,6 +543,7 @@ ENTRY(crypto_aegis256_aesni_enc_tail)
state_store0
 
FRAME_END
+   ret
 ENDPROC(crypto_aegis256_aesni_enc_tail)
 
 /*
--- a/arch/x86/crypto/morus1280-avx2-asm.S
+++ b/arch/x86/crypto/morus1280-avx2-asm.S
@@ -453,6 +453,7 @@ ENTRY(crypto_morus1280_avx2_enc_tail)
vmovdqu STATE4, (4 * 32)(%rdi)
 
FRAME_END
+   ret
 ENDPROC(crypto_morus1280_avx2_enc_tail)
 
 /*
--- a/arch/x86/crypto/morus1280-sse2-asm.S
+++ b/arch/x86/crypto/morus1280-sse2-asm.S
@@ -652,6 +652,7 @@ ENTRY(crypto_morus1280_sse2_enc_tail)
movdqu STATE4_HI, (9 * 16)(%rdi)
 
FRAME_END
+   ret
 ENDPROC(crypto_morus1280_sse2_enc_tail)
 
 /*
--- a/arch/x86/crypto/morus640-sse2-asm.S
+++ b/arch/x86/crypto/morus640-sse2-asm.S
@@ -437,6 +437,7 @@ ENTRY(crypto_morus640_sse2_enc_tail)
movdqu STATE4, (4 * 16)(%rdi)
 
FRAME_END
+   ret
 ENDPROC(crypto_morus640_sse2_enc_tail)
 
 /*


Re: [tip:x86/pti] x86/asm: Pad assembly functions with INT3 instructions

2018-06-17 Thread Mike Galbraith
On Sun, 2018-06-17 at 14:00 +0200, Borislav Petkov wrote:
> On Sun, Jun 17, 2018 at 01:40:13PM +0200, Mike Galbraith wrote:
> > On Mon, 2018-05-14 at 05:53 -0700, tip-bot for Alexey Dobriyan wrote:
> > > Commit-ID:  51bad67ffbce0aaa44579f84ef5d05597054ec6a
> > > Gitweb: 
> > > https://git.kernel.org/tip/51bad67ffbce0aaa44579f84ef5d05597054ec6a
> > > Author: Alexey Dobriyan 
> > > AuthorDate: Tue, 8 May 2018 00:37:55 +0300
> > > Committer:  Ingo Molnar 
> > > CommitDate: Mon, 14 May 2018 11:43:03 +0200
> > > 
> > > x86/asm: Pad assembly functions with INT3 instructions
> > > 
> > > Use INT3 instead of NOP. All that padding between functions is
> > > an illegal area, no legitimate code should jump into it.
> > 
> > Is dinky patchlet suggesting cryptomgr is being naughty?
> > 
> > (revert silences spew, but..)
> > 
> > ...
> > [   21.041608] int3:  [#1] SMP PTI
> > [   21.041754] CPU: 3 PID: 935 Comm: cryptomgr_test Tainted: GE 
> > 4.17.0.g075a1d3-tip-default #146
> > [   21.041888] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
> > 09/23/2013
> > [   21.042035] RIP: 0010:crypto_aegis128_aesni_enc_tail+0x74/0x80 
> > [aegis128_aesni]
> > [   21.042171] Code: 38 dc ca 66 0f 38 dc d3 66 0f 38 dc de 66 0f ef e5 f3 
> > 0f 7f 27 f3 0f 7f 47 10 f3 0f 7f 4f 20 f3 0f 7f 57 30 f3 0f 7f 5f 40 cc 
> >  cc cc cc cc cc cc cc cc cc cc cc 48 83 fe 10 0f 82 c3 03 00 00
> 
> Looks like it misses a RET:

Bingo.

[   28.751069] RIP: 0010:crypto_aegis128l_aesni_enc_tail+0xcd/0xd0 
[aegis128l_aesni]

Next next next..

> ---
> From 6ac281ee69f4cb5b581d5f49662fb56b6326155a Mon Sep 17 00:00:00 2001
> From: Borislav Petkov 
> Date: Sun, 17 Jun 2018 13:57:42 +0200
> Subject: [PATCH] x86/crypto: Add a missing RET
> 
> crypto_aegis128_aesni_enc_tail() needs to return too.
> 
> Signed-off-by: Borislav Petkov 
> ---
>  arch/x86/crypto/aegis128-aesni-asm.S | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/crypto/aegis128-aesni-asm.S 
> b/arch/x86/crypto/aegis128-aesni-asm.S
> index 9254e0b6cc06..717bf0776421 100644
> --- a/arch/x86/crypto/aegis128-aesni-asm.S
> +++ b/arch/x86/crypto/aegis128-aesni-asm.S
> @@ -535,6 +535,7 @@ ENTRY(crypto_aegis128_aesni_enc_tail)
>   movdqu STATE3, 0x40(STATEP)
>  
>   FRAME_END
> + ret
>  ENDPROC(crypto_aegis128_aesni_enc_tail)
>  
>  .macro decrypt_block a s0 s1 s2 s3 s4 i
> -- 
> 2.17.0.582.gccdcbd54c
> 


Re: [tip:x86/pti] x86/asm: Pad assembly functions with INT3 instructions

2018-06-17 Thread Mike Galbraith
On Mon, 2018-05-14 at 05:53 -0700, tip-bot for Alexey Dobriyan wrote:
> Commit-ID:  51bad67ffbce0aaa44579f84ef5d05597054ec6a
> Gitweb: 
> https://git.kernel.org/tip/51bad67ffbce0aaa44579f84ef5d05597054ec6a
> Author: Alexey Dobriyan 
> AuthorDate: Tue, 8 May 2018 00:37:55 +0300
> Committer:  Ingo Molnar 
> CommitDate: Mon, 14 May 2018 11:43:03 +0200
> 
> x86/asm: Pad assembly functions with INT3 instructions
> 
> Use INT3 instead of NOP. All that padding between functions is
> an illegal area, no legitimate code should jump into it.

Is dinky patchlet suggesting cryptomgr is being naughty?

(revert silences spew, but..)

...
[   21.041608] int3:  [#1] SMP PTI
[   21.041754] CPU: 3 PID: 935 Comm: cryptomgr_test Tainted: GE 
4.17.0.g075a1d3-tip-default #146
[   21.041888] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[   21.042035] RIP: 0010:crypto_aegis128_aesni_enc_tail+0x74/0x80 
[aegis128_aesni]
[   21.042171] Code: 38 dc ca 66 0f 38 dc d3 66 0f 38 dc de 66 0f ef e5 f3 0f 
7f 27 f3 0f 7f 47 10 f3 0f 7f 4f 20 f3 0f 7f 57 30 f3 0f 7f 5f 40 cc  cc cc 
cc cc cc cc cc cc cc cc cc 48 83 fe 10 0f 82 c3 03 00 00 
[   21.042333] RSP: 0018:963f81ee79b8 EFLAGS: 0246
[   21.042485] RAX: c0985950 RBX: 0001 RCX: 8a3ab90d6000
[   21.042640] RDX: 8a3ab90d6000 RSI: 0001 RDI: 963f81ee7af0
[   21.042792] RBP: 963f81ee7a90 R08: 0001 R09: 8a3ab90d6000
[   21.042953] R10: c1267690ad7d2d9e R11: ffe0 R12: 8a3ab90d6000
[   21.043100] R13: c0987040 R14: 963f81ee7af0 R15: 8a3ab90d6000
[   21.043250] FS:  () GS:8a3adecc() 
knlGS:
[   21.043405] CS:  0010 DS:  ES:  CR0: 80050033
[   21.043554] CR2: 7f2e169c4010 CR3: 0001f700a005 CR4: 001606e0
[   21.043704] Call Trace:
[   21.043854]  ? crypto_aegis128_aesni_process_crypt+0x8a/0xc0 [aegis128_aesni]
[   21.044004]  ? crypto_aegis128_aesni_crypt+0x238/0x440 [aegis128_aesni]
[   21.044156]  ? crypto_aegis128_aesni_crypt+0x238/0x440 [aegis128_aesni]
[   21.044311]  ? crypto_aegis128_aesni_encrypt+0x62/0xb0 [aegis128_aesni]
[   21.044454]  ? crypto_aegis128_aesni_encrypt+0x62/0xb0 [aegis128_aesni]
[   21.044597]  ? crypto_aead_setauthsize+0x23/0x40
[   21.044739]  ? __test_aead+0x632/0x15d0
[   21.044884]  ? crypto_aegis128_aesni_crypt+0x440/0x440 [aegis128_aesni]
[   21.045026]  ? __test_aead+0x632/0x15d0
[   21.045167]  ? crypto_alloc_tfm+0x52/0xf0
[   21.045308]  ? crypto_acomp_scomp_free_ctx+0x30/0x30
[   21.045449]  ? crypto_create_tfm+0x32/0xe0
[   21.045594]  ? crypto_acomp_scomp_free_ctx+0x30/0x30
[   21.045734]  ? crypto_acomp_scomp_free_ctx+0x30/0x30
[   21.045877]  ? test_aead+0x21/0xa0
[   21.046015]  ? alg_test_aead+0x3f/0xa0
[   21.046154]  ? alg_test.part.13+0x170/0x370
[   21.046291]  ? pick_next_task_fair+0x134/0x5d0
[   21.046426]  ? __switch_to+0x92/0x4b0
[   21.046565]  ? finish_task_switch+0x7f/0x2d0
[   21.046701]  ? __schedule+0x2b8/0x860
[   21.046833]  ? crypto_acomp_scomp_free_ctx+0x30/0x30
[   21.046963]  ? cryptomgr_test+0x40/0x50
[   21.047092]  ? kthread+0x11e/0x140
[   21.047221]  ? kthread_associate_blkcg+0xb0/0xb0
[   21.047350]  ? ret_from_fork+0x3a/0x50
[   21.047478] Modules linked in: aegis128_aesni(E+) snd_timer(E) 
crct10dif_pclmul(E) r8169(E) snd(E) crc32_pclmul(E) mii(E) iTCO_wdt(E) 
ghash_clmulni_intel(E) iTCO_vendor_support(E) pcbc(E) gpio_ich(E) 
aesni_intel(E) soundcore(E) aes_x86_64(E) lpc_ich(E) crypto_simd(E) mei_me(E) 
cryptd(E) mfd_core(E) i2c_i801(E) mei(E) glue_helper(E) pcspkr(E) thermal(E) 
intel_smartconnect(E) fan(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E) 
grace(E) sunrpc(E) sch_fq_codel(E) sr_mod(E) cdrom(E) hid_logitech_hidpp(E) 
hid_logitech_dj(E) uas(E) usb_storage(E) hid_generic(E) usbhid(E) nouveau(E) 
wmi(E) i2c_algo_bit(E) drm_kms_helper(E) syscopyarea(E) xhci_pci(E) 
sysfillrect(E) sysimgblt(E) fb_sys_fops(E) ahci(E) ttm(E) ehci_pci(E) 
libahci(E) xhci_hcd(E) ehci_hcd(E) libata(E) drm(E) usbcore(E) video(E) 
button(E) sd_mod(E)
[   21.048064]  vfat(E) fat(E) virtio_blk(E) virtio_mmio(E) virtio_pci(E) 
virtio_ring(E) virtio(E) ext4(E) crc32c_intel(E) crc16(E) mbcache(E) jbd2(E) 
loop(E) sg(E) dm_multipath(E) dm_mod(E) scsi_dh_rdac(E) scsi_dh_emc(E) 
scsi_dh_alua(E) scsi_mod(E) efivarfs(E)
[   21.048396] Dumping ftrace buffer:
[   21.048556](ftrace buffer empty)
[   21.048726] ---[ end trace 8cdd2dd0a107e807 ]---
[   21.048901] RIP: 0010:crypto_aegis128_aesni_enc_tail+0x74/0x80 
[aegis128_aesni]
[   21.049051] Code: 38 dc ca 66 0f 38 dc d3 66 0f 38 dc de 66 0f ef e5 f3 0f 
7f 27 f3 0f 7f 47 10 f3 0f 7f 4f 20 f3 0f 7f 57 30 f3 0f 7f 5f 40 cc  cc cc 
cc cc cc cc cc cc cc cc cc 48 83 fe 10 0f 82 c3 03 00 00 
[   21.049224] RSP: 0018:963f81ee79b8 EFLAGS: 0246
[   21.049390] RAX: c0985950 RBX: 0001 RCX: 8a3ab90d6000
[   21.049579] RDX: 8a3ab90d6000 RSI: 

v4.14.21+: ATOMIC_SLEEP splat bisected to 9428088c90b6 ("drm/qxl: reapply cursor after resetting primary")

2018-06-16 Thread Mike Galbraith
Greetings,

Running a kernel with ATOMIC_SLEEP enabled in one of my VMs, I met the
splat below.  I tracked it back to 4.14-stable, and bisected it there.

[   35.748479] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   37.302172] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   43.719223] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   45.284748] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   47.544198] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   49.024251] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   50.222626] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   67.521590] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[  864.956846] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[  866.478807] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 2245.113210] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 2308.323698] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 2325.967740] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 3355.291413] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 3367.545378] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 3395.581055] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[ 3405.144002] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239

[   35.748479] BUG: sleeping function called from invalid context at 
kernel/locking/mutex.c:239
[   35.750518] in_atomic(): 1, irqs_disabled(): 0, pid: 2482, name: X
[   35.752119] CPU: 0 PID: 2482 Comm: X Kdump: loaded Tainted: GE   
  4.17.0.g4c5e8fc-default #846
[   35.754276] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.0.0-prebuilt.qemu-project.org 04/01/2014
[   35.756381] Call Trace:
[   35.756921]  dump_stack+0x85/0xcb
[   35.757415]  ___might_sleep+0xd8/0x130
[   35.757818]  mutex_lock+0x1c/0x40
[   35.758202]  qxl_surface_evict+0x23/0x60 [qxl]
[   35.758756]  qxl_gem_object_free+0x27/0x40 [qxl]
[   35.759247]  qxl_bo_unref+0x1d/0x30 [qxl]
[   35.759721]  qxl_cursor_atomic_update+0x251/0x2a0 [qxl]
[   35.760314]  drm_atomic_helper_commit_planes+0xdf/0x220 [drm_kms_helper]
[   35.761065]  drm_atomic_helper_commit_tail+0x26/0x60 [drm_kms_helper]
[   35.761722]  commit_tail+0x5f/0x70 [drm_kms_helper]
[   35.762160]  drm_atomic_helper_commit+0xfc/0x110 [drm_kms_helper]
[   35.762712]  drm_atomic_helper_update_plane+0xf0/0x110 [drm_kms_helper]
[   35.763277]  __setplane_internal+0x196/0x240 [drm]
[   35.763698]  drm_mode_cursor_universal+0xec/0x1d0 [drm]
[   35.764143]  drm_mode_cursor_common+0x16a/0x1d0 [drm]
[   35.764570]  ? drm_mode_cursor_ioctl+0x50/0x50 [drm]
[   35.765011]  drm_ioctl_kernel+0x81/0xd0 [drm]
[   35.765425]  drm_ioctl+0x2a8/0x350 [drm]
[   35.765795]  ? drm_mode_cursor_ioctl+0x50/0x50 [drm]
[   35.766256]  do_vfs_ioctl+0x91/0x6a0
[   35.766770]  ? __do_page_fault+0x27e/0x4f0
[   35.767116]  ksys_ioctl+0x70/0x80
[   35.767397]  __x64_sys_ioctl+0x16/0x20
[   35.767732]  do_syscall_64+0x60/0x180
[   35.768042]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   35.768470] RIP: 0033:0x7fe99fc0a477
[   35.768783] Code: b3 66 90 48 8b 05 21 4a 2c 00 64 c7 00 26 00 00 00 48 c7 
c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d f1 49 2c 00 f7 d8 64 89 01 48 
[   35.770453] RSP: 002b:7fff8dcc3ca8 EFLAGS: 3246 ORIG_RAX: 
0010
[   35.771092] RAX: ffda RBX: 5594217685e0 RCX: 7fe99fc0a477
[   35.771690] RDX: 7fff8dcc3ce0 RSI: c02464bb RDI: 0018
[   35.772298] RBP: 7fff8dcc3ce0 R08: 0040 R09: 0004
[   35.773105] R10: 0040 R11: 3246 R12: c02464bb
[   35.773758] R13: 0018 R14: 0040 R15: 7fff8dcc3db4

git bisect start
# good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13
git bisect good 569dbb88e80deb68974ef6fdd6a13edb9d686261
# bad: [cda6fd4d9382205bb792255cd56a91062d404bc0] Linux 4.14.50
git bisect bad cda6fd4d9382205bb792255cd56a91062d404bc0
# good: [fbd01410e89a66f346ba1b3c0161e1198449b746] Merge 
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
git bisect good fbd01410e89a66f346ba1b3c0161e1198449b746
# good: [e4b57c4bc11bfaa2588557b52ad5abd65fb9390e] sparc64: mmu_context: Add 
missing include files
git bisect good e4b57c4bc11bfaa2588557b52ad5abd65fb9390e
# bad: [703fca31ac31ed81455a8642a6988cbca47f0f07] tpm: st33zp24: fix potential 
buffer overruns caused by bit glitches on the 

[Fwd: avahi-daemon.service startup failure post kernel commit f396922d862a]

2018-06-13 Thread Mike Galbraith
Well, the folks at "To:" below apparently don't want bug reports from
non-subscribers (no mediation, simply rejected).  Posting here simply
because it may save some other busy person a bisection. 

 Forwarded Message ----
From: Mike Galbraith 
To: av...@lists.freedesktop.org
Subject: avahi-daemon.service startup failure post kernel commit
f396922d862a
Date: Wed, 13 Jun 2018 13:32:25 +0200

Greetings,

Service startup failure bisected to a kernel commit, but that commit
points the finger at userspace, ergo an attempt to report it.  Let's
see if it bounces.

homer:~ # systemctl status avahi-daemon
● avahi-daemon.service - Avahi mDNS/DNS-SD Stack
   Loaded: loaded (/usr/lib/systemd/system/avahi-daemon.service; enabled; 
vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2018-06-13 09:49:58 CEST; 1min 
54s ago
  Process: 1930 ExecStart=/usr/sbin/avahi-daemon -s (code=exited, status=255)
 Main PID: 1930 (code=exited, status=255)
   Status: "avahi-daemon 0.6.32 exiting."

Jun 13 09:49:58 homer systemd[1]: Started Avahi mDNS/DNS-SD Stack.
Jun 13 09:49:58 homer avahi-daemon[1930]: Loading service file 
/etc/avahi/services/sftp-ssh.service.
Jun 13 09:49:58 homer avahi-daemon[1930]: Loading service file 
/etc/avahi/services/ssh.service.
Jun 13 09:49:58 homer avahi-daemon[1930]: SO_REUSEADDR failed: Structure needs 
cleaning
Jun 13 09:49:58 homer avahi-daemon[1930]: SO_REUSEADDR failed: Structure needs 
cleaning
Jun 13 09:49:58 homer avahi-daemon[1930]: Failed to create server: No suitable 
network protocol available
Jun 13 09:49:58 homer avahi-daemon[1930]: avahi-daemon 0.6.32 exiting.
Jun 13 09:49:58 homer systemd[1]: avahi-daemon.service: Main process exited, 
code=exited, status=255/n/a
Jun 13 09:49:58 homer systemd[1]: avahi-daemon.service: Unit entered failed 
state.
Jun 13 09:49:58 homer systemd[1]: avahi-daemon.service: Failed with result 
'exit-code'.
homer:~ #

f396922d862aa05b53ad740596652691a723ee23 is the first bad commit
commit f396922d862aa05b53ad740596652691a723ee23
Author: Maciej Żenczykowski 
Date:   Sun Jun 3 10:47:05 2018 -0700

net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

It is not safe to do so because such sockets are already in the
hash tables and changing these options can result in invalidating
the tb->fastreuse(port) caching.

This can have later far reaching consequences wrt. bind conflict checks
which rely on these caches (for optimization purposes).

Not to mention that you can currently end up with two identical
non-reuseport listening sockets bound to the same local ip:port
by clearing reuseport on them after they've already both been bound.

There is unfortunately no EISBOUND error or anything similar,
and EISCONN seems to be misleading for a bound-but-not-connected
socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
is the closest you can get to meaning 'socket in bad state'.
(although perhaps EINVAL wouldn't be a bad choice either?)

This does unfortunately run the risk of breaking buggy
userspace programs...

Signed-off-by: Maciej Żenczykowski 
Cc: Eric Dumazet 
Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller 

:04 04 39b702bc132c8aa812fbd452822a7047331553a1 
e0ed7194986fd828073702d5346a4f91fbd6ea01 M  net


Re: [PATCH] Revert "debugfs: inode: debugfs_create_dir uses mode permission from parent"

2018-06-11 Thread Mike Galbraith
On Mon, 2018-06-11 at 11:12 -0700, Laura Abbott wrote:
> On 06/11/2018 02:28 AM, Thomas Richter wrote:
> > This reverts commit 95cde3c59966f6371b6bcd9e4e2da2ba64ee9775.
> > It breaks the ioctl(KVM_CREATE_VM) interface.
> > 
> 
> Can you elaborate a little more on how this breaks? Fedora has
> gotten at least one report of a failure in this ioctl and
> I'd like know if it's the same issue.

What I reported is here https://lkml.org/lkml/2018/6/8/79

-Mike


regression: "95cde3c59966 debugfs: inode: debugfs_create_dir uses mode permission from parent" terminally annoys libvirt

2018-06-08 Thread Mike Galbraith
Greetings,

$subject bisected and verified via revert.  Box is garden variety
i4790, distro is openSUSE Leap 15.0.

Error starting domain: internal error: process exited while connecting to 
monitor: ioctl(KVM_CREATE_VM) failed: 12 Cannot allocate memory
2018-06-08T03:18:00.453006Z qemu-system-x86_64: failed to initialize KVM: 
Cannot allocate memory

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 89, in cb_wrapper
callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 125, in tmpcb
callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/libvirtobject.py", line 82, in newfn
ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/domain.py", line 1508, in startup
self._backend.create()
  File "/usr/lib64/python3.6/site-packages/libvirt.py", line 1069, in create
if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: process exited while connecting to 
monitor: ioctl(KVM_CREATE_VM) failed: 12 Cannot allocate memory
2018-06-08T03:18:00.453006Z qemu-system-x86_64: failed to initialize KVM: 
Cannot allocate memory

95cde3c59966f6371b6bcd9e4e2da2ba64ee9775 is the first bad commit
commit 95cde3c59966f6371b6bcd9e4e2da2ba64ee9775
Author: Thomas Richter 
Date:   Fri Apr 27 14:35:47 2018 +0200

debugfs: inode: debugfs_create_dir uses mode permission from parent

git bisect start
# good: [29dcea88779c856c7dc92040a0c01233263101d4] Linux 4.17
git bisect good 29dcea88779c856c7dc92040a0c01233263101d4
# bad: [ba1b7309fc2e909a5828c36a7cd187e5d7df6f53] Merge branch 'next-smack' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security
git bisect bad ba1b7309fc2e909a5828c36a7cd187e5d7df6f53
# bad: [135c5504a600ff9b06e321694fbcac78a9530cd4] Merge tag 
'drm-next-2018-06-06-1' of git://anongit.freedesktop.org/drm/drm
git bisect bad 135c5504a600ff9b06e321694fbcac78a9530cd4
# good: [5231804cf9e584f3e7e763a0d6d2fffe011c1bce] Merge tag 
'leds_for_4.18-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds
git bisect good 5231804cf9e584f3e7e763a0d6d2fffe011c1bce
# good: [315852b422972e6ebb1dfddaadada09e46a2681a] drm: rcar-du: Fix build 
failure
git bisect good 315852b422972e6ebb1dfddaadada09e46a2681a
# bad: [ec064d3c6b40697fd72f4b1eeabbf293b7947a04] Merge tag 
'driver-core-4.18-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
git bisect bad ec064d3c6b40697fd72f4b1eeabbf293b7947a04
# good: [a941fc3957113df977e7396c2cf1679e87a6] USB: typec: tcpm: no need to 
check return value of debugfs_create_dir()
git bisect good a941fc3957113df977e7396c2cf1679e87a6
# good: [07c4dd3435aa387d3b58f4e941dc516513f14507] Merge tag 'usb-4.18-rc1' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect good 07c4dd3435aa387d3b58f4e941dc516513f14507
# good: [e55077307d6f7ce86f9a468408fa5613c4a5cb5d] nvmem: meson-efuse: add 
write support
git bisect good e55077307d6f7ce86f9a468408fa5613c4a5cb5d
# good: [fdff4053d51be4850185aa895813405decd6e956] fpga: clarify that 
unregister functions also free
git bisect good fdff4053d51be4850185aa895813405decd6e956
# good: [1bc4d68b06ab913d392c8ad6481b9729bd58b8d5] ath10k: re-enable the 
firmware fallback mechanism for testmode
git bisect good 1bc4d68b06ab913d392c8ad6481b9729bd58b8d5
# bad: [085aa2de568493d7cde52126512d37260077811a] mm: memory_hotplug: use 
put_device() if device_register fail
git bisect bad 085aa2de568493d7cde52126512d37260077811a
# good: [f0a462970ee19930518b03798a21cfdb3fd35877] Documentation: clarify 
firmware_class provenance and why we can't rename the module
git bisect good f0a462970ee19930518b03798a21cfdb3fd35877
# bad: [95cde3c59966f6371b6bcd9e4e2da2ba64ee9775] debugfs: inode: 
debugfs_create_dir uses mode permission from parent
git bisect bad 95cde3c59966f6371b6bcd9e4e2da2ba64ee9775
# good: [964f8363a1aba6cb4198bfaaac538b08f1c538f1] debugfs: Re-use 
kstrtobool_from_user()
git bisect good 964f8363a1aba6cb4198bfaaac538b08f1c538f1
# first bad commit: [95cde3c59966f6371b6bcd9e4e2da2ba64ee9775] debugfs: inode: 
debugfs_create_dir uses mode permission from parent



Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

2018-06-01 Thread Mike Galbraith
On Fri, 2018-06-01 at 13:03 -0700, Andy Lutomirski wrote:
> 
> Mike, you never did say: do you have PCID on your CPU?

Yes.

>   Also, what is
> your workload doing to cause so many switches back and forth between
> init_mm and a task.

pipe-test measures pipe round trip, does nearly nothing but schedule.  

-Mike


Re: [PATCH] x86,switch_mm: skip atomic operations for init_mm

2018-06-01 Thread Mike Galbraith
On Fri, 2018-06-01 at 14:22 -0400, Rik van Riel wrote:
> On Fri, 2018-06-01 at 08:11 -0700, Andy Lutomirski wrote:
> > On Fri, Jun 1, 2018 at 5:28 AM Rik van Riel  wrote:
> > > 
> > > Song noticed switch_mm_irqs_off taking a lot of CPU time in recent
> > > kernels,using 2.4% of a 48 CPU system during a netperf to localhost
> > > run.
> > > Digging into the profile, we noticed that cpumask_clear_cpu and
> > > cpumask_set_cpu together take about half of the CPU time taken by
> > > switch_mm_irqs_off.
> > > 
> > > However, the CPUs running netperf end up switching back and forth
> > > between netperf and the idle task, which does not require changes
> > > to the mm_cpumask. Furthermore, the init_mm cpumask ends up being
> > > the most heavily contended one in the system.`
> > > 
> > > Skipping cpumask_clear_cpu and cpumask_set_cpu for init_mm
> > > (mostly the idle task) reduced CPU use of switch_mm_irqs_off
> > > from 2.4% of the CPU to 1.9% of the CPU, with the following
> > > netperf commandline:
> > 
> > I'm conceptually fine with this change.  Does mm_cpumask(_mm)
> > end
> > up in a deterministic state?
> 
> Given that we do not touch mm_cpumask(_mm)
> any more, and that bitmask never appears to be
> used for things like tlb shootdowns (kernel TLB
> shootdowns simply go to everybody), I suspect
> it ends up in whatever state it is initialized
> to on startup.
> 
> I had not looked into this much, because it does
> not appear to be used for anything.
> 
> > Mike, depending on exactly what's going on with your benchmark, this
> > might help recover a bit of your performance, too.
> 
> It will be interesting to know how this change
> impacts others.

previous pipe-test numbers
4.13.16 2.024978 usecs/loop -- avg 2.045250 977.9 KHz
4.14.47 2.234518 usecs/loop -- avg 2.227716 897.8 KHz
4.15.18 2.287815 usecs/loop -- avg 2.295858 871.1 KHz
4.16.13 2.286036 usecs/loop -- avg 2.279057 877.6 KHz
4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz

new numbers
4.17.0.g0512e01 2.268629 usecs/loop -- avg 2.269493 881.3 KHz
4.17.0.g0512e01 2.035401 usecs/loop -- avg 2.038341 981.2 KHz +andy
4.17.0.g0512e01 2.238701 usecs/loop -- avg 2.231828 896.1 KHz -andy+rik

There might be something there with your change Rik, but it's small
enough to be wary of variance.  Andy's "invert the return of
tlb_defer_switch_to_init_mm()" is OTOH pretty clear.

-Mike


4.13..4.14 scheduling overhead regression (bisected - b956575bed91)

2018-06-01 Thread Mike Galbraith
Greetings,

While dusting off regression testing trees, I noticed a substantial
pipe-test dent at 4.14, and bisected it to b956575bed91.  Log below.

skew_tick=1 audit=0 nodelayacct cgroup_disable=memory nopti nospectre_v2 
nospec_store_bypass_disable
gov performance
taskset 0xc pipe-test 1

4.4.134 2.125751 usecs/loop -- avg 2.138286 935.3 KHz
4.5.7   2.111431 usecs/loop -- avg 2.141415 934.0 KHz
4.6.7   2.060672 usecs/loop -- avg 2.059993 970.9 KHz
4.7.10  2.086366 usecs/loop -- avg 2.090943 956.5 KHz
4.8.17  2.008318 usecs/loop -- avg 2.005780 997.1 KHz + ~2.00-ish   
1.000
4.9.104 1.989143 usecs/loop -- avg 2.010945 994.6 KHz
4.10.17 2.010789 usecs/loop -- avg 2.006868 996.6 KHz
4.11.12 2.032123 usecs/loop -- avg 2.037766 981.5 KHz - ~2.04-ish   
1.015   1.000
4.12.14 2.055745 usecs/loop -- avg 2.065894 968.1 KHz
4.13.16 2.024978 usecs/loop -- avg 2.045250 977.9 KHz
4.14.47 2.234518 usecs/loop -- avg 2.227716 897.8 KHz -- >2.22  
1.110   1.093   1.000
4.15.18 2.287815 usecs/loop -- avg 2.295858 871.1 KHz --- >2.28 
1.144   1.126   1.030
4.16.13 2.286036 usecs/loop -- avg 2.279057 877.6 KHz
4.17.0.g88a8676 2.288231 usecs/loop -- avg 2.288917 873.8 KHz

b956575bed91ecfb136a8300742ecbbf451471ab is the first bad commit
commit b956575bed91ecfb136a8300742ecbbf451471ab
Author: Andy Lutomirski 
Date:   Mon Oct 9 09:50:49 2017 -0700

x86/mm: Flush more aggressively in lazy TLB mode

git bisect start
# good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13
git bisect good 569dbb88e80deb68974ef6fdd6a13edb9d686261
# bad: [57a3ca7835962109d94533465a75e8c716b26845] Linux 4.14.47
git bisect bad 57a3ca7835962109d94533465a75e8c716b26845
# good: [fbf4432ff71b7a25bef993a5312906946d27f446] Merge branch 'akpm' (patches 
from Andrew)
git bisect good fbf4432ff71b7a25bef993a5312906946d27f446
# bad: [3fefc31843cfe2b5f072efe11ed9ccaf6a7a5092] Merge tag 'pm-final-4.14' of 
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect bad 3fefc31843cfe2b5f072efe11ed9ccaf6a7a5092
# good: [8d93c7a4315711ea0f7a95ca353a89c4ed0763fb] Merge tag 
'pci-v4.14-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
git bisect good 8d93c7a4315711ea0f7a95ca353a89c4ed0763fb
# good: [0f380715e51f5ff418cfccb4cd0d4fe4c48c3241] Merge tag 'sound-4.14-rc4' 
of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect good 0f380715e51f5ff418cfccb4cd0d4fe4c48c3241
# bad: [e5f468b3f23313994c5e6c356135f9b0d76bcb94] Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
git bisect bad e5f468b3f23313994c5e6c356135f9b0d76bcb94
# good: [3d7882769b5dc929690f96e0c318c29b97f51018] Merge tag 'devprop-4.14-rc5' 
of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 3d7882769b5dc929690f96e0c318c29b97f51018
# bad: [ae7df8f985f1b0445366ae6f6324cd08a218526e] Merge tag 
'char-misc-4.14-rc5' of 
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad ae7df8f985f1b0445366ae6f6324cd08a218526e
# good: [be1f16ba35d97aff4d85c0daba0a02da51b7c83c] Merge branch '4.14-fixes' of 
git://git.linux-mips.org/pub/scm/ralf/upstream-linus
git bisect good be1f16ba35d97aff4d85c0daba0a02da51b7c83c
# good: [a339b351304d5e6b02c7cf8eed895d181e64bce0] Merge branch 
'sched-urgent-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good a339b351304d5e6b02c7cf8eed895d181e64bce0
# bad: [7a23c5abb930cefcef85df6dc0c8fb3e8961980c] Merge tag 
'dmaengine-fix-4.14-rc5' of git://git.infradead.org/users/vkoul/slave-dma
git bisect bad 7a23c5abb930cefcef85df6dc0c8fb3e8961980c
# good: [ab7ff471aa5db670197070760f022622793da7e5] x86/hyperv: Fix hypercalls 
with extended CPU ranges for TLB flushing
git bisect good ab7ff471aa5db670197070760f022622793da7e5
# bad: [1f161f67a272cc4f29f27934dd3f74cb657eb5c4] x86/microcode: Do the family 
check first
git bisect bad 1f161f67a272cc4f29f27934dd3f74cb657eb5c4
# good: [cc6afe2240298049585e86b1ade85efc8a7f225d] x86/apic: Silence "FW_BUG 
TSC_DEADLINE disabled due to Errata" on hypervisors
git bisect good cc6afe2240298049585e86b1ade85efc8a7f225d
# bad: [b956575bed91ecfb136a8300742ecbbf451471ab] x86/mm: Flush more 
aggressively in lazy TLB mode
git bisect bad b956575bed91ecfb136a8300742ecbbf451471ab
# good: [616dd5872e52493863b0202632703eebd51243dc] x86/apic: Update 
TSC_DEADLINE quirk with additional SKX stepping
git bisect good 616dd5872e52493863b0202632703eebd51243dc
# first bad commit: [b956575bed91ecfb136a8300742ecbbf451471ab] x86/mm: Flush 
more aggressively in lazy TLB mode


Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-22 Thread Mike Galbraith
On Tue, 2018-05-22 at 11:46 +0200, Mike Galbraith wrote:
> On Tue, 2018-05-22 at 11:14 +0200, Sebastian Andrzej Siewior wrote:
> 
> >  If you suggest that I
> > should stop caring about UV than I do so. Please post a patch that adds
> > a dependency to UV on PREEMPT so that part of the architecture is
> > documented.
> 
> Will do.

On second thought, no I won't.  It's either already known, or it should
be, making any such submission smell funny.

-Mike


Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-22 Thread Mike Galbraith
On Tue, 2018-05-22 at 11:14 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-22 10:24:22 [+0200], Mike Galbraith wrote:
> 
> > If I were in your shoes, I think I'd just stop caring about UV until a
> > real user appears.  AFAIK, I'm the only guy who ever ran RT on UV, and
> > I only did so because SUSE asked me to look into it.. years ago now.
> 
> Okay. The problem I have with this patch is that it remains RT only
> while the problem it addresses is not RT-only and PREEMPT kernels are
> very much affected.

Ah, but when RT gets merged (someday... maybe), that patch will apply,
and instantly make all.. zero.. UV-RT users happy campers :)

> The thing is that *you* are my only UV user :)

Crash-test-dummies don't really qualify as users :)

>  If you suggest that I
> should stop caring about UV than I do so. Please post a patch that adds
> a dependency to UV on PREEMPT so that part of the architecture is
> documented.

Will do.

-Mike


Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-22 Thread Mike Galbraith
On Tue, 2018-05-22 at 08:50 +0200, Sebastian Andrzej Siewior wrote:
> 
> Regarding the preempt_disable() in the original patch in uv_read_rtc():
> This looks essential for PREEMPT configs. Is it possible to get this
> tested by someone or else get rid of the UV code? It looks broken for
> "uv_get_min_hub_revision_id() != 1".

I suspect SGI cares not one whit about PREEMPT.

> Why does PREEMPT_RT require migrate_disable() but PREEMPT only is fine
> as-is? This does not look right.

UV is not ok with a PREEMPT config, it's just that for RT it's dirt
simple to shut it up, whereas for PREEMPT, preempt_disable() across
uv_bau_init() doesn't cut it due to allocations, and whatever else I
would have met before ending the whack-a-mole game.

If I were in your shoes, I think I'd just stop caring about UV until a
real user appears.  AFAIK, I'm the only guy who ever ran RT on UV, and
I only did so because SUSE asked me to look into it.. years ago now.

-Mike


Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-19 Thread Mike Galbraith
On Mon, 2018-05-07 at 09:39 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-06 12:59:19 [+0200], Mike Galbraith wrote:
> > On Sun, 2018-05-06 at 12:26 +0200, Thomas Gleixner wrote:
> > > On Fri, 4 May 2018, Sebastian Andrzej Siewior wrote:
> > > 
> > > > From: Mike Galbraith <umgwanakikb...@gmail.com>
> > > > 
> > > > Shrug.  Lots of hobbyists have a beast in their basement, right?
> > > 
> > > This hardly qualifies as a proper changelog ...
> > 
> > Hm, that wasn't intended to be a changelog.
> > 
> > This patch may not be current either, I haven't tested RT on a UV box
> > in quite some time.
> 
> That last hunk looks like something that would be required even for !RT. 
> Would you mind to check that patch and write a changelog? If it doesn't
> work for RT there is no need to carry this in -RT.

None of that patch is needed for a UV3000, but the below is.  It's
likely still valid for now ancient UV boxen, but the UV100 the patch
was originally written for (2011/2.6.33-rt) has apparently wandered off
to become a beer keg or something meanwhile, so I can't test.

UV: Fix uv_bau_init() check_preemption_disabled() gripeage

[2.851947] BUG: using smp_processor_id() in preemptible [] code: 
swapper/0/1
[2.851951] caller is uv_bau_init+0x28/0xb62
[2.851954] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
4.17.0-rc5-0.g3e3e37b-rt_debug
[2.851956] Hardware name: SGI UV3000/UV3000, BIOS SGI UV 3000 series BIOS 
01/15/2015
[2.851957] Call Trace:
[2.851964]  dump_stack+0x85/0xcb
[2.851969]  check_preemption_disabled+0x10c/0x120
[2.851972]  ? init_per_cpu+0x88c/0x88c
[2.851974]  uv_bau_init+0x28/0xb62
[2.851979]  ? lapic_cal_handler+0xbb/0xbb
[2.851982]  ? rt_mutex_unlock+0x35/0x50
[2.851985]  ? init_per_cpu+0x88c/0x88c
[2.851988]  ? set_debug_rodata+0x11/0x11
[2.851991]  do_one_initcall+0x46/0x249
[2.851995]  kernel_init_freeable+0x207/0x29c
[2.851999]  ? rest_init+0xd0/0xd0
[2.852000]  kernel_init+0xa/0x110
[2.852000]  ret_from_fork+0x3a/0x50

(gdb) list *uv_bau_init+0x28
0x824a4d96 is in uv_bau_init (./arch/x86/include/asm/uv/uv_hub.h:212).
207 return (struct uv_hub_info_s *)__uv_hub_info_list[node];
208 }
209
210 static inline struct uv_hub_info_s *_uv_hub_info(void)
211 {
212 return (struct uv_hub_info_s *)uv_cpu_info->p_uv_hub_info;
213 }
214 #define uv_hub_info _uv_hub_info()
215
216 static inline struct uv_hub_info_s *uv_cpu_hub_info(int cpu)
(gdb)

arch/x86/include/asm/uv/uv_hub.h:
197 #define uv_cpu_info this_cpu_ptr(&__uv_cpu_info)

This and other substitutions make uv_bau_init() annoying for a PREEMPT
kernel, but PREEMPT_RT can silence the lot with one migrate_disable().

Signed-off-by: Mike Galbraith <efa...@gmx.de>
---
 arch/x86/platform/uv/tlb_uv.c |5 +
 1 file changed, 5 insertions(+)

--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -2213,6 +2213,8 @@ static int __init uv_bau_init(void)
if (!is_uv_system())
return 0;
 
+   migrate_disable();
+
if (is_uv4_hub())
ops = uv4_bau_ops;
else if (is_uv3_hub())
@@ -2269,6 +2271,8 @@ static int __init uv_bau_init(void)
}
}
 
+   migrate_enable();
+
return 0;
 
 err_bau_disable:
@@ -2276,6 +2280,7 @@ static int __init uv_bau_init(void)
for_each_possible_cpu(cur_cpu)
free_cpumask_var(per_cpu(uv_flush_tlb_mask, cur_cpu));
 
+   migrate_enable();
set_bau_off();
nobau_perm = 1;
 


Re: cpu stopper threads and load balancing leads to deadlock

2018-05-17 Thread Mike Galbraith
On Thu, 2018-05-17 at 07:03 -0700, Paul E. McKenney wrote:
> On Tue, May 15, 2018 at 06:30:26AM +0200, Mike Galbraith wrote:
> 
> > > Something like so perhaps? Mike, can you play around with that? Could
> > > burn your granny and eat your cookies.
> > 
> > Did this get queued anywhere?
> 
> I have not queued it, but given Peter's Signed-off-by and your Tested-by
> I would be happy to do so.

Here's the later. Tested-by: Mike Galbraith <efa...@gmx.de>

> > > diff --git a/arch/x86/kernel/cpu/mtrr/main.c 
> > > b/arch/x86/kernel/cpu/mtrr/main.c
> > > index 7468de429087..07360523c3ce 100644
> > > --- a/arch/x86/kernel/cpu/mtrr/main.c
> > > +++ b/arch/x86/kernel/cpu/mtrr/main.c
> > > @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
> > >  
> > >   if (!use_intel() || mtrr_aps_delayed_init)
> > >   return;
> > > +
> > > + rcu_cpu_starting(smp_processor_id());
> > > +
> > >   /*
> > >* Ideally we should hold mtrr_mutex here to avoid mtrr entries
> > >* changed, but this routine will be called in cpu boot time,
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 2a734692a581..4dab46950fdb 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
> > >   return 0;
> > >  }
> > >  
> > > +static DEFINE_PER_CPU(int, rcu_cpu_started);
> > > +
> > >  /*
> > >   * Mark the specified CPU as being online so that subsequent grace 
> > > periods
> > >   * (both expedited and normal) will wait on it.  Note that this means 
> > > that
> > > @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
> > >   struct rcu_node *rnp;
> > >   struct rcu_state *rsp;
> > >  
> > > + if (per_cpu(rcu_cpu_started, cpu))
> > > + return;
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 1;
> > > +
> > >   for_each_rcu_flavor(rsp) {
> > >   rdp = per_cpu_ptr(rsp->rda, cpu);
> > >   rnp = rdp->mynode;
> > > @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
> > >   preempt_enable();
> > >   for_each_rcu_flavor(rsp)
> > >   rcu_cleanup_dying_idle_cpu(cpu, rsp);
> > > +
> > > + per_cpu(rcu_cpu_started, cpu) = 0;
> > >  }
> > >  
> > >  /* Migrate the dead CPU's callbacks to the current CPU. */
> > 
> 


Re: cpu stopper threads and load balancing leads to deadlock

2018-05-14 Thread Mike Galbraith
On Thu, 2018-05-03 at 18:45 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 09:12:31AM -0700, Paul E. McKenney wrote:
> > On Thu, May 03, 2018 at 04:44:50PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 03, 2018 at 04:16:55PM +0200, Mike Galbraith wrote:
> > > > On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> > > > > On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> > > > > 
> > > > > > Dang.  With $subject fix applied as well..
> > > > > 
> > > > > That's a NO then... :-(
> > > > 
> > > > Could say who cares about oddball offline wakeup stat. 
> > > 
> > > Yeah, nobody.. but I don't want to have to change the wakeup code to
> > > deal with this if at all possible. That'd just add conditions that are
> > > 'always' false, except in this exceedingly rare circumstance.
> > > 
> > > So ideally we manage to tell RCU that it needs to pay attention while
> > > we're doing this here thing, which is what I thought RCU_NONIDLE() was
> > > about.
> > 
> > One straightforward approach would be to provide a arch-specific
> > Kconfig option that tells notify_cpu_starting() not to bother invoking
> > rcu_cpu_starting().  Then x86 selects this Kconfig option and invokes
> > rcu_cpu_starting() itself early enough to avoid splats.
> > 
> > See the (untested, probably does not even build) patch below.
> > 
> > I have no idea where to insert either the "select" or the call to
> > rcu_cpu_starting(), so I left those out.  I know that putting the
> > call too early will cause trouble, but I have no idea what constitutes
> > "too early".  :-/
> 
> Something like so perhaps? Mike, can you play around with that? Could
> burn your granny and eat your cookies.

Did this get queued anywhere?

> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 7468de429087..07360523c3ce 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
>  
>   if (!use_intel() || mtrr_aps_delayed_init)
>   return;
> +
> + rcu_cpu_starting(smp_processor_id());
> +
>   /*
>* Ideally we should hold mtrr_mutex here to avoid mtrr entries
>* changed, but this routine will be called in cpu boot time,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a734692a581..4dab46950fdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
>   return 0;
>  }
>  
> +static DEFINE_PER_CPU(int, rcu_cpu_started);
> +
>  /*
>   * Mark the specified CPU as being online so that subsequent grace periods
>   * (both expedited and normal) will wait on it.  Note that this means that
> @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
>   struct rcu_node *rnp;
>   struct rcu_state *rsp;
>  
> + if (per_cpu(rcu_cpu_started, cpu))
> + return;
> +
> + per_cpu(rcu_cpu_started, cpu) = 1;
> +
>   for_each_rcu_flavor(rsp) {
>   rdp = per_cpu_ptr(rsp->rda, cpu);
>   rnp = rdp->mynode;
> @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
>   preempt_enable();
>   for_each_rcu_flavor(rsp)
>   rcu_cleanup_dying_idle_cpu(cpu, rsp);
> +
> + per_cpu(rcu_cpu_started, cpu) = 0;
>  }
>  
>  /* Migrate the dead CPU's callbacks to the current CPU. */


Re: [patch] swiotlb: fix ignored DMA_ATTR_NO_WARN request

2018-05-12 Thread Mike Galbraith
To conclude to this snail like thread (/me=walking wounded), with the
v4.16.8 hunk below, traces showing that swiotlb_alloc_coherent() was
being asked to not bother warning started showing up after the box had
been flogged for a while.

Whatever finally happens with swiotlb (seems to be in flux), other
folks meeting annoying gripeage can find bandaids in the interim.

The End

v4.16.8 !DMA_DIRECT_OPS
Xorg-3105  [001]   2156.711471: swiotlb_alloc_coherent+0xa7/0x1e0: yup
Xorg-3105  [001]   2156.711497: 
 => ttm_dma_populate+0x23c/0x310 [ttm]
 => ttm_tt_bind+0x31/0x60 [ttm]
 => ttm_bo_handle_move_mem+0x527/0x580 [ttm]
 => ttm_bo_validate+0xfb/0x110 [ttm]
 => ttm_bo_init_reserved+0x289/0x450 [ttm]
 => ttm_bo_init+0x77/0xd0 [ttm]
 => nouveau_bo_new+0x3fc/0x5e0 [nouveau]
 => nouveau_gem_new+0x66/0x110 [nouveau]
 => nouveau_gem_ioctl_new+0x48/0xc0 [nouveau]
 => drm_ioctl_kernel+0x66/0xb0 [drm]
 => drm_ioctl+0x2a4/0x360 [drm]
 => nouveau_drm_ioctl+0x50/0xb0 [nouveau]
 => do_vfs_ioctl+0x92/0x5e0
 => SyS_ioctl+0x3b/0x70
 => do_syscall_64+0x74/0x1a0
 => entry_SYSCALL_64_after_hwframe+0x3d/0xa2

--- a/arch/x86/kernel/pci-swiotlb.c
+++ b/arch/x86/kernel/pci-swiotlb.c
@@ -28,10 +28,8 @@ void *x86_swiotlb_alloc_coherent(struct
 * swiotlb_alloc_coherent() will print a warning when the DMA
 * memory allocation ultimately failed.
 */
-   flags |= __GFP_NOWARN;
-
-   vaddr = dma_generic_alloc_coherent(hwdev, size, dma_handle, flags,
-  attrs);
+   vaddr = dma_generic_alloc_coherent(hwdev, size, dma_handle,
+  flags | __GFP_NOWARN, attrs);
if (vaddr)
return vaddr;
 


[patch] swiotlb: fix ignored DMA_ATTR_NO_WARN request

2018-05-11 Thread Mike Galbraith

In the trace below, swiotlb_alloc() is called with __GFP_NOWARN, it ors
attrs with DMA_ATTR_NO_WARN and passes it to swiotlb_alloc_buffer(),
which does NOT pass it on to swiotlb_tbl_map_single(), leading to an
ever repeating warning that the caller of swiotlb_alloc() explicitly
asked to be squelched.  Pass the caller's request for silence onward.

 Xorg-3170  [006]    963.866098: swiotlb_alloc+0x1d/0x1a0: gfp & 
__GFP_NOWARN
 Xorg-3170  [006]    963.866101: 
 => ttm_dma_populate+0x250/0x310 [ttm]
 => ttm_tt_populate+0x28/0x70 [ttm]
 => ttm_tt_bind+0x26/0x60 [ttm]
 => ttm_bo_handle_move_mem+0x51a/0x580 [ttm]
 => ttm_bo_validate+0xfa/0x110 [ttm]
 => ttm_bo_init_reserved+0x296/0x450 [ttm]
 => ttm_bo_init+0x73/0xd0 [ttm]
 => nouveau_bo_new+0x3eb/0x5c0 [nouveau]
 => nouveau_gem_new+0x66/0x110 [nouveau]
 => nouveau_gem_ioctl_new+0x48/0xc0 [nouveau]
 => drm_ioctl_kernel+0x66/0xb0 [drm]
 => drm_ioctl+0x28d/0x340 [drm]
 => nouveau_drm_ioctl+0x50/0xb0 [nouveau]
 => do_vfs_ioctl+0x92/0x5e0
 => ksys_ioctl+0x3a/0x70
 => __x64_sys_ioctl+0x16/0x20
 => do_syscall_64+0x5b/0x180
 => entry_SYSCALL_64_after_hwframe+0x44/0xa9
 Xorg-3170  [006]    963.866917: swiotlb_tbl_map_single+0x29b/0x2d0: 
swiotlb buffer is full (sz: 2097152 bytes)

Signed-off-by: Mike Galbraith <efa...@gmx.de>
---
 lib/swiotlb.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -714,7 +714,7 @@ swiotlb_alloc_buffer(struct device *dev,
 
phys_addr = swiotlb_tbl_map_single(dev,
__phys_to_dma(dev, io_tlb_start),
-   0, size, DMA_FROM_DEVICE, 0);
+   0, size, DMA_FROM_DEVICE, attrs);
if (phys_addr == SWIOTLB_MAP_ERROR)
goto out_warn;
 


Re: kernel spew from nouveau/ swiotlb

2018-05-11 Thread Mike Galbraith
On Thu, 2018-05-10 at 12:28 +0200, Mike Galbraith wrote:
> On Thu, 2018-05-10 at 11:10 +0200, Mike Galbraith wrote:
> > Greetings,
> > 
> > When box is earning its keep, nouveau/swiotlb grumble.. a LOT.  The
> > below is from master.today.
> > 
> > [12594.640959] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12594.693000] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12594.713787] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12594.743413] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12594.796740] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12607.000774] swiotlb_tbl_map_single: 54 callbacks suppressed
> > [12607.000776] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12607.347941] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > [12608.677038] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> > bytes)
> > homer:/novell/ssh # dmesg|grep 'swiotlb buffer is full'|wc -l
> > 2052
> > homer:/novell/ssh # dmesg|grep 'callbacks suppressed'|wc -l
> > 171
> > 
> > lib/swiotlb.c:
> >  573 not_found:
> >  574 spin_unlock_irqrestore(_tlb_lock, flags);
> >  575 if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
> >  576 dev_warn(hwdev, "swiotlb buffer is full (sz: %zd 
> > bytes)\n", size);
> > 
> > Does nouveau perhaps want one of those DMA_ATTR_NO_WARN thingies?
> 
> Or should ttm perhaps always use the one on hand?  (seems to work)

No it didn't, I just didn't wait long enough for spew to start...

> ---
>  drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> +++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
> @@ -342,7 +342,7 @@ static struct dma_page *__ttm_dma_alloc_
>   if (!d_page)
>   return NULL;
>  
> - if (pool->type & IS_HUGE)
> + if (1 || pool->type & IS_HUGE)
>   attrs = DMA_ATTR_NO_WARN;
>  
>   vaddr = dma_alloc_attrs(pool->dev, pool->size, _page->dma,

While IS_HUGE is indeed false on my box, it just doesn't matter,
because when we get to either the old or the new alloc(), it calls
swiotlb_alloc_buffer(), which drops attrs passed to it on the floor,
making it unlikely that alloc() caller wishes are granted.

-Mike


Re: [Nouveau] kernel spew from nouveau/ swiotlb

2018-05-10 Thread Mike Galbraith
On Thu, 2018-05-10 at 17:31 +0200, Mike Galbraith wrote:
> On Thu, 2018-05-10 at 10:31 -0400, Jerome Glisse wrote:
> > 
> > Could you bisect ? I would love to point finger upstream to the DMA
> > folk who made changes to that API without testing with GPU.
> 
> Rummaging a bit, it might be...
> 

(unsend, whack duplicate line, munge, send;)

> nouveau_bo_new()
> ...
> ttm_dma_pool_alloc_new_pages()
>   dma_alloc_attrs()
> ops->alloc() == x86_swiotlb_alloc_coherent()
> x86_swiotlb_alloc_coherent() flags |= __GFP_NOWARN;
>   swiotlb_alloc_coherent(..flags)
> swiotlb_alloc_coherent(..flags) attrs = (flags & __GFP_NOWARN) ? 
> DMA_ATTR_NO_WARN : 0;
>   swiotlb_alloc_buffer(..attrs)
*  swiotlb_tbl_map_single(..0) passed 0 vs attrs, gripeage follows

Or something like that.


Re: [Nouveau] kernel spew from nouveau/ swiotlb

2018-05-10 Thread Mike Galbraith
On Thu, 2018-05-10 at 10:31 -0400, Jerome Glisse wrote:
> 
> Could you bisect ? I would love to point finger upstream to the DMA
> folk who made changes to that API without testing with GPU.

Rummaging a bit, it might be...

nouveau_bo_new()
...
ttm_dma_pool_alloc_new_pages()
  dma_alloc_attrs()
ops->alloc() == x86_swiotlb_alloc_coherent()
x86_swiotlb_alloc_coherent() flags |= __GFP_NOWARN;
  swiotlb_alloc_coherent(..flags)
swiotlb_alloc_coherent(..flags) attrs = (flags & __GFP_NOWARN) ? 
DMA_ATTR_NO_WARN : 0;
  swiotlb_alloc_buffer(..attr)
swiotlb_alloc_buffer(..0)  <== hm, pass zero instead of attr?
  swiotlb_tbl_map_single() gripeage

...that?

-Mike


Re: kernel spew from nouveau/ swiotlb

2018-05-10 Thread Mike Galbraith
On Thu, 2018-05-10 at 11:10 +0200, Mike Galbraith wrote:
> Greetings,
> 
> When box is earning its keep, nouveau/swiotlb grumble.. a LOT.  The
> below is from master.today.
> 
> [12594.640959] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12594.693000] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12594.713787] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12594.743413] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12594.796740] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12607.000774] swiotlb_tbl_map_single: 54 callbacks suppressed
> [12607.000776] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12607.347941] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> [12608.677038] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 
> bytes)
> homer:/novell/ssh # dmesg|grep 'swiotlb buffer is full'|wc -l
> 2052
> homer:/novell/ssh # dmesg|grep 'callbacks suppressed'|wc -l
> 171
> 
> lib/swiotlb.c:
>  573 not_found:
>  574 spin_unlock_irqrestore(_tlb_lock, flags);
>  575 if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
>  576 dev_warn(hwdev, "swiotlb buffer is full (sz: %zd 
> bytes)\n", size);
> 
> Does nouveau perhaps want one of those DMA_ATTR_NO_WARN thingies?

Or should ttm perhaps always use the one on hand?  (seems to work)

---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -342,7 +342,7 @@ static struct dma_page *__ttm_dma_alloc_
if (!d_page)
return NULL;
 
-   if (pool->type & IS_HUGE)
+   if (1 || pool->type & IS_HUGE)
attrs = DMA_ATTR_NO_WARN;
 
vaddr = dma_alloc_attrs(pool->dev, pool->size, _page->dma,


kernel spew from nouveau/ swiotlb

2018-05-10 Thread Mike Galbraith
Greetings,

When box is earning its keep, nouveau/swiotlb grumble.. a LOT.  The
below is from master.today.

[12594.640959] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12594.693000] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12594.713787] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12594.743413] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12594.796740] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12607.000774] swiotlb_tbl_map_single: 54 callbacks suppressed
[12607.000776] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12607.347941] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
[12608.677038] nouveau :01:00.0: swiotlb buffer is full (sz: 2097152 bytes)
homer:/novell/ssh # dmesg|grep 'swiotlb buffer is full'|wc -l
2052
homer:/novell/ssh # dmesg|grep 'callbacks suppressed'|wc -l
171

lib/swiotlb.c:
 573 not_found:
 574 spin_unlock_irqrestore(_tlb_lock, flags);
 575 if (!(attrs & DMA_ATTR_NO_WARN) && printk_ratelimit())
 576 dev_warn(hwdev, "swiotlb buffer is full (sz: %zd 
bytes)\n", size);

Does nouveau perhaps want one of those DMA_ATTR_NO_WARN thingies?

-Mike


Re: bug in tag handling in blk-mq?

2018-05-09 Thread Mike Galbraith
On Wed, 2018-05-09 at 13:50 -0600, Jens Axboe wrote:
> On 5/9/18 12:31 PM, Mike Galbraith wrote:
> > On Wed, 2018-05-09 at 11:01 -0600, Jens Axboe wrote:
> >> On 5/9/18 10:57 AM, Mike Galbraith wrote:
> >>
> >>>>> Confirmed.  Impressive high speed bug stomping.
> >>>>
> >>>> Well, that's good news. Can I get you to try this patch?
> >>>
> >>> Sure thing.  The original hang (minus provocation patch) being
> >>> annoyingly non-deterministic, this will (hopefully) take a while.
> >>
> >> You can verify with the provocation patch as well first, if you wish.
> > 
> > Done, box still seems fine.
> 
> Omar had some (valid) complaints, can you try this one as well? You
> can also find it as a series here:
> 
> http://git.kernel.dk/cgit/linux-block/log/?h=bfq-cleanups
> 
> I'll repost the series shortly, need to check if it actually builds and
> boots.

I applied the series (+ provocation), all is well.

-Mike


Re: bug in tag handling in blk-mq?

2018-05-09 Thread Mike Galbraith
On Wed, 2018-05-09 at 11:01 -0600, Jens Axboe wrote:
> On 5/9/18 10:57 AM, Mike Galbraith wrote:
> 
> >>> Confirmed.  Impressive high speed bug stomping.
> >>
> >> Well, that's good news. Can I get you to try this patch?
> > 
> > Sure thing.  The original hang (minus provocation patch) being
> > annoyingly non-deterministic, this will (hopefully) take a while.
> 
> You can verify with the provocation patch as well first, if you wish.

Done, box still seems fine.

-Mike


Re: bug in tag handling in blk-mq?

2018-05-09 Thread Mike Galbraith
On Wed, 2018-05-09 at 09:18 -0600, Jens Axboe wrote:
> On 5/8/18 10:11 PM, Mike Galbraith wrote:
> > On Tue, 2018-05-08 at 19:09 -0600, Jens Axboe wrote:
> >>
> >> Alright, I managed to reproduce it. What I think is happening is that
> >> BFQ is limiting the inflight case to something less than the wake
> >> batch for sbitmap, which can lead to stalls. I don't have time to test
> >> this tonight, but perhaps you can give it a go when you are back at it.
> >> If not, I'll try tomorrow morning.
> >>
> >> If this is the issue, I can turn it into a real patch. This is just to
> >> confirm that the issue goes away with the below.
> > 
> > Confirmed.  Impressive high speed bug stomping.
> 
> Well, that's good news. Can I get you to try this patch?

Sure thing.  The original hang (minus provocation patch) being
annoyingly non-deterministic, this will (hopefully) take a while.

-Mike


Re: bug in tag handling in blk-mq?

2018-05-08 Thread Mike Galbraith
On Tue, 2018-05-08 at 14:37 -0600, Jens Axboe wrote:
> 
> - sdd has nothing pending, yet has 6 active waitqueues.

sdd is where ccache storage lives, which that should have been the only
activity on that drive, as I built source in sdb, and was doing nothing
else that utilizes sdd.

-Mike


Re: bug in tag handling in blk-mq?

2018-05-08 Thread Mike Galbraith
On Tue, 2018-05-08 at 19:09 -0600, Jens Axboe wrote:
> 
> Alright, I managed to reproduce it. What I think is happening is that
> BFQ is limiting the inflight case to something less than the wake
> batch for sbitmap, which can lead to stalls. I don't have time to test
> this tonight, but perhaps you can give it a go when you are back at it.
> If not, I'll try tomorrow morning.
> 
> If this is the issue, I can turn it into a real patch. This is just to
> confirm that the issue goes away with the below.

Confirmed.  Impressive high speed bug stomping.

> diff --git a/lib/sbitmap.c b/lib/sbitmap.c
> index e6a9c06ec70c..94ced15b6428 100644
> --- a/lib/sbitmap.c
> +++ b/lib/sbitmap.c
> @@ -272,6 +272,7 @@ EXPORT_SYMBOL_GPL(sbitmap_bitmap_show);
>  
>  static unsigned int sbq_calc_wake_batch(unsigned int depth)
>  {
> +#if 0
>   unsigned int wake_batch;
>  
>   /*
> @@ -284,6 +285,9 @@ static unsigned int sbq_calc_wake_batch(unsigned int 
> depth)
>   wake_batch = max(1U, depth / SBQ_WAIT_QUEUES);
>  
>   return wake_batch;
> +#else
> + return 1;
> +#endif
>  }
>  
>  int sbitmap_queue_init_node(struct sbitmap_queue *sbq, unsigned int depth,
> 


Re: bug in tag handling in blk-mq?

2018-05-08 Thread Mike Galbraith
On Tue, 2018-05-08 at 08:55 -0600, Jens Axboe wrote:
> 
> All the block debug files are empty...

Sigh.  Take 2, this time cat debug files, having turned block tracing
off before doing anything else (so trace bits in dmesg.txt should end
AT the stall).

-Mike

dmesg.xz
Description: application/xz


dmesg.txt.xz
Description: application/xz


block_debug.xz
Description: application/xz


Re: bug in tag handling in blk-mq?

2018-05-08 Thread Mike Galbraith
On Tue, 2018-05-08 at 06:51 +0200, Mike Galbraith wrote:
> 
> I'm deadlined ATM, but will get to it.

(Bah, even a zombie can type ccache -C; make -j8 and stare...)

kbuild again hung on the first go (yay), and post hang data written to
sdd1 survived (kernel source lives in sdb3).  Full ftrace buffer (echo
1 > events/block/enable) available off list if desired.  dmesg.txt.xz
is dmesg from post hang crashdump, attached because it contains the
tail of trace buffer, so _might_ be useful.

homer:~ # df|grep sd
/dev/sdb3  959074776 785342824 172741072  82% /
/dev/sdc3  959074776 455464912 502618984  48% /backup
/dev/sdb1 159564  7980    151584   6% /boot/efi
/dev/sdd1  961301832 393334868 519112540  44% /abuild

Kernel is virgin modulo these...

patches/remove_irritating_plus.diff
patches/add-scm-version-to-EXTRAVERSION.patch
patches/block-bfq:-postpone-rq-preparation-to-insert-or-merge.patch
patches/block-bfq:-test.patch  (hang provocation hack from Paolo)

-Mike

block_debug.tar.xz
Description: application/xz-compressed-tar


dmesg.xz
Description: application/xz


dmesg.txt.xz
Description: application/xz


Re: bug in tag handling in blk-mq?

2018-05-07 Thread Mike Galbraith
On Mon, 2018-05-07 at 20:02 +0200, Paolo Valente wrote:
> 
> 
> > Is there a reproducer?

Just building fat config kernels works for me.  It was highly non-
deterministic, but reproduced quickly twice in a row with Paolos hack.
  
> Ok Mike, I guess it's your turn now, for at least a stack trace.

Sure.  I'm deadlined ATM, but will get to it.

-Mike


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-07 Thread Mike Galbraith
On Mon, 2018-05-07 at 11:27 +0200, Paolo Valente wrote:
> 
> 
> Where is the bug?

Hm, seems potent pain-killers and C don't mix all that well.



Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-07 Thread Mike Galbraith
On Mon, 2018-05-07 at 09:39 +0200, Sebastian Andrzej Siewior wrote:
> On 2018-05-06 12:59:19 [+0200], Mike Galbraith wrote:
> > On Sun, 2018-05-06 at 12:26 +0200, Thomas Gleixner wrote:
> > > On Fri, 4 May 2018, Sebastian Andrzej Siewior wrote:
> > > 
> > > > From: Mike Galbraith <umgwanakikb...@gmail.com>
> > > > 
> > > > Shrug.  Lots of hobbyists have a beast in their basement, right?
> > > 
> > > This hardly qualifies as a proper changelog ...
> > 
> > Hm, that wasn't intended to be a changelog.
> > 
> > This patch may not be current either, I haven't tested RT on a UV box
> > in quite some time.
> 
> That last hunk looks like something that would be required even for !RT. 
> Would you mind to check that patch and write a changelog? If it doesn't
> work for RT there is no need to carry this in -RT.

Yeah, I'll try to reserve a box.

-Mike


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-06 Thread Mike Galbraith
On Sun, 2018-05-06 at 09:42 +0200, Paolo Valente wrote:
> 
> diff --git a/block/bfq-mq-iosched.c b/block/bfq-mq-iosched.c
> index 118f319af7c0..6662efe29b69 100644
> --- a/block/bfq-mq-iosched.c
> +++ b/block/bfq-mq-iosched.c
> @@ -525,8 +525,13 @@ static void bfq_limit_depth(unsigned int op, struct 
> blk_mq_alloc_data *data)
> if (unlikely(bfqd->sb_shift != bt->sb.shift))
> bfq_update_depths(bfqd, bt);
>  
> +#if 0
> data->shallow_depth =
> bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
^

Q: why doesn't the top of this function look like so?

---
 block/bfq-iosched.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -539,7 +539,7 @@ static void bfq_limit_depth(unsigned int
struct bfq_data *bfqd = data->q->elevator->elevator_data;
struct sbitmap_queue *bt;
 
-   if (op_is_sync(op) && !op_is_write(op))
+   if (!op_is_write(op))
return;
 
if (data->flags & BLK_MQ_REQ_RESERVED) {

It looks a bit odd that these elements exist...

+   /*
+    * no more than 75% of tags for sync writes (25% extra tags
+    * w.r.t. async I/O, to prevent async I/O from starving sync
+    * writes)
+    */
+   bfqd->word_depths[0][1] = max(((1U>2, 1U);

+   /* no more than ~37% of tags for sync writes (~20% extra tags) */
+   bfqd->word_depths[1][1] = max(((1U>4, 1U);

...yet we index via and log a guaranteed zero.

-Mike




Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-06 Thread Mike Galbraith
On Mon, 2018-05-07 at 04:43 +0200, Mike Galbraith wrote:
> On Sun, 2018-05-06 at 09:42 +0200, Paolo Valente wrote:
> > 
> > I've attached a compressed patch (to avoid possible corruption from my
> > mailer).  I'm little confident, but no pain, no gain, right?
> > 
> > If possible, apply this patch on top of the fix I proposed in this
> > thread, just to eliminate possible further noise. Finally, the
> > patch content follows.
> > 
> > Hoping for a stroke of luck,
> 
> FWIW, box didn't survive the first full build of the morning.

Nor the second.

-Mike


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-06 Thread Mike Galbraith
On Sun, 2018-05-06 at 09:42 +0200, Paolo Valente wrote:
> 
> I've attached a compressed patch (to avoid possible corruption from my
> mailer).  I'm little confident, but no pain, no gain, right?
> 
> If possible, apply this patch on top of the fix I proposed in this
> thread, just to eliminate possible further noise. Finally, the
> patch content follows.
> 
> Hoping for a stroke of luck,

FWIW, box didn't survive the first full build of the morning.

> Paolo
> 
> diff --git a/block/bfq-mq-iosched.c b/block/bfq-mq-iosched.c
> index 118f319af7c0..6662efe29b69 100644
> --- a/block/bfq-mq-iosched.c
> +++ b/block/bfq-mq-iosched.c

That doesn't exist in master, so I applied it like so.

---
 block/bfq-iosched.c |4 
 1 file changed, 4 insertions(+)

--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -554,8 +554,12 @@ static void bfq_limit_depth(unsigned int
if (unlikely(bfqd->sb_shift != bt->sb.shift))
bfq_update_depths(bfqd, bt);
 
+#if 0
data->shallow_depth =
bfqd->word_depths[!!bfqd->wr_busy_queues][op_is_sync(op)];
+#else
+   data->shallow_depth = 1;
+#endif
 
bfq_log(bfqd, "[%s] wr_busy %d sync %d depth %u",
__func__, bfqd->wr_busy_queues, op_is_sync(op),


Re: [PATCH] x86: UV: raw_spinlock conversion

2018-05-06 Thread Mike Galbraith
On Sun, 2018-05-06 at 12:26 +0200, Thomas Gleixner wrote:
> On Fri, 4 May 2018, Sebastian Andrzej Siewior wrote:
> 
> > From: Mike Galbraith <umgwanakikb...@gmail.com>
> > 
> > Shrug.  Lots of hobbyists have a beast in their basement, right?
> 
> This hardly qualifies as a proper changelog ...

Hm, that wasn't intended to be a changelog.

This patch may not be current either, I haven't tested RT on a UV box
in quite some time.

-Mike



Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-05 Thread Mike Galbraith
On Sat, 2018-05-05 at 12:39 +0200, Paolo Valente wrote:
> 
> BTW, if you didn't run out of patience with this permanent issue yet,
> I was thinking of two o three changes to retry to trigger your failure
> reliably.

Sure, fire away, I'll happily give the annoying little bugger
opportunities to show its tender belly.

-Mike



Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-05 Thread Mike Galbraith
On Fri, 2018-05-04 at 21:46 +0200, Mike Galbraith wrote:
> Tentatively, I suspect you've just fixed the nasty stalls I reported a
> while back.

Oh well, so much for optimism.  It took a lot, but just hung.


Re: [PATCH BUGFIX] block, bfq: postpone rq preparation to insert or merge

2018-05-04 Thread Mike Galbraith
Tentatively, I suspect you've just fixed the nasty stalls I reported a
while back.  Not a hint of stall as yet (should have shown itself by
now), spinning rust buckets are being all they can be, box feels good.

Later mq-deadline (I hope to eventually forget the module dependency
eternities we've spent together;), welcome back bfq (maybe.. I hope).

-Mike


[patch-rt] sched,fair: Fix CFS bandwidth control lockdep DEADLOCK report

2018-05-04 Thread Mike Galbraith
CFS bandwidth control yields the inversion gripe below, moving
handling quells it.


WARNING: possible irq lock inversion dependency detected
4.16.7-rt1-rt #2 Tainted: GE   

sirq-hrtimer/0/15 just changed the state of lock:
 (_b->lock){+...}, at: [<9adb5cf7>] 
sched_cfs_period_timer+0x28/0x140
but this lock was taken by another, HARDIRQ-safe lock in the past:
 (>lock){-...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
 Possible interrupt unsafe locking scenario: 
   CPU0CPU1
   
  lock(_b->lock);
   local_irq_disable();
   lock(>lock);
   lock(_b->lock);
  
lock(>lock);
*** DEADLOCK *** 
1 lock held by sirq-hrtimer/0/15:
 #0:  (_cpu(local_softirq_locks[i], __cpu).lock){+.+.}, at: 
[<61d5600a>] do_current_softirqs+0x170/0x660
the shortest dependencies between 2nd lock and 1st lock:
 -> (>lock){-...} ops: 67919540 {
IN-HARDIRQ-W at:
  _raw_spin_lock+0x38/0x50
  scheduler_tick+0x4c/0x110
  update_process_times+0x21/0x50
  tick_periodic+0x2b/0x100
  tick_handle_periodic+0x1f/0x60
  timer_interrupt+0x14/0x20
  __handle_irq_event_percpu+0x5f/0x3f0
  handle_irq_event_percpu+0x37/0x70
  handle_irq_event+0x37/0x60
  handle_edge_irq+0xbe/0x1e0
  handle_irq+0x1f/0x30
  do_IRQ+0x65/0x130
  ret_from_intr+0x0/0x22
  timer_irq_works+0x60/0x10e
  setup_IO_APIC+0x620/0x7e3
  x86_late_time_init+0x17/0x1c
  start_kernel+0x410/0x4b3
  secondary_startup_64+0xa5/0xb0
INITIAL USE at:
 _raw_spin_lock_irqsave+0x4f/0x70
 rq_attach_root+0x18/0xe0
 sched_init+0x2ea/0x413
 start_kernel+0x282/0x4b3
 secondary_startup_64+0xa5/0xb0
  }
  ... key  at: [<0ab3ac7a>] __key.69727+0x0/0x8
  ... acquired at:
   lock_acquire+0xbd/0x250
   _raw_spin_lock+0x38/0x50
   rq_online_fair+0x9a/0x190
   set_rq_online+0x4c/0x60
   rq_attach_root+0xac/0xe0
   sched_init+0x2ea/0x413
   start_kernel+0x282/0x4b3
   secondary_startup_64+0xa5/0xb0 
-> (_b->lock){+...} ops: 56 {
   HARDIRQ-ON-W at:
_raw_spin_lock+0x38/0x50
sched_cfs_period_timer+0x28/0x140
__hrtimer_run_queues+0x10e/0x5f0
hrtimer_run_softirq+0x83/0xc0
do_current_softirqs+0x292/0x660
run_ksoftirqd+0x27/0x70
smpboot_thread_fn+0x27f/0x330
kthread+0x103/0x140
ret_from_fork+0x3a/0x50
   INITIAL USE at:
   _raw_spin_lock+0x38/0x50
   rq_online_fair+0x9a/0x190
   set_rq_online+0x4c/0x60
   rq_attach_root+0xac/0xe0
   sched_init+0x2ea/0x413
   start_kernel+0x282/0x4b3
   secondary_startup_64+0xa5/0xb0
 }
 ... key  at: [<bf5d5ec7>] __key.47691+0x0/0x8
 ... acquired at:
   __lock_acquire+0x1e6/0x770
   lock_acquire+0xbd/0x250
   _raw_spin_lock+0x38/0x50
   sched_cfs_period_timer+0x28/0x140
   __hrtimer_run_queues+0x10e/0x5f0
   hrtimer_run_softirq+0x83/0xc0
   do_current_softirqs+0x292/0x660
   run_ksoftirqd+0x27/0x70
   smpboot_thread_fn+0x27f/0x330
   kthread+0x103/0x140
   ret_from_fork+0x3a/0x50 
stack backtrace:
CPU: 0 PID: 15 Comm: sirq-hrtimer/0 Tainted: GE4.16.7-rt1-rt #2
Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
Call Trace:
 dump_stack+0x78/0xab
 print_irq_inversion_bug.part.38+0x19f/0x1aa
 check_usage_backwards+0x11b/0x120
 ? check_usage_forwards+0x130/0x130
 mark_lock+0x17c/0x280
 __lock_acquire+0x1e6/0x770
 lock_acquire+0xbd/0x250
 ? sched_cfs_period_timer+0x28/0x140
 _raw_spin_lock+0x38/0x50
 ? sched_cfs_period_timer+0x28/0x140
 sched_cfs_period_timer+0x28/0x140
 ? sched_cfs_slack_timer+0xc0/0xc0
 __hrtimer_run_queues+0x10e/0x5f0
 hrtimer_run_softirq+0x83/0xc0
 do_current_softirqs+0x292/0x660
 run_ksoftirqd+0x27/0x70
 smpboot_thread_fn+0x27f/0x330
 kthread+0x103/0x140
 ? smpboot_register_percpu_thread_cpumask+0x100/0x100
 ? kthread_delayed_work_timer_fn+0x90/0x90
 ret_from_fork+0x3a/0x50

Signed-off-by: Mike Galbraith <efa...@gmx.de>
---
 kernel/sched/fair.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@

Re: cpu stopper threads and load balancing leads to deadlock

2018-05-03 Thread Mike Galbraith
On Thu, 2018-05-03 at 18:45 +0200, Peter Zijlstra wrote:
> 
> Something like so perhaps? Mike, can you play around with that? Could
> burn your granny and eat your cookies.

That worked, and nothing entertaining has happened.. yet.  Hm, I could
use this kernel to update my backup drive, if there's a cookie monster
lurking, that might get its attention :)
 
> diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
> index 7468de429087..07360523c3ce 100644
> --- a/arch/x86/kernel/cpu/mtrr/main.c
> +++ b/arch/x86/kernel/cpu/mtrr/main.c
> @@ -793,6 +793,9 @@ void mtrr_ap_init(void)
>  
>   if (!use_intel() || mtrr_aps_delayed_init)
>   return;
> +
> + rcu_cpu_starting(smp_processor_id());
> +
>   /*
>* Ideally we should hold mtrr_mutex here to avoid mtrr entries
>* changed, but this routine will be called in cpu boot time,
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 2a734692a581..4dab46950fdb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3775,6 +3775,8 @@ int rcutree_dead_cpu(unsigned int cpu)
>   return 0;
>  }
>  
> +static DEFINE_PER_CPU(int, rcu_cpu_started);
> +
>  /*
>   * Mark the specified CPU as being online so that subsequent grace periods
>   * (both expedited and normal) will wait on it.  Note that this means that
> @@ -3796,6 +3798,11 @@ void rcu_cpu_starting(unsigned int cpu)
>   struct rcu_node *rnp;
>   struct rcu_state *rsp;
>  
> + if (per_cpu(rcu_cpu_started, cpu))
> + return;
> +
> + per_cpu(rcu_cpu_started, cpu) = 1;
> +
>   for_each_rcu_flavor(rsp) {
>   rdp = per_cpu_ptr(rsp->rda, cpu);
>   rnp = rdp->mynode;
> @@ -3852,6 +3859,8 @@ void rcu_report_dead(unsigned int cpu)
>   preempt_enable();
>   for_each_rcu_flavor(rsp)
>   rcu_cleanup_dying_idle_cpu(cpu, rsp);
> +
> + per_cpu(rcu_cpu_started, cpu) = 0;
>  }
>  
>  /* Migrate the dead CPU's callbacks to the current CPU. */


Re: cpu stopper threads and load balancing leads to deadlock

2018-05-03 Thread Mike Galbraith
On Thu, 2018-05-03 at 15:56 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 03:32:39PM +0200, Mike Galbraith wrote:
> 
> > Dang.  With $subject fix applied as well..
> 
> That's a NO then... :-(

Could say who cares about oddball offline wakeup stat. 


Re: cpu stopper threads and load balancing leads to deadlock

2018-05-03 Thread Mike Galbraith
On Thu, 2018-05-03 at 14:49 +0200, Peter Zijlstra wrote:
> On Thu, May 03, 2018 at 02:40:21PM +0200, Mike Galbraith wrote:
> > On Thu, 2018-05-03 at 14:28 +0200, Peter Zijlstra wrote:
> > > 
> > > Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
> > > stopper lock, but that should not affect the RCU state.
> > 
> > No, not new, just an additional woes from same spot.
> 
> Ah, ok. Does somsething like this make it go away?

Dang.  With $subject fix applied as well..

[  151.103732] smpboot: Booting Node 0 Processor 2 APIC 0x4
[  151.104908] =
[  151.104909] WARNING: suspicious RCU usage
[  151.104910] 4.17.0.g66d489e-tip-default #84 Tainted: GE
[  151.104911] -
[  151.104912] kernel/sched/core.c:1625 suspicious rcu_dereference_check() 
usage!
[  151.104913] 
   other info that might help us debug this:

[  151.104914] 
   RCU used illegally from offline CPU!
   rcu_scheduler_active = 2, debug_locks = 0
[  151.104916] 3 locks held by swapper/2/0:
[  151.104916]  #0: 560adb60 (stop_cpus_mutex){+.+.}, at: 
stop_machine_from_inactive_cpu+0x86/0x140
[  151.104923]  #1: e4fb0238 (>pi_lock){-.-.}, at: 
try_to_wake_up+0x2d/0x5f0
[  151.104929]  #2: 3341403b (rcu_read_lock){}, at: 
rcu_read_lock+0x0/0x80
[  151.104934] 
   stack backtrace:
[  151.104937] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G   
 E 4.17.0.g66d489e-tip-default #84
[  151.104938] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  151.104938] Call Trace:
[  151.104942]  dump_stack+0x78/0xb3
[  151.104945]  ttwu_stat+0x121/0x130
[  151.104949]  try_to_wake_up+0x2c2/0x5f0
[  151.104953]  ? cpu_stop_park+0x30/0x30
[  151.104956]  wake_up_q+0x4a/0x70
[  151.104959]  cpu_stop_queue_work+0x6b/0xa0
[  151.104963]  queue_stop_cpus_work+0x61/0xb0
[  151.104968]  stop_machine_from_inactive_cpu+0xd8/0x140
[  151.104970]  ? mtrr_restore+0x80/0x80
[  151.104976]  mtrr_ap_init+0x62/0x70
[  151.104979]  identify_secondary_cpu+0x18/0x80
[  151.104982]  smp_store_cpu_info+0x44/0x50
[  151.104985]  start_secondary+0x9a/0x1e0
[  151.104988]  secondary_startup_64+0xa5/0xb0

> diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
> index f89014a2c238..a32518c2ba4a 100644
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -650,8 +650,10 @@ int stop_machine_from_inactive_cpu(cpu_stop_fn_t fn, 
> void *data,
>   /* Schedule work on other CPUs and execute directly for local CPU */
>   set_state(, MULTI_STOP_PREPARE);
>   cpu_stop_init_done(, num_active_cpus());
> - queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, ,
> -  );
> +
> + RCU_NONIDLE(queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop,
> +  , ));
> +
>   ret = multi_cpu_stop();
>  
>   /* Busy wait for completion. */


Re: cpu stopper threads and load balancing leads to deadlock

2018-05-03 Thread Mike Galbraith
On Thu, 2018-05-03 at 14:28 +0200, Peter Zijlstra wrote:
> 
> Hurm.. I don't see how this is 'new'. We moved the wakeup out from under
> stopper lock, but that should not affect the RCU state.

No, not new, just an additional woes from same spot.

-Mike



Re: cpu stopper threads and load balancing leads to deadlock

2018-05-03 Thread Mike Galbraith
On Tue, 2018-04-24 at 14:33 +0100, Matt Fleming wrote:
> On Fri, 20 Apr, at 11:50:05AM, Peter Zijlstra wrote:
> > On Tue, Apr 17, 2018 at 03:21:19PM +0100, Matt Fleming wrote:
> > > Hi guys,
> > > 
> > > We've seen a bug in one of our SLE kernels where the cpu stopper
> > > thread ("migration/15") is entering idle balance. This then triggers
> > > active load balance.
> > > 
> > > At the same time, a task on another CPU triggers a page fault and NUMA
> > > balancing kicks in to try and migrate the task closer to the NUMA node
> > > for that page (we're inside stop_two_cpus()). This faulting task is
> > > spinning in try_to_wake_up() (inside smp_cond_load_acquire(>on_cpu,
> > > !VAL)), waiting for "migration/15" to context switch.
> > > 
> > > Unfortunately, because "migration/15" is doing active load balance
> > > it's spinning waiting for the NUMA-page-faulting CPU's stopper lock,
> > > which is already held (since it's inside stop_two_cpus()).
> > > 
> > > Deadlock ensues.
> > 
> > 
> > So if I read that right, something like the following happens:
> > 
> > CPU0CPU1
> > 
> > schedule(.prev=migrate/0)   
> >   pick_next_task  ...
> > idle_balancemigrate_swap()
> >   active_balance  stop_two_cpus()
> > spin_lock(stopper0->lock)
> > spin_lock(stopper1->lock)
> > ttwu(migrate/0)
> >   smp_cond_load_acquire() -- 
> > waits for schedule()
> > stop_one_cpu(1)
> >   spin_lock(stopper1->lock) -- waits for stopper lock
> 
> Yep, that's exactly right.
> 
> > Fix _this_ deadlock by taking out the wakeups from under stopper->lock.
> > I'm not entirely sure there isn't more dragons here, but this particular
> > one seems fixable by doing that.
> > 
> > Is there any way you can reproduce/test this?
> 
> I'm afraid I don't have any way to test this, but I can ask the
> customer that reported it if they can.
> 
> Either way, this fix looks good to me.

Seems there's another problem there with hotplug.  Virgin tip...

[  122.147601] smpboot: CPU 4 is now offline
[  122.189701] smpboot: CPU 5 is now offline
[  122.225612] smpboot: CPU 6 is now offline
[  122.257760] smpboot: CPU 7 is now offline
[  124.172418] smpboot: CPU 2 is now offline
[  124.209121] smpboot: CPU 3 is now offline
[  124.215810] smpboot: Booting Node 0 Processor 2 APIC 0x4

[  124.216939] =
[  124.216939] WARNING: suspicious RCU usage
[  124.216941] 4.17.0.g66d489e-tip-default #82 Tainted: GE
[  124.216941] -
[  124.216943] kernel/sched/core.c:1614 suspicious rcu_dereference_check() 
usage!
[  124.216944] 
   other info that might help us debug this:

[  124.216945] 
   RCU used illegally from offline CPU!
   rcu_scheduler_active = 2, debug_locks = 0
[  124.216946] 4 locks held by swapper/2/0:
[  124.216947]  #0: 1f9fa447 (stop_cpus_mutex){+.+.}, at: 
stop_machine_from_inactive_cpu+0x86/0x130
[  124.216953]  #1: 4cb07b3b (>lock){..-.}, at: 
cpu_stop_queue_work+0x2d/0x80
[  124.216958]  #2: d3a46b90 (>pi_lock){-.-.}, at: 
try_to_wake_up+0x2d/0x5f0
[  124.216964]  #3: f360767b (rcu_read_lock){}, at: 
rcu_read_lock+0x0/0x80
[  124.216969] 
   stack backtrace:
[  124.216971] CPU: 2 PID: 0 Comm: swapper/2 Kdump: loaded Tainted: G   
 E 4.17.0.g66d489e-tip-default #82
[  124.216972] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 
09/23/2013
[  124.216973] Call Trace:
[  124.216977]  dump_stack+0x78/0xb3
[  124.216979]  ttwu_stat+0x121/0x130
[  124.216983]  try_to_wake_up+0x2c2/0x5f0
[  124.216988]  ? cpu_stop_park+0x30/0x30
[  124.216990]  cpu_stop_queue_work+0x7c/0x80
[  124.216993]  queue_stop_cpus_work+0x61/0xb0
[  124.216997]  stop_machine_from_inactive_cpu+0xd3/0x130
[  124.216999]  ? mtrr_restore+0x80/0x80
[  124.217005]  mtrr_ap_init+0x62/0x70
[  124.217008]  identify_secondary_cpu+0x18/0x80
[  124.217011]  smp_store_cpu_info+0x44/0x50
[  124.217014]  start_secondary+0x9a/0x1e0
[  124.217017]  secondary_startup_64+0xa5/0xb0

[  124.218433] ==
[  124.218433] WARNING: possible circular locking dependency detected
[  124.218433] 4.17.0.g66d489e-tip-default #82 Tainted: GE
[  124.218434] --
[  124.218434] swapper/2/0 is trying to acquire lock:
[  124.218434] 6b311cf8 ((console_sem).lock){-...}, at: 
down_trylock+0xf/0x30

[  124.218436] but task is already holding lock:
[  124.218436] d3a46b90 (>pi_lock){-.-.}, at: 
try_to_wake_up+0x2d/0x5f0

[  124.218438] which lock already depends on the new lock.


[  124.218438] the existing 

Re: [PATCH v7 2/5] cpuset: Add cpuset.sched_load_balance to v2

2018-05-02 Thread Mike Galbraith
On Wed, 2018-05-02 at 16:02 +0200, Peter Zijlstra wrote:
> On Wed, May 02, 2018 at 09:47:00AM -0400, Waiman Long wrote:
> 
> > > I've read half of the next patch that adds the isolation thing. And
> > > while that kludges around the whole root cgorup is magic thing, it
> > > doesn't help if you move the above scenario on level down:
> > >
> > >
> > >   R
> > >  /\
> > >AB
> > >   /   \
> > > C   D
> > >
> > >
> > > R: cpus=0-7, load_balance=0
> > > A: cpus=0-1, load_balance=1
> > > B: cpus=2-7, load_balance=0
> > > C: cpus=2-3, load_balance=1
> > > D: cpus=4-7, load_balance=1
> > >
> > >
> > > Also, I feel we should strive to have a minimal amount of tasks that
> > > cannot be moved out of the root group; the current set is far too large.
> > 
> > What exactly is the use case you have in mind with loading balancing
> > disabled in B, but enabled in C and D? We would like to support some
> > sensible use cases, but not every possible combinations.
> 
> Suppose A is your system group, and C and D are individual RT workloads
> or something.

Yeah, it does have a distinct "640K ought to be enough for anybody"
flavor to it.

-Mike


Re: [RFC/RFT patch 0/7] timekeeping: Unify clock MONOTONIC and clock BOOTTIME

2018-04-26 Thread Mike Galbraith
On Wed, 2018-04-25 at 15:03 +0200, Thomas Gleixner wrote:
> Right, it does not matter. The real interesting one is d6ed449afdb3.

FWIW, three boxen here suspend/resume fine, but repeatably exhibit the
below after a very few minute suspend, and a short bisect fingered your
suspect.  Distro is opensuse 42.3.

[  211.113902] Restarting tasks ... done.
[  211.114817] PM: suspend exit
[  212.312993] systemd-journald[7266]: File 
/var/log/journal/016627c3c4784cd4812d4b7e96a34226/system.journal corrupted or 
uncleanly shut down, renaming and replacing.
[  212.313363] systemd-coredump[7264]: Detected coredump of the journal daemon 
itself, diverted to 
/var/lib/systemd/coredump/core.systemd-journal.0.0aa39276decf4f1ab6fda3464e31f9dd.582.152472095400.



Re: DOS by unprivileged user

2018-04-25 Thread Mike Galbraith
On Wed, 2018-04-25 at 15:54 +0100, Alan Cox wrote:
> 
> Classical Unix systems never had this problem because they respond to
> thrashing by ensuring that all processes consumed CPU and made some
> progress. Linux handles it by thrashing itself to dealth while BSD always
> handled it by moving from paging more towards swapping and behaving like
> a swap bound batch machine.

Memcg constrained the gitk hog nicely, forcing it do dig into swap
after hitting its limit.  Dunno if there are any userspace bits that
can use it wisely though.

-Mike


Re: DOS by unprivileged user

2018-04-25 Thread Mike Galbraith
On Wed, 2018-04-25 at 15:54 +0100, Alan Cox wrote:
> > > I think memory allocation and io waits can't be decoupled from
> > > scheduling as they are now.  
> > 
> > The scheduler is not decoupled from either, it is intimately involved
> > in both.  However, none of the decision making smarts for either reside
> > in the scheduler, nor should they.
> 
> It belongs in both.

If mm decision making belongs within the process scheduler, it follows
that IO requests, dirty page writeback etc. do as well.  Nope, I don't
think we want to create a squid-uler, with tentacles extending all over
the dang kernel.

The thrashing problem could use some attention, but we'll have to agree
to disagree about the scheduler growing mm, io (etc) smarts.

-Mike


Re: [PATCH] sched: fix typo in error message

2018-04-25 Thread Mike Galbraith
On Wed, 2018-04-25 at 13:41 +0800, Li Bin wrote:
> Signed-off-by: Li Bin 
> ---
>  kernel/sched/topology.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 64cc564..cf15c1c 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1618,7 +1618,7 @@ static struct sched_domain *build_sched_domain(struct 
> sched_domain_topology_leve
>  
>   if (!cpumask_subset(sched_domain_span(child),
>   sched_domain_span(sd))) {
> - pr_err("BUG: arch topology borken\n");
> + pr_err("BUG: arch topology broken\n");

That's not a typo, it's a kernel-speak synonym, as are "borked",
b0rked", "b0rken", "busted", "hosed", and lord knows how many other
colorful variants.  Non-native English speakers beware.

-Mike


Re: DOS by unprivileged user

2018-04-23 Thread Mike Galbraith
On Sun, 2018-04-22 at 21:37 +0200, Ferry Toth wrote:
> > Yes your memory hog scenario thoroughly wrecks the user experience, but
> > the process scheduler in not the source of that wreckage, it's a memory
> > management issue.  With no constraints in place, anybody can just keep
> > on allocating until the entire system starts grinding itself to dust.
> > 
> > -Mike
>  
> That is exactly the issue I think. It is not just a user experience,
> they is no distinction between crashing the kernel and grinding it to
> dust. The effect we have is: any user on a multi user system can
> crash the system.
>  
> Memory management / constraints can't solve the problem either: it
> might be intentional to alloc such amounts of memory.

Constraints definitely can solve this particular problem instance. 
Plain old ulimit can stop the memory hog (wish) dead in its tracks, or
you can use memory cgroup controller to constrain it in a more friendly
manner.  I started gitk in a 4G constrained cgroup, which redirected
its greedy fingers to the swap bin for the remainder of its needs.

But yes, there is currently no wonderful one size fits all fully
automatic solution to running low on memory that doesn't involve
running to the store.

> I think memory allocation and io waits can't be decoupled from
> scheduling as they are now.

The scheduler is not decoupled from either, it is intimately involved
in both.  However, none of the decision making smarts for either reside
in the scheduler, nor should they.

-Mike


  1   2   3   4   5   6   7   8   9   10   >