Re: [GIT PULL] xfs: fixes for v4.20-rc6

2018-12-08 Thread Linus Torvalds
On Sat, Dec 8, 2018 at 8:36 AM Darrick J. Wong  wrote:
>
> Finally, the most important fix is to the pipe splicing code (aka the
> generic copy_file_range fallback) to avoid pointless short directio
> reads by only asking the filesystem for as much data as there are
> available pages in the pipe buffer.  Our previous fix (simulated short
> directio reads because the number of pages didn't match the length of
> the read requested) caused subtle problems on overlayfs, so that part is
> reverted.

Honestly, I really wish you simply wouldn't send me "xfs" fixes that
aren't really xfs-specific at all.

All the splice patches (and honestly, I feel some of the iomap ones
too) that have come in through the xfs tree should have been handled
separately as actual VFS patches. Or at least had acks from Al or
something.

I'm looking at that splice patch, and my initial reaction was "Hmm.
but that breaks 0 vs -EAGAIN". But then I realized that that's why
you're validating pipe->nrbufs, because the special temporary
per-thread pipe is always supposed to be empty on entry, so a zero
length can't happen.

But just the fact that I felt like I had to go and look at one of
these commits makes me go "this is not an XFS fix at all, it shouldn't
have been treated as an XFS patch, and the original commit that
*caused* the problem shouldn't have been treated as one either".

So please. I want to feel like when I get a XFS pull from the XFS
maintainer, I don't need to worry about it, and I can just do the git
pull without having to check details.

But that means that when you do changes outside of XFS code, those
changes need to be handled _differently_. And they shouldn't be
bypassing Al etc. And even if you can't get an Ack from Al, send them
separately, so that I can check them without there being any XFS
issues that are mixed up with the pull.

So the patch looks good, and I'm merging this, but I really really
don't have to feel like I need to look at xfs pulls this way.

 Linus


Re: [patch for-4.20] Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask"

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 2:27 PM David Rientjes  wrote:
>
> I noticed the race in 89c83fb539f9 ("mm, thp: consolidate THP gfp handling
> into alloc_hugepage_direct_gfpmask") that is fixed by the revert, but as
> you noted it didn't cleanup the second part which is the balancing act for
> gfp flags between alloc_hugepage_direct_gfpmask() and alloc_pages_vma().
> Syzbot found this to trigger the WARN_ON_ONCE() you mention.

Can you rewrite the commit message to explain this better rather than
the misleading "race" description?

 Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v2] x86/fault: Decode and print #PF oops in human readable form

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 2:06 PM Sean Christopherson
 wrote:
>
> Looking at it again, my own personal preference would be to swap the order
> of the #PF lines.

Yeah, probably.

Also:

> [  160.246820] BUG: unable to handle kernel paging request at beef
> [  160.247517] #PF: supervisor-privileged instruction fetch from kernel code
> [  160.248085] #PF: error_code(0x0010) - not-present page

With this form, I think the "kernel" in the first line is actually
misleading. Yes, it's a #PF for the kernel, but then the "kernel" on
the second line talks about what mode we were in when it happened, so
we have two different meanings of "kernel" on two adjacent lines.

So maybe  that "BUG: unable to handle kernel paging request" message
should be something like

  "BUG: unable to handle page fault for address beef"

instead? Does that make sense to people?

Anyway, enough bike-shedding from me, I'll just shut up about this,
since I don't really care all that deeply, and I wasn't really the
target audience anyway. Sorry for the noise, and I'll leave the
decision to the people who actually wanted this.

  Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH v2] x86/fault: Decode and print #PF oops in human readable form

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 11:52 AM Sean Christopherson
 wrote:
>
> Remove the per-bit decoding of the error code and instead print:

The patch looks fine to me, so feel free to add an acked-by, but:

 (a) I'm not the one who wanted the human-legible version in the first
place, since I'm also perfectly ok with just the error code, so my
judgement is obviously garbage wrt this whole thing

 (b) it would be good to have a couple of actual examples of the
printout to judge.

I can certainly imagine how it looks just from the patch, but maybe if
I actually see reality I'd go "eww". Or somebody else goes "eww".

  Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] x86/fault: Decode and print #PF oops in human readable form

2018-12-07 Thread Linus Torvalds
On Fri, Dec 7, 2018 at 10:44 AM Sean Christopherson
 wrote:
>
> Remove the per-bit decoding of the error code and instead print the raw
> error code followed by a brief description of what caused the fault, the
> effective privilege level of the faulting access, and whether the fault
> originated in user code or kernel code.

This doesn't quite work as-is, though.

For example, at least the PK bit is independent of the other bits and
would be interesting in the human-legible version, but doesn't show up
there at all.

That said, I think the end result might be more legible than the
previous version, so this approach may well be good, it just needs at
least that "permissions violation"  part to be extended with whether
it was PK or not.

Also, shouldn't we show the SGX bit too as some kind of "during SGX"
extension on the "in user/kernel space" part?

   Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: MADV_HUGEPAGE vs. NUMA semantic (was: Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression)

2018-12-06 Thread Linus Torvalds
[ Oops. different thread for me due to edited subject, so I saw this
after replying to the earlier email by David ]

On Thu, Dec 6, 2018 at 1:14 AM Michal Hocko  wrote:
>
> MADV_HUGEPAGE changes the picture because the caller expressed a need
> for THP and is willing to go extra mile to get it.

Actually, I think MADV_HUGEPAGE should just be
"TRANSPARENT_HUGEPAGE_ALWAYS but only for this vma".

So MADV_HUGEPAGE shouldn't change any behavior at all, if the kernel
was built with TRANSPARENT_HUGEPAGE_ALWAYS.

Put another way: even if you decide to run a kernel that does *not*
have that "always THP" (beause you presumably think that it's too
blunt an instrument), then MADV_HUGEPAGE says "for _this_ vma, do the
'always THP' bebavior"

I think those semantics would be a whole lot easier to explain to
people, and perhaps more imporantly, starting off from that kind of
mindset also gives good guidance to what MADV_HUGEPAGE behavior should
be: it should be sane enough that it makes sense as the _default_
behavior for the TRANSPARENT_HUGEPAGE_ALWAYS configuration.

But that also means that no, MADV_HUGEPAGE doesn't really change the
picture. All it does is says "I know that for this vma, THP really
does make sense as a default".

It doesn't say "I _have_ to have THP", exactly like
TRANSPARENT_HUGEPAGE_ALWAYS does not mean that every allocation should
strive to be THP.

>I believe that something like the below would be sensible
> 1) THP on a local node with compaction not giving up too early
> 2) THP on a remote node in NOWAIT mode - so no direct
>compaction/reclaim (trigger kswapd/kcompactd only for
>defrag=defer+madvise)
> 3) fallback to the base page allocation

That doesn't sound insane to me. That said, the numbers David quoted
do fairly strongly imply that local small-pages are actually preferred
to any remote THP pages.

But *that* in turn makes for other possible questions:

 - if the reason we couldn't get a local hugepage is that we're simply
out of local memory (huge *or* small), then maybe a remote hugepage is
better.

   Note that this now implies that the choice can be an issue of "did
the hugepage allocation fail due to fragmentation, or due to the node
being low of memory"

and there is the other question that I asked in the other thread
(before subject edit):

 - how local is the load to begin with?

   Relatively shortlived processes - or processes that are explicitly
bound to a node - might have different preferences than some
long-lived process where the CPU bounces around, and might have
different trade-offs for the local vs remote question too.

So just based on David's numbers, and some wild handwaving on my part,
a slightly more complex, but still very sensible default might be
something like

 1) try to do a cheap local node hugepage allocation

Rationale: everybody agrees this is the best case.

But if that fails:

 2) look at compacting and the local node, but not very hard.

If there's lots of memory on the local node, but synchronous
compaction doesn't do anything easily, just fall back to small pages.

Rationale: local memory is generally more important than THP.

If that fails (ie local node is simply low on memory):

 3) Try to do remote THP allocation

 Rationale: Ok, we simply didn't have a lot of local memory, so
it's not just a question of fragmentation. If it *had* been
fragmentation, lots of small local pages would have been better than a
remote THP page.

 Oops, remote THP allocation failed (possibly after synchronous
remote compaction, but maybe this is where we do kcompactd).

 4) Just do any small page, and do reclaim etc. THP isn't happening,
and it's not a priority when you're starting to feel memory pressure.

In general, I really would want to avoid magic kernel command lines
(or sysfs settings, or whatever) making a huge difference in behavior.
So I really wish people would see the whole
'transparent_hugepage_flags' thing as a way for kernel developers to
try different settings, not as a way for users to tune their loads.

Our default should work as sane defaults, we shouldn't have a "ok,
let's have this sysfs tunable and let people make their own
decisions". That's a cop-out.

Btw, don't get me wrong: I'm not suggesting removing the sysfs knob.
As a debug tool, it's great, where you can ask "ok, do things work
better if you set THP-defrag to defer+madvise".

I'm just saying that we should *not* use that sysfs flag as an excuse
for "ok, if we get the default wrong, people can make their own
defaults". We should strive to do well enough that it really shouldn't
be an issue in normal situations.

 Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 3:43 PM David Rientjes  wrote:
>
> On Broadwell, the access latency to local small pages was +5.6%, remote
> hugepages +16.4%, and remote small pages +19.9%.
>
> On Naples, the access latency to local small pages was +4.9%, intrasocket
> hugepages +10.5%, intrasocket small pages +19.6%, intersocket small pages
> +26.6%, and intersocket hugepages +29.2%

Are those two last numbers transposed?

Or why would small page accesses be *faster* than hugepages for the
intersocket case?

Of course, depending on testing, maybe the page itself was remote, but
the page tables were random, and you happened to get a remote page
table for the hugepage case?

> The results on Murano were similar, which is why I suspect Aneesh
> introduced the __GFP_THISNODE requirement for thp in 4.0, which preferred,
> in order, local small pages, remote 1-hop hugepages, remote 2-hop
> hugepages, remote 1-hop small pages, remote 2-hop small pages.

it sounds like on the whole the TLB advantage of hugepages is smaller
than the locality advantage.

Which doesn't surprise me on x86, because TLB costs really are fairly
low. Very good TLB fills, relatively to what I've seen elsewhere.

> So it *appears* from the x86 platforms that NUMA matters much more
> significantly than hugeness, but remote hugepages are a slight win over
> remote small pages.  PPC appeared the same wrt the local node but then
> prefers hugeness over affinity when it comes to remote pages.

I do think POWER at least historically has much weaker TLB fills, but
also very costly page table creation/teardown. Constant-time O(1)
arguments about hash lookups are only worth so much when the constant
time is pretty big. They've been working on it.

So at least on POWER, afaik one issue is literally that hugepages made
the hash setup and teardown situation much better.

One thing that might be worth looking at is whether the process itself
is all that node-local. Maybe we could aim for a policy that says
"prefer local memory, but if we notice that the accesses to this vma
aren't all that local, then who cares?".

IOW, the default could be something more dynamic than just "always use
__GFP_THISNODE". It could be more along the lines of "start off using
__GFP_THISNODE, but for longer-lived processes that bounce around
across nodes, maybe relax it?"

I don't think we have that kind of information right now, though, do we?

Honestly, I think things like vm_policy etc should not be the solution
- yes, some people may know *exactly* what access patterns they want,
but for most situations, I think the policy should be that defaults
"just work".

In fact, I wish even MADV_HUGEPAGE itself were to approach being a
no-op with THP.

We already have TRANSPARENT_HUGEPAGE_ALWAYS being the default kconfig
option (but I think it's a bit detbatable, because I'm not sure
everybody always agrees about memory use), so on the whole
MADV_HUGEPAGE shouldn't really *do* anything.

 Linus


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 12:28 PM Andy Lutomirski  wrote:
>
> "read" isn't an actual bit in the error code, so I thought it would be
> polite to make it look a little bit different.

If you care about the bits in the error code, then just look at the number.

And if you care about what the numbers mean, it doesn't matter how it's encoded.

I don't think you should mix up the two concepts.

> Sure.  Although it's extremely odd for us to OOPS from user mode, so
> maybe the OOPS code in general should print a big fat warning, and
> we'll just otherwise assume it was from kernel mode.

Yeah, the "from user mode" case is generally really something horribly
bad (reserved bits being set or us screwing up LDT's etc), so yeah,
maybe a better model is indeed to point that out explicitly, the same
way the "user/kernel doesn't match U/S bit" is pointed out.

  Linus


Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 11:07 AM Andy Lutomirski  wrote:
>
> How do you like the attached patch?

I agree with whoever thought it's odd that "read" is in lower case
when everything else is in upper case.

And honestly, I'd just siggest making the err_text simply have the
real user/kernel difference in it too, using something like

pr_alert("#PF error: 0x%04lx%s from %s mode\n", error_code, err_txt,
user_mode(regs) ? "user" : "kernel" );

The "oh, PF_USER and user_mode differs, so let's point that out
explicitly" thing is a good thing regardless.

 Linus


Re: siginfo pid not populated from ptrace?

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 6:40 AM Eric W. Biederman  wrote:
>
> We have in the past had ptrace users that weren't just about debugging
> so I don't know that it is fair to just dismiss it as debugging
> infrastructure.

Absolutely.

Some uses are more than just debug. People occasionally use ptrace
because it's the only way to do what they want, so you'll find people
who do it for sandboxing, for example. It's not necessarily designed
for that, or particularly fast or well-suited for it, but I've
definitely seen it used that way.

So I don't think the behavioral test breakage like this is necessarily
a huge deal, and until some "real use" actually shows that it cares it
might be something we dismiss as "just test", but it very much has the
potential to hit real uses.

The fact that a behavioral test broke is definitely interesting.

And maybe some of the siginfo allocations could depend on whether the
signal is actually ever caught or not.

For example, a terminal signal (or one that is ignored) might not need
siginfo. But if the process is ptraced, maybe that terminal signal
isn't actually terminal? So we might have situations where we want to
simply check "is the signal target being ptraced"..

  Linus


Re: [RFC] avoid indirect calls for DMA direct mappings

2018-12-06 Thread Linus Torvalds
On Thu, Dec 6, 2018 at 10:28 AM Linus Torvalds
 wrote:
>
> Put another way, you made the fast case unnecessarily slow.

Side note: the code seems to be a bit confused about it, because
*some* cases test the fast case first, and some do it after they've
already accessed the pointer for the slow case.

So even aside from the performance and code generation issue (and
possible future "use a special bit pattern for the fast case"), it
would be good for _consistency_ to just always do the fast-case test
first.

Linus


Re: [PATCH] x86/mm/fault: Streamline the fault error_code decoder some more

2018-12-06 Thread Linus Torvalds
On Wed, Dec 5, 2018 at 11:34 PM Ingo Molnar  wrote:
>
> Yeah, so I don't like the overly long 'SUPERVISOR' and the somewhat
> inconsistent, sporadic handling of negatives. Here's our error code bits:
>
> /*
>  * Page fault error code bits:
>  *
>  *   bit 0 ==0: no page found   1: protection fault
>  *   bit 1 ==0: read access 1: write access
>  *   bit 2 ==0: kernel-mode access  1: user-mode access

No. Really not at all.

Bit 2 is *not* "kernel vs user".  Never has been. Never will be.

It's a single bit that mixes up *three* different cases:

 - regular user mode access (value: 1)

 - regular CPL0 access (value: 0)

 - CPU system access (value: 0)

and that third case really is important and relevant. And importantly,
it can happen from user space.

In fact, these days we possibly have a fourth case:

 - kernel access using wruss (value: 1)

and I'd rather see just the numbers (which you have to know to decode)
than see the simplified AND WRONG decoding of those numbers.

Please don't ever confuse the fault U/S bit with "user vs kernel".
It's just not true, and people should be very very aware of it now
being true.

If you care whether a page fault happened in user mode or not, you
have to look at the register state (ie "user_mode(regs)").

Please call the U/S bit something else than "user" or "kernel".

   Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds
On Wed, Dec 5, 2018 at 3:51 PM Linus Torvalds
 wrote:
>
> Ok, I've applied David's latest patch.
>
> I'm not at all objecting to tweaking this further, I just didn't want
> to have this regression stand.

Hmm. Can somebody (David?) also perhaps try to state what the
different latency impacts end up being? I suspect it's been mentioned
several times during the argument, but it would be nice to have a
"going forward, this is what I care about" kind of setup for good
default behavior.

How much of the problem ends up being about the cost of compaction vs
the cost of getting a remote node bigpage?

That would seem to be a fairly major issue, but __GFP_THISNODE affects
both. It limits compaction to just this now, in addition to obviously
limiting the allocation result.

I realize that we probably do want to just have explicit policies that
do not exist right now, but what are (a) sane defaults, and (b) sane
policies?

For example, if we cannot get a hugepage on this node, but we *do* get
a node-local small page, is the local memory advantage simply better
than the possible TLB advantage?

Because if that's the case (at least commonly), then that in itself is
a fairly good argument for "hugepage allocations should always be
THISNODE".

But David also did mention the actual allocation overhead itself in
the commit, and maybe the math is more "try to get a local hugepage,
but if no such thing exists, see if you can get a remote hugepage
_cheaply_".

So another model can be "do local-only compaction, but allow non-local
allocation if the local node doesn't have anything". IOW, if other
nodes have hugepages available, pick them up, but don't try to compact
other nodes to do so?

And yet another model might be "do a least-effort thing, give me a
local hugepage if it exists, otherwise fall back to small pages".

So there are different combinations of "try compaction" vs "local-remote".

  Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds
On Wed, Dec 5, 2018 at 3:36 PM Andrea Arcangeli  wrote:
>
> Like said earlier still better to apply __GFP_COMPACT_ONLY or David's
> patch than to return to v4.18 though.

Ok, I've applied David's latest patch.

I'm not at all objecting to tweaking this further, I just didn't want
to have this regression stand.

   Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-05 Thread Linus Torvalds
On Wed, Dec 5, 2018 at 12:40 PM Andrea Arcangeli  wrote:
>
> So ultimately we decided that the saner behavior that gives the least
> risk of regression for the short term, until we can do something
> better, was the one that is already applied upstream.

You're ignoring the fact that people *did* report things regressed.

That's the part I find unacceptable. You're saying "we picked
something that minimized regressions".

No it didn't. The regression is present and real, and is on a real
load, not a benchmark.

So that argument is clearly bogus.

I'm going to revert the commit since people apparently seem to be
ignoring this fundamental issue.

Real workloads regressed.  The regressions got reported. Ignoring that
isn't acceptable.

Linus


Re: [PATCH] x86/fault: Print "SUPERVISOR" and "READ" when decoding #PF oops

2018-12-05 Thread Linus Torvalds
On Wed, Dec 5, 2018 at 11:27 AM Randy Dunlap  wrote:
>
> BTW, what does PK mean?

"Protection Key"

   Linus


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Linus Torvalds
On Tue, Dec 4, 2018 at 11:49 AM Linus Torvalds
 wrote:
>
> because honestly, the *only* reason we hold on to that lock is for the
> insane and not really interesting case of "somebody tried to use
> ptrace to change the creds in-flight during the exec".

No, sorry, me confused. Not somebody trying to change them, it's just
ptrace_attach() trying to change _our_ state during this sequence, and
relying on it all being atomic.

So taking a ref is unnecessary and pointless. It's not the creds that
change, it's that we really want to delay ptrace_attach().

We could maybe set that "we're busy now" flag, and have
ptrace_attach() do something like

if (task_is_busy(task)) {
sched_yield();
return -ERESTARTSYS;
}

or something like that.

  Linus


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Linus Torvalds
On Tue, Dec 4, 2018 at 11:33 AM Linus Torvalds
 wrote:
>
> Looking at this, I'm agreeing that ot would be better to just try to
> narrow down the cred_guard_mutex use a lot.

Ho humm. This is a crazy idea, but I don't see why it wouldn't work.

How about we:

 - stop holding on to cred_guard_mutex entirely in the exec path

and instead just do:

 - prepare_bprm_creds takes a ref to our old creds, and saves it off in the bprm

 - security_bprm_{committing,committed}_creds() can do it's "is this a
valid transition" using the saved-off old creds instead of the current
creds

because honestly, the *only* reason we hold on to that lock is for the
insane and not really interesting case of "somebody tried to use
ptrace to change the creds in-flight during the exec".

Or maybe we could just add a task state flag that says "in exec, you
can't modify the creds in this window, because we're about to switch
to new creds".

Again, no *normal* situation will even notice or care, I think. We
hold the cred lock purely to make sure that the sequence from
prepare_exec_creds -> install_exec_creds is "atomic" wrt credentials,
and it already is for all the normal cases since this is all inside a
single execve system call.

   Linus


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Linus Torvalds
On Tue, Dec 4, 2018 at 10:17 AM Michal Hocko  wrote:
>
> > How about something like we set PF_NOFREEZE when we set PF_EXITING? At
> > that point we've pretty much turned into a kernel thread, no?
>
> Hmm, that doesn't sound like a bad idea but I am not sure it will
> help because those threads we are waiting for might be block way before
> they reached PF_EXITING.

Yeah, looks that way. We've got the whole "zap_other_threads() ->
actually starting the exit" window, which is probably much bigger than
the "start the exit -> release_task" window.

So we'd have to mark things non-freezable at zap time, not at exit
time, and that's a lot more questionable.

Looking at this, I'm agreeing that ot would be better to just try to
narrow down the cred_guard_mutex use a lot.

Oleg, if you had patch that got push-back for that, maybe this problem
is now the impetus for people to say "yeah, that's not nice but we
clearly need to do it".

I'm not finding any old emails on this, but considering I redid my
email setup recently, that doesn't necessarily mean much.

   Linus


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-04 Thread Linus Torvalds
On Tue, Dec 4, 2018 at 1:58 AM Michal Hocko  wrote:
>
> AFAIU both suspend and hibernation require the system to enter quiescent
> state with no task potentially interfering with suspended devices. And
> in this particular case those de-thread-ed threads will certainly not
> interfere so silencing the lockdep sounds like a reasonable workaround.

I still think it would be better to simply not freeze killed user processes.

We already  have things like

if (test_tsk_thread_flag(p, TIF_MEMDIE))
return false;

exactly because we do not want to freeze processes that are about to
die due to being killed. Very similar situation: we don't want to
freeze those processes, because doing so would halt them from freeing
the resources that may be needed for suspend or hibernate.

How about something like we set PF_NOFREEZE when we set PF_EXITING? At
that point we've pretty much turned into a kernel thread, no?

Linus


Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode

2018-12-04 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 5:38 PM Tim Chen  wrote:
>
> To make the usage of STIBP and its working principles clear,
> here are some additional explanations of STIBP from our Intel
> HW architects.  This should also help answer some of the questions
> from Thomas and others on STIBP's usages with IBPB and IBRS.
>
> Thanks.
>
> Tim
>
> ---
>
> STIBP
> ^
> Implementations of STIBP on existing Core-family processors (where STIBP
> functionality was added through a microcode update) work by disabling
> branch predictors that both:
>
> 1. Contain indirect branch predictions for both hardware threads, and
> 2. Do not contain a dedicated thread ID bit

Honestly, it still feels entirely misguided to me.

The above is not STIBP. It's just "disable IB". There's nothing "ST" about it.

So on processors where there is no thread ID bit (or per-thread
predictors), Intel simply SHOULD NOT EXPOSE this at all.

As it is, I refuse to call this shit "STIBP", because on current CPU's
that's simply a lie.

Being "technically correct" is not an excuse. It's just lying. I would
really hope that we restrict the lying to politicians, and not do it
in technical documentation.

   Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 2:04 PM Linus Torvalds
 wrote:
>
> so I think all of David's patch is somewhat sensible, even if that
> specific "order == pageblock_order" test really looks like it might
> want to be clarified.

Side note: I think maybe people should just look at that whole
compaction logic for that block, because it doesn't make much sense to
me:

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */
if (costly_order && (gfp_mask & __GFP_NORETRY)) {
/*
 * If compaction is deferred for high-order allocations,
 * it is because sync compaction recently failed. If
 * this is the case and the caller requested a THP
 * allocation, we do not want to heavily disrupt the
 * system, so we fail the allocation instead of entering
 * direct reclaim.
 */
if (compact_result == COMPACT_DEFERRED)
goto nopage;

/*
 * Looks like reclaim/compaction is worth trying, but
 * sync compaction could be very expensive, so keep
 * using async compaction.
 */
compact_priority = INIT_COMPACT_PRIORITY;
}

this is where David wants to add *his* odd test, and I think everybody
looks at that added case

+   if (order == pageblock_order &&
+   !(current->flags & PF_KTHREAD))
+   goto nopage;

and just goes "Eww".

But I think the real problem is that it's the "goto nopage" thing that
makes _sense_, and the current cases for "let's try compaction" that
are the odd ones, and then David adds one new special case for the
sensible behavior.

For example, why would COMPACT_DEFERRED mean "don't bother", but not
all the other reasons it didn't really make sense?

So does it really make sense to fall through AT ALL to that "retry"
case, when we explicitly already had (gfp_mask & __GFP_NORETRY)?

Maybe the real fix is to instead of adding yet another special case
for "goto nopage", it should just be unconditional: simply don't try
to compact large-pages if __GFP_NORETRY was set.

Hmm? I dunno. Right now - for 4.20, I'd obviously want to keep changes
smallish, so a hacky added special case might be the right thing to
do. But the code does look odd, doesn't it?

I think part of it comes from the fact that we *used* to do the
compaction first, and then we did the reclaim, and then it was
re-orghanized to do reclaim first, but it tried to keep semantic
changes minimal and some of the above comes from that re-org.

I think.

Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 12:12 PM Andrea Arcangeli  wrote:
>
> On Mon, Dec 03, 2018 at 11:28:07AM -0800, Linus Torvalds wrote:
> >
> > One is the patch posted by Andrea earlier in this thread, which seems
> > to target just this known regression.
>
> For the short term the important thing is to fix the VM regression one
> way or another, I don't personally mind which way.
>
> > The other seems to be to revert commit ac5b2c1891  and instead apply
> >
> >   
> > https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/
> >
> > which also seems to be sensible.
>
> In my earlier review of David's patch, it looked runtime equivalent to
> the __GFP_COMPACT_ONLY solution. It has the only advantage of adding a

I think there's a missing "not" in the above.

> new gfpflag until we're sure we need it but it's the worst solution
> available for the long term in my view. It'd be ok to apply it as
> stop-gap measure though.

So I have no really strong opinions either way.

I looking at the two options, I think I'd personally have a slight
preference for that patch by David, not so much because it doesn't add
a new GFP flag, but because it seems to make it a lot more explicit
that GFP_TRANSHUGE_LIGHT automatically implies __GFP_NORETRY.

I think that makes a whole lot of conceptual sense with the whole
meaning of GFP_TRANSHUGE_LIGHT. It's all about "no
reclaim/compaction", but honestly, doesn't __GFP_NORETRY make sense?

So I look at David's patch, and I go "that makes sense", and then I
compare it with ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
MADV_HUGEPAGE mappings") and that makes me go "ok, that's a hack".

So *if* reverting ac5b2c18911f and applying David's patch instead
fixes the KVM latency issues (which I assume it really should do,
simply thanks to __GFP_NORETRY), then I think that makes more sense.

That said, I do agree that the

if (order == pageblock_order ...)

test in __alloc_pages_slowpath() in David's patch then argues for
"that looks hacky".  But that code *is* inside the test for

if (costly_order && (gfp_mask & __GFP_NORETRY)) {

so within the context of that (not visible in the patch itself), it
looks like a sensible model. The whole point of that block is, as the
comment above it says

/*
 * Checks for costly allocations with __GFP_NORETRY, which
 * includes THP page fault allocations
 */

so I think all of David's patch is somewhat sensible, even if that
specific "order == pageblock_order" test really looks like it might
want to be clarified.

BUT.

With all that said, I really don't mind that __GFP_COMPACT_ONLY
approach either. I think David's patch makes sense in a bigger
context, while the __GFP_COMPACT_ONLY patch makes sense in the context
of "let's just fix this _particular_ special case.

As long as both work (and apparently they do), either is perfectly find by me.

Some kind of "Thunderdome for patches" is needed, with an epic soundtrack.

   "Two patches enter, one patch leaves!"

I don't so much care which one.

 Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 10:59 AM Michal Hocko  wrote:
>
> You are misinterpreting my words. I haven't dismissed anything. I do
> recognize both usecases under discussion.
>
> I have merely said that a better THP locality needs more work and during
> the review discussion I have even volunteered to work on that.

We have two known patches that seem to have no real downsides.

One is the patch posted by Andrea earlier in this thread, which seems
to target just this known regression.

The other seems to be to revert commit ac5b2c1891  and instead apply

  
https://lore.kernel.org/lkml/alpine.deb.2.21.1810081303060.221...@chino.kir.corp.google.com/

which also seems to be sensible.

I'm not seeing why the KVM load would react badly to either of those
models, and they are known to fix the google local-node issue.

  Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 10:30 AM Michal Hocko  wrote:
>
> I do not get it. 5265047ac301 which this patch effectively reverts has
> regressed kvm workloads. People started to notice only later because
> they were not running on kernels with that commit until later. We have
> 4.4 based kernels reports. What do you propose to do for those people?

We have at least two patches that others claim to fix things.

You dismissed them and said "can't be done".

As a result, I'm not really interested in this discussion.

Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 10:15 AM Michal Hocko  wrote:
>
> The thing is that there is no universal win here. There are two
> different types of workloads and we cannot satisfy both.

Ok, if that's the case, then I'll just revert the commit.

Michal, our rules are very simple: we don't generate regressions. It's
better to have old reliable behavior than to start creating *new*
problems.

 Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-12-03 Thread Linus Torvalds
On Wed, Nov 28, 2018 at 8:48 AM Linus Torvalds
 wrote:
>
> On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
> >
> > In general, memory allocation fairness among processes should be a good
> > thing.  So I think the report should have been a "performance
> > improvement" instead of "performance regression".
>
> Hey, when you put it that way...
>
> Let's ignore this issue for now, and see if it shows up in some real
> workload and people complain.

Well, David Rientjes points out that it *does* cause real problems in
real workloads, so it's not just this benchmark run that shows the
issue.

So I guess we should revert, or at least fix. David, please post your
numbers again in public along with your suggested solution...

   Linus


Re: [RFC PATCH 1/1] epoll: use rwlock in order to reduce ep_poll_callback() contention

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 3:03 AM Roman Penyaev  wrote:
>
> Also I'm not quite sure where to put very special lockless variant
> of adding element to the list (list_add_tail_lockless() in this
> patch).  Seems keeping it locally is safer.

That function is scary, and can be mis-used so easily that I
definitely don't want to see it anywhere else.

Afaik, it's *really* important that only "add_tail" operations can be
done in parallel.

This also ends up making the memory ordering of "xchg()" very very
important. Yes, we've documented it as being an ordering op, but I'm
not sure we've relied on it this directly before.

I also note that now we do more/different locking in the waitqueue
handling, because the code now takes both that rwlock _and_ the
waitqueue spinlock for wakeup. That also makes me worried that the
"waitqueue_active()" games are no no longer reliable. I think they're
fine (looks like they are only done under the write-lock, so it's
effectively the same serialization anyway), but the upshoot of all of
this is that I *really* want others to look at this patch too. A lot
of small subtle things here.

Linus


Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Linus Torvalds
On Mon, Dec 3, 2018 at 6:17 AM Michal Hocko  wrote:
>
> This argument just doesn't make any sense. Rare bugs are maybe even more
> annoying because you do not expect them to happen.

Absolutely.

And valid lockdep complaints are a real issue too.

So I don't think there's any question that this should be reverted,
the only question is whether I take the revert directly or get it from
the PM tree.

It does sound like the de_thread() behavior needs more work. Maybe,
for example, we need to make it clear that zapped threads are *not*
frozen, and instead the freezer will wait for them to exit?

Hmm?

  Linus


Linux 4.20-rc5

2018-12-02 Thread Linus Torvalds
  KVM: nVMX/nSVM: Fix bug which sets vcpu->arch.tsc_offset to L1 tsc_offset

Li Zhijian (2):
  kselftests/bpf: use ping6 as the default ipv6 ping binary when it exists
      initramfs: clean old path before creating a hardlink

Lihong Yang (1):
  i40e: Fix deletion of MAC filters

Linus Torvalds (3):
  test_hexdump: use memcpy instead of strncpy
  unifdef: use memcpy instead of strncpy
  Linux 4.20-rc5

Liran Alon (5):
  KVM: nVMX: Fix kernel info-leak when enabling
KVM_CAP_HYPERV_ENLIGHTENED_VMCS more than once
  KVM: x86: Fix kernel info-leak in KVM_HC_CLOCK_PAIRING hypercall
  KVM: VMX: Update shared MSRs to be saved/restored on MSR_EFER.LMA changes
  KVM: nVMX: Verify eVMCS revision id match supported eVMCS
version on eVMCS VMPTRLD
  KVM: nVMX: vmcs12 revision_id is always VMCS12_REVISION even
when copied from eVMCS

Lorenzo Bianconi (1):
  net: thunderx: fix NULL pointer dereference in nic_remove

Lorenzo Pieralisi (1):
  ACPI/IORT: Fix iort_get_platform_device_domain() uninitialized
pointer value

Luis Chamberlain (2):
  MAINTAINERS: name change for Luis
  lib/test_kmod.c: fix rmmod double free

Luiz Capitulino (1):
  KVM: VMX: re-add ple_gap module parameter

Lyude Paul (6):
  drm/amd/dm: Don't forget to attach MST encoders
  drm/amd/dm: Understand why attaching path/tile properties are needed
  drm/dp_mst: Skip validating ports during destruction, just ref
  drm/meson: Enable fast_io in meson_dw_hdmi_regmap_config
  drm/meson: Fix OOB memory accesses in meson_viu_set_osd_lut()
  Revert "drm/dp_mst: Skip validating ports during destruction, just ref"

Majd Dibbiny (1):
  RDMA/mlx5: Fix fence type for IB_WR_LOCAL_INV WR

Manu Gautam (2):
  phy: qcom-qusb2: Use HSTX_TRIM fused value as is
  phy: qcom-qusb2: Fix HSTX_TRIM tuning with fused value for SDM845

Marek Szyprowski (1):
  usb: gadget: u_ether: fix unsafe list iteration

Martin Kelly (1):
  iio:st_magn: Fix enable device after trigger

Martin Schwidefsky (1):
  s390/mm: correct pgtable_bytes on page table downgrade

Martynas Pumputis (1):
  bpf: fix check of allowed specifiers in bpf_trace_printk

Masami Hiramatsu (1):
  arm64: ftrace: Fix to enable syscall events on arm64

Mathias Kresin (1):
  MIPS: ralink: Fix mt7620 nd_sd pinmux

Matias Bjørling (1):
  ia64: export node_distance function

Max Filippov (3):
  xtensa: enable coprocessors that are being flushed
  xtensa: fix coprocessor context offset definitions
  xtensa: fix coprocessor part of ptrace_{get,set}xregs

Maximilian Heyne (1):
  fs: fix lost error code in dio_complete

Michael Guralnik (1):
  IB/mlx5: Avoid load failure due to unknown link width

Michael Niewöhner (1):
  usb: core: quirks: add RESET_RESUME quirk for Cherry G230 Stream series

Michael Roth (1):
  KVM: PPC: Book3S HV: Fix handling for interrupted H_ENTER_NESTED

Mika Westerberg (1):
  thunderbolt: Prevent root port runtime suspend during NVM upgrade

Mikulas Patocka (1):
  PCI: Fix incorrect value returned from pcie_get_speed_cap()

Ming Lei (1):
  block: fix single range discard merge

Nathan Chancellor (2):
  ARM: OMAP2+: prm44xx: Fix section annotation on
omap44xx_prm_enable_io_wakeup
  cachefiles: Explicitly cast enumerated type in put_object

Neil Armstrong (1):
  drm/meson: Fixes for drm_crtc_vblank_on/off support

NeilBrown (1):
  fscache: fix race between enablement and dropping of object

Nicolin Chen (2):
  hwmon (ina2xx) Fix NULL id pointer in probe()
  hwmon: (ina2xx) Fix current value calculation

Nikolay Borisov (1):
  btrfs: Always try all copies when reading extent buffers

Pan Bian (9):
  btrfs: relocation: set trans to be NULL after ending transaction
  exportfs: do not read dentry after free
  ext2: fix potential use after free
  rapidio/rionet: do not free skb before reading its length
  net: hisilicon: remove unexpected free_netdev
  pvcalls-front: fixes incorrect error handling
  hfs: do not free node before using
  hfsplus: do not free node before using
  ocfs2: fix potential use after free

Parav Pandit (1):
  RDMA/core: Add GIDs while changing MAC addr only for registered ndev

Paul Burton (1):
  MAINTAINERS: Update linux-mips mailing list address

Paul Moore (1):
  selinux: add support for RTM_NEWCHAIN, RTM_DELCHAIN, and RTM_GETCHAIN

Pavankumar Kondeti (1):
  sched, trace: Fix prev_state output in sched_switch tracepoint

Pavel Tikhomirov (1):
  mm: cleancache: fix corruption on missed inode invalidation

Peter Ujfalusi (4):
  ASoC: omap-abe-twl6040: Fix missing audio card caused by deferred probing
  ASoC: omap-mcbsp: Fix latency value calculation for pm_qos
  ASoC: omap-mcpdm: Add pm_qos handling to avoid under/overruns
with CPU_IDLE
  ASoC: omap-dmic: Add pm_qos handling to avoid overruns with CPU_IDLE

Pet

Re: [GIT PULL] Driver core fix for 4.20-rc5

2018-11-30 Thread Linus Torvalds
On Fri, Nov 30, 2018 at 8:06 AM Greg KH  wrote:
>
> It resolves an issue with the data alignment in 'struct devres' for the
> ARC platform.  The full details are in the commit changelog, but the
> short summary is the change is a single line:
>
> -   unsigned long long  data[]; /* guarantee ull alignment */
> +   u8 __aligned(ARCH_KMALLOC_MINALIGN) data[];

Hmm.

Are you aware that this is up to 128 bytes? Including on common
architectures like ARM64?

I've done the pull, but honestly, that seems a bit excessive, when a
fair amount of devres users seem to have fairly small data (ie looking
at "size", I see things like

p = devres_alloc(dmam_device_release, sizeof(void *), GFP_KERNEL);

or

dr = devres_alloc(devm_gpio_release, sizeof(unsigned), GFP_KERNEL);

that have allocations of a couple of bytes, and the new model means
that those allocations will be aligned to 128-byte boundaries, and
then (because ARCH_KMALLOC_MINALIGN, again) you'll end up actually
wasting 256 bytes for a tiny structure on ARM64.

Maybe it doesn't matter. But it does seem somewhat excessive to do
things like this.

Yeah, on x86, the alignment isn't even noticeable, being just 8 bytes.

   Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-30 Thread Linus Torvalds
On Fri, Nov 30, 2018 at 10:39 AM Josh Poimboeuf  wrote:
>
> AFAICT, all the other proposed options seem to have major issues.

I still absolutely detest this patch, and in fact it got worse from
the test of the config variable.

Honestly, the entry code being legible and simple is more important
than the extra cycle from branching to a trampoline for static calls.

Just don't do the inline case if it causes this much confusion.

 Linus


Re: [PATCH 0/2] [GIT PULL] tracing: More fixes for 4.20

2018-11-30 Thread Linus Torvalds
On Fri, Nov 30, 2018 at 9:41 AM Linus Torvalds
 wrote:
>
> I went back and merged things [..]

Note that I did this as two merges, even if one would have done (since
the second pull request was just adding new commits on top of the
first).

This way I got the matching diffstat from your pull requests, but more
importantly also the independent merge messages.

The history looks slightly odd this way (with two adjacent merges of
continuous history), but I thought I'd explain the reason.

   Linus


Re: [PATCH 0/2] [GIT PULL] tracing: More fixes for 4.20

2018-11-30 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 7:19 PM Steven Rostedt  wrote:
>
> Note, this is on top of a previous git pull that I have submitted:
>
>   http://lkml.kernel.org/r/20181127224031.76681...@vmware.local.home

Hmm.

I had dismissed that, because the patch descriptors for that series
had had "for-next" in them.

https://lore.kernel.org/lkml/20181122002801.501220...@goodmis.org/

so I dismissed that pull request entirely as being not for this
release entirely.

I went back and merged things, but in general, please try to avoid
confusing me. I'm easily confused when I get mixed messages about the
patches and the pull requests, and will then generally default to
"ignore, this is informational".

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 12:25 PM Josh Poimboeuf  wrote:
>
> On Thu, Nov 29, 2018 at 11:27:00AM -0800, Andy Lutomirski wrote:
> >
> > I propose a different solution:
> >
> > As in this patch set, we have a direct and an indirect version.  The
> > indirect version remains exactly the same as in this patch set.  The
> > direct version just only does the patching when all seems well: the
> > call instruction needs to be 0xe8, and we only do it when the thing
> > doesn't cross a cache line.  Does that work?  In the rare case where
> > the compiler generates something other than 0xe8 or crosses a cache
> > line, then the thing just remains as a call to the out of line jmp
> > trampoline.  Does that seem reasonable?  It's a very minor change to
> > the patch set.
>
> Maybe that would be ok.  If my math is right, we would use the
> out-of-line version almost 5% of the time due to cache misalignment of
> the address.

Note that I don't think cache-line alignment is necessarily sufficient.

The I$ fetch from the cacheline can happen in smaller chunks, because
the bus between the I$ and the instruction decode isn't a full
cacheline (well, it is _now_ in modern big cores, but it hasn't always
been).

So even if the cacheline is updated atomically, I could imagine seeing
a partial fetch from the I$ (old values) and then a second partial
fetch (new values).

It would be interesting to know what the exact fetch rules are.

 Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 11:16 AM Steven Rostedt  wrote:
>
> But then we need to implement all numbers of parameters.

Oh, I agree, it's nasty.

But it's actually a nastiness that we've solved before. In particular,
with the system call mappings, which have pretty much the exact same
issue of "map unknown number of arguments to registers".

Yes, it's different - there you map the unknown number of arguments to
a structure access instead. And yes, the macros are unbelievably ugly.
See

arch/x86/include/asm/syscall_wrapper.h

and the __MAP() macro from

include/linux/syscalls.h

so it's not pretty. But it would solve all the problems.

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 11:08 AM Linus Torvalds
 wrote:
>
> What you can do then is basically add a single-byte prefix to the
> "call" instruction that does nothing (say, cs override), and then
> replace *that* with a 'int3' instruction.

Hmm. the segment prefixes are documented as being "reserved" for
branch instructions. I *think* that means just conditional branches
(Intel at one point used the prefixes for static prediction
information), not "call", but who knows..

It might be better to use an empty REX prefix on x86-64 or something like that.

   Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 10:58 AM Linus Torvalds
 wrote:
>
> In contrast, if the call was wrapped in an inline asm, we'd *know* the
> compiler couldn't turn a "call wrapper(%rip)" into anything else.

Actually, I think I have a better model - if the caller is done with inline asm.

What you can do then is basically add a single-byte prefix to the
"call" instruction that does nothing (say, cs override), and then
replace *that* with a 'int3' instruction.

Boom. Done.

Now, the "int3" handler can just update the instruction in-place, but
leave the "int3" in place, and then return to the next instruction
byte (which is just the normal branch instruction without the prefix
byte).

The cross-CPU case continues to work, because the 'int3' remains in
place until after the IPI.

But that would require that we'd mark those call instruction with

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 10:47 AM Steven Rostedt  wrote:
>
> Note, we do have a bit of control at what is getting called. The patch
> set requires that the callers are wrapped in macros. We should not
> allow just any random callers (like from asm).

Actually, I'd argue that asm is often more controlled than C code.

Right now you can do odd things if you really want to, and have the
compiler generate indirect calls to those wrapper functions.

For example, I can easily imagine a pre-retpoline compiler turning

 if (cond)
fn1(a,b)
 else
   fn2(a,b);

into a function pointer conditional

(cond ? fn1 : fn2)(a,b);

and honestly, the way "static_call()" works now, can you guarantee
that the call-site doesn't end up doing that, and calling the
trampoline function for two different static calls from one indirect
call?

See what I'm talking about? Saying "callers are wrapped in macros"
doesn't actually protect you from the compiler doing things like that.

In contrast, if the call was wrapped in an inline asm, we'd *know* the
compiler couldn't turn a "call wrapper(%rip)" into anything else.

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 10:00 AM Andy Lutomirski  wrote:
> > then it really sounds pretty safe to just say "ok, just make it
> > aligned and update the instruction with an atomic cmpxchg or
> > something".
>
> And how do we do that?  With a gcc plugin and some asm magic?

Asm magic.

You already have to mark the call sites with

static_call(fn, arg1, arg2, ...);

and while it right now just magically depends on gcc outputting the
right code to call the trampoline. But it could do it as a jmp
instruction (tail-call), and maybe that works right, maybe it doesn't.
And maybe some gcc switch makes it output it as a indirect call due to
instrumentation or something. Doing it with asm magic would, I feel,
be safer anyway, so that we'd know *exactly* how that call gets done.

For example, if gcc does it as a jmp due to a tail-call, the
compiler/linker could in theory turn the jump into a short jump if it
sees that the trampoline is close enough. Does that happen? Probably
not. But I don't see why it *couldn't* happen in the current patch
series. The trampoline is just a regular function, even if it has been
defined by global asm.

Putting the trampoline in a different code section could fix things
like that (maybe there was a patch that did that and I missed it?) but
I do think that doing the call with an asm would *also* fix it.

But the "just always use a trampoline" is certainly the simpler model.

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 9:59 AM Steven Rostedt  wrote:
>
> Do you realize that the cmpxchg used by the first attempts of the
> dynamic modification of code by ftrace was the source of the e1000e
> NVRAM corruption bug.

If you have a static call in IO memory, you have bigger problems than that.

What's your point?

Again - I will point out that the things you guys have tried to come
up with have been *WORSE*. Much worse.

   Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 9:50 AM Linus Torvalds
 wrote:
>
>  - corrupt random registers because we "know" they aren't in use

Just to clarify: I think that's a completely unacceptable model.

We already have lots of special calling conventions, including ones
that do not have any call-clobbered registers at all, because we have
special magic calls in inline asm.

Some of those might be prime material for doing static calls (ie PV-op
stuff, where the native model does *not* change any registers).

So no. Don't do ugly hacks like that.

   Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 9:44 AM Steven Rostedt  wrote:
>
> Well, the current method (as Jiri mentioned) did get the OK from at
> least Intel (and that was with a lot of arm twisting to do so).

Guys, when the comparison is to:

 - create a huge honking security hole by screwing up the stack frame

or

 - corrupt random registers because we "know" they aren't in use

then it really sounds pretty safe to just say "ok, just make it
aligned and update the instruction with an atomic cmpxchg or
something".

Of course, another option is to just say "we don't do the inline case,
then", and only ever do a call to a stub that does a "jmp"
instruction.

Problem solved, at the cost of some I$. Emulating a "jmp" is trivial,
in ways emulating a "call" is not.

  Linus


Re: remove the ->mapping_error method from dma_map_ops V2

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 8:23 AM Christoph Hellwig  wrote:
>
> We can.  At least in theory.  The problem is that depending on the
> crazy mapping from physical and kernel virtual address to dma addresses
> these might be pages at pretty random places.  Look at fun like
> arch/x86/pci/sta2x11-fixup.c for how ugly these mappings could look.
>
> It also means that we might have setup swiotlb on just about every
> 32-bit architecture, even if it has no real addressing limit except for
> the one we imposed.

No. Really. If there's no iotlb, then you just mark that one page
reserved. It simply doesn't get used. It doesn't mean you suddenly
need a swiotlb.

If there *is* a iotlb, none of this should matter, because you'd just
never map anything into that page.

But whatever. It's independent from the patch series under discussion.
Make dma_mapping_error() at least return a real error (eg -EINVAL, or
whatever is the common error), and we can maybe do this later.

Or, better yet, plan on removing the single-page dma mappign entirely
at a later date, and make the issue moot.

  Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 9:13 AM Steven Rostedt  wrote:
>
> No, we really do need to sync after we change the second part of the
> command with the int3 on it. Unless there's another way to guarantee
> that the full instruction gets seen when we replace the int3 with the
> finished command.

Making sure the call instruction is aligned with the I$ fetch boundary
should do that.

It's not in the SDM, but neither was our current behavior - we
were/are just relying on "it will work".

 Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 9:02 AM Andy Lutomirski  wrote:
> >
> > - just restart the instruction (with the suggested "ptregs->rip --")
> >
> > - to avoid any "oh, we're not making progress" issues, just fix the
> > instruction yourself to be the right call, by looking it up in the
> > "what needs to be fixed" tables.
>
> I thought that too.  I think it deadlocks. CPU A does text_poke_bp().  CPU B 
> is waiting for a spinlock with IRQs off.  CPU C holds the spinlock and hits 
> the int3.  The int3 never goes away because CPU A is waiting for CPU B to 
> handle the sync_core IPI.
>
> Or do you think we can avoid the IPI while the int3 is there?

I'm handwaving and thinking that CPU C that hits the int3 can just fix
up the instruction directly in its own caches, and return.

Yes, it does what he "text_poke" *will* do (so now the instruction
gets rewritten _twice_), but who cares? It's idempotent.

And no, I don't have code, just "maybe some handwaving like this"

   Linus


Re: [PATCH v2 4/4] x86/static_call: Add inline static call implementation for x86-64

2018-11-29 Thread Linus Torvalds
On Thu, Nov 29, 2018 at 8:33 AM Josh Poimboeuf  wrote:
>
> This seems to work...
>
> +   .if \create_gap == 1
> +   .rept 6
> +   pushq 5*8(%rsp)
> +   .endr
> +   .endif
> +
> -idtentry int3  do_int3 has_error_code=0
> +idtentry int3  do_int3 has_error_code=0  
>   create_gap=1

Ugh. Doesn't this entirely screw up the stack layout, which then
screws up  task_pt_regs(), which then breaks ptrace and friends?

... and you'd only notice it for users that use int3 in user space,
which now writes random locations on the kernel stack, which is then a
huge honking security hole.

It's possible that I'm confused, but let's not play random games with
the stack like this. The entry code is sacred, in scary ways.

So no. Do *not* try to change %rsp on the stack in the bp handler.
Instead, I'd suggest:

 - just restart the instruction (with the suggested "ptregs->rip --")

 - to avoid any "oh, we're not making progress" issues, just fix the
instruction yourself to be the right call, by looking it up in the
"what needs to be fixed" tables.

No?

  Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-28 Thread Linus Torvalds
On Tue, Nov 27, 2018 at 7:20 PM Huang, Ying  wrote:
>
> From the above data, for the parent commit 3 processes exited within
> 14s, another 3 exited within 100s.  For this commit, the first process
> exited at 203s.  That is, this commit makes memory allocation more fair
> among processes, so that processes proceeded at more similar speed.  But
> this raises system memory footprint too, so triggered much more swap,
> thus lower benchmark score.
>
> In general, memory allocation fairness among processes should be a good
> thing.  So I think the report should have been a "performance
> improvement" instead of "performance regression".

Hey, when you put it that way...

Let's ignore this issue for now, and see if it shows up in some real
workload and people complain.

 Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Linus Torvalds
On Tue, Nov 27, 2018 at 12:57 PM Andrea Arcangeli  wrote:
>
> This difference can only happen with defrag=always, and that's not the
> current upstream default.

Ok, thanks. That makes it a bit less critical.

> That MADV_HUGEPAGE causes flights with NUMA balancing is not great
> indeed, qemu needs NUMA locality too, but then the badness caused by
> __GFP_THISNODE was a larger regression in the worst case for qemu.
[...]
> So the short term alternative again would be the alternate patch that
> does __GFP_THISNODE|GFP_ONLY_COMPACT appended below.

Sounds like we should probably do this. Particularly since Vlastimil
pointed out that we'd otherwise have issues with the back-port for 4.4
where that "defrag=always" was the default.

The patch doesn't look horrible, and it directly addresses this
particular issue.

Is there some reason we wouldn't want to do it?

   Linus


Re: [LKP] [mm] ac5b2c1891: vm-scalability.throughput -61.3% regression

2018-11-27 Thread Linus Torvalds
On Mon, Nov 26, 2018 at 10:24 PM kernel test robot
 wrote:
>
> FYI, we noticed a -61.3% regression of vm-scalability.throughput due
> to commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for
> MADV_HUGEPAGE mappings")

Well, that's certainly noticeable and not good.

Andrea, I suspect it might be causing fights with auto numa migration..

Lots more system time, but also look at this:

>1122389 ±  9% +17.2%1315380 ±  4%  proc-vmstat.numa_hit
> 214722 ±  5% +21.6% 261076 ±  3%  
> proc-vmstat.numa_huge_pte_updates
>1108142 ±  9% +17.4%1300857 ±  4%  proc-vmstat.numa_local
> 145368 ± 48% +63.1% 237050 ± 17%  proc-vmstat.numa_miss
> 159615 ± 44% +57.6% 251573 ± 16%  proc-vmstat.numa_other
> 185.50 ± 81%   +8278.6%  15542 ± 40%  proc-vmstat.numa_pages_migrated

Should the commit be reverted? Or perhaps at least modified?

 Linus


Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated

2018-11-27 Thread Linus Torvalds
On Tue, Nov 27, 2018 at 8:49 AM Christopher Lameter  wrote:
>
> A process has no refcount on a page struct and is waiting for it to become
> unlocked? Why? Should it not simply ignore that page and continue?

The problem isn't that you can just "continue".

You need to *retry*.

And you can't just busy-loop. You want to wait until the page state
has changed, and _then_ retry.

  Linus


Re: [PATCH 4/8] HID: input: use the Resolution Multiplier for high-resolution scrolling

2018-11-26 Thread Linus Torvalds
On Thu, Nov 22, 2018 at 3:28 PM Peter Hutterer  wrote:
>
> The device sends hi-res values of 4, so it should end up as REL_WHEEL_HI_RES
> 30. We are getting 28 instead which doesn't add up to a nice 120.

I think you're just doing the math in the wrong order.

Why don't you just do

update = val * 120 / multiplier

which gives you the expected "30".

It seems you have done the "120 / multiplier" too early, and you force
that value into "wheel_factor". Don't. Do all the calculations
(including all the accumulated ones) in the original values, and only
do the "multiply by 120 and divide by multiplier" at the very end.

Hmm?

  Linus


Re: [PATCHi v2] mm: put_and_wait_on_page_locked() while page is migrated

2018-11-26 Thread Linus Torvalds
On Mon, Nov 26, 2018 at 11:27 AM Hugh Dickins  wrote:
>
> +enum behavior {
> +   EXCLUSIVE,  /* Hold ref to page and take the bit when woken, like
> +* __lock_page() waiting on then setting PG_locked.
> +*/
> +   SHARED, /* Hold ref to page and check the bit when woken, like
> +* wait_on_page_writeback() waiting on PG_writeback.
> +*/
> +   DROP,   /* Drop ref to page before wait, no check when woken,
> +* like put_and_wait_on_page_locked() on PG_locked.
> +*/
> +};

Ack, thanks.

Linus


Linux 4.20-rc4

2018-11-25 Thread Linus Torvalds
 tools cpupower: Override CFLAGS assignments

Johan Hovold (3):
  gnss: serial: fix synchronous write timeout
  gnss: sirf: fix synchronous write timeout
  mtd: rawnand: atmel: fix OF child-node lookup

John Stultz (1):
  wlcore: Fixup "Add support for optional wakeirq"

Jon Maloy (2):
  tipc: fix lockdep warning when reinitilaizing sockets
  tipc: don't assume linear buffer when reading ancillary data

Juliet Kim (1):
  net/ibmnvic: Fix deadlock problem in reset

Kai-Heng Feng (4):
  USB: Wait for extra delay time after USB_PORT_FEAT_RESET for quirky hub
  USB: quirks: Add no-lpm quirk for Raydium touchscreens
  HID: multitouch: Add pointstick support for Cirque Touchpad
  HID: i2c-hid: Disable runtime PM for LG touchscreen

Karsten Graul (1):
  net/smc: use queue pair number when matching link group

Keerthy (2):
  opp: ti-opp-supply: Dynamically update u_volt_min
  opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call

Kenneth Feng (1):
  drm/amdgpu: Enable HDP memory light sleep

Konstantin Khlebnikov (1):
  tools/power/cpupower: fix compilation with STATIC=true

Kuppuswamy Sathyanarayanan (1):
  usb: dwc3: Fix NULL pointer exception in dwc3_pci_remove()

Linus Torvalds (1):
  Linux 4.20-rc4

Lorenzo Bianconi (3):
  mt76: fix uninitialized mutex access setting rts threshold
  net: thunderx: set xdp_prog to NULL if bpf_prog_add fails
  net: thunderx: set tso_hdrs pointer to NULL in nicvf_free_snd_queue

Lu Baolu (1):
  iommu/vt-d: Fix NULL pointer dereference in prq_event_thread()

Luc Van Oostenryck (1):
  MAINTAINERS: change Sparse's maintainer

Luca Coelho (1):
  iwlwifi: mvm: don't use SAR Geo if basic SAR is not used

Lucas Bates (1):
  tc-testing: tdc.py: ignore errors when decoding stdout/stderr

Lukas Wunner (1):
  can: hi311x: Use level-triggered interrupt

Maarten Jacobs (1):
  usb: cdc-acm: add entry for Hiro (Conexant) modem

Marc Kleine-Budde (5):
  can: flexcan: remove not needed struct flexcan_priv::tx_mb and
struct flexcan_priv::tx_mb_idx
  can: dev: can_get_echo_skb(): factor out non sending code to
__can_get_echo_skb()
  can: dev: __can_get_echo_skb(): replace struct can_frame by
canfd_frame to access frame length
  can: dev: __can_get_echo_skb(): Don't crash the kernel if
can_priv::echo_skb is accessed out of bounds
  can: dev: __can_get_echo_skb(): print error message, if trying
to echo non existing skb

Martin Schiller (2):
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs
  net: phy: mdio-gpio: Fix working over slow can_sleep GPIOs

Mathias Nyman (3):
  xhci: Fix leaking USB3 shared_hcd at xhci removal
  xhci: handle port status events for removed USB3 hcd
  usb: xhci: Prevent bus suspend if a port connect change or
polling state is detected

Matt Chen (1):
  iwlwifi: fix wrong WGDS_WIFI_DATA_SIZE

Matthew Cover (1):
  tuntap: fix multiqueue rx

Matthew Wilcox (19):
  XArray: Fix xa_for_each with a single element at 0
  XArray: Export __xa_foo to non-GPL modules
  nilfs2: Use xa_erase_irq
  XArray: Regularise xa_reserve
  XArray: Unify xa_cmpxchg and __xa_cmpxchg
  XArray: Turn xa_erase into an exported function
  XArray: Add xa_store_bh() and xa_store_irq()
  XArray: Unify xa_store and __xa_store
  XArray: Handle NULL pointers differently for allocation
  XArray: Fix Documentation
  XArray: Correct xa_store_range
  XArray tests: Correct some 64-bit assumptions
  dax: Remove optimisation from dax_lock_mapping_entry
  dax: Make sure the unlocking entry isn't locked
  dax: Reinstate RCU protection of inode
  dax: Fix dax_unlock_mapping_entry for PMD pages
  dax: Fix huge page faults
  dax: Avoid losing wakeup in dax_lock_mapping_entry
  XArray tests: Add missing locking

Mattias Jacobsson (1):
  USB: misc: appledisplay: add 20" Apple Cinema Display

Mauro Carvalho Chehab (2):
  v4l2-controls: add a missing include
  media: dm365_ipipeif: better annotate a fall though

Maxime Chevallier (1):
  net: mvneta: Don't advertise 2.5G modes

Michael Chan (5):
  bnxt_en: Fix RSS context allocation.
  bnxt_en: Fix rx_l4_csum_errors counter on 57500 devices.
  bnxt_en: Disable RDMA support on the 57500 chips.
  bnxt_en: Workaround occasional TX timeout on 57500 A0.
  bnxt_en: Add software "missed_irqs" counter.

Michal Kalderon (1):
  qed: Fix rdma_info structure allocation

Moshe Shemesh (1):
  net/mlx5e: RX, verify received packet size in Linear Striding RQ

Nathan Chancellor (2):
  media: tc358743: Remove unnecessary self assignment
  misc: atmel-ssc: Fix section annotation on atmel_ssc_get_driver_data

Nicholas Kazlauskas (2):
  drm/amdgpu: Add amdgpu "max bpc" connector property (v2)
  drm/amd/display: Support amdgpu "m

Re: [patch V2 27/28] x86/speculation: Add seccomp Spectre v2 user space protection mode

2018-11-25 Thread Linus Torvalds
[ You forgot to fix your quilt setup.. ]

On Sun, 25 Nov 2018, Thomas Gleixner wrote:
>
> The mitigation guide documents how STIPB works:
>
>Setting bit 1 (STIBP) of the IA32_SPEC_CTRL MSR on a logical processor
>prevents the predicted targets of indirect branches on any logical
>processor of that core from being controlled by software that executes
>(or executed previously) on another logical processor of the same core.

Can we please just fix this stupid lie?

Yes, Intel calls it "STIBP" and tries to make it out to be about the
indirect branch predictor being per-SMT thread.

But the reason it is unacceptable is apparently because in reality it just
disables indirect branch prediction entirely. So yes, *technically* it's
true that that limits indirect branch prediction to just a single SMT
core, but in reality it is just a "go really slow" mode.

If STIBP had actually just keyed off the logical SMT thread, we wouldn't
need to have worried about it in the first place.

So let's document reality rather than Intel's Pollyanna world-view.

Reality matters. It's why we had to go all this. Lying about things
and making it appear like it's not a big deal was why the original
patch made it through without people noticing.

   Linus


Re: [PATCH] mm: put_and_wait_on_page_locked() while page is migrated

2018-11-25 Thread Linus Torvalds
On Sat, Nov 24, 2018 at 7:21 PM Hugh Dickins  wrote:
>
> Linus, I'm addressing this patch to you because I see from Tim Chen's
> thread that it would interest you, and you were disappointed not to
> root cause the issue back then.  I'm not pushing for you to fast-track
> this into 4.20-rc, but I expect Andrew will pick it up for mmotm, and
> thence linux-next.  Or you may spot a terrible defect, but I hope not.

The only terrible defect I spot is that I wish the change to the
'lock' argument in wait_on_page_bit_common() came with a comment
explaining the new semantics.

The old semantics were somewhat obvious (even if not documented): if
'lock' was set,  we'd make the wait exclusive, and we'd lock the page
before returning. That kind of matches the intuitive meaning for the
function prototype, and it's pretty obvious in the callers too.

The new semantics don't have the same kind of really intuitive
meaning, I feel. That "-1" doesn't mean "unlock", it means "drop page
reference", so there is no longer a fairly intuitive and direct
mapping between the argument name and type and the behavior of the
function.

So I don't hate the concept of the patch at all, but I do ask to:

 - better documentation.

   This might not be "documentation" at all, maybe that "lock"
variable should just be renamed (because it's not about just locking
any more), and would be better off as a tristate enum called
"behavior" that has "LOCK, DROP, WAIT" values?

 - while it sounds likely that this is indeed the same issue that
plagues us with the insanely long wait-queues, it would be *really*
nice to have that actually confirmed.

   Does somebody still have access to the customer load that triggered
the horrible scaling issues before?

In particular, on that second issue: the "fixes" that went in for the
wait-queues didn't really fix any real scalability problem, it really
just fixed the excessive irq latency issues due to the long traversal
holding a lock.

If this really fixes the fundamental issue, that should show up as an
actual performance difference, I'd expect..

End result: I like and approve of the patch, but I'd like it a lot
more if the code behavior was clarified a bit, and I'd really like to
close the loop on that old nasty page wait queue issue...

   Linus


Re: [GIT PULL] XArray updates

2018-11-24 Thread Linus Torvalds
On Sat, Nov 24, 2018 at 7:00 PM Matthew Wilcox  wrote:
>
> I pushed it out to hkps://hkps.pool.sks-keyservers.net ... as I recall,
> pgp.mit.edu is frequently not synchronised well with the other keyservers,
> but I'm surprised that one of the sks-keyservers didn't have it.

I got it now, so it was just some propagation delay or somethinig.

Thanks,
  Linus


Re: [GIT PULL] XArray updates

2018-11-24 Thread Linus Torvalds
On Sat, Nov 24, 2018 at 6:38 PM Matthew Wilcox  wrote:
>
> I generated a new key 5EC42E41545C1F5E and signed a new tag
> xarray-4.20-rc4

Hmm. Did you publicize it on any keyservers? I'm not finding the key
on pgp.mit.edu or on the sks-keyservers.net pool..

> I've signed that key with my old DSA key 2218C81E8E7C03FF which has
> about 400 signatures on it, but I understand is not terribly trustable
> these days.

.. but at least I can find that one.  I think a 1024-bit DSA key may
be borderline in theory, but it's a lot better than no key at all.

   Linus


Re: [GIT PULL] XArray updates

2018-11-24 Thread Linus Torvalds
On Sat, Nov 24, 2018 at 9:32 AM Matthew Wilcox  wrote:
>
>   git://git.infradead.org/users/willy/linux-dax.git xarray

Can you *please* make that a signed tag.

I don't trust infradead.org implicitly, so I really want signed tag
pull requests.  I may not always notice, but when I do, I abort the
pull.

The only site that doesn't need signed tags is kernel.org, which I can
ssh into and I know and trust the security model. And even there I
really prefer signed tags.

   Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-23 Thread Linus Torvalds
On Fri, Nov 23, 2018 at 10:39 AM Andy Lutomirski  wrote:
>
> What is memcpy_to_io even supposed to do?  I’m guessing it’s defined as 
> something like “copy this data to IO space using at most long-sized writes, 
> all aligned, and writing each byte exactly once, in order.”  That sounds... 
> dubiously useful.

We've got hundreds of users of it, so it's fairly common..

> I could see a function that writes to aligned memory in specified-sized 
> chunks.

We have that. It's called "__iowrite{32,64}_copy()". It has very few users.

  Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-23 Thread Linus Torvalds
On Fri, Nov 23, 2018 at 8:36 AM Linus Torvalds
 wrote:
>
> Let me write a generic routine in lib/iomap_copy.c (which already does
> the "user specifies chunk size" cases), and hook it up for x86.

Something like this?

ENTIRELY UNTESTED! It might not compile. Seriously. And if it does
compile, it might not work.

And this doesn't actually do the memset_io() function at all, just the
memcpy ones.

Finally, it's worth noting that on x86, we have this:

  /*
   * override generic version in lib/iomap_copy.c
   */
  ENTRY(__iowrite32_copy)
  movl %edx,%ecx
  rep movsd
  ret
  ENDPROC(__iowrite32_copy)

because back in 2006, we did this:

[PATCH] Add faster __iowrite32_copy routine for x86_64

This assembly version is measurably faster than the generic version in
lib/iomap_copy.c.

which actually implies that "rep movsd" is faster than doing
__raw_writel() by hand.

So it is possible that this should all be arch-specific code rather
than that butt-ugly "generic" code I wrote in this patch.

End result: I'm not really all that  happy about this patch, but it's
perhaps worth testing, and it's definitely worth discussing. Because
our current memcpy_{to,from}io() is truly broken garbage.

   Linus
 arch/x86/include/asm/io.h |   6 ++
 include/linux/io.h|   2 +
 lib/iomap_copy.c  | 153 ++
 3 files changed, 161 insertions(+)

diff --git a/arch/x86/include/asm/io.h b/arch/x86/include/asm/io.h
index 832da8229cc7..3b9206ee25b8 100644
--- a/arch/x86/include/asm/io.h
+++ b/arch/x86/include/asm/io.h
@@ -92,6 +92,12 @@ build_mmio_write(__writel, "l", unsigned int, "r", )
 
 #define mmiowb() barrier()
 
+void __iowrite_copy(void __iomem *to, const void *from, size_t count);
+void __ioread_copy(void *to, const void __iomem *from, size_t count);
+
+#define memcpy_toio __iowrite_copy
+#define memcpy_fromio __ioread_copy
+
 #ifdef CONFIG_X86_64
 
 build_mmio_read(readq, "q", u64, "=r", :"memory")
diff --git a/include/linux/io.h b/include/linux/io.h
index 32e30e8fb9db..642f78970018 100644
--- a/include/linux/io.h
+++ b/include/linux/io.h
@@ -28,6 +28,8 @@
 struct device;
 struct resource;
 
+void __ioread_copy(void *to, const void __iomem *from, size_t count);
+void __iowrite_copy(void __iomem *to, const void *from, size_t count);
 __visible void __iowrite32_copy(void __iomem *to, const void *from, size_t count);
 void __ioread32_copy(void *to, const void __iomem *from, size_t count);
 void __iowrite64_copy(void __iomem *to, const void *from, size_t count);
diff --git a/lib/iomap_copy.c b/lib/iomap_copy.c
index b8f1d6cbb200..8edc359dda62 100644
--- a/lib/iomap_copy.c
+++ b/lib/iomap_copy.c
@@ -17,6 +17,159 @@
 
 #include 
 #include 
+#include 
+
+static inline bool iomem_align(const void __iomem *ptr, int size, int count)
+{
+	return count >= size && (__force unsigned long)ptr & size;
+}
+
+
+/**
+ * __iowrite_copy - copy data to MMIO space
+ * @to: destination, in MMIO space
+ * @from: source
+ * @count: number of bytes to copy.
+ *
+ * Copy arbitrarily aligned data from kernel space to MMIO space,
+ * using reasonable chunking.
+ */
+void __attribute__((weak)) __iowrite_copy(void __iomem *to,
+	  const void *from,
+	  size_t count)
+{
+	if (iomem_align(to, 1, count)) {
+		unsigned char data = *(unsigned char *)from;
+		__raw_writeb(data, to);
+		from++;
+		to++;
+		count--;
+	}
+	if (iomem_align(to, 2, count)) {
+		unsigned short data = get_unaligned((unsigned short *)from);
+		__raw_writew(data, to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+#ifdef CONFIG_64BIT
+	if (iomem_align(to, 4, count)) {
+		unsigned int data = get_unaligned((unsigned int *)from);
+		__raw_writel(data, to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+	while (count >= sizeof(unsigned long)) {
+		unsigned long data = get_unaligned((unsigned long *)from);
+#ifdef CONFIG_64BIT
+		__raw_writeq(data, to);
+#else
+		__raw_writel(data, to);
+#endif
+		from += sizeof(unsigned long);
+		to += sizeof(unsigned long);
+		count -= sizeof(unsigned long);
+	}
+
+#ifdef CONFIG_64BIT
+	if (count >= 4) {
+		unsigned int data = get_unaligned((unsigned int *)from);
+		__raw_writel(data, to);
+		from += 4;
+		to += 4;
+		count -= 4;
+	}
+#endif
+
+	if (count >= 2) {
+		unsigned short data = get_unaligned((unsigned short *)from);
+		__raw_writew(data, to);
+		from += 2;
+		to += 2;
+		count -= 2;
+	}
+
+	if (count) {
+		unsigned char data = *(unsigned char *)from;
+		__raw_writeb(data, to);
+	}
+}
+EXPORT_SYMBOL_GPL(__iowrite_copy);
+
+/**
+ * __ioread_copy - copy data from MMIO space
+ * @to: destination
+ * @from: source, in MMIO space
+ * @count: number of bytes to copy.
+ *
+ * Copy arbitrarily aligned data from MMIO space to kernel space,
+ * using reasonable chunking.
+ */
+void __attribute__((weak)) __ioread_copy(void *to,
+	

Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-23 Thread Linus Torvalds
On Fri, Nov 23, 2018 at 2:12 AM David Laight  wrote:
>
> I've just patched my driver and redone the test on a 4.13 (ubuntu) kernel.
> Calling memcpy_fromio(kernel_buffer, PCIe_address, length)
> generates a lot of single byte TLP.

I just tested it too - it turns out that the __inline_memcpy() code
never triggers, and "memcpy_toio()" just generates a memcpy.

So that code seems entirely dead.

And, in fact, the codebase I looked at was the historical one, because
I had been going back and looking at the history. The modern tree
*does* have the "__inline_memcpy()" function I pointed at, but it's
not actually hooked up to anything!

This actually has been broken for _ages_. The breakage goes back to
2010, and commit 6175ddf06b61 ("x86: Clean up mem*io functions"), and
it seems nobody really ever noticed - or thought that it was ok.

That commit claims that iomem has no special significance on x86, but
that really really isn't true, exactly because the access size does
matter.

And as mentioned, the generic memory copy routines are not at all
appropriate, and that has nothing to do with ERMS. Our "do it by hand"
memory copy routine does things like this:

.Lless_16bytes:
cmpl $8,%edx
jb   .Lless_8bytes
/*
 * Move data from 8 bytes to 15 bytes.
 */
movq 0*8(%rsi), %r8
movq -1*8(%rsi, %rdx),  %r9
movq %r8,   0*8(%rdi)
movq %r9,   -1*8(%rdi, %rdx)
retq

and note how for a 8-byte copy, it will do *two* reads of the same 8
bytes, and *two* writes of the same 8 byte destination. That's
perfectly ok for regular memory, and it means that the code can handle
an arbitrary 8-15 byte copy without any conditionals or loop counts,
but it is *not* ok for iomem.

Of course, in practice it all just happens to work in almost all
situations (because a lot of iomem devices simply won't care), and
manual access to iomem is basically extremely rare these days anyway,
but it's definitely entirely and utterly broken.

End result: we *used* to do this right. For the last eight years our
"memcpy_{to,from}io()" has been entirely broken, and apparently even
the people who noticed oddities like David, never reported it as
breakage but instead just worked around it in drivers.

Ho humm.

Let me write a generic routine in lib/iomap_copy.c (which already does
the "user specifies chunk size" cases), and hook it up for x86.

David, are you using a bus analyzer or something to do your testing?
I'll have a trial patch for you asap.

   Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Linus Torvalds
On Thu, Nov 22, 2018 at 10:07 AM Andy Lutomirski  wrote:
>
> I'm not personally volunteering, but I suspect we can do much better
> than we do now:
>
>  - The new MOVDIRI and MOVDIR64B instructions can do big writes to WC
> and UC memory.
>
>  - MOVNTDQA can, I think, do 64-byte loads, but only from WC memory.

No, performance isn't the _primary_ issue. Nobody uses MMIO and
expects high performance from the generic functions (but people may
then tweak individual drivers to do tricks).

And we've historically had various broken hardware that cares deeply
about access size. Trying to be clever and do big accesses could
easily break something.

The fact that nobody has complained about the generic memcpy routines
probably means that the broken hardware isn't in use any more, or it
just works anyway. And nobody has complained about performance either,
so it's clearly not a huge issue. "rep movs" probably works ok on WC
memory writes anyway, it's the UC case that is bad, but I don't think
anybody uses UC and then does the "memcp_to/fromio()" things. If you
have UC memory, you tend to do the accesses properly.

So I suspect we should just write memcpy_{to,from}io() in terms of writel/readl.

Oh, and I just noticed that on x86 we expressly use our old "safe and
sane" functions: see __inline_memcpy(), and its use in
__memcpy_{from,to}io().

So the "falls back to memcpy" was always a red herring. We don't
actually do that.

Which explains why things work.

Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Linus Torvalds
On Thu, Nov 22, 2018 at 9:36 AM David Laight  wrote:
>
> The other problem with the ERMS copy is that it gets used
> for copy_to/from_io() - and the 'rep movsb' on uncached
> locations has to do byte copies.

Ugh. I thought we changed that *long* ago, because even our non-ERMS
copy is broken for PCI (it does overlapping stores for the small tail
cases).

But looking at "memcpy_{from,to}io()", I don't see x86 overriding it
with anything better.

I suspect nobody uses those functions for anything critical any more.
The fbcon people have their own copy functions, iirc.

But we definitely should fix this. *NONE* of the regular memcpy
functions actually work right for PCI space any more, and haven't for
a long time.

 Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Linus Torvalds
On Thu, Nov 22, 2018 at 9:26 AM Andy Lutomirski  wrote:
>
> So I think your patch is viable.  Also, with that patch applied,
> put_user_ex() should become worse than worthless

Yes. I hate those special-case _ex variants.

I guess I should just properly forward-port my patch series where the
different steps are separated out (not jumbled up like that patch I
actually posted).

Linus


Re: [PATCH 8/8] HID: logitech: Enable high-resolution scrolling on Logitech mice

2018-11-22 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 10:35 PM Peter Hutterer
 wrote:
>
> This patch is a combinations of the now-reverted commits 1ff2e1a44e0,
> d56ca9855bf9, 5fe2ccbef9d, 044ee89028 together with some extra bits for the
> directional and timeout-based reset.

Instead of using an actual timer (which is quite expensive), how about
just saving the "last time" data along with the remainder?

We have a fairly low-overhead 'sched_clock()' function that gives a
clock approximation in nanoseconds (64 bits). Note that it doesn't
actually give nanosecond precision - it might fall back to jiffies for
hardware that doesn't have anything better, but for timeouts on the
order of a second, it's fine.

  Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-22 Thread Linus Torvalds
On Thu, Nov 22, 2018 at 2:32 AM Ingo Molnar  wrote:
> * Linus Torvalds  wrote:
> >
> > Random patch (with my "asm goto" hack included) attached, in case
> > people want to play with it.
>
> Doesn't even look all that hacky to me. Any hack in it that I didn't
> notice? :-)

The code to use asm goto sadly doesn't have any fallback at all for
the "no asm goto available".

I guess we're getting close to "we require asm goto support", but I
don't think we're there yet.

Also, while "unsafe_put_user()" has been converted to use asm goto
(and yes, it really does generate much nicer code), the same is not
true of "unsafe_get_user()". Because sadly, gcc does not support asm
goto with output values.

So, realistically, my patch is not _technically_ hacky, but it's
simply not viable as things stand, and it's more of a tech
demonstration than anything else.

   Linus


Re: [PATCH] x86/speculation: Revert turning on STIBP all the time

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 12:51 PM Jiri Kosina  wrote:
>
> For -rc, I don't think we need to do this at this moment, given the
> prctl+seccomp fixup is basically ready, do we?

Agreed.

   Linus


Re: [patch 01/24] x86/speculation: Update the TIF_SSBD comment

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 12:28 PM Linus Torvalds
 wrote:
>
> Ugh. Now you're using the broken quilt thing that makes a mush of emails for 
> me.

Reading the series in alpine makes it look fine. No testing, but each
patch seems sensible.

And yes, triggering on seccomp makes more sense than dumpable to me.

   Linus


Re: [patch 01/24] x86/speculation: Update the TIF_SSBD comment

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 12:18 PM Thomas Gleixner  wrote:
>
> From: Tim Chen "Reduced Data Speculation" is an obsolete term.

Ugh. Now you're using the broken quilt thing that makes a mush of emails for me.

  Linus


Re: [Patch v6 14/16] x86/speculation: Use STIBP to restrict speculation on non-dumpable task

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 12:07 PM Dave Hansen  wrote:
>
> Repurposing dumpable is really screwy and surely imprecise, but it
> really is the closest thing that we have without the new ABI.

But we *have* a new ABI.

So that's not a valid argument.

It's more like "this other thing that some other users use for
something *entirely* different has in _one_ case the semantics you'd
want, but in most cases not at all".

Because gpg really is the odd man out.

And it's not at all obvious that you can attack gpg using the hole
that STIBP opens, when there are other timing attacks that are likely
as good or better, and when we know that people who really care about
the issue are already just disabling SMT entirely.

That's really the basic issue here: STIBP has horrible overhead, _and_
it's not even targeting the people who really want it, so we'd better
be very targeted in how it's used.

Because we already know how badly things messed up when the use of
STIBP wasn't targeted.

The _only_ very real and direct advantage "dumpable" has is that it
hides the problem from benchmarks. Because benchmarks don't test
non-dumpable processes.

But honestly, that sounds like a disadvantage to me. It smells like
"let's hide the overhead dishonestly".

 Linus


Re: [Patch v6 14/16] x86/speculation: Use STIBP to restrict speculation on non-dumpable task

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 9:41 AM Tim Chen  wrote:
>
> When STIBP is on, it will prevent not only untrusted code from attacking,
> but also trusted code from getting attacked.  So non-dumpable task running
> with STIBP will protect itself from attacks from code running on sibling CPU.

I understand.

You didn't read my email about why "dumpable" is not sensible.

 Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 10:16 AM Linus Torvalds
 wrote:
>
> It might be interesting to just change raw_copy_to/from_user() to
> handle a lot more cases (in particular, handle cases where 'size' is
> 8-byte aligned). The special cases we *do* have may not be the right
> ones (the 10-byte case in particular looks odd).
>
> For example, instead of having a "if constant size is 8 bytes, do one
> get/put_user()" case, we might have a "if constant size is < 64 just
> unroll it into get/put_user()" calls.

Actually, x86 doesn't even set INLINE_COPY_TO_USER, so I don't think
the constant size cases ever trigger at all the way they are set up
now.

I do have a random patch that makes "unsafe_put_user()" actually use
"asm goto" for the error case, and that, together with the attached
patch seems to generate fairly nice code, but even then it would
depend on gcc actually unrolling things (which we do *not* want in
general).

But for a 32-byte user copy (cp_old_stat), and that
INLINE_COPY_TO_USER, it generates this:

stac
movl$32, %edx   #, size
movq%rsp, %rax  #, src
.L201:
movq(%rax), %rcx# MEM[base: src_155, offset: 0B],
MEM[base: src_155, offset: 0B]
1:  movq %rcx,0(%rbp)   # MEM[base: src_155, offset: 0B],
MEM[(struct __large_struct *)dst_156]
ASM_EXTABLE_HANDLE from=1b to=.L200 handler="ex_handler_uaccess"#

addq$8, %rax#, src
addq$8, %rbp#, statbuf
subq$8, %rdx#, size
jne .L201   #,
clac

which is actually fairly close to "optimal".

Random patch (with my "asm goto" hack included) attached, in case
people want to play with it.

Impressively, it actually removes more lines of code than it adds. But
I didn't actually check whether the end result *works*, so hey..

  Linus
 arch/x86/include/asm/uaccess.h|  96 +--
 arch/x86/include/asm/uaccess_64.h | 191 ++
 fs/readdir.c  |  22 +++--
 3 files changed, 149 insertions(+), 160 deletions(-)

diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index b5e58cc0c5e7..3f4c89deb7a1 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -12,6 +12,9 @@
 #include 
 #include 
 
+#define INLINE_COPY_TO_USER
+#define INLINE_COPY_FROM_USER
+
 /*
  * The fs value determines whether argument validity checking should be
  * performed or not.  If get_fs() == USER_DS, checking is performed, with
@@ -189,19 +192,14 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
 
 
 #ifdef CONFIG_X86_32
-#define __put_user_asm_u64(x, addr, err, errret)			\
-	asm volatile("\n"		\
-		 "1:	movl %%eax,0(%2)\n"			\
-		 "2:	movl %%edx,4(%2)\n"			\
-		 "3:"		\
-		 ".section .fixup,\"ax\"\n"\
-		 "4:	movl %3,%0\n"\
-		 "	jmp 3b\n"	\
-		 ".previous\n"	\
-		 _ASM_EXTABLE_UA(1b, 4b)\
-		 _ASM_EXTABLE_UA(2b, 4b)\
-		 : "=r" (err)	\
-		 : "A" (x), "r" (addr), "i" (errret), "0" (err))
+#define __put_user_goto_u64(x, addr, label)			\
+	asm volatile("\n"	\
+		 "1:	movl %%eax,0(%2)\n"		\
+		 "2:	movl %%edx,4(%2)\n"		\
+		 _ASM_EXTABLE_UA(1b, %2l)			\
+		 _ASM_EXTABLE_UA(2b, %2l)			\
+		 : : "A" (x), "r" (addr)			\
+		 : : label)
 
 #define __put_user_asm_ex_u64(x, addr)	\
 	asm volatile("\n"		\
@@ -216,8 +214,8 @@ __typeof__(__builtin_choose_expr(sizeof(x) > sizeof(0UL), 0ULL, 0UL))
 	asm volatile("call __put_user_8" : "=a" (__ret_pu)	\
 		 : "A" ((typeof(*(ptr)))(x)), "c" (ptr) : "ebx")
 #else
-#define __put_user_asm_u64(x, ptr, retval, errret) \
-	__put_user_asm(x, ptr, retval, "q", "", "er", errret)
+#define __put_user_goto_u64(x, ptr, label) \
+	__put_user_goto(x, ptr, "q", "", "er", label)
 #define __put_user_asm_ex_u64(x, addr)	\
 	__put_user_asm_ex(x, addr, "q", "", "er")
 #define __put_user_x8(x, ptr, __ret_pu) __put_user_x(8, x, ptr, __ret_pu)
@@ -278,23 +276,21 @@ extern void __put_user_8(void);
 	__builtin_expect(__ret_pu, 0);\
 })
 
-#define __put_user_size(x, ptr, size, retval, errret)			\
+#define __put_user_size(x, ptr, size, label)\
 do {	\
-	retval = 0;			\
 	__chk_user_ptr(ptr);		\
 	switch (size) {			\
 	case 1:\
-		__put_user_asm(x, ptr, retval, "b", "b", "iq", errret);	\
+		__put_user_goto(x, ptr, "b", "b", "iq", label);	\
 		br

Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 10:26 AM Andy Lutomirski  wrote:
>
> Can we maybe use this as an excuse to ask for some reasonable instructions to 
> access user memory?

I did that long ago. It's why we have CLAC/STAC today. I was told that
what I actually asked for (get an instruction to access user space - I
suggested using a segment override prefix) was not viable.

 Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 9:27 AM Linus Torvalds
 wrote:
>
> It would be interesting to know exactly which copy it is that matters
> so much...  *inlining* the erms case might show that nicely in
> profiles.

Side note: the fact that Jens' patch (which I don't like in that form)
allegedly shrunk the resulting kernel binary would seem to indicate
that there's a *lot* of compile-time constant-sized memcpy calls that
we are missing, and that fall back to copy_user_generic().

It might be interesting to just change raw_copy_to/from_user() to
handle a lot more cases (in particular, handle cases where 'size' is
8-byte aligned). The special cases we *do* have may not be the right
ones (the 10-byte case in particular looks odd).

For example, instead of having a "if constant size is 8 bytes, do one
get/put_user()" case, we might have a "if constant size is < 64 just
unroll it into get/put_user()" calls.

  Linus


Re: [PATCH] x86: only use ERMS for user copies for larger sizes

2018-11-21 Thread Linus Torvalds
On Wed, Nov 21, 2018 at 5:45 AM Paolo Abeni  wrote:
>
> In my experiments 64 bytes was the break even point for all the CPUs I
> had handy, but I guess that may change with other models.

Note that experiments with memcpy speed are almost invariably broken.
microbenchmarks don't show the impact of I$, but they also don't show
the impact of _behavior_.

For example, there might be things like "repeat strings do cacheline
optimizations" that end up meaning that cachelines stay in L2, for
example, and are never brought into L1. That can be a really good
thing, but it can also mean that now the result isn't as close to the
CPU, and the subsequent use of the cacheline can be costlier.

I say "go for upping the limit to 128 bytes".

That said, if the aio user copy is _so_ critical that it's this
noticeable, there may be other issues. Sometimes _real_ cost of small
user copies is often the STAC/CLAC, more so than the "rep movs".

It would be interesting to know exactly which copy it is that matters
so much...  *inlining* the erms case might show that nicely in
profiles.

   Linus


Re: [Patch v6 14/16] x86/speculation: Use STIBP to restrict speculation on non-dumpable task

2018-11-20 Thread Linus Torvalds
On Tue, Nov 20, 2018 at 4:33 PM Tim Chen  wrote:
>
> Implements arch_update_spec_restriction() for x86.  Use STIBP to
> restrict speculative execution when running a task set to non-dumpable,
> or clear the restriction if the task is set to dumpable.

I don't think this necessarily makes sense.

The new "auto" behavior is that we aim to restrict untrusted code (and
the loader of such code uses prctrl to set that flag), then this whole
"set STIBP for non-dumpable" makes little sense.

A non-dumpable app by definition is *more* trusted, not less trusted.

So this model of "let's disable prediction for system processes" not
only doesn't make sense, but it also unnecessarily penalizes those
potentially very important system processes.

Also, "dumpable" in general is pretty oddly defined to be used for this.

The same (privileged) process can be dumpable or not depending on how
it was started (ie if it was started by a regular user and became
trusted through suid, it's not dumpable, but if it was started from a
root process it remains dumpable.

So I'm just not convinced "dumpability" is meaningful for STIBP.

  Linus


Re: STIBP by default.. Revert?

2018-11-18 Thread Linus Torvalds
On Sun, Nov 18, 2018 at 2:17 PM Jiri Kosina  wrote:
> Which gets us back to Tim's fixup patch. Do you still prefer the revert,
> given the existence of that?

I don't think the code needs to be reverted, but the *behavior* of
just unconditionally enabling STIBP needs to be reverted.

Because it was clearly way more expensive than people were told.

Linus


Linux 4.20-rc3

2018-11-18 Thread Linus Torvalds
PCI IDs for KabyLake and
CoffeeLake CPUs
  perf/x86/intel/uncore: Support CoffeeLake 8th CBOX

Laurent Pinchart (4):
  drm/omap: Populate DSS children in omapdss driver
  drm/omap: hdmi4: Ensure the device is active during bind
  drm/omap: dsi: Ensure the device is active during probe
  drm/omap: Move DISPC runtime PM handling to omapdrm

Linus Torvalds (1):
  Linux 4.20-rc3

Lionel Landwerlin (1):
  drm/i915: fix broadwell EU computation

Lukas Czerner (1):
  fuse: fix use-after-free in fuse_direct_IO()

Lyude Paul (2):
  drm/i915: Fix possible race in intel_dp_add_mst_connector()
  drm/i915: Fix NULL deref when re-enabling HPD IRQs on systems with MST

Maarten Lankhorst (1):
  drm/i915: Move programming plane scaler to its own function.

Maciej W. Rozycki (1):
  rtc: hctosys: Add missing range error reporting

Martin K. Petersen (1):
  Revert "scsi: ufs: Disable blk-mq for now"

Masanari Iida (1):
  scsi: qla2xxx: Fix a typo in MODULE_PARM_DESC

Masayoshi Mizuma (1):
  tools/testing/nvdimm: Fix the array size for dimm devices.

Max Filippov (2):
  xtensa: make sure bFLT stack is 16 byte aligned
  xtensa: fix boot parameters address translation

Michael Ellerman (6):
  powerpc/mm/64s: Consolidate SLB assertions
  powerpc/mm/64s: Use PPC_SLBFEE macro
  powerpc/mm/64s: Only use slbfee on CPUs that support it
  powerpc/mm/64s: Fix preempt warning in slb_allocate_kernel()
  powerpc/io: Fix the IO workarounds code to work with Radix
  selftests/powerpc: Fix wild_bctr test to work on ppc64

Michal Hocko (2):
  mm, memory_hotplug: check zone_movable in has_unmovable_pages
  mm, page_alloc: check for max order in hot path

Mika Kuoppala (1):
  drm/i915/icl: Drop spurious register read from icl_dbuf_slices_update

Mike Kravetz (1):
  hugetlbfs: fix kernel BUG at fs/hugetlbfs/inode.c:444!

Mike Rapoport (1):
  mm/gup.c: fix follow_page_mask() kerneldoc comment

Miklos Szeredi (2):
  fuse: fix leaked notify reply
  fuse: fix possibly missed wake-up after abort

Mimi Zohar (1):
  integrity: support new struct public_key_signature encoding field

Ming Lei (1):
  SCSI: fix queue cleanup race before queue initialization is done

Nicholas Piggin (1):
  powerpc/64: Fix kernel stack 16-byte alignment

Olof Johansson (3):
  RISC-V: lib: Fix build error for 64-bit
  RISC-V: Silence some module warnings on 32-bit
  kernel/sched/psi.c: simplify cgroup_move_task()

Omar Sandoval (1):
  kyber: fix wrong strlcpy() size in trace_kyber_latency()

Ondrej Mosnacek (1):
  selinux: check length properly in SCTP bind hook

Patrick Bellasi (1):
  sched/fair: Fix cpu_util_wake() for 'execl' type workloads

Paul Moore (1):
  selinux: fix non-MLS handling in mls_context_to_sid()

Philip Yang (1):
  drm/amdgpu: fix bug with IH ring setup

Prarit Bhargava (1):
  kdb: Use strscpy with destination buffer size

Quinn Tran (1):
  scsi: qla2xxx: Initialize port speed to avoid setting lower speed

Randy Dunlap (1):
  scripts/faddr2line: fix location of start_kernel in comment

Rex Zhu (1):
  drm/amd/pp: Fix truncated clock value when set watermark

Roman Gushchin (1):
  mm: don't reclaim inodes with many attached pages

Russell King (5):
  ARM: make lookup_processor_type() non-__init
  ARM: split out processor lookup
  ARM: clean up per-processor check_bugs method call
  ARM: add PROC_VTABLE and PROC_TABLE macros
  ARM: spectre-v2: per-CPU vtables to work around big.Little systems

Satheesh Rajendran (1):
  powerpc/numa: Suppress "VPHN is not supported" messages

Scott Mayhew (1):
  nfsd: COPY and CLONE operations require the saved filehandle to be set

Scott Wood (1):
  KVM: PPC: Move and undef TRACE_INCLUDE_PATH/FILE

Sean Paul (1):
  drm: Fix htmldocs warnings in drm_fourcc.c

Stanislav Lisovskiy (1):
  drm/dp_mst: Check if primary mstb is null

Sudeep Holla (1):
  dt-bindings: cpufreq: remove stale arm_big_little_dt entry

Tony Lindgren (1):
  drm/omap: dsi: Fix missing of_platform_depopulate()

Trond Myklebust (5):
  NFSv4: Don't exit the state manager without clearing
NFS4CLNT_MANAGER_RUNNING
  NFSv4: Ensure that the state manager exits the loop on SIGKILL
  SUNRPC: Fix a Oops when destroying the RPCSEC_GSS credential cache
  SUNRPC: Fix a bogus get/put in generic_key_to_expire()
  NFSv4: Fix an Oops during delegation callbacks

Ulf Hansson (2):
  ARM: cpuidle: Don't register the driver when back-end init returns -ENXIO
  ARM: cpuidle: Convert to use cpuidle_register|unregister()

Uwe Kleine-König (1):
  scripts/spdxcheck.py: make python3 compliant

Vasily Averin (1):
  mm/swapfile.c: use kvzalloc for swap_info_struct allocation

Ville Syrjälä (3):
  drm/i915: Fix hpd handling for pins with two encoders
  drm/i915: Clean up skl_program_scaler()
 

Re: STIBP by default.. Revert?

2018-11-18 Thread Linus Torvalds
On Sun, Nov 18, 2018 at 1:49 PM Jiri Kosina  wrote:
>
> > So why do that STIBP slow-down by default when the people who *really*
> > care already disabled SMT?
>
> BTW for them, there is no impact at all.

Right. People who really care about security and are anal about it do
not see *any* advantage of the patch.

But people who aren't that worried suddenly see potentially huge slowdowns.

In other words, the behavior of the patch is basically essentially
exactly the reverse of what you'd want. You penalize the people who
don't even want it and don't care.

> STIBP is only activated on systems with HT on; plus odds are that people
> who don't care about spectrev2 already have 'nospectre_v2' on their
> command-line, so they are fine as well.

I'm talking about *normal* people. People who simply aren't all that
invested in this all. People who just want to get their work done.

> So, I think it's as theoretical as any other spectrev2 (only with the
> extra "HT" condition added on top).

What? No.

It's *way* more theoretical than something like meltdown, which could
be trivially used to get data from another protection domain.

Have you seen any actual realistic attacks for normal human users?
Things where the *kernel* should actually care?

The javascript thing is for the browser to fix up, not for the kernel
to say "now everything should run up to 50% slower".

   Linus


STIBP by default.. Revert?

2018-11-18 Thread Linus Torvalds
This was marked for stable, and honestly, nowhere in the discussion
did I see any mention of just *how* bad the performance impact of this
was.

When performance goes down by 50% on some loads, people need to start
asking themselves whether it was worth it. It's apparently better to
just disable SMT entirely, which is what security-conscious people do
anyway.

So why do that STIBP slow-down by default when the people who *really*
care already disabled SMT?

I think we should use the same logic as for L1TF: we default to
something that doesn't kill performance. Warn once about it, and let
the  crazy people say "I'd rather take a 50% performance hit than
worry about a theoretical issue".

  Linus


Re: Oops: 0003 [#1] PREEMPT SMP NOPTI

2018-11-16 Thread Linus Torvalds
On Thu, Nov 15, 2018 at 8:29 PM Kyle Sanderson  wrote:
>
> 2008(!) dual-core Atom box.
> [1027541.963573] BUG: unable to handle kernel paging request at 
> b0428a44
> [1027541.963647] IP: format_decode+0x20/0x3d0

The code decodes to:

   0: 55push   %rbp
   1: 48 8d 2e  lea(%rsi),%rbp
   4: 53push   %rbx
   5: 48 8d 1f  lea(%rdi),%rbx
   8: 48 8d 64 24 f8lea-0x8(%rsp),%rsp
   d: 0f b6 06  movzbl (%rsi),%eax
  10: 48 89 3c 24  mov%rdi,(%rsp)
  14: 3c 01cmp$0x1,%al
  16: 74 4cje 0x64
  18: 3c 02cmp$0x2,%al
  1a: 0f 84 a2 01 00 00je 0x1c2
  20:* c6 06 00  movb   $0x0,(%rsi) <-- trapping instruction
  23: 0f b6 07  movzbl (%rdi),%eax
  26: 84 c0test   %al,%al
  28: 0f 84 db 02 00 00je 0x309

and that trapping instruction is, as far as I can tell, this one:

/* By default */
spec->type = FORMAT_TYPE_NONE;

and the fault seems to be a protection fault due to a write to a
read-only area (and yes, we *have* read from that 'spec' pointer
before that write.

> [1027541.965114] RIP: 0010:format_decode+0x20/0x3d0
> [1027541.965463] Call Trace:
> [1027541.965501]  vsnprintf+0x56/0x4d0

This is all very odd, because that "spec" pointer points to an
automatic variable on the stack of the vsnprintf() function, but we
have:

RSP: 9e8c02267ba0
RSI: b0428a44

so it looks like some completely crazy register state corruption.

Is this repeatable at all? Do you see other random faults?

   Linus


Linux 4.20-rc2

2018-11-11 Thread Linus Torvalds
   drm/amd/display: Cleanup MST non-atomic code workaround
  drm/amd/display: Drop reusing drm connector for MST

Jin Yao (1):
  perf top: Display the LBR stats in callchain entry

Jiri Kosina (1):
  HID: moving to group maintainership model

Jiri Olsa (1):
  perf tools: Do not zero sample_id_all for group members

Jiri Slaby (1):
  netfilter: bridge: define INT_MIN & INT_MAX in userspace

Johannes Thumshirn (1):
  block: respect virtual boundary mask in bvecs

John Garry (1):
  of, numa: Validate some distance map rules

Jon Maloy (1):
  tipc: fix link re-establish failure

Jozsef Kadlecsik (2):
  netfilter: ipset: Correct rcu_dereference() call in ip_set_put_comment()
  netfilter: ipset: Fix calling ip_set() macro at dumping

Juergen Gross (3):
  x86/xen: fix pv boot
  xen: fix xen_qlock_wait()
  xen: remove size limit of privcmd-buf mapping interface

Julian Sax (1):
  HID: i2c-hid: add Direkt-Tek DTLAPY133-1 to descriptor override

Julian Wiedmann (6):
  s390/qeth: sanitize strings in debug messages
  s390/qeth: fix HiperSockets sniffer
  s390/qeth: unregister netdevice only when registered
  s390/qeth: fix initial operstate
  s390/qeth: sanitize ARP requests
  s390/qeth: report 25Gbit link speed

Juri Lelli (1):
  posix-cpu-timers: Remove useless call to check_dl_overrun()

Justin M. Forbes (1):
  s390/mm: Fix ERROR: "__node_distance" undefined!

Kai-Heng Feng (1):
  HID: i2c-hid: Add a small delay after sleep command for Raydium touchpanel

Keith Busch (1):
  block: Clear kernel memory before copying to user

Kirill A. Shutemov (3):
  x86/mm: Move LDT remap out of KASLR region on 5-level paging
  x86/ldt: Unmap PTEs for the slot before freeing LDT pages
  x86/ldt: Remove unused variable in map_ldt_struct()

Kuninori Morimoto (2):
  arm64: dts: renesas: r8a7795: add missing dma-names on hscif2
  sata_rcar: convert to SPDX identifiers

Leo Li (1):
  drm/amd: Update atom_smu_info_v3_3 structure

Leonard Crestez (1):
  ARM: dts: imx6sx-sdb: Fix enet phy regulator

Liam Merwick (1):
  xen/grant-table: Fix incorrect gnttab_dma_free_pages() pr_debug message

Linus Torvalds (1):
  Linux 4.20-rc2

Linus Walleij (1):
  HID: fix up .raw_event() documentation

Longhe Zheng (1):
  drm/i915/gvt: Handle values of EDP_PSR_IMR and EDP_PSR_IIR

Lu Fengqi (1):
  btrfs: fix pinned underflow after transaction aborted

Lucas Stach (1):
  drm/etnaviv: fix bogus fence complete check in timeout handler

Luis Henriques (2):
  ceph: add destination file data sync before doing any remote copy
  ceph: quota: fix null pointer dereference in quota check

Lyude Paul (1):
  drm/amd/amdgpu/dm: Fix dm_dp_create_fake_mst_encoder()

Maciej W. Rozycki (5):
  MIPS: Fix `dma_alloc_coherent' returning a non-coherent allocation
  FDDI: defza: Fix SPDX annotation
  FDDI: defza: Add missing comment closing
  FDDI: defza: Move SMT Tx data buffer declaration next to its skb
  FDDI: defza: Make the driver version string constant

Manasi Navare (1):
  drm/i915/icl: Fix the macros for DFLEXDPMLE register bits

Manjunath Patil (1):
  xen-blkfront: fix kernel panic with negotiate_mq error path

Martin Schwidefsky (5):
  mm: make the __PAGETABLE_PxD_FOLDED defines non-empty
  mm: introduce mm_[p4d|pud|pmd]_folded
  mm: add mm_pxd_folded checks to pgtable_bytes accounting functions
  s390/mm: fix mis-accounting of pgtable_bytes
  compiler: remove __no_sanitize_address_or_inline again

Masahiro Yamada (4):
  kbuild: rpm-pkg: fix binrpm-pkg breakage when O= is used
  kbuild: deb-pkg: fix bindeb-pkg breakage when O= is used
  kconfig: merge_config: avoid false positive matches from comment lines
  kbuild: deb-pkg: fix too low build version number

Masami Hiramatsu (1):
  tracing/kprobes: Fix strpbrk() argument order

Mathieu Malaterre (2):
  net: document skb parameter in function 'skb_gso_size_check'
  watchdog/core: Add missing prototypes for weak functions

Matwey V. Kornilov (1):
  net: core: netpoll: Enable netconsole IPv6 link local address

Md Fahad Iqbal Polash (1):
  ice: Fix flags for port VLAN

Michael Kelley (2):
  clockevents/drivers/i8253: Add support for PIT shutdown quirk
  x86/hyper-v: Enable PIT shutdown quirk

Michał Mirosław (2):
  ibmvnic: fix accelerated VLAN handling
  qlcnic: remove assumption that vlan_tci != 0

Miguel Ojeda (1):
  Compiler Attributes: improve explanation of header

Mikulas Patocka (1):
  vt: fix broken display when running aptitude

Ming Lei (3):
  block: make sure discard bio is aligned with logical block size
  block: cleanup __blkdev_issue_discard()
  block: make sure writesame bio is aligned with logical block size

Miroslav Lichvar (1):
  igb: shorten maximum PHC timecounter update interval

Nathan Chancellor (1)

Re: [GIT pull] scheduler fixes for 4.20

2018-11-11 Thread Linus Torvalds
On Sun, Nov 11, 2018 at 2:11 AM Thomas Gleixner  wrote:
>
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git 
> sched-urgent-for-linus

Hmm. I get

   Already up to date.

with top commit being 993f0b0510da ("sched/topology: Fix off by one
bug") that I already merged earlier.

Did you forget to push out?

I did find the commits in the 'sched/urgent' branch, so I pulled that.

Linus


Re: [PULL REQUEST] i2c for 4.20

2018-11-10 Thread Linus Torvalds
On Sat, Nov 10, 2018 at 5:20 AM Wolfram Sang  wrote:
>
> I hope the rc1 rule still applies for this new driver. It consists of the I2C
> master part and UCSI client part which has the needed ack from Heikki.

I've pulled it, but please don't do this again.

The "New driver" rule has become pretty much historical, simply
because releases have been regular enough that it doesn't much make
much sense, and we just have't had the kinds of "basic support"
drivers that make sense pushing out as early as possible just to get
people able to use a machine at all. So in practice, the "new driver"
rule has become a "new ID or quirk" rule.

The nVidia i2c support doesn't sound so special that it couldn't have
waited until the merge window.

Linus


Re: [GIT PULL] Devicetree fixes for 4.20-rc, round 2

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 3:39 PM Rob Herring  wrote:
>
> Devicetree fixes for 4.20-rc:

This pull request should be getting an automated reply once I've
pushed my merge out (soon), so I won't be doing the manual "pulled"
ack emails any more.

If you don't see the automated reply, or you have any issues with it,
let me know (or just cut out the middle man and talk to Konstantin
directly, he's the mind behind the machine).

Linus


Re: [GIT PULL] Ceph fixes for 4.20-rc2

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 10:48 AM Ilya Dryomov  wrote:
>
> Two CephFS fixes (copy_file_range and quota) and a small feature bit
> cleanup.

.. I'm doing a few final manual "pulled" ack emails just to let people
know that I'll be stopping them, because Konstantin's pr-tracker-bot
automation should now be in place if the message is cc'd to lkml.

NOTE! It's currently only lkml, afaik, so other people who don't cc
lkml for their pulls won't get the automation, but I also won't be
looking at "which lists were cc'd", so I won't be doing the manual
ones.

And if you want to use lkml but don't want to see the automated "it
has been merged" emails, there's opt-out too. I know some people hate
getting automated emails.

 Linus


Re: [PATCH 3/3] lockdep: Use line-buffered printk() for lockdep messages.

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 12:12 AM Sergey Senozhatsky
 wrote:
>
> Dunno. I guess we still haven't heard from Linus because he did quite a good
> job setting up his 'email filters' ;)

Not filters, just long threads that I lurk on.

I don't actually care too much about this - the part I care about is
that when panics etc happen, things go out with a true best effort.

And "best effort" actually means "reality", not "theory". I don't care
one whit for some broken odd serial console that has a lock and
deadlocks if you get a panic just in the right place. I care about the
main printk/tty code doing the right thing, and avoiding the locks
with the scheduler and timers etc. So the timestamping and wakeup code
needing locks - or thinking you can delay things and print them out
later (when no later happens because you're panicing in an NMI) -
*that* is what I care deeply about.

Something like having a line buffering interface for random debugging
messages etc, I just don't get excited about. It just needs to be
simple enough and robust enough. You guys seem to be talking it out
ok.

 Linus


Re: [GIT PULL] nds32 new features and bug fix for 4.20

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 4:01 AM Greentime Hu  wrote:
>
> nds32 patches for 4.20

Much much too late for 4.20.

Send these the next merge window please.

 Linus


Re: [GIT PULL] s390 patches for 4.20 #2

2018-11-09 Thread Linus Torvalds
On Fri, Nov 9, 2018 at 1:14 AM Martin Schwidefsky
 wrote:
>
> s390 updates for 4.20-rc2

Pulled.

>  - A fix for the pgtable_bytes misaccounting on s390. The patch changes
>common code part in regard to page table folding and adds extra
>checks to mm_[inc|dec]_nr_[pmds|puds].

Ugh. This is somewhat invasive, I worry  that some header include or
architecture doesn't pick up on the subtle __PAGETABLE_XYZ_FOLDED
things (if you don't get the includes, the mm_xyz_folded() maros will
be mis-defined.

Has this been in linux-next or any other wide testing? The changes
aren't _new_, but...

 Linus


Re: [GIT PULL] xfs: fixes for v4.20-rc2

2018-11-08 Thread Linus Torvalds
On Thu, Nov 8, 2018 at 4:40 PM Darrick J. Wong  wrote:
>
> - fix incorrect dropping of error code from bmap
>
> - print buffer offsets instead of useless hashed pointers when dumping
>   corrupt metadata
>
> - fix integer overflow in attribute verifier

Pulled,

  Linus


Re: [GIT PULL] LED fixes for 4.20-rc2

2018-11-08 Thread Linus Torvalds
On Thu, Nov 8, 2018 at 1:29 PM Jacek Anaszewski
 wrote:
>
>
> All three fixes are related to the newly added pattern trigger:
>
> - remove mutex_lock() from timer callback, which would trigger problems
>   related to sleeping in atomic context, the removal is harmless since
>   mutex protection turned out to be redundant in this case
> - fix pattern parsing to properly handle intervals with brightness == 0
> - fix typos in the ABI documentation

Pulled,

Linus


Re: [GIT PULL] sound fixes for 4.20-rc2

2018-11-08 Thread Linus Torvalds
On Thu, Nov 8, 2018 at 10:05 AM Takashi Iwai  wrote:
>
> sound fixes for 4.20-rc2
>
> Two small regression fixes for HD-audio: one about vga_switcheroo and
> runtime PM, and another about Oops on some Thinkpads.

Pulled,

  Linus


Re: [GIT PULL] Compiler Attributes for v4.20-rc2

2018-11-08 Thread Linus Torvalds
On Thu, Nov 8, 2018 at 6:00 AM Miguel Ojeda
 wrote:
>
> A small patch for compiler-gcc.h and a trivial one for compiler_attributes.h.

Pulled,

Linus


Re: [GIT PULL] mtd: Fixes for 4.20-rc2

2018-11-08 Thread Linus Torvalds
On Thu, Nov 8, 2018 at 5:25 AM Boris Brezillon
 wrote:
>
> Here is the MTD fixes PR for 4.20-rc2.

Pulled,

Linus


Re: [GIT PULL] ARM: SoC fixes

2018-11-07 Thread Linus Torvalds
On Wed, Nov 7, 2018 at 9:10 AM Olof Johansson  wrote:
>
> ARM: SoC fixes

Pulled.

> I was a bit too trigger happy to enable PREEMPT on multi_v7_defconfig,
> and it ended up regressing at least BeagleBone XM boards.

Odd. Did it hit some "may_sleep()" test in a driver that is hidden by
preempt being off? Otherwise I don't see how/why preempt should fail
in a board-specific manner..

  Linus


Re: [GIT PULL] HID fixes

2018-11-07 Thread Linus Torvalds
On Wed, Nov 7, 2018 at 2:31 AM Jiri Kosina  wrote:
>
> HID subsytem fixes

Pulled,

   Linus


Re: [RFC][PATCH] tree-wide: Remove __inline__ and __inline usage

2018-11-06 Thread Linus Torvalds
On Tue, Nov 6, 2018 at 11:42 AM Peter Zijlstra  wrote:
>
> Do you want me to do that patch, or have you already just done it?

I'd rather see it go through something like -tip than doing it myself
directly, and get at least some of the automated testing before
unleashing it on an unsuspecting world.

I don't think it will affect much, but there *could* be situations
where there are some crusty __inline__ users that actually want
__always_inline behavior.

And we *may* actually have cases where we want the "let compiler make
a judgement call" behavior, so with this change people will have that
as an option. But yes, in the short term, it has the possibility of
regressions due to missed inlining.

Linus


Re: [RFC][PATCH] tree-wide: Remove __inline__ and __inline usage

2018-11-06 Thread Linus Torvalds
On Tue, Nov 6, 2018 at 2:02 AM Peter Zijlstra  wrote:
>
> Therefore I'm proposing to run:
>
>   git grep -l "\<__inline\(\|__\)\>" | while read file
>   do
> sed -i -e 's/\<__inline\(\|__\)\>/inline/g' $file
>   done
>
> On your current tree, and apply the below fixup patch on top of that
> result.

So I started doing this, and in fact fixed up a few more issues by
hand on top of your patch, but then realized hat it's somewhat
dangerous and possibly broken.

For the uapi header files in particular, __inline__ may actually be
required. Depending on use, and compiler settings, "inline" can be a
word reserved for the user, and shouldn't be used by system headers.

Now, several uapi headers obviously *do* use "inline", and I think in
this day and age that's fine, but I don't actually want to break
possible valid uses.

So I'd argue that we don't actually want to get rid of "__inline__" at
all, because we may need it.

But we *could* get rid of these two lines in include/linux/compiler_types.h

  #define __inline__ inline
  #define __inline   inline

and just say that "inline" for the kernel means "always_inline", but
if you use __inline__ or __inline then you get the "raw" compiler
inlining.

Then people can decide to get rid of __inline__ on a case-by-case basis.

 Linus


Re: [GIT PULL] tracing/kprobes: Fix strpbrk() argument order

2018-11-06 Thread Linus Torvalds
On Tue, Nov 6, 2018 at 6:12 AM Steven Rostedt  wrote:
>
> Masami found a slight bug in his code where he transposed the arguments of a
> call to strpbrk.

Pulled,

  Linus


  1   2   3   4   5   6   7   8   9   10   >