Re: [Xen-devel] [PATCH 00/13] mmu_notifier kill invalidate_page callback

2017-08-29 Thread Linus Torvalds
On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glisse  wrote:
>
> Note this is barely tested. I intend to do more testing of next few days
> but i do not have access to all hardware that make use of the mmu_notifier
> API.

Thanks for doing this.

> First 2 patches convert existing call of mmu_notifier_invalidate_page()
> to mmu_notifier_invalidate_range() and bracket those call with call to
> mmu_notifier_invalidate_range_start()/end().

Ok, those two patches are a bit more complex than I was hoping for,
but not *too* bad.

And the final end result certainly looks nice:

>  16 files changed, 74 insertions(+), 214 deletions(-)

Yeah, removing all those invalidate_page() notifiers certainly makes
for a nice patch.

And I actually think you missed some more lines that can now be
removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be
needed either, so you can remove all of those too (most of them are
empty inline functions, but x86 has one that actually does something.

So there's an added 30 or so dead lines that should be removed in the
kvm patch, I think.

But from a _very_ quick read-through this looks fine. But it obviously
needs testing.

People - *especially* the people who saw issues under KVM - can you
try out Jérôme's patch-series? I aded some people to the cc, the full
series is on lkml. Jérôme - do you have a git branch for people to
test that they could easily pull and try out?

Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization

2017-08-24 Thread Linus Torvalds
On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnier  wrote:
>
> My original performance testing was done with an Ubuntu generic
> configuration. This configuration has the CONFIG_FUNCTION_TRACER
> option which was incompatible with PIE. The tracer failed to replace
> the __fentry__ call by a nop slide on each traceable function because
> the instruction was not the one expected. If PIE is enabled, gcc
> generates a difference call instruction based on the GOT without
> checking the visibility options (basically call *__fentry__@GOTPCREL).

Gah.

Don't we actually have *more* address bits for randomization at the
low end, rather than getting rid of -mcmodel=kernel?

Has anybody looked at just moving kernel text by smaller values than
the page size? Yeah, yeah, the kernel has several sections that need
page alignment, but I think we could relocate normal text by just the
cacheline size, and that sounds like it would give several bits of
randomness with little downside.

Or has somebody already looked at it and I just missed it?

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 3:36 PM, Kirill A. Shutemov
 wrote:
>
> Below is test cases that allocates a lot of page tables and measuare
> fork/exit time. (I'm not entirely sure it's the best way to stress the
> codepath.)

Looks ok to me. Doing a profile (without the RCU freeing, obviously) gives me

   0.77%  a.out[kernel.vmlinux]  [k] free_pgd_range


  ▒

so it does seem to spend time in the page directory code.

> Unpatched:  average 4.8322s, stddev 0.114s
> Patched:average 4.8362s, stddev 0.111s

Ok, I vote for avoiding the complexity of two different behaviors, and
just making the page table freeing use RCU unconditionally.

If actively trying to trigger that code doesn't show a real measurable
difference, I don't think it matters, and the fewer different code
paths we have, the better.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 12:59 PM, Kirill A. Shutemov
 wrote:
>
> In this case we need performance numbers for !PARAVIRT kernel.

Yes.

> Numbers for tight loop of "mmap(MAP_POPULATE); munmap()" might be
> interesting too for worst case scenario.

Actually, I don't think you want to populate all the pages. You just
want to populate *one* page, in order to build up the page directory
structure, not allocate all the final points.

And we only free the actual page tables when there is nothing around,
so it should be at least a 2MB-aligned region etc.

So you should do a *big* allocation, and then touch a single page in
the middle, and then minmap it - that should give you maximal page
table activity. Otherwise the page tables will generally just stay
around.

Realistically, it's mainly exit() that frees page tables. Yes, you may
have a few page tables free'd by a normal munmap(), but it's usually
very limited. Which is why I suggested that script-heavy thing with
lots of small executables. That tends to be the main realistic load
that really causes a ton of page directory activity.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT

2017-08-23 Thread Linus Torvalds
On Wed, Aug 23, 2017 at 6:45 AM, Vitaly Kuznetsov  wrote:
>
> Solve the issue by enabling RCU-based table free mechanism when PARAVIRT
> is selected in config. Testing with kernbench doesn't show any notable
> performance impact:

I wonder if we should just make it unconditional if it doesn't really
show any performance difference. One less config complexity to worry
about (and in this case I'm not so much worried about Kconfig itself,
as just "oh, you have totally different paths in the core VM depending
on PARAVIRT".

That said, the thing to test for these kinds of things is often
heavily scripted loads that just run thousands and thousands of really
small processes, and build up and tear down page tables all the time
because of fork/exit.

The load I've used occasionally is just "make test" in the git source
tree. Tons and tons of trivial fork/exec/exit things for all those
small tests and shell scripts.

I think 'kernbench' just does kernel compiles. Which is not very
kernel or VM intensive at all. It's mostly just user mode compilers in
parallel.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.13-rc2

2017-07-21 Thread Linus Torvalds
On Fri, Jul 21, 2017 at 3:17 AM, Juergen Gross  wrote:
>  drivers/xen/pvcalls-back.c | 1236 
> 

This really doesn't look like a fix.

The merge window is over.

So I'm not pulling this without way more explanations of why I should.

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Lots of new warnings with gcc-7.1.1

2017-07-11 Thread Linus Torvalds
On Tue, Jul 11, 2017 at 8:17 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> If that's the case, I'd prefer just turning off the format-truncation
> (but not overflow) warning with '-Wno-format-trunction".

Doing

 KBUILD_CFLAGS  += $(call cc-disable-warning, format-truncation)

in the main Makefile certainly cuts down on the warnings.

We still have some overflow warnings, including the crazy one where
gcc doesn't see that the number of max7315 boards is very limited.

But those could easily be converted to just snprintf() instead, and
then the truncation warning disabling takes care of it. Maybe that's
the right answer.

We also have about a bazillion

warning: ‘*’ in boolean context, suggest ‘&&’ instead

warnings in drivers/ata/libata-core.c, all due to a single macro that
uses a pattern that gcc-7.1.1 doesn't like. The warning looks a bit
debatable, but I suspect the macro could easily be changed too.

Tejun, would you hate just moving the "multiply by 1000" part _into_
that EZ() macro? Something like the attached (UNTESTED!) patch?

  Linus
 drivers/ata/libata-core.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index 8453f9a4682f..4c7d5a138495 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -3231,19 +3231,19 @@ static const struct ata_timing ata_timing[] = {
 };
 
 #define ENOUGH(v, unit)(((v)-1)/(unit)+1)
-#define EZ(v, unit)((v)?ENOUGH(v, unit):0)
+#define EZ(v, unit)((v)?ENOUGH((v)*1000, unit):0)
 
 static void ata_timing_quantize(const struct ata_timing *t, struct ata_timing 
*q, int T, int UT)
 {
-   q->setup= EZ(t->setup  * 1000,  T);
-   q->act8b= EZ(t->act8b  * 1000,  T);
-   q->rec8b= EZ(t->rec8b  * 1000,  T);
-   q->cyc8b= EZ(t->cyc8b  * 1000,  T);
-   q->active   = EZ(t->active * 1000,  T);
-   q->recover  = EZ(t->recover* 1000,  T);
-   q->dmack_hold   = EZ(t->dmack_hold * 1000,  T);
-   q->cycle= EZ(t->cycle  * 1000,  T);
-   q->udma = EZ(t->udma   * 1000, UT);
+   q->setup= EZ(t->setup,  T);
+   q->act8b= EZ(t->act8b,  T);
+   q->rec8b= EZ(t->rec8b,  T);
+   q->cyc8b= EZ(t->cyc8b,  T);
+   q->active   = EZ(t->active, T);
+   q->recover  = EZ(t->recover,T);
+   q->dmack_hold   = EZ(t->dmack_hold, T);
+   q->cycle= EZ(t->cycle,  T);
+   q->udma = EZ(t->udma,   UT);
 }
 
 void ata_timing_merge(const struct ata_timing *a, const struct ata_timing *b,
___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] Lots of new warnings with gcc-7.1.1

2017-07-11 Thread Linus Torvalds
On Tue, Jul 11, 2017 at 8:10 PM, Guenter Roeck  wrote:
>
> The hwmon warnings are all about supporting no more than 9,999 sensors
> (applesmc) to 999,999,999 sensors (scpi) of a given type.

Yeah, I think that's enough.

> Easy "fix" would be to replace snprintf() with scnprintf(), presumably
> because gcc doesn't know about scnprintf().

If that's the case, I'd prefer just turning off the format-truncation
(but not overflow) warning with '-Wno-format-trunction".

But maybe we can at least start it on a subsystem-by-subsystem basis
after people have verified their own subsusystem?

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


[Xen-devel] Lots of new warnings with gcc-7.1.1

2017-07-11 Thread Linus Torvalds
[ Very random list of maintainers and mailing lists, at least
partially by number of warnings generated by gcc-7.1.1 that is then
correlated with the get_maintainers script ]

So I upgraded one of my boxes to F26, which upgraded the compiler to gcc-7.1.1

Which in turn means that my nice clean allmodconfig compile is not an
unholy mess of annoying new warnings.

Normally I hate the stupid new warnings, but this time around they are
actually exactly the kinds of warnings you'd want to see and that are
hard for humans to pick out errors: lots of format errors wrt limited
buffer sizes.

At the same time, many of them *are* annoying. We have various limited
buffers that are limited for a good reason, and some of the format
truncation warnings are about numbers in the range {0-MAX_INT], where
we definitely know that we don't need to worry about the really big
ones.

After all, we're using "snprintf()" for a reason - we *want* to
truncate if the buffer is too small.

But a lot of the warnings look reasonable, and at least the warnings
are nice in how they actually explain why the warning is happening.
Example:

  arch/x86/platform/intel-mid/device_libs/platform_max7315.c: In
function ‘max7315_platform_data’:
  arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:35:
warning: ‘%d’ directive writing between 1 and 11 bytes into a region
of size 9 [-Wformat-overflow=]
 sprintf(base_pin_name, "max7315_%d_base", nr);
 ^~
  arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:26:
note: directive argument in the range [-2147483647, 2147483647]

Yeah, the compiler is technically correct, but we already made sure we
have at most MAX7315_NUM of those adapters, so no, "nr" is really not
going to be a 10-digit number.

So the warning is kind of bogus.

At the same time, others aren't quite as insane, and in many cases the
warnings might be easy to just fix.

And some actually look valid, although they might still require odd input:

  net/bluetooth/smp.c: In function ‘le_max_key_size_read’:
  net/bluetooth/smp.c:3372:29: warning: ‘snprintf’ output may be
truncated before the last format character [-Wformat-truncation=]
snprintf(buf, sizeof(buf), "%2u\n", SMP_DEV(hdev)->max_key_size);
   ^~~
  net/bluetooth/smp.c:3372:2: note: ‘snprintf’ output between 4 and 5
bytes into a destination of size 4

yeah, "max_key_size" is unsigned char, but if it's larger than 99 it
really does need 5 bytes for "%2u\n" with the terminating NUL
character.

Of course, the "%2d" implies that people expect it to be < 100, but at
the same time it doesn't sound like a bad idea to just make the buffer
be one byte bigger. So..

Anyway, it would be lovely if some of the more affected developers
would take a look at gcc-7.1.1 warnings. Right now I get about three
*thousand* lines of warnings from a "make allmodconfig" build, which
makes them a bit overwhelming.

I do suspect I'll make "-Wformat-truncation" (as opposed to
"-Wformat-overflow") be a "V=1" kind of warning.  But let's see how
many of these we can fix, ok?

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [GIT PULL] xen: xenfs fixes for 4.9-rc2

2016-10-24 Thread Linus Torvalds
On Mon, Oct 24, 2016 at 9:37 AM, David Vrabel  wrote:
>
> I think the changes are trivial and uncontroversial.

Hmm. Sadly, they are also buggy.

This:

if (files->mode & S_IFLNK) {

is simply wrong. The correct test for S_IFLNK is to do

if ((files->mode & S_IFMT) == S_IFLNK) {

and quite frankly, the right model is almost certainly to just do a
switch-statement that does something like

switch (files->mode & S_IFMT) {
case S_IFLNK:
...
case S_IFREG:
case 0:

default:
..error..

because maybe somebody wants to add other cases later (and even if
not, it's just wrong to randomly change any other mode into S_IFREG).

And while I could easily do an evil merge and fix that part up, I
really don't want to do things like that. So I'm not going to pull
this.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0

2016-07-27 Thread Linus Torvalds
On Wed, Jul 27, 2016 at 4:09 PM, Rafael J. Wysocki  wrote:
>
> The STAO definition document:
>
> http://wiki.xenproject.org/mediawiki/images/0/02/Status-override-table.pdf
>
> requires as to "operate as if that device does not exist", quite literally.

Well, first off, documentation is one thing, actually changing
behavior is something entirely different.

Theory and practice are *not* the same.

The other worry I have is that I'd be happier if it's still visible in
/sys/bus/acpi/ etc. Again, it's one thing to not react to it
programmatically, and another thing entirely to actually hide the
information from the rest of the system.

If I read that patch right, it will be hidden from sysfs too. But
Maybe I'm mistaken.

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0

2016-07-27 Thread Linus Torvalds
On Wed, Jul 27, 2016 at 6:45 AM, David Vrabel  wrote:
>
> Shannon Zhao (16):
>   Xen: ACPI: Hide UART used by Xen

So this caused a trivial conflict. No biggie, it wasn't bad and the
patch was acked by Rafael. However, looking at it made me somewhat
unhappy.

Should the device entry in ACPI really be hidden unconditionally? In
particular, if we are *not* running under virtualization, it sounds
wrong to hide it.

Comments? Am I missing something?

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel


Re: [Xen-devel] [GIT PULL] xen: bug fixes for 4.7-rc0

2016-06-22 Thread Linus Torvalds
Having upgraded one of my machines to F24, I get a few new warnings
during the kernel compile due to a new compiler.

Some of them are just annoying and wrong, but one of them points to a
real Xen bug:

  arch/x86/xen/mmu.c:1116:57: warning: array subscript is above array
bounds [-Warray-bounds]
for (; vaddr <= vaddr_end && (pmd < (level2_kernel_pgt + PAGE_SIZE));
~~~^~

because that is definitely completely wrong.

Yes, level2_kernel_pgt is one page in size, but it only has 512
entries, because each entry is 8 bytes.

So that "+ PAGE_SIZE" is entirely bogus. Either it should be

  (level2_kernel_pgt + 512)

or it should be

  ((void *)level2_kernel_pgt + PAGE_SIZE)

but as it stands now it's a bug.

This harkens back to 2012, commit 7f9140626c757 ("xen/mmu: Copy and
revector the P2M tree").

It may be that we end up never actually crossing the pmd boundary
anyway due to vaddr limit - I didn't check. But please fix the code
regardless.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 4/9] x86/traps: Enable all exception handler callbacks early

2016-04-03 Thread Linus Torvalds
On Sun, Apr 3, 2016 at 3:07 AM, Borislav Petkov  wrote:
>
> I'm wondering whether making it try to EFAULT correctly is the right
> thing to do... We're certainly more conservative if we panic and not
> allow some silently failed attempt at recovery which looks successful,
> to continue.

No, please don't fail at early boot.

Early boot is just about the *worst* situation to try to debug odd
failures, exactly since things like printk may not be reliable, and
things won't get logged etc.

So particularly during early boot we should try as hard as possible
not to crash - even if it means not being able to log about a problem.
At least that way you have a hopefully working machine and can *maybe*
debug things.

Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling

2016-04-02 Thread Linus Torvalds
On Sat, Apr 2, 2016 at 10:13 AM, Andy Lutomirski  wrote:
>
> I also tried a bad wrmsrl at a couple early points.  Very very early
> it just works with not warning.  A little later and it prints the
> warning.

Ok, that sounds like the correct behavior - I'm sure the very very
early ones "warned" too, it just got dropped on the floor due to being
before the printing/logging was up and running.

And even then, it's obviously much better than having machines
mysteriously just reboot.

Thanks,
 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling

2016-04-02 Thread Linus Torvalds
This patch series looks much nicer than the last one. I assume you
tested that the early-trap handling actually worked too? I only looked
at the patches..

Ack to it all,

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2016-03-14 Thread Linus Torvalds
On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirski  wrote:
>
> The code in my queue is, literally:
>
> bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
>  struct pt_regs *regs, int trapnr)
> {
> WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
>   (unsigned int)regs->cx);
>
> /* Pretend that the read succeeded and returned 0. */
> regs->ip = ex_fixup_addr(fixup);
> regs->ax = 0;
> regs->dx = 0;
> return true;
> }
> EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);

I guess I can live with this, as long as we also extend the
early-fault handling to work with the special exception handlers.

And as long as people start understanding that killing the machine is
a bad bad bad thing. It's a debugging nightmare.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2016-03-14 Thread Linus Torvalds
On Mon, Mar 14, 2016 at 11:10 AM, Andy Lutomirski  wrote:
>
> A couple of the wrmsr users actually care about performance.  These
> are the ones involved in context switching and, to a lesser extent, in
> switching in and out of guest mode.

.. ok, see the crossed emails.

I'd *much* rather special case the special cases. Not make the generic
case something complex.

And *particularly* not make the generic case be something where people
think it's sane to oops and kill the machine. Christ.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2016-03-14 Thread Linus Torvalds
On Mon, Mar 14, 2016 at 11:04 AM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> None of this insane complicated crap that buys us exactly *nothing*,
> and depends on fancy new exception handling support etc etc.

Actually, the one _new_ thing we could do is to instead of removing
the old crappy rdmsr()/wrmsr() implementation entirely, we'll just
rename it to "rd/wrmsr_unsafe()", and not have any exception table
stuff for that at all (so now you *will* get an oops and panic if you
use that).

The only reason to do that is for performance-critical MSR's that
really don't want any overhead. Sure, they could just be changed to
use "wrmsr_safe()", but for things like the APIC accesses, or
update_debugctlmsr() (that is supposed to check for processor version)
that can be truly critical, an explicitly _unsafe_ version may be the
right thing to do.

The fact is, the problem with rd/wrmsr is that we just did the
defaults the wrong way around. Making the simple and obvious version
be unsafe is just wrong.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2016-03-14 Thread Linus Torvalds
On Mon, Mar 14, 2016 at 10:17 AM, Andy Lutomirski  wrote:
>
> So yes, let's please warn.  I'm okay with removing the panic_on_oops
> thing though.  (But if anyone suggests that we should stop OOPSing on
> bad kernel page faults, I *will* fight back.)

Bad kernel page faults are something completely different. They would
be actual bugs regardless.

The MSR thing has *often* been just silly "this CPU doesn't do that
MSR". Generic bootup setup code etc that just didn't know or care
about the particular badly documented rule for one particular random
CPU version and stepping.

In fact, when I say "often", I suspect I should really just say
"always". I don't think we've ever found a case where oopsing would
have been the right thing. But it has definitely caused lots of
problems, especially in the early paths where your code doesn't even
work right now.

Now, when it comes to the warning, I guess I could live with it, but I
think it's stupid to make this a low-level exception handler thing.

So what I think should be done:

 - make sure that wr/rdmsr_safe() actually works during very early
init. At some point it didn't.

 - get rid of the current wrmsr/rdmsr *entirely*. It's shit.

 - Add this wrapper:

  #define wrmsr(msr, low, high) \
WARN_ON_ONCE(wrmsr_safe(msr, low, high))

and be done with it. We could even decide to make that WARN_ON_ONCE()
be something we could configure out, because it's really a debugging
thing and isn't like it should be all that fatal.

None of this insane complicated crap that buys us exactly *nothing*,
and depends on fancy new exception handling support etc etc.

So what's the downside to just doing this simple thing?

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2016-03-14 Thread Linus Torvalds
On Mar 14, 2016 10:05 AM, "Andy Lutomirski"  wrote:
>
> We could probably remove that check and let custom fixups run early.
> I don't see any compelling reason to keep them disabled.  That should
> probably be a separate change, though.

Or we could just use the existing wrmsr_safe() code and not add this new
special code at all.

Look, why are you doing this? We should get rid of the difference between
wrmsr and wrmsr_safe(), not make it bigger.

Just make everything safe. There has never in the history of anything been
an advantage to making things oops and to making things more complicated.

Why is rd/wrmsr() so magically important?

Linus
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v3 3/5] x86/paravirt: Add paravirt_{read, write}_msr

2016-03-14 Thread Linus Torvalds
On Mar 14, 2016 9:53 AM, "Andy Lutomirski"  wrote:
>
> Can you clarify?  KVM uses the native version, and the native version
> only oopses with this series applied if panic_on_oops is set.

Can we please remove that idiocy?

There is no reason to panic whatsoever. Seriously. What's the upside of
that logic?

Linus
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-27 Thread Linus Torvalds
On Jan 26, 2016 23:51, "Peter Zijlstra"  wrote:
>
> So for a moment it looked like MIPS wanted to equal or surpass Alpha in
> this respect.

If there is an architecture that I'd expect to try to take the "sucks more"
crown, MIPS would be it. They've already done the "worst cache award"
thing, and are proud members of the "stupid branch delay slot" crowd. MIPS
historically even did the "delayed load slot".

The only mistake they've never done, AFAIK, is to have a rotating register
file.

At the same time, the "load-to-dependent-store" thing you really have to
*work* at doing wrong. It's not enough to just have bad taste and
incompetent architects. You really have to spend real effort to screw up
that badly. It's really hard to do by mistake.

So I suspect that even MIPS can't get it wrong.

   Linus
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstra  wrote:
>
> This is distinct from:

That may be distinct, but:

> struct foo *x = READ_ONCE(*ptr);
> smp_read_barrier_depends();
> x->bar = 5;

This case is complete BS. Stop perpetuating it. I already removed a
number of bogus cases of it, and I removed the incorrect documentation
that had this crap.

It's called "smp_READ_barrier_depends()" for a reason.

Alpha is the only one that needs it, and alpha needs it only for
dependent READS.

It's not called smp_read_write_barrier_depends(). It's not called
"smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else.

So alpha does have an implied dependency chain from a read to a
subsequent dependent write, and does not need any extra barriers.

Alpha does *not* have a dependency chain from a read to a subsequent
read, which is why we need that horrible crappy
smp_read_barrier_depends(). But it's the only reason.

This is the alpha reference manual wrt read-to-write dependency:

  5.6.1.7 Definition of Dependence Constraint

The depends relation (DP) is defined as follows. Given u and v
issued by processor Pi, where u
is a read or an instruction fetch and v is a write, u precedes v
in DP order (written u DP v, that
is, v depends on u) in either of the following situations:

 • u determines the execution of v, the location accessed by v, or
the value written by v.
 • u determines the execution or address or value of another
memory access z that precedes

v or might precede v (that is, would precede v in some execution
path depending
on the value read by u) by processor issue constraint (see Section 5.6.1.3).

Note that the dependence barrier honors not only control flow, but
address and data values too.  This is a different syntax than we use,
but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or
conditional dependency between the two implies an ordering.

So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where
the second read is data-dependent on the first. Nothing else.

So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a
barrier between two reads, then that is a bug.

The above code is crap.  It's exactly as much crap as

   a = READ_ONCE(x);
   smp_rmb();
   WRITE_ONCE(b, y);

because a "rmb()" simply doesn't have anything to do with
read-vs-subsequent-write ordering.

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 12:10 PM, Paul E. McKenney
<paul...@linux.vnet.ibm.com> wrote:
> On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote:
>>
>> > struct foo *x = READ_ONCE(*ptr);
>> > smp_read_barrier_depends();
>> > x->bar = 5;
>>
>> This case is complete BS. Stop perpetuating it. I already removed a
>> number of bogus cases of it, and I removed the incorrect documentation
>> that had this crap.
>
> If I understand your objection correctly, you want the above pattern
> expressed either like this:
>
> struct foo *x = rcu_dereference(*ptr);
> x->bar = 5;
>
> Or like this:
>
> struct foo *x = lockless_dereference(*ptr);
> x->bar = 5;
>
> Or am I missing your point?

You are entirely missing the point.

You might as well just write it as

struct foo x = READ_ONCE(*ptr);
x->bar = 5;

because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

So what I am saying is simple: anybody who writes that
"smp_read_barrier_depends()" in there is just ttoally and completely
WRONG, and the fact that Peter wrote it out after I removed several
instances of that bloody f*cking idiocy is disturbing.

Don't do it. It's BS. It's wrong. Don't make excuses for it.

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds
<torva...@linux-foundation.org> wrote:
>
> You might as well just write it as
>
> struct foo x = READ_ONCE(*ptr);
> x->bar = 5;
>
> because that "smp_read_barrier_depends()" does NOTHING wrt the second write.

Just to clarify: on alpha it adds a memory barrier, but that memory
barrier is useless.

On non-alpha, it is a no-op, and obviously does nothing simply because
it generates no code.

So if anybody believes that the "smp_read_barrier_depends()" does
something, they are *wrong*.

And if anybody sends out an email with that smp_read_barrier_depends()
in an example, they are actively just confusing other people, which is
even worse than just being wrong. Which is why I jumped in.

So stop perpetuating the myth that smp_read_barrier_depends() does
something here. It does not. It's a bug, and it has become this "mind
virus" for some people that seem to believe that it does something.

I had to remove this crap once from the kernel already, see commit
105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and
atomic*_read_ctrl()").

I don't want to ever see that broken construct again. And I want to
make sure that everybody is educated about how broken it was. I'm
extremely unhappy that it came up again.

If it turns out that some architecture does actually need a barrier
between a read and a dependent write, then that will mean that

 (a) we'll have to make up a _new_ barrier, because
"smp_read_barrier_depends()" is not that barrier. We'll presumably
then have to make that new barrier part of "rcu_derefence()" and
friends.

 (b) we will have found an architecture with even worse memory
ordering semantics than alpha, and we'll have to stop castigating
alpha for being the worst memory ordering ever.

but I sincerely hope that we'll never find that kind of broken architecture.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h

2016-01-26 Thread Linus Torvalds
On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenney
 wrote:
>
> No trailing data-dependent read, so agreed, no smp_read_barrier_depends()
> needed.  That said, I believe that we should encourage rcu_dereference*()
> or lockless_dereference() instead of READ_ONCE() for documentation
> reasons, though.

I agree that that is likely the right thing to do in pretty much all situations.

In theory, there might be performance situations where we'd want to
actively avoid the smp_read_barrier_depends() inherent in those, but
considering that it's only a performance issue on alpha, and we
probably have all of two or three people using Linux on alpha, it's a
pretty theoretical performance worry.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Linus Torvalds
On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnar  wrote:
>
> Linus, what's your preference?

So quite frankly, is there any reason we don't just implement
native_read_msr() as just

   unsigned long long native_read_msr(unsigned int msr)
   {
  int err;
  unsigned long long val;

  val = native_read_msr_safe(msr, );
  WARN_ON_ONCE(err);
  return val;
   }

Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be
done with it. I don't see the downside.

How many msr reads are so critical that the function call
overhead would matter? Get rid of the inline version of the _safe()
thing too, and put that thing there too.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Linus Torvalds
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Ven  wrote:
>>
>> How many msr reads are so critical that the function call
>> overhead would matter?
>
> if anything qualifies it'd be switch_to() and friends.

Is there anything else than the FS/GS_BASE thing (possibly hidden
behind inlines etc that I didn't get from a quick grep)? And why is
that sometimes using the "safe" version (in do_arch_prctl()), and
sometimes not (switch_to())?

I'm not convinced that mess is a good argument for the status quo ;)

> note that I'm not entirely happy about the notion of "safe" MSRs.
> They're safe as in "won't fault".

I wouldn't object to renaming them.

Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-21 Thread Linus Torvalds
On Mon, Sep 21, 2015 at 11:16 AM, Andy Lutomirski  wrote:
>
> In the interest of sanity, I want to drop the "native_", too

Yes. I think the only reason it exists is to have that wrapper layer
for PV. And that argument just goes away if you just make the
non-inline helper function do all the PV logic directly.

I really suspect we should do this for a *lot* of the PV ops. Yeah,
some are so performance-critical that we probably do have a good
reason for the inline indirections etc (historical example: native
spin-unlock, which traditionally could be done as a single store
instruction), but I suspect a lot of the PV indirection is for this
kind of "historical wrapper model" reason, and it often makes it
really hard to see what is going on because you have to go through
several layers of indirection, often in different files.

  Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops

2015-09-20 Thread Linus Torvalds
On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirski  wrote:
> This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> access to a WARN_ON_ONCE and a return of zero (in the RDMSR case).
> We still write a pr_info entry unconditionally for debugging.

No, this is wrong.

If you really want to do something like this, then just make all MSR
reads safe. So the only difference between "safe" and "unsafe" is that
the unsafe version just doesn't check the return value, and silently
just returns zero for reads (or writes nothing).

To quote Obi-Wan: "Use the exception table, Luke".

Because decoding instructions is just too ugly. We'll do it for CPU
errata where we might have to do it for user space code too (ie the
AMD prefetch mess), but for code that _we_ control? Hell no.

So NAK on this.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] Patch x86/ldt: Make modify_ldt synchronous has been added to the 4.1-stable tree

2015-08-14 Thread Linus Torvalds
On Fri, Aug 14, 2015 at 12:16 AM, Ingo Molnar mi...@kernel.org wrote:

 I just sent it to Linus, so barring a catastrophy it should be upstream soon.

Well, I just rejected that pull as completely broken, so I guess the
catastrophy happened... Or, alternatively, I mis-read the code, which
does happen, but my giant ego tells me that's very rare..

Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg

2015-04-29 Thread Linus Torvalds
On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote:
 In the pv_scan_next() function, the slow cmpxchg atomic operation is
 performed even if the other CPU is not even close to being halted. This
 extra cmpxchg can harm slowpath performance.

 This patch introduces the new mayhalt flag to indicate if the other
 spinning CPU is close to being halted or not. The current threshold
 for x86 is 2k cpu_relax() calls. If this flag is not set, the other
 spinning CPU will have at least 2k more cpu_relax() calls before
 it can enter the halt state. This should give enough time for the
 setting of the locked flag in struct mcs_spinlock to propagate to
 that CPU without using atomic op.

 Yuck! I'm not at all sure you can make assumptions like that. And the
 worst part is, if it goes wrong the borkage is subtle and painful.\

I have to agree with Peter.

But it goes beyond this particular patch. Patterns like this:

   xchg(pn-mayhalt, true);

are just evil and disgusting. Even befoe this patch, that code had

(void)xchg(pn-state, vcpu_halted);

which is *wrong* and should never be done.

If you want it to be set_mb() (which sets a value and has a memory
barrier), then use set_mb(). Yes, it happens to use a xchg() to do
so, but dammit, it documents that whole this is a memory barrier in
the name.

Also, anybody who does this should damn well document why the memory
barrier is needed. The xchg(pn-state, vcpu_halted) at least is
preceded by a comment about the barriers. The new mayhalt has no sane
comment in it, and the reason seems to be that no sane comment is
possible. The xchg() seems to be some black magic thing.

Let's not introduce magic stuff in our locking primitives. At least
not undocumented magic that makes no sense.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [RFC PATCH] x86/asm/irq: Don't use POPF but STI

2015-04-21 Thread Linus Torvalds
On Tue, Apr 21, 2015 at 5:45 AM, Ingo Molnar mi...@kernel.org wrote:

 Totally untested and not signed off yet: because we'd first have to
 make sure (via irq flags debugging) that it's not used in reverse, to
 re-disable interrupts:

Not only might that happen in some place, I *really* doubt that a
conditional 'sti' is actually any faster. The only way it's going to
be measurably faster is if you run some microbenchmark so that the
code is hot and the branch predicts well.

popf is fast for the no changes to IF case, and is a smaller
instruction anyway. I'd really hate to make this any more complex
unless somebody has some real numbers for performance improvement
(that is *not* just some cycle timing from a bogus test-case, but real
measurements on a real load).

And even *with* real measurements, I'd worry about the use popf to
clear IF case.

   Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov
kir...@shutemov.name wrote:

 I'm feeling I miss very basic background on how Xen works, but why does it
 set _PAGE_GLOBAL on userspace entries? It sounds strange to me.

It is definitely strange. I'm guessing that it's some ancient Xen hack
for the early Intel virtualization that used to have absolutely
horrendous vmenter/exit costs, including very much the TLB overhead. \

These days, Intel has address space identifiers, and doesn't flush the
whole TLB on VM entry/exit, so it's probably pointless to play games
with the global bit.

I get the feeling that a lot of Xen stuff is that kind of legacy
hacks that should just be cleaned up, but nobody has the energy or
the interest.  There was the whole odd crazy SHARED_KERNEL_PMD hackery
too.

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

2015-02-19 Thread Linus Torvalds
On Thu, Feb 19, 2015 at 5:06 AM, David Vrabel david.vra...@citrix.com wrote:

 The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
 dereference pmd outside of the lock during NUMA hinting fault) and
 specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
 p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.

 Any fault on a present userspace mapping (e.g., a write to a read-only
 mapping) is being misinterpreted as a NUMA hinting fault and not handled
 correctly.  All userspace programs end up continuously  faulting.

 This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
 all present userspace page table entries.

That's some crazy stuff, but whatever. The patch is clearly good. Applied,

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-11 Thread Linus Torvalds
On Feb 11, 2015 3:15 PM, Jeremy Fitzhardinge jer...@goop.org wrote:

 Right now it needs to be a locked operation to prevent read-reordering.
 x86 memory ordering rules state that all writes are seen in a globally
 consistent order, and are globally ordered wrt reads *on the same
 addresses*, but reads to different addresses can be reordered wrt to
writes.

The modern x86 rules are actually much tighter than that.

Every store is a release, and every load is an acquire. So a non-atomic
store is actually a perfectly fine unlock. All preceding stores will be
seen by other cpu's before the unlock, and while reads can pass stores,
they only pass *earlier* stores.

For *taking* a lock you need an atomic access, because otherwise loads
inside the locked region could bleed out to before the store that takes the
lock.

 Linus
___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH] x86 spinlock: Fix memory corruption on completing completions

2015-02-09 Thread Linus Torvalds
On Mon, Feb 9, 2015 at 4:02 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote:
 So we have 3 choices,
 1. xadd
 2. continue with current approach.
 3. a read before unlock and also after that.

 For the truly paranoid we have probe_kernel_address(), suppose the lock
 was in module space and the module just got unloaded under us.

That's much too expensive.

The xadd shouldn't be noticeably more expensive than the current
add_smp(). Yes, lock xadd used to be several cycles slower than
just lock add on some early cores, but I think these days it's down
to a single-cycle difference, which is not really different from doing
a separate load after the add.

The real problem with xadd used to be that we always had to do magic
special-casing for i386, but that's one of the reasons we dropped
support for original 80386.

So I think Raghavendra's last version (which hopefully fixes the
lockup problem that Sasha reported) together with changing that

add_smp(lock-tickets.head, TICKET_LOCK_INC);
if (READ_ONCE(lock-tickets.tail)  TICKET_SLOWPATH_FLAG) ..

into something like

val = xadd((lock-ticket.head_tail, TICKET_LOCK_INC  TICKET_SHIFT);
if (unlikely(val  TICKET_SLOWPATH_FLAG)) ...

would be the right thing to do. Somebody should just check that I got
that shift right, and that the tail is in the high bytes (head really
needs to be high to work, if it's in the low byte(s) the xadd would
overflow from head into tail which would be wrong).

 Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] linux-next: missing merge fix patch for the merge of the xen-tip tree with the arm-soc tree

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 6:26 PM, Stephen Rothwell s...@canb.auug.org.au wrote:
 Hi Linus,

 I have been carrying this merge fix patch for some time that is now
 needed in your tree:

No, it's not. Look more closely.

My merge put the

dev-archdata.dma_coherent = coherent;

line at the top of the function, like the way it was in the original commit.

Linus

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel