Re: [Xen-devel] [PATCH 00/13] mmu_notifier kill invalidate_page callback
On Tue, Aug 29, 2017 at 4:54 PM, Jérôme Glissewrote: > > Note this is barely tested. I intend to do more testing of next few days > but i do not have access to all hardware that make use of the mmu_notifier > API. Thanks for doing this. > First 2 patches convert existing call of mmu_notifier_invalidate_page() > to mmu_notifier_invalidate_range() and bracket those call with call to > mmu_notifier_invalidate_range_start()/end(). Ok, those two patches are a bit more complex than I was hoping for, but not *too* bad. And the final end result certainly looks nice: > 16 files changed, 74 insertions(+), 214 deletions(-) Yeah, removing all those invalidate_page() notifiers certainly makes for a nice patch. And I actually think you missed some more lines that can now be removed: kvm_arch_mmu_notifier_invalidate_page() should no longer be needed either, so you can remove all of those too (most of them are empty inline functions, but x86 has one that actually does something. So there's an added 30 or so dead lines that should be removed in the kvm patch, I think. But from a _very_ quick read-through this looks fine. But it obviously needs testing. People - *especially* the people who saw issues under KVM - can you try out Jérôme's patch-series? I aded some people to the cc, the full series is on lkml. Jérôme - do you have a git branch for people to test that they could easily pull and try out? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] x86: PIE support and option to extend KASLR randomization
On Thu, Aug 24, 2017 at 2:13 PM, Thomas Garnierwrote: > > My original performance testing was done with an Ubuntu generic > configuration. This configuration has the CONFIG_FUNCTION_TRACER > option which was incompatible with PIE. The tracer failed to replace > the __fentry__ call by a nop slide on each traceable function because > the instruction was not the one expected. If PIE is enabled, gcc > generates a difference call instruction based on the GOT without > checking the visibility options (basically call *__fentry__@GOTPCREL). Gah. Don't we actually have *more* address bits for randomization at the low end, rather than getting rid of -mcmodel=kernel? Has anybody looked at just moving kernel text by smaller values than the page size? Yeah, yeah, the kernel has several sections that need page alignment, but I think we could relocate normal text by just the cacheline size, and that sounds like it would give several bits of randomness with little downside. Or has somebody already looked at it and I just missed it? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT
On Wed, Aug 23, 2017 at 3:36 PM, Kirill A. Shutemovwrote: > > Below is test cases that allocates a lot of page tables and measuare > fork/exit time. (I'm not entirely sure it's the best way to stress the > codepath.) Looks ok to me. Doing a profile (without the RCU freeing, obviously) gives me 0.77% a.out[kernel.vmlinux] [k] free_pgd_range ▒ so it does seem to spend time in the page directory code. > Unpatched: average 4.8322s, stddev 0.114s > Patched:average 4.8362s, stddev 0.111s Ok, I vote for avoiding the complexity of two different behaviors, and just making the page table freeing use RCU unconditionally. If actively trying to trigger that code doesn't show a real measurable difference, I don't think it matters, and the fewer different code paths we have, the better. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT
On Wed, Aug 23, 2017 at 12:59 PM, Kirill A. Shutemovwrote: > > In this case we need performance numbers for !PARAVIRT kernel. Yes. > Numbers for tight loop of "mmap(MAP_POPULATE); munmap()" might be > interesting too for worst case scenario. Actually, I don't think you want to populate all the pages. You just want to populate *one* page, in order to build up the page directory structure, not allocate all the final points. And we only free the actual page tables when there is nothing around, so it should be at least a 2MB-aligned region etc. So you should do a *big* allocation, and then touch a single page in the middle, and then minmap it - that should give you maximal page table activity. Otherwise the page tables will generally just stay around. Realistically, it's mainly exit() that frees page tables. Yes, you may have a few page tables free'd by a normal munmap(), but it's usually very limited. Which is why I suggested that script-heavy thing with lots of small executables. That tends to be the main realistic load that really causes a ton of page directory activity. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86: enable RCU based table free when PARAVIRT
On Wed, Aug 23, 2017 at 6:45 AM, Vitaly Kuznetsovwrote: > > Solve the issue by enabling RCU-based table free mechanism when PARAVIRT > is selected in config. Testing with kernbench doesn't show any notable > performance impact: I wonder if we should just make it unconditional if it doesn't really show any performance difference. One less config complexity to worry about (and in this case I'm not so much worried about Kconfig itself, as just "oh, you have totally different paths in the core VM depending on PARAVIRT". That said, the thing to test for these kinds of things is often heavily scripted loads that just run thousands and thousands of really small processes, and build up and tear down page tables all the time because of fork/exit. The load I've used occasionally is just "make test" in the git source tree. Tons and tons of trivial fork/exec/exit things for all those small tests and shell scripts. I think 'kernbench' just does kernel compiles. Which is not very kernel or VM intensive at all. It's mostly just user mode compilers in parallel. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.13-rc2
On Fri, Jul 21, 2017 at 3:17 AM, Juergen Grosswrote: > drivers/xen/pvcalls-back.c | 1236 > This really doesn't look like a fix. The merge window is over. So I'm not pulling this without way more explanations of why I should. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Lots of new warnings with gcc-7.1.1
On Tue, Jul 11, 2017 at 8:17 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > If that's the case, I'd prefer just turning off the format-truncation > (but not overflow) warning with '-Wno-format-trunction". Doing KBUILD_CFLAGS += $(call cc-disable-warning, format-truncation) in the main Makefile certainly cuts down on the warnings. We still have some overflow warnings, including the crazy one where gcc doesn't see that the number of max7315 boards is very limited. But those could easily be converted to just snprintf() instead, and then the truncation warning disabling takes care of it. Maybe that's the right answer. We also have about a bazillion warning: ‘*’ in boolean context, suggest ‘&&’ instead warnings in drivers/ata/libata-core.c, all due to a single macro that uses a pattern that gcc-7.1.1 doesn't like. The warning looks a bit debatable, but I suspect the macro could easily be changed too. Tejun, would you hate just moving the "multiply by 1000" part _into_ that EZ() macro? Something like the attached (UNTESTED!) patch? Linus drivers/ata/libata-core.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index 8453f9a4682f..4c7d5a138495 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -3231,19 +3231,19 @@ static const struct ata_timing ata_timing[] = { }; #define ENOUGH(v, unit)(((v)-1)/(unit)+1) -#define EZ(v, unit)((v)?ENOUGH(v, unit):0) +#define EZ(v, unit)((v)?ENOUGH((v)*1000, unit):0) static void ata_timing_quantize(const struct ata_timing *t, struct ata_timing *q, int T, int UT) { - q->setup= EZ(t->setup * 1000, T); - q->act8b= EZ(t->act8b * 1000, T); - q->rec8b= EZ(t->rec8b * 1000, T); - q->cyc8b= EZ(t->cyc8b * 1000, T); - q->active = EZ(t->active * 1000, T); - q->recover = EZ(t->recover* 1000, T); - q->dmack_hold = EZ(t->dmack_hold * 1000, T); - q->cycle= EZ(t->cycle * 1000, T); - q->udma = EZ(t->udma * 1000, UT); + q->setup= EZ(t->setup, T); + q->act8b= EZ(t->act8b, T); + q->rec8b= EZ(t->rec8b, T); + q->cyc8b= EZ(t->cyc8b, T); + q->active = EZ(t->active, T); + q->recover = EZ(t->recover,T); + q->dmack_hold = EZ(t->dmack_hold, T); + q->cycle= EZ(t->cycle, T); + q->udma = EZ(t->udma, UT); } void ata_timing_merge(const struct ata_timing *a, const struct ata_timing *b, ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] Lots of new warnings with gcc-7.1.1
On Tue, Jul 11, 2017 at 8:10 PM, Guenter Roeckwrote: > > The hwmon warnings are all about supporting no more than 9,999 sensors > (applesmc) to 999,999,999 sensors (scpi) of a given type. Yeah, I think that's enough. > Easy "fix" would be to replace snprintf() with scnprintf(), presumably > because gcc doesn't know about scnprintf(). If that's the case, I'd prefer just turning off the format-truncation (but not overflow) warning with '-Wno-format-trunction". But maybe we can at least start it on a subsystem-by-subsystem basis after people have verified their own subsusystem? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
[Xen-devel] Lots of new warnings with gcc-7.1.1
[ Very random list of maintainers and mailing lists, at least partially by number of warnings generated by gcc-7.1.1 that is then correlated with the get_maintainers script ] So I upgraded one of my boxes to F26, which upgraded the compiler to gcc-7.1.1 Which in turn means that my nice clean allmodconfig compile is not an unholy mess of annoying new warnings. Normally I hate the stupid new warnings, but this time around they are actually exactly the kinds of warnings you'd want to see and that are hard for humans to pick out errors: lots of format errors wrt limited buffer sizes. At the same time, many of them *are* annoying. We have various limited buffers that are limited for a good reason, and some of the format truncation warnings are about numbers in the range {0-MAX_INT], where we definitely know that we don't need to worry about the really big ones. After all, we're using "snprintf()" for a reason - we *want* to truncate if the buffer is too small. But a lot of the warnings look reasonable, and at least the warnings are nice in how they actually explain why the warning is happening. Example: arch/x86/platform/intel-mid/device_libs/platform_max7315.c: In function ‘max7315_platform_data’: arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:35: warning: ‘%d’ directive writing between 1 and 11 bytes into a region of size 9 [-Wformat-overflow=] sprintf(base_pin_name, "max7315_%d_base", nr); ^~ arch/x86/platform/intel-mid/device_libs/platform_max7315.c:41:26: note: directive argument in the range [-2147483647, 2147483647] Yeah, the compiler is technically correct, but we already made sure we have at most MAX7315_NUM of those adapters, so no, "nr" is really not going to be a 10-digit number. So the warning is kind of bogus. At the same time, others aren't quite as insane, and in many cases the warnings might be easy to just fix. And some actually look valid, although they might still require odd input: net/bluetooth/smp.c: In function ‘le_max_key_size_read’: net/bluetooth/smp.c:3372:29: warning: ‘snprintf’ output may be truncated before the last format character [-Wformat-truncation=] snprintf(buf, sizeof(buf), "%2u\n", SMP_DEV(hdev)->max_key_size); ^~~ net/bluetooth/smp.c:3372:2: note: ‘snprintf’ output between 4 and 5 bytes into a destination of size 4 yeah, "max_key_size" is unsigned char, but if it's larger than 99 it really does need 5 bytes for "%2u\n" with the terminating NUL character. Of course, the "%2d" implies that people expect it to be < 100, but at the same time it doesn't sound like a bad idea to just make the buffer be one byte bigger. So.. Anyway, it would be lovely if some of the more affected developers would take a look at gcc-7.1.1 warnings. Right now I get about three *thousand* lines of warnings from a "make allmodconfig" build, which makes them a bit overwhelming. I do suspect I'll make "-Wformat-truncation" (as opposed to "-Wformat-overflow") be a "V=1" kind of warning. But let's see how many of these we can fix, ok? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [GIT PULL] xen: xenfs fixes for 4.9-rc2
On Mon, Oct 24, 2016 at 9:37 AM, David Vrabelwrote: > > I think the changes are trivial and uncontroversial. Hmm. Sadly, they are also buggy. This: if (files->mode & S_IFLNK) { is simply wrong. The correct test for S_IFLNK is to do if ((files->mode & S_IFMT) == S_IFLNK) { and quite frankly, the right model is almost certainly to just do a switch-statement that does something like switch (files->mode & S_IFMT) { case S_IFLNK: ... case S_IFREG: case 0: default: ..error.. because maybe somebody wants to add other cases later (and even if not, it's just wrong to randomly change any other mode into S_IFREG). And while I could easily do an evil merge and fix that part up, I really don't want to do things like that. So I'm not going to pull this. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0
On Wed, Jul 27, 2016 at 4:09 PM, Rafael J. Wysockiwrote: > > The STAO definition document: > > http://wiki.xenproject.org/mediawiki/images/0/02/Status-override-table.pdf > > requires as to "operate as if that device does not exist", quite literally. Well, first off, documentation is one thing, actually changing behavior is something entirely different. Theory and practice are *not* the same. The other worry I have is that I'd be happier if it's still visible in /sys/bus/acpi/ etc. Again, it's one thing to not react to it programmatically, and another thing entirely to actually hide the information from the rest of the system. If I read that patch right, it will be hidden from sysfs too. But Maybe I'm mistaken. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [GIT PULL] xen: features and fixes for 4.8-rc0
On Wed, Jul 27, 2016 at 6:45 AM, David Vrabelwrote: > > Shannon Zhao (16): > Xen: ACPI: Hide UART used by Xen So this caused a trivial conflict. No biggie, it wasn't bad and the patch was acked by Rafael. However, looking at it made me somewhat unhappy. Should the device entry in ACPI really be hidden unconditionally? In particular, if we are *not* running under virtualization, it sounds wrong to hide it. Comments? Am I missing something? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [GIT PULL] xen: bug fixes for 4.7-rc0
Having upgraded one of my machines to F24, I get a few new warnings during the kernel compile due to a new compiler. Some of them are just annoying and wrong, but one of them points to a real Xen bug: arch/x86/xen/mmu.c:1116:57: warning: array subscript is above array bounds [-Warray-bounds] for (; vaddr <= vaddr_end && (pmd < (level2_kernel_pgt + PAGE_SIZE)); ~~~^~ because that is definitely completely wrong. Yes, level2_kernel_pgt is one page in size, but it only has 512 entries, because each entry is 8 bytes. So that "+ PAGE_SIZE" is entirely bogus. Either it should be (level2_kernel_pgt + 512) or it should be ((void *)level2_kernel_pgt + PAGE_SIZE) but as it stands now it's a bug. This harkens back to 2012, commit 7f9140626c757 ("xen/mmu: Copy and revector the P2M tree"). It may be that we end up never actually crossing the pmd boundary anyway due to vaddr limit - I didn't check. But please fix the code regardless. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v5 4/9] x86/traps: Enable all exception handler callbacks early
On Sun, Apr 3, 2016 at 3:07 AM, Borislav Petkovwrote: > > I'm wondering whether making it try to EFAULT correctly is the right > thing to do... We're certainly more conservative if we panic and not > allow some silently failed attempt at recovery which looks successful, > to continue. No, please don't fail at early boot. Early boot is just about the *worst* situation to try to debug odd failures, exactly since things like printk may not be reliable, and things won't get logged etc. So particularly during early boot we should try as hard as possible not to crash - even if it means not being able to log about a problem. At least that way you have a hopefully working machine and can *maybe* debug things. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling
On Sat, Apr 2, 2016 at 10:13 AM, Andy Lutomirskiwrote: > > I also tried a bad wrmsrl at a couple early points. Very very early > it just works with not warning. A little later and it prints the > warning. Ok, that sounds like the correct behavior - I'm sure the very very early ones "warned" too, it just got dropped on the floor due to being before the printing/logging was up and running. And even then, it's obviously much better than having machines mysteriously just reboot. Thanks, Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v5 0/9] Improve non-"safe" MSR access failure handling
This patch series looks much nicer than the last one. I assume you tested that the early-trap handling actually worked too? I only looked at the patches.. Ack to it all, Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirskiwrote: > > The code in my queue is, literally: > > bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup, > struct pt_regs *regs, int trapnr) > { > WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x", > (unsigned int)regs->cx); > > /* Pretend that the read succeeded and returned 0. */ > regs->ip = ex_fixup_addr(fixup); > regs->ax = 0; > regs->dx = 0; > return true; > } > EXPORT_SYMBOL(ex_handler_rdmsr_unsafe); I guess I can live with this, as long as we also extend the early-fault handling to work with the special exception handlers. And as long as people start understanding that killing the machine is a bad bad bad thing. It's a debugging nightmare. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Mar 14, 2016 at 11:10 AM, Andy Lutomirskiwrote: > > A couple of the wrmsr users actually care about performance. These > are the ones involved in context switching and, to a lesser extent, in > switching in and out of guest mode. .. ok, see the crossed emails. I'd *much* rather special case the special cases. Not make the generic case something complex. And *particularly* not make the generic case be something where people think it's sane to oops and kill the machine. Christ. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Mar 14, 2016 at 11:04 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > None of this insane complicated crap that buys us exactly *nothing*, > and depends on fancy new exception handling support etc etc. Actually, the one _new_ thing we could do is to instead of removing the old crappy rdmsr()/wrmsr() implementation entirely, we'll just rename it to "rd/wrmsr_unsafe()", and not have any exception table stuff for that at all (so now you *will* get an oops and panic if you use that). The only reason to do that is for performance-critical MSR's that really don't want any overhead. Sure, they could just be changed to use "wrmsr_safe()", but for things like the APIC accesses, or update_debugctlmsr() (that is supposed to check for processor version) that can be truly critical, an explicitly _unsafe_ version may be the right thing to do. The fact is, the problem with rd/wrmsr is that we just did the defaults the wrong way around. Making the simple and obvious version be unsafe is just wrong. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Mar 14, 2016 at 10:17 AM, Andy Lutomirskiwrote: > > So yes, let's please warn. I'm okay with removing the panic_on_oops > thing though. (But if anyone suggests that we should stop OOPSing on > bad kernel page faults, I *will* fight back.) Bad kernel page faults are something completely different. They would be actual bugs regardless. The MSR thing has *often* been just silly "this CPU doesn't do that MSR". Generic bootup setup code etc that just didn't know or care about the particular badly documented rule for one particular random CPU version and stepping. In fact, when I say "often", I suspect I should really just say "always". I don't think we've ever found a case where oopsing would have been the right thing. But it has definitely caused lots of problems, especially in the early paths where your code doesn't even work right now. Now, when it comes to the warning, I guess I could live with it, but I think it's stupid to make this a low-level exception handler thing. So what I think should be done: - make sure that wr/rdmsr_safe() actually works during very early init. At some point it didn't. - get rid of the current wrmsr/rdmsr *entirely*. It's shit. - Add this wrapper: #define wrmsr(msr, low, high) \ WARN_ON_ONCE(wrmsr_safe(msr, low, high)) and be done with it. We could even decide to make that WARN_ON_ONCE() be something we could configure out, because it's really a debugging thing and isn't like it should be all that fatal. None of this insane complicated crap that buys us exactly *nothing*, and depends on fancy new exception handling support etc etc. So what's the downside to just doing this simple thing? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v4 2/5] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mar 14, 2016 10:05 AM, "Andy Lutomirski"wrote: > > We could probably remove that check and let custom fixups run early. > I don't see any compelling reason to keep them disabled. That should > probably be a separate change, though. Or we could just use the existing wrmsr_safe() code and not add this new special code at all. Look, why are you doing this? We should get rid of the difference between wrmsr and wrmsr_safe(), not make it bigger. Just make everything safe. There has never in the history of anything been an advantage to making things oops and to making things more complicated. Why is rd/wrmsr() so magically important? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v3 3/5] x86/paravirt: Add paravirt_{read, write}_msr
On Mar 14, 2016 9:53 AM, "Andy Lutomirski"wrote: > > Can you clarify? KVM uses the native version, and the native version > only oopses with this series applied if panic_on_oops is set. Can we please remove that idiocy? There is no reason to panic whatsoever. Seriously. What's the upside of that logic? Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h
On Jan 26, 2016 23:51, "Peter Zijlstra"wrote: > > So for a moment it looked like MIPS wanted to equal or surpass Alpha in > this respect. If there is an architecture that I'd expect to try to take the "sucks more" crown, MIPS would be it. They've already done the "worst cache award" thing, and are proud members of the "stupid branch delay slot" crowd. MIPS historically even did the "delayed load slot". The only mistake they've never done, AFAIK, is to have a rotating register file. At the same time, the "load-to-dependent-store" thing you really have to *work* at doing wrong. It's not enough to just have bad taste and incompetent architects. You really have to spend real effort to screw up that badly. It's really hard to do by mistake. So I suspect that even MIPS can't get it wrong. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h
On Tue, Jan 26, 2016 at 9:22 AM, Peter Zijlstrawrote: > > This is distinct from: That may be distinct, but: > struct foo *x = READ_ONCE(*ptr); > smp_read_barrier_depends(); > x->bar = 5; This case is complete BS. Stop perpetuating it. I already removed a number of bogus cases of it, and I removed the incorrect documentation that had this crap. It's called "smp_READ_barrier_depends()" for a reason. Alpha is the only one that needs it, and alpha needs it only for dependent READS. It's not called smp_read_write_barrier_depends(). It's not called "smp_mb_depends()". It's a weaker form of "smp_rmb()", nothing else. So alpha does have an implied dependency chain from a read to a subsequent dependent write, and does not need any extra barriers. Alpha does *not* have a dependency chain from a read to a subsequent read, which is why we need that horrible crappy smp_read_barrier_depends(). But it's the only reason. This is the alpha reference manual wrt read-to-write dependency: 5.6.1.7 Definition of Dependence Constraint The depends relation (DP) is defined as follows. Given u and v issued by processor Pi, where u is a read or an instruction fetch and v is a write, u precedes v in DP order (written u DP v, that is, v depends on u) in either of the following situations: • u determines the execution of v, the location accessed by v, or the value written by v. • u determines the execution or address or value of another memory access z that precedes v or might precede v (that is, would precede v in some execution path depending on the value read by u) by processor issue constraint (see Section 5.6.1.3). Note that the dependence barrier honors not only control flow, but address and data values too. This is a different syntax than we use, but 'u' is the READ_ONCE, and 'v' is the write. Any data, address or conditional dependency between the two implies an ordering. So no, "smp_read_barrier_depends()" is *ONLY* about two reads, where the second read is data-dependent on the first. Nothing else. So if you _ever_ see a "smp_read_barrier_depends()" that isn't about a barrier between two reads, then that is a bug. The above code is crap. It's exactly as much crap as a = READ_ONCE(x); smp_rmb(); WRITE_ONCE(b, y); because a "rmb()" simply doesn't have anything to do with read-vs-subsequent-write ordering. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h
On Tue, Jan 26, 2016 at 12:10 PM, Paul E. McKenney <paul...@linux.vnet.ibm.com> wrote: > On Tue, Jan 26, 2016 at 11:44:46AM -0800, Linus Torvalds wrote: >> >> > struct foo *x = READ_ONCE(*ptr); >> > smp_read_barrier_depends(); >> > x->bar = 5; >> >> This case is complete BS. Stop perpetuating it. I already removed a >> number of bogus cases of it, and I removed the incorrect documentation >> that had this crap. > > If I understand your objection correctly, you want the above pattern > expressed either like this: > > struct foo *x = rcu_dereference(*ptr); > x->bar = 5; > > Or like this: > > struct foo *x = lockless_dereference(*ptr); > x->bar = 5; > > Or am I missing your point? You are entirely missing the point. You might as well just write it as struct foo x = READ_ONCE(*ptr); x->bar = 5; because that "smp_read_barrier_depends()" does NOTHING wrt the second write. So what I am saying is simple: anybody who writes that "smp_read_barrier_depends()" in there is just ttoally and completely WRONG, and the fact that Peter wrote it out after I removed several instances of that bloody f*cking idiocy is disturbing. Don't do it. It's BS. It's wrong. Don't make excuses for it. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h
On Tue, Jan 26, 2016 at 2:15 PM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > You might as well just write it as > > struct foo x = READ_ONCE(*ptr); > x->bar = 5; > > because that "smp_read_barrier_depends()" does NOTHING wrt the second write. Just to clarify: on alpha it adds a memory barrier, but that memory barrier is useless. On non-alpha, it is a no-op, and obviously does nothing simply because it generates no code. So if anybody believes that the "smp_read_barrier_depends()" does something, they are *wrong*. And if anybody sends out an email with that smp_read_barrier_depends() in an example, they are actively just confusing other people, which is even worse than just being wrong. Which is why I jumped in. So stop perpetuating the myth that smp_read_barrier_depends() does something here. It does not. It's a bug, and it has become this "mind virus" for some people that seem to believe that it does something. I had to remove this crap once from the kernel already, see commit 105ff3cbf225 ("atomic: remove all traces of READ_ONCE_CTRL() and atomic*_read_ctrl()"). I don't want to ever see that broken construct again. And I want to make sure that everybody is educated about how broken it was. I'm extremely unhappy that it came up again. If it turns out that some architecture does actually need a barrier between a read and a dependent write, then that will mean that (a) we'll have to make up a _new_ barrier, because "smp_read_barrier_depends()" is not that barrier. We'll presumably then have to make that new barrier part of "rcu_derefence()" and friends. (b) we will have found an architecture with even worse memory ordering semantics than alpha, and we'll have to stop castigating alpha for being the worst memory ordering ever. but I sincerely hope that we'll never find that kind of broken architecture. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [v3,11/41] mips: reuse asm-generic/barrier.h
On Tue, Jan 26, 2016 at 3:29 PM, Paul E. McKenneywrote: > > No trailing data-dependent read, so agreed, no smp_read_barrier_depends() > needed. That said, I believe that we should encourage rcu_dereference*() > or lockless_dereference() instead of READ_ONCE() for documentation > reasons, though. I agree that that is likely the right thing to do in pretty much all situations. In theory, there might be performance situations where we'd want to actively avoid the smp_read_barrier_depends() inherent in those, but considering that it's only a performance issue on alpha, and we probably have all of two or three people using Linux on alpha, it's a pretty theoretical performance worry. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 1:46 AM, Ingo Molnarwrote: > > Linus, what's your preference? So quite frankly, is there any reason we don't just implement native_read_msr() as just unsigned long long native_read_msr(unsigned int msr) { int err; unsigned long long val; val = native_read_msr_safe(msr, ); WARN_ON_ONCE(err); return val; } Note: no inline, no nothing. Just put it in arch/x86/lib/msr.c, and be done with it. I don't see the downside. How many msr reads are so critical that the function call overhead would matter? Get rid of the inline version of the _safe() thing too, and put that thing there too. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 9:49 AM, Arjan van de Venwrote: >> >> How many msr reads are so critical that the function call >> overhead would matter? > > if anything qualifies it'd be switch_to() and friends. Is there anything else than the FS/GS_BASE thing (possibly hidden behind inlines etc that I didn't get from a quick grep)? And why is that sometimes using the "safe" version (in do_arch_prctl()), and sometimes not (switch_to())? I'm not convinced that mess is a good argument for the status quo ;) > note that I'm not entirely happy about the notion of "safe" MSRs. > They're safe as in "won't fault". I wouldn't object to renaming them. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Mon, Sep 21, 2015 at 11:16 AM, Andy Lutomirskiwrote: > > In the interest of sanity, I want to drop the "native_", too Yes. I think the only reason it exists is to have that wrapper layer for PV. And that argument just goes away if you just make the non-inline helper function do all the PV logic directly. I really suspect we should do this for a *lot* of the PV ops. Yeah, some are so performance-critical that we probably do have a good reason for the inline indirections etc (historical example: native spin-unlock, which traditionally could be done as a single store instruction), but I suspect a lot of the PV indirection is for this kind of "historical wrapper model" reason, and it often makes it really hard to see what is going on because you have to go through several layers of indirection, often in different files. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v2 1/2] x86/msr: Carry on after a non-"safe" MSR access fails without !panic_on_oops
On Sun, Sep 20, 2015 at 5:02 PM, Andy Lutomirskiwrote: > This demotes an OOPS and likely panic due to a failed non-"safe" MSR > access to a WARN_ON_ONCE and a return of zero (in the RDMSR case). > We still write a pr_info entry unconditionally for debugging. No, this is wrong. If you really want to do something like this, then just make all MSR reads safe. So the only difference between "safe" and "unsafe" is that the unsafe version just doesn't check the return value, and silently just returns zero for reads (or writes nothing). To quote Obi-Wan: "Use the exception table, Luke". Because decoding instructions is just too ugly. We'll do it for CPU errata where we might have to do it for user space code too (ie the AMD prefetch mess), but for code that _we_ control? Hell no. So NAK on this. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] Patch x86/ldt: Make modify_ldt synchronous has been added to the 4.1-stable tree
On Fri, Aug 14, 2015 at 12:16 AM, Ingo Molnar mi...@kernel.org wrote: I just sent it to Linus, so barring a catastrophy it should be upstream soon. Well, I just rejected that pull as completely broken, so I guess the catastrophy happened... Or, alternatively, I mis-read the code, which does happen, but my giant ego tells me that's very rare.. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH v16 13/14] pvqspinlock: Improve slowpath performance by avoiding cmpxchg
On Wed, Apr 29, 2015 at 11:11 AM, Peter Zijlstra pet...@infradead.org wrote: On Fri, Apr 24, 2015 at 02:56:42PM -0400, Waiman Long wrote: In the pv_scan_next() function, the slow cmpxchg atomic operation is performed even if the other CPU is not even close to being halted. This extra cmpxchg can harm slowpath performance. This patch introduces the new mayhalt flag to indicate if the other spinning CPU is close to being halted or not. The current threshold for x86 is 2k cpu_relax() calls. If this flag is not set, the other spinning CPU will have at least 2k more cpu_relax() calls before it can enter the halt state. This should give enough time for the setting of the locked flag in struct mcs_spinlock to propagate to that CPU without using atomic op. Yuck! I'm not at all sure you can make assumptions like that. And the worst part is, if it goes wrong the borkage is subtle and painful.\ I have to agree with Peter. But it goes beyond this particular patch. Patterns like this: xchg(pn-mayhalt, true); are just evil and disgusting. Even befoe this patch, that code had (void)xchg(pn-state, vcpu_halted); which is *wrong* and should never be done. If you want it to be set_mb() (which sets a value and has a memory barrier), then use set_mb(). Yes, it happens to use a xchg() to do so, but dammit, it documents that whole this is a memory barrier in the name. Also, anybody who does this should damn well document why the memory barrier is needed. The xchg(pn-state, vcpu_halted) at least is preceded by a comment about the barriers. The new mayhalt has no sane comment in it, and the reason seems to be that no sane comment is possible. The xchg() seems to be some black magic thing. Let's not introduce magic stuff in our locking primitives. At least not undocumented magic that makes no sense. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [RFC PATCH] x86/asm/irq: Don't use POPF but STI
On Tue, Apr 21, 2015 at 5:45 AM, Ingo Molnar mi...@kernel.org wrote: Totally untested and not signed off yet: because we'd first have to make sure (via irq flags debugging) that it's not used in reverse, to re-disable interrupts: Not only might that happen in some place, I *really* doubt that a conditional 'sti' is actually any faster. The only way it's going to be measurably faster is if you run some microbenchmark so that the code is hot and the branch predicts well. popf is fast for the no changes to IF case, and is a smaller instruction anyway. I'd really hate to make this any more complex unless somebody has some real numbers for performance improvement (that is *not* just some cycle timing from a bogus test-case, but real measurements on a real load). And even *with* real measurements, I'd worry about the use popf to clear IF case. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0
On Thu, Feb 19, 2015 at 5:05 PM, Kirill A. Shutemov kir...@shutemov.name wrote: I'm feeling I miss very basic background on how Xen works, but why does it set _PAGE_GLOBAL on userspace entries? It sounds strange to me. It is definitely strange. I'm guessing that it's some ancient Xen hack for the early Intel virtualization that used to have absolutely horrendous vmenter/exit costs, including very much the TLB overhead. \ These days, Intel has address space identifiers, and doesn't flush the whole TLB on VM entry/exit, so it's probably pointless to play games with the global bit. I get the feeling that a lot of Xen stuff is that kind of legacy hacks that should just be cleaned up, but nobody has the energy or the interest. There was the whole odd crazy SHARED_KERNEL_PMD hackery too. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0
On Thu, Feb 19, 2015 at 5:06 AM, David Vrabel david.vra...@citrix.com wrote: The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not dereference pmd outside of the lock during NUMA hinting fault) and specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to p[te|md]_protnone_numa) breaks Xen 64-bit PV guests. Any fault on a present userspace mapping (e.g., a write to a read-only mapping) is being misinterpreted as a NUMA hinting fault and not handled correctly. All userspace programs end up continuously faulting. This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on all present userspace page table entries. That's some crazy stuff, but whatever. The patch is clearly good. Applied, Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86 spinlock: Fix memory corruption on completing completions
On Feb 11, 2015 3:15 PM, Jeremy Fitzhardinge jer...@goop.org wrote: Right now it needs to be a locked operation to prevent read-reordering. x86 memory ordering rules state that all writes are seen in a globally consistent order, and are globally ordered wrt reads *on the same addresses*, but reads to different addresses can be reordered wrt to writes. The modern x86 rules are actually much tighter than that. Every store is a release, and every load is an acquire. So a non-atomic store is actually a perfectly fine unlock. All preceding stores will be seen by other cpu's before the unlock, and while reads can pass stores, they only pass *earlier* stores. For *taking* a lock you need an atomic access, because otherwise loads inside the locked region could bleed out to before the store that takes the lock. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH] x86 spinlock: Fix memory corruption on completing completions
On Mon, Feb 9, 2015 at 4:02 AM, Peter Zijlstra pet...@infradead.org wrote: On Mon, Feb 09, 2015 at 03:04:22PM +0530, Raghavendra K T wrote: So we have 3 choices, 1. xadd 2. continue with current approach. 3. a read before unlock and also after that. For the truly paranoid we have probe_kernel_address(), suppose the lock was in module space and the module just got unloaded under us. That's much too expensive. The xadd shouldn't be noticeably more expensive than the current add_smp(). Yes, lock xadd used to be several cycles slower than just lock add on some early cores, but I think these days it's down to a single-cycle difference, which is not really different from doing a separate load after the add. The real problem with xadd used to be that we always had to do magic special-casing for i386, but that's one of the reasons we dropped support for original 80386. So I think Raghavendra's last version (which hopefully fixes the lockup problem that Sasha reported) together with changing that add_smp(lock-tickets.head, TICKET_LOCK_INC); if (READ_ONCE(lock-tickets.tail) TICKET_SLOWPATH_FLAG) .. into something like val = xadd((lock-ticket.head_tail, TICKET_LOCK_INC TICKET_SHIFT); if (unlikely(val TICKET_SLOWPATH_FLAG)) ... would be the right thing to do. Somebody should just check that I got that shift right, and that the tail is in the high bytes (head really needs to be high to work, if it's in the low byte(s) the xadd would overflow from head into tail which would be wrong). Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] linux-next: missing merge fix patch for the merge of the xen-tip tree with the arm-soc tree
On Mon, Dec 22, 2014 at 6:26 PM, Stephen Rothwell s...@canb.auug.org.au wrote: Hi Linus, I have been carrying this merge fix patch for some time that is now needed in your tree: No, it's not. Look more closely. My merge put the dev-archdata.dma_coherent = coherent; line at the top of the function, like the way it was in the original commit. Linus ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel