Re: [PATCH v2] tools/power turbostat: Fix RAPL summary collection on AMD processors
On 4/20/21 1:15 PM, Chen Yu wrote: On Tue, Apr 20, 2021 at 10:07:01AM +0200, Borislav Petkov wrote: On Tue, Apr 20, 2021 at 10:03:36AM +0800, Chen Yu wrote: On Mon, Apr 19, 2021 at 02:58:12PM -0500, Terry Bowman wrote: Turbostat fails to correctly collect and display RAPL summary information on Family 17h and 19h AMD processors. Running turbostat on these processors returns immediately. If turbostat is working correctly then RAPL summary data is displayed until the user provided command completes. If a command is not provided by the user then turbostat is designed to continuously display RAPL information until interrupted. The issue is due to offset_to_idx() and idx_to_offset() missing support for AMD MSR addresses/offsets. offset_to_idx()'s switch statement is missing cases for AMD MSRs and idx_to_offset() does not include a path to return AMD MSR(s) for any idx. The solution is add AMD MSR support to offset_to_idx() and idx_to_offset(). These functions are split-out and renamed along architecture vendor lines for supporting both AMD and Intel MSRs. Fixes: 9972d5d84d76 ("tools/power turbostat: Enable accumulate RAPL display") Signed-off-by: Terry Bowman Thanks for fixing, Terry, and previously there was a patch for this from Bas Nieuwenhuizen: https://lkml.org/lkml/2021/3/12/682 and it is expected to have been merged in Len's branch already. Expected? So is it or is it not? This patch was sent to Len and it is not in public repo yet. He is preparing for a new release of turbostat as merge window is approaching. And can you folks agree on a patch already and give it to Artem for testing (CCed) because he's triggering it too: https://bugzilla.kernel.org/show_bug.cgi?id=212357 Okay. I would vote for the the patch from Bas as it was a combined work from two authors and tested by several AMD users. But let me paste it here too for Artem to see if this also works for him: From 00e0622b1b693a5c7dc343aeb3aa51614a9e125e Mon Sep 17 00:00:00 2001 From: Bas Nieuwenhuizen Date: Fri, 12 Mar 2021 21:27:40 +0800 Subject: [PATCH] tools/power/turbostat: Fix turbostat for AMD Zen CPUs It was reported that on Zen+ system turbostat started exiting, which was tracked down to the MSR_PKG_ENERGY_STAT read failing because offset_to_idx wasn't returning a non-negative index. This patch combined the modification from Bingsong Si and Bas Nieuwenhuizen and addd the MSR to the index system as alternative for MSR_PKG_ENERGY_STATUS. Fixes: 9972d5d84d76 ("tools/power turbostat: Enable accumulate RAPL display") Reported-by: youling257 Tested-by: youling257 Tested-by: sibingsong Tested-by: Kurt Garloff Co-developed-by: Bingsong Si Signed-off-by: Chen Yu --- tools/power/x86/turbostat/turbostat.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/tools/power/x86/turbostat/turbostat.c b/tools/power/x86/turbostat/turbostat.c index a7c4f0772e53..a7c965734fdf 100644 --- a/tools/power/x86/turbostat/turbostat.c +++ b/tools/power/x86/turbostat/turbostat.c @@ -297,7 +297,10 @@ int idx_to_offset(int idx) switch (idx) { case IDX_PKG_ENERGY: - offset = MSR_PKG_ENERGY_STATUS; + if (do_rapl & RAPL_AMD_F17H) + offset = MSR_PKG_ENERGY_STAT; + else + offset = MSR_PKG_ENERGY_STATUS; break; case IDX_DRAM_ENERGY: offset = MSR_DRAM_ENERGY_STATUS; @@ -326,6 +329,7 @@ int offset_to_idx(int offset) switch (offset) { case MSR_PKG_ENERGY_STATUS: + case MSR_PKG_ENERGY_STAT: idx = IDX_PKG_ENERGY; break; case MSR_DRAM_ENERGY_STATUS: @@ -353,7 +357,7 @@ int idx_valid(int idx) { switch (idx) { case IDX_PKG_ENERGY: - return do_rapl & RAPL_PKG; + return do_rapl & (RAPL_PKG | RAPL_AMD_F17H); case IDX_DRAM_ENERGY: return do_rapl & RAPL_DRAM; case IDX_PP0_ENERGY: The patch works for me. Tested-by: Artem S. Tashkinov
A long standing issue with RAM usage reporting
Hello everyone, I'd love to bring kernel developers' attention to this long standing issue: https://bugzilla.kernel.org/show_bug.cgi?id=201675 It would be great if something was done about it because otherwise htop, top and free and numerous other utilities in Linux have to implement hacks and workarounds to properly report free/used RAM. https://github.com/htop-dev/htop/issues/556 https://gitlab.com/procps-ng/procps/-/issues/196 There's also another related issue: https://bugzilla.kernel.org/show_bug.cgi?id=201673 but it will be automatically solved once the initial bug report has been dealt with. Best regards, Artem
[PATCH] Kconfig: default to CC_OPTIMIZE_FOR_PERFORMANCE_O3 for gcc >= 10
> GCC 10 appears to have changed -O2 in order to make compilation time faster when using -flto, seemingly at the expense of performance, in particular with regards to how the inliner works. Since -O3 these days shouldn't have the same set of bugs as 10 years ago, this commit defaults new kernel compiles to -O3 when using gcc >= 10. It's a strong "no" from me. 1) Aside from rare Gentoo users no one has extensively tested -O3 with the kernel - even Gentoo defaults to -O2 for kernel compilation 2) -O3 _always_ bloats the code by a large amount which means both vmlinux/bzImage and modules will become bigger, and slower to load from the disk 3) -O3 does _not_ necessarily makes the code run faster 4) If GCC10 has removed certain options for the -O2 optimization level you could just readded them as compilation flags without forcing -O3 by default on everyone 5) If you still insist on -O3 I guess everyone would be happy if you just made two KConfig options: OPTIMIZE_O2 (-O2) OPTIMIZE_O3_EVEN_MOAR (-O3) Best regards, Artem
Re: Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
On 8/5/19 9:05 AM, Hillf Danton wrote: On Sun, 4 Aug 2019 09:23:17 + "Artem S. Tashkinov" wrote: Hello, There's this bug which has been bugging many people for many years already and which is reproducible in less than a few minutes under the latest and greatest kernel, 5.2.6. All the kernel parameters are set to defaults. Thanks for report! Steps to reproduce: 1) Boot with mem=4G 2) Disable swap to make everything faster (sudo swapoff -a) 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox 4) Start opening tabs in either of them and watch your free RAM decrease We saw another corner-case cpu hog report under memory pressure also with swap disabled. In that report the xfs filesystem was an factor with CONFIG_MEMCG enabled. Anything special, say like kernel:watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [leaker1:7193] or [ 3225.313209] Xorg: page allocation failure: order:4, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0 in your kernel log? I'm running ext4 only without LVM, encryption or anything like that. Plain GPT/MBR partitions with plenty of free space and no disk errors. Once you hit a situation when opening a new tab requires more RAM than is currently available, the system will stall hard. You will barely be able to move the mouse pointer. Your disk LED will be flashing incessantly (I'm not entirely sure why). You will not be able to run new applications or close currently running ones. A cpu hog may come on top of memory hog in some scenario. It might have happened as well - I couldn't know since I wasn't able to open a terminal. Once the system recovered there was no trace of anything extraordinary. This little crisis may continue for minutes or even longer. I think that's not how the system should behave in this situation. I believe something must be done about that to avoid this stall. Yes, Sir. I'm almost sure some sysctl parameters could be changed to avoid this situation but something tells me this could be done for everyone and made default because some non tech-savvy users will just give up on Linux if they ever get in a situation like this and they won't be keen or even be able to Google for solutions. I am not willing to repeat that it is hard to produce a pill for all patients, but the info you post will help solve the crisis sooner. Hillf In case you have troubles reproducing this bug report I can publish a VM image - still everything is quite mundane: Fedora 30 + XFCE + web browser. Nothing else, nothing fancy. Regards, Artem
Let's talk about the elephant in the room - the Linux kernel's inability to gracefully handle low memory pressure
Hello, There's this bug which has been bugging many people for many years already and which is reproducible in less than a few minutes under the latest and greatest kernel, 5.2.6. All the kernel parameters are set to defaults. Steps to reproduce: 1) Boot with mem=4G 2) Disable swap to make everything faster (sudo swapoff -a) 3) Launch a web browser, e.g. Chrome/Chromium or/and Firefox 4) Start opening tabs in either of them and watch your free RAM decrease Once you hit a situation when opening a new tab requires more RAM than is currently available, the system will stall hard. You will barely be able to move the mouse pointer. Your disk LED will be flashing incessantly (I'm not entirely sure why). You will not be able to run new applications or close currently running ones. This little crisis may continue for minutes or even longer. I think that's not how the system should behave in this situation. I believe something must be done about that to avoid this stall. I'm almost sure some sysctl parameters could be changed to avoid this situation but something tells me this could be done for everyone and made default because some non tech-savvy users will just give up on Linux if they ever get in a situation like this and they won't be keen or even be able to Google for solutions. Best regards, Artem
On the issue of CPU model-specific registers write protection in UEFI secure boot mode
Hello LKML, Is there a serious reason why CPU MSR is write protected in UEFI secure boot mode in Linux? * In order to even use MSR you have to be root to `modprobe msr`. * In order to read/write from/to MSR you have to be root as /dev/cpu/*/msr is accessible only by root. * CPU registers don't survive reboot/power cycles. * I'm not a CPU designer but if I'm not mistaken MSR cannot be used to create any sort of stealth malware. So, I'm asking this question because these registers allow to fine tune Intel CPU power parameters ( https://github.com/georgewhewell/undervolt ) like voltage and others and make it possible to run your system both faster and cooler and right now it's not possible under Linux and perfectly possible under competing proprietary OSes. Of course, the user can * fetch his distro kernel sources * apply a patch from ( https://github.com/intel/intel-cmt-cat/wiki/UEFI-Secure-Boot-Compatibility ) * install his own UEFI certificate * compile, sign and install a patched MSR kernel module However this all has to be done for each new kernel release and many Linux users just cannot do anything on this list. Best regards, Artem
Disabling CPU vulnerabilities workarounds (second try)
Hello, I'm resending my last email since the first one didn't draw enough attention despite the gravity of the situation and the issue has been exacerbated by the recent kernel 4.20 changes which incur even a larger performance loss - up to 50% according to the most recent Phoronix testing: https://www.phoronix.com/scan.php?page=article=linux-420-stibp It looks like only pure compute loads are not affected by the new code which renders hyper-threading in Intel CPUs almost useless. The original email follows: *** As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them *all*. Disabling is a good option for strictly confined environments where no third-party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could write a patch which implements the following two options for the kernel: * A boot option which allows to disable *most* protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled-in or use certain GCC workarounds), e.g. let's call it "insecure" or "insecurecpumode" * A compile-time config option which disables the said fixes _permanently_ without a way to turn them back on. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. *** Addendum: I can imagine that writing such a patch is not trivial and no one is eager to do that. In this case people would love to see an extra file in the kernel documentation, e.g. CPU-vulnerabilities.txt which lists all the existing protections in the kernel and the boot options to disable them. The Internet is already rife with questions how to disable the said protections and the results are quite different. In short, it would be great to have some organization in regard to the issue. Regards, Artem
Disabling CPU vulnerabilities workarounds (second try)
Hello, I'm resending my last email since the first one didn't draw enough attention despite the gravity of the situation and the issue has been exacerbated by the recent kernel 4.20 changes which incur even a larger performance loss - up to 50% according to the most recent Phoronix testing: https://www.phoronix.com/scan.php?page=article=linux-420-stibp It looks like only pure compute loads are not affected by the new code which renders hyper-threading in Intel CPUs almost useless. The original email follows: *** As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them *all*. Disabling is a good option for strictly confined environments where no third-party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could write a patch which implements the following two options for the kernel: * A boot option which allows to disable *most* protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled-in or use certain GCC workarounds), e.g. let's call it "insecure" or "insecurecpumode" * A compile-time config option which disables the said fixes _permanently_ without a way to turn them back on. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. *** Addendum: I can imagine that writing such a patch is not trivial and no one is eager to do that. In this case people would love to see an extra file in the kernel documentation, e.g. CPU-vulnerabilities.txt which lists all the existing protections in the kernel and the boot options to disable them. The Internet is already rife with questions how to disable the said protections and the results are quite different. In short, it would be great to have some organization in regard to the issue. Regards, Artem
Re: Disabling CPU vulnerabilities workarounds
On 08/25/2018 06:39 PM, Casey Schaufler wrote: On 8/25/2018 3:42 AM, Artem S. Tashkinov wrote: Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Many of the mitigations are unrelated to each other. There is no one aspect of the system that identifies a behavior as a security issue. I don't know anyone who could create a list of all the "fixes" that have gone in over the years. Realize that features like speculative execution have had security issues that are unrelated to obscure attacks like side-channels. While you may think that you don't care, some of those flaws affect correctness. My bet is you wouldn't want to disable those. As far as I know mitigations started to appear in January 2018 and kernels released prior to this date all work just fine without any issues with "correctness", so I'm not sure what you're talking about. I'm quite sure at least Intel perfectly knows, as well as Linus Torvalds who coordinates everything. Also diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f73fa6f6d85e..e6362717c895 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -991,7 +991,7 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0; - if (x86_match_cpu(cpu_no_speculation)) + //if (x86_match_cpu(cpu_no_speculation)) return; setup_force_cpu_bug(X86_BUG_SPECTRE_V1); and setting this in .config: CONFIG_RETPOLINE=n CONFIG_PAGE_TABLE_ISOLATION=n Ostensibly disables all mitigations and everything continues to work just fine. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. Like maybe the software in centrifuges in a nuclear fuel processing plant? All the examples you've cited are network connected and are vulnerable to attack. And don't try the "no untrusted code" argument. You'll have code on those systems that has been known vulnerable for decades. I'm not sure 1) why you're trying to mix unrelated classes of vulnerabilities - of course there are vulnerabilities other than the ones caused by speculative execution; 2) why you're insisting that my argument, that someone may never run untrusted code, has no merit. I may perfectly have a standard Linux distro installed on my PC/server and never run a web browser or any similar applications other than the ones provided by my distro in a form of various packages - which means I will never run any untrusted code. I will also never run any scriptable applications (bash/python/php/ruby/etc) from the net either. How such a configuration might be susceptible to speculative execution attacks? I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". That would be an interesting exercise for the opposite case. A boot option that enables all the runtime protections would certainly be interesting to some people. If you could implement one, you could do the other. I would be happy to review such a patch. Go for it. I'd love to leave that task to those who are more proficient in writing kernel code and whose work is more likely to be merged. My patch might be never streamlined for totally unrelated reasons (and we've seen too many examples of that already). * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. This suffers from all the challenges previously mentioned, but would be equally interesting, again for the opposite case. Again, I see no challenges since, for instance, RHEL has gone as far as to backport all the patches to previously released officially unmaintained kernels, so all these patches could be easily disabled if one really wanted to. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. I can't argue with you on that. Again, I believe the greater value would come from documenting how to turn everything on. I guess you meant "turn everything off". Best regards, Artem
Re: Disabling CPU vulnerabilities workarounds
On 08/25/2018 06:39 PM, Casey Schaufler wrote: On 8/25/2018 3:42 AM, Artem S. Tashkinov wrote: Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Many of the mitigations are unrelated to each other. There is no one aspect of the system that identifies a behavior as a security issue. I don't know anyone who could create a list of all the "fixes" that have gone in over the years. Realize that features like speculative execution have had security issues that are unrelated to obscure attacks like side-channels. While you may think that you don't care, some of those flaws affect correctness. My bet is you wouldn't want to disable those. As far as I know mitigations started to appear in January 2018 and kernels released prior to this date all work just fine without any issues with "correctness", so I'm not sure what you're talking about. I'm quite sure at least Intel perfectly knows, as well as Linus Torvalds who coordinates everything. Also diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index f73fa6f6d85e..e6362717c895 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -991,7 +991,7 @@ static void __init cpu_set_bug_bits(struct cpuinfo_x86 *c) { u64 ia32_cap = 0; - if (x86_match_cpu(cpu_no_speculation)) + //if (x86_match_cpu(cpu_no_speculation)) return; setup_force_cpu_bug(X86_BUG_SPECTRE_V1); and setting this in .config: CONFIG_RETPOLINE=n CONFIG_PAGE_TABLE_ISOLATION=n Ostensibly disables all mitigations and everything continues to work just fine. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. Like maybe the software in centrifuges in a nuclear fuel processing plant? All the examples you've cited are network connected and are vulnerable to attack. And don't try the "no untrusted code" argument. You'll have code on those systems that has been known vulnerable for decades. I'm not sure 1) why you're trying to mix unrelated classes of vulnerabilities - of course there are vulnerabilities other than the ones caused by speculative execution; 2) why you're insisting that my argument, that someone may never run untrusted code, has no merit. I may perfectly have a standard Linux distro installed on my PC/server and never run a web browser or any similar applications other than the ones provided by my distro in a form of various packages - which means I will never run any untrusted code. I will also never run any scriptable applications (bash/python/php/ruby/etc) from the net either. How such a configuration might be susceptible to speculative execution attacks? I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". That would be an interesting exercise for the opposite case. A boot option that enables all the runtime protections would certainly be interesting to some people. If you could implement one, you could do the other. I would be happy to review such a patch. Go for it. I'd love to leave that task to those who are more proficient in writing kernel code and whose work is more likely to be merged. My patch might be never streamlined for totally unrelated reasons (and we've seen too many examples of that already). * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. This suffers from all the challenges previously mentioned, but would be equally interesting, again for the opposite case. Again, I see no challenges since, for instance, RHEL has gone as far as to backport all the patches to previously released officially unmaintained kernels, so all these patches could be easily disabled if one really wanted to. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. I can't argue with you on that. Again, I believe the greater value would come from documenting how to turn everything on. I guess you meant "turn everything off". Best regards, Artem
Disabling CPU vulnerabilities workarounds
Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. Best regards, Artem
Disabling CPU vulnerabilities workarounds
Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. Best regards, Artem
Disabling CPU vulnerabilities workarounds
Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. Best regards, Artem
Disabling CPU vulnerabilities workarounds
Hello LKML, As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are added to the Linux kernel without a simple way to disable them all in one fell swoop. Disabling is a good option for strictly confined environments where no 3d party untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or even a home server which runs Samba/SSH server and nothing else. I wonder if someone could wrote a patch which implemented the following two options for the kernel: * A boot option option which allows to disable most runtime protections/workarounds/fixes (as far as I understand some of them can't be reverted since they are compiled in or use certain GCC flags), e.g. let's call it "insecure" or "insecurecpumode". * A compile-time CONFIG_ option which disables all these fixes _permanently_ without a way to turn them later back on during runtime. Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of various things which take ages to sift through and there's zero understanding whether you've found everything and correctly disabled it. Best regards, Artem
On the kernel numbering scheme
Hello all, I know this proposal has already been made great many times but I'd like to repeat it and have a healthy discussion about it. The current kernel numbering scheme makes no sense at all because the first two numbers don't represent anything at all. They had some meaning back in the 1.x 2.x 3.x days but then with the introduction of the new rolling development model, they became worthless. I'd love to change the kernel numbering scheme to this: .RELEASE.PATCH_LEVEL So that the first kernel to be released in 2019 will be numbered 2019.0(.0), and its consequent releases will be 2019.1, 2019.2, 2019.3, etc. and its stable patches will be 2019.0.1, 2019.0.2, 2019.0.3, 2019.0.4, etc. With this scheme you can easily see how fresh your kernel is and there's no need arbitrary raise the first number because it always matches the current year. There's one minor detail which might raise some questions: there are release candidates and then there's a release, so for the development which starts before the year end we might start with e.g. 2018.5-rc1 and then if the actual release crosses a new year mark we simply turn 2018.5-rc7 into 2019.0.0. Best regards, Artem S. Tashkinov
On the kernel numbering scheme
Hello all, I know this proposal has already been made great many times but I'd like to repeat it and have a healthy discussion about it. The current kernel numbering scheme makes no sense at all because the first two numbers don't represent anything at all. They had some meaning back in the 1.x 2.x 3.x days but then with the introduction of the new rolling development model, they became worthless. I'd love to change the kernel numbering scheme to this: .RELEASE.PATCH_LEVEL So that the first kernel to be released in 2019 will be numbered 2019.0(.0), and its consequent releases will be 2019.1, 2019.2, 2019.3, etc. and its stable patches will be 2019.0.1, 2019.0.2, 2019.0.3, 2019.0.4, etc. With this scheme you can easily see how fresh your kernel is and there's no need arbitrary raise the first number because it always matches the current year. There's one minor detail which might raise some questions: there are release candidates and then there's a release, so for the development which starts before the year end we might start with e.g. 2018.5-rc1 and then if the actual release crosses a new year mark we simply turn 2018.5-rc7 into 2019.0.0. Best regards, Artem S. Tashkinov
Trying to understand General protection fault/hrtimer_active
Hello, After stopping mariadb on our database server, the server physically crashed and required a hard reset in order to get back online. Fortunately the system was able to dump the kernel error: Aug 11 09:22:44 mariadb mysqld[1229]: 2017-08-11 9:22:44 140417868658432 [ERROR] mysqld: Deadlock found when trying to get lock; try restarting transaction Aug 11 09:24:03 mariadb kernel: [225113.038696] general protection fault: [#1] SMP Aug 11 09:24:03 mariadb kernel: [225113.038709] Modules linked in: ppdev intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul joydev input_leds glue_helper ablk_helper cryptd serio_raw shpchp lpc_ich parport_pc 8250_fintek parport tpm_infineon mac_hid nct6775 hwmon_vid coretemp autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx x or hid_generic usbhid hid raid6_pq libcrc32c raid0 multipath linear raid1 mxm_wmi ahci psmouse r8169 libahci mii wmi video fjes Aug 11 09:24:03 mariadb kernel: [225113.038836] CPU: 3 PID: 3570 Comm: mysqld Not tainted 4.4.0-89-generic #112-Ubuntu Aug 11 09:24:03 mariadb kernel: [225113.038853] Hardware name: MSI MS-7816/H87-G43 (MS-7816), BIOS V2.14B6 08/23/2013 Aug 11 09:24:03 mariadb kernel: [225113.038868] task: 8807f6f88e00 ti: 8807f6534000 task.ti: 8807f6534000 Aug 11 09:24:03 mariadb kernel: [225113.038881] RIP: 0010:[] [] hrtimer_active+0x9/0x60 Aug 11 09:24:03 mariadb kernel: [225113.038899] RSP: 0018:8807f65379e0 EFLAGS: 00010246 Aug 11 09:24:03 mariadb kernel: [225113.038909] RAX: RBX: ffbf8807f6537a30 RCX: Aug 11 09:24:03 mariadb kernel: [225113.038922] RDX: RSI: 8807f6f88e00 RDI: ffbf8807f6537a30 Aug 11 09:24:03 mariadb kernel: [225113.038947] RBP: 8807f65379e0 R08: 8807f6534000 R09: Aug 11 09:24:03 mariadb kernel: [225113.038982] R10: 000103599c14 R11: R12: Aug 11 09:24:03 mariadb kernel: [225113.039018] R13: 0001 R14: 8807f6537b58 R15: Aug 11 09:24:03 mariadb kernel: [225113.039053] FS: 7fb69edc5700() GS:88081eac() knlGS: Aug 11 09:24:03 mariadb kernel: [225113.039091] CS: 0010 DS: ES: CR0: 80050033 Aug 11 09:24:03 mariadb kernel: [225113.039112] CR2: 7fb59e1e7e88 CR3: 0007f943f000 CR4: 001406e0 Aug 11 09:24:03 mariadb kernel: [225113.039148] Stack: Aug 11 09:24:03 mariadb kernel: [225113.039164] 8807f6537a18 810efba9 8807f6537b58 2cf88ace51220a81 Aug 11 09:24:03 mariadb kernel: [225113.039202] ffbf8807f6537a30 0001 8807f6537ac0 Aug 11 09:24:03 mariadb kernel: [225113.039240] 81841341 05f5e100 88071ab63a30 Aug 11 09:24:03 mariadb kernel: [225113.039278] Call Trace: Aug 11 09:24:03 mariadb kernel: [225113.039297] [] hrtimer_try_to_cancel+0x29/0x130 Aug 11 09:24:03 mariadb kernel: [225113.039321] [] schedule_hrtimeout_range_clock+0xd1/0x1b0 Aug 11 09:24:03 mariadb kernel: [225113.039346] [] ? __hrtimer_init+0x90/0x90 Aug 11 09:24:03 mariadb kernel: [225113.039369] [] ? schedule_hrtimeout_range_clock+0xb9/0x1b0 Aug 11 09:24:03 mariadb kernel: [225113.039405] [] schedule_hrtimeout_range+0x13/0x20 Aug 11 09:24:03 mariadb kernel: [225113.039430] [] poll_schedule_timeout+0x44/0x70 Aug 11 09:24:03 mariadb kernel: [225113.039453] [] do_sys_poll+0x4af/0x560 Aug 11 09:24:03 mariadb kernel: [225113.039477] [] ? __alloc_skb+0x5b/0x1f0 Aug 11 09:24:03 mariadb kernel: [225113.039500] [] ? __kmalloc_node_track_caller+0x249/0x310 Aug 11 09:24:03 mariadb kernel: [225113.039525] [] ? __alloc_skb+0x87/0x1f0 Aug 11 09:24:03 mariadb kernel: [225113.039548] [] ? poll_select_copy_remaining+0x140/0x140 Aug 11 09:24:03 mariadb kernel: [225113.039572] [] ? _raw_spin_unlock_bh+0x1e/0x20 Aug 11 09:24:03 mariadb kernel: [225113.039596] [] ? release_sock+0x111/0x160 Aug 11 09:24:03 mariadb kernel: [225113.039620] [] ? tcp_recvmsg+0x3fc/0xbe0 Aug 11 09:24:03 mariadb kernel: [225113.039644] [] ? inet_recvmsg+0x7e/0xb0 Aug 11 09:24:03 mariadb kernel: [225113.039666] [] ? sock_recvmsg+0x3d/0x50 Aug 11 09:24:03 mariadb kernel: [225113.039688] [] ? SYSC_recvfrom+0x13d/0x150 Aug 11 09:24:03 mariadb kernel: [225113.039711] [] ? __schedule+0x3b6/0xa30 Aug 11 09:24:03 mariadb kernel: [225113.039734] [] ? ktime_get_ts64+0x49/0xf0 Aug 11 09:24:03 mariadb kernel: [225113.039756] [] SyS_poll+0x71/0x130 Aug 11 09:24:03 mariadb kernel: [225113.039778] [] entry_SYSCALL_64_fastpath+0x16/0x71 Aug 11 09:24:03 mariadb kernel: [225113.039801] Code: 00 00 0f 1f 44 00 00 55 48 c7 47 28 70 f9 0e 81 48 89 77 58 48 89 e5 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 57 30 eb 1d 80 7f 38 00 75 32 48 3b 78 08 74 2c 39 50 04
Trying to understand General protection fault/hrtimer_active
Hello, After stopping mariadb on our database server, the server physically crashed and required a hard reset in order to get back online. Fortunately the system was able to dump the kernel error: Aug 11 09:22:44 mariadb mysqld[1229]: 2017-08-11 9:22:44 140417868658432 [ERROR] mysqld: Deadlock found when trying to get lock; try restarting transaction Aug 11 09:24:03 mariadb kernel: [225113.038696] general protection fault: [#1] SMP Aug 11 09:24:03 mariadb kernel: [225113.038709] Modules linked in: ppdev intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass crct10dif_pc lmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul joydev input_leds glue_helper ablk_helper cryptd serio_raw shpchp lpc_ich parport_pc 8250_fintek parport tpm_infineon mac_hid nct6775 hwmon_vid coretemp autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx x or hid_generic usbhid hid raid6_pq libcrc32c raid0 multipath linear raid1 mxm_wmi ahci psmouse r8169 libahci mii wmi video fjes Aug 11 09:24:03 mariadb kernel: [225113.038836] CPU: 3 PID: 3570 Comm: mysqld Not tainted 4.4.0-89-generic #112-Ubuntu Aug 11 09:24:03 mariadb kernel: [225113.038853] Hardware name: MSI MS-7816/H87-G43 (MS-7816), BIOS V2.14B6 08/23/2013 Aug 11 09:24:03 mariadb kernel: [225113.038868] task: 8807f6f88e00 ti: 8807f6534000 task.ti: 8807f6534000 Aug 11 09:24:03 mariadb kernel: [225113.038881] RIP: 0010:[] [] hrtimer_active+0x9/0x60 Aug 11 09:24:03 mariadb kernel: [225113.038899] RSP: 0018:8807f65379e0 EFLAGS: 00010246 Aug 11 09:24:03 mariadb kernel: [225113.038909] RAX: RBX: ffbf8807f6537a30 RCX: Aug 11 09:24:03 mariadb kernel: [225113.038922] RDX: RSI: 8807f6f88e00 RDI: ffbf8807f6537a30 Aug 11 09:24:03 mariadb kernel: [225113.038947] RBP: 8807f65379e0 R08: 8807f6534000 R09: Aug 11 09:24:03 mariadb kernel: [225113.038982] R10: 000103599c14 R11: R12: Aug 11 09:24:03 mariadb kernel: [225113.039018] R13: 0001 R14: 8807f6537b58 R15: Aug 11 09:24:03 mariadb kernel: [225113.039053] FS: 7fb69edc5700() GS:88081eac() knlGS: Aug 11 09:24:03 mariadb kernel: [225113.039091] CS: 0010 DS: ES: CR0: 80050033 Aug 11 09:24:03 mariadb kernel: [225113.039112] CR2: 7fb59e1e7e88 CR3: 0007f943f000 CR4: 001406e0 Aug 11 09:24:03 mariadb kernel: [225113.039148] Stack: Aug 11 09:24:03 mariadb kernel: [225113.039164] 8807f6537a18 810efba9 8807f6537b58 2cf88ace51220a81 Aug 11 09:24:03 mariadb kernel: [225113.039202] ffbf8807f6537a30 0001 8807f6537ac0 Aug 11 09:24:03 mariadb kernel: [225113.039240] 81841341 05f5e100 88071ab63a30 Aug 11 09:24:03 mariadb kernel: [225113.039278] Call Trace: Aug 11 09:24:03 mariadb kernel: [225113.039297] [] hrtimer_try_to_cancel+0x29/0x130 Aug 11 09:24:03 mariadb kernel: [225113.039321] [] schedule_hrtimeout_range_clock+0xd1/0x1b0 Aug 11 09:24:03 mariadb kernel: [225113.039346] [] ? __hrtimer_init+0x90/0x90 Aug 11 09:24:03 mariadb kernel: [225113.039369] [] ? schedule_hrtimeout_range_clock+0xb9/0x1b0 Aug 11 09:24:03 mariadb kernel: [225113.039405] [] schedule_hrtimeout_range+0x13/0x20 Aug 11 09:24:03 mariadb kernel: [225113.039430] [] poll_schedule_timeout+0x44/0x70 Aug 11 09:24:03 mariadb kernel: [225113.039453] [] do_sys_poll+0x4af/0x560 Aug 11 09:24:03 mariadb kernel: [225113.039477] [] ? __alloc_skb+0x5b/0x1f0 Aug 11 09:24:03 mariadb kernel: [225113.039500] [] ? __kmalloc_node_track_caller+0x249/0x310 Aug 11 09:24:03 mariadb kernel: [225113.039525] [] ? __alloc_skb+0x87/0x1f0 Aug 11 09:24:03 mariadb kernel: [225113.039548] [] ? poll_select_copy_remaining+0x140/0x140 Aug 11 09:24:03 mariadb kernel: [225113.039572] [] ? _raw_spin_unlock_bh+0x1e/0x20 Aug 11 09:24:03 mariadb kernel: [225113.039596] [] ? release_sock+0x111/0x160 Aug 11 09:24:03 mariadb kernel: [225113.039620] [] ? tcp_recvmsg+0x3fc/0xbe0 Aug 11 09:24:03 mariadb kernel: [225113.039644] [] ? inet_recvmsg+0x7e/0xb0 Aug 11 09:24:03 mariadb kernel: [225113.039666] [] ? sock_recvmsg+0x3d/0x50 Aug 11 09:24:03 mariadb kernel: [225113.039688] [] ? SYSC_recvfrom+0x13d/0x150 Aug 11 09:24:03 mariadb kernel: [225113.039711] [] ? __schedule+0x3b6/0xa30 Aug 11 09:24:03 mariadb kernel: [225113.039734] [] ? ktime_get_ts64+0x49/0xf0 Aug 11 09:24:03 mariadb kernel: [225113.039756] [] SyS_poll+0x71/0x130 Aug 11 09:24:03 mariadb kernel: [225113.039778] [] entry_SYSCALL_64_fastpath+0x16/0x71 Aug 11 09:24:03 mariadb kernel: [225113.039801] Code: 00 00 0f 1f 44 00 00 55 48 c7 47 28 70 f9 0e 81 48 89 77 58 48 89 e5 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 <48> 8b 57 30 eb 1d 80 7f 38 00 75 32 48 3b 78 08 74 2c 39 50 04
Re: The most insane proposal in regard to the Linux kernel development
On 2016-04-07 01:05, Greg KH wrote: On Sat, Apr 02, 2016 at 05:43:47PM +0500, Artem S. Tashkinov wrote: One very big justification of this proposal is that core Linux development (I'm talking about various subsystems like mm/ ipc/ and interfaces under block/ fs/ security/ sound/ etc. ) has slowed down significantly over the past years so radical changes which warrant new kernel API/ABI are less likely to be introduced. That's not true at all, the change is constant, and increasing, just look at the tree for proof of that. Please, share your opinion. Please read Documentation/stable_api_nonsense.txt for my opinion, and that of the current developers. If you don't agree with this, that's fine, you are welcome to fork the kernel at any specific point and keep that api stable, just like many companies do and make money from it (SuSE, Red Hat, etc.) best of luck with your kernel project, Tell me, why no one in the Linux kernel dev team is concerned that: 1) There is up to a hundred regressions in each kernel release where a big chunk of them are caused by internal API changes? 2) API changes sometimes require drastic changes in every related hardware driver and since there's no way you can realistically test the code or the hardware, people later discover that their hardware has stopped working? 3) The core kernel developers do not have enough expertise to correctly update the entire kernel source tree so little things get broken? 4) Developing drivers for a moving target is a Herculean job? 5) You cannot easily bisect kernel regressions because regressions are often caused by things _outside_ of the problem you're experiencing. 6) You cannot use new drivers for you hardware on your old kernel, because new drivers are incompatible with an old source tree (don't remind me of RHEL's kernel - it's a rare exception and they usually port only the drivers their respective clients use). 7) Tech unsavvy people cannot realistically debug the kernel. Hey, please, do not tell me that you're doing a great job following postings in LKML or resolving bugs files in bugzilla. You do a very lousy job indeed - multiple postings in LKML get zero replies because a corresponding developer is either not subscribed to LKML at all, or he has missed the message. There are literally hundreds(!) of bugs in bugzilla which have ZERO replies. What's more, a great number of kernel developers do not have accounts in bugzilla and they don't read corresponding mailing lists. What the hell is wrong with you guys? You're developing the kernel like it's your toy project. 1) There's no accountability whatsoever. 2) There are no unit tests. Not a single one. 3) There's no surefire way to contact developers who have commited "bad" code. 4) There's no sense of direction. 5) There's no easy way to debug the kernel. For instance, let's talk about the revoke() call. Right now, if a certain IO device is removed while files on it are still open (there are multiple ways of opening files in Linux, starting from fopen() and ending with mmap()), the kernel state is basically undefined(!). Great! The corresponding mount point cannot be reused(!). Whatever program, which has its files' descriptors on this accidentally removed device, usually cannot gracefully quit or continue working. How on Earth this syscall doesn't get the utmost attention? Then we have bug 12309(1). My last comment to this bug gives a very simple way of reproducing it on all Android devices. Then we have bug 15875(2) which will probably take just ten man hours to be resolved, yet there is no interest at all, yet thousands of people have very real problems due to it. Tell me, are you really proud of yourselves? Tell me, do you develop the kernel for your amusement, ego, your employee or for average people to use? Tell me, are you really interested in more people migrating from stable long term supported OSes to Linux? I want some truly honest answers. And let's not repeat this mantra "we don't have enough resources". You have enough resources to break API/ABIs in a huge way, you have enough resources to introduce regressions - you only don't have enough resources to have any resemblance of a responsible development process. Best regards, Artem 1) https://bugzilla.kernel.org/show_bug.cgi?id=12309 2) https://bugzilla.kernel.org/show_bug.cgi?id=15875
Re: The most insane proposal in regard to the Linux kernel development
On 2016-04-07 01:05, Greg KH wrote: On Sat, Apr 02, 2016 at 05:43:47PM +0500, Artem S. Tashkinov wrote: One very big justification of this proposal is that core Linux development (I'm talking about various subsystems like mm/ ipc/ and interfaces under block/ fs/ security/ sound/ etc. ) has slowed down significantly over the past years so radical changes which warrant new kernel API/ABI are less likely to be introduced. That's not true at all, the change is constant, and increasing, just look at the tree for proof of that. Please, share your opinion. Please read Documentation/stable_api_nonsense.txt for my opinion, and that of the current developers. If you don't agree with this, that's fine, you are welcome to fork the kernel at any specific point and keep that api stable, just like many companies do and make money from it (SuSE, Red Hat, etc.) best of luck with your kernel project, Tell me, why no one in the Linux kernel dev team is concerned that: 1) There is up to a hundred regressions in each kernel release where a big chunk of them are caused by internal API changes? 2) API changes sometimes require drastic changes in every related hardware driver and since there's no way you can realistically test the code or the hardware, people later discover that their hardware has stopped working? 3) The core kernel developers do not have enough expertise to correctly update the entire kernel source tree so little things get broken? 4) Developing drivers for a moving target is a Herculean job? 5) You cannot easily bisect kernel regressions because regressions are often caused by things _outside_ of the problem you're experiencing. 6) You cannot use new drivers for you hardware on your old kernel, because new drivers are incompatible with an old source tree (don't remind me of RHEL's kernel - it's a rare exception and they usually port only the drivers their respective clients use). 7) Tech unsavvy people cannot realistically debug the kernel. Hey, please, do not tell me that you're doing a great job following postings in LKML or resolving bugs files in bugzilla. You do a very lousy job indeed - multiple postings in LKML get zero replies because a corresponding developer is either not subscribed to LKML at all, or he has missed the message. There are literally hundreds(!) of bugs in bugzilla which have ZERO replies. What's more, a great number of kernel developers do not have accounts in bugzilla and they don't read corresponding mailing lists. What the hell is wrong with you guys? You're developing the kernel like it's your toy project. 1) There's no accountability whatsoever. 2) There are no unit tests. Not a single one. 3) There's no surefire way to contact developers who have commited "bad" code. 4) There's no sense of direction. 5) There's no easy way to debug the kernel. For instance, let's talk about the revoke() call. Right now, if a certain IO device is removed while files on it are still open (there are multiple ways of opening files in Linux, starting from fopen() and ending with mmap()), the kernel state is basically undefined(!). Great! The corresponding mount point cannot be reused(!). Whatever program, which has its files' descriptors on this accidentally removed device, usually cannot gracefully quit or continue working. How on Earth this syscall doesn't get the utmost attention? Then we have bug 12309(1). My last comment to this bug gives a very simple way of reproducing it on all Android devices. Then we have bug 15875(2) which will probably take just ten man hours to be resolved, yet there is no interest at all, yet thousands of people have very real problems due to it. Tell me, are you really proud of yourselves? Tell me, do you develop the kernel for your amusement, ego, your employee or for average people to use? Tell me, are you really interested in more people migrating from stable long term supported OSes to Linux? I want some truly honest answers. And let's not repeat this mantra "we don't have enough resources". You have enough resources to break API/ABIs in a huge way, you have enough resources to introduce regressions - you only don't have enough resources to have any resemblance of a responsible development process. Best regards, Artem 1) https://bugzilla.kernel.org/show_bug.cgi?id=12309 2) https://bugzilla.kernel.org/show_bug.cgi?id=15875
The most insane proposal in regard to the Linux kernel development
Hello all, It's not a secret that there are two basic ways of running a Linux distribution on your hardware. Either you use a stable distro which has quite an outdated kernel release which might not support your hardware or you run the most recent stable version but you lose stability and you are prone to regressions. This problem can be solved by decoupling drivers from the kernel and supplying them separately so that you could enjoy stable kernel version X with brand new drivers like it's done in most other proprietary OS'es. I've been thinking of asking Linus about this decoupling for years already but I'm hesitant 'cause I'm 99.9% sure he will downright reject this proposal. Still I'm gonna risk asking 'cause there are multiple pluses with this proposal: 1) We might have truly stable really long term supported kernels (3-5 years of more). 2) The kernel size will be reduced by two orders of magnitude. 3) The user will be free to try different kernel drivers versions without leaving his/her stable kernel. 4) Drivers will become easier to develop, debug and maintain (usually the developer will just have two kernel trees to target and test against). 5) There will be a sense of QA/QC and accountability (nothing like that exists at the moment as reflected by a very long list regressions for every kernel release). 6) Drivers regressions will be easier to spot ('cause you can be sure that no other kernel changes have had undesired consequences/conflicts - right now driver A might break and does occasionally break because unrelated feature B has been reworked/tweaked/etc.). 7) There will be a lot fewer kernel releases and no constant rush to update them. 8) Kernel release numbers will become meaningful again. Right now no one can quickly say what's the difference between kernel 4.5.0 and 4.1.0. This way kernel development must be changed to accommodate this proposal: 1) Yeah, I know, you all hate that, but stable APIs and ABIs must be introduced and supported for, let's say, at least three to five years. 2) Like we used to have during 2.2.x, 2.4.x development cycles, unstable kernels with new APIs must be developed in parallel to stable ones. 3) Of course that means that drivers for every kernel tree (stable/unstable) must be developed in parallel. In the future, perhaps, several parallel drivers versions will have to be developed, e.g. drivers for kernels 1.0.x (stable), 1.2.x (next stable) and 1.3.x (unstable). However, taking into consideration that these three kernel releases span the range of 3..5 * 3 years = 9..15 years, older kernels will stop being supported eventually. In short I'm offering a concept of Windows NT kernel development. They have very rare stable kernel releases (e.g. XP SP0, SP1, SP2, 2003, 2003 R2 - all binary compatible), then Vista kernel began development and after its release six years later, hardware vendors had to support just two kernel releases. Not that is a big issue. One very big justification of this proposal is that core Linux development (I'm talking about various subsystems like mm/ ipc/ and interfaces under block/ fs/ security/ sound/ etc. ) has slowed down significantly over the past years so radical changes which warrant new kernel API/ABI are less likely to be introduced. Please, share your opinion. -- Best regards, Artem
The most insane proposal in regard to the Linux kernel development
Hello all, It's not a secret that there are two basic ways of running a Linux distribution on your hardware. Either you use a stable distro which has quite an outdated kernel release which might not support your hardware or you run the most recent stable version but you lose stability and you are prone to regressions. This problem can be solved by decoupling drivers from the kernel and supplying them separately so that you could enjoy stable kernel version X with brand new drivers like it's done in most other proprietary OS'es. I've been thinking of asking Linus about this decoupling for years already but I'm hesitant 'cause I'm 99.9% sure he will downright reject this proposal. Still I'm gonna risk asking 'cause there are multiple pluses with this proposal: 1) We might have truly stable really long term supported kernels (3-5 years of more). 2) The kernel size will be reduced by two orders of magnitude. 3) The user will be free to try different kernel drivers versions without leaving his/her stable kernel. 4) Drivers will become easier to develop, debug and maintain (usually the developer will just have two kernel trees to target and test against). 5) There will be a sense of QA/QC and accountability (nothing like that exists at the moment as reflected by a very long list regressions for every kernel release). 6) Drivers regressions will be easier to spot ('cause you can be sure that no other kernel changes have had undesired consequences/conflicts - right now driver A might break and does occasionally break because unrelated feature B has been reworked/tweaked/etc.). 7) There will be a lot fewer kernel releases and no constant rush to update them. 8) Kernel release numbers will become meaningful again. Right now no one can quickly say what's the difference between kernel 4.5.0 and 4.1.0. This way kernel development must be changed to accommodate this proposal: 1) Yeah, I know, you all hate that, but stable APIs and ABIs must be introduced and supported for, let's say, at least three to five years. 2) Like we used to have during 2.2.x, 2.4.x development cycles, unstable kernels with new APIs must be developed in parallel to stable ones. 3) Of course that means that drivers for every kernel tree (stable/unstable) must be developed in parallel. In the future, perhaps, several parallel drivers versions will have to be developed, e.g. drivers for kernels 1.0.x (stable), 1.2.x (next stable) and 1.3.x (unstable). However, taking into consideration that these three kernel releases span the range of 3..5 * 3 years = 9..15 years, older kernels will stop being supported eventually. In short I'm offering a concept of Windows NT kernel development. They have very rare stable kernel releases (e.g. XP SP0, SP1, SP2, 2003, 2003 R2 - all binary compatible), then Vista kernel began development and after its release six years later, hardware vendors had to support just two kernel releases. Not that is a big issue. One very big justification of this proposal is that core Linux development (I'm talking about various subsystems like mm/ ipc/ and interfaces under block/ fs/ security/ sound/ etc. ) has slowed down significantly over the past years so radical changes which warrant new kernel API/ABI are less likely to be introduced. Please, share your opinion. -- Best regards, Artem
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 10:55, Kent Overstreet wrote: On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote: On 2015-12-22 10:38, Kent Overstreet wrote: >On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote: >>On 12/22/15 12:59, Kent Overstreet wrote: >>> reproduced it with 32 bit pae: >>> >>>> 1. Exclude memory above 4G line with boot param "max_addr=4G". >>> >>> doesn't work - max_addr=1G doesn't work either >>> >>>> 2. Disable highmem with "highmem=0". >>> >>> works! >>> >>>> 3. Try booting 64bit kernel. >>> >>> works >> >>blk_queue_bio() does split then bounce, which makes the segment >>counting based on pages before bouncing and could go wrong. >> >>What do you think of a patch like this? > >Artem, can you give this patch a try? This patch ostensibly fixes the issue - at least I cannot immediately reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov" Let's all contemplate the fact that blk_segment_map_sg() _overrunning the end of the provided sglist_ was this much of a clusterfuck to debug. From the look of it this fix has nothing to do with PAE, so then why only PAE users like me were affected by the original (b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 10:38, Kent Overstreet wrote: On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote: On 12/22/15 12:59, Kent Overstreet wrote: > reproduced it with 32 bit pae: > >> 1. Exclude memory above 4G line with boot param "max_addr=4G". > > doesn't work - max_addr=1G doesn't work either > >> 2. Disable highmem with "highmem=0". > > works! > >> 3. Try booting 64bit kernel. > > works blk_queue_bio() does split then bounce, which makes the segment counting based on pages before bouncing and could go wrong. What do you think of a patch like this? Artem, can you give this patch a try? This patch ostensibly fixes the issue - at least I cannot immediately reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov" -- Jun'ichi Nomura, NEC Corporation diff --git a/block/blk-core.c b/block/blk-core.c index 5131993b..1d1c3c7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) struct request *req; unsigned int request_count = 0; - blk_queue_split(q, , q->bio_split); - /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) */ blk_queue_bounce(q, ); + blk_queue_split(q, , q->bio_split); + if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) { bio->bi_error = -EIO; bio_endio(bio); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 01:07, Tejun Heo wrote: Hello, Artem. Can you please apply the following patch on top and see whether anything changes? If it does make the issue go away, can you please revert the ".can_queue" part and test again? Thanks. --- drivers/ata/ahci.h|2 +- drivers/ata/libahci.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- a/drivers/ata/ahci.h +++ b/drivers/ata/ahci.h @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde */ #define AHCI_SHT(drv_name) \ ATA_NCQ_SHT(drv_name), \ - .can_queue = AHCI_MAX_CMDS - 1,\ + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \ .sg_tablesize = AHCI_MAX_SG, \ .dma_boundary = AHCI_DMA_BOUNDARY,\ .shost_attrs= ahci_shost_attrs, \ --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev hpriv->saved_cap2 = cap2 = 0; /* some chips have errata preventing 64bit use */ - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) { + if ((cap & HOST_CAP_64)/* && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)*/) { dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n"); cap &= ~HOST_CAP_64; } With the ".can_queue" part left intact the bug resurfaced. Full dmesg output is attached. dmesg.xz Description: application/xz
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 01:07, Tejun Heo wrote: Hello, Artem. Can you please apply the following patch on top and see whether anything changes? If it does make the issue go away, can you please revert the ".can_queue" part and test again? Thanks. --- drivers/ata/ahci.h|2 +- drivers/ata/libahci.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- a/drivers/ata/ahci.h +++ b/drivers/ata/ahci.h @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde */ #define AHCI_SHT(drv_name) \ ATA_NCQ_SHT(drv_name), \ - .can_queue = AHCI_MAX_CMDS - 1,\ + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \ .sg_tablesize = AHCI_MAX_SG, \ .dma_boundary = AHCI_DMA_BOUNDARY,\ .shost_attrs= ahci_shost_attrs, \ --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev hpriv->saved_cap2 = cap2 = 0; /* some chips have errata preventing 64bit use */ - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) { + if ((cap & HOST_CAP_64)/* && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)*/) { dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n"); cap &= ~HOST_CAP_64; } This patch fixes the issue for me. Now rechecking without .can_queue part. BTW, since I left debugging on, here's the part you wanted: [0.613851] XXX port 0 dma_sz=91392 mem=c002 mem_dma=0002 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.613865] XXX port 1 dma_sz=91392 mem=eea0 mem_dma=2ea0 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.620464] XXX port 2 dma_sz=91392 mem=eea2 mem_dma=2ea2 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.627121] XXX port 3 dma_sz=91392 mem=eea4 mem_dma=2ea4 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.633791] XXX port 4 dma_sz=91392 mem=eea6 mem_dma=2ea6 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.640445] XXX port 5 dma_sz=91392 mem=eea8 mem_dma=2ea8 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 10:23, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds wrote: That said, we obviously need to figure out this current problem regardless first.. ... although maybe it *would* be interesting to hear what happens if you just compile a 64-bit kernel instead? Under x86-64 I cannot reproduce this problem. It seems like it's PAE specific (Kent Overstreet says he has reproduced it). Do you still see the problem? Because if not, then we should look very specifically for some 32-bit PAE issue. For example, maybe we use "unsigned long" somewhere where we should use "phys_addr_t". On x86-64, they obviously end up being the same. On normal non-PAE x86-32, they are also the same. But .. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 01:07, Tejun Heo wrote: Hello, Artem. Can you please apply the following patch on top and see whether anything changes? If it does make the issue go away, can you please revert the ".can_queue" part and test again? Thanks. --- drivers/ata/ahci.h|2 +- drivers/ata/libahci.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- a/drivers/ata/ahci.h +++ b/drivers/ata/ahci.h @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde */ #define AHCI_SHT(drv_name) \ ATA_NCQ_SHT(drv_name), \ - .can_queue = AHCI_MAX_CMDS - 1,\ + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \ .sg_tablesize = AHCI_MAX_SG, \ .dma_boundary = AHCI_DMA_BOUNDARY,\ .shost_attrs= ahci_shost_attrs, \ --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev hpriv->saved_cap2 = cap2 = 0; /* some chips have errata preventing 64bit use */ - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) { + if ((cap & HOST_CAP_64)/* && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)*/) { dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n"); cap &= ~HOST_CAP_64; } With the ".can_queue" part left intact the bug resurfaced. Full dmesg output is attached. dmesg.xz Description: application/xz
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 10:23, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvaldswrote: That said, we obviously need to figure out this current problem regardless first.. ... although maybe it *would* be interesting to hear what happens if you just compile a 64-bit kernel instead? Under x86-64 I cannot reproduce this problem. It seems like it's PAE specific (Kent Overstreet says he has reproduced it). Do you still see the problem? Because if not, then we should look very specifically for some 32-bit PAE issue. For example, maybe we use "unsigned long" somewhere where we should use "phys_addr_t". On x86-64, they obviously end up being the same. On normal non-PAE x86-32, they are also the same. But .. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 10:55, Kent Overstreet wrote: On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote: On 2015-12-22 10:38, Kent Overstreet wrote: >On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote: >>On 12/22/15 12:59, Kent Overstreet wrote: >>> reproduced it with 32 bit pae: >>> >>>> 1. Exclude memory above 4G line with boot param "max_addr=4G". >>> >>> doesn't work - max_addr=1G doesn't work either >>> >>>> 2. Disable highmem with "highmem=0". >>> >>> works! >>> >>>> 3. Try booting 64bit kernel. >>> >>> works >> >>blk_queue_bio() does split then bounce, which makes the segment >>counting based on pages before bouncing and could go wrong. >> >>What do you think of a patch like this? > >Artem, can you give this patch a try? This patch ostensibly fixes the issue - at least I cannot immediately reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov" Let's all contemplate the fact that blk_segment_map_sg() _overrunning the end of the provided sglist_ was this much of a clusterfuck to debug. From the look of it this fix has nothing to do with PAE, so then why only PAE users like me were affected by the original (b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 10:38, Kent Overstreet wrote: On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote: On 12/22/15 12:59, Kent Overstreet wrote: > reproduced it with 32 bit pae: > >> 1. Exclude memory above 4G line with boot param "max_addr=4G". > > doesn't work - max_addr=1G doesn't work either > >> 2. Disable highmem with "highmem=0". > > works! > >> 3. Try booting 64bit kernel. > > works blk_queue_bio() does split then bounce, which makes the segment counting based on pages before bouncing and could go wrong. What do you think of a patch like this? Artem, can you give this patch a try? This patch ostensibly fixes the issue - at least I cannot immediately reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov" -- Jun'ichi Nomura, NEC Corporation diff --git a/block/blk-core.c b/block/blk-core.c index 5131993b..1d1c3c7 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) struct request *req; unsigned int request_count = 0; - blk_queue_split(q, , q->bio_split); - /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even @@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, struct bio *bio) */ blk_queue_bounce(q, ); + blk_queue_split(q, , q->bio_split); + if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) { bio->bi_error = -EIO; bio_endio(bio); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-22 01:07, Tejun Heo wrote: Hello, Artem. Can you please apply the following patch on top and see whether anything changes? If it does make the issue go away, can you please revert the ".can_queue" part and test again? Thanks. --- drivers/ata/ahci.h|2 +- drivers/ata/libahci.c |2 +- 2 files changed, 2 insertions(+), 2 deletions(-) --- a/drivers/ata/ahci.h +++ b/drivers/ata/ahci.h @@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde */ #define AHCI_SHT(drv_name) \ ATA_NCQ_SHT(drv_name), \ - .can_queue = AHCI_MAX_CMDS - 1,\ + .can_queue = 1/*AHCI_MAX_CMDS - 1*/, \ .sg_tablesize = AHCI_MAX_SG, \ .dma_boundary = AHCI_DMA_BOUNDARY,\ .shost_attrs= ahci_shost_attrs, \ --- a/drivers/ata/libahci.c +++ b/drivers/ata/libahci.c @@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev hpriv->saved_cap2 = cap2 = 0; /* some chips have errata preventing 64bit use */ - if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) { + if ((cap & HOST_CAP_64)/* && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)*/) { dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n"); cap &= ~HOST_CAP_64; } This patch fixes the issue for me. Now rechecking without .can_queue part. BTW, since I left debugging on, here's the part you wanted: [0.613851] XXX port 0 dma_sz=91392 mem=c002 mem_dma=0002 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.613865] XXX port 1 dma_sz=91392 mem=eea0 mem_dma=2ea0 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.620464] XXX port 2 dma_sz=91392 mem=eea2 mem_dma=2ea2 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.627121] XXX port 3 dma_sz=91392 mem=eea4 mem_dma=2ea4 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.633791] XXX port 4 dma_sz=91392 mem=eea6 mem_dma=2ea6 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 [0.640445] XXX port 5 dma_sz=91392 mem=eea8 mem_dma=2ea8 cmd_slot=0 rx_fis=1024 cmd_tbl=1280 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 10:23, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds wrote: That said, we obviously need to figure out this current problem regardless first.. ... although maybe it *would* be interesting to hear what happens if you just compile a 64-bit kernel instead? Do you still see the problem? Because if not, then we should look very specifically for some 32-bit PAE issue. For example, maybe we use "unsigned long" somewhere where we should use "phys_addr_t". On x86-64, they obviously end up being the same. On normal non-PAE x86-32, they are also the same. But .. Let's wait for what Tejun Heo might say - I've applied his debugging patch and sent back the results. Building x86_64 kernel here involves installing a 64bit Linux VM, so I'd like it to be the last resort. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 11:55, Tejun Heo wrote: Artem, can you please reproduce the issue with the following patch applied and attach the kernel log? Thanks. I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es revert). Hopefully it's how you intended it to be. Here's the result (I skipped the beginning of dmesg - it's the same as always - see bugzilla).[ 60.387407] Corrupted low memory at c0001000 (1000 phys) = cba3d25f [ 60.387411] Corrupted low memory at c0001004 (1004 phys) = e8f17ba7 [ 60.387413] Corrupted low memory at c0001008 (1008 phys) = 61cfa79a [ 60.387415] Corrupted low memory at c000100c (100c phys) = dc4d5d71 [ 60.387417] Corrupted low memory at c0001010 (1010 phys) = adbdc15b [ 60.387418] Corrupted low memory at c0001014 (1014 phys) = dee76bdc [ 60.387420] Corrupted low memory at c0001018 (1018 phys) = 827dee31 [ 60.387422] Corrupted low memory at c000101c (101c phys) = ef70cf7b [ 60.387423] Corrupted low memory at c0001020 (1020 phys) = 82fdee4d [ 60.387425] Corrupted low memory at c0001024 (1024 phys) = 77533c7b [ 60.387427] Corrupted low memory at c0001028 (1028 phys) = ddd4cf35 [ 60.387428] Corrupted low memory at c000102c (102c phys) = 7beea149 [ 60.387430] Corrupted low memory at c0001030 (1030 phys) = 798fe878 [ 60.387432] Corrupted low memory at c0001034 (1034 phys) = 4283a7a8 [ 60.387434] Corrupted low memory at c0001038 (1038 phys) = 4dee093d [ 60.387435] Corrupted low memory at c000103c (103c phys) = ee21ef73 [ 60.387437] Corrupted low memory at c0001040 (1040 phys) = fe3dc93d [ 60.387439] Corrupted low memory at c0001044 (1044 phys) = b8e7cf0d [ 60.387440] Corrupted low memory at c0001048 (1048 phys) = af3c9977 [ 60.387442] Corrupted low memory at c000104c (104c phys) = b80b7b8b [ 60.387444] Corrupted low memory at c0001050 (1050 phys) = b6f73d77 [ 60.387445] Corrupted low memory at c0001054 (1054 phys) = f7276f70 [ 60.387447] Corrupted low memory at c0001058 (1058 phys) = c62f70f6 [ 60.387449] Corrupted low memory at c000105c (105c phys) = 3ef734bd [ 60.387451] Corrupted low memory at c0001060 (1060 phys) = 1ef79f40 [ 60.387452] Corrupted low memory at c0001064 (1064 phys) = f1cf9f65 [ 60.387454] Corrupted low memory at c0001068 (1068 phys) = 297a5390 [ 60.387456] Corrupted low memory at c000106c (106c phys) = a7f14fbc [ 60.387457] Corrupted low memory at c0001070 (1070 phys) = 57ef71af [ 60.387459] Corrupted low memory at c0001074 (1074 phys) = 219d15e4 [ 60.387461] Corrupted low memory at c0001078 (1078 phys) = 7b99a2af [ 60.387462] Corrupted low memory at c000107c (107c phys) = c56d281b [ 60.387464] Corrupted low memory at c0001080 (1080 phys) = 3c84de6e [ 60.387466] Corrupted low memory at c0001084 (1084 phys) = edee56ec [ 60.387468] Corrupted low memory at c0001088 (1088 phys) = 49b557a7 [ 60.387469] Corrupted low memory at c000108c (108c phys) = 01baeb6a [ 60.387471] Corrupted low memory at c0001090 (1090 phys) = b775acde [ 60.387473] Corrupted low memory at c0001094 (1094 phys) = 30dd6851 [ 60.387474] Corrupted low memory at c0001098 (1098 phys) = f328fd0f [ 60.387476] Corrupted low memory at c000109c (109c phys) = 17ad185c [ 60.387478] Corrupted low memory at c00010a0 (10a0 phys) = b83985f5 [ 60.387479] Corrupted low memory at c00010a4 (10a4 phys) = 775b8af5 [ 60.387481] Corrupted low memory at c00010a8 (10a8 phys) = 3d35e4bc [ 60.387483] Corrupted low memory at c00010ac (10ac phys) = bf4d7b90 [ 60.387485] Corrupted low memory at c00010b0 (10b0 phys) = 1db6fd99 [ 60.387486] Corrupted low memory at c00010b4 (10b4 phys) = 3b94bf2f [ 60.387488] Corrupted low memory at c00010b8 (10b8 phys) = 5f447e55 [ 60.387490] Corrupted low memory at c00010bc (10bc phys) = dcfe6395 [ 60.387491] Corrupted low memory at c00010c0 (10c0 phys) = fc0b7a23 [ 60.387493] Corrupted low memory at c00010c4 (10c4 phys) = 32fa23aa [ 60.387495] Corrupted low memory at c00010c8 (10c8 phys) = e88ef3f8 [ 60.387496] Corrupted low memory at c00010cc (10cc phys) = 1ed7e14b [ 60.387498] Corrupted low memory at c00010d0 (10d0 phys) = 9fc3d7d1 [ 60.387500] Corrupted low memory at c00010d4 (10d4 phys) = 015f447f [ 60.387501] Corrupted low memory at c00010d8 (10d8 phys) = 7d11c17f [ 60.387503] Corrupted low memory at c00010dc (10dc phys) = 4785fc2d [ 60.387505] Corrupted low memory at c00010e0 (10e0 phys) = 5fe16bf4 [ 60.387507] Corrupted low memory at c00010e4 (10e4 phys) = 4de3fcc5 [ 60.387508] Corrupted low memory at c00010e8 (10e8 phys) = 4f477297 [ 60.387510] Corrupted low memory at c00010ec (10ec phys) = 59a47d35 [ 60.387512] Corrupted low memory at c00010f0 (10f0 phys) = c97c78df [ 60.387513] Corrupted low memory at c00010f4 (10f4 phys) = e3aafa4b [ 60.387515] Corrupted low memory at c00010f8 (10f8 phys) = 658bd8cb [ 60.387517] Corrupted low memory at c00010fc (10fc phys) = 6f5eb91f [ 60.387518] Corrupted low memory at c0001100 (1100 phys) = ca66ce3a [
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 09:32, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov wrote: P.S. I know Linus doesn't condone PAE but I still find it more preferrable than running a mixed environment with almost zero benefit in regard to performance and quite obvious performance regressions related to an increased number of libraries being loaded (i686 + x86_64) and slightly bloated code which sometimes cannot fit in the CPU cache. Call me old fashioned but I won't upgrade to x86_64 until most of the things that I run locally are available for x86_64 and that won't happen any time soon. Don't upgrade *user* land. User land doesn't use the braindamage that is PAE. Just run a 64-bit kernel. Keep all your 32-bit userland apps and libraries. Trust me, that *will* be faster. PAE works really horribly badly, because all your really important data structures like your inodes and directory cache will all be in the low 1GB even if you have 16BG of RAM. Of course, I'd also like more people to run things that way just to get more coverage of the whole "yes, we do all the compat stuff correctly". So I have some other reasons to prefer people running 64-bit kernels with 32-bit user land. But PAE really is a disaster. In the past I happily ran an x86_64 bit kernel together with 32bit userland for quite some time but then I hit a wall: VirtualBox expects its kernel modules to have the same bitness as the application itself so I had to revert back to an i686 PAE setup. It's probably high time to try qemu however last time I looked at it a few years ago it lacked several crucial features I need from a VM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 08:21, Ming Lei wrote: On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov wrote: # cat /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size} 32767 32767 168 65536 Looks it is fine, then maybe it is related with BIOVEC_PHYS_MERGEABLE(), BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *' is still 32bit. It was confirmed that there isn't the issue if PAE is disabled. Dumping both sata/ahci hw sg table and bio's bvec might be helpful. Um, sorry, what exact variables/files do you want to see? I'm not an expert in /sys. On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet wrote: oy vey. WTF's been happening in blk-merge.c? Theyy're not the same bug. The bug in your thread was introduced by Jens in 5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where he screwed up the bvprv handling - but that patch comes after the patch Artem bisected to. blk_bio_segment_split() looks correct in b54ffb73ca. Yes, that is why reverting 578270bfb(block: fix segment split) can make the issue disappear, because 5014c311ba "block: fix bogus compiler warnings in blk-merge.c" basically disables sg-merge and prevents the issue from being triggered. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 07:18, Ming Lei wrote: On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov wrote: BTW, I have posted very similar issue in the link: http://marc.info/?l=linux-ide=145066119623811=2 Artem, I noticed from bugzillar that the hardware is i386, just wondering if PAE is enabled? If yes, I am more confident that both the two kinds of report are similar or same. Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned in the corresponding bug report. OK, could you dump value of the following files under /sys/block/sdN/queue/ ? max_hw_sectors_kb max_sectors_kb max_segments max_segment_size 'sdN' is the faulted disk name. # cat /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size} 32767 32767 168 65536 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 06:38, Ming Lei wrote: On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. (Also Tejun - maybe you can see what's up - maybe that error message tells you something) I'm not sure what's up with his machine, the disk doesn't seem to be anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda: ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133 which doesn't strike me as odd. Looking at the dmesg, it also looks like it's a pretty normal Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI ID for the AHCI chip seems to be (INTEL, 0x1c02). Any ideas? Anybody? BTW, I have posted very similar issue in the link: http://marc.info/?l=linux-ide=145066119623811=2 Artem, I noticed from bugzillar that the hardware is i386, just wondering if PAE is enabled? If yes, I am more confident that both the two kinds of report are similar or same. Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned in the corresponding bug report. P.S. I know Linus doesn't condone PAE but I still find it more preferrable than running a mixed environment with almost zero benefit in regard to performance and quite obvious performance regressions related to an increased number of libraries being loaded (i686 + x86_64) and slightly bloated code which sometimes cannot fit in the CPU cache. Call me old fashioned but I won't upgrade to x86_64 until most of the things that I run locally are available for x86_64 and that won't happen any time soon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 04:42, Kent Overstreet wrote: On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote: On 2015-12-20 23:18, Christoph Hellwig wrote: >On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: >>Kent, Jens, Christoph et al, >> please see this bugzilla: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=109661 >> >>where Artem Tashkinov bisected his problems with 4.3 down to commit >>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all >>signed off on. > >Artem, > >can you re-check the commits around this series again? I would be >extremtly surprised if it's really this particular commit and not >one just before it causing the problem - it just allocates bios >to the biggest possible instead of only allocating up to what >bio_add_page would accept. I'm positive about this particular commit. Of course, it might be another GCC 4.7.4 miscompilation which causes the errors which shouldn't be there but I'm not an expert, so. I believe you on the commit, and I doubt this has anything to do with gcc - the errors you're getting are exactly what you normally get when you send the device an sglist to dma to/from that it doesn't like. The queue limits stuff is annoyingly fragile, you'd think we'd be able to check directly in the driver that the stuff we're sending the device is sane but we don't. If I came up with a debug patch could you try it out? I don't have any ideas for one yet, but if someone who knows the ATA code doesn't jump in I'll call up Tejun and make him walk me through it. No problem, I just hope that this particular access mode (and you debug patch) won't decrease the lifespan of my HDD. Seagate HDDs have been very fragile (read atrociously unreliable) for the past five years. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:44, Kent Overstreet wrote: On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote: On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: > Kent, Jens, Christoph et al, ie please see this bugzilla: >o > httpps://bugzilla.kernel.org/show_bug.cgi?id=109661 > > where Artem Tashkinov bisected his problems with 4.3 down to commit > b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all > signed off on. Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. pretty sure it's something with how blk_bio_segment_split() decides what segments are mergable and not. bio_get_nr_vecs() was just returning nr_pages == queue_max_segments (ignoring sectors for the moment) - so wait, wtf? that's basically assuming no segment merging can ever happen, if it does then this was causing us to send smaller requests to the device than we could have been. so actually two possibilities I can see: - in blk_bio_segment_split(), something's screwed up with how it decides what segments are going to be mergable or not. but I don't think that's likely since it's doing the exact same thing the rest of the segment merging code does. - or, the driver was lying in its queue limits, using queue_max_segments for "the maximum number of pages I can possibly take", and that bug lurked undiscovered because of the screwed-upness in bio_get_nr_vecs(). Offhand I don't know where to start digging in the driver code to look into the second theory though. Tejun, you got any ideas? Here's an actual bisect log which Linus was missing: git bisect start # bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3 git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861 # good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2 git bisect good 64291f7db5bd8150a74ad2036f1037e6a0428df2 # bad: [807249d3ada1ff28a47c4054ca4edd479421b671] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus git bisect bad 807249d3ada1ff28a47c4054ca4edd479421b671 # good: [102178108e2246cb4b329d3fb7872cd3d7120205] Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect good 102178108e2246cb4b329d3fb7872cd3d7120205 # good: [62da98656b62a5ca57f22263705175af8ded5aa1] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in git bisect good 62da98656b62a5ca57f22263705175af8ded5aa1 # good: [f1a3c0b933e7ff856223d6fcd7456d403e54e4e5] Merge tag 'devicetree-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect good f1a3c0b933e7ff856223d6fcd7456d403e54e4e5 # bad: [9cbf22b37ae0592dea809cb8d424990774c21786] Merge tag 'dlm-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm git bisect bad 9cbf22b37ae0592dea809cb8d424990774c21786 # good: [8bdc69b764013a9b5ebeef7df8f314f1066c5d79] Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect good 8bdc69b764013a9b5ebeef7df8f314f1066c5d79 # good: [df910390e2db07a76c87f258475f6c96253cee6c] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect good df910390e2db07a76c87f258475f6c96253cee6c # bad: [d975f309a8b250e67b66eabeb56be6989c783629] Merge branch 'for-4.3/sg' of git://git.kernel.dk/linux-block git bisect bad d975f309a8b250e67b66eabeb56be6989c783629 # bad: [89e2a8404e4415da1edbac6ca4f7332b4a74fae2] crypto/omap-sham: remove an open coded access to ->page_link git bisect bad 89e2a8404e4415da1edbac6ca4f7332b4a74fae2 # good: [0e28997ec476bad4c7dbe0a08775290051325f53] btrfs: remove bio splitting and merge_bvec_fn() calls git bisect good 0e28997ec476bad4c7dbe0a08775290051325f53 # bad: [2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad] Documentation: update notes in biovecs about arbitrarily sized bios git bisect bad 2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad # good: [7140aafce2fc14c5af02fdb7859b6bea0108be3d] md/raid5: get rid of bio_fits_rdev() git bisect good 7140aafce2fc14c5af02fdb7859b6bea0108be3d # good: [6cf66b4caf9c71f64a5486cadbd71ab58d0d4307] fs: use helper bio_add_page() instead of open coding on bi_io_vec git bisect good 6cf66b4caf9c71f64a5486cadbd71ab58d0d4307 # bad: [b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c] block: remove bio_get_nr_vecs() git bisect bad b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c And like he said since the step before the last one was good and the very last one was bad there was no way I could have made a mistake. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:41, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig wrote: Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. Judging by Artem's bisect log, the last commit he tested before the bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper bio_add_page() instead of open coding on bi_io_vec") and he marked that one good. Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up the dmesg files (in the same bisection tar-file as the bisection log) with the actual versions. Also, Artem's bisect.log isn't actually the .git/BISECT_LOG file that contains the full information about what was marked good and bad, so it's a bit hard to read (ie I can tell that Artem had to mark commit 6cf66b4caf9c as "good" not because his log says so, but because that explains the next commit to be tested). Of course, it's fairly easy to make a mistake while bisecting (just doing a thinko), but usually bisection miistakes end up causing you to go into some "all good" or "all bad" region of commits, and the fact that Artem seems to have marked the previous commit good and the final commit bad does seem to imply the bisection was successful. But yes, it is always nice to double-check the bisection results. The best way to do it is generally to try to revert the bad commit and verify that things work after that, but that commit doesn't revert cleanly on top of 4.3 due to other changes. Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's basically a revert of b54ffb73cadc, but with a few fixups to make the revert work on top of 4.3. So Artem, if you can test whether 4.3 works with that revert, and/or double-check booting that b54ffb73cadc again (to verify that it's really bad), and its parent (to double-check that it's really good), that would be a good way to verify that yes, it is really that *one* commit that breaks things for you. After reverting (applying) this patch on top of 4.3.3 everything is back to normal. It's indeed a guilty commit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:18, Christoph Hellwig wrote: On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. I'm positive about this particular commit. Of course, it might be another GCC 4.7.4 miscompilation which causes the errors which shouldn't be there but I'm not an expert, so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 22:51, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. (Also Tejun - maybe you can see what's up - maybe that error message tells you something) I'm not sure what's up with his machine, the disk doesn't seem to be anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda: ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133 which doesn't strike me as odd. Looking at the dmesg, it also looks like it's a pretty normal Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI ID for the AHCI chip seems to be (INTEL, 0x1c02). Any ideas? Anybody? That's correct. That's a very usual Asus P8P67 Pro motherboard (Intel P67 chipset) in AHCI mode and run of the mill HDD which is the one you identified. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 04:42, Kent Overstreet wrote: On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote: On 2015-12-20 23:18, Christoph Hellwig wrote: >On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: >>Kent, Jens, Christoph et al, >> please see this bugzilla: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=109661 >> >>where Artem Tashkinov bisected his problems with 4.3 down to commit >>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all >>signed off on. > >Artem, > >can you re-check the commits around this series again? I would be >extremtly surprised if it's really this particular commit and not >one just before it causing the problem - it just allocates bios >to the biggest possible instead of only allocating up to what >bio_add_page would accept. I'm positive about this particular commit. Of course, it might be another GCC 4.7.4 miscompilation which causes the errors which shouldn't be there but I'm not an expert, so. I believe you on the commit, and I doubt this has anything to do with gcc - the errors you're getting are exactly what you normally get when you send the device an sglist to dma to/from that it doesn't like. The queue limits stuff is annoyingly fragile, you'd think we'd be able to check directly in the driver that the stuff we're sending the device is sane but we don't. If I came up with a debug patch could you try it out? I don't have any ideas for one yet, but if someone who knows the ATA code doesn't jump in I'll call up Tejun and make him walk me through it. No problem, I just hope that this particular access mode (and you debug patch) won't decrease the lifespan of my HDD. Seagate HDDs have been very fragile (read atrociously unreliable) for the past five years. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:41, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig wrote: Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. Judging by Artem's bisect log, the last commit he tested before the bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper bio_add_page() instead of open coding on bi_io_vec") and he marked that one good. Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up the dmesg files (in the same bisection tar-file as the bisection log) with the actual versions. Also, Artem's bisect.log isn't actually the .git/BISECT_LOG file that contains the full information about what was marked good and bad, so it's a bit hard to read (ie I can tell that Artem had to mark commit 6cf66b4caf9c as "good" not because his log says so, but because that explains the next commit to be tested). Of course, it's fairly easy to make a mistake while bisecting (just doing a thinko), but usually bisection miistakes end up causing you to go into some "all good" or "all bad" region of commits, and the fact that Artem seems to have marked the previous commit good and the final commit bad does seem to imply the bisection was successful. But yes, it is always nice to double-check the bisection results. The best way to do it is generally to try to revert the bad commit and verify that things work after that, but that commit doesn't revert cleanly on top of 4.3 due to other changes. Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's basically a revert of b54ffb73cadc, but with a few fixups to make the revert work on top of 4.3. So Artem, if you can test whether 4.3 works with that revert, and/or double-check booting that b54ffb73cadc again (to verify that it's really bad), and its parent (to double-check that it's really good), that would be a good way to verify that yes, it is really that *one* commit that breaks things for you. After reverting (applying) this patch on top of 4.3.3 everything is back to normal. It's indeed a guilty commit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 08:21, Ming Lei wrote: On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov wrote: # cat /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size} 32767 32767 168 65536 Looks it is fine, then maybe it is related with BIOVEC_PHYS_MERGEABLE(), BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *' is still 32bit. It was confirmed that there isn't the issue if PAE is disabled. Dumping both sata/ahci hw sg table and bio's bvec might be helpful. Um, sorry, what exact variables/files do you want to see? I'm not an expert in /sys. On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet wrote: oy vey. WTF's been happening in blk-merge.c? Theyy're not the same bug. The bug in your thread was introduced by Jens in 5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where he screwed up the bvprv handling - but that patch comes after the patch Artem bisected to. blk_bio_segment_split() looks correct in b54ffb73ca. Yes, that is why reverting 578270bfb(block: fix segment split) can make the issue disappear, because 5014c311ba "block: fix bogus compiler warnings in blk-merge.c" basically disables sg-merge and prevents the issue from being triggered. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 09:32, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov wrote: P.S. I know Linus doesn't condone PAE but I still find it more preferrable than running a mixed environment with almost zero benefit in regard to performance and quite obvious performance regressions related to an increased number of libraries being loaded (i686 + x86_64) and slightly bloated code which sometimes cannot fit in the CPU cache. Call me old fashioned but I won't upgrade to x86_64 until most of the things that I run locally are available for x86_64 and that won't happen any time soon. Don't upgrade *user* land. User land doesn't use the braindamage that is PAE. Just run a 64-bit kernel. Keep all your 32-bit userland apps and libraries. Trust me, that *will* be faster. PAE works really horribly badly, because all your really important data structures like your inodes and directory cache will all be in the low 1GB even if you have 16BG of RAM. Of course, I'd also like more people to run things that way just to get more coverage of the whole "yes, we do all the compat stuff correctly". So I have some other reasons to prefer people running 64-bit kernels with 32-bit user land. But PAE really is a disaster. In the past I happily ran an x86_64 bit kernel together with 32bit userland for quite some time but then I hit a wall: VirtualBox expects its kernel modules to have the same bitness as the application itself so I had to revert back to an i686 PAE setup. It's probably high time to try qemu however last time I looked at it a few years ago it lacked several crucial features I need from a VM. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:44, Kent Overstreet wrote: On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote: On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: > Kent, Jens, Christoph et al, ie please see this bugzilla: >o > httpps://bugzilla.kernel.org/show_bug.cgi?id=109661 > > where Artem Tashkinov bisected his problems with 4.3 down to commit > b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all > signed off on. Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. pretty sure it's something with how blk_bio_segment_split() decides what segments are mergable and not. bio_get_nr_vecs() was just returning nr_pages == queue_max_segments (ignoring sectors for the moment) - so wait, wtf? that's basically assuming no segment merging can ever happen, if it does then this was causing us to send smaller requests to the device than we could have been. so actually two possibilities I can see: - in blk_bio_segment_split(), something's screwed up with how it decides what segments are going to be mergable or not. but I don't think that's likely since it's doing the exact same thing the rest of the segment merging code does. - or, the driver was lying in its queue limits, using queue_max_segments for "the maximum number of pages I can possibly take", and that bug lurked undiscovered because of the screwed-upness in bio_get_nr_vecs(). Offhand I don't know where to start digging in the driver code to look into the second theory though. Tejun, you got any ideas? Here's an actual bisect log which Linus was missing: git bisect start # bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3 git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861 # good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2 git bisect good 64291f7db5bd8150a74ad2036f1037e6a0428df2 # bad: [807249d3ada1ff28a47c4054ca4edd479421b671] Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus git bisect bad 807249d3ada1ff28a47c4054ca4edd479421b671 # good: [102178108e2246cb4b329d3fb7872cd3d7120205] Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect good 102178108e2246cb4b329d3fb7872cd3d7120205 # good: [62da98656b62a5ca57f22263705175af8ded5aa1] netfilter: nf_conntrack: make nf_ct_zone_dflt built-in git bisect good 62da98656b62a5ca57f22263705175af8ded5aa1 # good: [f1a3c0b933e7ff856223d6fcd7456d403e54e4e5] Merge tag 'devicetree-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux git bisect good f1a3c0b933e7ff856223d6fcd7456d403e54e4e5 # bad: [9cbf22b37ae0592dea809cb8d424990774c21786] Merge tag 'dlm-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm git bisect bad 9cbf22b37ae0592dea809cb8d424990774c21786 # good: [8bdc69b764013a9b5ebeef7df8f314f1066c5d79] Merge branch 'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect good 8bdc69b764013a9b5ebeef7df8f314f1066c5d79 # good: [df910390e2db07a76c87f258475f6c96253cee6c] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi git bisect good df910390e2db07a76c87f258475f6c96253cee6c # bad: [d975f309a8b250e67b66eabeb56be6989c783629] Merge branch 'for-4.3/sg' of git://git.kernel.dk/linux-block git bisect bad d975f309a8b250e67b66eabeb56be6989c783629 # bad: [89e2a8404e4415da1edbac6ca4f7332b4a74fae2] crypto/omap-sham: remove an open coded access to ->page_link git bisect bad 89e2a8404e4415da1edbac6ca4f7332b4a74fae2 # good: [0e28997ec476bad4c7dbe0a08775290051325f53] btrfs: remove bio splitting and merge_bvec_fn() calls git bisect good 0e28997ec476bad4c7dbe0a08775290051325f53 # bad: [2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad] Documentation: update notes in biovecs about arbitrarily sized bios git bisect bad 2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad # good: [7140aafce2fc14c5af02fdb7859b6bea0108be3d] md/raid5: get rid of bio_fits_rdev() git bisect good 7140aafce2fc14c5af02fdb7859b6bea0108be3d # good: [6cf66b4caf9c71f64a5486cadbd71ab58d0d4307] fs: use helper bio_add_page() instead of open coding on bi_io_vec git bisect good 6cf66b4caf9c71f64a5486cadbd71ab58d0d4307 # bad: [b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c] block: remove bio_get_nr_vecs() git bisect bad b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c And like he said since the step before the last one was good and the very last one was bad there was no way I could have made a mistake. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 22:51, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. (Also Tejun - maybe you can see what's up - maybe that error message tells you something) I'm not sure what's up with his machine, the disk doesn't seem to be anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda: ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133 which doesn't strike me as odd. Looking at the dmesg, it also looks like it's a pretty normal Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI ID for the AHCI chip seems to be (INTEL, 0x1c02). Any ideas? Anybody? That's correct. That's a very usual Asus P8P67 Pro motherboard (Intel P67 chipset) in AHCI mode and run of the mill HDD which is the one you identified. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-20 23:18, Christoph Hellwig wrote: On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. Artem, can you re-check the commits around this series again? I would be extremtly surprised if it's really this particular commit and not one just before it causing the problem - it just allocates bios to the biggest possible instead of only allocating up to what bio_add_page would accept. I'm positive about this particular commit. Of course, it might be another GCC 4.7.4 miscompilation which causes the errors which shouldn't be there but I'm not an expert, so. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 06:38, Ming Lei wrote: On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote: Kent, Jens, Christoph et al, please see this bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=109661 where Artem Tashkinov bisected his problems with 4.3 down to commit b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all signed off on. (Also Tejun - maybe you can see what's up - maybe that error message tells you something) I'm not sure what's up with his machine, the disk doesn't seem to be anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda: ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133 which doesn't strike me as odd. Looking at the dmesg, it also looks like it's a pretty normal Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI ID for the AHCI chip seems to be (INTEL, 0x1c02). Any ideas? Anybody? BTW, I have posted very similar issue in the link: http://marc.info/?l=linux-ide=145066119623811=2 Artem, I noticed from bugzillar that the hardware is i386, just wondering if PAE is enabled? If yes, I am more confident that both the two kinds of report are similar or same. Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned in the corresponding bug report. P.S. I know Linus doesn't condone PAE but I still find it more preferrable than running a mixed environment with almost zero benefit in regard to performance and quite obvious performance regressions related to an increased number of libraries being loaded (i686 + x86_64) and slightly bloated code which sometimes cannot fit in the CPU cache. Call me old fashioned but I won't upgrade to x86_64 until most of the things that I run locally are available for x86_64 and that won't happen any time soon. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 07:18, Ming Lei wrote: On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov wrote: BTW, I have posted very similar issue in the link: http://marc.info/?l=linux-ide=145066119623811=2 Artem, I noticed from bugzillar that the hardware is i386, just wondering if PAE is enabled? If yes, I am more confident that both the two kinds of report are similar or same. Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically mentioned in the corresponding bug report. OK, could you dump value of the following files under /sys/block/sdN/queue/ ? max_hw_sectors_kb max_sectors_kb max_segments max_segment_size 'sdN' is the faulted disk name. # cat /sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size} 32767 32767 168 65536 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 11:55, Tejun Heo wrote: Artem, can you please reproduce the issue with the following patch applied and attach the kernel log? Thanks. I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es revert). Hopefully it's how you intended it to be. Here's the result (I skipped the beginning of dmesg - it's the same as always - see bugzilla).[ 60.387407] Corrupted low memory at c0001000 (1000 phys) = cba3d25f [ 60.387411] Corrupted low memory at c0001004 (1004 phys) = e8f17ba7 [ 60.387413] Corrupted low memory at c0001008 (1008 phys) = 61cfa79a [ 60.387415] Corrupted low memory at c000100c (100c phys) = dc4d5d71 [ 60.387417] Corrupted low memory at c0001010 (1010 phys) = adbdc15b [ 60.387418] Corrupted low memory at c0001014 (1014 phys) = dee76bdc [ 60.387420] Corrupted low memory at c0001018 (1018 phys) = 827dee31 [ 60.387422] Corrupted low memory at c000101c (101c phys) = ef70cf7b [ 60.387423] Corrupted low memory at c0001020 (1020 phys) = 82fdee4d [ 60.387425] Corrupted low memory at c0001024 (1024 phys) = 77533c7b [ 60.387427] Corrupted low memory at c0001028 (1028 phys) = ddd4cf35 [ 60.387428] Corrupted low memory at c000102c (102c phys) = 7beea149 [ 60.387430] Corrupted low memory at c0001030 (1030 phys) = 798fe878 [ 60.387432] Corrupted low memory at c0001034 (1034 phys) = 4283a7a8 [ 60.387434] Corrupted low memory at c0001038 (1038 phys) = 4dee093d [ 60.387435] Corrupted low memory at c000103c (103c phys) = ee21ef73 [ 60.387437] Corrupted low memory at c0001040 (1040 phys) = fe3dc93d [ 60.387439] Corrupted low memory at c0001044 (1044 phys) = b8e7cf0d [ 60.387440] Corrupted low memory at c0001048 (1048 phys) = af3c9977 [ 60.387442] Corrupted low memory at c000104c (104c phys) = b80b7b8b [ 60.387444] Corrupted low memory at c0001050 (1050 phys) = b6f73d77 [ 60.387445] Corrupted low memory at c0001054 (1054 phys) = f7276f70 [ 60.387447] Corrupted low memory at c0001058 (1058 phys) = c62f70f6 [ 60.387449] Corrupted low memory at c000105c (105c phys) = 3ef734bd [ 60.387451] Corrupted low memory at c0001060 (1060 phys) = 1ef79f40 [ 60.387452] Corrupted low memory at c0001064 (1064 phys) = f1cf9f65 [ 60.387454] Corrupted low memory at c0001068 (1068 phys) = 297a5390 [ 60.387456] Corrupted low memory at c000106c (106c phys) = a7f14fbc [ 60.387457] Corrupted low memory at c0001070 (1070 phys) = 57ef71af [ 60.387459] Corrupted low memory at c0001074 (1074 phys) = 219d15e4 [ 60.387461] Corrupted low memory at c0001078 (1078 phys) = 7b99a2af [ 60.387462] Corrupted low memory at c000107c (107c phys) = c56d281b [ 60.387464] Corrupted low memory at c0001080 (1080 phys) = 3c84de6e [ 60.387466] Corrupted low memory at c0001084 (1084 phys) = edee56ec [ 60.387468] Corrupted low memory at c0001088 (1088 phys) = 49b557a7 [ 60.387469] Corrupted low memory at c000108c (108c phys) = 01baeb6a [ 60.387471] Corrupted low memory at c0001090 (1090 phys) = b775acde [ 60.387473] Corrupted low memory at c0001094 (1094 phys) = 30dd6851 [ 60.387474] Corrupted low memory at c0001098 (1098 phys) = f328fd0f [ 60.387476] Corrupted low memory at c000109c (109c phys) = 17ad185c [ 60.387478] Corrupted low memory at c00010a0 (10a0 phys) = b83985f5 [ 60.387479] Corrupted low memory at c00010a4 (10a4 phys) = 775b8af5 [ 60.387481] Corrupted low memory at c00010a8 (10a8 phys) = 3d35e4bc [ 60.387483] Corrupted low memory at c00010ac (10ac phys) = bf4d7b90 [ 60.387485] Corrupted low memory at c00010b0 (10b0 phys) = 1db6fd99 [ 60.387486] Corrupted low memory at c00010b4 (10b4 phys) = 3b94bf2f [ 60.387488] Corrupted low memory at c00010b8 (10b8 phys) = 5f447e55 [ 60.387490] Corrupted low memory at c00010bc (10bc phys) = dcfe6395 [ 60.387491] Corrupted low memory at c00010c0 (10c0 phys) = fc0b7a23 [ 60.387493] Corrupted low memory at c00010c4 (10c4 phys) = 32fa23aa [ 60.387495] Corrupted low memory at c00010c8 (10c8 phys) = e88ef3f8 [ 60.387496] Corrupted low memory at c00010cc (10cc phys) = 1ed7e14b [ 60.387498] Corrupted low memory at c00010d0 (10d0 phys) = 9fc3d7d1 [ 60.387500] Corrupted low memory at c00010d4 (10d4 phys) = 015f447f [ 60.387501] Corrupted low memory at c00010d8 (10d8 phys) = 7d11c17f [ 60.387503] Corrupted low memory at c00010dc (10dc phys) = 4785fc2d [ 60.387505] Corrupted low memory at c00010e0 (10e0 phys) = 5fe16bf4 [ 60.387507] Corrupted low memory at c00010e4 (10e4 phys) = 4de3fcc5 [ 60.387508] Corrupted low memory at c00010e8 (10e8 phys) = 4f477297 [ 60.387510] Corrupted low memory at c00010ec (10ec phys) = 59a47d35 [ 60.387512] Corrupted low memory at c00010f0 (10f0 phys) = c97c78df [ 60.387513] Corrupted low memory at c00010f4 (10f4 phys) = e3aafa4b [ 60.387515] Corrupted low memory at c00010f8 (10f8 phys) = 658bd8cb [ 60.387517] Corrupted low memory at c00010fc (10fc phys) = 6f5eb91f [ 60.387518] Corrupted low memory at c0001100 (1100 phys) = ca66ce3a [
Re: IO errors after "block: remove bio_get_nr_vecs()"
On 2015-12-21 10:23, Linus Torvalds wrote: On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvaldswrote: That said, we obviously need to figure out this current problem regardless first.. ... although maybe it *would* be interesting to hear what happens if you just compile a 64-bit kernel instead? Do you still see the problem? Because if not, then we should look very specifically for some 32-bit PAE issue. For example, maybe we use "unsigned long" somewhere where we should use "phys_addr_t". On x86-64, they obviously end up being the same. On normal non-PAE x86-32, they are also the same. But .. Let's wait for what Tejun Heo might say - I've applied his debugging patch and sent back the results. Building x86_64 kernel here involves installing a 64bit Linux VM, so I'd like it to be the last resort. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Not being able to reread the partition table - why is Linux so 90x?
Hello, I wonder why in 2013 I still cannot modify _unused_ partitions on the fly, yeah, the Internet is full of: # hdparm -z /dev/sda /dev/sda: re-reading partition table BLKRRPART failed: Device or resource busy # fdisk (after adding a new partition using unused space on my hdd) ... Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table. Re-reading the partition table failed.: Device or resource busy SCSI rescan command doesn't work too. I do understand that the Linux kernel doesn't have any form of invoke() but then it prevents me from altering the partitions which are not used - it's 100% counter intuitive. Windows, for instance, allows to modify even a system partition on the fly since 2006, Linux doesn't allow to add partitions without rebooting the system. Could anyone elaborate, please? Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Not being able to reread the partition table - why is Linux so 90x?
Hello, I wonder why in 2013 I still cannot modify _unused_ partitions on the fly, yeah, the Internet is full of: # hdparm -z /dev/sda /dev/sda: re-reading partition table BLKRRPART failed: Device or resource busy # fdisk (after adding a new partition using unused space on my hdd) ... Command (m for help): w The partition table has been altered. Calling ioctl() to re-read partition table. Re-reading the partition table failed.: Device or resource busy SCSI rescan command doesn't work too. I do understand that the Linux kernel doesn't have any form of invoke() but then it prevents me from altering the partitions which are not used - it's 100% counter intuitive. Windows, for instance, allows to modify even a system partition on the fly since 2006, Linux doesn't allow to add partitions without rebooting the system. Could anyone elaborate, please? Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 30, 2013 02:41:01 AM, Jack wrote: On Fri 25-10-13 19:37:53, Ted Tso wrote: >> Sure, although I wonder if it would be worth it calcuate some kind of >> rolling average of the write bandwidth while we are doing writeback, >> so if it turns out we got unlucky with the contents of the first 100MB >> of dirty data (it could be either highly random or highly sequential) >> the we'll eventually correct to the right level. > We already do average measured throughput over a longer time window and >have kind of rolling average algorithm doing some averaging. > >> This means that VM would have to keep dirty page counters for each BDI >> --- which I thought we weren't doing right now, which is why we have a >> global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I >> have cause and effect reversed? :-) > And we do currently keep the number of dirty & under writeback pages per >BDI. We have global limits because mm wants to limit the total number of dirty >pages (as those are harder to free). It doesn't care as much to which device >these pages belong (although it probably should care a bit more because >there are huge differences between how quickly can different devices get rid >of dirty pages). This might sound like an absolutely stupid question which makes no sense at all, so I want to apologize for it in advance, but since the Linux kernel lacks revoke(), does that mean that dirty buffers will always occupy the kernel memory if I for instance remove my USB stick before the kernel has had the time to flush these buffers? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 30, 2013 02:41:01 AM, Jack wrote: On Fri 25-10-13 19:37:53, Ted Tso wrote: Sure, although I wonder if it would be worth it calcuate some kind of rolling average of the write bandwidth while we are doing writeback, so if it turns out we got unlucky with the contents of the first 100MB of dirty data (it could be either highly random or highly sequential) the we'll eventually correct to the right level. We already do average measured throughput over a longer time window and have kind of rolling average algorithm doing some averaging. This means that VM would have to keep dirty page counters for each BDI --- which I thought we weren't doing right now, which is why we have a global vm.dirty_ratio/vm.dirty_background_ratio threshold. (Or do I have cause and effect reversed? :-) And we do currently keep the number of dirty under writeback pages per BDI. We have global limits because mm wants to limit the total number of dirty pages (as those are harder to free). It doesn't care as much to which device these pages belong (although it probably should care a bit more because there are huge differences between how quickly can different devices get rid of dirty pages). This might sound like an absolutely stupid question which makes no sense at all, so I want to apologize for it in advance, but since the Linux kernel lacks revoke(), does that mean that dirty buffers will always occupy the kernel memory if I for instance remove my USB stick before the kernel has had the time to flush these buffers? -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 26, 2013 02:44:07 AM, neil wrote: On Fri, 25 Oct 2013 18:26:23 + (UTC) "Artem S. Tashkinov" >> >> Exactly. And not being able to use applications which show you IO performance >> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine >> my life without being able to see the progress of a copying operation. With >> the current >> dirty cache there's no way to understand how you storage media actually >> behaves. > >So fix Midnight Commander. If you want the copy to be actually finished when >it says it is finished, then it needs to call 'fsync()' at the end. This sounds like a very bad joke. How applications are supposed to show and calculate an _average_ write speed if there are no kernel calls/ioctls to actually make the kernel flush dirty buffers _during_ copying? Actually it's a good way to solve this problem in user space - alas, even if such calls are implemented, user space will start using them only in 2018 if not further from that. >> >> Per device dirty cache seems like a nice idea, I, for one, would like to >> disable it >> altogether or make it an absolute minimum for things like USB flash drives - >> because >> I don't care about multithreaded performance or delayed allocation on such >> devices - >> I'm interested in my data reaching my USB stick ASAP - because it's how most >> people >> use them. >> > >As has already been said, you can substantially disable the cache by tuning >down various values in /proc/sys/vm/. >Have you tried? I don't understand who you are replying to. I asked about per device settings, you are again referring me to system wide settings - they don't look that good if we're talking about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense to allocate 20% of physical RAM for things which don't belong to it in the first place. I don't know any other OS which has a similar behaviour. And like people (including me) have already mentioned, such a huge dirty cache can stall their PCs/servers for a considerable amount of time. Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also not everyone in this world has an UPS - which means such a huge buffer can lead to a serious data loss in case of a power blackout. Regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 25, 2013 05:26:45 PM, david wrote: On Fri, 25 Oct 2013, NeilBrown wrote: > >> >> What exactly is bothering you about this? The amount of memory used or the >> time until data is flushed? > >actually, I think the problem is more the impact of the huge write later on. Exactly. And not being able to use applications which show you IO performance like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine my life without being able to see the progress of a copying operation. With the current dirty cache there's no way to understand how you storage media actually behaves. Hopefully this issue won't dissolve into obscurity and someone will actually make up a plan (and a patch) how to make dirty write cache behave in a sane manner considering the fact that there are devices with very different write speeds and requirements. It'd be ever better, if I could specify dirty cache as a mount option (though sane defaults or semi-automatic values based on runtime estimates won't hurt). Per device dirty cache seems like a nice idea, I, for one, would like to disable it altogether or make it an absolute minimum for things like USB flash drives - because I don't care about multithreaded performance or delayed allocation on such devices - I'm interested in my data reaching my USB stick ASAP - because it's how most people use them. Regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote: On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: >> >> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel >> built for the i686 (with PAE) and x86-64 architectures. What's really >> troubling me >> is that the x86-64 kernel has the following problem: >> >> When I copy large files to any storage device, be it my HDD with ext4 >> partitions >> or flash drive with FAT32 partitions, the kernel first caches them in memory >> entirely >> then flushes them some time later (quite unpredictably though) or >> immediately upon >> invoking "sync". > >Yeah, I think we default to a 10% "dirty background memory" (and >allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB >of dirty memory for writeout before we even start writing, and twice >that before we start *waiting* for it. > >On 32-bit x86, we only count the memory in the low 1GB (really >actually up to about 890MB), so "10% dirty" really means just about >90MB of buffering (and a "hard limit" of ~180MB of dirty). > >And that "up to 3.2GB of dirty memory" is just crazy. Our defaults >come from the old days of less memory (and perhaps servers that don't >much care), and the fact that x86-32 ends up having much lower limits >even if you end up having more memory. > >You can easily tune it: > >echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes >echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes > >or similar. But you're right, we need to make the defaults much saner. > >Wu? Andrew? Comments? > My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or more) this value becomes unrealistic (13GB) and I've already had some unpleasant effects due to it. I.e. when I dump a large MySQL database (its dump weighs around 10GB) - it appears on the disk almost immediately, but then, later, when the kernel decides to flush it to the disk, the server almost stalls and other IO requests take a lot more time to complete even though mysqldump is run with ionice -c3, so the use of ionice has no real effect. Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Disabling in-memory write cache for x86-64 in Linux II
Hello! On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon invoking "sync". How can I disable this memory cache altogether (or at least minimize caching)? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance "sync" takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Disabling in-memory write cache for x86-64 in Linux II
Hello! On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon invoking sync. How can I disable this memory cache altogether (or at least minimize caching)? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance sync takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote: On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote: On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon invoking sync. Yeah, I think we default to a 10% dirty background memory (and allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB of dirty memory for writeout before we even start writing, and twice that before we start *waiting* for it. On 32-bit x86, we only count the memory in the low 1GB (really actually up to about 890MB), so 10% dirty really means just about 90MB of buffering (and a hard limit of ~180MB of dirty). And that up to 3.2GB of dirty memory is just crazy. Our defaults come from the old days of less memory (and perhaps servers that don't much care), and the fact that x86-32 ends up having much lower limits even if you end up having more memory. You can easily tune it: echo $((16*1024*1024)) /proc/sys/vm/dirty_background_bytes echo $((48*1024*1024)) /proc/sys/vm/dirty_bytes or similar. But you're right, we need to make the defaults much saner. Wu? Andrew? Comments? My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or more) this value becomes unrealistic (13GB) and I've already had some unpleasant effects due to it. I.e. when I dump a large MySQL database (its dump weighs around 10GB) - it appears on the disk almost immediately, but then, later, when the kernel decides to flush it to the disk, the server almost stalls and other IO requests take a lot more time to complete even though mysqldump is run with ionice -c3, so the use of ionice has no real effect. Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 25, 2013 05:26:45 PM, david wrote: On Fri, 25 Oct 2013, NeilBrown wrote: What exactly is bothering you about this? The amount of memory used or the time until data is flushed? actually, I think the problem is more the impact of the huge write later on. Exactly. And not being able to use applications which show you IO performance like Midnight Commander. You might prefer to use cp -a but I cannot imagine my life without being able to see the progress of a copying operation. With the current dirty cache there's no way to understand how you storage media actually behaves. Hopefully this issue won't dissolve into obscurity and someone will actually make up a plan (and a patch) how to make dirty write cache behave in a sane manner considering the fact that there are devices with very different write speeds and requirements. It'd be ever better, if I could specify dirty cache as a mount option (though sane defaults or semi-automatic values based on runtime estimates won't hurt). Per device dirty cache seems like a nice idea, I, for one, would like to disable it altogether or make it an absolute minimum for things like USB flash drives - because I don't care about multithreaded performance or delayed allocation on such devices - I'm interested in my data reaching my USB stick ASAP - because it's how most people use them. Regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Disabling in-memory write cache for x86-64 in Linux II
Oct 26, 2013 02:44:07 AM, neil wrote: On Fri, 25 Oct 2013 18:26:23 + (UTC) Artem S. Tashkinov Exactly. And not being able to use applications which show you IO performance like Midnight Commander. You might prefer to use cp -a but I cannot imagine my life without being able to see the progress of a copying operation. With the current dirty cache there's no way to understand how you storage media actually behaves. So fix Midnight Commander. If you want the copy to be actually finished when it says it is finished, then it needs to call 'fsync()' at the end. This sounds like a very bad joke. How applications are supposed to show and calculate an _average_ write speed if there are no kernel calls/ioctls to actually make the kernel flush dirty buffers _during_ copying? Actually it's a good way to solve this problem in user space - alas, even if such calls are implemented, user space will start using them only in 2018 if not further from that. Per device dirty cache seems like a nice idea, I, for one, would like to disable it altogether or make it an absolute minimum for things like USB flash drives - because I don't care about multithreaded performance or delayed allocation on such devices - I'm interested in my data reaching my USB stick ASAP - because it's how most people use them. As has already been said, you can substantially disable the cache by tuning down various values in /proc/sys/vm/. Have you tried? I don't understand who you are replying to. I asked about per device settings, you are again referring me to system wide settings - they don't look that good if we're talking about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense to allocate 20% of physical RAM for things which don't belong to it in the first place. I don't know any other OS which has a similar behaviour. And like people (including me) have already mentioned, such a huge dirty cache can stall their PCs/servers for a considerable amount of time. Of course, if you don't use Linux on the desktop you don't really care - well, I do. Also not everyone in this world has an UPS - which means such a huge buffer can lead to a serious data loss in case of a power blackout. Regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Disabling in-memory write cache for x86-64 in Linux 3.11
Hello, On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon running "sync". How can I disable this memory cache altogether? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance "sync" takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Disabling in-memory write cache for x86-64 in Linux 3.11
Hello, On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel built for the i686 (with PAE) and x86-64 architectures. What's really troubling me is that the x86-64 kernel has the following problem: When I copy large files to any storage device, be it my HDD with ext4 partitions or flash drive with FAT32 partitions, the kernel first caches them in memory entirely then flushes them some time later (quite unpredictably though) or immediately upon running sync. How can I disable this memory cache altogether? When running the i686 kernel with the same configuration I don't observe this effect - files get written out almost immediately (for instance sync takes less than a second, whereas on x86-64 it can take a dozen of _minutes_ depending on a file size and storage performance). I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 /dev/XXX) - firstly this command is detrimental to the performance of my PC, secondly, it won't help in this instance. Swap is totally disabled, usually my memory is entirely free. My kernel configuration can be fetched here: https://bugzilla.kernel.org/show_bug.cgi?id=63531 Please, advise. Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 11:43:00 PM, Eric wrote: On Mon, 2013-07-29 at 15:47 +, Artem S. Tashkinov wrote: > >> A wine developer clearly showed that this option simply doesn't work. >> >> http://bugs.winehq.org/show_bug.cgi?id=26031#c21 >> >> Output of strace: >> getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 >> setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >> bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0. >> >> 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) > >Its clear that some other socket did not use SO_REUSADDR > >All sockets using a given port _must_ have use SO_REUSADDR to allow this >port being reused. > It's exactly what's been tried. A program running with SO_REUSADDR, once no longer running consequently fails to regain the rights for the port. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 11:27:00 PM, rick wrote: >> A wine developer clearly showed that this option simply doesn't work. >> >> http://bugs.winehq.org/show_bug.cgi?id=26031#c21 >> >> Output of strace: >> getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 >> setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 >> bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0. >> 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) > >The output of netstat -an didn't by any chance happen to still show an >endpoint in the LISTEN state for that port number did it? > >rick jones > By chance - no, nothing is/was listening. You can recreate this test in an order of few minutes without ever trusting my word. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 09:35:25 PM, Stephen wrote: On Mon, 29 Jul 2013 15:10:34 + (UTC) >"Artem S. Tashkinov" wrote: > >> Hello, >> >> Currently the Linux kernel disallows to start listening on a TCP/UDP socket >> if >> there are open connections against the port, regardless connections status. >> So even >> if _all_ you have is some stale (i.e. no longer active connections pending >> destruction) >> the kernel will not allow to reuse this socket. >> >> Stephen Hemminger argues that this behaviour is expected even though it's >> 100% >> counter productive, it defies common sense and I cannot think of any >> security implications >> should this feature be allowed. >> >> Besides, when discussing this bug on Wine's bugzilla I have shown that this >> behavior not >> only affect Windows applications running under Wine, but also native POSIX >> applications. >> >> If nothing else is listening to incoming connections how can _old_ _stale_ >> connections >> prevent an application from listening on the port? Windows has no qualms >> about allowing >> that, why the Linux kernel works differently? >> >> I want to hear how the current apparently _broken_ behaviour, "The current >> socket API >> behavior is unlikely to be changed because so many applications expect it", >> can be expected. >> >> Also I'd like to know which applications depend on this "feature". >> >> Imagine a situation, >> >> You have an apache server serving connections on port 80. For some reasons a >> crash in >> one of its modules causes the daemon crash but during the crash Apache had >> some open >> connections on this port. >> >> According to Stephen Hemminger I cannot relaunch Apache until the kernel >> waits arbitrary >> time in order to clean stale connections for its networking pool. >> >> I fail to see how this behaviour can be "expected". >> >> More on it here: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=45571 >> http://bugs.winehq.org/show_bug.cgi?id=26031 > >I understand your problem, people have been having to deal with it for 30 >years. >The attitude in your response makes it seem like you just discovered fire, >read a book like Steven's network programming if you need more info. > >If you don't use SO_REUSEADDR then yes application has to wait for time wait >period. > >If you do enable SO_REUSEADDR then it is possible to bind to a port with >existing >stale connections. > A wine developer clearly showed that this option simply doesn't work. http://bugs.winehq.org/show_bug.cgi?id=26031#c21 Output of strace: getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0. 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use) Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A call to revise sockets behaviour
Hello, Currently the Linux kernel disallows to start listening on a TCP/UDP socket if there are open connections against the port, regardless connections status. So even if _all_ you have is some stale (i.e. no longer active connections pending destruction) the kernel will not allow to reuse this socket. Stephen Hemminger argues that this behaviour is expected even though it's 100% counter productive, it defies common sense and I cannot think of any security implications should this feature be allowed. Besides, when discussing this bug on Wine's bugzilla I have shown that this behavior not only affect Windows applications running under Wine, but also native POSIX applications. If nothing else is listening to incoming connections how can _old_ _stale_ connections prevent an application from listening on the port? Windows has no qualms about allowing that, why the Linux kernel works differently? I want to hear how the current apparently _broken_ behaviour, "The current socket API behavior is unlikely to be changed because so many applications expect it", can be expected. Also I'd like to know which applications depend on this "feature". Imagine a situation, You have an apache server serving connections on port 80. For some reasons a crash in one of its modules causes the daemon crash but during the crash Apache had some open connections on this port. According to Stephen Hemminger I cannot relaunch Apache until the kernel waits arbitrary time in order to clean stale connections for its networking pool. I fail to see how this behaviour can be "expected". More on it here: https://bugzilla.kernel.org/show_bug.cgi?id=45571 http://bugs.winehq.org/show_bug.cgi?id=26031 Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A call to revise sockets behaviour
Hello, Currently the Linux kernel disallows to start listening on a TCP/UDP socket if there are open connections against the port, regardless connections status. So even if _all_ you have is some stale (i.e. no longer active connections pending destruction) the kernel will not allow to reuse this socket. Stephen Hemminger argues that this behaviour is expected even though it's 100% counter productive, it defies common sense and I cannot think of any security implications should this feature be allowed. Besides, when discussing this bug on Wine's bugzilla I have shown that this behavior not only affect Windows applications running under Wine, but also native POSIX applications. If nothing else is listening to incoming connections how can _old_ _stale_ connections prevent an application from listening on the port? Windows has no qualms about allowing that, why the Linux kernel works differently? I want to hear how the current apparently _broken_ behaviour, The current socket API behavior is unlikely to be changed because so many applications expect it, can be expected. Also I'd like to know which applications depend on this feature. Imagine a situation, You have an apache server serving connections on port 80. For some reasons a crash in one of its modules causes the daemon crash but during the crash Apache had some open connections on this port. According to Stephen Hemminger I cannot relaunch Apache until the kernel waits arbitrary time in order to clean stale connections for its networking pool. I fail to see how this behaviour can be expected. More on it here: https://bugzilla.kernel.org/show_bug.cgi?id=45571 http://bugs.winehq.org/show_bug.cgi?id=26031 Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 09:35:25 PM, Stephen wrote: On Mon, 29 Jul 2013 15:10:34 + (UTC) Artem S. Tashkinov wrote: Hello, Currently the Linux kernel disallows to start listening on a TCP/UDP socket if there are open connections against the port, regardless connections status. So even if _all_ you have is some stale (i.e. no longer active connections pending destruction) the kernel will not allow to reuse this socket. Stephen Hemminger argues that this behaviour is expected even though it's 100% counter productive, it defies common sense and I cannot think of any security implications should this feature be allowed. Besides, when discussing this bug on Wine's bugzilla I have shown that this behavior not only affect Windows applications running under Wine, but also native POSIX applications. If nothing else is listening to incoming connections how can _old_ _stale_ connections prevent an application from listening on the port? Windows has no qualms about allowing that, why the Linux kernel works differently? I want to hear how the current apparently _broken_ behaviour, The current socket API behavior is unlikely to be changed because so many applications expect it, can be expected. Also I'd like to know which applications depend on this feature. Imagine a situation, You have an apache server serving connections on port 80. For some reasons a crash in one of its modules causes the daemon crash but during the crash Apache had some open connections on this port. According to Stephen Hemminger I cannot relaunch Apache until the kernel waits arbitrary time in order to clean stale connections for its networking pool. I fail to see how this behaviour can be expected. More on it here: https://bugzilla.kernel.org/show_bug.cgi?id=45571 http://bugs.winehq.org/show_bug.cgi?id=26031 I understand your problem, people have been having to deal with it for 30 years. The attitude in your response makes it seem like you just discovered fire, read a book like Steven's network programming if you need more info. If you don't use SO_REUSEADDR then yes application has to wait for time wait period. If you do enable SO_REUSEADDR then it is possible to bind to a port with existing stale connections. A wine developer clearly showed that this option simply doesn't work. http://bugs.winehq.org/show_bug.cgi?id=26031#c21 Output of strace: getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0. 0.0.0)}, 16) = -1 EADDRINUSE (Address already in use) Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 11:27:00 PM, rick wrote: A wine developer clearly showed that this option simply doesn't work. http://bugs.winehq.org/show_bug.cgi?id=26031#c21 Output of strace: getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0. 0.0.0)}, 16) = -1 EADDRINUSE (Address already in use) The output of netstat -an didn't by any chance happen to still show an endpoint in the LISTEN state for that port number did it? rick jones By chance - no, nothing is/was listening. You can recreate this test in an order of few minutes without ever trusting my word. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: A call to revise sockets behaviour
Jul 29, 2013 11:43:00 PM, Eric wrote: On Mon, 2013-07-29 at 15:47 +, Artem S. Tashkinov wrote: A wine developer clearly showed that this option simply doesn't work. http://bugs.winehq.org/show_bug.cgi?id=26031#c21 Output of strace: getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0. 0.0.0)}, 16) = -1 EADDRINUSE (Address already in use) Its clear that some other socket did not use SO_REUSADDR All sockets using a given port _must_ have use SO_REUSADDR to allow this port being reused. It's exactly what's been tried. A program running with SO_REUSADDR, once no longer running consequently fails to regain the rights for the port. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 8, 2013 04:25:43 AM, Patrik Jakobsson wrote: On Wed, May 8, 2013 at 12:02 AM, Bjorn Helgaas wrote: >> On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson wrote: >>> On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas wrote: > I'm not sure if reading /proc/mtrr actually reads the registers out of > the CPU each time, or whether we just return the cached values we read > out during initial boot-up. If the latter, then this output isn't > really useful as there's no guarantee the values are still intact. Good point. From what I can tell, on Artem's system with "CPU0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz," we would be using generic_mtrr_ops, and generic_get_mtrr() appears to read from the MSRs, so I think it should be useful. >>> >>> FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might >>> have been fixed by bios upgrades by now but not sure. >>> >>> It might also suffer (depending on the revision) from the Sandy bridge SATA >>> issue. So if affected, SATA controller is a ticking bomb. >>> >>> I have a P8H67-V motherboard but I haven't seen any suspend related issues. >>> >>> If this is totally unrelated I'm sorry for wasting your time. Just thought >>> it >>> might be good to know. >> >> Thanks for chiming in. I'm not familiar with either of the issues you >> mentioned. Do you have any references where I could read up on them? > >I think this is the official statement from Intel on the SATA issue: >http://newsroom.intel.com/community/intel_newsroom/blog/2011/01/31/intel-identifies-chipset-design-error-implementing-solution My motherboard has a new fixed B3 revision so this issue doesn't affect me. Besides this SATA ports degradation issue is constantly present - it has no relationship to suspend. > >And here's a link to a discussion about the PCIe-to-PCI bridge stuff: >https://lkml.org/lkml/2012/1/30/216 > >> Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at >> 05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I >> don't think that's the problem. > >I meant what you said ;) and yes, it seems unrelated. Both my P8H67 and a >P8P67 I've built behave nicely if nothing is connected. Have you tried suspending more than three times? In the absence of UEFI boot this bug emerges only on a third or even fourth resume attempt. UEFI boot triggers it immediately on a first resume though. >> And the issue affects both USB and a hard drive, so I suspect it's >> more than just SATA. Artem, did you identify the PCI devices leading >> to your USB and hard drive? I can't remember if I've actually seen >> that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 8, 2013 04:03:18 AM, Bjorn Helgaas wrote: On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson > wrote: >> On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas wrote: I'm not sure if reading /proc/mtrr actually reads the registers out of the CPU each time, or whether we just return the cached values we read out during initial boot-up. If the latter, then this output isn't really useful as there's no guarantee the values are still intact. >>> >>> Good point. From what I can tell, on Artem's system with "CPU0: >>> Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz," we would be using >>> generic_mtrr_ops, and generic_get_mtrr() appears to read from the >>> MSRs, so I think it should be useful. >> >> FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might >> have been fixed by bios upgrades by now but not sure. >> >> It might also suffer (depending on the revision) from the Sandy bridge SATA >> issue. So if affected, SATA controller is a ticking bomb. >> >> I have a P8H67-V motherboard but I haven't seen any suspend related issues. >> >> If this is totally unrelated I'm sorry for wasting your time. Just thought it >> might be good to know. > >Thanks for chiming in. I'm not familiar with either of the issues you >mentioned. Do you have any references where I could read up on them? > >Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at >05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I >don't think that's the problem. > >And the issue affects both USB and a hard drive, so I suspect it's >more than just SATA. Artem, did you identify the PCI devices leading >to your USB and hard drive? I can't remember if I've actually seen >that. I posted my lspci information here https://bugzilla.kernel.org/show_bug.cgi?id=53551 If that's not enough, please tell how can I collect this information. The SATA issue is discussed here: https://bugzilla.kernel.org/show_bug.cgi?id=43229 According to Intel and Linux kernel developers it poses no threat. Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 8, 2013 04:03:18 AM, Bjorn Helgaas wrote: On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson wrote: On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas wrote: I'm not sure if reading /proc/mtrr actually reads the registers out of the CPU each time, or whether we just return the cached values we read out during initial boot-up. If the latter, then this output isn't really useful as there's no guarantee the values are still intact. Good point. From what I can tell, on Artem's system with CPU0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, we would be using generic_mtrr_ops, and generic_get_mtrr() appears to read from the MSRs, so I think it should be useful. FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might have been fixed by bios upgrades by now but not sure. It might also suffer (depending on the revision) from the Sandy bridge SATA issue. So if affected, SATA controller is a ticking bomb. I have a P8H67-V motherboard but I haven't seen any suspend related issues. If this is totally unrelated I'm sorry for wasting your time. Just thought it might be good to know. Thanks for chiming in. I'm not familiar with either of the issues you mentioned. Do you have any references where I could read up on them? Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at 05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I don't think that's the problem. And the issue affects both USB and a hard drive, so I suspect it's more than just SATA. Artem, did you identify the PCI devices leading to your USB and hard drive? I can't remember if I've actually seen that. I posted my lspci information here https://bugzilla.kernel.org/show_bug.cgi?id=53551 If that's not enough, please tell how can I collect this information. The SATA issue is discussed here: https://bugzilla.kernel.org/show_bug.cgi?id=43229 According to Intel and Linux kernel developers it poses no threat. Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 8, 2013 04:25:43 AM, Patrik Jakobsson wrote: On Wed, May 8, 2013 at 12:02 AM, Bjorn Helgaas wrote: On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson wrote: On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas wrote: I'm not sure if reading /proc/mtrr actually reads the registers out of the CPU each time, or whether we just return the cached values we read out during initial boot-up. If the latter, then this output isn't really useful as there's no guarantee the values are still intact. Good point. From what I can tell, on Artem's system with CPU0: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, we would be using generic_mtrr_ops, and generic_get_mtrr() appears to read from the MSRs, so I think it should be useful. FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might have been fixed by bios upgrades by now but not sure. It might also suffer (depending on the revision) from the Sandy bridge SATA issue. So if affected, SATA controller is a ticking bomb. I have a P8H67-V motherboard but I haven't seen any suspend related issues. If this is totally unrelated I'm sorry for wasting your time. Just thought it might be good to know. Thanks for chiming in. I'm not familiar with either of the issues you mentioned. Do you have any references where I could read up on them? I think this is the official statement from Intel on the SATA issue: http://newsroom.intel.com/community/intel_newsroom/blog/2011/01/31/intel-identifies-chipset-design-error-implementing-solution My motherboard has a new fixed B3 revision so this issue doesn't affect me. Besides this SATA ports degradation issue is constantly present - it has no relationship to suspend. And here's a link to a discussion about the PCIe-to-PCI bridge stuff: https://lkml.org/lkml/2012/1/30/216 Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at 05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I don't think that's the problem. I meant what you said ;) and yes, it seems unrelated. Both my P8H67 and a P8P67 I've built behave nicely if nothing is connected. Have you tried suspending more than three times? In the absence of UEFI boot this bug emerges only on a third or even fourth resume attempt. UEFI boot triggers it immediately on a first resume though. And the issue affects both USB and a hard drive, so I suspect it's more than just SATA. Artem, did you identify the PCI devices leading to your USB and hard drive? I can't remember if I've actually seen that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 7, 2013 10:27:30 PM, Bjorn Helgaas wrote: On Tue, May 7, 2013 at 8:59 AM, Artem S. Tashkinov wrote: >> May 7, 2013 09:25:40 PM,Bjorn Helgaas wrote: >>> [+cc Phillip] >>> >>>> I would suspect that Windows' complaint about the BIOS mucking up the MTRRs >>>> is likely the best hint. Likely Windows is detecting the problem and fixing >>>> it up on resume, thus it only complains about "reduced resume performance". >>>> If the MTRRs are messed up, then quite likely parts of RAM have become >>>> uncacheable, causing performance to get randomly slaughtered in various >>>> ways. >>>> >>>> From looking at the code it's not clear if we are checking/restoring the >>>> MTRR contents after resume. If not, maybe we should be. >>> >>>I agree; the MTRR warning is a good hint. Artem? >>> >>>Phillip, I cc'd you because you have similar hardware and your >>>https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is >>>slightly similar. Have you seen anything like this "reduced >>>performance after resume" issue? If so, can you collect /proc/mtrr >>>contents before and after suspending? >>> >> >> Like Robert Hancock correctly noted the Linux kernel lacks the code to check >> for MTTR changes after resume - I'm not a kernel hacker to write such a code >> ;-) >> >> Likewise there's no code to see if RAM pages have become uncacheable - i.e >> I've no idea how to check it either. >> >> According to /proc/mttr nothing changes on resume - only Windows detects >> the discrepancy between MTTR regions on resume. dmesg contains no warnings >> or errors (aside from usual ACPI SATA warnings - but they happen right on >> boot - so I highly doubt the ACPI or SATA layers can be the culprit, since >> USB >> exhibits a similar performance degradation). >> >> In short, there's little to nothing that I can check. > >I'm not trying to be ungrateful, but maybe you could actually collect >the info we've asked for and attach it to the bugzilla. It's hard for >me to get excited about digging into this when all I see is "nothing >changes in MTRR" and "it's probably not X." I really need some >concrete data to help rule things out and suggest other things to >investigate. > >Maybe we won't be able to make progress on this until other people >start hitting similar issues and we can find patterns. The pattern is very easy to spot - Linus once told that desktop PCs are not meant to work properly with suspend. That's kinda strange for me as I have yet to encounter a PC where Windows fails to work properly after resume - maybe I'm lucky - who knows. Taking into consideration that only few people use Linux, most Linux users avoid UEFI, very few of them actually use suspend/resume then it gets very easy to understand why such bug reports are vanishingly rare. Asus themselves could have easily debugged this issue if they were slightly interested in fixing it, yet their policy is that they only support Windows, and Linux is not their concern. Best regards -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 7, 2013 09:25:40 PM,Bjorn Helgaas wrote: > [+cc Phillip] > >> I would suspect that Windows' complaint about the BIOS mucking up the MTRRs >> is likely the best hint. Likely Windows is detecting the problem and fixing >> it up on resume, thus it only complains about "reduced resume performance". >> If the MTRRs are messed up, then quite likely parts of RAM have become >> uncacheable, causing performance to get randomly slaughtered in various >> ways. >> >> From looking at the code it's not clear if we are checking/restoring the >> MTRR contents after resume. If not, maybe we should be. > >I agree; the MTRR warning is a good hint. Artem? > >Phillip, I cc'd you because you have similar hardware and your >https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is >slightly similar. Have you seen anything like this "reduced >performance after resume" issue? If so, can you collect /proc/mtrr >contents before and after suspending? > Like Robert Hancock correctly noted the Linux kernel lacks the code to check for MTTR changes after resume - I'm not a kernel hacker to write such a code ;-) Likewise there's no code to see if RAM pages have become uncacheable - i.e I've no idea how to check it either. According to /proc/mttr nothing changes on resume - only Windows detects the discrepancy between MTTR regions on resume. dmesg contains no warnings or errors (aside from usual ACPI SATA warnings - but they happen right on boot - so I highly doubt the ACPI or SATA layers can be the culprit, since USB exhibits a similar performance degradation). In short, there's little to nothing that I can check. That bug report has nothing to do with my problem - my PC suspends and resumes more or less correctly - everything works (albeit some parts don't work as they should). That person also has a very outdated BIOS - 1904 from 08/15/2011. I wouldn't be surprised if BIOS update solved his problem. Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 7, 2013 09:25:40 PM,Bjorn Helgaas wrote: [+cc Phillip] I would suspect that Windows' complaint about the BIOS mucking up the MTRRs is likely the best hint. Likely Windows is detecting the problem and fixing it up on resume, thus it only complains about reduced resume performance. If the MTRRs are messed up, then quite likely parts of RAM have become uncacheable, causing performance to get randomly slaughtered in various ways. From looking at the code it's not clear if we are checking/restoring the MTRR contents after resume. If not, maybe we should be. I agree; the MTRR warning is a good hint. Artem? Phillip, I cc'd you because you have similar hardware and your https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is slightly similar. Have you seen anything like this reduced performance after resume issue? If so, can you collect /proc/mtrr contents before and after suspending? Like Robert Hancock correctly noted the Linux kernel lacks the code to check for MTTR changes after resume - I'm not a kernel hacker to write such a code ;-) Likewise there's no code to see if RAM pages have become uncacheable - i.e I've no idea how to check it either. According to /proc/mttr nothing changes on resume - only Windows detects the discrepancy between MTTR regions on resume. dmesg contains no warnings or errors (aside from usual ACPI SATA warnings - but they happen right on boot - so I highly doubt the ACPI or SATA layers can be the culprit, since USB exhibits a similar performance degradation). In short, there's little to nothing that I can check. That bug report has nothing to do with my problem - my PC suspends and resumes more or less correctly - everything works (albeit some parts don't work as they should). That person also has a very outdated BIOS - 1904 from 08/15/2011. I wouldn't be surprised if BIOS update solved his problem. Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
May 7, 2013 10:27:30 PM, Bjorn Helgaas wrote: On Tue, May 7, 2013 at 8:59 AM, Artem S. Tashkinov wrote: May 7, 2013 09:25:40 PM,Bjorn Helgaas wrote: [+cc Phillip] I would suspect that Windows' complaint about the BIOS mucking up the MTRRs is likely the best hint. Likely Windows is detecting the problem and fixing it up on resume, thus it only complains about reduced resume performance. If the MTRRs are messed up, then quite likely parts of RAM have become uncacheable, causing performance to get randomly slaughtered in various ways. From looking at the code it's not clear if we are checking/restoring the MTRR contents after resume. If not, maybe we should be. I agree; the MTRR warning is a good hint. Artem? Phillip, I cc'd you because you have similar hardware and your https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is slightly similar. Have you seen anything like this reduced performance after resume issue? If so, can you collect /proc/mtrr contents before and after suspending? Like Robert Hancock correctly noted the Linux kernel lacks the code to check for MTTR changes after resume - I'm not a kernel hacker to write such a code ;-) Likewise there's no code to see if RAM pages have become uncacheable - i.e I've no idea how to check it either. According to /proc/mttr nothing changes on resume - only Windows detects the discrepancy between MTTR regions on resume. dmesg contains no warnings or errors (aside from usual ACPI SATA warnings - but they happen right on boot - so I highly doubt the ACPI or SATA layers can be the culprit, since USB exhibits a similar performance degradation). In short, there's little to nothing that I can check. I'm not trying to be ungrateful, but maybe you could actually collect the info we've asked for and attach it to the bugzilla. It's hard for me to get excited about digging into this when all I see is nothing changes in MTRR and it's probably not X. I really need some concrete data to help rule things out and suggest other things to investigate. Maybe we won't be able to make progress on this until other people start hitting similar issues and we can find patterns. The pattern is very easy to spot - Linus once told that desktop PCs are not meant to work properly with suspend. That's kinda strange for me as I have yet to encounter a PC where Windows fails to work properly after resume - maybe I'm lucky - who knows. Taking into consideration that only few people use Linux, most Linux users avoid UEFI, very few of them actually use suspend/resume then it gets very easy to understand why such bug reports are vanishingly rare. Asus themselves could have easily debugged this issue if they were slightly interested in fixing it, yet their policy is that they only support Windows, and Linux is not their concern. Best regards -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
> >Did this problem ever get resolved? > Hello, Unfortunately, no. Out of curiosity I've tried booting kernel 3.9-rc8 in EUFI mode but it exhibits the same problem. Right after the boot: [root@localhost ~]# dd if=/dev/zero of=test bs=64M count=3 3+0 records in 3+0 records out 201326592 bytes (201 MB) copied, 1.08544 s, 185 MB/s After suspend/resume: # dd if=/dev/zero of=test bs=64M count=3 3+0 records in 3+0 records out 201326592 bytes (201 MB) copied, 66.5392 s, 3.0 MB/s That's for my primary SATA-3 HDD. Forgive me my impudence but I believe debugging the USB stack is tangential to this problem. Something far deeper than USB support breaks, but so far no one has come even with the slightest clue of what that might be. And like I mentioned before this problem doesn't affect Windows - once I suspended it seven times in a row and it kept on chugging happily. According to hdparm nothing changes after suspend/resume: Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Advanced power management level: disabled Recommended acoustic management value: 208, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns 3MB/sec matches PIO mode 0 which is ridiculous and implausible given than this HDD is attached via SATA. Besides hdparm says that: # hdparm -tT --direct /dev/sda /dev/sda: Timing O_DIRECT cached reads: 862 MB in 2.00 seconds = 430.77 MB/sec Timing O_DIRECT disk reads: 520 MB in 3.01 seconds = 173.03 MB/sec So, only writes are affected. My dmesg is here: http://ompldr.org/vaThpcA/dmesg -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Did this problem ever get resolved? Hello, Unfortunately, no. Out of curiosity I've tried booting kernel 3.9-rc8 in EUFI mode but it exhibits the same problem. Right after the boot: [root@localhost ~]# dd if=/dev/zero of=test bs=64M count=3 3+0 records in 3+0 records out 201326592 bytes (201 MB) copied, 1.08544 s, 185 MB/s After suspend/resume: # dd if=/dev/zero of=test bs=64M count=3 3+0 records in 3+0 records out 201326592 bytes (201 MB) copied, 66.5392 s, 3.0 MB/s That's for my primary SATA-3 HDD. Forgive me my impudence but I believe debugging the USB stack is tangential to this problem. Something far deeper than USB support breaks, but so far no one has come even with the slightest clue of what that might be. And like I mentioned before this problem doesn't affect Windows - once I suspended it seven times in a row and it kept on chugging happily. According to hdparm nothing changes after suspend/resume: Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 16 Current = ? Advanced power management level: disabled Recommended acoustic management value: 208, current value: 0 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns 3MB/sec matches PIO mode 0 which is ridiculous and implausible given than this HDD is attached via SATA. Besides hdparm says that: # hdparm -tT --direct /dev/sda /dev/sda: Timing O_DIRECT cached reads: 862 MB in 2.00 seconds = 430.77 MB/sec Timing O_DIRECT disk reads: 520 MB in 3.01 seconds = 173.03 MB/sec So, only writes are affected. My dmesg is here: http://ompldr.org/vaThpcA/dmesg -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CONFIG_X86_INTEL_PSTATE disables CPU frequency transition stats, many governors and other standard features
Hello, Just wanted to let everyone know that CONFIG_X86_INTEL_PSTATE wreaks havoc with the CPU frequency subsystem in the Linux kernel. With this option enabled: 1) All governors except performance and powersave are gone, ondemand userspace, conservative 2) scaling_cur_freq is gone, thus user space utilities monitoring the CPU frequency have stopped working 3) CPU frequency transition stats are gone, there's no "stats" directory anywhere 4) scaling_available_frequencies is gone, so I cannot set the desired constant CPU frequency (the userspace governor is not available anyway) Is this an intended behavior? I shrivel to think that's the case. The bug report is filed here: https://bugzilla.kernel.org/show_bug.cgi?id=57141 Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
CONFIG_X86_INTEL_PSTATE disables CPU frequency transition stats, many governors and other standard features
Hello, Just wanted to let everyone know that CONFIG_X86_INTEL_PSTATE wreaks havoc with the CPU frequency subsystem in the Linux kernel. With this option enabled: 1) All governors except performance and powersave are gone, ondemand userspace, conservative 2) scaling_cur_freq is gone, thus user space utilities monitoring the CPU frequency have stopped working 3) CPU frequency transition stats are gone, there's no stats directory anywhere 4) scaling_available_frequencies is gone, so I cannot set the desired constant CPU frequency (the userspace governor is not available anyway) Is this an intended behavior? I shrivel to think that's the case. The bug report is filed here: https://bugzilla.kernel.org/show_bug.cgi?id=57141 Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 27, 2013 12:47:01 AM, Bjorn Helgaas wrote: On Mon, Feb 25, 2013 at 11:35 PM, Artem S. Tashkinov wrote: >> Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote: >>> >>>Where are we at with this, Artem? I assume it's still a problem. >>> >> >> Yes, it is, Bjorn. >> >> In order to eliminate this problem I switched back to MBR yesterday, because >> so far I haven't received any instructions or guidance as to how I can debug >> it further. I'm absolutely sure USB write speed is just another >> manifestation of >> it so I decided not to debug USB specifically (it just doesn't make too much >> sense). >> >> What I see is that something terribly wrong is going on but if Linus has no >> ideas >> I, as an average Joe, don't have a slightest clue as to what I can do. >> >> The bug report with necessary, but seemingly useless information, can be >> found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551 >> >> If anyone comes up with new ideas I can quickly try UEFI again now that I >> have two HDDs at my disposal (the old one is formatted as GPT, the new one is >> MBR). > >The ideas I saw are: > >1) Figure out whether it ever worked. If an older kernel worked >correctly and a newer one is broken, bisection is at least a >possibility. You mentioned that it did work before (Feb 12), but in >the past you never suspended twice in one boot session, whereas maybe >you did when seeing the problem? This is difficult to say since the first kernel I tried to run in EUFI mode was 3.7.x, so I've no idea if any previous ones ever worked. > >2) Try the "setpci" to set the MSI address back to the original value >to see if it makes a difference (see my Feb 12 message). I will try it soon and report back to you. > >3) Collect "lspci -vvv -" output to investigate the XHCI >Unsupported Request errors. > >4) Use usbmon to collect traces before and after the suspend. Likewise. Still I don't quite understand why you are persistent in your desire to investigate USB controllers specifically - my problem affects all storage devices that I have. > >I googled around a bit looking for similar reports. I found lots of >suspend issues, mostly with Windows, but no leads yet. It looks like >the board has been around for a while, so you would think we'd have >some other reports of a problem this bad. But maybe it really is >related to UEFI and nobody really uses that yet? 99% of people around me don't use UEFI, and the ones who use it do it because they want to run Hacintosh (it's quite complicated to run a EUFI OS from a non UEFI BIOS). That's the main reason you don't see similar reports. EUFI so far hasn't proven its supremacy and efficiency over BIOS. When 3TB and larger HDD's become more widespread people will have to use UEFI. They will simply have no choice (unless of course you have two HDDs, where one is BIOS formatted to boot your system, and another one is GPT partitioned in order to support > 2,2TB space). Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 27, 2013 12:47:01 AM, Bjorn Helgaas wrote: On Mon, Feb 25, 2013 at 11:35 PM, Artem S. Tashkinov wrote: Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote: Where are we at with this, Artem? I assume it's still a problem. Yes, it is, Bjorn. In order to eliminate this problem I switched back to MBR yesterday, because so far I haven't received any instructions or guidance as to how I can debug it further. I'm absolutely sure USB write speed is just another manifestation of it so I decided not to debug USB specifically (it just doesn't make too much sense). What I see is that something terribly wrong is going on but if Linus has no ideas I, as an average Joe, don't have a slightest clue as to what I can do. The bug report with necessary, but seemingly useless information, can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551 If anyone comes up with new ideas I can quickly try UEFI again now that I have two HDDs at my disposal (the old one is formatted as GPT, the new one is MBR). The ideas I saw are: 1) Figure out whether it ever worked. If an older kernel worked correctly and a newer one is broken, bisection is at least a possibility. You mentioned that it did work before (Feb 12), but in the past you never suspended twice in one boot session, whereas maybe you did when seeing the problem? This is difficult to say since the first kernel I tried to run in EUFI mode was 3.7.x, so I've no idea if any previous ones ever worked. 2) Try the setpci to set the MSI address back to the original value to see if it makes a difference (see my Feb 12 message). I will try it soon and report back to you. 3) Collect lspci -vvv - output to investigate the XHCI Unsupported Request errors. 4) Use usbmon to collect traces before and after the suspend. Likewise. Still I don't quite understand why you are persistent in your desire to investigate USB controllers specifically - my problem affects all storage devices that I have. I googled around a bit looking for similar reports. I found lots of suspend issues, mostly with Windows, but no leads yet. It looks like the board has been around for a while, so you would think we'd have some other reports of a problem this bad. But maybe it really is related to UEFI and nobody really uses that yet? 99% of people around me don't use UEFI, and the ones who use it do it because they want to run Hacintosh (it's quite complicated to run a EUFI OS from a non UEFI BIOS). That's the main reason you don't see similar reports. EUFI so far hasn't proven its supremacy and efficiency over BIOS. When 3TB and larger HDD's become more widespread people will have to use UEFI. They will simply have no choice (unless of course you have two HDDs, where one is BIOS formatted to boot your system, and another one is GPT partitioned in order to support 2,2TB space). Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote: > >Where are we at with this, Artem? I assume it's still a problem. > Yes, it is, Bjorn. In order to eliminate this problem I switched back to MBR yesterday, because so far I haven't received any instructions or guidance as to how I can debug it further. I'm absolutely sure USB write speed is just another manifestation of it so I decided not to debug USB specifically (it just doesn't make too much sense). What I see is that something terribly wrong is going on but if Linus has no ideas I, as an average Joe, don't have a slightest clue as to what I can do. The bug report with necessary, but seemingly useless information, can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551 If anyone comes up with new ideas I can quickly try UEFI again now that I have two HDDs at my disposal (the old one is formatted as GPT, the new one is MBR). Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote: Where are we at with this, Artem? I assume it's still a problem. Yes, it is, Bjorn. In order to eliminate this problem I switched back to MBR yesterday, because so far I haven't received any instructions or guidance as to how I can debug it further. I'm absolutely sure USB write speed is just another manifestation of it so I decided not to debug USB specifically (it just doesn't make too much sense). What I see is that something terribly wrong is going on but if Linus has no ideas I, as an average Joe, don't have a slightest clue as to what I can do. The bug report with necessary, but seemingly useless information, can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551 If anyone comes up with new ideas I can quickly try UEFI again now that I have two HDDs at my disposal (the old one is formatted as GPT, the new one is MBR). Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 13, 2013 01:32:53 AM, Linus Torvalds wrote: On Tue, Feb 12, 2013 at 10:29 AM, Artem S. Tashkinov wrote: >> Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote: >>> >>>A few things to try to pinpoint: >>> >>> (a) Is it *only* write performance that suffers, or is it other >>>performance too? Networking (DMA? Perhaps only writing *to* the >>>network?)? CPU? >> >> I 've tested hdpard -tT --direct and the output on boot and after suspend >> is quite similar. >> >> I 've also checked my network read/write speed, and it 's the same >> ~ 100MBit/sec (I have no 1Gbit computers on my network >> unfortunately). > >Ok. So it really sounds like just USB and HD writes. Which is quite >odd, since they have basically nothing in common I can think of >(except the obvious block layer issues). > >>> (b) the fact that it apparently happens with both SATA and USB >>>implies that it's neither, and is more likely something core like >>>memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever). >> >> I 've no idea, please, check my bug report where I 've just added lots of >> information including a diff between on boot and after suspend. > >I 'm not seeing anything particularly interesting there. > >Except why/how did the MSI address/data change for the SATA >controller? The irq itself hasn 't changed.. There 's probably some sane >reason for that too (it 's an odd encoding, maybe they code for the >same thing), and there 's nothing like that for USB, so... > >And if it was irq problems, I 'd expect you to see it more for reads >than for writes anyway. Along with a few messages about missed irqs >and whatever. > >I'm stumped, and have no ideas. I can 't even begin to guess how this >would happen. One thing to try is if it happens for all USB ports (you >have multiple controllers) and I assume performance doesn 't come back >if you unplug and replug the USB disk.. I've just plugged and unplugged my USB stick into all available hubs (including a USB3 one, that is xhci_hcd) and I've got the same write speed on all of them - around 930KB/sec (quite a weird number - as if I'm on USB 1.1) - lsusb says I'm happily running ehci_hcd/2p, 480M and xhci_hcd/2p, 5000M. The only pattern that I see here is that write speed to real devices degrades, tmpfs write speed stays the same: $ dd if=/dev/zero of=test bs=32M count=32 32+0 records indegrade 32+0 records out 1073741824 bytes (1.1 GB) copied, 0.296323 s, 3.6 GB/s Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote: >On Mon, Feb 11, 2013 at 10:25 PM, Artem S. Tashkinov wrote: >> Hello Linus, >> >> I 've already posted a bug report >> (https://bugzilla.kernel.org/show_bug.cgi?id=53551), >> a message to LKML >> (http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/00837.html) >> and so far I 've received zero response even though the bug is quite >> critical as it prevents >> me from using suspend altogether. >> >> I wonder if you could tell me who is responsible for this problem and who I >> need to CC in >> bugzilla. > >According to your bugzilla it doesn 't really seem to be strictly >UEFI-specific, and it 's hard to tell what subsystem is to blame. > >A few things to try to pinpoint: > > (a) Is it *only* write performance that suffers, or is it other >performance too? Networking (DMA? Perhaps only writing *to* the >network?)? CPU? I've tested hdpard -tT --direct and the output on boot and after suspend is quite similar. I've also checked my network read/write speed, and it's the same ~ 100MBit/sec (I have no 1Gbit computers on my network unfortunately). > > (b) the fact that it apparently happens with both SATA and USB >implies that it 's neither, and is more likely something core like >memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever). I've no idea, please, check my bug report where I've just added lots of information including a diff between on boot and after suspend. lspci outputs differ quite substantially, but the things that have change say nothing to me - you'll want to see it for yourself. I see changes like: - Changed: MRL- PresDet- LinkState- + Changed: MRL- PresDet+ LinkState- i.e. PresDet minus to PresDet plus. - Address: fee0f00c Data: 41e1 + Address: Data: - Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- TAbort- > (c) can you find anything that changes over the suspend/resume? IOW, >look at things like "lspci -vvxxx" before-and-after, and see what >changed on the bridges leading to both things etc. > >The performance drop sounds extreme enough that it sounds like caches >got disabled or something, but that should show up as CPU performance >in general being slow, not just writes to disk. But basically, I think >we need more clues about which sub-area is actually the culprit. My >*guess* would be some core PCI thing not being initialized, but I >don 't see how you could even make PCI go that slow. Interrupt >problems? DMA failures? I have no idea. > >Has it ever worked? Suspend on desktop motherboards used to be quite >spotty (nobody ever used it, manufacturers didn 't care), but it >generally has gotten better since people use it more these days.. I remember it used to work before, but I've never suspended more than once during one boot session before (this time I did it out of pure curiosity) and I've never run Linux from UEFI. > >Added lkml and Bjorn to the participants, in case anybody has any ideas.. > I'll gladly provide any information you need. Thanks a lot, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote: On Mon, Feb 11, 2013 at 10:25 PM, Artem S. Tashkinov wrote: Hello Linus, I 've already posted a bug report (https://bugzilla.kernel.org/show_bug.cgi?id=53551), a message to LKML (http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/00837.html) and so far I 've received zero response even though the bug is quite critical as it prevents me from using suspend altogether. I wonder if you could tell me who is responsible for this problem and who I need to CC in bugzilla. According to your bugzilla it doesn 't really seem to be strictly UEFI-specific, and it 's hard to tell what subsystem is to blame. A few things to try to pinpoint: (a) Is it *only* write performance that suffers, or is it other performance too? Networking (DMA? Perhaps only writing *to* the network?)? CPU? I've tested hdpard -tT --direct and the output on boot and after suspend is quite similar. I've also checked my network read/write speed, and it's the same ~ 100MBit/sec (I have no 1Gbit computers on my network unfortunately). (b) the fact that it apparently happens with both SATA and USB implies that it 's neither, and is more likely something core like memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever). I've no idea, please, check my bug report where I've just added lots of information including a diff between on boot and after suspend. lspci outputs differ quite substantially, but the things that have change say nothing to me - you'll want to see it for yourself. I see changes like: - Changed: MRL- PresDet- LinkState- + Changed: MRL- PresDet+ LinkState- i.e. PresDet minus to PresDet plus. - Address: fee0f00c Data: 41e1 + Address: Data: - Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- (c) can you find anything that changes over the suspend/resume? IOW, look at things like lspci -vvxxx before-and-after, and see what changed on the bridges leading to both things etc. The performance drop sounds extreme enough that it sounds like caches got disabled or something, but that should show up as CPU performance in general being slow, not just writes to disk. But basically, I think we need more clues about which sub-area is actually the culprit. My *guess* would be some core PCI thing not being initialized, but I don 't see how you could even make PCI go that slow. Interrupt problems? DMA failures? I have no idea. Has it ever worked? Suspend on desktop motherboards used to be quite spotty (nobody ever used it, manufacturers didn 't care), but it generally has gotten better since people use it more these days.. I remember it used to work before, but I've never suspended more than once during one boot session before (this time I did it out of pure curiosity) and I've never run Linux from UEFI. Added lkml and Bjorn to the participants, in case anybody has any ideas.. I'll gladly provide any information you need. Thanks a lot, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal HDD/USB write speed after sleep on a UEFI system
Feb 13, 2013 01:32:53 AM, Linus Torvalds wrote: On Tue, Feb 12, 2013 at 10:29 AM, Artem S. Tashkinov wrote: Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote: A few things to try to pinpoint: (a) Is it *only* write performance that suffers, or is it other performance too? Networking (DMA? Perhaps only writing *to* the network?)? CPU? I 've tested hdpard -tT --direct and the output on boot and after suspend is quite similar. I 've also checked my network read/write speed, and it 's the same ~ 100MBit/sec (I have no 1Gbit computers on my network unfortunately). Ok. So it really sounds like just USB and HD writes. Which is quite odd, since they have basically nothing in common I can think of (except the obvious block layer issues). (b) the fact that it apparently happens with both SATA and USB implies that it's neither, and is more likely something core like memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever). I 've no idea, please, check my bug report where I 've just added lots of information including a diff between on boot and after suspend. I 'm not seeing anything particularly interesting there. Except why/how did the MSI address/data change for the SATA controller? The irq itself hasn 't changed.. There 's probably some sane reason for that too (it 's an odd encoding, maybe they code for the same thing), and there 's nothing like that for USB, so... And if it was irq problems, I 'd expect you to see it more for reads than for writes anyway. Along with a few messages about missed irqs and whatever. I'm stumped, and have no ideas. I can 't even begin to guess how this would happen. One thing to try is if it happens for all USB ports (you have multiple controllers) and I assume performance doesn 't come back if you unplug and replug the USB disk.. I've just plugged and unplugged my USB stick into all available hubs (including a USB3 one, that is xhci_hcd) and I've got the same write speed on all of them - around 930KB/sec (quite a weird number - as if I'm on USB 1.1) - lsusb says I'm happily running ehci_hcd/2p, 480M and xhci_hcd/2p, 5000M. The only pattern that I see here is that write speed to real devices degrades, tmpfs write speed stays the same: $ dd if=/dev/zero of=test bs=32M count=32 32+0 records indegrade 32+0 records out 1073741824 bytes (1.1 GB) copied, 0.296323 s, 3.6 GB/s Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Abysmal HDD/USB write speed after sleep on a UEFI system
Hello, I have a P8P67 Pro motherboard made by ASUS and recently I decided to switch to EUFI boot. Maybe it's a coincidence or maybe Linux kernel 3.7.6 (vanilla) has some serious bug but after waking up from sleep write performance becomes intolerable. On boot I have: HDD write performance: ~120MB/sec USB write performance: ~18MB/sec After sleep: HDD write performance: ~7MB/sec (i.e 17 times slower) USB write performance: ~0.5MB/sec (i.e. 36 times slower) This is totally unacceptable, the computer becomes unusable. I'm open to suggestions how to debug this extremely serious problem. P.S. Since I'm still using x86 kernel, on boot it switches x86-64 UEFI off: [0.00] efi: EFI v2.31 by American Megatrends [0.00] efi: ACPI=0xdf385000 ACPI 2.0=0xdf385000 SMBIOS=0xdec28e98 MPS=0xfc9a0 [0.00] efi: No EFI runtime due to 32/64-bit mismatch with kernel ... [0.00] efi: Setup done, disabling due to 32/64-bit mismatch Best regards, Artem -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Abysmal HDD/USB write speed after sleep on a UEFI system
Hello, I have a P8P67 Pro motherboard made by ASUS and recently I decided to switch to EUFI boot. Maybe it's a coincidence or maybe Linux kernel 3.7.6 (vanilla) has some serious bug but after waking up from sleep write performance becomes intolerable. On boot I have: HDD write performance: ~120MB/sec USB write performance: ~18MB/sec After sleep: HDD write performance: ~7MB/sec (i.e 17 times slower) USB write performance: ~0.5MB/sec (i.e. 36 times slower) This is totally unacceptable, the computer becomes unusable. I'm open to suggestions how to debug this extremely serious problem. P.S. Since I'm still using x86 kernel, on boot it switches x86-64 UEFI off: [0.00] efi: EFI v2.31 by American Megatrends [0.00] efi: ACPI=0xdf385000 ACPI 2.0=0xdf385000 SMBIOS=0xdec28e98 MPS=0xfc9a0 [0.00] efi: No EFI runtime due to 32/64-bit mismatch with kernel ... [0.00] efi: Setup done, disabling due to 32/64-bit mismatch Best regards, Artem -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
A vague, murky topic of "Buffer I/O error on device sdb6, logical block NNNNNNNNN" and a ext4/VFS oops
Hello, When I was copying a lot of information (tens of gigabytes) from my primary HDD to a secondary HDD I got gazillions of errors like these ones: [19568.964762] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 8036352 size 524288 starting block 51946549) [19568.964767] sd 2:0:0:0: [sdb] [19568.964768] Result: hostbyte=0x00 driverbyte=0x08 [19568.964770] sd 2:0:0:0: [sdb] [19568.964771] Sense Key : 0xb [current] [descriptor] [19568.964774] Descriptor sense data with sense descriptors (in hex): [19568.964775] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [19568.964784] 00 00 00 00 [19568.964788] sd 2:0:0:0: [sdb] [19568.964789] ASC=0x0 ASCQ=0x0 [19568.964791] sd 2:0:0:0: [sdb] CDB: [19568.964792] cdb[0]=0x2a: 2a 00 18 c5 25 a8 00 00 70 00 [19568.964804] Buffer I/O error on device sdb6, logical block 13727786 [19568.964806] Buffer I/O error on device sdb6, logical block 13727787 [19568.964808] Buffer I/O error on device sdb6, logical block 13727788 [19568.964810] Buffer I/O error on device sdb6, logical block 13727789 [19568.964812] Buffer I/O error on device sdb6, logical block 13727790 along with: [19568.964832] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 8560640 size 57344 starting block 51946677) [19568.964843] ata3: EH complete [19624.635176] ata3.00: exception Emask 0x0 SAct 0x3fff SErr 0x4 action 0x6 frozen [19624.635181] ata3: SError: { CommWake } [19624.635184] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635190] ata3.00: cmd 61/00:00:48:ee:cb/04:00:18:00:00/40 tag 0 ncq 524288 out [19624.635190] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635193] ata3.00: status: { DRDY } [19624.635196] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635201] ata3.00: cmd 61/08:08:f0:65:bd/00:00:1d:00:00/40 tag 1 ncq 4096 out [19624.635201] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635203] ata3.00: status: { DRDY } [19624.635206] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635211] ata3.00: cmd 61/00:10:48:f2:cb/04:00:18:00:00/40 tag 2 ncq 524288 out [19624.635211] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635213] ata3.00: status: { DRDY } [19624.635215] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635220] ata3.00: cmd 61/00:18:48:f6:cb/04:00:18:00:00/40 tag 3 ncq 524288 out [19624.635220] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635223] ata3.00: status: { DRDY } [19624.635225] ata3.00: failed command: WRITE FPDMA QUEUED along with: [19624.635320] ata3: hard resetting link [19624.954880] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [19624.956101] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359) [19624.956109] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT2._GTF] (Node ef0307b0), AE_NOT_FOUND (20120711/psparse-536) [19624.958006] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359) [19624.958011] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT2._GTF] (Node ef0307b0), AE_NOT_FOUND (20120711/psparse-536) [19624.958366] ata3.00: configured for UDMA/133 [19624.960763] ata3.00: device reported invalid CHS sector 0 [19624.960765] ata3.00: device reported invalid CHS sector 0 [19624.960767] ata3.00: device reported invalid CHS sector 0 [19624.960769] ata3.00: device reported invalid CHS sector 0 [19624.960771] ata3.00: device reported invalid CHS sector 0 [19624.960773] ata3.00: device reported invalid CHS sector 0 [19624.960775] ata3.00: device reported invalid CHS sector 0 [19624.960777] ata3.00: device reported invalid CHS sector 0 [19624.960779] ata3.00: device reported invalid CHS sector 0 [19624.960781] ata3.00: device reported invalid CHS sector 0 [19624.960782] ata3.00: device reported invalid CHS sector 0 [19624.960784] ata3.00: device reported invalid CHS sector 0 [19624.960786] ata3.00: device reported invalid CHS sector 0 [19624.960788] ata3.00: device reported invalid CHS sector 0 and also this: [19624.961128] Buffer I/O error on device sdb6, logical block 13783485 [19624.961132] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 236183552 size 524288 starting block 52002249) [19624.961142] sd 2:0:0:0: [sdb] [19624.961144] Result: hostbyte=0x00 driverbyte=0x08 [19624.961146] sd 2:0:0:0: [sdb] [19624.961147] Sense Key : 0xb [current] [descriptor] [19624.961149] Descriptor sense data with sense descriptors (in hex): [19624.961151] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [19624.961160] 00 00 00 00 [19624.961164] sd 2:0:0:0: [sdb] [19624.961165] ASC=0x0 ASCQ=0x0 [19624.961167] sd 2:0:0:0: [sdb] CDB: [19624.961168] cdb[0]=0x2a: 2a 00 1d bd 65 f0 00 00 08 00 [19624.961176] end_request: I/O error, dev sdb, sector 498951664 [19624.961179] Buffer I/O error on device
A vague, murky topic of Buffer I/O error on device sdb6, logical block NNNNNNNNN and a ext4/VFS oops
Hello, When I was copying a lot of information (tens of gigabytes) from my primary HDD to a secondary HDD I got gazillions of errors like these ones: [19568.964762] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 8036352 size 524288 starting block 51946549) [19568.964767] sd 2:0:0:0: [sdb] [19568.964768] Result: hostbyte=0x00 driverbyte=0x08 [19568.964770] sd 2:0:0:0: [sdb] [19568.964771] Sense Key : 0xb [current] [descriptor] [19568.964774] Descriptor sense data with sense descriptors (in hex): [19568.964775] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [19568.964784] 00 00 00 00 [19568.964788] sd 2:0:0:0: [sdb] [19568.964789] ASC=0x0 ASCQ=0x0 [19568.964791] sd 2:0:0:0: [sdb] CDB: [19568.964792] cdb[0]=0x2a: 2a 00 18 c5 25 a8 00 00 70 00 [19568.964804] Buffer I/O error on device sdb6, logical block 13727786 [19568.964806] Buffer I/O error on device sdb6, logical block 13727787 [19568.964808] Buffer I/O error on device sdb6, logical block 13727788 [19568.964810] Buffer I/O error on device sdb6, logical block 13727789 [19568.964812] Buffer I/O error on device sdb6, logical block 13727790 along with: [19568.964832] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 8560640 size 57344 starting block 51946677) [19568.964843] ata3: EH complete [19624.635176] ata3.00: exception Emask 0x0 SAct 0x3fff SErr 0x4 action 0x6 frozen [19624.635181] ata3: SError: { CommWake } [19624.635184] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635190] ata3.00: cmd 61/00:00:48:ee:cb/04:00:18:00:00/40 tag 0 ncq 524288 out [19624.635190] res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635193] ata3.00: status: { DRDY } [19624.635196] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635201] ata3.00: cmd 61/08:08:f0:65:bd/00:00:1d:00:00/40 tag 1 ncq 4096 out [19624.635201] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635203] ata3.00: status: { DRDY } [19624.635206] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635211] ata3.00: cmd 61/00:10:48:f2:cb/04:00:18:00:00/40 tag 2 ncq 524288 out [19624.635211] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635213] ata3.00: status: { DRDY } [19624.635215] ata3.00: failed command: WRITE FPDMA QUEUED [19624.635220] ata3.00: cmd 61/00:18:48:f6:cb/04:00:18:00:00/40 tag 3 ncq 524288 out [19624.635220] res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) [19624.635223] ata3.00: status: { DRDY } [19624.635225] ata3.00: failed command: WRITE FPDMA QUEUED along with: [19624.635320] ata3: hard resetting link [19624.954880] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300) [19624.956101] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359) [19624.956109] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT2._GTF] (Node ef0307b0), AE_NOT_FOUND (20120711/psparse-536) [19624.958006] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20120711/psargs-359) [19624.958011] ACPI Error: Method parse/execution failed [\_SB_.PCI0.SAT0.SPT2._GTF] (Node ef0307b0), AE_NOT_FOUND (20120711/psparse-536) [19624.958366] ata3.00: configured for UDMA/133 [19624.960763] ata3.00: device reported invalid CHS sector 0 [19624.960765] ata3.00: device reported invalid CHS sector 0 [19624.960767] ata3.00: device reported invalid CHS sector 0 [19624.960769] ata3.00: device reported invalid CHS sector 0 [19624.960771] ata3.00: device reported invalid CHS sector 0 [19624.960773] ata3.00: device reported invalid CHS sector 0 [19624.960775] ata3.00: device reported invalid CHS sector 0 [19624.960777] ata3.00: device reported invalid CHS sector 0 [19624.960779] ata3.00: device reported invalid CHS sector 0 [19624.960781] ata3.00: device reported invalid CHS sector 0 [19624.960782] ata3.00: device reported invalid CHS sector 0 [19624.960784] ata3.00: device reported invalid CHS sector 0 [19624.960786] ata3.00: device reported invalid CHS sector 0 [19624.960788] ata3.00: device reported invalid CHS sector 0 and also this: [19624.961128] Buffer I/O error on device sdb6, logical block 13783485 [19624.961132] EXT4-fs warning (device sdb6): ext4_end_bio:250: I/O error writing to inode 6029369 (offset 236183552 size 524288 starting block 52002249) [19624.961142] sd 2:0:0:0: [sdb] [19624.961144] Result: hostbyte=0x00 driverbyte=0x08 [19624.961146] sd 2:0:0:0: [sdb] [19624.961147] Sense Key : 0xb [current] [descriptor] [19624.961149] Descriptor sense data with sense descriptors (in hex): [19624.961151] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00 [19624.961160] 00 00 00 00 [19624.961164] sd 2:0:0:0: [sdb] [19624.961165] ASC=0x0 ASCQ=0x0 [19624.961167] sd 2:0:0:0: [sdb] CDB: [19624.961168] cdb[0]=0x2a: 2a 00 1d bd 65 f0 00 00 08 00 [19624.961176] end_request: I/O error, dev sdb, sector 498951664 [19624.961179] Buffer I/O error on device