from:"Artem S. Tashkinov"


On 08/25/2018 06:39 PM, Casey Schaufler wrote:

On 8/25/2018 3:42 AM, Artem S. Tashkinov wrote:

Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are 
added to the Linux kernel without a simple way to disable them all in one fell 
swoop.


Many of the mitigations are unrelated to each other. There is no one
aspect of the system that identifies a behavior as a security issue.
I don't know anyone who could create a list of all the "fixes" that
have gone in over the years. Realize that features like speculative
execution have had security issues that are unrelated to obscure attacks
like side-channels. While you may think that you don't care, some of
those flaws affect correctness. My bet is you wouldn't want to disable
those.


As far as I know mitigations started to appear in January 2018 and 
kernels released prior to this date all work just fine without any 
issues with "correctness", so I'm not sure what you're talking about.


I'm quite sure at least Intel perfectly knows, as well as Linus Torvalds 
who coordinates everything.


Also

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f6d85e..e6362717c895 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -991,7 +991,7 @@ static void __init cpu_set_bug_bits(struct 
cpuinfo_x86 *c)

{
u64 ia32_cap = 0;

- if (x86_match_cpu(cpu_no_speculation))
+ //if (x86_match_cpu(cpu_no_speculation))
return;

setup_force_cpu_bug(X86_BUG_SPECTRE_V1);

and setting this in .config:

CONFIG_RETPOLINE=n
CONFIG_PAGE_TABLE_ISOLATION=n

Ostensibly disables all mitigations and everything continues to work 
just fine.





Disabling is a good option for strictly confined environments where no 3d party 
untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or 
even a home server which runs Samba/SSH server and nothing else.


Like maybe the software in centrifuges in a nuclear fuel processing plant?

All the examples you've cited are network connected and are vulnerable to 
attack.
And don't try the "no untrusted code" argument. You'll have code on those 
systems
that has been known vulnerable for decades.


I'm not sure

1) why you're trying to mix unrelated classes of vulnerabilities - of 
course there are vulnerabilities other than the ones caused by 
speculative execution;


2) why you're insisting that my argument, that someone may never run 
untrusted code, has no merit. I may perfectly have a standard Linux 
distro installed on my PC/server and never run a web browser or any 
similar applications other than the ones provided by my distro in a form 
of various packages - which means I will never run any untrusted code. I 
will also never run any scriptable applications 
(bash/python/php/ruby/etc) from the net either. How such a configuration 
might be susceptible to speculative execution attacks?





I wonder if someone could wrote a patch which implemented the following two 
options for the kernel:

* A boot option option which allows to disable most runtime protections/workarounds/fixes (as far 
as I understand some of them can't be reverted since they are compiled in or use certain GCC 
flags), e.g. let's call it "insecure" or "insecurecpumode".


That would be an interesting exercise for the opposite case. A boot option
that enables all the runtime protections would certainly be interesting to
some people. If you could implement one, you could do the other.

I would be happy to review such a patch. Go for it.


I'd love to leave that task to those who are more proficient in writing 
kernel code and whose work is more likely to be merged. My patch might 
be never streamlined for totally unrelated reasons (and we've seen too 
many examples of that already).





* A compile-time CONFIG_ option which disables all these fixes _permanently_ 
without a way to turn them later back on during runtime.


This suffers from all the challenges previously mentioned, but would
be equally interesting, again for the opposite case.


Again, I see no challenges since, for instance, RHEL has gone as far as 
to backport all the patches to previously released officially 
unmaintained kernels, so all these patches could be easily disabled if 
one really wanted to.






Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of 
various things which take ages to sift through and there's zero understanding 
whether you've found everything and correctly disabled it.


I can't argue with you on that. Again, I believe the greater value would
come from documenting how to turn everything on.


I guess you meant "turn everything off".


Best regards,
Artem

Re: Disabling CPU vulnerabilities workarounds


On 08/25/2018 06:39 PM, Casey Schaufler wrote:

On 8/25/2018 3:42 AM, Artem S. Tashkinov wrote:

Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs vulnerabilities are 
added to the Linux kernel without a simple way to disable them all in one fell 
swoop.


Many of the mitigations are unrelated to each other. There is no one
aspect of the system that identifies a behavior as a security issue.
I don't know anyone who could create a list of all the "fixes" that
have gone in over the years. Realize that features like speculative
execution have had security issues that are unrelated to obscure attacks
like side-channels. While you may think that you don't care, some of
those flaws affect correctness. My bet is you wouldn't want to disable
those.


As far as I know mitigations started to appear in January 2018 and 
kernels released prior to this date all work just fine without any 
issues with "correctness", so I'm not sure what you're talking about.


I'm quite sure at least Intel perfectly knows, as well as Linus Torvalds 
who coordinates everything.


Also

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f73fa6f6d85e..e6362717c895 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -991,7 +991,7 @@ static void __init cpu_set_bug_bits(struct 
cpuinfo_x86 *c)

{
u64 ia32_cap = 0;

- if (x86_match_cpu(cpu_no_speculation))
+ //if (x86_match_cpu(cpu_no_speculation))
return;

setup_force_cpu_bug(X86_BUG_SPECTRE_V1);

and setting this in .config:

CONFIG_RETPOLINE=n
CONFIG_PAGE_TABLE_ISOLATION=n

Ostensibly disables all mitigations and everything continues to work 
just fine.





Disabling is a good option for strictly confined environments where no 3d party 
untrusted code is ever to be run, e.g. a rendering farm, a supercomputer, or 
even a home server which runs Samba/SSH server and nothing else.


Like maybe the software in centrifuges in a nuclear fuel processing plant?

All the examples you've cited are network connected and are vulnerable to 
attack.
And don't try the "no untrusted code" argument. You'll have code on those 
systems
that has been known vulnerable for decades.


I'm not sure

1) why you're trying to mix unrelated classes of vulnerabilities - of 
course there are vulnerabilities other than the ones caused by 
speculative execution;


2) why you're insisting that my argument, that someone may never run 
untrusted code, has no merit. I may perfectly have a standard Linux 
distro installed on my PC/server and never run a web browser or any 
similar applications other than the ones provided by my distro in a form 
of various packages - which means I will never run any untrusted code. I 
will also never run any scriptable applications 
(bash/python/php/ruby/etc) from the net either. How such a configuration 
might be susceptible to speculative execution attacks?





I wonder if someone could wrote a patch which implemented the following two 
options for the kernel:

* A boot option option which allows to disable most runtime protections/workarounds/fixes (as far 
as I understand some of them can't be reverted since they are compiled in or use certain GCC 
flags), e.g. let's call it "insecure" or "insecurecpumode".


That would be an interesting exercise for the opposite case. A boot option
that enables all the runtime protections would certainly be interesting to
some people. If you could implement one, you could do the other.

I would be happy to review such a patch. Go for it.


I'd love to leave that task to those who are more proficient in writing 
kernel code and whose work is more likely to be merged. My patch might 
be never streamlined for totally unrelated reasons (and we've seen too 
many examples of that already).





* A compile-time CONFIG_ option which disables all these fixes _permanently_ 
without a way to turn them later back on during runtime.


This suffers from all the challenges previously mentioned, but would
be equally interesting, again for the opposite case.


Again, I see no challenges since, for instance, RHEL has gone as far as 
to backport all the patches to previously released officially 
unmaintained kernels, so all these patches could be easily disabled if 
one really wanted to.






Right now linux/Documentation/admin-guide/kernel-parameters.txt is a mess of 
various things which take ages to sift through and there's zero understanding 
whether you've found everything and correctly disabled it.


I can't argue with you on that. Again, I believe the greater value would
come from documenting how to turn everything on.


I guess you meant "turn everything off".


Best regards,
Artem

Disabling CPU vulnerabilities workarounds


Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs 
vulnerabilities are added to the Linux kernel without a simple way to 
disable them all in one fell swoop.


Disabling is a good option for strictly confined environments where no 
3d party untrusted code is ever to be run, e.g. a rendering farm, a 
supercomputer, or even a home server which runs Samba/SSH server and 
nothing else.


I wonder if someone could wrote a patch which implemented the following 
two options for the kernel:


* A boot option option which allows to disable most runtime 
protections/workarounds/fixes (as far as I understand some of them can't 
be reverted since they are compiled in or use certain GCC flags), e.g. 
let's call it "insecure" or "insecurecpumode".


* A compile-time CONFIG_ option which disables all these fixes 
_permanently_ without a way to turn them later back on during runtime.


Right now linux/Documentation/admin-guide/kernel-parameters.txt is a 
mess of various things which take ages to sift through and there's zero 
understanding whether you've found everything and correctly disabled it.



Best regards,
Artem

Disabling CPU vulnerabilities workarounds


Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs 
vulnerabilities are added to the Linux kernel without a simple way to 
disable them all in one fell swoop.


Disabling is a good option for strictly confined environments where no 
3d party untrusted code is ever to be run, e.g. a rendering farm, a 
supercomputer, or even a home server which runs Samba/SSH server and 
nothing else.


I wonder if someone could wrote a patch which implemented the following 
two options for the kernel:


* A boot option option which allows to disable most runtime 
protections/workarounds/fixes (as far as I understand some of them can't 
be reverted since they are compiled in or use certain GCC flags), e.g. 
let's call it "insecure" or "insecurecpumode".


* A compile-time CONFIG_ option which disables all these fixes 
_permanently_ without a way to turn them later back on during runtime.


Right now linux/Documentation/admin-guide/kernel-parameters.txt is a 
mess of various things which take ages to sift through and there's zero 
understanding whether you've found everything and correctly disabled it.



Best regards,
Artem

Disabling CPU vulnerabilities workarounds

2018-08-23 Thread Artem S. Tashkinov


Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs 
vulnerabilities are added to the Linux kernel without a simple way to 
disable them all in one fell swoop.


Disabling is a good option for strictly confined environments where no 
3d party untrusted code is ever to be run, e.g. a rendering farm, a 
supercomputer, or even a home server which runs Samba/SSH server and 
nothing else.


I wonder if someone could wrote a patch which implemented the following 
two options for the kernel:


* A boot option option which allows to disable most runtime 
protections/workarounds/fixes (as far as I understand some of them can't 
be reverted since they are compiled in or use certain GCC flags), e.g. 
let's call it "insecure" or "insecurecpumode".


* A compile-time CONFIG_ option which disables all these fixes 
_permanently_ without a way to turn them later back on during runtime.


Right now linux/Documentation/admin-guide/kernel-parameters.txt is a 
mess of various things which take ages to sift through and there's zero 
understanding whether you've found everything and correctly disabled it.



Best regards,
Artem

Disabling CPU vulnerabilities workarounds

2018-08-23 Thread Artem S. Tashkinov


Hello LKML,

As time goes by more and more fixes of Intel/AMD/ARM CPUs 
vulnerabilities are added to the Linux kernel without a simple way to 
disable them all in one fell swoop.


Disabling is a good option for strictly confined environments where no 
3d party untrusted code is ever to be run, e.g. a rendering farm, a 
supercomputer, or even a home server which runs Samba/SSH server and 
nothing else.


I wonder if someone could wrote a patch which implemented the following 
two options for the kernel:


* A boot option option which allows to disable most runtime 
protections/workarounds/fixes (as far as I understand some of them can't 
be reverted since they are compiled in or use certain GCC flags), e.g. 
let's call it "insecure" or "insecurecpumode".


* A compile-time CONFIG_ option which disables all these fixes 
_permanently_ without a way to turn them later back on during runtime.


Right now linux/Documentation/admin-guide/kernel-parameters.txt is a 
mess of various things which take ages to sift through and there's zero 
understanding whether you've found everything and correctly disabled it.



Best regards,
Artem

On the kernel numbering scheme

2018-04-16 Thread Artem S. Tashkinov


Hello all,

I know this proposal has already been made great many times but I'd like 
to repeat it and have a healthy discussion about it.


The current kernel numbering scheme makes no sense at all because the 
first two numbers don't represent anything at all. They had some meaning 
back in the 1.x 2.x 3.x days but then with the introduction of the new 
rolling development model, they became worthless.


I'd love to change the kernel numbering scheme to this:

.RELEASE.PATCH_LEVEL

So that the first kernel to be released in 2019 will be numbered 
2019.0(.0), and its consequent releases will be 2019.1, 2019.2, 2019.3, 
etc. and its stable patches will be 2019.0.1, 2019.0.2, 2019.0.3, 
2019.0.4, etc.


With this scheme you can easily see how fresh your kernel is and there's 
no need arbitrary raise the first number because it always matches the 
current year.


There's one minor detail which might raise some questions: there are 
release candidates and then there's a release, so for the development 
which starts before the year end we might start with e.g. 2018.5-rc1 and 
then if the actual release crosses a new year mark we simply turn 
2018.5-rc7 into 2019.0.0.


Best regards,
Artem S. Tashkinov

On the kernel numbering scheme

2018-04-16 Thread Artem S. Tashkinov


Hello all,

I know this proposal has already been made great many times but I'd like 
to repeat it and have a healthy discussion about it.


The current kernel numbering scheme makes no sense at all because the 
first two numbers don't represent anything at all. They had some meaning 
back in the 1.x 2.x 3.x days but then with the introduction of the new 
rolling development model, they became worthless.


I'd love to change the kernel numbering scheme to this:

.RELEASE.PATCH_LEVEL

So that the first kernel to be released in 2019 will be numbered 
2019.0(.0), and its consequent releases will be 2019.1, 2019.2, 2019.3, 
etc. and its stable patches will be 2019.0.1, 2019.0.2, 2019.0.3, 
2019.0.4, etc.


With this scheme you can easily see how fresh your kernel is and there's 
no need arbitrary raise the first number because it always matches the 
current year.


There's one minor detail which might raise some questions: there are 
release candidates and then there's a release, so for the development 
which starts before the year end we might start with e.g. 2018.5-rc1 and 
then if the actual release crosses a new year mark we simply turn 
2018.5-rc7 into 2019.0.0.


Best regards,
Artem S. Tashkinov

Trying to understand General protection fault/hrtimer_active

2017-08-11 Thread Artem S. Tashkinov


Hello,

After stopping mariadb on our database server, the server physically 
crashed and required a hard reset in order to get back online.


Fortunately the system was able to dump the kernel error:

Aug 11 09:22:44 mariadb mysqld[1229]: 2017-08-11  9:22:44 
140417868658432 [ERROR] mysqld: Deadlock found when trying to get lock; 
try restarting transaction
Aug 11 09:24:03 mariadb kernel: [225113.038696] general protection 
fault:  [#1] SMP
Aug 11 09:24:03 mariadb kernel: [225113.038709] Modules linked in: ppdev 
intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass 
crct10dif_pc
lmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw 
gf128mul joydev input_leds glue_helper ablk_helper cryptd serio_raw 
shpchp lpc_ich parport_pc
 8250_fintek parport tpm_infineon mac_hid nct6775 hwmon_vid coretemp 
autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx x
or hid_generic usbhid hid raid6_pq libcrc32c raid0 multipath linear 
raid1 mxm_wmi ahci psmouse r8169 libahci mii wmi video fjes
Aug 11 09:24:03 mariadb kernel: [225113.038836] CPU: 3 PID: 3570 Comm: 
mysqld Not tainted 4.4.0-89-generic #112-Ubuntu
Aug 11 09:24:03 mariadb kernel: [225113.038853] Hardware name: MSI 
MS-7816/H87-G43 (MS-7816), BIOS V2.14B6 08/23/2013
Aug 11 09:24:03 mariadb kernel: [225113.038868] task: 8807f6f88e00 
ti: 8807f6534000 task.ti: 8807f6534000
Aug 11 09:24:03 mariadb kernel: [225113.038881] RIP: 
0010:[]  [] hrtimer_active+0x9/0x60
Aug 11 09:24:03 mariadb kernel: [225113.038899] RSP: 
0018:8807f65379e0  EFLAGS: 00010246
Aug 11 09:24:03 mariadb kernel: [225113.038909] RAX:  
RBX: ffbf8807f6537a30 RCX: 
Aug 11 09:24:03 mariadb kernel: [225113.038922] RDX:  
RSI: 8807f6f88e00 RDI: ffbf8807f6537a30
Aug 11 09:24:03 mariadb kernel: [225113.038947] RBP: 8807f65379e0 
R08: 8807f6534000 R09: 
Aug 11 09:24:03 mariadb kernel: [225113.038982] R10: 000103599c14 
R11:  R12: 
Aug 11 09:24:03 mariadb kernel: [225113.039018] R13: 0001 
R14: 8807f6537b58 R15: 
Aug 11 09:24:03 mariadb kernel: [225113.039053] FS:  
7fb69edc5700() GS:88081eac() knlGS:
Aug 11 09:24:03 mariadb kernel: [225113.039091] CS:  0010 DS:  ES: 
 CR0: 80050033
Aug 11 09:24:03 mariadb kernel: [225113.039112] CR2: 7fb59e1e7e88 
CR3: 0007f943f000 CR4: 001406e0

Aug 11 09:24:03 mariadb kernel: [225113.039148] Stack:
Aug 11 09:24:03 mariadb kernel: [225113.039164]  8807f6537a18 
810efba9 8807f6537b58 2cf88ace51220a81
Aug 11 09:24:03 mariadb kernel: [225113.039202]  ffbf8807f6537a30 
 0001 8807f6537ac0
Aug 11 09:24:03 mariadb kernel: [225113.039240]  81841341 
05f5e100 88071ab63a30 

Aug 11 09:24:03 mariadb kernel: [225113.039278] Call Trace:
Aug 11 09:24:03 mariadb kernel: [225113.039297]  [] 
hrtimer_try_to_cancel+0x29/0x130
Aug 11 09:24:03 mariadb kernel: [225113.039321]  [] 
schedule_hrtimeout_range_clock+0xd1/0x1b0
Aug 11 09:24:03 mariadb kernel: [225113.039346]  [] ? 
__hrtimer_init+0x90/0x90
Aug 11 09:24:03 mariadb kernel: [225113.039369]  [] ? 
schedule_hrtimeout_range_clock+0xb9/0x1b0
Aug 11 09:24:03 mariadb kernel: [225113.039405]  [] 
schedule_hrtimeout_range+0x13/0x20
Aug 11 09:24:03 mariadb kernel: [225113.039430]  [] 
poll_schedule_timeout+0x44/0x70
Aug 11 09:24:03 mariadb kernel: [225113.039453]  [] 
do_sys_poll+0x4af/0x560
Aug 11 09:24:03 mariadb kernel: [225113.039477]  [] ? 
__alloc_skb+0x5b/0x1f0
Aug 11 09:24:03 mariadb kernel: [225113.039500]  [] ? 
__kmalloc_node_track_caller+0x249/0x310
Aug 11 09:24:03 mariadb kernel: [225113.039525]  [] ? 
__alloc_skb+0x87/0x1f0
Aug 11 09:24:03 mariadb kernel: [225113.039548]  [] ? 
poll_select_copy_remaining+0x140/0x140
Aug 11 09:24:03 mariadb kernel: [225113.039572]  [] ? 
_raw_spin_unlock_bh+0x1e/0x20
Aug 11 09:24:03 mariadb kernel: [225113.039596]  [] ? 
release_sock+0x111/0x160
Aug 11 09:24:03 mariadb kernel: [225113.039620]  [] ? 
tcp_recvmsg+0x3fc/0xbe0
Aug 11 09:24:03 mariadb kernel: [225113.039644]  [] ? 
inet_recvmsg+0x7e/0xb0
Aug 11 09:24:03 mariadb kernel: [225113.039666]  [] ? 
sock_recvmsg+0x3d/0x50
Aug 11 09:24:03 mariadb kernel: [225113.039688]  [] ? 
SYSC_recvfrom+0x13d/0x150
Aug 11 09:24:03 mariadb kernel: [225113.039711]  [] ? 
__schedule+0x3b6/0xa30
Aug 11 09:24:03 mariadb kernel: [225113.039734]  [] ? 
ktime_get_ts64+0x49/0xf0
Aug 11 09:24:03 mariadb kernel: [225113.039756]  [] 
SyS_poll+0x71/0x130
Aug 11 09:24:03 mariadb kernel: [225113.039778]  [] 
entry_SYSCALL_64_fastpath+0x16/0x71
Aug 11 09:24:03 mariadb kernel: [225113.039801] Code: 00 00 0f 1f 44 00 
00 55 48 c7 47 28 70 f9 0e 81 48 89 77 58 48 89 e5 5d c3 66 0f 1f 84 00 
00 00 00 00 0f
 1f 44 00 00 55 48 89 e5 <48> 8b 57 30 eb 1d 80 7f 38 00 75 32 48 3b 78 
08 74 2c 39 50 04

Trying to understand General protection fault/hrtimer_active

2017-08-11 Thread Artem S. Tashkinov


Hello,

After stopping mariadb on our database server, the server physically 
crashed and required a hard reset in order to get back online.


Fortunately the system was able to dump the kernel error:

Aug 11 09:22:44 mariadb mysqld[1229]: 2017-08-11  9:22:44 
140417868658432 [ERROR] mysqld: Deadlock found when trying to get lock; 
try restarting transaction
Aug 11 09:24:03 mariadb kernel: [225113.038696] general protection 
fault:  [#1] SMP
Aug 11 09:24:03 mariadb kernel: [225113.038709] Modules linked in: ppdev 
intel_rapl x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm irqbypass 
crct10dif_pc
lmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw 
gf128mul joydev input_leds glue_helper ablk_helper cryptd serio_raw 
shpchp lpc_ich parport_pc
 8250_fintek parport tpm_infineon mac_hid nct6775 hwmon_vid coretemp 
autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx x
or hid_generic usbhid hid raid6_pq libcrc32c raid0 multipath linear 
raid1 mxm_wmi ahci psmouse r8169 libahci mii wmi video fjes
Aug 11 09:24:03 mariadb kernel: [225113.038836] CPU: 3 PID: 3570 Comm: 
mysqld Not tainted 4.4.0-89-generic #112-Ubuntu
Aug 11 09:24:03 mariadb kernel: [225113.038853] Hardware name: MSI 
MS-7816/H87-G43 (MS-7816), BIOS V2.14B6 08/23/2013
Aug 11 09:24:03 mariadb kernel: [225113.038868] task: 8807f6f88e00 
ti: 8807f6534000 task.ti: 8807f6534000
Aug 11 09:24:03 mariadb kernel: [225113.038881] RIP: 
0010:[]  [] hrtimer_active+0x9/0x60
Aug 11 09:24:03 mariadb kernel: [225113.038899] RSP: 
0018:8807f65379e0  EFLAGS: 00010246
Aug 11 09:24:03 mariadb kernel: [225113.038909] RAX:  
RBX: ffbf8807f6537a30 RCX: 
Aug 11 09:24:03 mariadb kernel: [225113.038922] RDX:  
RSI: 8807f6f88e00 RDI: ffbf8807f6537a30
Aug 11 09:24:03 mariadb kernel: [225113.038947] RBP: 8807f65379e0 
R08: 8807f6534000 R09: 
Aug 11 09:24:03 mariadb kernel: [225113.038982] R10: 000103599c14 
R11:  R12: 
Aug 11 09:24:03 mariadb kernel: [225113.039018] R13: 0001 
R14: 8807f6537b58 R15: 
Aug 11 09:24:03 mariadb kernel: [225113.039053] FS:  
7fb69edc5700() GS:88081eac() knlGS:
Aug 11 09:24:03 mariadb kernel: [225113.039091] CS:  0010 DS:  ES: 
 CR0: 80050033
Aug 11 09:24:03 mariadb kernel: [225113.039112] CR2: 7fb59e1e7e88 
CR3: 0007f943f000 CR4: 001406e0

Aug 11 09:24:03 mariadb kernel: [225113.039148] Stack:
Aug 11 09:24:03 mariadb kernel: [225113.039164]  8807f6537a18 
810efba9 8807f6537b58 2cf88ace51220a81
Aug 11 09:24:03 mariadb kernel: [225113.039202]  ffbf8807f6537a30 
 0001 8807f6537ac0
Aug 11 09:24:03 mariadb kernel: [225113.039240]  81841341 
05f5e100 88071ab63a30 

Aug 11 09:24:03 mariadb kernel: [225113.039278] Call Trace:
Aug 11 09:24:03 mariadb kernel: [225113.039297]  [] 
hrtimer_try_to_cancel+0x29/0x130
Aug 11 09:24:03 mariadb kernel: [225113.039321]  [] 
schedule_hrtimeout_range_clock+0xd1/0x1b0
Aug 11 09:24:03 mariadb kernel: [225113.039346]  [] ? 
__hrtimer_init+0x90/0x90
Aug 11 09:24:03 mariadb kernel: [225113.039369]  [] ? 
schedule_hrtimeout_range_clock+0xb9/0x1b0
Aug 11 09:24:03 mariadb kernel: [225113.039405]  [] 
schedule_hrtimeout_range+0x13/0x20
Aug 11 09:24:03 mariadb kernel: [225113.039430]  [] 
poll_schedule_timeout+0x44/0x70
Aug 11 09:24:03 mariadb kernel: [225113.039453]  [] 
do_sys_poll+0x4af/0x560
Aug 11 09:24:03 mariadb kernel: [225113.039477]  [] ? 
__alloc_skb+0x5b/0x1f0
Aug 11 09:24:03 mariadb kernel: [225113.039500]  [] ? 
__kmalloc_node_track_caller+0x249/0x310
Aug 11 09:24:03 mariadb kernel: [225113.039525]  [] ? 
__alloc_skb+0x87/0x1f0
Aug 11 09:24:03 mariadb kernel: [225113.039548]  [] ? 
poll_select_copy_remaining+0x140/0x140
Aug 11 09:24:03 mariadb kernel: [225113.039572]  [] ? 
_raw_spin_unlock_bh+0x1e/0x20
Aug 11 09:24:03 mariadb kernel: [225113.039596]  [] ? 
release_sock+0x111/0x160
Aug 11 09:24:03 mariadb kernel: [225113.039620]  [] ? 
tcp_recvmsg+0x3fc/0xbe0
Aug 11 09:24:03 mariadb kernel: [225113.039644]  [] ? 
inet_recvmsg+0x7e/0xb0
Aug 11 09:24:03 mariadb kernel: [225113.039666]  [] ? 
sock_recvmsg+0x3d/0x50
Aug 11 09:24:03 mariadb kernel: [225113.039688]  [] ? 
SYSC_recvfrom+0x13d/0x150
Aug 11 09:24:03 mariadb kernel: [225113.039711]  [] ? 
__schedule+0x3b6/0xa30
Aug 11 09:24:03 mariadb kernel: [225113.039734]  [] ? 
ktime_get_ts64+0x49/0xf0
Aug 11 09:24:03 mariadb kernel: [225113.039756]  [] 
SyS_poll+0x71/0x130
Aug 11 09:24:03 mariadb kernel: [225113.039778]  [] 
entry_SYSCALL_64_fastpath+0x16/0x71
Aug 11 09:24:03 mariadb kernel: [225113.039801] Code: 00 00 0f 1f 44 00 
00 55 48 c7 47 28 70 f9 0e 81 48 89 77 58 48 89 e5 5d c3 66 0f 1f 84 00 
00 00 00 00 0f
 1f 44 00 00 55 48 89 e5 <48> 8b 57 30 eb 1d 80 7f 38 00 75 32 48 3b 78 
08 74 2c 39 50 04

Re: The most insane proposal in regard to the Linux kernel development

2016-04-07 Thread Artem S. Tashkinov


On 2016-04-07 01:05, Greg KH wrote:

On Sat, Apr 02, 2016 at 05:43:47PM +0500, Artem S. Tashkinov wrote:
One very big justification of this proposal is that core Linux 
development
(I'm talking about various subsystems like mm/ ipc/ and interfaces 
under
block/ fs/ security/ sound/ etc. ) has slowed down significantly over 
the
past years so radical changes which warrant new kernel API/ABI are 
less

likely to be introduced.


That's not true at all, the change is constant, and increasing, just
look at the tree for proof of that.


Please, share your opinion.


Please read Documentation/stable_api_nonsense.txt for my opinion, and
that of the current developers.

If you don't agree with this, that's fine, you are welcome to fork the
kernel at any specific point and keep that api stable, just like many
companies do and make money from it (SuSE, Red Hat, etc.)

best of luck with your kernel project,



Tell me, why no one in the Linux kernel dev team is concerned that:

1) There is up to a hundred regressions in each kernel release where a 
big chunk of them are caused by internal API changes?
2) API changes sometimes require drastic changes in every related 
hardware driver and since there's no way you can realistically test the 
code or the hardware, people later discover that their hardware has 
stopped working?
3) The core kernel developers do not have enough expertise to correctly 
update the entire kernel source tree so little things get broken?

4) Developing drivers for a moving target is a Herculean job?
5) You cannot easily bisect kernel regressions because regressions are 
often caused by things _outside_ of the problem you're experiencing.
6) You cannot use new drivers for you hardware on your old kernel, 
because new drivers are incompatible with an old source tree (don't 
remind me of RHEL's kernel - it's a rare exception and they usually port 
only the drivers their respective clients use).

7) Tech unsavvy people cannot realistically debug the kernel.

Hey, please, do not tell me that you're doing a great job following 
postings in LKML or resolving bugs files in bugzilla. You do a very 
lousy job indeed - multiple postings in LKML get zero replies because a 
corresponding developer is either not subscribed to LKML at all, or he 
has missed the message. There are literally hundreds(!) of bugs in 
bugzilla which have ZERO replies. What's more, a great number of kernel 
developers do not have accounts in bugzilla and they don't read 
corresponding mailing lists.


What the hell is wrong with you guys? You're developing the kernel like 
it's your toy project.


1) There's no accountability whatsoever.
2) There are no unit tests. Not a single one.
3) There's no surefire way to contact developers who have commited "bad" 
code.

4) There's no sense of direction.
5) There's no easy way to debug the kernel.

For instance, let's talk about the revoke() call. Right now, if a 
certain IO device is removed while files on it are still open (there are 
multiple ways of opening files in Linux, starting from fopen() and 
ending with mmap()), the kernel state is basically undefined(!). Great! 
The corresponding mount point cannot be reused(!). Whatever program, 
which has its files' descriptors on this accidentally removed device, 
usually cannot gracefully quit or continue working. How on Earth this 
syscall doesn't get the utmost attention?


Then we have bug 12309(1). My last comment to this bug gives a very 
simple way of reproducing it on all Android devices.


Then we have bug 15875(2) which will probably take just ten man hours to 
be resolved, yet there is no interest at all, yet thousands of people 
have very real problems due to it.


Tell me, are you really proud of yourselves?
Tell me, do you develop the kernel for your amusement, ego, your 
employee or for average people to use?
Tell me, are you really interested in more people migrating from stable 
long term supported OSes to Linux?


I want some truly honest answers. And let's not repeat this mantra "we 
don't have enough resources". You have enough resources to break 
API/ABIs in a huge way, you have enough resources to introduce 
regressions - you only don't have enough resources to have any 
resemblance of a responsible development process.


Best regards,

Artem

1) https://bugzilla.kernel.org/show_bug.cgi?id=12309
2) https://bugzilla.kernel.org/show_bug.cgi?id=15875

Re: The most insane proposal in regard to the Linux kernel development

2016-04-07 Thread Artem S. Tashkinov


On 2016-04-07 01:05, Greg KH wrote:

On Sat, Apr 02, 2016 at 05:43:47PM +0500, Artem S. Tashkinov wrote:
One very big justification of this proposal is that core Linux 
development
(I'm talking about various subsystems like mm/ ipc/ and interfaces 
under
block/ fs/ security/ sound/ etc. ) has slowed down significantly over 
the
past years so radical changes which warrant new kernel API/ABI are 
less

likely to be introduced.


That's not true at all, the change is constant, and increasing, just
look at the tree for proof of that.


Please, share your opinion.


Please read Documentation/stable_api_nonsense.txt for my opinion, and
that of the current developers.

If you don't agree with this, that's fine, you are welcome to fork the
kernel at any specific point and keep that api stable, just like many
companies do and make money from it (SuSE, Red Hat, etc.)

best of luck with your kernel project,



Tell me, why no one in the Linux kernel dev team is concerned that:

1) There is up to a hundred regressions in each kernel release where a 
big chunk of them are caused by internal API changes?
2) API changes sometimes require drastic changes in every related 
hardware driver and since there's no way you can realistically test the 
code or the hardware, people later discover that their hardware has 
stopped working?
3) The core kernel developers do not have enough expertise to correctly 
update the entire kernel source tree so little things get broken?

4) Developing drivers for a moving target is a Herculean job?
5) You cannot easily bisect kernel regressions because regressions are 
often caused by things _outside_ of the problem you're experiencing.
6) You cannot use new drivers for you hardware on your old kernel, 
because new drivers are incompatible with an old source tree (don't 
remind me of RHEL's kernel - it's a rare exception and they usually port 
only the drivers their respective clients use).

7) Tech unsavvy people cannot realistically debug the kernel.

Hey, please, do not tell me that you're doing a great job following 
postings in LKML or resolving bugs files in bugzilla. You do a very 
lousy job indeed - multiple postings in LKML get zero replies because a 
corresponding developer is either not subscribed to LKML at all, or he 
has missed the message. There are literally hundreds(!) of bugs in 
bugzilla which have ZERO replies. What's more, a great number of kernel 
developers do not have accounts in bugzilla and they don't read 
corresponding mailing lists.


What the hell is wrong with you guys? You're developing the kernel like 
it's your toy project.


1) There's no accountability whatsoever.
2) There are no unit tests. Not a single one.
3) There's no surefire way to contact developers who have commited "bad" 
code.

4) There's no sense of direction.
5) There's no easy way to debug the kernel.

For instance, let's talk about the revoke() call. Right now, if a 
certain IO device is removed while files on it are still open (there are 
multiple ways of opening files in Linux, starting from fopen() and 
ending with mmap()), the kernel state is basically undefined(!). Great! 
The corresponding mount point cannot be reused(!). Whatever program, 
which has its files' descriptors on this accidentally removed device, 
usually cannot gracefully quit or continue working. How on Earth this 
syscall doesn't get the utmost attention?


Then we have bug 12309(1). My last comment to this bug gives a very 
simple way of reproducing it on all Android devices.


Then we have bug 15875(2) which will probably take just ten man hours to 
be resolved, yet there is no interest at all, yet thousands of people 
have very real problems due to it.


Tell me, are you really proud of yourselves?
Tell me, do you develop the kernel for your amusement, ego, your 
employee or for average people to use?
Tell me, are you really interested in more people migrating from stable 
long term supported OSes to Linux?


I want some truly honest answers. And let's not repeat this mantra "we 
don't have enough resources". You have enough resources to break 
API/ABIs in a huge way, you have enough resources to introduce 
regressions - you only don't have enough resources to have any 
resemblance of a responsible development process.


Best regards,

Artem

1) https://bugzilla.kernel.org/show_bug.cgi?id=12309
2) https://bugzilla.kernel.org/show_bug.cgi?id=15875

The most insane proposal in regard to the Linux kernel development

2016-04-02 Thread Artem S. Tashkinov


Hello all,

It's not a secret that there are two basic ways of running a Linux 
distribution on your hardware. Either you use a stable distro which has 
quite an outdated kernel release which might not support your hardware 
or you run the most recent stable version but you lose stability and you 
are prone to regressions.


This problem can be solved by decoupling drivers from the kernel and 
supplying them separately so that you could enjoy stable kernel version 
X with brand new drivers like it's done in most other proprietary OS'es. 
I've been thinking of asking Linus about this decoupling for years 
already but I'm hesitant 'cause I'm 99.9% sure he will downright 
reject this proposal. Still I'm gonna risk asking 'cause there are 
multiple pluses with this proposal:


1) We might have truly stable really long term supported kernels (3-5 
years of more).

2) The kernel size will be reduced by two orders of magnitude.
3) The user will be free to try different kernel drivers versions 
without leaving his/her stable kernel.
4) Drivers will become easier to develop, debug and maintain (usually 
the developer will just have two kernel trees to target and test 
against).
5) There will be a sense of QA/QC and accountability (nothing like that 
exists at the moment as reflected by a very long list regressions for 
every kernel release).
6) Drivers regressions will be easier to spot ('cause you can be sure 
that no other kernel changes have had undesired consequences/conflicts - 
right now driver A might break and does occasionally break because 
unrelated feature B has been reworked/tweaked/etc.).
7) There will be a lot fewer kernel releases and no constant rush to 
update them.
8) Kernel release numbers will become meaningful again. Right now no one 
can quickly say what's the difference between kernel 4.5.0 and 4.1.0.


This way kernel development must be changed to accommodate this 
proposal:


1) Yeah, I know, you all hate that, but stable APIs and ABIs must be 
introduced and supported for, let's say, at least three to five years.
2) Like we used to have during 2.2.x, 2.4.x development cycles, unstable 
kernels with new APIs must be developed in parallel to stable ones.
3) Of course that means that drivers for every kernel tree 
(stable/unstable) must be developed in parallel. In the future, perhaps, 
several parallel drivers versions will have to be developed, e.g. 
drivers for kernels 1.0.x (stable), 1.2.x (next stable) and 1.3.x 
(unstable). However, taking into consideration that these three kernel 
releases span the range of 3..5 * 3 years = 9..15 years, older kernels 
will stop being supported eventually.


In short I'm offering a concept of Windows NT kernel development. They 
have very rare stable kernel releases (e.g. XP SP0, SP1, SP2, 2003, 2003 
R2 - all binary compatible), then Vista kernel began development and 
after its release six years later, hardware vendors had to support just 
two kernel releases. Not that is a big issue.


One very big justification of this proposal is that core Linux 
development (I'm talking about various subsystems like mm/ ipc/ and 
interfaces under block/ fs/ security/ sound/ etc. ) has slowed down 
significantly over the past years so radical changes which warrant new 
kernel API/ABI are less likely to be introduced.


Please, share your opinion.

--
Best regards,

Artem

The most insane proposal in regard to the Linux kernel development

2016-04-02 Thread Artem S. Tashkinov


Hello all,

It's not a secret that there are two basic ways of running a Linux 
distribution on your hardware. Either you use a stable distro which has 
quite an outdated kernel release which might not support your hardware 
or you run the most recent stable version but you lose stability and you 
are prone to regressions.


This problem can be solved by decoupling drivers from the kernel and 
supplying them separately so that you could enjoy stable kernel version 
X with brand new drivers like it's done in most other proprietary OS'es. 
I've been thinking of asking Linus about this decoupling for years 
already but I'm hesitant 'cause I'm 99.9% sure he will downright 
reject this proposal. Still I'm gonna risk asking 'cause there are 
multiple pluses with this proposal:


1) We might have truly stable really long term supported kernels (3-5 
years of more).

2) The kernel size will be reduced by two orders of magnitude.
3) The user will be free to try different kernel drivers versions 
without leaving his/her stable kernel.
4) Drivers will become easier to develop, debug and maintain (usually 
the developer will just have two kernel trees to target and test 
against).
5) There will be a sense of QA/QC and accountability (nothing like that 
exists at the moment as reflected by a very long list regressions for 
every kernel release).
6) Drivers regressions will be easier to spot ('cause you can be sure 
that no other kernel changes have had undesired consequences/conflicts - 
right now driver A might break and does occasionally break because 
unrelated feature B has been reworked/tweaked/etc.).
7) There will be a lot fewer kernel releases and no constant rush to 
update them.
8) Kernel release numbers will become meaningful again. Right now no one 
can quickly say what's the difference between kernel 4.5.0 and 4.1.0.


This way kernel development must be changed to accommodate this 
proposal:


1) Yeah, I know, you all hate that, but stable APIs and ABIs must be 
introduced and supported for, let's say, at least three to five years.
2) Like we used to have during 2.2.x, 2.4.x development cycles, unstable 
kernels with new APIs must be developed in parallel to stable ones.
3) Of course that means that drivers for every kernel tree 
(stable/unstable) must be developed in parallel. In the future, perhaps, 
several parallel drivers versions will have to be developed, e.g. 
drivers for kernels 1.0.x (stable), 1.2.x (next stable) and 1.3.x 
(unstable). However, taking into consideration that these three kernel 
releases span the range of 3..5 * 3 years = 9..15 years, older kernels 
will stop being supported eventually.


In short I'm offering a concept of Windows NT kernel development. They 
have very rare stable kernel releases (e.g. XP SP0, SP1, SP2, 2003, 2003 
R2 - all binary compatible), then Vista kernel began development and 
after its release six years later, hardware vendors had to support just 
two kernel releases. Not that is a big issue.


One very big justification of this proposal is that core Linux 
development (I'm talking about various subsystems like mm/ ipc/ and 
interfaces under block/ fs/ security/ sound/ etc. ) has slowed down 
significantly over the past years so radical changes which warrant new 
kernel API/ABI are less likely to be introduced.


Please, share your opinion.

--
Best regards,

Artem

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 10:55, Kent Overstreet wrote:

On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote:

On 2015-12-22 10:38, Kent Overstreet wrote:
>On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote:
>>On 12/22/15 12:59, Kent Overstreet wrote:
>>> reproduced it with 32 bit pae:
>>>
>>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>>>
>>> doesn't work - max_addr=1G doesn't work either
>>>
>>>> 2. Disable highmem with "highmem=0".
>>>
>>> works!
>>>
>>>> 3. Try booting 64bit kernel.
>>>
>>> works
>>
>>blk_queue_bio() does split then bounce, which makes the segment
>>counting based on pages before bouncing and could go wrong.
>>
>>What do you think of a patch like this?
>
>Artem, can you give this patch a try?


This patch ostensibly fixes the issue - at least I cannot immediately
reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"


Let's all contemplate the fact that blk_segment_map_sg() _overrunning 
the end of

the provided sglist_ was this much of a clusterfuck to debug.


From the look of it this fix has nothing to do with PAE, so then why 
only PAE users like me were affected by the original 
(b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 10:38, Kent Overstreet wrote:

On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote:

On 12/22/15 12:59, Kent Overstreet wrote:
> reproduced it with 32 bit pae:
>
>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>
> doesn't work - max_addr=1G doesn't work either
>
>> 2. Disable highmem with "highmem=0".
>
> works!
>
>> 3. Try booting 64bit kernel.
>
> works

blk_queue_bio() does split then bounce, which makes the segment
counting based on pages before bouncing and could go wrong.

What do you think of a patch like this?


Artem, can you give this patch a try?



This patch ostensibly fixes the issue - at least I cannot immediately 
reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"






--
Jun'ichi Nomura, NEC Corporation

diff --git a/block/blk-core.c b/block/blk-core.c
index 5131993b..1d1c3c7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct 
request_queue *q, struct bio *bio)

struct request *req;
unsigned int request_count = 0;

-   blk_queue_split(q, , q->bio_split);
-
/*
 * low level driver can indicate that it wants pages above a
 * certain limit bounced to low memory (ie for highmem, or even
@@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct 
request_queue *q, struct bio *bio)

 */
blk_queue_bounce(q, );

+   blk_queue_split(q, , q->bio_split);
+
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
bio_endio(bio);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 01:07, Tejun Heo wrote:

Hello, Artem.

Can you please apply the following patch on top and see whether
anything changes?  If it does make the issue go away, can you please
revert the ".can_queue" part and test again?

Thanks.

---
 drivers/ata/ahci.h|2 +-
 drivers/ata/libahci.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
  */
 #define AHCI_SHT(drv_name) \
ATA_NCQ_SHT(drv_name),  \
-   .can_queue  = AHCI_MAX_CMDS - 1,\
+   .can_queue  = 1/*AHCI_MAX_CMDS - 1*/,   \
.sg_tablesize   = AHCI_MAX_SG,  \
.dma_boundary   = AHCI_DMA_BOUNDARY,\
.shost_attrs= ahci_shost_attrs, \
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
hpriv->saved_cap2 = cap2 = 0;

/* some chips have errata preventing 64bit use */
-   if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
+	if ((cap & HOST_CAP_64)/* && (hpriv->flags & 
AHCI_HFLAG_32BIT_ONLY)*/) {

dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
cap &= ~HOST_CAP_64;
}


With the ".can_queue" part left intact the bug resurfaced. Full dmesg 
output is attached.

dmesg.xz
Description: application/xz

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 01:07, Tejun Heo wrote:

Hello, Artem.

Can you please apply the following patch on top and see whether
anything changes?  If it does make the issue go away, can you please
revert the ".can_queue" part and test again?

Thanks.

---
 drivers/ata/ahci.h|2 +-
 drivers/ata/libahci.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
  */
 #define AHCI_SHT(drv_name) \
ATA_NCQ_SHT(drv_name),  \
-   .can_queue  = AHCI_MAX_CMDS - 1,\
+   .can_queue  = 1/*AHCI_MAX_CMDS - 1*/,   \
.sg_tablesize   = AHCI_MAX_SG,  \
.dma_boundary   = AHCI_DMA_BOUNDARY,\
.shost_attrs= ahci_shost_attrs, \
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
hpriv->saved_cap2 = cap2 = 0;

/* some chips have errata preventing 64bit use */
-   if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
+	if ((cap & HOST_CAP_64)/* && (hpriv->flags & 
AHCI_HFLAG_32BIT_ONLY)*/) {

dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
cap &= ~HOST_CAP_64;
}


This patch fixes the issue for me. Now rechecking without .can_queue 
part.


BTW, since I left debugging on, here's the part you wanted:

[0.613851] XXX port 0 dma_sz=91392 mem=c002 mem_dma=0002 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.613865] XXX port 1 dma_sz=91392 mem=eea0 mem_dma=2ea0 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.620464] XXX port 2 dma_sz=91392 mem=eea2 mem_dma=2ea2 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.627121] XXX port 3 dma_sz=91392 mem=eea4 mem_dma=2ea4 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.633791] XXX port 4 dma_sz=91392 mem=eea6 mem_dma=2ea6 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.640445] XXX port 5 dma_sz=91392 mem=eea8 mem_dma=2ea8 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 10:23, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
 wrote:


That said, we obviously need to figure out this current problem
regardless first..


... although maybe it *would* be interesting to hear what happens if
you just compile a 64-bit kernel instead?


Under x86-64 I cannot reproduce this problem. It seems like it's PAE 
specific (Kent Overstreet says he has reproduced it).




Do you still see the problem? Because if not, then we should look very
specifically for some 32-bit PAE issue.

For example, maybe we use "unsigned long" somewhere where we should
use "phys_addr_t". On x86-64, they obviously end up being the same. On
normal non-PAE x86-32, they are also the same. But ..

 Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 01:07, Tejun Heo wrote:

Hello, Artem.

Can you please apply the following patch on top and see whether
anything changes?  If it does make the issue go away, can you please
revert the ".can_queue" part and test again?

Thanks.

---
 drivers/ata/ahci.h|2 +-
 drivers/ata/libahci.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
  */
 #define AHCI_SHT(drv_name) \
ATA_NCQ_SHT(drv_name),  \
-   .can_queue  = AHCI_MAX_CMDS - 1,\
+   .can_queue  = 1/*AHCI_MAX_CMDS - 1*/,   \
.sg_tablesize   = AHCI_MAX_SG,  \
.dma_boundary   = AHCI_DMA_BOUNDARY,\
.shost_attrs= ahci_shost_attrs, \
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
hpriv->saved_cap2 = cap2 = 0;

/* some chips have errata preventing 64bit use */
-   if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
+	if ((cap & HOST_CAP_64)/* && (hpriv->flags & 
AHCI_HFLAG_32BIT_ONLY)*/) {

dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
cap &= ~HOST_CAP_64;
}


With the ".can_queue" part left intact the bug resurfaced. Full dmesg 
output is attached.

dmesg.xz
Description: application/xz

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 10:23, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
 wrote:


That said, we obviously need to figure out this current problem
regardless first..


... although maybe it *would* be interesting to hear what happens if
you just compile a 64-bit kernel instead?


Under x86-64 I cannot reproduce this problem. It seems like it's PAE 
specific (Kent Overstreet says he has reproduced it).




Do you still see the problem? Because if not, then we should look very
specifically for some 32-bit PAE issue.

For example, maybe we use "unsigned long" somewhere where we should
use "phys_addr_t". On x86-64, they obviously end up being the same. On
normal non-PAE x86-32, they are also the same. But ..

 Linus

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 10:55, Kent Overstreet wrote:

On Tue, Dec 22, 2015 at 10:52:37AM +0500, Artem S. Tashkinov wrote:

On 2015-12-22 10:38, Kent Overstreet wrote:
>On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote:
>>On 12/22/15 12:59, Kent Overstreet wrote:
>>> reproduced it with 32 bit pae:
>>>
>>>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>>>
>>> doesn't work - max_addr=1G doesn't work either
>>>
>>>> 2. Disable highmem with "highmem=0".
>>>
>>> works!
>>>
>>>> 3. Try booting 64bit kernel.
>>>
>>> works
>>
>>blk_queue_bio() does split then bounce, which makes the segment
>>counting based on pages before bouncing and could go wrong.
>>
>>What do you think of a patch like this?
>
>Artem, can you give this patch a try?


This patch ostensibly fixes the issue - at least I cannot immediately
reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"


Let's all contemplate the fact that blk_segment_map_sg() _overrunning 
the end of

the provided sglist_ was this much of a clusterfuck to debug.


From the look of it this fix has nothing to do with PAE, so then why 
only PAE users like me were affected by the original 
(b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c) patch?

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 10:38, Kent Overstreet wrote:

On Tue, Dec 22, 2015 at 05:26:12AM +, Junichi Nomura wrote:

On 12/22/15 12:59, Kent Overstreet wrote:
> reproduced it with 32 bit pae:
>
>> 1. Exclude memory above 4G line with boot param "max_addr=4G".
>
> doesn't work - max_addr=1G doesn't work either
>
>> 2. Disable highmem with "highmem=0".
>
> works!
>
>> 3. Try booting 64bit kernel.
>
> works

blk_queue_bio() does split then bounce, which makes the segment
counting based on pages before bouncing and could go wrong.

What do you think of a patch like this?


Artem, can you give this patch a try?



This patch ostensibly fixes the issue - at least I cannot immediately 
reproduce it. You can count me in as "Tested-by: Artem S. Tashkinov"






--
Jun'ichi Nomura, NEC Corporation

diff --git a/block/blk-core.c b/block/blk-core.c
index 5131993b..1d1c3c7 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1689,8 +1689,6 @@ static blk_qc_t blk_queue_bio(struct 
request_queue *q, struct bio *bio)

struct request *req;
unsigned int request_count = 0;

-   blk_queue_split(q, , q->bio_split);
-
/*
 * low level driver can indicate that it wants pages above a
 * certain limit bounced to low memory (ie for highmem, or even
@@ -1698,6 +1696,8 @@ static blk_qc_t blk_queue_bio(struct 
request_queue *q, struct bio *bio)

 */
blk_queue_bounce(q, );

+   blk_queue_split(q, , q->bio_split);
+
if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) {
bio->bi_error = -EIO;
bio_endio(bio);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-22 01:07, Tejun Heo wrote:

Hello, Artem.

Can you please apply the following patch on top and see whether
anything changes?  If it does make the issue go away, can you please
revert the ".can_queue" part and test again?

Thanks.

---
 drivers/ata/ahci.h|2 +-
 drivers/ata/libahci.c |2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -365,7 +365,7 @@ extern struct device_attribute *ahci_sde
  */
 #define AHCI_SHT(drv_name) \
ATA_NCQ_SHT(drv_name),  \
-   .can_queue  = AHCI_MAX_CMDS - 1,\
+   .can_queue  = 1/*AHCI_MAX_CMDS - 1*/,   \
.sg_tablesize   = AHCI_MAX_SG,  \
.dma_boundary   = AHCI_DMA_BOUNDARY,\
.shost_attrs= ahci_shost_attrs, \
--- a/drivers/ata/libahci.c
+++ b/drivers/ata/libahci.c
@@ -420,7 +420,7 @@ void ahci_save_initial_config(struct dev
hpriv->saved_cap2 = cap2 = 0;

/* some chips have errata preventing 64bit use */
-   if ((cap & HOST_CAP_64) && (hpriv->flags & AHCI_HFLAG_32BIT_ONLY)) {
+	if ((cap & HOST_CAP_64)/* && (hpriv->flags & 
AHCI_HFLAG_32BIT_ONLY)*/) {

dev_info(dev, "controller can't do 64bit DMA, forcing 32bit\n");
cap &= ~HOST_CAP_64;
}


This patch fixes the issue for me. Now rechecking without .can_queue 
part.


BTW, since I left debugging on, here's the part you wanted:

[0.613851] XXX port 0 dma_sz=91392 mem=c002 mem_dma=0002 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.613865] XXX port 1 dma_sz=91392 mem=eea0 mem_dma=2ea0 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.620464] XXX port 2 dma_sz=91392 mem=eea2 mem_dma=2ea2 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.627121] XXX port 3 dma_sz=91392 mem=eea4 mem_dma=2ea4 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.633791] XXX port 4 dma_sz=91392 mem=eea6 mem_dma=2ea6 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280
[0.640445] XXX port 5 dma_sz=91392 mem=eea8 mem_dma=2ea8 
cmd_slot=0 rx_fis=1024 cmd_tbl=1280


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 10:23, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
 wrote:


That said, we obviously need to figure out this current problem
regardless first..


... although maybe it *would* be interesting to hear what happens if
you just compile a 64-bit kernel instead?

Do you still see the problem? Because if not, then we should look very
specifically for some 32-bit PAE issue.

For example, maybe we use "unsigned long" somewhere where we should
use "phys_addr_t". On x86-64, they obviously end up being the same. On
normal non-PAE x86-32, they are also the same. But ..



Let's wait for what Tejun Heo might say - I've applied his debugging 
patch and sent back the results.


Building x86_64 kernel here involves installing a 64bit Linux VM, so I'd 
like it to be the last resort.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 11:55, Tejun Heo wrote:

Artem, can you please reproduce the issue with the following patch
applied and attach the kernel log?

Thanks.



I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es 
revert). Hopefully it's how you intended it to be.


Here's the result (I skipped the beginning of dmesg - it's the same as 
always - see bugzilla).[   60.387407] Corrupted low memory at c0001000 (1000 phys) = cba3d25f
[   60.387411] Corrupted low memory at c0001004 (1004 phys) = e8f17ba7
[   60.387413] Corrupted low memory at c0001008 (1008 phys) = 61cfa79a
[   60.387415] Corrupted low memory at c000100c (100c phys) = dc4d5d71
[   60.387417] Corrupted low memory at c0001010 (1010 phys) = adbdc15b
[   60.387418] Corrupted low memory at c0001014 (1014 phys) = dee76bdc
[   60.387420] Corrupted low memory at c0001018 (1018 phys) = 827dee31
[   60.387422] Corrupted low memory at c000101c (101c phys) = ef70cf7b
[   60.387423] Corrupted low memory at c0001020 (1020 phys) = 82fdee4d
[   60.387425] Corrupted low memory at c0001024 (1024 phys) = 77533c7b
[   60.387427] Corrupted low memory at c0001028 (1028 phys) = ddd4cf35
[   60.387428] Corrupted low memory at c000102c (102c phys) = 7beea149
[   60.387430] Corrupted low memory at c0001030 (1030 phys) = 798fe878
[   60.387432] Corrupted low memory at c0001034 (1034 phys) = 4283a7a8
[   60.387434] Corrupted low memory at c0001038 (1038 phys) = 4dee093d
[   60.387435] Corrupted low memory at c000103c (103c phys) = ee21ef73
[   60.387437] Corrupted low memory at c0001040 (1040 phys) = fe3dc93d
[   60.387439] Corrupted low memory at c0001044 (1044 phys) = b8e7cf0d
[   60.387440] Corrupted low memory at c0001048 (1048 phys) = af3c9977
[   60.387442] Corrupted low memory at c000104c (104c phys) = b80b7b8b
[   60.387444] Corrupted low memory at c0001050 (1050 phys) = b6f73d77
[   60.387445] Corrupted low memory at c0001054 (1054 phys) = f7276f70
[   60.387447] Corrupted low memory at c0001058 (1058 phys) = c62f70f6
[   60.387449] Corrupted low memory at c000105c (105c phys) = 3ef734bd
[   60.387451] Corrupted low memory at c0001060 (1060 phys) = 1ef79f40
[   60.387452] Corrupted low memory at c0001064 (1064 phys) = f1cf9f65
[   60.387454] Corrupted low memory at c0001068 (1068 phys) = 297a5390
[   60.387456] Corrupted low memory at c000106c (106c phys) = a7f14fbc
[   60.387457] Corrupted low memory at c0001070 (1070 phys) = 57ef71af
[   60.387459] Corrupted low memory at c0001074 (1074 phys) = 219d15e4
[   60.387461] Corrupted low memory at c0001078 (1078 phys) = 7b99a2af
[   60.387462] Corrupted low memory at c000107c (107c phys) = c56d281b
[   60.387464] Corrupted low memory at c0001080 (1080 phys) = 3c84de6e
[   60.387466] Corrupted low memory at c0001084 (1084 phys) = edee56ec
[   60.387468] Corrupted low memory at c0001088 (1088 phys) = 49b557a7
[   60.387469] Corrupted low memory at c000108c (108c phys) = 01baeb6a
[   60.387471] Corrupted low memory at c0001090 (1090 phys) = b775acde
[   60.387473] Corrupted low memory at c0001094 (1094 phys) = 30dd6851
[   60.387474] Corrupted low memory at c0001098 (1098 phys) = f328fd0f
[   60.387476] Corrupted low memory at c000109c (109c phys) = 17ad185c
[   60.387478] Corrupted low memory at c00010a0 (10a0 phys) = b83985f5
[   60.387479] Corrupted low memory at c00010a4 (10a4 phys) = 775b8af5
[   60.387481] Corrupted low memory at c00010a8 (10a8 phys) = 3d35e4bc
[   60.387483] Corrupted low memory at c00010ac (10ac phys) = bf4d7b90
[   60.387485] Corrupted low memory at c00010b0 (10b0 phys) = 1db6fd99
[   60.387486] Corrupted low memory at c00010b4 (10b4 phys) = 3b94bf2f
[   60.387488] Corrupted low memory at c00010b8 (10b8 phys) = 5f447e55
[   60.387490] Corrupted low memory at c00010bc (10bc phys) = dcfe6395
[   60.387491] Corrupted low memory at c00010c0 (10c0 phys) = fc0b7a23
[   60.387493] Corrupted low memory at c00010c4 (10c4 phys) = 32fa23aa
[   60.387495] Corrupted low memory at c00010c8 (10c8 phys) = e88ef3f8
[   60.387496] Corrupted low memory at c00010cc (10cc phys) = 1ed7e14b
[   60.387498] Corrupted low memory at c00010d0 (10d0 phys) = 9fc3d7d1
[   60.387500] Corrupted low memory at c00010d4 (10d4 phys) = 015f447f
[   60.387501] Corrupted low memory at c00010d8 (10d8 phys) = 7d11c17f
[   60.387503] Corrupted low memory at c00010dc (10dc phys) = 4785fc2d
[   60.387505] Corrupted low memory at c00010e0 (10e0 phys) = 5fe16bf4
[   60.387507] Corrupted low memory at c00010e4 (10e4 phys) = 4de3fcc5
[   60.387508] Corrupted low memory at c00010e8 (10e8 phys) = 4f477297
[   60.387510] Corrupted low memory at c00010ec (10ec phys) = 59a47d35
[   60.387512] Corrupted low memory at c00010f0 (10f0 phys) = c97c78df
[   60.387513] Corrupted low memory at c00010f4 (10f4 phys) = e3aafa4b
[   60.387515] Corrupted low memory at c00010f8 (10f8 phys) = 658bd8cb
[   60.387517] Corrupted low memory at c00010fc (10fc phys) = 6f5eb91f
[   60.387518] Corrupted low memory at c0001100 (1100 phys) = ca66ce3a
[

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 09:32, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov wrote:


P.S. I know Linus doesn't condone PAE but I still find it more 
preferrable

than running a mixed environment with almost zero benefit in regard to
performance and quite obvious performance regressions related to an
increased number of libraries being loaded (i686 + x86_64) and 
slightly

bloated code which sometimes cannot fit in the CPU cache. Call me old
fashioned but I won't upgrade to x86_64 until most of the things that 
I run

locally are available for x86_64 and that won't happen any time soon.


Don't upgrade *user* land. User land doesn't use the braindamage that 
is PAE.


Just run a 64-bit kernel. Keep all your 32-bit userland apps and 
libraries.


Trust me, that *will* be faster. PAE works really horribly badly,
because all your really important data structures like your inodes and
directory cache will all be in the low 1GB even if you have 16BG of
RAM.

Of course, I'd also like more people to run things that way just to
get more coverage of the whole "yes, we do all the compat stuff
correctly". So I have some other reasons to prefer people running
64-bit kernels with 32-bit user land. But PAE really is a disaster.



In the past I happily ran an x86_64 bit kernel together with 32bit 
userland for quite some time but then I hit a wall: VirtualBox expects 
its kernel modules to have the same bitness as the application itself so 
I had to revert back to an i686 PAE setup. It's probably high time to 
try qemu however last time I looked at it a few years ago it lacked 
several crucial features I need from a VM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 08:21, Ming Lei wrote:

On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov wrote:

# cat
/sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}
32767
32767
168
65536


Looks it is fine, then maybe it is related with 
BIOVEC_PHYS_MERGEABLE(),

BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and
phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *'
is still 32bit.

It was confirmed that there isn't the issue if PAE is disabled.

Dumping both sata/ahci hw sg table and bio's bvec might be helpful.


Um, sorry, what exact variables/files do you want to see? I'm not an 
expert in /sys.




On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet wrote:


oy vey. WTF's been happening in blk-merge.c?

Theyy're not the same bug. The bug in your thread was introduced by 
Jens in
5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where 
he screwed
up the bvprv handling - but that patch comes after the patch Artem 
bisected to.


blk_bio_segment_split() looks correct in b54ffb73ca.


Yes, that is why reverting 578270bfb(block: fix segment split) can make 
the

issue disappear, because 5014c311ba "block: fix bogus compiler
warnings in blk-merge.c" basically disables sg-merge and prevents the
issue from being
triggered.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 07:18, Ming Lei wrote:

On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov wrote:

BTW, I have posted very similar issue in the link:

http://marc.info/?l=linux-ide=145066119623811=2

Artem, I noticed from bugzillar that the hardware is i386, just
wondering if PAE is enabled?  If yes, I am more confident
that both the two kinds of report are similar or same.



Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically 
mentioned

in the corresponding bug report.


OK, could you dump value of the following files under 
/sys/block/sdN/queue/ ?


max_hw_sectors_kb
max_sectors_kb
max_segments
max_segment_size

'sdN' is the faulted disk name.



# cat 
/sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}

32767
32767
168
65536
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 06:38, Ming Lei wrote:

On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.

(Also Tejun - maybe you can see what's up - maybe that error message
tells you something)

I'm not sure what's up with his machine, the disk doesn't seem to be
anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:

  ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133

which doesn't strike me as odd.

Looking at the dmesg, it also looks like it's a pretty normal
Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
ID for the AHCI chip seems to be (INTEL, 0x1c02).

Any ideas? Anybody?


BTW, I have posted very similar issue in the link:

http://marc.info/?l=linux-ide=145066119623811=2

Artem, I noticed from bugzillar that the hardware is i386, just
wondering if PAE is enabled?  If yes, I am more confident
that both the two kinds of report are similar or same.



Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically 
mentioned in the corresponding bug report.


P.S. I know Linus doesn't condone PAE but I still find it more 
preferrable than running a mixed environment with almost zero benefit in 
regard to performance and quite obvious performance regressions related 
to an increased number of libraries being loaded (i686 + x86_64) and 
slightly bloated code which sometimes cannot fit in the CPU cache. Call 
me old fashioned but I won't upgrade to x86_64 until most of the things 
that I run locally are available for x86_64 and that won't happen any 
time soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 04:42, Kent Overstreet wrote:

On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote:

On 2015-12-20 23:18, Christoph Hellwig wrote:
>On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
>>Kent, Jens, Christoph et al,
>> please see this bugzilla:
>>
>>  https://bugzilla.kernel.org/show_bug.cgi?id=109661
>>
>>where Artem Tashkinov bisected his problems with 4.3 down to commit
>>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>>signed off on.
>
>Artem,
>
>can you re-check the commits around this series again?  I would be
>extremtly surprised if it's really this particular commit and not
>one just before it causing the problem - it just allocates bios
>to the biggest possible instead of only allocating up to what
>bio_add_page would accept.

I'm positive about this particular commit. Of course, it might be 
another
GCC 4.7.4 miscompilation which causes the errors which shouldn't be 
there

but
I'm not an expert, so.

I believe you on the commit, and I doubt this has anything to do with 
gcc - the
errors you're getting are exactly what you normally get when you send 
the device

an sglist to dma to/from that it doesn't like.

The queue limits stuff is annoyingly fragile, you'd think we'd be able 
to check
directly in the driver that the stuff we're sending the device is sane 
but we

don't.

If I came up with a debug patch could you try it out? I don't have any 
ideas for
one yet, but if someone who knows the ATA code doesn't jump in I'll 
call up

Tejun and make him walk me through it.

No problem, I just hope that this particular access mode (and you debug 
patch) won't decrease the lifespan of my HDD. Seagate HDDs have been 
very fragile (read atrociously unreliable) for the past five years.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:44, Kent Overstreet wrote:

On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote:

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
> Kent, Jens, Christoph et al,
ie  please see this bugzilla:
>o
>   httpps://bugzilla.kernel.org/show_bug.cgi?id=109661
>
> where Artem Tashkinov bisected his problems with 4.3 down to commit
> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> signed off on.

Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


pretty sure it's something with how blk_bio_segment_split() decides 
what
segments are mergable and not. bio_get_nr_vecs() was just returning 
nr_pages ==
queue_max_segments (ignoring sectors for the moment) - so wait, wtf? 
that's
basically assuming no segment merging can ever happen, if it does then 
this was
causing us to send smaller requests to the device than we could have 
been.


so actually two possibilities I can see:
 - in blk_bio_segment_split(), something's screwed up with how it 
decides what
   segments are going to be mergable or not. but I don't think that's 
likely
   since it's doing the exact same thing the rest of the segment 
merging code

   does.
 - or, the driver was lying in its queue limits, using 
queue_max_segments for
   "the maximum number of pages I can possibly take", and that bug 
lurked

   undiscovered because of the screwed-upness in bio_get_nr_vecs().

Offhand I don't know where to start digging in the driver code to look 
into the

second theory though. Tejun, you got any ideas?


Here's an actual bisect log which Linus was missing:

git bisect start
# bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2
git bisect good 64291f7db5bd8150a74ad2036f1037e6a0428df2
# bad: [807249d3ada1ff28a47c4054ca4edd479421b671] Merge branch 
'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus

git bisect bad 807249d3ada1ff28a47c4054ca4edd479421b671
# good: [102178108e2246cb4b329d3fb7872cd3d7120205] Merge tag 
'armsoc-drivers' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

git bisect good 102178108e2246cb4b329d3fb7872cd3d7120205
# good: [62da98656b62a5ca57f22263705175af8ded5aa1] netfilter: 
nf_conntrack: make nf_ct_zone_dflt built-in

git bisect good 62da98656b62a5ca57f22263705175af8ded5aa1
# good: [f1a3c0b933e7ff856223d6fcd7456d403e54e4e5] Merge tag 
'devicetree-for-4.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

git bisect good f1a3c0b933e7ff856223d6fcd7456d403e54e4e5
# bad: [9cbf22b37ae0592dea809cb8d424990774c21786] Merge tag 'dlm-4.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

git bisect bad 9cbf22b37ae0592dea809cb8d424990774c21786
# good: [8bdc69b764013a9b5ebeef7df8f314f1066c5d79] Merge branch 
'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

git bisect good 8bdc69b764013a9b5ebeef7df8f314f1066c5d79
# good: [df910390e2db07a76c87f258475f6c96253cee6c] Merge tag 'scsi-misc' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

git bisect good df910390e2db07a76c87f258475f6c96253cee6c
# bad: [d975f309a8b250e67b66eabeb56be6989c783629] Merge branch 
'for-4.3/sg' of git://git.kernel.dk/linux-block

git bisect bad d975f309a8b250e67b66eabeb56be6989c783629
# bad: [89e2a8404e4415da1edbac6ca4f7332b4a74fae2] crypto/omap-sham: 
remove an open coded access to ->page_link

git bisect bad 89e2a8404e4415da1edbac6ca4f7332b4a74fae2
# good: [0e28997ec476bad4c7dbe0a08775290051325f53] btrfs: remove bio 
splitting and merge_bvec_fn() calls

git bisect good 0e28997ec476bad4c7dbe0a08775290051325f53
# bad: [2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad] Documentation: update 
notes in biovecs about arbitrarily sized bios

git bisect bad 2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad
# good: [7140aafce2fc14c5af02fdb7859b6bea0108be3d] md/raid5: get rid of 
bio_fits_rdev()

git bisect good 7140aafce2fc14c5af02fdb7859b6bea0108be3d
# good: [6cf66b4caf9c71f64a5486cadbd71ab58d0d4307] fs: use helper 
bio_add_page() instead of open coding on bi_io_vec

git bisect good 6cf66b4caf9c71f64a5486cadbd71ab58d0d4307
# bad: [b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c] block: remove 
bio_get_nr_vecs()

git bisect bad b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c

And like he said since the step before the last one was good and the 
very last one was bad there was no way I could have made a mistake.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:41, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig wrote:


Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


Judging by Artem's bisect log, the last commit he tested before the
bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper
bio_add_page() instead of open coding on bi_io_vec") and he marked
that one good.

Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up
the dmesg files (in the same bisection tar-file as the bisection log)
with the actual versions. Also, Artem's bisect.log isn't actually the
.git/BISECT_LOG file that contains the full information about what was
marked good and bad, so it's a bit hard to read (ie I can tell that
Artem had to mark commit 6cf66b4caf9c as "good" not because his log
says so, but because that explains the next commit to be tested).

Of course, it's fairly easy to make a mistake while bisecting (just
doing a thinko), but usually bisection miistakes end up causing you to
go into some "all good" or "all bad" region of commits, and the fact
that Artem seems to have marked the previous commit good and the final
commit bad does seem to imply the bisection was successful.

But yes, it is always nice to double-check the bisection results. The
best way to do it is generally to try to revert the bad commit and
verify that things work after that, but that commit doesn't revert
cleanly on top of 4.3 due to other changes.

Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's
basically a revert of b54ffb73cadc, but with a few fixups to make the
revert work on top of 4.3.

So Artem, if you can test whether 4.3 works with that revert, and/or
double-check booting that b54ffb73cadc again (to verify that it's
really bad), and its parent (to double-check that it's really good),
that would be a good way to verify that yes, it is really that *one*
commit that breaks things for you.



After reverting (applying) this patch on top of 4.3.3 everything is back 
to normal. It's indeed a guilty commit.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:18, Christoph Hellwig wrote:

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.


Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


I'm positive about this particular commit. Of course, it might be 
another
GCC 4.7.4 miscompilation which causes the errors which shouldn't be 
there but

I'm not an expert, so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 22:51, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.

(Also Tejun - maybe you can see what's up - maybe that error message
tells you something)

I'm not sure what's up with his machine, the disk doesn't seem to be
anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:

  ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133

which doesn't strike me as odd.

Looking at the dmesg, it also looks like it's a pretty normal
Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
ID for the AHCI chip seems to be (INTEL, 0x1c02).

Any ideas? Anybody?



That's correct. That's a very usual Asus P8P67 Pro motherboard (Intel 
P67 chipset) in AHCI mode and run of the mill HDD which is the one you 
identified.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"

On 2015-12-21 04:42, Kent Overstreet wrote:

On Mon, Dec 21, 2015 at 04:25:12AM +0500, Artem S. Tashkinov wrote:

On 2015-12-20 23:18, Christoph Hellwig wrote:
>On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
>>Kent, Jens, Christoph et al,
>> please see this bugzilla:
>>
>>  https://bugzilla.kernel.org/show_bug.cgi?id=109661
>>
>>where Artem Tashkinov bisected his problems with 4.3 down to commit
>>b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
>>signed off on.
>
>Artem,
>
>can you re-check the commits around this series again?  I would be
>extremtly surprised if it's really this particular commit and not
>one just before it causing the problem - it just allocates bios
>to the biggest possible instead of only allocating up to what
>bio_add_page would accept.

I'm positive about this particular commit. Of course, it might be 
another
GCC 4.7.4 miscompilation which causes the errors which shouldn't be 
there

but
I'm not an expert, so.

I believe you on the commit, and I doubt this has anything to do with 
gcc - the
errors you're getting are exactly what you normally get when you send 
the device

an sglist to dma to/from that it doesn't like.

The queue limits stuff is annoyingly fragile, you'd think we'd be able 
to check
directly in the driver that the stuff we're sending the device is sane 
but we

don't.

If I came up with a debug patch could you try it out? I don't have any 
ideas for
one yet, but if someone who knows the ATA code doesn't jump in I'll 
call up

Tejun and make him walk me through it.

No problem, I just hope that this particular access mode (and you debug 
patch) won't decrease the lifespan of my HDD. Seagate HDDs have been 
very fragile (read atrociously unreliable) for the past five years.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:41, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 10:18 AM, Christoph Hellwig wrote:


Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


Judging by Artem's bisect log, the last commit he tested before the
bad one was the commit before: commit 6cf66b4caf9c ("fs: use helper
bio_add_page() instead of open coding on bi_io_vec") and he marked
that one good.

Sadly, without CONFIG_LOCALVERSION_AUTO, there's no way to match up
the dmesg files (in the same bisection tar-file as the bisection log)
with the actual versions. Also, Artem's bisect.log isn't actually the
.git/BISECT_LOG file that contains the full information about what was
marked good and bad, so it's a bit hard to read (ie I can tell that
Artem had to mark commit 6cf66b4caf9c as "good" not because his log
says so, but because that explains the next commit to be tested).

Of course, it's fairly easy to make a mistake while bisecting (just
doing a thinko), but usually bisection miistakes end up causing you to
go into some "all good" or "all bad" region of commits, and the fact
that Artem seems to have marked the previous commit good and the final
commit bad does seem to imply the bisection was successful.

But yes, it is always nice to double-check the bisection results. The
best way to do it is generally to try to revert the bad commit and
verify that things work after that, but that commit doesn't revert
cleanly on top of 4.3 due to other changes.

Attached is a *COMPLETELY*UNTESTED* revertish patch for 4.3. It's
basically a revert of b54ffb73cadc, but with a few fixups to make the
revert work on top of 4.3.

So Artem, if you can test whether 4.3 works with that revert, and/or
double-check booting that b54ffb73cadc again (to verify that it's
really bad), and its parent (to double-check that it's really good),
that would be a good way to verify that yes, it is really that *one*
commit that breaks things for you.



After reverting (applying) this patch on top of 4.3.3 everything is back 
to normal. It's indeed a guilty commit.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 08:21, Ming Lei wrote:

On Mon, Dec 21, 2015 at 10:25 AM, Artem S. Tashkinov wrote:

# cat
/sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}
32767
32767
168
65536


Looks it is fine, then maybe it is related with 
BIOVEC_PHYS_MERGEABLE(),

BIOVEC_SEG_BOUNDARY() or sort of thing, because dma_addr_t and
phys_addr_t turn to 64-bit with PAE, but 'unsigned long' and 'void *'
is still 32bit.

It was confirmed that there isn't the issue if PAE is disabled.

Dumping both sata/ahci hw sg table and bio's bvec might be helpful.


Um, sorry, what exact variables/files do you want to see? I'm not an 
expert in /sys.




On Mon, Dec 21, 2015 at 10:32 AM, Kent Overstreet wrote:


oy vey. WTF's been happening in blk-merge.c?

Theyy're not the same bug. The bug in your thread was introduced by 
Jens in
5014c311ba "block: fix bogus compiler warnings in blk-merge.c", where 
he screwed
up the bvprv handling - but that patch comes after the patch Artem 
bisected to.


blk_bio_segment_split() looks correct in b54ffb73ca.


Yes, that is why reverting 578270bfb(block: fix segment split) can make 
the

issue disappear, because 5014c311ba "block: fix bogus compiler
warnings in blk-merge.c" basically disables sg-merge and prevents the
issue from being
triggered.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 09:32, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 5:50 PM, Artem S. Tashkinov wrote:


P.S. I know Linus doesn't condone PAE but I still find it more 
preferrable

than running a mixed environment with almost zero benefit in regard to
performance and quite obvious performance regressions related to an
increased number of libraries being loaded (i686 + x86_64) and 
slightly

bloated code which sometimes cannot fit in the CPU cache. Call me old
fashioned but I won't upgrade to x86_64 until most of the things that 
I run

locally are available for x86_64 and that won't happen any time soon.


Don't upgrade *user* land. User land doesn't use the braindamage that 
is PAE.


Just run a 64-bit kernel. Keep all your 32-bit userland apps and 
libraries.


Trust me, that *will* be faster. PAE works really horribly badly,
because all your really important data structures like your inodes and
directory cache will all be in the low 1GB even if you have 16BG of
RAM.

Of course, I'd also like more people to run things that way just to
get more coverage of the whole "yes, we do all the compat stuff
correctly". So I have some other reasons to prefer people running
64-bit kernels with 32-bit user land. But PAE really is a disaster.



In the past I happily ran an x86_64 bit kernel together with 32bit 
userland for quite some time but then I hit a wall: VirtualBox expects 
its kernel modules to have the same bitness as the application itself so 
I had to revert back to an i686 PAE setup. It's probably high time to 
try qemu however last time I looked at it a few years ago it lacked 
several crucial features I need from a VM.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:44, Kent Overstreet wrote:

On Sun, Dec 20, 2015 at 07:18:01PM +0100, Christoph Hellwig wrote:

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:
> Kent, Jens, Christoph et al,
ie  please see this bugzilla:
>o
>   httpps://bugzilla.kernel.org/show_bug.cgi?id=109661
>
> where Artem Tashkinov bisected his problems with 4.3 down to commit
> b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
> signed off on.

Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


pretty sure it's something with how blk_bio_segment_split() decides 
what
segments are mergable and not. bio_get_nr_vecs() was just returning 
nr_pages ==
queue_max_segments (ignoring sectors for the moment) - so wait, wtf? 
that's
basically assuming no segment merging can ever happen, if it does then 
this was
causing us to send smaller requests to the device than we could have 
been.


so actually two possibilities I can see:
 - in blk_bio_segment_split(), something's screwed up with how it 
decides what
   segments are going to be mergable or not. but I don't think that's 
likely
   since it's doing the exact same thing the rest of the segment 
merging code

   does.
 - or, the driver was lying in its queue limits, using 
queue_max_segments for
   "the maximum number of pages I can possibly take", and that bug 
lurked

   undiscovered because of the screwed-upness in bio_get_nr_vecs().

Offhand I don't know where to start digging in the driver code to look 
into the

second theory though. Tejun, you got any ideas?


Here's an actual bisect log which Linus was missing:

git bisect start
# bad: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect bad 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# good: [64291f7db5bd8150a74ad2036f1037e6a0428df2] Linux 4.2
git bisect good 64291f7db5bd8150a74ad2036f1037e6a0428df2
# bad: [807249d3ada1ff28a47c4054ca4edd479421b671] Merge branch 
'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus

git bisect bad 807249d3ada1ff28a47c4054ca4edd479421b671
# good: [102178108e2246cb4b329d3fb7872cd3d7120205] Merge tag 
'armsoc-drivers' of 
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc

git bisect good 102178108e2246cb4b329d3fb7872cd3d7120205
# good: [62da98656b62a5ca57f22263705175af8ded5aa1] netfilter: 
nf_conntrack: make nf_ct_zone_dflt built-in

git bisect good 62da98656b62a5ca57f22263705175af8ded5aa1
# good: [f1a3c0b933e7ff856223d6fcd7456d403e54e4e5] Merge tag 
'devicetree-for-4.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

git bisect good f1a3c0b933e7ff856223d6fcd7456d403e54e4e5
# bad: [9cbf22b37ae0592dea809cb8d424990774c21786] Merge tag 'dlm-4.3' of 
git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

git bisect bad 9cbf22b37ae0592dea809cb8d424990774c21786
# good: [8bdc69b764013a9b5ebeef7df8f314f1066c5d79] Merge branch 
'for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

git bisect good 8bdc69b764013a9b5ebeef7df8f314f1066c5d79
# good: [df910390e2db07a76c87f258475f6c96253cee6c] Merge tag 'scsi-misc' 
of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

git bisect good df910390e2db07a76c87f258475f6c96253cee6c
# bad: [d975f309a8b250e67b66eabeb56be6989c783629] Merge branch 
'for-4.3/sg' of git://git.kernel.dk/linux-block

git bisect bad d975f309a8b250e67b66eabeb56be6989c783629
# bad: [89e2a8404e4415da1edbac6ca4f7332b4a74fae2] crypto/omap-sham: 
remove an open coded access to ->page_link

git bisect bad 89e2a8404e4415da1edbac6ca4f7332b4a74fae2
# good: [0e28997ec476bad4c7dbe0a08775290051325f53] btrfs: remove bio 
splitting and merge_bvec_fn() calls

git bisect good 0e28997ec476bad4c7dbe0a08775290051325f53
# bad: [2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad] Documentation: update 
notes in biovecs about arbitrarily sized bios

git bisect bad 2ec3182f9c20a9eef0dacc0512cf2ca2df7be5ad
# good: [7140aafce2fc14c5af02fdb7859b6bea0108be3d] md/raid5: get rid of 
bio_fits_rdev()

git bisect good 7140aafce2fc14c5af02fdb7859b6bea0108be3d
# good: [6cf66b4caf9c71f64a5486cadbd71ab58d0d4307] fs: use helper 
bio_add_page() instead of open coding on bi_io_vec

git bisect good 6cf66b4caf9c71f64a5486cadbd71ab58d0d4307
# bad: [b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c] block: remove 
bio_get_nr_vecs()

git bisect bad b54ffb73cadcdcff9cc1ae0e11f502407e3e2e4c

And like he said since the step before the last one was good and the 
very last one was bad there was no way I could have made a mistake.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 22:51, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.

(Also Tejun - maybe you can see what's up - maybe that error message
tells you something)

I'm not sure what's up with his machine, the disk doesn't seem to be
anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:

  ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133

which doesn't strike me as odd.

Looking at the dmesg, it also looks like it's a pretty normal
Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
ID for the AHCI chip seems to be (INTEL, 0x1c02).

Any ideas? Anybody?



That's correct. That's a very usual Asus P8P67 Pro motherboard (Intel 
P67 chipset) in AHCI mode and run of the mill HDD which is the one you 
identified.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-20 23:18, Christoph Hellwig wrote:

On Sun, Dec 20, 2015 at 09:51:14AM -0800, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.


Artem,

can you re-check the commits around this series again?  I would be
extremtly surprised if it's really this particular commit and not
one just before it causing the problem - it just allocates bios
to the biggest possible instead of only allocating up to what
bio_add_page would accept.


I'm positive about this particular commit. Of course, it might be 
another
GCC 4.7.4 miscompilation which causes the errors which shouldn't be 
there but

I'm not an expert, so.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 06:38, Ming Lei wrote:

On Mon, Dec 21, 2015 at 1:51 AM, Linus Torvalds wrote:

Kent, Jens, Christoph et al,
 please see this bugzilla:

  https://bugzilla.kernel.org/show_bug.cgi?id=109661

where Artem Tashkinov bisected his problems with 4.3 down to commit
b54ffb73cadc ("block: remove bio_get_nr_vecs()") that you've all
signed off on.

(Also Tejun - maybe you can see what's up - maybe that error message
tells you something)

I'm not sure what's up with his machine, the disk doesn't seem to be
anyuthing particularly unusual, it looks like a 1TB Seagate Barracuda:

  ata1.00: ATA-8: ST1000DM003-1CH162, CC44, max UDMA/133

which doesn't strike me as odd.

Looking at the dmesg, it also looks like it's a pretty normal
Sandybridge setup with Intel chipset. Artem, can you confirm? The PCI
ID for the AHCI chip seems to be (INTEL, 0x1c02).

Any ideas? Anybody?


BTW, I have posted very similar issue in the link:

http://marc.info/?l=linux-ide=145066119623811=2

Artem, I noticed from bugzillar that the hardware is i386, just
wondering if PAE is enabled?  If yes, I am more confident
that both the two kinds of report are similar or same.



Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically 
mentioned in the corresponding bug report.


P.S. I know Linus doesn't condone PAE but I still find it more 
preferrable than running a mixed environment with almost zero benefit in 
regard to performance and quite obvious performance regressions related 
to an increased number of libraries being loaded (i686 + x86_64) and 
slightly bloated code which sometimes cannot fit in the CPU cache. Call 
me old fashioned but I won't upgrade to x86_64 until most of the things 
that I run locally are available for x86_64 and that won't happen any 
time soon.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 07:18, Ming Lei wrote:

On Mon, Dec 21, 2015 at 9:50 AM, Artem S. Tashkinov wrote:

BTW, I have posted very similar issue in the link:

http://marc.info/?l=linux-ide=145066119623811=2

Artem, I noticed from bugzillar that the hardware is i386, just
wondering if PAE is enabled?  If yes, I am more confident
that both the two kinds of report are similar or same.



Yes, I'm on i686 with PAE (16GB of RAM here) - it's specifically 
mentioned

in the corresponding bug report.


OK, could you dump value of the following files under 
/sys/block/sdN/queue/ ?


max_hw_sectors_kb
max_sectors_kb
max_segments
max_segment_size

'sdN' is the faulted disk name.



# cat 
/sys/block/sda/queue/{max_hw_sectors_kb,max_sectors_kb,max_segments,max_segment_size}

32767
32767
168
65536
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 11:55, Tejun Heo wrote:

Artem, can you please reproduce the issue with the following patch
applied and attach the kernel log?

Thanks.



I've applied this patch on top of vanilla 4.3.3 kernel (without Linus'es 
revert). Hopefully it's how you intended it to be.


Here's the result (I skipped the beginning of dmesg - it's the same as 
always - see bugzilla).[   60.387407] Corrupted low memory at c0001000 (1000 phys) = cba3d25f
[   60.387411] Corrupted low memory at c0001004 (1004 phys) = e8f17ba7
[   60.387413] Corrupted low memory at c0001008 (1008 phys) = 61cfa79a
[   60.387415] Corrupted low memory at c000100c (100c phys) = dc4d5d71
[   60.387417] Corrupted low memory at c0001010 (1010 phys) = adbdc15b
[   60.387418] Corrupted low memory at c0001014 (1014 phys) = dee76bdc
[   60.387420] Corrupted low memory at c0001018 (1018 phys) = 827dee31
[   60.387422] Corrupted low memory at c000101c (101c phys) = ef70cf7b
[   60.387423] Corrupted low memory at c0001020 (1020 phys) = 82fdee4d
[   60.387425] Corrupted low memory at c0001024 (1024 phys) = 77533c7b
[   60.387427] Corrupted low memory at c0001028 (1028 phys) = ddd4cf35
[   60.387428] Corrupted low memory at c000102c (102c phys) = 7beea149
[   60.387430] Corrupted low memory at c0001030 (1030 phys) = 798fe878
[   60.387432] Corrupted low memory at c0001034 (1034 phys) = 4283a7a8
[   60.387434] Corrupted low memory at c0001038 (1038 phys) = 4dee093d
[   60.387435] Corrupted low memory at c000103c (103c phys) = ee21ef73
[   60.387437] Corrupted low memory at c0001040 (1040 phys) = fe3dc93d
[   60.387439] Corrupted low memory at c0001044 (1044 phys) = b8e7cf0d
[   60.387440] Corrupted low memory at c0001048 (1048 phys) = af3c9977
[   60.387442] Corrupted low memory at c000104c (104c phys) = b80b7b8b
[   60.387444] Corrupted low memory at c0001050 (1050 phys) = b6f73d77
[   60.387445] Corrupted low memory at c0001054 (1054 phys) = f7276f70
[   60.387447] Corrupted low memory at c0001058 (1058 phys) = c62f70f6
[   60.387449] Corrupted low memory at c000105c (105c phys) = 3ef734bd
[   60.387451] Corrupted low memory at c0001060 (1060 phys) = 1ef79f40
[   60.387452] Corrupted low memory at c0001064 (1064 phys) = f1cf9f65
[   60.387454] Corrupted low memory at c0001068 (1068 phys) = 297a5390
[   60.387456] Corrupted low memory at c000106c (106c phys) = a7f14fbc
[   60.387457] Corrupted low memory at c0001070 (1070 phys) = 57ef71af
[   60.387459] Corrupted low memory at c0001074 (1074 phys) = 219d15e4
[   60.387461] Corrupted low memory at c0001078 (1078 phys) = 7b99a2af
[   60.387462] Corrupted low memory at c000107c (107c phys) = c56d281b
[   60.387464] Corrupted low memory at c0001080 (1080 phys) = 3c84de6e
[   60.387466] Corrupted low memory at c0001084 (1084 phys) = edee56ec
[   60.387468] Corrupted low memory at c0001088 (1088 phys) = 49b557a7
[   60.387469] Corrupted low memory at c000108c (108c phys) = 01baeb6a
[   60.387471] Corrupted low memory at c0001090 (1090 phys) = b775acde
[   60.387473] Corrupted low memory at c0001094 (1094 phys) = 30dd6851
[   60.387474] Corrupted low memory at c0001098 (1098 phys) = f328fd0f
[   60.387476] Corrupted low memory at c000109c (109c phys) = 17ad185c
[   60.387478] Corrupted low memory at c00010a0 (10a0 phys) = b83985f5
[   60.387479] Corrupted low memory at c00010a4 (10a4 phys) = 775b8af5
[   60.387481] Corrupted low memory at c00010a8 (10a8 phys) = 3d35e4bc
[   60.387483] Corrupted low memory at c00010ac (10ac phys) = bf4d7b90
[   60.387485] Corrupted low memory at c00010b0 (10b0 phys) = 1db6fd99
[   60.387486] Corrupted low memory at c00010b4 (10b4 phys) = 3b94bf2f
[   60.387488] Corrupted low memory at c00010b8 (10b8 phys) = 5f447e55
[   60.387490] Corrupted low memory at c00010bc (10bc phys) = dcfe6395
[   60.387491] Corrupted low memory at c00010c0 (10c0 phys) = fc0b7a23
[   60.387493] Corrupted low memory at c00010c4 (10c4 phys) = 32fa23aa
[   60.387495] Corrupted low memory at c00010c8 (10c8 phys) = e88ef3f8
[   60.387496] Corrupted low memory at c00010cc (10cc phys) = 1ed7e14b
[   60.387498] Corrupted low memory at c00010d0 (10d0 phys) = 9fc3d7d1
[   60.387500] Corrupted low memory at c00010d4 (10d4 phys) = 015f447f
[   60.387501] Corrupted low memory at c00010d8 (10d8 phys) = 7d11c17f
[   60.387503] Corrupted low memory at c00010dc (10dc phys) = 4785fc2d
[   60.387505] Corrupted low memory at c00010e0 (10e0 phys) = 5fe16bf4
[   60.387507] Corrupted low memory at c00010e4 (10e4 phys) = 4de3fcc5
[   60.387508] Corrupted low memory at c00010e8 (10e8 phys) = 4f477297
[   60.387510] Corrupted low memory at c00010ec (10ec phys) = 59a47d35
[   60.387512] Corrupted low memory at c00010f0 (10f0 phys) = c97c78df
[   60.387513] Corrupted low memory at c00010f4 (10f4 phys) = e3aafa4b
[   60.387515] Corrupted low memory at c00010f8 (10f8 phys) = 658bd8cb
[   60.387517] Corrupted low memory at c00010fc (10fc phys) = 6f5eb91f
[   60.387518] Corrupted low memory at c0001100 (1100 phys) = ca66ce3a
[

Re: IO errors after "block: remove bio_get_nr_vecs()"


On 2015-12-21 10:23, Linus Torvalds wrote:

On Sun, Dec 20, 2015 at 8:47 PM, Linus Torvalds
 wrote:


That said, we obviously need to figure out this current problem
regardless first..


... although maybe it *would* be interesting to hear what happens if
you just compile a 64-bit kernel instead?

Do you still see the problem? Because if not, then we should look very
specifically for some 32-bit PAE issue.

For example, maybe we use "unsigned long" somewhere where we should
use "phys_addr_t". On x86-64, they obviously end up being the same. On
normal non-PAE x86-32, they are also the same. But ..



Let's wait for what Tejun Heo might say - I've applied his debugging 
patch and sent back the results.


Building x86_64 kernel here involves installing a 64bit Linux VM, so I'd 
like it to be the last resort.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Not being able to reread the partition table - why is Linux so 90x?

2013-11-08 Thread Artem S. Tashkinov

Hello,

I wonder why in 2013 I still cannot modify _unused_ partitions on the fly,
yeah, the Internet is full of:

# hdparm -z /dev/sda

/dev/sda:
 re-reading partition table
 BLKRRPART failed: Device or resource busy

# fdisk (after adding a new partition using unused space on my hdd)
...
Command (m for help): w

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy

SCSI rescan command doesn't work too.

I do understand that the Linux kernel doesn't have any form of invoke() but
then it prevents me from altering the partitions which are not used - it's
100% counter intuitive. Windows, for instance, allows to modify even a
system partition on the fly since 2006, Linux doesn't allow to add partitions
without rebooting the system.

Could anyone elaborate, please?

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Not being able to reread the partition table - why is Linux so 90x?

2013-11-08 Thread Artem S. Tashkinov

Hello,

I wonder why in 2013 I still cannot modify _unused_ partitions on the fly,
yeah, the Internet is full of:

# hdparm -z /dev/sda

/dev/sda:
 re-reading partition table
 BLKRRPART failed: Device or resource busy

# fdisk (after adding a new partition using unused space on my hdd)
...
Command (m for help): w

The partition table has been altered.
Calling ioctl() to re-read partition table.
Re-reading the partition table failed.: Device or resource busy

SCSI rescan command doesn't work too.

I do understand that the Linux kernel doesn't have any form of invoke() but
then it prevents me from altering the partitions which are not used - it's
100% counter intuitive. Windows, for instance, allows to modify even a
system partition on the fly since 2006, Linux doesn't allow to add partitions
without rebooting the system.

Could anyone elaborate, please?

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

2013-10-30 Thread Artem S. Tashkinov

Oct 30, 2013 02:41:01 AM, Jack wrote:
On Fri 25-10-13 19:37:53, Ted Tso wrote:
>> Sure, although I wonder if it would be worth it calcuate some kind of
>> rolling average of the write bandwidth while we are doing writeback,
>> so if it turns out we got unlucky with the contents of the first 100MB
>> of dirty data (it could be either highly random or highly sequential)
>> the we'll eventually correct to the right level.
>  We already do average measured throughput over a longer time window and
>have kind of rolling average algorithm doing some averaging.
>
>> This means that VM would have to keep dirty page counters for each BDI
>> --- which I thought we weren't doing right now, which is why we have a
>> global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
>> have cause and effect reversed?  :-)
>  And we do currently keep the number of dirty & under writeback pages per
>BDI. We have global limits because mm wants to limit the total number of dirty
>pages (as those are harder to free). It doesn't care as much to which device
>these pages belong (although it probably should care a bit more because
>there are huge differences between how quickly can different devices get rid
>of dirty pages).

This might sound like an absolutely stupid question which makes no sense at
all, so I want to apologize for it in advance, but since the Linux kernel lacks
revoke(), does that mean that dirty buffers will always occupy the kernel memory
if I for instance remove my USB stick before the kernel has had the time to 
flush
these buffers?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

2013-10-30 Thread Artem S. Tashkinov

Oct 30, 2013 02:41:01 AM, Jack wrote:
On Fri 25-10-13 19:37:53, Ted Tso wrote:
 Sure, although I wonder if it would be worth it calcuate some kind of
 rolling average of the write bandwidth while we are doing writeback,
 so if it turns out we got unlucky with the contents of the first 100MB
 of dirty data (it could be either highly random or highly sequential)
 the we'll eventually correct to the right level.
  We already do average measured throughput over a longer time window and
have kind of rolling average algorithm doing some averaging.

 This means that VM would have to keep dirty page counters for each BDI
 --- which I thought we weren't doing right now, which is why we have a
 global vm.dirty_ratio/vm.dirty_background_ratio threshold.  (Or do I
 have cause and effect reversed?  :-)
  And we do currently keep the number of dirty  under writeback pages per
BDI. We have global limits because mm wants to limit the total number of dirty
pages (as those are harder to free). It doesn't care as much to which device
these pages belong (although it probably should care a bit more because
there are huge differences between how quickly can different devices get rid
of dirty pages).

This might sound like an absolutely stupid question which makes no sense at
all, so I want to apologize for it in advance, but since the Linux kernel lacks
revoke(), does that mean that dirty buffers will always occupy the kernel memory
if I for instance remove my USB stick before the kernel has had the time to 
flush
these buffers?
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 + (UTC) "Artem S. Tashkinov"
>> 
>> Exactly. And not being able to use applications which show you IO performance
>> like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
>> my life without being able to see the progress of a copying operation. With 
>> the current
>> dirty cache there's no way to understand how you storage media actually 
>> behaves.
>
>So fix Midnight Commander.  If you want the copy to be actually finished when
>it says  it is finished, then it needs to call 'fsync()' at the end.

This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to 
actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way 
to
solve this problem in user space - alas, even if such calls are implemented, 
user
space will start using them only in 2018 if not further from that.

>> 
>> Per device dirty cache seems like a nice idea, I, for one, would like to 
>> disable it
>> altogether or make it an absolute minimum for things like USB flash drives - 
>> because
>> I don't care about multithreaded performance or delayed allocation on such 
>> devices -
>> I'm interested in my data reaching my USB stick ASAP - because it's how most 
>> people
>> use them.
>>
>
>As has already been said, you can substantially disable  the cache by tuning
>down various values in /proc/sys/vm/.
>Have you tried?

I don't understand who you are replying to. I asked about per device settings, 
you are
again referring me to system wide settings - they don't look that good if we're 
talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the 
first place.

I don't know any other OS which has a similar behaviour.

And like people (including me) have already mentioned, such a huge dirty cache 
can
stall their PCs/servers for a considerable amount of time.

Of course, if you don't use Linux on the desktop you don't really care - well, 
I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead 
to a
serious data loss in case of a power blackout.

Regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:
>
>>
>> What exactly is bothering you about this?  The amount of memory used or the
>> time until data is flushed?
>
>actually, I think the problem is more the impact of the huge write later on.

Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use "cp -a" but I cannot imagine
my life without being able to see the progress of a copying operation. With the 
current
dirty cache there's no way to understand how you storage media actually behaves.

Hopefully this issue won't dissolve into obscurity and someone will actually 
make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount 
option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).

Per device dirty cache seems like a nice idea, I, for one, would like to 
disable it
altogether or make it an absolute minimum for things like USB flash drives - 
because
I don't care about multithreaded performance or delayed allocation on such 
devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most 
people
use them.

Regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote:
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote:
>>
>> On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
>> built for the i686 (with PAE) and x86-64 architectures. What's really 
>> troubling me
>> is that the x86-64 kernel has the following problem:
>>
>> When I copy large files to any storage device, be it my HDD with ext4 
>> partitions
>> or flash drive with FAT32 partitions, the kernel first caches them in memory 
>> entirely
>> then flushes them some time later (quite unpredictably though) or 
>> immediately upon
>> invoking "sync".
>
>Yeah, I think we default to a 10% "dirty background memory" (and
>allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
>of dirty memory for writeout before we even start writing, and twice
>that before we start *waiting* for it.
>
>On 32-bit x86, we only count the memory in the low 1GB (really
>actually up to about 890MB), so "10% dirty" really means just about
>90MB of buffering (and a "hard limit" of ~180MB of dirty).
>
>And that "up to 3.2GB of dirty memory" is just crazy. Our defaults
>come from the old days of less memory (and perhaps servers that don't
>much care), and the fact that x86-32 ends up having much lower limits
>even if you end up having more memory.
>
>You can easily tune it:
>
>echo $((16*1024*1024)) > /proc/sys/vm/dirty_background_bytes
>echo $((48*1024*1024)) > /proc/sys/vm/dirty_bytes
>
>or similar. But you're right, we need to make the defaults much saner.
>
>Wu? Andrew? Comments?
>

My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
more) this value becomes unrealistic (13GB) and I've already had some
unpleasant effects due to it.

I.e. when I dump a large MySQL database (its dump weighs around 10GB)
- it appears on the disk almost immediately, but then, later, when the kernel
decides to flush it to the disk, the server almost stalls and other IO requests
take a lot more time to complete even though mysqldump is run with ionice -c3,
so the use of ionice has no real effect.

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disabling in-memory write cache for x86-64 in Linux II

Hello!

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling 
me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory 
entirely
then flushes them some time later (quite unpredictably though) or immediately 
upon
invoking "sync".

How can I disable this memory cache altogether (or at least minimize caching)? 
When
running the i686 kernel with the same configuration I don't observe this effect 
- files get
written out almost immediately (for instance "sync" takes less than a second, 
whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 
/dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it 
won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: 
https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disabling in-memory write cache for x86-64 in Linux II

Hello!

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling 
me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory 
entirely
then flushes them some time later (quite unpredictably though) or immediately 
upon
invoking sync.

How can I disable this memory cache altogether (or at least minimize caching)? 
When
running the i686 kernel with the same configuration I don't observe this effect 
- files get
written out almost immediately (for instance sync takes less than a second, 
whereas
on x86-64 it can take a dozen of _minutes_ depending on a file size and storage
performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 
/dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it 
won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: 
https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 25, 2013 02:18:50 PM, Linus Torvalds wrote:
On Fri, Oct 25, 2013 at 8:25 AM, Artem S. Tashkinov wrote:

 On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
 built for the i686 (with PAE) and x86-64 architectures. What's really 
 troubling me
 is that the x86-64 kernel has the following problem:

 When I copy large files to any storage device, be it my HDD with ext4 
 partitions
 or flash drive with FAT32 partitions, the kernel first caches them in memory 
 entirely
 then flushes them some time later (quite unpredictably though) or 
 immediately upon
 invoking sync.

Yeah, I think we default to a 10% dirty background memory (and
allows up to 20% dirty), so on your 16GB machine, we allow up to 1.6GB
of dirty memory for writeout before we even start writing, and twice
that before we start *waiting* for it.

On 32-bit x86, we only count the memory in the low 1GB (really
actually up to about 890MB), so 10% dirty really means just about
90MB of buffering (and a hard limit of ~180MB of dirty).

And that up to 3.2GB of dirty memory is just crazy. Our defaults
come from the old days of less memory (and perhaps servers that don't
much care), and the fact that x86-32 ends up having much lower limits
even if you end up having more memory.

You can easily tune it:

echo $((16*1024*1024))  /proc/sys/vm/dirty_background_bytes
echo $((48*1024*1024))  /proc/sys/vm/dirty_bytes

or similar. But you're right, we need to make the defaults much saner.

Wu? Andrew? Comments?


My feeling is that vm.dirty_ratio/vm.dirty_background_ratio should _not_ be
percentage based, 'cause for PCs/servers with a lot of memory (say 64GB or
more) this value becomes unrealistic (13GB) and I've already had some
unpleasant effects due to it.

I.e. when I dump a large MySQL database (its dump weighs around 10GB)
- it appears on the disk almost immediately, but then, later, when the kernel
decides to flush it to the disk, the server almost stalls and other IO requests
take a lot more time to complete even though mysqldump is run with ionice -c3,
so the use of ionice has no real effect.

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 25, 2013 05:26:45 PM, david wrote:
On Fri, 25 Oct 2013, NeilBrown wrote:


 What exactly is bothering you about this?  The amount of memory used or the
 time until data is flushed?

actually, I think the problem is more the impact of the huge write later on.

Exactly. And not being able to use applications which show you IO performance
like Midnight Commander. You might prefer to use cp -a but I cannot imagine
my life without being able to see the progress of a copying operation. With the 
current
dirty cache there's no way to understand how you storage media actually behaves.

Hopefully this issue won't dissolve into obscurity and someone will actually 
make
up a plan (and a patch) how to make dirty write cache behave in a sane manner
considering the fact that there are devices with very different write speeds and
requirements. It'd be ever better, if I could specify dirty cache as a mount 
option
(though sane defaults or semi-automatic values based on runtime estimates
won't hurt).

Per device dirty cache seems like a nice idea, I, for one, would like to 
disable it
altogether or make it an absolute minimum for things like USB flash drives - 
because
I don't care about multithreaded performance or delayed allocation on such 
devices -
I'm interested in my data reaching my USB stick ASAP - because it's how most 
people
use them.

Regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Disabling in-memory write cache for x86-64 in Linux II

Oct 26, 2013 02:44:07 AM, neil wrote:
On Fri, 25 Oct 2013 18:26:23 + (UTC) Artem S. Tashkinov
 
 Exactly. And not being able to use applications which show you IO performance
 like Midnight Commander. You might prefer to use cp -a but I cannot imagine
 my life without being able to see the progress of a copying operation. With 
 the current
 dirty cache there's no way to understand how you storage media actually 
 behaves.

So fix Midnight Commander.  If you want the copy to be actually finished when
it says  it is finished, then it needs to call 'fsync()' at the end.

This sounds like a very bad joke. How applications are supposed to show and
calculate an _average_ write speed if there are no kernel calls/ioctls to 
actually
make the kernel flush dirty buffers _during_ copying? Actually it's a good way 
to
solve this problem in user space - alas, even if such calls are implemented, 
user
space will start using them only in 2018 if not further from that.

 
 Per device dirty cache seems like a nice idea, I, for one, would like to 
 disable it
 altogether or make it an absolute minimum for things like USB flash drives - 
 because
 I don't care about multithreaded performance or delayed allocation on such 
 devices -
 I'm interested in my data reaching my USB stick ASAP - because it's how most 
 people
 use them.


As has already been said, you can substantially disable  the cache by tuning
down various values in /proc/sys/vm/.
Have you tried?

I don't understand who you are replying to. I asked about per device settings, 
you are
again referring me to system wide settings - they don't look that good if we're 
talking
about a 3MB/sec flash drive and 500MB/sec SSD drive. Besides it makes no sense
to allocate 20% of physical RAM for things which don't belong to it in the 
first place.

I don't know any other OS which has a similar behaviour.

And like people (including me) have already mentioned, such a huge dirty cache 
can
stall their PCs/servers for a considerable amount of time.

Of course, if you don't use Linux on the desktop you don't really care - well, 
I do. Also
not everyone in this world has an UPS - which means such a huge buffer can lead 
to a
serious data loss in case of a power blackout.

Regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disabling in-memory write cache for x86-64 in Linux 3.11

2013-10-23 Thread Artem S. Tashkinov

Hello,

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling 
me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory 
entirely
then flushes them some time later (quite unpredictably though) or immediately 
upon
running "sync".

How can I disable this memory cache altogether? When running the i686 kernel 
with
the same configuration I don't observe this effect - files get written out 
almost immediately
(for instance "sync" takes less than a second, whereas on x86-64 it can take a 
dozen of
_minutes_ depending on a file size and storage performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 
/dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it 
won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: 
https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Disabling in-memory write cache for x86-64 in Linux 3.11

2013-10-23 Thread Artem S. Tashkinov

Hello,

On my x86-64 PC (Intel Core i5 2500, 16GB RAM), I have the same 3.11 kernel
built for the i686 (with PAE) and x86-64 architectures. What's really troubling 
me
is that the x86-64 kernel has the following problem:

When I copy large files to any storage device, be it my HDD with ext4 partitions
or flash drive with FAT32 partitions, the kernel first caches them in memory 
entirely
then flushes them some time later (quite unpredictably though) or immediately 
upon
running sync.

How can I disable this memory cache altogether? When running the i686 kernel 
with
the same configuration I don't observe this effect - files get written out 
almost immediately
(for instance sync takes less than a second, whereas on x86-64 it can take a 
dozen of
_minutes_ depending on a file size and storage performance).

I'm _not_ talking about disabling write cache on my storage itself (hdparm -W 0 
/dev/XXX)
- firstly this command is detrimental to the performance of my PC, secondly, it 
won't help
in this instance.

Swap is totally disabled, usually my memory is entirely free.

My kernel configuration can be fetched here: 
https://bugzilla.kernel.org/show_bug.cgi?id=63531

Please, advise.

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 11:43:00 PM, Eric wrote:
On Mon, 2013-07-29 at 15:47 +, Artem S. Tashkinov wrote:
>
>> A wine developer clearly showed that this option simply doesn't work. 
>> 
>> http://bugs.winehq.org/show_bug.cgi?id=26031#c21
>> 
>> Output of strace:
>> getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
>> setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0.   
>>   
>> 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>
>Its clear that some other socket did not use SO_REUSADDR
>
>All sockets using a given port _must_ have use SO_REUSADDR to allow this
>port being reused.
>

It's exactly what's been tried. A program running with SO_REUSADDR, once no 
longer
running consequently fails to regain the rights for the port.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 11:27:00 PM, rick wrote:

>> A wine developer clearly showed that this option simply doesn't work.
>>
>> http://bugs.winehq.org/show_bug.cgi?id=26031#c21
>>
>> Output of strace:
>> getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
>> setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
>> bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0.
>> 0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)
>
>The output of netstat -an didn't by any chance happen to still show an 
>endpoint in the LISTEN state for that port number did it?
>
>rick jones
>


By chance - no, nothing is/was listening. You can recreate this test in an 
order of few
minutes without ever trusting my word.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 09:35:25 PM, Stephen wrote:
On Mon, 29 Jul 2013 15:10:34 + (UTC)
>"Artem S. Tashkinov" wrote:
>
>> Hello,
>> 
>> Currently the Linux kernel disallows to start listening on a TCP/UDP socket 
>> if
>> there are open connections against the port, regardless connections status. 
>> So even
>> if _all_ you have is some stale (i.e. no longer active connections pending 
>> destruction)
>> the kernel will not allow to reuse this socket.
>> 
>> Stephen Hemminger argues that this behaviour is expected even though it's 
>> 100%
>> counter productive, it defies common sense and I cannot think of any 
>> security implications
>> should this feature be allowed.
>> 
>> Besides, when discussing this bug on Wine's bugzilla I have shown that this 
>> behavior not
>> only affect Windows applications running under Wine, but also native POSIX 
>> applications.
>> 
>> If nothing else is listening to incoming connections how can _old_ _stale_ 
>> connections
>> prevent an application from listening on the port? Windows has no qualms 
>> about allowing
>> that, why the Linux kernel works differently?
>> 
>> I want to hear how the current apparently _broken_ behaviour, "The current 
>> socket API
>> behavior is unlikely to be changed because so many applications expect it", 
>> can be expected.
>> 
>> Also I'd like to know which applications depend on this "feature".
>> 
>> Imagine a situation,
>> 
>> You have an apache server serving connections on port 80. For some reasons a 
>> crash in
>> one of its modules causes the daemon crash but during the crash Apache had 
>> some open
>> connections on this port.
>> 
>> According to Stephen Hemminger I cannot relaunch Apache until the kernel 
>> waits arbitrary
>> time in order to clean stale connections for its networking pool.
>> 
>> I fail to see how this behaviour can be "expected".
>> 
>> More on it here:
>> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=45571
>> http://bugs.winehq.org/show_bug.cgi?id=26031
>
>I understand your problem, people have been having to deal with it for 30 
>years.
>The attitude in your response makes it seem like you just discovered fire,
>read a book like Steven's network programming if you need more info.
>
>If you don't use SO_REUSEADDR then yes application has to wait for time wait
>period.
>
>If you do enable SO_REUSEADDR then it is possible to bind to a port with 
>existing
>stale connections.
>

A wine developer clearly showed that this option simply doesn't work. 

http://bugs.winehq.org/show_bug.cgi?id=26031#c21

Output of strace:
getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr("0. 
0.0.0")}, 16) = -1 EADDRINUSE (Address already in use)

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

A call to revise sockets behaviour

Hello,

Currently the Linux kernel disallows to start listening on a TCP/UDP socket if
there are open connections against the port, regardless connections status. So 
even
if _all_ you have is some stale (i.e. no longer active connections pending 
destruction)
the kernel will not allow to reuse this socket.

Stephen Hemminger argues that this behaviour is expected even though it's 100%
counter productive, it defies common sense and I cannot think of any security 
implications
should this feature be allowed.

Besides, when discussing this bug on Wine's bugzilla I have shown that this 
behavior not
only affect Windows applications running under Wine, but also native POSIX 
applications.

If nothing else is listening to incoming connections how can _old_ _stale_ 
connections
prevent an application from listening on the port? Windows has no qualms about 
allowing
that, why the Linux kernel works differently?

I want to hear how the current apparently _broken_ behaviour, "The current 
socket API
behavior is unlikely to be changed because so many applications expect it", can 
be expected.

Also I'd like to know which applications depend on this "feature".

Imagine a situation,

You have an apache server serving connections on port 80. For some reasons a 
crash in
one of its modules causes the daemon crash but during the crash Apache had some 
open
connections on this port.

According to Stephen Hemminger I cannot relaunch Apache until the kernel waits 
arbitrary
time in order to clean stale connections for its networking pool.

I fail to see how this behaviour can be "expected".

More on it here:

https://bugzilla.kernel.org/show_bug.cgi?id=45571
http://bugs.winehq.org/show_bug.cgi?id=26031

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

A call to revise sockets behaviour

Hello,

Currently the Linux kernel disallows to start listening on a TCP/UDP socket if
there are open connections against the port, regardless connections status. So 
even
if _all_ you have is some stale (i.e. no longer active connections pending 
destruction)
the kernel will not allow to reuse this socket.

Stephen Hemminger argues that this behaviour is expected even though it's 100%
counter productive, it defies common sense and I cannot think of any security 
implications
should this feature be allowed.

Besides, when discussing this bug on Wine's bugzilla I have shown that this 
behavior not
only affect Windows applications running under Wine, but also native POSIX 
applications.

If nothing else is listening to incoming connections how can _old_ _stale_ 
connections
prevent an application from listening on the port? Windows has no qualms about 
allowing
that, why the Linux kernel works differently?

I want to hear how the current apparently _broken_ behaviour, The current 
socket API
behavior is unlikely to be changed because so many applications expect it, can 
be expected.

Also I'd like to know which applications depend on this feature.

Imagine a situation,

You have an apache server serving connections on port 80. For some reasons a 
crash in
one of its modules causes the daemon crash but during the crash Apache had some 
open
connections on this port.

According to Stephen Hemminger I cannot relaunch Apache until the kernel waits 
arbitrary
time in order to clean stale connections for its networking pool.

I fail to see how this behaviour can be expected.

More on it here:

https://bugzilla.kernel.org/show_bug.cgi?id=45571
http://bugs.winehq.org/show_bug.cgi?id=26031

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 09:35:25 PM, Stephen wrote:
On Mon, 29 Jul 2013 15:10:34 + (UTC)
Artem S. Tashkinov wrote:

 Hello,
 
 Currently the Linux kernel disallows to start listening on a TCP/UDP socket 
 if
 there are open connections against the port, regardless connections status. 
 So even
 if _all_ you have is some stale (i.e. no longer active connections pending 
 destruction)
 the kernel will not allow to reuse this socket.
 
 Stephen Hemminger argues that this behaviour is expected even though it's 
 100%
 counter productive, it defies common sense and I cannot think of any 
 security implications
 should this feature be allowed.
 
 Besides, when discussing this bug on Wine's bugzilla I have shown that this 
 behavior not
 only affect Windows applications running under Wine, but also native POSIX 
 applications.
 
 If nothing else is listening to incoming connections how can _old_ _stale_ 
 connections
 prevent an application from listening on the port? Windows has no qualms 
 about allowing
 that, why the Linux kernel works differently?
 
 I want to hear how the current apparently _broken_ behaviour, The current 
 socket API
 behavior is unlikely to be changed because so many applications expect it, 
 can be expected.
 
 Also I'd like to know which applications depend on this feature.
 
 Imagine a situation,
 
 You have an apache server serving connections on port 80. For some reasons a 
 crash in
 one of its modules causes the daemon crash but during the crash Apache had 
 some open
 connections on this port.
 
 According to Stephen Hemminger I cannot relaunch Apache until the kernel 
 waits arbitrary
 time in order to clean stale connections for its networking pool.
 
 I fail to see how this behaviour can be expected.
 
 More on it here:
 
 https://bugzilla.kernel.org/show_bug.cgi?id=45571
 http://bugs.winehq.org/show_bug.cgi?id=26031

I understand your problem, people have been having to deal with it for 30 
years.
The attitude in your response makes it seem like you just discovered fire,
read a book like Steven's network programming if you need more info.

If you don't use SO_REUSEADDR then yes application has to wait for time wait
period.

If you do enable SO_REUSEADDR then it is possible to bind to a port with 
existing
stale connections.


A wine developer clearly showed that this option simply doesn't work. 

http://bugs.winehq.org/show_bug.cgi?id=26031#c21

Output of strace:
getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0. 
0.0.0)}, 16) = -1 EADDRINUSE (Address already in use)

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 11:27:00 PM, rick wrote:

 A wine developer clearly showed that this option simply doesn't work.

 http://bugs.winehq.org/show_bug.cgi?id=26031#c21

 Output of strace:
 getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0.
 0.0.0)}, 16) = -1 EADDRINUSE (Address already in use)

The output of netstat -an didn't by any chance happen to still show an 
endpoint in the LISTEN state for that port number did it?

rick jones



By chance - no, nothing is/was listening. You can recreate this test in an 
order of few
minutes without ever trusting my word.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: A call to revise sockets behaviour

Jul 29, 2013 11:43:00 PM, Eric wrote:
On Mon, 2013-07-29 at 15:47 +, Artem S. Tashkinov wrote:

 A wine developer clearly showed that this option simply doesn't work. 
 
 http://bugs.winehq.org/show_bug.cgi?id=26031#c21
 
 Output of strace:
 getsockopt(24, SOL_SOCKET, SO_REUSEADDR, [0], [4]) = 0
 setsockopt(24, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 bind(24, {sa_family=AF_INET, sin_port=htons(43012), sin_addr=inet_addr(0.   
   
 0.0.0)}, 16) = -1 EADDRINUSE (Address already in use)

Its clear that some other socket did not use SO_REUSADDR

All sockets using a given port _must_ have use SO_REUSADDR to allow this
port being reused.


It's exactly what's been tried. A program running with SO_REUSADDR, once no 
longer
running consequently fails to regain the rights for the port.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 8, 2013 04:25:43 AM, Patrik Jakobsson wrote:
On Wed, May 8, 2013 at 12:02 AM, Bjorn Helgaas wrote:
>> On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson wrote:
>>> On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas  wrote:
> I'm not sure if reading /proc/mtrr actually reads the registers out of
> the CPU each time, or whether we just return the cached values we read
> out during initial boot-up. If the latter, then this output isn't
> really useful as there's no guarantee the values are still intact.

 Good point.  From what I can tell, on Artem's system with "CPU0:
 Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz," we would be using
 generic_mtrr_ops, and generic_get_mtrr() appears to read from the
 MSRs, so I think it should be useful.
>>>
>>> FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might
>>> have been fixed by bios upgrades by now but not sure.
>>>
>>> It might also suffer (depending on the revision) from the Sandy bridge SATA
>>> issue. So if affected, SATA controller is a ticking bomb.
>>>
>>> I have a P8H67-V motherboard but I haven't seen any suspend related issues.
>>>
>>> If this is totally unrelated I'm sorry for wasting your time. Just thought 
>>> it
>>> might be good to know.
>>
>> Thanks for chiming in.  I'm not familiar with either of the issues you
>> mentioned.  Do you have any references where I could read up on them?
>
>I think this is the official statement from Intel on the SATA issue:
>http://newsroom.intel.com/community/intel_newsroom/blog/2011/01/31/intel-identifies-chipset-design-error-implementing-solution

My motherboard has a new fixed B3 revision so this issue doesn't affect me.
Besides this SATA ports degradation issue is constantly present - it has no
relationship to suspend.

>
>And here's a link to a discussion about the PCIe-to-PCI bridge stuff:
>https://lkml.org/lkml/2012/1/30/216
>
>> Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at
>> 05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I
>> don't think that's the problem.
>
>I meant what you said ;) and yes, it seems unrelated. Both my P8H67 and a
>P8P67 I've built behave nicely if nothing is connected.

Have you tried suspending more than three times? In the absence of UEFI
boot this bug emerges only on a third or even fourth resume attempt. UEFI
boot triggers it immediately on a first resume though.

>> And the issue affects both USB and a hard drive, so I suspect it's
>> more than just SATA.  Artem, did you identify the PCI devices leading
>> to your USB and hard drive?  I can't remember if I've actually seen
>> that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 8, 2013 04:03:18 AM, Bjorn Helgaas wrote:
On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson
> wrote:
>> On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas 
 wrote:
 I'm not sure if reading /proc/mtrr actually reads the registers out of
 the CPU each time, or whether we just return the cached values we read
 out during initial boot-up. If the latter, then this output isn't
 really useful as there's no guarantee the values are still intact.
>>>
>>> Good point.  From what I can tell, on Artem's system with "CPU0:
>>> Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz," we would be using
>>> generic_mtrr_ops, and generic_get_mtrr() appears to read from the
>>> MSRs, so I think it should be useful.
>>
>> FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might
>> have been fixed by bios upgrades by now but not sure.
>>
>> It might also suffer (depending on the revision) from the Sandy bridge SATA
>> issue. So if affected, SATA controller is a ticking bomb.
>>
>> I have a P8H67-V motherboard but I haven't seen any suspend related issues.
>>
>> If this is totally unrelated I'm sorry for wasting your time. Just thought it
>> might be good to know.
>
>Thanks for chiming in.  I'm not familiar with either of the issues you
>mentioned.  Do you have any references where I could read up on them?
>
>Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at
>05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I
>don't think that's the problem.
>
>And the issue affects both USB and a hard drive, so I suspect it's
>more than just SATA.  Artem, did you identify the PCI devices leading
>to your USB and hard drive?  I can't remember if I've actually seen
>that.

I posted my lspci information here 
https://bugzilla.kernel.org/show_bug.cgi?id=53551

If that's not enough, please tell how can I collect this information.

The SATA issue is discussed here: 
https://bugzilla.kernel.org/show_bug.cgi?id=43229

According to Intel and Linux kernel developers it poses no threat.

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 8, 2013 04:03:18 AM, Bjorn Helgaas wrote:
On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson
 wrote:
 On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas 
 wrote:
 I'm not sure if reading /proc/mtrr actually reads the registers out of
 the CPU each time, or whether we just return the cached values we read
 out during initial boot-up. If the latter, then this output isn't
 really useful as there's no guarantee the values are still intact.

 Good point.  From what I can tell, on Artem's system with CPU0:
 Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, we would be using
 generic_mtrr_ops, and generic_get_mtrr() appears to read from the
 MSRs, so I think it should be useful.

 FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might
 have been fixed by bios upgrades by now but not sure.

 It might also suffer (depending on the revision) from the Sandy bridge SATA
 issue. So if affected, SATA controller is a ticking bomb.

 I have a P8H67-V motherboard but I haven't seen any suspend related issues.

 If this is totally unrelated I'm sorry for wasting your time. Just thought it
 might be good to know.

Thanks for chiming in.  I'm not familiar with either of the issues you
mentioned.  Do you have any references where I could read up on them?

Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at
05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I
don't think that's the problem.

And the issue affects both USB and a hard drive, so I suspect it's
more than just SATA.  Artem, did you identify the PCI devices leading
to your USB and hard drive?  I can't remember if I've actually seen
that.

I posted my lspci information here 
https://bugzilla.kernel.org/show_bug.cgi?id=53551

If that's not enough, please tell how can I collect this information.

The SATA issue is discussed here: 
https://bugzilla.kernel.org/show_bug.cgi?id=43229

According to Intel and Linux kernel developers it poses no threat.

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 8, 2013 04:25:43 AM, Patrik Jakobsson wrote:
On Wed, May 8, 2013 at 12:02 AM, Bjorn Helgaas wrote:
On Tue, May 7, 2013 at 2:48 PM, Patrik Jakobsson wrote:
On Tue, May 7, 2013 at 10:20 PM, Bjorn Helgaas wrote:
I'm not sure if reading /proc/mtrr actually reads the registers out of
the CPU each time, or whether we just return the cached values we read
out during initial boot-up. If the latter, then this output isn't
really useful as there's no guarantee the values are still intact.

Good point. From what I can tell, on Artem's system with CPU0:
Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz, we would be using
generic_mtrr_ops, and generic_get_mtrr() appears to read from the
MSRs, so I think it should be useful.

FWIW, that motherboard suffers from a PCI to PCIE bridge problem. It might
have been fixed by bios upgrades by now but not sure.

It might also suffer (depending on the revision) from the Sandy bridge SATA
issue. So if affected, SATA controller is a ticking bomb.

I have a P8H67-V motherboard but I haven't seen any suspend related issues.

If this is totally unrelated I'm sorry for wasting your time. Just thought
it
might be good to know.

Thanks for chiming in. I'm not familiar with either of the issues you
mentioned. Do you have any references where I could read up on them?

I think this is the official statement from Intel on the SATA issue:
http://newsroom.intel.com/community/intel_newsroom/blog/2011/01/31/intel-identifies-chipset-design-error-implementing-solution

My motherboard has a new fixed B3 revision so this issue doesn't affect me.
Besides this SATA ports degradation issue is constantly present - it has no
relationship to suspend.

And here's a link to a discussion about the PCIe-to-PCI bridge stuff:
https://lkml.org/lkml/2012/1/30/216

Artem's system has a PCIe-to-PCI bridge (not a PCI-to-PCIe bridge) at
05:00.0, but it leads to [bus 06] and there's nothing on bus 06, so I
don't think that's the problem.

I meant what you said ;) and yes, it seems unrelated. Both my P8H67 and a
P8P67 I've built behave nicely if nothing is connected.

Have you tried suspending more than three times? In the absence of UEFI
boot this bug emerges only on a third or even fourth resume attempt. UEFI
boot triggers it immediately on a first resume though.

And the issue affects both USB and a hard drive, so I suspect it's
more than just SATA. Artem, did you identify the PCI devices leading
to your USB and hard drive? I can't remember if I've actually seen
that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 7, 2013 10:27:30 PM, Bjorn Helgaas wrote:
On Tue, May 7, 2013 at 8:59 AM, Artem S. Tashkinov  wrote:
>> May 7, 2013 09:25:40 PM,Bjorn Helgaas  wrote:
>>> [+cc Phillip]
>>>
>>>> I would suspect that Windows' complaint about the BIOS mucking up the MTRRs
>>>> is likely the best hint. Likely Windows is detecting the problem and fixing
>>>> it up on resume, thus it only complains about "reduced resume performance".
>>>> If the MTRRs are messed up, then quite likely parts of RAM have become
>>>> uncacheable, causing performance to get randomly slaughtered in various
>>>> ways.
>>>>
>>>> From looking at the code it's not clear if we are checking/restoring the
>>>> MTRR contents after resume. If not, maybe we should be.
>>>
>>>I agree; the MTRR warning is a good hint.  Artem?
>>>
>>>Phillip, I cc'd you because you have similar hardware and your
>>>https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is
>>>slightly similar.  Have you seen anything like this "reduced
>>>performance after resume" issue?  If so, can you collect /proc/mtrr
>>>contents before and after suspending?
>>>
>>
>> Like Robert Hancock correctly noted the Linux kernel lacks the code to check
>> for MTTR changes after resume - I'm not a kernel hacker to write such a code 
>> ;-)
>>
>> Likewise there's no code to see if RAM pages have become uncacheable - i.e
>> I've no idea how to check it either.
>>
>> According to /proc/mttr nothing changes on resume - only Windows detects
>> the discrepancy between MTTR regions on resume. dmesg contains no warnings
>> or errors (aside from usual ACPI SATA warnings - but they happen right on
>> boot - so I highly doubt the ACPI or SATA layers can be the culprit, since 
>> USB
>> exhibits a similar performance degradation).
>>
>> In short, there's little to nothing that I can check.
>
>I'm not trying to be ungrateful, but maybe you could actually collect
>the info we've asked for and attach it to the bugzilla.  It's hard for
>me to get excited about digging into this when all I see is "nothing
>changes in MTRR" and "it's probably not X."  I really need some
>concrete data to help rule things out and suggest other things to
>investigate.
>
>Maybe we won't be able to make progress on this until other people
>start hitting similar issues and we can find patterns.

The pattern is very easy to spot - Linus once told that desktop PCs are
not meant to work properly with suspend. That's kinda strange for me
as I have yet to encounter a PC where Windows fails to work properly
after resume - maybe I'm lucky - who knows.

Taking into consideration that only few people use Linux, most Linux
users avoid UEFI, very few of them actually use suspend/resume then
it gets very easy to understand why such bug reports are vanishingly
rare.

Asus themselves could have easily debugged this issue if they were
slightly interested in fixing it, yet their policy is that they only support
Windows, and Linux is not their concern.

Best regards
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 7, 2013 09:25:40 PM,Bjorn Helgaas  wrote:
> [+cc Phillip]
>
>> I would suspect that Windows' complaint about the BIOS mucking up the MTRRs
>> is likely the best hint. Likely Windows is detecting the problem and fixing
>> it up on resume, thus it only complains about "reduced resume performance".
>> If the MTRRs are messed up, then quite likely parts of RAM have become
>> uncacheable, causing performance to get randomly slaughtered in various
>> ways.
>>
>> From looking at the code it's not clear if we are checking/restoring the
>> MTRR contents after resume. If not, maybe we should be.
>
>I agree; the MTRR warning is a good hint.  Artem?
>
>Phillip, I cc'd you because you have similar hardware and your
>https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is
>slightly similar.  Have you seen anything like this "reduced
>performance after resume" issue?  If so, can you collect /proc/mtrr
>contents before and after suspending?
>

Like Robert Hancock correctly noted the Linux kernel lacks the code to check
for MTTR changes after resume - I'm not a kernel hacker to write such a code ;-)

Likewise there's no code to see if RAM pages have become uncacheable - i.e
I've no idea how to check it either.

According to /proc/mttr nothing changes on resume - only Windows detects
the discrepancy between MTTR regions on resume. dmesg contains no warnings
or errors (aside from usual ACPI SATA warnings - but they happen right on
boot - so I highly doubt the ACPI or SATA layers can be the culprit, since USB
exhibits a similar performance degradation).

In short, there's little to nothing that I can check.

That bug report has nothing to do with my problem - my PC suspends and
resumes more or less correctly - everything works (albeit some parts don't
work as they should). That person also has a very outdated BIOS -  1904 from
08/15/2011. I wouldn't be surprised if BIOS update solved his problem.

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 7, 2013 09:25:40 PM,Bjorn Helgaas  wrote:
 [+cc Phillip]

 I would suspect that Windows' complaint about the BIOS mucking up the MTRRs
 is likely the best hint. Likely Windows is detecting the problem and fixing
 it up on resume, thus it only complains about reduced resume performance.
 If the MTRRs are messed up, then quite likely parts of RAM have become
 uncacheable, causing performance to get randomly slaughtered in various
 ways.

 From looking at the code it's not clear if we are checking/restoring the
 MTRR contents after resume. If not, maybe we should be.

I agree; the MTRR warning is a good hint.  Artem?

Phillip, I cc'd you because you have similar hardware and your
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is
slightly similar.  Have you seen anything like this reduced
performance after resume issue?  If so, can you collect /proc/mtrr
contents before and after suspending?


Like Robert Hancock correctly noted the Linux kernel lacks the code to check
for MTTR changes after resume - I'm not a kernel hacker to write such a code ;-)

Likewise there's no code to see if RAM pages have become uncacheable - i.e
I've no idea how to check it either.

According to /proc/mttr nothing changes on resume - only Windows detects
the discrepancy between MTTR regions on resume. dmesg contains no warnings
or errors (aside from usual ACPI SATA warnings - but they happen right on
boot - so I highly doubt the ACPI or SATA layers can be the culprit, since USB
exhibits a similar performance degradation).

In short, there's little to nothing that I can check.

That bug report has nothing to do with my problem - my PC suspends and
resumes more or less correctly - everything works (albeit some parts don't
work as they should). That person also has a very outdated BIOS -  1904 from
08/15/2011. I wouldn't be surprised if BIOS update solved his problem.

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

May 7, 2013 10:27:30 PM, Bjorn Helgaas wrote:
On Tue, May 7, 2013 at 8:59 AM, Artem S. Tashkinov  wrote:
 May 7, 2013 09:25:40 PM,Bjorn Helgaas  wrote:
 [+cc Phillip]

 I would suspect that Windows' complaint about the BIOS mucking up the MTRRs
 is likely the best hint. Likely Windows is detecting the problem and fixing
 it up on resume, thus it only complains about reduced resume performance.
 If the MTRRs are messed up, then quite likely parts of RAM have become
 uncacheable, causing performance to get randomly slaughtered in various
 ways.

 From looking at the code it's not clear if we are checking/restoring the
 MTRR contents after resume. If not, maybe we should be.

I agree; the MTRR warning is a good hint.  Artem?

Phillip, I cc'd you because you have similar hardware and your
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1131468 report is
slightly similar.  Have you seen anything like this reduced
performance after resume issue?  If so, can you collect /proc/mtrr
contents before and after suspending?


 Like Robert Hancock correctly noted the Linux kernel lacks the code to check
 for MTTR changes after resume - I'm not a kernel hacker to write such a code 
 ;-)

 Likewise there's no code to see if RAM pages have become uncacheable - i.e
 I've no idea how to check it either.

 According to /proc/mttr nothing changes on resume - only Windows detects
 the discrepancy between MTTR regions on resume. dmesg contains no warnings
 or errors (aside from usual ACPI SATA warnings - but they happen right on
 boot - so I highly doubt the ACPI or SATA layers can be the culprit, since 
 USB
 exhibits a similar performance degradation).

 In short, there's little to nothing that I can check.

I'm not trying to be ungrateful, but maybe you could actually collect
the info we've asked for and attach it to the bugzilla.  It's hard for
me to get excited about digging into this when all I see is nothing
changes in MTRR and it's probably not X.  I really need some
concrete data to help rule things out and suggest other things to
investigate.

Maybe we won't be able to make progress on this until other people
start hitting similar issues and we can find patterns.

The pattern is very easy to spot - Linus once told that desktop PCs are
not meant to work properly with suspend. That's kinda strange for me
as I have yet to encounter a PC where Windows fails to work properly
after resume - maybe I'm lucky - who knows.

Taking into consideration that only few people use Linux, most Linux
users avoid UEFI, very few of them actually use suspend/resume then
it gets very easy to understand why such bug reports are vanishingly
rare.

Asus themselves could have easily debugged this issue if they were
slightly interested in fixing it, yet their policy is that they only support
Windows, and Linux is not their concern.

Best regards
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-04-27 Thread Artem S. Tashkinov

>
>Did this problem ever get resolved?
>

Hello,

Unfortunately, no. Out of curiosity I've tried booting kernel
3.9-rc8 in EUFI mode but it exhibits the same problem. 

Right after the boot:

[root@localhost ~]# dd if=/dev/zero of=test bs=64M count=3
3+0 records in
3+0 records out
201326592 bytes (201 MB) copied, 1.08544 s, 185 MB/s

After suspend/resume:

# dd if=/dev/zero of=test bs=64M count=3
3+0 records in
3+0 records out
201326592 bytes (201 MB) copied, 66.5392 s, 3.0 MB/s

That's for my primary SATA-3 HDD.

Forgive me my impudence but I believe debugging the USB stack is
tangential to this problem. Something far deeper than USB support
breaks, but so far no one has come even with the slightest clue of
what that might be.

And like I mentioned before this problem doesn't affect Windows - once
I suspended it seven times in a row and it kept on chugging happily.

According to hdparm nothing changes after suspend/resume:

Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = ?
Advanced power management level: disabled
Recommended acoustic management value: 208, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 Cycle time: no flow control=120ns  IORDY flow control=120ns

3MB/sec matches PIO mode 0 which is ridiculous and implausible given
than this HDD is attached via SATA.

Besides hdparm says that:

# hdparm -tT --direct /dev/sda

/dev/sda:
 Timing O_DIRECT cached reads:   862 MB in  2.00 seconds = 430.77 MB/sec
 Timing O_DIRECT disk reads:  520 MB in  3.01 seconds = 173.03 MB/sec

So, only writes are affected.

My dmesg is here: http://ompldr.org/vaThpcA/dmesg
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-04-27 Thread Artem S. Tashkinov


Did this problem ever get resolved?


Hello,

Unfortunately, no. Out of curiosity I've tried booting kernel
3.9-rc8 in EUFI mode but it exhibits the same problem. 

Right after the boot:

[root@localhost ~]# dd if=/dev/zero of=test bs=64M count=3
3+0 records in
3+0 records out
201326592 bytes (201 MB) copied, 1.08544 s, 185 MB/s

After suspend/resume:

# dd if=/dev/zero of=test bs=64M count=3
3+0 records in
3+0 records out
201326592 bytes (201 MB) copied, 66.5392 s, 3.0 MB/s

That's for my primary SATA-3 HDD.

Forgive me my impudence but I believe debugging the USB stack is
tangential to this problem. Something far deeper than USB support
breaks, but so far no one has come even with the slightest clue of
what that might be.

And like I mentioned before this problem doesn't affect Windows - once
I suspended it seven times in a row and it kept on chugging happily.

According to hdparm nothing changes after suspend/resume:

Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 16  Current = ?
Advanced power management level: disabled
Recommended acoustic management value: 208, current value: 0
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
 Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
 Cycle time: no flow control=120ns  IORDY flow control=120ns

3MB/sec matches PIO mode 0 which is ridiculous and implausible given
than this HDD is attached via SATA.

Besides hdparm says that:

# hdparm -tT --direct /dev/sda

/dev/sda:
 Timing O_DIRECT cached reads:   862 MB in  2.00 seconds = 430.77 MB/sec
 Timing O_DIRECT disk reads:  520 MB in  3.01 seconds = 173.03 MB/sec

So, only writes are affected.

My dmesg is here: http://ompldr.org/vaThpcA/dmesg
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

CONFIG_X86_INTEL_PSTATE disables CPU frequency transition stats, many governors and other standard features

2013-04-26 Thread Artem S. Tashkinov

Hello,

Just wanted to let everyone know that CONFIG_X86_INTEL_PSTATE wreaks
havoc with the CPU frequency subsystem in the Linux kernel.

With this option enabled:

1) All governors except performance and powersave are gone, ondemand
userspace, conservative

2) scaling_cur_freq is gone, thus user space utilities monitoring the CPU
frequency have stopped working

3) CPU frequency transition stats are gone, there's no "stats" directory
anywhere

4) scaling_available_frequencies is gone, so I cannot set the desired constant
CPU frequency (the userspace governor is not available anyway)

Is this an intended behavior? I shrivel to think that's the case.

The bug report is filed here: https://bugzilla.kernel.org/show_bug.cgi?id=57141

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

CONFIG_X86_INTEL_PSTATE disables CPU frequency transition stats, many governors and other standard features

2013-04-26 Thread Artem S. Tashkinov

Hello,

Just wanted to let everyone know that CONFIG_X86_INTEL_PSTATE wreaks
havoc with the CPU frequency subsystem in the Linux kernel.

With this option enabled:

1) All governors except performance and powersave are gone, ondemand
userspace, conservative

2) scaling_cur_freq is gone, thus user space utilities monitoring the CPU
frequency have stopped working

3) CPU frequency transition stats are gone, there's no stats directory
anywhere

4) scaling_available_frequencies is gone, so I cannot set the desired constant
CPU frequency (the userspace governor is not available anyway)

Is this an intended behavior? I shrivel to think that's the case.

The bug report is filed here: https://bugzilla.kernel.org/show_bug.cgi?id=57141

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-26 Thread Artem S. Tashkinov

Feb 27, 2013 12:47:01 AM, Bjorn Helgaas wrote:
On Mon, Feb 25, 2013 at 11:35 PM, Artem S. Tashkinov wrote:
>> Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote:
>>>
>>>Where are we at with this, Artem?  I assume it's still a problem.
>>>
>>
>> Yes, it is, Bjorn.
>>
>> In order to eliminate this problem I switched back to MBR yesterday, because
>> so far I haven't received any instructions or guidance as to how I can debug
>> it further. I'm absolutely sure USB write speed is just another 
>> manifestation of
>> it so I decided not to debug USB specifically (it just doesn't make too much
>> sense).
>>
>> What I see is that something terribly wrong is going on but if Linus has no 
>> ideas
>> I, as an average Joe, don't have a slightest clue as to what I can do.
>>
>> The bug report with necessary, but seemingly useless information, can be
>> found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551
>>
>> If anyone comes up with new ideas I can quickly try UEFI again now that I
>> have two HDDs at my disposal (the old one is formatted as GPT, the new one is
>> MBR).
>
>The ideas I saw are:
>
>1) Figure out whether it ever worked.  If an older kernel worked
>correctly and a newer one is broken, bisection is at least a
>possibility.  You mentioned that it did work before (Feb 12), but in
>the past you never suspended twice in one boot session, whereas maybe
>you did when seeing the problem?

This is difficult to say since the first kernel I tried to run in EUFI mode was
3.7.x, so I've no idea if any previous ones ever worked.

>
>2) Try the "setpci" to set the MSI address back to the original value
>to see if it makes a difference (see my Feb 12 message).

I will try it soon and report back to you.

>
>3) Collect "lspci -vvv -" output to investigate the XHCI
>Unsupported Request errors.
>
>4) Use usbmon to collect traces before and after the suspend.

Likewise. Still I don't quite understand why you are persistent in your
desire to investigate USB controllers specifically - my problem affects
all storage devices that I have.

>
>I googled around a bit looking for similar reports.  I found lots of
>suspend issues, mostly with Windows, but no leads yet.  It looks like
>the board has been around for a while, so you would think we'd have
>some other reports of a problem this bad.  But maybe it really is
>related to UEFI and nobody really uses that yet?

99% of people around me don't use UEFI, and the ones who use it do
it because they want to run Hacintosh (it's quite complicated to run
a EUFI OS from a non UEFI BIOS).

That's the main reason you don't see similar reports. EUFI so far hasn't
proven its supremacy and efficiency over BIOS. When 3TB and larger
HDD's become more widespread people will have to use UEFI. They will
simply have no choice (unless of course you have two HDDs, where one
is BIOS formatted to boot your system, and another one is GPT
partitioned in order to support > 2,2TB space).

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-26 Thread Artem S. Tashkinov

Feb 27, 2013 12:47:01 AM, Bjorn Helgaas wrote:
On Mon, Feb 25, 2013 at 11:35 PM, Artem S. Tashkinov wrote:
 Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote:

Where are we at with this, Artem?  I assume it's still a problem.


 Yes, it is, Bjorn.

 In order to eliminate this problem I switched back to MBR yesterday, because
 so far I haven't received any instructions or guidance as to how I can debug
 it further. I'm absolutely sure USB write speed is just another 
 manifestation of
 it so I decided not to debug USB specifically (it just doesn't make too much
 sense).

 What I see is that something terribly wrong is going on but if Linus has no 
 ideas
 I, as an average Joe, don't have a slightest clue as to what I can do.

 The bug report with necessary, but seemingly useless information, can be
 found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551

 If anyone comes up with new ideas I can quickly try UEFI again now that I
 have two HDDs at my disposal (the old one is formatted as GPT, the new one is
 MBR).

The ideas I saw are:

1) Figure out whether it ever worked.  If an older kernel worked
correctly and a newer one is broken, bisection is at least a
possibility.  You mentioned that it did work before (Feb 12), but in
the past you never suspended twice in one boot session, whereas maybe
you did when seeing the problem?

This is difficult to say since the first kernel I tried to run in EUFI mode was
3.7.x, so I've no idea if any previous ones ever worked.


2) Try the setpci to set the MSI address back to the original value
to see if it makes a difference (see my Feb 12 message).

I will try it soon and report back to you.


3) Collect lspci -vvv - output to investigate the XHCI
Unsupported Request errors.

4) Use usbmon to collect traces before and after the suspend.

Likewise. Still I don't quite understand why you are persistent in your
desire to investigate USB controllers specifically - my problem affects
all storage devices that I have.


I googled around a bit looking for similar reports.  I found lots of
suspend issues, mostly with Windows, but no leads yet.  It looks like
the board has been around for a while, so you would think we'd have
some other reports of a problem this bad.  But maybe it really is
related to UEFI and nobody really uses that yet?

99% of people around me don't use UEFI, and the ones who use it do
it because they want to run Hacintosh (it's quite complicated to run
a EUFI OS from a non UEFI BIOS).

That's the main reason you don't see similar reports. EUFI so far hasn't
proven its supremacy and efficiency over BIOS. When 3TB and larger
HDD's become more widespread people will have to use UEFI. They will
simply have no choice (unless of course you have two HDDs, where one
is BIOS formatted to boot your system, and another one is GPT
partitioned in order to support  2,2TB space).

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-25 Thread Artem S. Tashkinov

Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote:
>
>Where are we at with this, Artem?  I assume it's still a problem.
>

Yes, it is, Bjorn.

In order to eliminate this problem I switched back to MBR yesterday, because
so far I haven't received any instructions or guidance as to how I can debug
it further. I'm absolutely sure USB write speed is just another manifestation of
it so I decided not to debug USB specifically (it just doesn't make too much
sense).

What I see is that something terribly wrong is going on but if Linus has no 
ideas
I, as an average Joe, don't have a slightest clue as to what I can do.

The bug report with necessary, but seemingly useless information, can be 
found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551

If anyone comes up with new ideas I can quickly try UEFI again now that I
have two HDDs at my disposal (the old one is formatted as GPT, the new one is
MBR).

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

2013-02-25 Thread Artem S. Tashkinov

Feb 26, 2013 03:57:52 AM, Bjorn Helgaas wrote:

Where are we at with this, Artem?  I assume it's still a problem.


Yes, it is, Bjorn.

In order to eliminate this problem I switched back to MBR yesterday, because
so far I haven't received any instructions or guidance as to how I can debug
it further. I'm absolutely sure USB write speed is just another manifestation of
it so I decided not to debug USB specifically (it just doesn't make too much
sense).

What I see is that something terribly wrong is going on but if Linus has no 
ideas
I, as an average Joe, don't have a slightest clue as to what I can do.

The bug report with necessary, but seemingly useless information, can be 
found here: https://bugzilla.kernel.org/show_bug.cgi?id=53551

If anyone comes up with new ideas I can quickly try UEFI again now that I
have two HDDs at my disposal (the old one is formatted as GPT, the new one is
MBR).

Best regards,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

Feb 13, 2013 01:32:53 AM, Linus Torvalds wrote:
On Tue, Feb 12, 2013 at 10:29 AM, Artem S. Tashkinov wrote:
>> Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote:
>>>
>>>A few things to try to pinpoint:
>>>
>>> (a) Is it *only* write performance that suffers, or is it other
>>>performance too? Networking (DMA? Perhaps only writing *to* the
>>>network?)? CPU?
>>
>> I  've tested hdpard -tT --direct and the output on boot and after suspend
>> is quite similar.
>>
>> I  've also checked my network read/write speed, and it  's the same
>> ~ 100MBit/sec (I have no 1Gbit computers on my network
>> unfortunately).
>
>Ok. So it really sounds like just USB and HD writes. Which is quite
>odd, since they have basically nothing in common I can think of
>(except the obvious block layer issues).
>
>>> (b) the fact that it apparently happens with both SATA and USB
>>>implies that it's neither, and is more likely something core like
>>>memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever).
>>
>> I  've no idea, please, check my bug report where I  've just added lots of
>> information including a diff between on boot and after suspend.
>
>I  'm not seeing anything particularly interesting there.
>
>Except why/how did the MSI address/data change for the SATA
>controller? The irq itself hasn  't changed.. There  's probably some sane
>reason for that too (it  's an odd encoding, maybe they code for the
>same thing), and there  's nothing like that for USB, so...
>
>And if it was irq problems, I  'd expect you to see it more for reads
>than for writes anyway. Along with a few messages about missed irqs
>and whatever.
>
>I'm stumped, and have no ideas. I can  't even begin to guess how this
>would happen. One thing to try is if it happens for all USB ports (you
>have multiple controllers) and I assume performance doesn  't come back
>if you unplug and replug the USB disk..

I've just plugged and unplugged my USB stick into all available hubs
(including a USB3 one, that is xhci_hcd) and I've got the same write speed
on all of them - around 930KB/sec (quite a weird number - as if I'm on USB
1.1) - lsusb says I'm happily running ehci_hcd/2p, 480M and xhci_hcd/2p,
5000M.

The only pattern that I see here is that write speed to real devices degrades,
tmpfs write speed stays the same:

$ dd if=/dev/zero of=test bs=32M count=32
32+0 records indegrade
32+0 records out
1073741824 bytes (1.1 GB) copied, 0.296323 s, 3.6 GB/s

Best regards,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote:
>On Mon, Feb 11, 2013 at 10:25 PM, Artem S. Tashkinov wrote:
>> Hello Linus,
>>
>> I  've already posted a bug report 
>> (https://bugzilla.kernel.org/show_bug.cgi?id=53551),
>> a message to LKML 
>> (http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/00837.html)
>> and so far I  've received zero response even though the bug is quite 
>> critical as it prevents
>> me from using suspend altogether.
>>
>> I wonder if you could tell me who is responsible for this problem and who I 
>> need to CC in
>> bugzilla.
>
>According to your bugzilla it doesn  't really seem to be strictly
>UEFI-specific, and it  's hard to tell what subsystem is to blame.
>
>A few things to try to pinpoint:
>
> (a) Is it *only* write performance that suffers, or is it other
>performance too? Networking (DMA? Perhaps only writing *to* the
>network?)? CPU?

I've tested hdpard -tT --direct and the output on boot and after suspend
is quite similar.

I've also checked my network read/write speed, and it's the same
~ 100MBit/sec (I have no 1Gbit computers on my network
unfortunately).

>
> (b) the fact that it apparently happens with both SATA and USB
>implies that it  's neither, and is more likely something core like
>memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever).

I've no idea, please, check my bug report where I've just added lots of
information including a diff between on boot and after suspend.

lspci outputs differ quite substantially, but the things that have change
say nothing to me - you'll want to see it for yourself. I see changes like:

-   Changed: MRL- PresDet- LinkState-
+   Changed: MRL- PresDet+ LinkState-

i.e. PresDet minus to PresDet plus.

-   Address: fee0f00c  Data: 41e1
+   Address:   Data: 

-   Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- TAbort- 
> (c) can you find anything that changes over the suspend/resume? IOW,
>look at things like "lspci -vvxxx" before-and-after, and see what
>changed on the bridges leading to both things etc.
>
>The performance drop sounds extreme enough that it sounds like caches
>got disabled or something, but that should show up as CPU performance
>in general being slow, not just writes to disk. But basically, I think
>we need more clues about which sub-area is actually the culprit. My
>*guess* would be some core PCI thing not being initialized, but I
>don  't see how you could even make PCI go that slow. Interrupt
>problems? DMA failures? I have no idea.
>
>Has it ever worked? Suspend on desktop motherboards used to be quite
>spotty (nobody ever used it, manufacturers didn  't care), but it
>generally has gotten better since people use it more these days..

I remember it used to work before, but I've never suspended more than once
during one boot session before (this time I did it out of pure curiosity) and
I've never run Linux from UEFI.

>
>Added lkml and Bjorn to the participants, in case anybody has any ideas..
>

I'll gladly provide any information you need.

Thanks a lot,

Artem
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system

Feb 12, 2013 11:30:20 PM, Linus Torvalds wrote:
On Mon, Feb 11, 2013 at 10:25 PM, Artem S. Tashkinov wrote:
 Hello Linus,

 I  've already posted a bug report 
 (https://bugzilla.kernel.org/show_bug.cgi?id=53551),
 a message to LKML 
 (http://lkml.indiana.edu/hypermail/linux/kernel/1302.1/00837.html)
 and so far I  've received zero response even though the bug is quite 
 critical as it prevents
 me from using suspend altogether.

 I wonder if you could tell me who is responsible for this problem and who I 
 need to CC in
 bugzilla.

According to your bugzilla it doesn  't really seem to be strictly
UEFI-specific, and it  's hard to tell what subsystem is to blame.

A few things to try to pinpoint:

 (a) Is it *only* write performance that suffers, or is it other
performance too? Networking (DMA? Perhaps only writing *to* the
network?)? CPU?

I've tested hdpard -tT --direct and the output on boot and after suspend
is quite similar.

I've also checked my network read/write speed, and it's the same
~ 100MBit/sec (I have no 1Gbit computers on my network
unfortunately).


 (b) the fact that it apparently happens with both SATA and USB
implies that it  's neither, and is more likely something core like
memory speed (mtrr, caching) or PCI (DMA, burst sizes, whatever).

I've no idea, please, check my bug report where I've just added lots of
information including a diff between on boot and after suspend.

lspci outputs differ quite substantially, but the things that have change
say nothing to me - you'll want to see it for yourself. I see changes like:

-   Changed: MRL- PresDet- LinkState-
+   Changed: MRL- PresDet+ LinkState-

i.e. PresDet minus to PresDet plus.

-   Address: fee0f00c  Data: 41e1
+   Address:   Data: 

-   Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
 (c) can you find anything that changes over the suspend/resume? IOW,
look at things like lspci -vvxxx before-and-after, and see what
changed on the bridges leading to both things etc.

The performance drop sounds extreme enough that it sounds like caches
got disabled or something, but that should show up as CPU performance
in general being slow, not just writes to disk. But basically, I think
we need more clues about which sub-area is actually the culprit. My
*guess* would be some core PCI thing not being initialized, but I
don  't see how you could even make PCI go that slow. Interrupt
problems? DMA failures? I have no idea.

Has it ever worked? Suspend on desktop motherboards used to be quite
spotty (nobody ever used it, manufacturers didn  't care), but it
generally has gotten better since people use it more these days..

I remember it used to work before, but I've never suspended more than once
during one boot session before (this time I did it out of pure curiosity) and
I've never run Linux from UEFI.


Added lkml and Bjorn to the participants, in case anybody has any ideas..


I'll gladly provide any information you need.

Thanks a lot,

Artem
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: Abysmal HDD/USB write speed after sleep on a UEFI system