XZ compressed kernel broken?
I could not get kernel 4.17 or up to boot on my laptop. I tried various things, started a new config from scratch on a freshly downloaded (not up-patched) source tree, to no avail. Today I tried switching kernel compression from XZ to GZIP and now it boots fine. Just as a PSA if your kernel just insta-resets after being loaded from grub. Might be some kind of freak interdependency, didn't analyze further. Felix
XZ compressed kernel broken?
I could not get kernel 4.17 or up to boot on my laptop. I tried various things, started a new config from scratch on a freshly downloaded (not up-patched) source tree, to no avail. Today I tried switching kernel compression from XZ to GZIP and now it boots fine. Just as a PSA if your kernel just insta-resets after being loaded from grub. Might be some kind of freak interdependency, didn't analyze further. Felix
Re: getting mysterious (to me) EINVAL from inotify_rm_watch
Thus spake Peter Meerwald-Stadler (pme...@pmeerw.net): > > I am trying to add inotify support to my tail implementation (for -F). > > This is what happens: > > > > inotify_init() = 4 > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 > > inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 > > > > There is also some polling, some reading and some statting going on here, > > but > > those are on other descriptors than 4 so they should not matter). > > > > Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? > > This is a stock kernel 4.5.0. > #include > #include > int main() { > int fd, i, j; > printf("init %d\n", fd=inotify_init()); // 3 > printf("add %d\n", i=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 1 > printf("rm %d\n", inotify_rm_watch(fd, i)); // 0 > printf("add %d\n", j=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 2 > return 0; > } > Ubuntu kernel x86_64 4.4.0-21, seems to work here > so we have to guess what's going on between _add and _rm? Wait! It just occurred to me that this does not make any sense at all. You use the name of the file with inotify_add_watch, not the descriptor to the file. Why would closing the file matter? My "load generator" test program is: #include #include #include #include int main() { int fd=open("/tmp/foo",O_WRONLY|O_CREAT|O_TRUNC,0600); assert(fd>-1); sleep(1); write(fd,"1\n",2); sleep(1); write(fd,"2\n",2); int fd2=open("/tmp/bar",O_WRONLY|O_CREAT|O_TRUNC,0600); assert(fd>-1); write(fd2,"3\n",2); rename("/tmp/bar","/tmp/foo"); close(fd); sleep(1); write(fd2,"4\n",2); close(fd2); } I touch /tmp/foo first, then I run my inotify tail -F on it, and I expect the output to be 1\n2\n3\n4\n It is. Then I press Ctrl-C. Here is the strace of the tail: execve("./bin-x86_64/tail", ["./bin-x86_64/tail", "-F", "/tmp/foo"], [/* 57 vars */]) = 0 arch_prctl(ARCH_SET_FS, 0x7fff1b1e2920) = 0 rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_NODEFER, 0x4018d0}, {SIG_DFL, [], 0}, 8) = 0 open("/tmp/foo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f251cf9f000 read(3, "", 32768) = 0 inotify_init() = 4 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 read(3, "1\n", 8192)= 2 write(1, "1\n", 2) = 2 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0 read(3, "2\n", 8192)= 2 write(1, "2\n", 2) = 2 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 close(3)= 0 inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) open("/tmp/foo", O_RDONLY) = 3 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 open("/tmp", O_RDONLY) = 5 read(3, "3\n4\n", 8192) = 4 write(1, "3\n4\n", 4) = 4 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\0\200\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +++ killed by SIGINT +++ As you can see, I do close(3) and then inotify_rm_watch, and it returns EINVAL. If I do the inotify_rm_watch first and then
Re: getting mysterious (to me) EINVAL from inotify_rm_watch
Thus spake Peter Meerwald-Stadler (pme...@pmeerw.net): > > I am trying to add inotify support to my tail implementation (for -F). > > This is what happens: > > > > inotify_init() = 4 > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 > > inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 > > > > There is also some polling, some reading and some statting going on here, > > but > > those are on other descriptors than 4 so they should not matter). > > > > Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? > > This is a stock kernel 4.5.0. > #include > #include > int main() { > int fd, i, j; > printf("init %d\n", fd=inotify_init()); // 3 > printf("add %d\n", i=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 1 > printf("rm %d\n", inotify_rm_watch(fd, i)); // 0 > printf("add %d\n", j=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 2 > return 0; > } > Ubuntu kernel x86_64 4.4.0-21, seems to work here > so we have to guess what's going on between _add and _rm? Wait! It just occurred to me that this does not make any sense at all. You use the name of the file with inotify_add_watch, not the descriptor to the file. Why would closing the file matter? My "load generator" test program is: #include #include #include #include int main() { int fd=open("/tmp/foo",O_WRONLY|O_CREAT|O_TRUNC,0600); assert(fd>-1); sleep(1); write(fd,"1\n",2); sleep(1); write(fd,"2\n",2); int fd2=open("/tmp/bar",O_WRONLY|O_CREAT|O_TRUNC,0600); assert(fd>-1); write(fd2,"3\n",2); rename("/tmp/bar","/tmp/foo"); close(fd); sleep(1); write(fd2,"4\n",2); close(fd2); } I touch /tmp/foo first, then I run my inotify tail -F on it, and I expect the output to be 1\n2\n3\n4\n It is. Then I press Ctrl-C. Here is the strace of the tail: execve("./bin-x86_64/tail", ["./bin-x86_64/tail", "-F", "/tmp/foo"], [/* 57 vars */]) = 0 arch_prctl(ARCH_SET_FS, 0x7fff1b1e2920) = 0 rt_sigaction(SIGPIPE, {SIG_IGN, [PIPE], SA_RESTORER|SA_NODEFER, 0x4018d0}, {SIG_DFL, [], 0}, 8) = 0 open("/tmp/foo", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f251cf9f000 read(3, "", 32768) = 0 inotify_init() = 4 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=2, ...}) = 0 read(3, "1\n", 8192)= 2 write(1, "1\n", 2) = 2 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\2\0\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0 read(3, "2\n", 8192)= 2 write(1, "2\n", 2) = 2 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0644, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 close(3)= 0 inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) open("/tmp/foo", O_RDONLY) = 3 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 open("/tmp", O_RDONLY) = 5 read(3, "3\n4\n", 8192) = 4 write(1, "3\n4\n", 4) = 4 read(3, "", 8192) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 1 ([{fd=4, revents=POLLIN}]) read(4, "\1\0\0\0\0\200\0\0\0\0\0\0\0\0\0\0", 2048) = 16 fstat(3, {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = 0 (Timeout) fstat(3, {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 stat("/tmp/foo", {st_mode=S_IFREG|0600, st_size=4, ...}) = 0 poll([{fd=4, events=POLLIN}], 1, 1000) = ? ERESTART_RESTARTBLOCK (Interrupted by signal) --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} --- +++ killed by SIGINT +++ As you can see, I do close(3) and then inotify_rm_watch, and it returns EINVAL. If I do the inotify_rm_watch first and then
Re: getting mysterious (to me) EINVAL from inotify_rm_watch
Thus spake Peter Meerwald-Stadler (pme...@pmeerw.net): > > I am trying to add inotify support to my tail implementation (for -F). > > This is what happens: > > > > inotify_init() = 4 > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 > > inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 > > > > There is also some polling, some reading and some statting going on here, > > but > > those are on other descriptors than 4 so they should not matter). > > > > Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? > > This is a stock kernel 4.5.0. > #include > #include > int main() { > int fd, i, j; > printf("init %d\n", fd=inotify_init()); // 3 > printf("add %d\n", i=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 1 > printf("rm %d\n", inotify_rm_watch(fd, i)); // 0 > printf("add %d\n", j=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 2 > return 0; > } > Ubuntu kernel x86_64 4.4.0-21, seems to work here > so we have to guess what's going on between _add and _rm? Oh, it turns out to be my fault. I called close() on the file first, then did inotify_rm_watch. It was not clear to me from the documentation that that automatically removes the inotify watch. Sorry for the noise, Felix
Re: getting mysterious (to me) EINVAL from inotify_rm_watch
Thus spake Peter Meerwald-Stadler (pme...@pmeerw.net): > > I am trying to add inotify support to my tail implementation (for -F). > > This is what happens: > > > > inotify_init() = 4 > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 > > inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) > > inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 > > > > There is also some polling, some reading and some statting going on here, > > but > > those are on other descriptors than 4 so they should not matter). > > > > Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? > > This is a stock kernel 4.5.0. > #include > #include > int main() { > int fd, i, j; > printf("init %d\n", fd=inotify_init()); // 3 > printf("add %d\n", i=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 1 > printf("rm %d\n", inotify_rm_watch(fd, i)); // 0 > printf("add %d\n", j=inotify_add_watch(fd, "/tmp/foo", IN_MODIFY)); // 2 > return 0; > } > Ubuntu kernel x86_64 4.4.0-21, seems to work here > so we have to guess what's going on between _add and _rm? Oh, it turns out to be my fault. I called close() on the file first, then did inotify_rm_watch. It was not clear to me from the documentation that that automatically removes the inotify watch. Sorry for the noise, Felix
getting mysterious (to me) EINVAL from inotify_rm_watch
Hi, I am trying to add inotify support to my tail implementation (for -F). This is what happens: inotify_init() = 4 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 There is also some polling, some reading and some statting going on here, but those are on other descriptors than 4 so they should not matter). Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? This is a stock kernel 4.5.0. Thanks, Felix
getting mysterious (to me) EINVAL from inotify_rm_watch
Hi, I am trying to add inotify support to my tail implementation (for -F). This is what happens: inotify_init() = 4 inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 1 inotify_rm_watch(4, 1) = -1 EINVAL (Invalid argument) inotify_add_watch(4, "/tmp/foo", IN_MODIFY) = 2 There is also some polling, some reading and some statting going on here, but those are on other descriptors than 4 so they should not matter). Can somebody explain the EINVAL I'm getting from inotify_rm_watch to me? This is a stock kernel 4.5.0. Thanks, Felix
Re: fork on processes with lots of memory
> Dear Linux kernel devs, > I talked to someone who uses large Linux based hardware to run a > process with huge memory requirements (think 4 GB), and he told me that > if they do a fork() syscall on that process, the whole system comes to > standstill. And not just for a second or two. He said they measured a 45 > minute (!) delay before the system became responsive again. I'm sorry, I meant 4 TB not 4 GB. I'm not used to working with that kind of memory sizes. > Their working theory is that all the pages need to be marked copy-on-write > in both processes, and if you touch one page, a copy needs to be made, > and than just takes a while if you have a billion pages. > I was wondering if there is any advice for such situations from the > memory management people on this list. > In this case the fork was for an execve afterwards, but I was going to > recommend fork to them for something else that can not be tricked around > with vfork. > Can anyone comment on whether the 45 minute number sounds like it could > be real? When I heard it, I was flabberghasted. But the other person > swore it was real. Can a fork cause this much of a delay? Is there a way > to work around it? > I was going to recommend the fork to create a boundary between the > processes, so that you can recover from memory corruption in one > process. In fact, after the fork I would want to munmap almost all of > the shared pages anyway, but there is no way to tell fork that. > Thanks, > Felix > PS: Please put me on Cc if you reply, I'm not subscribed to this mailing > list.
fork on processes with lots of memory
Dear Linux kernel devs, I talked to someone who uses large Linux based hardware to run a process with huge memory requirements (think 4 GB), and he told me that if they do a fork() syscall on that process, the whole system comes to standstill. And not just for a second or two. He said they measured a 45 minute (!) delay before the system became responsive again. Their working theory is that all the pages need to be marked copy-on-write in both processes, and if you touch one page, a copy needs to be made, and than just takes a while if you have a billion pages. I was wondering if there is any advice for such situations from the memory management people on this list. In this case the fork was for an execve afterwards, but I was going to recommend fork to them for something else that can not be tricked around with vfork. Can anyone comment on whether the 45 minute number sounds like it could be real? When I heard it, I was flabberghasted. But the other person swore it was real. Can a fork cause this much of a delay? Is there a way to work around it? I was going to recommend the fork to create a boundary between the processes, so that you can recover from memory corruption in one process. In fact, after the fork I would want to munmap almost all of the shared pages anyway, but there is no way to tell fork that. Thanks, Felix PS: Please put me on Cc if you reply, I'm not subscribed to this mailing list.
fork on processes with lots of memory
Dear Linux kernel devs, I talked to someone who uses large Linux based hardware to run a process with huge memory requirements (think 4 GB), and he told me that if they do a fork() syscall on that process, the whole system comes to standstill. And not just for a second or two. He said they measured a 45 minute (!) delay before the system became responsive again. Their working theory is that all the pages need to be marked copy-on-write in both processes, and if you touch one page, a copy needs to be made, and than just takes a while if you have a billion pages. I was wondering if there is any advice for such situations from the memory management people on this list. In this case the fork was for an execve afterwards, but I was going to recommend fork to them for something else that can not be tricked around with vfork. Can anyone comment on whether the 45 minute number sounds like it could be real? When I heard it, I was flabberghasted. But the other person swore it was real. Can a fork cause this much of a delay? Is there a way to work around it? I was going to recommend the fork to create a boundary between the processes, so that you can recover from memory corruption in one process. In fact, after the fork I would want to munmap almost all of the shared pages anyway, but there is no way to tell fork that. Thanks, Felix PS: Please put me on Cc if you reply, I'm not subscribed to this mailing list.
Re: fork on processes with lots of memory
> Dear Linux kernel devs, > I talked to someone who uses large Linux based hardware to run a > process with huge memory requirements (think 4 GB), and he told me that > if they do a fork() syscall on that process, the whole system comes to > standstill. And not just for a second or two. He said they measured a 45 > minute (!) delay before the system became responsive again. I'm sorry, I meant 4 TB not 4 GB. I'm not used to working with that kind of memory sizes. > Their working theory is that all the pages need to be marked copy-on-write > in both processes, and if you touch one page, a copy needs to be made, > and than just takes a while if you have a billion pages. > I was wondering if there is any advice for such situations from the > memory management people on this list. > In this case the fork was for an execve afterwards, but I was going to > recommend fork to them for something else that can not be tricked around > with vfork. > Can anyone comment on whether the 45 minute number sounds like it could > be real? When I heard it, I was flabberghasted. But the other person > swore it was real. Can a fork cause this much of a delay? Is there a way > to work around it? > I was going to recommend the fork to create a boundary between the > processes, so that you can recover from memory corruption in one > process. In fact, after the fork I would want to munmap almost all of > the shared pages anyway, but there is no way to tell fork that. > Thanks, > Felix > PS: Please put me on Cc if you reply, I'm not subscribed to this mailing > list.
Re: security problem with seccomp-filter
> What you're describing should work correctly (it's part of the > regression test suite we use). So, given that, I'd love to get to the > bottom of what you're seeing. Do you have a URL to your code? What > architecture are you running on? Well, I must be doing something wrong then. I extracted a test case from my program. I put it on http://ptrace.fefe.de/seccompfail.c It installs three seccomp filters, the last one containing this: DISALLOW_SYSCALL(prctl), with #define DISALLOW_SYSCALL(name) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL) It is my understanding that that should then kill the process if the prctl syscall is called again. I test this by attempting to install the very same seccomp filter again, which calls prctl, but the process is not killed. What am I doing wrong? Thanks, Felix #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifndef SECCOMP_MODE_FILTER # define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ # define SECCOMP_RET_KILL 0xU /* kill the task immediately */ # define SECCOMP_RET_TRAP 0x0003U /* disallow and force a SIGSYS */ # define SECCOMP_RET_ALLOW 0x7fffU /* allow */ struct seccomp_data { int nr; __u32 arch; __u64 instruction_pointer; __u64 args[6]; }; #endif #ifndef SYS_SECCOMP # define SYS_SECCOMP 1 #endif #define syscall_nr (offsetof(struct seccomp_data, nr)) #if defined(__i386__) # define REG_SYSCALL REG_EAX # define ARCH_NR AUDIT_ARCH_I386 #elif defined(__x86_64__) # define REG_SYSCALL REG_RAX # define ARCH_NR AUDIT_ARCH_X86_64 #else # error "Platform does not support seccomp filter yet" #endif #define ALLOW_SYSCALL(name) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) static int install_syscall_filter(void) { /* Linux allows a process to restrict itself (and potential children) * in what syscalls can be issued. The mechanism is called * seccomp-filter or "seccomp mode 2". It works by reusing the * Berkeley Packet Filter, which is meant for PCAP-style packet * filtering expressions like "only TCP packets, please". But it is * really a bytecode that has to be passed inside an array, and each * instruction is constructed using scary looking macros. The basics * are not so bad, however. We have two registers, one accumulator * and one index register (which is not used in this part of the * code), and instead of a network packet we are operating on a * certain struct with the syscall info, which is called seccomp_data * (reproduced above). */ struct sock_filter filter[] = { /* validate architecture to avoid x32-on-x86_64 syscall aliasing shenanigans */ /* BPF_LD = load, BPF_W = word, BPF_ABS = absolute offset */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, arch)), /* BPF_JMP+BPF_JEQ+BPF_K = compare accumulator to constant (in our * case, ARCH_NR), and skip the next instruction if equal */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), /* "return SECCOMP_RET_KILL", tell seccomp to kill the process */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), /* load the syscall number */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)), /* and now a list of allowed syscalls */ ALLOW_SYSCALL(rt_sigreturn), #ifdef __NR_sigreturn ALLOW_SYSCALL(sigreturn), #endif ALLOW_SYSCALL(exit_group), ALLOW_SYSCALL(exit), #ifdef __NR_socketcall ALLOW_SYSCALL(socketcall), #else ALLOW_SYSCALL(socket), ALLOW_SYSCALL(sendto), ALLOW_SYSCALL(recvfrom), #endif ALLOW_SYSCALL(poll), /* so we can further restrict allowed syscalls */ ALLOW_SYSCALL(prctl), /* so gethostbyname can open /etc/resolv.conf */ ALLOW_SYSCALL(open), ALLOW_SYSCALL(read), ALLOW_SYSCALL(mmap), ALLOW_SYSCALL(mmap2), ALLOW_SYSCALL(munmap), ALLOW_SYSCALL(lseek), ALLOW_SYSCALL(_llseek), ALLOW_SYSCALL(close), /* for our time keeping */ ALLOW_SYSCALL(gettimeofday), // x86_64 uses a vsyscall for this, so this filter will never trigger /* for when buffer writes the output; since we only write to stdout, filter for fd==1 */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 4), /* it's write(2). Load first argument into accumulator */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, args[0])), /* if it's 1 (stdout), skip 1 instruction */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 1, 1, 0), /* "return SECCOMP_RET_KILL", tell seccomp to kill the process */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), /* "return SECCOMP_RET_ALLOW", tell seccomp to allow the syscall */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW), /* if none of these syscalls matched, kill the process */
Re: security problem with seccomp-filter
What you're describing should work correctly (it's part of the regression test suite we use). So, given that, I'd love to get to the bottom of what you're seeing. Do you have a URL to your code? What architecture are you running on? Well, I must be doing something wrong then. I extracted a test case from my program. I put it on http://ptrace.fefe.de/seccompfail.c It installs three seccomp filters, the last one containing this: DISALLOW_SYSCALL(prctl), with #define DISALLOW_SYSCALL(name) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL) It is my understanding that that should then kill the process if the prctl syscall is called again. I test this by attempting to install the very same seccomp filter again, which calls prctl, but the process is not killed. What am I doing wrong? Thanks, Felix #include stddef.h #include features.h #include inttypes.h #include sys/socket.h #include netinet/in.h #include netinet/ip_icmp.h #include arpa/inet.h #include sys/poll.h #include unistd.h #include time.h #include netdb.h #include alloca.h #include signal.h #include errno.h #include sys/prctl.h #include linux/unistd.h #include linux/audit.h #include linux/filter.h #include linux/seccomp.h #ifndef SECCOMP_MODE_FILTER # define SECCOMP_MODE_FILTER 2 /* uses user-supplied filter. */ # define SECCOMP_RET_KILL 0xU /* kill the task immediately */ # define SECCOMP_RET_TRAP 0x0003U /* disallow and force a SIGSYS */ # define SECCOMP_RET_ALLOW 0x7fffU /* allow */ struct seccomp_data { int nr; __u32 arch; __u64 instruction_pointer; __u64 args[6]; }; #endif #ifndef SYS_SECCOMP # define SYS_SECCOMP 1 #endif #define syscall_nr (offsetof(struct seccomp_data, nr)) #if defined(__i386__) # define REG_SYSCALL REG_EAX # define ARCH_NR AUDIT_ARCH_I386 #elif defined(__x86_64__) # define REG_SYSCALL REG_RAX # define ARCH_NR AUDIT_ARCH_X86_64 #else # error Platform does not support seccomp filter yet #endif #define ALLOW_SYSCALL(name) \ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_##name, 0, 1), \ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_ALLOW) static int install_syscall_filter(void) { /* Linux allows a process to restrict itself (and potential children) * in what syscalls can be issued. The mechanism is called * seccomp-filter or seccomp mode 2. It works by reusing the * Berkeley Packet Filter, which is meant for PCAP-style packet * filtering expressions like only TCP packets, please. But it is * really a bytecode that has to be passed inside an array, and each * instruction is constructed using scary looking macros. The basics * are not so bad, however. We have two registers, one accumulator * and one index register (which is not used in this part of the * code), and instead of a network packet we are operating on a * certain struct with the syscall info, which is called seccomp_data * (reproduced above). */ struct sock_filter filter[] = { /* validate architecture to avoid x32-on-x86_64 syscall aliasing shenanigans */ /* BPF_LD = load, BPF_W = word, BPF_ABS = absolute offset */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, arch)), /* BPF_JMP+BPF_JEQ+BPF_K = compare accumulator to constant (in our * case, ARCH_NR), and skip the next instruction if equal */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, ARCH_NR, 1, 0), /* return SECCOMP_RET_KILL, tell seccomp to kill the process */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), /* load the syscall number */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, nr)), /* and now a list of allowed syscalls */ ALLOW_SYSCALL(rt_sigreturn), #ifdef __NR_sigreturn ALLOW_SYSCALL(sigreturn), #endif ALLOW_SYSCALL(exit_group), ALLOW_SYSCALL(exit), #ifdef __NR_socketcall ALLOW_SYSCALL(socketcall), #else ALLOW_SYSCALL(socket), ALLOW_SYSCALL(sendto), ALLOW_SYSCALL(recvfrom), #endif ALLOW_SYSCALL(poll), /* so we can further restrict allowed syscalls */ ALLOW_SYSCALL(prctl), /* so gethostbyname can open /etc/resolv.conf */ ALLOW_SYSCALL(open), ALLOW_SYSCALL(read), ALLOW_SYSCALL(mmap), ALLOW_SYSCALL(mmap2), ALLOW_SYSCALL(munmap), ALLOW_SYSCALL(lseek), ALLOW_SYSCALL(_llseek), ALLOW_SYSCALL(close), /* for our time keeping */ ALLOW_SYSCALL(gettimeofday), // x86_64 uses a vsyscall for this, so this filter will never trigger /* for when buffer writes the output; since we only write to stdout, filter for fd==1 */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, __NR_write, 0, 4), /* it's write(2). Load first argument into accumulator */ BPF_STMT(BPF_LD+BPF_W+BPF_ABS, offsetof(struct seccomp_data, args[0])), /* if it's 1 (stdout), skip 1 instruction */ BPF_JUMP(BPF_JMP+BPF_JEQ+BPF_K, 1, 1, 0), /* return SECCOMP_RET_KILL, tell seccomp to kill the process */ BPF_STMT(BPF_RET+BPF_K, SECCOMP_RET_KILL), /* return
security problem with seccomp-filter
Hi, I have had some great success with seccomp-filter a while ago, so I decided to use it to add some defense in depth to a ping program I wrote. The premise is, like for all ping programs I assume, that it starts setuid root, gets a raw socket, drops privileges, parses the command line, potentially does a DNS lookup, and then it sends and receives packets, using gettimeofday and poll. So I added a seccomp filter that allows this. But where do you put it? Ideally you'd want the filter installed right away after dropping privileges, so the command line parsing and the DNS routines are secured, too. But then you'd allow unnecessary attack surface (why allow open after the DNS routines are done parsing /etc/resolv.conf, for example?). The documentation says you can add more than one seccomp filter, just call prctl multiple times and allow prctl initially. So that's what I did. But when I added the secondary filters (which would blacklist open and setsockopt), and for double checking tried installing the last one twice (after the last one was supposed to blacklist prctl), to my surprise my attempt did not lead to process termination but to a success return value. I think this is a serious security breach. Maybe I am the first one to attempt to install multiple seccomp filters in the same process? The observed behavior is consistent with only the first filter being consulted. I'm using stock kernel 3.19 for what it's worth. Thanks, Felix -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
security problem with seccomp-filter
Hi, I have had some great success with seccomp-filter a while ago, so I decided to use it to add some defense in depth to a ping program I wrote. The premise is, like for all ping programs I assume, that it starts setuid root, gets a raw socket, drops privileges, parses the command line, potentially does a DNS lookup, and then it sends and receives packets, using gettimeofday and poll. So I added a seccomp filter that allows this. But where do you put it? Ideally you'd want the filter installed right away after dropping privileges, so the command line parsing and the DNS routines are secured, too. But then you'd allow unnecessary attack surface (why allow open after the DNS routines are done parsing /etc/resolv.conf, for example?). The documentation says you can add more than one seccomp filter, just call prctl multiple times and allow prctl initially. So that's what I did. But when I added the secondary filters (which would blacklist open and setsockopt), and for double checking tried installing the last one twice (after the last one was supposed to blacklist prctl), to my surprise my attempt did not lead to process termination but to a success return value. I think this is a serious security breach. Maybe I am the first one to attempt to install multiple seccomp filters in the same process? The observed behavior is consistent with only the first filter being consulted. I'm using stock kernel 3.19 for what it's worth. Thanks, Felix -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): > Past performance is no guarantee of current correctness :) And over an > Ethernet, there will be a very different set of both timings and TCP > segment sizes compared to loopback. > My guess is that you will find setting the lo mtu to 1500 a very > interesting experiment. Setting the MTU on lo to 1500 eliminates the problem and gives me double digit MB/sec throughput. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): > >Oh I'm pretty sure it's not my application, because my application performs > >well over ethernet, which is after all its purpose. Also I see the > >write, the TCP uncork, then a pause, and then the packet leaving. > Well, a wise old engineer tried to teach me that the proper spelling is > ass-u-me :) so just for grins, you might try the TCP_RR test anyway :) And > even if your application is correct (although I wonder why the receiver > isn't sucking data-out very quickly...) if you can reproduce the problem > with netperf it will be easier for others to do so. My application is only the server, the receiver is smbget from Samba, so I don't feel responsible for it :-) Still, when run over Ethernet, it works fine without waiting for timeouts to expire. To reproduce this: - smbget is from samba, you probably already have this - gatling (my server) can be gotten from cvs -d :pserver:[EMAIL PROTECTED]:/cvs -z9 co dietlibc libowfat gatling dietlibc is not strictly needed, but it's my environment. First built dietlibc, then libowfat, then gatling. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP_DEFER_ACCEPT issues
Thus spake Eric Dumazet ([EMAIL PROTECTED]): > 1) Setting a timeout in a millisecond range (< 1000) is not very good > because some clients may need much more time to send your server the data > (very long distance). So a second granularity is OK. I want millisecond accuracy for consistency. select and poll have it, we have a 1000 Hz timer, we should also expose that accuracy. I don't want to have sub second timeouts, in case you were wondering. > 2) After timeout is elapsed, the server tcp stack has no socket associated > to your client attempt. So closing the server listening socket wont be able > to send RST. I agree a RST *should* be sent by the server once the timeout > is triggered. I don't see any evidence for a timeout happening at all. I passed 1 as argument to the setsockopt, so I'd expect a timeout to happen pretty quickly. There was no connection reset until I Ctrl-C'd the server 15 minuets (!) laster. > A typical tcpdump of what is happening for a tcp_defer_accept timeout of 20 > seconds is : > [1]08:52:47.480291 IP client.60930 > server.http: S > 2498995442:2498995442(0) win 5840 0,nop,wscale 2> > [2]08:52:47.480302 IP server.http > client.60930: S > 1173302644:1173302644(0) ack 2498995443 win 5840 > [3]08:52:47.481669 IP client.60930 > server.http: . ack 1 win 5840 > [4]08:52:50.757543 IP server.http > client.60930: S > 1173302644:1173302644(0) ack 2498995443 win 5840 > [5]08:52:50.758953 IP client.60930 > server.http: . ack 1 win 5840 > [6]08:52:56.760611 IP server.http > client.60930: S > 1173302644:1173302644(0) ack 2498995443 win 5840 > [7]08:52:56.761886 IP client.60930 > server.http: . ack 1 win 5840 > [8]08:53:08.771254 IP server.http > client.60930: S > 1173302644:1173302644(0) ack 2498995443 win 5840 > [9]08:53:08.772514 IP client.60930 > server.http: . ack 1 win 5840 > [10]08:53:32.782488 IP server.http > client.60930: S > 1173302644:1173302644(0) ack 2498995443 win 5840 > [11]08:53:32.783754 IP client.60930 > server.http: . ack 1 win 5840 > > [12]08:59:30.509097 IP client.60930 > server.http: P 1:3(2) ack 1 win 5840 > [13]08:59:30.509125 IP server.http > client.60930: R > 1173302645:1173302645(0) win 0 I see this, too. If I connect and not send something, I expected the kernel to drop the connection when the timeout is reached. Nothing like that happens. > So TCP_DEFER_ACCEPT might send way more packets than needed. Only in the face of attackers, and after the handshake. I could live with that. If the timeout happened. > We only should wait for the data coming from the client to be able to pass > the new socket to the listening application. Yes. And we should send a RST if no data is coming in within the timeout, which is not happening for me (2.6.23). Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): > >How could I test this theory? > Can you take another trace that isn't so "cooked?" One that just sticks > with TCP-level and below stuff? Sorry for taking so long. Here is a tcpdump. The side on port 445 is the SMB server using TCP_CORK. 23:03:20.283772 IP 127.0.0.1.33230 > 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 23:03:20.283774 IP 127.0.0.1.445 > 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 23:03:20.283797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 1 win 257 23:03:20.295851 IP 127.0.0.1.33230 > 127.0.0.1.445: P 1:195(194) ack 1 win 257 23:03:20.295881 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 195 win 265 23:03:20.295959 IP 127.0.0.1.445 > 127.0.0.1.33230: P 1:87(86) ack 195 win 265 23:03:20.295998 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 87 win 256 23:03:20.296063 IP 127.0.0.1.33230 > 127.0.0.1.445: P 195:287(92) ack 87 win 256 23:03:20.296096 IP 127.0.0.1.445 > 127.0.0.1.33230: P 87:181(94) ack 287 win 265 23:03:20.296135 IP 127.0.0.1.33230 > 127.0.0.1.445: P 287:373(86) ack 181 win 255 23:03:20.296163 IP 127.0.0.1.445 > 127.0.0.1.33230: P 181:239(58) ack 373 win 265 23:03:20.296201 IP 127.0.0.1.33230 > 127.0.0.1.445: P 373:459(86) ack 239 win 255 23:03:20.296245 IP 127.0.0.1.445 > 127.0.0.1.33230: P 239:309(70) ack 459 win 265 23:03:20.296286 IP 127.0.0.1.33230 > 127.0.0.1.445: P 459:535(76) ack 309 win 254 23:03:20.296314 IP 127.0.0.1.445 > 127.0.0.1.33230: P 309:461(152) ack 535 win 265 23:03:20.296361 IP 127.0.0.1.33230 > 127.0.0.1.445: P 535:594(59) ack 461 win 253 23:03:20.296400 IP 127.0.0.1.445 > 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 23:03:20.335748 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 16845 win 125 [note the .2 sec pause] 23:03:20.547763 IP 127.0.0.1.445 > 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 23:03:20.547797 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 0 23:03:20.547855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 32845 win 96 23:03:20.547863 IP 127.0.0.1.445 > 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 23:03:20.547890 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 33229 win 96 [note the .2 sec pause] 23:03:20.755775 IP 127.0.0.1.445 > 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 23:03:20.755855 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 45517 win 96 23:03:20.755868 IP 127.0.0.1.445 > 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 23:03:20.755898 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 49613 win 96 [another one] 23:03:20.963789 IP 127.0.0.1.445 > 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 23:03:20.963871 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 61901 win 96 23:03:20.963885 IP 127.0.0.1.445 > 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 23:03:20.963909 IP 127.0.0.1.33230 > 127.0.0.1.445: . ack 64525 win 96 23:03:20.964101 IP 127.0.0.1.33230 > 127.0.0.1.445: P 594:653(59) ack 64525 win 96 23:03:21.003790 IP 127.0.0.1.445 > 127.0.0.1.33230: . ack 653 win 265 23:03:21.171811 IP 127.0.0.1.445 > 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 You get the idea. Anyway, now THIS is the interesting case, because we have two packets in the answer, and you see the first half of the answer leaving immediately (when I wanted the whole answer to be sent) but the second only leaving after the .2 sec delay. > If SMB is a one-request-at-a-time protocol (I can never remember), It is. > you > could simulate it with a netperf TCP_RR test by passing suitable values to > the test-specific -r option: > netperf -H -t TCP_RR -- -r , > If that shows similar behaviour then you can ass-u-me it isn't your > application. Oh I'm pretty sure it's not my application, because my application performs well over ethernet, which is after all its purpose. Also I see the write, the TCP uncork, then a pause, and then the packet leaving. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: TCP_DEFER_ACCEPT issues
Thus spake Eric Dumazet ([EMAIL PROTECTED]): 1) Setting a timeout in a millisecond range ( 1000) is not very good because some clients may need much more time to send your server the data (very long distance). So a second granularity is OK. I want millisecond accuracy for consistency. select and poll have it, we have a 1000 Hz timer, we should also expose that accuracy. I don't want to have sub second timeouts, in case you were wondering. 2) After timeout is elapsed, the server tcp stack has no socket associated to your client attempt. So closing the server listening socket wont be able to send RST. I agree a RST *should* be sent by the server once the timeout is triggered. I don't see any evidence for a timeout happening at all. I passed 1 as argument to the setsockopt, so I'd expect a timeout to happen pretty quickly. There was no connection reset until I Ctrl-C'd the server 15 minuets (!) laster. A typical tcpdump of what is happening for a tcp_defer_accept timeout of 20 seconds is : [1]08:52:47.480291 IP client.60930 server.http: S 2498995442:2498995442(0) win 5840 mss 1460,sackOK,timestamp 2685904595 0,nop,wscale 2 [2]08:52:47.480302 IP server.http client.60930: S 1173302644:1173302644(0) ack 2498995443 win 5840 mss 1460 [3]08:52:47.481669 IP client.60930 server.http: . ack 1 win 5840 [4]08:52:50.757543 IP server.http client.60930: S 1173302644:1173302644(0) ack 2498995443 win 5840 mss 1460 [5]08:52:50.758953 IP client.60930 server.http: . ack 1 win 5840 [6]08:52:56.760611 IP server.http client.60930: S 1173302644:1173302644(0) ack 2498995443 win 5840 mss 1460 [7]08:52:56.761886 IP client.60930 server.http: . ack 1 win 5840 [8]08:53:08.771254 IP server.http client.60930: S 1173302644:1173302644(0) ack 2498995443 win 5840 mss 1460 [9]08:53:08.772514 IP client.60930 server.http: . ack 1 win 5840 [10]08:53:32.782488 IP server.http client.60930: S 1173302644:1173302644(0) ack 2498995443 win 5840 mss 1460 [11]08:53:32.783754 IP client.60930 server.http: . ack 1 win 5840 a very long time, then client finally sends 2 bytes [12]08:59:30.509097 IP client.60930 server.http: P 1:3(2) ack 1 win 5840 [13]08:59:30.509125 IP server.http client.60930: R 1173302645:1173302645(0) win 0 I see this, too. If I connect and not send something, I expected the kernel to drop the connection when the timeout is reached. Nothing like that happens. So TCP_DEFER_ACCEPT might send way more packets than needed. Only in the face of attackers, and after the handshake. I could live with that. If the timeout happened. We only should wait for the data coming from the client to be able to pass the new socket to the listening application. Yes. And we should send a RST if no data is coming in within the timeout, which is not happening for me (2.6.23). Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): How could I test this theory? Can you take another trace that isn't so cooked? One that just sticks with TCP-level and below stuff? Sorry for taking so long. Here is a tcpdump. The side on port 445 is the SMB server using TCP_CORK. 23:03:20.283772 IP 127.0.0.1.33230 127.0.0.1.445: S 1503927325:1503927325(0) win 32792 mss 16396,sackOK,timestamp 9451736 0,nop,wscale 7 23:03:20.283774 IP 127.0.0.1.445 127.0.0.1.33230: S 1513925692:1513925692(0) ack 1503927326 win 32768 mss 16396,sackOK,timestamp 9451737 9451736,nop,wscale 7 23:03:20.283797 IP 127.0.0.1.33230 127.0.0.1.445: . ack 1 win 257 nop,nop,timestamp 9451737 9451737 23:03:20.295851 IP 127.0.0.1.33230 127.0.0.1.445: P 1:195(194) ack 1 win 257 nop,nop,timestamp 9451740 9451737 23:03:20.295881 IP 127.0.0.1.445 127.0.0.1.33230: . ack 195 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.295959 IP 127.0.0.1.445 127.0.0.1.33230: P 1:87(86) ack 195 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.295998 IP 127.0.0.1.33230 127.0.0.1.445: . ack 87 win 256 nop,nop,timestamp 9451740 9451740 23:03:20.296063 IP 127.0.0.1.33230 127.0.0.1.445: P 195:287(92) ack 87 win 256 nop,nop,timestamp 9451740 9451740 23:03:20.296096 IP 127.0.0.1.445 127.0.0.1.33230: P 87:181(94) ack 287 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.296135 IP 127.0.0.1.33230 127.0.0.1.445: P 287:373(86) ack 181 win 255 nop,nop,timestamp 9451740 9451740 23:03:20.296163 IP 127.0.0.1.445 127.0.0.1.33230: P 181:239(58) ack 373 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.296201 IP 127.0.0.1.33230 127.0.0.1.445: P 373:459(86) ack 239 win 255 nop,nop,timestamp 9451740 9451740 23:03:20.296245 IP 127.0.0.1.445 127.0.0.1.33230: P 239:309(70) ack 459 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.296286 IP 127.0.0.1.33230 127.0.0.1.445: P 459:535(76) ack 309 win 254 nop,nop,timestamp 9451740 9451740 23:03:20.296314 IP 127.0.0.1.445 127.0.0.1.33230: P 309:461(152) ack 535 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.296361 IP 127.0.0.1.33230 127.0.0.1.445: P 535:594(59) ack 461 win 253 nop,nop,timestamp 9451740 9451740 23:03:20.296400 IP 127.0.0.1.445 127.0.0.1.33230: . 461:16845(16384) ack 594 win 265 nop,nop,timestamp 9451740 9451740 23:03:20.335748 IP 127.0.0.1.33230 127.0.0.1.445: . ack 16845 win 125 nop,nop,timestamp 9451750 9451740 [note the .2 sec pause] 23:03:20.547763 IP 127.0.0.1.445 127.0.0.1.33230: P 16845:32845(16000) ack 594 win 265 nop,nop,timestamp 9451803 9451750 23:03:20.547797 IP 127.0.0.1.33230 127.0.0.1.445: . ack 32845 win 0 nop,nop,timestamp 9451803 9451803 23:03:20.547855 IP 127.0.0.1.33230 127.0.0.1.445: . ack 32845 win 96 nop,nop,timestamp 9451803 9451803 23:03:20.547863 IP 127.0.0.1.445 127.0.0.1.33230: P 32845:33229(384) ack 594 win 265 nop,nop,timestamp 9451803 9451803 23:03:20.547890 IP 127.0.0.1.33230 127.0.0.1.445: . ack 33229 win 96 nop,nop,timestamp 9451803 9451803 [note the .2 sec pause] 23:03:20.755775 IP 127.0.0.1.445 127.0.0.1.33230: P 33229:45517(12288) ack 594 win 265 nop,nop,timestamp 9451855 9451803 23:03:20.755855 IP 127.0.0.1.33230 127.0.0.1.445: . ack 45517 win 96 nop,nop,timestamp 9451855 9451855 23:03:20.755868 IP 127.0.0.1.445 127.0.0.1.33230: P 45517:49613(4096) ack 594 win 265 nop,nop,timestamp 9451855 9451855 23:03:20.755898 IP 127.0.0.1.33230 127.0.0.1.445: . ack 49613 win 96 nop,nop,timestamp 9451855 9451855 [another one] 23:03:20.963789 IP 127.0.0.1.445 127.0.0.1.33230: P 49613:61901(12288) ack 594 win 265 nop,nop,timestamp 9451907 9451855 23:03:20.963871 IP 127.0.0.1.33230 127.0.0.1.445: . ack 61901 win 96 nop,nop,timestamp 9451907 9451907 23:03:20.963885 IP 127.0.0.1.445 127.0.0.1.33230: P 61901:64525(2624) ack 594 win 265 nop,nop,timestamp 9451907 9451907 23:03:20.963909 IP 127.0.0.1.33230 127.0.0.1.445: . ack 64525 win 96 nop,nop,timestamp 9451907 9451907 23:03:20.964101 IP 127.0.0.1.33230 127.0.0.1.445: P 594:653(59) ack 64525 win 96 nop,nop,timestamp 9451907 9451907 23:03:21.003790 IP 127.0.0.1.445 127.0.0.1.33230: . ack 653 win 265 nop,nop,timestamp 9451917 9451907 23:03:21.171811 IP 127.0.0.1.445 127.0.0.1.33230: P 64525:76813(12288) ack 653 win 265 nop,nop,timestamp 9451959 9451907 You get the idea. Anyway, now THIS is the interesting case, because we have two packets in the answer, and you see the first half of the answer leaving immediately (when I wanted the whole answer to be sent) but the second only leaving after the .2 sec delay. If SMB is a one-request-at-a-time protocol (I can never remember), It is. you could simulate it with a netperf TCP_RR test by passing suitable values to the test-specific -r option: netperf -H remote -t TCP_RR -- -r req,rsp If that shows similar behaviour then you can ass-u-me it isn't your application. Oh I'm pretty sure it's not my application, because my application performs well over ethernet, which is after all its purpose. Also I see the
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): Oh I'm pretty sure it's not my application, because my application performs well over ethernet, which is after all its purpose. Also I see the write, the TCP uncork, then a pause, and then the packet leaving. Well, a wise old engineer tried to teach me that the proper spelling is ass-u-me :) so just for grins, you might try the TCP_RR test anyway :) And even if your application is correct (although I wonder why the receiver isn't sucking data-out very quickly...) if you can reproduce the problem with netperf it will be easier for others to do so. My application is only the server, the receiver is smbget from Samba, so I don't feel responsible for it :-) Still, when run over Ethernet, it works fine without waiting for timeouts to expire. To reproduce this: - smbget is from samba, you probably already have this - gatling (my server) can be gotten from cvs -d :pserver:[EMAIL PROTECTED]:/cvs -z9 co dietlibc libowfat gatling dietlibc is not strictly needed, but it's my environment. First built dietlibc, then libowfat, then gatling. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Rick Jones ([EMAIL PROTECTED]): Past performance is no guarantee of current correctness :) And over an Ethernet, there will be a very different set of both timings and TCP segment sizes compared to loopback. My guess is that you will find setting the lo mtu to 1500 a very interesting experiment. Setting the MTU on lo to 1500 eliminates the problem and gives me double digit MB/sec throughput. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
TCP_DEFER_ACCEPT issues
I am trying to use TCP_DEFER_ACCEPT in my web server. There are some operational problems. First of all: timeout handling. I would like to be able to set a timeout in seconds (or better: milliseconds) for how long the socket is allowed to sit there without data coming in. For high load situations, I have been enforcing timeouts in the range of 15 seconds, otherwise someone can DoS the server by opening a lot of connections and tying up data structures. It is still possible, of course, to tie up kernel memory this way, by not reacting to the FIN or RST packets and running into a timeout there, too, but that is partially tunable via sysctl. According to tcp(7) the int argument to TCP_DEFER_ACCEPT is in seconds. In the kernel code, it's converted to TCP timeout units. When I ran my server, and connected without sending any data, nothing happened. No timeout. Minutes later, the connection was still there. Even worse: when I killed (!) the server process (thus closing the server socket), the client did not get a reset. Only when I type something in the telnet, I get a reset. This appears to be very broken. My suggestion: 1. make the argument to the setsockopt be in seconds, or milliseconds. 2. if the server socket is closed, reset all pending connections. Comments? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
TCP_DEFER_ACCEPT issues
I am trying to use TCP_DEFER_ACCEPT in my web server. There are some operational problems. First of all: timeout handling. I would like to be able to set a timeout in seconds (or better: milliseconds) for how long the socket is allowed to sit there without data coming in. For high load situations, I have been enforcing timeouts in the range of 15 seconds, otherwise someone can DoS the server by opening a lot of connections and tying up data structures. It is still possible, of course, to tie up kernel memory this way, by not reacting to the FIN or RST packets and running into a timeout there, too, but that is partially tunable via sysctl. According to tcp(7) the int argument to TCP_DEFER_ACCEPT is in seconds. In the kernel code, it's converted to TCP timeout units. When I ran my server, and connected without sending any data, nothing happened. No timeout. Minutes later, the connection was still there. Even worse: when I killed (!) the server process (thus closing the server socket), the client did not get a reset. Only when I type something in the telnet, I get a reset. This appears to be very broken. My suggestion: 1. make the argument to the setsockopt be in seconds, or milliseconds. 2. if the server socket is closed, reset all pending connections. Comments? Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
> the packet trace was a bit too cooked perhaps, but there were indications > that at times the TCP window was going to zero - perhaps something with > window updates or persist timers? Does TCP use different window sizes on loopback? Why is this not happening on ethernet? How could I test this theory? My initial idea was that it has something todo with the different MTU on loopback. My initial block size was 16k, but the problem stayed when I changed it to 64k. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
the packet trace was a bit too cooked perhaps, but there were indications that at times the TCP window was going to zero - perhaps something with window updates or persist timers? Does TCP use different window sizes on loopback? Why is this not happening on ethernet? How could I test this theory? My initial idea was that it has something todo with the different MTU on loopback. My initial block size was 16k, but the problem stayed when I changed it to 64k. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: bizarre network timing problem
Thus spake Chuck Ebbert ([EMAIL PROTECTED]): > > Any ideas what could cause this? > (cc: netdev) Maybe I should mention this, too: accept(5, {sa_family=AF_INET6, sin6_port=htons(59821), inet_pton(AF_INET6, ":::127.0.0.1", _addr), sin6_flowinfo=0, sin6_scope_id=0}, [18446744069414584348]) = 8 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0 And if it would be the Nagle algorithm, it should also impact the ethernet case. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
bizarre network timing problem
I wrote a small read-only SMB server, and wanted to see how fast it was. So I used smbget to download a moderately large file from it via localhost. smbget only got ~70 KB/sec. This is what the view from strace -tt on the server is: 22:44:58.812467 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59 22:44:58.812619 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b46b8e5e000 22:44:58.812729 fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) 22:44:58.812847 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=13323248792850399240}}) = 0 22:44:58.812946 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLOUT, {u32=8, u64=18251433459580936}}) = 0 22:44:58.813039 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1 22:44:58.813132 setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0 22:44:58.813215 write(8, "\0\0\372<\377SMB.\0\0\0\0\200A\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 64) = 64 22:44:58.813323 sendfile(8, 9, [128000], 64000) = 64000 22:44:58.813430 setsockopt(8, SOL_TCP, TCP_CORK, [0], 4) = 0 22:44:58.813511 munmap(0x2b46b8e5e000, 8192) = 0 22:44:58.813600 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1 22:44:58.813693 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=8}}) = 0 22:44:58.813778 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=18252000395264008}}) = 0 22:44:58.813869 epoll_wait(7, {}, 100, 441) = 0 22:44:59.255789 epoll_wait(7, {{EPOLLIN, {u32=8, u64=18252000395264008}}}, 100, 999) = 1 22:44:59.688519 read(8, "\0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3"..., 8192) = 59 As you can see, the time difference between reading the query and writing the result is very small, but there is a big delay before receiving the next request. This is the view from a sniffer on the lo interface: 1192653899.688385127.0.0.1 -> 127.0.0.1SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 192000 1192653899.688399127.0.0.1 -> 127.0.0.1TCP 445 > 42990 [ACK] Seq=192660 Ack=779 Win=33920 Len=0 TSV=359208 TSER=359208 1192653899.895725127.0.0.1 -> 127.0.0.1SMB [TCP Window Full] Read AndX Response, FID: 0x0001, 64000 bytes 1192653899.895793127.0.0.1 -> 127.0.0.1TCP 42990 > 445 [ACK] Seq=779 Ack=204948 Win=12288 Len=0 TSV=359260 TSER=359260 1192653899.895805127.0.0.1 -> 127.0.0.1NBSS NBSS Continuation Message 1192653899.935725127.0.0.1 -> 127.0.0.1TCP 42990 > 445 [ACK] Seq=779 Ack=209044 Win=12288 Len=0 TSV=359270 TSER=359260 1192653900.147739127.0.0.1 -> 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.147767127.0.0.1 -> 127.0.0.1TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=0 Len=0 TSV=359323 TSER=359323 1192653900.147807127.0.0.1 -> 127.0.0.1TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=221332 Win=12288 Len=0 TSV=359323 TSER=359323 1192653900.147815127.0.0.1 -> 127.0.0.1NBSS NBSS Continuation Message 1192653900.147837127.0.0.1 -> 127.0.0.1TCP 42990 > 445 [ACK] Seq=779 Ack=225428 Win=12288 Len=0 TSV=359323 TSER=359323 1192653900.355754127.0.0.1 -> 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.355782127.0.0.1 -> 127.0.0.1TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=0 Len=0 TSV=359375 TSER=359375 1192653900.355820127.0.0.1 -> 127.0.0.1TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=237716 Win=12288 Len=0 TSV=359375 TSER=359375 1192653900.355829127.0.0.1 -> 127.0.0.1NBSS NBSS Continuation Message 1192653900.355849127.0.0.1 -> 127.0.0.1TCP 42990 > 445 [ACK] Seq=779 Ack=241812 Win=12288 Len=0 TSV=359375 TSER=359375 1192653900.563766127.0.0.1 -> 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.563794127.0.0.1 -> 127.0.0.1TCP [TCP ZeroWindow] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=0 Len=0 TSV=359427 TSER=359427 1192653900.563831127.0.0.1 -> 127.0.0.1TCP [TCP Window Update] 42990 > 445 [ACK] Seq=779 Ack=254100 Win=12288 Len=0 TSV=359427 TSER=359427 1192653900.563839127.0.0.1 -> 127.0.0.1NBSS NBSS Continuation Message 1192653900.563858127.0.0.1 -> 127.0.0.1TCP 42990 > 445 [ACK] Seq=779 Ack=256724 Win=12288 Len=0 TSV=359427 TSER=359427 1192653900.56127.0.0.1 -> 127.0.0.1SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 256000 Note the delay between sending the response and getting the reply. Also note that there is almost no delay between getting the reply and sending the next request. My understanding of TCP_CORK from the tcp(7) man page is that it should flush out the data immediately, but the network trace seems to suggest that there is a 200 ms delay between the request and the outgoing data. tcp(7) says there is a 200 ms delay for sending out data when the socket is in corked mode, so uncorking does not appear to work. Now for the strange part: the
bizarre network timing problem
I wrote a small read-only SMB server, and wanted to see how fast it was. So I used smbget to download a moderately large file from it via localhost. smbget only got ~70 KB/sec. This is what the view from strace -tt on the server is: 22:44:58.812467 read(8, \0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3..., 8192) = 59 22:44:58.812619 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b46b8e5e000 22:44:58.812729 fcntl(9, F_GETFL) = 0x8000 (flags O_RDONLY|O_LARGEFILE) 22:44:58.812847 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=13323248792850399240}}) = 0 22:44:58.812946 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLOUT, {u32=8, u64=18251433459580936}}) = 0 22:44:58.813039 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1 22:44:58.813132 setsockopt(8, SOL_TCP, TCP_CORK, [1], 4) = 0 22:44:58.813215 write(8, \0\0\372\377SMB.\0\0\0\0\200A\300\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3..., 64) = 64 22:44:58.813323 sendfile(8, 9, [128000], 64000) = 64000 22:44:58.813430 setsockopt(8, SOL_TCP, TCP_CORK, [0], 4) = 0 22:44:58.813511 munmap(0x2b46b8e5e000, 8192) = 0 22:44:58.813600 epoll_wait(7, {{EPOLLOUT, {u32=8, u64=18251433459580936}}}, 100, 442) = 1 22:44:58.813693 epoll_ctl(7, EPOLL_CTL_DEL, 8, {0, {u32=8, u64=8}}) = 0 22:44:58.813778 epoll_ctl(7, EPOLL_CTL_ADD, 8, {EPOLLIN, {u32=8, u64=18252000395264008}}) = 0 22:44:58.813869 epoll_wait(7, {}, 100, 441) = 0 22:44:59.255789 epoll_wait(7, {{EPOLLIN, {u32=8, u64=18252000395264008}}}, 100, 999) = 1 22:44:59.688519 read(8, \0\0\0007\377SMB.\0\0\0\0\10\1\310\0\0\0\0\0\0\0\0\0\0\0\0\0\0\232\3..., 8192) = 59 As you can see, the time difference between reading the query and writing the result is very small, but there is a big delay before receiving the next request. This is the view from a sniffer on the lo interface: 1192653899.688385127.0.0.1 - 127.0.0.1SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 192000 1192653899.688399127.0.0.1 - 127.0.0.1TCP 445 42990 [ACK] Seq=192660 Ack=779 Win=33920 Len=0 TSV=359208 TSER=359208 1192653899.895725127.0.0.1 - 127.0.0.1SMB [TCP Window Full] Read AndX Response, FID: 0x0001, 64000 bytes 1192653899.895793127.0.0.1 - 127.0.0.1TCP 42990 445 [ACK] Seq=779 Ack=204948 Win=12288 Len=0 TSV=359260 TSER=359260 1192653899.895805127.0.0.1 - 127.0.0.1NBSS NBSS Continuation Message 1192653899.935725127.0.0.1 - 127.0.0.1TCP 42990 445 [ACK] Seq=779 Ack=209044 Win=12288 Len=0 TSV=359270 TSER=359260 1192653900.147739127.0.0.1 - 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.147767127.0.0.1 - 127.0.0.1TCP [TCP ZeroWindow] 42990 445 [ACK] Seq=779 Ack=221332 Win=0 Len=0 TSV=359323 TSER=359323 1192653900.147807127.0.0.1 - 127.0.0.1TCP [TCP Window Update] 42990 445 [ACK] Seq=779 Ack=221332 Win=12288 Len=0 TSV=359323 TSER=359323 1192653900.147815127.0.0.1 - 127.0.0.1NBSS NBSS Continuation Message 1192653900.147837127.0.0.1 - 127.0.0.1TCP 42990 445 [ACK] Seq=779 Ack=225428 Win=12288 Len=0 TSV=359323 TSER=359323 1192653900.355754127.0.0.1 - 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.355782127.0.0.1 - 127.0.0.1TCP [TCP ZeroWindow] 42990 445 [ACK] Seq=779 Ack=237716 Win=0 Len=0 TSV=359375 TSER=359375 1192653900.355820127.0.0.1 - 127.0.0.1TCP [TCP Window Update] 42990 445 [ACK] Seq=779 Ack=237716 Win=12288 Len=0 TSV=359375 TSER=359375 1192653900.355829127.0.0.1 - 127.0.0.1NBSS NBSS Continuation Message 1192653900.355849127.0.0.1 - 127.0.0.1TCP 42990 445 [ACK] Seq=779 Ack=241812 Win=12288 Len=0 TSV=359375 TSER=359375 1192653900.563766127.0.0.1 - 127.0.0.1NBSS [TCP Window Full] NBSS Continuation Message 1192653900.563794127.0.0.1 - 127.0.0.1TCP [TCP ZeroWindow] 42990 445 [ACK] Seq=779 Ack=254100 Win=0 Len=0 TSV=359427 TSER=359427 1192653900.563831127.0.0.1 - 127.0.0.1TCP [TCP Window Update] 42990 445 [ACK] Seq=779 Ack=254100 Win=12288 Len=0 TSV=359427 TSER=359427 1192653900.563839127.0.0.1 - 127.0.0.1NBSS NBSS Continuation Message 1192653900.563858127.0.0.1 - 127.0.0.1TCP 42990 445 [ACK] Seq=779 Ack=256724 Win=12288 Len=0 TSV=359427 TSER=359427 1192653900.56127.0.0.1 - 127.0.0.1SMB Read AndX Request, FID: 0x0001, 64000 bytes at offset 256000 Note the delay between sending the response and getting the reply. Also note that there is almost no delay between getting the reply and sending the next request. My understanding of TCP_CORK from the tcp(7) man page is that it should flush out the data immediately, but the network trace seems to suggest that there is a 200 ms delay between the request and the outgoing data. tcp(7) says there is a 200 ms delay for sending out data when the socket is in corked mode, so uncorking does not appear to work. Now for the strange part: the same code works without a 200 ms delay
Re: bizarre network timing problem
Thus spake Chuck Ebbert ([EMAIL PROTECTED]): Any ideas what could cause this? (cc: netdev) Maybe I should mention this, too: accept(5, {sa_family=AF_INET6, sin6_port=htons(59821), inet_pton(AF_INET6, :::127.0.0.1, sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [18446744069414584348]) = 8 setsockopt(8, SOL_TCP, TCP_NODELAY, [1], 4) = 0 And if it would be the Nagle algorithm, it should also impact the ethernet case. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nforce 4 audio has no s/pdif out
My shiny new nforce 4 main board has sound that is detected OK by ALSA: intel8x0_measure_ac97_clock: measured 49970 usecs intel8x0: clocking to 46877 ALSA device list: #0: NVidia CK804 with ALC850 at 0xd2003000, irq 185 but I can't get my stereo to play. It is connected via optical S/PDIF. Works fine under Windoze, so the hardware is ok. Any idea what I could do? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Jeremy Fitzhardinge ([EMAIL PROTECTED]): > Unfortunately, the Dothans *REQUIRE* some degree of ACPI support; the > speedfreq-centrino needs to extract a table from ACPI to know what are > valid operating (voltage/frequency) points to use for the CPU. The > patch you're using is definitely wrong in principle, though if it works > for you in practice then by all means use it. I enabled these: CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_TABLE=y CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y It should have worked, shouldn't it? Well, it did not. You can look at the kernel messages at http://dl.fefe.de/dmesg.gz if that helps. No cpufreq, and as far as I can see, no speedstep. The fan is running, that's all I can tell. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Jeremy Fitzhardinge ([EMAIL PROTECTED]): Unfortunately, the Dothans *REQUIRE* some degree of ACPI support; the speedfreq-centrino needs to extract a table from ACPI to know what are valid operating (voltage/frequency) points to use for the CPU. The patch you're using is definitely wrong in principle, though if it works for you in practice then by all means use it. I enabled these: CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_STAT=y CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=y CONFIG_CPU_FREQ_GOV_USERSPACE=y CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_TABLE=y CONFIG_X86_SPEEDSTEP_CENTRINO=y CONFIG_X86_SPEEDSTEP_CENTRINO_ACPI=y CONFIG_X86_SPEEDSTEP_CENTRINO_TABLE=y It should have worked, shouldn't it? Well, it did not. You can look at the kernel messages at http://dl.fefe.de/dmesg.gz if that helps. No cpufreq, and as far as I can see, no speedstep. The fan is running, that's all I can tell. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nforce 4 audio has no s/pdif out
My shiny new nforce 4 main board has sound that is detected OK by ALSA: intel8x0_measure_ac97_clock: measured 49970 usecs intel8x0: clocking to 46877 ALSA device list: #0: NVidia CK804 with ALC850 at 0xd2003000, irq 185 but I can't get my stereo to play. It is connected via optical S/PDIF. Works fine under Windoze, so the hardware is ok. Any idea what I could do? Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Adam Belay ([EMAIL PROTECTED]): > > > Why not use ACPI for CPU scaling? > > Felix, did you try this? > ACPI is the preferred (and only standardized) method of controlling cpu > throttling on x86 systems. 1. I don't trust ACPI 2. my battery runs out quicker with ACPI compared to cpufreq I _really_ _really_ don't want ACPI. No, really not. This is no idle decision. My current notebook is the only hardware I have ever seen enabling ACPI not completely break Linux. Of all my 10+ machines, including my other 3 ones that are actually in use. Which ACPI way to you mean, by the way? Just enabling ACPI with thermal and CPU or the cpufreq ACPI driver? I think I tried that driver and did not get the /sys interface to switch frequencies and governors. If I must, I can try again with 2.6.11, but I really really really do not want to use ACPI, unless someone with a big shotgun is standing behind me. > Also, as I said earlier, I wanted to see an lspci for the usb issues. :00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller (rev a3) :00:01.0 ISA bridge: nVidia Corporation: Unknown device 0050 (rev a3) :00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) :00:02.0 USB Controller: nVidia Corporation CK804 USB Controller (rev a2) :00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev a3) :00:04.0 Multimedia audio controller: nVidia Corporation CK804 AC'97 Audio Controller (rev a2) :00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev a2) :00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3) :00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller (rev a3) :00:09.0 PCI bridge: nVidia Corporation CK804 PCI Bridge (rev a2) :00:0a.0 Bridge: nVidia Corporation CK804 Ethernet Controller (rev a3) :00:0b.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) :00:0c.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) :00:0d.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) :00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) :00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration :00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map :00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller :00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control :01:00.0 VGA compatible controller: nVidia Corporation: Unknown device 0141 (rev a2) :05:0b.0 FireWire (IEEE 1394): Texas Instruments TSB43AB22/A IEEE-1394a-2000 Controller (PHY/Link) :05:0c.0 Ethernet controller: Marvell Technology Group Ltd. Gigabit Ethernet Controller (rev 13) This kernel is stock 2.6.11 with CONFIG_USB_DEBUG=y. When I put in my USB hub with my USB webcam, I get this: Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: state 5 ports 10 chg evt 0400 Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: GetStatus port 10 status 001803 POWER sig=j CSC CONNECT Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: port 10, status 0501, change 0001, 480 Mb/s Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: debounce: port 10: total 100ms stable 100ms status 0x501 Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: port 10 high speed Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: GetStatus port 10 status 001005 POWER sig=se0 PE CONNECT Mar 22 23:25:40 demilich kernel: usb 1-10: new high speed USB device using ehci_hcd and address 3 Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: port 10 reset error -110 Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: hub_port_status failed (err = -32) Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: port 10 not enabled, trying reset again... Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: port 10 reset error -110 Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: hub_port_status failed (err = -32) Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: port 10 not enabled, trying reset again... Mar 22 23:25:40 demilich kernel: ehci_hcd :00:02.1: port 10 reset error -110 Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: hub_port_status failed (err = -32) Mar 22 23:25:40 demilich kernel: hub 1-0:1.0: port 10 not enabled, trying reset again... Mar 22 23:25:41 demilich kernel: ehci_hcd :00:02.1: port 10 high speed Mar 22 23:25:41 demilich kernel: ehci_hcd :00:02.1: GetStatus port 10 status 001005 POWER sig=se0 PE CONNECT Mar 22 23:25:41 demilich kernel: usb 1-10: new device strings: Mfr=0, Product=0, SerialNumber=0 Mar 22 23:25:41 demilich kernel: usb 1-10: hotplug Mar 22 23:25:41 demilich kernel: usb 1-10: adding 1-10:1.0 (config #1, interface 0) Mar 22 23:25:41 demilich kernel: usb 1-10:1.0: hotplug (the line with Mfr=0 looks wrong to me). Now pulling the device and putting it on through my USB hub (same hardware port on the
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Andrew Morton ([EMAIL PROTECTED]): > > Finally Centrino SpeedStep. > > I have a "Intel(R) Pentium(R) M processor 1.80GHz" in my notebook. > > Linux does not support it. This architecture has been out there for > > months now, and there even was a patch to support it posted here a in > > October last year or so. Linux still does not include it. Until > > 2.6.11-rc4-bk8 or so, the old patched file from back then still worked. > > Now it doesn't. Because some interface changed. Now what? Using a > > Centrino notebook without CPU throttling is completely out of the > > question. Linux might as well not boot on it at all. > Could you please dig out the old patch, send it? I didn't keep the patch, but I kept the patched C file. I'll attach it. Felix centrino-speedstep.tar.gz Description: Binary data
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Andrew Morton ([EMAIL PROTECTED]): > > My new nForce 4 mainboard has 10 or so USB 2.0 outlets. In Windows, > > they all work. In Linux, two of them work. Putting my USB stick or > > anything else in one of the others produces nothing in Linux. > > Apparently no IRQ getting through or something? > Did it work correctly on any earlier kernel? If so, which one(s)? It turns out the ports do work with 2.6.11; I was running rc4 when I last observed it break. Sorry for the bad bug report. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Andrew Morton ([EMAIL PROTECTED]): My new nForce 4 mainboard has 10 or so USB 2.0 outlets. In Windows, they all work. In Linux, two of them work. Putting my USB stick or anything else in one of the others produces nothing in Linux. Apparently no IRQ getting through or something? Did it work correctly on any earlier kernel? If so, which one(s)? It turns out the ports do work with 2.6.11; I was running rc4 when I last observed it break. Sorry for the bad bug report. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Thus spake Andrew Morton ([EMAIL PROTECTED]): Finally Centrino SpeedStep. I have a Intel(R) Pentium(R) M processor 1.80GHz in my notebook. Linux does not support it. This architecture has been out there for months now, and there even was a patch to support it posted here a in October last year or so. Linux still does not include it. Until 2.6.11-rc4-bk8 or so, the old patched file from back then still worked. Now it doesn't. Because some interface changed. Now what? Using a Centrino notebook without CPU throttling is completely out of the question. Linux might as well not boot on it at all. Could you please dig out the old patch, send it? I didn't keep the patch, but I kept the patched C file. I'll attach it. Felix centrino-speedstep.tar.gz Description: Binary data
2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Linux is getting less and less usable for me. :-( My new nForce 4 mainboard has 10 or so USB 2.0 outlets. In Windows, they all work. In Linux, two of them work. Putting my USB stick or anything else in one of the others produces nothing in Linux. Apparently no IRQ getting through or something? This is what /proc/interrupts has to say: 177:9503618 IO-APIC-level ohci_hcd, eth0 These are the USB boot messages: usbcore: registered new driver usbfs usbcore: registered new driver hub ehci_hcd :00:02.1: new USB bus registered, assigned bus number 1 ehci_hcd :00:02.1: USB 2.0 initialized, EHCI 1.00, driver 26 Oct 2004 hub 1-0:1.0: USB hub found ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) ohci_hcd :00:02.0: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found usbcore: registered new driver usblp drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver Initializing USB Mass Storage driver... usb 2-4: new low speed USB device using ohci_hcd and address 2 usbcore: registered new driver usb-storage USB Mass Storage support registered. input: USB HID v1.10 Mouse [B16_b_02 USB-PS/2 Optical Mouse] on usb-:00:02.0-4 usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.0:USB HID core driver HUB0 XVR0 XVR1 XVR2 XVR3 USB0 USB2 MMAC MMCI UAR1 As you can see, it appears to work in principle. Now about IPv6: npush and npoll are two applications I wrote. npush sends multicast announcements and opens a TCP socket. npoll receives the multicast announcement and connects to the source IP/port/scope_id of the announcement. If both are run on the same machine, npoll sees the link local address of eth0 as source IP, and the interface number of eth0 as scope_id. So far so good. Trying to connect() however hangs. Since this has been broken in different ways for as long as I can remember in Linux, and I keep complaining about it every half a year or so. Can't someone fix this once and for all? IPv4 checks whether we are connecting to our own address and reroutes through loopback, why can't IPv6? Finally Centrino SpeedStep. I have a "Intel(R) Pentium(R) M processor 1.80GHz" in my notebook. Linux does not support it. This architecture has been out there for months now, and there even was a patch to support it posted here a in October last year or so. Linux still does not include it. Until 2.6.11-rc4-bk8 or so, the old patched file from back then still worked. Now it doesn't. Because some interface changed. Now what? Using a Centrino notebook without CPU throttling is completely out of the question. Linux might as well not boot on it at all. Did I mention that I'm really tired of you putting stones into ATI's way? You might believe you have a right to piss everyone off, after all people get what they paid for. Or maybe you think you are on a crusade to promote open source software. But if you keep alienating me (I'm a software developer) like this, I spend more time working around this bullshit and less time writing free software. In the end, everyone loses. I sincerely hope some day you people are done pissing in the pool and can create at least some semblance of semi-stable APIs. This house is never going to be safe for living until you stop digging around the foundation. You know, people are actually spending time (and money!) to learn how to write Linux kernel modules. And all this API shifting makes sure their knowledge is completely obsolete a few months down the road. That's not how you create a community of people working on a shared goal. Enough ranting for today. Sigh. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
2.6.11: USB broken on nforce4, ipv6 still broken, centrino speedstep even more broken than in 2.6.10
Linux is getting less and less usable for me. :-( My new nForce 4 mainboard has 10 or so USB 2.0 outlets. In Windows, they all work. In Linux, two of them work. Putting my USB stick or anything else in one of the others produces nothing in Linux. Apparently no IRQ getting through or something? This is what /proc/interrupts has to say: 177:9503618 IO-APIC-level ohci_hcd, eth0 These are the USB boot messages: usbcore: registered new driver usbfs usbcore: registered new driver hub ehci_hcd :00:02.1: new USB bus registered, assigned bus number 1 ehci_hcd :00:02.1: USB 2.0 initialized, EHCI 1.00, driver 26 Oct 2004 hub 1-0:1.0: USB hub found ohci_hcd: 2004 Nov 08 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI) ohci_hcd :00:02.0: new USB bus registered, assigned bus number 2 hub 2-0:1.0: USB hub found usbcore: registered new driver usblp drivers/usb/class/usblp.c: v0.13: USB Printer Device Class driver Initializing USB Mass Storage driver... usb 2-4: new low speed USB device using ohci_hcd and address 2 usbcore: registered new driver usb-storage USB Mass Storage support registered. input: USB HID v1.10 Mouse [B16_b_02 USB-PS/2 Optical Mouse] on usb-:00:02.0-4 usbcore: registered new driver usbhid drivers/usb/input/hid-core.c: v2.0:USB HID core driver HUB0 XVR0 XVR1 XVR2 XVR3 USB0 USB2 MMAC MMCI UAR1 As you can see, it appears to work in principle. Now about IPv6: npush and npoll are two applications I wrote. npush sends multicast announcements and opens a TCP socket. npoll receives the multicast announcement and connects to the source IP/port/scope_id of the announcement. If both are run on the same machine, npoll sees the link local address of eth0 as source IP, and the interface number of eth0 as scope_id. So far so good. Trying to connect() however hangs. Since this has been broken in different ways for as long as I can remember in Linux, and I keep complaining about it every half a year or so. Can't someone fix this once and for all? IPv4 checks whether we are connecting to our own address and reroutes through loopback, why can't IPv6? Finally Centrino SpeedStep. I have a Intel(R) Pentium(R) M processor 1.80GHz in my notebook. Linux does not support it. This architecture has been out there for months now, and there even was a patch to support it posted here a in October last year or so. Linux still does not include it. Until 2.6.11-rc4-bk8 or so, the old patched file from back then still worked. Now it doesn't. Because some interface changed. Now what? Using a Centrino notebook without CPU throttling is completely out of the question. Linux might as well not boot on it at all. Did I mention that I'm really tired of you putting stones into ATI's way? You might believe you have a right to piss everyone off, after all people get what they paid for. Or maybe you think you are on a crusade to promote open source software. But if you keep alienating me (I'm a software developer) like this, I spend more time working around this bullshit and less time writing free software. In the end, everyone loses. I sincerely hope some day you people are done pissing in the pool and can create at least some semblance of semi-stable APIs. This house is never going to be safe for living until you stop digging around the foundation. You know, people are actually spending time (and money!) to learn how to write Linux kernel modules. And all this API shifting makes sure their knowledge is completely obsolete a few months down the road. That's not how you create a community of people working on a shared goal. Enough ranting for today. Sigh. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff for ipv6 RFC compatibility
I have been told that I should send a diff rather than complain and expect others to make a diff. Oops ,) So attached is a diff. Oh boy oh boy will I now become part of the Linux Changelog? ;) Felix --- linux/include/linux/in6.h Sat May 19 02:45:08 2001 +++ linux.fefe/include/linux/in6.h Fri Jun 8 20:37:13 2001 @@ -53,7 +53,7 @@ struct in6_addr ipv6mr_multiaddr; /* local IPv6 address of interface */ - int ipv6mr_ifindex; + int ipv6mr_interface; }; struct in6_flowlabel_req --- linux/net/ipv6/ipv6_sockglue.c Mon Mar 26 04:14:25 2001 +++ linux.fefe/net/ipv6/ipv6_sockglue.c Fri Jun 8 20:37:01 2001 @@ -346,9 +346,9 @@ break; if (optname == IPV6_ADD_MEMBERSHIP) - retv = ipv6_sock_mc_join(sk, mreq.ipv6mr_ifindex, _multiaddr); + retv = ipv6_sock_mc_join(sk, mreq.ipv6mr_interface, +_multiaddr); else - retv = ipv6_sock_mc_drop(sk, mreq.ipv6mr_ifindex, _multiaddr); + retv = ipv6_sock_mc_drop(sk, mreq.ipv6mr_interface, +_multiaddr); break; } case IPV6_ROUTER_ALERT:
Linux kernel headers violate RFC2553
glibc works around this, but the diet libc uses the kernel headers and thus exports the wrong API to user land. Here is what RFC2553 mandates: struct ipv6_mreq { struct in6_addr ipv6mr_multiaddr; /* IPv6 multicast addr */ unsigned intipv6mr_interface; /* interface index */ }; ...and here is what include/linux/in6.h declares: struct ipv6_mreq { /* IPv6 multicast address of group */ struct in6_addr ipv6mr_multiaddr; /* local IPv6 address of interface */ int ipv6mr_ifindex; }; Note the ipv6mr_ifindex instead of the correct ipv6mr_interface. This wrong name is only used twice in net/ipv6/ipv6_sockglue.c, so it should be trivial to fix. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Linux kernel headers violate RFC2553
glibc works around this, but the diet libc uses the kernel headers and thus exports the wrong API to user land. Here is what RFC2553 mandates: struct ipv6_mreq { struct in6_addr ipv6mr_multiaddr; /* IPv6 multicast addr */ unsigned intipv6mr_interface; /* interface index */ }; ...and here is what include/linux/in6.h declares: struct ipv6_mreq { /* IPv6 multicast address of group */ struct in6_addr ipv6mr_multiaddr; /* local IPv6 address of interface */ int ipv6mr_ifindex; }; Note the ipv6mr_ifindex instead of the correct ipv6mr_interface. This wrong name is only used twice in net/ipv6/ipv6_sockglue.c, so it should be trivial to fix. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
diff for ipv6 RFC compatibility
I have been told that I should send a diff rather than complain and expect others to make a diff. Oops ,) So attached is a diff. Oh boy oh boy will I now become part of the Linux Changelog? ;) Felix --- linux/include/linux/in6.h Sat May 19 02:45:08 2001 +++ linux.fefe/include/linux/in6.h Fri Jun 8 20:37:13 2001 @@ -53,7 +53,7 @@ struct in6_addr ipv6mr_multiaddr; /* local IPv6 address of interface */ - int ipv6mr_ifindex; + int ipv6mr_interface; }; struct in6_flowlabel_req --- linux/net/ipv6/ipv6_sockglue.c Mon Mar 26 04:14:25 2001 +++ linux.fefe/net/ipv6/ipv6_sockglue.c Fri Jun 8 20:37:01 2001 @@ -346,9 +346,9 @@ break; if (optname == IPV6_ADD_MEMBERSHIP) - retv = ipv6_sock_mc_join(sk, mreq.ipv6mr_ifindex, mreq.ipv6mr_multiaddr); + retv = ipv6_sock_mc_join(sk, mreq.ipv6mr_interface, +mreq.ipv6mr_multiaddr); else - retv = ipv6_sock_mc_drop(sk, mreq.ipv6mr_ifindex, mreq.ipv6mr_multiaddr); + retv = ipv6_sock_mc_drop(sk, mreq.ipv6mr_interface, +mreq.ipv6mr_multiaddr); break; } case IPV6_ROUTER_ALERT:
ipv6: can't connect to myself?!
I can't connect() to my own link-local address. connect just hangs. Before some wise guy now tells me I should be connecting to ::1 instead: "oh, really!" ;) The application is npush/npoll from my ncp program suite, which can be found at http://www.fefe.de/ncp/. Basically, the sender sends UDP announcements and the receiver connects to the IP of the announcement on the interface of the announcement. strace of the receiver reveals that it hangs in the connect() call. Any takers? Why does this not work? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ipv6: can't connect to myself?!
I can't connect() to my own link-local address. connect just hangs. Before some wise guy now tells me I should be connecting to ::1 instead: oh, really! ;) The application is npush/npoll from my ncp program suite, which can be found at http://www.fefe.de/ncp/. Basically, the sender sends UDP announcements and the receiver connects to the IP of the announcement on the interface of the announcement. strace of the receiver reveals that it hangs in the connect() call. Any takers? Why does this not work? Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
include/asm-sparc/ptrace.h is broken
on line 76, it includes , which does not exist. This is critical because this include file does not work when used from a libc. ptrace.h is from 1997 on my 2.4.5 kernel, so this is not something that broke recently. My suggestion is to remove the offending line altogether or at least protect it with #ifdef __KERNEL__. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
include/asm-sparc/ptrace.h is broken
on line 76, it includes asm/asm_offsets.h, which does not exist. This is critical because this include file does not work when used from a libc. ptrace.h is from 1997 on my 2.4.5 kernel, so this is not something that broke recently. My suggestion is to remove the offending line altogether or at least protect it with #ifdef __KERNEL__. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
problem: reading from (rivafb) framebuffer is really slow
When benchmarking DirectFB, I found that a typical software alpha blending rectangle fill is completely dominated (I'm talking 90% of the CPU cycles here) by the time it takes to read pixels from the framebuffer. The pixels are read linearly in chunks of aligned 32-bit words. It's a Geforce 2 GTS in 1024x768 with 32-bit color depth using rivafb. This looks quite crass to me. Any ideas? Maybe rivafb does not initialize AGP and the card is in PCI mode or something? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
problem: reading from (rivafb) framebuffer is really slow
When benchmarking DirectFB, I found that a typical software alpha blending rectangle fill is completely dominated (I'm talking 90% of the CPU cycles here) by the time it takes to read pixels from the framebuffer. The pixels are read linearly in chunks of aligned 32-bit words. It's a Geforce 2 GTS in 1024x768 with 32-bit color depth using rivafb. This looks quite crass to me. Any ideas? Maybe rivafb does not initialize AGP and the card is in PCI mode or something? Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
vfat large file support
I can't copy a file larger than 2 gigs to my vfat partition. What gives? 2.4.4-ac5 kernel. My cp copies 2 gigs and then aborts. $ echo foo >> file_on_vfat_partition causes the shell to become unresponsive and consume lots of CPU time. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
vfat large file support
I can't copy a file larger than 2 gigs to my vfat partition. What gives? 2.4.4-ac5 kernel. My cp copies 2 gigs and then aborts. $ echo foo file_on_vfat_partition causes the shell to become unresponsive and consume lots of CPU time. Felix - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
chown bug
The man page says: If the owner or group is specified as -1, then that ID is not changed. If user !root says chown("/usr",-1,-1), he gets EPERM. Why? He explicitly told the kernel that he does not actually want to change anything. Why would the kernel say EPERM? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
chown bug
The man page says: If the owner or group is specified as -1, then that ID is not changed. If user !root says chown("/usr",-1,-1), he gets EPERM. Why? He explicitly told the kernel that he does not actually want to change anything. Why would the kernel say EPERM? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
USB and 2.4.2: "uhci: host system error, PCI problems?"
This is the log. Feb 23 14:35:53 hellhound kernel: usb.c: registered new driver usb_mouse Feb 23 14:35:53 hellhound kernel: PCI: Found IRQ 12 for device 00:07.2 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:07.3 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:0b.0 Feb 23 14:35:53 hellhound kernel: uhci.c: USB UHCI at I/O 0xa400, IRQ 12 Feb 23 14:35:53 hellhound kernel: uhci.c: detected 2 ports Feb 23 14:35:53 hellhound kernel: usb.c: new USB bus registered, assigned bus number 1 Feb 23 14:35:53 hellhound kernel: hub.c: USB hub found Feb 23 14:35:53 hellhound kernel: hub.c: 2 ports detected Feb 23 14:35:53 hellhound kernel: PCI: Found IRQ 12 for device 00:07.3 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:07.2 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:0b.0 Feb 23 14:35:53 hellhound kernel: uhci.c: USB UHCI at I/O 0xa800, IRQ 12 Feb 23 14:35:53 hellhound kernel: uhci.c: detected 2 ports Feb 23 14:35:53 hellhound kernel: usb.c: new USB bus registered, assigned bus number 2 Feb 23 14:35:53 hellhound kernel: hub.c: USB hub found Feb 23 14:35:53 hellhound kernel: hub.c: 2 ports detected Feb 23 14:35:53 hellhound usbmgr[2819]: start 0.4.4 Feb 23 14:35:53 hellhound kernel: usb.c: registered new driver hid Feb 23 14:35:53 hellhound kernel: mice: PS/2 mouse device common for all mice Feb 23 14:35:53 hellhound insmod: Note: /etc/modules.conf is more recent than /lib/modules/2.4.2-fefe1/modules.dep Feb 23 14:35:53 hellhound usbmgr[2821]: "hid" was loaded Feb 23 14:35:53 hellhound usbmgr[2821]: "mousedev" was loaded Feb 23 14:35:53 hellhound usbmgr[2821]: open error "host" Feb 23 14:35:53 hellhound usbmgr[2824]: mount /proc/bus/usb Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 5ab port2: 58a data: 6 Feb 23 14:35:53 hellhound kernel: hub.c: USB new device connect on bus1/1, assigned device number 22 Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 58a port2: 58a data: 6 Feb 23 14:35:53 hellhound kernel: mouse0: PS/2 mouse device for input0 Feb 23 14:35:53 hellhound kernel: input0: Logitech USB Mouse on usb1:22.0 Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 5a5 port2: 588 data: 4 Feb 23 14:35:54 hellhound kernel: uhci.c: root-hub INT complete: port1: 58a port2: 58a data: 6 Feb 23 14:35:54 hellhound usbmgr[2821]: class:0x9 subclass:0x0 protocol:0x0 Feb 23 14:35:54 hellhound kernel: uhci.c: root-hub INT complete: port1: 588 port2: 588 data: 6 Feb 23 14:35:54 hellhound usbmgr[2821]: USB device is matched the configuration Feb 23 14:35:54 hellhound usbmgr[2821]: "none" isn't loaded Feb 23 14:35:54 hellhound usbmgr[2821]: vendor:0x46d product:0xc00c Feb 23 14:35:54 hellhound usbmgr[2821]: class:0x3 subclass:0x1 protocol:0x2 Feb 23 14:35:54 hellhound usbmgr[2821]: USB device is matched the configuration Feb 23 14:35:54 hellhound kernel: uhci: host system error, PCI problems? Feb 23 14:35:54 hellhound kernel: uhci: host controller halted. very bad Any ideas? It's a VIA based Athlon board. Worked fine with 2.4.0 and 2.4.1. The only change was that I added rivafb, which finally adds Geforce support in 2.4.2. /proc/interrupts does not show any interrupts assigned to rivafb, maybe there is a conflict? CPU0 0: 457839 XT-PIC timer 1: 24705 XT-PIC keyboard 2: 0 XT-PIC cascade 5: 156420 XT-PIC eth0 8: 0 XT-PIC rtc 11: 26 XT-PIC ncr53c8xx 12: 5232 XT-PIC usb-uhci, usb-uhci 14: 17610 XT-PIC ide0 15: 2441 XT-PIC ide1 NMI: 0 ERR: 0 Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
USB and 2.4.2: uhci: host system error, PCI problems?
This is the log. Feb 23 14:35:53 hellhound kernel: usb.c: registered new driver usb_mouse Feb 23 14:35:53 hellhound kernel: PCI: Found IRQ 12 for device 00:07.2 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:07.3 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:0b.0 Feb 23 14:35:53 hellhound kernel: uhci.c: USB UHCI at I/O 0xa400, IRQ 12 Feb 23 14:35:53 hellhound kernel: uhci.c: detected 2 ports Feb 23 14:35:53 hellhound kernel: usb.c: new USB bus registered, assigned bus number 1 Feb 23 14:35:53 hellhound kernel: hub.c: USB hub found Feb 23 14:35:53 hellhound kernel: hub.c: 2 ports detected Feb 23 14:35:53 hellhound kernel: PCI: Found IRQ 12 for device 00:07.3 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:07.2 Feb 23 14:35:53 hellhound kernel: PCI: The same IRQ used for device 00:0b.0 Feb 23 14:35:53 hellhound kernel: uhci.c: USB UHCI at I/O 0xa800, IRQ 12 Feb 23 14:35:53 hellhound kernel: uhci.c: detected 2 ports Feb 23 14:35:53 hellhound kernel: usb.c: new USB bus registered, assigned bus number 2 Feb 23 14:35:53 hellhound kernel: hub.c: USB hub found Feb 23 14:35:53 hellhound kernel: hub.c: 2 ports detected Feb 23 14:35:53 hellhound usbmgr[2819]: start 0.4.4 Feb 23 14:35:53 hellhound kernel: usb.c: registered new driver hid Feb 23 14:35:53 hellhound kernel: mice: PS/2 mouse device common for all mice Feb 23 14:35:53 hellhound insmod: Note: /etc/modules.conf is more recent than /lib/modules/2.4.2-fefe1/modules.dep Feb 23 14:35:53 hellhound usbmgr[2821]: "hid" was loaded Feb 23 14:35:53 hellhound usbmgr[2821]: "mousedev" was loaded Feb 23 14:35:53 hellhound usbmgr[2821]: open error "host" Feb 23 14:35:53 hellhound usbmgr[2824]: mount /proc/bus/usb Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 5ab port2: 58a data: 6 Feb 23 14:35:53 hellhound kernel: hub.c: USB new device connect on bus1/1, assigned device number 22 Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 58a port2: 58a data: 6 Feb 23 14:35:53 hellhound kernel: mouse0: PS/2 mouse device for input0 Feb 23 14:35:53 hellhound kernel: input0: Logitech USB Mouse on usb1:22.0 Feb 23 14:35:53 hellhound kernel: uhci.c: root-hub INT complete: port1: 5a5 port2: 588 data: 4 Feb 23 14:35:54 hellhound kernel: uhci.c: root-hub INT complete: port1: 58a port2: 58a data: 6 Feb 23 14:35:54 hellhound usbmgr[2821]: class:0x9 subclass:0x0 protocol:0x0 Feb 23 14:35:54 hellhound kernel: uhci.c: root-hub INT complete: port1: 588 port2: 588 data: 6 Feb 23 14:35:54 hellhound usbmgr[2821]: USB device is matched the configuration Feb 23 14:35:54 hellhound usbmgr[2821]: "none" isn't loaded Feb 23 14:35:54 hellhound usbmgr[2821]: vendor:0x46d product:0xc00c Feb 23 14:35:54 hellhound usbmgr[2821]: class:0x3 subclass:0x1 protocol:0x2 Feb 23 14:35:54 hellhound usbmgr[2821]: USB device is matched the configuration Feb 23 14:35:54 hellhound kernel: uhci: host system error, PCI problems? Feb 23 14:35:54 hellhound kernel: uhci: host controller halted. very bad Any ideas? It's a VIA based Athlon board. Worked fine with 2.4.0 and 2.4.1. The only change was that I added rivafb, which finally adds Geforce support in 2.4.2. /proc/interrupts does not show any interrupts assigned to rivafb, maybe there is a conflict? CPU0 0: 457839 XT-PIC timer 1: 24705 XT-PIC keyboard 2: 0 XT-PIC cascade 5: 156420 XT-PIC eth0 8: 0 XT-PIC rtc 11: 26 XT-PIC ncr53c8xx 12: 5232 XT-PIC usb-uhci, usb-uhci 14: 17610 XT-PIC ide0 15: 2441 XT-PIC ide1 NMI: 0 ERR: 0 Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] Near-constant time directory index for Ext2
Thus spake Alan Cox ([EMAIL PROTECTED]): > > > There will be a lot fewer metadata index > > > blocks in your directory file, for one thing. > > Oh yes, another thing: a B-tree directory structure does not need > > metadata index blocks. > Before people get excited about complex tree directory indexes, remember to > solve the other 95% before implementation - recovering from lost blocks, > corruption and the like And don't forget the trouble with NFS handles after the tree was rebalanced. Trees are nice only theoretically. In practice, the benefits are outweighed by the nastiness in form of fsck and NFS and bigger code (normally: more complex -> less reliable). Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [rfc] Near-constant time directory index for Ext2
Thus spake Alan Cox ([EMAIL PROTECTED]): There will be a lot fewer metadata index blocks in your directory file, for one thing. Oh yes, another thing: a B-tree directory structure does not need metadata index blocks. Before people get excited about complex tree directory indexes, remember to solve the other 95% before implementation - recovering from lost blocks, corruption and the like And don't forget the trouble with NFS handles after the tree was rebalanced. Trees are nice only theoretically. In practice, the benefits are outweighed by the nastiness in form of fsck and NFS and bigger code (normally: more complex - less reliable). Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sendfile64?
Why isn't there a sendfile64? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
sendfile64?
Why isn't there a sendfile64? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux stifles innovation...
Thus spake Dennis ([EMAIL PROTECTED]): > You are confusing "progress" with "innovation". If there is only 1 choice, > thats not innovation. Expanding on a bad idea, or even a good one, is not > innovation. This is bizarre. Please name one innovation in the history of mankind that could not be seen as expanding on a different idea or even cloning an idea from someone else (for example, nature). Dennis, do you have a single argument or are you going to post bizarre statements like this forever? Please just say so, so people cann killfile you now. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LONG RANT] Re: Linux stifles innovation...
Thus spake Henning P . Schmiedehausen ([EMAIL PROTECTED]): > "If a company does not write a driver which works on all hardware > platforms in all cases and gives us the source, then it is better, > that the company writes no drivers at all." > "If I can't force a company to write a driver for everyone, then I > don't want to write them any driver at all." > IMHO you're like a spoiled kid: "If I can't have it, noone should have it". Henning, what is the matter with you? I bought the hardware. Why should I pay for the driver? Not even on Windows you pay extra for a driver! Please state your intentions. Why would you want to split the Linux user base into people who pay companies to screw them (I get a driver for hardware I already paid for, but the driver will work with exactly one kernel version on one hardware) and people who think they deserve support when they buy hardware? Why do we even have to discuss drivers? A company that actively hinders developing a good driver with patents, NDAs or other legal crap does not deserve my money. If you throw your money at such people, you deserve everything you get. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [LONG RANT] Re: Linux stifles innovation...
Thus spake Henning P . Schmiedehausen ([EMAIL PROTECTED]): "If a company does not write a driver which works on all hardware platforms in all cases and gives us the source, then it is better, that the company writes no drivers at all." "If I can't force a company to write a driver for everyone, then I don't want to write them any driver at all." IMHO you're like a spoiled kid: "If I can't have it, noone should have it". Henning, what is the matter with you? I bought the hardware. Why should I pay for the driver? Not even on Windows you pay extra for a driver! Please state your intentions. Why would you want to split the Linux user base into people who pay companies to screw them (I get a driver for hardware I already paid for, but the driver will work with exactly one kernel version on one hardware) and people who think they deserve support when they buy hardware? Why do we even have to discuss drivers? A company that actively hinders developing a good driver with patents, NDAs or other legal crap does not deserve my money. If you throw your money at such people, you deserve everything you get. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux stifles innovation...
Thus spake Dennis ([EMAIL PROTECTED]): You are confusing "progress" with "innovation". If there is only 1 choice, thats not innovation. Expanding on a bad idea, or even a good one, is not innovation. This is bizarre. Please name one innovation in the history of mankind that could not be seen as expanding on a different idea or even cloning an idea from someone else (for example, nature). Dennis, do you have a single argument or are you going to post bizarre statements like this forever? Please just say so, so people cann killfile you now. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [reiserfs-list] ReiserFS Oops (2.4.1, deterministic, symlink related)
Thus spake J . A . Magallon ([EMAIL PROTECTED]): > > How about a simple patch to the top level makefile that checks the gcc > > version then prints a distinct message ..'this compiler hasn't been approved > > for compiling the kernel', sleeping for one second, then continuing on. This > > solution doesn't stop compiling and makes a visible indicator without forcing > > anything. > Or a config option like CONFIG_TRUSTED_COMPILER, and everyone that wants > can bracket his code in 'if [ $TRUSTED = "y" ] ... fi', so if some driver-fs > fails with untrusted compilers it is just not selectable. What kind of crap is this? It is not the kernel's job to work around RedHat bugs. If RedHat ships a broken compiler, it is their responsibility to tell their customers and provide a working one. This kind of compatibility crap has caused commercial Unices to suffocate in their own bloat. We don't need this. And we don't want this. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [reiserfs-list] ReiserFS Oops (2.4.1, deterministic, symlink related)
Thus spake J . A . Magallon ([EMAIL PROTECTED]): How about a simple patch to the top level makefile that checks the gcc version then prints a distinct message ..'this compiler hasn't been approved for compiling the kernel', sleeping for one second, then continuing on. This solution doesn't stop compiling and makes a visible indicator without forcing anything. Or a config option like CONFIG_TRUSTED_COMPILER, and everyone that wants can bracket his code in 'if [ $TRUSTED = "y" ] ... fi', so if some driver-fs fails with untrusted compilers it is just not selectable. What kind of crap is this? It is not the kernel's job to work around RedHat bugs. If RedHat ships a broken compiler, it is their responsibility to tell their customers and provide a working one. This kind of compatibility crap has caused commercial Unices to suffocate in their own bloat. We don't need this. And we don't want this. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Choosing Linux NICs (was: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN))
Thus spake Felix von Leitner ([EMAIL PROTECTED]): > What is missing here is a good authoritative web ressource that tells > people which NIC to buy. I started one now. It's at http://www.fefe.de/linuxeth/, but there is not much content yet. Please contribute! Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN)
Thus spake Andrew Morton ([EMAIL PROTECTED]): > Conclusions: > For a NIC which cannot do scatter/gather/checksums, the zerocopy > patch makes no change in throughput in all case. > For a NIC which can do scatter/gather/checksums, sendfile() > efficiency is improved by 40% and send() efficiency is decreased by > 10%. The increase and decrease caused by the zerocopy patch will in > fact be significantly larger than these two figures, because the > measurements here include a constant base load caused by the device > driver. What is missing here is a good authoritative web ressource that tells people which NIC to buy. I have a tulip NIC because a few years ago that apparently was the NIC of choice. It has good multicast (which is important to me), but AFAIK it has neither scatter-gather nor hardware checksumming. Is there such a web page already? If not, I volunteer to create amd maintain one. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN)
Thus spake Andrew Morton ([EMAIL PROTECTED]): Conclusions: For a NIC which cannot do scatter/gather/checksums, the zerocopy patch makes no change in throughput in all case. For a NIC which can do scatter/gather/checksums, sendfile() efficiency is improved by 40% and send() efficiency is decreased by 10%. The increase and decrease caused by the zerocopy patch will in fact be significantly larger than these two figures, because the measurements here include a constant base load caused by the device driver. What is missing here is a good authoritative web ressource that tells people which NIC to buy. I have a tulip NIC because a few years ago that apparently was the NIC of choice. It has good multicast (which is important to me), but AFAIK it has neither scatter-gather nor hardware checksumming. Is there such a web page already? If not, I volunteer to create amd maintain one. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Choosing Linux NICs (was: Re: sendfile+zerocopy: fairly sexy (nothing to do with ECN))
Thus spake Felix von Leitner ([EMAIL PROTECTED]): What is missing here is a good authoritative web ressource that tells people which NIC to buy. I started one now. It's at http://www.fefe.de/linuxeth/, but there is not much content yet. Please contribute! Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Off-Topic: how do I trace a PID over double-forks?
This is more a Unix API question than a Linux question. I hope the issue is interesting enough to be of interest to some of you. Basically, I am writing an init which features process watching capabilities. My init has a management channel with which you can tell it "the PID of the ssh process is really 123 instead of 12". When init forks a getty and that getty exits, it is restarted. So far so good. But I want my init to be able to restart uncooperative processes like sendmail that fork in the background. sendmail may be a bad example because the sources are available, but please imagine you didn't have the sources to sendmail or didn't want to touch them. Now, the back channel for my init has a function that allows to set the PID of a process. The idea is that the init does not start sendmail but a wrapper. The wrapper forks, runs sendmail, does some magic trickery to find the real PID of the daemonized sendmail and tells init this PID so init will know it has to restart sendmail when it exits and won't restart the wrapper when that exits. Follow me this far? Great! The real problem at hand is: what kind of trickery can I employ in that wrapper. I was hoping for something that is not Linux specific, but I haven't found anything yet. I was also hoping that I could find a method that does not rely on /proc being there or on any filesystem being mounted read-write (yes, my back channel works if the filesystem is mounted read-only). So, using /proc and relying on something like /var/run/sendmail.pid are out. Someone suggested using fcntl to create a lock and then use fcntl again to see who holds the lock. That sounded good at first, but fork() does not seem to inherit locks. Does anyone have another idea? In case I made you wonder: http://www.fefe.de/minit/ Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Documenting stat(2)
Thus spake Eric S. Raymond ([EMAIL PROTECTED]): > Here is what I think I know about stat(2) that isn't in the > Linux man pages: > * For a symlink (S_IFLNK) it reports the size of the link file, not the > size of the file the link points to. I think you confuse stat and lstat here. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Documenting stat(2)
Thus spake Eric S. Raymond ([EMAIL PROTECTED]): Here is what I think I know about stat(2) that isn't in the Linux man pages: * For a symlink (S_IFLNK) it reports the size of the link file, not the size of the file the link points to. I think you confuse stat and lstat here. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Off-Topic: how do I trace a PID over double-forks?
This is more a Unix API question than a Linux question. I hope the issue is interesting enough to be of interest to some of you. Basically, I am writing an init which features process watching capabilities. My init has a management channel with which you can tell it "the PID of the ssh process is really 123 instead of 12". When init forks a getty and that getty exits, it is restarted. So far so good. But I want my init to be able to restart uncooperative processes like sendmail that fork in the background. sendmail may be a bad example because the sources are available, but please imagine you didn't have the sources to sendmail or didn't want to touch them. Now, the back channel for my init has a function that allows to set the PID of a process. The idea is that the init does not start sendmail but a wrapper. The wrapper forks, runs sendmail, does some magic trickery to find the real PID of the daemonized sendmail and tells init this PID so init will know it has to restart sendmail when it exits and won't restart the wrapper when that exits. Follow me this far? Great! The real problem at hand is: what kind of trickery can I employ in that wrapper. I was hoping for something that is not Linux specific, but I haven't found anything yet. I was also hoping that I could find a method that does not rely on /proc being there or on any filesystem being mounted read-write (yes, my back channel works if the filesystem is mounted read-only). So, using /proc and relying on something like /var/run/sendmail.pid are out. Someone suggested using fcntl to create a lock and then use fcntl again to see who holds the lock. That sounded good at first, but fork() does not seem to inherit locks. Does anyone have another idea? In case I made you wonder: http://www.fefe.de/minit/ Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]]
Thus spake Ingo Molnar ([EMAIL PROTECTED]): > if you read my (radical) proposal, the identification is based on a kernel > pointer and a 256-bit random integer. So non-negative integers are not > needed. (file-IO system-calls would be modified to detect if 'Unix file > descriptors' or pointers to 'native file descriptors' are passed to them, > so this is truly radical.) Yuck, don't pass pointers in kernel space to user space! NT does it and look what kernel call argument verification havoc it wrought over them! Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: 'native files', 'object fingerprints' [was: sendpath()]
Thus spake Ingo Molnar ([EMAIL PROTECTED]): > But even user-space code could use 'native files', via the following, safe > mechanizm: [something reminiscient of a token from a capability system] > (this 'fingerprint' mechanizm can be used for any object, not only files.) One good thing about tokens is that file handles can be implemented on top of them in user space. On the other hand, there already are mechanisms to pass file descriptors around and so on, so you don't gain anything tangible from your efford. I would advise reading some text books about capability systems, there is a lot to be learned here. But retrofitting something like this on an existing kernel is probably not a very good idea. Experience shows that you can't "un-bloat" a piece of software by introducing a few elegant concepts. The compatibility stuff eats most of the benefits. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Thus spake Jamie Lokier ([EMAIL PROTECTED]): > You would need to use a new open() flag: O_ANYFD. > The requirement comes from this like this: > close (0); > close (1); > close (2); > open ("/dev/console", O_RDWR); > dup (); > dup (); So it's not actually part of POSIX, it's just to get around fixing legacy code? ;-) Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Thus spake Ingo Molnar ([EMAIL PROTECTED]): > > I don't know how Linux does it, but returning the first free file > > descriptor can be implemented as O(1) operation. > to put it more accurately: the requirement is to be able to open(), use > and close() an unlimited number of file descriptors with O(1) overhead, > under any allocation pattern, with only RAM limiting the number of files. > Both of my proposals attempt to provide this. It's possible to open() O(1) > but do a O(log(N)) close(), but that is of no practical value IMO. I cheated. I was only talking about open(). close() is of course more expensive then. Other than that: where does the requirement come from? Can't we just use a free list where we prepend closed fds and always use the first one on open()? That would even increase spatial locality and be good for the CPU caches. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Thus spake Albert D. Cahalan ([EMAIL PROTECTED]): > Rather than combining open() with sendfile(), it could be combined > with stat(). Since the syscall would be new anyway, it could skip > the normal requirement about returning the next free file descriptor > in favor of returning whatever can be most quickly found. I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Thus spake Albert D. Cahalan ([EMAIL PROTECTED]): Rather than combining open() with sendfile(), it could be combined with stat(). Since the syscall would be new anyway, it could skip the normal requirement about returning the next free file descriptor in favor of returning whatever can be most quickly found. I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Is sendfile all that sexy?
Thus spake Ingo Molnar ([EMAIL PROTECTED]): I don't know how Linux does it, but returning the first free file descriptor can be implemented as O(1) operation. to put it more accurately: the requirement is to be able to open(), use and close() an unlimited number of file descriptors with O(1) overhead, under any allocation pattern, with only RAM limiting the number of files. Both of my proposals attempt to provide this. It's possible to open() O(1) but do a O(log(N)) close(), but that is of no practical value IMO. I cheated. I was only talking about open(). close() is of course more expensive then. Other than that: where does the requirement come from? Can't we just use a free list where we prepend closed fds and always use the first one on open()? That would even increase spatial locality and be good for the CPU caches. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: O_ANY [was: Re: 'native files', 'object fingerprints' [was: sendpath()]]
Thus spake Ingo Molnar ([EMAIL PROTECTED]): if you read my (radical) proposal, the identification is based on a kernel pointer and a 256-bit random integer. So non-negative integers are not needed. (file-IO system-calls would be modified to detect if 'Unix file descriptors' or pointers to 'native file descriptors' are passed to them, so this is truly radical.) Yuck, don't pass pointers in kernel space to user space! NT does it and look what kernel call argument verification havoc it wrought over them! Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal RAID 0 performance on 2.4.0-test10 for IDE?
Thus spake Felix von Leitner ([EMAIL PROTECTED]): > Here is the result of my test program on the strip set: > # rb < /dev/md/0 > 30.3 meg/sec > # One more detail: top says the CPU is 50% system when reading from either one of the disk or raid devices. That seems awfully high considering that the Promise controller claims to do UDMA. Any comments? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Abysmal RAID 0 performance on 2.4.0-test10 for IDE?
Hi, I bought 4 ATA-100 Maxtor drives and put them on a Promise Ultra100 controller to make a single striping RAID of them to increase throughput. I wrote a small test program that simply reads stdin linearly and displays the throughput. The block size is 100k. This is the result: # cat /etc/raidtab raiddev /dev/md/0 raid-level 0 nr-raid-disks 4 persistent-superblock 1 chunk-size 32 device /dev/ide/host2/bus0/target0/lun0/part1 raid-disk 0 device /dev/ide/host2/bus0/target1/lun0/part1 raid-disk 2 device /dev/ide/host2/bus1/target0/lun0/part1 raid-disk 1 device /dev/ide/host2/bus1/target1/lun0/part1 raid-disk 3 Here are the results of my test program on the disk devices: # rb < /dev/ide/host2/bus0/target0/lun0/part1 27.8 meg/sec # rb < /dev/ide/host2/bus0/target0/lun0/part1 26.8 meg/sec the other two disks have approximately the same numbers. Here is the result of my test program on the strip set: # rb < /dev/md/0 30.3 meg/sec # While this is faster than linear mode, I would have expected much better performance. These are the boot messages of the Promise adapter: PDC20267: IDE controller on PCI bus 00 dev 60 PDC20267: chipset revision 2 PDC20267: not 100% native mode: will probe irqs later PDC20267: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode. ide2: BM-DMA at 0xec00-0xec07, BIOS settings: hde:pio, hdf:pio ide3: BM-DMA at 0xec08-0xec0f, BIOS settings: hdg:pio, hdh:pio ide2 at 0xdc00-0xdc07,0xe002 on irq 10 ide3 at 0xe400-0xe407,0xe802 on irq 10 hde: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdf: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdg: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdh: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) I tuned the devices with hdparm -c 1 -a 32 -m 16 -p -u 1, for what it's worth (did not increase throughput but appeared to lessen the CPU usage). To verify that this is not an issue of the Promise controller, I started two instances of my test tool at the same time, one working on hde, the other on hdg (the two channels). Both yielded approximately 25 meg/sec, so it does not appear to be a hardware or driver issue. Is the RAID code really this slow? Any ideas what I can do? I am using the user space tools from raidtools-19990421-0.90.tar.bz2, but that should not have any influence, right? I heard that there is a new, faster RAID code somewhere, but it only claimed to be faster on RAID level 5, not on striping. Any tuning advice? By the way: I noticed another thing: one of the Maxtor hard disks was broken. It caused the whole box to freeze solid (no numlock, no console switches, no sysrq). That to me severely limits the usefulness of IDE RAID. While SCSI problems cause trouble, too, I have never seen one cause a complete freeze. How am I supposed to hot-swap the disks? I am using VESA framebuffer, so maybe there was a panic and it simply did not appear on my screen (or in the logs). Hope to hear from you soon (the RAID is needed on Dec 27). Should I use LVM instead of the MD code? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Abysmal RAID 0 performance on 2.4.0-test10 for IDE?
Hi, I bought 4 ATA-100 Maxtor drives and put them on a Promise Ultra100 controller to make a single striping RAID of them to increase throughput. I wrote a small test program that simply reads stdin linearly and displays the throughput. The block size is 100k. This is the result: # cat /etc/raidtab raiddev /dev/md/0 raid-level 0 nr-raid-disks 4 persistent-superblock 1 chunk-size 32 device /dev/ide/host2/bus0/target0/lun0/part1 raid-disk 0 device /dev/ide/host2/bus0/target1/lun0/part1 raid-disk 2 device /dev/ide/host2/bus1/target0/lun0/part1 raid-disk 1 device /dev/ide/host2/bus1/target1/lun0/part1 raid-disk 3 Here are the results of my test program on the disk devices: # rb /dev/ide/host2/bus0/target0/lun0/part1 27.8 meg/sec # rb /dev/ide/host2/bus0/target0/lun0/part1 26.8 meg/sec the other two disks have approximately the same numbers. Here is the result of my test program on the strip set: # rb /dev/md/0 30.3 meg/sec # While this is faster than linear mode, I would have expected much better performance. These are the boot messages of the Promise adapter: PDC20267: IDE controller on PCI bus 00 dev 60 PDC20267: chipset revision 2 PDC20267: not 100% native mode: will probe irqs later PDC20267: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode. ide2: BM-DMA at 0xec00-0xec07, BIOS settings: hde:pio, hdf:pio ide3: BM-DMA at 0xec08-0xec0f, BIOS settings: hdg:pio, hdh:pio ide2 at 0xdc00-0xdc07,0xe002 on irq 10 ide3 at 0xe400-0xe407,0xe802 on irq 10 hde: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdf: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdg: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) hdh: 160086528 sectors (81964 MB) w/2048KiB Cache, CHS=158816/16/63, UDMA(100) I tuned the devices with hdparm -c 1 -a 32 -m 16 -p -u 1, for what it's worth (did not increase throughput but appeared to lessen the CPU usage). To verify that this is not an issue of the Promise controller, I started two instances of my test tool at the same time, one working on hde, the other on hdg (the two channels). Both yielded approximately 25 meg/sec, so it does not appear to be a hardware or driver issue. Is the RAID code really this slow? Any ideas what I can do? I am using the user space tools from raidtools-19990421-0.90.tar.bz2, but that should not have any influence, right? I heard that there is a new, faster RAID code somewhere, but it only claimed to be faster on RAID level 5, not on striping. Any tuning advice? By the way: I noticed another thing: one of the Maxtor hard disks was broken. It caused the whole box to freeze solid (no numlock, no console switches, no sysrq). That to me severely limits the usefulness of IDE RAID. While SCSI problems cause trouble, too, I have never seen one cause a complete freeze. How am I supposed to hot-swap the disks? I am using VESA framebuffer, so maybe there was a panic and it simply did not appear on my screen (or in the logs). Hope to hear from you soon (the RAID is needed on Dec 27). Should I use LVM instead of the MD code? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Abysmal RAID 0 performance on 2.4.0-test10 for IDE?
Thus spake Felix von Leitner ([EMAIL PROTECTED]): Here is the result of my test program on the strip set: # rb /dev/md/0 30.3 meg/sec # One more detail: top says the CPU is 50% system when reading from either one of the disk or raid devices. That seems awfully high considering that the Promise controller claims to do UDMA. Any comments? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
question about pread
I am trying to implement pread for my diet libc. This is my test program: #include main() { char buf[1024]; int fd=open("/etc/passwd",0); pread(fd,buf,30,32); close(fd); write(1,buf,32); } I compiled it against diet libc and glibc and ran it on a powerpc box. t is the test program linked against diet libc, t1 is the test program linked against glibc. Here is the result: $ strace ./t1 execve("./t1", ["./t1"], [/* 19 vars */]) = 0 brk(0) = 0x100106a8 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x30014000 open("/etc/ld.so.preload", O_RDONLY)= -1 ENOENT (No such file or directory) open("/usr/local/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/local/lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0 open("/usr/X11R6/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/X11R6/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=9729, ...}) = 0 mmap(NULL, 9729, PROT_READ, MAP_PRIVATE, 3, 0) = 0x30015000 close(3)= 0 open("/lib/libc.so.6", O_RDONLY)= 3 fstat(3, {st_mode=S_IFREG|0755, st_size=992080, ...}) = 0 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\24\0\0\0\1\0\2(\340"..., 4096) = 4096 mmap(0xfeea000, 1072860, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xfeea000 mprotect(0xffcb000, 151260, PROT_NONE) = 0 mmap(0xffda000, 69632, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0xe) = 0xffda000 mmap(0xffeb000, 20188, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffeb000 close(3)= 0 munmap(0x30015000, 9729)= 0 getpid()= 11304 open("/etc/passwd", O_RDONLY) = 3 pread(3, "daemon:x:1:1:daemon:/usr/sbin:", 30, 137438953472) = 30 close(3)= 0 write(1, "daemon:x:1:1:daemon:/usr/sbin:j ", 32daemon:x:1:1:daemon:/usr/sbin:j ) = 32 exit(32)= ? $ strace ./t execve("./t", ["./t"], [/* 19 vars */]) = 0 open("/etc/passwd", O_RDONLY) = 3 pread(3, "", 30, 137438953472) = 0 close(3)= 0 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32) = 32 exit(32)= ? $ How can this be? Both open the same file and call pread with the same arguments, yet pread returns 30 for the glibc program and 0 for the diet libc one?! Can anyone shed some light on this? What exactly is the calling convention for pread? The diet libc pread code appears to work on x86 and sparc but not on mips and ppc. I used kernel 2.4.0-test10 on x86 and 2.2.17 on sparc and ppc, for what it's worth. stumped, Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
question about pread
I am trying to implement pread for my diet libc. This is my test program: #include unistd.h main() { char buf[1024]; int fd=open("/etc/passwd",0); pread(fd,buf,30,32); close(fd); write(1,buf,32); } I compiled it against diet libc and glibc and ran it on a powerpc box. t is the test program linked against diet libc, t1 is the test program linked against glibc. Here is the result: $ strace ./t1 execve("./t1", ["./t1"], [/* 19 vars */]) = 0 brk(0) = 0x100106a8 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x30014000 open("/etc/ld.so.preload", O_RDONLY)= -1 ENOENT (No such file or directory) open("/usr/local/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/local/lib", {st_mode=S_IFDIR|S_ISGID|0775, st_size=4096, ...}) = 0 open("/usr/X11R6/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory) stat("/usr/X11R6/lib", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 open("/etc/ld.so.cache", O_RDONLY) = 3 fstat(3, {st_mode=S_IFREG|0644, st_size=9729, ...}) = 0 mmap(NULL, 9729, PROT_READ, MAP_PRIVATE, 3, 0) = 0x30015000 close(3)= 0 open("/lib/libc.so.6", O_RDONLY)= 3 fstat(3, {st_mode=S_IFREG|0755, st_size=992080, ...}) = 0 read(3, "\177ELF\1\2\1\0\0\0\0\0\0\0\0\0\0\3\0\24\0\0\0\1\0\2(\340"..., 4096) = 4096 mmap(0xfeea000, 1072860, PROT_READ|PROT_EXEC, MAP_PRIVATE, 3, 0) = 0xfeea000 mprotect(0xffcb000, 151260, PROT_NONE) = 0 mmap(0xffda000, 69632, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED, 3, 0xe) = 0xffda000 mmap(0xffeb000, 20188, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xffeb000 close(3)= 0 munmap(0x30015000, 9729)= 0 getpid()= 11304 open("/etc/passwd", O_RDONLY) = 3 pread(3, "daemon:x:1:1:daemon:/usr/sbin:", 30, 137438953472) = 30 close(3)= 0 write(1, "daemon:x:1:1:daemon:/usr/sbin:j ", 32daemon:x:1:1:daemon:/usr/sbin:j ) = 32 exit(32)= ? $ strace ./t execve("./t", ["./t"], [/* 19 vars */]) = 0 open("/etc/passwd", O_RDONLY) = 3 pread(3, "", 30, 137438953472) = 0 close(3)= 0 write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 32) = 32 exit(32)= ? $ How can this be? Both open the same file and call pread with the same arguments, yet pread returns 30 for the glibc program and 0 for the diet libc one?! Can anyone shed some light on this? What exactly is the calling convention for pread? The diet libc pread code appears to work on x86 and sparc but not on mips and ppc. I used kernel 2.4.0-test10 on x86 and 2.2.17 on sparc and ppc, for what it's worth. stumped, Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Thus spake Linus Torvalds ([EMAIL PROTECTED]): > I disagree. > Let's just face it, poll() is a bad interface scalability-wise. Is that a reason to implement it badly? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Linux's implementation of poll() not scalable?
Thus spake Linus Torvalds ([EMAIL PROTECTED]): I disagree. Let's just face it, poll() is a bad interface scalability-wise. Is that a reason to implement it badly? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: driver initialization pipelining
Thus spake Andre Hedrick ([EMAIL PROTECTED]): > > Some of the initialization can definitely be done in parallel, but there > > are all sorts of special cases, like devices which turn off interrupts > > during init (IDE), and other fun tricks... Some of the delays during > > init are timing sensitive, where you don't want to have to wait for the > > tasklet to be called for completion. > I will be happy to break the IRQ code for a demo for Felix. > But do backup your data first, because it will not be there when you boot > again! I don't get it. If you say that IDE disables interrupts during init, does that mean that it disables _all_ interrupts or just that you mask the IDE IRQs? Actually, I was thinking more along the lines of SCSI bus scan, because the Linux IDE reset is already barely noticeable. Does "timing sensitive" mean "don't come again too early" or "be 100% punctual"? There ought to be _some_ initializations that don't require interrupts? Registering the file systems and network protocols, stuff like that? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bind() allowed to non-local addresses
Thus spake David S. Miller ([EMAIL PROTECTED]): > I'll say it again, if you have to make changes to apps/servers the > feature does not make any sense. It must operate transparently or > not at all. There once was a socket file system which solved exactly this problem in a nice and obvious way. If you wanted to allow user joe to bind to port 80, you just do "chown joe /socks/80". Whatever happened to that neat idea? If it was under /proc, I would be happy. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Proposal: driver initialization pipelining
Linux already boots fairly quickly, but there seems to be one straightforward way to speed it up a little more: pipelining. The idea is to split the initialization of drivers into two routines. This is only useful for drivers that reset hardware and then wait a while before continuing. My thought is: during that time, other drivers could work. If we split the initialization into one "trigger the reset" routine and one "do the rest" routine, we could interleave initializations by first calling all the reset routines, then doing some static initializations and then call all the second halves of the initialization. Particularly SCSI and IDE scans need noticeable time and could possibly be done in parallel with the USB init, right? This is just a quick idea. If the whole concept is broken, please just say so. No need to start a monster thread about this. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Proposal: driver initialization pipelining
Linux already boots fairly quickly, but there seems to be one straightforward way to speed it up a little more: pipelining. The idea is to split the initialization of drivers into two routines. This is only useful for drivers that reset hardware and then wait a while before continuing. My thought is: during that time, other drivers could work. If we split the initialization into one "trigger the reset" routine and one "do the rest" routine, we could interleave initializations by first calling all the reset routines, then doing some static initializations and then call all the second halves of the initialization. Particularly SCSI and IDE scans need noticeable time and could possibly be done in parallel with the USB init, right? This is just a quick idea. If the whole concept is broken, please just say so. No need to start a monster thread about this. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: bind() allowed to non-local addresses
Thus spake David S. Miller ([EMAIL PROTECTED]): I'll say it again, if you have to make changes to apps/servers the feature does not make any sense. It must operate transparently or not at all. There once was a socket file system which solved exactly this problem in a nice and obvious way. If you wanted to allow user joe to bind to port 80, you just do "chown joe /socks/80". Whatever happened to that neat idea? If it was under /proc, I would be happy. Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: Proposal: driver initialization pipelining
Thus spake Andre Hedrick ([EMAIL PROTECTED]): Some of the initialization can definitely be done in parallel, but there are all sorts of special cases, like devices which turn off interrupts during init (IDE), and other fun tricks... Some of the delays during init are timing sensitive, where you don't want to have to wait for the tasklet to be called for completion. I will be happy to break the IRQ code for a demo for Felix. But do backup your data first, because it will not be there when you boot again! I don't get it. If you say that IDE disables interrupts during init, does that mean that it disables _all_ interrupts or just that you mask the IDE IRQs? Actually, I was thinking more along the lines of SCSI bus scan, because the Linux IDE reset is already barely noticeable. Does "timing sensitive" mean "don't come again too early" or "be 100% punctual"? There ought to be _some_ initializations that don't require interrupts? Registering the file systems and network protocols, stuff like that? Felix - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Need help with SPARC fork()
I need help with fork() on SPARC Linux. I am trying to port my diet libc to SPARC Linux but can't get fork() to work. Even when I copy the fork() code from glibc verbatim, the tasks have a corrupted stack frame. I tried to strip the init code and it looks like I broke fork in the process. Does anyone have a pointer about fork() constraints that I might have failed to notice? Felix PS: In case anyone is interested: the intel only version of diet libc is at http://www.fefe.de/dietlibc/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/