Re: [patch v2] epoll use a single inode ...
On Sat, 10 Mar 2007, Pavel Machek wrote: > > Heh, this is what Al was saying ;) > > I'm fine with that, but how about counter cycles (going back to zero)? > > Just use u64? Yeah, the second patch was using an u64. I ended up using a "class" name (signalfd, timerfd, asyncfd) as dname entry. An incremental counter would not add any useful information, and noone will ever care about dentry name of those objects. Prolly the class name is the only thing you might want to know, or we can drop even that and use a shared dentry for everything. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Hi! > > > @@ -763,15 +767,17 @@ > > >* using the inode number. > > >*/ > > > error = -ENOMEM; > > > - sprintf(name, "[%lu]", inode->i_ino); > > > this.name = name; > > > - this.len = strlen(name); > > > - this.hash = inode->i_ino; > > > + this.len = sprintf(name, "[%p]", ep); > > > + this.hash = 0; > > > > Please don't expose kernel pointers to user space. > > > > It's much better to do something like > > > > static unsigned int epoll_inode; > > > > this.len = sprintf(name, "[%u]", ++epoll_inode); > > > > if you just need some pseudo-unique name to distinguish two epoll things > > from each other (vs from a dup'ed fd). > > Heh, this is what Al was saying ;) > I'm fine with that, but how about counter cycles (going back to zero)? Just use u64? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Hi! @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode-i_ino; + this.len = sprintf(name, [%p], ep); + this.hash = 0; Please don't expose kernel pointers to user space. It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, [%u], ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Just use u64? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Sat, 10 Mar 2007, Pavel Machek wrote: Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Just use u64? Yeah, the second patch was using an u64. I ended up using a class name (signalfd, timerfd, asyncfd) as dname entry. An incremental counter would not add any useful information, and noone will ever care about dentry name of those objects. Prolly the class name is the only thing you might want to know, or we can drop even that and use a shared dentry for everything. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, Mar 06, 2007 at 21:20:33 +0100, Eric Dumazet wrote: ... > I rewrote the reciprocal_div() for i386 so that one multiply is used. > > static inline u32 reciprocal_divide(u32 A, u32 R) > { > #if __i386 > unsigned int edx, eax; > asm("mul %2":"=a" (eax), "=d" (edx):"rm" (R), "0" (A)); ^^^ mul does not work if R is memory operand. mull should be used instead. -- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, Mar 06, 2007 at 21:20:33 +0100, Eric Dumazet wrote: ... I rewrote the reciprocal_div() for i386 so that one multiply is used. static inline u32 reciprocal_divide(u32 A, u32 R) { #if __i386 unsigned int edx, eax; asm(mul %2:=a (eax), =d (edx):rm (R), 0 (A)); ^^^ mul does not work if R is memory operand. mull should be used instead. -- - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Eric Dumazet wrote: Linus Torvalds a écrit : On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so "numbers talk, bullshit walks". No more objections. And the numbers were ? :) (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Well, given that sprintf() is frequently called only for pipe/sockets creation, we probably better : 1) wait a very clever idea to suppress individual dentry per pipe/sockets (no more sprintf() at pipe/socket setup) 2) delay the sprintf() only if needed as you mentioned in a previous mail (when someone wants ls -l /proc/pid/fd/), since their dentries are not anymore inserted in the global dcache hash, they could stay with a (nul) dname. Yes, the right thing to do is probably to only generate these strings when someone tries to list them, not on every socket/pipe/epoll creation. One can assign a counter and keep it as a binary value at the start, but create the strings when necessary. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds a écrit : On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so "numbers talk, bullshit walks". No more objections. And the numbers were ? :) (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Well, given that sprintf() is frequently called only for pipe/sockets creation, we probably better : 1) wait a very clever idea to suppress individual dentry per pipe/sockets (no more sprintf() at pipe/socket setup) 2) delay the sprintf() only if needed as you mentioned in a previous mail (when someone wants ls -l /proc/pid/fd/), since their dentries are not anymore inserted in the global dcache hash, they could stay with a (nul) dname. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so "numbers talk, bullshit walks". No more objections. (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Of course you can do better for octal and hex -- it's just shift and mask. Decimal is trickier; however, at least on i386 it might make sense to divide by 100 and then use the AAM instruction, or a table lookup, to split it into individual digits. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: > > I did a user space program, attached to this mail. > > I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so "numbers talk, bullshit walks". No more objections. (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Eric Dumazet a écrit : On Tuesday 06 March 2007 18:28, Eric Dumazet wrote: On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? I am going to test this, but at least on Opterons, the reciprocal divide I added into mm/slab.c gave me a nice speedup. Linus, I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. static inline u32 reciprocal_divide(u32 A, u32 R) { #if __i386 unsigned int edx, eax; asm("mul %2":"=a" (eax), "=d" (edx):"rm" (R), "0" (A)); return edx; #else return (u32)(((u64)A * R) >> 32); #endif } Results are really good on 32bit. On 64bit/Opteron they are impressive. $ gcc -O2 -o divide_bench divide_bench.c First result gives the number of cycles to perform old number() routine using plain do_div() Second result gives the number of cycles using reciprocal_div trick results on a Intel Pentium III 866 MHz $ ./divide_bench 413.453 cycles per call, last res=9901 132.746 cycles per call, last res=9901 $ ./divide_bench 411.833 cycles per call, last res=9901 129.652 cycles per call, last res=9901 $ ./divide_bench 480.645 cycles per call, last res=9901 158.642 cycles per call, last res=9901 $ ./divide_bench 412.769 cycles per call, last res=9901 129.643 cycles per call, last res=9901 $ ./divide_bench 410.809 cycles per call, last res=9901 129.609 cycles per call, last res=9901 results on AMD 246 (2GHz) Sorry this machine is quite loaded... I dont have a dev x86_64 machine. $ gcc -O2 -m32 -o divide_bench32 divide_bench.c $ ./divide_bench32 412.181 cycles per call, last res=9901 112.314 cycles per call, last res=9901 $ ./divide_bench32 444.008 cycles per call, last res=9901 114.314 cycles per call, last res=9901 $ ./divide_bench32 423.168 cycles per call, last res=9901 112.318 cycles per call, last res=9901 $ ./divide_bench32 427.73 cycles per call, last res=9901 110.712 cycles per call, last res=9901 $ ./divide_bench32 410.529 cycles per call, last res=9901 114.068 cycles per call, last res=9901 $ ./divide_bench32 489.856 cycles per call, last res=9901 124.889 cycles per call, last res=9901 $ ./divide_bench32 389.278 cycles per call, last res=9901 104.697 cycles per call, last res=9901 With a 64bit prog : $ gcc -O2 -m64 -o divide_bench64 divide_bench.c $ ./divide_bench64 826.136 cycles per call, last res=9901 105.912 cycles per call, last res=9901 $ ./divide_bench64 627.096 cycles per call, last res=9901 76.2473 cycles per call, last res=9901 $ ./divide_bench64 604.524 cycles per call, last res=9901 76.1405 cycles per call, last res=9901 $ ./divide_bench64 621.013 cycles per call, last res=9901 76.0963 cycles per call, last res=9901 $ ./divide_bench64 836.799 cycles per call, last res=9901 103.967 cycles per call, last res=9901 $ ./divide_bench64 982.718 cycles per call, last res=9901 127.945 cycles per call, last res=9901 $ ./divide_bench64 609.346 cycles per call, last res=9901 76.0768 cycles per call, last res=9901 #include #include #include #include #ifdef __x86_64 #define rdtscll(val) do { \ unsigned int __a,__d; \ asm volatile("rdtsc" : "=a" (__a), "=d" (__d)); \ (val) = ((unsigned long)__a) | (((unsigned long)__d)<<32); \ } while(0) # define do_div(n,base) ({ \ uint32_t __base = (base); \ uint32_t __rem; \ __rem = ((uint64_t)(n)) % __base; \ (n) = ((uint64_t)(n)) / __base; \ __rem; \ }) #elif __i386 #define rdtscll(val) \ __asm__ __volatile__("rdtsc" : "=A" (val)) #define do_div(n,base) ({ \ unsigned long __upper, __low, __high, __mod, __base; \ __base = (base); \ asm("":"=a" (__low), "=d" (__high):"A" (n)); \ __upper = __high; \ if (__high) { \ __upper = __high % (__base); \ __high = __high / (__base); \ } \ asm("divl %2":"=a" (__low), "=d" (__mod):"rm" (__base), "0" (__low), "1" (__upper)); \ asm("":"=A" (n):"a" (__low),"d" (__high)); \ __mod; \ }) #endif static const char digits[] = "0123456789abcdefghijklmnopqrstuvwxyz"; char *number_divides(unsigned long long num, int base, char *result) { if (num == 0) *result++ = '0'; else while (num) *result++ = digits[do_div(num,base)]; *result = 0; return result; } typedef unsigned int u32; typedef unsigned long long u64; #define CONSTANT_RECIPROCAL_VALUE(k) \ (u32)(((1LL << 32) + (k - 1)) / k) const u32 reciprocal_values[36 + 1] = { 0,
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 18:28, Eric Dumazet wrote: > On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: > > > > Using reciprocal divides permits to change each divide by two > > > multiplies, less expensive on current CPUS. > > > > Are you sure? > > I am going to test this, but at least on Opterons, the reciprocal divide I > added into mm/slab.c gave me a nice speedup. > With attached test program (one million calls to pipe()), I got about 0.1 s of speedup on my 1.6 GHz Pentium(M) dell D610 machine (3.350 s instead of 3.450 s, on many runs) Thats about 100 ns saved per number() call But then I realized that on ia32, gcc compilers is not very smart : static inline u32 reciprocal_divide(u32 A, u32 R) { return (u32)(((u64)A * R) >> 32); } It generates two multiplies... arg... //begin of reciprocal_divide() 4b0: 8b 4c 24 28 mov0x28(%esp),%ecx 4b4: 89 f0 mov%esi,%eax 4b6: f7 64 24 24 mull 0x24(%esp) 4ba: 0f af ceimul %esi,%ecx 4bd: 8d 14 11lea(%ecx,%edx,1),%edx // end of reciprocal_divide() 4c0: 8b 8c 24 8c 00 00 00mov0x8c(%esp),%ecx 4c7: 89 d0 mov%edx,%eax 4c9: 0f af c8imul %eax,%ecx 4cc: 29 ce sub%ecx,%esi 4ce: 8b 4c 24 1c mov0x1c(%esp),%ecx 4d2: 0f b6 34 31 movzbl (%ecx,%esi,1),%esi 4d6: 89 f1 mov%esi,%ecx 4d8: 89 c6 mov%eax,%esi 4da: 88 0f mov%cl,(%edi) 4dc: 47 inc%edi 4dd: ff 44 24 20 incl 0x20(%esp) 4e1: 85 c0 test %eax,%eax 4e3: 75 cb jne4b0 So even with a total of 3 multiplies per digit, we win... Maybe some bit of x86 asm is needed to make gcc be smarter (using only one multiply for reciprocal_divide()) /* * micro benchmark to time calls to pipe()/close() */ main() { int fd[100*2]; unsigned int l, i; for (l = 0 ; l < 1 ; l++) { for (i = 0 ; i < 100*2 ; i+=2) pipe(fd + i); for (i = 0 ; i < 100*2 ; i++) close(fd[i]); } }
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, H. Peter Anvin wrote: > Eric Dumazet wrote: > > > > For epoll, I suspect this is harmless : Programs dont allocate epolls fd > > every milli second, but at startup only. > > > > For pipes/sockets, using a 64 bits would be problematic, because sprintf() > > uses a divide for each digit. And a divide is slow. Ten divides are *very* > > slow. > > > > That's true for *any* sprintf(), though. sprintf() converts all its arguments > to 64 bits. Sigh! We *always* do do_div(), even for the more trivial %x case. Anyway, it does not really matter for the type of dfs will be applied to. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: On Tue, 6 Mar 2007, Eric Dumazet wrote: Something like : [PATCH] : Use reciprocal divides in sprintf() Try this on Core 2, and I suspect that you'll find that the hardware is actually *faster* than doing the shift/test, function call and the two multiplies. Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? For base 8 and 16, this is shift and mask, respectively, so it's bound to be faster (although modern hardware can often optimize this, embedded hardware definitely can't.) Base 10, which even in the Linux kernel is almost certainly the most common case, is a lot iffier. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: > On Tue, 6 Mar 2007, Eric Dumazet wrote: > > Something like : > > > > [PATCH] : Use reciprocal divides in sprintf() > > Try this on Core 2, and I suspect that you'll find that the hardware is > actually *faster* than doing the shift/test, function call and the > two multiplies. > Where do you see a function call ? 448: 44 89 d0mov%r10d,%eax 44b: 44 89 eamov%r13d,%edx 44e: 48 0f af c1 imul %rcx,%rax 452: 48 c1 e8 20 shr$0x20,%rax 456: 0f af d0imul %eax,%edx 459: 49 29 d2sub%rdx,%r10 45c: 43 0f b6 14 16 movzbl (%r14,%r10,1),%edx 461: 41 89 c2mov%eax,%r10d 464: 41 88 13mov%dl,(%r11) 467: 49 ff c3inc%r11 46a: 85 c0 test %eax,%eax 46c: 75 da jne448 > > Using reciprocal divides permits to change each divide by two multiplies, > > less expensive on current CPUS. > > Are you sure? I am going to test this, but at least on Opterons, the reciprocal divide I added into mm/slab.c gave me a nice speedup. I am going to bench some stupid loop : for (i = 0 ; i < 1000*1000 ; i++) { pipe(fds); close(fds[0]); close(fds[1]); } - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: > > Something like : > > [PATCH] : Use reciprocal divides in sprintf() Try this on Core 2, and I suspect that you'll find that the hardware is actually *faster* than doing the shift/test, function call and the two multiplies. > Using reciprocal divides permits to change each divide by two multiplies, > less > expensive on current CPUS. Are you sure? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. Nope. You need both the result of the division *and* the remainder, and you can't do that with a single multiply. Oh, right. *D'oh*. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 17:28, H. Peter Anvin wrote: > Eric Dumazet wrote: > > For epoll, I suspect this is harmless : Programs dont allocate epolls fd > > every milli second, but at startup only. > > > > For pipes/sockets, using a 64 bits would be problematic, because > > sprintf() uses a divide for each digit. And a divide is slow. Ten > > divides are *very* slow. > > That's true for *any* sprintf(), though. sprintf() converts all its > arguments to 64 bits. > > However, this could be optimized. I think right now sprintf() uses a > generic divide-by-base, but a divide by 8 and 16 can of course be > handled with a shift, and divide by 10 can be replaced with a > multiplication by 0x1999ULL on most architectures. Or maybe just use reciprocal division, to keep the 35 bases available in number() Something like : [PATCH] : Use reciprocal divides in sprintf() pipe() syscalls spend a noticeable amount of time in sprintf() to format their dentry name. One name may want to print 9 or 10 digits, using one divide per digit. Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Signed-off-by: Eric Dumazet <[EMAIL PROTECTED]> diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h index f9c90b3..55a79e0 100644 --- a/include/linux/reciprocal_div.h +++ b/include/linux/reciprocal_div.h @@ -23,7 +23,10 @@ #include * or else the performance is slower than a normal divide. */ extern u32 reciprocal_value(u32 B); - +/* + * precomputed reciprocal values of integers from 0 to 36 + */ +extern const u32 reciprocal_values[36 + 1]; static inline u32 reciprocal_divide(u32 A, u32 R) { diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c index 6a3bd48..2dcea45 100644 --- a/lib/reciprocal_div.c +++ b/lib/reciprocal_div.c @@ -1,6 +1,31 @@ #include #include +#define CONSTANT_RECIPROCAL_VALUE(k) \ + (u32)(((1LL << 32) + (k - 1)) / k) + +const u32 reciprocal_values[36 + 1] = { + 0, + CONSTANT_RECIPROCAL_VALUE(1), CONSTANT_RECIPROCAL_VALUE(2), + CONSTANT_RECIPROCAL_VALUE(3), CONSTANT_RECIPROCAL_VALUE(4), + CONSTANT_RECIPROCAL_VALUE(5), CONSTANT_RECIPROCAL_VALUE(6), + CONSTANT_RECIPROCAL_VALUE(7), CONSTANT_RECIPROCAL_VALUE(8), + CONSTANT_RECIPROCAL_VALUE(9), CONSTANT_RECIPROCAL_VALUE(10), + CONSTANT_RECIPROCAL_VALUE(11), CONSTANT_RECIPROCAL_VALUE(12), + CONSTANT_RECIPROCAL_VALUE(13), CONSTANT_RECIPROCAL_VALUE(14), + CONSTANT_RECIPROCAL_VALUE(15), CONSTANT_RECIPROCAL_VALUE(16), + CONSTANT_RECIPROCAL_VALUE(17), CONSTANT_RECIPROCAL_VALUE(18), + CONSTANT_RECIPROCAL_VALUE(19), CONSTANT_RECIPROCAL_VALUE(20), + CONSTANT_RECIPROCAL_VALUE(21), CONSTANT_RECIPROCAL_VALUE(22), + CONSTANT_RECIPROCAL_VALUE(23), CONSTANT_RECIPROCAL_VALUE(24), + CONSTANT_RECIPROCAL_VALUE(25), CONSTANT_RECIPROCAL_VALUE(26), + CONSTANT_RECIPROCAL_VALUE(27), CONSTANT_RECIPROCAL_VALUE(28), + CONSTANT_RECIPROCAL_VALUE(29), CONSTANT_RECIPROCAL_VALUE(30), + CONSTANT_RECIPROCAL_VALUE(31), CONSTANT_RECIPROCAL_VALUE(32), + CONSTANT_RECIPROCAL_VALUE(33), CONSTANT_RECIPROCAL_VALUE(34), + CONSTANT_RECIPROCAL_VALUE(35), CONSTANT_RECIPROCAL_VALUE(36), +}; + u32 reciprocal_value(u32 k) { u64 val = (1LL << 32) + (k - 1); diff --git a/lib/vsprintf.c b/lib/vsprintf.c index b025864..9c98cc4 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -22,6 +22,7 @@ #include #include #include #include +#include #include /* for PAGE_SIZE */ #include @@ -180,8 +181,16 @@ static char * number(char * buf, char * i = 0; if (num == 0) tmp[i++]='0'; - else while (num != 0) - tmp[i++] = digits[do_div(num,base)]; + else { + while (num >> 32) + tmp[i++] = digits[do_div(num,base)]; + while (num != 0) { + u32 next = reciprocal_divide((u32)num, + reciprocal_values[base]); + tmp[i++] = digits[num - next*base]; + num = next; + } + } if (i > precision) precision = i; size -= precision;
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, H. Peter Anvin wrote: > > That's true for *any* sprintf(), though. sprintf() converts all its arguments > to 64 bits. Well, it very much uses "do_div()", so that it can do a 64 / 32 -> (div64,mod32) divide, which is often faster than a full 64:64 divide. > However, this could be optimized. I think right now sprintf() uses a generic > divide-by-base, but a divide by 8 and 16 can of course be handled with a > shift, and divide by 10 can be replaced with a multiplication by > 0x1999ULL on most architectures. Nope. You need both the result of the division *and* the remainder, and you can't do that with a single multiply. Also, with modern hardware, divides are usually fairly cheap, with early exit logic, so that the common case of small integers is fairly cheap. Yeah, generating a full 64-bit number printout is still expensive, of course (both because you need to do many divides *and* because only the last few divides will be able to do any appreciable early exit logic. Anyway, I think a full integer divide on Core 2 is something like 22 cycles. Yes, the multiply is much fasster (at 4 cycles), but I think that 22 cycles is actually worst-case. Somebody who has a benchmark could try. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Eric Dumazet wrote: For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. That's true for *any* sprintf(), though. sprintf() converts all its arguments to 64 bits. However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Eric Dumazet wrote: For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. That's true for *any* sprintf(), though. sprintf() converts all its arguments to 64 bits. However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, H. Peter Anvin wrote: That's true for *any* sprintf(), though. sprintf() converts all its arguments to 64 bits. Well, it very much uses do_div(), so that it can do a 64 / 32 - (div64,mod32) divide, which is often faster than a full 64:64 divide. However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. Nope. You need both the result of the division *and* the remainder, and you can't do that with a single multiply. Also, with modern hardware, divides are usually fairly cheap, with early exit logic, so that the common case of small integers is fairly cheap. Yeah, generating a full 64-bit number printout is still expensive, of course (both because you need to do many divides *and* because only the last few divides will be able to do any appreciable early exit logic. Anyway, I think a full integer divide on Core 2 is something like 22 cycles. Yes, the multiply is much fasster (at 4 cycles), but I think that 22 cycles is actually worst-case. Somebody who has a benchmark could try. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 17:28, H. Peter Anvin wrote: Eric Dumazet wrote: For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. That's true for *any* sprintf(), though. sprintf() converts all its arguments to 64 bits. However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. Or maybe just use reciprocal division, to keep the 35 bases available in number() Something like : [PATCH] : Use reciprocal divides in sprintf() pipe() syscalls spend a noticeable amount of time in sprintf() to format their dentry name. One name may want to print 9 or 10 digits, using one divide per digit. Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Signed-off-by: Eric Dumazet [EMAIL PROTECTED] diff --git a/include/linux/reciprocal_div.h b/include/linux/reciprocal_div.h index f9c90b3..55a79e0 100644 --- a/include/linux/reciprocal_div.h +++ b/include/linux/reciprocal_div.h @@ -23,7 +23,10 @@ #include linux/types.h * or else the performance is slower than a normal divide. */ extern u32 reciprocal_value(u32 B); - +/* + * precomputed reciprocal values of integers from 0 to 36 + */ +extern const u32 reciprocal_values[36 + 1]; static inline u32 reciprocal_divide(u32 A, u32 R) { diff --git a/lib/reciprocal_div.c b/lib/reciprocal_div.c index 6a3bd48..2dcea45 100644 --- a/lib/reciprocal_div.c +++ b/lib/reciprocal_div.c @@ -1,6 +1,31 @@ #include asm/div64.h #include linux/reciprocal_div.h +#define CONSTANT_RECIPROCAL_VALUE(k) \ + (u32)(((1LL 32) + (k - 1)) / k) + +const u32 reciprocal_values[36 + 1] = { + 0, + CONSTANT_RECIPROCAL_VALUE(1), CONSTANT_RECIPROCAL_VALUE(2), + CONSTANT_RECIPROCAL_VALUE(3), CONSTANT_RECIPROCAL_VALUE(4), + CONSTANT_RECIPROCAL_VALUE(5), CONSTANT_RECIPROCAL_VALUE(6), + CONSTANT_RECIPROCAL_VALUE(7), CONSTANT_RECIPROCAL_VALUE(8), + CONSTANT_RECIPROCAL_VALUE(9), CONSTANT_RECIPROCAL_VALUE(10), + CONSTANT_RECIPROCAL_VALUE(11), CONSTANT_RECIPROCAL_VALUE(12), + CONSTANT_RECIPROCAL_VALUE(13), CONSTANT_RECIPROCAL_VALUE(14), + CONSTANT_RECIPROCAL_VALUE(15), CONSTANT_RECIPROCAL_VALUE(16), + CONSTANT_RECIPROCAL_VALUE(17), CONSTANT_RECIPROCAL_VALUE(18), + CONSTANT_RECIPROCAL_VALUE(19), CONSTANT_RECIPROCAL_VALUE(20), + CONSTANT_RECIPROCAL_VALUE(21), CONSTANT_RECIPROCAL_VALUE(22), + CONSTANT_RECIPROCAL_VALUE(23), CONSTANT_RECIPROCAL_VALUE(24), + CONSTANT_RECIPROCAL_VALUE(25), CONSTANT_RECIPROCAL_VALUE(26), + CONSTANT_RECIPROCAL_VALUE(27), CONSTANT_RECIPROCAL_VALUE(28), + CONSTANT_RECIPROCAL_VALUE(29), CONSTANT_RECIPROCAL_VALUE(30), + CONSTANT_RECIPROCAL_VALUE(31), CONSTANT_RECIPROCAL_VALUE(32), + CONSTANT_RECIPROCAL_VALUE(33), CONSTANT_RECIPROCAL_VALUE(34), + CONSTANT_RECIPROCAL_VALUE(35), CONSTANT_RECIPROCAL_VALUE(36), +}; + u32 reciprocal_value(u32 k) { u64 val = (1LL 32) + (k - 1); diff --git a/lib/vsprintf.c b/lib/vsprintf.c index b025864..9c98cc4 100644 --- a/lib/vsprintf.c +++ b/lib/vsprintf.c @@ -22,6 +22,7 @@ #include linux/types.h #include linux/string.h #include linux/ctype.h #include linux/kernel.h +#include linux/reciprocal_div.h #include asm/page.h /* for PAGE_SIZE */ #include asm/div64.h @@ -180,8 +181,16 @@ static char * number(char * buf, char * i = 0; if (num == 0) tmp[i++]='0'; - else while (num != 0) - tmp[i++] = digits[do_div(num,base)]; + else { + while (num 32) + tmp[i++] = digits[do_div(num,base)]; + while (num != 0) { + u32 next = reciprocal_divide((u32)num, + reciprocal_values[base]); + tmp[i++] = digits[num - next*base]; + num = next; + } + } if (i precision) precision = i; size -= precision;
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: However, this could be optimized. I think right now sprintf() uses a generic divide-by-base, but a divide by 8 and 16 can of course be handled with a shift, and divide by 10 can be replaced with a multiplication by 0x1999ULL on most architectures. Nope. You need both the result of the division *and* the remainder, and you can't do that with a single multiply. Oh, right. *D'oh*. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: Something like : [PATCH] : Use reciprocal divides in sprintf() Try this on Core 2, and I suspect that you'll find that the hardware is actually *faster* than doing the shift/test, function call and the two multiplies. Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: On Tue, 6 Mar 2007, Eric Dumazet wrote: Something like : [PATCH] : Use reciprocal divides in sprintf() Try this on Core 2, and I suspect that you'll find that the hardware is actually *faster* than doing the shift/test, function call and the two multiplies. Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? For base 8 and 16, this is shift and mask, respectively, so it's bound to be faster (although modern hardware can often optimize this, embedded hardware definitely can't.) Base 10, which even in the Linux kernel is almost certainly the most common case, is a lot iffier. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: On Tue, 6 Mar 2007, Eric Dumazet wrote: Something like : [PATCH] : Use reciprocal divides in sprintf() Try this on Core 2, and I suspect that you'll find that the hardware is actually *faster* than doing the shift/test, function call and the two multiplies. Where do you see a function call ? 448: 44 89 d0mov%r10d,%eax 44b: 44 89 eamov%r13d,%edx 44e: 48 0f af c1 imul %rcx,%rax 452: 48 c1 e8 20 shr$0x20,%rax 456: 0f af d0imul %eax,%edx 459: 49 29 d2sub%rdx,%r10 45c: 43 0f b6 14 16 movzbl (%r14,%r10,1),%edx 461: 41 89 c2mov%eax,%r10d 464: 41 88 13mov%dl,(%r11) 467: 49 ff c3inc%r11 46a: 85 c0 test %eax,%eax 46c: 75 da jne448 number+0x138 Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? I am going to test this, but at least on Opterons, the reciprocal divide I added into mm/slab.c gave me a nice speedup. I am going to bench some stupid loop : for (i = 0 ; i 1000*1000 ; i++) { pipe(fds); close(fds[0]); close(fds[1]); } - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, H. Peter Anvin wrote: Eric Dumazet wrote: For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. That's true for *any* sprintf(), though. sprintf() converts all its arguments to 64 bits. Sigh! We *always* do do_div(), even for the more trivial %x case. Anyway, it does not really matter for the type of dfs will be applied to. - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tuesday 06 March 2007 18:28, Eric Dumazet wrote: On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? I am going to test this, but at least on Opterons, the reciprocal divide I added into mm/slab.c gave me a nice speedup. With attached test program (one million calls to pipe()), I got about 0.1 s of speedup on my 1.6 GHz Pentium(M) dell D610 machine (3.350 s instead of 3.450 s, on many runs) Thats about 100 ns saved per number() call But then I realized that on ia32, gcc compilers is not very smart : static inline u32 reciprocal_divide(u32 A, u32 R) { return (u32)(((u64)A * R) 32); } It generates two multiplies... arg... //begin of reciprocal_divide() 4b0: 8b 4c 24 28 mov0x28(%esp),%ecx 4b4: 89 f0 mov%esi,%eax 4b6: f7 64 24 24 mull 0x24(%esp) 4ba: 0f af ceimul %esi,%ecx 4bd: 8d 14 11lea(%ecx,%edx,1),%edx // end of reciprocal_divide() 4c0: 8b 8c 24 8c 00 00 00mov0x8c(%esp),%ecx 4c7: 89 d0 mov%edx,%eax 4c9: 0f af c8imul %eax,%ecx 4cc: 29 ce sub%ecx,%esi 4ce: 8b 4c 24 1c mov0x1c(%esp),%ecx 4d2: 0f b6 34 31 movzbl (%ecx,%esi,1),%esi 4d6: 89 f1 mov%esi,%ecx 4d8: 89 c6 mov%eax,%esi 4da: 88 0f mov%cl,(%edi) 4dc: 47 inc%edi 4dd: ff 44 24 20 incl 0x20(%esp) 4e1: 85 c0 test %eax,%eax 4e3: 75 cb jne4b0 number+0x160 So even with a total of 3 multiplies per digit, we win... Maybe some bit of x86 asm is needed to make gcc be smarter (using only one multiply for reciprocal_divide()) /* * micro benchmark to time calls to pipe()/close() */ main() { int fd[100*2]; unsigned int l, i; for (l = 0 ; l 1 ; l++) { for (i = 0 ; i 100*2 ; i+=2) pipe(fd + i); for (i = 0 ; i 100*2 ; i++) close(fd[i]); } }
Re: [patch v2] epoll use a single inode ...
Eric Dumazet a écrit : On Tuesday 06 March 2007 18:28, Eric Dumazet wrote: On Tuesday 06 March 2007 18:19, Linus Torvalds wrote: Using reciprocal divides permits to change each divide by two multiplies, less expensive on current CPUS. Are you sure? I am going to test this, but at least on Opterons, the reciprocal divide I added into mm/slab.c gave me a nice speedup. Linus, I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. static inline u32 reciprocal_divide(u32 A, u32 R) { #if __i386 unsigned int edx, eax; asm(mul %2:=a (eax), =d (edx):rm (R), 0 (A)); return edx; #else return (u32)(((u64)A * R) 32); #endif } Results are really good on 32bit. On 64bit/Opteron they are impressive. $ gcc -O2 -o divide_bench divide_bench.c First result gives the number of cycles to perform old number() routine using plain do_div() Second result gives the number of cycles using reciprocal_div trick results on a Intel Pentium III 866 MHz $ ./divide_bench 413.453 cycles per call, last res=9901 132.746 cycles per call, last res=9901 $ ./divide_bench 411.833 cycles per call, last res=9901 129.652 cycles per call, last res=9901 $ ./divide_bench 480.645 cycles per call, last res=9901 158.642 cycles per call, last res=9901 $ ./divide_bench 412.769 cycles per call, last res=9901 129.643 cycles per call, last res=9901 $ ./divide_bench 410.809 cycles per call, last res=9901 129.609 cycles per call, last res=9901 results on AMD 246 (2GHz) Sorry this machine is quite loaded... I dont have a dev x86_64 machine. $ gcc -O2 -m32 -o divide_bench32 divide_bench.c $ ./divide_bench32 412.181 cycles per call, last res=9901 112.314 cycles per call, last res=9901 $ ./divide_bench32 444.008 cycles per call, last res=9901 114.314 cycles per call, last res=9901 $ ./divide_bench32 423.168 cycles per call, last res=9901 112.318 cycles per call, last res=9901 $ ./divide_bench32 427.73 cycles per call, last res=9901 110.712 cycles per call, last res=9901 $ ./divide_bench32 410.529 cycles per call, last res=9901 114.068 cycles per call, last res=9901 $ ./divide_bench32 489.856 cycles per call, last res=9901 124.889 cycles per call, last res=9901 $ ./divide_bench32 389.278 cycles per call, last res=9901 104.697 cycles per call, last res=9901 With a 64bit prog : $ gcc -O2 -m64 -o divide_bench64 divide_bench.c $ ./divide_bench64 826.136 cycles per call, last res=9901 105.912 cycles per call, last res=9901 $ ./divide_bench64 627.096 cycles per call, last res=9901 76.2473 cycles per call, last res=9901 $ ./divide_bench64 604.524 cycles per call, last res=9901 76.1405 cycles per call, last res=9901 $ ./divide_bench64 621.013 cycles per call, last res=9901 76.0963 cycles per call, last res=9901 $ ./divide_bench64 836.799 cycles per call, last res=9901 103.967 cycles per call, last res=9901 $ ./divide_bench64 982.718 cycles per call, last res=9901 127.945 cycles per call, last res=9901 $ ./divide_bench64 609.346 cycles per call, last res=9901 76.0768 cycles per call, last res=9901 #include stdio.h #include stdlib.h #include stdint.h #include math.h #ifdef __x86_64 #define rdtscll(val) do { \ unsigned int __a,__d; \ asm volatile(rdtsc : =a (__a), =d (__d)); \ (val) = ((unsigned long)__a) | (((unsigned long)__d)32); \ } while(0) # define do_div(n,base) ({ \ uint32_t __base = (base); \ uint32_t __rem; \ __rem = ((uint64_t)(n)) % __base; \ (n) = ((uint64_t)(n)) / __base; \ __rem; \ }) #elif __i386 #define rdtscll(val) \ __asm__ __volatile__(rdtsc : =A (val)) #define do_div(n,base) ({ \ unsigned long __upper, __low, __high, __mod, __base; \ __base = (base); \ asm(:=a (__low), =d (__high):A (n)); \ __upper = __high; \ if (__high) { \ __upper = __high % (__base); \ __high = __high / (__base); \ } \ asm(divl %2:=a (__low), =d (__mod):rm (__base), 0 (__low), 1 (__upper)); \ asm(:=A (n):a (__low),d (__high)); \ __mod; \ }) #endif static const char digits[] = 0123456789abcdefghijklmnopqrstuvwxyz; char *number_divides(unsigned long long num, int base, char *result) { if (num == 0) *result++ = '0'; else while (num) *result++ = digits[do_div(num,base)]; *result = 0; return result; } typedef unsigned int u32; typedef unsigned long long u64; #define CONSTANT_RECIPROCAL_VALUE(k) \ (u32)(((1LL 32) + (k - 1)) / k) const u32 reciprocal_values[36 + 1] = { 0, CONSTANT_RECIPROCAL_VALUE(1),
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so numbers talk, bullshit walks. No more objections. (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so numbers talk, bullshit walks. No more objections. (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Of course you can do better for octal and hex -- it's just shift and mask. Decimal is trickier; however, at least on i386 it might make sense to divide by 100 and then use the AAM instruction, or a table lookup, to split it into individual digits. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds a écrit : On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so numbers talk, bullshit walks. No more objections. And the numbers were ? :) (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Well, given that sprintf() is frequently called only for pipe/sockets creation, we probably better : 1) wait a very clever idea to suppress individual dentry per pipe/sockets (no more sprintf() at pipe/socket setup) 2) delay the sprintf() only if needed as you mentioned in a previous mail (when someone wants ls -l /proc/pid/fd/), since their dentries are not anymore inserted in the global dcache hash, they could stay with a (nul) dname. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Eric Dumazet wrote: Linus Torvalds a écrit : On Tue, 6 Mar 2007, Eric Dumazet wrote: I did a user space program, attached to this mail. I rewrote the reciprocal_div() for i386 so that one multiply is used. Ok, this is definitely faster on Core 2 as well, so numbers talk, bullshit walks. No more objections. And the numbers were ? :) (That said, I bet you could do even better for octal and hex numbers, so if you *really* want to speed things up, you should just make a special-case routine for each base (there's just three of them), and you can then also optimize the base-10 thing much better (you can do two digits at a time by dividing by 100, etc) Well, given that sprintf() is frequently called only for pipe/sockets creation, we probably better : 1) wait a very clever idea to suppress individual dentry per pipe/sockets (no more sprintf() at pipe/socket setup) 2) delay the sprintf() only if needed as you mentioned in a previous mail (when someone wants ls -l /proc/pid/fd/), since their dentries are not anymore inserted in the global dcache hash, they could stay with a (nul) dname. Yes, the right thing to do is probably to only generate these strings when someone tries to list them, not on every socket/pipe/epoll creation. One can assign a counter and keep it as a binary value at the start, but create the strings when necessary. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: > This would be used for epoll, signalfd and timerfd. And the printf format > is %llu ;) Today is really one on those days ... %llx - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: > Davide Libenzi a écrit : > > On Mon, 5 Mar 2007, H. Peter Anvin wrote: > > > > > Davide Libenzi wrote: > > > > Right now is using: > > > > > > > > this.len = sprintf(name, "[%u.%d]", current->pid, fd); > > > > > > > > That should be unique and not have the wraparound problem. Ack? > > > > > > > NAK, very much NAK. > > > > > > File descriptors aren't file structures, they're *pointers* to file > > > structures. > > > > > > It's perfectly possible -- downright common -- for a file descriptor to be > > > inherited by another process, and then the pid is recycled -- collision. > > > > Ugh! Right! 64 bit counter it is ... :) > > For epoll, I suspect this is harmless : Programs dont allocate epolls fd every > milli second, but at startup only. > > For pipes/sockets, using a 64 bits would be problematic, because sprintf() > uses a divide for each digit. And a divide is slow. Ten divides are *very* > slow. This would be used for epoll, signalfd and timerfd. And the printf format is %llu ;) - Davide
Re: [patch v2] epoll use a single inode ...
Davide Libenzi a écrit : On Mon, 5 Mar 2007, H. Peter Anvin wrote: Davide Libenzi wrote: Right now is using: this.len = sprintf(name, "[%u.%d]", current->pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. Ugh! Right! 64 bit counter it is ... :) For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, H. Peter Anvin wrote: > Davide Libenzi wrote: > > > > Right now is using: > > > > this.len = sprintf(name, "[%u.%d]", current->pid, fd); > > > > That should be unique and not have the wraparound problem. Ack? > > > > NAK, very much NAK. > > File descriptors aren't file structures, they're *pointers* to file > structures. > > It's perfectly possible -- downright common -- for a file descriptor to be > inherited by another process, and then the pid is recycled -- collision. Ugh! Right! 64 bit counter it is ... :) - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Davide Libenzi wrote: Right now is using: this.len = sprintf(name, "[%u.%d]", current->pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, H. Peter Anvin wrote: > Linus Torvalds wrote: > > > > Since this is not actually *used* for anything but showing the fd's in > > /proc//fd/ etc, no. In fact, an integer will wrap a *lot* less than a > > kernel data structure will be re-used, so even with the simple "wraps every > > 4G uses", you're still better off. > > > > ... and if that worries you, use a 64-bit counter. They're cheap (compared to > an sprintf), and even if they advance once a nanosecond they won't wrap around > for over 584 years. Right now is using: this.len = sprintf(name, "[%u.%d]", current->pid, fd); That should be unique and not have the wraparound problem. Ack? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: Since this is not actually *used* for anything but showing the fd's in /proc//fd/ etc, no. In fact, an integer will wrap a *lot* less than a kernel data structure will be re-used, so even with the simple "wraps every 4G uses", you're still better off. ... and if that worries you, use a 64-bit counter. They're cheap (compared to an sprintf), and even if they advance once a nanosecond they won't wrap around for over 584 years. IOW, if the thing actually _mattered_ we should use some bitmap allocator or similar (eg pidmaps etc), but with something where the only reason really is as a visible indicator of difference for a user that happens to look, simple-and-stupid is better. That only makes sense if it matters that the numbers are kept small. This is in fact a highly suboptimal constraint, as it is almost guaranteed to create wraparounds much sooner, but there are situations in which that's the only option, e.g. for pty number allocations due to glibc braindamage. -hpa - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: > On Mon, 5 Mar 2007, Linus Torvalds wrote: > > > > It's much better to do something like > > > > static unsigned int epoll_inode; > > > > this.len = sprintf(name, "[%u]", ++epoll_inode); > > > > if you just need some pseudo-unique name to distinguish two epoll things > > from each other (vs from a dup'ed fd). > > Heh, this is what Al was saying ;) > I'm fine with that, but how about counter cycles (going back to zero)? > Should we care to handle them correctly? Since this is not actually *used* for anything but showing the fd's in /proc//fd/ etc, no. In fact, an integer will wrap a *lot* less than a kernel data structure will be re-used, so even with the simple "wraps every 4G uses", you're still better off. IOW, if the thing actually _mattered_ we should use some bitmap allocator or similar (eg pidmaps etc), but with something where the only reason really is as a visible indicator of difference for a user that happens to look, simple-and-stupid is better. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: > On Mon, 5 Mar 2007, Linus Torvalds wrote: > > > On Mon, 5 Mar 2007, Davide Libenzi wrote: > > > @@ -763,15 +767,17 @@ > > >* using the inode number. > > >*/ > > > error = -ENOMEM; > > > - sprintf(name, "[%lu]", inode->i_ino); > > > this.name = name; > > > - this.len = strlen(name); > > > - this.hash = inode->i_ino; > > > + this.len = sprintf(name, "[%p]", ep); > > > + this.hash = 0; > > > > Please don't expose kernel pointers to user space. > > > > It's much better to do something like > > > > static unsigned int epoll_inode; > > > > this.len = sprintf(name, "[%u]", ++epoll_inode); > > > > if you just need some pseudo-unique name to distinguish two epoll things > > from each other (vs from a dup'ed fd). > > Heh, this is what Al was saying ;) > I'm fine with that, but how about counter cycles (going back to zero)? > Should we care to handle them correctly? Maybe a [TID.FD] format? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Linus Torvalds wrote: > On Mon, 5 Mar 2007, Davide Libenzi wrote: > > @@ -763,15 +767,17 @@ > > * using the inode number. > > */ > > error = -ENOMEM; > > - sprintf(name, "[%lu]", inode->i_ino); > > this.name = name; > > - this.len = strlen(name); > > - this.hash = inode->i_ino; > > + this.len = sprintf(name, "[%p]", ep); > > + this.hash = 0; > > Please don't expose kernel pointers to user space. > > It's much better to do something like > > static unsigned int epoll_inode; > > this.len = sprintf(name, "[%u]", ++epoll_inode); > > if you just need some pseudo-unique name to distinguish two epoll things > from each other (vs from a dup'ed fd). Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Should we care to handle them correctly? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: > @@ -763,15 +767,17 @@ >* using the inode number. >*/ > error = -ENOMEM; > - sprintf(name, "[%lu]", inode->i_ino); > this.name = name; > - this.len = strlen(name); > - this.hash = inode->i_ino; > + this.len = sprintf(name, "[%p]", ep); > + this.hash = 0; Please don't expose kernel pointers to user space. It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, "[%u]", ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch v2] epoll use a single inode ...
ChangeLog: - v2) Properly size the buffer. Avoid extra strlen() (Eric Dumazet) Epoll does not keep any private data attached to its inode, so there'd be no need to allocate one inode per fd. For epoll, the inode is just a placeholder for the file operations and could be shared by all instances. I'd like to use the same optimization even for the upcoming file-based objects, so if you see problems let me know. One that Al was pointing out was that an fstat(2) over an epoll fd would show the same st_ino. IMO that should be fine since an fstat(2) over an epoll fd is not something you want to do in any case and expecting meaningfull results. This patch also avoids the dentry associated with the file* to be hashed inside the global dentry hash. Signed-off-by: Davide Libenzi - Davide eventpoll.c | 39 +-- 1 file changed, 33 insertions(+), 6 deletions(-) Index: linux-2.6.20.ep2/fs/eventpoll.c === --- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 -0800 +++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 15:26:16.0 -0800 @@ -258,6 +258,7 @@ int maxevents, long timeout); static int eventpollfs_delete_dentry(struct dentry *dentry); static struct inode *ep_eventpoll_inode(void); +static struct inode *ep_create_inode(void); static int eventpollfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt); @@ -279,6 +280,9 @@ /* Virtual fs used to allocate inodes for eventpoll files */ static struct vfsmount *eventpoll_mnt __read_mostly; +/* Placeholder inode for eventpoll fds */ +static struct inode *eventpoll_inode; + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_close, @@ -733,7 +737,7 @@ struct eventpoll *ep) { struct qstr this; - char name[32]; + char name[2 * sizeof(void *) + 8]; struct dentry *dentry; struct inode *inode; struct file *file; @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, "[%lu]", inode->i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode->i_ino; + this.len = sprintf(name, "[%p]", ep); + this.hash = 0; dentry = d_alloc(eventpoll_mnt->mnt_sb->s_root, ); if (!dentry) goto eexit_4; dentry->d_op = _dentry_operations; - d_add(dentry, inode); + /* Do not publish this dentry inside the global dentry hash table */ + dentry->d_flags &= ~DCACHE_UNHASHED; + d_instantiate(dentry, inode); + file->f_path.mnt = mntget(eventpoll_mnt); file->f_path.dentry = dentry; file->f_mapping = inode->i_mapping; @@ -1555,6 +1561,11 @@ static int eventpollfs_delete_dentry(struct dentry *dentry) { + /* +* We faked vfs to believe the dentry was hashed when we created it. +* Now we restore the flag so that dput() will work correctly. +*/ + dentry->d_flags |= DCACHE_UNHASHED; return 1; } @@ -1562,6 +1573,17 @@ static struct inode *ep_eventpoll_inode(void) { + + return igrab(eventpoll_inode); +} + +/* + * A single inode exist for all eventpoll files. On the contrary of pipes, + * eventpoll inodes has no per-instance data associated, so we can avoid + * the allocation of multiple of them. + */ +static struct inode *ep_create_inode(void) +{ int error = -ENOMEM; struct inode *inode = new_inode(eventpoll_mnt->mnt_sb); @@ -1626,10 +1648,14 @@ /* Mount the above commented virtual file system */ eventpoll_mnt = kern_mount(_fs_type); - error = PTR_ERR(eventpoll_mnt); if (IS_ERR(eventpoll_mnt)) goto epanic; + /* Create the single instance of inode for all eventpoll fds */ + eventpoll_inode = ep_create_inode(); + if (IS_ERR(eventpoll_inode)) + goto epanic; + DNPRINTK(3, (KERN_INFO "[%p] eventpoll: successfully initialized.\n", current)); return 0; @@ -1642,6 +1668,7 @@ static void __exit eventpoll_exit(void) { /* Undo all operations done inside eventpoll_init() */ + iput(eventpoll_inode); unregister_filesystem(_fs_type); mntput(eventpoll_mnt); kmem_cache_destroy(pwq_cache); - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch v2] epoll use a single inode ...
ChangeLog: - v2) Properly size the buffer. Avoid extra strlen() (Eric Dumazet) Epoll does not keep any private data attached to its inode, so there'd be no need to allocate one inode per fd. For epoll, the inode is just a placeholder for the file operations and could be shared by all instances. I'd like to use the same optimization even for the upcoming file-based objects, so if you see problems let me know. One that Al was pointing out was that an fstat(2) over an epoll fd would show the same st_ino. IMO that should be fine since an fstat(2) over an epoll fd is not something you want to do in any case and expecting meaningfull results. This patch also avoids the dentry associated with the file* to be hashed inside the global dentry hash. Signed-off-by: Davide Libenzi davidel@xmailserver.org - Davide eventpoll.c | 39 +-- 1 file changed, 33 insertions(+), 6 deletions(-) Index: linux-2.6.20.ep2/fs/eventpoll.c === --- linux-2.6.20.ep2.orig/fs/eventpoll.c2007-03-04 14:40:01.0 -0800 +++ linux-2.6.20.ep2/fs/eventpoll.c 2007-03-05 15:26:16.0 -0800 @@ -258,6 +258,7 @@ int maxevents, long timeout); static int eventpollfs_delete_dentry(struct dentry *dentry); static struct inode *ep_eventpoll_inode(void); +static struct inode *ep_create_inode(void); static int eventpollfs_get_sb(struct file_system_type *fs_type, int flags, const char *dev_name, void *data, struct vfsmount *mnt); @@ -279,6 +280,9 @@ /* Virtual fs used to allocate inodes for eventpoll files */ static struct vfsmount *eventpoll_mnt __read_mostly; +/* Placeholder inode for eventpoll fds */ +static struct inode *eventpoll_inode; + /* File callbacks that implement the eventpoll file behaviour */ static const struct file_operations eventpoll_fops = { .release= ep_eventpoll_close, @@ -733,7 +737,7 @@ struct eventpoll *ep) { struct qstr this; - char name[32]; + char name[2 * sizeof(void *) + 8]; struct dentry *dentry; struct inode *inode; struct file *file; @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode-i_ino; + this.len = sprintf(name, [%p], ep); + this.hash = 0; dentry = d_alloc(eventpoll_mnt-mnt_sb-s_root, this); if (!dentry) goto eexit_4; dentry-d_op = eventpollfs_dentry_operations; - d_add(dentry, inode); + /* Do not publish this dentry inside the global dentry hash table */ + dentry-d_flags = ~DCACHE_UNHASHED; + d_instantiate(dentry, inode); + file-f_path.mnt = mntget(eventpoll_mnt); file-f_path.dentry = dentry; file-f_mapping = inode-i_mapping; @@ -1555,6 +1561,11 @@ static int eventpollfs_delete_dentry(struct dentry *dentry) { + /* +* We faked vfs to believe the dentry was hashed when we created it. +* Now we restore the flag so that dput() will work correctly. +*/ + dentry-d_flags |= DCACHE_UNHASHED; return 1; } @@ -1562,6 +1573,17 @@ static struct inode *ep_eventpoll_inode(void) { + + return igrab(eventpoll_inode); +} + +/* + * A single inode exist for all eventpoll files. On the contrary of pipes, + * eventpoll inodes has no per-instance data associated, so we can avoid + * the allocation of multiple of them. + */ +static struct inode *ep_create_inode(void) +{ int error = -ENOMEM; struct inode *inode = new_inode(eventpoll_mnt-mnt_sb); @@ -1626,10 +1648,14 @@ /* Mount the above commented virtual file system */ eventpoll_mnt = kern_mount(eventpoll_fs_type); - error = PTR_ERR(eventpoll_mnt); if (IS_ERR(eventpoll_mnt)) goto epanic; + /* Create the single instance of inode for all eventpoll fds */ + eventpoll_inode = ep_create_inode(); + if (IS_ERR(eventpoll_inode)) + goto epanic; + DNPRINTK(3, (KERN_INFO [%p] eventpoll: successfully initialized.\n, current)); return 0; @@ -1642,6 +1668,7 @@ static void __exit eventpoll_exit(void) { /* Undo all operations done inside eventpoll_init() */ + iput(eventpoll_inode); unregister_filesystem(eventpoll_fs_type); mntput(eventpoll_mnt); kmem_cache_destroy(pwq_cache); - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode-i_ino; + this.len = sprintf(name, [%p], ep); + this.hash = 0; Please don't expose kernel pointers to user space. It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, [%u], ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Linus Torvalds wrote: On Mon, 5 Mar 2007, Davide Libenzi wrote: @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode-i_ino; + this.len = sprintf(name, [%p], ep); + this.hash = 0; Please don't expose kernel pointers to user space. It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, [%u], ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Should we care to handle them correctly? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: On Mon, 5 Mar 2007, Linus Torvalds wrote: On Mon, 5 Mar 2007, Davide Libenzi wrote: @@ -763,15 +767,17 @@ * using the inode number. */ error = -ENOMEM; - sprintf(name, [%lu], inode-i_ino); this.name = name; - this.len = strlen(name); - this.hash = inode-i_ino; + this.len = sprintf(name, [%p], ep); + this.hash = 0; Please don't expose kernel pointers to user space. It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, [%u], ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Should we care to handle them correctly? Maybe a [TID.FD] format? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: On Mon, 5 Mar 2007, Linus Torvalds wrote: It's much better to do something like static unsigned int epoll_inode; this.len = sprintf(name, [%u], ++epoll_inode); if you just need some pseudo-unique name to distinguish two epoll things from each other (vs from a dup'ed fd). Heh, this is what Al was saying ;) I'm fine with that, but how about counter cycles (going back to zero)? Should we care to handle them correctly? Since this is not actually *used* for anything but showing the fd's in /proc/pid/fd/ etc, no. In fact, an integer will wrap a *lot* less than a kernel data structure will be re-used, so even with the simple wraps every 4G uses, you're still better off. IOW, if the thing actually _mattered_ we should use some bitmap allocator or similar (eg pidmaps etc), but with something where the only reason really is as a visible indicator of difference for a user that happens to look, simple-and-stupid is better. Linus - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Linus Torvalds wrote: Since this is not actually *used* for anything but showing the fd's in /proc/pid/fd/ etc, no. In fact, an integer will wrap a *lot* less than a kernel data structure will be re-used, so even with the simple wraps every 4G uses, you're still better off. ... and if that worries you, use a 64-bit counter. They're cheap (compared to an sprintf), and even if they advance once a nanosecond they won't wrap around for over 584 years. IOW, if the thing actually _mattered_ we should use some bitmap allocator or similar (eg pidmaps etc), but with something where the only reason really is as a visible indicator of difference for a user that happens to look, simple-and-stupid is better. That only makes sense if it matters that the numbers are kept small. This is in fact a highly suboptimal constraint, as it is almost guaranteed to create wraparounds much sooner, but there are situations in which that's the only option, e.g. for pty number allocations due to glibc braindamage. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, H. Peter Anvin wrote: Linus Torvalds wrote: Since this is not actually *used* for anything but showing the fd's in /proc/pid/fd/ etc, no. In fact, an integer will wrap a *lot* less than a kernel data structure will be re-used, so even with the simple wraps every 4G uses, you're still better off. ... and if that worries you, use a 64-bit counter. They're cheap (compared to an sprintf), and even if they advance once a nanosecond they won't wrap around for over 584 years. Right now is using: this.len = sprintf(name, [%u.%d], current-pid, fd); That should be unique and not have the wraparound problem. Ack? - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Davide Libenzi wrote: Right now is using: this.len = sprintf(name, [%u.%d], current-pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. -hpa - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, H. Peter Anvin wrote: Davide Libenzi wrote: Right now is using: this.len = sprintf(name, [%u.%d], current-pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. Ugh! Right! 64 bit counter it is ... :) - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
Davide Libenzi a écrit : On Mon, 5 Mar 2007, H. Peter Anvin wrote: Davide Libenzi wrote: Right now is using: this.len = sprintf(name, [%u.%d], current-pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. Ugh! Right! 64 bit counter it is ... :) For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch v2] epoll use a single inode ...
On Tue, 6 Mar 2007, Eric Dumazet wrote: Davide Libenzi a écrit : On Mon, 5 Mar 2007, H. Peter Anvin wrote: Davide Libenzi wrote: Right now is using: this.len = sprintf(name, [%u.%d], current-pid, fd); That should be unique and not have the wraparound problem. Ack? NAK, very much NAK. File descriptors aren't file structures, they're *pointers* to file structures. It's perfectly possible -- downright common -- for a file descriptor to be inherited by another process, and then the pid is recycled -- collision. Ugh! Right! 64 bit counter it is ... :) For epoll, I suspect this is harmless : Programs dont allocate epolls fd every milli second, but at startup only. For pipes/sockets, using a 64 bits would be problematic, because sprintf() uses a divide for each digit. And a divide is slow. Ten divides are *very* slow. This would be used for epoll, signalfd and timerfd. And the printf format is %llu ;) - Davide
Re: [patch v2] epoll use a single inode ...
On Mon, 5 Mar 2007, Davide Libenzi wrote: This would be used for epoll, signalfd and timerfd. And the printf format is %llu ;) Today is really one on those days ... %llx - Davide - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/