Re: Syslets, Threadlets, generic AIO support, v6
* Eric Dumazet <[EMAIL PROTECTED]> wrote: > I tried your bench and found two problems : > - You scan half of the bitmap [...] > Try to close not a 'middle fd', but a really low one (10 for example), > and latencie is doubled. that was intentional. I really didnt want to fabricate a worst-case result but something more representative: in real apps the bitmap isnt fully filled all the time and most of the find-bit sequences are short. Hence the two fds and one of them goes from the middle of the range. > - You incorrectlty divide best_delta and worst_delta by LOOPS (5) ah, indeed, that's a bug - victim of a last minute edit :) Since the divident is constant it doesnt really matter to the validity of the relative nature of the slowdown (which is what i was intested in), but you are right - i have fixed the download and have redone the numbers. Here are the correct results from my box: # ./fd-scale-bench 100 0 checking the cache-hot performance of open()-ing 100 fds. num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us ... num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us num_fds: 100, best cost: 41.00 us, worst cost: 59.00 us and cache-cold: # ./fd-scale-bench 100 1 checking the cache-cold performance of open()-ing 100 fds. num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us ... num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us num_fds: 61693, best cost: 25.00 us, worst cost: 30.00 us num_fds: 77117, best cost: 27.00 us, worst cost: 30.00 us num_fds: 96397, best cost: 27.00 us, worst cost: 31.00 us num_fds: 120497, best cost: 31.00 us, worst cost: 43.00 us num_fds: 150622, best cost: 31.00 us, worst cost: 34.00 us num_fds: 188278, best cost: 33.00 us, worst cost: 36.00 us num_fds: 235348, best cost: 35.00 us, worst cost: 42.00 us num_fds: 294186, best cost: 36.00 us, worst cost: 41.00 us num_fds: 367733, best cost: 40.00 us, worst cost: 43.00 us num_fds: 459667, best cost: 44.00 us, worst cost: 46.00 us num_fds: 574584, best cost: 48.00 us, worst cost: 65.00 us num_fds: 718231, best cost: 54.00 us, worst cost: 59.00 us num_fds: 897789, best cost: 60.00 us, worst cost: 62.00 us num_fds: 100, best cost: 65.00 us, worst cost: 68.00 us > with a corrected bench; cache-cold numbers are > 100 us on this Intel > Pentium-M > > num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us > > On an Opteron x86_64 machine, results are better :) > > num_fds: 100, best cost: 28.00 us, worst cost: 106.00 us yeah. I quoted the full range because i was really more interested of our current 'limit' range (which is somewhere between 50K and 100K open fds) where the scanning cost becomes directly measurable, and the nature of slowdown. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Thu, 31 May 2007 11:02:52 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > it's both a flexibility and a speedup thing as well: > > > > flexibility: for libraries to be able to open files and keep them open > > comes up regularly. For example currently glibc is quite wasteful in a > > number of common networking related functions (Ulrich, please correct > > me if i'm wrong), which could be optimized if glibc could just keep a > > netlink channel fd open and could poll() it for changes and cache the > > results if there are no changes (or something like that). > > > > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - > > non-linear fds are cheaper to allocate/map: > > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html > > > > (i definitely remember having written code for that too, but i cannot > > find that in the archives. hm.) In theory we could avoid _all_ > > fd-bitmap overhead as well and use a per-process list/pool of struct > > file buffers plus a maximum-fd field as the 'non-linear fd allocator' > > (at the price of only deallocating them at process exit time). > > to measure this i've written fd-scale-bench.c: > >http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c > > which tests the (cache-hot or cache-cold) cost of open()-ing of two fds > while there are N other fds already open: one is from the 'middle' of > the range, one is from the end of it. > > Lets check our current 'extreme high end' performance with 1 million > fds. (which is not realistic right now but there certainly are systems > with over a hundred thousand open fds). Results from a fast CPU with 2MB > of cache: > > cache-hot: > > # ./fd-scale-bench 100 0 > checking the cache-hot performance of open()-ing 100 fds. > num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us > num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us > num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us > num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us > ... > num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us > num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us > num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us > num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us > num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us > num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us > num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us > num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us > num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us > num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us > num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us > num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us > num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us > > cache-cold: > > # ./fd-scale-bench 100 1 > checking the performance of open()-ing 100 fds. > num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us > num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us > ... > num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us > num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us > num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us > num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us > num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us > num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us > num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us > num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us > num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us > num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us > num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us > num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us > num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us > > we are pretty good at the moment: the open() cost starts to increase at > around 100K open fds, both in the cache-cold and cache-hot case. (that > roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At > 1 million fds our fd bitmap has a size of 128K when there are 1 million > fds open in a single process. > > so while it's certainly not 'urgent' to improve this, private fds are an > easier target for optimizations in this area, because they dont have the > continuity requirement anymore, so the fd bitmap is not a 'forced' > property of them. Your numbers do not match mines (mines were more than two years old so I redid a test before replying) I tried your bench and found two problems : - You scan half of the bitmap - You incorrectlty divide best_delta and worst_delta by LOOPS (5) Try to close not a 'middle fd', but a really low one (10 for example), and latencie is doubled. with a corrected bench; cache-cold numbers are > 100 us on this Intel Pentium-M num_fds: 100, best cost: 120.00 us, worst cost: 131.00 us On an Opteron x86_64 machine, results are better :)
Re: Syslets, Threadlets, generic AIO support, v6
* Albert Cahalan <[EMAIL PROTECTED]> wrote: > Ingo Molnar writes: > > >looking over the list of our new generic APIs (see further below) i > >think there are three important things that are needed for an API to > >become widely used: > > > > 1) it should solve a real problem (ha ;-), it should be intuitive to > >humans and it should fit into existing things naturally. > > > > 2) it should be ubiquitous. (if it's about IO it should cover block IO, > >network IO, timers, signals and everything) Even if it might look > >silly in some of the cases, having complete, utter, no compromises, > >100% coverage for everything massively helps the uptake of an API, > >because it allows the user-space coder to pick just one paradigm > >that is closest to his application and stick to it and only to it. > > > > 3) it should be end-to-end supported by glibc. > > 4) At least slightly portable. > > Anything supported by any similar OS is already ahead, even if it > isn't the perfect API of our dreams. [...] it might have been so a few years ago but it's changing slowly but surely - BSD is becoming more and more irrelevant. What matters mostly to app writers these days: "is it in most Linux distros" - and the key to that is upstream kernel support and glibc support. The days of BSD (and UNIX) are pretty much numbered. (I'm not against standardizing APIs in POSIX of course - the BSDs tend to follow the Linux APIs in that area with a few years lag.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Thu, May 31 2007, Ingo Molnar wrote: > > * Ingo Molnar <[EMAIL PROTECTED]> wrote: > > > (i definitely remember having written code for that too, but i cannot > > find that in the archives. hm.) In theory we could avoid _all_ > > fd-bitmap overhead as well and use a per-process list/pool of struct > > file buffers plus a maximum-fd field as the 'non-linear fd allocator' > > (at the price of only deallocating them at process exit time). > > btw., this also allows mostly-lockless fd allocation, which would > probably benefit threaded apps too. (we can just recycle it from a > per-CPU list of cached fds for that process) See also: http://lkml.org/lkml/2006/6/16/144 which originates from a much simpler patch I did to fix performance regressions in this area for the SLES10 kernel. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > (i definitely remember having written code for that too, but i cannot > find that in the archives. hm.) In theory we could avoid _all_ > fd-bitmap overhead as well and use a per-process list/pool of struct > file buffers plus a maximum-fd field as the 'non-linear fd allocator' > (at the price of only deallocating them at process exit time). btw., this also allows mostly-lockless fd allocation, which would probably benefit threaded apps too. (we can just recycle it from a per-CPU list of cached fds for that process) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Eric Dumazet <[EMAIL PROTECTED]> wrote: > > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - > > non-linear fds are cheaper to allocate/map: > > > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html > > > > (i definitely remember having written code for that too, but i > > cannot find that in the archives. hm.) In theory we could avoid > > _all_ fd-bitmap overhead as well and use a per-process list/pool of > > struct file buffers plus a maximum-fd field as the 'non-linear fd > > allocator' (at the price of only deallocating them at process exit > > time). > > Only very few apps need to open more than 100.000 files. yes. I did not list it as a primary reason for private fds, it's just a nice side-effect. As long as the other apps are not hurt, i see no problem in improving the >100K open files case. > As these files are likely sockets, O_ANY is not a solution. why not? It would be a natural thing to extend sys_socket() with a 'flags' parameter and pass in O_ANY (along with any other possible fd parameter like O_NDELAY, which could be inherited over connect()). > A trick is to try to keep first 64 handles freed, so that kernel wont > consume too much cpu time and cache in get_unused_fd() > > http://lkml.org/lkml/2005/9/15/307 this is basically a user-space front-end cache to fd allocation - which duplicates data needlessly. I dont see any problem with doing this in the kernel. (Also, obviously 'first 64 handles' could easily break with certain types of apps so glibc cannot do this.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > it's both a flexibility and a speedup thing as well: > > flexibility: for libraries to be able to open files and keep them open > comes up regularly. For example currently glibc is quite wasteful in a > number of common networking related functions (Ulrich, please correct > me if i'm wrong), which could be optimized if glibc could just keep a > netlink channel fd open and could poll() it for changes and cache the > results if there are no changes (or something like that). > > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - > non-linear fds are cheaper to allocate/map: > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html > > (i definitely remember having written code for that too, but i cannot > find that in the archives. hm.) In theory we could avoid _all_ > fd-bitmap overhead as well and use a per-process list/pool of struct > file buffers plus a maximum-fd field as the 'non-linear fd allocator' > (at the price of only deallocating them at process exit time). to measure this i've written fd-scale-bench.c: http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c which tests the (cache-hot or cache-cold) cost of open()-ing of two fds while there are N other fds already open: one is from the 'middle' of the range, one is from the end of it. Lets check our current 'extreme high end' performance with 1 million fds. (which is not realistic right now but there certainly are systems with over a hundred thousand open fds). Results from a fast CPU with 2MB of cache: cache-hot: # ./fd-scale-bench 100 0 checking the cache-hot performance of open()-ing 100 fds. num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us ... num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us num_fds: 100, best cost: 8.20 us, worst cost: 9.60 us cache-cold: # ./fd-scale-bench 100 1 checking the performance of open()-ing 100 fds. num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us ... num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us num_fds: 150622, best cost: 6.40 us, worst cost: 7.60 us num_fds: 188278, best cost: 6.80 us, worst cost: 9.20 us num_fds: 235348, best cost: 7.20 us, worst cost: 8.80 us num_fds: 294186, best cost: 8.00 us, worst cost: 9.40 us num_fds: 367733, best cost: 8.80 us, worst cost: 11.60 us num_fds: 459667, best cost: 9.20 us, worst cost: 12.20 us num_fds: 574584, best cost: 10.00 us, worst cost: 12.40 us num_fds: 718231, best cost: 11.00 us, worst cost: 13.40 us num_fds: 897789, best cost: 12.80 us, worst cost: 15.80 us num_fds: 100, best cost: 13.60 us, worst cost: 15.40 us we are pretty good at the moment: the open() cost starts to increase at around 100K open fds, both in the cache-cold and cache-hot case. (that roughly corresponds to the fd bitmap falling out of the 32K L1 cache) At 1 million fds our fd bitmap has a size of 128K when there are 1 million fds open in a single process. so while it's certainly not 'urgent' to improve this, private fds are an easier target for optimizations in this area, because they dont have the continuity requirement anymore, so the fd bitmap is not a 'forced' property of them. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Ingo Molnar writes: looking over the list of our new generic APIs (see further below) i think there are three important things that are needed for an API to become widely used: 1) it should solve a real problem (ha ;-), it should be intuitive to humans and it should fit into existing things naturally. 2) it should be ubiquitous. (if it's about IO it should cover block IO, network IO, timers, signals and everything) Even if it might look silly in some of the cases, having complete, utter, no compromises, 100% coverage for everything massively helps the uptake of an API, because it allows the user-space coder to pick just one paradigm that is closest to his application and stick to it and only to it. 3) it should be end-to-end supported by glibc. 4) At least slightly portable. Anything supported by any similar OS is already ahead, even if it isn't the perfect API of our dreams. This means kqueue and doors. If it's not on any BSD or UNIX, then most app developers won't touch it. Worse yet, it won't appear in programming books, so even the Linux-only app programmers won't know about it. Running ideas by the FreeBSD and OpenSolaris developers wouldn't be a bad idea. Agreement leads to standardization, which leads to interfaces getting used. BTW, wrapper libraries that bury the new API under a layer of gunk are not helpful. One might as well just use the old API. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Thu, 31 May 2007 08:13:03 +0200 Ingo Molnar <[EMAIL PROTECTED]> wrote: > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > I agree. What would be a good interface to allocate fds in such > > > area? We don't want to replicate syscalls, so maybe a special new > > > dup function? > > > > I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or > > similar, and just have NONLINEAR_FD be some magic value (for example, > > make it be 0x4000 - the bit that says "private, nonlinear" in the > > first place). > > > > But what's gotten lost in the current discussion is that we probably > > don't actually _need_ such a private space. I'm just saying that if > > the *choice* is between memory-mapped interfaces and a private > > fd-space, we should probably go for the latter. "Everything is a file" > > is the UNIX way, after all. But there's little reason to introduce > > private fd's otherwise. > > it's both a flexibility and a speedup thing as well: > > flexibility: for libraries to be able to open files and keep them open > comes up regularly. For example currently glibc is quite wasteful in a > number of common networking related functions (Ulrich, please correct me > if i'm wrong), which could be optimized if glibc could just keep a > netlink channel fd open and could poll() it for changes and cache the > results if there are no changes (or something like that). > > speedup: i suggested O_ANY 6 years ago as a speedup to Apache - > non-linear fds are cheaper to allocate/map: > > http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html > > (i definitely remember having written code for that too, but i cannot > find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap > overhead as well and use a per-process list/pool of struct file buffers > plus a maximum-fd field as the 'non-linear fd allocator' (at the price > of only deallocating them at process exit time). Only very few apps need to open more than 100.000 files. As these files are likely sockets, O_ANY is not a solution. A trick is to try to keep first 64 handles freed, so that kernel wont consume too much cpu time and cache in get_unused_fd() http://lkml.org/lkml/2005/9/15/307 This trick is portable (not linux centric). - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > I agree. What would be a good interface to allocate fds in such > > area? We don't want to replicate syscalls, so maybe a special new > > dup function? > > I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or > similar, and just have NONLINEAR_FD be some magic value (for example, > make it be 0x4000 - the bit that says "private, nonlinear" in the > first place). > > But what's gotten lost in the current discussion is that we probably > don't actually _need_ such a private space. I'm just saying that if > the *choice* is between memory-mapped interfaces and a private > fd-space, we should probably go for the latter. "Everything is a file" > is the UNIX way, after all. But there's little reason to introduce > private fd's otherwise. it's both a flexibility and a speedup thing as well: flexibility: for libraries to be able to open files and keep them open comes up regularly. For example currently glibc is quite wasteful in a number of common networking related functions (Ulrich, please correct me if i'm wrong), which could be optimized if glibc could just keep a netlink channel fd open and could poll() it for changes and cache the results if there are no changes (or something like that). speedup: i suggested O_ANY 6 years ago as a speedup to Apache - non-linear fds are cheaper to allocate/map: http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html (i definitely remember having written code for that too, but i cannot find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap overhead as well and use a per-process list/pool of struct file buffers plus a maximum-fd field as the 'non-linear fd allocator' (at the price of only deallocating them at process exit time). Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote: >> Which *could* be something as simple as saying "bit 30 in the file >> descriptor specifies a separate fd space" along with some flags to make >> open and friends return those separate fd's. That makes them useless for >> "select()" (which assumes a flat address space, of course), but would be >> useful for just about anything else. On Wed, May 30, 2007 at 05:27:15PM -0500, Matt Mackall wrote: > Or.. we could have a method of swizzling in and out an entire FD > array, similar to UML's trick for swizzling MMs. I like that notion even better than randomization. I think it should happen. I like SKAS, too, of course. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30, 2007 at 01:00:30PM -0700, Linus Torvalds wrote: > Which *could* be something as simple as saying "bit 30 in the file > descriptor specifies a separate fd space" along with some flags to make > open and friends return those separate fd's. That makes them useless for > "select()" (which assumes a flat address space, of course), but would be > useful for just about anything else. Or.. we could have a method of swizzling in and out an entire FD array, similar to UML's trick for swizzling MMs. -- Mathematics is the supreme nostalgia of our time. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30, 2007 at 02:27:52PM -0700, Linus Torvalds wrote: > Well, don't think of it as a special case at all: think of bit 30 as a > "the user asked for a non-linear fd". > In fact, to make it effective, I'd suggest literally scrambling the low > bits (using, for example, some silly per-boot xor value to to actually > generate the "true" index - the equivalent of a really stupid randomizer). > That way you'd have the legacy "linear" space, and a separate "non-linear > space" where people simply *cannot* make assumptions about contiguous fd > allocations. There's no special case there - it's just an extension which > explicitly allows us to say "if you do that, your fd's won't be allocated > the traditional way any more, but you *can* mix the traditional and the > non-linear allocation". One could always stuff a seed or per-cpu seeds in the files_struct and use a PRNG. The only trick would be cacheline bounces and/or space consumption of seeds. Another possibility would be bitreversed contiguity or otherwise a bit permutation of some contiguous range, modulo (of course) the high bit used to tag the randomized range. With "truly" random/sparse fd numbers it may be meaningful to use a different data structure from a bitmap to track them in-kernel, though xor and other easily-computed mappings to/from contiguous ranges won't need such in earnest. -- wli - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007 14:27:52 -0700 (PDT) Linus Torvalds <[EMAIL PROTECTED]> wrote: > Well, don't think of it as a special case at all: think of bit 30 as > a "the user asked for a non-linear fd". If the sole point is to protect an fd from being closed or operated on outside of a certain context, why not just provide the ability to "protect" an fd to prevent its use. Maybe a pair of syscalls like "fdprotect" and "fdunprotect" that take an fd and an integer key. Protected fds would return EBADF or something if accessed. The same integer key must be provided to fdunprotect in order to gain access to it again. Then glibc or valgrind or whatever would just unprotect the fd before operating on it. - DML - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Davide Libenzi a écrit : On Wed, 30 May 2007, Linus Torvalds wrote: And then the semantics: do these descriptors should show up in /proc/self/fd? Are there separate directories for each namespace? Do they count against the rlimit? Oh, absolutely. The'd be real fd's in every way. People could use them 100% equivalently (and concurrently) with the traditional ones. The whole, and the _only_ point, would be that it breaks the legacy guarantees of a dense fd space. Most apps don't actually *need* that dense fd space in any case. But by defaulting to it, we wouldn't break those (few) apps that actually depend on it. I agree. What would be a good interface to allocate fds in such area? We don't want to replicate syscalls, so maybe a special new dup function? If the deal is to be able to get faster open()/socket()/pipe()/... calls by not finding the first 0 bit in a huge bitmap, a better way would be to have a flag in struct task, reset to 0 at exec time. A new syscall would say : This process is OK to receive *random* fds. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ulrich Drepper wrote: > You also have to be aware that open() is just one piece of the puzzle. > What about socket()? I've cursed this interface many times before and > now it's biting you: there is parameter to pass a flag. What about > transferring file descriptors via Unix domain sockets? How can I decide > the transferred descriptor should be in the private namespace? Well, we can't just replicate/change every system call that creates a file descriptor. So I'm for something like: int sys_fdup(int fd, int flags); So you basically create your fds with their native/existing system calls, and then you dup/move them into the prefered fd space. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Davide Libenzi wrote: > > I agree. What would be a good interface to allocate fds in such area? We > don't want to replicate syscalls, so maybe a special new dup function? I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or similar, and just have NONLINEAR_FD be some magic value (for example, make it be 0x4000 - the bit that says "private, nonlinear" in the first place). But what's gotten lost in the current discussion is that we probably don't actually _need_ such a private space. I'm just saying that if the *choice* is between memory-mapped interfaces and a private fd-space, we should probably go for the latter. "Everything is a file" is the UNIX way, after all. But there's little reason to introduce private fd's otherwise. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds a écrit : On Wed, 30 May 2007, Eric Dumazet wrote: No, Davide, the problem is that some applications depend on getting _specific_ file descriptors. Fix the application, and not adding kernel bloat ? No. The application is _correct_. It's how file descriptors are defined to work. Then you can also exclude multi-threading, since a thread (even not inside glibc) can also use socket()/pipe()/open()/whatever and take the zero file descriptor as well. Totally different. That's an application internal issue. It does *not* mean that we can break existing standards. The only hardcoded thing in Unix is 0, 1 and 2 fds. Wrong. I already gave an example of real code that just didn't bother to keep track of which fd's it had open, and closed them all. Partly, in fact, because you can't even _know_ which fd's you have open when somebody else just execve's you. If someone really cares, /proc/self/fd can help. But one shouldn't care at all. About the things that the process can do before execing() a process, file descriptors outside of 0,1,2 are the most obvious thing, but you also have alarm(), or stupid rlimits. You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. You cannot just change years and years of coding practice, and standard documentations. The behaviour of file descriptors is a fact. Ignoring that fact because you don't like it is naïve and simply not realistic. I want to change nothing. Current situation is fine and well documented, thank you. If a program does "for (i = 0; i < NR_OPEN; i++) close(i);", this *will*/*should* work as intended : close all files descriptors from 0 to NR_OPEN. Big deal. But you wont find in a program : FILE *fp = fopen("somefile", "r"); for (i = 0; i < NR_OPEN; i++) close(i); while (fgets(buff, sizeof(buff), fp)) { } You and/or others want to add fd namespaces and other hacks. I saw on this thread suspicious examples, I am waiting for a real one, justifying all this stuff. After file descriptors separation, I guess we'll need memory space separation as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu time separation, and so on... setrlimit() layered for every shared lib. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds wrote: > Side note: it might not even be a "close-on-exec by default" thing: it > might well be a *always* close-on-exec. > > That COE is pretty horrid to do, we need to scan a bitmap of those things > on each exec. So it migth be totally sensible to just declare that the > non-linear fd's would simply always be "local", and never bleed across an > execve). Hm, I wouldn't limit the mechanism prematurely. Using Valgrind as an example of an alternate user of this mechanism, it would be useful to use a pipe to transmit out-of-band information from an exec-er to an exec-ee process. At the moment there's a lot mucking around with execve() to transmit enough information from the parent valgrind to its successor. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Linus Torvalds wrote: > > Sure. I think there are things we can do (like make the non-linear fd's > appear somewhere else, and make them close-on-exec by default etc). Side note: it might not even be a "close-on-exec by default" thing: it might well be a *always* close-on-exec. That COE is pretty horrid to do, we need to scan a bitmap of those things on each exec. So it migth be totally sensible to just declare that the non-linear fd's would simply always be "local", and never bleed across an execve). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Linus Torvalds wrote: > Well, don't think of it as a special case at all: think of bit 30 as a > "the user asked for a non-linear fd". This sounds easy but doesn't really solve all the issues. Let me repeat your example and the solution currently in use: problem: application wants to close all file descriptors except a select few, cleaning up what is currently open. It doesn't know all the descriptors that are open. Maybe all this in preparation of an exec call. Today the best method to do this is to readdir() /proc/self/fd and exclude the descriptors on the whitelist. If the special, non-sequential descriptors are also listed in that directory the runtimes still cannot use them since they are visible. If you go ahead with this, then at the very least add a flag which causes the descriptor to not show up in /proc/*/fd. You also have to be aware that open() is just one piece of the puzzle. What about socket()? I've cursed this interface many times before and now it's biting you: there is parameter to pass a flag. What about transferring file descriptors via Unix domain sockets? How can I decide the transferred descriptor should be in the private namespace? There are likely many many more problems and cornercases like this. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI ALNKu8VCKy7CvoIqJD3Xs/Y= =+fM8 -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Linus Torvalds wrote: > > And then the semantics: do these descriptors should show up in > > /proc/self/fd? Are there separate directories for each namespace? Do > > they count against the rlimit? > > Oh, absolutely. The'd be real fd's in every way. People could use them > 100% equivalently (and concurrently) with the traditional ones. The whole, > and the _only_ point, would be that it breaks the legacy guarantees of a > dense fd space. > > Most apps don't actually *need* that dense fd space in any case. But by > defaulting to it, we wouldn't break those (few) apps that actually depend > on it. I agree. What would be a good interface to allocate fds in such area? We don't want to replicate syscalls, so maybe a special new dup function? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Jeremy Fitzhardinge wrote: > > Some programs - legitimately, I think - scan /proc/self/fd to close > everything. The question is whether the glibc-private fds should appear > there. And something like a "close-on-fork" flag might be useful, > though I guess glibc can keep track of its own fds closely enough to not > need something like that. Sure. I think there are things we can do (like make the non-linear fd's appear somewhere else, and make them close-on-exec by default etc). And it's not like it's necessarily at all the only way to do things. I just threw it out as a possible solution - and one that is almost certainly *superior* to trying to work around the fd thing with some shared memory area which has tons of much more serious problems of its own (*). Linus (*) Ranging from: specialized-only interfaces, inability to pass it around, lack of any abstraction interfaces, and almost impossible to debug. The security implications of kernel and user space sharing read-write access to some shared area are also legion! - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ulrich Drepper wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Linus Torvalds wrote: > > for (i = 0; i < NR_OPEN; i++) > > close(i); > > > > to clean up all file descriptors before doing something new. And yes, I > > think it was bash that used to *literally* do something like that a long > > time ago. > > Indeed. It was not only bash, though, I fixed probably a dozen > applications. But even the new and better solution (readdir of > /proc/self/fd) does not prevent the problem of closing descriptors the > system might still need and the application doesn't know about. Please, do not drop me out of the Cc list. If you have a valid point, you should be able to carry it forward regardless, no? - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Ulrich Drepper wrote: > I don't like special cases. For me things better come in quantities 0, > 1, and unlimited (well, reasonable high limit). Otherwise, who gets to > use that special namespace? The C library is not the only body of code > which would want to use descriptors. Valgrind could certainly make use of it. It currently reserves a set of fds "high enough", and tries hard to hide them from apps, but /proc/self/fd makes it intractable in general (there was only so much simulation I was willing to do in Valgrind). J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds wrote: > Which *could* be something as simple as saying "bit 30 in the file > descriptor specifies a separate fd space" along with some flags to make > open and friends return those separate fd's. That makes them useless for > "select()" (which assumes a flat address space, of course), but would be > useful for just about anything else. > Some programs - legitimately, I think - scan /proc/self/fd to close everything. The question is whether the glibc-private fds should appear there. And something like a "close-on-fork" flag might be useful, though I guess glibc can keep track of its own fds closely enough to not need something like that. J - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Eric Dumazet wrote: > > So library routines *must not* open file descriptors in the normal space. > > > > (The same is true of real applications doing the equivalent of > > > > for (i = 0; i < NR_OPEN; i++) > > close(i); > > Quite buggy IMHO Looking at it now, I'd agree (although I think I have that somewhere in my old code too). Consider though, that such code is contained also in reference books like Richard Stevens "UNIX Network Programming". - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ulrich Drepper wrote: > > I don't like special cases. For me things better come in quantities 0, > 1, and unlimited (well, reasonable high limit). Otherwise, who gets to > use that special namespace? The C library is not the only body of code > which would want to use descriptors. Well, don't think of it as a special case at all: think of bit 30 as a "the user asked for a non-linear fd". In fact, to make it effective, I'd suggest literally scrambling the low bits (using, for example, some silly per-boot xor value to to actually generate the "true" index - the equivalent of a really stupid randomizer). That way you'd have the legacy "linear" space, and a separate "non-linear space" where people simply *cannot* make assumptions about contiguous fd allocations. There's no special case there - it's just an extension which explicitly allows us to say "if you do that, your fd's won't be allocated the traditional way any more, but you *can* mix the traditional and the non-linear allocation". > And then the semantics: do these descriptors should show up in > /proc/self/fd? Are there separate directories for each namespace? Do > they count against the rlimit? Oh, absolutely. The'd be real fd's in every way. People could use them 100% equivalently (and concurrently) with the traditional ones. The whole, and the _only_ point, would be that it breaks the legacy guarantees of a dense fd space. Most apps don't actually *need* that dense fd space in any case. But by defaulting to it, we wouldn't break those (few) apps that actually depend on it. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Linus Torvalds wrote: > for (i = 0; i < NR_OPEN; i++) > close(i); > > to clean up all file descriptors before doing something new. And yes, I > think it was bash that used to *literally* do something like that a long > time ago. Indeed. It was not only bash, though, I fixed probably a dozen applications. But even the new and better solution (readdir of /proc/self/fd) does not prevent the problem of closing descriptors the system might still need and the application doesn't know about. > Which *could* be something as simple as saying "bit 30 in the file > descriptor specifies a separate fd space" along with some flags to make > open and friends return those separate fd's. I don't like special cases. For me things better come in quantities 0, 1, and unlimited (well, reasonable high limit). Otherwise, who gets to use that special namespace? The C library is not the only body of code which would want to use descriptors. And then the semantics: do these descriptors should show up in /proc/self/fd? Are there separate directories for each namespace? Do they count against the rlimit? This seems to me like a shot from the hips without thinking about other possibilities. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY z9ql4FJa8XTSiZzRG79ocwM= =0E7f -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds a écrit : On Wed, 30 May 2007, Eric Dumazet wrote: So yes, reimplement sendfile() should help to find last splice() bugs, and as a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket) Well, to get those kinds of advantages, you'd have to use splice directly, since sendfile() hasn't supported nonblocking disk IO, and the interface doesn't really allow for it. sendfile() interface doesnt allow it, but if you open("somediskfile", O_RDONLY | O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk io, (while starting an io with readahead) I actually use this trick myself :) (splice(disk -> pipe, NONBLOCK), splice(pipe -> worker)) non blocking disk io, + zero copy :) In fact, since nonblocking accesses require also some *polling* method, and we don't have that for files, I suspect the best option for those things is to simply mix AIO and splice(). AIO tends to be the right thing for disk waits (read: short, often cached), and if we can improve AIO performance for the cached accesses (which is exactly what the threadlets should hopefully allow us to do), I would seriously suggest going that route. But the pure "use splice to _implement_ sendfile()" thing is worth doing for all the other reasons, even if nonblocking file access is not likely one of them. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Eric Dumazet wrote: > > > > No, Davide, the problem is that some applications depend on getting > > _specific_ file descriptors. > > Fix the application, and not adding kernel bloat ? No. The application is _correct_. It's how file descriptors are defined to work. > Then you can also exclude multi-threading, since a thread (even not inside > glibc) can also use socket()/pipe()/open()/whatever and take the zero file > descriptor as well. Totally different. That's an application internal issue. It does *not* mean that we can break existing standards. > The only hardcoded thing in Unix is 0, 1 and 2 fds. Wrong. I already gave an example of real code that just didn't bother to keep track of which fd's it had open, and closed them all. Partly, in fact, because you can't even _know_ which fd's you have open when somebody else just execve's you. You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. You cannot just change years and years of coding practice, and standard documentations. The behaviour of file descriptors is a fact. Ignoring that fact because you don't like it is naïve and simply not realistic. Linus
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds a écrit : On Wed, 30 May 2007, Davide Libenzi wrote: Here I think we are forgetting that glibc is userspace and there's no separation between the application code and glibc code. An application linking to glibc can break glibc in thousand ways, indipendently from fds or not fds. Like complaining that glibc is broken because printf() suddendly does not work anymore ;) No, Davide, the problem is that some applications depend on getting _specific_ file descriptors. Fix the application, and not adding kernel bloat ? For example, if you do close(0); .. something else .. if (open("myfile", O_RDONLY) < 0) exit(1); you can (and should) depend on the open returning zero. Then you can also exclude multi-threading, since a thread (even not inside glibc) can also use socket()/pipe()/open()/whatever and take the zero file descriptor as well. Frankly I dont buy this fd namespace stuff. The only hardcoded thing in Unix is 0, 1 and 2 fds. People usually take care of these, or should use a Microsoft OS. POSIX mandates that open() returns the lowest available fd. But this obviously works only if you dont have another thread messing with fds, or if you dont call a library function that opens a file. Thats all. So library routines *must not* open file descriptors in the normal space. (The same is true of real applications doing the equivalent of for (i = 0; i < NR_OPEN; i++) close(i); Quite buggy IMHO This hack was to avoid bugs coming from ancestors applications, forking/execing a shell, and at times where one process could not open more than 20 files (AT&T Unix, 21 years ago) Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make sure fd is not propagated at exec() time. to clean up all file descriptors before doing something new. And yes, I think it was bash that used to *literally* do something like that a long time ago. Another example of the same thing: people open file descriptors and know that they'll be "dense" in the result, and then use "select()" on them. poll() is nice. Even AT&T Unix had it 21 years ago :) So it's true that file descriptors can't be used randomly by the standard libraries - they'd need to have some kind of separate "private space". Which *could* be something as simple as saying "bit 30 in the file descriptor specifies a separate fd space" along with some flags to make open and friends return those separate fd's. That makes them useless for "select()" (which assumes a flat address space, of course), but would be useful for just about anything else. Please dont do that. Second class fds. Then what about having ten different shared libraries ? Third class fds ? - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Eric Dumazet wrote: > > So yes, reimplement sendfile() should help to find last splice() bugs, and as > a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> > socket) Well, to get those kinds of advantages, you'd have to use splice directly, since sendfile() hasn't supported nonblocking disk IO, and the interface doesn't really allow for it. In fact, since nonblocking accesses require also some *polling* method, and we don't have that for files, I suspect the best option for those things is to simply mix AIO and splice(). AIO tends to be the right thing for disk waits (read: short, often cached), and if we can improve AIO performance for the cached accesses (which is exactly what the threadlets should hopefully allow us to do), I would seriously suggest going that route. But the pure "use splice to _implement_ sendfile()" thing is worth doing for all the other reasons, even if nonblocking file access is not likely one of them. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Linus Torvalds wrote: > On Wed, 30 May 2007, Davide Libenzi wrote: > > > > Here I think we are forgetting that glibc is userspace and there's no > > separation between the application code and glibc code. An application > > linking to glibc can break glibc in thousand ways, indipendently from fds > > or not fds. Like complaining that glibc is broken because printf() > > suddendly does not work anymore ;) > > No, Davide, the problem is that some applications depend on getting > _specific_ file descriptors. > > For example, if you do > > close(0); > .. something else .. > if (open("myfile", O_RDONLY) < 0) > exit(1); > > you can (and should) depend on the open returning zero. > > So library routines *must not* open file descriptors in the normal space. > > (The same is true of real applications doing the equivalent of > > for (i = 0; i < NR_OPEN; i++) > close(i); > > to clean up all file descriptors before doing something new. And yes, I > think it was bash that used to *literally* do something like that a long > time ago. Right. I misunderstood Uli and Ingo. I thought it was like trying to protect glibc from intentional application mis-behaviour. > Another example of the same thing: people open file descriptors and know > that they'll be "dense" in the result, and then use "select()" on them. > > So it's true that file descriptors can't be used randomly by the standard > libraries - they'd need to have some kind of separate "private space". > > Which *could* be something as simple as saying "bit 30 in the file > descriptor specifies a separate fd space" along with some flags to make > open and friends return those separate fd's. That makes them useless for > "select()" (which assumes a flat address space, of course), but would be > useful for just about anything else. I think it can be solved in a few ways. Yours or Ingo's (or something else) can work, to solve the above "legacy" fd space expectations. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Linus Torvalds a écrit : On Wed, 30 May 2007, Mark Lord wrote: I wonder how useful it would be to reimplement sendfile() using splice(), either in glibc or inside the kernel itself? I'd like that, if only because right now we have two separate paths that kind of do the same thing, and splice really is the only one that is generic. I thought Jens even had some experimental patches for it. It might be worth to "just do it" - there's some internal overhead, but on the other hand, it's also likely the best way to make sure any issues get sorted out. Last time I played with splice(), I found a bug with readahead logic, most probably because nobody but me tried it before. (corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 ) So yes, reimplement sendfile() should help to find last splice() bugs, and as a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Davide Libenzi wrote: > > Here I think we are forgetting that glibc is userspace and there's no > separation between the application code and glibc code. An application > linking to glibc can break glibc in thousand ways, indipendently from fds > or not fds. Like complaining that glibc is broken because printf() > suddendly does not work anymore ;) No, Davide, the problem is that some applications depend on getting _specific_ file descriptors. For example, if you do close(0); .. something else .. if (open("myfile", O_RDONLY) < 0) exit(1); you can (and should) depend on the open returning zero. So library routines *must not* open file descriptors in the normal space. (The same is true of real applications doing the equivalent of for (i = 0; i < NR_OPEN; i++) close(i); to clean up all file descriptors before doing something new. And yes, I think it was bash that used to *literally* do something like that a long time ago. Another example of the same thing: people open file descriptors and know that they'll be "dense" in the result, and then use "select()" on them. So it's true that file descriptors can't be used randomly by the standard libraries - they'd need to have some kind of separate "private space". Which *could* be something as simple as saying "bit 30 in the file descriptor specifies a separate fd space" along with some flags to make open and friends return those separate fd's. That makes them useless for "select()" (which assumes a flat address space, of course), but would be useful for just about anything else. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Davide Libenzi wrote: > An application > linking to glibc can break glibc in thousand ways, indipendently from fds > or not fds. It's not (only/mainly) about breaking. File descriptors are a resources which has to be used under the control of the program. The runtime cannot just steal some for itself. This indirectly leads to breaking code. We've seen this many times and I keep repeating the same issue over and over again: why do we have MAP_ANON instead of keeping a file descriptor with /dev/null open? Why is mmap made more complicated by allowing the file descriptor to be closed after the mmap() call is done? Take a look at a process running your favorite shell. Ever wonder why there is this stray file descriptor with a high number? $ cat /proc/3754/cmdline bash $ ll /proc/3754/fd/ total 0 lrwx-- 1 drepper drepper 64 2007-05-30 12:50 0 -> /dev/pts/19 lrwx-- 1 drepper drepper 64 2007-05-30 12:50 1 -> /dev/pts/19 lrwx-- 1 drepper drepper 64 2007-05-30 12:49 2 -> /dev/pts/19 lrwx-- 1 drepper drepper 64 2007-05-30 12:50 255 -> /dev/pts/19 File descriptors must be requested explicitly and cannot be implicitly consumed. All that and the other problem I mentioned earlier today about auxiliary data. File descriptors are not the ideal interface. Elegant: yes, ideal: no. Fro physics and math you might have learned that not every result that looks clean and beautiful is correct. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXdbC2ijCOnn/RHQRAgBbAJ0RoNsQr4L6Bm5hLy7somAKeTqCcQCbBHmx 8hzG+1w0rYMTqXxNmi/QQ7o= =O7Xm -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ingo Molnar wrote: > > * Linus Torvalds <[EMAIL PROTECTED]> wrote: > > > > To echo Uli and paraphrase an ad, "it's the interface, silly." > > > > THERE IS NO INTERFACE! You're just making that up, and glossing over > > the most important part of the whole thing! > > > > If you could actually point to something specific that matches what > > everybody needs, and is architecture-neutral, it would be a different > > issue. As is, you're just saying "memory-mapped interfaces" without > > actually going into enough detail to show HOW MUCH IT SUCKS. > > > > There really are very few programs that would use them. [...] > > looking over the list of our new generic APIs (see further below) i > think there are three important things that are needed for an API to > become widely used: > > 1) it should solve a real problem (ha ;-), it should be intuitive to > humans and it should fit into existing things naturally. > > 2) it should be ubiquitous. (if it's about IO it should cover block IO, > network IO, timers, signals and everything) Even if it might look > silly in some of the cases, having complete, utter, no compromises, > 100% coverage for everything massively helps the uptake of an API, > because it allows the user-space coder to pick just one paradigm > that is closest to his application and stick to it and only to it. > > 3) it should be end-to-end supported by glibc. > > our failed API attempts so far were: > > - sendfile(). This API mainly failed on #2. It partly failed on #1 too. >(couldnt be used in certain types of scenarios so was unintuitive.) >splice() fixes this almost completely. > > - KAIO. It fails on #2 and #3. > > our more successful new APIs: > > - futexes. After some hickups they form the base of all modern >user-space locking. > > - splice. (a bit too early to tell but it's looking good so far. Would >be nice if someone did a brute-force memcpy() based vmsplice to user >memory, just to make usage fully symmetric.) > > partially successful, not yet failed new APIs: > > - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but >not completely). Despite the non-complete coverage of event domains a >good number of apps are using it, and in particular a couple really >'high end' apps with massive amounts of event sources - which apps >would have no chance with poll, select or threads. > > - inotify. It's being used quite happily on the desktop, despite some >of its limitations. (Possibly integratable into epoll?) I think, as Linus pointed out (as I did a few months ago), that there's confusion about the term "Unification" or "Single Interface". Unification is not about fetching all the data coming from the more diverse sources, into a single interface. That is just broken, because each data source wants a different data structure to be reported. This is ABI-hell 101. Unification is the ability to uniformly wait for readiness, and then fetch data with source-dependent collectors (read(2), io_getvents(2), ...). That way you have ABI isolation on the single data source, and not monster structures trying to blob together the more diverse data formats. AFAIK, inotify works with select/poll/epoll as is. - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ingo Molnar wrote: > yeah - this is a fundamental design question for Linus i guess :-) glibc > (and other infrastructure libraries) have a fundamental problem: they > cannot (and do not) presently use persistent file descriptors to make > use of kernel functionality, due to ABI side-effects. [applications can > dup into an fd used by glibc, applications can close it - shells close > fds blindly for example, etc.] Today glibc simply cannot open a file > descriptor and keep it open while application code is running due to > these problems. Here I think we are forgetting that glibc is userspace and there's no separation between the application code and glibc code. An application linking to glibc can break glibc in thousand ways, indipendently from fds or not fds. Like complaining that glibc is broken because printf() suddendly does not work anymore ;) #include int main(void) { close(fileno(stdout)); printf("Whiskey Tango Foxtrot?\n"); return 0; } - Davide - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30 2007, Linus Torvalds wrote: > > > On Wed, 30 May 2007, Mark Lord wrote: > > > > I wonder how useful it would be to reimplement sendfile() > > using splice(), either in glibc or inside the kernel itself? > > I'd like that, if only because right now we have two separate paths that > kind of do the same thing, and splice really is the only one that is > generic. > > I thought Jens even had some experimental patches for it. It might be > worth to "just do it" - there's some internal overhead, but on the other > hand, it's also likely the best way to make sure any issues get sorted > out. I do, this is a one year old patch that does that: http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=f8f550e027fd07ad8d87110178803dc63b544d89 I'll update it, test, and submit for 2.6.23. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Mark Lord wrote: > > I wonder how useful it would be to reimplement sendfile() > using splice(), either in glibc or inside the kernel itself? I'd like that, if only because right now we have two separate paths that kind of do the same thing, and splice really is the only one that is generic. I thought Jens even had some experimental patches for it. It might be worth to "just do it" - there's some internal overhead, but on the other hand, it's also likely the best way to make sure any issues get sorted out. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30 2007, Mark Lord wrote: > Ingo Molnar wrote: > > > > - sendfile(). This API mainly failed on #2. It partly failed on #1 too. > > (couldnt be used in certain types of scenarios so was unintuitive.) > > splice() fixes this almost completely. > > > > - KAIO. It fails on #2 and #3. > > I wonder how useful it would be to reimplement sendfile() > using splice(), either in glibc or inside the kernel itself? > > sendfile() does get used a fair bit, but I really doubt that anyone > outside of a handful of people on this list actually use splice(). It's indeed the plan, I even have git branch for it. Just never took the time to actually finish it. http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=splice-sendfile -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Ingo Molnar wrote: - sendfile(). This API mainly failed on #2. It partly failed on #1 too. (couldnt be used in certain types of scenarios so was unintuitive.) splice() fixes this almost completely. - KAIO. It fails on #2 and #3. I wonder how useful it would be to reimplement sendfile() using splice(), either in glibc or inside the kernel itself? sendfile() does get used a fair bit, but I really doubt that anyone outside of a handful of people on this list actually use splice(). Cheers - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30 2007, Ingo Molnar wrote: > - splice. (a bit too early to tell but it's looking good so far. Would >be nice if someone did a brute-force memcpy() based vmsplice to user >memory, just to make usage fully symmetric.) Heh, I actually agree, at least then the interface is complete! We can always replace it with something more clever, should someone feel so inclined. Here's a rough patch to do that, it's totally untested (but it compiles). sparse will warn about the __user removal, though. I'm sure viro would shoot me dead on the spot, should he see this... diff --git a/fs/splice.c b/fs/splice.c index 12f2828..5023c01 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -657,9 +657,9 @@ out_ret: * key here is the 'actor' worker passed in that actually moves the data * to the wanted destination. See pipe_to_file/pipe_to_sendpage above. */ -ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, - struct file *out, loff_t *ppos, size_t len, - unsigned int flags, splice_actor *actor) +ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, void *actor_priv, + loff_t *ppos, size_t len, unsigned int flags, + splice_actor *actor) { int ret, do_wakeup, err; struct splice_desc sd; @@ -669,7 +669,7 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, sd.total_len = len; sd.flags = flags; - sd.file = out; + sd.file = actor_priv; sd.pos = *ppos; for (;;) { @@ -1240,28 +1240,104 @@ static int get_iovec_page_array(const struct iovec __user *iov, return error; } +static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf, + struct splice_desc *sd) +{ + int ret; + + ret = buf->ops->pin(pipe, buf); + if (!ret) { + void __user *dst = sd->userptr; + /* +* use non-atomic map, can be optimized to map atomically if we +* prefault the user memory. +*/ + char *src = buf->ops->map(pipe, buf, 0); + + if (copy_to_user(dst, src, sd->len)) + ret = -EFAULT; + + buf->ops->unmap(pipe, buf, src); + + if (!ret) + return sd->len; + } + + return ret; +} + +/* + * For lack of a better implementation, implement vmsplice() to userspace + * as a simple copy of the pipes pages to the user iov. + */ +static long vmsplice_to_user(struct file *file, const struct iovec __user *iov, +unsigned long nr_segs, unsigned int flags) +{ + struct pipe_inode_info *pipe; + ssize_t size; + int error; + long ret; + + pipe = pipe_info(file->f_path.dentry->d_inode); + if (!pipe) + return -EBADF; + if (!nr_segs) + return 0; + + if (pipe->inode) + mutex_lock(&pipe->inode->i_mutex); + + ret = 0; + while (nr_segs) { + void __user *base; + size_t len; + + /* +* Get user address base and length for this iovec. +*/ + error = get_user(base, &iov->iov_base); + if (unlikely(error)) + break; + error = get_user(len, &iov->iov_len); + if (unlikely(error)) + break; + + /* +* Sanity check this iovec. 0 read succeeds. +*/ + if (unlikely(!len)) + break; + error = -EFAULT; + if (unlikely(!base)) + break; + + size = __splice_from_pipe(pipe, (void *) base, NULL, len, + flags, pipe_to_user); + if (size < 0) { + if (!ret) + ret = size; + + break; + } + + nr_segs--; + iov++; + ret += size; + } + + if (pipe->inode) + mutex_unlock(&pipe->inode->i_mutex); + + return ret; +} + /* * vmsplice splices a user address range into a pipe. It can be thought of * as splice-from-memory, where the regular splice is splice-from-file (or * to file). In both cases the output is a pipe, naturally. - * - * Note that vmsplice only supports splicing _from_ user memory to a pipe, - * not the other way around. Splicing from user memory is a simple operation - * that can be supported without any funky alignment restrictions or nasty - * vm tricks. We simply map in the user memory and fill them into a pipe. - * The reverse isn't quite as easy, though. There are two possible solutions - * for that: - * - * - memcpy() the data internally, at which point we might as well just - * do a regular read
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30 2007, Zach Brown wrote: > > Yeah, it'll confuse CFQ a lot actually. The threads either need to share > > an io context (clean approach, however will introduce locking for things > > that were previously lockless), or CFQ needs to get better support for > > cooperating processes. > > Do let me know if I can be of any help in this. Thanks, it should not be a lot of work though. > > For the fio testing, we can make some improvements there. Right now you > > don't get any concurrency of the io requests if you set eg iodepth=32, > > as the 32 requests will be submitted as a linked chain of atoms. For io > > saturation, that's not really what you want. > > Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm > using fio's libaio engine. I'm not testing the syslet syscall interface > yet. Ah ok, then there's no issue from that end! -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
> due to the added syscall. (Maybe we can just get that reserved > upstream now?) Maybe, but we'd have to agree on the bare syslet interface that is being supported :). Personally, I'd like that to be the simplest thing that works for people and I'm not convinced that the current syslet-specific syscalls are that. Certainly not the atom interface, anyway. +asmlinkage __attribute__((weak)) long +sys_umem_add(unsigned long __user *uptr, unsigned long inc) +{ + unsigned long val, new_val; + + if (get_user(val, uptr)) + return -EFAULT; + /* +* inc == 0 means 'read memory value': +*/ + if (!inc) + return val; + + new_val = val + inc; + if (__put_user(new_val, uptr)) + return -EFAULT; + + return new_val; +} A syscall for *long addition* strikes me as a bit much, I have to admit. Where do we stop? (Where's the compat wrapper? :)) Maybe this would be fine for some wildly aggressive optimization some number of years in the future when we have millions of syslet interface users complaining about the cycle overhead of their syslet engines, but it seems like we can do something much less involved in the first pass without harming the possibility of promising to support this complex optimization in the future. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
> Yeah, it'll confuse CFQ a lot actually. The threads either need to share > an io context (clean approach, however will introduce locking for things > that were previously lockless), or CFQ needs to get better support for > cooperating processes. Do let me know if I can be of any help in this. > For the fio testing, we can make some improvements there. Right now you > don't get any concurrency of the io requests if you set eg iodepth=32, > as the 32 requests will be submitted as a linked chain of atoms. For io > saturation, that's not really what you want. Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm using fio's libaio engine. I'm not testing the syslet syscall interface yet. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Linus Torvalds <[EMAIL PROTECTED]> wrote: > > To echo Uli and paraphrase an ad, "it's the interface, silly." > > THERE IS NO INTERFACE! You're just making that up, and glossing over > the most important part of the whole thing! > > If you could actually point to something specific that matches what > everybody needs, and is architecture-neutral, it would be a different > issue. As is, you're just saying "memory-mapped interfaces" without > actually going into enough detail to show HOW MUCH IT SUCKS. > > There really are very few programs that would use them. [...] looking over the list of our new generic APIs (see further below) i think there are three important things that are needed for an API to become widely used: 1) it should solve a real problem (ha ;-), it should be intuitive to humans and it should fit into existing things naturally. 2) it should be ubiquitous. (if it's about IO it should cover block IO, network IO, timers, signals and everything) Even if it might look silly in some of the cases, having complete, utter, no compromises, 100% coverage for everything massively helps the uptake of an API, because it allows the user-space coder to pick just one paradigm that is closest to his application and stick to it and only to it. 3) it should be end-to-end supported by glibc. our failed API attempts so far were: - sendfile(). This API mainly failed on #2. It partly failed on #1 too. (couldnt be used in certain types of scenarios so was unintuitive.) splice() fixes this almost completely. - KAIO. It fails on #2 and #3. our more successful new APIs: - futexes. After some hickups they form the base of all modern user-space locking. - splice. (a bit too early to tell but it's looking good so far. Would be nice if someone did a brute-force memcpy() based vmsplice to user memory, just to make usage fully symmetric.) partially successful, not yet failed new APIs: - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but not completely). Despite the non-complete coverage of event domains a good number of apps are using it, and in particular a couple really 'high end' apps with massive amounts of event sources - which apps would have no chance with poll, select or threads. - inotify. It's being used quite happily on the desktop, despite some of its limitations. (Possibly integratable into epoll?) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ingo Molnar wrote: > we should perhaps enable glibc to have its separate fd namespace (or > 'hidden' file descriptors at the upper end of the fd space) so that it > can transparently listen to netlink events (or do epoll), Something like this would only work reliably if you have actual protection coming with it. Also, there are still reasons why an application might want to see, close, handle, whatever these descriptors in a separate namespace. I think such namespaces are a broken concept. How many do you want to introduce? Plus, then you get away from the normal file descriptor interfaces anyway. If you'd represent these alternative namespace descriptors with ordinary ints you gain nothing. You'd have to use tuples (namespace,descriptor) and then you need a whole set of new interfaces or some sticky namespace selection which will only cause problems (think signal delivery). > without > impacting the application fd namespace - instead of ducking to a memory > based API as a workaround. It's not "ducking". Memory mapping is one of the most natural interfaces. Just because Unix/Linux is built around the concept of file descriptors does not mean this is the ultimate in usability. File descriptors are in fact clumsy: if you have a file descriptor to read and write data, all auxiliary data for that communication must be transferred out-of-band (e.g, fcntl) or in very magical and hard to use ways (recvmsg, sendmsg). With a memory based event mechanism this auxiliary data can be stored in memory along with the event notification. > it is a serious flexibility issue that should not be ignored. The > unified fd space is a blessing on one hand because it's simple and Too simple. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXZqX2ijCOnn/RHQRAsSFAKCNrd8/sRss1wBA9hkpnYIeALDbXQCfRNAb yZy2Nofz2CgDo9PQYK3C/bo= =klUJ -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Jeff Garzik wrote: > > You snipped the key part of my response, so I'll say it again: > > Event rings (a) most closely match what is going on in the hardware and (b) > often closely match what is going on in multi-socket, event-driven software > application. I have rather strong counter-arguments: (a) yes, it's how hardware does it, but if you actually look at hardware, you quickly realize that every single piece of hardware uses a *different* ring interface. This should really tell you something. In fact, it may not be rings at all, but structures with more complex formats (eg the USB descriptors). (b) yes, event-driven software tends to use some data structures that are sometimes approximated by event rings, but they all use *different* software structures. There simply *is* no common "event" structure: each program tends to have its own issues, it's own allocation policies, and its own "ring" structures. They may not be rings at all. They can be priority queues/heaps or other much more complex structures. > To echo Uli and paraphrase an ad, "it's the interface, silly." THERE IS NO INTERFACE! You're just making that up, and glossing over the most important part of the whole thing! If you could actually point to something specific that matches what everybody needs, and is architecture-neutral, it would be a different issue. As is, you're just saying "memory-mapped interfaces" without actually going into enough detail to show HOW MUCH IT SUCKS. There really are very few programs that would use them. We had a trivial benchmark, the only function of which was to show usage, and here Ingo and Evgeniy are (once more) talking about bugs in that one months later. THAT should tell you something. Make poll/select/aio/read etc faster. THAT is where the payoffs are. In fact, if somebody wants to look at a standard interface that could be speeded up, the prime thing to look at is "readdir()" (aka getdents). Making _that_ thing go faster and scale better and do read-ahead is likely to be a lot more important for performance. It was one of the bottle-necks for samba several years ago, and nobody has really tried to improve it. And yes, that's because it's hard - people would rather make up new interfaces that are largely irrelevant even before they are born, than actually try to improve important existing ones. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, 30 May 2007, Ingo Molnar wrote: > > * Ulrich Drepper <[EMAIL PROTECTED]> wrote: > > > > I'm not going to judge your tests but saying there are no significant > > advantages is too one-sided. There is one huge advantage: the > > interface. A memory-based interface is simply the best form. File > > descriptors are a resource the runtime cannot transparently consume. > > yeah - this is a fundamental design question for Linus i guess :-) Well, quite frankly, to me, the most important part of syslets is that if they are done right, they introduce _no_ new interfaces at all that people actually use. Over the years, we've done lots of nice "extended functionality" stuff. Nobody ever uses them. The only thing that gets used is the standard stuff that everybody else does too. So when it comes to syslets, the most important interface will be the existing aio_read() etc interfaces _without_ any in-memory stuff at all, and everything done by the kernel to just make it look exactly like it used to look. And the biggest advantage is that it simplifies the internal kernel code, and makes us use the same code for aio and non-aio (and I think we have a good possibility of improving performance too, if only because we will get much more natural and fine-grained scheduling points!) Any extended "direct syslets" use is technically _interesting_, but ultimately almost totally pointless. Which was why I was pushing really really hard for a simple interface and not being too clever or exposing internal designs too much. An in-memory thing tends to be the absolute _worst_ interface when it comes to abstraction layers and future changes. > glibc (and other infrastructure libraries) have a fundamental problem: > they cannot (and do not) presently use persistent file descriptors to > make use of kernel functionality, due to ABI side-effects. glibc has a more fundamental problem: the "fun" stuff is generally not worth it. For example, any AIO thing that requires glibc to be rewritten is almost totally uninteresting. It should work with _existing_ binaries, and _existing_ ABI's to be useful - since 99% of all AIO users are binary- only and won't recompile for some experimental library. The whole epoll/kevent flame-wars have ignored a huge issue: almost nobody uses either. People still use poll and select, to such an _overwhelming_ degree that it almost doesn't even matter if you were to make the alternatives orders of magnitude faster - it wouldn't even be visible. > we should perhaps enable glibc to have its separate fd namespace (or > 'hidden' file descriptors at the upper end of the fd space) so that it > can transparently listen to netlink events (or do epoll), without > impacting the application fd namespace - instead of ducking to a memory > based API as a workaround. Yeah, I don't think it would be at all wrong to have "private file descriptors". I'd prefer that over memory-based (for all the abstraction issues, and because a lot of things really *are* about file descriptors!). Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > epoll is very much is capable of doing it - but why bother if > something more flexible than a ring can be used and the performance > difference is negligible? (Read my other reply in this thread for > further points.) in particular i'd like to (re-)stress this point: Thirdly, our main problem was not the structure of epoll, our main problem was that event APIs were not widely available, so applications couldnt go to a pure event based design - they always had to handle certain types of event domains specially, due to lack of coverage. The latest epoll patches largely address that. This was a huge barrier against adoption of epoll. starting with putting limits into the design by going to over-smart data structures like rings is just stupid. Lets fix, enhance and speed up what we have now (epoll) so that it becomes ubiquitous, and _then_ we can extend epoll to maybe fill events into rings. We should have our priorities right and should stop rewriting the whole world, especially when it comes to user APIs. Right now we have _no_ event API with complete coverage, and that's far more of a problem than the actual micro-structure of the API. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Jeff Garzik <[EMAIL PROTECTED]> wrote: > >>You should pick up the kevent work :) > > > >3 months ago i verified the published kevent vs. epoll benchmark and > >found that benchmark to be fatally flawed. When i redid it properly > >kevent showed no significant advantage over epoll. Note that i did > >those measurements _before_ the recent round of epoll speedups. So > >unless someone does believable benchmarks i consider kevent an > >over-hyped, mis-benchmarked complication to do something that epoll > >is perfectly capable of doing. > > You snipped the key part of my response, so I'll say it again: > > Event rings (a) most closely match what is going on in the hardware > and (b) often closely match what is going on in multi-socket, > event-driven software application. event rings are just pure data structures that describe a set of data, and they have advantages and disadvantages. For the record, we've already got direct experience with rings as software APIs: they were used for KAIO and they were an implementational and maintainance nightmare and nobody used them. Kevent might be better, but you make it sound as if it was a trivial design choice while it certainly isnt! Sure, for hardware interfaces like networking cards tx and rx rings are the best thing but that is apples to oranges: hardware itself is about _limited_ physical resources, matching a _limited_ data structure like a ring quite well. But for software APIs, the built-in limit of rings makes it a baroque data structure that has a fair share disadvantages in addition to its obvious advantages. > This is not something epoll is capable of doing, at the present time. epoll is very much is capable of doing it - but why bother if something more flexible than a ring can be used and the performance difference is negligible? (Read my other reply in this thread for further points.) but, for the record, syslets very much use a completion ring, so i'm not fundamentally opposed to the idea. I just think it's seriously over-hyped, just like most other bits of the kevent approach. (Nor do we have to attach this to syslets and threadlets - kevents are an orthogonal approach not directly related to asynchronous syscalls - syslets/threadlets can make use of epoll just as much as they can make use of kevent APIs.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30, 2007 at 10:54:00AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote: > > * Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > > I did not want to start with another round of ping-pong insults :), > > but, Ingo, you did not show that kevent works worse. I did show that > > sometimes it works better. It flawed from 0 to 30% win in that tests, > > in results Johann Bork presented kevent and epoll behaved the same. In > > results I posted earlier, I said, that sometimes epoll behaved better, > > sometimes kevent. [...] > > let me refresh your recollection: > > http://lkml.org/lkml/2007/2/25/116 > > where you said: > > "But note, that on my athlon64 3500 test machine kevent is about 7900 > requests per second compared to 4000+ epoll, so expect a challenge." You can also find in that threads that I managed to run epoll server on that machine with 9k requests per second, although that was not reproducible again. > for a long time you made much fuss about how kevents is so much better > and how epoll cannot perform and scale as well (you said various > arguments why that is supposedly so), and some people bought into the > performance argument and advocated kevent due to its supposed > performance and scalability advantages - while now we are down to "epoll > and kevent are break-even"? You just draw a picture you want to see. Even on the kevent page I have links to other people's benchmarks, which show how kevent behave compared to epoll in theirs load. _My_ tests showed kevent performance win, you tuned my (can be broken) epoll code and results changed - this is developemnt process, where things are not obtained from the air. > in my book that is way too much of a difference, it is (best-case) a way > too sloppy approach to something as fundamental as Linux's basic event > model and design, and it is also compounded by your continued "nothing > happened, really, lets move on" stance. Losing trust is easy, winning it > back is hard. Let me reuse a phrase of yours: "expect a challenge". Well, I do not care much about what people think I did wrong or right. There are obviously bad and good ideas and implementations. I might be absolutely wrong with something, but that is a process of solving problems, which I really enjoy. I just want that there sould be no personal insults, if I made such things, shame on me :) > Ingo -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Ingo Molnar wrote: * Jeff Garzik <[EMAIL PROTECTED]> wrote: You should pick up the kevent work :) 3 months ago i verified the published kevent vs. epoll benchmark and found that benchmark to be fatally flawed. When i redid it properly kevent showed no significant advantage over epoll. Note that i did those measurements _before_ the recent round of epoll speedups. So unless someone does believable benchmarks i consider kevent an over-hyped, mis-benchmarked complication to do something that epoll is perfectly capable of doing. You snipped the key part of my response, so I'll say it again: Event rings (a) most closely match what is going on in the hardware and (b) often closely match what is going on in multi-socket, event-driven software application. To echo Uli and paraphrase an ad, "it's the interface, silly." This is not something epoll is capable of doing, at the present time. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([EMAIL PROTECTED]) > wrote: > > it is a serious flexibility issue that should not be ignored. The > > unified fd space is a blessing on one hand because it's simple and > > powerful, but it's also a curse because nested use of the fd space for > > libraries is currently not possible. But it should be detached from any > > fundamental question of kevent vs. epoll. (By improving library use of > > file descriptors we'll improve the utility of all syscalls - by ducking > > to a memory based API we only solve that particular event based usage.) > > There is another issue with file descriptors - userspace must dig into > kernel each time it wants to get a new set of events, while with > memory based approach it has them without doing so. After it has > returned from kernel and know that there are some evetns, kernel can > add more of them into the ring (if there is a place) and userspace > will process them withouth additional syscalls. Firstly, this is not a fundamental property of epoll. If we wanted to, it would be possible to extend epoll to fill in a ring of events from the wakeup handler. It's an incremental add-on to epoll that should not impact the design. How much info to put into a single event is another incremental thing - for most of the high-performance cases all the information we need is the type of the event and the fd it occured on. Currently epoll supports that minimal approach. Secondly, our current syscall overhead is below 0.1 usecs on latest hardware: dione:~/l> ./lat_syscall null Simple syscall: 0.0911 microseconds so you need millions of events _per cpu_ for the syscall overhead to show up. Thirdly, our main problem was not the structure of epoll, our main problem was that event APIs were not widely available, so applications couldnt go to a pure event based design - they always had to handle certain types of event domains specially, due to lack of coverage. The latest epoll patches largely address that. This was a huge barrier against adoption of epoll. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Wed, May 30, 2007 at 10:42:52AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote: > it is a serious flexibility issue that should not be ignored. The > unified fd space is a blessing on one hand because it's simple and > powerful, but it's also a curse because nested use of the fd space for > libraries is currently not possible. But it should be detached from any > fundamental question of kevent vs. epoll. (By improving library use of > file descriptors we'll improve the utility of all syscalls - by ducking > to a memory based API we only solve that particular event based usage.) There is another issue with file descriptors - userspace must dig into kernel each time it wants to get a new set of events, while with memory based approach it has them without doing so. After it has returned from kernel and know that there are some evetns, kernel can add more of them into the ring (if there is a place) and userspace will process them withouth additional syscalls. Although syscall overhead is very small, it does exist and should not be ignored in the design. > > Ingo -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > I did not want to start with another round of ping-pong insults :), > but, Ingo, you did not show that kevent works worse. I did show that > sometimes it works better. It flawed from 0 to 30% win in that tests, > in results Johann Bork presented kevent and epoll behaved the same. In > results I posted earlier, I said, that sometimes epoll behaved better, > sometimes kevent. [...] let me refresh your recollection: http://lkml.org/lkml/2007/2/25/116 where you said: "But note, that on my athlon64 3500 test machine kevent is about 7900 requests per second compared to 4000+ epoll, so expect a challenge." for a long time you made much fuss about how kevents is so much better and how epoll cannot perform and scale as well (you said various arguments why that is supposedly so), and some people bought into the performance argument and advocated kevent due to its supposed performance and scalability advantages - while now we are down to "epoll and kevent are break-even"? in my book that is way too much of a difference, it is (best-case) a way too sloppy approach to something as fundamental as Linux's basic event model and design, and it is also compounded by your continued "nothing happened, really, lets move on" stance. Losing trust is easy, winning it back is hard. Let me reuse a phrase of yours: "expect a challenge". Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Ulrich Drepper <[EMAIL PROTECTED]> wrote: > Ingo Molnar wrote: > > 3 months ago i verified the published kevent vs. epoll benchmark and > > found that benchmark to be fatally flawed. When i redid it properly > > kevent showed no significant advantage over epoll. > > I'm not going to judge your tests but saying there are no significant > advantages is too one-sided. There is one huge advantage: the > interface. A memory-based interface is simply the best form. File > descriptors are a resource the runtime cannot transparently consume. yeah - this is a fundamental design question for Linus i guess :-) glibc (and other infrastructure libraries) have a fundamental problem: they cannot (and do not) presently use persistent file descriptors to make use of kernel functionality, due to ABI side-effects. [applications can dup into an fd used by glibc, applications can close it - shells close fds blindly for example, etc.] Today glibc simply cannot open a file descriptor and keep it open while application code is running due to these problems. we should perhaps enable glibc to have its separate fd namespace (or 'hidden' file descriptors at the upper end of the fd space) so that it can transparently listen to netlink events (or do epoll), without impacting the application fd namespace - instead of ducking to a memory based API as a workaround. it is a serious flexibility issue that should not be ignored. The unified fd space is a blessing on one hand because it's simple and powerful, but it's also a curse because nested use of the fd space for libraries is currently not possible. But it should be detached from any fundamental question of kevent vs. epoll. (By improving library use of file descriptors we'll improve the utility of all syscalls - by ducking to a memory based API we only solve that particular event based usage.) Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Hi Ingo, developers. On Wed, May 30, 2007 at 09:20:55AM +0200, Ingo Molnar ([EMAIL PROTECTED]) wrote: > > * Jeff Garzik <[EMAIL PROTECTED]> wrote: > > > You should pick up the kevent work :) > > 3 months ago i verified the published kevent vs. epoll benchmark and > found that benchmark to be fatally flawed. When i redid it properly > kevent showed no significant advantage over epoll. Note that i did those > measurements _before_ the recent round of epoll speedups. So unless > someone does believable benchmarks i consider kevent an over-hyped, > mis-benchmarked complication to do something that epoll is perfectly > capable of doing. I did not want to start with another round of ping-pong insults :), but, Ingo, you did not show that kevent works worse. I did show that sometimes it works better. It flawed from 0 to 30% win in that tests, in results Johann Bork presented kevent and epoll behaved the same. In results I posted earlier, I said, that sometimes epoll behaved better, sometimes kevent. What does it say? Just the fact, that in that given workload result was the one we saw. Nothing more, nothing less. It does not show something is broken, and definitely not that it is: citation1: we're heading to yet-another monolitic interface, we're heading with no valid reasons given if other than some handwaving. citation2: consider kevent an over-hyped, mis-benchmarked complication to do something that epoll is perfectly Getting into account another features kevent has (and what it was designed for originally - for network AIO, which is quite hard (if ever possible) with files and epoll, I'm not talking about syslets as AIO, it is different approach and likely it is simpler, getting even only that it is already very good), it is not what people said in above citations. It looks like you have some personal insults on that, which I do not understand. But it has nothing with technical side of the problem, so lets stop such rethoric and concentrate on real problem and forget any possible personal issues which might be raised sometimes :). Although I closed kevent and eventfs projects, I would gladly continue if we can and want to have progress in that area. Thanks. > Ingo -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Tue, May 29 2007, Zach Brown wrote: Thanks for picking this up, Zach! > - cfq gets confused, share io_context amongst threads? Yeah, it'll confuse CFQ a lot actually. The threads either need to share an io context (clean approach, however will introduce locking for things that were previously lockless), or CFQ needs to get better support for cooperating processes. The problem is that CFQ will wait for a dependent IO for a given process, which may arrive from a totally unrelated process. For the fio testing, we can make some improvements there. Right now you don't get any concurrency of the io requests if you set eg iodepth=32, as the 32 requests will be submitted as a linked chain of atoms. For io saturation, that's not really what you want. I'll take a stab at improving both of the above. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ingo Molnar wrote: > 3 months ago i verified the published kevent vs. epoll benchmark and > found that benchmark to be fatally flawed. When i redid it properly > kevent showed no significant advantage over epoll. I'm not going to judge your tests but saying there are no significant advantages is too one-sided. There is one huge advantage: the interface. A memory-based interface is simply the best form. File descriptors are a resource the runtime cannot transparently consume. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFGXShu2ijCOnn/RHQRAi5ZAJ920rRneulUMjTETu6XoiOaOi7SLgCfbmO+ UDM1CLqbaEZREAMnuOWRzuY= =CERV -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Zach Brown <[EMAIL PROTECTED]> wrote: > > Having async request and response rings would be quite useful, and > > most closely match what is going on under the hood in the kernel and > > hardware. > > Yeah, but I have lots of competing thoughts about this. note that async request and response rings are implemented already in essence: that's how FIO uses syslets. The linked list of syslet atoms is the 'request ring' (it's just that 'ring' is not a hard-enforced data structure - you can use other request formats too), and the completion ring is the 'response ring'. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
* Jeff Garzik <[EMAIL PROTECTED]> wrote: > You should pick up the kevent work :) 3 months ago i verified the published kevent vs. epoll benchmark and found that benchmark to be fatally flawed. When i redid it properly kevent showed no significant advantage over epoll. Note that i did those measurements _before_ the recent round of epoll speedups. So unless someone does believable benchmarks i consider kevent an over-hyped, mis-benchmarked complication to do something that epoll is perfectly capable of doing. Ingo - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Tue, May 29, 2007 at 04:20:04PM -0700, Ulrich Drepper wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Zach Brown wrote: > > That todo item > > about producing documentation and distro kernels is specifically to bait > > Uli into trying to implement posix aio on top of syslets in glibc. > > Get DaveJ to pick up the code for Fedora kernels and I'll get to it. With F7 out the door, I'm looking at getting devel/ back in shape again, so I can get something done there soon-ish. With the usual caveat that if this isn't upstream by the time we do a release, we'll have to drop it due to the added syscall. (Maybe we can just get that reserved upstream now?) Dave -- http://www.codemonkey.org.uk - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Zach Brown wrote: > That todo item > about producing documentation and distro kernels is specifically to bait > Uli into trying to implement posix aio on top of syslets in glibc. Get DaveJ to pick up the code for Fedora kernels and I'll get to it. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFGXLUk2ijCOnn/RHQRAjL0AJ0UQzNnMn8xpj7ga0OeEWUhnkhZfgCfTH+j iQ52SLZgWwp4wmAGCy/eLZs= =hpyn -END PGP SIGNATURE- - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
> You should pick up the kevent work :) I haven't looked at it in a while but yes, it's "on the radar" :). > Having async request and response rings would be quite useful, and most > closely match what is going on under the hood in the kernel and hardware. Yeah, but I have lots of competing thoughts about this. For the time being I'm focusing on simplifying the mechanisms that support the sys_io_*() interface so I never ever have to debug fs/aio.c (also known as chewing glass to those of us with the scars) again. That said, I'll gladly work closely with developers who are seriously considering putting some next gen interface to the test. That todo item about producing documentation and distro kernels is specifically to bait Uli into trying to implement posix aio on top of syslets in glibc. 'cause we can go back and forth about potential interfaces for, well, how long as it been? years? I want non-trivial users who we can measure so we can *stop* designing and implementing the moment something is good enough for them. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
> .. so don't keep us in suspense. Do you have any numbers for anything > (like Oracle, to pick a random thing out of thin air ;) that might > actually indicate whether this actually works or not? I haven't gotten to running Oracle's database against it. It is going to be Very Cranky if O_DIRECT writes aren't concurrent, and that's going to take a bit of work in fs/direct-io.c. I've done initial micro-benchmarking runs for basic sanity testing with fio. They haven't wildly regressed, that's about as much as can be said with confidence so far :). Take a streaming O_DIRECT read. 1meg requests, 64 in flight. str: (g=0): rw=read, bs=1M-1M/1M-1M, ioengine=libaio, iodepth=64 mainline: read : io=3,405MiB, bw=97,996KiB/s, iops=93, runt= 36434msec aio+syslets: read : io=3,452MiB, bw=99,115KiB/s, iops=94, runt= 36520msec That's on an old gigabit copper FC array with 10 drives behind a, no seriously, qla2100. The real test is the change in memory and cpu consumption, and I haven't modified fio to take reasonably precise measurements of those yet. Once I get O_DIRECT writes concurrent that'll be the next step. I was pleased to see my motivation for the patches, to avoid having to add specific support for operations to be called from fs/aio.c, work out. Take the case of 4k random buffered reads from a block device with a cold cache: read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 mainine: read : io=16,116KiB, bw=457KiB/s, iops=111, runt= 36047msec slat (msec): min=4, max= 629, avg=563.17, stdev=71.92 clat (msec): min=0, max=0, avg= 0.00, stdev= 0.00 aio+syslets: read : io=125MiB, bw=3,634KiB/s, iops=887, runt= 36147msec slat (msec): min=0, max=3, avg= 0.00, stdev= 0.08 clat (msec): min=2, max= 643, avg=71.59, stdev=74.25 aio+syslets w/o cfq read : io=208MiB, bw=6,057KiB/s, iops=1,478, runt= 36071msec slat (msec): min=0, max= 15, avg= 0.00, stdev= 0.09 clat (msec): min=2, max= 758, avg=42.75, stdev=37.33 Everyone step back and thank Jens for writing a tool that gives us interesting data without us always having to craft some stupid specific test each and every time. Thanks, Jens! In the mainline number fio clearly shows the buffered read submissions being handled synchronously. The mainline buffered IO paths doesn't know to identify and work with iocbs so requests are handled in series. In the +syslet number we see the __async_schedule() catching the blocking buffered read, letting the submission proceed asynchronously. We get async behaviour without having to touch any of the buffered IO paths. Then we turn off cfq and we actually start to saturate the (relatively ancient) drives :). I need to mail Jens about that cfq behaviour, but I'm guessing it's expected behaviour of a sort -- each syslet thread gets its own io_context instead of inheriting it from its parent. - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
Zach Brown wrote: I'm pleased to announce the availability of version 6 of the syslet subsystem. Ingo and I agreed that I'll handle syslet releases while he's busy with CFS. I copied the cc: list from Ingo's v5 announcement. If you'd like to be dropped (or added), please let me know. The v6 patch series against 2.6.21 can be downloaded from: http://oss.oracle.com/~zab/syslets/v6/ Example applications and previous syslet releases can be found at: http://people.redhat.com/~mingo/syslet-patches/ The syslet subsystem aims to provide user-space with an efficient interface for managing the asynchronus submission and completion of existing system calls. The only changes since v5 are small changes that I made to support the experimental aio patch described below. My syslet subsystem todo list is as follows, in no particular order: - replace WARN_ON() calls with error handling or avoidance - split the x86_64-async.patch into more specific patches - investigate integration with ptrace - investigate rare ./syslet-test cpu spinning - provide distro kernel rpms and documentation for developers - compat design problems, still? http://lkml.org/lkml/2007/3/7/523 Included in this patch series is an experimental patch which reworks fs/aio.c to reuse the syslet subsystem to process iocb requests from user space. The intent of this work is to simplify the code and broaden aio functionality. Many issues need to be addressed before this aio work could be merged: - support cancellation by sending signals to async_threads - figure out what to do about signals from handlers, like SIGXFSZ - verify that heavy loads do not consume excessive cpu or memory - concurrent dio writes - cfq gets confused, share io_context amongst threads? - restrict allowed operations like .aio_{r,w} methods used to More details on this work in progress can be found in the patch. Any and all feedback is welcome and encouraged! You should pick up the kevent work :) Having async request and response rings would be quite useful, and most closely match what is going on under the hood in the kernel and hardware. Jeff - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Syslets, Threadlets, generic AIO support, v6
I'm pleased to announce the availability of version 6 of the syslet subsystem. Ingo and I agreed that I'll handle syslet releases while he's busy with CFS. I copied the cc: list from Ingo's v5 announcement. If you'd like to be dropped (or added), please let me know. The v6 patch series against 2.6.21 can be downloaded from: http://oss.oracle.com/~zab/syslets/v6/ Example applications and previous syslet releases can be found at: http://people.redhat.com/~mingo/syslet-patches/ The syslet subsystem aims to provide user-space with an efficient interface for managing the asynchronus submission and completion of existing system calls. The only changes since v5 are small changes that I made to support the experimental aio patch described below. My syslet subsystem todo list is as follows, in no particular order: - replace WARN_ON() calls with error handling or avoidance - split the x86_64-async.patch into more specific patches - investigate integration with ptrace - investigate rare ./syslet-test cpu spinning - provide distro kernel rpms and documentation for developers - compat design problems, still? http://lkml.org/lkml/2007/3/7/523 Included in this patch series is an experimental patch which reworks fs/aio.c to reuse the syslet subsystem to process iocb requests from user space. The intent of this work is to simplify the code and broaden aio functionality. Many issues need to be addressed before this aio work could be merged: - support cancellation by sending signals to async_threads - figure out what to do about signals from handlers, like SIGXFSZ - verify that heavy loads do not consume excessive cpu or memory - concurrent dio writes - cfq gets confused, share io_context amongst threads? - restrict allowed operations like .aio_{r,w} methods used to More details on this work in progress can be found in the patch. Any and all feedback is welcome and encouraged! - z - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Syslets, Threadlets, generic AIO support, v6
On Tue, 29 May 2007, Zach Brown wrote: > > Included in this patch series is an experimental patch which reworks fs/aio.c > to reuse the syslet subsystem to process iocb requests from user space. The > intent of this work is to simplify the code and broaden aio functionality. .. so don't keep us in suspense. Do you have any numbers for anything (like Oracle, to pick a random thing out of thin air ;) that might actually indicate whether this actually works or not? Or is it just so experimental that no real program that uses aio can actually work yet? Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/