Re: 2.6.24-rc3: find complains about /proc/net
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Roland McGrath wrote: Oh, it seems it has indeed been that way for a very long time, so I was mistaken. It still seems a little odd to me. Ulrich can say definitively whether the kind of concern I mentioned really matters one way or the other for glibc. glibc cannot survive (at least NPTL) if somebody uses funny CLONE_* flags to separate various pieces of information, e.g., file descriptors. So, all the information in each thread's /proc/self should be identical. When the information is not the same, the current semantics seems to be more useful. So I guess, no change is the way to go here. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHQ25/2ijCOnn/RHQRAmhhAJsHRF7FqO8DWwZ97gHxIO/i4Z1AAQCffCGa Q2J8kjthKbbNQf1USWMAw3Y= =xl/a -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
network interface state
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Just FYI, with the current getaddrinfo code it is even more critical to get to a point where I can cache network interface information and query the kernel whether it changed. We now have to read the RTM_GETADDR tables for every lookup. It was more limited with the old, incomplete implementation. Even if it's something as simple as a RTM_SEQUENCE request which returns a number that is bumped at every interface change. Related: I need to know about the device type (the ARPHRD_* values) to determine whether a device is for a native transport or a tunnel. What I currently do is: - - at the beginning I get information about all interfaces using RTM_GETADDR - - them later I have to find the device type by + reading the RTM_GETLINK data to get to the device name + then using the name and ioctl(SIOCGIFHWADDR) I get the device type It would be so much nicer if the device type would be part of the RTM_GETADDR data, or at least the RTM_GETLINK data. Any help on any of these issues? - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHO2HI2ijCOnn/RHQRAtQQAJ0QV6j/BKFmN5nWugrQ/zXf0cCu9wCffRYT +aXv6y5S1m5iwR7gVfOhp9A= =Uf3i -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: network interface state
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Miller wrote: Most daemons handle this by listening for events on the netlink socket, but I understand how that might not be practical for glibc. Right, this cannot work. I have no inner loop which I can control. I cannot install a listener. At some point, when we have non-sequential, hidden file descriptors, I'll be able to leave a socket file descriptor open. But that's about it. Even then the generation counter interface is likely to be the best choice. It's part of the link information, Look in ifinfomsg-ifi_type Great, I fixed up the code. I guess in future, once I can cache the data, I'll simply read the RTM_GETADDR and RTM_GETLINK data all at once and be done with it. BTW, is it possible to send both these requests out before starting to read the results? This would reduce the amount of code quite a bit. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFHO47s2ijCOnn/RHQRApIIAJwNATDabXkfszG2e+gtJWO9f4wm4wCdFuoQ Yn40KK+cs9Di4fq+WKTQalo= =q02M -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bind and O_NONBLOCK
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Evgeniy Polyakov wrote: So, did I understand you correctly, that you want to introduce network AIO here? (for example on behalf of work queue or something else?) See Alan's mail. All this was his proposal, I just got it accepted upstream. The problem to solve is if you have a distributed network port set. Apparently NetBIOS has it but I could also imagine this to be useful in cluster implementations which have to appear as one machine. In this case, before binding to a given port, you have to make sure no other machine already handles this port. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG9UDy2ijCOnn/RHQRAvntAKC6F6Pz6zHd/iZLFECOZ0MxlhdPBQCgjrLC V9cazPF5jjf2eUSr7ZKDSas= =0v1W -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: bind and O_NONBLOCK
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Evgeniy Polyakov wrote: Could you point to the original Alan's proposal, I only found short note (as in you original mail) at opengroup.org and failed to correctly googlify it in the web. There was no public mail. I asked RH engineering for proposals for changes to the POSIX spec and Alan replied. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG9Uyl2ijCOnn/RHQRAtyNAJ0TLrZ8P3VcoFDWT1g+Qft1eTU+1QCffus6 Tljy9S9Sxb7z09l/GBkLSvY= =golD -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
bind and O_NONBLOCK
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Some time back Alan asked about adding O_NONBLOCK support to bind in the POSIX spec. I brought this up and the following text will be in the next revision of the POSIX spec: === If the socket address cannot be assigned immediately and O_NONBLOCK is set for the file descriptor for the socket, bind( ) shall fail and set errno to [EINPROGRESS], but the assignment request shall not be aborted, and the assignment shall be completed asynchronously. Subsequent calls to bind() for the same socket, before the assignment is completed, shall fail and set errno to [EALREADY]. When the assignment has been performed asynchronously, pselect(), select(), and poll() shall indicate that the file descriptor for the socket is ready for reading and writing. === It would be ideal if we'd have such an implementation in the next few months so that we, in theory, can check whether the text in the specification makes sense. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG813Z2ijCOnn/RHQRAsNkAJ9EuDWX3EDez8+o/y3I39A7Axy++ACfZAXi DRFm1UadrbJ+c7ss0a1vWUI= =p1bV -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
drop association of connection-less socket
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 The Linux man page for connect(2) currently says: Connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC. No such wording is in the POSIX definition which only says If address is a null address for the protocol, the socket’s peer address shall be reset. This is not the same but seems to be what Linux implements. The problem is that I tried to reuse a socket which has been associated with an IPv6 address to later connect to an IPv4 address. This is part of the getaddrinfo implementation and an effort to make it more efficient. strace's output looks like this: connect(3, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, 2001:11b8:1:0:207:e94f:ee7c:4b72, sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable) connect(3, {sa_family=AF_UNSPEC, sa_data=\0\0\0\0\0\0\0\0\0\0\0\0\0\0}, 28) = 0 connect(3, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr(192.168.1.72)}, 16) = 0 I.e., despite what the man page says, the second connect only reset the address, as required by the POSIX spec. It did not reset the address family of the socket. What I ideally would like to see is what the Linux man page says. I.e., if the .sa_family field is AF_UNSPEC all, the address and address family, is reset. Otherwise only the address association itself is reset. Is this functionality which got lost over time? Or is the man page wrong and this never was the case? Is this a worthwhile change? - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8M+52ijCOnn/RHQRAnTEAJ0Z/DrTkcCjpbybB5lqDad9Z0MbZwCeLZOh u/mNfxV7uDjRsSuOj4YwuIg= =FO70 -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: drop association of connection-less socket
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I guess the request is not that useful. The family of the socket is determined earlier so to undo this it takes more of an effort. I managed to get by for most cases without this change so no action needed. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8TqZ2ijCOnn/RHQRAkXeAJ0RGW9zuP8xnLNVdnsHCLFR6IVJ8QCgwmBf 0ncI+FkqHE3vaYieIcHqOXo= =UxXC -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 As a follow up to my question from yesterday on the netdev list what I think is a real problem. Either in the kernel or in the POSIX spec. The POSIX spec currently says this about SOCK_DGRAM sockets: If address is a null address for the protocol, the socket’s peer address shall be reset. The term null address is not further specified but it will usually be read to allow the following scenario to work out: fd = socket(AT_INET6, ...) connect(fd, ...some IPv6 address...) struct sockaddr_in6 sin6 = { .sin6_family = AF_INET6 }; connect(fd, sin6, sizeof (sin6)); connect(fd, ...some new IPv6 address...) This does not work on Linux in the moment. The socket remains connected to the old IPv6 address but the second connect() call does succeed (this does not sound OK). What does work is if the connect call to disassociate the address uses AF_UNSPEC instead of AF_INET6. The question is: do people here think this is a problem in the POSIX spec? Binding to :: and 0.0.0.0 isn't possible, so maybe the Linux implementation should allow this? If you think the POSIX spec is wrong (and can point to other implementations doing the same as Linux) let me know and I'll work on getting the spec changed. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8T6L2ijCOnn/RHQRAnSRAJ9sXDGG9OepEQWQInaPgwxCWlaH6wCghqim ULttg5/lU8c1rSpBnoRCjB8= =nGVv -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andi Kleen wrote: The standard way to undo connect is to use AF_UNSPEC. Code to handle that for dgram sockets is there. It's the same code for v4 and v6. I quoted the standard and it does not say anything about AF_UNSPEC. So you cannot simply make such broad statements. I also don't say that this behavior should be removed. It's certainly useful, very much so in fact. But the spec calls for a null address to be used and that's in my understanding something different from using AF_UNSPEC. I looked through Stevens TCP Illustrated Vol 2 and it seems not to mention resetting the address at all. The POSIX spec certainly got this text from .1g. I cannot test it on other systems. If somebody has access to some certified systems (and maybe others), write a bit of code which creates a DGRAM socket, connect to one address, call connect with a null address, then connect to another address (which likely has to use a different interface since otherwise the connect will just succeed, it seems). - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8VMF2ijCOnn/RHQRAr9NAJwLxyql0kQnMGJNaPZlRGsuB6rGEACgog88 WIWAFhuBWsjps7PdbcoumUQ= =oLxP -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ulrich Drepper wrote: Yes, but for IPv4/6 it's not an issue. Some implementations might handle all-zeros and the spec _currently_ calls for it. In this case an alignment would be good. Searching the web shows up this: http://developer.apple.com/documentation/Darwin/Reference/ManPages/man2/connect.2.html Datagram sockets may dissolve the association by connecting to an invalid address, such as a null address or an address with the address family set to AF_UNSPEC (the error EAFNOSUPPORT will be harmlessly returned). I.e., at least Apple implements both variants. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8Vvu2ijCOnn/RHQRAsSfAJkBELtiNyul8wMOjVv1x7LfvDWw/ACfR0D0 cm+k1wfhCsT4GjbF3uac+eY= =nksN -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andi Kleen wrote: But the spec calls for a null address to be used and that's in my understanding something different from using AF_UNSPEC. memset(sockaddr, 0, sizeof(sockaddr)) should give you AF_UNSPEC But the spec calls for quotenull address for the protocol/quote. That means the family for the null address is the same as the family of the socket. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8WCO2ijCOnn/RHQRAgtsAJ9qTFVj5QQbVG/hUflxo/6uPOfl4QCdHSX8 wi2GX7B0pht8VDaswYLqdpM= =sMSg -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andi Kleen wrote: Spec doesn't match traditional behaviour then. Well, determining whether that's the case is part of this exercise. IPv4 0.0.0.0 is traditionally an synonym for old style all broadcast (255.255.255.255) on UDP/RAW and it's certainly possible to connect() to that. Where do you get this from? And where is this implemented? I don't doubt it but I have to convince people to change the standard and possibly introduce incompatibility. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8WQY2ijCOnn/RHQRAlsBAJ9qZRZXNN2VEy136MFIT1daHfju5ACdGiIW k0I5e2BGRjvjbJrrAwtehqo= =fX+i -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: follow-up: discrepancy with POSIX
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 David Miller wrote: It just occured to me that AF_UNSPEC might be used simply because all zeros might be a valid real bindable address for some address family. And using AF_UNSPEC avoids that problem entirely. Yes, but for IPv4/6 it's not an issue. Some implementations might handle all-zeros and the spec _currently_ calls for it. In this case an alignment would be good. I guess I'll just go ahead and file a problem report with the spec. Maybe the Unix vendors will test their implementations in provide feedback. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFG8Vam2ijCOnn/RHQRAlw2AJwPCkD/GdX5YWCjsidhNXkGT71SiQCeLUDX XimSWS2NMI9T8QxnnV3FDQ4= =8XbG -END PGP SIGNATURE- - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] O_CLOEXEC for SCM_RIGHTS
Part two in the O_CLOEXEC saga: adding support for file descriptors received through Unix domain sockets. The patch is once again pretty minimal, it introduces a new flag for recvmsg and passes it just like the existing MSG_CMSG_COMPAT flag. I think this bit is not used otherwise but the networking people will know better. This new flag is not recognized by recvfrom and recv. These functions cannot be used for that purpose and the asymmetry this introduces is not worse than the already existing MSG_CMSG_COMPAT situations. The patch must be applied on the patch which introduced O_CLOEXEC. It has to remove static from the new get_unused_fd_flags function but since scm.c cannot live in a module the function still hasn't to be exported. Here's a test program to make sure the code works. It's so much longer than the actual patch... #include errno.h #include error.h #include fcntl.h #include stdio.h #include string.h #include unistd.h #include sys/socket.h #include sys/un.h #ifndef O_CLOEXEC # define O_CLOEXEC 0200 #endif #ifndef MSG_CMSG_CLOEXEC # define MSG_CMSG_CLOEXEC 0x4000 #endif int main (int argc, char *argv[]) { if (argc 1) { int fd = atol (argv[1]); printf (child: fd = %d\n, fd); if (fcntl (fd, F_GETFD) == 0 || errno != EBADF) { puts (file descriptor valid in child); return 1; } return 0; } struct sockaddr_un sun; strcpy (sun.sun_path, ./testsocket); sun.sun_family = AF_UNIX; char databuf[] = hello; struct iovec iov[1]; iov[0].iov_base = databuf; iov[0].iov_len = sizeof (databuf); union { struct cmsghdr hdr; char bytes[CMSG_SPACE (sizeof (int))]; } buf; struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1, .msg_control = buf.bytes, .msg_controllen = sizeof (buf) }; struct cmsghdr *cmsg = CMSG_FIRSTHDR (msg); cmsg-cmsg_level = SOL_SOCKET; cmsg-cmsg_type = SCM_RIGHTS; cmsg-cmsg_len = CMSG_LEN (sizeof (int)); msg.msg_controllen = cmsg-cmsg_len; pid_t child = fork (); if (child == -1) error (1, errno, fork); if (child == 0) { int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock 0) error (1, errno, socket); if (bind (sock, (struct sockaddr *) sun, sizeof (sun)) 0) error (1, errno, bind); if (listen (sock, SOMAXCONN) 0) error (1, errno, listen); int conn = accept (sock, NULL, NULL); if (conn == -1) error (1, errno, accept); *(int *) CMSG_DATA (cmsg) = sock; if (sendmsg (conn, msg, MSG_NOSIGNAL) 0) error (1, errno, sendmsg); return 0; } /* For a test suite this should be more robust like a barrier in shared memory. */ sleep (1); int sock = socket (PF_UNIX, SOCK_STREAM, 0); if (sock 0) error (1, errno, socket); if (connect (sock, (struct sockaddr *) sun, sizeof (sun)) 0) error (1, errno, connect); unlink (sun.sun_path); *(int *) CMSG_DATA (cmsg) = -1; if (recvmsg (sock, msg, MSG_CMSG_CLOEXEC) 0) error (1, errno, recvmsg); int fd = *(int *) CMSG_DATA (cmsg); if (fd == -1) error (1, 0, no descriptor received); char fdname[20]; snprintf (fdname, sizeof (fdname), %d, fd); execl (/proc/self/exe, argv[0], fdname, NULL); puts (execl failed); return 1; } Signed-off-by: Ulrich Drepper [EMAIL PROTECTED] --- a/fs/open.c +++ b/fs/open.c @@ -855,7 +855,7 @@ /* * Find an empty file descriptor entry, and mark it busy. */ -static int get_unused_fd_flags(int flags) +int get_unused_fd_flags(int flags) { struct files_struct * files = current-files; int fd, error; --- a/include/linux/file.h +++ b/include/linux/file.h @@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int *fput_needed)); extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag)); extern void put_filp(struct file *); extern int get_unused_fd(void); +extern int FASTCALL(get_unused_fd_flags(int flags)); extern void FASTCALL(put_unused_fd(unsigned int fd)); struct kmem_cache; --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -253,6 +253,9 @@ struct ucred { #define MSG_EOF MSG_FIN +#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file + descriptor received through + SCM_RIGHTS */ #if defined(CONFIG_COMPAT) #define MSG_CMSG_COMPAT0x8000 /* This message needs 32 bit fixups */ #else --- a/net/core/scm.c +++ b/net/core/scm.c @@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm) err = security_file_receive(fp[i]); if (err) break; - err = get_unused_fd(); + err = get_unused_fd_flags(MSG_CMSG_CLOEXEC msg-msg_flags + ? O_CLOEXEC : 0
[PATCH] V2: O_CLOEXEC for SCM_RIGHTS
Take two: I forgot to change the compat code. This has now happened. Only one additional line changed. Everything else from the first patch remains the same. I try to avoid clogging the list unnecessarily by not resending the test program. Signed-off-by: Ulrich Drepper [EMAIL PROTECTED] --- a/fs/open.c +++ b/fs/open.c @@ -855,7 +855,7 @@ /* * Find an empty file descriptor entry, and mark it busy. */ -static int get_unused_fd_flags(int flags) +int get_unused_fd_flags(int flags) { struct files_struct * files = current-files; int fd, error; --- a/include/linux/file.h +++ b/include/linux/file.h @@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int *fput_needed)); extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag)); extern void put_filp(struct file *); extern int get_unused_fd(void); +extern int FASTCALL(get_unused_fd_flags(int flags)); extern void FASTCALL(put_unused_fd(unsigned int fd)); struct kmem_cache; --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -253,6 +253,9 @@ struct ucred { #define MSG_EOF MSG_FIN +#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file + descriptor received through + SCM_RIGHTS */ #if defined(CONFIG_COMPAT) #define MSG_CMSG_COMPAT0x8000 /* This message needs 32 bit fixups */ #else --- a/net/compat.c +++ b/net/compat.c @@ -276,7 +276,8 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct scm_cookie *scm) err = security_file_receive(fp[i]); if (err) break; - err = get_unused_fd(); + err = get_unused_fd_flags(MSG_CMSG_CLOEXEC msg-msg_flags + ? O_CLOEXEC : 0); if (err 0) break; new_fd = err; --- a/net/core/scm.c +++ b/net/core/scm.c @@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm) err = security_file_receive(fp[i]); if (err) break; - err = get_unused_fd(); + err = get_unused_fd_flags(MSG_CMSG_CLOEXEC msg-msg_flags + ? O_CLOEXEC : 0); if (err 0) break; new_fd = err; --- a/net/socket.c +++ b/net/socket.c @@ -1939,9 +1939,7 @@ asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, total_len = err; cmsg_ptr = (unsigned long)msg_sys.msg_control; - msg_sys.msg_flags = 0; - if (MSG_CMSG_COMPAT flags) - msg_sys.msg_flags = MSG_CMSG_COMPAT; + msg_sys.msg_flags = flags (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT); if (sock-file-f_flags O_NONBLOCK) flags |= MSG_DONTWAIT; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] V3: O_CLOEXEC for SCM_RIGHTS
Take two: I forgot to change the compat code. This has now happened. Only one additional line changed. Everything else from the first patch remains the same. I try to avoid clogging the list unnecessarily by not resending the test program. Signed-off-by: Ulrich Drepper [EMAIL PROTECTED] --- a/fs/open.c +++ b/fs/open.c @@ -855,7 +855,7 @@ /* * Find an empty file descriptor entry, and mark it busy. */ -static int get_unused_fd_flags(int flags) +int get_unused_fd_flags(int flags) { struct files_struct * files = current-files; int fd, error; --- a/include/linux/file.h +++ b/include/linux/file.h @@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int *fput_needed)); extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag)); extern void put_filp(struct file *); extern int get_unused_fd(void); +extern int FASTCALL(get_unused_fd_flags(int flags)); extern void FASTCALL(put_unused_fd(unsigned int fd)); struct kmem_cache; --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -253,6 +253,9 @@ struct ucred { #define MSG_EOF MSG_FIN +#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file + descriptor received through + SCM_RIGHTS */ #if defined(CONFIG_COMPAT) #define MSG_CMSG_COMPAT0x8000 /* This message needs 32 bit fixups */ #else --- a/net/compat.c +++ b/net/compat.c @@ -276,7 +276,8 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct scm_cookie *scm) err = security_file_receive(fp[i]); if (err) break; - err = get_unused_fd(); + err = get_unused_fd_flags(MSG_CMSG_CLOEXEC msg-msg_flags + ? O_CLOEXEC : 0); if (err 0) break; new_fd = err; --- a/net/core/scm.c +++ b/net/core/scm.c @@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie *scm) err = security_file_receive(fp[i]); if (err) break; - err = get_unused_fd(); + err = get_unused_fd_flags(MSG_CMSG_CLOEXEC msg-msg_flags + ? O_CLOEXEC : 0); if (err 0) break; new_fd = err; --- a/net/socket.c +++ b/net/socket.c @@ -1939,9 +1939,7 @@ asmlinkage long sys_recvmsg(int fd, struct msghdr __user *msg, total_len = err; cmsg_ptr = (unsigned long)msg_sys.msg_control; - msg_sys.msg_flags = 0; - if (MSG_CMSG_COMPAT flags) - msg_sys.msg_flags = MSG_CMSG_COMPAT; + msg_sys.msg_flags = flags (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT); if (sock-file-f_flags O_NONBLOCK) flags |= MSG_DONTWAIT; - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] [GIT PATCH net-2.6.23] IPV6: Configurable IPv6 address selection policy table (RFC3484)
David Miller wrote: One idea is to have glibc have some kind of socket open, subscribed to a group which gets sticky events. I don't quite yet know the context but I have to intervene: keeping sockets open is not good. This will only cause problems. Any interface must be memory based. Something like register a word which is set when an event arrives is a much better interface. Who you then go and retrieve messages is another issue. If this is a rare event then opening is new netlink socket is no problem. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC] [GIT PATCH net-2.6.23] IPV6: Configurable IPv6 address selection policy table (RFC3484)
David Miller wrote: Something more scalable has to be used. This is where the shared-memory based event notification comes in. It was always also meant to be used for things like this. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take35 0/10] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: I think that mean that everybody is happy with APi, design and set of features. No comment means that I still have not been able to test anything since regardless of what version I tried, it failed to build. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take24 0/6] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: Why do we want to inject _ready_ event, when it is possible to mark event as ready and wakeup thread parked in syscall? Going back to this old one: How do you want to mark an event ready if you don't want to introduce yet another layer of data structures? The event notification happens through entries in the ring buffer. Userlevel code should never add anything to the ring buffer directly, this would mean huge synchronization problems. Yes, one could add additional data structures accompanying the ring buffer which can specify userlevel-generated events. But this is a) clumsy and b) a pain to use when the same ring buffer is used in multiple threads (you'd have to have another shared memory segment). It's much cleaner if the userlevel code can get the kernel to inject a userlevel-generated event. This is the equivalent of userlevel code generating a signal with kill(). -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: Kevent POSIX timers support.
Evgeniy Polyakov wrote: We need to pass the data in the sigev_value meember of the struct sigevent structure passed to timer_create to the caller. I don't see it being done here nor when the timer is created. Do I miss something? The sigev_value value should be stored in the user/ptr member of struct ukevent. sigev_value was stored in k_itimer structure, I just do not know where to put it in the ukevent provided to userspace - it can be placed in pointer value if you like. sigev_value is a union and the largest element is a pointer. So, transporting the pointer value is sufficient and it should be passed up to the user in the ptr member of struct ukevent. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take24 0/6] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: With provided patch it is possible to wakeup 'for-free' - just call kevent_ctl(ready) with zero number of ready events, so thread will be awakened if it was in poll(kevent_fd), kevent_wait() or kevent_get_events(). Yes, I realize that. But I wrote something else: Rather than mark an existing entry as ready, how about a call to inject a new ready event? This would be useful to implement functionality at userlevel and still use an event queue to announce the availability. Without this type of functionality we'd need to use indirect notification via signal or pipe or something like that. This is still something which is wanted. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kevent POSIX timers support.
David Miller wrote: Now we'll have to have a compat layer for 32-bit/64-bit environments thanks to POSIX timers, which is rediculious. We already have compat_sys_timer_create. It should be sufficient just to add the conversion (if anything new is needed) there. The pointer value can be passed to userland in one or two int fields, I don't really care. When reporting the event to the user code we cannot just point into the ring buffer anyway. So while copying the data we can rewrite it if necessary. I see no need to complicate the code more than it already is. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take24 0/6] kevent: Generic event handling mechanism.
. But the signal mask is something completely different and completely independent from the signal queue. There is nothing in the kevent interface to replace that functionality. Nor should this be possible with the events; only a sigset_t parameter to kevent_wait makes sense. Having sigmask parameter is the same as creating kevent signal delivery. No, no, no. Not at all. Surely you don't suggest keeping your original timer patch? Of course not - kevent timers are more scalable than posix timers (the latter uses idr, which is slower than balanced binary tree, since it looks like it uses similar to radix tree algo), POSIX interface is much-much-much more unconvenient to use than simple add/wait. I assume you misread the question. You agree to drop the patch and then go on listing things why you think it's better to keep them. I don't think these arguments are in any way sufficient. The interface is already too big and this is 100% duplicate functionality. If there are performance problems with the POSIX timer implementation (and I have yet to see indications) it should be fixed instead of worked around. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take25 1/6] kevent: Description.
Evgeniy Polyakov wrote: If kernel has put data asynchronously it will setup special flag, thus kevent_wait() will not sleep and will return, so thread will check new entries and process them. This is not sufficient. The userlevel code does not commit the events until they are processed. So assume two threads at userlevel, one event is asynchronously posted. The first thread picks it up, the second call kevent_wait. With your scheme it will not be put to sleep and unnecessarily returns to userlevel. What I propose and what has been proven to work in many situations is to have part of the kevent_wait syscall the information about I am aware of all events up to XX; wake me only if anything beyond that is added. Please take a look at how futexes work, it's really the same concept. And it's really also simpler for the implementation. Having such a flag is much more complicated than adding a simple index comparison before going to sleep. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take25 1/6] kevent: Description.
Evgeniy Polyakov wrote: It _IS_ how previous interface worked. EXACTLY! No, the old interface committed everything not only up to a given index. This is the huge difference which makes or breaks it. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take24 0/6] kevent: Generic event handling mechanism.
on a specific CPU then the wakeup function should take this into account. I.e., if any of the threads waiting was/will be scheduled on the same CPU it should be preferred. With the current simple form of a ring buffer this isn't sufficient, though. Reading all entries in the ring buffer until finding the one written by the CPU in question is not helpful. We'd need a mechanism to point the thread to the entry in question. One possibility to do this is to return the ring buffer entry as the return value of the kevent_wait() syscall. This works fine if the thread only works for one event (which I guess will be 99.999% of all uses). An extension could be to extend the ukevent structure to contain an index of the next entry written the same CPU. Another problem this entails is false sharing of the ring buffer entries. This would probably require to pad the ukevent structure to 64 bytes. It's not that much more, 40 bytes so far, it's also more future-safe. The alternative is to allocate have per-CPU regions in the ring buffer. With hotplug CPUs this is just plain silly. I think this optimization has the potential to help quite a bit, especially for large machines. === - we absolutely need an interface to signal the kernel that a thread, just woken from kevent_wait, cannot handle the events. I.e., the events are in the ring buffer but all the other threads are in the kernel in their kevent_wait calls. The new syscall would wake up one or more threads to handle the events. This syscall is for instance necessary if the thread calling kevent_wait is canceled. It might also be needed when a thread requested more than one event and realizes processing an entry takes a long time and that another thread might work on the other items in the meantime. Al Viro pointed out another possible solution which also could solve the handled flag problem and concurrency in use of the ring buffer. The idea is to require the kevent_wait() syscall to signal which entry in the ring buffer is handled or not handled. This means: + the kernel knows at any time which entries in the buffer are free and which are not + concurrent filling of the ring buffer is no problem anymore since entries are not discarded until told + by not waiting for event (num parameter == 0) the syscall can be used to discard entries to free up the ring buffer before continuing to work on more entries. And, as per the requirement above, it can be used to tell the kernel that certain entries are *NOT* handled and need to be sent to another thread. This would be useful in the thread cancellation case. This seems like a nice approach. === - why no syscall to create kevent queue? With dynamic /dev this might be a problem and it's really not much additional code. What about programs which want to use these interfaces before /dev is set up? === - still: the syscall should use a struct timespec* timeout parameter and not nanosecs. There are at least three timeout modes which are wanted: + relative, unconditionally wait that long + relative, aborted in case of large enough settimeofday() or NTP adjustment + absolute timeout. Probably even with selecting which clock ot use. This mode requires a timespec value parameter We have all this code already in the futex syscall. It just needs to be generalized or copied and adjusted. === - still: no signal mask parameter in the kevent_wait (and get_event) syscall. Regardless of what one thinks about signals, they are used and integrating the kevent interface into existing code requires this functionality. And it's not only about receiving signals. The signal mask parameter can also be used to _prevent_ signals from being delivered in that time. === - the KEVENT_REQ_WAKEUP_ONE functionality is good and needed. But I would reverse the default. I cannot see many places where you want all threads to be woken. Introduce KEVENT_REQ_WAKEUP_ALL instead. === - there is really no reason to invent yet another timer implementation. We have the POSIX timers which are feature rich and nicely implemented. All that is needed is to implement SIGEV_KEVENT as a notification mechanism. The timer is registered as part of the timer_create() syscalls. === I haven't yet looked at the other event sources. I think the above is enough for now. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: One can set number of events before the syscall and do not remove them after syscall. It can be updated if there is need for that. Nobody doubts that it is possible. But it is a) potentially much expensive and b) an alien concept to have the signal mask to set during the wait call implicitly. Conceptually it doesn't even make sense. This is no event to wait for. It a parameter for the specific wait call, just like the timeout. And I fortunately haven't seen you proposing to pass the timeout value implicitly. Not good enough? It does exactly what it is supposed to do. What can there be not good enough? Not to move signals into special case of events. If poll() can not work with them it does not mean, that they need to be specified as additional syscall parameter, instead change poll() to work with them, which can be easily done with kevents. You still seem to be completely missing the point. The signal mask is no event to wait for. It has nothing to do with this that ppoll() takes the signal mask as a parameter. The signal mask is a parameter for the wait call just like the timeout, not more and not less. Do not mix warm and soft - waiting for some period is not equal to syscall timeout. Waiting is possible with timer kevent user (although only relative timeout, can be changed to support both, not a big problem). That's what I'm saying all the time. Of course it can be supported. But for this the timeout parameter must be a timespec pointer. Whatever you could possibly mean by do not mix warm and soft I cannot possibly imagine. Fact is that both relative and absolute timeouts are useful. And that for absolute timeouts the change of the clock has to be taken into account. I'm quite sure that absolute timeouts are very usefull, but not as in the case of waiting for syscall completeness. In any way, kevent can be extended to support absolute timeouts in it's timer notifications. That's not the same. If you argue that then the syscall should have no timeout parameter at all. Fact is that setting up a timer is not for free. Since the timeout is used all the time having a timeout parameter is the right answer. And if you do this then do it right just like every other syscall other than poll: use a timespec object. This gives flexibility without measurable cost. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
Evgeniy Polyakov wrote: The whole idea of mmap buffer seems to be broken, since those who asked for creation do not like existing design and do not show theirs... What kind of argumentation is that? Because my attempt to implement it doesn't work and nobody right away has a better suggestion this means the idea is broken. Nonsense. It just means that time should be spend on thinking about this. You cut all this short by rushing out your attempt without any discussions. Unfortunately nobody else really looked at the approach so it lingered around for some weeks. Well, now it is clear that it is not the right approach and we can start thinking about it again. You seems to not checked the code - each event can be marked as ready only one time, which means only one copy and so on. It was done _specially_. And it is not limitation, but new approach. I know that it is done deliberately and I tell you that this is wrong and unacceptable. Realtime signals are one event which need to have more than one event queued. This is no description of what you have implemented, it's a description of the reality of realtime signals. RT signals are queued. They carry a data value (the sigval_t object) which can be unique for each signal delivery. Coalescing the signal events therefore leads to information loss. Therefore, at the very least for signal we need to have the ability to queue more than one event for each event source. Not having this functionality means that signals and likely other types of events cannot be implemented using kevent queues. Queue of the same signals or any other events has fundamental flawness (as any other ring buffer implementation, which has queue size) - it's size of the queue and extremely bad case of the overflow. Of course there are additional problems. Overflows need to be handled. But this is nothing which is unsolvable. So, the same event may not be ready several times. Any design which allows to create infinite number of events generated for the same case is broken, since consumer can be in situation, when it can not handle that flow. That's complete nonsense. Again, for RT signals it is very reasonable and not broken to have multiple outstanding signals. That is why poll() returns only POLLIN when data is ready in network stack, but is not trying to generate some kind of a signal for each byte/packet/MTU/MSS received. It makes no sense to drag poll() into this discussion. poll() is a very limited interface. The new event handling is supposed to be the opposite, namely, usable for all kinds of events. Arguing that because poll() does it like this just means you don't see what big step is needed to get to the goal of a unified event handling. The shackles of poll() must be left behind. RT signals have design problems, and I will not repeate the same error with similar limits in kevent. I don't know what to say. You claim to be the source of all wisdom is OS design. Maybe you should design your own OS, from ground up. I wonder how many people would like that since all your arguments are squarely geared towards optimizing the implementation. But: the implementation is irrelevant without users. The functionality users (= programmers) want and need is what must drive the implementation. And RT signals are definitely heavily used and liked by programmers. You have to accept that you try to modify an OS which has that functionality regardless of how much you hate it and want to fight it. Mmap implementation can be added separately, since it does not affect kevent core. That I doubt very much and it is why I would not want the kevent stuff go into any released kernel until that detail is resolved. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: In context you have cut, one updated signal mask between calls to event delivery mechanism (using for example signal()), so it has exactly the same price. No, it does not. If the signal mask is recomputed by the program for each new wait call then you have a lot more work to do when the signal mask is implicitly specified. I created it just because I think that POSIX workaround to add signals into the syscall parameters is not good enough. Not good enough? It does exactly what it is supposed to do. What can there be not good enough? You again cut my explanation on why just pure timeout is used. We start a syscall, which can block forever, so we want to limit it's time, and we add special parameter to show how long this syscall should run. Timeout is not about how long we should sleep (which indeed can be absolute), but how long syscall should run - which is related to the time syscall started. I know very well what a timeout is. But the way the timeout can be specified can vary. It is often useful (as for select, poll) to specify relative timeouts. But there are equally useful uses where the timeout is needed at a specific point in time. Without a syscall interface which can have a absolute timeout parameter we'd have to write as a poor approximation at userlever clock_gettime (CLOCK_REALTIME, ts); struct timespec rel; rel.tv_sec = abstmo.tv_sec - ts.tv_sec; rel.tv_nsec = abstmo.tv_sec - ts.tv_nsec; if (rel.tv_nsec 0) { rel.tv_nsec += 10; --rel.tv_sec; } if (rel.tv_sec 0) inttmo = -1; // or whatever is used for return immediately else inttmo = rel.tv_sec * UINT64_C(10) + rel.tv_nsec; wait(..., inttmo, ...) Not only is this much more expensive to do at userlevel, it is also inadequate because calls to settimeofday() do not cause a recomputation of the timeout. See Ingo's RT futex stuff as an example for a kernel interface which does it right. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
Evgeniy Polyakov wrote: Existing design does not allow overflow. And I've pointed out a number of times that this is not practical at best. There are event sources which can create events which cannot be coalesced into one single event as it would be required with your design. Signals are one example, specifically realtime signals. If we do not want the design to be limited from the start this approach has to be thought over. So zap mmap() support completely, since it is not usable at all. We wont discuss on it. Initial implementation did not have it. But I was requested to do it, and it is ready now. No one likes it, but no one provides an alternative implementation. We are stuck. We need the mapped ring buffer. The current design (before it was removed) was broken but this does not mean it shouldn't be implemented. We just need more time to figure out how to implement it correctly. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: And you can add/remove signal events using existing kevent api between calls. That's far more expensive than using a mask under control of the program. And creating special cases for usual events is bad. There is unified way to deal with events in kevent - add/remove/modify/wait on them, signals are just usual events. How can this be unified? The installment of the temporary signal mask is unlike the handling of signal for the purpose of reporting them through the signal queue. It's equally completely new functionality. Don't kid yourself in thinking that because this is signal stuff, too, you're unifying something. The way this signal mask is used has nothing whatsoever to do with the delivering signals via the event queue. For the latter the signals always must be blocked (similar to sigwait's requirement). As a result it means you want to introduce a new mechanism for the event queue instead of using the well known and often used method of optionally passing a signal mask to the syscall. That's just insane. I think you wanted to say, that 'all event mechanism except the most commonly used poll/select/epoll use timespec'. Get your facts straight. select uses timeval which is just the predecessor of of timespec. And epoll is just (badly) designed after poll. Fact is therefore that poll plus its spawn is the only interface using such a timeout method. I designed it to be similar to poll(), it is really good interface. Not many people agree. All the interfaces designed (not derived) in the last years take a timespec parameter. Plus, you chose to ignore all the nice things using a timespec allow you like absolute timeout modes etc. See the clock_nanosleep() interface for a way this can be useful. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take19 0/4] kevent: Generic event handling mechanism.
On 9/22/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: The only two things missed in patchset after his suggestions are new POSIX-like interface, which I personally consider as very unconvenient, This means you really do not know at all what this is about. We already have these interfaces. Several of them and there will likely be more. These are interfaces for functionality which needs the new event notification. There is *NO* reason whatsoever to not make this - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
[Bah, sent too eaqrly] On 9/22/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: The only two things missed in patchset after his suggestions are new POSIX-like interface, which I personally consider as very unconvenient, This means you really do not know at all what this is about. We already have these interfaces. Several of them and there will likely be more. These are interfaces for functionality which needs the new event notification. There is *NO* reason whatsoever to not make add this extension and instead invent new interfaces to have notification sent to the event queue. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 1/4] kevent: Core files.
On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: This patch includes core kevent files: [...] I tried to look at the example programs before and failed. I tried again. Where can I find up-to-date example code? Some other points: - I really would prefer not to rush all this into the upstream kernel. The main problem is that the ring buffer interface is a shared data structure. These are always tricky. We need to find the right combination between size (as small as possible) and supporting all the interfaces. - so far only the timer and aio notification is speced out. What about the rest? Are we sure all aspects can be expressed? I am not yet. - we need an interface to add an event from userlevel. I.e., we need to be able to synthesize events. There are events (like, for instance the async DNS functionality) which come from userlevel code. I would very much prefer we look at the other events before setting the data structures in stone. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: When we enter sys_ppoll() we specify needed signals as syscall parameter, with kevents we will add them into the queue. No, this is not sufficient as I said in the last mail. Why do you completely ignore what others say. The code which depends on the signal does not have to have access to the event queue. If a library sets up an interrupt handler then it expect the signal to be delivered this way. In such situations ppoll etc allow the signal to be generally blocked and enabled only and *ATOMICALLY* around the delays. This is not possible with the current wait interface. We need this signal mask interfaces and the appropriate setup code. Being able to get signal notifications does not mean this is always the way it can and must happen. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take19 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: It is completely possible to do what you describe without special syscall parameters. First of all, I don't see how this is efficiently possible. The mask might change from call to call. Second, hasn't it sunk in that inventing new ways to pass parameters is bad? Programmers don't want to learn new ways for every new interface. Reuse is good! This applies to the signal mask here. But there is another parameter falling into that category and I meant to mention it before: the timeout value. All other calls except poll and especially all modern interfaces use a timespec pointer. This is the way times are kept in userland code. Don't try to force people to do something else. Using a timespec also has the advantage that we can add an absolute timeout value mode (optional) instead of the relative timeout value. In this context, we should/must be able to specify which clock the timeout is for (not as part of the wait call, but another control operation perhaps). It's important to distinguish between CLOCK_REALTIME and CLOCK_MONOTONE. Both have their use. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take19 1/4] kevent: Core files.
On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c These are simple programs which by themselves have problems. For instance, I consider a very bad idea to hardcode the size of the ring buffer. Specifying macros in the header file counts as hardcoding. Systems grow over time and so will the demand of connections. I have no problem with the kernel hardcoding the value internally (or having a /proc entry to select it) but programs should be able to dynamically learn about the value so they don't have to be recompiled. But more problematic is that I don't see how the interfaces can be efficiently used in multi-threaded (or multi-process) programs. How would multiple threads using the same kevent queue and running in the same kevent_get_events() loop work out? How do they guarantee that each request is only handled once? From what I see now this means a second data structure is needed to keep track of the state of each entry. But even then, how do we even recognized used ring buffer entries? For instance, assume two threads. Both call get_events, one event is reported, both threads are woken up (which is another thing to consider, more later). One thread uses ring buffer entry, the other goes back to sleep in get_events. Now, how does the kernel know when the other thread is done working on the ring buffer entry? There might be lots of entries coming in overflowing the entire buffer. Heck, you don't even need two threads for this scenario. When I was thinking about this (and discussing it in Ottawa) I was always assuming that we have a status field in the ring buffer entry which lets the userlevel code indicate whether the entry is free again or not. This requires a writable mapping, yes, and potentially causes cache line ping-pong. I think Zach mentioned he has some ideas about this. As for the multiple thread wakeup, I mentioned this before. We have to avoid the trampling herd problem. We cannot wakeup all waiters. But we also cannot assume that, without protocols, waking up just one for each available entry is sufficient. So the first question is: what is the current policy? AIO was removed from patchset by request of Cristoph. Timers, network AIO, fs AIO, socket nortifications and poll/select events work well with existing structures. Well, excuse me if I don't take your word for it. I agree, the AIO code should not be submitted along with this. The same for any other code using the event handling. But we need to check whether the interface is generic enough to accomodate them in a way which actually makes sense. Again, think highly threaded processes or multiple processes sharing the same event queue. It is even possible to create variable sized kevents - each kevent contain pointer to user's data, which can be considered as pointer to additional area (it's size kernel implementation for given kevent type can determine from other parameters or use predefined one and fetch additional data in -enqueue() callback). That sounds interesting and certainly helps with securing the interface for the future. But if there is anything we can do to avoid unnecessary costs we should do it, even if this means investigation all this further. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take19 0/4] kevent: Generic event handling mechanism.
On 9/27/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: \ I have been told in private what is signal masks about - just to wait until either signal or given condition is ready, but in that case just add additional kevent user like AIO complete or netwrok notification and wait until either requested events are ready or signal is triggered. No, this won't work. Yes, I want signal notification as part of the event handling. But there are situations when this is not suitable. Only if the signal is expected in the same code using the event handling can you do this. But this is not always possible. Especially when the signal handling code is used in other parts of the code than the event handling. E.g., signal handling in a library, event handling in the main code. You cannot assume that all the code is completely integrated. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [take14 0/3] kevent: Generic event handling mechanism.
On 8/31/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote: Sorry ofr long delay - I was on small vacations. No vacation here, but travel nontheless. - one point of critique which applied to many proposals over the years: multiplexer syscalls a bad, really bad. [...] Can you convince Christoph? I do not care about interfaces, but until several people agree on it, I will not change anything. I hope that Linus and/or Andrew simply decree that multiplexers are bad. glibc and probably strace are the two most affected programs so their maintainers should have a say. My opinion os clear. Also for analysis tools the multiplexers are bad since different numbers of parameters are used and maybe even with different types. You completely miss AIO here (I talk not about POSIX AIO). Sure, I should have mentioned it. But I was assuming this all along. I use there only id provided by user, it is not his cookie, but it was done to make strucutre as small as possible. Think about size of the mapped buffer when there are several kevent queues - it is all mapped and thus pinned memory. It of course can be extended. It being what? The problem is that the structure of the ring buffer elements cannot easily be changed later. So we have to get it right now which means being a bit pessimistic about future requirements. Add padding, there will certainly be future uses which need more space. Next, the current interfaces once again fail to learn from a mistake we made and which got corrected for the other interfaces. We need to be able to change the signal mask around the delay atomically. Just like we have ppoll for poll, pselect for select (and hopefully soon also epoll_pwait for epoll_wait) we need to have this feature in the new interfaces. We able to change kevents atomically. I don't understand. Or you don't understand. I was talking about changing the signal mask atomically around the wait call. I.e., the call needs an additional optional parameter specifying the signal mask to use (for the kernel: two parameters, pointer and length). This parameter is not available in the version of the patch I looked at and should be added if it's still missing in the latest version of the patch. Again, look at the difference between poll() and ppoll() and do the same. Well, I rarely talk about what other people want, but if you strongly feel, that all posix crap is better than epoll interface, then I can not agree with you. You miss the point entirely like DaveM before you. What I ask for is simply a uniform and well established form to tell an interface to use the kevent notification mechanism and not sue signals etc. Look at the mail I sent in reply to DaveM's mail. It is possible to create additional one using any POSIX API you like, but I strongly insist on having possibility to use lightweight syscall interface too. Again, missing the point. We can without any significant change enable POSIX interfaces and GNU extensions like the timer, AIO, the async DNS code, etc use kevents. For the latter, which is entirely implemented at userlevel, we need interfaces to queue kevents from userlevel. I think this is already supported. The other two definitely benefit from using kevent notification and since they are/will be handled in the kernel the completion events should be queued in a kevent queue as specified in the sigevent structure passed to the system call. Ring buffer _always_ has space for new events until queue is not filled. So if userspace do not read for too much time it's events and eventually tries to add new one, it will fail early. Sorry, I don't understand this at all. If the ring buffer always has enough room then events must be preregistered. Is this the case? Seems very inflexible and who would this work with event sources like timers which can trigger many times? I hope you don't mean that ring buffers probably won't overflow since programs have to handle events fast enough. That's not acceptable. There is no overflow - I do not want to introduce another signal queue overflow crap here. And once again - no signals. Well, signals are the only asynchronous notification mechanism we have. But more to the point: why cannot there be overflows? You basically want to deliver the same event to several users. But how do you want to achive it with network buffers for example. When several threads reads from the same socket, they do not obtain the same data. That's not what I am after. I'm perfectly fine with waking only one thread. In fact, this is how it must be to avoid the trampling herd effects. But there is the problem that if the woken thread is not working on the issue for which it was woken (e.g., if the thread got canceled) then it must be able to wake another thread. In affect, there should be a syscall which causes a given number of other waiters (make the number a parameter to the syscall) is woken. They would start running and if nothing
Re: [take14 0/3] kevent: Generic event handling mechanism.
fail? Will mremap() work to increase/descrease the size? Will mremap() be allowed to be called with MREMAP_MAYMOVE? What if mmap() is called from different processes (in the POSIX sense, i.e., from different address spaces)? Either mmap(...) Or int kevent_map_ringbuf (int kfd, size_t num) - one interface to set additional parameters. This is likely mostly to make the interfaces safe for the future. Perhaps the number of events needed per delay call should be set this way. int kevent_ctl (int kfd, int cmd, ...) - one interface to shut the kevent down. This might be overkill. We should be able to use munmap() and close(). If a real interface for this would be created it should look like this int kevent_destroy (int kfd, void *ringbuf, size_t num) I find this rather more cumbersome. Just use close and munmap. - one interface to submit requests. int kevent_submit (int kfd, struct kevent_event *ev, int flags, struct timespec *timeout) Maybe the flags parameter isn't needed, it's just another way to make sure we won't regret the design later. If the ring buffer can fill up and this is detected by the kernel (unlike what happens in take 14) then the calling thread could be delayed undefinitely. Maybe we even have a deadlock if there is only one thread. If only a wait/no-wait mode is needed, then use only a flags parameter and no timeout parameter. A special variant should be if ev == NULL the call is taken as a request to wake one or more delayed threads. - one interface to delay threads until the next event becomes available. No data is transfered along with the call. The event data must be read from the ring buffer: int kevent_wait (int kfd, unsigned ringstate, const struct timespec *timeout, const sigset_t *sigmask) Wait-mode can be implemented by recognizing timeout==NULL. no-wait mode is implemented using timeout-tv_sec==timeout-tv_nsec==0. If sigset_t is NULL the signal mask is not changed. The ringstate parameter is also not present in the take 14 proposal. Something like it is necessary to prevent the thread from going to sleep while there are events in the ring buffer. It would be very wasteful if the kernel would have to keep track of outstanding events. This would also mean then handling events would require a system call, exactly what the ring buffer approach should prevent. I think the sequence for waiting for an event should be like this: + get current ring state + check whether any outstanding event in ring buffer + if yes, copy data out of ring buffer, mark ring buffer record as unused (atomically). + if no, call kevent_wait with ring state value When the kernel delivers a new event it does: + find place to store event + change ring state (might be a simple counter) The kevent_wait implementation in the kernel would then as the first thing determine whether the ring state changed. If yes, the syscall returns immediate with -ENWOULDBLOCK. Otherwise it is queued for waiting. With these steps and the requirement that all ring buffer entries are processed FIFO we can a) avoid syscalls to avoid freeing ring buffer entries b) detect overflows in the ring buffer c) can maintain the read pointer at userlevel while the kernel can maintain the write pointer into the buffer -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take14 0/3] kevent: Generic event handling mechanism.
David Miller wrote: SigEvent, and signals in general, are crap. They are complex and userland gets it wrong more often than not. Interfaces for userland should be simple, signals are not simple. You miss the point. sigevent has nothing necessarily to do with signals. I don't want signals. I just want the same interface to specify the action to be used. If I'm using struct sigevent sigev; int kfd; kfd = kevent_create (...); sigev.sigev_notify = SIGEV_KEVENT; sigev.sigev_kfd = kfd; sigev.sigev_valie.sival_ptr = some_data; then I can use this sigev variable in an unmodified timer_create call. The kernel would see SIGEV_KEVENT (as opposed to SIGEV_SIGNAL etc) and **not** generate a signal but instead create the event in the kevent queue. The proposal to use sigevent has nothing to do with signals. It's just about the interface and to have smooth integration with existing functionality. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take12 0/3] kevent: Generic event handling mechanism.
I so far also haven't taken the time to look exactly at the interface. I plan to do it asap since this is IMO our big chance to get it right. I want to have a unifying interface which can handle all the different events we need and which come up today and tomorrow. We have to be able to handle not only file descriptors and AIO but also timers, signals, message queues (OK, they are file descriptors but let's make it official), futexes. I'm probably missing the one or the other thing now. DaveM says there are example programs for the current interfaces. I must admit I haven't seen those either. So if possible, point the world to them again. If you do that now I'll review everything and write up my recommendations re the interface before Monday. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
Suparna Bhattacharya wrote: Is there a (remote) possibility that the thread could have died and its pid got reused by a new thread in another process ? Or is there a mechanism that prevents such a possibility from arising (not just in NPTL library, but at the kernel level) ? The UID/GID won't help you with dying processes. What if the same user creates a process with the same PID? That process will not expect the notification and mustn't receive it. If you cannot detect whether the issuing process died you have problems which cannot be solved with a uid/gid pair. AIO for pipes should not be a problem - Chris Mason had a patch, so we can just bring it up to the current levels, possibly with some additional improvements. Good. I'm not sure what would be the right thing to do for the sockets case. While we could put together a patch for basic aio_read/write (based on the same model used for files), given the whole ongoing kevent effort, its not yet clear to me what would make the most sense ... Ben had a patch to do a fallback to kernel threads for AIO operations that are not yet supported natively. I had some concerns about the approach, but I guess he had intended it as an interim path for cases like this. A fallback solution would be sufficient. Nobody _should_ use POSIX AIO for networking but people do and just giving them something that works is good enough. It cannot really be worse than the userlevel emulation we have know. The alternative, separately and sequentially handling network sockets at userlevel is horrible. We'd have to go over every file descriptor and check whether it's a socket and then take if out of the request list for the kernel. Then they need to be handled separately before or after the kernel AIO code. This would punish unduly all the 99.9% of the programs which don't use POSIX AIO for network I/O. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)
Suparna Bhattacharya wrote: I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not part of the ABI, but only internal to the kernel implementation. I think Zach had suggested inferring THREAD_ID notification if the pid specified is not zero. But, I don't see why -sigev_notify couldn't used directly (just like the POSIX timers code does) thus doing away with the new constants altogether. Sebestian/Laurent, do you recall? I suggest to model the implementation after the timer code which does exactly what we need. I'm guessing they are being used for validation of permissions at the time of sending the signal, but maybe saving the task pointer in the iocb instead of the pid would suffice ? Why should any verification be necessary? The requests are generated in the same process which will receive the notification. Even if the POSIX process (aka, kernel process group) changes the IDs the notifications should be set. The key is that notifications cannot be sent to another POSIX process. Adding this as a feature just makes things so much more complicated. So I think the intended behaviour is as you describe it should be Then the documentation needs to be adjusted. The way it works (and better ideas are welcome) is that, since the io_submit() syscall already accepts an array of iocbs[], no new syscall was introduced. To implement lio_listio, one has to set up such an array, with the first iocb in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which specifies the sigev notification to be associated with group completion (a NULL value of the sigev notification pointer would imply equivalent of LIO_WAIT). OK, this seems OK. We have to construct the iocb arrays dynamically anyway. My thought here was that it should be possible to include M as a parameter to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control block ... then whatever semantics are agreed upon can be implemented. If you have room for the parameter this is fine. For the beginning we can enforce the number to be the same as the total number of requests. Let us know what you think about the listio interface ... hopefully the other issues are mostly simple to resolve. It should be fine and I would support adding all this assuming the normal file support (as opposed to direct I/O only) is added, too. But I have one last question: sockets, pipes and the like are already supported, right? If this is not the case we have a problem with the currently proposed lio_listio interface. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take6 1/3] kevent: Core files.
Evgeniy Polyakov wrote: The main disadvantage is that all memory is allocated on the start even if it will not be used later. I think dynamic grow is appropriate solution, since user will have that memory used anyway, since kevents are allocated, If you _allocate_ memory at startup you're doing something wrong. All you should do is allocate address space. Memory should be allocated when it is needed. Growing a memory region is always hard because it means you cannot keep any addresses around and always have to reload a base pointer. That's not ideal. Especially on 64-bit machines address space really is no limitation anymore. So, allocate as much as needed, allocate memory when it's needed, and don't resize. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [3/4] kevent: AIO, aio_sendfile() implementation.
Sébastien Dugué wrote: aio completion notification I looked over this now but I don't think I understand everything. Or I don't see how it all is integrated. And no, I'm not looking at the proposed glibc code since would mean being tainted. Details: --- A struct sigevent *aio_sigeventp is added to struct iocb in include/linux/aio_abi.h An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in include/linux/aio.h: - IO_NOTIFY_SIGNAL means that the signal is to be sent to the requesting thread - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a specifi thread. This has been proved to be sufficient in the timer code which basically has the same problem. But why do you need separate constants? We have the various SIGEV_* constants, among them SIGEV_THREAD_ID. Just use these constants for the values of ki_notify. The following fields are added to struct kiocb in include/linux/aio.h: - pid_t ki_pid: target of the signal - __u16 ki_signo: signal number - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or IO_NOTIFY_THREAD_ID - uid_t ki_uid, ki_euid: filled with the submitter credentials These two fields aren't needed for the POSIX interfaces. Where does the requirement come from? I don't say they should be removed, they might be useful, but if the costs are non-negligible then they could go away. - check whether the submitting thread wants to be notified directly (sigevent-sigev_notify_thread_id is 0) or wants the signal to be sent to another thread. In the latter case a check is made to assert that the target thread is in the same thread group Is this really how it's implemented? This is not how it should be. Either a signal is sent to a specific thread in the same process (this is what SIGEV_THREAD_ID is for) or the signal is sent to a calling process. Sending a signal to the process means that from the kernel's POV any thread which doesn't have the signal blocked can receive it. The final decision is made by the kernel. There is no mechanism to send the signal to another process. So, for the purpose of the POSIX AIO code the ki_pid value is only needed when the SIGEV_THREAD_ID bit is set. It could be an extension and I don't mind it being introduced. But again, it's not necessary and if it adds costs then it could be left out. It is something which could easily be introduced later if the need arises. listio support I really don't understand the kernel interface for this feature. Details: --- An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h A struct lio_event is added in include/linux/aio.h A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h So you have a pointer in the structure for the individual requests. I assume you use the atomic counter to trigger the final delivery. I further assume that if lio_wait is set the calling thread is suspended until all requests are handled and that the final notification in this case means that thread gets woken. This is all fine. But how do you pass the requests to the kernel? If you have a new lio_listio-like syscall it'll be easy. But I haven't seen anything like this mentioned. The alternative is to pass the requests one-by-one in which case I don't see how you create the reference to the lio_listio control block. This approach seems to be slower. If all requests are passed at once, do you have the equivalent of LIO_NOP entries? How can we support the extension where we wait for a number of requests which need not be all of them. I.e., I submit N requests and want to be notified when at least M (M = N) notified. I am not yet clear about the actual semantics we should implement (e.g., do we send another notification after the first one?) but it's something which IMO should be taken into account in the design. Finally, and this is very important, does you code send out the individual requests notification and then in the end the lio_listio completion? I think Suparna wrote this is the case but I want to make sure. Overall, this looks much better than the old code. If the answers to my questions show that the behavior is compatible with the POSIX AIO code I'm certainly very much in favor of adding the kernel code. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [take5 0/4] kevent: Generic event handling mechanism.
Evgeniy Polyakov wrote: Question with kevents removal from syscall stays open until Ulrich accepts or declines mapped buffer implementation. It was my idea in the first place to use the ring buffer. I'm sure others had the same idea but that's what I presented. So, I see no reason you should delay making this change because of me. The only important thing is that we need to get a useful semantics for fork and exec. For fork, it must be possible to dequeue entries from the ring buffer in a thread-safe way. For exec (where a file descriptor might survive) we likely need a mechanism to mmap the ring buffer only based on the file descriptor. I'm not sure about this, though. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC 1/4] kevent: core files.
Herbert Xu wrote: The other to consider is that events don't come from the hardware. Events are written by the kernel. So if user-space is just reading the events that we've written, then there are no cache misses at all. Not quite true. The ring buffer can be written to from another processor. The kernel thread responsible for generating the event (receiving data from network or disk, expired timer) can run independently on another CPU. This is the case to keep in mind here. I thought Zach and the other involved in the discussions in Ottawa said this has been shown to be a problem and that a ring buffer implementation with something other than simple front and back pointers is preferable. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC 1/4] kevent: core files.
Nicholas Miell wrote: [...] and was wondering if you were familiar with the Solaris port APIs* and, I wasn't. if so, you could please comment on how your proposed event channels are different/better. There indeed is not much difference. The differences are in the details. The way those ports are specified doesn't allow much room for further optimizations. E.g., the userlevel ring buffer isn't possible. But mostly it's the same semantics. The ec_t type in my text is also better a file descriptor since otherwise it cannot be transported via Unix stream sockets. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC 1/4] kevent: core files.
Evgeniy Polyakov wrote: Btw, why do we want mapped ring of ready events? If user requestd some event, he definitely wants to get them back when they are ready, and not to check and then get them? Could you please explain more on this issue? If of course makes no sense to enter the kernel to actually get the event. This should be done by storing the event in the ring buffer. I.e., there are two ways to get an event: - with a syscall. This can report as many events at once as the caller provides space for. And no event which is reported in the run buffer should be reported this way - if there is space, report it in the ring buffer. Yes, the buffer can be optional, then all events are reported by the system call. So the use case would be like this: wait_and_get_event: is buffer empty ? yes - make syscall no - get event from buffer To avoid races, the syscall needs to take a parameter indicating the last event checked out from the buffer. If in the meantime the kernel put another event in the buffer the syscall immediately returns. Similar to what we do in the futex syscall. The question is how to best represent the ring buffer. Zach and some others had some ready responses in Ottawa. The important thing is to avoid cache line ping pong when possible. Is the ring buffer absolutely necessary? Probably not. But it has the potential to help quite a bit. Don't look at the problem to solve in the context of heavy I/O operations when another syscall here and there doesn't matter. With this single event mechanism for every possible event the kernel can generate programming can look quite different. E.g., every read() call can implicitly we changed into an async read call followed by a user-level reschedule. This rescheduling allows another thread of execution to run while the read request is processed. I.e., it's basically a setjmp() followed by a goto into the inner loop to get the next event. And now suddenly the event notification mechanism really should be as fast as possible. If we submit basically every request asynchronously and are not creating dedicated threads for specific tasks anymore we a) have a lot more event notifications b) the probability of an event being reported when we want the receive the next one if higher (i.e., the case where no syscall vs syscall makes a difference) Yes, all this will require changes in the way programs a written but we shouldn't limit the way we can write programs unnecessarily. I think that given increasing discrepancies in relative speed/latency of the peripherals and the CPU this is one possible solution to keep the CPUs busy without resorting to a gazillion separate threads in each program. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [RFC 1/4] kevent: core files.
Zach Brown wrote: Ulrich, would you be satisfied if we didn't have the userspace mapped ring on the first pass and only had a collection syscall? I'm not the one to make a call but why rush things? Let's do it right from the start. Later changes can only lead to problems with users of the earlier interface. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [3/4] kevent: AIO, aio_sendfile() implementation.
Badari Pulavarty wrote: Before we spend too much time cleaning up and merging into mainline - I would like an agreement that what we add is good enough for glibc POSIX AIO. I haven't seen a description of the interface so far. Would be good if it existed. But I briefly mentioned one quirk in the interface about which Suparna wasn't sure whether it's implemented/implementable in the current interface. If a lio_listio call is made the individual requests are handle just as if they'd be issue separately. I.e., the notification specified in the individual aiocb is performed when the specific request is done. Then, once all requests are done, another notification is made, this time controlled by the sigevent parameter if lio_listio. Another feature which I always wanted: the current lio_listio call returns in blocking mode only if all requests are done. In non-blocking mode it returns immediately and the program needs to poll the aiocbs. What is needed is something in the middle. For instance, if multiple read requests are issued the program might be able to start working as soon as one request is satisfied. I.e., a call similar to lio_listio would be nice which also takes another parameter specifying how many of the NENT aiocbs have to finish before the call returns. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature
Re: [3/4] kevent: AIO, aio_sendfile() implementation.
Christoph Hellwig wrote: My personal opinion on existing AIO is that it is not the right design. Benjamin LaHaise agree with me (if I understood him right), I completely agree with that aswell. I agree, too, but the current code is not the last of the line. Suparna has a st of patches which make the current kernel aio code work much better and especially make it really usable to implement POSIX AIO. In Ottawa we were talking about submitting it and Suparna will. We just thought about a little longer timeframe. I guess it could be accelerated since he mostly has the patch done. But I don't know her schedule. Important here is, don't base any decision on the current aio implementation. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ signature.asc Description: OpenPGP digital signature