Re: 2.6.24-rc3: find complains about /proc/net

2007-11-20 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Roland McGrath wrote:
 Oh, it seems it has indeed been that way for a very long time, so I was
 mistaken.  It still seems a little odd to me.  Ulrich can say definitively
 whether the kind of concern I mentioned really matters one way or the other
 for glibc.

glibc cannot survive (at least NPTL) if somebody uses funny CLONE_*
flags to separate various pieces of information, e.g., file descriptors.
 So, all the information in each thread's /proc/self should be identical.

When the information is not the same, the current semantics seems to be
more useful.  So I guess, no change is the way to go here.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHQ25/2ijCOnn/RHQRAmhhAJsHRF7FqO8DWwZ97gHxIO/i4Z1AAQCffCGa
Q2J8kjthKbbNQf1USWMAw3Y=
=xl/a
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


network interface state

2007-11-14 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Just FYI, with the current getaddrinfo code it is even more critical to
get to a point where I can cache network interface information and query
the kernel whether it changed.  We now have to read the RTM_GETADDR
tables for every lookup.  It was more limited with the old, incomplete
implementation.

Even if it's something as simple as a RTM_SEQUENCE request which returns
a number that is bumped at every interface change.

Related: I need to know about the device type (the ARPHRD_* values) to
determine whether a device is for a native transport or a tunnel.  What
I currently do is:

- - at the beginning I get information about all interfaces using
RTM_GETADDR

- - them later I have to find the device type by

  + reading the RTM_GETLINK data to get to the device name

  + then using the name and ioctl(SIOCGIFHWADDR) I get the device type


It would be so much nicer if the device type would be part of the
RTM_GETADDR data, or at least the RTM_GETLINK data.


Any help on any of these issues?

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHO2HI2ijCOnn/RHQRAtQQAJ0QV6j/BKFmN5nWugrQ/zXf0cCu9wCffRYT
+aXv6y5S1m5iwR7gVfOhp9A=
=Uf3i
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: network interface state

2007-11-14 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

David Miller wrote:
 Most daemons handle this by listening for events on the netlink
 socket, but I understand how that might not be practical for
 glibc.

Right, this cannot work.  I have no inner loop which I can control.  I
cannot install a listener.

At some point, when we have non-sequential, hidden file descriptors,
I'll be able to leave a socket file descriptor open.  But that's about
it.  Even then the generation counter interface is likely to be the best
choice.


 It's part of the link information, Look in ifinfomsg-ifi_type

Great, I fixed up the code.  I guess in future, once I can cache the
data, I'll simply read the RTM_GETADDR and RTM_GETLINK data all at once
and be done with it.

BTW, is it possible to send both these requests out before starting to
read the results?  This would reduce the amount of code quite a bit.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFHO47s2ijCOnn/RHQRApIIAJwNATDabXkfszG2e+gtJWO9f4wm4wCdFuoQ
Yn40KK+cs9Di4fq+WKTQalo=
=q02M
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bind and O_NONBLOCK

2007-09-22 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Evgeniy Polyakov wrote:
 So, did I understand you correctly, that you want to introduce network
 AIO here? (for example on behalf of work queue or something else?)

See Alan's mail.  All this was his proposal, I just got it accepted
upstream.

The problem to solve is if you have a distributed network port set.
Apparently NetBIOS has it but I could also imagine this to be useful in
cluster implementations which have to appear as one machine.  In this
case, before binding to a given port, you have to make sure no other
machine already handles this port.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG9UDy2ijCOnn/RHQRAvntAKC6F6Pz6zHd/iZLFECOZ0MxlhdPBQCgjrLC
V9cazPF5jjf2eUSr7ZKDSas=
=0v1W
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: bind and O_NONBLOCK

2007-09-22 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Evgeniy Polyakov wrote:
 Could you point to the original Alan's proposal, I only found short note
 (as in you original mail) at opengroup.org and failed to correctly
 googlify it in the web.

There was no public mail.  I asked RH engineering for proposals for
changes to the POSIX spec and Alan replied.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG9Uyl2ijCOnn/RHQRAtyNAJ0TLrZ8P3VcoFDWT1g+Qft1eTU+1QCffus6
Tljy9S9Sxb7z09l/GBkLSvY=
=golD
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


bind and O_NONBLOCK

2007-09-21 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Some time back Alan asked about adding O_NONBLOCK support to bind in the
POSIX spec.  I brought this up and the following text will be in the
next revision of the POSIX spec:

===
If the socket address cannot be assigned immediately and O_NONBLOCK is
set for the file descriptor for the socket, bind( ) shall fail and set
errno to [EINPROGRESS], but the assignment request shall not be aborted,
and the assignment shall be completed asynchronously. Subsequent calls
to bind() for the same socket, before the assignment is completed, shall
fail and set errno to [EALREADY].

When the assignment has been performed asynchronously, pselect(),
select(), and poll() shall indicate that the file descriptor for the
socket is ready for reading and writing.
===

It would be ideal if we'd have such an implementation in the next few
months so that we, in theory, can check whether the text in the
specification makes sense.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG813Z2ijCOnn/RHQRAsNkAJ9EuDWX3EDez8+o/y3I39A7Axy++ACfZAXi
DRFm1UadrbJ+c7ss0a1vWUI=
=p1bV
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


drop association of connection-less socket

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

The Linux man page for connect(2) currently says:

  Connectionless sockets may dissolve the association by connecting to
  an address with the sa_family member of sockaddr set to AF_UNSPEC.


No such wording is in the POSIX definition which only says

  If  address is a null address for the protocol, the socket’s peer
  address shall be reset.


This is not the same but seems to be what Linux implements.


The problem is that I tried to reuse a socket which has been associated
with an IPv6 address to later connect to an IPv4 address.  This is part
of the getaddrinfo implementation and an effort to make it more
efficient.  strace's output  looks like this:

connect(3, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6,
2001:11b8:1:0:207:e94f:ee7c:4b72, sin6_addr), sin6_flowinfo=0,
sin6_scope_id=0}, 28) = -1 ENETUNREACH (Network is unreachable)

connect(3, {sa_family=AF_UNSPEC,
sa_data=\0\0\0\0\0\0\0\0\0\0\0\0\0\0}, 28) = 0

connect(3, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr(192.168.1.72)}, 16) = 0


I.e., despite what the man page says, the second connect only reset the
address, as required by the POSIX spec.  It did not reset the address
family of the socket.


What I ideally would like to see is what the Linux man page says.  I.e.,
if the .sa_family field is AF_UNSPEC all, the address and address
family, is reset.  Otherwise only the address association itself is reset.

Is this functionality which got lost over time?  Or is the man page
wrong and this never was the case?  Is this a worthwhile change?

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8M+52ijCOnn/RHQRAnTEAJ0Z/DrTkcCjpbybB5lqDad9Z0MbZwCeLZOh
u/mNfxV7uDjRsSuOj4YwuIg=
=FO70
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: drop association of connection-less socket

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

I guess the request is not that useful.  The family of the socket is
determined earlier so to undo this it takes more of an effort.  I
managed to get by for most cases without this change so no action needed.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8TqZ2ijCOnn/RHQRAkXeAJ0RGW9zuP8xnLNVdnsHCLFR6IVJ8QCgwmBf
0ncI+FkqHE3vaYieIcHqOXo=
=UxXC
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

As a follow up to my question from yesterday on the netdev list what I
think is a real problem.  Either in the kernel or in the POSIX spec.

The POSIX spec currently says this about SOCK_DGRAM sockets:

  If  address is a null address for the protocol, the socket’s peer
  address shall be reset.

The term null address is not further specified but it will usually be
read to allow the following scenario to work out:

   fd = socket(AT_INET6, ...)

   connect(fd, ...some IPv6 address...)

   struct sockaddr_in6 sin6 = { .sin6_family = AF_INET6 };
   connect(fd, sin6, sizeof (sin6));

   connect(fd, ...some new IPv6 address...)

This does not work on Linux in the moment.  The socket remains connected
to the old IPv6 address but the second connect() call does succeed (this
does not sound OK).  What does work is if the connect call to
disassociate the address uses AF_UNSPEC instead of AF_INET6.


The question is: do people here think this is a problem in the POSIX
spec?  Binding to :: and 0.0.0.0 isn't possible, so maybe the Linux
implementation should allow this?

If you think the POSIX spec is wrong (and can point to other
implementations doing the same as Linux) let me know and I'll work on
getting the spec changed.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8T6L2ijCOnn/RHQRAnSRAJ9sXDGG9OepEQWQInaPgwxCWlaH6wCghqim
ULttg5/lU8c1rSpBnoRCjB8=
=nGVv
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andi Kleen wrote:
 The standard way to undo connect is to use AF_UNSPEC. Code to handle
 that for dgram sockets is there. It's the same code for v4 and v6.

I quoted the standard and it does not say anything about AF_UNSPEC.  So
you cannot simply make such broad statements.

I also don't say that this behavior should be removed.  It's certainly
useful, very much so in fact.

But the spec calls for a null address to be used and that's in my
understanding something different from using AF_UNSPEC.

I looked through Stevens TCP Illustrated Vol 2 and it seems not to
mention resetting the address at all.  The POSIX spec certainly got this
text from .1g.

I cannot test it on other systems.  If somebody has access to some
certified systems (and maybe others), write a bit of code which creates
a DGRAM socket, connect to one address, call connect with a null
address, then connect to another address (which likely has to use a
different interface since otherwise the connect will just succeed, it
seems).

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8VMF2ijCOnn/RHQRAr9NAJwLxyql0kQnMGJNaPZlRGsuB6rGEACgog88
WIWAFhuBWsjps7PdbcoumUQ=
=oLxP
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ulrich Drepper wrote:
 Yes, but for IPv4/6 it's not an issue.  Some implementations might
 handle all-zeros and the spec _currently_ calls for it.  In this case an
 alignment would be good.

Searching the web shows up this:

http://developer.apple.com/documentation/Darwin/Reference/ManPages/man2/connect.2.html


  Datagram sockets may dissolve the association by connecting to an
  invalid address, such as a null address or an address with the address
  family set to AF_UNSPEC (the error EAFNOSUPPORT will be harmlessly
  returned).


I.e., at least Apple implements both variants.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8Vvu2ijCOnn/RHQRAsSfAJkBELtiNyul8wMOjVv1x7LfvDWw/ACfR0D0
cm+k1wfhCsT4GjbF3uac+eY=
=nksN
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andi Kleen wrote:
 But the spec calls for a null address to be used and that's in my
 understanding something different from using AF_UNSPEC.
 
 memset(sockaddr, 0, sizeof(sockaddr)) should give you AF_UNSPEC

But the spec calls for quotenull address for the protocol/quote.

That means the family for the null address is the same as the family of
the socket.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8WCO2ijCOnn/RHQRAgtsAJ9qTFVj5QQbVG/hUflxo/6uPOfl4QCdHSX8
wi2GX7B0pht8VDaswYLqdpM=
=sMSg
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Andi Kleen wrote:
 Spec doesn't match traditional behaviour then.

Well, determining whether that's the case is part of this exercise.


 IPv4 0.0.0.0 is 
 traditionally an synonym for old style all broadcast (255.255.255.255) 
 on UDP/RAW and it's certainly possible to connect() to that. 

Where do you get this from?  And where is this implemented?  I don't
doubt it but I have to convince people to change the standard and
possibly introduce incompatibility.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8WQY2ijCOnn/RHQRAlsBAJ9qZRZXNN2VEy136MFIT1daHfju5ACdGiIW
k0I5e2BGRjvjbJrrAwtehqo=
=fX+i
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: follow-up: discrepancy with POSIX

2007-09-19 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

David Miller wrote:
 It just occured to me that AF_UNSPEC might be used simply
 because all zeros might be a valid real bindable address
 for some address family.  And using AF_UNSPEC avoids that
 problem entirely.

Yes, but for IPv4/6 it's not an issue.  Some implementations might
handle all-zeros and the spec _currently_ calls for it.  In this case an
alignment would be good.

I guess I'll just go ahead and file a problem report with the spec.
Maybe the Unix vendors will test their implementations in provide feedback.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFG8Vam2ijCOnn/RHQRAlw2AJwPCkD/GdX5YWCjsidhNXkGT71SiQCeLUDX
XimSWS2NMI9T8QxnnV3FDQ4=
=8XbG
-END PGP SIGNATURE-
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] O_CLOEXEC for SCM_RIGHTS

2007-06-02 Thread Ulrich Drepper
Part two in the O_CLOEXEC saga: adding support for file descriptors received
through Unix domain sockets.

The patch is once again pretty minimal, it introduces a new flag for recvmsg
and passes it just like the existing MSG_CMSG_COMPAT flag.  I think this bit
is not used otherwise but the networking people will know better.

This new flag is not recognized by recvfrom and recv.  These functions cannot
be used for that purpose and the asymmetry this introduces is not worse than
the already existing MSG_CMSG_COMPAT situations.

The patch must be applied on the patch which introduced O_CLOEXEC.  It has to
remove static from the new get_unused_fd_flags function but since scm.c cannot
live in a module the function still hasn't to be exported.

Here's a test program to make sure the code works.  It's so much longer than
the actual patch...

#include errno.h
#include error.h
#include fcntl.h
#include stdio.h
#include string.h
#include unistd.h
#include sys/socket.h
#include sys/un.h

#ifndef O_CLOEXEC
# define O_CLOEXEC 0200
#endif
#ifndef MSG_CMSG_CLOEXEC
# define MSG_CMSG_CLOEXEC 0x4000
#endif


int
main (int argc, char *argv[])
{
  if (argc  1)
{
  int fd = atol (argv[1]);
  printf (child: fd = %d\n, fd);
  if (fcntl (fd, F_GETFD) == 0 || errno != EBADF)
{
  puts (file descriptor valid in child);
  return 1;
}
  return 0;

}

  struct sockaddr_un sun;
  strcpy (sun.sun_path, ./testsocket);
  sun.sun_family = AF_UNIX;

  char databuf[] = hello;
  struct iovec iov[1];
  iov[0].iov_base = databuf;
  iov[0].iov_len = sizeof (databuf);

  union
  {
struct cmsghdr hdr;
char bytes[CMSG_SPACE (sizeof (int))];
  } buf;
  struct msghdr msg = { .msg_iov = iov, .msg_iovlen = 1,
.msg_control = buf.bytes,
.msg_controllen = sizeof (buf) };
  struct cmsghdr *cmsg = CMSG_FIRSTHDR (msg);

  cmsg-cmsg_level = SOL_SOCKET;
  cmsg-cmsg_type = SCM_RIGHTS;
  cmsg-cmsg_len = CMSG_LEN (sizeof (int));

  msg.msg_controllen = cmsg-cmsg_len;

  pid_t child = fork ();
  if (child == -1)
error (1, errno, fork);
  if (child == 0)
{
  int sock = socket (PF_UNIX, SOCK_STREAM, 0);
  if (sock  0)
error (1, errno, socket);

  if (bind (sock, (struct sockaddr *) sun, sizeof (sun))  0)
error (1, errno, bind);
  if (listen (sock, SOMAXCONN)  0)
error (1, errno, listen);

  int conn = accept (sock, NULL, NULL);
  if (conn == -1)
error (1, errno, accept);

  *(int *) CMSG_DATA (cmsg) = sock;
  if (sendmsg (conn, msg, MSG_NOSIGNAL)  0)
error (1, errno, sendmsg);

  return 0;
}

  /* For a test suite this should be more robust like a
 barrier in shared memory.  */
  sleep (1);
  
  int sock = socket (PF_UNIX, SOCK_STREAM, 0);
  if (sock  0)
error (1, errno, socket);

  if (connect (sock, (struct sockaddr *) sun, sizeof (sun))  0)
error (1, errno, connect);
  unlink (sun.sun_path);

  *(int *) CMSG_DATA (cmsg) = -1;

  if (recvmsg (sock, msg, MSG_CMSG_CLOEXEC)  0)
error (1, errno, recvmsg);

  int fd = *(int *) CMSG_DATA (cmsg);
  if (fd == -1)
error (1, 0, no descriptor received);

  char fdname[20];
  snprintf (fdname, sizeof (fdname), %d, fd);
  execl (/proc/self/exe, argv[0], fdname, NULL);
  puts (execl failed);
  return 1;
}



Signed-off-by: Ulrich Drepper [EMAIL PROTECTED]

--- a/fs/open.c
+++ b/fs/open.c
@@ -855,7 +855,7 @@
 /*
  * Find an empty file descriptor entry, and mark it busy.
  */
-static int get_unused_fd_flags(int flags)
+int get_unused_fd_flags(int flags)
 {
struct files_struct * files = current-files;
int fd, error;
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int 
*fput_needed));
 extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag));
 extern void put_filp(struct file *);
 extern int get_unused_fd(void);
+extern int FASTCALL(get_unused_fd_flags(int flags));
 extern void FASTCALL(put_unused_fd(unsigned int fd));
 struct kmem_cache;
 
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -253,6 +253,9 @@ struct ucred {
 
 #define MSG_EOF MSG_FIN
 
+#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file
+  descriptor received through
+  SCM_RIGHTS */
 #if defined(CONFIG_COMPAT)
 #define MSG_CMSG_COMPAT0x8000  /* This message needs 32 bit 
fixups */
 #else
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
*scm)
err = security_file_receive(fp[i]);
if (err)
break;
-   err = get_unused_fd();
+   err = get_unused_fd_flags(MSG_CMSG_CLOEXEC  msg-msg_flags
+ ? O_CLOEXEC : 0

[PATCH] V2: O_CLOEXEC for SCM_RIGHTS

2007-06-02 Thread Ulrich Drepper
Take two: I forgot to change the compat code.  This has now happened.  Only one
additional line changed.

Everything else from the first patch remains the same.  I try to avoid clogging
the list unnecessarily by not resending the test program.


Signed-off-by: Ulrich Drepper [EMAIL PROTECTED]

--- a/fs/open.c
+++ b/fs/open.c
@@ -855,7 +855,7 @@
 /*
  * Find an empty file descriptor entry, and mark it busy.
  */
-static int get_unused_fd_flags(int flags)
+int get_unused_fd_flags(int flags)
 {
struct files_struct * files = current-files;
int fd, error;
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int 
*fput_needed));
 extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag));
 extern void put_filp(struct file *);
 extern int get_unused_fd(void);
+extern int FASTCALL(get_unused_fd_flags(int flags));
 extern void FASTCALL(put_unused_fd(unsigned int fd));
 struct kmem_cache;
 
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -253,6 +253,9 @@ struct ucred {
 
 #define MSG_EOF MSG_FIN
 
+#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file
+  descriptor received through
+  SCM_RIGHTS */
 #if defined(CONFIG_COMPAT)
 #define MSG_CMSG_COMPAT0x8000  /* This message needs 32 bit 
fixups */
 #else
--- a/net/compat.c
+++ b/net/compat.c
@@ -276,7 +276,8 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct 
scm_cookie *scm)
err = security_file_receive(fp[i]);
if (err)
break;
-   err = get_unused_fd();
+   err = get_unused_fd_flags(MSG_CMSG_CLOEXEC  msg-msg_flags
+ ? O_CLOEXEC : 0);
if (err  0)
break;
new_fd = err;
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
*scm)
err = security_file_receive(fp[i]);
if (err)
break;
-   err = get_unused_fd();
+   err = get_unused_fd_flags(MSG_CMSG_CLOEXEC  msg-msg_flags
+ ? O_CLOEXEC : 0);
if (err  0)
break;
new_fd = err;
--- a/net/socket.c
+++ b/net/socket.c
@@ -1939,9 +1939,7 @@ asmlinkage long sys_recvmsg(int fd, struct msghdr __user 
*msg,
total_len = err;
 
cmsg_ptr = (unsigned long)msg_sys.msg_control;
-   msg_sys.msg_flags = 0;
-   if (MSG_CMSG_COMPAT  flags)
-   msg_sys.msg_flags = MSG_CMSG_COMPAT;
+   msg_sys.msg_flags = flags  (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
 
if (sock-file-f_flags  O_NONBLOCK)
flags |= MSG_DONTWAIT;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] V3: O_CLOEXEC for SCM_RIGHTS

2007-06-02 Thread Ulrich Drepper
Take two: I forgot to change the compat code.  This has now happened.  Only one
additional line changed.

Everything else from the first patch remains the same.  I try to avoid clogging
the list unnecessarily by not resending the test program.


Signed-off-by: Ulrich Drepper [EMAIL PROTECTED]

--- a/fs/open.c
+++ b/fs/open.c
@@ -855,7 +855,7 @@
 /*
  * Find an empty file descriptor entry, and mark it busy.
  */
-static int get_unused_fd_flags(int flags)
+int get_unused_fd_flags(int flags)
 {
struct files_struct * files = current-files;
int fd, error;
--- a/include/linux/file.h
+++ b/include/linux/file.h
@@ -73,6 +73,7 @@ extern struct file * FASTCALL(fget_light(unsigned int fd, int 
*fput_needed));
 extern void FASTCALL(set_close_on_exec(unsigned int fd, int flag));
 extern void put_filp(struct file *);
 extern int get_unused_fd(void);
+extern int FASTCALL(get_unused_fd_flags(int flags));
 extern void FASTCALL(put_unused_fd(unsigned int fd));
 struct kmem_cache;
 
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -253,6 +253,9 @@ struct ucred {
 
 #define MSG_EOF MSG_FIN
 
+#define MSG_CMSG_CLOEXEC 0x4000/* Set close_on_exit for file
+  descriptor received through
+  SCM_RIGHTS */
 #if defined(CONFIG_COMPAT)
 #define MSG_CMSG_COMPAT0x8000  /* This message needs 32 bit 
fixups */
 #else
--- a/net/compat.c
+++ b/net/compat.c
@@ -276,7 +276,8 @@ void scm_detach_fds_compat(struct msghdr *kmsg, struct 
scm_cookie *scm)
err = security_file_receive(fp[i]);
if (err)
break;
-   err = get_unused_fd();
+   err = get_unused_fd_flags(MSG_CMSG_CLOEXEC  msg-msg_flags
+ ? O_CLOEXEC : 0);
if (err  0)
break;
new_fd = err;
--- a/net/core/scm.c
+++ b/net/core/scm.c
@@ -228,7 +228,8 @@ void scm_detach_fds(struct msghdr *msg, struct scm_cookie 
*scm)
err = security_file_receive(fp[i]);
if (err)
break;
-   err = get_unused_fd();
+   err = get_unused_fd_flags(MSG_CMSG_CLOEXEC  msg-msg_flags
+ ? O_CLOEXEC : 0);
if (err  0)
break;
new_fd = err;
--- a/net/socket.c
+++ b/net/socket.c
@@ -1939,9 +1939,7 @@ asmlinkage long sys_recvmsg(int fd, struct msghdr __user 
*msg,
total_len = err;
 
cmsg_ptr = (unsigned long)msg_sys.msg_control;
-   msg_sys.msg_flags = 0;
-   if (MSG_CMSG_COMPAT  flags)
-   msg_sys.msg_flags = MSG_CMSG_COMPAT;
+   msg_sys.msg_flags = flags  (MSG_CMSG_CLOEXEC|MSG_CMSG_COMPAT);
 
if (sock-file-f_flags  O_NONBLOCK)
flags |= MSG_DONTWAIT;
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] [GIT PATCH net-2.6.23] IPV6: Configurable IPv6 address selection policy table (RFC3484)

2007-04-29 Thread Ulrich Drepper
David Miller wrote:
 One idea is to have glibc have some kind of socket open, subscribed
 to a group which gets sticky events.

I don't quite yet know the context but I have to intervene: keeping
sockets open is not good.  This will only cause problems.

Any interface must be memory based.  Something like register a word
which is set when an event arrives is a much better interface.  Who you
then go and retrieve messages is another issue.  If this is a rare event
then opening is new netlink socket is no problem.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] [GIT PATCH net-2.6.23] IPV6: Configurable IPv6 address selection policy table (RFC3484)

2007-04-29 Thread Ulrich Drepper
David Miller wrote:
 Something more scalable has to be used.

This is where the shared-memory based event notification comes in.  It
was always also meant to be used for things like this.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take35 0/10] kevent: Generic event handling mechanism.

2007-02-12 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 I think that mean that everybody is happy with APi, design and set of
 features. 

No comment means that I still have not been able to test anything since
regardless of what version I tried, it failed to build.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take24 0/6] kevent: Generic event handling mechanism.

2006-12-27 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 Why do we want to inject _ready_ event, when it is possible to mark
 event as ready and wakeup thread parked in syscall?

Going back to this old one:

How do you want to mark an event ready if you don't want to introduce
yet another layer of data structures?  The event notification happens
through entries in the ring buffer.  Userlevel code should never add
anything to the ring buffer directly, this would mean huge
synchronization problems.  Yes, one could add additional data structures
accompanying the ring buffer which can specify userlevel-generated
events.  But this is a) clumsy and b) a pain to use when the same ring
buffer is used in multiple threads (you'd have to have another shared
memory segment).

It's much cleaner if the userlevel code can get the kernel to inject a
userlevel-generated event.  This is the equivalent of userlevel code
generating a signal with kill().

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: Kevent POSIX timers support.

2006-11-27 Thread Ulrich Drepper

Evgeniy Polyakov wrote:
We need to pass the data in the sigev_value meember of the struct 
sigevent structure passed to timer_create to the caller.  I don't see it 
being done here nor when the timer is created.  Do I miss something? 
The sigev_value value should be stored in the user/ptr member of struct 
ukevent.


sigev_value was stored in k_itimer structure, I just do not know where
to put it in the ukevent provided to userspace - it can be placed in
pointer value if you like.


sigev_value is a union and the largest element is a pointer.  So, 
transporting the pointer value is sufficient and it should be passed up 
to the user in the ptr member of struct ukevent.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take24 0/6] kevent: Generic event handling mechanism.

2006-11-27 Thread Ulrich Drepper

Evgeniy Polyakov wrote:


With provided patch it is possible to wakeup 'for-free' - just call
kevent_ctl(ready) with zero number of ready events, so thread will be
awakened if it was in poll(kevent_fd), kevent_wait() or
kevent_get_events().


Yes, I realize that.  But I wrote something else:

 Rather than mark an existing entry as ready, how about a call to
 inject a new ready event?

 This would be useful to implement functionality at userlevel and
 still use an event queue to announce the availability.  Without this
 type of functionality we'd need to use indirect notification via
 signal or pipe or something like that.

This is still something which is wanted.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kevent POSIX timers support.

2006-11-27 Thread Ulrich Drepper

David Miller wrote:

Now we'll have to have a compat layer for 32-bit/64-bit environments
thanks to POSIX timers, which is rediculious.


We already have compat_sys_timer_create.  It should be sufficient just 
to add the conversion (if anything new is needed) there.  The pointer 
value can be passed to userland in one or two int fields, I don't really 
care.  When reporting the event to the user code we cannot just point 
into the ring buffer anyway.  So while copying the data we can rewrite 
it if necessary.  I see no need to complicate the code more than it 
already is.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take24 0/6] kevent: Generic event handling mechanism.

2006-11-27 Thread Ulrich Drepper
.


But the signal mask is something completely different and completely 
independent from the signal queue.  There is nothing in the kevent 
interface to replace that functionality.  Nor should this be possible 
with the events; only a sigset_t parameter to kevent_wait makes sense.




Having sigmask parameter is the same as creating kevent signal delivery.


No, no, no.  Not at all.


Surely you don't suggest keeping your original timer patch?


Of course not - kevent timers are more scalable than posix timers (the 
latter uses idr, which is slower than balanced binary tree, since it
looks like it uses similar to radix tree algo), POSIX interface is 
much-much-much more unconvenient to use than simple add/wait.


I assume you misread the question.  You agree to drop the patch and then 
 go on listing things why you think it's better to keep them.  I don't 
think these arguments are in any way sufficient.  The interface is 
already too big and this is 100% duplicate functionality.  If there are 
performance problems with the POSIX timer implementation (and I have yet 
to see indications) it should be fixed instead of worked around.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take25 1/6] kevent: Description.

2006-11-27 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

If kernel has put data asynchronously it will setup special flag, thus 
kevent_wait() will not sleep and will return, so thread will check new

entries and process them.


This is not sufficient.

The userlevel code does not commit the events until they are processed. 
 So assume two threads at userlevel, one event is asynchronously 
posted.  The first thread picks it up, the second call kevent_wait.


With your scheme it will not be put to sleep and unnecessarily returns 
to userlevel.


What I propose and what has been proven to work in many situations is to 
have part of the kevent_wait syscall the information about I am aware 
of all events up to XX; wake me only if anything beyond that is added.


Please take a look at how futexes work, it's really the same concept. 
And it's really also simpler for the implementation.  Having such a flag 
is much more complicated than adding a simple index comparison before 
going to sleep.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take25 1/6] kevent: Description.

2006-11-27 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

It _IS_ how previous interface worked.

EXACTLY!


No, the old interface committed everything not only up to a given index. 
 This is the huge difference which makes or breaks it.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take24 0/6] kevent: Generic event handling mechanism.

2006-11-11 Thread Ulrich Drepper
 on a specific CPU then the wakeup function
  should take this into account.  I.e., if any of the threads
  waiting was/will be scheduled on the same CPU it should be
  preferred.

  With the current simple form of a ring buffer this isn't sufficient,
  though.  Reading all entries in the ring buffer until finding the
  one written by the CPU in question is not helpful.  We'd need a
  mechanism to point the thread to the entry in question.  One
  possibility to do this is to return the ring buffer entry as the
  return value of the kevent_wait() syscall.  This works fine if the
  thread only works for one event (which I guess will be 99.999% of
  all uses).  An extension could be to extend the ukevent structure to
  contain an index of the next entry written the same CPU.

  Another problem this entails is false sharing of the ring buffer
  entries.  This would probably require to pad the ukevent structure
  to 64 bytes.  It's not that much more, 40 bytes so far, it's
  also more future-safe.  The alternative is to allocate have per-CPU
  regions in the ring buffer.  With hotplug CPUs this is just plain
  silly.

  I think this optimization has the potential to help quite a bit,
  especially for large machines.

===

- we absolutely need an interface to signal the kernel that a thread,
  just woken from kevent_wait, cannot handle the events.  I.e., the
  events are in the ring buffer but all the other threads are in the
  kernel in their kevent_wait calls.  The new syscall would wake up
  one or more threads to handle the events.

  This syscall is for instance necessary if the thread calling
  kevent_wait is canceled.  It might also be needed when a thread
  requested more than one event and realizes processing an entry
  takes a long time and that another thread might work on the other
  items in the meantime.


  Al Viro pointed out another possible solution which also could solve
  the handled flag problem and concurrency in use of the ring buffer.

  The idea is to require the kevent_wait() syscall to signal which entry
  in the ring buffer is handled or not handled.  This means:

  + the kernel knows at any time which entries in the buffer are free
and which are not

  + concurrent filling of the ring buffer is no problem anymore since
entries are not discarded until told

  + by not waiting for event (num parameter == 0) the syscall can be
used to discard entries to free up the ring buffer before continuing
to work on more entries.  And, as per the requirement above, it can
be used to tell the kernel that certain entries are *NOT* handled
and need to be sent to another thread.  This would be useful in the
thread cancellation case.

  This seems like a nice approach.

===

- why no syscall to create kevent queue?  With dynamic /dev this might
  be a problem and it's really not much additional code.  What about
  programs which want to use these interfaces before /dev is set up?

===

- still: the syscall should use a struct timespec* timeout parameter
  and not nanosecs.  There are at least three timeout modes which
  are wanted:

  + relative, unconditionally wait that long

  + relative, aborted in case of large enough settimeofday() or NTP
adjustment

  + absolute timeout.  Probably even with selecting which clock ot use.
This mode requires a timespec value parameter


  We have all this code already in the futex syscall.  It just needs to
  be generalized or copied and adjusted.

===

- still: no signal mask parameter in the kevent_wait (and get_event)
  syscall.  Regardless of what one thinks about signals, they are used
  and integrating the kevent interface into existing code requires
  this functionality.  And it's not only about receiving signals.
  The signal mask parameter can also be used to _prevent_ signals from
  being delivered in that time.

===

- the KEVENT_REQ_WAKEUP_ONE functionality is good and needed.  But I
  would reverse the default.  I cannot see many places where you want
  all threads to be woken.  Introduce KEVENT_REQ_WAKEUP_ALL instead.

===

- there is really no reason to invent yet another timer implementation.
  We have the POSIX timers which are feature rich and nicely
  implemented.  All that is needed is to implement SIGEV_KEVENT as a
  notification mechanism.  The timer is registered as part of the
  timer_create() syscalls.

===


I haven't yet looked at the other event sources.  I think the above is
enough for now.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-16 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

One can set number of events before the syscall and do not remove them
after syscall. It can be updated if there is need for that.


Nobody doubts that it is possible.  But it is

a) potentially much expensive

and

b) an alien concept

to have the signal mask to set during the wait call implicitly. 
Conceptually it doesn't even make sense.  This is no event to wait for. 
 It a parameter for the specific wait call, just like the timeout.  And 
I fortunately haven't seen you proposing to pass the timeout value 
implicitly.



Not good enough?  It does exactly what it is supposed to do.  What can 
there be not good enough?


Not to move signals into special case of events. If poll() can not work
with them it does not mean, that they need to be specified as additional
syscall parameter, instead change poll() to work with them, which can be
easily done with kevents.


You still seem to be completely missing the point.  The signal mask is 
no event to wait for.  It has nothing to do with this that ppoll() takes 
the signal mask as a parameter.  The signal mask is a parameter for the 
wait call just like the timeout, not more and not less.




Do not mix warm and soft - waiting for some period is not equal to
syscall timeout. Waiting is possible with timer kevent user (although
only relative timeout, can be changed to support both, not a big
problem).


That's what I'm saying all the time.  Of course it can be supported. 
But for this the timeout parameter must be a timespec pointer.  Whatever 
you could possibly mean by do not mix warm and soft I cannot possibly 
imagine.  Fact is that both relative and absolute timeouts are useful. 
And that for absolute timeouts the change of the clock has to be taken 
into account.




I'm quite sure that absolute timeouts are very usefull, but not as in
the case of waiting for syscall completeness. In any way, kevent can be
extended to support absolute timeouts in it's timer notifications.


That's not the same.  If you argue that then the syscall should have no 
timeout parameter at all.  Fact is that setting up a timer is not for 
free.  Since the timeout is used all the time having a timeout parameter 
is the right answer.  And if you do this then do it right just like 
every other syscall other than poll: use a timespec object.  This gives 
flexibility without measurable cost.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-16 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

The whole idea of mmap buffer seems to be broken, since those who asked
for creation do not like existing design and do not show theirs...


What kind of argumentation is that?

   Because my attempt to implement it doesn't work and nobody right
away has a better suggestion this means the idea is broken.

Nonsense.

It just means that time should be spend on thinking about this.  You cut 
all this short by rushing out your attempt without any discussions. 
Unfortunately nobody else really looked at the approach so it lingered 
around for some weeks.  Well, now it is clear that it is not the right 
approach and we can start thinking about it again.



You seems to not checked the code - each event can be marked as ready 
only one time, which means only one copy and so on.

It was done _specially_. And it is not limitation, but new approach.


I know that it is done deliberately and I tell you that this is wrong 
and unacceptable.  Realtime signals are one event which need to have 
more than one event queued.  This is no description of what you have 
implemented, it's a description of the reality of realtime signals.


RT signals are queued.  They carry a data value (the sigval_t object) 
which can be unique for each signal delivery.  Coalescing the signal 
events therefore leads to information loss.


Therefore, at the very least for signal we need to have the ability to 
queue more than one event for each event source.  Not having this 
functionality means that signals and likely other types of events cannot 
be implemented using kevent queues.




Queue of the same signals or any other events has fundamental flawness
(as any other ring buffer implementation, which has queue size)  -
it's size of the queue and extremely bad case of the overflow.


Of course there are additional problems.  Overflows need to be handled. 
 But this is nothing which is unsolvable.




So, the same event may not be ready several times. Any design which
allows to create infinite number of events generated for the same case
is broken, since consumer can be in situation, when it can not handle
that flow.


That's complete nonsense.  Again, for RT signals it is very reasonable 
and not broken to have multiple outstanding signals.




That is why poll() returns only POLLIN when data is ready in
network stack, but is not trying to generate some kind of a signal for 
each byte/packet/MTU/MSS received.


It makes no sense to drag poll() into this discussion.  poll() is a very 
limited interface.  The new event handling is supposed to be the 
opposite, namely, usable for all kinds of events.  Arguing that because 
poll() does it like this just means you don't see what big step is 
needed to get to the goal of a unified event handling.  The shackles of 
poll() must be left behind.




RT signals have design problems, and I will not repeate the same error
with similar limits in kevent.


I don't know what to say.  You claim to be the source of all wisdom is 
OS design.  Maybe you should design your own OS, from ground up.  I 
wonder how many people would like that since all your arguments are 
squarely geared towards optimizing the implementation.  But: the 
implementation is irrelevant without users.  The functionality users (= 
programmers) want and need is what must drive the implementation.  And 
RT signals are definitely heavily used and liked by programmers.  You 
have to accept that you try to modify an OS which has that functionality 
regardless of how much you hate it and want to fight it.




Mmap implementation can be added separately, since it does not affect
kevent core.


That I doubt very much and it is why I would not want the kevent stuff 
go into any released kernel until that detail is resolved.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-15 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

In context you have cut, one updated signal mask between calls to event
delivery mechanism (using for example signal()), so it has exactly the
same price.


No, it does not.  If the signal mask is recomputed by the program for 
each new wait call then you have a lot more work to do when the signal 
mask is implicitly specified.




I created it just because I think that POSIX workaround to add signals
into the syscall parameters is not good enough.


Not good enough?  It does exactly what it is supposed to do.  What can 
there be not good enough?




You again cut my explanation on why just pure timeout is used.
We start a syscall, which can block forever, so we want to limit it's
time, and we add special parameter to show how long this syscall should
run. Timeout is not about how long we should sleep (which indeed can be
absolute), but how long syscall should run - which is related to the 
time syscall started.


I know very well what a timeout is.  But the way the timeout can be 
specified can vary.  It is often useful (as for select, poll) to specify 
relative timeouts.


But there are equally useful uses where the timeout is needed at a 
specific point in time.  Without a syscall interface which can have a 
absolute timeout parameter we'd have to write as a poor approximation at 
userlever


clock_gettime (CLOCK_REALTIME, ts);
struct timespec rel;
rel.tv_sec = abstmo.tv_sec - ts.tv_sec;
rel.tv_nsec = abstmo.tv_sec - ts.tv_nsec;
if (rel.tv_nsec  0) {
  rel.tv_nsec += 10;
  --rel.tv_sec;
}
if (rel.tv_sec  0)
  inttmo = -1;  // or whatever is used for return immediately
else
  inttmo = rel.tv_sec * UINT64_C(10) + rel.tv_nsec;

 wait(..., inttmo, ...)


Not only is this much more expensive to do at userlevel, it is also 
inadequate because calls to settimeofday() do  not cause a recomputation 
of the timeout.


See Ingo's RT futex stuff as an example for a kernel interface which 
does it right.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-15 Thread Ulrich Drepper

Evgeniy Polyakov wrote:

Existing design does not allow overflow.


And I've pointed out a number of times that this is not practical at 
best.  There are event sources which can create events which cannot be 
coalesced into one single event as it would be required with your design.


Signals are one example, specifically realtime signals.  If we do not 
want the design to be limited from the start this approach has to be 
thought over.



So zap mmap() support completely, since it is not usable at all. We wont 
discuss on it.


Initial implementation did not have it.
But I was requested to do it, and it is ready now.
No one likes it, but no one provides an alternative implementation.
We are stuck.


We need the mapped ring buffer.  The current design (before it was 
removed) was broken but this does not mean it shouldn't be implemented. 
 We just need more time to figure out how to implement it correctly.


--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-05 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 And you can add/remove signal events using existing kevent api between
 calls.

That's far more expensive than using a mask under control of the program.


 And creating special cases for usual events is bad.
 There is unified way to deal with events in kevent -
 add/remove/modify/wait on them, signals are just usual events.

How can this be unified?  The installment of the temporary signal mask
is unlike the handling of signal for the purpose of reporting them
through the signal queue.  It's equally completely new functionality.
Don't kid yourself in thinking that because this is signal stuff, too,
you're unifying something.  The way this signal mask is used has
nothing whatsoever to do with the delivering signals via the event
queue.  For the latter the signals always must be blocked (similar to
sigwait's requirement).

As a result it means you want to introduce a new mechanism for the event
queue instead of using the well known and often used method of
optionally passing a signal mask to the syscall.  That's just insane.


 I think you wanted to say, that 'all event mechanism except the most
 commonly used poll/select/epoll use timespec'.

Get your facts straight.  select uses timeval which is just the
predecessor of of timespec.  And epoll is just (badly) designed after
poll.  Fact is therefore that poll plus its spawn is the only interface
using such a timeout method.


 I designed it to be similar to poll(), it is really good interface.

Not many people agree.  All the interfaces designed (not derived) in the
last years take a timespec parameter.

Plus, you chose to ignore all the nice things using a timespec allow you
like absolute timeout modes etc.  See the clock_nanosleep()  interface
for a way this can be useful.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-04 Thread Ulrich Drepper

On 9/22/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

The only two things missed in patchset after his suggestions are
new POSIX-like interface, which I personally consider as very unconvenient,


This means you really do not know at all what this is about.  We
already have these interfaces.  Several of them and there will likely
be more.  These are interfaces for functionality which needs the new
event notification.  There is *NO* reason whatsoever to not make this
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-04 Thread Ulrich Drepper

[Bah, sent too eaqrly]

On 9/22/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

The only two things missed in patchset after his suggestions are
new POSIX-like interface, which I personally consider as very unconvenient,


This means you really do not know at all what this is about.  We
already have these interfaces.  Several of them and there will likely
be more.  These are interfaces for functionality which needs the new
event notification.  There is *NO* reason whatsoever to not make add
this extension and instead invent new interfaces to have notification
sent to the event queue.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 1/4] kevent: Core files.

2006-10-04 Thread Ulrich Drepper

On 9/20/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

This patch includes core kevent files:
[...]


I tried to look at the example programs before and failed.  I tried
again.  Where can I find up-to-date example code?

Some other points:

- I really would prefer not to rush all this into the upstream kernel.
The main problem is that the ring buffer interface is a shared data
structure.  These are always tricky.  We need to find the right
combination between size (as small as possible) and supporting all the
interfaces.

- so far only the timer and aio notification is speced out.  What
about the rest?  Are we sure all aspects can be expressed?  I am not
yet.

- we need an interface to add an event from userlevel.  I.e., we need
to be able to synthesize events.  There are events (like, for instance
the async DNS functionality) which come from userlevel code.

I would very much prefer we look at the other events before setting
the data structures in stone.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-04 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 When we enter sys_ppoll() we specify needed signals as syscall
 parameter, with kevents we will add them into the queue.

No, this is not sufficient as I said in the last mail.  Why do you
completely ignore what others say.  The code which depends on the signal
does not have to have access to the event queue.  If a library sets up
an interrupt handler then it expect the signal to be delivered this way.
 In such situations ppoll etc allow the signal to be generally blocked
and enabled only and *ATOMICALLY* around the delays.  This is not
possible with the current wait interface.  We need this signal mask
interfaces and the appropriate setup code.

Being able to get signal notifications does not mean this is always the
way it can and must happen.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-04 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 It is completely possible to do what you describe without special
 syscall parameters.

First of all, I don't see how this is efficiently possible.  The mask
might change from call to call.

Second, hasn't it sunk in that inventing new ways to pass parameters is
bad?  Programmers don't want to learn new ways for every new interface.
 Reuse is good!

This applies to the signal mask here.

But there is another parameter falling into that category and I meant to
mention it before: the timeout value.  All other calls except poll and
especially all modern interfaces use a timespec pointer.  This is the
way times are kept in userland code.  Don't try to force people to do
something else.

Using a timespec also has the advantage that we can add an absolute
timeout value mode (optional) instead of the relative timeout value.

In this context, we should/must be able to specify which clock the
timeout is for (not as part of the wait call, but another control
operation perhaps).  It's important to distinguish between
CLOCK_REALTIME and CLOCK_MONOTONE.  Both have their use.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take19 1/4] kevent: Core files.

2006-10-04 Thread Ulrich Drepper

On 10/3/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

http://tservice.net.ru/~s0mbre/archive/kevent/evserver_kevent.c
http://tservice.net.ru/~s0mbre/archive/kevent/evtest.c


These are simple programs which by themselves have problems.  For
instance, I consider a very bad idea to hardcode the size of the ring
buffer.  Specifying macros in the header file counts as hardcoding.
Systems grow over time and so will the demand of connections.  I have
no problem with the kernel hardcoding the value internally (or having
a /proc entry to select it) but programs should be able to dynamically
learn about the value so they don't have to be recompiled.

But more problematic is that I don't see how the interfaces can be
efficiently used in multi-threaded (or multi-process) programs.  How
would multiple threads using the same kevent queue and running in the
same kevent_get_events() loop work out?  How do they guarantee that
each request is only handled once?


From what I see now this means a second data structure is needed to

keep track of the state of each entry.  But even then, how do we even
recognized used ring buffer entries?

For instance, assume two threads.  Both call get_events, one event is
reported, both threads are woken up (which is another thing to
consider, more later).  One thread uses ring buffer entry, the other
goes back to sleep in get_events.  Now, how does the kernel know when
the other thread is done working on the ring buffer entry?  There
might be lots of entries coming in overflowing the entire buffer.
Heck, you don't even need two threads for this scenario.

When I was thinking about this (and discussing it in Ottawa) I was
always assuming that we have a status field in the ring buffer entry
which lets the userlevel code indicate whether the entry is free again
or not.  This requires a writable mapping, yes, and potentially causes
cache line ping-pong.  I think Zach mentioned he has some ideas about
this.


As for the multiple thread wakeup, I mentioned this before.  We have
to avoid the trampling herd problem.  We cannot wakeup all waiters.
But we also cannot assume that, without protocols, waking up just one
for each available entry is sufficient.  So the first question is:
what is the current policy?



AIO was removed from patchset by request of Cristoph.
Timers, network AIO, fs AIO, socket nortifications and poll/select
events work well with existing structures.


Well, excuse me if I don't take your word for it.  I agree, the AIO
code should not be submitted along with this.  The same for any other
code using the event handling.  But we need to check whether the
interface is generic enough to accomodate them in a way which actually
makes sense.  Again, think highly threaded processes or multiple
processes sharing the same event queue.



It is even possible to create variable sized kevents - each kevent
contain pointer to user's data, which can be considered as pointer to
additional area (it's size kernel implementation for given kevent type
can determine from other parameters or use predefined one and fetch
additional data in -enqueue() callback).


That sounds interesting and certainly helps with securing the
interface for the future.  But if there is anything we can do to avoid
unnecessary costs we should do it, even if this means investigation
all this further.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take19 0/4] kevent: Generic event handling mechanism.

2006-10-03 Thread Ulrich Drepper

On 9/27/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:
\ I have been told in private what is signal masks about - just to wait

until either signal or given condition is ready, but in that case just
add additional kevent user like AIO complete or netwrok notification
and wait until either requested events are ready or signal is triggered.


No, this won't work.  Yes, I want signal notification as part of the
event handling.  But there are situations when this is not suitable.
Only if the signal is expected in the same code using the event
handling can you do this.  But this is not always possible.
Especially when the signal handling code is used in other parts of the
code than the event handling.  E.g., signal handling in a library,
event handling in the main code.  You cannot assume that all the code
is completely integrated.
-
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [take14 0/3] kevent: Generic event handling mechanism.

2006-09-09 Thread Ulrich Drepper

On 8/31/06, Evgeniy Polyakov [EMAIL PROTECTED] wrote:

Sorry ofr long delay - I was on small vacations.


No vacation here, but travel nontheless.


 - one point of critique which applied to many proposals over the years:
   multiplexer syscalls a bad, really bad. [...]

Can you convince Christoph?
I do not care about interfaces, but until several people agree on it, I
will not change anything.


I hope that Linus and/or Andrew simply decree that multiplexers are
bad. glibc and probably strace are the two most affected programs so
their maintainers should have a say.  My opinion os clear.  Also for
analysis tools the multiplexers are bad since different numbers of
parameters are used and maybe even with different types.



You completely miss AIO here (I talk not about POSIX AIO).


Sure, I should have mentioned it.  But I was assuming this all along.



I use there only id provided by user, it is not his cookie, but it was
done to make strucutre as small as possible.
Think about size of the mapped buffer when there are several kevent
queues - it is all mapped and thus pinned memory.
It of course can be extended.


It being what?  The problem is that the structure of the ring buffer
elements cannot easily be changed later.  So we have to get it right
now which means being a bit pessimistic about future requirements.
Add padding, there will certainly be future uses which need more
space.



 Next, the current interfaces once again fail to learn from a mistake we
 made and which got corrected for the other interfaces.  We need to be
 able to change the signal mask around the delay atomically.  Just like
 we have ppoll for poll, pselect for select (and hopefully soon also
 epoll_pwait for epoll_wait) we need to have this feature in the new
 interfaces.

We able to change kevents atomically.


I don't understand.  Or you don't understand.  I was talking about
changing the signal mask atomically around the wait call.  I.e., the
call needs an additional optional parameter specifying the signal mask
to use (for the kernel: two parameters, pointer and length).  This
parameter is not available in the version of the patch I looked at and
should be added if it's still missing in the latest version of the
patch.  Again, look at the difference between poll() and ppoll() and
do the same.



Well, I rarely talk about what other people want, but if you strongly
feel, that all posix crap is better than epoll interface, then I can not
agree with you.


You miss the point entirely like DaveM before you.  What I ask for is
simply a uniform and well established form to tell an interface to use
the kevent notification mechanism and not sue signals etc.  Look at
the mail I sent in reply to DaveM's mail.



It is possible to create additional one using any POSIX API you like,
but I strongly insist on having possibility to use lightweight syscall
interface too.


Again, missing the point.  We can without any significant change
enable POSIX interfaces and GNU extensions like the timer, AIO, the
async DNS code, etc use kevents.  For the latter, which is entirely
implemented at userlevel, we need interfaces to queue kevents from
userlevel.  I think this is already supported.  The other two
definitely benefit from using kevent notification and since they
are/will be handled in the kernel the completion events should be
queued in a kevent queue as specified in the sigevent structure passed
to the system call.



Ring buffer _always_ has space for new events until queue is not filled.
So if userspace do not read for too much time it's events and eventually
tries to add new one, it will fail early.


Sorry, I don't understand this at all.

If the ring buffer always has enough room then events must be
preregistered.  Is this the case?  Seems very inflexible and who would
this work with event sources like timers which can trigger many times?

I hope you don't mean that ring buffers probably won't overflow since
programs have to handle events fast enough.  That's not acceptable.



There is no overflow - I do not want to introduce another signal queue
overflow crap here.
And once again - no signals.


Well, signals are the only asynchronous notification mechanism we
have.  But more to the point: why cannot there be overflows?



You basically want to deliver the same event to several users.
But how do you want to achive it with network buffers for example.
When several threads reads from the same socket, they do not obtain the
same data.


That's not what I am after.  I'm perfectly fine with waking only one
thread.  In fact, this is how it must be to avoid the trampling herd
effects.  But there is the problem that if the woken thread is not
working on the issue for which it was woken (e.g., if the thread got
canceled) then it must be able to wake another thread.  In affect,
there should be a syscall which causes a given number of other waiters
(make the number a parameter to the syscall) is woken.  They would
start running and if nothing 

Re: [take14 0/3] kevent: Generic event handling mechanism.

2006-08-27 Thread Ulrich Drepper
  fail?  Will mremap() work to increase/descrease the size?  Will
  mremap() be allowed to be called with MREMAP_MAYMOVE?  What if mmap()
  is called from different processes (in the POSIX sense, i.e., from
  different address spaces)?

  Either

   mmap(...)

  Or

   int kevent_map_ringbuf (int kfd, size_t num)


- one interface to set additional parameters.  This is likely mostly to
  make the interfaces safe for the future.  Perhaps the number of events
  needed per delay call should be set this way.

int kevent_ctl (int kfd, int cmd, ...)


- one interface to shut the kevent down.  This might be overkill.  We
  should be able to use munmap() and close().  If a real interface for
  this would be created it should look like this

   int kevent_destroy (int kfd, void *ringbuf, size_t num)

  I find this rather more cumbersome.  Just use close and munmap.


- one interface to submit requests.

int kevent_submit (int kfd, struct kevent_event *ev, int flags,
   struct timespec *timeout)

  Maybe the flags parameter isn't needed, it's just another way to make
  sure we won't regret the design later.  If the ring buffer can fill up
  and this is detected by the kernel (unlike what happens in take 14)
  then the calling thread could be delayed undefinitely.  Maybe we even
  have a deadlock if there is only one thread.  If only a wait/no-wait
  mode is needed, then use only a flags parameter and no timeout
  parameter.

  A special variant should be if ev == NULL the call is taken as a
  request to wake one or more delayed threads.


- one interface to delay threads until the next event becomes available.
  No data is transfered along with the call.  The event data must be
  read from the ring buffer:

int kevent_wait (int kfd, unsigned ringstate,
 const struct timespec *timeout,
 const sigset_t *sigmask)

  Wait-mode can be implemented by recognizing timeout==NULL.  no-wait
  mode is implemented using timeout-tv_sec==timeout-tv_nsec==0.  If
  sigset_t is NULL the signal mask is not changed.

  The ringstate parameter is also not present in the take 14 proposal.
  Something like it is necessary to prevent the thread from going to
  sleep while there are events in the ring buffer.  It would be very
  wasteful if the kernel would have to keep track of outstanding
  events.  This would also mean then handling events would require
  a system call, exactly what the ring buffer approach should prevent.

  I think the sequence for waiting for an event should be like this:

+ get current ring state
+ check whether any outstanding event in ring buffer
+ if yes, copy data out of ring buffer, mark ring buffer record
  as unused (atomically).
+ if no, call kevent_wait with ring state value

  When the kernel delivers a new event it does:

+ find place to store event
+ change ring state (might be a simple counter)

  The kevent_wait implementation in the kernel would then as the first
  thing determine whether the ring state changed.  If yes, the syscall
  returns immediate with -ENWOULDBLOCK.  Otherwise it is queued for
  waiting.

  With these steps and the requirement that all ring buffer entries are
  processed FIFO we can
  a) avoid syscalls to avoid freeing ring buffer entries
  b) detect overflows in the ring buffer
  c) can maintain the read pointer at userlevel while the kernel can
 maintain the write pointer into the buffer


-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take14 0/3] kevent: Generic event handling mechanism.

2006-08-27 Thread Ulrich Drepper
David Miller wrote:
 SigEvent, and signals in general, are crap.  They are complex
 and userland gets it wrong more often than not.  Interfaces
 for userland should be simple, signals are not simple.

You miss the point.

sigevent has nothing necessarily to do with signals.  I don't want
signals.  I just want the same interface to specify the action to be used.

If I'm using

  struct sigevent sigev;
  int kfd;

  kfd = kevent_create (...);

  sigev.sigev_notify = SIGEV_KEVENT;
  sigev.sigev_kfd = kfd;
  sigev.sigev_valie.sival_ptr = some_data;


then I can use this sigev variable in an unmodified timer_create call.
The kernel would see SIGEV_KEVENT (as opposed to SIGEV_SIGNAL etc) and
**not** generate a signal but instead create the event in the kevent queue.


The proposal to use sigevent has nothing to do with signals.  It's just
about the interface and to have smooth integration with existing
functionality.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take12 0/3] kevent: Generic event handling mechanism.

2006-08-22 Thread Ulrich Drepper
I so far also haven't taken the time to look exactly at the interface.
I plan to do it asap since this is IMO our big chance to get it right.
I want to have a unifying interface which can handle all the different
events we need and which come up today and tomorrow.  We have to be able
to handle not only file descriptors and AIO but also timers, signals,
message queues (OK, they are file descriptors but let's make it
official), futexes.  I'm probably missing the one or the other thing now.

DaveM says there are example programs for the current interfaces.  I
must admit I haven't seen those either.  So if possible, point the world
to them again.  If you do that now I'll review everything and write up
my recommendations re the interface before Monday.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)

2006-08-14 Thread Ulrich Drepper
Suparna Bhattacharya wrote:
 Is there a (remote) possibility that the thread could have died and its
 pid got reused by a new thread in another process ? Or is there a mechanism
 that prevents such a possibility from arising (not just in NPTL library,
 but at the kernel level) ?

The UID/GID won't help you with dying processes.  What if the same user
creates a process with the same PID?  That process will not expect the
notification and mustn't receive it.  If you cannot detect whether the
issuing process died you have problems which cannot be solved with a
uid/gid pair.


 AIO for pipes should not be a problem - Chris Mason had a patch, so we can
 just bring it up to the current levels, possibly with some additional
 improvements.

Good.


 I'm not sure what would be the right thing to do for the sockets case. While
 we could put together a patch for basic aio_read/write (based on the same
 model used for files), given the whole ongoing kevent effort, its not yet
 clear to me what would make the most sense ...  
 
 Ben had a patch to do a fallback to kernel threads for AIO operations that
 are not yet supported natively. I had some concerns about the approach, but
 I guess he had intended it as an interim path for cases like this.

A fallback solution would be sufficient.  Nobody _should_ use POSIX AIO
for networking but people do and just giving them something that works
is good enough.  It cannot really be worse than the userlevel emulation
we have know.

The alternative, separately and sequentially handling network sockets at
userlevel is horrible.  We'd have to go over every file descriptor and
check whether it's a socket and then take if out of the request list for
the kernel.  Then they need to be handled separately before or after the
kernel AIO code.  This would punish unduly all the 99.9% of the programs
which don't use POSIX  AIO for network I/O.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: Kernel patches enabling better POSIX AIO (Was Re: [3/4] kevent: AIO, aio_sendfile)

2006-08-12 Thread Ulrich Drepper
Suparna Bhattacharya wrote:
 I am wondering about that too. IIRC, the IO_NOTIFY_* constants are not
 part of the ABI, but only internal to the kernel implementation. I think
 Zach had suggested inferring THREAD_ID notification if the pid specified
 is not zero. But, I don't see why -sigev_notify couldn't used directly
 (just like the POSIX timers code does) thus doing away with the 
 new constants altogether. Sebestian/Laurent, do you recall?

I suggest to model the implementation after the timer code which does
exactly what we need.


 I'm guessing they are being used for validation of permissions at the time
 of sending the signal, but maybe saving the task pointer in the iocb instead
 of the pid would suffice ?

Why should any verification be necessary?  The requests are generated in
the same process which will receive the notification.  Even if the POSIX
process (aka, kernel process group) changes the IDs the notifications
should be set.  The key is that notifications cannot be sent to another
POSIX process.

Adding this as a feature just makes things so much more complicated.


 So I think the
 intended behaviour is as you describe it should be

Then the documentation needs to be adjusted.


 The way it works (and better ideas are welcome) is that, since the io_submit()
 syscall already accepts an array of iocbs[], no new syscall was introduced.
 To implement lio_listio, one has to set up such an array, with the first iocb
 in the array having the special (new) grouping opcode of IOCB_CMD_GROUP which
 specifies the sigev notification to be associated with group completion
 (a NULL value of the sigev notification pointer would imply equivalent of
 LIO_WAIT).

OK, this seems OK.  We have to construct the iocb arrays dynamically anyway.


 My thought here was that it should be possible to include M as a parameter
 to the IOCB_CMD_GROUP opcode iocb, and thus incorporated in the lio control
 block ... then whatever semantics are agreed upon can be implemented.

If you have room for the parameter this is fine.  For the beginning we
can enforce the number to be the same as the total number of requests.


 Let us know what you think about the listio interface ... hopefully the
 other issues are mostly simple to resolve.

It should be fine and I would support adding all this assuming the
normal file support (as opposed to direct I/O only) is added, too.


But I have one last question: sockets, pipes and the like are already
supported, right?  If this is not the case we have a problem with the
currently proposed  lio_listio interface.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take6 1/3] kevent: Core files.

2006-08-11 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 The main disadvantage is that all memory is allocated on the start even
 if it will not be used later. I think dynamic grow is appropriate
 solution, since user will have that memory used anyway, since kevents
 are allocated,

If you _allocate_ memory at startup you're doing something wrong.  All
you should do is allocate address space.  Memory should be allocated
when it is needed.

Growing a memory region is always hard because it means you cannot keep
any addresses around and always have to reload a base pointer.  That's
not ideal.

Especially on 64-bit machines address space really is no limitation
anymore.  So, allocate as much as needed, allocate memory when it's
needed, and don't resize.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [3/4] kevent: AIO, aio_sendfile() implementation.

2006-08-11 Thread Ulrich Drepper
Sébastien Dugué wrote:
aio completion notification

I looked over this now but I don't think I understand everything.  Or I
don't see how it all is integrated.  And no, I'm not looking at the
proposed glibc code since would mean being tainted.


 Details:
 ---
 
   A struct sigevent *aio_sigeventp is added to struct iocb in
 include/linux/aio_abi.h
 
   An enum {IO_NOTIFY_SIGNAL = 0, IO_NOTIFY_THREAD_ID = 1} is added in
 include/linux/aio.h:
 
   - IO_NOTIFY_SIGNAL means that the signal is to be sent to the
 requesting thread 
 
   - IO_NOTIFY_THREAD_ID means that the signal is to be sent to a
 specifi thread.

This has been proved to be sufficient in the timer code which basically
has the same problem.  But why do you need separate constants?  We have
the various SIGEV_* constants, among them SIGEV_THREAD_ID.  Just use
these constants for the values of ki_notify.


   The following fields are added to struct kiocb in include/linux/aio.h:
 
   - pid_t ki_pid: target of the signal
 
   - __u16 ki_signo: signal number
 
   - __u16 ki_notify: kind of notification, IO_NOTIFY_SIGNAL or
  IO_NOTIFY_THREAD_ID
 
   - uid_t ki_uid, ki_euid: filled with the submitter credentials

These two fields aren't needed for the POSIX interfaces.  Where does the
requirement come from?  I don't say they should be removed, they might
be useful, but if the costs are non-negligible then they could go away.


   - check whether the submitting thread wants to be notified directly
 (sigevent-sigev_notify_thread_id is 0) or wants the signal to be sent
 to another thread.
 In the latter case a check is made to assert that the target thread
 is in the same thread group

Is this really how it's implemented?  This is not how it should be.
Either a signal is sent to a specific thread in the same process (this
is what SIGEV_THREAD_ID is for) or the signal is sent to a calling
process.  Sending a signal to the process means that from the kernel's
POV any thread which doesn't have the signal blocked can receive it.
The final decision is made by the kernel.  There is no mechanism to send
the signal to another process.

So, for the purpose of the POSIX AIO code the ki_pid value is only
needed when the SIGEV_THREAD_ID bit is set.

It could be an extension and I don't mind it being introduced.  But
again, it's not necessary and if it adds costs then it could be left
out.  It is something which could easily be introduced later if the need
arises.


   listio support
 

I really don't understand the kernel interface for this feature.


 Details:
 ---
 
   An IOCB_CMD_GROUP is added to the IOCB_CMD enum in include/linux/aio_abi.h
 
   A struct lio_event is added in include/linux/aio.h
 
   A struct lio_event *ki_lio is added to struct iocb in include/linux/aio.h

So you have a pointer in the structure for the individual requests.  I
assume you use the atomic counter to trigger the final delivery.  I
further assume that if lio_wait is set the calling thread is suspended
until all requests are handled and that the final notification in this
case means that thread gets woken.

This is all fine.

But how do you pass the requests to the kernel?  If you have a new
lio_listio-like syscall it'll be easy.  But I haven't seen anything like
this mentioned.

The alternative is to pass the requests one-by-one in which case I don't
see how you create the reference to the lio_listio control block.  This
approach seems to be slower.

If all requests are passed at once, do you have the equivalent of
LIO_NOP entries?


How can we support the extension where we wait for a number of requests
which need not be all of them.  I.e., I submit N requests and want to be
notified when at least M (M = N) notified.  I am not yet clear about
the actual semantics we should implement (e.g., do we send another
notification after the first one?) but it's something which IMO should
be taken into account in the design.


Finally, and this is very important, does you code send out the
individual requests notification and then in the end the lio_listio
completion?  I think Suparna wrote this is the case but I want to make sure.


Overall, this looks much better than the old code.  If the answers to my
questions show that the behavior is compatible with the POSIX AIO code
I'm certainly very much in favor of adding the kernel code.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [take5 0/4] kevent: Generic event handling mechanism.

2006-08-09 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 Question with kevents removal from syscall stays open until Ulrich
 accepts or declines mapped buffer implementation.

It was my idea in the first place to use the ring buffer.  I'm sure
others had the same idea but that's what I presented.  So, I see no
reason you should delay making this change because of me.

The only important thing is that we need to get a useful semantics for
fork and exec.  For fork, it must be possible to dequeue entries from
the ring buffer in a thread-safe way.  For exec (where a file descriptor
might survive) we likely need a mechanism to mmap the ring buffer only
based on the file descriptor.  I'm not sure about this, though.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC 1/4] kevent: core files.

2006-08-01 Thread Ulrich Drepper
Herbert Xu wrote:
 The other to consider is that events don't come from the hardware.
 Events are written by the kernel.  So if user-space is just reading
 the events that we've written, then there are no cache misses at all.

Not quite true.  The ring buffer can be written to from another
processor.  The kernel thread responsible for generating the event
(receiving data from network or disk, expired timer) can run
independently on another CPU.

This is the case to keep in mind here.  I thought Zach and the other
involved in the discussions in Ottawa said this has been shown to be a
problem and that a ring buffer implementation with something other than
simple front and back pointers is preferable.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC 1/4] kevent: core files.

2006-07-30 Thread Ulrich Drepper
Nicholas Miell wrote:
 [...] and was wondering
 if you were familiar with the Solaris port APIs* and,

I wasn't.


 if so, you could
 please comment on how your proposed event channels are different/better.

There indeed is not much difference.  The differences are in the
details.  The way those ports are specified doesn't allow much room for
further optimizations.  E.g., the userlevel ring buffer isn't possible.
 But mostly it's the same semantics.  The ec_t type in my text is also
better a file descriptor since otherwise it cannot be transported via
Unix stream sockets.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC 1/4] kevent: core files.

2006-07-29 Thread Ulrich Drepper
Evgeniy Polyakov wrote:
 Btw, why do we want mapped ring of ready events?
 If user requestd some event, he definitely wants to get them back when
 they are ready, and not to check and then get them?
 Could you please explain more on this issue?

If of course makes no sense to enter the kernel to actually get the
event.  This should be done by storing the event in the ring buffer.
I.e., there are two ways to get an event:

- with a syscall.  This can report as many events at once as the caller
  provides space for.  And no event which is reported in the run buffer
  should be reported this way

- if there is space, report it in the ring buffer.  Yes, the buffer
  can be optional, then all events are reported by the system call.


So the use case would be like this:


wait_and_get_event:

  is buffer empty ?

yes - make syscall

no - get event from buffer


To avoid races, the syscall needs to take a parameter indicating the
last event checked out from the buffer.  If in the meantime the kernel
put another event in the buffer the syscall immediately returns.
Similar to what we do in the futex syscall.

The question is how to best represent the ring buffer.  Zach and some
others had some ready responses in Ottawa.  The important thing is to
avoid cache line ping pong when possible.


Is the ring buffer absolutely necessary?  Probably not.  But it has the
potential to help quite a bit.  Don't look at the problem to solve in
the context of heavy I/O operations when another syscall here and there
doesn't matter.  With this single event mechanism for every possible
event the kernel can generate programming can look quite different.
E.g., every read() call can implicitly we changed into an async read
call followed by a user-level reschedule.  This rescheduling allows
another thread of execution to run while the read request is processed.
 I.e., it's basically a setjmp() followed by a goto into the inner loop
to get the next event.  And now suddenly the event notification
mechanism really should be as fast as possible.  If we submit basically
every request asynchronously and are not creating dedicated threads for
specific tasks anymore we

a) have a lot more event notifications

b) the probability of an event being reported when we want the receive
   the next one if higher (i.e., the case where no syscall vs syscall
   makes a difference)

Yes, all this will require changes in the way programs a written but we
shouldn't limit the way we can write programs unnecessarily.  I think
that given increasing discrepancies in relative speed/latency of the
peripherals and the CPU this is one possible solution to keep the CPUs
busy without resorting to a gazillion separate threads in each program.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC 1/4] kevent: core files.

2006-07-28 Thread Ulrich Drepper
Zach Brown wrote:
 Ulrich, would you be satisfied if we didn't
 have the userspace mapped ring on the first pass and only had a
 collection syscall?

I'm not the one to make a call but why rush things?  Let's do it right
from the start.  Later changes can only lead to problems with users of
the earlier interface.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [3/4] kevent: AIO, aio_sendfile() implementation.

2006-07-27 Thread Ulrich Drepper
Badari Pulavarty wrote:
 Before we spend too much time cleaning up and merging into mainline -
 I would like an agreement that what we add is good enough for glibc
 POSIX AIO.

I haven't seen a description of the interface so far.  Would be good if
it existed.  But I briefly mentioned one quirk in the interface about
which Suparna wasn't sure whether it's implemented/implementable in the
current interface.

If a lio_listio call is made the individual requests are handle just as
if they'd be issue separately.  I.e., the notification specified in the
individual aiocb is performed when the specific request is done.  Then,
once all requests are done, another notification is made, this time
controlled by the sigevent parameter if lio_listio.


Another feature which I always wanted: the current lio_listio call
returns in blocking mode only if all requests are done.  In non-blocking
mode it returns immediately and the program needs to poll the aiocbs.
What is needed is something in the middle.  For instance, if multiple
read requests are issued the program might be able to start working as
soon as one request is satisfied.  I.e., a call similar to lio_listio
would be nice which also takes another parameter specifying how many of
the NENT aiocbs have to finish before the call returns.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [3/4] kevent: AIO, aio_sendfile() implementation.

2006-07-26 Thread Ulrich Drepper
Christoph Hellwig wrote:
 My personal opinion on existing AIO is that it is not the right design.
 Benjamin LaHaise agree with me (if I understood him right),
 
 I completely agree with that aswell.

I agree, too, but the current code is not the last of the line.  Suparna
has a st of patches which make the current kernel aio code work much
better and especially make it really usable to implement POSIX AIO.

In Ottawa we were talking about submitting it and Suparna will.  We just
thought about a little longer timeframe.  I guess it could be
accelerated since he mostly has the patch done.  But I don't know her
schedule.

Important here is, don't base any decision on the current aio
implementation.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature