Re: PROBLEM: Failure to deliver SIGCHLD

2005-07-22 Thread Edgar Toernig
Michael Harris wrote:
>
> [2.] The problem occurs in a forking server similar in function to
> inetd.  The server employs a very simple SIGCHLD handler that loops on
> wait(2), until all zombie processes have been collected.  For no
> immediately apparent reason, the parent process behaves as if it no
> longer receives SIGCHLD.  Manually sending the signal has no effect.

Sounds like a blocked signal.

> [6.] This is the code for the signal handler in the server application. 
> 
> void reaper_man (int signum)
> {
> int stat;
> while ( waitpid(-1, , WNOHANG) > 0 );
> }
> 
> signal (SIGCHLD, reaper_man);  /* from main() */
>
> I dare say it contains no bugs (famous last words)

It does - it clobbers errno :-)

My suggestions: use sigaction with defined restart/mask/etc behaviour
instead of signal.  Save and restore errno in the signal handler.
Make sure SIGCHLD isn't blocked.

But if your only interest is to get rid of the zombies, the most simple
solution would be to set SIGCHLD to ignore.

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PROBLEM: Failure to deliver SIGCHLD

2005-07-22 Thread Edgar Toernig
Michael Harris wrote:

 [2.] The problem occurs in a forking server similar in function to
 inetd.  The server employs a very simple SIGCHLD handler that loops on
 wait(2), until all zombie processes have been collected.  For no
 immediately apparent reason, the parent process behaves as if it no
 longer receives SIGCHLD.  Manually sending the signal has no effect.

Sounds like a blocked signal.

 [6.] This is the code for the signal handler in the server application. 
 
 void reaper_man (int signum)
 {
 int stat;
 while ( waitpid(-1, stat, WNOHANG)  0 );
 }
 
 signal (SIGCHLD, reaper_man);  /* from main() */

 I dare say it contains no bugs (famous last words)

It does - it clobbers errno :-)

My suggestions: use sigaction with defined restart/mask/etc behaviour
instead of signal.  Save and restore errno in the signal handler.
Make sure SIGCHLD isn't blocked.

But if your only interest is to get rid of the zombies, the most simple
solution would be to set SIGCHLD to ignore.

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Bad rounding in timeval_to_jiffies [was: Re: Odd Timer behavior in 2.6 vs 2.4 (1 extra tick)]

2005-04-21 Thread Edgar Toernig
On Thu, 21 Apr 2005, Chris Friesen wrote:
>
> Does mainline have a high precision monotonic wallclock that is not 
> affected by time-of-day changes?  Something like "nano/mico seconds 
> since boot"?

On newer kernels with the posix timers (I think 2.6 - not sure though)
there's clock_gettime(CLOCK_MONOTONIC, ...).

Linus Torvalds wrote:
>
> Getting "approximate uptime" really really _really_ fast
> might be useful for some things, but I don't know how many.

I bet most users of gettimeofday actually want a strictly monotonic
increasing clock where the actual base time is irrelevant.  Just strace
some apps - those issuing hundreds and thousands of gettimeofday calls
are most likely in this class.  Those who only call gettimeofday once
or twice or the ones that really want the wall clock time.

How often does the kernel use jiffies (the monotonic clock) and how often
xtime (the wall clock)?

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Bad rounding in timeval_to_jiffies [was: Re: Odd Timer behavior in 2.6 vs 2.4 (1 extra tick)]

2005-04-21 Thread Edgar Toernig
On Thu, 21 Apr 2005, Chris Friesen wrote:

 Does mainline have a high precision monotonic wallclock that is not 
 affected by time-of-day changes?  Something like nano/mico seconds 
 since boot?

On newer kernels with the posix timers (I think 2.6 - not sure though)
there's clock_gettime(CLOCK_MONOTONIC, ...).

Linus Torvalds wrote:

 Getting approximate uptime really really _really_ fast
 might be useful for some things, but I don't know how many.

I bet most users of gettimeofday actually want a strictly monotonic
increasing clock where the actual base time is irrelevant.  Just strace
some apps - those issuing hundreds and thousands of gettimeofday calls
are most likely in this class.  Those who only call gettimeofday once
or twice or the ones that really want the wall clock time.

How often does the kernel use jiffies (the monotonic clock) and how often
xtime (the wall clock)?

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: problem with select() - 2.4.5

2001-06-22 Thread Edgar Toernig

Thomas Speck wrote:
> 
> tio.c_cflag = baud | CLOCAL;

How about adding CREAD?

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: problem with select() - 2.4.5

2001-06-22 Thread Edgar Toernig

Thomas Speck wrote:
 
 tio.c_cflag = baud | CLOCAL;

How about adding CREAD?

Ciao, ET.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: symlink_prefix

2001-06-06 Thread Edgar Toernig

Alexander Viro wrote:
> 
> On Thu, 7 Jun 2001, Edgar Toernig wrote:
> 
> > Alexander Viro wrote:
> > > ...
> > > dir = open("/usr/local", O_DIRECTORY);
> > > /* error handling */
> > > new_mount(dir, MNT_SET, fs_fd); /* closes dir and fs_fd */
> >
> > Do you really want to start using fds instead of strings for tree
> > modifying commands (link, unlink, symlink, rename, mount and umount)?
> > Even if it were possible in the new_mount case it wouldn't have the
> > atomic lookup+act nature of the old mount.  And then, _I_ would
> > prefer a uniform interface for tree management commands - strings.
> 
> You have exactly the same atomicity warranties. That is to say, none.
> Mountpoint can be renamed between the lookup and mounting.

Ok.  I thought, mounting is an atomic operation (though normally not
required).  Hmm... but looking at your last batch of VFS patches sent
to lkml you consider mount a more used call in the future ;-)  Maybe
it would be better to have some more strict rules for mount if ie each
login performs a dozen of them...

> Moreover, even after mount(2) you can rename() parent of mountpoint. On
> all Unices I've seen (well, aside of v7 which didn't have rename(2)).
> So if you rely on anything of that kind - you are screwed. Portably
> screwed, at that.

I thought more about a rename of ie "/usr/local" between the open and
the new_mount call.  I guess, an unlink("/usr/local") after the open
will let the new_mount fail.  Btw, what happens in this case of two
concurrent mounts?

fd1=open("/foo")
fd2=open("/foo")
new_mount(fd1...)
new_mount(fd2...)   // or vice versa, first fd2 then fd1

>[...] but even if your argument makes sense, it only makes sense for
> "dir" argument. "device" is nothing but a filesystem-specific option.

Sure.  I only meant the "dir" argument.

Maybe I've just an uneasy feeling about such a change because it exposes
and depends on internal implementation details of the kernel (the dcache).
On other systems it's normally not possible to associate a unique name
with a file descriptor.  Newer Linux versions may support this for
directories due to the dcache (not sure if this is really always the case).
And this calling convention for new_mount would be the first one that
makes this visible in userspace.  And it would depend on this feature.
This may limit future changes of the kernel VFS implementation (maybe
someone really adds some kind of hardlinked directories or something
else that makes it impossible to get a unique name for a dir fd).

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: symlink_prefix

2001-06-06 Thread Edgar Toernig

Alexander Viro wrote:
> ...
> dir = open("/usr/local", O_DIRECTORY);
> /* error handling */
> new_mount(dir, MNT_SET, fs_fd); /* closes dir and fs_fd */

Do you really want to start using fds instead of strings for tree
modifying commands (link, unlink, symlink, rename, mount and umount)?
Even if it were possible in the new_mount case it wouldn't have the
atomic lookup+act nature of the old mount.  And then, _I_ would
prefer a uniform interface for tree management commands - strings.

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: symlink_prefix

2001-06-06 Thread Edgar Toernig

Alexander Viro wrote:
 ...
 dir = open(/usr/local, O_DIRECTORY);
 /* error handling */
 new_mount(dir, MNT_SET, fs_fd); /* closes dir and fs_fd */

Do you really want to start using fds instead of strings for tree
modifying commands (link, unlink, symlink, rename, mount and umount)?
Even if it were possible in the new_mount case it wouldn't have the
atomic lookup+act nature of the old mount.  And then, _I_ would
prefer a uniform interface for tree management commands - strings.

Ciao, ET.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: symlink_prefix

2001-06-06 Thread Edgar Toernig

Alexander Viro wrote:
 
 On Thu, 7 Jun 2001, Edgar Toernig wrote:
 
  Alexander Viro wrote:
   ...
   dir = open(/usr/local, O_DIRECTORY);
   /* error handling */
   new_mount(dir, MNT_SET, fs_fd); /* closes dir and fs_fd */
 
  Do you really want to start using fds instead of strings for tree
  modifying commands (link, unlink, symlink, rename, mount and umount)?
  Even if it were possible in the new_mount case it wouldn't have the
  atomic lookup+act nature of the old mount.  And then, _I_ would
  prefer a uniform interface for tree management commands - strings.
 
 You have exactly the same atomicity warranties. That is to say, none.
 Mountpoint can be renamed between the lookup and mounting.

Ok.  I thought, mounting is an atomic operation (though normally not
required).  Hmm... but looking at your last batch of VFS patches sent
to lkml you consider mount a more used call in the future ;-)  Maybe
it would be better to have some more strict rules for mount if ie each
login performs a dozen of them...

 Moreover, even after mount(2) you can rename() parent of mountpoint. On
 all Unices I've seen (well, aside of v7 which didn't have rename(2)).
 So if you rely on anything of that kind - you are screwed. Portably
 screwed, at that.

I thought more about a rename of ie /usr/local between the open and
the new_mount call.  I guess, an unlink(/usr/local) after the open
will let the new_mount fail.  Btw, what happens in this case of two
concurrent mounts?

fd1=open(/foo)
fd2=open(/foo)
new_mount(fd1...)
new_mount(fd2...)   // or vice versa, first fd2 then fd1

[...] but even if your argument makes sense, it only makes sense for
 dir argument. device is nothing but a filesystem-specific option.

Sure.  I only meant the dir argument.

Maybe I've just an uneasy feeling about such a change because it exposes
and depends on internal implementation details of the kernel (the dcache).
On other systems it's normally not possible to associate a unique name
with a file descriptor.  Newer Linux versions may support this for
directories due to the dcache (not sure if this is really always the case).
And this calling convention for new_mount would be the first one that
makes this visible in userspace.  And it would depend on this feature.
This may limit future changes of the kernel VFS implementation (maybe
someone really adds some kind of hardlinked directories or something
else that makes it impossible to get a unique name for a dir fd).

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-27 Thread Edgar Toernig

Daniel Phillips wrote:
> 
> It won't, the open for "." is handled in the VFS, not the filesystem -
> it will open the directory.  (Without needing to be told it's a
> directory via O_DIRECTORY.)  If you do open("magicdev") you'll get the
> device, because that's handled by magicdevfs.

You really mean that "magicdev" is a directory and:

open("magicdev/.", O_RDONLY);
open("magicdev", O_RDONLY);

would both succeed but open different objects?

> I'm not claiming there isn't breakage somewhere,

you break UNIX fundamentals.  But I'm quite relieved now because I'm
pretty sure that something like that will never go into the kernel.

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-27 Thread Edgar Toernig

Daniel Phillips wrote:
 
 It won't, the open for . is handled in the VFS, not the filesystem -
 it will open the directory.  (Without needing to be told it's a
 directory via O_DIRECTORY.)  If you do open(magicdev) you'll get the
 device, because that's handled by magicdevfs.

You really mean that magicdev is a directory and:

open(magicdev/., O_RDONLY);
open(magicdev, O_RDONLY);

would both succeed but open different objects?

 I'm not claiming there isn't breakage somewhere,

you break UNIX fundamentals.  But I'm quite relieved now because I'm
pretty sure that something like that will never go into the kernel.

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-25 Thread Edgar Toernig

Daniel Phillips wrote:
> 
> Oops, oh wait, there's already another open point: your breakage
> examples both rely on opening ".".  You're right, "." should always be
> a directory and I believe that's enforced by the VFS.  So we don't have
> an example of breakage yet.

That's just because I did a simple "ls".  But it doesn't make a
difference.  The magicdevs _are_ directories and

chdir("magicdev");
open(".", O_RDONLY);

shouldn't open the device.

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-25 Thread Edgar Toernig

Daniel Phillips wrote:
 
 Oops, oh wait, there's already another open point: your breakage
 examples both rely on opening ..  You're right, . should always be
 a directory and I believe that's enforced by the VFS.  So we don't have
 an example of breakage yet.

That's just because I did a simple ls.  But it doesn't make a
difference.  The magicdevs _are_ directories and

chdir(magicdev);
open(., O_RDONLY);

shouldn't open the device.

Ciao, ET.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-23 Thread Edgar Toernig

Daniel Phillips wrote:
> On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
> > Daniel Phillips wrote:
> > > On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > > > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > > > What I'd like to see:
> > > > > >
> > > > > > - An interface for registering an array of related devices
> > > > > > (almost always two: raw and ctl) and their legacy device
> > > > > > numbers with a single userspace callout that does whatever
> > > > > > /dev/ creation needs to be done. Thus, naming and permissions
> > > > > > live in user space. No "device node is also a directory"
> > > > > > weirdness...
> > > > >
> > > > > Could you be specific about what is weird about it?
> > > >
> > > > *boggle*
> > > >
> > > >[general sense of unease]
> >
> > I fully agree with Oliver.  It's an abomination.
> 
> We are, or at least, I am, investigating this question purely on
> technical grounds - name calling is a noop.

Right.  But sometimes new ideas raise these kind of feelings ;)

> > > It's going to be marked 'd', it's a directory, not a file.
> >
> > Aha.  So you lose the S_ISCHR/BLK attribute.
> 
> Readdir fills in a directory type, so ls sees it as a directory and does
> the right thing.  On the other hand, we know we're on a device
> filesystem so we will next open the name as a regular file, and find
> ISCHR or ISBLK: good.

??? The kernel may know it, but the app?  Or do you really want to
give different stat data on stat(2) and fstat(2)?  These flags are
currently used by archive/backup prgs.  It's a hint that these files
are not regular files and shouldn't be opened for reading.
Having a 'd' would mean that they would really try to enter the
directory and save it's contents.  Don't know what happens in this
case to your "special" files ;-)

> The rule for this filesystem is: if you open with O_DIRECTORY then
> directory operations are permitted, nothing else.  If you open without
> O_DIRECTORY then directory operations are forbidden (as
> usual) and normal device semantics apply.

As usual?  I think you've just changed the rules for O_DIRECTORY.  Up
to now it's only a flag that tells open it should fail if the name
does not refer to a directory.  Nothing else.  It was introduced to
remove a race condition in user space applications.  Especially it
is optional - everything works the same whether you give the flag
or not (except the race avoidance of course).  And there are a lot
of programs that do not use O_DIRECTORY (it's a Linux private flag,
not even mentioned in POSIX).  Every program that does:

fd = open(foo, O_RDONLY);
fchdir(fd);
x = opendir(".")

will break.  And that is POSIX conform.  And I know that there are
programs that use this when recursively scanning directories (avoids
name mangling and repeated name lookups of the directory on later
stat calls).

> > Directories are not allowed to be read from/written to.  The VFS may
> > support it, but it's not (current) UNIX.
> 
> Here, we obey this rule: if you open it with O_DIRECTORY then you
> can't read from or write to it.

IMHO you've just invented opendir(2).

> Nothing breaks here, ls works as it always did.
> 
> This is what ls does:
> 
> open("foobar", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
> fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
> fcntl64(0x3, 0x2, 0x1, 0x2) = -1 ENOSYS (Function not implemented)
> fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
> brk(0x805b000)  = 0x805b000
> getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented)
> getdents(3, /* 2 entries */, 2980)  = 28
> getdents(3, /* 0 entries */, 2980)  = 0
> close(3)= 0
> 
> Note that ls doesn't do anything as inconvenient as opening
> foobar as a normal file first, expecting that operation to fail.

Well, your ls does not work "as it always did".  Here's an strace of
my libc5 system ls:

open(".", O_RDONLY) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
getdents(3, /* 64 entries */, 4096) = 1216
getdents(3, /* 9 entries */, 4096)  = 168
getdents(3, /* 0 entries */, 4096)  = 0
close(3)= 0

And my find(1) does:

open(".", O_RDONLY) = 3
[scan all dirs]
fchdir(3)   = 0

to return to its initial dir.  Will break too.

> No, you would get side effects only if yo

Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-23 Thread Edgar Toernig

Daniel Phillips wrote:
 On Wednesday 23 May 2001 06:19, Edgar Toernig wrote:
  Daniel Phillips wrote:
   On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
On Mon, 21 May 2001, Daniel Phillips wrote:
 On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
  What I'd like to see:
 
  - An interface for registering an array of related devices
  (almost always two: raw and ctl) and their legacy device
  numbers with a single userspace callout that does whatever
  /dev/ creation needs to be done. Thus, naming and permissions
  live in user space. No device node is also a directory
  weirdness...

 Could you be specific about what is weird about it?
   
*boggle*
   
   [general sense of unease]
 
  I fully agree with Oliver.  It's an abomination.
 
 We are, or at least, I am, investigating this question purely on
 technical grounds - name calling is a noop.

Right.  But sometimes new ideas raise these kind of feelings ;)

   It's going to be marked 'd', it's a directory, not a file.
 
  Aha.  So you lose the S_ISCHR/BLK attribute.
 
 Readdir fills in a directory type, so ls sees it as a directory and does
 the right thing.  On the other hand, we know we're on a device
 filesystem so we will next open the name as a regular file, and find
 ISCHR or ISBLK: good.

??? The kernel may know it, but the app?  Or do you really want to
give different stat data on stat(2) and fstat(2)?  These flags are
currently used by archive/backup prgs.  It's a hint that these files
are not regular files and shouldn't be opened for reading.
Having a 'd' would mean that they would really try to enter the
directory and save it's contents.  Don't know what happens in this
case to your special files ;-)

 The rule for this filesystem is: if you open with O_DIRECTORY then
 directory operations are permitted, nothing else.  If you open without
 O_DIRECTORY then directory operations are forbidden (as
 usual) and normal device semantics apply.

As usual?  I think you've just changed the rules for O_DIRECTORY.  Up
to now it's only a flag that tells open it should fail if the name
does not refer to a directory.  Nothing else.  It was introduced to
remove a race condition in user space applications.  Especially it
is optional - everything works the same whether you give the flag
or not (except the race avoidance of course).  And there are a lot
of programs that do not use O_DIRECTORY (it's a Linux private flag,
not even mentioned in POSIX).  Every program that does:

fd = open(foo, O_RDONLY);
fchdir(fd);
x = opendir(.)

will break.  And that is POSIX conform.  And I know that there are
programs that use this when recursively scanning directories (avoids
name mangling and repeated name lookups of the directory on later
stat calls).

  Directories are not allowed to be read from/written to.  The VFS may
  support it, but it's not (current) UNIX.
 
 Here, we obey this rule: if you open it with O_DIRECTORY then you
 can't read from or write to it.

IMHO you've just invented opendir(2).

 Nothing breaks here, ls works as it always did.
 
 This is what ls does:
 
 open(foobar, O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
 fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
 fcntl64(0x3, 0x2, 0x1, 0x2) = -1 ENOSYS (Function not implemented)
 fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
 brk(0x805b000)  = 0x805b000
 getdents64(0x3, 0x8058270, 0x1000, 0x26) = -1 ENOSYS (Function not implemented)
 getdents(3, /* 2 entries */, 2980)  = 28
 getdents(3, /* 0 entries */, 2980)  = 0
 close(3)= 0
 
 Note that ls doesn't do anything as inconvenient as opening
 foobar as a normal file first, expecting that operation to fail.

Well, your ls does not work as it always did.  Here's an strace of
my libc5 system ls:

open(., O_RDONLY) = 3
fcntl(3, F_SETFD, FD_CLOEXEC)   = 0
getdents(3, /* 64 entries */, 4096) = 1216
getdents(3, /* 9 entries */, 4096)  = 168
getdents(3, /* 0 entries */, 4096)  = 0
close(3)= 0

And my find(1) does:

open(., O_RDONLY) = 3
[scan all dirs]
fchdir(3)   = 0

to return to its initial dir.  Will break too.

 No, you would get side effects only if you open as a regular file.

IMHO your assumption that opening a dir _requires_ O_DIRECTORY is
wrong.  You've put in a new semantic that has not been there and
that will break programs and POSIX conformance.

 Please, if you know something that actually breaks, tell me.

Yeah, see above ;)

Ciao, ET.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Edgar Toernig

Daniel Phillips wrote:
> 
> On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
> > On Mon, 21 May 2001, Daniel Phillips wrote:
> > > On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
> > > > What I'd like to see:
> > > >
> > > > - An interface for registering an array of related devices
> > > > (almost always two: raw and ctl) and their legacy device numbers
> > > > with a single userspace callout that does whatever /dev/ creation
> > > > needs to be done. Thus, naming and permissions live in user
> > > > space. No "device node is also a directory" weirdness...
> > >
> > > Could you be specific about what is weird about it?
> >
> > *boggle*
> >
> >[general sense of unease]

I fully agree with Oliver.  It's an abomination.

> > I don't think it's likely to be even workable. Just consider the
> > directory entry for a moment - is it going to be marked d or [cb]?
> 
> It's going to be marked 'd', it's a directory, not a file.

Aha.  So you lose the S_ISCHR/BLK attribute.

> > If it doesn't have the directory bit set, Midnight commander won't
> > let me look at it, and I wouldn't blame cd or ls for complaining. If it
> > does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
> > million other programs if they did the wrong thing. They've had 30
> > years to expect that files aren't directories. They're going to act
> > weird.
> 
> No problem, it's a directory.

Directories are not allowed to be read from/written to.  The VFS may
support it, but it's not (current) UNIX.

> > Linus has been kicking this idea around for a couple years now and
> > it's still a cute solution looking for a problem. It just doesn't
> > belong in UNIX.
> 
> Hmm, ok, do we still have any *technical* reasons?

So with your definition, I have a fs-object that is marked as a directory
but opening it opens a device.  Pretty nice.  How I'm supposed to list
it's contents?  open+readdir?  But the open has nasty side effects.
So you have a directory that you are not allowed to list (because of the
possible side effects) but is allowed to be read from/written to maybe
even issue ioctls to?.  And you call that sane???

IMO the whole idea of arguments following the device name is junk (incl
a "/ctrl").

Just think about the implications of the original "/dev/ttyS0/19200"
suggestion.  It sounds nice and tempting.  But which programs will
benefit.  Which gets confused.  What will be cleaned up.  After some
thoughts you'll find out that it's useless ;-)

And with special "ctrl" devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
This _may_ work for some kind of devices.  But serial ports are one
example where it simply will _not_.  It requires that you know the
name of the device.  For ttys this is often not the case.  Even if
you manage to get some name for stdin for example - now I should
simply attach a "ctrl" to that name to get a control channel???
At least dangerous.  If I'm lucky I only get an EPERM...

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-22 Thread Edgar Toernig

Daniel Phillips wrote:
 
 On Tuesday 22 May 2001 17:24, Oliver Xymoron wrote:
  On Mon, 21 May 2001, Daniel Phillips wrote:
   On Monday 21 May 2001 19:16, Oliver Xymoron wrote:
What I'd like to see:
   
- An interface for registering an array of related devices
(almost always two: raw and ctl) and their legacy device numbers
with a single userspace callout that does whatever /dev/ creation
needs to be done. Thus, naming and permissions live in user
space. No device node is also a directory weirdness...
  
   Could you be specific about what is weird about it?
 
  *boggle*
 
 [general sense of unease]

I fully agree with Oliver.  It's an abomination.

  I don't think it's likely to be even workable. Just consider the
  directory entry for a moment - is it going to be marked d or [cb]?
 
 It's going to be marked 'd', it's a directory, not a file.

Aha.  So you lose the S_ISCHR/BLK attribute.

  If it doesn't have the directory bit set, Midnight commander won't
  let me look at it, and I wouldn't blame cd or ls for complaining. If it
  does have the 'd' bit set, I wouldn't blame cp, tar, find, or a
  million other programs if they did the wrong thing. They've had 30
  years to expect that files aren't directories. They're going to act
  weird.
 
 No problem, it's a directory.

Directories are not allowed to be read from/written to.  The VFS may
support it, but it's not (current) UNIX.

  Linus has been kicking this idea around for a couple years now and
  it's still a cute solution looking for a problem. It just doesn't
  belong in UNIX.
 
 Hmm, ok, do we still have any *technical* reasons?

So with your definition, I have a fs-object that is marked as a directory
but opening it opens a device.  Pretty nice.  How I'm supposed to list
it's contents?  open+readdir?  But the open has nasty side effects.
So you have a directory that you are not allowed to list (because of the
possible side effects) but is allowed to be read from/written to maybe
even issue ioctls to?.  And you call that sane???

IMO the whole idea of arguments following the device name is junk (incl
a /ctrl).

Just think about the implications of the original /dev/ttyS0/19200
suggestion.  It sounds nice and tempting.  But which programs will
benefit.  Which gets confused.  What will be cleaned up.  After some
thoughts you'll find out that it's useless ;-)

And with special ctrl devices (ie /dev/ttyS0 and /dev/ttyS0ctrl):
This _may_ work for some kind of devices.  But serial ports are one
example where it simply will _not_.  It requires that you know the
name of the device.  For ttys this is often not the case.  Even if
you manage to get some name for stdin for example - now I should
simply attach a ctrl to that name to get a control channel???
At least dangerous.  If I'm lucky I only get an EPERM...

Ciao, ET.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)

2001-05-20 Thread Edgar Toernig

Alexander Viro wrote:
> 
> On Sun, 20 May 2001, Edgar Toernig wrote:
> 
> > IMHO any scheme that requires a special name to perform ioctl like
> > functions will not work.  Often you don't known the name of the
> > device you're talking to and then you're lost.
> 
> ls -l /proc/self/fd/

Oh come on.  You made most of the VFS and should know better.  Since when
is it possible to always get a "usable" name for an fd???  The ls -l will
give me "deleted", "socket", "...".  If I try to access the name given
by procfs I may get EPERM, etc etc.  And then, it's pretty strange to append
a "ctl" to some arbitrary name and I get a control device for that name???
No.  Using names is __wrong__!

> [not going to happen:]
> 1) sys_ioctl() going away from syscall table.

I would never suggest that.

> 2) semi-automatic conversion of existing applications.

Same.  Much too dangerous.

> To hell with
> the way we are finding descriptor, we need to deal with arguments themselves.
> And no extra logics in libc will help - the whole problem is that ioctls
> have rather irregular arguments.

Don Quijote II.? ;-)

IMHO any similar powerful (and versatile) interface will see the same
problems.  Enforcing a read/write like interface (and rejecting drivers
that pass ptrs through this interface) may give you some knowledge about
the kernel/userspace communication.  But the data the flows around will
become the same mess that is present with the current ioctl.  Every driver
invents its own sets of commands, its own rules of argument parsing, ...
Maybe it's no longer strange binary data but readable ASCII strings but
that's all.  Look at how many different "styles" of /proc files there are.

> What we need is "make it sane", not "inherit as many things from the
> old API as possible". And obvious first target is Linux-specific
> device ioctls, simply because they have fewer programs using them.

You can impose some rules like "must support" commands, something of
how arguments are encoded, errors reported and so on.  But I wouldn't
like to see an SNMP like mess...

IMHO what's needed is a definition for "sane" in this context.  Trying
to limit the kind of actions performed by ioctls is not "sane".  Then
people will always revert back to old ioctl.  "Sane" could be: network
transparent, architecture independant, usable with generic tools and non
C-like languages.

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)

2001-05-20 Thread Edgar Toernig

Alexander Viro wrote:
> 
> For the latter, though,
> we need to write commands into files and here your miscdevices (or procfs
> files, or /dev/foo/ctl - whatever) is needed.

IMHO any scheme that requires a special name to perform ioctl like
functions will not work.  Often you don't known the name of the
device you're talking to and then you're lost.

So, if you want an additional communication channel to a device why
not introduce an fcntl or system call like

cltrfd = fcntl(fd, F_CTRLFD)or  openctrl(fd)  ?

That way you can always get access to the control channel and use
regular read/write for communication [1].  To make it more versatile,
you may want to extent the shell syntax, i.e. a '@' in redirection
operators get the control fd:

echo "eject" >@/dev/cdrom
{ echo "b19200,onlcr" >@1 ; echo "Hello World!" ; } >/dev/ttyS0

Yes, requires support in user space apps but doesn't mess around
with the file namespace.  It's too precious to sacrifice ;-)

I don't know how much infrastructure in the kernel is required for this 
- i.e. add readctrl/writectrl methods or create virtual inodes/devices
on the fly?  There are more capable people than me to judge on that...

Ciao, ET.


[1] If you want you can even allow this flag as an open mode to
open the ctrl channel without opening the dev.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)

2001-05-20 Thread Edgar Toernig

Alexander Viro wrote:
 
 For the latter, though,
 we need to write commands into files and here your miscdevices (or procfs
 files, or /dev/foo/ctl - whatever) is needed.

IMHO any scheme that requires a special name to perform ioctl like
functions will not work.  Often you don't known the name of the
device you're talking to and then you're lost.

So, if you want an additional communication channel to a device why
not introduce an fcntl or system call like

cltrfd = fcntl(fd, F_CTRLFD)or  openctrl(fd)  ?

That way you can always get access to the control channel and use
regular read/write for communication [1].  To make it more versatile,
you may want to extent the shell syntax, i.e. a '@' in redirection
operators get the control fd:

echo eject @/dev/cdrom
{ echo b19200,onlcr @1 ; echo Hello World! ; } /dev/ttyS0

Yes, requires support in user space apps but doesn't mess around
with the file namespace.  It's too precious to sacrifice ;-)

I don't know how much infrastructure in the kernel is required for this 
- i.e. add readctrl/writectrl methods or create virtual inodes/devices
on the fly?  There are more capable people than me to judge on that...

Ciao, ET.


[1] If you want you can even allow this flag as an open mode to
open the ctrl channel without opening the dev.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: F_CTRLFD (was Re: Why side-effects on open(2) are evil.)

2001-05-20 Thread Edgar Toernig

Alexander Viro wrote:
 
 On Sun, 20 May 2001, Edgar Toernig wrote:
 
  IMHO any scheme that requires a special name to perform ioctl like
  functions will not work.  Often you don't known the name of the
  device you're talking to and then you're lost.
 
 ls -l /proc/self/fd/n

Oh come on.  You made most of the VFS and should know better.  Since when
is it possible to always get a usable name for an fd???  The ls -l will
give me deleted, socket,   If I try to access the name given
by procfs I may get EPERM, etc etc.  And then, it's pretty strange to append
a ctl to some arbitrary name and I get a control device for that name???
No.  Using names is __wrong__!

 [not going to happen:]
 1) sys_ioctl() going away from syscall table.

I would never suggest that.

 2) semi-automatic conversion of existing applications.

Same.  Much too dangerous.

 To hell with
 the way we are finding descriptor, we need to deal with arguments themselves.
 And no extra logics in libc will help - the whole problem is that ioctls
 have rather irregular arguments.

Don Quijote II.? ;-)

IMHO any similar powerful (and versatile) interface will see the same
problems.  Enforcing a read/write like interface (and rejecting drivers
that pass ptrs through this interface) may give you some knowledge about
the kernel/userspace communication.  But the data the flows around will
become the same mess that is present with the current ioctl.  Every driver
invents its own sets of commands, its own rules of argument parsing, ...
Maybe it's no longer strange binary data but readable ASCII strings but
that's all.  Look at how many different styles of /proc files there are.

 What we need is make it sane, not inherit as many things from the
 old API as possible. And obvious first target is Linux-specific
 device ioctls, simply because they have fewer programs using them.

You can impose some rules like must support commands, something of
how arguments are encoded, errors reported and so on.  But I wouldn't
like to see an SNMP like mess...

IMHO what's needed is a definition for sane in this context.  Trying
to limit the kind of actions performed by ioctls is not sane.  Then
people will always revert back to old ioctl.  Sane could be: network
transparent, architecture independant, usable with generic tools and non
C-like languages.

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Edgar Toernig

nitpicking: a system call without side effects would be pretty useless.

Alexander Viro wrote:
> A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
> no-op. Breaking that assumption is a Bad Thing(tm).

That assumption is totally bogus.  Even for regular files you have side
effects (atime); for anything else they're unpredictable.

Ciao, ET.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Why side-effects on open(2) are evil. (was Re: [RFD w/info-PATCH]device arguments from lookup)

2001-05-19 Thread Edgar Toernig

nitpicking: a system call without side effects would be pretty useless.

Alexander Viro wrote:
 A lot of stuff relies on the fact that close(open(foo, O_RDONLY)) is a
 no-op. Breaking that assumption is a Bad Thing(tm).

That assumption is totally bogus.  Even for regular files you have side
effects (atime); for anything else they're unpredictable.

Ciao, ET.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Wow! Is memory ever cheap!

2001-05-09 Thread Edgar Toernig

Larry McVoy wrote:
> 
> Let's review:  ECC is nice, but it doesn't solve all data corruption
> problems.  Applications which do their own end to end data integrity
> checks will catch many more error cases than what ECC catches.

I think you have a wrong idea why the ECC is there.  ECC deals with
the inherit shortcommings of DRAM.

DRAMs are not perfect.  They have a probability to lose a bit.
Normally this probability is low enough to live with it.  Lets say
you have a system with 1MByte and let's say the probability for a
single bit error is around 1 error in 100 years.  Good enough.
Now put 1GByte in the system. You'll get a probability of 10 errors
per year.  Maybe good enough for a Windows box but not acceptable
for your server.  So you put in ECC to bring this probability back
into reasonable numbers.  ECC can correct the single bit errors.
You only have to deal with double bit errors.  Chance for them is
much much lower.

Sure, it doesn't solve all data corruption problems - only simple
errors in DRAMs.  But it makes systems with huge amount of RAM staying
up alive much longer.  And btw, your integrity checks over data will
not protect against a corrupted kernel or application...

Ciao, ET.

PS: Just let your app run long enough.  I'm sure it will detect a
checksum error some day ;-)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Wow! Is memory ever cheap!

2001-05-09 Thread Edgar Toernig

Larry McVoy wrote:
 
 Let's review:  ECC is nice, but it doesn't solve all data corruption
 problems.  Applications which do their own end to end data integrity
 checks will catch many more error cases than what ECC catches.

I think you have a wrong idea why the ECC is there.  ECC deals with
the inherit shortcommings of DRAM.

DRAMs are not perfect.  They have a probability to lose a bit.
Normally this probability is low enough to live with it.  Lets say
you have a system with 1MByte and let's say the probability for a
single bit error is around 1 error in 100 years.  Good enough.
Now put 1GByte in the system. You'll get a probability of 10 errors
per year.  Maybe good enough for a Windows box but not acceptable
for your server.  So you put in ECC to bring this probability back
into reasonable numbers.  ECC can correct the single bit errors.
You only have to deal with double bit errors.  Chance for them is
much much lower.

Sure, it doesn't solve all data corruption problems - only simple
errors in DRAMs.  But it makes systems with huge amount of RAM staying
up alive much longer.  And btw, your integrity checks over data will
not protect against a corrupted kernel or application...

Ciao, ET.

PS: Just let your app run long enough.  I'm sure it will detect a
checksum error some day ;-)

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Real Time Traffic Flow Measurement - anybody working on it?

2001-04-19 Thread Edgar Toernig

Hi,

Michael Clark wrote:
> 
> An obvious kernel improvement for userspace meters like NeTraMet would
> be to give libpcap's pcap_read a kernel interface that can return more
> than one packet at a time (the libpcap interface has this capability).

It's already there - the turbo packet interface (PACKET_RX_RING sockopt).
Very nice and fast.  Direct transfer to mmapped memory.

> An additional feature for network devices that could support it (not
> sure if this is feasible) would be to switch to an 'interrupt when
> packet buffer full' when in promiscuous mode.

With the RX_RING you can poll a memory location in the mmapped memory
to detect whether there are new packets.  You basically only perform
a system call (poll/select) if there's nothing more to do.

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: CONFIG_PACKET_MMAP help

2001-04-19 Thread Edgar Toernig

Hi,

[EMAIL PROTECTED] wrote:
> 
> 1. for tp_frame_size, I dont want to truncate any data on ethernet, I
> need 1514 bytes, is this the best way to do it and not waste space?
> 
> static const int TURBO_FRAME_SIZE=
>  TPACKET_ALIGN(TPACKET_ALIGN(sizeof(tpacket_hdr)) +
>TPACKET_ALIGN(sizeof(struct sockaddr_ll)+ETH_HLEN) + 1500);

Looks OK.  Maybe instead of ETH_HLEN min(ETH_HLEN,16)?  The framesize
calculation is really strange...

> 2. what is tp_block_nr for?  I dont understand it, I just set it to 1
> and make tp_block_size big enough for all the frames I need, so its
> just one contiguous space, all I need is about a megabyte I think.

Better go the other way around - set tb_block_size to PAGE_SIZE and
tb_block_nr appropriate.  tb_block_size is the contiguous physical memory
the kernel tries to allocate.  Anything above PAGE_SIZE is likely to fail.
For you that would mean only 2 packets per 4k-page.  You could try to
start with bigger (power of 2) block sizes and go down to smaller ones if
it fails (ENOMEM). [1].  Btw, there's in implicit limit on tb_block_nr.
The vector to manage the blocks is kmalloc'ed and may not be larger than
128kb giving max 32768 blocks.  Hmm... moment... seems there's a similar
limit for tp_frame_nr (max 32768 frames).  I'm pretty sure _that_ limit
was not there when I worked with this during 2.3.  Not so nice on gigabit
ethernet :-(

> 3. is this the general approach for the api?
> [...]

Looks OK too.

>if (tp->status == 0) poll() for pollin on the socket  /* is there a
>race here? */

No race.

> 4. what does the copy threshold setsockopt tuning accomplish? doesnt it always
> have to copy anyway, to the mmaped area?

I haven't used it myself.  Reading the sources it does something different.
Afaics when active if there's a packet that has been truncated by the
framesize it is additionally stored in the socket's receive queue to be
fetched by a normal read/recv.  It notifies you about this by setting
the TP_STATUS_COPY bit.  So it seems to mean: copy to socket if threshold
(framesize) exceeded.

Ciao, ET.


[1] The PACKET_RX_RING sockopt accepts all block sizes that are a multiple
of PAGE_SIZE but always allocates a power of 2 size chunk.  So using non
power of 2 sizes will waste locked kernel memory.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: CONFIG_PACKET_MMAP help

2001-04-19 Thread Edgar Toernig

Hi,

[EMAIL PROTECTED] wrote:
 
 1. for tp_frame_size, I dont want to truncate any data on ethernet, I
 need 1514 bytes, is this the best way to do it and not waste space?
 
 static const int TURBO_FRAME_SIZE=
  TPACKET_ALIGN(TPACKET_ALIGN(sizeof(tpacket_hdr)) +
TPACKET_ALIGN(sizeof(struct sockaddr_ll)+ETH_HLEN) + 1500);

Looks OK.  Maybe instead of ETH_HLEN min(ETH_HLEN,16)?  The framesize
calculation is really strange...

 2. what is tp_block_nr for?  I dont understand it, I just set it to 1
 and make tp_block_size big enough for all the frames I need, so its
 just one contiguous space, all I need is about a megabyte I think.

Better go the other way around - set tb_block_size to PAGE_SIZE and
tb_block_nr appropriate.  tb_block_size is the contiguous physical memory
the kernel tries to allocate.  Anything above PAGE_SIZE is likely to fail.
For you that would mean only 2 packets per 4k-page.  You could try to
start with bigger (power of 2) block sizes and go down to smaller ones if
it fails (ENOMEM). [1].  Btw, there's in implicit limit on tb_block_nr.
The vector to manage the blocks is kmalloc'ed and may not be larger than
128kb giving max 32768 blocks.  Hmm... moment... seems there's a similar
limit for tp_frame_nr (max 32768 frames).  I'm pretty sure _that_ limit
was not there when I worked with this during 2.3.  Not so nice on gigabit
ethernet :-(

 3. is this the general approach for the api?
 [...]

Looks OK too.

if (tp-status == 0) poll() for pollin on the socket  /* is there a
race here? */

No race.

 4. what does the copy threshold setsockopt tuning accomplish? doesnt it always
 have to copy anyway, to the mmaped area?

I haven't used it myself.  Reading the sources it does something different.
Afaics when active if there's a packet that has been truncated by the
framesize it is additionally stored in the socket's receive queue to be
fetched by a normal read/recv.  It notifies you about this by setting
the TP_STATUS_COPY bit.  So it seems to mean: copy to socket if threshold
(framesize) exceeded.

Ciao, ET.


[1] The PACKET_RX_RING sockopt accepts all block sizes that are a multiple
of PAGE_SIZE but always allocates a power of 2 size chunk.  So using non
power of 2 sizes will waste locked kernel memory.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: Real Time Traffic Flow Measurement - anybody working on it?

2001-04-19 Thread Edgar Toernig

Hi,

Michael Clark wrote:
 
 An obvious kernel improvement for userspace meters like NeTraMet would
 be to give libpcap's pcap_read a kernel interface that can return more
 than one packet at a time (the libpcap interface has this capability).

It's already there - the turbo packet interface (PACKET_RX_RING sockopt).
Very nice and fast.  Direct transfer to mmapped memory.

 An additional feature for network devices that could support it (not
 sure if this is feasible) would be to switch to an 'interrupt when
 packet buffer full' when in promiscuous mode.

With the RX_RING you can poll a memory location in the mmapped memory
to detect whether there are new packets.  You basically only perform
a system call (poll/select) if there's nothing more to do.

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-01-20 Thread Edgar Toernig

Michael Lindner wrote:
>[...]
> send(s, ".", 1, 0);
>[...]
> while (select(r+1, , 0, 0, 0) > 0) {
>[...]
>[select returns only after about 1 HZ]

Ever heard of nagle?  (If not, there's a long thread about
it on the mailing list *g*)

It's not the select that waits. It's a delay in the tcp send
path waiting for more data.  Try disabling it:

int f=1;
setsockopt(s, SOL_TCP, TCP_NODELAY, , sizeof(f));

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: PROBLEM: select() on TCP socket sleeps for 1 tick even if data available

2001-01-20 Thread Edgar Toernig

Michael Lindner wrote:
[...]
 send(s, ".", 1, 0);
[...]
 while (select(r+1, readfds, 0, 0, 0)  0) {
[...]
[select returns only after about 1 HZ]

Ever heard of nagle?  (If not, there's a long thread about
it on the mailing list *g*)

It's not the select that waits. It's a delay in the tcp send
path waiting for more data.  Try disabling it:

int f=1;
setsockopt(s, SOL_TCP, TCP_NODELAY, f, sizeof(f));

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Edgar Toernig

Linus Torvalds wrote:
> 
> The point they disagree is when the event gets removed from the event
> queue. For edge triggered, this one is trivial: when a get_events() thing
> happens and moves it into user land. This is basically a one-liner, and it
> is local to get_events() and needs absolutely no help from anybody else.
> So obviously event removal is _very_ simple for edge-triggered events -
> the INTACK basically removes the event (and also re-arms the trigger
> logic: which is different from most interrupt controllers, so the analogy
> falls down here).

And IMHO here's a problem.  The events are no longer events.  They are
just hints saying: after the previous get_events() something has happened.
You don't know if you've already handled this event.  There's no synchron-
ization between what the app does and the triggering of 'hints'.

For example your waitpid-loop: you get the event, start the waitpid-loop.
While processing another process dies.  You handle it too (still in the
loop).  But a new 'hint' has already been registered.  So on the next
get_event you'll be notified again.  I just hope, every event-generator
has a WNOHANG flag...

It could even be possible, that you are unable to perform some actions
without triggering hints despite the fact that the conditions will
already be gone before the next get_event.  May generate lot of bogus
hints.

At least the current semantic of for example "POLL_IN on fd was signaled
so I may read without blocking" gets lost.

Maybe (don't know kernel wise) it makes sense to check in the kernel
if the events to be returned to userspace are still valid.  The user
space has to do it anyway...  But that way you get a more level-based
design ;)


Another thing: while toying with cooperative userspace multithreading
I found it much more versatile to have a req_type/req_data tuple in
the request structure (ie READ/, TIMEOUT/, WAKEUP/).

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: Linux's implementation of poll() not scalable?

2000-10-24 Thread Edgar Toernig

Linus Torvalds wrote:
 
 The point they disagree is when the event gets removed from the event
 queue. For edge triggered, this one is trivial: when a get_events() thing
 happens and moves it into user land. This is basically a one-liner, and it
 is local to get_events() and needs absolutely no help from anybody else.
 So obviously event removal is _very_ simple for edge-triggered events -
 the INTACK basically removes the event (and also re-arms the trigger
 logic: which is different from most interrupt controllers, so the analogy
 falls down here).

And IMHO here's a problem.  The events are no longer events.  They are
just hints saying: after the previous get_events() something has happened.
You don't know if you've already handled this event.  There's no synchron-
ization between what the app does and the triggering of 'hints'.

For example your waitpid-loop: you get the event, start the waitpid-loop.
While processing another process dies.  You handle it too (still in the
loop).  But a new 'hint' has already been registered.  So on the next
get_event you'll be notified again.  I just hope, every event-generator
has a WNOHANG flag...

It could even be possible, that you are unable to perform some actions
without triggering hints despite the fact that the conditions will
already be gone before the next get_event.  May generate lot of bogus
hints.

At least the current semantic of for example "POLL_IN on fd was signaled
so I may read without blocking" gets lost.

Maybe (don't know kernel wise) it makes sense to check in the kernel
if the events to be returned to userspace are still valid.  The user
space has to do it anyway...  But that way you get a more level-based
design ;)


Another thing: while toying with cooperative userspace multithreading
I found it much more versatile to have a req_type/req_data tuple in
the request structure (ie READ/fd, TIMEOUT/ms, WAKEUP/handle).

Ciao, ET.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/