Re: console font limits

2007-05-01 Thread Albert Cahalan

On 5/1/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

Antonino A. Daplas wrote:
>
> And this will entail a lot of work to change (Is it worth it to rework
> the code and remove the limitation?). The linux-console project
> (http://linuxconsole.sourceforge.net/) might have , but I don't know its
> current status.

Well, I think the consensus is that anything beyond that should be done
in userspace; the main such console daemon was Kon2 last I checked.


Font size is not a sane place to draw the line. Features are.
The levels of support go something like this:

0. 7-bit ASCII
1. Simple direct-to-font VGA characters.
2. UTF-8 and large fonts, but no compositing or wide characters.
3. Simple compositing and double-wide characters. (like xterm)
4. Right-to-left. (like Kermit95)
5. Complex shaping, glyph substitution, and vertical text.

Without large fonts, UTF-8 is 90% pointless bloat.

Userspace console daemons are rotten to the core. There is no safe and
reliable way to make kernel messages pass through the userspace console.
You'd either be in graphics mode or you'd still be subject to the limit
of 256 simultaneous glyphs while normal VGA attributes are in use. This
is so defective that one might as well just run X with a fullscreen xterm.
If userspace is your answer, then let's rip out the UTF-8 code.

Personally I don't even need #1, but I think anything less than #3 is
really rude toward people outside of Europe+Americas. I especially hate
to hear Europeans argue against this when they have 100% precomposed
characters for themselves and appear to have played a role (via ISO votes)
in denying stuff like the mere 12 precomposed characters needed to use
the Yoruba language with simple renderers.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: console font limits

2007-05-01 Thread Albert Cahalan

On 5/1/07, H. Peter Anvin [EMAIL PROTECTED] wrote:

Antonino A. Daplas wrote:

 And this will entail a lot of work to change (Is it worth it to rework
 the code and remove the limitation?). The linux-console project
 (http://linuxconsole.sourceforge.net/) might have , but I don't know its
 current status.

Well, I think the consensus is that anything beyond that should be done
in userspace; the main such console daemon was Kon2 last I checked.


Font size is not a sane place to draw the line. Features are.
The levels of support go something like this:

0. 7-bit ASCII
1. Simple direct-to-font VGA characters.
2. UTF-8 and large fonts, but no compositing or wide characters.
3. Simple compositing and double-wide characters. (like xterm)
4. Right-to-left. (like Kermit95)
5. Complex shaping, glyph substitution, and vertical text.

Without large fonts, UTF-8 is 90% pointless bloat.

Userspace console daemons are rotten to the core. There is no safe and
reliable way to make kernel messages pass through the userspace console.
You'd either be in graphics mode or you'd still be subject to the limit
of 256 simultaneous glyphs while normal VGA attributes are in use. This
is so defective that one might as well just run X with a fullscreen xterm.
If userspace is your answer, then let's rip out the UTF-8 code.

Personally I don't even need #1, but I think anything less than #3 is
really rude toward people outside of Europe+Americas. I especially hate
to hear Europeans argue against this when they have 100% precomposed
characters for themselves and appear to have played a role (via ISO votes)
in denying stuff like the mere 12 precomposed characters needed to use
the Yoruba language with simple renderers.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


console font limits

2007-04-30 Thread Albert Cahalan

I'm having problems with a font I just created. It's a rather big one,
intended for a framebuffer console in UTF-8 mode. The strace program
reports that /bin/setfont fails on a KDFONTOP ioctl with EINVAL.
In reading the kernel code, I find this:

vt.c:static int con_font_set(struct vc_data *vc, struct console_font_op *op)
vt.c-{
vt.c-   struct console_font font;
vt.c-   int rc = -EINVAL;
vt.c-   int size;
vt.c-
vt.c-   if (vc->vc_mode != KD_TEXT)
vt.c-   return -EINVAL;
vt.c-   if (!op->data)
vt.c-   return -EINVAL;
vt.c-   if (op->charcount > 512)
vt.c-   return -EINVAL;

Ouch. Why is the old VGA limit being applied to the framebuffer console?
Could this just get removed? I dearly hope we aren't still storing the
framebuffer data as two bytes per character+attribute pair.

I nearly hit the 32-pixel height limit as well, yet another relic from
the VGA hardware. I also nearly hit the 64 KB font size limit.

Currently I'm doing a 15x30 font with 870 glyphs to represent 978
different Unicode code points. This is for a 200 DPI display with
an anti-aliasing filter, so fonts need to be big. I'm considering 15x36
so that I'll have more room for double-accented letters, but clearly
the kernel would block that too.

BTW, the PSF font format documentation seems to suggest that
there is a way to make the kernel handle combining accents:
http://www.win.tue.nl/~aeb/linux/kbd/font-formats-1.html
Does anybody know if that really works? I could sure use that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


console font limits

2007-04-30 Thread Albert Cahalan

I'm having problems with a font I just created. It's a rather big one,
intended for a framebuffer console in UTF-8 mode. The strace program
reports that /bin/setfont fails on a KDFONTOP ioctl with EINVAL.
In reading the kernel code, I find this:

vt.c:static int con_font_set(struct vc_data *vc, struct console_font_op *op)
vt.c-{
vt.c-   struct console_font font;
vt.c-   int rc = -EINVAL;
vt.c-   int size;
vt.c-
vt.c-   if (vc-vc_mode != KD_TEXT)
vt.c-   return -EINVAL;
vt.c-   if (!op-data)
vt.c-   return -EINVAL;
vt.c-   if (op-charcount  512)
vt.c-   return -EINVAL;

Ouch. Why is the old VGA limit being applied to the framebuffer console?
Could this just get removed? I dearly hope we aren't still storing the
framebuffer data as two bytes per character+attribute pair.

I nearly hit the 32-pixel height limit as well, yet another relic from
the VGA hardware. I also nearly hit the 64 KB font size limit.

Currently I'm doing a 15x30 font with 870 glyphs to represent 978
different Unicode code points. This is for a 200 DPI display with
an anti-aliasing filter, so fonts need to be big. I'm considering 15x36
so that I'll have more room for double-accented letters, but clearly
the kernel would block that too.

BTW, the PSF font format documentation seems to suggest that
there is a way to make the kernel handle combining accents:
http://www.win.tue.nl/~aeb/linux/kbd/font-formats-1.html
Does anybody know if that really works? I could sure use that.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Only send pdeath_signal when getppid changes.

2007-04-10 Thread Albert Cahalan

On 4/10/07, Roland McGrath <[EMAIL PROTECTED]> wrote:

> Does a parent death signal make most sense between separately written 
programs?

I don't think it does.  It has always seemed an utterly cockamamy feature
to me, and I've never understood what actually motivated it.


It's useful, but the other case is more important.


> Does a parent death signal make most sense between processes that are part of
> a larger program.

That is the only way I can really see it being used.  The only actual
example of use I know is what Albert Cahalan reported.  To my mind, the
only semantics that matter for pdeath_signal are what previous uses
expected in the past and still need for compatibility.  If we started with
a fresh rationale from the ground up on what the feature is good for, I am
rather skeptical it would pass muster to be added today.


Until inotify and dnotify work on /proc/12345/task, there really isn't
an alternative for some of us. Polling is unusable.

Ideally one could pick any container, session, process group,
mm, task group, or task for notification of state change.
State change means various things like destruction, addition
of something new, exec, etc. (stuff one can see in /proc)
With appropriate privs, having the debug-related stuff would be
good as well.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] Only send pdeath_signal when getppid changes.

2007-04-10 Thread Albert Cahalan

On 4/10/07, Roland McGrath [EMAIL PROTECTED] wrote:

 Does a parent death signal make most sense between separately written 
programs?

I don't think it does.  It has always seemed an utterly cockamamy feature
to me, and I've never understood what actually motivated it.


It's useful, but the other case is more important.


 Does a parent death signal make most sense between processes that are part of
 a larger program.

That is the only way I can really see it being used.  The only actual
example of use I know is what Albert Cahalan reported.  To my mind, the
only semantics that matter for pdeath_signal are what previous uses
expected in the past and still need for compatibility.  If we started with
a fresh rationale from the ground up on what the feature is good for, I am
rather skeptical it would pass muster to be added today.


Until inotify and dnotify work on /proc/12345/task, there really isn't
an alternative for some of us. Polling is unusable.

Ideally one could pick any container, session, process group,
mm, task group, or task for notification of state change.
State change means various things like destruction, addition
of something new, exec, etc. (stuff one can see in /proc)
With appropriate privs, having the debug-related stuff would be
good as well.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PID entries in /proc sorted by number, not start time in 2.6.19

2007-02-28 Thread Albert Cahalan

On 2/28/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

Chuck Ebbert <[EMAIL PROTECTED]> writes:

> Starting with kernel 2.6.19, the process directories in
> /proc are sorted by number. They were sorted by process
> start time in 2.6.18 and earlier. This makes the output
> of procps come out in that order too, pissing off users
> who are used to the old way.


ps --sort=start_time

I've always just assumed the order to be random.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PID entries in /proc sorted by number, not start time in 2.6.19

2007-02-28 Thread Albert Cahalan

On 2/28/07, Eric W. Biederman [EMAIL PROTECTED] wrote:

Chuck Ebbert [EMAIL PROTECTED] writes:

 Starting with kernel 2.6.19, the process directories in
 /proc are sorted by number. They were sorted by process
 start time in 2.6.18 and earlier. This makes the output
 of procps come out in that order too, pissing off users
 who are used to the old way.


ps --sort=start_time

I've always just assumed the order to be random.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-04 Thread Albert Cahalan

On 1/4/07, Segher Boessenkool <[EMAIL PROTECTED]> wrote:

> Adjusting gcc flags to eliminate optimizations is another way to go.
> Adding -fwrapv would be an excellent start. Lack of this flag breaks
> most code which checks for integer wrap-around.

Lack of the flag does not break any valid C code, only code
making unwarranted assumptions (i.e., buggy code).


Right, if "C" means "strictly conforming ISO C" to you.
(in which case, nearly all real-world code is broken)

FYI, the kernel also assumes that a "char" is 8 bits.
Maybe you should run away screaming.


> The compiler "knows"
> that signed integers don't ever wrap, and thus eliminates any code
> which checks for values going negative after a wrap-around.

You cannot assume it eliminates such code; the compiler is free
to do whatever it wants in such a case.

You should typically write such a computation using unsigned
types, FWIW.

Anyway, with 4.1 you shouldn't see frequent problems due to


Right, it gets much worse with the current gcc snapshots.

IMHO you should play such games with "g++ -O9", but that's
a discussion for a different mailing list.


"not using -fwrapv while my code is broken WRT signed overflow"
yet; and if/when problems start to happen, to "correct" action
to take is not to add the compiler flag, but to fix the code.


Nope, unless we decide that the performance advantages of
a language change are worth the risk and pain.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-04 Thread Albert Cahalan

On 1/4/07, Segher Boessenkool [EMAIL PROTECTED] wrote:

 Adjusting gcc flags to eliminate optimizations is another way to go.
 Adding -fwrapv would be an excellent start. Lack of this flag breaks
 most code which checks for integer wrap-around.

Lack of the flag does not break any valid C code, only code
making unwarranted assumptions (i.e., buggy code).


Right, if C means strictly conforming ISO C to you.
(in which case, nearly all real-world code is broken)

FYI, the kernel also assumes that a char is 8 bits.
Maybe you should run away screaming.


 The compiler knows
 that signed integers don't ever wrap, and thus eliminates any code
 which checks for values going negative after a wrap-around.

You cannot assume it eliminates such code; the compiler is free
to do whatever it wants in such a case.

You should typically write such a computation using unsigned
types, FWIW.

Anyway, with 4.1 you shouldn't see frequent problems due to


Right, it gets much worse with the current gcc snapshots.

IMHO you should play such games with g++ -O9, but that's
a discussion for a different mailing list.


not using -fwrapv while my code is broken WRT signed overflow
yet; and if/when problems start to happen, to correct action
to take is not to add the compiler flag, but to fix the code.


Nope, unless we decide that the performance advantages of
a language change are worth the risk and pain.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel + gcc 4.1 = several problems

2007-01-03 Thread Albert Cahalan

Linus Torvalds writes:

[probably Mikael Pettersson] writes:



The suggestions I've had so far which I have not yet tried:

- Select a different x86 CPU in the config.
  - Unfortunately the C3-2 flags seem to simply tell GCC to
schedule for ppro (like i686) and enabled MMX and SSE
  - Probably useless


Actually, try this one. Try using something that doesn't like "cmov".
Maybe the C3-2 simply has some internal cmov bugginess.


Of course that changes register usage, register spilling,
and thus ultimately even the stack layout. :-(

Adjusting gcc flags to eliminate optimizations is another way to go.
Adding -fwrapv would be an excellent start. Lack of this flag breaks
most code which checks for integer wrap-around. The compiler "knows"
that signed integers don't ever wrap, and thus eliminates any code
which checks for values going negative after a wrap-around. I could
imagine this affecting a switch() or other jump table.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


nasty thread-related bugs, maybe in exit

2006-12-20 Thread Albert Cahalan

There are big nasty bugs related to threaded processes exiting,
especially when involving: zombie leaders, clone w/o SIGCHLD,
and ptrace. I can make tasks that remain until reboot. I've seen
things stuck in "X" state. I've seen pending SIGKILL and even
blocked SIGKILL. I've seen "D" state pretending to dump core
for eternity, despite having core dumps disabled.

Does this not bother anybody? I posted this twice already:

http://lkml.org/lkml/2006/12/18/312
http://lkml.org/lkml/2006/12/19/335

Killing the parent does NOT always clear these zombies. Well,
perhaps it would, but PID 1 is protected.

The source code included below is cloninator.c minus SIGCHLD.
Run it in a loop, periodically sending it SIGKILL, like this:

gcc -m32 -O2 -std=gnu99 -o foo foo.c
while true; do killall -9 foo; ./foo; sleep 1; done

Note: it's NOT an unlimited fork bomb.

The original has SIGCHLD in the clone flags. Things go very
badly if you rapidly SIGKILL things while ptracing. You can
cause this with "strace" and "killall", but a more reliable
method is to have the ptracer use tgkill to SIGKILL all the
tasks as fast as possible.

Tested: both mainline 2.6.19 and the latest Fedora Core 5 kernel

///
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#include 

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   /* push and pop because -fPIC probably
  needs ebx for the GOT base pointer */
   __asm__ __volatile__(
   "push %%ebx ; "
   "push %1 ; pop %%ebx ; int $0x80"
   "; pop %%ebx"
   :"=a"(eax)
   :"r"(fd),"c"(buf),"d"(count),"0"(eax)
   :"memory"
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = "0123456789abcdef";
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u>>(i*4))&15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}


#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0
static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add
// and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   "lock; xaddl %1, %0;"
   : "=m" (*l), "=r" (result)
   : "1" (result), "m" (*l)
   : "memory");
   return result + val;
   // Returns the value written to memory
}

// lock_bts_high_bit atomically tests and
// sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   "lock; btsl $31, %0;\n\t"
   "setnc %1;"
   : "=m" (*l), "=q" (result)
   : "m" (*l)
   : "memory");
   return result;
}

static int futex(int* uaddr, int op, int val,
const struct timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   "push %%ebx ; push %1 ; pop %%ebx"
   " ; int $0x80; pop %%ebx"
   :"=a"(eax)
   :"r"(uaddr),"c"(op),"d"(val),"0"(eax)
   :"memory"
   );
   return eax;
}

// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }
   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure mutex not given up since check
   if (!(val & 0x8000))
   continue;

   // Wait for the mutex
   futex(l, FUTEX_WAIT, val, NULL, NULL, 0);
   }
}

// unlock will release a mutex
static void unlock(lock_t* l)
{
   // Turn off lock held bit and check for waiters
   if (lock_add(l, 0x8000) == 0)
   {
   // No waiters
   return;
 

Re: [BUG] daemon.c blows up on OSX

2006-12-20 Thread Albert Cahalan

Linus Torvalds writes:


So it would appear that for OS X, the

  #define _XOPEN_SOURCE_EXTENDED 1 /* AIX 5.3L needs this */
  #define _GNU_SOURCE
  #define _BSD_SOURCE
sequence actually _disables_ those things.


Yes, of course. The odd one here is glibc.

Normal systems enable everything by default. As soon as you
specify a feature define, you get ONLY what you asked for.
I'm not sure why glibc is broken, but I suspect that somebody
wants to make everyone declare their code to be GNU source.
(despite many "GNU" things not working on the HURD at all)

Define _APPLE_C_SOURCE to make MacOS X give you everything.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-20 Thread Albert Cahalan

On 12/20/06, David Wragg <[EMAIL PROTECTED]> wrote:

"Albert Cahalan" <[EMAIL PROTECTED]> writes:
> On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote:
>> This patch (against 2.6.19/2.6.19.1) adds the four context
>> switch values (voluntary context switches, involuntary
>> context switches, and the same values accumulated from
>> terminated child processes) to the end of /proc/*/stat,
>> similarly to min_flt, maj_flt and the time used values.
>
> Hmmm, OK, do people have a use for these values?

My reason for writing the patch was to track which processes are
active (i.e. got scheduled to run) by polling these context switch
values.  The time used values are not a reliable way to detect process
activity on fast machines.  So for example, when sorting by %CPU, top
often shows many processes using 0% CPU, despite the fact that these
processes are running occasionally.  If top sorted by (%CPU, context
switch count delta), it might give a more useful display of which
processes are active on the system.


Oh, that'd be great.

The cumulative ones are still not justified though, and I fear they
may be 64-bit even on i386. It turns out that an i386 procps spends
much of its time doing 64-bit division to parse the damn ASCII crap.
I suppose I could just skip those fields, but generating them isn't
too cheap and probably I'd get stuck parsing them for some other
reason -- having them separate is probably a good idea.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: util-linux: orphan

2006-12-20 Thread Albert Cahalan

On 12/20/06, Jan Engelhardt <[EMAIL PROTECTED]> wrote:


>> I've originally thought about util-linux upstream fork,
>> but as usually an fork is bad step. So.. I'd like to start
>> some discussion before this step.
> ...
>> after few weeks I'm pleased to announce a new "util-linux-ng"
>> project. This project is a fork of the original util-linux (2.13-pre7).
>
> Well, how about giving me a chunk of it? I'd like /bin/kill please.
> I already ship a nicer one in procps anyway, so you can just delete
> the files and call that done. (just today I was working on a Fedora
> system and /bin/kill annoyed me)

How can you ship a "nicer" kill, given that its sole purpose is to accept

  kill { -l | -t | {-s SIGNUM | -SIGNAME } somepid [morepids] }

?


I checked compatibility with Solaris, Tru64, probably a few BSDs,
and man pages of many others.

Fedora Core 5 doesn't seem to like this command:

/bin/kill -l 17 19

(which reminds me, I need to add sigqueue support and
maybe tgkill support)


What about merging util-linux and procps?


How? Which way?

As I mentioned before, I was twice disappointed in missing
announcements of util-linux maintainership being up for grabs.
I certainly have a track record for keeping things stable.

Prior to me, procps has a history of being abandoned and
broken. Procps is a fork of the long-dead kmem-ps project.
Procps was then passed to someone who added color and
then disappeared. The prior maintainer picked up the old
code again, no doubt under influence of his employer Red Hat.
I rewrote much of it then, but had trouble getting in all of
my changes. Debian started using my code, which slowly
turned into a fork. Maintainership was passed to somebody
else, without even telling me. That person and his immediate
successor added numerous serious bugs. Inexperience with
the code and the lack of a test suite soon led to that group
being bogged down in problems. One by one, the various
Linux distributions switched over to my version of the code.

So as you may imagine, I'd be rather nervous about letting
procps get into that situation again. Bugs are yucky. Having
multiple committers and no testing is a sure path to ruin.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: util-linux: orphan

2006-12-20 Thread Albert Cahalan

On 12/20/06, Jan Engelhardt [EMAIL PROTECTED] wrote:


 I've originally thought about util-linux upstream fork,
 but as usually an fork is bad step. So.. I'd like to start
 some discussion before this step.
 ...
 after few weeks I'm pleased to announce a new util-linux-ng
 project. This project is a fork of the original util-linux (2.13-pre7).

 Well, how about giving me a chunk of it? I'd like /bin/kill please.
 I already ship a nicer one in procps anyway, so you can just delete
 the files and call that done. (just today I was working on a Fedora
 system and /bin/kill annoyed me)

How can you ship a nicer kill, given that its sole purpose is to accept

  kill { -l | -t | {-s SIGNUM | -SIGNAME } somepid [morepids] }

?


I checked compatibility with Solaris, Tru64, probably a few BSDs,
and man pages of many others.

Fedora Core 5 doesn't seem to like this command:

/bin/kill -l 17 19

(which reminds me, I need to add sigqueue support and
maybe tgkill support)


What about merging util-linux and procps?


How? Which way?

As I mentioned before, I was twice disappointed in missing
announcements of util-linux maintainership being up for grabs.
I certainly have a track record for keeping things stable.

Prior to me, procps has a history of being abandoned and
broken. Procps is a fork of the long-dead kmem-ps project.
Procps was then passed to someone who added color and
then disappeared. The prior maintainer picked up the old
code again, no doubt under influence of his employer Red Hat.
I rewrote much of it then, but had trouble getting in all of
my changes. Debian started using my code, which slowly
turned into a fork. Maintainership was passed to somebody
else, without even telling me. That person and his immediate
successor added numerous serious bugs. Inexperience with
the code and the lack of a test suite soon led to that group
being bogged down in problems. One by one, the various
Linux distributions switched over to my version of the code.

So as you may imagine, I'd be rather nervous about letting
procps get into that situation again. Bugs are yucky. Having
multiple committers and no testing is a sure path to ruin.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-20 Thread Albert Cahalan

On 12/20/06, David Wragg [EMAIL PROTECTED] wrote:

Albert Cahalan [EMAIL PROTECTED] writes:
 On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote:
 This patch (against 2.6.19/2.6.19.1) adds the four context
 switch values (voluntary context switches, involuntary
 context switches, and the same values accumulated from
 terminated child processes) to the end of /proc/*/stat,
 similarly to min_flt, maj_flt and the time used values.

 Hmmm, OK, do people have a use for these values?

My reason for writing the patch was to track which processes are
active (i.e. got scheduled to run) by polling these context switch
values.  The time used values are not a reliable way to detect process
activity on fast machines.  So for example, when sorting by %CPU, top
often shows many processes using 0% CPU, despite the fact that these
processes are running occasionally.  If top sorted by (%CPU, context
switch count delta), it might give a more useful display of which
processes are active on the system.


Oh, that'd be great.

The cumulative ones are still not justified though, and I fear they
may be 64-bit even on i386. It turns out that an i386 procps spends
much of its time doing 64-bit division to parse the damn ASCII crap.
I suppose I could just skip those fields, but generating them isn't
too cheap and probably I'd get stuck parsing them for some other
reason -- having them separate is probably a good idea.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] daemon.c blows up on OSX

2006-12-20 Thread Albert Cahalan

Linus Torvalds writes:


So it would appear that for OS X, the

  #define _XOPEN_SOURCE_EXTENDED 1 /* AIX 5.3L needs this */
  #define _GNU_SOURCE
  #define _BSD_SOURCE
sequence actually _disables_ those things.


Yes, of course. The odd one here is glibc.

Normal systems enable everything by default. As soon as you
specify a feature define, you get ONLY what you asked for.
I'm not sure why glibc is broken, but I suspect that somebody
wants to make everyone declare their code to be GNU source.
(despite many GNU things not working on the HURD at all)

Define _APPLE_C_SOURCE to make MacOS X give you everything.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


nasty thread-related bugs, maybe in exit

2006-12-20 Thread Albert Cahalan

There are big nasty bugs related to threaded processes exiting,
especially when involving: zombie leaders, clone w/o SIGCHLD,
and ptrace. I can make tasks that remain until reboot. I've seen
things stuck in X state. I've seen pending SIGKILL and even
blocked SIGKILL. I've seen D state pretending to dump core
for eternity, despite having core dumps disabled.

Does this not bother anybody? I posted this twice already:

http://lkml.org/lkml/2006/12/18/312
http://lkml.org/lkml/2006/12/19/335

Killing the parent does NOT always clear these zombies. Well,
perhaps it would, but PID 1 is protected.

The source code included below is cloninator.c minus SIGCHLD.
Run it in a loop, periodically sending it SIGKILL, like this:

gcc -m32 -O2 -std=gnu99 -o foo foo.c
while true; do killall -9 foo; ./foo; sleep 1; done

Note: it's NOT an unlimited fork bomb.

The original has SIGCHLD in the clone flags. Things go very
badly if you rapidly SIGKILL things while ptracing. You can
cause this with strace and killall, but a more reliable
method is to have the ptracer use tgkill to SIGKILL all the
tasks as fast as possible.

Tested: both mainline 2.6.19 and the latest Fedora Core 5 kernel

///
#include sys/mman.h
#include signal.h
#include sched.h
#include stdio.h
#include stdlib.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include string.h
#include unistd.h
#include asm/unistd.h

#include sys/ipc.h
#include sys/shm.h

#include stdbool.h

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   /* push and pop because -fPIC probably
  needs ebx for the GOT base pointer */
   __asm__ __volatile__(
   push %%ebx ; 
   push %1 ; pop %%ebx ; int $0x80
   ; pop %%ebx
   :=a(eax)
   :r(fd),c(buf),d(count),0(eax)
   :memory
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = 0123456789abcdef;
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u(i*4))15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}


#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0
static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add
// and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   lock; xaddl %1, %0;
   : =m (*l), =r (result)
   : 1 (result), m (*l)
   : memory);
   return result + val;
   // Returns the value written to memory
}

// lock_bts_high_bit atomically tests and
// sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   lock; btsl $31, %0;\n\t
   setnc %1;
   : =m (*l), =q (result)
   : m (*l)
   : memory);
   return result;
}

static int futex(int* uaddr, int op, int val,
const struct timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   push %%ebx ; push %1 ; pop %%ebx
; int $0x80; pop %%ebx
   :=a(eax)
   :r(uaddr),c(op),d(val),0(eax)
   :memory
   );
   return eax;
}

// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }
   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure mutex not given up since check
   if (!(val  0x8000))
   continue;

   // Wait for the mutex
   futex(l, FUTEX_WAIT, val, NULL, NULL, 0);
   }
}

// unlock will release a mutex
static void unlock(lock_t* l)
{
   // Turn off lock held bit and check for waiters
   if (lock_add(l, 0x8000) == 0)
   {
  

Re: util-linux: orphan

2006-12-19 Thread Albert Cahalan

Karel Zak writes:


I've originally thought about util-linux upstream fork,
but as usually an fork is bad step. So.. I'd like to start
some discussion before this step.

...

after few weeks I'm pleased to announce a new "util-linux-ng"
project. This project is a fork of the original util-linux (2.13-pre7).


Aw damn, I missed it again. LKML gets about 300 posts/day. The last
time util-linux was offered, I missed out. Bummer.

Well, how about giving me a chunk of it? I'd like /bin/kill please.
I already ship a nicer one in procps anyway, so you can just delete
the files and call that done. (just today I was working on a Fedora
system and /bin/kill annoyed me)

VERY STRONG SUGGESTION: build a full test suite before you mess with
the source. This isn't some cute toy like xeyes or a silly game.
This is util-linux, which MUST work.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: wedged processes, test program supplied

2006-12-19 Thread Albert Cahalan

On 12/20/06, Mike Galbraith <[EMAIL PROTECTED]> wrote:

On Tue, 2006-12-19 at 21:46 -0500, Albert Cahalan wrote:
> Somebody PLEASE try this...

I was having enough fun with cloninator (which was whitespace munged
btw).


Anything stuck? Besides refusing to die, that beast slays debuggers
left and right. I just need to add execve of /proc/self/exe and a massive
storm of signals on the alternate stack.

In the original post, I also mangled the recommended ps command:
ps -Ccloninator
-mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wchan

Leave out pid,tid,nlwp if you need to save screen space, like so:
ps -Ccloninator -mwostat,ppid,pending,sigmask,sigignore,caught,wchan

(note: procps versions prior to 3.2.7 are mostly fine, but will mess
up the PENDING column for any single-threaded processes you get)

This is fun to look at:
watch ps -Ccloninator fostat,ppid,wchan:9,comm


> Normally, when a process dies it becomes a zombie.
> If the parent dies (before or after the child), the child
> is adopted by init. Init will reap the child.
>
> The program included below DOES NOT get reaped.

While true wasn't a great test recommendation :)


Oh. I wanted to be sure you'd see the problem. Did you have
some... difficulty? A plain old ^C should make things stop.
The second test program is like the first, but missing SIGCHLD
from the clone flags, and hopefully not whitespace-mangled.

Note that the test program is not normally a fork bomb.
It self-limits itself to 42 tasks via a lock in shared memory.
If things are working OK, you should see no more than
about 60 tasks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-19 Thread Albert Cahalan

David Wragg writes:

Benjamin LaHaise <[EMAIL PROTECTED]> writes:

On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote:



This patch (against 2.6.19/2.6.19.1) adds the four context
switch values (voluntary context switches, involuntary
context switches, and the same values accumulated from
terminated child processes) to the end of /proc/*/stat,
similarly to min_flt, maj_flt and the time used values.


Hmmm, OK, do people have a use for these values?


Please put these into new files, as the stat files in /proc are
horribly overloaded and have always been somewhat problematic
when it comes to changing how things are reported due to internal
changes to the kernel.  Cheers,


No thanks. Yours truly, the maintainer of "ps", "top", "vmstat", etc.


The delay accounting value was added to the end of /proc/pid/stat back
in July without discussion, so I assumed this approach was still
considered satisfactory.


/proc/*/stat is the very best place in /proc for any per-process
data that will be commonly needed. Unlike /proc/*/status, few
people are tempted to screw with the formatting and/or spelling.
Unlike the /sys crap, it doesn't take 3 syscalls PER VALUE to
get at the data.

The things to ask are of course: will this really be used, and
does it really belong in /proc at all?


Putting just these four values into a new file would seem a little
odd, since they have a lot in common with the other getrusage values
that are already in /proc/pid/stat.  One possibility is to add
/proc/pid/rusage, mirroring the full struct rusage in text form, since
struct rusage is already part of the kernel ABI (though Linux doesn't
fill in half of the values).


Since we already have a struct defined and all...

sys_get_rusage(int pid)


Or perhaps it makes sense to reorganize all the values from
/proc/pid/stat and its siblings into a sysfs-like one-value-per-file
structure, though that might introduce atomicity and efficiency issues
(calculating some of the values involves iterating over the threads in
the process; with everything in one file, these loops are folded
together).


Yeah, big time. Things are quite bad in /proc, but /sys is a joke.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


BUG: wedged processes, test program supplied

2006-12-19 Thread Albert Cahalan

Somebody PLEASE try this...

Normally, when a process dies it becomes a zombie.
If the parent dies (before or after the child), the child
is adopted by init. Init will reap the child.

The program included below DOES NOT get reaped.

Do like so:

gcc -m32 -O2 -std=gnu99 -o foo foo.c
while true; do killall -9 foo; ./foo; sleep 1; done

BTW, it gets even better if you start playing with ptrace.
Use the "strace" program (following children) and/or start
sending rapid-fire SIGKILL to all the various _threads_ in
the processes. You can get processes wedged in a wide
variety of interesting states. I've seen "X" state, processes
sitting around with pending SIGKILL, a process stuck in
"D" state supposedly core dumping despite ulimit 0 on
the core size, etc.

/

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#include 

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   /* push and pop because -fPIC probably
  needs ebx for the GOT base pointer */
   __asm__ __volatile__(
   "push %%ebx ; "
   "push %1 ; pop %%ebx ; int $0x80"
   "; pop %%ebx"
   :"=a"(eax)
   :"r"(fd),"c"(buf),"d"(count),"0"(eax)
   :"memory"
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = "0123456789abcdef";
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u>>(i*4))&15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}

#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0

static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add
// and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   "lock; xaddl %1, %0;"
   : "=m" (*l), "=r" (result)
   : "1" (result), "m" (*l)
   : "memory");
   return result + val;
   // Returns the value written to memory
}

// lock_bts_high_bit atomically tests and
// sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   "lock; btsl $31, %0;\n\t"
   "setnc %1;"
   : "=m" (*l), "=q" (result)
   : "m" (*l)
   : "memory");
   return result;
}

static int futex(int* uaddr, int op, int val,
const struct timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   "push %%ebx ; push %1 ; pop %%ebx"
   " ; int $0x80; pop %%ebx"
   :"=a"(eax)
   :"r"(uaddr),"c"(op),"d"(val),"0"(eax)
   :"memory"
   );
   return eax;
}

// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }

   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure mutex not given up since check
   if (!(val & 0x8000))
   continue;

   // Wait for the mutex
   futex(l, FUTEX_WAIT, val, NULL, NULL, 0);
   }
}

// unlock will release a mutex
static void unlock(lock_t* l)
{
   // Turn off lock held bit and check for waiters
   if (lock_add(l, 0x8000) == 0)
   {
   // No waiters
   return;
   }

   // Waiters found, wake up one of them
   futex(l, FUTEX_WAKE, 1, NULL, NULL, 0);
}

unsigned toomany = 42;

struct data {
   unsigned nprocs;
   lock_t lock;
   unsigned count;
};

struct data *data;

static struct data *get_shm(void)
{
   void *addr;
   int shmid;

   // create
   shmid = shmget(IPC_PRIVATE,42,IPC_CREAT|0666);
   // attach
   addr = shmat(shmid, NULL, 0);
   // don't want it to 

BUG: wedged processes, test program supplied

2006-12-19 Thread Albert Cahalan

Somebody PLEASE try this...

Normally, when a process dies it becomes a zombie.
If the parent dies (before or after the child), the child
is adopted by init. Init will reap the child.

The program included below DOES NOT get reaped.

Do like so:

gcc -m32 -O2 -std=gnu99 -o foo foo.c
while true; do killall -9 foo; ./foo; sleep 1; done

BTW, it gets even better if you start playing with ptrace.
Use the strace program (following children) and/or start
sending rapid-fire SIGKILL to all the various _threads_ in
the processes. You can get processes wedged in a wide
variety of interesting states. I've seen X state, processes
sitting around with pending SIGKILL, a process stuck in
D state supposedly core dumping despite ulimit 0 on
the core size, etc.

/

#include sys/mman.h
#include signal.h
#include sched.h
#include stdio.h
#include stdlib.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include string.h
#include unistd.h
#include asm/unistd.h

#include sys/ipc.h
#include sys/shm.h

#include stdbool.h

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   /* push and pop because -fPIC probably
  needs ebx for the GOT base pointer */
   __asm__ __volatile__(
   push %%ebx ; 
   push %1 ; pop %%ebx ; int $0x80
   ; pop %%ebx
   :=a(eax)
   :r(fd),c(buf),d(count),0(eax)
   :memory
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = 0123456789abcdef;
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u(i*4))15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}

#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0

static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add
// and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   lock; xaddl %1, %0;
   : =m (*l), =r (result)
   : 1 (result), m (*l)
   : memory);
   return result + val;
   // Returns the value written to memory
}

// lock_bts_high_bit atomically tests and
// sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   lock; btsl $31, %0;\n\t
   setnc %1;
   : =m (*l), =q (result)
   : m (*l)
   : memory);
   return result;
}

static int futex(int* uaddr, int op, int val,
const struct timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   push %%ebx ; push %1 ; pop %%ebx
; int $0x80; pop %%ebx
   :=a(eax)
   :r(uaddr),c(op),d(val),0(eax)
   :memory
   );
   return eax;
}

// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }

   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure mutex not given up since check
   if (!(val  0x8000))
   continue;

   // Wait for the mutex
   futex(l, FUTEX_WAIT, val, NULL, NULL, 0);
   }
}

// unlock will release a mutex
static void unlock(lock_t* l)
{
   // Turn off lock held bit and check for waiters
   if (lock_add(l, 0x8000) == 0)
   {
   // No waiters
   return;
   }

   // Waiters found, wake up one of them
   futex(l, FUTEX_WAKE, 1, NULL, NULL, 0);
}

unsigned toomany = 42;

struct data {
   unsigned nprocs;
   lock_t lock;
   unsigned count;
};

struct data *data;

static struct data *get_shm(void)
{
   void *addr;
   int shmid;

   // create
   shmid = shmget(IPC_PRIVATE,42,IPC_CREAT|0666);
   // attach
   

Re: [PATCH] procfs: export context switch counts in /proc/*/stat

2006-12-19 Thread Albert Cahalan

David Wragg writes:

Benjamin LaHaise [EMAIL PROTECTED] writes:

On Mon, Dec 18, 2006 at 11:50:08PM +, David Wragg wrote:



This patch (against 2.6.19/2.6.19.1) adds the four context
switch values (voluntary context switches, involuntary
context switches, and the same values accumulated from
terminated child processes) to the end of /proc/*/stat,
similarly to min_flt, maj_flt and the time used values.


Hmmm, OK, do people have a use for these values?


Please put these into new files, as the stat files in /proc are
horribly overloaded and have always been somewhat problematic
when it comes to changing how things are reported due to internal
changes to the kernel.  Cheers,


No thanks. Yours truly, the maintainer of ps, top, vmstat, etc.


The delay accounting value was added to the end of /proc/pid/stat back
in July without discussion, so I assumed this approach was still
considered satisfactory.


/proc/*/stat is the very best place in /proc for any per-process
data that will be commonly needed. Unlike /proc/*/status, few
people are tempted to screw with the formatting and/or spelling.
Unlike the /sys crap, it doesn't take 3 syscalls PER VALUE to
get at the data.

The things to ask are of course: will this really be used, and
does it really belong in /proc at all?


Putting just these four values into a new file would seem a little
odd, since they have a lot in common with the other getrusage values
that are already in /proc/pid/stat.  One possibility is to add
/proc/pid/rusage, mirroring the full struct rusage in text form, since
struct rusage is already part of the kernel ABI (though Linux doesn't
fill in half of the values).


Since we already have a struct defined and all...

sys_get_rusage(int pid)


Or perhaps it makes sense to reorganize all the values from
/proc/pid/stat and its siblings into a sysfs-like one-value-per-file
structure, though that might introduce atomicity and efficiency issues
(calculating some of the values involves iterating over the threads in
the process; with everything in one file, these loops are folded
together).


Yeah, big time. Things are quite bad in /proc, but /sys is a joke.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: BUG: wedged processes, test program supplied

2006-12-19 Thread Albert Cahalan

On 12/20/06, Mike Galbraith [EMAIL PROTECTED] wrote:

On Tue, 2006-12-19 at 21:46 -0500, Albert Cahalan wrote:
 Somebody PLEASE try this...

I was having enough fun with cloninator (which was whitespace munged
btw).


Anything stuck? Besides refusing to die, that beast slays debuggers
left and right. I just need to add execve of /proc/self/exe and a massive
storm of signals on the alternate stack.

In the original post, I also mangled the recommended ps command:
ps -Ccloninator
-mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wchan

Leave out pid,tid,nlwp if you need to save screen space, like so:
ps -Ccloninator -mwostat,ppid,pending,sigmask,sigignore,caught,wchan

(note: procps versions prior to 3.2.7 are mostly fine, but will mess
up the PENDING column for any single-threaded processes you get)

This is fun to look at:
watch ps -Ccloninator fostat,ppid,wchan:9,comm


 Normally, when a process dies it becomes a zombie.
 If the parent dies (before or after the child), the child
 is adopted by init. Init will reap the child.

 The program included below DOES NOT get reaped.

While true wasn't a great test recommendation :)


Oh. I wanted to be sure you'd see the problem. Did you have
some... difficulty? A plain old ^C should make things stop.
The second test program is like the first, but missing SIGCHLD
from the clone flags, and hopefully not whitespace-mangled.

Note that the test program is not normally a fork bomb.
It self-limits itself to 42 tasks via a lock in shared memory.
If things are working OK, you should see no more than
about 60 tasks.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: util-linux: orphan

2006-12-19 Thread Albert Cahalan

Karel Zak writes:


I've originally thought about util-linux upstream fork,
but as usually an fork is bad step. So.. I'd like to start
some discussion before this step.

...

after few weeks I'm pleased to announce a new util-linux-ng
project. This project is a fork of the original util-linux (2.13-pre7).


Aw damn, I missed it again. LKML gets about 300 posts/day. The last
time util-linux was offered, I missed out. Bummer.

Well, how about giving me a chunk of it? I'd like /bin/kill please.
I already ship a nicer one in procps anyway, so you can just delete
the files and call that done. (just today I was working on a Fedora
system and /bin/kill annoyed me)

VERY STRONG SUGGESTION: build a full test suite before you mess with
the source. This isn't some cute toy like xeyes or a silly game.
This is util-linux, which MUST work.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


unreapable zombies, maybe futex+ptrace+exit

2006-12-18 Thread Albert Cahalan

I have a fun little test program for people to try. It creates zombies
that persist until reboot, despite being reparented to init. Sometimes
it creates processes that block SIGKILL, sit around with pending SIGKILL,
or both.

You'll want:

a. either assembly skills or the ability to run 32-bit x86 code
b. the procps-3.2.7 release, so you can easily view the results
c. the strace program, or some other ptrace-based debugger
d. a recent kernel -- updated Fedora 5 or mainline 2.6.19 will do

Compile like this:
gcc -m32 -std=gnu99 -O2 -o cloninator cloninator.c

Run like this:
strace -f -F ./cloninator

Let the program run for a bit, then do one of a few fun things:

a. hit ^C to stop it
b. run "killall -9 cloninator" to stop it
c. send SIGKILL to the process group (the negative as PID)
d. send SIGKILL to all your processes (use -1 as PID)

View the results:
ps -Ccloninator -mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wch

I suggest trying other debuggers. Under a debugger I can't share,
thousands of messed-up zombies get created in under a minute.
With strace, you'll probably get a half dozen after a couple trys.
You might try gdb, fenris, nightview, and anything else which
uses ptrace to observe something. (Ideas?) Be sure to specify any
options needed to follow child processes; you may need to comment
out the CLONE_VFORK case for wimpy debuggers.

BTW, we can probably now answer this question:

$ egrep -i 'todo.*safe' kernel/*.c
kernel/exit.c:  // TODO: is this safe?
kernel/exit.c:  // TODO: is this safe?

///

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#include 
#include 

#include 

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   // push and pop because -fPIC probably needs ebx for the GOT
base pointer
   __asm__ __volatile__(
   "push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx"
   :"=a"(eax)
   :"r"(fd),"c"(buf),"d"(count),"0"(eax)
   :"memory"
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = "0123456789abcdef";
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u>>(i*4))&15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}


#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0

static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   "lock; xaddl %1, %0;"
   : "=m" (*l), "=r" (result)
   : "1" (result), "m" (*l)
   : "memory");
   return result + val; // Return the value written to memory
}

// lock_bts_high_bit atomically tests and sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   "lock; btsl $31, %0;\n\t"
   "setnc %1;"
   : "=m" (*l), "=q" (result)
   : "m" (*l)
   : "memory");
   return result;
}

static int futex(int* uaddr, int op, int val, const struct
timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   "push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx"
   :"=a"(eax)
   :"r"(uaddr),"c"(op),"d"(val),"0"(eax)
   :"memory"
   );
   return eax;
}


// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }

   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure the mutex wasn't given up since the check
   if (!(val & 0x8000))
   continue;

   

unreapable zombies, maybe futex+ptrace+exit

2006-12-18 Thread Albert Cahalan

I have a fun little test program for people to try. It creates zombies
that persist until reboot, despite being reparented to init. Sometimes
it creates processes that block SIGKILL, sit around with pending SIGKILL,
or both.

You'll want:

a. either assembly skills or the ability to run 32-bit x86 code
b. the procps-3.2.7 release, so you can easily view the results
c. the strace program, or some other ptrace-based debugger
d. a recent kernel -- updated Fedora 5 or mainline 2.6.19 will do

Compile like this:
gcc -m32 -std=gnu99 -O2 -o cloninator cloninator.c

Run like this:
strace -f -F ./cloninator

Let the program run for a bit, then do one of a few fun things:

a. hit ^C to stop it
b. run killall -9 cloninator to stop it
c. send SIGKILL to the process group (the negative as PID)
d. send SIGKILL to all your processes (use -1 as PID)

View the results:
ps -Ccloninator -mwostat,ppid,pid,tid,nlwp,pending,sigmask,sigignore,caught,wch

I suggest trying other debuggers. Under a debugger I can't share,
thousands of messed-up zombies get created in under a minute.
With strace, you'll probably get a half dozen after a couple trys.
You might try gdb, fenris, nightview, and anything else which
uses ptrace to observe something. (Ideas?) Be sure to specify any
options needed to follow child processes; you may need to comment
out the CLONE_VFORK case for wimpy debuggers.

BTW, we can probably now answer this question:

$ egrep -i 'todo.*safe' kernel/*.c
kernel/exit.c:  // TODO: is this safe?
kernel/exit.c:  // TODO: is this safe?

///

#include sys/mman.h
#include signal.h
#include sched.h
#include stdio.h
#include stdlib.h
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include string.h
#include unistd.h
#include asm/unistd.h

#include sys/ipc.h
#include sys/shm.h

#include stdbool.h

static void early_write(int fd, const void *buf, size_t count)
{
#if 0
   unsigned long eax = __NR_write;
   // push and pop because -fPIC probably needs ebx for the GOT
base pointer
   __asm__ __volatile__(
   push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx
   :=a(eax)
   :r(fd),c(buf),d(count),0(eax)
   :memory
   );
#endif
}

static void p_str(char *s)
{
   size_t count = strlen(s);
   early_write(STDERR_FILENO,s,count);
}

static void p_hex(unsigned long u)
{
   char buf[9];
   char x[] = 0123456789abcdef;
   char *s = buf;
   s[8] = '\0';
   int i = 8;
   while(i--)
   buf[7-i] = x[(u(i*4))15];
   early_write(STDERR_FILENO,buf,8);
}

static void p_dec(unsigned long u)
{
   char buf[11];
   char *s = buf+10;
   *s-- = '\0';
   int count = 0;
   while(u || !count)
   {
   *s-- = u%10 + '0';
   u /= 10;
   count++;
   }
   early_write(STDERR_FILENO,s+1,count);
}


#define FUTEX_WAIT  0
#define FUTEX_WAKE  1


typedef int lock_t;

#define LOCK_INITIALIZER 0

static inline void init_lock(lock_t* l) { *l = 0; }

// lock_add performs an atomic add and returns the resulting value
static inline int lock_add(lock_t* l, int val)
{
   int result = val;
   __asm__ __volatile__ (
   lock; xaddl %1, %0;
   : =m (*l), =r (result)
   : 1 (result), m (*l)
   : memory);
   return result + val; // Return the value written to memory
}

// lock_bts_high_bit atomically tests and sets the high bit and returns
// true if the bit was clear initially
static inline bool lock_bts_high_bit(lock_t* l)
{
   bool result;
   __asm__ __volatile__ (
   lock; btsl $31, %0;\n\t
   setnc %1;
   : =m (*l), =q (result)
   : m (*l)
   : memory);
   return result;
}

static int futex(int* uaddr, int op, int val, const struct
timespec*timeout, int*uaddr2, int val3)
{
   (void)timeout;
   (void)uaddr2;
   (void)val3;
   int eax = __NR_futex;
   __asm__ __volatile__(
   push %%ebx ; push %1 ; pop %%ebx ; int $0x80; pop %%ebx
   :=a(eax)
   :r(uaddr),c(op),d(val),0(eax)
   :memory
   );
   return eax;
}


// lock will wait for and lock a mutex
static void lock(lock_t* l)
{
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex
   return;
   }

   // Increment wait count
   lock_add(l, 1);

   while (true)
   {
   // Check the mutex and set held bit
   if (lock_bts_high_bit(l))
   {
   // Got the mutex, decrement wait count
   lock_add(l, -1);
   return;
   }

   int val = *l;
   // Ensure the mutex wasn't given up since the check
   

Re: new procfs memory analysis feature

2006-12-11 Thread Albert Cahalan

David Singleton writes:


Add variation of /proc/PID/smaps called /proc/PID/pagemaps.
Shows reference counts for individual pages instead of aggregate totals.
Allows more detailed memory usage information for memory analysis tools.
An example of the output shows the shared text VMA for ld.so and
the share depths of the pages in the VMA.

a7f4b000-a7f65000 r-xp  00:0d 19185826   /lib/ld-2.5.90.so
 11 11 11 11 11 11 11 11 11 13 13 13 13 13 13 13 8 8 8 13 13 13 13 13 13 13


Arrrgh! Not another ghastly maps file!

The original was mildly defective. Somebody thought " (deleted)" was
a reserved filename extension. Somebody thought "/SYSV*" was also
some kind of reserved namespace. Nobody ever thought to bother with
a properly specified grammar; it's more fun to blame application
developers for guessing as best they can. The use of %08lx is quite
a wart too, looking ridiculous on 64-bit systems.

Now we have /proc/*/smaps, which should make decent programmers cry.
Really now, WTF? It has compact non-obvious parts, which would be a
nice choice for performance if not for being MIXED with wordy bloated
parts of a completely different nature. Parsing is terribly painful.

Supposedly there is a NUMA version too.

Along the way, nobody bothered to add support for describing the
page size (IMHO your format ***severely*** needs this) or for the
various VMA flags to indicate if memory is locked, randomized, etc.

There can be a million pages in a mapping for a 32-bit process.
If my guess (since you too failed to document your format) is right,
you propose to have one decimal value per page. In other words,
the lines of this file can be megabytes long without even getting
to the issue of 64-bit hardware. This is no text file!

How about a proper system call? Enough is enough already. Take a
look at the mincore system call. Imagine it taking a PID. The 7
available bits probably won't do, so expand that a bit. Just take
the user-allowed parts of the VMA and/or PTE (both varients are
good to have) and put them in a struct. There may be some value
in having both low-privilage and high-privilege versions of this.

BTW, you might wish to ensure that Wine can implement VirtualQueryEx
perfectly based on this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new procfs memory analysis feature

2006-12-11 Thread Albert Cahalan

David Singleton writes:


Add variation of /proc/PID/smaps called /proc/PID/pagemaps.
Shows reference counts for individual pages instead of aggregate totals.
Allows more detailed memory usage information for memory analysis tools.
An example of the output shows the shared text VMA for ld.so and
the share depths of the pages in the VMA.

a7f4b000-a7f65000 r-xp  00:0d 19185826   /lib/ld-2.5.90.so
 11 11 11 11 11 11 11 11 11 13 13 13 13 13 13 13 8 8 8 13 13 13 13 13 13 13


Arrrgh! Not another ghastly maps file!

The original was mildly defective. Somebody thought  (deleted) was
a reserved filename extension. Somebody thought /SYSV* was also
some kind of reserved namespace. Nobody ever thought to bother with
a properly specified grammar; it's more fun to blame application
developers for guessing as best they can. The use of %08lx is quite
a wart too, looking ridiculous on 64-bit systems.

Now we have /proc/*/smaps, which should make decent programmers cry.
Really now, WTF? It has compact non-obvious parts, which would be a
nice choice for performance if not for being MIXED with wordy bloated
parts of a completely different nature. Parsing is terribly painful.

Supposedly there is a NUMA version too.

Along the way, nobody bothered to add support for describing the
page size (IMHO your format ***severely*** needs this) or for the
various VMA flags to indicate if memory is locked, randomized, etc.

There can be a million pages in a mapping for a 32-bit process.
If my guess (since you too failed to document your format) is right,
you propose to have one decimal value per page. In other words,
the lines of this file can be megabytes long without even getting
to the issue of 64-bit hardware. This is no text file!

How about a proper system call? Enough is enough already. Take a
look at the mincore system call. Imagine it taking a PID. The 7
available bits probably won't do, so expand that a bit. Just take
the user-allowed parts of the VMA and/or PTE (both varients are
good to have) and put them in a struct. There may be some value
in having both low-privilage and high-privilege versions of this.

BTW, you might wish to ensure that Wine can implement VirtualQueryEx
perfectly based on this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] Simple privacy enhancement for /proc/

2005-04-12 Thread Albert Cahalan
On Sun, 2005-04-10 at 17:38 +0200, Rene Scharfe wrote:

> Albert, allowing access based on tty sounds nice, but it _is_ expansive.
> More importantly, perhaps, it would "virtualize" /proc: every user would
> see different permissions for certain files in there.  That's too comlex
> for my taste.

If you really can't allow access based on tty, then at least allow
access if any UID value matches any UID value. Without this, a user
can not always see a setuid program they are running.

> First, configuring via kernel parameters is sufficient.  It simplifies
> implementation a lot because we know the settings cannot change.  And we
> don't need the added flexibility of sysctls anyway -- I assume these
> parameters are set at installation time and never touched again.

This means mucking with boot parameters, which can be a pain.
The various boot loaders do not all use the same config file.

> Then I suppose we don't need to be able to fine-tune the permissions for
> each file in /proc//.  All that we need is a distinction between
> "normal" users (which are to be restricted) and admins (which need to
> see everything).

The /proc/*/maps file sure is different from the /proc/*/status file.
The same for all the others, really.

> This patch introduces two kernel parameters: proc.privacy and proc.gid.
> The group ID attribute of all files below /proc/ is set to
> proc.gid, but only if you activate the feature by setting proc.privacy
> to a non-zero value.

This is very bad. Please do not change the GID as seen by
the stat() call. This value is used.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] Simple privacy enhancement for /proc/pid

2005-04-12 Thread Albert Cahalan
On Sun, 2005-04-10 at 17:38 +0200, Rene Scharfe wrote:

 Albert, allowing access based on tty sounds nice, but it _is_ expansive.
 More importantly, perhaps, it would virtualize /proc: every user would
 see different permissions for certain files in there.  That's too comlex
 for my taste.

If you really can't allow access based on tty, then at least allow
access if any UID value matches any UID value. Without this, a user
can not always see a setuid program they are running.

 First, configuring via kernel parameters is sufficient.  It simplifies
 implementation a lot because we know the settings cannot change.  And we
 don't need the added flexibility of sysctls anyway -- I assume these
 parameters are set at installation time and never touched again.

This means mucking with boot parameters, which can be a pain.
The various boot loaders do not all use the same config file.

 Then I suppose we don't need to be able to fine-tune the permissions for
 each file in /proc/pid/.  All that we need is a distinction between
 normal users (which are to be restricted) and admins (which need to
 see everything).

The /proc/*/maps file sure is different from the /proc/*/status file.
The same for all the others, really.

 This patch introduces two kernel parameters: proc.privacy and proc.gid.
 The group ID attribute of all files below /proc/pid is set to
 proc.gid, but only if you activate the feature by setting proc.privacy
 to a non-zero value.

This is very bad. Please do not change the GID as seen by
the stat() call. This value is used.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel SCM saga..

2005-04-09 Thread Albert Cahalan
Linus Torvalds writes:

> NOTE! I detest the centralized SCM model, but if push comes to shove,
> and we just _can't_ get a reasonable parallell merge thing going in
> the short timeframe (ie month or two), I'll use something like SVN
> on a trusted site with just a few committers, and at least try to
> distribute the merging out over a few people rather than making _me_
> be the throttle.
>
> The reason I don't really want to do that is once we start doing
> it that way, I suspect we'll have a _really_ hard time stopping.
> I think it's a broken model. So I'd much rather try to have some
> pain in the short run and get a better model running, but I just
> wanted to let people know that I'm pragmatic enough that I realize
> that we may not have much choice.

I think you at least instinctively know this, but...

Centralized SCM means you have to grant and revoke commit access,
which means that Linux gets the disease of ugly BSD politics.

Under both the old pre-BitKeeper patch system and under BitKeeper,
developer rank is fuzzy. Everyone knows that some developers are
more central than others, but it isn't fully public and well-defined.
You can change things day by day without having to demote anyone.
While Linux development isn't completely without jealousy and pride,
few have stormed off (mostly IDE developers AFAIK) and none have
forked things as severely as OpenBSD and DragonflyBSD.

You may rank developer X higher than developer Y, but they have
only a guess as to how things are. Perhaps developer X would be
a prideful jerk if he knew. Perhaps developer Y would quit in
resentment if he knew.

Whatever you do, please avoid the BSD-style politics.

(the MAINTAINERS file is bad enough; it has caused problems)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Kernel SCM saga..

2005-04-09 Thread Albert Cahalan
Linus Torvalds writes:

 NOTE! I detest the centralized SCM model, but if push comes to shove,
 and we just _can't_ get a reasonable parallell merge thing going in
 the short timeframe (ie month or two), I'll use something like SVN
 on a trusted site with just a few committers, and at least try to
 distribute the merging out over a few people rather than making _me_
 be the throttle.

 The reason I don't really want to do that is once we start doing
 it that way, I suspect we'll have a _really_ hard time stopping.
 I think it's a broken model. So I'd much rather try to have some
 pain in the short run and get a better model running, but I just
 wanted to let people know that I'm pragmatic enough that I realize
 that we may not have much choice.

I think you at least instinctively know this, but...

Centralized SCM means you have to grant and revoke commit access,
which means that Linux gets the disease of ugly BSD politics.

Under both the old pre-BitKeeper patch system and under BitKeeper,
developer rank is fuzzy. Everyone knows that some developers are
more central than others, but it isn't fully public and well-defined.
You can change things day by day without having to demote anyone.
While Linux development isn't completely without jealousy and pride,
few have stormed off (mostly IDE developers AFAIK) and none have
forked things as severely as OpenBSD and DragonflyBSD.

You may rank developer X higher than developer Y, but they have
only a guess as to how things are. Perhaps developer X would be
a prideful jerk if he knew. Perhaps developer Y would quit in
resentment if he knew.

Whatever you do, please avoid the BSD-style politics.

(the MAINTAINERS file is bad enough; it has caused problems)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Can't use SYSFS for "Proprietry" driver modules !!!.

2005-03-27 Thread Albert Cahalan
greg k-h writes:
> On Sat, Mar 26, 2005 at 10:30:20PM -0500, Lee Revell wrote:

>> That's the problem, it's not spelled out explicitly anywhere.
>> That file does not address the issue of whether a driver is
>> a "derived work". This is the part he should talk to a lawyer
>> about, right?
>
> How about the fact that when you load a kernel module, it is
> linked into the main kernel image?  The GPL explicitly states
> what needs to be done for code linked in.

This probably fails. Obviously, it's not over until the courts
say so, but...

First of all, the GPL might not be as infectious as you and RMS
wish it to be. There is a limit to what can be a derived work
in copyright law.

Second of all, module loading is not the same as "linking" in
the traditional sense. The GPL was written before Linux had
kernel modules. Don't be so sure a court would rule as you
would like it to rule.

> Also, realize that you have to use GPL licensed header files
> to build your kernel module...

Um, like the printer cartridges and game cartridges with code
in them? Courts have held that it was OK to copy because it was
needed to implement an interface.

Whatever your lawyer may have said was undoubtably influenced
by your biased attempt to describe the technical issues.

Not that I care for proprietary stuff, being a PowerPC user
myself, but spreading unjustified FUD isn't proper behavior.
Neither is it proper to be marking key driver interfaces as
GPL-only. It's far better to just ignore the proprietary stuff.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Can't use SYSFS for Proprietry driver modules !!!.

2005-03-27 Thread Albert Cahalan
greg k-h writes:
 On Sat, Mar 26, 2005 at 10:30:20PM -0500, Lee Revell wrote:

 That's the problem, it's not spelled out explicitly anywhere.
 That file does not address the issue of whether a driver is
 a derived work. This is the part he should talk to a lawyer
 about, right?

 How about the fact that when you load a kernel module, it is
 linked into the main kernel image?  The GPL explicitly states
 what needs to be done for code linked in.

This probably fails. Obviously, it's not over until the courts
say so, but...

First of all, the GPL might not be as infectious as you and RMS
wish it to be. There is a limit to what can be a derived work
in copyright law.

Second of all, module loading is not the same as linking in
the traditional sense. The GPL was written before Linux had
kernel modules. Don't be so sure a court would rule as you
would like it to rule.

 Also, realize that you have to use GPL licensed header files
 to build your kernel module...

Um, like the printer cartridges and game cartridges with code
in them? Courts have held that it was OK to copy because it was
needed to implement an interface.

Whatever your lawyer may have said was undoubtably influenced
by your biased attempt to describe the technical issues.

Not that I care for proprietary stuff, being a PowerPC user
myself, but spreading unjustified FUD isn't proper behavior.
Neither is it proper to be marking key driver interfaces as
GPL-only. It's far better to just ignore the proprietary stuff.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][0/6] Change proc file permissions with sysctls

2005-03-19 Thread Albert Cahalan
On Sun, 2005-03-20 at 01:22 +0100, Rene Scharfe wrote:

> The permissions of files in /proc/1 (usually belonging to init) are
> kept as they are.  The idea is to let system processes be freely
> visible by anyone, just as before.  Especially interesting in this
> regard would be instances of login.  I don't know how to easily
> discriminate between system processes and "normal" processes inside
> the kernel (apart from pid == 1 and uid == 0 (which is too broad)).
> Any ideas?

The ideal would be to allow viewing:

1. killable processes (that is, YOU can kill them)
2. processes sharing a tty with a killable process

Optionally, add:

3. processes controlling a tty master of a killable process
4. ancestors of all of the above
5. children of killable processes

This is of course expensive, but maybe you can get some of
it cheaply. For example, allow viewing a process if the session
leader, group leader, parent, or tpgid process is killable.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][0/6] Change proc file permissions with sysctls

2005-03-19 Thread Albert Cahalan
On Sun, 2005-03-20 at 01:22 +0100, Rene Scharfe wrote:

 The permissions of files in /proc/1 (usually belonging to init) are
 kept as they are.  The idea is to let system processes be freely
 visible by anyone, just as before.  Especially interesting in this
 regard would be instances of login.  I don't know how to easily
 discriminate between system processes and normal processes inside
 the kernel (apart from pid == 1 and uid == 0 (which is too broad)).
 Any ideas?

The ideal would be to allow viewing:

1. killable processes (that is, YOU can kill them)
2. processes sharing a tty with a killable process

Optionally, add:

3. processes controlling a tty master of a killable process
4. ancestors of all of the above
5. children of killable processes

This is of course expensive, but maybe you can get some of
it cheaply. For example, allow viewing a process if the session
leader, group leader, parent, or tpgid process is killable.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-17 Thread Albert Cahalan
On Thu, 2005-03-17 at 16:55 +, Russell King wrote:
> On Tue, Mar 15, 2005 at 10:23:54AM -0500, Albert Cahalan wrote:
> > On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote:
> > > On Mon, 14 Mar 2005, Albert Cahalan wrote:
> > > 
> > > > When the vsyscall page is created, copy the one needed function
> > > > into it. The kernel is already self-modifying in many places; this
> > > > is nothing new.
> > > 
> > > AFAIK this will only works on ia32 and x86_64 and not definitely not
> > > on ia64. Who knows about the other platforms 
> > 
> > I'll bet it does work fine on IA-64. If it didn't, you would
> > be unable to load the kernel or load an executable.
> > 
> > I know it works for PowerPC. You'll need an isync instruction
> > of course. You may also want a sync instruction and some code
> > to invalidate the cache.
> > 
> > Setting up the page content should be a 1-time operation done
> > at boot. Check your processor manuals as needed.
> 
> Won't work on ARM.  We have XIP kernels, which prevents the use of
> self-modifying code.

Does the ARM kernel provide a special page of code for
apps to execute? If not, then ARM is irrelevant.

Doesn't ARM always have an MMU? If you have an MMU, then
it is no problem to have one single page of non-XIP code
for this purpose.

Supposing that you do support the vsyscall hack and you don't
have an MMU, you can just place the tiny code fragment on the
stack (or anywhere else) when an exec is performed.

So, as far as I can see, ARM is fully capable of supporting this.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-17 Thread Albert Cahalan
On Thu, 2005-03-17 at 16:55 +, Russell King wrote:
 On Tue, Mar 15, 2005 at 10:23:54AM -0500, Albert Cahalan wrote:
  On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote:
   On Mon, 14 Mar 2005, Albert Cahalan wrote:
   
When the vsyscall page is created, copy the one needed function
into it. The kernel is already self-modifying in many places; this
is nothing new.
   
   AFAIK this will only works on ia32 and x86_64 and not definitely not
   on ia64. Who knows about the other platforms 
  
  I'll bet it does work fine on IA-64. If it didn't, you would
  be unable to load the kernel or load an executable.
  
  I know it works for PowerPC. You'll need an isync instruction
  of course. You may also want a sync instruction and some code
  to invalidate the cache.
  
  Setting up the page content should be a 1-time operation done
  at boot. Check your processor manuals as needed.
 
 Won't work on ARM.  We have XIP kernels, which prevents the use of
 self-modifying code.

Does the ARM kernel provide a special page of code for
apps to execute? If not, then ARM is irrelevant.

Doesn't ARM always have an MMU? If you have an MMU, then
it is no problem to have one single page of non-XIP code
for this purpose.

Supposing that you do support the vsyscall hack and you don't
have an MMU, you can just place the tiny code fragment on the
stack (or anywhere else) when an exec is performed.

So, as far as I can see, ARM is fully capable of supporting this.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/ chmod'able]

2005-03-15 Thread Albert Cahalan
Better interface:

/sbin/sysctl -w proc.maps=0440
/sbin/sysctl -w proc.cmdline=0444
/sbin/sysctl -w proc.status=0444

The /etc/sysctl.conf file can be used to set these
at boot time.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/ chmod'able]

2005-03-15 Thread Albert Cahalan
On Wed, 2005-03-16 at 03:39 +0100, Rene Scharfe wrote:
> So, I gather from the feedback I've got that chmod'able /proc/
> would be a bit over the top. 8-)  While providing the easiest and most
> intuitive user interface for changing the permissions on those
> directories, it is overkill.  Paul is right when he says that such a
> feature should be turned on or off for all sessions at once, and that's
> it.
> 
> My patch had at least one other problem: the contents of eac
> /proc/ directory became chmod'able, too, which was not intended.
> 
> Instead of fixing it up I took two steps back, dusted off the umask
> kernel parameter patch and added the "special gid" feature I mentioned.
> 
> Without the new kernel parameters behaviour is unchanged.  Add
> proc.umask=077 and all /proc/ will get a permission mode of 500.
> This breaks pstree (no output), as Bodo already noted, because this
> program needs access to /proc/1.  It also breaks w -- it shows the
> correct number of users but it lists X even for sessions owned
> by the user running it.
> 
> Use proc.umask=007 and proc.gid=50 instead and all /proc/ dirs
> will have a mode of 550 and their group attribute will be set to 50
> (that's "staff" on my Debian system).  Pstree will work for all members
> of that special group (just like top, ps and w -- which also show
> everything in that case).  Normal users will still have a restricted
> view.
> 
> Albert, would you take fixes for w even though you despise the feature
> that makes them necessary?

I will take patches if they are not too messy and they do not
cause tools to report garbage output. For example, I do not
wish to have tools reporting -1, 0, or uninitialized data in
place of correct data.

Distinct controls for the various files could be useful.
I might want to make /proc/*/cmdline be public, or make
/proc/*/maps be private. This is particularly helpful if
a low-security file is added for bare-bones ps operation.

You might make a special exception for built-in kernel tasks
and init.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Capabilities across execve

2005-03-15 Thread Albert Cahalan
Russell King, the latest person to notice defects, writes:

> However, the way the kernel is setup today, this seems
> impossible to achieve, which tends to make the whole
> idea of capabilities completely and utterly useless.
>
> How is this stuff supposed to work?  Are my ideas of
> what's supposed to be achievable completely wrong,
> although they look completely reasonable to me.
>
> Don't get me wrong - the capability system seems great at
> permanently revoking capabilities via /proc/sys/kernel/cap-bound,
> and dropping them within an application provided it remains UID0.
> Apart from that, capabilities seem completely useless.
...
> it seems to be something of a lost cause.
...
> my goal of running the script with minimal capabilities
> was completely *impossible* to achieve.

Uh huh. First, some history.

Capability bits were implemented in DG-UX and IRIX.
The two systems did not agree on operation. The draft
POSIX standard, withdrawn for good reason, greatly
changed between draft 16 and draft 17. Settings that
work for one draft are horribly insecure on the other.
Linux capabilities were partly done by the IRIX crew,
working from draft 16. Everyone else had draft 17 or
even draft 13. (and DG-UX had a better system anyway)

Tytso put things well when he wrote: "A lot of innocent
bits have been deforested  while trying work out the
differences between what Linux is doing (which is basically
following Draft 17), and what Trusted Irix is doing (which 
apparently is following Draft 16)."

Then along comes a sendmail exploit. An emergency fix
was produced, breaking an already-defective capability
design.

Note that, unlike DG-UX, our IRIX-inspired design did
not reserve any capability bits for non-kernel use.
This causes an inconsistent security model, with things
like the X server relying on UID. Inconsistency is bad.

OK, so that's how we got into this mess.

Now, how do we get out?

We will always have to deal with old-style apps. Those
few apps that handle capabilities can handle the bad
system we have now, and can handle a system without the
capability syscalls. (for old kernels) These apps can
not handle a changed setup though; to change things we
must make the old syscalls return failure. ANYTHING ELSE
IS VERY UNSAFE.

There is exactly one capability system in popular use.
That would be the one that comes with Solaris. Moving
toward that, via a kernel config option, appears to be
a sane way to get ourselves unstuck from this big mess.
An added advantage that that the Solaris-style method
instantly becomes the standard, especially if Linux is
strongly compatible. This helps with admin training and
portable software.

See if you can find any holes:
http://docs.sun.com/app/docs/doc/816-5175/6mbba7f39?a=view


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/ chmod'able

2005-03-15 Thread Albert Cahalan
On Tue, 2005-03-15 at 15:31 +0100, Bodo Eggert wrote:
> (snipped the CC list - hope that's ok)
> 
> On Mon, 14 Mar 2005, Albert Cahalan wrote:
> > On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote:
> > > On Mon, 14 Mar 2005, Albert Cahalan wrote:

> > This really isn't about security.
> 
> Information leakage is a security aspect.

If you will go to such extremes, Linux is poorly suited.
A user can detect activity on the computer by examining
the performance of their own activity.

> > Privacy may be undesirable.
> 
> May. That's why I suggested the min/max sysctl.
> 
> > With privacy comes anti-social behavior.
> 
> With anti-social behavior comes the admin and his LART.
> 
> BTW: If the users want to be anti-social, they'll just rename setiathome 
> to something like -bash or soffice.

This does not matter: "Rene, your soffice program is eating
too much CPU time. Find some other place to run it."

> > Supposing that the
> > users do get privacy, perhaps because the have paid for it:
> 
> Vservers,
> > Xen, UML, VM, VMware, separate computers
> > 
> > Going with separate computers is best.
> 
> If you like wasting space and energy. If the user's demands don't exceed 
> one percent of a historic PC, there is no point in buying more hardware.

Sure there is:

a. info leakage (way more than just /proc)
b. admin control
c. budget control
d. downtime hits fewer users

> > Don't forget to use
> > network traffic control to keep users from being able to
> > detect the network activity of other users.
> 
> Like that:?
> 
> $ netstat
> Active Internet connections (w/o servers)
> Proto Recv-Q Send-Q Local Address   Foreign Address State
> /proc/net/tcp: Permission denied

Nope. If you really care about information leakage, you'll
be concerned about the ability to detect network congestion.

Example #1

A spy sends packets from time to time. He measures the delay
and packet loss to determine if the network is busy. When the
network suddenly becomes busy, he can guess that you have
started some operation that requires heavy network traffic.

Example #2

A spy sends packets from time to time. He measures the delay
and packet loss to determine if the network is busy. Over time,
he learns when workers are busy. From this he can determine an
appropriate time to sneak into your building.

Hey, if you're going to be paranoid about %CPU and %MEM, you
have to be paranoid about %NET too. This requires traffic
control unless you have separate networks. Assign a fixed
portion of bandwidth to any user that you wish to hide info
from. Be sure to consider latency as well.

> > > > Users who want privacy can get their
> > > > own computer. So, these need to work:
> > > > 
> > > > ps [...]
> > > > w
> > > > top
> > > 
> > > Works as intended. Only pstree breaks, if init isn't visible.
> > 
> > They work like they do with a rootkit installed.
> > Traditional behavior has been broken.
> 
> They are as broken as finger or ls are if the home directory is chmodded.

Probably something should be done to deal with the problem of
a chmodded home directory. It's not ls that matters though.
It's du that matters. On a normal shared system, a user should
be able to see where all the disk blocks and inodes are going.
Filenames need not be visible. Then: "Rene, you're being kind
of greedy about the disk space aren't you? You're using 666 GB."


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-15 Thread Albert Cahalan
On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote:
> On Mon, 14 Mar 2005, Albert Cahalan wrote:
> 
> > When the vsyscall page is created, copy the one needed function
> > into it. The kernel is already self-modifying in many places; this
> > is nothing new.
> 
> AFAIK this will only works on ia32 and x86_64 and not definitely not
> on ia64. Who knows about the other platforms 

I'll bet it does work fine on IA-64. If it didn't, you would
be unable to load the kernel or load an executable.

I know it works for PowerPC. You'll need an isync instruction
of course. You may also want a sync instruction and some code
to invalidate the cache.

Setting up the page content should be a 1-time operation done
at boot. Check your processor manuals as needed.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-15 Thread Albert Cahalan
On Mon, 2005-03-14 at 19:22 -0800, Christoph Lameter wrote:
 On Mon, 14 Mar 2005, Albert Cahalan wrote:
 
  When the vsyscall page is created, copy the one needed function
  into it. The kernel is already self-modifying in many places; this
  is nothing new.
 
 AFAIK this will only works on ia32 and x86_64 and not definitely not
 on ia64. Who knows about the other platforms 

I'll bet it does work fine on IA-64. If it didn't, you would
be unable to load the kernel or load an executable.

I know it works for PowerPC. You'll need an isync instruction
of course. You may also want a sync instruction and some code
to invalidate the cache.

Setting up the page content should be a 1-time operation done
at boot. Check your processor manuals as needed.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/pid chmod'able

2005-03-15 Thread Albert Cahalan
On Tue, 2005-03-15 at 15:31 +0100, Bodo Eggert wrote:
 (snipped the CC list - hope that's ok)
 
 On Mon, 14 Mar 2005, Albert Cahalan wrote:
  On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote:
   On Mon, 14 Mar 2005, Albert Cahalan wrote:

  This really isn't about security.
 
 Information leakage is a security aspect.

If you will go to such extremes, Linux is poorly suited.
A user can detect activity on the computer by examining
the performance of their own activity.

  Privacy may be undesirable.
 
 May. That's why I suggested the min/max sysctl.
 
  With privacy comes anti-social behavior.
 
 With anti-social behavior comes the admin and his LART.
 
 BTW: If the users want to be anti-social, they'll just rename setiathome 
 to something like -bash or soffice.

This does not matter: Rene, your soffice program is eating
too much CPU time. Find some other place to run it.

  Supposing that the
  users do get privacy, perhaps because the have paid for it:
 
 Vservers,
  Xen, UML, VM, VMware, separate computers
  
  Going with separate computers is best.
 
 If you like wasting space and energy. If the user's demands don't exceed 
 one percent of a historic PC, there is no point in buying more hardware.

Sure there is:

a. info leakage (way more than just /proc)
b. admin control
c. budget control
d. downtime hits fewer users

  Don't forget to use
  network traffic control to keep users from being able to
  detect the network activity of other users.
 
 Like that:?
 
 $ netstat
 Active Internet connections (w/o servers)
 Proto Recv-Q Send-Q Local Address   Foreign Address State
 /proc/net/tcp: Permission denied

Nope. If you really care about information leakage, you'll
be concerned about the ability to detect network congestion.

Example #1

A spy sends packets from time to time. He measures the delay
and packet loss to determine if the network is busy. When the
network suddenly becomes busy, he can guess that you have
started some operation that requires heavy network traffic.

Example #2

A spy sends packets from time to time. He measures the delay
and packet loss to determine if the network is busy. Over time,
he learns when workers are busy. From this he can determine an
appropriate time to sneak into your building.

Hey, if you're going to be paranoid about %CPU and %MEM, you
have to be paranoid about %NET too. This requires traffic
control unless you have separate networks. Assign a fixed
portion of bandwidth to any user that you wish to hide info
from. Be sure to consider latency as well.

Users who want privacy can get their
own computer. So, these need to work:

ps [...]
w
top
   
   Works as intended. Only pstree breaks, if init isn't visible.
  
  They work like they do with a rootkit installed.
  Traditional behavior has been broken.
 
 They are as broken as finger or ls are if the home directory is chmodded.

Probably something should be done to deal with the problem of
a chmodded home directory. It's not ls that matters though.
It's du that matters. On a normal shared system, a user should
be able to see where all the disk blocks and inodes are going.
Filenames need not be visible. Then: Rene, you're being kind
of greedy about the disk space aren't you? You're using 666 GB.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Capabilities across execve

2005-03-15 Thread Albert Cahalan
Russell King, the latest person to notice defects, writes:

 However, the way the kernel is setup today, this seems
 impossible to achieve, which tends to make the whole
 idea of capabilities completely and utterly useless.

 How is this stuff supposed to work?  Are my ideas of
 what's supposed to be achievable completely wrong,
 although they look completely reasonable to me.

 Don't get me wrong - the capability system seems great at
 permanently revoking capabilities via /proc/sys/kernel/cap-bound,
 and dropping them within an application provided it remains UID0.
 Apart from that, capabilities seem completely useless.
...
 it seems to be something of a lost cause.
...
 my goal of running the script with minimal capabilities
 was completely *impossible* to achieve.

Uh huh. First, some history.

Capability bits were implemented in DG-UX and IRIX.
The two systems did not agree on operation. The draft
POSIX standard, withdrawn for good reason, greatly
changed between draft 16 and draft 17. Settings that
work for one draft are horribly insecure on the other.
Linux capabilities were partly done by the IRIX crew,
working from draft 16. Everyone else had draft 17 or
even draft 13. (and DG-UX had a better system anyway)

Tytso put things well when he wrote: A lot of innocent
bits have been deforested  while trying work out the
differences between what Linux is doing (which is basically
following Draft 17), and what Trusted Irix is doing (which 
apparently is following Draft 16).

Then along comes a sendmail exploit. An emergency fix
was produced, breaking an already-defective capability
design.

Note that, unlike DG-UX, our IRIX-inspired design did
not reserve any capability bits for non-kernel use.
This causes an inconsistent security model, with things
like the X server relying on UID. Inconsistency is bad.

OK, so that's how we got into this mess.

Now, how do we get out?

We will always have to deal with old-style apps. Those
few apps that handle capabilities can handle the bad
system we have now, and can handle a system without the
capability syscalls. (for old kernels) These apps can
not handle a changed setup though; to change things we
must make the old syscalls return failure. ANYTHING ELSE
IS VERY UNSAFE.

There is exactly one capability system in popular use.
That would be the one that comes with Solaris. Moving
toward that, via a kernel config option, appears to be
a sane way to get ourselves unstuck from this big mess.
An added advantage that that the Solaris-style method
instantly becomes the standard, especially if Linux is
strongly compatible. This helps with admin training and
portable software.

See if you can find any holes:
http://docs.sun.com/app/docs/doc/816-5175/6mbba7f39?a=view


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/pid chmod'able]

2005-03-15 Thread Albert Cahalan
On Wed, 2005-03-16 at 03:39 +0100, Rene Scharfe wrote:
 So, I gather from the feedback I've got that chmod'able /proc/pid
 would be a bit over the top. 8-)  While providing the easiest and most
 intuitive user interface for changing the permissions on those
 directories, it is overkill.  Paul is right when he says that such a
 feature should be turned on or off for all sessions at once, and that's
 it.
 
 My patch had at least one other problem: the contents of eac
 /proc/pid directory became chmod'able, too, which was not intended.
 
 Instead of fixing it up I took two steps back, dusted off the umask
 kernel parameter patch and added the special gid feature I mentioned.
 
 Without the new kernel parameters behaviour is unchanged.  Add
 proc.umask=077 and all /proc/pid will get a permission mode of 500.
 This breaks pstree (no output), as Bodo already noted, because this
 program needs access to /proc/1.  It also breaks w -- it shows the
 correct number of users but it lists X even for sessions owned
 by the user running it.
 
 Use proc.umask=007 and proc.gid=50 instead and all /proc/pid dirs
 will have a mode of 550 and their group attribute will be set to 50
 (that's staff on my Debian system).  Pstree will work for all members
 of that special group (just like top, ps and w -- which also show
 everything in that case).  Normal users will still have a restricted
 view.
 
 Albert, would you take fixes for w even though you despise the feature
 that makes them necessary?

I will take patches if they are not too messy and they do not
cause tools to report garbage output. For example, I do not
wish to have tools reporting -1, 0, or uninitialized data in
place of correct data.

Distinct controls for the various files could be useful.
I might want to make /proc/*/cmdline be public, or make
/proc/*/maps be private. This is particularly helpful if
a low-security file is added for bare-bones ps operation.

You might make a special exception for built-in kernel tasks
and init.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] /proc umask and gid [was: Make /proc/pid chmod'able]

2005-03-15 Thread Albert Cahalan
Better interface:

/sbin/sysctl -w proc.maps=0440
/sbin/sysctl -w proc.cmdline=0444
/sbin/sysctl -w proc.status=0444

The /etc/sysctl.conf file can be used to set these
at boot time.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/ chmod'able

2005-03-14 Thread Albert Cahalan
On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote:
> On Mon, 14 Mar 2005, Albert Cahalan wrote:
> > On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote:
> > > Albert Cahalan wrote:
> 
> > > Why do you think users should not be allowed to chmod their processes' 
> > > /proc directories?  Isn't it similar to being able to chmod their home 
> > > directories?  They own both objects, after all (both conceptually and as 
> > > attributed in the filesystem).
> > 
> > This is, to use your own word, "cloaking". This would let
> > a bad user or even an unauthorized user hide from the admin.
> 
> NACK, the admin (and with the new inherited capabilities all users with 
> cap_???_override) can see all processes. Only users who don't need to know
> won't see the other user's processes.

Capabilities are too broken for most people to use. Normal users
do not get CAP_DAC_OVERRIDE by default anyway, for good reason.

> > Note that the admin hopefully does not normally run as root.
> 
> su1 and sudo exist.

This is a pain. Now every user will need sudo access,
and the sudoers file will have to disable requesting
passwords so that scripts will work without hassle.

> > Even if the admin were not running as a normal user, it is
> > expected that normal users can keep tabs on each other.
> > The admin may be sleeping. Social pressure is important to
> > prevent one user from sucking up all the memory and CPU time.
> 
> Privacy is important, too. Imagine each user can see the CEO (or the
> admin) executing "ee nakedgirl.jpg".

Obviously, he likes to have users see him do this.
He'd use a private machine if he wanted privacy.

> > > > Note: I'm the procps (ps, top, w, etc.) maintainer.
> > > > 
> > > > Probably I'd have to make /bin/ps run setuid root
> > > > to deal with this. (minor changes needed) The same
> > > > goes for /usr/bin/top, which I know is currently
> > > > unsafe and difficult to fix.
> 
> I used unpatched procps 3.1.11, and it worked for me, except pstree.

It does not work correctly.

Look, patches with this "feature" are called rootkits.
Think of the headlines: "Linux now with built-in rootkit".

> > > Why do ps and top need to be setuid root to deal with a resticted /proc? 
> > > What information in /proc/ needs to be available to any and all 
> > > users?
> > 
> > Anything provided by traditional UNIX and BSD systems
> > should be available.
> 
> e.g. the buffer overflow in sendmail? Or all the open relays? :)
> 
> The demands to security and privacy have increased. Linux should be able 
> to provide the requested privacy.

This really isn't about security. Privacy may be undesirable.
With privacy comes anti-social behavior. Supposing that the
users do get privacy, perhaps because the have paid for it:

Xen, UML, VM, VMware, separate computers

Going with separate computers is best. Don't forget to use
network traffic control to keep users from being able to
detect the network activity of other users.

> > Users who want privacy can get their
> > own computer. So, these need to work:
> > 
> > ps -ef
> > ps -el
> > ps -ej
> > ps axu
> > ps axl
> > ps axj
> > ps axv
> > w
> > top
> 
> Works as intended. Only pstree breaks, if init isn't visible.

They work like they do with a rootkit installed.
Traditional behavior has been broken.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-14 Thread Albert Cahalan
On Mon, 2005-03-14 at 12:27 -0800, Matt Mackall wrote:
> On Mon, Mar 14, 2005 at 12:04:07PM -0800, john stultz wrote:
> > > > > > > > +static inline cycle_t read_timesource(struct timesource_t* ts)
> > > > > > > > +{
> > > > > > > > +   switch (ts->type) {
> > > > > > > > +   case TIMESOURCE_MMIO_32:
> > > > > > > > +   return (cycle_t)readl(ts->mmio_ptr);
> > > > > > > > +   case TIMESOURCE_MMIO_64:
> > > > > > > > +   return (cycle_t)readq(ts->mmio_ptr);
> > > > > > > > +   case TIMESOURCE_CYCLES:
> > > > > > > > +   return (cycle_t)get_cycles();
> > > > > > > > +   default:/* case: TIMESOURCE_FUNCTION */
> > > > > > > > +   return ts->read_fnct();
> > > > > > > > +   }
> > > > > > > > +}
> > > Well where we'd read an MMIO address, we'd simply set read_fnct to
> > > generic_timesource_mmio32 or so. And that function just does the read.
> > > So both that function and read_timesource become one-liners and we
> > > drop the conditional branches in the switch.
> > 
> > However the vsyscall/fsyscall bits cannot call in-kernel functions (as
> > they execute in userspace or a sudo-userspace). As it stands now in my
> > design TIMESOURCE_FUNCTION timesources will not be usable for
> > vsyscall/fsyscall implementations, so I'm not sure if that's doable.
> > 
> > I'd be interested you've got a way around that.
> 
> We can either stick all the generic mmio timer functions in the
> vsyscall page (they're tiny) or leave the vsyscall using type/ptr but
> have the kernel internally use only the function pointer. Someone
> who's more familiar with the vsyscall timer code should chime in here.

When the vsyscall page is created, copy the one needed function
into it. The kernel is already self-modifying in many places; this
is nothing new.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/ chmod'able

2005-03-14 Thread Albert Cahalan
On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote:
> Albert Cahalan wrote:
> > This is a bad idea. Users should not be allowed to
> > make this decision. This is rightly a decision for
> > the admin to make.
> 
> Why do you think users should not be allowed to chmod their processes' 
> /proc directories?  Isn't it similar to being able to chmod their home 
> directories?  They own both objects, after all (both conceptually and as 
> attributed in the filesystem).

This is, to use your own word, "cloaking". This would let
a bad user or even an unauthorized user hide from the admin.
Why should someone be able to hide a suspicious CPU hog?
Maybe they are cracking passwords or selling your CPU time.

Note that the admin hopefully does not normally run as root.
The admin should be using a normal user account most of the
time, to reduce the damage caused by his accidents.

Even if the admin were not running as a normal user, it is
expected that normal users can keep tabs on each other.
The admin may be sleeping. Social pressure is important to
prevent one user from sucking up all the memory and CPU time.

> > Note: I'm the procps (ps, top, w, etc.) maintainer.
> > 
> > Probably I'd have to make /bin/ps run setuid root
> > to deal with this. (minor changes needed) The same
> > goes for /usr/bin/top, which I know is currently
> > unsafe and difficult to fix.
> > 
> > Let's not go there, OK?
> 
> I have to admit to not having done any real testing with those 
> utilities.  My excuse is this isn't such a new feature, Openwall had 
> something similar for at least four years now and GrSecurity contains 
> yet another flavour of it.  Openwall also provides one patch for 
> procps-2.0.6, so I figured that problem (whatever their patch is about) 
> got fixed in later versions.

If I haven't seen that patch, to Hell with 'em.

It appears that Openwall is using procps-2.0.7 now. Oooh, they've
upgraded to something that's only 4.5 years old! Anybody using a
4-year-old procps is uninterested in security.

> Why do ps and top need to be setuid root to deal with a resticted /proc? 
> What information in /proc/ needs to be available to any and all 
> users?

Anything provided by traditional UNIX and BSD systems
should be available. Users who want privacy can get their
own computer. So, these need to work:

ps -ef
ps -el
ps -ej
ps axu
ps axl
ps axj
ps axv
w
top

Note that /proc does provide a bit more info than required.
This could be changed; it requires new /proc files or a
non-proc source of data.

> > If you restricted this new ability to root, then I'd
> > have much less of an objection. (not that I'd like it)
> 
> How about a boot parameter or sysctl to enable the chmod'ability of 
> /proc/, defaulting to off?  But I'd like to resolve your more 
> general objections above first, if possible. :)

This at least avoids breaking the traditional ability of
non-root users to spot abuse.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/pid chmod'able

2005-03-14 Thread Albert Cahalan
On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote:
 Albert Cahalan wrote:
  This is a bad idea. Users should not be allowed to
  make this decision. This is rightly a decision for
  the admin to make.
 
 Why do you think users should not be allowed to chmod their processes' 
 /proc directories?  Isn't it similar to being able to chmod their home 
 directories?  They own both objects, after all (both conceptually and as 
 attributed in the filesystem).

This is, to use your own word, cloaking. This would let
a bad user or even an unauthorized user hide from the admin.
Why should someone be able to hide a suspicious CPU hog?
Maybe they are cracking passwords or selling your CPU time.

Note that the admin hopefully does not normally run as root.
The admin should be using a normal user account most of the
time, to reduce the damage caused by his accidents.

Even if the admin were not running as a normal user, it is
expected that normal users can keep tabs on each other.
The admin may be sleeping. Social pressure is important to
prevent one user from sucking up all the memory and CPU time.

  Note: I'm the procps (ps, top, w, etc.) maintainer.
  
  Probably I'd have to make /bin/ps run setuid root
  to deal with this. (minor changes needed) The same
  goes for /usr/bin/top, which I know is currently
  unsafe and difficult to fix.
  
  Let's not go there, OK?
 
 I have to admit to not having done any real testing with those 
 utilities.  My excuse is this isn't such a new feature, Openwall had 
 something similar for at least four years now and GrSecurity contains 
 yet another flavour of it.  Openwall also provides one patch for 
 procps-2.0.6, so I figured that problem (whatever their patch is about) 
 got fixed in later versions.

If I haven't seen that patch, to Hell with 'em.

It appears that Openwall is using procps-2.0.7 now. Oooh, they've
upgraded to something that's only 4.5 years old! Anybody using a
4-year-old procps is uninterested in security.

 Why do ps and top need to be setuid root to deal with a resticted /proc? 
 What information in /proc/pid needs to be available to any and all 
 users?

Anything provided by traditional UNIX and BSD systems
should be available. Users who want privacy can get their
own computer. So, these need to work:

ps -ef
ps -el
ps -ej
ps axu
ps axl
ps axj
ps axv
w
top

Note that /proc does provide a bit more info than required.
This could be changed; it requires new /proc files or a
non-proc source of data.

  If you restricted this new ability to root, then I'd
  have much less of an objection. (not that I'd like it)
 
 How about a boot parameter or sysctl to enable the chmod'ability of 
 /proc/pid, defaulting to off?  But I'd like to resolve your more 
 general objections above first, if possible. :)

This at least avoids breaking the traditional ability of
non-root users to spot abuse.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] new timeofday core subsystem (v. A3)

2005-03-14 Thread Albert Cahalan
On Mon, 2005-03-14 at 12:27 -0800, Matt Mackall wrote:
 On Mon, Mar 14, 2005 at 12:04:07PM -0800, john stultz wrote:
+static inline cycle_t read_timesource(struct timesource_t* ts)
+{
+   switch (ts-type) {
+   case TIMESOURCE_MMIO_32:
+   return (cycle_t)readl(ts-mmio_ptr);
+   case TIMESOURCE_MMIO_64:
+   return (cycle_t)readq(ts-mmio_ptr);
+   case TIMESOURCE_CYCLES:
+   return (cycle_t)get_cycles();
+   default:/* case: TIMESOURCE_FUNCTION */
+   return ts-read_fnct();
+   }
+}
   Well where we'd read an MMIO address, we'd simply set read_fnct to
   generic_timesource_mmio32 or so. And that function just does the read.
   So both that function and read_timesource become one-liners and we
   drop the conditional branches in the switch.
  
  However the vsyscall/fsyscall bits cannot call in-kernel functions (as
  they execute in userspace or a sudo-userspace). As it stands now in my
  design TIMESOURCE_FUNCTION timesources will not be usable for
  vsyscall/fsyscall implementations, so I'm not sure if that's doable.
  
  I'd be interested you've got a way around that.
 
 We can either stick all the generic mmio timer functions in the
 vsyscall page (they're tiny) or leave the vsyscall using type/ptr but
 have the kernel internally use only the function pointer. Someone
 who's more familiar with the vsyscall timer code should chime in here.

When the vsyscall page is created, copy the one needed function
into it. The kernel is already self-modifying in many places; this
is nothing new.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/pid chmod'able

2005-03-14 Thread Albert Cahalan
On Tue, 2005-03-15 at 00:08 +0100, Bodo Eggert wrote:
 On Mon, 14 Mar 2005, Albert Cahalan wrote:
  On Mon, 2005-03-14 at 10:42 +0100, Rene Scharfe wrote:
   Albert Cahalan wrote:
 
   Why do you think users should not be allowed to chmod their processes' 
   /proc directories?  Isn't it similar to being able to chmod their home 
   directories?  They own both objects, after all (both conceptually and as 
   attributed in the filesystem).
  
  This is, to use your own word, cloaking. This would let
  a bad user or even an unauthorized user hide from the admin.
 
 NACK, the admin (and with the new inherited capabilities all users with 
 cap_???_override) can see all processes. Only users who don't need to know
 won't see the other user's processes.

Capabilities are too broken for most people to use. Normal users
do not get CAP_DAC_OVERRIDE by default anyway, for good reason.

  Note that the admin hopefully does not normally run as root.
 
 su1 and sudo exist.

This is a pain. Now every user will need sudo access,
and the sudoers file will have to disable requesting
passwords so that scripts will work without hassle.

  Even if the admin were not running as a normal user, it is
  expected that normal users can keep tabs on each other.
  The admin may be sleeping. Social pressure is important to
  prevent one user from sucking up all the memory and CPU time.
 
 Privacy is important, too. Imagine each user can see the CEO (or the
 admin) executing ee nakedgirl.jpg.

Obviously, he likes to have users see him do this.
He'd use a private machine if he wanted privacy.

Note: I'm the procps (ps, top, w, etc.) maintainer.

Probably I'd have to make /bin/ps run setuid root
to deal with this. (minor changes needed) The same
goes for /usr/bin/top, which I know is currently
unsafe and difficult to fix.
 
 I used unpatched procps 3.1.11, and it worked for me, except pstree.

It does not work correctly.

Look, patches with this feature are called rootkits.
Think of the headlines: Linux now with built-in rootkit.

   Why do ps and top need to be setuid root to deal with a resticted /proc? 
   What information in /proc/pid needs to be available to any and all 
   users?
  
  Anything provided by traditional UNIX and BSD systems
  should be available.
 
 e.g. the buffer overflow in sendmail? Or all the open relays? :)
 
 The demands to security and privacy have increased. Linux should be able 
 to provide the requested privacy.

This really isn't about security. Privacy may be undesirable.
With privacy comes anti-social behavior. Supposing that the
users do get privacy, perhaps because the have paid for it:

Xen, UML, VM, VMware, separate computers

Going with separate computers is best. Don't forget to use
network traffic control to keep users from being able to
detect the network activity of other users.

  Users who want privacy can get their
  own computer. So, these need to work:
  
  ps -ef
  ps -el
  ps -ej
  ps axu
  ps axl
  ps axj
  ps axv
  w
  top
 
 Works as intended. Only pstree breaks, if init isn't visible.

They work like they do with a rootkit installed.
Traditional behavior has been broken.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/ chmod'able

2005-03-13 Thread Albert Cahalan
> OK, folks, another try to enhance privacy by hiding
> process details from other users.  Why not simply use
> chmod to set the permissions of /proc/ directories?
> This patch implements it.
>
> Children processes inherit their parents' proc
> permissions on fork.  You can only set (and remove)
> read and execute permissions, the bits for write,
> suid etc. are not changable.  A user would add
>
> chmod 500 /proc/$$
>
> or something similar to his .profile to cloak his processes.
>
> What do you think about that one?

This is a bad idea. Users should not be allowed to
make this decision. This is rightly a decision for
the admin to make.

Note: I'm the procps (ps, top, w, etc.) maintainer.

Probably I'd have to make /bin/ps run setuid root
to deal with this. (minor changes needed) The same
goes for /usr/bin/top, which I know is currently
unsafe and difficult to fix.

Let's not go there, OK?

If you restricted this new ability to root, then I'd
have much less of an objection. (not that I'd like it)



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH][RFC] Make /proc/pid chmod'able

2005-03-13 Thread Albert Cahalan
 OK, folks, another try to enhance privacy by hiding
 process details from other users.  Why not simply use
 chmod to set the permissions of /proc/pid directories?
 This patch implements it.

 Children processes inherit their parents' proc
 permissions on fork.  You can only set (and remove)
 read and execute permissions, the bits for write,
 suid etc. are not changable.  A user would add

 chmod 500 /proc/$$

 or something similar to his .profile to cloak his processes.

 What do you think about that one?

This is a bad idea. Users should not be allowed to
make this decision. This is rightly a decision for
the admin to make.

Note: I'm the procps (ps, top, w, etc.) maintainer.

Probably I'd have to make /bin/ps run setuid root
to deal with this. (minor changes needed) The same
goes for /usr/bin/top, which I know is currently
unsafe and difficult to fix.

Let's not go there, OK?

If you restricted this new ability to root, then I'd
have much less of an objection. (not that I'd like it)



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-11 Thread Albert Cahalan
On Fri, 2005-03-11 at 19:15 +, Alan Cox wrote:
> > You forgot the PCI domain (a.k.a. hose, phb...) number.
> > Also, you might encode bus,slot,function according to
> > the PCI spec. So that gives:
> > 
> > long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask);
> 
> Still insufficient because the device might be hotplugged on you. You
> need a file handle that has the expected revocation effects on unplug
> and refcounts

I was under the impression that a file handle would be returned.

I'm not so sure that is a sane way to handle hot-plug though.
First of all, in general, it's going to be like this:

Fan, meet shit.
Shit, meet fan.

Those who care might best be served by SIGBUS with si_code
and si_info set appropriately. Perhaps a revoke() syscall
that handled mmap() would work the same way.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-11 Thread Albert Cahalan
On Fri, 2005-03-11 at 19:15 +, Alan Cox wrote:
  You forgot the PCI domain (a.k.a. hose, phb...) number.
  Also, you might encode bus,slot,function according to
  the PCI spec. So that gives:
  
  long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask);
 
 Still insufficient because the device might be hotplugged on you. You
 need a file handle that has the expected revocation effects on unplug
 and refcounts

I was under the impression that a file handle would be returned.

I'm not so sure that is a sane way to handle hot-plug though.
First of all, in general, it's going to be like this:

Fan, meet shit.
Shit, meet fan.

Those who care might best be served by SIGBUS with si_code
and si_info set appropriately. Perhaps a revoke() syscall
that handled mmap() would work the same way.



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-10 Thread Albert Cahalan
Peter Chubb writes:

> There are three new system calls:
>
>   long   usr_pci_open(int bus, int slot, int function, __u64 dma_mask);
>  Returns a filedescriptor for the PCI device described 
>  by bus,slot,function.  It also enables the device, and sets it 
>  up as a bus-mastering DMA device, with the specified dma mask.

You forgot the PCI domain (a.k.a. hose, phb...) number.
Also, you might encode bus,slot,function according to
the PCI spec. So that gives:

long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask);

(with the user library returning an int instead of long)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: binary drivers and development

2005-03-10 Thread Albert Cahalan
Lennart Sorensen writes:

> You forgot the very important:
>- Only works on architecture it was compiled for.  So anyone not
>  using i386 (and maybe later x86-64) is simply out of luck.  What do
>  nvidia users that want accelerated nvidia drivers for X DRI do
>  right now if they have a powerpc or a sparc or an alpha?  How about
>  porting Linux to a new architecture.  With binary drivers you now
>  start out with no drivers on the new architecture except for the
>  ones you have source for.  Not very productive.

Rik van Riel writes:

> No, it wouldn't.  I can use a source code driver on x86,
> x86-64 and PPC64 systems, but a binary driver is only
> usable on the architecture it was compiled for.
>
> Source code is way more portable than binary anything.

The kernel already has an AML interpreter for ACPI. **duck**

As for portability, AML would do the job. It beats typical
vendor source code IMHO, because endianness and integer size
are well-defined. (like the Java VM and .net)

For the x86 and ia64 users, the AML interpreter is probably
already compiled into the kernel. Most people need it to
set up SMP or power management. So, no added bloat even.

AML code is fairly well controlled and isolated. There is
of course the backdoor via DMA for the truly determined
evil author, but such paranoia is rather extreme. AML is
really designed for this sort of task.

As with any interpreter, there are ways (JIT) to make the
AML interpreter go faster if need be.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: binary drivers and development

2005-03-10 Thread Albert Cahalan
Lennart Sorensen writes:

 You forgot the very important:
- Only works on architecture it was compiled for.  So anyone not
  using i386 (and maybe later x86-64) is simply out of luck.  What do
  nvidia users that want accelerated nvidia drivers for X DRI do
  right now if they have a powerpc or a sparc or an alpha?  How about
  porting Linux to a new architecture.  With binary drivers you now
  start out with no drivers on the new architecture except for the
  ones you have source for.  Not very productive.

Rik van Riel writes:

 No, it wouldn't.  I can use a source code driver on x86,
 x86-64 and PPC64 systems, but a binary driver is only
 usable on the architecture it was compiled for.

 Source code is way more portable than binary anything.

The kernel already has an AML interpreter for ACPI. **duck**

As for portability, AML would do the job. It beats typical
vendor source code IMHO, because endianness and integer size
are well-defined. (like the Java VM and .net)

For the x86 and ia64 users, the AML interpreter is probably
already compiled into the kernel. Most people need it to
set up SMP or power management. So, no added bloat even.

AML code is fairly well controlled and isolated. There is
of course the backdoor via DMA for the truly determined
evil author, but such paranoia is rather extreme. AML is
really designed for this sort of task.

As with any interpreter, there are ways (JIT) to make the
AML interpreter go faster if need be.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: User mode drivers: part 2: PCI device handling (patch 1/2 for 2.6.11)

2005-03-10 Thread Albert Cahalan
Peter Chubb writes:

 There are three new system calls:

   long   usr_pci_open(int bus, int slot, int function, __u64 dma_mask);
  Returns a filedescriptor for the PCI device described 
  by bus,slot,function.  It also enables the device, and sets it 
  up as a bus-mastering DMA device, with the specified dma mask.

You forgot the PCI domain (a.k.a. hose, phb...) number.
Also, you might encode bus,slot,function according to
the PCI spec. So that gives:

long usr_pci_open(unsigned pcidomain, unsigned devspec, __u64 dmamask);

(with the user library returning an int instead of long)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] inotify for 2.6.11

2005-03-06 Thread Albert Cahalan
Christoph Hellwig writes:
> On Sat, Mar 05, 2005 at 07:40:06PM -0500, Robert Love wrote:
>> On Sun, 2005-03-06 at 00:04 +, Christoph Hellwig wrote:
 
>>> The user interface is still bogus.
>>
>> I presume you are talking about the ioctl.  I have tried to engage you
>> and others on what exactly you prefer instead.  I have said that moving
>> to a write interface is fine but I don't see how ut is _any_ better than
>> the ioctl.  Write is less typed, in fact, since we lose the command
>> versus argument delineation.
>> 
>> But if it is a anonymous decision, I'll switch it.  Or take patches. ;-)
>> It isn't a big deal.
>
> See the review I sent.  Write is exactly the right interface for that kind
> of thing.  For comment vs argument either put the number first so we don't
> have the problem of finding a delinator that isn't a valid filename, or
> use '\0' as such.

That's just putrid. You've proposed an interface that
combines the worst of ASCII with the worst of binary.

It is now well-established that ASCII interfaces are
horribly slow. This one will be no exception... but
with the '\0' in there, you have a binary interface.
So, it's an evil hybrid.

An ioctl() is a syscall with scope restricting it to a
single fd. This is a fine user interface, not a bogus one.
(keep 32-on-64 operation in mind to be polite)

If you'd rather have a normal (global) system call though,
that'll do too, likely leading to a bit more type checking
in the glibc-provided headers.

Adding plain old syscalls is rather nice actually.
It's only a pain at first, while waiting for glibc
to catch up.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [patch] inotify for 2.6.11

2005-03-06 Thread Albert Cahalan
Christoph Hellwig writes:
 On Sat, Mar 05, 2005 at 07:40:06PM -0500, Robert Love wrote:
 On Sun, 2005-03-06 at 00:04 +, Christoph Hellwig wrote:
 
 The user interface is still bogus.

 I presume you are talking about the ioctl.  I have tried to engage you
 and others on what exactly you prefer instead.  I have said that moving
 to a write interface is fine but I don't see how ut is _any_ better than
 the ioctl.  Write is less typed, in fact, since we lose the command
 versus argument delineation.
 
 But if it is a anonymous decision, I'll switch it.  Or take patches. ;-)
 It isn't a big deal.

 See the review I sent.  Write is exactly the right interface for that kind
 of thing.  For comment vs argument either put the number first so we don't
 have the problem of finding a delinator that isn't a valid filename, or
 use '\0' as such.

That's just putrid. You've proposed an interface that
combines the worst of ASCII with the worst of binary.

It is now well-established that ASCII interfaces are
horribly slow. This one will be no exception... but
with the '\0' in there, you have a binary interface.
So, it's an evil hybrid.

An ioctl() is a syscall with scope restricting it to a
single fd. This is a fine user interface, not a bogus one.
(keep 32-on-64 operation in mind to be polite)

If you'd rather have a normal (global) system call though,
that'll do too, likely leading to a bit more type checking
in the glibc-provided headers.

Adding plain old syscalls is rather nice actually.
It's only a pain at first, while waiting for glibc
to catch up.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] audit: handle loginuid through proc

2005-02-25 Thread Albert Cahalan
On Thu, 2005-02-24 at 22:49 -0800, Chris Wright wrote:
> * Albert Cahalan ([EMAIL PROTECTED]) wrote:

> > Assuming you'd like ps to print the LUID, how about
> > putting it with all the others? There are "Uid:"
> > lines in the /proc/*/status files.
> 
> It's also set (written) via /proc, so it should probably stay separate.

Gross. Please rip this out before it hits the streets.
(it's an interface change that might need eternal support)
Consider that:

1. Every other UID is handled by system calls:
   getuid, setuid, geteuid, setreuid,
   setresuid, getresuid, setfsuid

2. HP's Tru64 has getluid() and setluid() system calls
   that Linux should be compatible with. SecureWare has a
   version too, which looks more-or-less compatible with
   what HP is offering. (the descriptions do not conflict,
   but one has more details) It looks like ssh, apache,
   and sendmail (huh?) already knows to use these system
   calls even. 

The  header is used. Prototypes are the obvious.
The setuid() call returns 0 on success.

Tru64 notes that the login UID is sometimes called the
audit UID (AUID) because it is recorded with most audit
events.

getluid() returns an error if the LUID (AUID) is unset.

SecureWare additionally notes that setuid() and setgid() will
also fail when the luid is unset, to ensure that the LUID
is set before any other identity changes. (probably Linux
should just disable setting LUID after that point)



Just to be complete, here's what Sun did:

Sun has getauid() and setauid() syscalls which are
somewhat similar. They take pointers to the ID, and they
require privilege (PRIV_SYS_AUDIT and PRIV_PROC_AUDIT
for setauid, or just PRIV_PROC_AUDIT for getauid)
These calls have been superceded by getaudit_addr() and
setaudit_addr(), which use structs containing:

au_id_t   ai_auid; // audit user ID
au_mask_t ai_mask; // preselection mask
au_tid_addr_t ai_termid;   // terminal ID
au_asid_t ai_asid; // audit session ID

(the terminal ID is variable length, containing a
network address and a length value for it)


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] audit: handle loginuid through proc

2005-02-25 Thread Albert Cahalan
On Thu, 2005-02-24 at 22:49 -0800, Chris Wright wrote:
 * Albert Cahalan ([EMAIL PROTECTED]) wrote:

  Assuming you'd like ps to print the LUID, how about
  putting it with all the others? There are Uid:
  lines in the /proc/*/status files.
 
 It's also set (written) via /proc, so it should probably stay separate.

Gross. Please rip this out before it hits the streets.
(it's an interface change that might need eternal support)
Consider that:

1. Every other UID is handled by system calls:
   getuid, setuid, geteuid, setreuid,
   setresuid, getresuid, setfsuid

2. HP's Tru64 has getluid() and setluid() system calls
   that Linux should be compatible with. SecureWare has a
   version too, which looks more-or-less compatible with
   what HP is offering. (the descriptions do not conflict,
   but one has more details) It looks like ssh, apache,
   and sendmail (huh?) already knows to use these system
   calls even. 

The prot.h header is used. Prototypes are the obvious.
The setuid() call returns 0 on success.

Tru64 notes that the login UID is sometimes called the
audit UID (AUID) because it is recorded with most audit
events.

getluid() returns an error if the LUID (AUID) is unset.

SecureWare additionally notes that setuid() and setgid() will
also fail when the luid is unset, to ensure that the LUID
is set before any other identity changes. (probably Linux
should just disable setting LUID after that point)



Just to be complete, here's what Sun did:

Sun has getauid() and setauid() syscalls which are
somewhat similar. They take pointers to the ID, and they
require privilege (PRIV_SYS_AUDIT and PRIV_PROC_AUDIT
for setauid, or just PRIV_PROC_AUDIT for getauid)
These calls have been superceded by getaudit_addr() and
setaudit_addr(), which use structs containing:

au_id_t   ai_auid; // audit user ID
au_mask_t ai_mask; // preselection mask
au_tid_addr_t ai_termid;   // terminal ID
au_asid_t ai_asid; // audit session ID

(the terminal ID is variable length, containing a
network address and a length value for it)


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] audit: handle loginuid through proc

2005-02-24 Thread Albert Cahalan
Assuming you'd like ps to print the LUID, how about
putting it with all the others? There are "Uid:"
lines in the /proc/*/status files.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] A new entry for /proc

2005-02-24 Thread Albert Cahalan
[quoting various people...]

> Here is a new entry developed for /proc that prints for each process
> memory area (VMA) the size of rss. The maps from original kernel is   
> able to present the virtual size for each vma, but not the physical   
> size (rss). This entry can provide an additional information for tools
> that analyze the memory consumption. You can know the physical memory
> size of each library used by a process and also the executable file.
>
> Take a look the output:
> # cat /proc/877/smaps 
> 08048000-08132000 r-xp  /usr/bin/xmms
> Size: 936 kB
> Rss: 788 kB 
> 08132000-0813a000 rw-p  /usr/bin/xmms
> Size:  32 kB
> Rss:  32 kB 
> 0813a000-081dd000 rw-p
> Size: 652 kB
> Rss: 616 kB

The most important thing about a /proc file format is that it has
a documented means of being extended in the future. Without such
documentation, it is impossible to write a reliable parser.

The "Name: value" stuff is rather slow. Right now procps (ps, top, etc.)
is using a perfect hash function to parse the /proc/*/status files.
("man gperf") This is just plain gross, but needed for decent performance.

Extending the /proc/*/maps file might be possible. It is commonly used
by debuggers I think, so you'd better at least verify that gdb is OK.
The procps "pmap" tool uses it too. To satisfy the procps parser:

a. no more than 31 flags
b. no '/' prior to the filename
c. nothing after the filename
d. no new fields inserted prior to the inode number

> If there were a use for it, that use might want to distinguish between
> the "shared rss" of pagecache pages from a file, and the "anon rss" of
> private pages copied from file or originally zero - would need to get
> the struct page and check PageAnon.  And might want to count swap
> entries too.  Hard to say without real uses in mind.
...
> It's a mixture of two different styles, the /proc//maps
> many-hex-fields one-vma-per-line style and the /proc/meminfo
> one-decimal-kB-per-line style.  I think it would be better following
> the /proc//maps style, but replacing the major,minor,ino fields
> by size and rss (anon_rss? swap?) fields (decimal kB? I suppose so).

The more info the better. See the pmap "-x" option, currently missing
some data that the kernel does not supply. There are numerous other 
pmap options that are completely unimplemented because of the lack of   
info. See the Solaris 10 man page for pmap, available on Sun's web site.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] A new entry for /proc

2005-02-24 Thread Albert Cahalan
[quoting various people...]

 Here is a new entry developed for /proc that prints for each process
 memory area (VMA) the size of rss. The maps from original kernel is   
 able to present the virtual size for each vma, but not the physical   
 size (rss). This entry can provide an additional information for tools
 that analyze the memory consumption. You can know the physical memory
 size of each library used by a process and also the executable file.

 Take a look the output:
 # cat /proc/877/smaps 
 08048000-08132000 r-xp  /usr/bin/xmms
 Size: 936 kB
 Rss: 788 kB 
 08132000-0813a000 rw-p  /usr/bin/xmms
 Size:  32 kB
 Rss:  32 kB 
 0813a000-081dd000 rw-p
 Size: 652 kB
 Rss: 616 kB

The most important thing about a /proc file format is that it has
a documented means of being extended in the future. Without such
documentation, it is impossible to write a reliable parser.

The Name: value stuff is rather slow. Right now procps (ps, top, etc.)
is using a perfect hash function to parse the /proc/*/status files.
(man gperf) This is just plain gross, but needed for decent performance.

Extending the /proc/*/maps file might be possible. It is commonly used
by debuggers I think, so you'd better at least verify that gdb is OK.
The procps pmap tool uses it too. To satisfy the procps parser:

a. no more than 31 flags
b. no '/' prior to the filename
c. nothing after the filename
d. no new fields inserted prior to the inode number

 If there were a use for it, that use might want to distinguish between
 the shared rss of pagecache pages from a file, and the anon rss of
 private pages copied from file or originally zero - would need to get
 the struct page and check PageAnon.  And might want to count swap
 entries too.  Hard to say without real uses in mind.
...
 It's a mixture of two different styles, the /proc/pid/maps
 many-hex-fields one-vma-per-line style and the /proc/meminfo
 one-decimal-kB-per-line style.  I think it would be better following
 the /proc/pid/maps style, but replacing the major,minor,ino fields
 by size and rss (anon_rss? swap?) fields (decimal kB? I suppose so).

The more info the better. See the pmap -x option, currently missing
some data that the kernel does not supply. There are numerous other 
pmap options that are completely unimplemented because of the lack of   
info. See the Solaris 10 man page for pmap, available on Sun's web site.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] audit: handle loginuid through proc

2005-02-24 Thread Albert Cahalan
Assuming you'd like ps to print the LUID, how about
putting it with all the others? There are Uid:
lines in the /proc/*/status files.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    1   2