page table isolation alternative mechanism

2018-01-03 Thread Albert Cahalan
We got into the current situation for performance reasons, avoiding the costly
reload of CR3 that a hardware task switch would cause. It seems we'll be
loading CR3 now anyway, so it might be time to reconsider hardware
task switches.

The recent patches leave kernel entry/exit code mapped. Hardware task switches
wouldn't need that. All they need is a single entry in a reduced-size
IDT, for the
doublefault, and a minimal GDT, and a TSS. Taking the fault switches CR3. That
then gets you a proper IDT and GDT because those are virtually mapped.
Not a single byte of kernel code would need to be mapped while user code runs.


page table isolation alternative mechanism

2018-01-03 Thread Albert Cahalan
We got into the current situation for performance reasons, avoiding the costly
reload of CR3 that a hardware task switch would cause. It seems we'll be
loading CR3 now anyway, so it might be time to reconsider hardware
task switches.

The recent patches leave kernel entry/exit code mapped. Hardware task switches
wouldn't need that. All they need is a single entry in a reduced-size
IDT, for the
doublefault, and a minimal GDT, and a TSS. Taking the fault switches CR3. That
then gets you a proper IDT and GDT because those are virtually mapped.
Not a single byte of kernel code would need to be mapped while user code runs.


18-year-old bug

2016-01-06 Thread Albert Cahalan
This bug was introduced with SE Linux, 18 years ago. People have been
adding hacks to work around it as the bug bites them, but really the
bug ought to be fixed. Signals related to a tty are supposed to come
from the kernel. This got broken for pty devices. We now act as if
the signal is sent from the process on the master side of the pty.
That isn't right; the signal is supposed to come from the tty itself
and thus have a kernel identity.

How to reproduce:

Copy /bin/sleep to /tmp/work and /tmp/fail. Start up xterm, run
/tmp/work in the window, close the window, and see the process gone.
Now repeat that for /tmp/fail, but run "su -" in the window first.
Meanwhile, to view the problem, run this in another window:

ps -Cwork -Cfail -o tty,pid,ppid,tpgid,pgid,sid,ruid,euid,comm

(so like "/tmp/fail 100" or however much time you need)

I first saw the problem when I was maintaining top. People would
run top as root, close the window, and then find that top got stuck
spinning on select. Eventually top was hacked up to work around the
kernel bug, but really we shouldn't have userspace trying to work
around kernel bugs. I tried to fix it back then, but got a bit lost
in the then-new code. Sorry. Since then, I've become insanely busy
with ten kids. I'd really appreciate if somebody could take a shot
at fixing this bug. It seems to have hit a coworker a few months back,
and he is just living with it. (ouch)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


18-year-old bug

2016-01-06 Thread Albert Cahalan
This bug was introduced with SE Linux, 18 years ago. People have been
adding hacks to work around it as the bug bites them, but really the
bug ought to be fixed. Signals related to a tty are supposed to come
from the kernel. This got broken for pty devices. We now act as if
the signal is sent from the process on the master side of the pty.
That isn't right; the signal is supposed to come from the tty itself
and thus have a kernel identity.

How to reproduce:

Copy /bin/sleep to /tmp/work and /tmp/fail. Start up xterm, run
/tmp/work in the window, close the window, and see the process gone.
Now repeat that for /tmp/fail, but run "su -" in the window first.
Meanwhile, to view the problem, run this in another window:

ps -Cwork -Cfail -o tty,pid,ppid,tpgid,pgid,sid,ruid,euid,comm

(so like "/tmp/fail 100" or however much time you need)

I first saw the problem when I was maintaining top. People would
run top as root, close the window, and then find that top got stuck
spinning on select. Eventually top was hacked up to work around the
kernel bug, but really we shouldn't have userspace trying to work
around kernel bugs. I tried to fix it back then, but got a bit lost
in the then-new code. Sorry. Since then, I've become insanely busy
with ten kids. I'd really appreciate if somebody could take a shot
at fixing this bug. It seems to have hit a coworker a few months back,
and he is just living with it. (ouch)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-29 Thread Albert Cahalan
On Nov 29, 2007 4:40 PM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
> "Albert Cahalan" <[EMAIL PROTECTED]> writes:
>
> > On Nov 28, 2007 6:31 AM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
> >> Ingo Molnar <[EMAIL PROTECTED]> writes:
> >> > * Albert Cahalan <[EMAIL PROTECTED]> wrote:
> >> >> On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> Linux tasks when used in one particular way can fulfill the posix
> requirements for single threaded processes.
>
> Linux task groups when used in one particular way can fulfill the
> posix requirements for processes.

Right. Once you leave this, weirdness happens.
POSIX defines things in terms of processes and threads.
POSIX defines many of our interfaces. That includes
kernel behavior, the C library, and numerous programs.

> As for where /proc/self points given that procps seems to read
> files like /proc/self/stat.  It looks to me like we have a clear
> case of a user space application that cares about the current
> behavior and would break if we changed things.

I wasn't saying procps would break, though it would if
/proc/self/task went away. I'm more concerned about
multi-threaded things that look in their own /proc/self
directory. The procps programs are single-threaded.

In procps, the self link is used:

a. to see if the wchan file exists
b. to see if the task directory exists
c. to find the tty number

(that last one: there might not be a file descriptor
for the tty, and anyway I need it with the bits in all
the same places as what I get for the other processes)

I'll bet that something reads /proc/self/stat to see
CPU usage.

> > Note that it was intended that non-legacy additions
> > would normally be added to either the process directory
> > or the thread directory, not both. I think somebody may
> > have ripped out the ability to do this; at the very least
> > there have been numerous illogical additions.
>
> The rationale was not conveyed and the policy you describe
> seems like deprecating the /proc/ directory in favor
> of the /proc//task//.  Which was a pattern
> never established and it doesn't seem to make anything better
> so I don't see the point there.

For the stuff that is logically per-task, yes.
For the rest, no. Oh well...

It does make things better because redundant info
is a source of confusion.

> >> I'm still trying to understand which will break user space more,
> >> adding /proc/task or changing /proc/self.
> >
> > Changing /proc/self makes you get per-thread data
> > when you asked for per-process data. That's bad.
>
> /proc/self used to ask for per task data.  Which is why there
> is some confusion.

Heh. Well, /proc/self used to ask for per process data.
It was all the same. I think it matters that /proc/self was
always documented as being per-process.

> >> >> This one is probably best:
> >> >> /proc/task -> 123/task/456
> >> >> (with both numbers showing)
> >> >
> >> > this sounds good to me. If it's a symlink then there's not much other
> >> > choice because the thread PIDs do not even show up under /proc anymore.
> >>
> >> The name sounds good to me.
>
> I will see about writing the patch for this in a bit and sending
> it to Andrew.

Nice.

> Nope.  /proc/mounts was a symlink to /proc/self/mounts long before
> /proc/self was modified to stop pointing at the task directory and
> changed it point at the new task group directory.

Having the filesystem namespace be per-process is wild enough.
We really don't need it to be per-thread. (and yes, I'm using the
POSIX terms on purpose)

> Frankly from what I have seen of the code the task-group work
> seems to be a larger source of bugs, and complications, because
> people have a darn hard time wrapping their head around how it
> is supposed to behave, and all of the corner cases were not
> resolved at the time it was developed.

People look at me like I have two heads when I explain to
them that the Linux kernel source uses "pid" to mean
a thread. The bad terminology probably promotes bad thinking.
It would be lovely if that could somehow get fixed.

> My favorite ongoing issue is what is needed to allow a threaded
> init to actually function properly.  I think enough fixes have
> gone in that it might even work.

My "favorite" is the multi-threaded debugger. By this I
mean the debugger itself wants to be multi-threaded,
issuing ptrace commands from multiple threads.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-29 Thread Albert Cahalan
On Nov 29, 2007 4:40 PM, Eric W. Biederman [EMAIL PROTECTED] wrote:
 Albert Cahalan [EMAIL PROTECTED] writes:

  On Nov 28, 2007 6:31 AM, Eric W. Biederman [EMAIL PROTECTED] wrote:
  Ingo Molnar [EMAIL PROTECTED] writes:
   * Albert Cahalan [EMAIL PROTECTED] wrote:
   On Nov 27, 2007 7:49 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:

 Linux tasks when used in one particular way can fulfill the posix
 requirements for single threaded processes.

 Linux task groups when used in one particular way can fulfill the
 posix requirements for processes.

Right. Once you leave this, weirdness happens.
POSIX defines things in terms of processes and threads.
POSIX defines many of our interfaces. That includes
kernel behavior, the C library, and numerous programs.

 As for where /proc/self points given that procps seems to read
 files like /proc/self/stat.  It looks to me like we have a clear
 case of a user space application that cares about the current
 behavior and would break if we changed things.

I wasn't saying procps would break, though it would if
/proc/self/task went away. I'm more concerned about
multi-threaded things that look in their own /proc/self
directory. The procps programs are single-threaded.

In procps, the self link is used:

a. to see if the wchan file exists
b. to see if the task directory exists
c. to find the tty number

(that last one: there might not be a file descriptor
for the tty, and anyway I need it with the bits in all
the same places as what I get for the other processes)

I'll bet that something reads /proc/self/stat to see
CPU usage.

  Note that it was intended that non-legacy additions
  would normally be added to either the process directory
  or the thread directory, not both. I think somebody may
  have ripped out the ability to do this; at the very least
  there have been numerous illogical additions.

 The rationale was not conveyed and the policy you describe
 seems like deprecating the /proc/tgid directory in favor
 of the /proc/tgid/task/pid/.  Which was a pattern
 never established and it doesn't seem to make anything better
 so I don't see the point there.

For the stuff that is logically per-task, yes.
For the rest, no. Oh well...

It does make things better because redundant info
is a source of confusion.

  I'm still trying to understand which will break user space more,
  adding /proc/task or changing /proc/self.
 
  Changing /proc/self makes you get per-thread data
  when you asked for per-process data. That's bad.

 /proc/self used to ask for per task data.  Which is why there
 is some confusion.

Heh. Well, /proc/self used to ask for per process data.
It was all the same. I think it matters that /proc/self was
always documented as being per-process.

   This one is probably best:
   /proc/task - 123/task/456
   (with both numbers showing)
  
   this sounds good to me. If it's a symlink then there's not much other
   choice because the thread PIDs do not even show up under /proc anymore.
 
  The name sounds good to me.

 I will see about writing the patch for this in a bit and sending
 it to Andrew.

Nice.

 Nope.  /proc/mounts was a symlink to /proc/self/mounts long before
 /proc/self was modified to stop pointing at the task directory and
 changed it point at the new task group directory.

Having the filesystem namespace be per-process is wild enough.
We really don't need it to be per-thread. (and yes, I'm using the
POSIX terms on purpose)

 Frankly from what I have seen of the code the task-group work
 seems to be a larger source of bugs, and complications, because
 people have a darn hard time wrapping their head around how it
 is supposed to behave, and all of the corner cases were not
 resolved at the time it was developed.

People look at me like I have two heads when I explain to
them that the Linux kernel source uses pid to mean
a thread. The bad terminology probably promotes bad thinking.
It would be lovely if that could somehow get fixed.

 My favorite ongoing issue is what is needed to allow a threaded
 init to actually function properly.  I think enough fixes have
 gone in that it might even work.

My favorite is the multi-threaded debugger. By this I
mean the debugger itself wants to be multi-threaded,
issuing ptrace commands from multiple threads.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 28, 2007 5:46 AM, Ingo Molnar <[EMAIL PROTECTED]> wrote:
> * Albert Cahalan <[EMAIL PROTECTED]> wrote:
> > On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:
> > > [EMAIL PROTECTED] wrote:
> > >
> > > > We may be stuck with the current broken behavior for backwards
> > > > compatibility reasons but lets try fixing our ancient bug for the 2.6.25
> > > > time frame and see if anyone screams.
> >
> > It's not broken. It's just not the feature you're looking for.
>
> well it's quite broken at the moment and we are looking for solutions
> not for a blame game :-) You might have read the thread where i describe
> what i had to go through to do something fairly trivial.

In some ways that is NOT trivial, given that a high-level
language is free to use N:M threading.

If we assume that isn't allowed though, blaming the library
for not using native Linux thread IDs is entirely reasonable.
Linus picked sane ID numbering, not Solaris-style. Normal
app developers are unable to take advantage of Linus'
wise decision.

> > Changing /proc/self is somewhat risky, and probably
> > undesirable anyway. That file has always been used
> > to represent the process; at one time this also meant
> > the task. Documentation everywhere says "process".
>
> in Linux we never truly had a notion of "process" when your change was
> done - "process" always meant the task itself. That's why all the
> task_struct parameters/variables used to be named 'p', not 't'. So when
> NTPL came around this became a poorly defined notion.

We were sort of settling on "struct signal" as the process.

> > This one is probably best:
> > /proc/task -> 123/task/456
> > (with both numbers showing)
>
> this sounds good to me. If it's a symlink then there's not much other
> choice because the thread PIDs do not even show up under /proc anymore.
>
> > The problem with /proc/self/task/self is that it
> > makes /proc/789/task/self be ill-defined when
> > the observer is not tgid 789. If the directory can
> > only show up in the observer's own task directory,
> > then this solution is good.
>
> agreed.
>
> > I really don't want to see anything that would encourage
> > more use of the gdb backdoor. For those that don't
> > remember, gdb broke when access to threads via the
> > top-level /proc directory was temporarily removed.
> > We need that back door, unfortunately, but having it
> > show up in symlink targets is quite nasty.
> >
> > As for the history:
> >
> > I left it out. At the time it would have been fairly useless.
> > Back then, glibc didn't make things painful by pulling
> > phony thread IDs out of its ass. Shell scripts sure didn't
> > deal in threads. Monitoring tools like "ps" didn't need it.
> > If nothing needs it, well, why have it?
>
> sound, future-proof API design, with a little bit of foresight?

Yes, in a way. Adding stuff is usually easier than removing
stuff. I couldn't decide between /proc/self/task/self and /proc/task,
so I left the decision for later. I wasn't sure that I'd thought of
all the issues.

> I am
> faced with incidents on an almost daily basis that show how much we
> kernel folks suck at defining new APIs. The only luck is that the set of
> system calls is fairly complete already - but in the rare case where we
> touch an API it's a catastrophy most of the time. With such an API track
> record we'd probably never survive as a user-space project.

Most of user-space is worse.

What shocks me is that people keep designing ABIs with structs
that contain holes. (data leaks, waste, portability trouble, etc.)
This happens in kernel ABIs all the time. It ought to be blocked
by some sort of build tool. (with a whitelist for old stuff)

Another shocker is /proc/*/smaps, which should make you cry.
At the time I was working too much overtime to post about it,
and I figured that nobody would allow that into the kernel anyway...

Speaking of which, that's one that has no need to be in the task
directories. I put a maps file there to make porting old code easier,
but neither one really belongs. It's per-mm, which was in a 1:n
relationship with struct signal last I checked.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 28, 2007 6:31 AM, Eric W. Biederman <[EMAIL PROTECTED]> wrote:
> Ingo Molnar <[EMAIL PROTECTED]> writes:
> > * Albert Cahalan <[EMAIL PROTECTED]> wrote:
> >> On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:

> In a lot of ways if you access /proc/self and you get back information
> that does not correspond to yourself the result is nonsense.  Which
> is a fairly mighty problem.

In general, this is not a problem we have.

/proc/self points to the process, not the task group leader.

They are different. Look at /proc/*/stat, where the
per-process info is summary data. The per-thread
stat file is not summary data. This is intended to be
true for all files in /proc; there may be some with bugs.

Some of the data can not be summed up and will not
always be shared. This is legacy crud. Don't use it,
and don't try to "fix" it. It's there so that old programs
can continute to work as long as weird threading isn't
in use.

Note that it was intended that non-legacy additions
would normally be added to either the process directory
or the thread directory, not both. I think somebody may
have ripped out the ability to do this; at the very least
there have been numerous illogical additions.

> I'm still trying to understand which will break user space more,
> adding /proc/task or changing /proc/self.

Changing /proc/self makes you get per-thread data
when you asked for per-process data. That's bad.

> >> This one is probably best:
> >> /proc/task -> 123/task/456
> >> (with both numbers showing)
> >
> > this sounds good to me. If it's a symlink then there's not much other
> > choice because the thread PIDs do not even show up under /proc anymore.
>
> The name sounds good to me.
>
> I am not certain the two components make sense as we have a possible
> permission problem where it is remotely possible that a task will
> have permission to access /proc/ but not /proc/.

If it hurts, don't do that. We allow foot shooting.

> The reason I care is that we need to fix /proc/mounts.  So once we
> have /proc/task we can also have change /proc/mounts to
> be a symlink to /proc/task/mounts.
>
> Once we get the /proc/mounts thing sorted out.  There are several
> other entries in /proc that need to that need to follow in it's wake
> as they also become per namespace.  /proc/net and /proc/sysvipc for
> starters.

As I predicted, the container bloat would be a never-ending
source of bugs. You're discovering bugs where there were none.
You'll never run out of this sort of problem. Keeping Linux lean
and simple would be far better.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 27, 2007 7:49 PM, Guillaume Chazarain <[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
>
> > We may be stuck with the current broken behavior for backwards
> > compatibility reasons but lets try fixing our ancient bug for the 2.6.25
> > time frame and see if anyone screams.

It's not broken. It's just not the feature you're looking for.

> I'm not screaming because of this change, but I screamed when I
> discovered I could not have a replacement for gettid() in Java, or any
> other high level environment.

Java is so high-level that it seems inappropriate to touch /proc.
It is allowed for Java to do N:M threading you know.

> So, instead of making /proc/self an unstable interface that changed in
> 2.6.0 and 2.6.25, I'll vote for /proc/self/task/self. A new interface
> that can trivially be detected for existence, and programs relying on
> this interface will loudly break on older kernels, unlike with the
> proposed interface change.
>
> Ccing Albert Cahalan as he made the change to /proc/self in the first
> place:

Changing /proc/self is somewhat risky, and probably
undesirable anyway. That file has always been used
to represent the process; at one time this also meant
the task. Documentation everywhere says "process".

This one is probably best:
/proc/task -> 123/task/456
(with both numbers showing)

The problem with /proc/self/task/self is that it
makes /proc/789/task/self be ill-defined when
the observer is not tgid 789. If the directory can
only show up in the observer's own task directory,
then this solution is good.

I really don't want to see anything that would encourage
more use of the gdb backdoor. For those that don't
remember, gdb broke when access to threads via the
top-level /proc directory was temporarily removed.
We need that back door, unfortunately, but having it
show up in symlink targets is quite nasty.

As for the history:

I left it out. At the time it would have been fairly useless.
Back then, glibc didn't make things painful by pulling
phony thread IDs out of its ass. Shell scripts sure didn't
deal in threads. Monitoring tools like "ps" didn't need it.
If nothing needs it, well, why have it?

Regarding some of the discusison on LKML, I don't see
how unshare matters. If you unshare to the point where
you get a new TGID, then /proc/self must reflect that.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 27, 2007 7:49 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:

  We may be stuck with the current broken behavior for backwards
  compatibility reasons but lets try fixing our ancient bug for the 2.6.25
  time frame and see if anyone screams.

It's not broken. It's just not the feature you're looking for.

 I'm not screaming because of this change, but I screamed when I
 discovered I could not have a replacement for gettid() in Java, or any
 other high level environment.

Java is so high-level that it seems inappropriate to touch /proc.
It is allowed for Java to do N:M threading you know.

 So, instead of making /proc/self an unstable interface that changed in
 2.6.0 and 2.6.25, I'll vote for /proc/self/task/self. A new interface
 that can trivially be detected for existence, and programs relying on
 this interface will loudly break on older kernels, unlike with the
 proposed interface change.

 Ccing Albert Cahalan as he made the change to /proc/self in the first
 place:

Changing /proc/self is somewhat risky, and probably
undesirable anyway. That file has always been used
to represent the process; at one time this also meant
the task. Documentation everywhere says process.

This one is probably best:
/proc/task - 123/task/456
(with both numbers showing)

The problem with /proc/self/task/self is that it
makes /proc/789/task/self be ill-defined when
the observer is not tgid 789. If the directory can
only show up in the observer's own task directory,
then this solution is good.

I really don't want to see anything that would encourage
more use of the gdb backdoor. For those that don't
remember, gdb broke when access to threads via the
top-level /proc directory was temporarily removed.
We need that back door, unfortunately, but having it
show up in symlink targets is quite nasty.

As for the history:

I left it out. At the time it would have been fairly useless.
Back then, glibc didn't make things painful by pulling
phony thread IDs out of its ass. Shell scripts sure didn't
deal in threads. Monitoring tools like ps didn't need it.
If nothing needs it, well, why have it?

Regarding some of the discusison on LKML, I don't see
how unshare matters. If you unshare to the point where
you get a new TGID, then /proc/self must reflect that.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 28, 2007 6:31 AM, Eric W. Biederman [EMAIL PROTECTED] wrote:
 Ingo Molnar [EMAIL PROTECTED] writes:
  * Albert Cahalan [EMAIL PROTECTED] wrote:
  On Nov 27, 2007 7:49 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:

 In a lot of ways if you access /proc/self and you get back information
 that does not correspond to yourself the result is nonsense.  Which
 is a fairly mighty problem.

In general, this is not a problem we have.

/proc/self points to the process, not the task group leader.

They are different. Look at /proc/*/stat, where the
per-process info is summary data. The per-thread
stat file is not summary data. This is intended to be
true for all files in /proc; there may be some with bugs.

Some of the data can not be summed up and will not
always be shared. This is legacy crud. Don't use it,
and don't try to fix it. It's there so that old programs
can continute to work as long as weird threading isn't
in use.

Note that it was intended that non-legacy additions
would normally be added to either the process directory
or the thread directory, not both. I think somebody may
have ripped out the ability to do this; at the very least
there have been numerous illogical additions.

 I'm still trying to understand which will break user space more,
 adding /proc/task or changing /proc/self.

Changing /proc/self makes you get per-thread data
when you asked for per-process data. That's bad.

  This one is probably best:
  /proc/task - 123/task/456
  (with both numbers showing)
 
  this sounds good to me. If it's a symlink then there's not much other
  choice because the thread PIDs do not even show up under /proc anymore.

 The name sounds good to me.

 I am not certain the two components make sense as we have a possible
 permission problem where it is remotely possible that a task will
 have permission to access /proc/tid but not /proc/tgid.

If it hurts, don't do that. We allow foot shooting.

 The reason I care is that we need to fix /proc/mounts.  So once we
 have /proc/task we can also have change /proc/mounts to
 be a symlink to /proc/task/mounts.

 Once we get the /proc/mounts thing sorted out.  There are several
 other entries in /proc that need to that need to follow in it's wake
 as they also become per namespace.  /proc/net and /proc/sysvipc for
 starters.

As I predicted, the container bloat would be a never-ending
source of bugs. You're discovering bugs where there were none.
You'll never run out of this sort of problem. Keeping Linux lean
and simple would be far better.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: + proc-fix-the-threaded-proc-self.patch added to -mm tree

2007-11-28 Thread Albert Cahalan
On Nov 28, 2007 5:46 AM, Ingo Molnar [EMAIL PROTECTED] wrote:
 * Albert Cahalan [EMAIL PROTECTED] wrote:
  On Nov 27, 2007 7:49 PM, Guillaume Chazarain [EMAIL PROTECTED] wrote:
   [EMAIL PROTECTED] wrote:
  
We may be stuck with the current broken behavior for backwards
compatibility reasons but lets try fixing our ancient bug for the 2.6.25
time frame and see if anyone screams.
 
  It's not broken. It's just not the feature you're looking for.

 well it's quite broken at the moment and we are looking for solutions
 not for a blame game :-) You might have read the thread where i describe
 what i had to go through to do something fairly trivial.

In some ways that is NOT trivial, given that a high-level
language is free to use N:M threading.

If we assume that isn't allowed though, blaming the library
for not using native Linux thread IDs is entirely reasonable.
Linus picked sane ID numbering, not Solaris-style. Normal
app developers are unable to take advantage of Linus'
wise decision.

  Changing /proc/self is somewhat risky, and probably
  undesirable anyway. That file has always been used
  to represent the process; at one time this also meant
  the task. Documentation everywhere says process.

 in Linux we never truly had a notion of process when your change was
 done - process always meant the task itself. That's why all the
 task_struct parameters/variables used to be named 'p', not 't'. So when
 NTPL came around this became a poorly defined notion.

We were sort of settling on struct signal as the process.

  This one is probably best:
  /proc/task - 123/task/456
  (with both numbers showing)

 this sounds good to me. If it's a symlink then there's not much other
 choice because the thread PIDs do not even show up under /proc anymore.

  The problem with /proc/self/task/self is that it
  makes /proc/789/task/self be ill-defined when
  the observer is not tgid 789. If the directory can
  only show up in the observer's own task directory,
  then this solution is good.

 agreed.

  I really don't want to see anything that would encourage
  more use of the gdb backdoor. For those that don't
  remember, gdb broke when access to threads via the
  top-level /proc directory was temporarily removed.
  We need that back door, unfortunately, but having it
  show up in symlink targets is quite nasty.
 
  As for the history:
 
  I left it out. At the time it would have been fairly useless.
  Back then, glibc didn't make things painful by pulling
  phony thread IDs out of its ass. Shell scripts sure didn't
  deal in threads. Monitoring tools like ps didn't need it.
  If nothing needs it, well, why have it?

 sound, future-proof API design, with a little bit of foresight?

Yes, in a way. Adding stuff is usually easier than removing
stuff. I couldn't decide between /proc/self/task/self and /proc/task,
so I left the decision for later. I wasn't sure that I'd thought of
all the issues.

 I am
 faced with incidents on an almost daily basis that show how much we
 kernel folks suck at defining new APIs. The only luck is that the set of
 system calls is fairly complete already - but in the rare case where we
 touch an API it's a catastrophy most of the time. With such an API track
 record we'd probably never survive as a user-space project.

Most of user-space is worse.

What shocks me is that people keep designing ABIs with structs
that contain holes. (data leaks, waste, portability trouble, etc.)
This happens in kernel ABIs all the time. It ought to be blocked
by some sort of build tool. (with a whitelist for old stuff)

Another shocker is /proc/*/smaps, which should make you cry.
At the time I was working too much overtime to post about it,
and I figured that nobody would allow that into the kernel anyway...

Speaking of which, that's one that has no need to be in the task
directories. I put a maps file there to make porting old code easier,
but neither one really belongs. It's per-mm, which was in a 1:n
relationship with struct signal last I checked.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove PAGE_SIZE from headers_install

2007-07-14 Thread Albert Cahalan

On 7/14/07, David Miller <[EMAIL PROTECTED]> wrote:

From: "Albert Cahalan" <[EMAIL PROTECTED]>
Date: Sat, 14 Jul 2007 22:48:57 -0400

> A real constant-value PAGE_SIZE is useful and doable.

It's bogus to use it.  The kernel can get recompiled
to arbitrary page sizes on some architectures, so a constat
page size assumption cannot work.


Sure it can work. The ABI specifies limits on such things.
Probably the most appropriate size is the one specified
for alignment of ELF sections.

If I remember right, it's 64 K for the PowerPC ABI. This allows
for 64 K pages, even though many chips offer 4 K pages.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove PAGE_SIZE from headers_install

2007-07-14 Thread Albert Cahalan

Olaf Hering writes:

On Sat, Jul 14, H. Peter Anvin wrote:

Olaf Hering wrote:



Declare PAGE_SIZE as getpagesize() for userspace.
PAGE_SIZE is used in resource.h and shm.h


I would think it would be better to not define it at all.
Several architectures already don't have PAGE_SIZE visible
to userspace in any way.


i386 has it, so everyone uses it.


Since i386 was the first architecture and is still probably the
most common architecture (x86_64 being 30% AFAIK), i386 sets the
standard for the Linux API. Several architectures are broken and
thus suffering from incompatibility.

A real constant-value PAGE_SIZE is useful and doable.

It's useful because a getpagesize() can't be used for numerous
things, such as setting the size of an array.

It's doable, even on architectures that support multiple page
sizes, because ABIs specify alignment requirements. There are
two alignments of interest here:

a. the smallest that mmap() will ever naturally return on any
  correct implementation of the architecture's ABI ("naturally"
  meaning that MAP_FIXED was not used)

b. the smallest that mprotect() will tolerate on all
  correct implementations of the architecture

Pick either to be the Linux definition of PAGE_SIZE.

For example, if an architecture is specified to have a page size
of at least 4 K but no more than 64 K, then mprotect() will only
tolerate 64 K on all correct implementations of the architecture.
The ABI might allow mmap() to naturally return 4 K aligned data,
but might instead require 64 K alignment. Assuming 4 K, then the
mmap() value doesn't match the mprotect() value. Either one will
do as the value of PAGE_SIZE, as long as this is standardized in
the way that breaks the least code.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove PAGE_SIZE from headers_install

2007-07-14 Thread Albert Cahalan

Olaf Hering writes:

On Sat, Jul 14, H. Peter Anvin wrote:

Olaf Hering wrote:



Declare PAGE_SIZE as getpagesize() for userspace.
PAGE_SIZE is used in resource.h and shm.h


I would think it would be better to not define it at all.
Several architectures already don't have PAGE_SIZE visible
to userspace in any way.


i386 has it, so everyone uses it.


Since i386 was the first architecture and is still probably the
most common architecture (x86_64 being 30% AFAIK), i386 sets the
standard for the Linux API. Several architectures are broken and
thus suffering from incompatibility.

A real constant-value PAGE_SIZE is useful and doable.

It's useful because a getpagesize() can't be used for numerous
things, such as setting the size of an array.

It's doable, even on architectures that support multiple page
sizes, because ABIs specify alignment requirements. There are
two alignments of interest here:

a. the smallest that mmap() will ever naturally return on any
  correct implementation of the architecture's ABI (naturally
  meaning that MAP_FIXED was not used)

b. the smallest that mprotect() will tolerate on all
  correct implementations of the architecture

Pick either to be the Linux definition of PAGE_SIZE.

For example, if an architecture is specified to have a page size
of at least 4 K but no more than 64 K, then mprotect() will only
tolerate 64 K on all correct implementations of the architecture.
The ABI might allow mmap() to naturally return 4 K aligned data,
but might instead require 64 K alignment. Assuming 4 K, then the
mmap() value doesn't match the mprotect() value. Either one will
do as the value of PAGE_SIZE, as long as this is standardized in
the way that breaks the least code.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] remove PAGE_SIZE from headers_install

2007-07-14 Thread Albert Cahalan

On 7/14/07, David Miller [EMAIL PROTECTED] wrote:

From: Albert Cahalan [EMAIL PROTECTED]
Date: Sat, 14 Jul 2007 22:48:57 -0400

 A real constant-value PAGE_SIZE is useful and doable.

It's bogus to use it.  The kernel can get recompiled
to arbitrary page sizes on some architectures, so a constat
page size assumption cannot work.


Sure it can work. The ABI specifies limits on such things.
Probably the most appropriate size is the one specified
for alignment of ELF sections.

If I remember right, it's 64 K for the PowerPC ABI. This allows
for 64 K pages, even though many chips offer 4 K pages.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: partially mounted cifs filesystem

2007-07-08 Thread Albert Cahalan

On 7/7/07, Satyam Sharma <[EMAIL PROTECTED]> wrote:

On 7/7/07, Albert Cahalan <[EMAIL PROTECTED]> wrote:



I had one share mounted, from XP to Linux, and wanted another.
At first I had an incorrect setting on the XP box, almost
certainly related to permissions. The mount failed of course.
Running "mount" showed that the filesystem was not mounted,
but apparently it didn't remain fully unmounted either.
There was also nothing under the mount point, and the "ls -l"
data (directory size and link count) looked like ext3.


That means nothing was mounted there ...


I changed settings on the XP box numerous times. After many
frustrating attempts, I ran "umount" on the mount point and
then successfully mounted the filesystem.


... but still umount succeeded? Didn't it complain about nothing
being mounted there in the first place? Surprising that it actually
resolved the problem ...


It complained, and it resolved the problem.


I'll guess that the kernel returned an error for my early
attempts at mounting, but left open a CIFS connection.

I suppose the cifs error handling is buggy.


Yes, that could be the case. Could you please:

1. Tell us which kernel version was it? .config?
2. Was there some dmesg output from the failed mount(2) attempt?
3. What was the mount command line / options?


Server: Windows XP service pack 2, recently updated
Client: Fedora kernel 2.6.20-1.3094.fc7, mount.cifs version 1.10

My xterm still had the commands in the scrollback buffer.
I added a few, grepping dmesg and /etc/fstab, and chopped
out the unrelated stuff. Note that the number in my command
prompt is the exit code of the previous command; these are
all correct despite editing out the unrelated commands.

There are some interesting error messages, plus a lock order
warning that mentions cifs. Note that I have numerous cifs
shares mounted, so not every log message relates to this one.


Then:

1. Rebuild kernel with CIFS_DEBUG2.
2. Revert back (on the XP share export side) to the buggy / incorrect
settings -- so that you can try and reproduce the problem.
3. Let us know if you could reproduce, if so, any debug ouput / etc?


I probably spent a week messing with Windows settings. I switched
back and forth between simple file sharing and not, adjusted many
registry settings related to anonymous/guest treatment, redid the
ACLs more times than I care to think about... There really isn't
any hope I could get back to the original settings. My best guess
would be something related to an ACL for guest, everybody, SYSTEM,
or anonymous, or something related to the checkboxes for client
permissions in the file sharing dialog. At one time I had a deny ACL.

Here you go. The fstab lines will be word wrapped in this email,
but are not word wrapped in the file.

--
proc 0 # mount /mnt/vm/sc
Password:
mount error 11 = Resource temporarily unavailable
Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)
proc 255 # smbclient -L //192.168.1.141
Password:
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]

   Sharename   Type  Comment
   -     ---
   IPC$IPC   Remote IPC
   sourcecode  Disk
   ADMIN$  Disk  Remote Admin
   C$  Disk  Default share
   homedir Disk
session request to 192.168.1.141 failed (Called name not present)
session request to 192 failed (Called name not present)
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]

   Server   Comment
   ----

   WorkgroupMaster
   ----
proc 0 # smbclient  //192.168.1.141/sourcecode
Password:
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]
smb: \> ls
 .   D0  Wed Dec  6 18:12:30 2006
 ..  D0  Wed Dec  6 18:12:30 2006
 development D0  Mon Jul  2 15:10:15 2007
 legacy  D0  Wed Dec  6 22:29:42 2006
 libraries   D0  Mon Jul  2 16:03:25 2007
 mmm D0  Mon Jul  2 16:53:27 2007
 re  D0  Mon Jul  2 17:39:34 2007
 s   D0  Mon Jul  2 17:46:23 2007
 thirdparty  D0  Mon Jul  2 18:05:05 2007

   40931 blocks of size 524288. 18955 blocks available
smb: \> q
proc 0 # mount /mnt/vm/sc
Password:
mount error 11 = Resource temporarily unavailable
Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)
proc 255 # ls -l /mnt/vm/sc
total 0
proc 0 # ls -l /mnt/vm
total 2
drwxr-xr-x 1 root root0 2007-07-03 17:43 homedir
drwxr-xr-x 2 root root 1024 2007-07-03 13:30 sc
proc 0 # ls -al /mnt/vm/sc
total 4
drwxr-xr-x 2 root root 1024 2007-07-03 1

Re: partially mounted cifs filesystem

2007-07-08 Thread Albert Cahalan

On 7/7/07, Satyam Sharma [EMAIL PROTECTED] wrote:

On 7/7/07, Albert Cahalan [EMAIL PROTECTED] wrote:



I had one share mounted, from XP to Linux, and wanted another.
At first I had an incorrect setting on the XP box, almost
certainly related to permissions. The mount failed of course.
Running mount showed that the filesystem was not mounted,
but apparently it didn't remain fully unmounted either.
There was also nothing under the mount point, and the ls -l
data (directory size and link count) looked like ext3.


That means nothing was mounted there ...


I changed settings on the XP box numerous times. After many
frustrating attempts, I ran umount on the mount point and
then successfully mounted the filesystem.


... but still umount succeeded? Didn't it complain about nothing
being mounted there in the first place? Surprising that it actually
resolved the problem ...


It complained, and it resolved the problem.


I'll guess that the kernel returned an error for my early
attempts at mounting, but left open a CIFS connection.

I suppose the cifs error handling is buggy.


Yes, that could be the case. Could you please:

1. Tell us which kernel version was it? .config?
2. Was there some dmesg output from the failed mount(2) attempt?
3. What was the mount command line / options?


Server: Windows XP service pack 2, recently updated
Client: Fedora kernel 2.6.20-1.3094.fc7, mount.cifs version 1.10

My xterm still had the commands in the scrollback buffer.
I added a few, grepping dmesg and /etc/fstab, and chopped
out the unrelated stuff. Note that the number in my command
prompt is the exit code of the previous command; these are
all correct despite editing out the unrelated commands.

There are some interesting error messages, plus a lock order
warning that mentions cifs. Note that I have numerous cifs
shares mounted, so not every log message relates to this one.


Then:

1. Rebuild kernel with CIFS_DEBUG2.
2. Revert back (on the XP share export side) to the buggy / incorrect
settings -- so that you can try and reproduce the problem.
3. Let us know if you could reproduce, if so, any debug ouput / etc?


I probably spent a week messing with Windows settings. I switched
back and forth between simple file sharing and not, adjusted many
registry settings related to anonymous/guest treatment, redid the
ACLs more times than I care to think about... There really isn't
any hope I could get back to the original settings. My best guess
would be something related to an ACL for guest, everybody, SYSTEM,
or anonymous, or something related to the checkboxes for client
permissions in the file sharing dialog. At one time I had a deny ACL.

Here you go. The fstab lines will be word wrapped in this email,
but are not word wrapped in the file.

--
proc 0 # mount /mnt/vm/sc
Password:
mount error 11 = Resource temporarily unavailable
Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)
proc 255 # smbclient -L //192.168.1.141
Password:
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]

   Sharename   Type  Comment
   -     ---
   IPC$IPC   Remote IPC
   sourcecode  Disk
   ADMIN$  Disk  Remote Admin
   C$  Disk  Default share
   homedir Disk
session request to 192.168.1.141 failed (Called name not present)
session request to 192 failed (Called name not present)
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]

   Server   Comment
   ----

   WorkgroupMaster
   ----
proc 0 # smbclient  //192.168.1.141/sourcecode
Password:
Domain=[ALBERTXP] OS=[Windows 5.1] Server=[Windows 2000 LAN Manager]
smb: \ ls
 .   D0  Wed Dec  6 18:12:30 2006
 ..  D0  Wed Dec  6 18:12:30 2006
 development D0  Mon Jul  2 15:10:15 2007
 legacy  D0  Wed Dec  6 22:29:42 2006
 libraries   D0  Mon Jul  2 16:03:25 2007
 mmm D0  Mon Jul  2 16:53:27 2007
 re  D0  Mon Jul  2 17:39:34 2007
 s   D0  Mon Jul  2 17:46:23 2007
 thirdparty  D0  Mon Jul  2 18:05:05 2007

   40931 blocks of size 524288. 18955 blocks available
smb: \ q
proc 0 # mount /mnt/vm/sc
Password:
mount error 11 = Resource temporarily unavailable
Refer to the mount.cifs(8) manual page (e.g.man mount.cifs)
proc 255 # ls -l /mnt/vm/sc
total 0
proc 0 # ls -l /mnt/vm
total 2
drwxr-xr-x 1 root root0 2007-07-03 17:43 homedir
drwxr-xr-x 2 root root 1024 2007-07-03 13:30 sc
proc 0 # ls -al /mnt/vm/sc
total 4
drwxr-xr-x 2 root root 1024 2007-07-03 13:30 .
drwxr-xr-x 4 root root 1024 2007-07-03 13:30 ..
proc 0

partially mounted cifs filesystem

2007-07-06 Thread Albert Cahalan

I had one share mounted, from XP to Linux, and wanted another.
At first I had an incorrect setting on the XP box, almost
certainly related to permissions. The mount failed of course.
Running "mount" showed that the filesystem was not mounted,
but apparently it didn't remain fully unmounted either.
There was also nothing under the mount point, and the "ls -l"
data (directory size and link count) looked like ext3.

I changed settings on the XP box numerous times. After many
frustrating attempts, I ran "umount" on the mount point and
then successfully mounted the filesystem.

I'll guess that the kernel returned an error for my early
attempts at mounting, but left open a CIFS connection.

I suppose the cifs error handling is buggy.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


partially mounted cifs filesystem

2007-07-06 Thread Albert Cahalan

I had one share mounted, from XP to Linux, and wanted another.
At first I had an incorrect setting on the XP box, almost
certainly related to permissions. The mount failed of course.
Running mount showed that the filesystem was not mounted,
but apparently it didn't remain fully unmounted either.
There was also nothing under the mount point, and the ls -l
data (directory size and link count) looked like ext3.

I changed settings on the XP box numerous times. After many
frustrating attempts, I ran umount on the mount point and
then successfully mounted the filesystem.

I'll guess that the kernel returned an error for my early
attempts at mounting, but left open a CIFS connection.

I suppose the cifs error handling is buggy.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-22 Thread Albert Cahalan

On 6/22/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:


> > > > and these methods also destroy yourself on any machine with a looser
> > > > cache coherency between I and D-cache
> > > >
> > > > for all but x86 you pretty much have to do the mprotect() between the
> > > > two states to deal with the cache flushing properly...
> > >
> > > If the instructions to force data write-back and/or to
> > > invalidate the instruction cache are priveleged, yes.
> > > AFAIK, only ARM is that lame.
> >
> > and your program executes this on all the cpus in the system?

no I meant that you had to call your userspace instruction on all cpus,
so on all-but-arm (from the Intel side I know IA64 needs such a flush,
but I'm pretty sure PPC does too)


I understood.

AFAIK, it is common to propagate this via a special
bus cycle. Section 5.1.5.2.1 of the PowerPC manual
states that this is so. Secion 5.1.5.2 lists the requirements
for both uniprocessor and multiprocessor. Note that
Linux uses the coherent memory model for PowerPC SMP.
See also the "icbi" instruction description, where the use
of an address-only broadcast is mentioned.


> I don't recall seeing such code in the libgcc tranpoline
> setup for PowerPC. Either it's not required, or this is
> a rather popular bug.

I suspect it'll be playing under the assumption that going from "no
code" to "code" is fine since the icache is cold.


A previous trampoline would ruin that.

Fortunately, PowerPC is not as brain-dead as ARM and IA64.
(not that I'm writing code for any of these)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-22 Thread Albert Cahalan

On 6/21/07, Pavel Machek <[EMAIL PROTECTED]> wrote:


> >> It's really not worth getting bothered by. Truth is, big
> >> giant
> >> pathnames break lots of stuff already, both kernel and
> >> userspace.
> >
> >> Just look in /proc for some nice juicy kernel breakage:
> >> cwd, exe, fd/*, maps, mounts, mountstats, root, smaps
> >
> >Well, but we should be fixing that, not adding more. And /proc is
> >info-only, while this is security related code.
>
> Security tools read from /proc, so /proc is security-related.

If some tool relies on pathnames in /proc, that tool is broken... as
is /proc. We should be fixing that.


Running TOMOYO or AppArmor fixes the bug. :-)
You can't get long paths that break /proc if you are
running either. Therefore, one of those is required.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-22 Thread Albert Cahalan

On 6/22/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:

On Fri, 2007-06-22 at 01:56 -0400, Albert Cahalan wrote:
> On 6/21/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:
> > On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> > > Right now, Linux isn't all that friendly to JIT emulators.
> > > Here are the problems and suggestions to improve the situation.
> > >
> > > There is an SE Linux execmem restriction that enforces W^X.
> > > Assuming you don't wish to just disable SE Linux, there are
> > > two ugly ways around the problem. You can mmap a file twice,
> > > or you can abuse SysV shared memory. The mmap method requires
> > > that you know of a filesystem mounted rw,exec where you can
> > > write a very large temporary file. This arbitrary filesystem,
> > > rather than swap space, will be the backing store. The SysV
> > > shared memory method requires an undocumented flag and is
> > > subject to some annoying size limits. Both methods create
> > > objects that will fail to be deleted if the program dies
> > > before marking the objects for deletion.
> >
> > and these methods also destroy yourself on any machine with a looser
> > cache coherency between I and D-cache
> >
> > for all but x86 you pretty much have to do the mprotect() between the
> > two states to deal with the cache flushing properly...
>
> If the instructions to force data write-back and/or to
> invalidate the instruction cache are priveleged, yes.
> AFAIK, only ARM is that lame.

and your program executes this on all the cpus in the system?


I'll remember that if I ever run a JIT on the SMP ARM box.
(there's like one, at the manufacturer, right?)

I don't recall seeing such code in the libgcc tranpoline
setup for PowerPC. Either it's not required, or this is
a rather popular bug.

Perhaps ARM needs syscalls for this, or emulation for
the privileged instructions. This may already exist; it
sure is required. So this would be another need for
properly supporting JIT emulators.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-22 Thread Albert Cahalan

On 6/21/07, Pavel Machek [EMAIL PROTECTED] wrote:


  It's really not worth getting bothered by. Truth is, big
  giant
  pathnames break lots of stuff already, both kernel and
  userspace.
 
  Just look in /proc for some nice juicy kernel breakage:
  cwd, exe, fd/*, maps, mounts, mountstats, root, smaps
 
 Well, but we should be fixing that, not adding more. And /proc is
 info-only, while this is security related code.

 Security tools read from /proc, so /proc is security-related.

If some tool relies on pathnames in /proc, that tool is broken... as
is /proc. We should be fixing that.


Running TOMOYO or AppArmor fixes the bug. :-)
You can't get long paths that break /proc if you are
running either. Therefore, one of those is required.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-22 Thread Albert Cahalan

On 6/22/07, Arjan van de Ven [EMAIL PROTECTED] wrote:

On Fri, 2007-06-22 at 01:56 -0400, Albert Cahalan wrote:
 On 6/21/07, Arjan van de Ven [EMAIL PROTECTED] wrote:
  On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
   Right now, Linux isn't all that friendly to JIT emulators.
   Here are the problems and suggestions to improve the situation.
  
   There is an SE Linux execmem restriction that enforces W^X.
   Assuming you don't wish to just disable SE Linux, there are
   two ugly ways around the problem. You can mmap a file twice,
   or you can abuse SysV shared memory. The mmap method requires
   that you know of a filesystem mounted rw,exec where you can
   write a very large temporary file. This arbitrary filesystem,
   rather than swap space, will be the backing store. The SysV
   shared memory method requires an undocumented flag and is
   subject to some annoying size limits. Both methods create
   objects that will fail to be deleted if the program dies
   before marking the objects for deletion.
 
  and these methods also destroy yourself on any machine with a looser
  cache coherency between I and D-cache
 
  for all but x86 you pretty much have to do the mprotect() between the
  two states to deal with the cache flushing properly...

 If the instructions to force data write-back and/or to
 invalidate the instruction cache are priveleged, yes.
 AFAIK, only ARM is that lame.

and your program executes this on all the cpus in the system?


I'll remember that if I ever run a JIT on the SMP ARM box.
(there's like one, at the manufacturer, right?)

I don't recall seeing such code in the libgcc tranpoline
setup for PowerPC. Either it's not required, or this is
a rather popular bug.

Perhaps ARM needs syscalls for this, or emulation for
the privileged instructions. This may already exist; it
sure is required. So this would be another need for
properly supporting JIT emulators.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-22 Thread Albert Cahalan

On 6/22/07, Arjan van de Ven [EMAIL PROTECTED] wrote:


and these methods also destroy yourself on any machine with a looser
cache coherency between I and D-cache
   
for all but x86 you pretty much have to do the mprotect() between the
two states to deal with the cache flushing properly...
  
   If the instructions to force data write-back and/or to
   invalidate the instruction cache are priveleged, yes.
   AFAIK, only ARM is that lame.
 
  and your program executes this on all the cpus in the system?

no I meant that you had to call your userspace instruction on all cpus,
so on all-but-arm (from the Intel side I know IA64 needs such a flush,
but I'm pretty sure PPC does too)


I understood.

AFAIK, it is common to propagate this via a special
bus cycle. Section 5.1.5.2.1 of the PowerPC manual
states that this is so. Secion 5.1.5.2 lists the requirements
for both uniprocessor and multiprocessor. Note that
Linux uses the coherent memory model for PowerPC SMP.
See also the icbi instruction description, where the use
of an address-only broadcast is mentioned.


 I don't recall seeing such code in the libgcc tranpoline
 setup for PowerPC. Either it's not required, or this is
 a rather popular bug.

I suspect it'll be playing under the assumption that going from no
code to code is fine since the icache is cold.


A previous trampoline would ruin that.

Fortunately, PowerPC is not as brain-dead as ARM and IA64.
(not that I'm writing code for any of these)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-21 Thread Albert Cahalan

On 6/21/07, Arjan van de Ven <[EMAIL PROTECTED]> wrote:

On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
> Right now, Linux isn't all that friendly to JIT emulators.
> Here are the problems and suggestions to improve the situation.
>
> There is an SE Linux execmem restriction that enforces W^X.
> Assuming you don't wish to just disable SE Linux, there are
> two ugly ways around the problem. You can mmap a file twice,
> or you can abuse SysV shared memory. The mmap method requires
> that you know of a filesystem mounted rw,exec where you can
> write a very large temporary file. This arbitrary filesystem,
> rather than swap space, will be the backing store. The SysV
> shared memory method requires an undocumented flag and is
> subject to some annoying size limits. Both methods create
> objects that will fail to be deleted if the program dies
> before marking the objects for deletion.

and these methods also destroy yourself on any machine with a looser
cache coherency between I and D-cache

for all but x86 you pretty much have to do the mprotect() between the
two states to deal with the cache flushing properly...


If the instructions to force data write-back and/or to
invalidate the instruction cache are priveleged, yes.
AFAIK, only ARM is that lame.

For example, PowerPC lets unprivileged code run
the required instructions.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-21 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

Albert Cahalan wrote:



> Look, let's back up a bit here. At a high level, what exactly do
> you imagine that this behavior was intended for? I suggest you
> list some examples of the attacks that are blocked.
>
> Can you come up with a reasonable argument that the current behavior
> is the least painful restriction required to block those attacks?
> Does the current behavior block any attack that the proposed behavior
> would not? (list the attacks please)

See above.


Nope. I asked you to justify the existing behavior. Apparently you
are unable to do so. This should be a hint.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-21 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin [EMAIL PROTECTED] wrote:

Albert Cahalan wrote:



 Look, let's back up a bit here. At a high level, what exactly do
 you imagine that this behavior was intended for? I suggest you
 list some examples of the attacks that are blocked.

 Can you come up with a reasonable argument that the current behavior
 is the least painful restriction required to block those attacks?
 Does the current behavior block any attack that the proposed behavior
 would not? (list the attacks please)

See above.


Nope. I asked you to justify the existing behavior. Apparently you
are unable to do so. This should be a hint.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-21 Thread Albert Cahalan

On 6/21/07, Arjan van de Ven [EMAIL PROTECTED] wrote:

On Fri, 2007-06-08 at 02:35 -0400, Albert Cahalan wrote:
 Right now, Linux isn't all that friendly to JIT emulators.
 Here are the problems and suggestions to improve the situation.

 There is an SE Linux execmem restriction that enforces W^X.
 Assuming you don't wish to just disable SE Linux, there are
 two ugly ways around the problem. You can mmap a file twice,
 or you can abuse SysV shared memory. The mmap method requires
 that you know of a filesystem mounted rw,exec where you can
 write a very large temporary file. This arbitrary filesystem,
 rather than swap space, will be the backing store. The SysV
 shared memory method requires an undocumented flag and is
 subject to some annoying size limits. Both methods create
 objects that will fail to be deleted if the program dies
 before marking the objects for deletion.

and these methods also destroy yourself on any machine with a looser
cache coherency between I and D-cache

for all but x86 you pretty much have to do the mprotect() between the
two states to deal with the cache flushing properly...


If the instructions to force data write-back and/or to
invalidate the instruction cache are priveleged, yes.
AFAIK, only ARM is that lame.

For example, PowerPC lets unprivileged code run
the required instructions.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

Albert Cahalan wrote:
> Putting this into the security policy was an error born of
> lazyness to begin with. Abuse of the security mechanism
> was easier than hacking the toolchain, ELF loader, etc.
>
> Either a binary needs self-modification, or it doesn't. This is
> determined by the author of the code. If you don't trust an
> executable that needs this ability, then you simply can not
> run it in a useful way.

That's fine.  That's a policy decision.  That's what a security policy
*is*.  The owner of the system has decided, by security policy, that
that is not allowed.  Bypassing that is not acceptable.


Fixing a bug should be acceptable.

Look, let's back up a bit here. At a high level, what exactly do
you imagine that this behavior was intended for? I suggest you
list some examples of the attacks that are blocked.

Can you come up with a reasonable argument that the current behavior
is the least painful restriction required to block those attacks?
Does the current behavior block any attack that the proposed behavior
would not? (list the attacks please)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

It does and it doesn't. There is not a reasonable way for a
user to mark an app as needing full self-modifying ability.
It's not like the executable stack, which can be set via the
ELF note markings on the executable. (ELF note markings are
ideal because they can not be used via a ret-to-libc attack)
With admin privs, one can change SE Linux settings. Mark the
executable, disable the protection system-wide, generate a
completely new SE Linux policy, or just turn SE Linux off.
Normally we don't expect/require admin privs to install an
executable in one's own ~/bin directory. This is broken.
It ought to be easier to get a JIT working well without
enabling arbitrary mprotect. This would allow a JIT to
partially benefit from the recent security enhancements.
(think of all the buggy browser-based JIT things!)


I presumed an ELF note or extended filesystem attributes were already
in place for this sort of affair. It may be that the model implemented
is so restrictive that users are forbidden to create new executables,
in which case using a different model is certainly in order. Otherwise
the ELF note or attributes need to be implemented.


Users can create executables. Some will be non-functional
unless specially marked by an admin.

What is the goal here? I see no reasonable goal that would
result in such a policy.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

I prefer ELF notes (for start-up allocations) and prctl,
plus a mmap flag for per-allocation behavior.


Beware that the kernel (upstream of me) will likely refuse to support
to exotic mmap() placement policies. At that point userspace will have
to implement them itself with a front-end to mmap().

Userspace can actually live without kernel placement support for
everything but the executable itself, which is already implemented via
ELF loading standards. This is not to downplay the tremendous amounts
of pain involved for moving the stack, getting ld.so to land in the
right place, and so on. Actually I'm less sure about .interp placement.
In any event, exotic virtualspace allocation policies are largely yet
another "simple matter of programming" implementable entirely in
userspace.


When you go that route, you may need to abandon libc. I've done exactly
that for one emulator. It was not easy. Nearly nobody will want to go
down that path.

Things improve a bit if MAP_ANONYMOUS and SysV shared mem allocations
can be made to ignore the available memory checking. If I could allocate
a 2 GB chunk on a system with 1 GB total swap+RAM, then I could use
that as an area in which to perform MAP_FIXED allocations. As of now
this would require either adding the swap space or disabling the
available memory checking system-wide via sysctl.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

This is a bad idea. The standard semantics are needed for programs
relying upon them.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

I didn't mean that the default default :-) setting would change.
I meant that people could change the behavior from a boot script.
Things that break are really foul and nasty anyway, probably with
serious problems that ought to get fixed.


It's actually not a good idea to make it the default even via sysctl.
People won't realize something will break until it does, and what will
break is likely to be a database responsible for data integrity. The
IPC_RMID creation flag should suffice.


It's highly unlikely that such breakage would cause corruption.
Most likely it would cause the database to exit with an error
about failing to attach to a SysV shared memory segment.

I believe that a major cause of reboots is that admins are
unaware of SysV shared memory cruft left behind by apps that
crashed at the wrong moment or had other bugs. If something
is eating memory and you don't know what it is, you reboot.


On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

This is MADV_REMOVE, though most filesystems don't support it. Do you
need it for more than tmpfs?


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

Yes and no. It's painful to be r

Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin <[EMAIL PROTECTED]> wrote:

William Lee Irwin III wrote:



> I presumed an ELF note or extended filesystem attributes were already
> in place for this sort of affair. It may be that the model implemented
> is so restrictive that users are forbidden to create new executables,
> in which case using a different model is certainly in order. Otherwise
> the ELF note or attributes need to be implemented.

Another thing to keep in mind, since we're talking about security
policies in the first place, is that anything like this *MUST* be
"opt-in" on the part of the security policy, because what we're talking
about is circumventing an explicit security policy just based on a
user-provided binary saying, in effect, "don't worry, I know what I'm
doing."

Changing the meaning of an established explicit security policy is not
acceptable.


Not in this case. If an attacker can CHANGE THE BINARY then
it's already game over.

Putting this into the security policy was an error born of
lazyness to begin with. Abuse of the security mechanism
was easier than hacking the toolchain, ELF loader, etc.

Either a binary needs self-modification, or it doesn't. This is
determined by the author of the code. If you don't trust an
executable that needs this ability, then you simply can not
run it in a useful way.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin [EMAIL PROTECTED] wrote:

William Lee Irwin III wrote:



 I presumed an ELF note or extended filesystem attributes were already
 in place for this sort of affair. It may be that the model implemented
 is so restrictive that users are forbidden to create new executables,
 in which case using a different model is certainly in order. Otherwise
 the ELF note or attributes need to be implemented.

Another thing to keep in mind, since we're talking about security
policies in the first place, is that anything like this *MUST* be
opt-in on the part of the security policy, because what we're talking
about is circumventing an explicit security policy just based on a
user-provided binary saying, in effect, don't worry, I know what I'm
doing.

Changing the meaning of an established explicit security policy is not
acceptable.


Not in this case. If an attacker can CHANGE THE BINARY then
it's already game over.

Putting this into the security policy was an error born of
lazyness to begin with. Abuse of the security mechanism
was easier than hacking the toolchain, ELF loader, etc.

Either a binary needs self-modification, or it doesn't. This is
determined by the author of the code. If you don't trust an
executable that needs this ability, then you simply can not
run it in a useful way.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

It does and it doesn't. There is not a reasonable way for a
user to mark an app as needing full self-modifying ability.
It's not like the executable stack, which can be set via the
ELF note markings on the executable. (ELF note markings are
ideal because they can not be used via a ret-to-libc attack)
With admin privs, one can change SE Linux settings. Mark the
executable, disable the protection system-wide, generate a
completely new SE Linux policy, or just turn SE Linux off.
Normally we don't expect/require admin privs to install an
executable in one's own ~/bin directory. This is broken.
It ought to be easier to get a JIT working well without
enabling arbitrary mprotect. This would allow a JIT to
partially benefit from the recent security enhancements.
(think of all the buggy browser-based JIT things!)


I presumed an ELF note or extended filesystem attributes were already
in place for this sort of affair. It may be that the model implemented
is so restrictive that users are forbidden to create new executables,
in which case using a different model is certainly in order. Otherwise
the ELF note or attributes need to be implemented.


Users can create executables. Some will be non-functional
unless specially marked by an admin.

What is the goal here? I see no reasonable goal that would
result in such a policy.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

I prefer ELF notes (for start-up allocations) and prctl,
plus a mmap flag for per-allocation behavior.


Beware that the kernel (upstream of me) will likely refuse to support
to exotic mmap() placement policies. At that point userspace will have
to implement them itself with a front-end to mmap().

Userspace can actually live without kernel placement support for
everything but the executable itself, which is already implemented via
ELF loading standards. This is not to downplay the tremendous amounts
of pain involved for moving the stack, getting ld.so to land in the
right place, and so on. Actually I'm less sure about .interp placement.
In any event, exotic virtualspace allocation policies are largely yet
another simple matter of programming implementable entirely in
userspace.


When you go that route, you may need to abandon libc. I've done exactly
that for one emulator. It was not easy. Nearly nobody will want to go
down that path.

Things improve a bit if MAP_ANONYMOUS and SysV shared mem allocations
can be made to ignore the available memory checking. If I could allocate
a 2 GB chunk on a system with 1 GB total swap+RAM, then I could use
that as an area in which to perform MAP_FIXED allocations. As of now
this would require either adding the swap space or disabling the
available memory checking system-wide via sysctl.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

This is a bad idea. The standard semantics are needed for programs
relying upon them.


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

I didn't mean that the default default :-) setting would change.
I meant that people could change the behavior from a boot script.
Things that break are really foul and nasty anyway, probably with
serious problems that ought to get fixed.


It's actually not a good idea to make it the default even via sysctl.
People won't realize something will break until it does, and what will
break is likely to be a database responsible for data integrity. The
IPC_RMID creation flag should suffice.


It's highly unlikely that such breakage would cause corruption.
Most likely it would cause the database to exit with an error
about failing to attach to a SysV shared memory segment.

I believe that a major cause of reboots is that admins are
unaware of SysV shared memory cruft left behind by apps that
crashed at the wrong moment or had other bugs. If something
is eating memory and you don't know what it is, you reboot.


On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

This is MADV_REMOVE, though most filesystems don't support it. Do you
need it for more than tmpfs?


On Tue, Jun 19, 2007 at 11:16:29PM -0400, Albert Cahalan wrote:

Yes and no. It's painful to be restricted to one backing store.
Covering MAP_ANONYMOUS

Re: JIT emulator needs

2007-06-20 Thread Albert Cahalan

On 6/20/07, H. Peter Anvin [EMAIL PROTECTED] wrote:

Albert Cahalan wrote:
 Putting this into the security policy was an error born of
 lazyness to begin with. Abuse of the security mechanism
 was easier than hacking the toolchain, ELF loader, etc.

 Either a binary needs self-modification, or it doesn't. This is
 determined by the author of the code. If you don't trust an
 executable that needs this ability, then you simply can not
 run it in a useful way.

That's fine.  That's a policy decision.  That's what a security policy
*is*.  The owner of the system has decided, by security policy, that
that is not allowed.  Bypassing that is not acceptable.


Fixing a bug should be acceptable.

Look, let's back up a bit here. At a high level, what exactly do
you imagine that this behavior was intended for? I suggest you
list some examples of the attacks that are blocked.

Can you come up with a reasonable argument that the current behavior
is the least painful restriction required to block those attacks?
Does the current behavior block any attack that the proposed behavior
would not? (list the attacks please)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-19 Thread Albert Cahalan

On 6/19/07, William Lee Irwin III <[EMAIL PROTECTED]> wrote:

On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:



Right now, Linux isn't all that friendly to JIT emulators.
Here are the problems and suggestions to improve the situation.
There is an SE Linux execmem restriction that enforces W^X.
Assuming you don't wish to just disable SE Linux, there are
two ugly ways around the problem. You can mmap a file twice,
or you can abuse SysV shared memory. The mmap method requires
that you know of a filesystem mounted rw,exec where you can
write a very large temporary file. This arbitrary filesystem,
rather than swap space, will be the backing store. The SysV
shared memory method requires an undocumented flag and is
subject to some annoying size limits. Both methods create
objects that will fail to be deleted if the program dies
before marking the objects for deletion.


If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


It does and it doesn't. There is not a reasonable way for a
user to mark an app as needing full self-modifying ability.
It's not like the executable stack, which can be set via the
ELF note markings on the executable. (ELF note markings are
ideal because they can not be used via a ret-to-libc attack)

With admin privs, one can change SE Linux settings. Mark the
executable, disable the protection system-wide, generate a
completely new SE Linux policy, or just turn SE Linux off.

Normally we don't expect/require admin privs to install an
executable in one's own ~/bin directory. This is broken.

It ought to be easier to get a JIT working well without
enabling arbitrary mprotect. This would allow a JIT to
partially benefit from the recent security enhancements.
(think of all the buggy browser-based JIT things!)


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

Processors often have annoying limits on the immediate values
in instructions. An x86 or x86_64 JIT can go a bit faster if
all allocations are kept to the low 2 GB of address space.
There are also reasons for a 32bit-to-x86_64 JIT to chose
a nearly arbitrary 2 GB region that lies above 4 GB.
Other archs have other limits, such as 32 MB or 256 MB.


This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


I prefer ELF notes (for start-up allocations) and prctl,
plus a mmap flag for per-allocation behavior.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

Additions to better support JIT emulators:
a. sysctl to set IPC_RMID by default


This is a bad idea. The standard semantics are needed for programs
relying upon them.


I didn't mean that the default default :-) setting would change.
I meant that people could change the behavior from a boot script.
Things that break are really foul and nasty anyway, probably with
serious problems that ought to get fixed.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

c. open() flag to unlink a file before returning the fd


You probably want a tmpfile(3) -like affair which never has a pathname
to begin with. It could be useful for security purposes more generally.


Yes, exactly. I think there are some possible optimizations
available too, particularly with the cifs filesystem.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

d. mremap() flag to always keep the old mapping


This sounds vaguely like another syscall, like mdup(). This is
particularly meaningful in the context of anonymous memory, for
which there is no method of replicating mappings within a single
process address space.


Yes, mdup() and probably mdup2(). It could be mremap flags or not.

JIT emulators generally need a second mapping so that they can
have both read/write and execute for the same physical memory.

It is somewhat tolerable to have SE Linux enforce that the second
mapping be randomized. (it helps security greatly, but slows the
emulator by a tiny bit)


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

e. mremap() flag to get a read/write mapping of a read/exec one
f. mremap() flag to get a read/exec mapping of a read/write one


Presumably to be used in conjunction with keeping the old mapping.
A composite mdup()/mremap() and mprotect(), presumably saving a TLB
flush or other sorts of overhead, may make some sort of sense here.
Odds are it'll get rejected as the sequence of syscalls is a rather
precise equivalent, though it would optimize things (as would other
composite syscalls, e.g. ones combining fork() and execve() etc.).


A few mremap flags ought to do the job I think.



Re: JIT emulator needs

2007-06-19 Thread Albert Cahalan

On 6/19/07, William Lee Irwin III [EMAIL PROTECTED] wrote:

On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:



Right now, Linux isn't all that friendly to JIT emulators.
Here are the problems and suggestions to improve the situation.
There is an SE Linux execmem restriction that enforces W^X.
Assuming you don't wish to just disable SE Linux, there are
two ugly ways around the problem. You can mmap a file twice,
or you can abuse SysV shared memory. The mmap method requires
that you know of a filesystem mounted rw,exec where you can
write a very large temporary file. This arbitrary filesystem,
rather than swap space, will be the backing store. The SysV
shared memory method requires an undocumented flag and is
subject to some annoying size limits. Both methods create
objects that will fail to be deleted if the program dies
before marking the objects for deletion.


If the policy forbidding self-modifying code lacks a method of
exempting programs such as JIT interpreters (which I doubt) then
it's a problem. I'm with Alan on this one.


It does and it doesn't. There is not a reasonable way for a
user to mark an app as needing full self-modifying ability.
It's not like the executable stack, which can be set via the
ELF note markings on the executable. (ELF note markings are
ideal because they can not be used via a ret-to-libc attack)

With admin privs, one can change SE Linux settings. Mark the
executable, disable the protection system-wide, generate a
completely new SE Linux policy, or just turn SE Linux off.

Normally we don't expect/require admin privs to install an
executable in one's own ~/bin directory. This is broken.

It ought to be easier to get a JIT working well without
enabling arbitrary mprotect. This would allow a JIT to
partially benefit from the recent security enhancements.
(think of all the buggy browser-based JIT things!)


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

Processors often have annoying limits on the immediate values
in instructions. An x86 or x86_64 JIT can go a bit faster if
all allocations are kept to the low 2 GB of address space.
There are also reasons for a 32bit-to-x86_64 JIT to chose
a nearly arbitrary 2 GB region that lies above 4 GB.
Other archs have other limits, such as 32 MB or 256 MB.


This sort of logic might be appropriate for a sort of parametrized
and specialized vma allocator setting the policy in /proc/ along
with various sorts of limits. There are limits to such and at some
point things will have to manually manage their own process address
spaces in a platform-specific fashion. If kernel assistance here is
rejected they may have to do so in all cases.


I prefer ELF notes (for start-up allocations) and prctl,
plus a mmap flag for per-allocation behavior.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

Additions to better support JIT emulators:
a. sysctl to set IPC_RMID by default


This is a bad idea. The standard semantics are needed for programs
relying upon them.


I didn't mean that the default default :-) setting would change.
I meant that people could change the behavior from a boot script.
Things that break are really foul and nasty anyway, probably with
serious problems that ought to get fixed.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

c. open() flag to unlink a file before returning the fd


You probably want a tmpfile(3) -like affair which never has a pathname
to begin with. It could be useful for security purposes more generally.


Yes, exactly. I think there are some possible optimizations
available too, particularly with the cifs filesystem.


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

d. mremap() flag to always keep the old mapping


This sounds vaguely like another syscall, like mdup(). This is
particularly meaningful in the context of anonymous memory, for
which there is no method of replicating mappings within a single
process address space.


Yes, mdup() and probably mdup2(). It could be mremap flags or not.

JIT emulators generally need a second mapping so that they can
have both read/write and execute for the same physical memory.

It is somewhat tolerable to have SE Linux enforce that the second
mapping be randomized. (it helps security greatly, but slows the
emulator by a tiny bit)


On Fri, Jun 08, 2007 at 02:35:22AM -0400, Albert Cahalan wrote:

e. mremap() flag to get a read/write mapping of a read/exec one
f. mremap() flag to get a read/exec mapping of a read/write one


Presumably to be used in conjunction with keeping the old mapping.
A composite mdup()/mremap() and mprotect(), presumably saving a TLB
flush or other sorts of overhead, may make some sort of sense here.
Odds are it'll get rejected as the sequence of syscalls is a rather
precise equivalent, though it would optimize things (as would other
composite syscalls, e.g. ones combining fork() and execve() etc.).


A few mremap flags ought to do the job I think.


On Fri, Jun

Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-16 Thread Albert Cahalan

On 6/15/07, Pavel Machek <[EMAIL PROTECTED]> wrote:

[Albert Cahalan]



> It's really not worth getting bothered by. Truth is, big
> giant
> pathnames break lots of stuff already, both kernel and
> userspace.

> Just look in /proc for some nice juicy kernel breakage:
> cwd, exe, fd/*, maps, mounts, mountstats, root, smaps

Well, but we should be fixing that, not adding more. And /proc is
info-only, while this is security related code.


Security tools read from /proc, so /proc is security-related.

The limit imposed by TOMOYO (or AppArmor) is fine,
despite being security-related. It just needs to fail in
the safe direction: access denied.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-16 Thread Albert Cahalan

On 6/15/07, Pavel Machek [EMAIL PROTECTED] wrote:

[Albert Cahalan]



 It's really not worth getting bothered by. Truth is, big
 giant
 pathnames break lots of stuff already, both kernel and
 userspace.

 Just look in /proc for some nice juicy kernel breakage:
 cwd, exe, fd/*, maps, mounts, mountstats, root, smaps

Well, but we should be fixing that, not adding more. And /proc is
info-only, while this is security related code.


Security tools read from /proc, so /proc is security-related.

The limit imposed by TOMOYO (or AppArmor) is fine,
despite being security-related. It just needs to fail in
the safe direction: access denied.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-15 Thread Albert Cahalan

Christoph Hellwig writes:

On Thu, Jun 14, 2007 at 04:36:09PM +0900, Kentaro Takeda wrote:



We limit the maximum length of any string data (such as
domainname and pathnames) to TOMOYO_MAX_PATHNAME_LEN
(which is 4000) bytes to fit within a single page.

Userland programs can obtain the amount of RAM currently
used by TOMOYO from /proc interface.


Same NACK for this as for AppArmor, on exactly the same grounds.
Please stop wasting your time on pathname-based non-solutions.


This issue is a very very small wart on an otherwise fine idea.
It's really not worth getting bothered by. Truth is, big giant
pathnames break lots of stuff already, both kernel and userspace.

Just look in /proc for some nice juicy kernel breakage:
cwd, exe, fd/*, maps, mounts, mountstats, root, smaps

So, is that a NACK for the /proc filesystem too? :-)

We even limit filenames to 255 chars; just the other day
a Russian guy was complaining that his monstrous filenames
on a vfat filesystem could not be represented in UTF-8 mode.

Both TOMOYO and AppArmor are good ideas. At minimum, one of
them ought to be accepted. My preference would be TOMOYO,
having origins untainted by Novell's Microsoft dealings.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [TOMOYO 5/9] Memory and pathname management functions.

2007-06-15 Thread Albert Cahalan

Christoph Hellwig writes:

On Thu, Jun 14, 2007 at 04:36:09PM +0900, Kentaro Takeda wrote:



We limit the maximum length of any string data (such as
domainname and pathnames) to TOMOYO_MAX_PATHNAME_LEN
(which is 4000) bytes to fit within a single page.

Userland programs can obtain the amount of RAM currently
used by TOMOYO from /proc interface.


Same NACK for this as for AppArmor, on exactly the same grounds.
Please stop wasting your time on pathname-based non-solutions.


This issue is a very very small wart on an otherwise fine idea.
It's really not worth getting bothered by. Truth is, big giant
pathnames break lots of stuff already, both kernel and userspace.

Just look in /proc for some nice juicy kernel breakage:
cwd, exe, fd/*, maps, mounts, mountstats, root, smaps

So, is that a NACK for the /proc filesystem too? :-)

We even limit filenames to 255 chars; just the other day
a Russian guy was complaining that his monstrous filenames
on a vfat filesystem could not be represented in UTF-8 mode.

Both TOMOYO and AppArmor are good ideas. At minimum, one of
them ought to be accepted. My preference would be TOMOYO,
having origins untainted by Novell's Microsoft dealings.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Albert Cahalan

On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote:

On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote:
> On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote:
> >On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:



> >> * secure delete via destruction of per-file or per-block random crypto
> >keys
> >
> >I'd rather keep secure delete as a userland problem (or a layered FS
> >problem).  When you take backups and other copies of the file into
> >account, it's a bigger problem than btrfs wants to tackle right now.
>
> It can't be a userland problem if you allow disk blocks to move.
> Volume resizing, logging/journalling, etc. -- they combine to make
> the userland solution essentially impossible. (one could wipe the
> whole partition, or maybe fill ALL space on the volume)

Right about here is where I would insert a long story about ecryptfs, or
encryption solutions that happen all in userland.  At any rate, it is
outside the scope of v1.0, even though I definitely agree it is an
important problem for some people.


I'm sure you do have a nice long story, and I'm sure it seems
correct, but there is something not quite right about the add-on
hacks.

BTW, I'm suggesting that this be about deletion, not protection
of data you wish to keep. It covers more than just file bodies.
It covers inode data, block allocations, etc.


> >> * atomic creation of copy-on-write directory trees
> >
> >Do you mean something more fine grained than the current snapshotting
> >system?
>
> I believe so. Example: I have a linux-2.6 directory. It's not
> a mount point or anything special like that. I want to copy
> it to a new directory called wip, without actually copying
> all the blocks. To all the normal POSIX API stuff, this copy
> should look like the result of "cp -a", not hard links.

This would be a snapshot, which has to be done on a subvolume right now.
It is not as nice as being able to pick a random directory, but I've
only been able to get this far by limiting the feature scope
significantly.  What I did do was make subvolumes very cheap...just make
a bunch of them.


Can a regular user create and use a subvolume? If not, then
this doesn't work. (if so, then I have other concerns...)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-14 Thread Albert Cahalan

On 6/13/07, Chris Mason [EMAIL PROTECTED] wrote:

On Wed, Jun 13, 2007 at 12:14:40PM -0400, Albert Cahalan wrote:
 On 6/13/07, Chris Mason [EMAIL PROTECTED] wrote:
 On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:



  * secure delete via destruction of per-file or per-block random crypto
 keys
 
 I'd rather keep secure delete as a userland problem (or a layered FS
 problem).  When you take backups and other copies of the file into
 account, it's a bigger problem than btrfs wants to tackle right now.

 It can't be a userland problem if you allow disk blocks to move.
 Volume resizing, logging/journalling, etc. -- they combine to make
 the userland solution essentially impossible. (one could wipe the
 whole partition, or maybe fill ALL space on the volume)

Right about here is where I would insert a long story about ecryptfs, or
encryption solutions that happen all in userland.  At any rate, it is
outside the scope of v1.0, even though I definitely agree it is an
important problem for some people.


I'm sure you do have a nice long story, and I'm sure it seems
correct, but there is something not quite right about the add-on
hacks.

BTW, I'm suggesting that this be about deletion, not protection
of data you wish to keep. It covers more than just file bodies.
It covers inode data, block allocations, etc.


  * atomic creation of copy-on-write directory trees
 
 Do you mean something more fine grained than the current snapshotting
 system?

 I believe so. Example: I have a linux-2.6 directory. It's not
 a mount point or anything special like that. I want to copy
 it to a new directory called wip, without actually copying
 all the blocks. To all the normal POSIX API stuff, this copy
 should look like the result of cp -a, not hard links.

This would be a snapshot, which has to be done on a subvolume right now.
It is not as nice as being able to pick a random directory, but I've
only been able to get this far by limiting the feature scope
significantly.  What I did do was make subvolumes very cheap...just make
a bunch of them.


Can a regular user create and use a subvolume? If not, then
this doesn't work. (if so, then I have other concerns...)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Albert Cahalan

On 6/13/07, Chris Mason <[EMAIL PROTECTED]> wrote:

On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:



> The usual wishlist:
>
> * inode-to-pathnames mapping

This one I'll code, it will help with inode link count verification.  I
want to be able to detect at run time that an inode with a link count of
zero is still actually in a directory. So there will be back pointers
from the inode to the directory.


Great, but fsck improvement wasn't on my mind. This is
a desirable feature for the NFS server, and for regular users.
Think about a backup program trying to maintain hard links.


Also, the incremental backup code will be able to walk the btree to find
inodes that have changed, and the backpointers will help make a list of
file names that need to be rsync'd or whatever.

> * a subvolume that is a single file (disk image, database, etc.)

subvolumes can be made that have a single file in them, but they have to
be directories right now.  Doing otherwise would complicate mounts and
other management tools (inside the btree, it doesn't really matter).


Bummer. As I understand it, ZFS provides this. :-)


> * directory indexes to better support Wine and Samba
> * secure delete via destruction of per-file or per-block random crypto keys

I'd rather keep secure delete as a userland problem (or a layered FS
problem).  When you take backups and other copies of the file into
account, it's a bigger problem than btrfs wants to tackle right now.


It can't be a userland problem if you allow disk blocks to move.
Volume resizing, logging/journalling, etc. -- they combine to make
the userland solution essentially impossible. (one could wipe the
whole partition, or maybe fill ALL space on the volume)

I think it needs to be per-extent.

At each level in the btree, you place a randomly generated key
for the more leafward nodes. This means that secure deletion is
merely the act of wiping the key... which can itself occur by
wiping the key of the more rootward node.


> * atomic creation of copy-on-write directory trees

Do you mean something more fine grained than the current snapshotting
system?


I believe so. Example: I have a linux-2.6 directory. It's not
a mount point or anything special like that. I want to copy
it to a new directory called wip, without actually copying
all the blocks. To all the normal POSIX API stuff, this copy
should look like the result of "cp -a", not hard links.


> * insert/delete ability (add/remove a chunk in the middle of a file)

The disk format makes this O(extent records past the chunk).  It's
possible to code but it would not be optimized.


That's understandable, but note that Reiserfs can support this.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-13 Thread Albert Cahalan

On 6/13/07, Chris Mason [EMAIL PROTECTED] wrote:

On Wed, Jun 13, 2007 at 01:45:28AM -0400, Albert Cahalan wrote:



 The usual wishlist:

 * inode-to-pathnames mapping

This one I'll code, it will help with inode link count verification.  I
want to be able to detect at run time that an inode with a link count of
zero is still actually in a directory. So there will be back pointers
from the inode to the directory.


Great, but fsck improvement wasn't on my mind. This is
a desirable feature for the NFS server, and for regular users.
Think about a backup program trying to maintain hard links.


Also, the incremental backup code will be able to walk the btree to find
inodes that have changed, and the backpointers will help make a list of
file names that need to be rsync'd or whatever.

 * a subvolume that is a single file (disk image, database, etc.)

subvolumes can be made that have a single file in them, but they have to
be directories right now.  Doing otherwise would complicate mounts and
other management tools (inside the btree, it doesn't really matter).


Bummer. As I understand it, ZFS provides this. :-)


 * directory indexes to better support Wine and Samba
 * secure delete via destruction of per-file or per-block random crypto keys

I'd rather keep secure delete as a userland problem (or a layered FS
problem).  When you take backups and other copies of the file into
account, it's a bigger problem than btrfs wants to tackle right now.


It can't be a userland problem if you allow disk blocks to move.
Volume resizing, logging/journalling, etc. -- they combine to make
the userland solution essentially impossible. (one could wipe the
whole partition, or maybe fill ALL space on the volume)

I think it needs to be per-extent.

At each level in the btree, you place a randomly generated key
for the more leafward nodes. This means that secure deletion is
merely the act of wiping the key... which can itself occur by
wiping the key of the more rootward node.


 * atomic creation of copy-on-write directory trees

Do you mean something more fine grained than the current snapshotting
system?


I believe so. Example: I have a linux-2.6 directory. It's not
a mount point or anything special like that. I want to copy
it to a new directory called wip, without actually copying
all the blocks. To all the normal POSIX API stuff, this copy
should look like the result of cp -a, not hard links.


 * insert/delete ability (add/remove a chunk in the middle of a file)

The disk format makes this O(extent records past the chunk).  It's
possible to code but it would not be optimized.


That's understandable, but note that Reiserfs can support this.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-12 Thread Albert Cahalan

Neat! It's great to see somebody else waking up to the idea that
storage media is NOT to be trusted.

Judging by the design paper, it looks like your structs have some
alignment problems.

The usual wishlist:

* inode-to-pathnames mapping
* a subvolume that is a single file (disk image, database, etc.)
* directory indexes to better support Wine and Samba
* secure delete via destruction of per-file or per-block random crypto keys
* fast (seekless) access to normal-sized SE Linux data
* atomic creation of copy-on-write directory trees
* immutable bits like UFS has
* hole punch ability
* insert/delete ability (add/remove a chunk in the middle of a file)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ANNOUNCE] Btrfs: a copy on write, snapshotting FS

2007-06-12 Thread Albert Cahalan

Neat! It's great to see somebody else waking up to the idea that
storage media is NOT to be trusted.

Judging by the design paper, it looks like your structs have some
alignment problems.

The usual wishlist:

* inode-to-pathnames mapping
* a subvolume that is a single file (disk image, database, etc.)
* directory indexes to better support Wine and Samba
* secure delete via destruction of per-file or per-block random crypto keys
* fast (seekless) access to normal-sized SE Linux data
* atomic creation of copy-on-write directory trees
* immutable bits like UFS has
* hole punch ability
* insert/delete ability (add/remove a chunk in the middle of a file)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-08 Thread Albert Cahalan

On 6/8/07, Alan Cox <[EMAIL PROTECTED]> wrote:

> There is an SE Linux execmem restriction that enforces W^X.

This depends on whatever SELinux rulesets you are running. Its just a
good rule to have present that most programs shouldn't be self patching,
and then label those that do differently.


A marking in the executable would have made more sense.
It is really broken having an unprivileged user being able to
create whole new executables but unable to lift this restriction
on those executables.

In any case, the restriction is common and troublesome.


> Sometimes it is very helpful to have the read/write mapping
> be a fixed offset from the read/exec mapping. A power of 2
> can be especially desirable.

mmap MAP_FIXED can do this but you need to know a lot about the memory
layout of the system so it gets a bit platform specific.


Yes. There are unportable programs, and UNPORTABLE ones.
Memory layout can vary between vendor kernels, between normal
and 32-on-64 situations, between two different C libraries...


> Emulators often need a cheap way to change page permissions.

mprotect(, range) rather than a page at a time. The kernel will do
merging.


Nope. This can happen rapidly and repeatedly to pages
that are essentially random. The median length of a range
will be a page or two. Merging won't do very much at all.


> a. sysctl to set IPC_RMID by default
> b. shmget() flag to set IPC_RMID by default

Use POSIX shared memory


That appears to have the exact same problem.


> c. open() flag to unlink a file before returning the fd

Is it really that costly to create a blank file, why do you need to do it
a lot in a JIT ?


This part isn't about cost. It's about not leaving around
debris when the JIT crashes.


> e. mremap() flag to get a read/write mapping of a read/exec one
> f. mremap() flag to get a read/exec mapping of a read/write one
> g. mremap() flag to make the 5th arg (new addr) be the upper limit

This is all mprotect and munmap.


That won't get me a second mapping. Supposing that I had
a second mapping, SE Linux would deny the mprotect.
I'm looking for a mapping that is born executable or a mapping
that is born writable, as needed, so that no transition is needed.


> h. 6-bit wide mremap() "flag" to set the upper limit above given base
> i. support the prot argument to remap_file_pages
> j. a documented way (madvise?) to punch same-VMA zero-page holes

mmap (although you get more VMAs from that) so memset() is probably
genuinely cheaper if the permissions are not changing.


Well cost is the problem here. I sure can find some way to
get the operation done, but it isn't cheap. For some usages,
the current setup is costly enough that one must consider
abandoning the hardware MMU in favor of a software one
emitted as part of the JIT. :-(
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-08 Thread Albert Cahalan

On 6/8/07, Eric Dumazet <[EMAIL PROTECTED]> wrote:

Albert Cahalan a écrit :



> Additions to better support JIT emulators:
>
> a. sysctl to set IPC_RMID by default

Not very good, this will break some apps.


As a sysctl, the admin gets to choose between
compatibility and sanity.

I can see such a sysctl also being really helpful for a
shared computer used for an Operating Systems or
System Programming course.


> b. shmget() flag to set IPC_RMID by default

This is better :)


Both are good. This one requires that all apps using
SysV shared memory be modified to use the flag.
The other requires that a very few apps be modified
to tolerate a behavior change.


> c. open() flag to unlink a file before returning the fd


Well, I assume you would like fd = open("/path/somefile", O_RDWR | O_CREAT |
O_UNLINK, 0644)

(ie allocate a file handle but no name ?)


Yes.


Quite difficult to implement this atomically with current vfs, maybe a new
syscall would be better. (Linus will kill me for that :) )

(We dont need to insert "somefile" in one directory, then unlink it, we only
need to allocate an unnamed inode to get some backing store)


I suspect that SMB/CIFS has a native call for this. There is
some sort of tmpfile flag defined over in that world.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-08 Thread Albert Cahalan

On 6/8/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

"Albert Cahalan" <[EMAIL PROTECTED]> writes:
> On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:



>> So it looks to me like we need to do three things:
>> - Fix the inode number
>> - Fix the name on the hugetlbfs dentry to hold the key
>> - Add a big fat comment that user space programs depend on this
>>   behavior of both the dentry name and the inode number.
>
> Assuming that this proposed fix goes in:
>
> Since the inode number is the shmid, and this is a number
> that the kernel randomly chooses AFAIK, there should be
> no need to have different shm segments sharing the same
> inode number.

Where we run into inode number confusion is that all of
these shm segments are actually files on a tmpfs filesystem
somewhere, and by making the inode number the shmid we loose
the tmpfs inode number.  So it is possible we get tmpfs inode
number conflicts.  However the inode number is not used for
anything, and the files are not visible in any other way except
as shm segments so it doesn't matter.


Eh, the kernel choses both shmid and tmpfs inode number.
You could set a high bit in one or the other.


There is another case with ipc namespaces where we ultimately need
to support duplicate shmids on the same machine (so migration
is a possibility).  However by and large the user space
processes with duplicate ids should be invisible to each other.


On the bright side, this only screws up people who get the
crazy idea that processes can be migrated.


> The situation with the key is a bit more disturbing, though
> we already hit that anyway when IPC_PRIVATE is used.
> (why anybody would NOT use IPC_PRIVATE is a mystery)
> So having the key in the name doesn't make things worse.

Having "SYSV" in the name appears mandatory.  Otherwise you
don't even know it is a shm file. Although I may be confused.


It's mandatory for a different reason: to satisfy parsers.

It is nearly useless for identifying shm files. Look what I can do:
   touch /SYSV
   touch '/SYSV (deleted)'

(so pmap creates a shm, looks for the address in /proc/self/maps,
determines the device major/minor in use, and then uses that)


Hmm.  Thinking about this I have just realized that we may want
to approach this a little differently.  Currently I am reusing
the dentry and inode structure that hugetlbfs and tmpfs return
me, and simply have a distinct struct file for each shm mapping.

There is a little more cost but it may actually make sense to have
a dentry and inode that is specific to shm.c so we can do whatever
we need to without adding requirements to the normal tmpfs or hugtlb
code.


Piggybacking on tmpfs has always seemed a bit dirty to me.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


JIT emulator needs

2007-06-08 Thread Albert Cahalan

Right now, Linux isn't all that friendly to JIT emulators.
Here are the problems and suggestions to improve the situation.

There is an SE Linux execmem restriction that enforces W^X.
Assuming you don't wish to just disable SE Linux, there are
two ugly ways around the problem. You can mmap a file twice,
or you can abuse SysV shared memory. The mmap method requires
that you know of a filesystem mounted rw,exec where you can
write a very large temporary file. This arbitrary filesystem,
rather than swap space, will be the backing store. The SysV
shared memory method requires an undocumented flag and is
subject to some annoying size limits. Both methods create
objects that will fail to be deleted if the program dies
before marking the objects for deletion.

Processors often have annoying limits on the immediate values
in instructions. An x86 or x86_64 JIT can go a bit faster if
all allocations are kept to the low 2 GB of address space.
There are also reasons for a 32bit-to-x86_64 JIT to chose
a nearly arbitrary 2 GB region that lies above 4 GB.
Other archs have other limits, such as 32 MB or 256 MB.

Sometimes it is very helpful to have the read/write mapping
be a fixed offset from the read/exec mapping. A power of 2
can be especially desirable.

Emulators often need a cheap way to change page permissions.
One VMA per page is no good. Besides taking up space and making
many things generally slower, having one VMA per page causes
a huge performance loss for snapshot roll-back operations.
Just tearing down all those VMAs takes a good while.

Additions to better support JIT emulators:

a. sysctl to set IPC_RMID by default
b. shmget() flag to set IPC_RMID by default
c. open() flag to unlink a file before returning the fd
d. mremap() flag to always keep the old mapping
e. mremap() flag to get a read/write mapping of a read/exec one
f. mremap() flag to get a read/exec mapping of a read/write one
g. mremap() flag to make the 5th arg (new addr) be the upper limit
h. 6-bit wide mremap() "flag" to set the upper limit above given base
i. support the prot argument to remap_file_pages
j. a documented way (madvise?) to punch same-VMA zero-page holes
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


JIT emulator needs

2007-06-08 Thread Albert Cahalan

Right now, Linux isn't all that friendly to JIT emulators.
Here are the problems and suggestions to improve the situation.

There is an SE Linux execmem restriction that enforces W^X.
Assuming you don't wish to just disable SE Linux, there are
two ugly ways around the problem. You can mmap a file twice,
or you can abuse SysV shared memory. The mmap method requires
that you know of a filesystem mounted rw,exec where you can
write a very large temporary file. This arbitrary filesystem,
rather than swap space, will be the backing store. The SysV
shared memory method requires an undocumented flag and is
subject to some annoying size limits. Both methods create
objects that will fail to be deleted if the program dies
before marking the objects for deletion.

Processors often have annoying limits on the immediate values
in instructions. An x86 or x86_64 JIT can go a bit faster if
all allocations are kept to the low 2 GB of address space.
There are also reasons for a 32bit-to-x86_64 JIT to chose
a nearly arbitrary 2 GB region that lies above 4 GB.
Other archs have other limits, such as 32 MB or 256 MB.

Sometimes it is very helpful to have the read/write mapping
be a fixed offset from the read/exec mapping. A power of 2
can be especially desirable.

Emulators often need a cheap way to change page permissions.
One VMA per page is no good. Besides taking up space and making
many things generally slower, having one VMA per page causes
a huge performance loss for snapshot roll-back operations.
Just tearing down all those VMAs takes a good while.

Additions to better support JIT emulators:

a. sysctl to set IPC_RMID by default
b. shmget() flag to set IPC_RMID by default
c. open() flag to unlink a file before returning the fd
d. mremap() flag to always keep the old mapping
e. mremap() flag to get a read/write mapping of a read/exec one
f. mremap() flag to get a read/exec mapping of a read/write one
g. mremap() flag to make the 5th arg (new addr) be the upper limit
h. 6-bit wide mremap() flag to set the upper limit above given base
i. support the prot argument to remap_file_pages
j. a documented way (madvise?) to punch same-VMA zero-page holes
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match ipcs -m shmid

2007-06-08 Thread Albert Cahalan

On 6/8/07, Eric W. Biederman [EMAIL PROTECTED] wrote:

Albert Cahalan [EMAIL PROTECTED] writes:
 On 6/7/07, Eric W. Biederman [EMAIL PROTECTED] wrote:



 So it looks to me like we need to do three things:
 - Fix the inode number
 - Fix the name on the hugetlbfs dentry to hold the key
 - Add a big fat comment that user space programs depend on this
   behavior of both the dentry name and the inode number.

 Assuming that this proposed fix goes in:

 Since the inode number is the shmid, and this is a number
 that the kernel randomly chooses AFAIK, there should be
 no need to have different shm segments sharing the same
 inode number.

Where we run into inode number confusion is that all of
these shm segments are actually files on a tmpfs filesystem
somewhere, and by making the inode number the shmid we loose
the tmpfs inode number.  So it is possible we get tmpfs inode
number conflicts.  However the inode number is not used for
anything, and the files are not visible in any other way except
as shm segments so it doesn't matter.


Eh, the kernel choses both shmid and tmpfs inode number.
You could set a high bit in one or the other.


There is another case with ipc namespaces where we ultimately need
to support duplicate shmids on the same machine (so migration
is a possibility).  However by and large the user space
processes with duplicate ids should be invisible to each other.


On the bright side, this only screws up people who get the
crazy idea that processes can be migrated.


 The situation with the key is a bit more disturbing, though
 we already hit that anyway when IPC_PRIVATE is used.
 (why anybody would NOT use IPC_PRIVATE is a mystery)
 So having the key in the name doesn't make things worse.

Having SYSV in the name appears mandatory.  Otherwise you
don't even know it is a shm file. Although I may be confused.


It's mandatory for a different reason: to satisfy parsers.

It is nearly useless for identifying shm files. Look what I can do:
   touch /SYSV
   touch '/SYSV (deleted)'

(so pmap creates a shm, looks for the address in /proc/self/maps,
determines the device major/minor in use, and then uses that)


Hmm.  Thinking about this I have just realized that we may want
to approach this a little differently.  Currently I am reusing
the dentry and inode structure that hugetlbfs and tmpfs return
me, and simply have a distinct struct file for each shm mapping.

There is a little more cost but it may actually make sense to have
a dentry and inode that is specific to shm.c so we can do whatever
we need to without adding requirements to the normal tmpfs or hugtlb
code.


Piggybacking on tmpfs has always seemed a bit dirty to me.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-08 Thread Albert Cahalan

On 6/8/07, Eric Dumazet [EMAIL PROTECTED] wrote:

Albert Cahalan a écrit :



 Additions to better support JIT emulators:

 a. sysctl to set IPC_RMID by default

Not very good, this will break some apps.


As a sysctl, the admin gets to choose between
compatibility and sanity.

I can see such a sysctl also being really helpful for a
shared computer used for an Operating Systems or
System Programming course.


 b. shmget() flag to set IPC_RMID by default

This is better :)


Both are good. This one requires that all apps using
SysV shared memory be modified to use the flag.
The other requires that a very few apps be modified
to tolerate a behavior change.


 c. open() flag to unlink a file before returning the fd


Well, I assume you would like fd = open(/path/somefile, O_RDWR | O_CREAT |
O_UNLINK, 0644)

(ie allocate a file handle but no name ?)


Yes.


Quite difficult to implement this atomically with current vfs, maybe a new
syscall would be better. (Linus will kill me for that :) )

(We dont need to insert somefile in one directory, then unlink it, we only
need to allocate an unnamed inode to get some backing store)


I suspect that SMB/CIFS has a native call for this. There is
some sort of tmpfile flag defined over in that world.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: JIT emulator needs

2007-06-08 Thread Albert Cahalan

On 6/8/07, Alan Cox [EMAIL PROTECTED] wrote:

 There is an SE Linux execmem restriction that enforces W^X.

This depends on whatever SELinux rulesets you are running. Its just a
good rule to have present that most programs shouldn't be self patching,
and then label those that do differently.


A marking in the executable would have made more sense.
It is really broken having an unprivileged user being able to
create whole new executables but unable to lift this restriction
on those executables.

In any case, the restriction is common and troublesome.


 Sometimes it is very helpful to have the read/write mapping
 be a fixed offset from the read/exec mapping. A power of 2
 can be especially desirable.

mmap MAP_FIXED can do this but you need to know a lot about the memory
layout of the system so it gets a bit platform specific.


Yes. There are unportable programs, and UNPORTABLE ones.
Memory layout can vary between vendor kernels, between normal
and 32-on-64 situations, between two different C libraries...


 Emulators often need a cheap way to change page permissions.

mprotect(, range) rather than a page at a time. The kernel will do
merging.


Nope. This can happen rapidly and repeatedly to pages
that are essentially random. The median length of a range
will be a page or two. Merging won't do very much at all.


 a. sysctl to set IPC_RMID by default
 b. shmget() flag to set IPC_RMID by default

Use POSIX shared memory


That appears to have the exact same problem.


 c. open() flag to unlink a file before returning the fd

Is it really that costly to create a blank file, why do you need to do it
a lot in a JIT ?


This part isn't about cost. It's about not leaving around
debris when the JIT crashes.


 e. mremap() flag to get a read/write mapping of a read/exec one
 f. mremap() flag to get a read/exec mapping of a read/write one
 g. mremap() flag to make the 5th arg (new addr) be the upper limit

This is all mprotect and munmap.


That won't get me a second mapping. Supposing that I had
a second mapping, SE Linux would deny the mprotect.
I'm looking for a mapping that is born executable or a mapping
that is born writable, as needed, so that no transition is needed.


 h. 6-bit wide mremap() flag to set the upper limit above given base
 i. support the prot argument to remap_file_pages
 j. a documented way (madvise?) to punch same-VMA zero-page holes

mmap (although you get more VMAs from that) so memset() is probably
genuinely cheaper if the permissions are not changing.


Well cost is the problem here. I sure can find some way to
get the operation done, but it isn't cheap. For some usages,
the current setup is costly enough that one must consider
abandoning the hardware MMU in favor of a software one
emitted as part of the JIT. :-(
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Albert Cahalan

On 6/7/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:


So it looks to me like we need to do three things:
- Fix the inode number
- Fix the name on the hugetlbfs dentry to hold the key
- Add a big fat comment that user space programs depend on this
  behavior of both the dentry name and the inode number.


Assuming that this proposed fix goes in:

Since the inode number is the shmid, and this is a number
that the kernel randomly chooses AFAIK, there should be
no need to have different shm segments sharing the same
inode number.

The situation with the key is a bit more disturbing, though
we already hit that anyway when IPC_PRIVATE is used.
(why anybody would NOT use IPC_PRIVATE is a mystery)
So having the key in the name doesn't make things worse.

I have some concern about the device minor number.
This should be the same for all shm mappings; I do not
know if the behavior changed.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-07 Thread Albert Cahalan

On 6/7/07, Badari Pulavarty <[EMAIL PROTECTED]> wrote:


BTW, I agree with Eric that its would be nice to use shmid as part
of name instead of forcing to be as inode number. It should be
possible for pmap to workout shmid from "key" or name. Isn't it ?


It is not at all nice.

1. it's incompatible ABI breakage
2. where will you put the key then, in the inode? :-)

Changing to "SYSVID%d" is no good either. Look, people
are ***parsing*** this stuff in /proc. The /proc filesystem
is not some random sandbox to be playing in.

Before you go messing with it, note that the device number
also matters. (it's per-boot dynamic, but that's OK)
That's how one knows that /SYSV is not just
a regular file; sadly these didn't get a non-/ prefix.
(and no you can't fix that now; it's way too late)

Next time you feel like breaking an ABI, mind putting
"LET'S BREAK AN ABI!" in the subject of your email?

BTW, I suspect this kind of thing also breaks:
a. fuser, lsof, and other resource usage display tools
b. various obscure emulators (similar to valgrind)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match ipcs -m shmid

2007-06-07 Thread Albert Cahalan

On 6/7/07, Eric W. Biederman [EMAIL PROTECTED] wrote:


So it looks to me like we need to do three things:
- Fix the inode number
- Fix the name on the hugetlbfs dentry to hold the key
- Add a big fat comment that user space programs depend on this
  behavior of both the dentry name and the inode number.


Assuming that this proposed fix goes in:

Since the inode number is the shmid, and this is a number
that the kernel randomly chooses AFAIK, there should be
no need to have different shm segments sharing the same
inode number.

The situation with the key is a bit more disturbing, though
we already hit that anyway when IPC_PRIVATE is used.
(why anybody would NOT use IPC_PRIVATE is a mystery)
So having the key in the name doesn't make things worse.

I have some concern about the device minor number.
This should be the same for all shm mappings; I do not
know if the behavior changed.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match ipcs -m shmid

2007-06-07 Thread Albert Cahalan

On 6/7/07, Badari Pulavarty [EMAIL PROTECTED] wrote:


BTW, I agree with Eric that its would be nice to use shmid as part
of name instead of forcing to be as inode number. It should be
possible for pmap to workout shmid from key or name. Isn't it ?


It is not at all nice.

1. it's incompatible ABI breakage
2. where will you put the key then, in the inode? :-)

Changing to SYSVID%d is no good either. Look, people
are ***parsing*** this stuff in /proc. The /proc filesystem
is not some random sandbox to be playing in.

Before you go messing with it, note that the device number
also matters. (it's per-boot dynamic, but that's OK)
That's how one knows that /SYSV is not just
a regular file; sadly these didn't get a non-/ prefix.
(and no you can't fix that now; it's way too late)

Next time you feel like breaking an ABI, mind putting
LET'S BREAK AN ABI! in the subject of your email?

BTW, I suspect this kind of thing also breaks:
a. fuser, lsof, and other resource usage display tools
b. various obscure emulators (similar to valgrind)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-06 Thread Albert Cahalan

On 6/6/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

On Wed, 6 Jun 2007 23:27:01 -0400 "Albert Cahalan" <[EMAIL PROTECTED]> wrote:
> Eric W. Biederman writes:
> > Badari Pulavarty <[EMAIL PROTECTED]> writes:
>
> >> Your recent cleanup to shm code, namely
> >>
> >> [PATCH] shm: make sysv ipc shared memory use stacked files
> >>
> >> took away one of the debugging feature for shm segments.
> >> Originally, shmid were forced to be the inode numbers and
> >> they show up in /proc/pid/maps for the process which mapped
> >> this shared memory segments (vma listing). That way, its easy
> >> to find out who all mapped this shared memory segment. Your
> >> patchset, took away the inode# setting. So, we can't easily
> >> match the shmem segments to /proc/pid/maps easily. (It was
> >> really useful in tracking down a customer problem recently).
> >> Is this done deliberately ? Anything wrong in setting this back ?
> >
> > Theoretically it makes the stacked file concept more brittle,
> > because it means the lower layers can't care about their inode
> > number.
> >
> > We do need something to tie these things together.
> >
> > So I suspect what makes most sense is to simply rename the
> > dentry SYSVID
>
> Please stop breaking things in /proc. The pmap command relys
> on the old behavior.

What effect did this change have upon the pmap command?  Details, please.

> It's time to revert.

Probably true, but we'd need to understand what the impact was.


Very simply, pmap reports the shmid.

albert 0 ~$ pmap `pidof X` | egrep -2 shmid
3005  16384K rw-s-  /dev/fb0
3105152K rw---[ anon ]
31076000384K rw-s-[ shmid=0x3f428000 ]
310d6000384K rw-s-[ shmid=0x3f430001 ]
31136000384K rw-s-[ shmid=0x3f438002 ]
31196000384K rw-s-[ shmid=0x3f440003 ]
311f6000384K rw-s-[ shmid=0x3f448004 ]
31256000384K rw-s-[ shmid=0x3f450005 ]
312b6000384K rw-s-[ shmid=0x3f460006 ]
31316000384K rw-s-[ shmid=0x3f870007 ]
31491000140K r  /usr/share/fonts/type1/gsfonts/n021003l.pfb
3150e000   9496K rw---[ anon ]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match "ipcs -m" shmid

2007-06-06 Thread Albert Cahalan

Eric W. Biederman writes:

Badari Pulavarty <[EMAIL PROTECTED]> writes:



Your recent cleanup to shm code, namely

[PATCH] shm: make sysv ipc shared memory use stacked files

took away one of the debugging feature for shm segments.
Originally, shmid were forced to be the inode numbers and
they show up in /proc/pid/maps for the process which mapped
this shared memory segments (vma listing). That way, its easy
to find out who all mapped this shared memory segment. Your
patchset, took away the inode# setting. So, we can't easily
match the shmem segments to /proc/pid/maps easily. (It was
really useful in tracking down a customer problem recently).
Is this done deliberately ? Anything wrong in setting this back ?


Theoretically it makes the stacked file concept more brittle,
because it means the lower layers can't care about their inode
number.

We do need something to tie these things together.

So I suspect what makes most sense is to simply rename the
dentry SYSVID


Please stop breaking things in /proc. The pmap command relys
on the old behavior. It's time to revert. Put back the segment ID
where it belongs, and leave the key where it belongs too.

Containers are NOT worth breaking our ABIs left and right.
We don't need to leap off that bridge just because Solaris did,
unless you can explain why complexity and bloat are desirable.
We already have SE Linux, chroot, KVM, and several more!
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match ipcs -m shmid

2007-06-06 Thread Albert Cahalan

Eric W. Biederman writes:

Badari Pulavarty [EMAIL PROTECTED] writes:



Your recent cleanup to shm code, namely

[PATCH] shm: make sysv ipc shared memory use stacked files

took away one of the debugging feature for shm segments.
Originally, shmid were forced to be the inode numbers and
they show up in /proc/pid/maps for the process which mapped
this shared memory segments (vma listing). That way, its easy
to find out who all mapped this shared memory segment. Your
patchset, took away the inode# setting. So, we can't easily
match the shmem segments to /proc/pid/maps easily. (It was
really useful in tracking down a customer problem recently).
Is this done deliberately ? Anything wrong in setting this back ?


Theoretically it makes the stacked file concept more brittle,
because it means the lower layers can't care about their inode
number.

We do need something to tie these things together.

So I suspect what makes most sense is to simply rename the
dentry SYSVIDsegmentid


Please stop breaking things in /proc. The pmap command relys
on the old behavior. It's time to revert. Put back the segment ID
where it belongs, and leave the key where it belongs too.

Containers are NOT worth breaking our ABIs left and right.
We don't need to leap off that bridge just because Solaris did,
unless you can explain why complexity and bloat are desirable.
We already have SE Linux, chroot, KVM, and several more!
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH] /proc/pid/maps doesn't match ipcs -m shmid

2007-06-06 Thread Albert Cahalan

On 6/6/07, Andrew Morton [EMAIL PROTECTED] wrote:

On Wed, 6 Jun 2007 23:27:01 -0400 Albert Cahalan [EMAIL PROTECTED] wrote:
 Eric W. Biederman writes:
  Badari Pulavarty [EMAIL PROTECTED] writes:

  Your recent cleanup to shm code, namely
 
  [PATCH] shm: make sysv ipc shared memory use stacked files
 
  took away one of the debugging feature for shm segments.
  Originally, shmid were forced to be the inode numbers and
  they show up in /proc/pid/maps for the process which mapped
  this shared memory segments (vma listing). That way, its easy
  to find out who all mapped this shared memory segment. Your
  patchset, took away the inode# setting. So, we can't easily
  match the shmem segments to /proc/pid/maps easily. (It was
  really useful in tracking down a customer problem recently).
  Is this done deliberately ? Anything wrong in setting this back ?
 
  Theoretically it makes the stacked file concept more brittle,
  because it means the lower layers can't care about their inode
  number.
 
  We do need something to tie these things together.
 
  So I suspect what makes most sense is to simply rename the
  dentry SYSVIDsegmentid

 Please stop breaking things in /proc. The pmap command relys
 on the old behavior.

What effect did this change have upon the pmap command?  Details, please.

 It's time to revert.

Probably true, but we'd need to understand what the impact was.


Very simply, pmap reports the shmid.

albert 0 ~$ pmap `pidof X` | egrep -2 shmid
3005  16384K rw-s-  /dev/fb0
3105152K rw---[ anon ]
31076000384K rw-s-[ shmid=0x3f428000 ]
310d6000384K rw-s-[ shmid=0x3f430001 ]
31136000384K rw-s-[ shmid=0x3f438002 ]
31196000384K rw-s-[ shmid=0x3f440003 ]
311f6000384K rw-s-[ shmid=0x3f448004 ]
31256000384K rw-s-[ shmid=0x3f450005 ]
312b6000384K rw-s-[ shmid=0x3f460006 ]
31316000384K rw-s-[ shmid=0x3f870007 ]
31491000140K r  /usr/share/fonts/type1/gsfonts/n021003l.pfb
3150e000   9496K rw---[ anon ]
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread Albert Cahalan

David Schwartz writes:

[Aaron Wiebe]



open("/somefile", O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 <0.415147>


How could they make any difference? I can't think of any
conceivable way they could.


Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
call with a nonblocking flag return EAGAIN if its going to take
anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
they take more than 10ms?


There is no way you can re-try the request. The open must either
succeed or not return a handle. It is not like a 'read' operation
that has an "I didn't do anything, and you can retry this request"
option.

If 'open' returns a file handle, you can't retry it (since it must
succeed in order to do that, failure must not return a handle).
If you 'open' doesn't return a file handle, you can't retry it
(because, without a handle, there is no way to associate a future
request with this one, if it creates a file, the file must not be
created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists
(or doesn't exist and can be created, or whatever). This takes
however long it takes on NFS.


This is not the case, though we might need to allocate a new
flag to avoid breaking things.

Let open() with O_UNCHECKED always return a file descriptor,
except perhaps when failure can be identified without doing IO.
The "real" open then proceeds in the background.


From poll() or select(), you can see that the file descriptor

is not ready for anything. Eventually it becomes ready for IO
or reports an error condition. Both select() and poll() are
capable of reporting errors. If the "real" (background) open()
fails, then the only valid operation is close(). Attempts to
do anything else get EBADFD or ESTALE.

You'll also need a background close().
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: slow open() calls and o_nonblock

2007-06-03 Thread Albert Cahalan

David Schwartz writes:

[Aaron Wiebe]



open(/somefile, O_WRONLY|O_NONBLOCK|O_CREAT, 0644) = 1621 0.415147


How could they make any difference? I can't think of any
conceivable way they could.


Now, I'm a userspace guy so I can be pretty dense, but shouldn't a
call with a nonblocking flag return EAGAIN if its going to take
anywhere near 415ms?  Is there a way I can force opens to EAGAIN if
they take more than 10ms?


There is no way you can re-try the request. The open must either
succeed or not return a handle. It is not like a 'read' operation
that has an I didn't do anything, and you can retry this request
option.

If 'open' returns a file handle, you can't retry it (since it must
succeed in order to do that, failure must not return a handle).
If you 'open' doesn't return a file handle, you can't retry it
(because, without a handle, there is no way to associate a future
request with this one, if it creates a file, the file must not be
created if you don't call 'open' again).

The 'open' function must, at minimum, confirm that the file exists
(or doesn't exist and can be created, or whatever). This takes
however long it takes on NFS.


This is not the case, though we might need to allocate a new
flag to avoid breaking things.

Let open() with O_UNCHECKED always return a file descriptor,
except perhaps when failure can be identified without doing IO.
The real open then proceeds in the background.


From poll() or select(), you can see that the file descriptor

is not ready for anything. Eventually it becomes ready for IO
or reports an error condition. Both select() and poll() are
capable of reporting errors. If the real (background) open()
fails, then the only valid operation is close(). Attempts to
do anything else get EBADFD or ESTALE.

You'll also need a background close().
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Albert Cahalan

Ingo Molnar writes:


looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.


4) At least slightly portable.

Anything supported by any similar OS is already ahead, even if it
isn't the perfect API of our dreams. This means kqueue and doors.

If it's not on any BSD or UNIX, then most app developers won't
touch it. Worse yet, it won't appear in programming books, so even
the Linux-only app programmers won't know about it.

Running ideas by the FreeBSD and OpenSolaris developers wouldn't
be a bad idea. Agreement leads to standardization, which leads to
interfaces getting used.

BTW, wrapper libraries that bury the new API under a layer of
gunk are not helpful. One might as well just use the old API.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Syslets, Threadlets, generic AIO support, v6

2007-05-31 Thread Albert Cahalan

Ingo Molnar writes:


looking over the list of our new generic APIs (see further below) i
think there are three important things that are needed for an API to
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to
humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
network IO, timers, signals and everything) Even if it might look
silly in some of the cases, having complete, utter, no compromises,
100% coverage for everything massively helps the uptake of an API,
because it allows the user-space coder to pick just one paradigm
that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.


4) At least slightly portable.

Anything supported by any similar OS is already ahead, even if it
isn't the perfect API of our dreams. This means kqueue and doors.

If it's not on any BSD or UNIX, then most app developers won't
touch it. Worse yet, it won't appear in programming books, so even
the Linux-only app programmers won't know about it.

Running ideas by the FreeBSD and OpenSolaris developers wouldn't
be a bad idea. Agreement leads to standardization, which leads to
interfaces getting used.

BTW, wrapper libraries that bury the new API under a layer of
gunk are not helpful. One might as well just use the old API.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-29 Thread Albert Cahalan

On 5/29/07, Eric W. Biederman <[EMAIL PROTECTED]> wrote:

"Albert Cahalan" <[EMAIL PROTECTED]> writes:
> Jan Engelhardt writes:



-if(self_pid==1 && ADOPTED(processes[i]) && forest_type!='u')
+if(ADOPTED(processes[i]) && forest_type!='u')


That's not compatible because init's children are now in the
logical place. Since the days of procps-1.x.x or earlier,
such processes have been listed at top level.

BTW, what does "ps -ejH" do for you, with and without the patch?


ps -ejH displays everything.


That's not what I mean. (the "-e" causes that of course)
I'm asking about the parent-child relationships shown.
The "-H" option is a bit different from the "f" option.


I'd be a lot happier about breaking compatibility in this area
if I could get a functional adoption flag. That is, I really
would like to show a process as child of init if it naturally
was created as a child of init. It's less informative to have
fake children showing up the same as real ones. The original
parent PID would do. (BTW, the original parent name and/or
grandparent PID would be great to have) As a bonus, the kernel
could reap these processes more quickly than init can... and
then maybe we can stop caring if init is alive.


Having the kernel not reparent user processes to init is an interesting
idea, especially when those processes have not existed.  I'm not
certain that is POSIX complaint and otherwise backwards compatible.


I'm not suggesting that this be visible via POSIX APIs.

It's almost certainly a given that getppid() must return 1, and
probably /proc needs to show this as well. Without question,
any process created by init must be reaped by init.

Processes NOT created by init could be silently reaped by
the kernel. They need to see their own PPID as 1, but there
need not be any parent-child relationship in the kernel data
structures. The kernel can fake the whole thing, which is nice
because then the kernel isn't depending on userspace to
correctly perform the pointless action of playing with zombies.
(might setting the death signal to 0 be useful here?)

For "ps fax" and such, I'd like to distinguish between init's
real and adopted children. Right now the adopted children
look like they were created by init, which is not true. I only
need a simple boolean flag, set upon reparenting, to tell me.
Such a flag may also be useful for optimizing away the whole
wait/waitpid/wait4/waitid/wait3 nonsense when an adopted
child dies.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-29 Thread Albert Cahalan

On 5/29/07, Eric W. Biederman [EMAIL PROTECTED] wrote:

Albert Cahalan [EMAIL PROTECTED] writes:
 Jan Engelhardt writes:



-if(self_pid==1  ADOPTED(processes[i])  forest_type!='u')
+if(ADOPTED(processes[i])  forest_type!='u')


That's not compatible because init's children are now in the
logical place. Since the days of procps-1.x.x or earlier,
such processes have been listed at top level.

BTW, what does ps -ejH do for you, with and without the patch?


ps -ejH displays everything.


That's not what I mean. (the -e causes that of course)
I'm asking about the parent-child relationships shown.
The -H option is a bit different from the f option.


I'd be a lot happier about breaking compatibility in this area
if I could get a functional adoption flag. That is, I really
would like to show a process as child of init if it naturally
was created as a child of init. It's less informative to have
fake children showing up the same as real ones. The original
parent PID would do. (BTW, the original parent name and/or
grandparent PID would be great to have) As a bonus, the kernel
could reap these processes more quickly than init can... and
then maybe we can stop caring if init is alive.


Having the kernel not reparent user processes to init is an interesting
idea, especially when those processes have not existed.  I'm not
certain that is POSIX complaint and otherwise backwards compatible.


I'm not suggesting that this be visible via POSIX APIs.

It's almost certainly a given that getppid() must return 1, and
probably /proc needs to show this as well. Without question,
any process created by init must be reaped by init.

Processes NOT created by init could be silently reaped by
the kernel. They need to see their own PPID as 1, but there
need not be any parent-child relationship in the kernel data
structures. The kernel can fake the whole thing, which is nice
because then the kernel isn't depending on userspace to
correctly perform the pointless action of playing with zombies.
(might setting the death signal to 0 be useful here?)

For ps fax and such, I'd like to distinguish between init's
real and adopted children. Right now the adopted children
look like they were created by init, which is not true. I only
need a simple boolean flag, set upon reparenting, to tell me.
Such a flag may also be useful for optimizing away the whole
wait/waitpid/wait4/waitid/wait3 nonsense when an adopted
child dies.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-28 Thread Albert Cahalan

Jan Engelhardt writes:

On Apr 10 2007 17:47, Jan Engelhardt wrote:

On Apr 8 2007 20:57, Oleg Nesterov wrote:



Anyway, re-parenting to swapper breaks pstree, it doesn't
show kernel threads. And if ->parent == /sbin/init, we can't
remove us from ->children (unless we forbid sub-thread-of-init
exec). So the only safe change is set ->exit_state = -1.


Then we have to fix pstree and all that. (In fact, I'm
trying to patch `ps f` to DTRT ;p)


Done that and the result is that `ps afwx` now looks like:

  PID TTY  STAT   TIME COMMAND
 2722 ?S  0:00 [lockd]

...

3 ?S< 0:00 [events/0]
2 ?SN 0:00 [ksoftirqd/0]
1 ?Ss 0:02 init [3]
  537 ?S
...

-if(self_pid==1 && ADOPTED(processes[i]) && forest_type!='u')
+if(ADOPTED(processes[i]) && forest_type!='u')


That's not compatible because init's children are now in the
logical place. Since the days of procps-1.x.x or earlier,
such processes have been listed at top level.

BTW, what does "ps -ejH" do for you, with and without the patch?

I'd be a lot happier about breaking compatibility in this area
if I could get a functional adoption flag. That is, I really
would like to show a process as child of init if it naturally
was created as a child of init. It's less informative to have
fake children showing up the same as real ones. The original
parent PID would do. (BTW, the original parent name and/or
grandparent PID would be great to have) As a bonus, the kernel
could reap these processes more quickly than init can... and
then maybe we can stop caring if init is alive.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-28 Thread Albert Cahalan

Robin Holt writes:

On Mon, Apr 09, 2007 at 08:36:21AM -0600, Eric W. Biederman wrote:

Robin Holt <[EMAIL PROTECTED]> writes:



I would say this is more a benefit than a problem.  With a couple
of these systems we are testing, the number of kernel threads is
far greater than the number of user processes and having pstree
not normally show them, but maybe have an option we add later to
show them again would be beneficial.


Sure.

Robin how many kernel thread per cpu are you seeing?


10.


This has long been rotten. Mind fixing it for us? :-)

We have N types of thread on M CPUs. Pick something, N or M,
to be at the top level in /proc. The other goes below, in the
per-process task directories.

You then have either N or M things showing up in ps, not N*M.

Note that both ps and top can print the CPU number just fine.
Abusing the task name for this is just retarded. This suggests
that the top level should be the type of task, with the lower
level in /proc/*/task being per-CPU and not needing distinct
naming at all.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-28 Thread Albert Cahalan

Robin Holt writes:

On Mon, Apr 09, 2007 at 08:36:21AM -0600, Eric W. Biederman wrote:

Robin Holt [EMAIL PROTECTED] writes:



I would say this is more a benefit than a problem.  With a couple
of these systems we are testing, the number of kernel threads is
far greater than the number of user processes and having pstree
not normally show them, but maybe have an option we add later to
show them again would be beneficial.


Sure.

Robin how many kernel thread per cpu are you seeing?


10.


This has long been rotten. Mind fixing it for us? :-)

We have N types of thread on M CPUs. Pick something, N or M,
to be at the top level in /proc. The other goes below, in the
per-process task directories.

You then have either N or M things showing up in ps, not N*M.

Note that both ps and top can print the CPU number just fine.
Abusing the task name for this is just retarded. This suggests
that the top level should be the type of task, with the lower
level in /proc/*/task being per-CPU and not needing distinct
naming at all.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC, PATCH 1/3] introduce SYS_CLONE_MASK

2007-05-28 Thread Albert Cahalan

Jan Engelhardt writes:

On Apr 10 2007 17:47, Jan Engelhardt wrote:

On Apr 8 2007 20:57, Oleg Nesterov wrote:



Anyway, re-parenting to swapper breaks pstree, it doesn't
show kernel threads. And if -parent == /sbin/init, we can't
remove us from -children (unless we forbid sub-thread-of-init
exec). So the only safe change is set -exit_state = -1.


Then we have to fix pstree and all that. (In fact, I'm
trying to patch `ps f` to DTRT ;p)


Done that and the result is that `ps afwx` now looks like:

  PID TTY  STAT   TIME COMMAND
 2722 ?S  0:00 [lockd]

...

3 ?S 0:00 [events/0]
2 ?SN 0:00 [ksoftirqd/0]
1 ?Ss 0:02 init [3]
  537 ?Ss0:02  \_ /sbin/udevd --daemon
 1600 ?Ss 0:00  \_ /usr/bin/dbus-daemon --system
 1692 ?Ss 0:00  \_ /sbin/acpid
 1923 ?Ss 0:00  \_ /sbin/resmgrd

...

-if(self_pid==1  ADOPTED(processes[i])  forest_type!='u')
+if(ADOPTED(processes[i])  forest_type!='u')


That's not compatible because init's children are now in the
logical place. Since the days of procps-1.x.x or earlier,
such processes have been listed at top level.

BTW, what does ps -ejH do for you, with and without the patch?

I'd be a lot happier about breaking compatibility in this area
if I could get a functional adoption flag. That is, I really
would like to show a process as child of init if it naturally
was created as a child of init. It's less informative to have
fake children showing up the same as real ones. The original
parent PID would do. (BTW, the original parent name and/or
grandparent PID would be great to have) As a bonus, the kernel
could reap these processes more quickly than init can... and
then maybe we can stop caring if init is alive.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


setting all 3 file times

2007-05-20 Thread Albert Cahalan

Why can we still not do this?

It's a stupid restriction. Security isn't a reason;
we have SE Linux policy and auditing to take
care of any issues. Heck, SE Linux policy could
even deny this feature for the truly paranoid.

Writing to /dev/* to update timestamps is surely
a worse security situation. (see "dump" program)

Ideally we'd have atomic update in some way.
That might mean feeding the old times into the
system call, so that the kernel can fail it if any
changes have happened meanwhile. Maybe the
syscall could take a pair of "struct stat" even,
making the operation really easy and powerful.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


setting all 3 file times

2007-05-20 Thread Albert Cahalan

Why can we still not do this?

It's a stupid restriction. Security isn't a reason;
we have SE Linux policy and auditing to take
care of any issues. Heck, SE Linux policy could
even deny this feature for the truly paranoid.

Writing to /dev/* to update timestamps is surely
a worse security situation. (see dump program)

Ideally we'd have atomic update in some way.
That might mean feeding the old times into the
system call, so that the kernel can fail it if any
changes have happened meanwhile. Maybe the
syscall could take a pair of struct stat even,
making the operation really easy and powerful.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-19 Thread Albert Cahalan

On 5/19/07, Segher Boessenkool <[EMAIL PROTECTED]> wrote:

[Albert Cahalan]



> Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via
> MMCR0[TBSEL].

That's the performance monitor, which could very well be
in use already (for performance monitoring stuff, who
would have guessed).


It is the performance monitor, which sadly can not be used
very well unless the decrementer is disabled. The hardware
is buggy. As long as we use the decrementer for timekeeping,
we can not safely generate performance monitor interrupts.

I'd like to have the performance monitor available. It's NOT
available unless we use part of it for timekeeping. That's the
choice the hardware gives us.

We can get TBL bit flip interrupts for free. We don't even need
to give up one of the event counters. If we do give up one of the
event counters (a rather reasonable idea), then we can count
one of those TBL bit flips or the cycle counter.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-19 Thread Albert Cahalan

On 5/19/07, Segher Boessenkool [EMAIL PROTECTED] wrote:

[Albert Cahalan]



 Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via
 MMCR0[TBSEL].

That's the performance monitor, which could very well be
in use already (for performance monitoring stuff, who
would have guessed).


It is the performance monitor, which sadly can not be used
very well unless the decrementer is disabled. The hardware
is buggy. As long as we use the decrementer for timekeeping,
we can not safely generate performance monitor interrupts.

I'd like to have the performance monitor available. It's NOT
available unless we use part of it for timekeeping. That's the
choice the hardware gives us.

We can get TBL bit flip interrupts for free. We don't even need
to give up one of the event counters. If we do give up one of the
event counters (a rather reasonable idea), then we can count
one of those TBL bit flips or the cycle counter.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-18 Thread Albert Cahalan

On 5/18/07, Sergei Shtylyov <[EMAIL PROTECTED]> wrote:

Albert Cahalan wrote:



>>> Sure, but is there any utility in registering more than the
>>> decrementer on PPC?

>> Not yet. I'm not sure I know any other PPC CPU facility fitting
>> for clockevents. In theory, FIT could be used -- but its period
>> is measured in powers of 2, IIRC.

> I'd really like to have that as an option. It would allow oprofile
> to safely use hardware events on the MPC74xx "G4" processors.
> Alternately it would allow thermal events. It is safe to use at
> most one of the three (decrementer,profiling,thermal) interrupts.
> If two were to hit at the same time, badness happens.

Unfortunately, FIT exists only on Book E CPUs and MPC74xx aren't Book E, IIUC.


By the name "FIT" perhaps, but MPC74xx has essentially
the same thing.


> It's possible to wrapper the interrupt in something that divides
> down, calling the normal code only some of the time. I think one
> of the FIT choices is about 4 kHz on my system, which would be OK.

Erm, are you sure you have FIT (or is your system not MPC74xx based)?


Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via MMCR0[TBSEL].
TBSEL is a 2-bit field which selects a timebase bit to use. The timebase
bits that can be chosen are numbered 15, 19, 23, and 31. In the notation
used by every other CPU vendor those would be bits 0, 8, 12, and 16.

Example: My system uses a TBL frequency of 24907667. This gives choices
of 12453833, 48648, 3040, and 190 Hz. The lowest three of those could
be useful, with 48648 only for profiling and extreme real-time.

It's also possible to trigger on the CPU cycle counter, but this would
cost one of the performance counters. MPC7400 has 4, later CPUs have 6
or more, and I think xPC7x0 had only 2. This method is a bit nicer,
since then one could trigger interrupts on arbitrary clock cycles
without needing to write the timebase register.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-18 Thread Albert Cahalan

On 5/18/07, Sergei Shtylyov [EMAIL PROTECTED] wrote:

Albert Cahalan wrote:



 Sure, but is there any utility in registering more than the
 decrementer on PPC?

 Not yet. I'm not sure I know any other PPC CPU facility fitting
 for clockevents. In theory, FIT could be used -- but its period
 is measured in powers of 2, IIRC.

 I'd really like to have that as an option. It would allow oprofile
 to safely use hardware events on the MPC74xx G4 processors.
 Alternately it would allow thermal events. It is safe to use at
 most one of the three (decrementer,profiling,thermal) interrupts.
 If two were to hit at the same time, badness happens.

Unfortunately, FIT exists only on Book E CPUs and MPC74xx aren't Book E, IIUC.


By the name FIT perhaps, but MPC74xx has essentially
the same thing.


 It's possible to wrapper the interrupt in something that divides
 down, calling the normal code only some of the time. I think one
 of the FIT choices is about 4 kHz on my system, which would be OK.

Erm, are you sure you have FIT (or is your system not MPC74xx based)?


Set MMCR0[TBEE], set MMCR0[PMXE], and choose a TBL bit via MMCR0[TBSEL].
TBSEL is a 2-bit field which selects a timebase bit to use. The timebase
bits that can be chosen are numbered 15, 19, 23, and 31. In the notation
used by every other CPU vendor those would be bits 0, 8, 12, and 16.

Example: My system uses a TBL frequency of 24907667. This gives choices
of 12453833, 48648, 3040, and 190 Hz. The lowest three of those could
be useful, with 48648 only for profiling and extreme real-time.

It's also possible to trigger on the CPU cycle counter, but this would
cost one of the performance counters. MPC7400 has 4, later CPUs have 6
or more, and I think xPC7x0 had only 2. This method is a bit nicer,
since then one could trigger interrupts on arbitrary clock cycles
without needing to write the timebase register.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-17 Thread Albert Cahalan

Sergei Shtylyov writes:

Kumar Gala wrote:

[Sergei Shtylyov]

Kumar Gala wrote:



I haven't looked at all the new clock/timer code, is there any
utility in having support for more than one clock source?


Of course, you may register as many as you like.


Sure, but is there any utility in registering more than the
decrementer on PPC?


Not yet. I'm not sure I know any other PPC CPU facility fitting
for clockevents. In theory, FIT could be used -- but its period
is measured in powers of 2, IIRC.


I'd really like to have that as an option. It would allow oprofile
to safely use hardware events on the MPC74xx "G4" processors.
Alternately it would allow thermal events. It is safe to use at
most one of the three (decrementer,profiling,thermal) interrupts.
If two were to hit at the same time, badness happens.

It's possible to wrapper the interrupt in something that divides
down, calling the normal code only some of the time. I think one
of the FIT choices is about 4 kHz on my system, which would be OK.

Full oprofile functionality would be wonderful.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2.6.21-rt2] PowerPC: decrementer clockevent driver

2007-05-17 Thread Albert Cahalan

Sergei Shtylyov writes:

Kumar Gala wrote:

[Sergei Shtylyov]

Kumar Gala wrote:



I haven't looked at all the new clock/timer code, is there any
utility in having support for more than one clock source?


Of course, you may register as many as you like.


Sure, but is there any utility in registering more than the
decrementer on PPC?


Not yet. I'm not sure I know any other PPC CPU facility fitting
for clockevents. In theory, FIT could be used -- but its period
is measured in powers of 2, IIRC.


I'd really like to have that as an option. It would allow oprofile
to safely use hardware events on the MPC74xx G4 processors.
Alternately it would allow thermal events. It is safe to use at
most one of the three (decrementer,profiling,thermal) interrupts.
If two were to hit at the same time, badness happens.

It's possible to wrapper the interrupt in something that divides
down, calling the normal code only some of the time. I think one
of the FIT choices is about 4 kHz on my system, which would be OK.

Full oprofile functionality would be wonderful.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-15 Thread Albert Cahalan

Please don't forget the immutable bit. ("man lsattr")
Having both, BSD-style, would be even better.
The immutable bit is important for working around
software bugs and "features" that damage files.

I also can't find xattr support.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] LogFS take three

2007-05-15 Thread Albert Cahalan

Please don't forget the immutable bit. (man lsattr)
Having both, BSD-style, would be even better.
The immutable bit is important for working around
software bugs and features that damage files.

I also can't find xattr support.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-09 Thread Albert Cahalan

On 5/9/07, Andrey Borzenkov <[EMAIL PROTECTED]> wrote:

On Wednesday 09 May 2007, Albert Cahalan wrote:

...

On May 8 2007 00:43, Albert Cahalan wrote:



Fix: the vfat driver should use the 8.3 name for such files.

...

It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
have built-in support for 8.3 aliases. Normally the 8.3 names
are like hidden hard links, except that deletion of either name
will wipe out the other. (same as case differences too)
So the names are there, and they should already work.
They just need to be reported for directory listings when the
long names would be too long.


several problems associated with it

1. those names are rather meaningless. How do you find out which file they
refer to? It is OK for trivial cases but not in a directory full of long
names; nor am I sure how many unique short names can be generated.


If a short name can not be generated, then no OS could
create the file at all. The vfat and iso9660 filesystems require
short names. Any OS writing to such a filesystem MUST
generate short names in addition to any long names.
Mount your vfat as filesystem type "msdos" to see.

By default, Windows will also generate short names on NTFS.

Note that you can't put your files on a CD-ROM in a way
that Windows could read the filenames. Windows limits
CD-ROM filenames to 63 characters; you get at most 103
if you violate the spec.


2. directory contents is effectively invalidated upon backup and restore (tar
c; rm -rf; tar x). It is impossible to infer long names from short ones.


It may be that tar fails to use the vfat ioctl calls to save
and restore short names. You could try using Wine to
run a Windows-native backup program. This shouldn't
really matter though; you'd only be getting short names
for files that had truly unreasonable long names anyway.

I suppose somebody should check to see if there is a
danger of overwrite when the short-named files get
written back. The safest thing might be to mount the
filesystem as type "msdos".


3. this still does not answer how can I *create* long name from within Linux.


WTF? These names are too annoying to use, even if there
weren't this limit. Anything over about 29 characters is in
need of a rename. (that'd be 58 bytes for you, which is OK)
The limit is already 4 times larger than what is reasonable.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-09 Thread Albert Cahalan

On 5/8/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote:

On May 8 2007 00:43, Albert Cahalan wrote:



> Fix: the vfat driver should use the 8.3 name for such files.

Or the 31-character ISO Level 1(?).


That might be appropriate for a similar problem on CD-ROM
filesystems. (when the CD is rockridge KOI8 and you want UTF-8)
It may even be appropriate for Joliet, though 8.3 may be
the better choice in that case.

It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
have built-in support for 8.3 aliases. Normally the 8.3 names
are like hidden hard links, except that deletion of either name
will wipe out the other. (same as case differences too)
So the names are there, and they should already work.
They just need to be reported for directory listings when the
long names would be too long.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-09 Thread Albert Cahalan

On 5/8/07, Jan Engelhardt [EMAIL PROTECTED] wrote:

On May 8 2007 00:43, Albert Cahalan wrote:



 Fix: the vfat driver should use the 8.3 name for such files.

Or the 31-character ISO Level 1(?).


That might be appropriate for a similar problem on CD-ROM
filesystems. (when the CD is rockridge KOI8 and you want UTF-8)
It may even be appropriate for Joliet, though 8.3 may be
the better choice in that case.

It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
have built-in support for 8.3 aliases. Normally the 8.3 names
are like hidden hard links, except that deletion of either name
will wipe out the other. (same as case differences too)
So the names are there, and they should already work.
They just need to be reported for directory listings when the
long names would be too long.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-09 Thread Albert Cahalan

On 5/9/07, Andrey Borzenkov [EMAIL PROTECTED] wrote:

On Wednesday 09 May 2007, Albert Cahalan wrote:

...

On May 8 2007 00:43, Albert Cahalan wrote:



Fix: the vfat driver should use the 8.3 name for such files.

...

It's not appropriate for vfat, HPFS, JFS, or NTFS. All of those
have built-in support for 8.3 aliases. Normally the 8.3 names
are like hidden hard links, except that deletion of either name
will wipe out the other. (same as case differences too)
So the names are there, and they should already work.
They just need to be reported for directory listings when the
long names would be too long.


several problems associated with it

1. those names are rather meaningless. How do you find out which file they
refer to? It is OK for trivial cases but not in a directory full of long
names; nor am I sure how many unique short names can be generated.


If a short name can not be generated, then no OS could
create the file at all. The vfat and iso9660 filesystems require
short names. Any OS writing to such a filesystem MUST
generate short names in addition to any long names.
Mount your vfat as filesystem type msdos to see.

By default, Windows will also generate short names on NTFS.

Note that you can't put your files on a CD-ROM in a way
that Windows could read the filenames. Windows limits
CD-ROM filenames to 63 characters; you get at most 103
if you violate the spec.


2. directory contents is effectively invalidated upon backup and restore (tar
c; rm -rf; tar x). It is impossible to infer long names from short ones.


It may be that tar fails to use the vfat ioctl calls to save
and restore short names. You could try using Wine to
run a Windows-native backup program. This shouldn't
really matter though; you'd only be getting short names
for files that had truly unreasonable long names anyway.

I suppose somebody should check to see if there is a
danger of overwrite when the short-named files get
written back. The safest thing might be to mount the
filesystem as type msdos.


3. this still does not answer how can I *create* long name from within Linux.


WTF? These names are too annoying to use, even if there
weren't this limit. Anything over about 29 characters is in
need of a rename. (that'd be 58 bytes for you, which is OK)
The limit is already 4 times larger than what is reasonable.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] LogFS take two

2007-05-07 Thread Albert Cahalan

[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], linux-kernel@vger.kernel.org,
[EMAIL PROTECTED], [EMAIL PROTECTED]

Re: [PATCH 0/2] LogFS take two

You seem to be missing the immutable bit. This is really useful
for dealing with buggy or badly-designed things running as root.
I've used to to protect /dev/null from becoming a normal file
filled with junk, and to protect /etc/resolv.conf from "helpful"
network management daemons that don't know my DNS servers.

Anything else missing?

BTW, BSD offers an unprivileged immutable bit as well. I'm sure
it's useful for the apps that trash their own config files.
Actually, this bit alone would do fine, and we could really use
a way to protect writable device files from deletion or permission
bit changes.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-07 Thread Albert Cahalan

Andrey Borzenkov writes:


This was posted in one of Russian forums. It was not possible to
archive (under Linux, using tar) vfat directory where files had
long Russian names (really long - over 150 - 170 characters) - tar
returned stat failure. When looking with plain ls, file names
appeared truncated.


I have an idea to deal with this, but first a rant...

At two bytes per character, you get 127 characters in a filename.
That's wider than the standard 80-column display, and far wider
than the 28 or 29 characters that an "ls -l" has room for. In a
GUI file manager or file dialog box, you'll have to scroll sideways.
In a web browser directory listing, you'll almost certainly have
to scroll sideways. Must of this even applies to Windows tools.

In other words, this is user error. Somebody thought that a filename
was a place to store a document, probably a README file. What next,
shall we MIME-encode an icon into the filename?

Fix: the vfat driver should use the 8.3 name for such files.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Long file names in VFAT broken with iocharset=utf8

2007-05-07 Thread Albert Cahalan

Andrey Borzenkov writes:


This was posted in one of Russian forums. It was not possible to
archive (under Linux, using tar) vfat directory where files had
long Russian names (really long - over 150 - 170 characters) - tar
returned stat failure. When looking with plain ls, file names
appeared truncated.


I have an idea to deal with this, but first a rant...

At two bytes per character, you get 127 characters in a filename.
That's wider than the standard 80-column display, and far wider
than the 28 or 29 characters that an ls -l has room for. In a
GUI file manager or file dialog box, you'll have to scroll sideways.
In a web browser directory listing, you'll almost certainly have
to scroll sideways. Must of this even applies to Windows tools.

In other words, this is user error. Somebody thought that a filename
was a place to store a document, probably a README file. What next,
shall we MIME-encode an icon into the filename?

Fix: the vfat driver should use the 8.3 name for such files.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] LogFS take two

2007-05-07 Thread Albert Cahalan

[EMAIL PROTECTED], [EMAIL PROTECTED],
[EMAIL PROTECTED], linux-kernel@vger.kernel.org,
[EMAIL PROTECTED], [EMAIL PROTECTED]

Re: [PATCH 0/2] LogFS take two

You seem to be missing the immutable bit. This is really useful
for dealing with buggy or badly-designed things running as root.
I've used to to protect /dev/null from becoming a normal file
filled with junk, and to protect /etc/resolv.conf from helpful
network management daemons that don't know my DNS servers.

Anything else missing?

BTW, BSD offers an unprivileged immutable bit as well. I'm sure
it's useful for the apps that trash their own config files.
Actually, this bit alone would do fine, and we could really use
a way to protect writable device files from deletion or permission
bit changes.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Broken process startup times after suspend (regression)

2007-05-05 Thread Albert Cahalan

john stultz writes:


Indeed. The monotonic clock's behavior around suspend and resume
is poorly defined. When we increased it, folks didn't like the
fact that uptime would increase while a system was suspended.


The uptime really does need to increase during suspend. Otherwise,
things get really weird with devices like the OLPC XO which will be
sleeping between keystrokes. You could run the device for hours,
yet get an uptime of only a few minutes. Suspended time should get
counted as stolen time, same as when a hypervisor takes away time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-05 Thread Albert Cahalan

Andrew Morton writes:

"Cabot, Mason B" <[EMAIL PROTECTED]> wrote:



I've been testing the NAS performance of ext3/Openfiler 2.2 against
NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
video workloads. The Windows CIFS client will attempt a poor-man's
pre-allocation of the file on the server by sending 1-byte writes at
128K-byte strides, breaking block allocation on ext3 and leading to
fragmentation and poor performance. This will happen for many
applications (including iTunes) as the CIFS client issues these
pre-allocates under the application layer.


Oh my gawd, what a stupid hack.  Now we know what the
MS interoperability lab has been working on.


Stupid or not, this is their protocol. The cifs filesystem
driver needs a patch to do this. Probably that'll help get
better performance when Linux is writing to a Windows server.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Ext3 vs NTFS performance

2007-05-05 Thread Albert Cahalan

Andrew Morton writes:

Cabot, Mason B [EMAIL PROTECTED] wrote:



I've been testing the NAS performance of ext3/Openfiler 2.2 against
NTFS/WinXP and have found that NTFS significantly outperforms ext3 for
video workloads. The Windows CIFS client will attempt a poor-man's
pre-allocation of the file on the server by sending 1-byte writes at
128K-byte strides, breaking block allocation on ext3 and leading to
fragmentation and poor performance. This will happen for many
applications (including iTunes) as the CIFS client issues these
pre-allocates under the application layer.


Oh my gawd, what a stupid hack.  Now we know what the
MS interoperability lab has been working on.


Stupid or not, this is their protocol. The cifs filesystem
driver needs a patch to do this. Probably that'll help get
better performance when Linux is writing to a Windows server.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Broken process startup times after suspend (regression)

2007-05-05 Thread Albert Cahalan

john stultz writes:


Indeed. The monotonic clock's behavior around suspend and resume
is poorly defined. When we increased it, folks didn't like the
fact that uptime would increase while a system was suspended.


The uptime really does need to increase during suspend. Otherwise,
things get really weird with devices like the OLPC XO which will be
sleeping between keystrokes. You could run the device for hours,
yet get an uptime of only a few minutes. Suspended time should get
counted as stolen time, same as when a hypervisor takes away time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: console font limits

2007-05-03 Thread Albert Cahalan

On 5/3/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote:

On May 3 2007 02:17, Albert Cahalan wrote:



> Those sizes are unreadable on the 200 dpi OLPC XO screen,

Hm that should have read, for you:
I don't object implementing support for larger sizes.
(But I wonder how that should work without FB/CVIDIX/SVGA/VESA extensions.)

Note that I was assuming that no FB is used:


I'm assuming that the FB is used. Neither of my two
computers can do VGA text mode. Even for computers
which can do VGA text mode, if you want large fonts
(either by number of characters or by character width)
you need to use FB. That's just a requirement; anything
else would be insane.


For everything beyond Latin, fbiterm should work a lot better.


Then, as with X, you have problems with kernel messages.
Reliably sending printk through a userspace console is not
even possible. (consider a panic, OOM, or runaway RT task)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: console font limits

2007-05-03 Thread Albert Cahalan

On 5/2/07, Jan Engelhardt <[EMAIL PROTECTED]> wrote:

On May 1 2007 11:49, Albert Cahalan wrote:
>>
>> Well, I think the consensus is that anything beyond that should be done
>> in userspace; the main such console daemon was Kon2 last I checked.
>
> Font size is not a sane place to draw the line. Features are.
> The levels of support go something like this:
>
> 0. 7-bit ASCII
> 1. Simple direct-to-font VGA characters.
> 2. UTF-8 and large fonts, but no compositing or wide characters.
> 3. Simple compositing and double-wide characters. (like xterm)
> 4. Right-to-left. (like Kermit95)
> 5. Complex shaping, glyph substitution, and vertical text.
>
> Without large fonts, UTF-8 is 90% pointless bloat.

> Personally I don't even need #1, but I think anything less than #3 is
> really rude toward people outside of Europe+Americas. I especially hate
> to hear Europeans argue against this when they have 100% precomposed
> characters for themselves and appear to have played a role (via ISO votes)
> in denying stuff like the mere 12 precomposed characters needed to use
> the Yoruba language with simple renderers.


Note: I never suggested going beyond #3.


0. yes we want that

1. can't tell

2. utf8 yes, many text files are in that encoding.
   large fonts - can't tell, I am fine with the regular vga
   font infrastructure (8x16, 8x8)


Those sizes are unreadable on the 200 dpi OLPC XO screen,
and kind of icky on some of the really big desktop displays
when in native (framebuffer) mode. 200 dpi may be in your future.
Even the 32-pixel height limit is starting to be a problem.


3. compositing - no, don't need that,
   wide characters - does not even work in vga. just display a '??'
   and everything is fine.


It's been shown to be workable, and it allows support for
some additional languages.


4. I do not really think this has a future on VC.
   You would also 'need' kerning and that serif combiner thing (complex
   shaping?) for Arabic. At best, Arabic would look as horrible on VC
   as it does in xterm today (no RTL, no serif combiner)


I agree. Hebrew is more doable, but probably not worth the effort
because of the rarity and because of the general lack of support
in text mode apps for such odd behavior. Very few emulators
support this; kermit95 is one of the few.


5. Vertical text - who else supports this please? Webpages in languages
   that want to do TTB(top-to-bottom) scripting use html workarounds -
   probably because TTB availability it's not even guaranteed in a
   webbrowser.


I hope you didn't think I was suggesting this. It's quite absurd.
"Complex shaping, glyph substitution, and vertical text." was the
full item listed. Vertical is the least troublesome of those issues,
and as far as I know has never been implemented.


In short, the current console is very much OK.


I wouldn't say that. We suffer the bloat of all this UTF-8 stuff
without being able to load a decent-sized font to go with it.
We're stuck at 256 characters really, with the very lame option
of trading foreground color intensity control for an extra 256.

I think one could make a reasonable argument that all the
internationalization is bloat, and that thus UTF-8 should go.
Given that we do support UTF-8 though, allowing a font with
more than 256 characters (with foreground intensity control)
is obviously sensible.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: console font limits

2007-05-03 Thread Albert Cahalan

On 5/2/07, Jan Engelhardt [EMAIL PROTECTED] wrote:

On May 1 2007 11:49, Albert Cahalan wrote:

 Well, I think the consensus is that anything beyond that should be done
 in userspace; the main such console daemon was Kon2 last I checked.

 Font size is not a sane place to draw the line. Features are.
 The levels of support go something like this:

 0. 7-bit ASCII
 1. Simple direct-to-font VGA characters.
 2. UTF-8 and large fonts, but no compositing or wide characters.
 3. Simple compositing and double-wide characters. (like xterm)
 4. Right-to-left. (like Kermit95)
 5. Complex shaping, glyph substitution, and vertical text.

 Without large fonts, UTF-8 is 90% pointless bloat.

 Personally I don't even need #1, but I think anything less than #3 is
 really rude toward people outside of Europe+Americas. I especially hate
 to hear Europeans argue against this when they have 100% precomposed
 characters for themselves and appear to have played a role (via ISO votes)
 in denying stuff like the mere 12 precomposed characters needed to use
 the Yoruba language with simple renderers.


Note: I never suggested going beyond #3.


0. yes we want that

1. can't tell

2. utf8 yes, many text files are in that encoding.
   large fonts - can't tell, I am fine with the regular vga
   font infrastructure (8x16, 8x8)


Those sizes are unreadable on the 200 dpi OLPC XO screen,
and kind of icky on some of the really big desktop displays
when in native (framebuffer) mode. 200 dpi may be in your future.
Even the 32-pixel height limit is starting to be a problem.


3. compositing - no, don't need that,
   wide characters - does not even work in vga. just display a '??'
   and everything is fine.


It's been shown to be workable, and it allows support for
some additional languages.


4. I do not really think this has a future on VC.
   You would also 'need' kerning and that serif combiner thing (complex
   shaping?) for Arabic. At best, Arabic would look as horrible on VC
   as it does in xterm today (no RTL, no serif combiner)


I agree. Hebrew is more doable, but probably not worth the effort
because of the rarity and because of the general lack of support
in text mode apps for such odd behavior. Very few emulators
support this; kermit95 is one of the few.


5. Vertical text - who else supports this please? Webpages in languages
   that want to do TTB(top-to-bottom) scripting use html workarounds -
   probably because TTB availability it's not even guaranteed in a
   webbrowser.


I hope you didn't think I was suggesting this. It's quite absurd.
Complex shaping, glyph substitution, and vertical text. was the
full item listed. Vertical is the least troublesome of those issues,
and as far as I know has never been implemented.


In short, the current console is very much OK.


I wouldn't say that. We suffer the bloat of all this UTF-8 stuff
without being able to load a decent-sized font to go with it.
We're stuck at 256 characters really, with the very lame option
of trading foreground color intensity control for an extra 256.

I think one could make a reasonable argument that all the
internationalization is bloat, and that thus UTF-8 should go.
Given that we do support UTF-8 though, allowing a font with
more than 256 characters (with foreground intensity control)
is obviously sensible.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: console font limits

2007-05-03 Thread Albert Cahalan

On 5/3/07, Jan Engelhardt [EMAIL PROTECTED] wrote:

On May 3 2007 02:17, Albert Cahalan wrote:



 Those sizes are unreadable on the 200 dpi OLPC XO screen,

Hm that should have read, for you:
I don't object implementing support for larger sizes.
(But I wonder how that should work without FB/CVIDIX/SVGA/VESA extensions.)

Note that I was assuming that no FB is used:


I'm assuming that the FB is used. Neither of my two
computers can do VGA text mode. Even for computers
which can do VGA text mode, if you want large fonts
(either by number of characters or by character width)
you need to use FB. That's just a requirement; anything
else would be insane.


For everything beyond Latin, fbiterm should work a lot better.


Then, as with X, you have problems with kernel messages.
Reliably sending printk through a userspace console is not
even possible. (consider a panic, OOM, or runaway RT task)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >