Re: init's children list is long and slows reaping children.

2007-04-11 Thread Bill Davidsen

Eric W. Biederman wrote:

Bill Davidsen <[EMAIL PROTECTED]> writes:


As long as the original parent is preserved for getppid(). There are programs
out there which communicate between the parent and child with signals, and if
the original parent dies, it undesirable to have the child getppid() and start
sending signals to a program not expecting them. Invites undefined behavior.


Then the programs are broken.  getppid is defined to change if the process
is reparented to init.


The short answer is that kthreads don't do this so it doesn't matter.

But user programs are NOT broken, currently getppid returns either the 
original parent or init, and a program can identify init. Reparenting to 
another pid would not be easily noted, and as SUS notes, no values are 
reserved to error. So there's no way to check, and no neeed for 
kthreads, I was prematurely paranoid.


Presumably user processes will still be reparented to init so that's not 
an issue. Since there's no atomic signal_parent() the ppid could change 
between getppid() and signal(), but that's never actually been a problem 
AFAIK.


Related: Is there a benefit from having separate queues for original 
children of init and reparented (to init) tasks? Even in a server would 
there be enough to save anything?


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Oleg Nesterov
On 04/11, Bill Davidsen wrote:
>
> Oleg Nesterov wrote:
> >On 04/10, Eric W. Biederman wrote:
> >
> >>I'm trying to remember what the story is now.  There is a nasty
> >>race somewhere with reparenting, a threaded parent setting SIGCHLD to
> >>SIGIGN, and non-default signals that results in an zombie that no one
> >>can wait for and reap.  It requires being reparented twice to trigger.
> >
> >reparent_thread:
> >
> > ...
> >
> >/* If we'd notified the old parent about this child's death,
> > * also notify the new parent.
> > */
> >if (!traced && p->exit_state == EXIT_ZOMBIE &&
> >p->exit_signal != -1 && thread_group_empty(p))
> >do_notify_parent(p, p->exit_signal);
> >
> >We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
> >We don't do this.
> >
> >The best fix I believe is to cleanup the 
> >forget_original_parent/reparent_thread
> >interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal 
> >== -1"
> >checks.
> >
> As long as the original parent is preserved for getppid(). There are 
> programs out there which communicate between the parent and child with 
> signals, and if the original parent dies, it undesirable to have the 
> child getppid() and start sending signals to a program not expecting 
> them. Invites undefined behavior.

Sorry, can't understand.

If p->exit_signal == -1 after do_notify_parent() above, the task is completely
dead. Nobody can release it, we should do this (if EXIT_ZOMBIE). At this point
"p" was already re-parented, but this (and getppid) doesn't matter at all.

OK. Most likely you meant something else. Could you clarify?

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Eric W. Biederman
Bill Davidsen <[EMAIL PROTECTED]> writes:

> As long as the original parent is preserved for getppid(). There are programs
> out there which communicate between the parent and child with signals, and if
> the original parent dies, it undesirable to have the child getppid() and start
> sending signals to a program not expecting them. Invites undefined behavior.

Then the programs are broken.  getppid is defined to change if the process
is reparented to init.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Bill Davidsen

Oleg Nesterov wrote:

On 04/10, Eric W. Biederman wrote:


I'm trying to remember what the story is now.  There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap.  It requires being reparented twice to trigger.


reparent_thread:

...

/* If we'd notified the old parent about this child's death,
 * also notify the new parent.
 */
if (!traced && p->exit_state == EXIT_ZOMBIE &&
p->exit_signal != -1 && thread_group_empty(p))
do_notify_parent(p, p->exit_signal);

We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.

The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal == -1"
checks.

As long as the original parent is preserved for getppid(). There are 
programs out there which communicate between the parent and child with 
signals, and if the original parent dies, it undesirable to have the 
child getppid() and start sending signals to a program not expecting 
them. Invites undefined behavior.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Nick Piggin

Jeff Garzik wrote:

Linus Torvalds wrote:



On Fri, 6 Apr 2007, Jeff Garzik wrote:


I would rather change the implementation under the hood to start per-CPU
threads on demand, similar to a thread-pool implementation.

Boxes with $BigNum CPUs probably won't ever use half of those threads.



The counter-argument is that boxes with $BigNum CPU's really don't 
hurt from it either, and having per-process data structures is often 
simpler and more efficient than trying to have some thread pool.



Two points here:

* A lot of the users in the current kernel tree don't rely on the 
per-CPU qualities.  They just need multiple threads running.


* Even with per-CPU data structures and code, you don't necessarily have 
to keep a thread alive and running for each CPU.  Reap the ones that 
haven't been used in $TimeFrame, and add thread creation to the slow 
path that already exists in the bowels of schedule_work().


Or if some kernel hacker is really motivated, all workqueue users in the 
kernel would benefit from a "thread audit", looking at working 
conditions to decide if the new kthread APIs are more appropriate.


spawn on demand would require heuristics and complexity though. And
I think there is barely any positive tradeoff to weigh it against.

IOW, once we get the processes off the global list, there just isn't 
any downside from them. Sure, they use some memory, but people who buy 
1024-cpu machines won't care about a few kB per CPU..


So the *only* downside is literally the process list, and one 
suggested patch already just removes kernel threads entirely from the 
parenthood lists.


The other potential downside could be "ps is slow", but on the other 
hand, having the things stick around and have things like CPU-time 
accumulate is probably worth it - if there are some issues, they'd 
show up properly accounted for in a way that process pools would have 
a hard time doing.



Regardless of how things are shuffled about internally, there will 
always be annoying overhead /somewhere/ when you have a metric ton of 
kernel threads.  I think that people should also be working on ways to 
make the kernel threads a bit more manageable for the average human.


There are a few per CPU, but they should need no human management to
speak of.

Presumably if you have a 1024 CPU system, you'd generally want to be
running at least 1024 of your own processes there, so you already need
some tools to handle that magnitude of processes anyway.

So I really don't think this is worth changing things over, apart from 
literally removing them from process lists, which I think everybody 
agrees we should just do - it just never even came up before!



I think there is a human downside.  For an admin you have to wade 
through a ton of processes on your machine, if you are attempting to 
evaluate the overall state of the machine.  Just google around for all 
the admins complaining about the explosion of kernel threads on 
production machines :)


User tools should be improved. It shouldn't be too hard to be able to
aggregate kernel thread stats into a single top entry, for example.

I'm not saying the number of threads couldn't be cut down, but there
is still be an order of magnitude problem there...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Nick Piggin

Jeff Garzik wrote:

Linus Torvalds wrote:



On Fri, 6 Apr 2007, Jeff Garzik wrote:


I would rather change the implementation under the hood to start per-CPU
threads on demand, similar to a thread-pool implementation.

Boxes with $BigNum CPUs probably won't ever use half of those threads.



The counter-argument is that boxes with $BigNum CPU's really don't 
hurt from it either, and having per-process data structures is often 
simpler and more efficient than trying to have some thread pool.



Two points here:

* A lot of the users in the current kernel tree don't rely on the 
per-CPU qualities.  They just need multiple threads running.


* Even with per-CPU data structures and code, you don't necessarily have 
to keep a thread alive and running for each CPU.  Reap the ones that 
haven't been used in $TimeFrame, and add thread creation to the slow 
path that already exists in the bowels of schedule_work().


Or if some kernel hacker is really motivated, all workqueue users in the 
kernel would benefit from a thread audit, looking at working 
conditions to decide if the new kthread APIs are more appropriate.


spawn on demand would require heuristics and complexity though. And
I think there is barely any positive tradeoff to weigh it against.

IOW, once we get the processes off the global list, there just isn't 
any downside from them. Sure, they use some memory, but people who buy 
1024-cpu machines won't care about a few kB per CPU..


So the *only* downside is literally the process list, and one 
suggested patch already just removes kernel threads entirely from the 
parenthood lists.


The other potential downside could be ps is slow, but on the other 
hand, having the things stick around and have things like CPU-time 
accumulate is probably worth it - if there are some issues, they'd 
show up properly accounted for in a way that process pools would have 
a hard time doing.



Regardless of how things are shuffled about internally, there will 
always be annoying overhead /somewhere/ when you have a metric ton of 
kernel threads.  I think that people should also be working on ways to 
make the kernel threads a bit more manageable for the average human.


There are a few per CPU, but they should need no human management to
speak of.

Presumably if you have a 1024 CPU system, you'd generally want to be
running at least 1024 of your own processes there, so you already need
some tools to handle that magnitude of processes anyway.

So I really don't think this is worth changing things over, apart from 
literally removing them from process lists, which I think everybody 
agrees we should just do - it just never even came up before!



I think there is a human downside.  For an admin you have to wade 
through a ton of processes on your machine, if you are attempting to 
evaluate the overall state of the machine.  Just google around for all 
the admins complaining about the explosion of kernel threads on 
production machines :)


User tools should be improved. It shouldn't be too hard to be able to
aggregate kernel thread stats into a single top entry, for example.

I'm not saying the number of threads couldn't be cut down, but there
is still be an order of magnitude problem there...

--
SUSE Labs, Novell Inc.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Bill Davidsen

Oleg Nesterov wrote:

On 04/10, Eric W. Biederman wrote:


I'm trying to remember what the story is now.  There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap.  It requires being reparented twice to trigger.


reparent_thread:

...

/* If we'd notified the old parent about this child's death,
 * also notify the new parent.
 */
if (!traced  p-exit_state == EXIT_ZOMBIE 
p-exit_signal != -1  thread_group_empty(p))
do_notify_parent(p, p-exit_signal);

We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.

The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this exit_state == EXIT_ZOMBIE  exit_signal == -1
checks.

As long as the original parent is preserved for getppid(). There are 
programs out there which communicate between the parent and child with 
signals, and if the original parent dies, it undesirable to have the 
child getppid() and start sending signals to a program not expecting 
them. Invites undefined behavior.


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Eric W. Biederman
Bill Davidsen [EMAIL PROTECTED] writes:

 As long as the original parent is preserved for getppid(). There are programs
 out there which communicate between the parent and child with signals, and if
 the original parent dies, it undesirable to have the child getppid() and start
 sending signals to a program not expecting them. Invites undefined behavior.

Then the programs are broken.  getppid is defined to change if the process
is reparented to init.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Oleg Nesterov
On 04/11, Bill Davidsen wrote:

 Oleg Nesterov wrote:
 On 04/10, Eric W. Biederman wrote:
 
 I'm trying to remember what the story is now.  There is a nasty
 race somewhere with reparenting, a threaded parent setting SIGCHLD to
 SIGIGN, and non-default signals that results in an zombie that no one
 can wait for and reap.  It requires being reparented twice to trigger.
 
 reparent_thread:
 
  ...
 
 /* If we'd notified the old parent about this child's death,
  * also notify the new parent.
  */
 if (!traced  p-exit_state == EXIT_ZOMBIE 
 p-exit_signal != -1  thread_group_empty(p))
 do_notify_parent(p, p-exit_signal);
 
 We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
 We don't do this.
 
 The best fix I believe is to cleanup the 
 forget_original_parent/reparent_thread
 interaction and factor out this exit_state == EXIT_ZOMBIE  exit_signal 
 == -1
 checks.
 
 As long as the original parent is preserved for getppid(). There are 
 programs out there which communicate between the parent and child with 
 signals, and if the original parent dies, it undesirable to have the 
 child getppid() and start sending signals to a program not expecting 
 them. Invites undefined behavior.

Sorry, can't understand.

If p-exit_signal == -1 after do_notify_parent() above, the task is completely
dead. Nobody can release it, we should do this (if EXIT_ZOMBIE). At this point
p was already re-parented, but this (and getppid) doesn't matter at all.

OK. Most likely you meant something else. Could you clarify?

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-11 Thread Bill Davidsen

Eric W. Biederman wrote:

Bill Davidsen [EMAIL PROTECTED] writes:


As long as the original parent is preserved for getppid(). There are programs
out there which communicate between the parent and child with signals, and if
the original parent dies, it undesirable to have the child getppid() and start
sending signals to a program not expecting them. Invites undefined behavior.


Then the programs are broken.  getppid is defined to change if the process
is reparented to init.


The short answer is that kthreads don't do this so it doesn't matter.

But user programs are NOT broken, currently getppid returns either the 
original parent or init, and a program can identify init. Reparenting to 
another pid would not be easily noted, and as SUS notes, no values are 
reserved to error. So there's no way to check, and no neeed for 
kthreads, I was prematurely paranoid.


Presumably user processes will still be reparented to init so that's not 
an issue. Since there's no atomic signal_parent() the ppid could change 
between getppid() and signal(), but that's never actually been a problem 
AFAIK.


Related: Is there a benefit from having separate queues for original 
children of init and reparented (to init) tasks? Even in a server would 
there be enough to save anything?


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Oeser
On Tuesday 10 April 2007, Jeff Garzik wrote:
> Thus, rather than forcing authors to make their code more complex, we 
> should find another solution.

What about sth. like the "pre-forking" concept? So just have a thread creator 
thread,
which checks the amount of unused threads and keeps them within certain limits.

So that anything which needs a thread now simply queues up the work and
specifies, that it wants a new thread, if possible.

One problem seems to be, that a thread is nothing else but a statement
on what other tasks I can wait before doing my current one (e.g. I don't want 
to 
mlseep() twice on the same reset timeout). 
But we usually use locking to order that.

Do I miss anything fundamental here?

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Oeser
On Tuesday 10 April 2007, Jeff Garzik wrote:
> That's why I feel thread creation -- cheap under Linux -- is quite 
> appropriate for many of these situations.

Maybe that (thread creation) can be done at open(), socket-creation, 
service request, syscall or whatever event triggers a driver/subsystem 
to actually queue work into a thread.

And when there is a close(), socket-destruction, service completion
or whatever these threads can be marked for destruction and destroyed
by a timer or even immediately.

Regards

Ingo Oeser

-- 
If something is getting cheap, it is getting wasted just because it is cheap.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Davide Libenzi
On Tue, 10 Apr 2007, Bill Davidsen wrote:

> Davide Libenzi wrote:
> > On Mon, 9 Apr 2007, Linus Torvalds wrote:
> > 
> > > On Mon, 9 Apr 2007, Kyle Moffett wrote:
> > > > Maybe "struct posix_process" is more descriptive?  "struct
> > > > process_posix"?
> > > > "Ugly POSIX process semantics data" seems simple enough to stick in a
> > > > struct
> > > > name.  "struct uglyposix_process"?
> > > Guys, you didn't read my message.
> > > 
> > > It's *not* about "process" stuff.  Anything that tries to call it a
> > > "process" is *by*definition* worse than what it is now. Processes have all
> > > the things that we've cleanly separated out for filesystem, VM, SysV
> > > semaphore state, namespaces etc.
> > > 
> > > The "struct signal_struct" is the random *leftovers* from all the other
> > > stuff. It's *not* about "processes". Never has been, and never will be. 
> > 
> > I proposed "struct task_shared_ctx" but you ducked :)
> > 
> Descriptive, correct, I like it!

He's stubborn, he'll never accept patches. Must be a seasonal thing ;)



- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Matt Mackall
On Tue, Apr 10, 2007 at 03:05:56AM -0400, Jeff Garzik wrote:
> Andrew Morton wrote:
> >: root 3  0.0  0.0  0 0 ?S18:51   0:00 
> >[watchdog/0]
> >
> >That's the softlockup detector.  Confusingly named to look like a, err,
> >watchdog.  Could probably use keventd.
> 
> I would think this would run into the keventd "problem", where $N 
> processes can lock out another?
> 
> IMO a lot of these could potentially be simply started as brand new 
> threads, when an exception arises.
> 
> 
> >: root 5  0.0  0.0  0 0 ?S18:51   0:00 
> >[khelper]
> >
> >That's there to parent the kthread_create()d threads.  Could presumably use
> >khelper.
> >
> >: root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]
> >
> >Does it need to be per-cpu?
> 
> No, it does not.
> 
> It is used for PIO data transfer, so it merely has to respond quickly, 
> which rules out keventd.  You also don't want PIO data xfer for port A 
> blocked, sitting around waiting for PIO data xfer to complete on port C.
> 
> So, we merely need fast-reacting threads that keventd will not block. 
> We do not need per-CPU threads.
> 
> Again, I think a model where threads are created on demand, and reaped 
> after inactivity, would fit here.  As I feel it would fit with many 
> other non-libata drivers.
> 
> 
> >: root   153  0.0  0.0  0 0 ?S18:51   0:00 
> >[ata_aux]
> >
> >That's a single-threaded workqueue handler.  Perhaps could use keventd.
> 
> That is used by libata exception handler, for hotpug and such.  My main 
> worry with keventd is that we might get stuck behind an unrelated 
> process for an undefined length of time.
> 
> IMO the best model would be to create ata_aux thread on demand, and kill 
> it if it hasn't been used recently.
> 
> 
> 
> >: root   299  0.0  0.0  0 0 ?S18:51   0:00 
> >[scsi_eh_0]
> >: root   300  0.0  0.0  0 0 ?S18:51   0:00 
> >[scsi_eh_1]
> >: root   305  0.0  0.0  0 0 ?S18:51   0:00 
> >[scsi_eh_2]
> >: root   306  0.0  0.0  0 0 ?S18:51   0:00 
> >[scsi_eh_3]
> >
> >This machine has one CPU, one sata disk and one DVD drive.  The above is
> >hard to explain.
> 
> Nod.  I've never thought we needed this many threads.  At least it 
> doesn't scale out of control for $BigNum-CPU boxen.
> 
> As the name implies, this is SCSI exception handling thread.  Although 
> some synchronization is required, this could probably work with an 
> on-demand thread creation model too.
> 
> 
> >: root   319  0.0  0.0  0 0 ?S18:51   0:00 
> >[pccardd]
> >
> >hm.
> >
> >: root   331  0.0  0.0  0 0 ?S18:51   0:00 
> >[kpsmoused]
> >
> >hm.
> 
> This kernel thread is used as a "bottom half" handler for the PS2 mouse 
> interrupt.  This one is a bit more justifiable.
> 
> 
> >: root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]
> >
> >hm.  I didn't know that the Vaio had EDAC.
> >
> >: root  1173  0.0  0.0  0 0 ?S18:51   0:00 
> >[khpsbpkt]
> >
> >I can't even pronounce that.
> >
> >: root  1354  0.0  0.0  0 0 ?S18:51   0:00 
> >[knodemgrd_0]
> >
> >OK, I do have 1394 hardware, but it hasn't been used.
> >
> >: root  1636  0.0  0.0  0 0 ?S18:52   0:00 
> >[kondemand/0]
> >
> >I blame davej.
> >
> >> > otoh, a lot of these inefficeincies are probably down in scruffy 
> >> drivers
> >> > rather than in core or top-level code.
> >>
> >>You say scruffy, but most of the proliferation of kthreads comes
> >>from code written in the last few years.  Compare the explosion of 
> >>kthreads
> >>we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
> >>slowing down at all.
> >>
> >>On the 2-way box I grabbed the above ps output from, I end up with 69 
> >>kthreads.
> >>It doesn't surprise me at all that bigger iron is starting to see issues.
> >>
> >
> >Sure.
> >
> >I don't think it's completely silly to object to all this.  Sure, a kernel
> >thread is worth 4k in the best case, but I bet they have associated unused
> >resources and as we've seen, they can cause overhead.
> 
> Agreed.

There are some upsides to persistent kernel threads that we might want
to keep in mind:

- they can be reniced, made RT, etc. as needed
- they can be bound to CPUs
- they collect long-term CPU usage information

Most of the above doesn't matter for the average kernel thread that's
handling the occassional hardware reset, but for others it could.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Oleg Nesterov
On 04/10, Eric W. Biederman wrote:

> I'm trying to remember what the story is now.  There is a nasty
> race somewhere with reparenting, a threaded parent setting SIGCHLD to
> SIGIGN, and non-default signals that results in an zombie that no one
> can wait for and reap.  It requires being reparented twice to trigger.

reparent_thread:

...

/* If we'd notified the old parent about this child's death,
 * also notify the new parent.
 */
if (!traced && p->exit_state == EXIT_ZOMBIE &&
p->exit_signal != -1 && thread_group_empty(p))
do_notify_parent(p, p->exit_signal);

We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.

The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal == -1"
checks.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar <[EMAIL PROTECTED]> writes:

> * Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>
>> > so ... is anyone pursuing this? This would allow us to make 
>> > sys_wait4() faster and more scalable: no tasklist_lock bouncing for 
>> > example.
>> 
>> which part?
>
> all of it :) Everything you mentioned makes sense quite a bit. The 
> thread signal handling of do_wait was added in a pretty arbitrary 
> fashion so i doubt there are strong requirements in that area. Apps 
> might have grown to get used to it meanwhile though, so we've got to do 
> it carefully.

I'm looking at.  If only because there is a reasonable chance doing this
will fix the races with a threaded init.

However I just found something nasty.  The wait __WNOTHREAD flag.

And my quick search seems to find at least one user space applications
that uses it, and it is widely documented so I suspect there are
others :(

I played with moving the lists into signal_struct, and short of
architecture specific users of task->children all I had to touch
were:

 include/linux/init_task.h |2 +-
 include/linux/sched.h |5 +-
 kernel/exit.c |  159 +
 kernel/fork.c |2 +-
 mm/oom_kill.c |4 +-

So it should be relatively easy to change this child lists around...

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> > so ... is anyone pursuing this? This would allow us to make 
> > sys_wait4() faster and more scalable: no tasklist_lock bouncing for 
> > example.
> 
> which part?

all of it :) Everything you mentioned makes sense quite a bit. The 
thread signal handling of do_wait was added in a pretty arbitrary 
fashion so i doubt there are strong requirements in that area. Apps 
might have grown to get used to it meanwhile though, so we've got to do 
it carefully.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar <[EMAIL PROTECTED]> writes:

> * Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>
>> > on a second thought: the p->children list is needed for the whole 
>> > child/parent task tree, which is needed for sys_getppid().
>> 
>> Yes, something Oleg said made me realize that.
>> 
>> As long as the reparent isn't to complex it isn't required that we 
>> have exactly one list .
>> 
>> > The question is, does anything require us to reparent to within the 
>> > same thread group?
>> 
>> I think my head is finally on straight about this question.
>> 
>> Currently there is the silly linux specific parent death signal 
>> (pdeath_signal).  Oleg's memory was a better than mine on this score.
>> 
>> However there is no indication that the parent death signal being sent 
>> when a thread leader dies is actually correct, or even interesting. It 
>> probably should only be sent when getppid changes.
>> 
>> So with pdeath_signal fixed that is nothing that requires us to 
>> reparent within the same thread group.
>> 
>> I'm trying to remember what the story is now.  There is a nasty race 
>> somewhere with reparenting, a threaded parent setting SIGCHLD to 
>> SIGIGN, and non-default signals that results in an zombie that no one 
>> can wait for and reap.  It requires being reparented twice to trigger.
>> 
>> Anyway it is a real mess and if we can remove the stupid multi-headed 
>> child lists things would become much simpler and the problem could not 
>> occur.
>> 
>> Plus the code would become much simpler...
>> 
>> utrace appears to have removed the ptrace_children list and the 
>> special cases that entailed.
>
> so ... is anyone pursuing this? This would allow us to make sys_wait4() 
> faster and more scalable: no tasklist_lock bouncing for example.

which part?

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Bill Davidsen

Davide Libenzi wrote:

On Mon, 9 Apr 2007, Linus Torvalds wrote:


On Mon, 9 Apr 2007, Kyle Moffett wrote:

Maybe "struct posix_process" is more descriptive?  "struct process_posix"?
"Ugly POSIX process semantics data" seems simple enough to stick in a struct
name.  "struct uglyposix_process"?

Guys, you didn't read my message.

It's *not* about "process" stuff.  Anything that tries to call it a 
"process" is *by*definition* worse than what it is now. Processes have all 
the things that we've cleanly separated out for filesystem, VM, SysV 
semaphore state, namespaces etc.


The "struct signal_struct" is the random *leftovers* from all the other 
stuff. It's *not* about "processes". Never has been, and never will be. 


I proposed "struct task_shared_ctx" but you ducked :)


Descriptive, correct, I like it!

--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> > on a second thought: the p->children list is needed for the whole 
> > child/parent task tree, which is needed for sys_getppid().
> 
> Yes, something Oleg said made me realize that.
> 
> As long as the reparent isn't to complex it isn't required that we 
> have exactly one list .
> 
> > The question is, does anything require us to reparent to within the 
> > same thread group?
> 
> I think my head is finally on straight about this question.
> 
> Currently there is the silly linux specific parent death signal 
> (pdeath_signal).  Oleg's memory was a better than mine on this score.
> 
> However there is no indication that the parent death signal being sent 
> when a thread leader dies is actually correct, or even interesting. It 
> probably should only be sent when getppid changes.
> 
> So with pdeath_signal fixed that is nothing that requires us to 
> reparent within the same thread group.
> 
> I'm trying to remember what the story is now.  There is a nasty race 
> somewhere with reparenting, a threaded parent setting SIGCHLD to 
> SIGIGN, and non-default signals that results in an zombie that no one 
> can wait for and reap.  It requires being reparented twice to trigger.
> 
> Anyway it is a real mess and if we can remove the stupid multi-headed 
> child lists things would become much simpler and the problem could not 
> occur.
> 
> Plus the code would become much simpler...
> 
> utrace appears to have removed the ptrace_children list and the 
> special cases that entailed.

so ... is anyone pursuing this? This would allow us to make sys_wait4() 
faster and more scalable: no tasklist_lock bouncing for example.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Oleg Nesterov <[EMAIL PROTECTED]> writes:

> No! That is why I suggest (a long ago, in fact) to move ->children into
> ->signal_struct. When sub-thread forks, we set ->parent = group_leader.
> We don't need forget_original_parent() until the last thead exists. This
> also simplify do_wait().
>
> However, this breaks the current ->pdeath_signal behaviour. In fact (and
> Eric thinks the same) this _fixes_ this behaviour, but may break things.

Thinking about this.  As contingency planning if there is something in user
space that actually somehow cares about pdeath_signal, from a threaded
parent we can add a pdeath_signal list, to the task_struct and get
rid of the rest of the complexity.

I think I want to wait until someone screams first.

This does very much mean that we can remove the complexity of
a per thread ->children without fear, of having to revert everything.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar <[EMAIL PROTECTED]> writes:

> * Eric W. Biederman <[EMAIL PROTECTED]> wrote:
>
>> Ingo Molnar <[EMAIL PROTECTED]> writes:
>> 
>> > no. Two _completely separate_ lists.
>> >
>> > i.e. a to-be-reaped task will still be on the main list _too_. The 
>> > main list is for all the PID semantics rules. The reap-list is just 
>> > for wait4() processing. The two would be completely separate.
>> 
>> And what pray tell except for heuristics is the list of children used 
>> for?
>
> on a second thought: the p->children list is needed for the whole 
> child/parent task tree, which is needed for sys_getppid().

Yes, something Oleg said made me realize that.

As long as the reparent isn't to complex it isn't required that we
have exactly one list .

> The question 
> is, does anything require us to reparent to within the same thread 
> group?

I think my head is finally on straight about this question.

Currently there is the silly linux specific parent death signal
(pdeath_signal).  Oleg's memory was a better than mine on this score.

However there is no indication that the parent death signal being sent
when a thread leader dies is actually correct, or even interesting.
It probably should only be sent when getppid changes.

So with pdeath_signal fixed that is nothing that requires us to
reparent within the same thread group.

I'm trying to remember what the story is now.  There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap.  It requires being reparented twice to trigger.

Anyway it is a real mess and if we can remove the stupid multi-headed
child lists things would become much simpler and the problem could
not occur.

Plus the code would become much simpler...

utrace appears to have removed the ptrace_children list and the
special cases that entailed.

Eric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Oleg Nesterov
On 04/10, Ingo Molnar wrote:
> 
> * Eric W. Biederman <[EMAIL PROTECTED]> wrote:
> 
> > Ingo Molnar <[EMAIL PROTECTED]> writes:
> > 
> > > no. Two _completely separate_ lists.
> > >
> > > i.e. a to-be-reaped task will still be on the main list _too_. The 
> > > main list is for all the PID semantics rules. The reap-list is just 
> > > for wait4() processing. The two would be completely separate.
> > 
> > And what pray tell except for heuristics is the list of children used 
> > for?
> 
> on a second thought: the p->children list is needed for the whole 
> child/parent task tree, which is needed for sys_getppid(). The question 
> is, does anything require us to reparent to within the same thread 
> group?

No! That is why I suggest (a long ago, in fact) to move ->children into
->signal_struct. When sub-thread forks, we set ->parent = group_leader.
We don't need forget_original_parent() until the last thead exists. This
also simplify do_wait().

However, this breaks the current ->pdeath_signal behaviour. In fact (and
Eric thinks the same) this _fixes_ this behaviour, but may break things.

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman <[EMAIL PROTECTED]> wrote:

> Ingo Molnar <[EMAIL PROTECTED]> writes:
> 
> > no. Two _completely separate_ lists.
> >
> > i.e. a to-be-reaped task will still be on the main list _too_. The 
> > main list is for all the PID semantics rules. The reap-list is just 
> > for wait4() processing. The two would be completely separate.
> 
> And what pray tell except for heuristics is the list of children used 
> for?

on a second thought: the p->children list is needed for the whole 
child/parent task tree, which is needed for sys_getppid(). The question 
is, does anything require us to reparent to within the same thread 
group?

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Robin Holt
On Mon, Apr 09, 2007 at 06:48:54PM -0600, Eric W. Biederman wrote:
> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > I suspect there are quite a few kernel threads which don't really need to
> > be threads at all: the code would quite happily work if it was changed to
> > use keventd, via schedule_work() and friends.  But kernel threads are
> > somewhat easier to code for.
> >
> > I also suspect that there are a number of workqueue threads which
> > could/should have used create_singlethread_workqueue().  Often this is
> > because the developer just didn't think to do it.
> >
> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
> > rather than in core or top-level code.
> >
> >  > presumably-not-using-kthread kernel threads are coming from>
> 
> 
> >From another piece of this thread.
> 
> > > Robin how many kernel thread per cpu are you seeing?
> > 
> > 10.
> > 
> > FYI, pid 1539 is kthread.
> > 
> > a01:~ # ps -ef | egrep "\[.*\/255\]" 
> > root   512 1  0 Apr08 ?00:00:00 [migration/255]
> > root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
> > root  1281 1  0 Apr08 ?00:00:02 [events/255]
> > root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
> > root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
> > root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
> > root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
> > root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
> > root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
> > root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]
> 
> 
> So it looks like there were about 1500 kernel threads that started up before
> kthread started.

I should have been more clear, this is from that 4096 broken down to a
512 cpu partition.  This is the configuration the customer will receive
the machine.  The 4096 was just to see if it worked.

> So the kernel threads appear to have init as their parent is because
> they started before kthread for the most part.
> 
> At 10 kernel threads per cpu there may be a little bloat but it isn't
> out of control.  It is mostly that we are observing the kernel as
> NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
> a 1000 fold bigger then most people are used to :)
> 
> Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Ingo Molnar wrote:

* Russell King <[EMAIL PROTECTED]> wrote:

One per PC card socket to avoid the sysfs locking crappyness that 
would otherwise deadlock, and to convert from the old unreadable state 
machine implementation to a much more readable linearly coded 
implementation.


Could probably be eliminated if we had some mechanism to spawn a 
helper thread to do some task as required which didn't block other 
helper threads until it completes.


looks like the perfect usecase for threadlets. (threadlets only use up a 
separate context if necessary and can be coded in the familiar 
sequential/linear model)


Same response as to Andrew:  AFAICS that just increases complexity.

The simple path for programmers is writing straightforward code that 
does something like


blah
msleep()
blah

or in pccardd's case,

mutex_lock()
blah
mutex_unlock()

to permit sleeping without having to write more-complex code that deals 
with context transitions.


For slow-path, infrequently executed code, it is best to keep it as 
simple as possible.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Russell King <[EMAIL PROTECTED]> wrote:

> One per PC card socket to avoid the sysfs locking crappyness that 
> would otherwise deadlock, and to convert from the old unreadable state 
> machine implementation to a much more readable linearly coded 
> implementation.
> 
> Could probably be eliminated if we had some mechanism to spawn a 
> helper thread to do some task as required which didn't block other 
> helper threads until it completes.

looks like the perfect usecase for threadlets. (threadlets only use up a 
separate context if necessary and can be coded in the familiar 
sequential/linear model)

(btw., threadlets could in theory be executed in irq context too, and if 
we block on anything it gets bounced off to a real context - although 
this certainly pushes the limits and there would still be some deadlock 
potential for things like irq-unsafe non-sleeping locks (spinlocks, 
rwlocks).)

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

Well that obviously would be a dumb way to use keventd.  One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.


Far too complex.  See what Russell wrote, for instance.

When you are in a kernel thread, you can write more simple, 
straightforward, easy-to-debug code that does


blah
msleep()
blah

rather than creating an insanely complicated state machine for the same 
thing.


ESPECIALLY if you are already inside a state machine (the case with 
libata PIO data xfer), you do not want to add another state machine 
inside of that.


A great many uses of kernel threads are to simplify device reset and 
polling in this way.  I know; a year ago I audited every kernel thread, 
because I was so frustrated with the per-CPU thread explosion.


Thus, rather than forcing authors to make their code more complex, we 
should find another solution.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Andrew Morton
On Tue, 10 Apr 2007 04:33:57 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Andrew Morton wrote:
> > On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote:
> > 
> >> My main 
> >> worry with keventd is that we might get stuck behind an unrelated 
> >> process for an undefined length of time.
> > 
> > I don't think it has ever been demonstrated that keventd latency is
> > excessive, or a problem.  I guess we could instrument it and fix stuff
> > easily enough.
> 
> It's simple math, combined with user expectations.
> 
> On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which 
> are doing hardware reset tasks that could take 30-60 seconds (realistic 
> for libata, SCSI and network drivers, at least), then OBVIOUSLY you have 
> other tasks blocked for that length of time.

Well that obviously would be a dumb way to use keventd.  One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote:

My main 
worry with keventd is that we might get stuck behind an unrelated 
process for an undefined length of time.


I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem.  I guess we could instrument it and fix stuff
easily enough.


It's simple math, combined with user expectations.

On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which 
are doing hardware reset tasks that could take 30-60 seconds (realistic 
for libata, SCSI and network drivers, at least), then OBVIOUSLY you have 
other tasks blocked for that length of time.


Since the cause of the latency is msleep() -- the entire reason why the 
driver wanted to use a kernel thread in the first place -- it would seem 
to me that the simple fix is to start a new thread, possibly exceeding 
the number of CPUs in the box.




The main problem with keventd has been flush_scheduled_work() deadlocks: the


That's been a problem in the past, yes, but a minor one.

I'm talking about a key conceptual problem with keventd.

It is easy to see how an extra-long tg3 hardware reset might prevent a 
disk hotplug event from being processed for 30-60 seconds.  And as 
hardware gets more complex -- see the Intel IOP storage card which runs 
Linux -- the reset times get longer, too.


So the issue is /not/ deadlocks.



The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from.


Strongly agreed.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Russell King wrote:

Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.


kthread_run() should go that for you.  Creates a new thread with 
kthread_create(), and wakes it up immediately.  Goes away when you're 
done with it.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Russell King
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
> : root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]
> 
> hm.

One per PC card socket to avoid the sysfs locking crappyness that would
otherwise deadlock, and to convert from the old unreadable state machine
implementation to a much more readable linearly coded implementation.

Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Andrew Morton
On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik <[EMAIL PROTECTED]> wrote:

> My main 
> worry with keventd is that we might get stuck behind an unrelated 
> process for an undefined length of time.

I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem.  I guess we could instrument it and fix stuff
easily enough.

The main problem with keventd has been flush_scheduled_work() deadlocks: the
caller of flush_scheduled_work() wants to flush work item A, but holds some
lock which is also needed by unrelated work item B.  Most of the time, it
works.  But if item B happens to be queued the flush_scheduled_work() will
deadlock.

The fix is to flush-and-cancel just item A: if it's not running yet, cancel
it.  If it is running, wait until it has finished.  Oleg's

void cancel_work_sync(struct work_struct *work)

is queued for 2.6.22 and should permit some kthread->keventd conversions
which would previously been deadlocky.


The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from.  Conversions to
schedule_work()+cancel_work_sync() and conversions to
create_singlethread_workqueue().


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Torsten Kaiser wrote:

One thread per port, not per device.

 796 ?S  0:00  \_ [scsi_eh_0]
 797 ?S  0:00  \_ [scsi_eh_1]
 798 ?S  0:00  \_ [scsi_eh_2]
 819 ?S  0:00  \_ [scsi_eh_3]
 820 ?S  0:00  \_ [scsi_eh_4]
 824 ?S  0:00  \_ [scsi_eh_5]
 825 ?S  0:14  \_ [scsi_eh_6]

bardioc ~ # lsscsi -d
[0:0:0:0]diskATA  ST3160827AS  3.42  /dev/sda[8:0]
[1:0:0:0]diskATA  ST3160827AS  3.42  /dev/sdb[8:16]
[5:0:0:0]diskATA  IBM-DHEA-36480   HE8O  /dev/sdc[8:32]
[5:0:1:0]diskATA  Maxtor 6L160P0   BAH4  /dev/sdd[8:48]
[6:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4081B A100  /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0]sata_promise
[1]sata_promise
[2]sata_promise
[3]sata_via
[4]sata_via
[5]pata_via
[6]pata_via

The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.
Don't know if the thread is even needed for hotplug...


Nope, it's not.  At least for SATA (your chosen examples), hotplug is 
handled by a libata-specific thread.


The SCSI EH threads are there purely for SCSI exception handling.  For 
the majority of SAS and SATA, we replace the entire SCSI EH handling 
code with our own, making those threads less useful than on older (read: 
majority) of SCSI drivers.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Linus Torvalds wrote:


On Mon, 9 Apr 2007, Andrew Morton wrote:

   10 ?S< 0:00 [khelper]

That one's needed to parent the call_usermodehelper() apps.  I don't think
it does anything else.  We used to use keventd for this but that had some
problem whcih I forget.


I think it was one of a long series of deadlocks. 

Using a "keventd" for many different things sounds clever and nice, but 
then sucks horribly when one event triggers another event, and they depend 
on each other. Solution: use independent threads for the events.


Nod.  That's the key problem with keventd.  Independent things must wait 
on each other.


That's why I feel thread creation -- cheap under Linux -- is quite 
appropriate for many of these situations.


Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

: root 3  0.0  0.0  0 0 ?S18:51   0:00 [watchdog/0]

That's the softlockup detector.  Confusingly named to look like a, err,
watchdog.  Could probably use keventd.


I would think this would run into the keventd "problem", where $N 
processes can lock out another?


IMO a lot of these could potentially be simply started as brand new 
threads, when an exception arises.




: root 5  0.0  0.0  0 0 ?S18:51   0:00 [khelper]

That's there to parent the kthread_create()d threads.  Could presumably use
khelper.

: root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]

Does it need to be per-cpu?


No, it does not.

It is used for PIO data transfer, so it merely has to respond quickly, 
which rules out keventd.  You also don't want PIO data xfer for port A 
blocked, sitting around waiting for PIO data xfer to complete on port C.


So, we merely need fast-reacting threads that keventd will not block. 
We do not need per-CPU threads.


Again, I think a model where threads are created on demand, and reaped 
after inactivity, would fit here.  As I feel it would fit with many 
other non-libata drivers.




: root   153  0.0  0.0  0 0 ?S18:51   0:00 [ata_aux]

That's a single-threaded workqueue handler.  Perhaps could use keventd.


That is used by libata exception handler, for hotpug and such.  My main 
worry with keventd is that we might get stuck behind an unrelated 
process for an undefined length of time.


IMO the best model would be to create ata_aux thread on demand, and kill 
it if it hasn't been used recently.





: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.


Nod.  I've never thought we needed this many threads.  At least it 
doesn't scale out of control for $BigNum-CPU boxen.


As the name implies, this is SCSI exception handling thread.  Although 
some synchronization is required, this could probably work with an 
on-demand thread creation model too.




: root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]

hm.

: root   331  0.0  0.0  0 0 ?S18:51   0:00 [kpsmoused]

hm.


This kernel thread is used as a "bottom half" handler for the PS2 mouse 
interrupt.  This one is a bit more justifiable.




: root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]

hm.  I didn't know that the Vaio had EDAC.

: root  1173  0.0  0.0  0 0 ?S18:51   0:00 [khpsbpkt]

I can't even pronounce that.

: root  1354  0.0  0.0  0 0 ?S18:51   0:00 [knodemgrd_0]

OK, I do have 1394 hardware, but it hasn't been used.

: root  1636  0.0  0.0  0 0 ?S18:52   0:00 [kondemand/0]

I blame davej.


 > otoh, a lot of these inefficeincies are probably down in scruffy drivers
 > rather than in core or top-level code.

You say scruffy, but most of the proliferation of kthreads comes
from code written in the last few years.  Compare the explosion of kthreads
we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
slowing down at all.

On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
It doesn't surprise me at all that bigger iron is starting to see issues.



Sure.

I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.


Agreed.

Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Eric W. Biederman wrote:

At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control.  It is mostly that we are observing the kernel as
NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)



I disagree there is only a little bloat:  the current mechanism in place 
does not scale as NR_CPUS increases, as this thread demonstrates.


Beyond a certain point, on an 8-CPU box, it gets silly.  You certainly 
don't need eight kblockd threads or eight ata threads.


Jeff


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Torsten Kaiser

On 4/10/07, Andrew Morton <[EMAIL PROTECTED]> wrote:

: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.


One thread per port, not per device.

 796 ?S  0:00  \_ [scsi_eh_0]
 797 ?S  0:00  \_ [scsi_eh_1]
 798 ?S  0:00  \_ [scsi_eh_2]
 819 ?S  0:00  \_ [scsi_eh_3]
 820 ?S  0:00  \_ [scsi_eh_4]
 824 ?S  0:00  \_ [scsi_eh_5]
 825 ?S  0:14  \_ [scsi_eh_6]

bardioc ~ # lsscsi -d
[0:0:0:0]diskATA  ST3160827AS  3.42  /dev/sda[8:0]
[1:0:0:0]diskATA  ST3160827AS  3.42  /dev/sdb[8:16]
[5:0:0:0]diskATA  IBM-DHEA-36480   HE8O  /dev/sdc[8:32]
[5:0:1:0]diskATA  Maxtor 6L160P0   BAH4  /dev/sdd[8:48]
[6:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4081B A100  /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0]sata_promise
[1]sata_promise
[2]sata_promise
[3]sata_via
[4]sata_via
[5]pata_via
[6]pata_via

The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.
Don't know if the thread is even needed for hotplug...


I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.


For me its not the 4k that annoy me, but the clutter in ps or top.

Torsten
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Torsten Kaiser

On 4/10/07, Andrew Morton [EMAIL PROTECTED] wrote:

: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.


One thread per port, not per device.

 796 ?S  0:00  \_ [scsi_eh_0]
 797 ?S  0:00  \_ [scsi_eh_1]
 798 ?S  0:00  \_ [scsi_eh_2]
 819 ?S  0:00  \_ [scsi_eh_3]
 820 ?S  0:00  \_ [scsi_eh_4]
 824 ?S  0:00  \_ [scsi_eh_5]
 825 ?S  0:14  \_ [scsi_eh_6]

bardioc ~ # lsscsi -d
[0:0:0:0]diskATA  ST3160827AS  3.42  /dev/sda[8:0]
[1:0:0:0]diskATA  ST3160827AS  3.42  /dev/sdb[8:16]
[5:0:0:0]diskATA  IBM-DHEA-36480   HE8O  /dev/sdc[8:32]
[5:0:1:0]diskATA  Maxtor 6L160P0   BAH4  /dev/sdd[8:48]
[6:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4081B A100  /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0]sata_promise
[1]sata_promise
[2]sata_promise
[3]sata_via
[4]sata_via
[5]pata_via
[6]pata_via

The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.
Don't know if the thread is even needed for hotplug...


I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.


For me its not the 4k that annoy me, but the clutter in ps or top.

Torsten
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Eric W. Biederman wrote:

At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control.  It is mostly that we are observing the kernel as
NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)



I disagree there is only a little bloat:  the current mechanism in place 
does not scale as NR_CPUS increases, as this thread demonstrates.


Beyond a certain point, on an 8-CPU box, it gets silly.  You certainly 
don't need eight kblockd threads or eight ata threads.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

: root 3  0.0  0.0  0 0 ?S18:51   0:00 [watchdog/0]

That's the softlockup detector.  Confusingly named to look like a, err,
watchdog.  Could probably use keventd.


I would think this would run into the keventd problem, where $N 
processes can lock out another?


IMO a lot of these could potentially be simply started as brand new 
threads, when an exception arises.




: root 5  0.0  0.0  0 0 ?S18:51   0:00 [khelper]

That's there to parent the kthread_create()d threads.  Could presumably use
khelper.

: root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]

Does it need to be per-cpu?


No, it does not.

It is used for PIO data transfer, so it merely has to respond quickly, 
which rules out keventd.  You also don't want PIO data xfer for port A 
blocked, sitting around waiting for PIO data xfer to complete on port C.


So, we merely need fast-reacting threads that keventd will not block. 
We do not need per-CPU threads.


Again, I think a model where threads are created on demand, and reaped 
after inactivity, would fit here.  As I feel it would fit with many 
other non-libata drivers.




: root   153  0.0  0.0  0 0 ?S18:51   0:00 [ata_aux]

That's a single-threaded workqueue handler.  Perhaps could use keventd.


That is used by libata exception handler, for hotpug and such.  My main 
worry with keventd is that we might get stuck behind an unrelated 
process for an undefined length of time.


IMO the best model would be to create ata_aux thread on demand, and kill 
it if it hasn't been used recently.





: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.


Nod.  I've never thought we needed this many threads.  At least it 
doesn't scale out of control for $BigNum-CPU boxen.


As the name implies, this is SCSI exception handling thread.  Although 
some synchronization is required, this could probably work with an 
on-demand thread creation model too.




: root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]

hm.

: root   331  0.0  0.0  0 0 ?S18:51   0:00 [kpsmoused]

hm.


This kernel thread is used as a bottom half handler for the PS2 mouse 
interrupt.  This one is a bit more justifiable.




: root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]

hm.  I didn't know that the Vaio had EDAC.

: root  1173  0.0  0.0  0 0 ?S18:51   0:00 [khpsbpkt]

I can't even pronounce that.

: root  1354  0.0  0.0  0 0 ?S18:51   0:00 [knodemgrd_0]

OK, I do have 1394 hardware, but it hasn't been used.

: root  1636  0.0  0.0  0 0 ?S18:52   0:00 [kondemand/0]

I blame davej.


  otoh, a lot of these inefficeincies are probably down in scruffy drivers
  rather than in core or top-level code.

You say scruffy, but most of the proliferation of kthreads comes
from code written in the last few years.  Compare the explosion of kthreads
we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
slowing down at all.

On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
It doesn't surprise me at all that bigger iron is starting to see issues.



Sure.

I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.


Agreed.

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Linus Torvalds wrote:


On Mon, 9 Apr 2007, Andrew Morton wrote:

   10 ?S 0:00 [khelper]

That one's needed to parent the call_usermodehelper() apps.  I don't think
it does anything else.  We used to use keventd for this but that had some
problem whcih I forget.


I think it was one of a long series of deadlocks. 

Using a keventd for many different things sounds clever and nice, but 
then sucks horribly when one event triggers another event, and they depend 
on each other. Solution: use independent threads for the events.


Nod.  That's the key problem with keventd.  Independent things must wait 
on each other.


That's why I feel thread creation -- cheap under Linux -- is quite 
appropriate for many of these situations.


Jeff



-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Torsten Kaiser wrote:

One thread per port, not per device.

 796 ?S  0:00  \_ [scsi_eh_0]
 797 ?S  0:00  \_ [scsi_eh_1]
 798 ?S  0:00  \_ [scsi_eh_2]
 819 ?S  0:00  \_ [scsi_eh_3]
 820 ?S  0:00  \_ [scsi_eh_4]
 824 ?S  0:00  \_ [scsi_eh_5]
 825 ?S  0:14  \_ [scsi_eh_6]

bardioc ~ # lsscsi -d
[0:0:0:0]diskATA  ST3160827AS  3.42  /dev/sda[8:0]
[1:0:0:0]diskATA  ST3160827AS  3.42  /dev/sdb[8:16]
[5:0:0:0]diskATA  IBM-DHEA-36480   HE8O  /dev/sdc[8:32]
[5:0:1:0]diskATA  Maxtor 6L160P0   BAH4  /dev/sdd[8:48]
[6:0:0:0]cd/dvd  HL-DT-ST DVDRAM GSA-4081B A100  /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0]sata_promise
[1]sata_promise
[2]sata_promise
[3]sata_via
[4]sata_via
[5]pata_via
[6]pata_via

The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.
Don't know if the thread is even needed for hotplug...


Nope, it's not.  At least for SATA (your chosen examples), hotplug is 
handled by a libata-specific thread.


The SCSI EH threads are there purely for SCSI exception handling.  For 
the majority of SAS and SATA, we replace the entire SCSI EH handling 
code with our own, making those threads less useful than on older (read: 
majority) of SCSI drivers.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Andrew Morton
On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik [EMAIL PROTECTED] wrote:

 My main 
 worry with keventd is that we might get stuck behind an unrelated 
 process for an undefined length of time.

I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem.  I guess we could instrument it and fix stuff
easily enough.

The main problem with keventd has been flush_scheduled_work() deadlocks: the
caller of flush_scheduled_work() wants to flush work item A, but holds some
lock which is also needed by unrelated work item B.  Most of the time, it
works.  But if item B happens to be queued the flush_scheduled_work() will
deadlock.

The fix is to flush-and-cancel just item A: if it's not running yet, cancel
it.  If it is running, wait until it has finished.  Oleg's

void cancel_work_sync(struct work_struct *work)

is queued for 2.6.22 and should permit some kthread-keventd conversions
which would previously been deadlocky.


The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from.  Conversions to
schedule_work()+cancel_work_sync() and conversions to
create_singlethread_workqueue().


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Russell King
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
 : root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]
 
 hm.

One per PC card socket to avoid the sysfs locking crappyness that would
otherwise deadlock, and to convert from the old unreadable state machine
implementation to a much more readable linearly coded implementation.

Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.

-- 
Russell King
 Linux kernel2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Russell King wrote:

Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.


kthread_run() should go that for you.  Creates a new thread with 
kthread_create(), and wakes it up immediately.  Goes away when you're 
done with it.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik [EMAIL PROTECTED] wrote:

My main 
worry with keventd is that we might get stuck behind an unrelated 
process for an undefined length of time.


I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem.  I guess we could instrument it and fix stuff
easily enough.


It's simple math, combined with user expectations.

On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which 
are doing hardware reset tasks that could take 30-60 seconds (realistic 
for libata, SCSI and network drivers, at least), then OBVIOUSLY you have 
other tasks blocked for that length of time.


Since the cause of the latency is msleep() -- the entire reason why the 
driver wanted to use a kernel thread in the first place -- it would seem 
to me that the simple fix is to start a new thread, possibly exceeding 
the number of CPUs in the box.




The main problem with keventd has been flush_scheduled_work() deadlocks: the


That's been a problem in the past, yes, but a minor one.

I'm talking about a key conceptual problem with keventd.

It is easy to see how an extra-long tg3 hardware reset might prevent a 
disk hotplug event from being processed for 30-60 seconds.  And as 
hardware gets more complex -- see the Intel IOP storage card which runs 
Linux -- the reset times get longer, too.


So the issue is /not/ deadlocks.



The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from.


Strongly agreed.

Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Andrew Morton
On Tue, 10 Apr 2007 04:33:57 -0400 Jeff Garzik [EMAIL PROTECTED] wrote:

 Andrew Morton wrote:
  On Tue, 10 Apr 2007 03:05:56 -0400 Jeff Garzik [EMAIL PROTECTED] wrote:
  
  My main 
  worry with keventd is that we might get stuck behind an unrelated 
  process for an undefined length of time.
  
  I don't think it has ever been demonstrated that keventd latency is
  excessive, or a problem.  I guess we could instrument it and fix stuff
  easily enough.
 
 It's simple math, combined with user expectations.
 
 On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which 
 are doing hardware reset tasks that could take 30-60 seconds (realistic 
 for libata, SCSI and network drivers, at least), then OBVIOUSLY you have 
 other tasks blocked for that length of time.

Well that obviously would be a dumb way to use keventd.  One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Andrew Morton wrote:

Well that obviously would be a dumb way to use keventd.  One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.


Far too complex.  See what Russell wrote, for instance.

When you are in a kernel thread, you can write more simple, 
straightforward, easy-to-debug code that does


blah
msleep()
blah

rather than creating an insanely complicated state machine for the same 
thing.


ESPECIALLY if you are already inside a state machine (the case with 
libata PIO data xfer), you do not want to add another state machine 
inside of that.


A great many uses of kernel threads are to simplify device reset and 
polling in this way.  I know; a year ago I audited every kernel thread, 
because I was so frustrated with the per-CPU thread explosion.


Thus, rather than forcing authors to make their code more complex, we 
should find another solution.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Russell King [EMAIL PROTECTED] wrote:

 One per PC card socket to avoid the sysfs locking crappyness that 
 would otherwise deadlock, and to convert from the old unreadable state 
 machine implementation to a much more readable linearly coded 
 implementation.
 
 Could probably be eliminated if we had some mechanism to spawn a 
 helper thread to do some task as required which didn't block other 
 helper threads until it completes.

looks like the perfect usecase for threadlets. (threadlets only use up a 
separate context if necessary and can be coded in the familiar 
sequential/linear model)

(btw., threadlets could in theory be executed in irq context too, and if 
we block on anything it gets bounced off to a real context - although 
this certainly pushes the limits and there would still be some deadlock 
potential for things like irq-unsafe non-sleeping locks (spinlocks, 
rwlocks).)

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Jeff Garzik

Ingo Molnar wrote:

* Russell King [EMAIL PROTECTED] wrote:

One per PC card socket to avoid the sysfs locking crappyness that 
would otherwise deadlock, and to convert from the old unreadable state 
machine implementation to a much more readable linearly coded 
implementation.


Could probably be eliminated if we had some mechanism to spawn a 
helper thread to do some task as required which didn't block other 
helper threads until it completes.


looks like the perfect usecase for threadlets. (threadlets only use up a 
separate context if necessary and can be coded in the familiar 
sequential/linear model)


Same response as to Andrew:  AFAICS that just increases complexity.

The simple path for programmers is writing straightforward code that 
does something like


blah
msleep()
blah

or in pccardd's case,

mutex_lock()
blah
mutex_unlock()

to permit sleeping without having to write more-complex code that deals 
with context transitions.


For slow-path, infrequently executed code, it is best to keep it as 
simple as possible.


Jeff


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Robin Holt
On Mon, Apr 09, 2007 at 06:48:54PM -0600, Eric W. Biederman wrote:
 Andrew Morton [EMAIL PROTECTED] writes:
 
  I suspect there are quite a few kernel threads which don't really need to
  be threads at all: the code would quite happily work if it was changed to
  use keventd, via schedule_work() and friends.  But kernel threads are
  somewhat easier to code for.
 
  I also suspect that there are a number of workqueue threads which
  could/should have used create_singlethread_workqueue().  Often this is
  because the developer just didn't think to do it.
 
  otoh, a lot of these inefficeincies are probably down in scruffy drivers
  rather than in core or top-level code.
 
  I also wonder where all these parented-by-init,
  presumably-not-using-kthread kernel threads are coming from
 
 
 From another piece of this thread.
 
   Robin how many kernel thread per cpu are you seeing?
  
  10.
  
  FYI, pid 1539 is kthread.
  
  a01:~ # ps -ef | egrep \[.*\/255\] 
  root   512 1  0 Apr08 ?00:00:00 [migration/255]
  root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
  root  1281 1  0 Apr08 ?00:00:02 [events/255]
  root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
  root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
  root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
  root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
  root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
  root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
  root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]
 
 
 So it looks like there were about 1500 kernel threads that started up before
 kthread started.

I should have been more clear, this is from that 4096 broken down to a
512 cpu partition.  This is the configuration the customer will receive
the machine.  The 4096 was just to see if it worked.

 So the kernel threads appear to have init as their parent is because
 they started before kthread for the most part.
 
 At 10 kernel threads per cpu there may be a little bloat but it isn't
 out of control.  It is mostly that we are observing the kernel as
 NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
 a 1000 fold bigger then most people are used to :)
 
 Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman [EMAIL PROTECTED] wrote:

 Ingo Molnar [EMAIL PROTECTED] writes:
 
  no. Two _completely separate_ lists.
 
  i.e. a to-be-reaped task will still be on the main list _too_. The 
  main list is for all the PID semantics rules. The reap-list is just 
  for wait4() processing. The two would be completely separate.
 
 And what pray tell except for heuristics is the list of children used 
 for?

on a second thought: the p-children list is needed for the whole 
child/parent task tree, which is needed for sys_getppid(). The question 
is, does anything require us to reparent to within the same thread 
group?

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Oleg Nesterov
On 04/10, Ingo Molnar wrote:
 
 * Eric W. Biederman [EMAIL PROTECTED] wrote:
 
  Ingo Molnar [EMAIL PROTECTED] writes:
  
   no. Two _completely separate_ lists.
  
   i.e. a to-be-reaped task will still be on the main list _too_. The 
   main list is for all the PID semantics rules. The reap-list is just 
   for wait4() processing. The two would be completely separate.
  
  And what pray tell except for heuristics is the list of children used 
  for?
 
 on a second thought: the p-children list is needed for the whole 
 child/parent task tree, which is needed for sys_getppid(). The question 
 is, does anything require us to reparent to within the same thread 
 group?

No! That is why I suggest (a long ago, in fact) to move -children into
-signal_struct. When sub-thread forks, we set -parent = group_leader.
We don't need forget_original_parent() until the last thead exists. This
also simplify do_wait().

However, this breaks the current -pdeath_signal behaviour. In fact (and
Eric thinks the same) this _fixes_ this behaviour, but may break things.

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar [EMAIL PROTECTED] writes:

 * Eric W. Biederman [EMAIL PROTECTED] wrote:

 Ingo Molnar [EMAIL PROTECTED] writes:
 
  no. Two _completely separate_ lists.
 
  i.e. a to-be-reaped task will still be on the main list _too_. The 
  main list is for all the PID semantics rules. The reap-list is just 
  for wait4() processing. The two would be completely separate.
 
 And what pray tell except for heuristics is the list of children used 
 for?

 on a second thought: the p-children list is needed for the whole 
 child/parent task tree, which is needed for sys_getppid().

Yes, something Oleg said made me realize that.

As long as the reparent isn't to complex it isn't required that we
have exactly one list .

 The question 
 is, does anything require us to reparent to within the same thread 
 group?

I think my head is finally on straight about this question.

Currently there is the silly linux specific parent death signal
(pdeath_signal).  Oleg's memory was a better than mine on this score.

However there is no indication that the parent death signal being sent
when a thread leader dies is actually correct, or even interesting.
It probably should only be sent when getppid changes.

So with pdeath_signal fixed that is nothing that requires us to
reparent within the same thread group.

I'm trying to remember what the story is now.  There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap.  It requires being reparented twice to trigger.

Anyway it is a real mess and if we can remove the stupid multi-headed
child lists things would become much simpler and the problem could
not occur.

Plus the code would become much simpler...

utrace appears to have removed the ptrace_children list and the
special cases that entailed.

Eric

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Oleg Nesterov [EMAIL PROTECTED] writes:

 No! That is why I suggest (a long ago, in fact) to move -children into
 -signal_struct. When sub-thread forks, we set -parent = group_leader.
 We don't need forget_original_parent() until the last thead exists. This
 also simplify do_wait().

 However, this breaks the current -pdeath_signal behaviour. In fact (and
 Eric thinks the same) this _fixes_ this behaviour, but may break things.

Thinking about this.  As contingency planning if there is something in user
space that actually somehow cares about pdeath_signal, from a threaded
parent we can add a pdeath_signal list, to the task_struct and get
rid of the rest of the complexity.

I think I want to wait until someone screams first.

This does very much mean that we can remove the complexity of
a per thread -children without fear, of having to revert everything.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman [EMAIL PROTECTED] wrote:

  on a second thought: the p-children list is needed for the whole 
  child/parent task tree, which is needed for sys_getppid().
 
 Yes, something Oleg said made me realize that.
 
 As long as the reparent isn't to complex it isn't required that we 
 have exactly one list .
 
  The question is, does anything require us to reparent to within the 
  same thread group?
 
 I think my head is finally on straight about this question.
 
 Currently there is the silly linux specific parent death signal 
 (pdeath_signal).  Oleg's memory was a better than mine on this score.
 
 However there is no indication that the parent death signal being sent 
 when a thread leader dies is actually correct, or even interesting. It 
 probably should only be sent when getppid changes.
 
 So with pdeath_signal fixed that is nothing that requires us to 
 reparent within the same thread group.
 
 I'm trying to remember what the story is now.  There is a nasty race 
 somewhere with reparenting, a threaded parent setting SIGCHLD to 
 SIGIGN, and non-default signals that results in an zombie that no one 
 can wait for and reap.  It requires being reparented twice to trigger.
 
 Anyway it is a real mess and if we can remove the stupid multi-headed 
 child lists things would become much simpler and the problem could not 
 occur.
 
 Plus the code would become much simpler...
 
 utrace appears to have removed the ptrace_children list and the 
 special cases that entailed.

so ... is anyone pursuing this? This would allow us to make sys_wait4() 
faster and more scalable: no tasklist_lock bouncing for example.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Bill Davidsen

Davide Libenzi wrote:

On Mon, 9 Apr 2007, Linus Torvalds wrote:


On Mon, 9 Apr 2007, Kyle Moffett wrote:

Maybe struct posix_process is more descriptive?  struct process_posix?
Ugly POSIX process semantics data seems simple enough to stick in a struct
name.  struct uglyposix_process?

Guys, you didn't read my message.

It's *not* about process stuff.  Anything that tries to call it a 
process is *by*definition* worse than what it is now. Processes have all 
the things that we've cleanly separated out for filesystem, VM, SysV 
semaphore state, namespaces etc.


The struct signal_struct is the random *leftovers* from all the other 
stuff. It's *not* about processes. Never has been, and never will be. 


I proposed struct task_shared_ctx but you ducked :)


Descriptive, correct, I like it!

--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar [EMAIL PROTECTED] writes:

 * Eric W. Biederman [EMAIL PROTECTED] wrote:

  on a second thought: the p-children list is needed for the whole 
  child/parent task tree, which is needed for sys_getppid().
 
 Yes, something Oleg said made me realize that.
 
 As long as the reparent isn't to complex it isn't required that we 
 have exactly one list .
 
  The question is, does anything require us to reparent to within the 
  same thread group?
 
 I think my head is finally on straight about this question.
 
 Currently there is the silly linux specific parent death signal 
 (pdeath_signal).  Oleg's memory was a better than mine on this score.
 
 However there is no indication that the parent death signal being sent 
 when a thread leader dies is actually correct, or even interesting. It 
 probably should only be sent when getppid changes.
 
 So with pdeath_signal fixed that is nothing that requires us to 
 reparent within the same thread group.
 
 I'm trying to remember what the story is now.  There is a nasty race 
 somewhere with reparenting, a threaded parent setting SIGCHLD to 
 SIGIGN, and non-default signals that results in an zombie that no one 
 can wait for and reap.  It requires being reparented twice to trigger.
 
 Anyway it is a real mess and if we can remove the stupid multi-headed 
 child lists things would become much simpler and the problem could not 
 occur.
 
 Plus the code would become much simpler...
 
 utrace appears to have removed the ptrace_children list and the 
 special cases that entailed.

 so ... is anyone pursuing this? This would allow us to make sys_wait4() 
 faster and more scalable: no tasklist_lock bouncing for example.

which part?

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Molnar

* Eric W. Biederman [EMAIL PROTECTED] wrote:

  so ... is anyone pursuing this? This would allow us to make 
  sys_wait4() faster and more scalable: no tasklist_lock bouncing for 
  example.
 
 which part?

all of it :) Everything you mentioned makes sense quite a bit. The 
thread signal handling of do_wait was added in a pretty arbitrary 
fashion so i doubt there are strong requirements in that area. Apps 
might have grown to get used to it meanwhile though, so we've got to do 
it carefully.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Eric W. Biederman
Ingo Molnar [EMAIL PROTECTED] writes:

 * Eric W. Biederman [EMAIL PROTECTED] wrote:

  so ... is anyone pursuing this? This would allow us to make 
  sys_wait4() faster and more scalable: no tasklist_lock bouncing for 
  example.
 
 which part?

 all of it :) Everything you mentioned makes sense quite a bit. The 
 thread signal handling of do_wait was added in a pretty arbitrary 
 fashion so i doubt there are strong requirements in that area. Apps 
 might have grown to get used to it meanwhile though, so we've got to do 
 it carefully.

I'm looking at.  If only because there is a reasonable chance doing this
will fix the races with a threaded init.

However I just found something nasty.  The wait __WNOTHREAD flag.

And my quick search seems to find at least one user space applications
that uses it, and it is widely documented so I suspect there are
others :(

I played with moving the lists into signal_struct, and short of
architecture specific users of task-children all I had to touch
were:

 include/linux/init_task.h |2 +-
 include/linux/sched.h |5 +-
 kernel/exit.c |  159 +
 kernel/fork.c |2 +-
 mm/oom_kill.c |4 +-

So it should be relatively easy to change this child lists around...

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Oleg Nesterov
On 04/10, Eric W. Biederman wrote:

 I'm trying to remember what the story is now.  There is a nasty
 race somewhere with reparenting, a threaded parent setting SIGCHLD to
 SIGIGN, and non-default signals that results in an zombie that no one
 can wait for and reap.  It requires being reparented twice to trigger.

reparent_thread:

...

/* If we'd notified the old parent about this child's death,
 * also notify the new parent.
 */
if (!traced  p-exit_state == EXIT_ZOMBIE 
p-exit_signal != -1  thread_group_empty(p))
do_notify_parent(p, p-exit_signal);

We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.

The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this exit_state == EXIT_ZOMBIE  exit_signal == -1
checks.

Oleg.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Matt Mackall
On Tue, Apr 10, 2007 at 03:05:56AM -0400, Jeff Garzik wrote:
 Andrew Morton wrote:
 : root 3  0.0  0.0  0 0 ?S18:51   0:00 
 [watchdog/0]
 
 That's the softlockup detector.  Confusingly named to look like a, err,
 watchdog.  Could probably use keventd.
 
 I would think this would run into the keventd problem, where $N 
 processes can lock out another?
 
 IMO a lot of these could potentially be simply started as brand new 
 threads, when an exception arises.
 
 
 : root 5  0.0  0.0  0 0 ?S18:51   0:00 
 [khelper]
 
 That's there to parent the kthread_create()d threads.  Could presumably use
 khelper.
 
 : root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]
 
 Does it need to be per-cpu?
 
 No, it does not.
 
 It is used for PIO data transfer, so it merely has to respond quickly, 
 which rules out keventd.  You also don't want PIO data xfer for port A 
 blocked, sitting around waiting for PIO data xfer to complete on port C.
 
 So, we merely need fast-reacting threads that keventd will not block. 
 We do not need per-CPU threads.
 
 Again, I think a model where threads are created on demand, and reaped 
 after inactivity, would fit here.  As I feel it would fit with many 
 other non-libata drivers.
 
 
 : root   153  0.0  0.0  0 0 ?S18:51   0:00 
 [ata_aux]
 
 That's a single-threaded workqueue handler.  Perhaps could use keventd.
 
 That is used by libata exception handler, for hotpug and such.  My main 
 worry with keventd is that we might get stuck behind an unrelated 
 process for an undefined length of time.
 
 IMO the best model would be to create ata_aux thread on demand, and kill 
 it if it hasn't been used recently.
 
 
 
 : root   299  0.0  0.0  0 0 ?S18:51   0:00 
 [scsi_eh_0]
 : root   300  0.0  0.0  0 0 ?S18:51   0:00 
 [scsi_eh_1]
 : root   305  0.0  0.0  0 0 ?S18:51   0:00 
 [scsi_eh_2]
 : root   306  0.0  0.0  0 0 ?S18:51   0:00 
 [scsi_eh_3]
 
 This machine has one CPU, one sata disk and one DVD drive.  The above is
 hard to explain.
 
 Nod.  I've never thought we needed this many threads.  At least it 
 doesn't scale out of control for $BigNum-CPU boxen.
 
 As the name implies, this is SCSI exception handling thread.  Although 
 some synchronization is required, this could probably work with an 
 on-demand thread creation model too.
 
 
 : root   319  0.0  0.0  0 0 ?S18:51   0:00 
 [pccardd]
 
 hm.
 
 : root   331  0.0  0.0  0 0 ?S18:51   0:00 
 [kpsmoused]
 
 hm.
 
 This kernel thread is used as a bottom half handler for the PS2 mouse 
 interrupt.  This one is a bit more justifiable.
 
 
 : root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]
 
 hm.  I didn't know that the Vaio had EDAC.
 
 : root  1173  0.0  0.0  0 0 ?S18:51   0:00 
 [khpsbpkt]
 
 I can't even pronounce that.
 
 : root  1354  0.0  0.0  0 0 ?S18:51   0:00 
 [knodemgrd_0]
 
 OK, I do have 1394 hardware, but it hasn't been used.
 
 : root  1636  0.0  0.0  0 0 ?S18:52   0:00 
 [kondemand/0]
 
 I blame davej.
 
   otoh, a lot of these inefficeincies are probably down in scruffy 
  drivers
   rather than in core or top-level code.
 
 You say scruffy, but most of the proliferation of kthreads comes
 from code written in the last few years.  Compare the explosion of 
 kthreads
 we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
 slowing down at all.
 
 On the 2-way box I grabbed the above ps output from, I end up with 69 
 kthreads.
 It doesn't surprise me at all that bigger iron is starting to see issues.
 
 
 Sure.
 
 I don't think it's completely silly to object to all this.  Sure, a kernel
 thread is worth 4k in the best case, but I bet they have associated unused
 resources and as we've seen, they can cause overhead.
 
 Agreed.

There are some upsides to persistent kernel threads that we might want
to keep in mind:

- they can be reniced, made RT, etc. as needed
- they can be bound to CPUs
- they collect long-term CPU usage information

Most of the above doesn't matter for the average kernel thread that's
handling the occassional hardware reset, but for others it could.

-- 
Mathematics is the supreme nostalgia of our time.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Davide Libenzi
On Tue, 10 Apr 2007, Bill Davidsen wrote:

 Davide Libenzi wrote:
  On Mon, 9 Apr 2007, Linus Torvalds wrote:
  
   On Mon, 9 Apr 2007, Kyle Moffett wrote:
Maybe struct posix_process is more descriptive?  struct
process_posix?
Ugly POSIX process semantics data seems simple enough to stick in a
struct
name.  struct uglyposix_process?
   Guys, you didn't read my message.
   
   It's *not* about process stuff.  Anything that tries to call it a
   process is *by*definition* worse than what it is now. Processes have all
   the things that we've cleanly separated out for filesystem, VM, SysV
   semaphore state, namespaces etc.
   
   The struct signal_struct is the random *leftovers* from all the other
   stuff. It's *not* about processes. Never has been, and never will be. 
  
  I proposed struct task_shared_ctx but you ducked :)
  
 Descriptive, correct, I like it!

He's stubborn, he'll never accept patches. Must be a seasonal thing ;)



- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Oeser
On Tuesday 10 April 2007, Jeff Garzik wrote:
 That's why I feel thread creation -- cheap under Linux -- is quite 
 appropriate for many of these situations.

Maybe that (thread creation) can be done at open(), socket-creation, 
service request, syscall or whatever event triggers a driver/subsystem 
to actually queue work into a thread.

And when there is a close(), socket-destruction, service completion
or whatever these threads can be marked for destruction and destroyed
by a timer or even immediately.

Regards

Ingo Oeser

-- 
If something is getting cheap, it is getting wasted just because it is cheap.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-10 Thread Ingo Oeser
On Tuesday 10 April 2007, Jeff Garzik wrote:
 Thus, rather than forcing authors to make their code more complex, we 
 should find another solution.

What about sth. like the pre-forking concept? So just have a thread creator 
thread,
which checks the amount of unused threads and keeps them within certain limits.

So that anything which needs a thread now simply queues up the work and
specifies, that it wants a new thread, if possible.

One problem seems to be, that a thread is nothing else but a statement
on what other tasks I can wait before doing my current one (e.g. I don't want 
to 
mlseep() twice on the same reset timeout). 
But we usually use locking to order that.

Do I miss anything fundamental here?

Regards

Ingo Oeser
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Dave Jones
On Tue, Apr 10, 2007 at 09:07:54AM +0400, Alexey Dobriyan wrote:
 > On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
 > > On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <[EMAIL PROTECTED]> wrote:
 > 
 > [possible topic for KS2007]
 > 
 > > >   164 ?S< 0:00 [cqueue/0]
 > > >   165 ?S< 0:00 [cqueue/1]
 > > >
 > > > I'm not even sure wth these are.
 > >
 > > Me either.
 > 
 > drivers/connector/connector.c:
 > 455  dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);

Maybe I have apps relying on the connector stuff that I don't
even realise, but ttbomk, nothing actually *needs* this
for 99% of users if I'm not mistaken.

* wonders why he never built this modular..

config PROC_EVENTS
boolean "Report process events to userspace"
depends on CONNECTOR=y


Yay.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Alexey Dobriyan
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
> On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <[EMAIL PROTECTED]> wrote:

[possible topic for KS2007]

> >   164 ?S< 0:00 [cqueue/0]
> >   165 ?S< 0:00 [cqueue/1]
> >
> > I'm not even sure wth these are.
>
> Me either.

drivers/connector/connector.c:
455 dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Linus Torvalds


On Mon, 9 Apr 2007, Andrew Morton wrote:
> 
> >10 ?S< 0:00 [khelper]
> 
> That one's needed to parent the call_usermodehelper() apps.  I don't think
> it does anything else.  We used to use keventd for this but that had some
> problem whcih I forget.

I think it was one of a long series of deadlocks. 

Using a "keventd" for many different things sounds clever and nice, but 
then sucks horribly when one event triggers another event, and they depend 
on each other. Solution: use independent threads for the events.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <[EMAIL PROTECTED]> wrote:

> On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:
> 
>  > I suspect there are quite a few kernel threads which don't really need to
>  > be threads at all: the code would quite happily work if it was changed to
>  > use keventd, via schedule_work() and friends.  But kernel threads are
>  > somewhat easier to code for.
> 
> Perhaps too easy.  We have a bunch of kthreads sitting around that afaict,
> are there 'just in case', not because they're actually in use.

This is fun.

>10 ?S< 0:00 [khelper]

That one's needed to parent the call_usermodehelper() apps.  I don't think
it does anything else.  We used to use keventd for this but that had some
problem whcih I forget.  (Who went and misnamed keventd to "events", too? 
Nobody calls it "events", and with good reason)

> Why doesn't this get spawned when it needs to?
> 
>   164 ?S< 0:00 [cqueue/0]
>   165 ?S< 0:00 [cqueue/1]
> 
> I'm not even sure wth these are.

Me either.

> ...
>

: root 3  0.0  0.0  0 0 ?S18:51   0:00 [watchdog/0]

That's the softlockup detector.  Confusingly named to look like a, err,
watchdog.  Could probably use keventd.

: root 5  0.0  0.0  0 0 ?S18:51   0:00 [khelper]

That's there to parent the kthread_create()d threads.  Could presumably use
khelper.

: root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]

Does it need to be per-cpu?

: root   153  0.0  0.0  0 0 ?S18:51   0:00 [ata_aux]

That's a single-threaded workqueue handler.  Perhaps could use keventd.

: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.

: root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]

hm.

: root   331  0.0  0.0  0 0 ?S18:51   0:00 [kpsmoused]

hm.

: root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]

hm.  I didn't know that the Vaio had EDAC.

: root  1173  0.0  0.0  0 0 ?S18:51   0:00 [khpsbpkt]

I can't even pronounce that.

: root  1354  0.0  0.0  0 0 ?S18:51   0:00 [knodemgrd_0]

OK, I do have 1394 hardware, but it hasn't been used.

: root  1636  0.0  0.0  0 0 ?S18:52   0:00 [kondemand/0]

I blame davej.

>  > otoh, a lot of these inefficeincies are probably down in scruffy drivers
>  > rather than in core or top-level code.
> 
> You say scruffy, but most of the proliferation of kthreads comes
> from code written in the last few years.  Compare the explosion of kthreads
> we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
> slowing down at all.
> 
> On the 2-way box I grabbed the above ps output from, I end up with 69 
> kthreads.
> It doesn't surprise me at all that bigger iron is starting to see issues.
> 

Sure.

I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Dave Jones
On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:

 > I suspect there are quite a few kernel threads which don't really need to
 > be threads at all: the code would quite happily work if it was changed to
 > use keventd, via schedule_work() and friends.  But kernel threads are
 > somewhat easier to code for.

Perhaps too easy.  We have a bunch of kthreads sitting around that afaict,
are there 'just in case', not because they're actually in use.

   10 ?S< 0:00 [khelper]

Why doesn't this get spawned when it needs to?

  164 ?S< 0:00 [cqueue/0]
  165 ?S< 0:00 [cqueue/1]

I'm not even sure wth these are.

  166 ?S< 0:00 [ksuspend_usbd]

Just in case I decide to suspend ? Sounds.. handy.
But why not spawn this just after we start a suspend?

  209 ?S< 0:00 [aio/0]
  210 ?S< 0:00 [aio/1]

I'm sure I'd appreciate these a lot more if I had any AIO using apps.

  364 ?S< 0:00 [kpsmoused]

I never did figure out why this was a thread.

  417 ?S< 0:00 [scsi_eh_1]
  418 ?S< 0:00 [scsi_eh_2]
  419 ?S< 5:28 [scsi_eh_3]
  426 ?S< 0:00 [scsi_eh_4]
  427 ?S< 0:00 [scsi_eh_5]
  428 ?S< 0:00 [scsi_eh_6]
  429 ?S< 0:00 [scsi_eh_7]

Just in case my 7-1 in card reader gets an error.
(Which is unlikely on at least 6 of the slots as evidenced by the runtime 
column.
 -- Though now I'm curious as to why the error handler was running so much given
I've not experienced any errors.).
This must be a fun one of on huge diskfarms.

  884 ?S< 0:00 [kgameportd]

Just in case I ever decide to plug something into my soundcard.

 2179 ?S< 0:00 [kmpathd/0]
 2180 ?S< 0:00 [kmpathd/1]
 2189 ?S< 0:00 [kmirrord]

Just loading the modules starts up the threads, regardless
of whether you use them. (Not sure why they're getting loaded,
something for me to look into)

 3246 ?S< 0:00 [krfcommd]

I don't even *have* bluetooth hardware.
(Yes, the module shouldn't have loaded, but that's another battle..)

 > otoh, a lot of these inefficeincies are probably down in scruffy drivers
 > rather than in core or top-level code.

You say scruffy, but most of the proliferation of kthreads comes
from code written in the last few years.  Compare the explosion of kthreads
we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
slowing down at all.

On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
It doesn't surprise me at all that bigger iron is starting to see issues.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Mon, 09 Apr 2007 18:48:54 -0600
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

> Andrew Morton <[EMAIL PROTECTED]> writes:
> 
> > I suspect there are quite a few kernel threads which don't really need to
> > be threads at all: the code would quite happily work if it was changed to
> > use keventd, via schedule_work() and friends.  But kernel threads are
> > somewhat easier to code for.
> >
> > I also suspect that there are a number of workqueue threads which
> > could/should have used create_singlethread_workqueue().  Often this is
> > because the developer just didn't think to do it.
> >
> > otoh, a lot of these inefficeincies are probably down in scruffy drivers
> > rather than in core or top-level code.
> >
> >  > presumably-not-using-kthread kernel threads are coming from>
> 
> 
> >From another piece of this thread.

Yeah, sorry.  Without mentioning any names, [EMAIL PROTECTED] broke the
threading so I came in halfway.

> > > Robin how many kernel thread per cpu are you seeing?
> > 
> > 10.
> > 
> > FYI, pid 1539 is kthread.
> > 
> > a01:~ # ps -ef | egrep "\[.*\/255\]" 
> > root   512 1  0 Apr08 ?00:00:00 [migration/255]
> > root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
> > root  1281 1  0 Apr08 ?00:00:02 [events/255]
> > root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
> > root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
> > root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
> > root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
> > root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
> > root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
> > root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]
> 
> 
> So it looks like there were about 1500 kernel threads that started up before
> kthread started.
> 
> So the kernel threads appear to have init as their parent is because
> they started before kthread for the most part.
> 
> At 10 kernel threads per cpu there may be a little bloat but it isn't
> out of control.  It is mostly that we are observing the kernel as
> NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
> a 1000 fold bigger then most people are used to :)

Yes, I expect we could run init_workqueues() much earlier, from process 0
rather than from process 1.  Something _might_ blow up as it often does when
we change startup ordering, but in this case it would be somewhat surprising.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Eric W. Biederman
Andrew Morton <[EMAIL PROTECTED]> writes:

> I suspect there are quite a few kernel threads which don't really need to
> be threads at all: the code would quite happily work if it was changed to
> use keventd, via schedule_work() and friends.  But kernel threads are
> somewhat easier to code for.
>
> I also suspect that there are a number of workqueue threads which
> could/should have used create_singlethread_workqueue().  Often this is
> because the developer just didn't think to do it.
>
> otoh, a lot of these inefficeincies are probably down in scruffy drivers
> rather than in core or top-level code.
>
>  presumably-not-using-kthread kernel threads are coming from>


>From another piece of this thread.

> > Robin how many kernel thread per cpu are you seeing?
> 
> 10.
> 
> FYI, pid 1539 is kthread.
> 
> a01:~ # ps -ef | egrep "\[.*\/255\]" 
> root   512 1  0 Apr08 ?00:00:00 [migration/255]
> root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
> root  1281 1  0 Apr08 ?00:00:02 [events/255]
> root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
> root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
> root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
> root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
> root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
> root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
> root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]


So it looks like there were about 1500 kernel threads that started up before
kthread started.

So the kernel threads appear to have init as their parent is because
they started before kthread for the most part.

At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control.  It is mostly that we are observing the kernel as
NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Fri, 06 Apr 2007 18:38:40 -0400
Jeff Garzik <[EMAIL PROTECTED]> wrote:

> Robin Holt wrote:
> > We have been testing a new larger configuration and we are seeing a very
> > large scan time of init's tsk->children list.  In the cases we are seeing,
> > there are numerous kernel processes created for each cpu (ie: events/0
> > ... events/, xfslogd/0 ... xfslogd/).  These are
> > all on the list ahead of the processes we are currently trying to reap.
> 
> What about attacking the explosion of kernel threads?
> 
> As CPU counts increase, the number of per-CPU kernel threads gets really 
> ridiculous.
> 
> I would rather change the implementation under the hood to start per-CPU 
> threads on demand, similar to a thread-pool implementation.
> 
> Boxes with $BigNum CPUs probably won't ever use half of those threads.

I suspect there are quite a few kernel threads which don't really need to
be threads at all: the code would quite happily work if it was changed to
use keventd, via schedule_work() and friends.  But kernel threads are
somewhat easier to code for.

I also suspect that there are a number of workqueue threads which
could/should have used create_singlethread_workqueue().  Often this is
because the developer just didn't think to do it.

otoh, a lot of these inefficeincies are probably down in scruffy drivers
rather than in core or top-level code.



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Davide Libenzi
On Mon, 9 Apr 2007, Linus Torvalds wrote:

> On Mon, 9 Apr 2007, Kyle Moffett wrote:
> > 
> > Maybe "struct posix_process" is more descriptive?  "struct process_posix"?
> > "Ugly POSIX process semantics data" seems simple enough to stick in a struct
> > name.  "struct uglyposix_process"?
> 
> Guys, you didn't read my message.
> 
> It's *not* about "process" stuff.  Anything that tries to call it a 
> "process" is *by*definition* worse than what it is now. Processes have all 
> the things that we've cleanly separated out for filesystem, VM, SysV 
> semaphore state, namespaces etc.
> 
> The "struct signal_struct" is the random *leftovers* from all the other 
> stuff. It's *not* about "processes". Never has been, and never will be. 

I proposed "struct task_shared_ctx" but you ducked :)


- Davide


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Eric W. Biederman
Kyle Moffett <[EMAIL PROTECTED]> writes:

> Maybe "struct posix_process" is more descriptive?  "struct process_posix"?
> "Ugly POSIX process semantics data" seems simple  enough to stick in a struct
> name.  "struct uglyposix_process"?

Nack.

Linux internally doesn't have processes it has tasks with different
properties.  Anything that talks about processes will be even more
confusing.

The only thing really wrong with struct signal is that it is easy to
confuse with struct sighand.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Linus Torvalds


On Mon, 9 Apr 2007, Kyle Moffett wrote:
> 
> Maybe "struct posix_process" is more descriptive?  "struct process_posix"?
> "Ugly POSIX process semantics data" seems simple enough to stick in a struct
> name.  "struct uglyposix_process"?

Guys, you didn't read my message.

It's *not* about "process" stuff.  Anything that tries to call it a 
"process" is *by*definition* worse than what it is now. Processes have all 
the things that we've cleanly separated out for filesystem, VM, SysV 
semaphore state, namespaces etc.

The "struct signal_struct" is the random *leftovers* from all the other 
stuff. It's *not* about "processes". Never has been, and never will be. 

It's mainly about signal cruft, because that's the nastiest part of POSIX 
threads behaviour, and that can't be clearly separated as one clear 
structure. 

So
 - it really *is* mostly about signal handling and signal sources. 
 - it has some random *cruft* in it that isn't about signals, but even 
   that is mostly a matter of "it was random cruft in the original task 
   structure too, and it DID NOT MERIT a flag of its own"
 - if you wanted to clean things up, you'd actually make things like 
   the "rlimit" info structures of their own, and have pointers to them. 

So that cruft largely got put into "signal_struct" just because they were 
the last thing to be moved out, along with the signal stuff (which was the 
big and painful part). NOT because "struct signal_struct" is somehow about 
"process state".

So stop blathering about processes. It has *nothing* to do with processes. 
It's primarily about signals, but it has "cruft" in it.

So an accurate name is

struct 
signal_struct_with_some_cruft_in_it_that_did_not_merit_a_struct_of_its_own

but that's actually fairly unwieldly to type, and so in the name of sanity 
and clear source code, it's called

struct signal_struct

and that's it.

And people who have argued for renaming it don't even seem to understand 
what it's *about*, so the arguments for renaming it have been very weak 
indeed so far.

IT IS NOT ABOUT "PROCESSES".

To be a "posix process", you have to share *everything*. The signal-struct 
isn't even a very important part of that sharing. In fact, it's quite 
arguably the *least* important part to share, which is why it was 
separated out as a separate thing absolutely *last*, and which is why we 
could do without it for quite long, with just some annoyingly subtle 
non-POSIX semantics.

Get it? It's the structure that is *least* important for a "process".

So stop blathering about "processes". Anybody who does that, just shows 
that they haven't thought it through!

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Kyle Moffett

On Apr 09, 2007, at 14:09:51, Bill Davidsen wrote:

Ingo Molnar wrote:

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

On Fri, 6 Apr 2007, Davide Libenzi wrote:

or lets just face it and name it what it is: process_struct ;-)
That'd be fine too! Wonder if Linus would swallow a rename patch  
like that...
I don't really see the point. It's not even *true*. A "process"  
includes more than the shared signal-handling - it would include  
files and fs etc too.


So it's actually *more* correct to call it the shared signal  
state than it would be to call it "process" state.
we could call it "structure for everything that we know to be ugly  
about POSIX process semantics" ;-) The rest, like files and fs  
we've abstracted out already.

So are you voting for ugly_struct? ;-)

I do think this is still waiting for a more descriptive name, like  
proc_misc_struct or some such. Kernel code should be treated as  
literature, intended to be both read and readable.


Maybe "struct posix_process" is more descriptive?  "struct  
process_posix"?  "Ugly POSIX process semantics data" seems simple  
enough to stick in a struct name.  "struct uglyposix_process"?


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Bill Davidsen

Ingo Molnar wrote:

* Linus Torvalds <[EMAIL PROTECTED]> wrote:


On Fri, 6 Apr 2007, Davide Libenzi wrote:

or lets just face it and name it what it is: process_struct ;-)
That'd be fine too! Wonder if Linus would swallow a rename patch like 
that...
I don't really see the point. It's not even *true*. A "process" 
includes more than the shared signal-handling - it would include files 
and fs etc too.


So it's actually *more* correct to call it the shared signal state 
than it would be to call it "process" state.


we could call it "structure for everything that we know to be ugly about 
POSIX process semantics" ;-) The rest, like files and fs we've 
abstracted out already.


Ingo

So are you voting for ugly_struct? ;-)

I do think this is still waiting for a more descriptive name, like 
proc_misc_struct or some such. Kernel code should be treated as 
literature, intended to be both read and readable.


--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Chris Snook

Eric W. Biederman wrote:

Linus Torvalds <[EMAIL PROTECTED]> writes:
I'm not sure anybody would really be unhappy with pptr pointing to some 
magic and special task that has pid 0 (which makes it clear to everybody 
that the parent is something special), and that has SIGCHLD set to SIG_IGN 
(which should make the exit case not even go through the zombie phase).


I can't even imagine *how* you'd make a tool unhappy with that, since even 
tools like "ps" (and even more "pstree" won't read all the process states 
atomically, so they invariably will see parent pointers that don't even 
exist any more, because by the time they get to the parent, it has exited 
already.


Right.  pid == 1 being missing might cause some confusing having 
but having ppid == 0 should be fine.  Heck pid == 1 already has 
ppid == 0, so it is a value user space has had to deal with for a

while.

In addition there was a period in 2.6 where most kernel threads
and init had a pgid == 0 and a session  == 0, and nothing seemed
to complain.

We should probably make all of the kernel threads children of
init_task.  The initial idle thread on the first cpu that is the
parent of pid == 1.   That will give the ppid == 0 naturally because
the idle thread has pid == 0.


Linus, Eric, thanks for the history lesson.  I think it's safe to say 
that anything that breaks because of this sort of change was already 
broken anyway.


If we're going to scale to an obscene number of CPUs (which I believe 
was the original motivation on this thread) then putting the dead 
children on their own list will probably scale better.


-- Chris
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Ingo Molnar

* Linus Torvalds <[EMAIL PROTECTED]> wrote:

> On Fri, 6 Apr 2007, Davide Libenzi wrote:
> > > 
> > > or lets just face it and name it what it is: process_struct ;-)
> > 
> > That'd be fine too! Wonder if Linus would swallow a rename patch like 
> > that...
> 
> I don't really see the point. It's not even *true*. A "process" 
> includes more than the shared signal-handling - it would include files 
> and fs etc too.
> 
> So it's actually *more* correct to call it the shared signal state 
> than it would be to call it "process" state.

we could call it "structure for everything that we know to be ugly about 
POSIX process semantics" ;-) The rest, like files and fs we've 
abstracted out already.

Ingo
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Ingo Molnar

* Linus Torvalds [EMAIL PROTECTED] wrote:

 On Fri, 6 Apr 2007, Davide Libenzi wrote:
   
   or lets just face it and name it what it is: process_struct ;-)
  
  That'd be fine too! Wonder if Linus would swallow a rename patch like 
  that...
 
 I don't really see the point. It's not even *true*. A process 
 includes more than the shared signal-handling - it would include files 
 and fs etc too.
 
 So it's actually *more* correct to call it the shared signal state 
 than it would be to call it process state.

we could call it structure for everything that we know to be ugly about 
POSIX process semantics ;-) The rest, like files and fs we've 
abstracted out already.

Ingo
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Chris Snook

Eric W. Biederman wrote:

Linus Torvalds [EMAIL PROTECTED] writes:
I'm not sure anybody would really be unhappy with pptr pointing to some 
magic and special task that has pid 0 (which makes it clear to everybody 
that the parent is something special), and that has SIGCHLD set to SIG_IGN 
(which should make the exit case not even go through the zombie phase).


I can't even imagine *how* you'd make a tool unhappy with that, since even 
tools like ps (and even more pstree won't read all the process states 
atomically, so they invariably will see parent pointers that don't even 
exist any more, because by the time they get to the parent, it has exited 
already.


Right.  pid == 1 being missing might cause some confusing having 
but having ppid == 0 should be fine.  Heck pid == 1 already has 
ppid == 0, so it is a value user space has had to deal with for a

while.

In addition there was a period in 2.6 where most kernel threads
and init had a pgid == 0 and a session  == 0, and nothing seemed
to complain.

We should probably make all of the kernel threads children of
init_task.  The initial idle thread on the first cpu that is the
parent of pid == 1.   That will give the ppid == 0 naturally because
the idle thread has pid == 0.


Linus, Eric, thanks for the history lesson.  I think it's safe to say 
that anything that breaks because of this sort of change was already 
broken anyway.


If we're going to scale to an obscene number of CPUs (which I believe 
was the original motivation on this thread) then putting the dead 
children on their own list will probably scale better.


-- Chris
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Bill Davidsen

Ingo Molnar wrote:

* Linus Torvalds [EMAIL PROTECTED] wrote:


On Fri, 6 Apr 2007, Davide Libenzi wrote:

or lets just face it and name it what it is: process_struct ;-)
That'd be fine too! Wonder if Linus would swallow a rename patch like 
that...
I don't really see the point. It's not even *true*. A process 
includes more than the shared signal-handling - it would include files 
and fs etc too.


So it's actually *more* correct to call it the shared signal state 
than it would be to call it process state.


we could call it structure for everything that we know to be ugly about 
POSIX process semantics ;-) The rest, like files and fs we've 
abstracted out already.


Ingo

So are you voting for ugly_struct? ;-)

I do think this is still waiting for a more descriptive name, like 
proc_misc_struct or some such. Kernel code should be treated as 
literature, intended to be both read and readable.


--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Kyle Moffett

On Apr 09, 2007, at 14:09:51, Bill Davidsen wrote:

Ingo Molnar wrote:

* Linus Torvalds [EMAIL PROTECTED] wrote:

On Fri, 6 Apr 2007, Davide Libenzi wrote:

or lets just face it and name it what it is: process_struct ;-)
That'd be fine too! Wonder if Linus would swallow a rename patch  
like that...
I don't really see the point. It's not even *true*. A process  
includes more than the shared signal-handling - it would include  
files and fs etc too.


So it's actually *more* correct to call it the shared signal  
state than it would be to call it process state.
we could call it structure for everything that we know to be ugly  
about POSIX process semantics ;-) The rest, like files and fs  
we've abstracted out already.

So are you voting for ugly_struct? ;-)

I do think this is still waiting for a more descriptive name, like  
proc_misc_struct or some such. Kernel code should be treated as  
literature, intended to be both read and readable.


Maybe struct posix_process is more descriptive?  struct  
process_posix?  Ugly POSIX process semantics data seems simple  
enough to stick in a struct name.  struct uglyposix_process?


Cheers,
Kyle Moffett

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Linus Torvalds


On Mon, 9 Apr 2007, Kyle Moffett wrote:
 
 Maybe struct posix_process is more descriptive?  struct process_posix?
 Ugly POSIX process semantics data seems simple enough to stick in a struct
 name.  struct uglyposix_process?

Guys, you didn't read my message.

It's *not* about process stuff.  Anything that tries to call it a 
process is *by*definition* worse than what it is now. Processes have all 
the things that we've cleanly separated out for filesystem, VM, SysV 
semaphore state, namespaces etc.

The struct signal_struct is the random *leftovers* from all the other 
stuff. It's *not* about processes. Never has been, and never will be. 

It's mainly about signal cruft, because that's the nastiest part of POSIX 
threads behaviour, and that can't be clearly separated as one clear 
structure. 

So
 - it really *is* mostly about signal handling and signal sources. 
 - it has some random *cruft* in it that isn't about signals, but even 
   that is mostly a matter of it was random cruft in the original task 
   structure too, and it DID NOT MERIT a flag of its own
 - if you wanted to clean things up, you'd actually make things like 
   the rlimit info structures of their own, and have pointers to them. 

So that cruft largely got put into signal_struct just because they were 
the last thing to be moved out, along with the signal stuff (which was the 
big and painful part). NOT because struct signal_struct is somehow about 
process state.

So stop blathering about processes. It has *nothing* to do with processes. 
It's primarily about signals, but it has cruft in it.

So an accurate name is

struct 
signal_struct_with_some_cruft_in_it_that_did_not_merit_a_struct_of_its_own

but that's actually fairly unwieldly to type, and so in the name of sanity 
and clear source code, it's called

struct signal_struct

and that's it.

And people who have argued for renaming it don't even seem to understand 
what it's *about*, so the arguments for renaming it have been very weak 
indeed so far.

IT IS NOT ABOUT PROCESSES.

To be a posix process, you have to share *everything*. The signal-struct 
isn't even a very important part of that sharing. In fact, it's quite 
arguably the *least* important part to share, which is why it was 
separated out as a separate thing absolutely *last*, and which is why we 
could do without it for quite long, with just some annoyingly subtle 
non-POSIX semantics.

Get it? It's the structure that is *least* important for a process.

So stop blathering about processes. Anybody who does that, just shows 
that they haven't thought it through!

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Eric W. Biederman
Kyle Moffett [EMAIL PROTECTED] writes:

 Maybe struct posix_process is more descriptive?  struct process_posix?
 Ugly POSIX process semantics data seems simple  enough to stick in a struct
 name.  struct uglyposix_process?

Nack.

Linux internally doesn't have processes it has tasks with different
properties.  Anything that talks about processes will be even more
confusing.

The only thing really wrong with struct signal is that it is easy to
confuse with struct sighand.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Davide Libenzi
On Mon, 9 Apr 2007, Linus Torvalds wrote:

 On Mon, 9 Apr 2007, Kyle Moffett wrote:
  
  Maybe struct posix_process is more descriptive?  struct process_posix?
  Ugly POSIX process semantics data seems simple enough to stick in a struct
  name.  struct uglyposix_process?
 
 Guys, you didn't read my message.
 
 It's *not* about process stuff.  Anything that tries to call it a 
 process is *by*definition* worse than what it is now. Processes have all 
 the things that we've cleanly separated out for filesystem, VM, SysV 
 semaphore state, namespaces etc.
 
 The struct signal_struct is the random *leftovers* from all the other 
 stuff. It's *not* about processes. Never has been, and never will be. 

I proposed struct task_shared_ctx but you ducked :)


- Davide


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Fri, 06 Apr 2007 18:38:40 -0400
Jeff Garzik [EMAIL PROTECTED] wrote:

 Robin Holt wrote:
  We have been testing a new larger configuration and we are seeing a very
  large scan time of init's tsk-children list.  In the cases we are seeing,
  there are numerous kernel processes created for each cpu (ie: events/0
  ... events/big number, xfslogd/0 ... xfslogd/big number).  These are
  all on the list ahead of the processes we are currently trying to reap.
 
 What about attacking the explosion of kernel threads?
 
 As CPU counts increase, the number of per-CPU kernel threads gets really 
 ridiculous.
 
 I would rather change the implementation under the hood to start per-CPU 
 threads on demand, similar to a thread-pool implementation.
 
 Boxes with $BigNum CPUs probably won't ever use half of those threads.

I suspect there are quite a few kernel threads which don't really need to
be threads at all: the code would quite happily work if it was changed to
use keventd, via schedule_work() and friends.  But kernel threads are
somewhat easier to code for.

I also suspect that there are a number of workqueue threads which
could/should have used create_singlethread_workqueue().  Often this is
because the developer just didn't think to do it.

otoh, a lot of these inefficeincies are probably down in scruffy drivers
rather than in core or top-level code.

I also wonder where all these parented-by-init,
presumably-not-using-kthread kernel threads are coming from

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Eric W. Biederman
Andrew Morton [EMAIL PROTECTED] writes:

 I suspect there are quite a few kernel threads which don't really need to
 be threads at all: the code would quite happily work if it was changed to
 use keventd, via schedule_work() and friends.  But kernel threads are
 somewhat easier to code for.

 I also suspect that there are a number of workqueue threads which
 could/should have used create_singlethread_workqueue().  Often this is
 because the developer just didn't think to do it.

 otoh, a lot of these inefficeincies are probably down in scruffy drivers
 rather than in core or top-level code.

 I also wonder where all these parented-by-init,
 presumably-not-using-kthread kernel threads are coming from


From another piece of this thread.

  Robin how many kernel thread per cpu are you seeing?
 
 10.
 
 FYI, pid 1539 is kthread.
 
 a01:~ # ps -ef | egrep \[.*\/255\] 
 root   512 1  0 Apr08 ?00:00:00 [migration/255]
 root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
 root  1281 1  0 Apr08 ?00:00:02 [events/255]
 root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
 root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
 root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
 root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
 root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
 root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
 root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]


So it looks like there were about 1500 kernel threads that started up before
kthread started.

So the kernel threads appear to have init as their parent is because
they started before kthread for the most part.

At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control.  It is mostly that we are observing the kernel as
NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Mon, 09 Apr 2007 18:48:54 -0600
[EMAIL PROTECTED] (Eric W. Biederman) wrote:

 Andrew Morton [EMAIL PROTECTED] writes:
 
  I suspect there are quite a few kernel threads which don't really need to
  be threads at all: the code would quite happily work if it was changed to
  use keventd, via schedule_work() and friends.  But kernel threads are
  somewhat easier to code for.
 
  I also suspect that there are a number of workqueue threads which
  could/should have used create_singlethread_workqueue().  Often this is
  because the developer just didn't think to do it.
 
  otoh, a lot of these inefficeincies are probably down in scruffy drivers
  rather than in core or top-level code.
 
  I also wonder where all these parented-by-init,
  presumably-not-using-kthread kernel threads are coming from
 
 
 From another piece of this thread.

Yeah, sorry.  Without mentioning any names, [EMAIL PROTECTED] broke the
threading so I came in halfway.

   Robin how many kernel thread per cpu are you seeing?
  
  10.
  
  FYI, pid 1539 is kthread.
  
  a01:~ # ps -ef | egrep \[.*\/255\] 
  root   512 1  0 Apr08 ?00:00:00 [migration/255]
  root   513 1  0 Apr08 ?00:00:00 [ksoftirqd/255]
  root  1281 1  0 Apr08 ?00:00:02 [events/255]
  root  2435  1539  0 Apr08 ?00:00:00 [kblockd/255]
  root  3159  1539  0 Apr08 ?00:00:00 [aio/255]
  root  4007  1539  0 Apr08 ?00:00:00 [cqueue/255]
  root  8653  1539  0 Apr08 ?00:00:00 [ata/255]
  root 17438  1539  0 Apr08 ?00:00:00 [xfslogd/255]
  root 17950  1539  0 Apr08 ?00:00:00 [xfsdatad/255]
  root 18426  1539  0 Apr08 ?00:00:00 [rpciod/255]
 
 
 So it looks like there were about 1500 kernel threads that started up before
 kthread started.
 
 So the kernel threads appear to have init as their parent is because
 they started before kthread for the most part.
 
 At 10 kernel threads per cpu there may be a little bloat but it isn't
 out of control.  It is mostly that we are observing the kernel as
 NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
 a 1000 fold bigger then most people are used to :)

Yes, I expect we could run init_workqueues() much earlier, from process 0
rather than from process 1.  Something _might_ blow up as it often does when
we change startup ordering, but in this case it would be somewhat surprising.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Dave Jones
On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:

  I suspect there are quite a few kernel threads which don't really need to
  be threads at all: the code would quite happily work if it was changed to
  use keventd, via schedule_work() and friends.  But kernel threads are
  somewhat easier to code for.

Perhaps too easy.  We have a bunch of kthreads sitting around that afaict,
are there 'just in case', not because they're actually in use.

   10 ?S 0:00 [khelper]

Why doesn't this get spawned when it needs to?

  164 ?S 0:00 [cqueue/0]
  165 ?S 0:00 [cqueue/1]

I'm not even sure wth these are.

  166 ?S 0:00 [ksuspend_usbd]

Just in case I decide to suspend ? Sounds.. handy.
But why not spawn this just after we start a suspend?

  209 ?S 0:00 [aio/0]
  210 ?S 0:00 [aio/1]

I'm sure I'd appreciate these a lot more if I had any AIO using apps.

  364 ?S 0:00 [kpsmoused]

I never did figure out why this was a thread.

  417 ?S 0:00 [scsi_eh_1]
  418 ?S 0:00 [scsi_eh_2]
  419 ?S 5:28 [scsi_eh_3]
  426 ?S 0:00 [scsi_eh_4]
  427 ?S 0:00 [scsi_eh_5]
  428 ?S 0:00 [scsi_eh_6]
  429 ?S 0:00 [scsi_eh_7]

Just in case my 7-1 in card reader gets an error.
(Which is unlikely on at least 6 of the slots as evidenced by the runtime 
column.
 -- Though now I'm curious as to why the error handler was running so much given
I've not experienced any errors.).
This must be a fun one of on huge diskfarms.

  884 ?S 0:00 [kgameportd]

Just in case I ever decide to plug something into my soundcard.

 2179 ?S 0:00 [kmpathd/0]
 2180 ?S 0:00 [kmpathd/1]
 2189 ?S 0:00 [kmirrord]

Just loading the modules starts up the threads, regardless
of whether you use them. (Not sure why they're getting loaded,
something for me to look into)

 3246 ?S 0:00 [krfcommd]

I don't even *have* bluetooth hardware.
(Yes, the module shouldn't have loaded, but that's another battle..)

  otoh, a lot of these inefficeincies are probably down in scruffy drivers
  rather than in core or top-level code.

You say scruffy, but most of the proliferation of kthreads comes
from code written in the last few years.  Compare the explosion of kthreads
we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
slowing down at all.

On the 2-way box I grabbed the above ps output from, I end up with 69 kthreads.
It doesn't surprise me at all that bigger iron is starting to see issues.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Andrew Morton
On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones [EMAIL PROTECTED] wrote:

 On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:
 
   I suspect there are quite a few kernel threads which don't really need to
   be threads at all: the code would quite happily work if it was changed to
   use keventd, via schedule_work() and friends.  But kernel threads are
   somewhat easier to code for.
 
 Perhaps too easy.  We have a bunch of kthreads sitting around that afaict,
 are there 'just in case', not because they're actually in use.

This is fun.

10 ?S 0:00 [khelper]

That one's needed to parent the call_usermodehelper() apps.  I don't think
it does anything else.  We used to use keventd for this but that had some
problem whcih I forget.  (Who went and misnamed keventd to events, too? 
Nobody calls it events, and with good reason)

 Why doesn't this get spawned when it needs to?
 
   164 ?S 0:00 [cqueue/0]
   165 ?S 0:00 [cqueue/1]
 
 I'm not even sure wth these are.

Me either.

 ...


: root 3  0.0  0.0  0 0 ?S18:51   0:00 [watchdog/0]

That's the softlockup detector.  Confusingly named to look like a, err,
watchdog.  Could probably use keventd.

: root 5  0.0  0.0  0 0 ?S18:51   0:00 [khelper]

That's there to parent the kthread_create()d threads.  Could presumably use
khelper.

: root   152  0.0  0.0  0 0 ?S18:51   0:00 [ata/0]

Does it need to be per-cpu?

: root   153  0.0  0.0  0 0 ?S18:51   0:00 [ata_aux]

That's a single-threaded workqueue handler.  Perhaps could use keventd.

: root   299  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_0]
: root   300  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_1]
: root   305  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_2]
: root   306  0.0  0.0  0 0 ?S18:51   0:00 [scsi_eh_3]

This machine has one CPU, one sata disk and one DVD drive.  The above is
hard to explain.

: root   319  0.0  0.0  0 0 ?S18:51   0:00 [pccardd]

hm.

: root   331  0.0  0.0  0 0 ?S18:51   0:00 [kpsmoused]

hm.

: root   337  0.0  0.0  0 0 ?S18:51   0:00 [kedac]

hm.  I didn't know that the Vaio had EDAC.

: root  1173  0.0  0.0  0 0 ?S18:51   0:00 [khpsbpkt]

I can't even pronounce that.

: root  1354  0.0  0.0  0 0 ?S18:51   0:00 [knodemgrd_0]

OK, I do have 1394 hardware, but it hasn't been used.

: root  1636  0.0  0.0  0 0 ?S18:52   0:00 [kondemand/0]

I blame davej.

   otoh, a lot of these inefficeincies are probably down in scruffy drivers
   rather than in core or top-level code.
 
 You say scruffy, but most of the proliferation of kthreads comes
 from code written in the last few years.  Compare the explosion of kthreads
 we see coming from 2.4 to 2.6. It's disturbing, and I don't see it
 slowing down at all.
 
 On the 2-way box I grabbed the above ps output from, I end up with 69 
 kthreads.
 It doesn't surprise me at all that bigger iron is starting to see issues.
 

Sure.

I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Linus Torvalds


On Mon, 9 Apr 2007, Andrew Morton wrote:
 
 10 ?S 0:00 [khelper]
 
 That one's needed to parent the call_usermodehelper() apps.  I don't think
 it does anything else.  We used to use keventd for this but that had some
 problem whcih I forget.

I think it was one of a long series of deadlocks. 

Using a keventd for many different things sounds clever and nice, but 
then sucks horribly when one event triggers another event, and they depend 
on each other. Solution: use independent threads for the events.

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Alexey Dobriyan
On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
 On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones [EMAIL PROTECTED] wrote:

[possible topic for KS2007]

164 ?S 0:00 [cqueue/0]
165 ?S 0:00 [cqueue/1]
 
  I'm not even sure wth these are.

 Me either.

drivers/connector/connector.c:
455 dev-cbdev = cn_queue_alloc_dev(cqueue, dev-nls);

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-09 Thread Dave Jones
On Tue, Apr 10, 2007 at 09:07:54AM +0400, Alexey Dobriyan wrote:
  On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
   On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones [EMAIL PROTECTED] wrote:
  
  [possible topic for KS2007]
  
  164 ?S 0:00 [cqueue/0]
  165 ?S 0:00 [cqueue/1]
   
I'm not even sure wth these are.
  
   Me either.
  
  drivers/connector/connector.c:
  455  dev-cbdev = cn_queue_alloc_dev(cqueue, dev-nls);

Maybe I have apps relying on the connector stuff that I don't
even realise, but ttbomk, nothing actually *needs* this
for 99% of users if I'm not mistaken.

* wonders why he never built this modular..

config PROC_EVENTS
boolean Report process events to userspace
depends on CONNECTOR=y


Yay.

Dave

-- 
http://www.codemonkey.org.uk
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-08 Thread Oleg Nesterov
On 04/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <[EMAIL PROTECTED]> writes:
>
> > On 04/06, Oleg Nesterov wrote:
> >>
> >> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
> >>remove_parent(current);
> >>current->parent = child_reaper(current);
> >>current->real_parent = child_reaper(current);
> >> -  add_parent(current);
> >> -
> >> -  /* Set the exit signal to SIGCHLD so we signal init on exit */
> >> -  current->exit_signal = SIGCHLD;
> >> +  current->exit_signal = -1;
> >>
> >>if (!has_rt_policy(current) && (task_nice(current) < 0))
> >>set_user_nice(current, 0);
> >>
> >> is enough. init is still our parent (make ps happy), but it can't see us,
> >> we are not on ->children list.
> >
> > OK, this doesn't work. A multi-threaded init may do execve().
>
> Good catch.  daemonize must die!

Well, this is not the problem of daemonize(), but I agree, daemonize() should
die with the changes you proposed.

> > So, we can re-parent a kernel thread to swapper. In that case it doesn't 
> > matter
> > if we put task on ->children list or not.
>
> Yes.  We can.
>
> > User-visible change. Acceptable?
>
> We would have user visible changes when we changed to ktrhead_create anyway,
> and it's none of user space's business so certainly.

OK, I'll try to do "just for review" patch. I'm afraid pstree will be confused.

> >> Off course, we also need to add preparent_to_init() to kthread() and
> >> (say) stopmachine(). Or we can create kernel_thread_detached() and
> >> modify callers to use it.
> >
> > It would be very nice to introduce CLONE_KERNEL_THREAD instead, then
>
> If we are going to do something to copy_process and the like let's take
> this one step farther.  Let's pass in the value of the task to copy.

Yes! I thought about that too. but see below,

> Then we can add a wrapper around copy_process to build kernel_thread
> something like:
>
> struct task_struct *__kernel_thread(int (*fn)(void *), void * arg,
>   unsigned long flags)
> {
>   struct task_struct *task;
>   struct pt_regs regs, *reg;
>
> reg = kernel_thread_regs(, fn, arg);
>   task = copy_process(_task, flags, 0, reg, 0, NULL, NULL, NULL, 0);
>   if (!IS_ERR(task))
>   wake_up_new_task(task, flags);
>
>   return task;
> }

First, we still need some additional flag/parameter to set ->exit_state = -1.
Let's use CLONE_KERNEL_THREAD for discussion.

Now. Can we do copy_process(an_arbitrary_task) ? If yes, we need additional
locking in copy_process(). Just think about ->signal,->parent access. Nasty,
and probably not so useful.

If we can only use "current" or init_task, CLONE_KERNEL_THREAD is enough.
Just now it only

- sets ->exit_state = -1

- sets ->parent = _task

(note that the second change could be omitted, we can use CLONE_PARENT
 if we know that all users of CLONE_KERNEL_THREAD are kernel threads).

> After that daemonize just becomes:
>
> void daemonize(const char *name, ...)
> {
>   va_list args;
>
>   va_start(args, name);
>   vsnprintf(current->comm, sizeof(current->comm), name, args);
>   va_end(args);
> }
>
> And kthread_create becomes:
>
> struct task_struct *kthread_create(int (*threadfn)(void *data),
>  void *data,
>  const char namefmt[],
>  ...)
> {
>   struct kthread_create_info create;
>   struct task_struct *task;
>
>   create.threadfn = threadfn;
>   create.data = data;
>
>   /* We want our own signal handler (we take no signals by default). */
> task = __kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | 
> SIGCHLD);
>   if (!IS_ERR(task)) {
>   va_list args;
>   va_start(args, namefmt);
>   vsnprintf(task->comm, sizeof(task->comm),  namefmt, args);
>   va_end(args);
>   }
> return task;
> }

Yes, nice. This is not complete, we still have to wait for completion,
because kthread_create() promises that a new thread has no CPU on
return (so we can use kthread_bind() for example), but nice anyway.

> If we are willing to go that far.  I think it is worth it to touch the
> architecture specific code.  As that removes an unnecessary wait
> queue, and ensures that kernel threads always start with a consistent
> state.  So it is a performance, scalability and simplicity boost.

Yes.

But note that CLONE_KERNEL_THREAD allows us to achieve the same step by
step. For example, we can extend the meaning of CLONE_KERNEL_THREAD to
use init_task.mm in copy_mm(), then copy_fs(), etc.

Eric, I am far from sure I am right though. Probably it is better to do
one big change as you suggested.

However, I like the idea of for-in-kernel-only CLONE_ flags. For example,
we can kill the ugly "pid" parameter of copy_process() and use CLONE_IDLE
instead.

Also, we can't use __kernel_thread() you propose if we want do_wait(),

Re: init's children list is long and slows reaping children.

2007-04-08 Thread Oleg Nesterov
On 04/07, Eric W. Biederman wrote:

 Oleg Nesterov [EMAIL PROTECTED] writes:

  On 04/06, Oleg Nesterov wrote:
 
  @@ -275,10 +275,7 @@ static void reparent_to_init(void)
 remove_parent(current);
 current-parent = child_reaper(current);
 current-real_parent = child_reaper(current);
  -  add_parent(current);
  -
  -  /* Set the exit signal to SIGCHLD so we signal init on exit */
  -  current-exit_signal = SIGCHLD;
  +  current-exit_signal = -1;
 
 if (!has_rt_policy(current)  (task_nice(current)  0))
 set_user_nice(current, 0);
 
  is enough. init is still our parent (make ps happy), but it can't see us,
  we are not on -children list.
 
  OK, this doesn't work. A multi-threaded init may do execve().

 Good catch.  daemonize must die!

Well, this is not the problem of daemonize(), but I agree, daemonize() should
die with the changes you proposed.

  So, we can re-parent a kernel thread to swapper. In that case it doesn't 
  matter
  if we put task on -children list or not.

 Yes.  We can.

  User-visible change. Acceptable?

 We would have user visible changes when we changed to ktrhead_create anyway,
 and it's none of user space's business so certainly.

OK, I'll try to do just for review patch. I'm afraid pstree will be confused.

  Off course, we also need to add preparent_to_init() to kthread() and
  (say) stopmachine(). Or we can create kernel_thread_detached() and
  modify callers to use it.
 
  It would be very nice to introduce CLONE_KERNEL_THREAD instead, then

 If we are going to do something to copy_process and the like let's take
 this one step farther.  Let's pass in the value of the task to copy.

Yes! I thought about that too. but see below,

 Then we can add a wrapper around copy_process to build kernel_thread
 something like:

 struct task_struct *__kernel_thread(int (*fn)(void *), void * arg,
   unsigned long flags)
 {
   struct task_struct *task;
   struct pt_regs regs, *reg;

 reg = kernel_thread_regs(regs, fn, arg);
   task = copy_process(init_task, flags, 0, reg, 0, NULL, NULL, NULL, 0);
   if (!IS_ERR(task))
   wake_up_new_task(task, flags);

   return task;
 }

First, we still need some additional flag/parameter to set -exit_state = -1.
Let's use CLONE_KERNEL_THREAD for discussion.

Now. Can we do copy_process(an_arbitrary_task) ? If yes, we need additional
locking in copy_process(). Just think about -signal,-parent access. Nasty,
and probably not so useful.

If we can only use current or init_task, CLONE_KERNEL_THREAD is enough.
Just now it only

- sets -exit_state = -1

- sets -parent = init_task

(note that the second change could be omitted, we can use CLONE_PARENT
 if we know that all users of CLONE_KERNEL_THREAD are kernel threads).

 After that daemonize just becomes:

 void daemonize(const char *name, ...)
 {
   va_list args;

   va_start(args, name);
   vsnprintf(current-comm, sizeof(current-comm), name, args);
   va_end(args);
 }

 And kthread_create becomes:

 struct task_struct *kthread_create(int (*threadfn)(void *data),
  void *data,
  const char namefmt[],
  ...)
 {
   struct kthread_create_info create;
   struct task_struct *task;

   create.threadfn = threadfn;
   create.data = data;

   /* We want our own signal handler (we take no signals by default). */
 task = __kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | 
 SIGCHLD);
   if (!IS_ERR(task)) {
   va_list args;
   va_start(args, namefmt);
   vsnprintf(task-comm, sizeof(task-comm),  namefmt, args);
   va_end(args);
   }
 return task;
 }

Yes, nice. This is not complete, we still have to wait for completion,
because kthread_create() promises that a new thread has no CPU on
return (so we can use kthread_bind() for example), but nice anyway.

 If we are willing to go that far.  I think it is worth it to touch the
 architecture specific code.  As that removes an unnecessary wait
 queue, and ensures that kernel threads always start with a consistent
 state.  So it is a performance, scalability and simplicity boost.

Yes.

But note that CLONE_KERNEL_THREAD allows us to achieve the same step by
step. For example, we can extend the meaning of CLONE_KERNEL_THREAD to
use init_task.mm in copy_mm(), then copy_fs(), etc.

Eric, I am far from sure I am right though. Probably it is better to do
one big change as you suggested.

However, I like the idea of for-in-kernel-only CLONE_ flags. For example,
we can kill the ugly pid parameter of copy_process() and use CLONE_IDLE
instead.

Also, we can't use __kernel_thread() you propose if we want do_wait(),
using CLONE_ flags a bit more flexible.

BTW. I think your idea of kernel_thread_regs() is very nice in any case. Is
it enough (and possible) to implement an 

Re: init's children list is long and slows reaping children.

2007-04-07 Thread Eric W. Biederman
Oleg Nesterov <[EMAIL PROTECTED]> writes:

> On 04/06, Oleg Nesterov wrote:
>> 
>> Perhaps,
>> 
>> --- t/kernel/exit.c~ 2007-04-06 23:31:31.0 +0400
>> +++ t/kernel/exit.c  2007-04-06 23:31:57.0 +0400
>> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
>>  remove_parent(current);
>>  current->parent = child_reaper(current);
>>  current->real_parent = child_reaper(current);
>> -add_parent(current);
>> -
>> -/* Set the exit signal to SIGCHLD so we signal init on exit */
>> -current->exit_signal = SIGCHLD;
>> +current->exit_signal = -1;
>>  
>>  if (!has_rt_policy(current) && (task_nice(current) < 0))
>>  set_user_nice(current, 0);
>> 
>> is enough. init is still our parent (make ps happy), but it can't see us,
>> we are not on ->children list.
>
> OK, this doesn't work. A multi-threaded init may do execve(). 

Good catch.  daemonize must die! 

> So, we can re-parent a kernel thread to swapper. In that case it doesn't 
> matter
> if we put task on ->children list or not.

Yes.  We can.

> User-visible change. Acceptable?

We would have user visible changes when we changed to ktrhead_create anyway,
and it's none of user space's business so certainly.

>> Off course, we also need to add preparent_to_init() to kthread() and
>> (say) stopmachine(). Or we can create kernel_thread_detached() and
>> modify callers to use it.
>
> It would be very nice to introduce CLONE_KERNEL_THREAD instead, then

If we are going to do something to copy_process and the like let's take
this one step farther.  Let's pass in the value of the task to copy.

Then we can add a wrapper around copy_process to build kernel_thread
something like:

struct task_struct *__kernel_thread(int (*fn)(void *), void * arg,
unsigned long flags)
{
struct task_struct *task;
struct pt_regs regs, *reg;

reg = kernel_thread_regs(, fn, arg);
task = copy_process(_task, flags, 0, reg, 0, NULL, NULL, NULL, 0);
if (!IS_ERR(task))
wake_up_new_task(task, flags);

return task;
}

long kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
struct task_struct *task;
task = __kernel_thread(fn, arg, flags);
if (IS_ERR(task))
return PTR_ERR(task);
return task->pid;   
}

After that daemonize just becomes:

void daemonize(const char *name, ...)
{
va_list args;

va_start(args, name);
vsnprintf(current->comm, sizeof(current->comm), name, args);
va_end(args);
}

And kthread_create becomes:

struct task_struct *kthread_create(int (*threadfn)(void *data),
   void *data,
   const char namefmt[],
   ...)
{
struct kthread_create_info create;
struct task_struct *task;

create.threadfn = threadfn;
create.data = data;

/* We want our own signal handler (we take no signals by default). */
task = __kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | 
SIGCHLD);
if (!IS_ERR(task)) {
va_list args;
va_start(args, namefmt);
vsnprintf(task->comm, sizeof(task->comm),  namefmt, args);
va_end(args);
}
return task;
}

If we are willing to go that far.  I think it is worth it to touch the
architecture specific code.  As that removes an unnecessary wait
queue, and ensures that kernel threads always start with a consistent
state.  So it is a performance, scalability and simplicity boost.

Otherwise we should just put the code in the callers of kernel_thread
like we do today.

I don't have the energy to rework all of th architecture or even a
noticeable fraction of them right now.  I have to much on my plate.
A non-architecture specific solution looks like a fairly simple
patch and I might get around to it one of these times..

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: init's children list is long and slows reaping children.

2007-04-07 Thread Oleg Nesterov
On 04/06, Oleg Nesterov wrote:
> 
> Perhaps,
> 
> --- t/kernel/exit.c~  2007-04-06 23:31:31.0 +0400
> +++ t/kernel/exit.c   2007-04-06 23:31:57.0 +0400
> @@ -275,10 +275,7 @@ static void reparent_to_init(void)
>   remove_parent(current);
>   current->parent = child_reaper(current);
>   current->real_parent = child_reaper(current);
> - add_parent(current);
> -
> - /* Set the exit signal to SIGCHLD so we signal init on exit */
> - current->exit_signal = SIGCHLD;
> + current->exit_signal = -1;
>  
>   if (!has_rt_policy(current) && (task_nice(current) < 0))
>   set_user_nice(current, 0);
> 
> is enough. init is still our parent (make ps happy), but it can't see us,
> we are not on ->children list.

OK, this doesn't work. A multi-threaded init may do execve(). Can we forbid
exec() from non-main thread? init is special anyway. We can simplify de_thread()
in this case, and kill tasklist_lock around zap_other_threads().

So, we can re-parent a kernel thread to swapper. In that case it doesn't matter
if we put task on ->children list or not.

User-visible change. Acceptable?

> Off course, we also need to add preparent_to_init() to kthread() and
> (say) stopmachine(). Or we can create kernel_thread_detached() and
> modify callers to use it.

It would be very nice to introduce CLONE_KERNEL_THREAD instead, then

--- kernel/fork.c~  2007-04-07 20:11:14.0 +0400
+++ kernel/fork.c   2007-04-07 23:40:35.0 +0400
@@ -1159,7 +1159,8 @@ static struct task_struct *copy_process(
p->parent_exec_id = p->self_exec_id;
 
/* ok, now we should be set up.. */
-   p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : 
(clone_flags & CSIGNAL);
+   p->exit_signal = (clone_flags & 
(CLONE_THREAD|CLONE_KERNEL_THREAD))
+   ? -1 : (clone_flags & CSIGNAL);
p->pdeath_signal = 0;
p->exit_state = 0;
 
@@ -1196,6 +1197,8 @@ static struct task_struct *copy_process(
/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD))
p->parent = current->parent;
+   else if (unlikely(clone_flags & CLONE_KERNEL_THREAD))
+   p->parent = _task;
else
p->parent = current;

That is all, very simple. However, in that case we should introduce

#define SYS_CLONE_MASK  (~CLONE_KERNEL_THREAD)

and change every sys_clone() implementation to filter out non-SYS_CLONE_MASK
flags. This is trivial (and imho useful), but

arch/sparc/kernel/process.c
arch/sparc64/kernel/process.c
arch/m68knommu/kernel/process.c
arch/m68k/kernel/process.c
arch/alpha/kernel/entry.S
arch/h8300/kernel/process.c
arch/v850/kernel/process.c
arch/frv/kernel/kernel_thread.S
arch/frv/kernel/kernel_thread.S
arch/powerpc/kernel/misc_32.S
arch/powerpc/kernel/misc_64.S

implement kthread_create() via sys_clone(). This means that without additional
effort from mainteners these arches can't take advantage of CLONE_KERNEL_THREAD.

Dear CC list, any advice?

Oleg.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   3   >