Re: Module loading/unloading and "The Stop Machine"

2008-02-22 Thread Max Krasnyanskiy

Hi Andi,


Max Krasnyanskiy <[EMAIL PROTECTED]> writes:

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

Let me know if you have something in mind. When I get a chance I'll stare
some more at that code and try to come up with an alternative solution.

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-22 Thread Andi Kleen
Max Krasnyanskiy <[EMAIL PROTECTED]> writes:
>
> static struct module *load_module(void __user *umod,
>  unsigned long len,
>  const char __user *uargs)
> {
>  ...
>
>  /* Now sew it into the lists so we can get lockdep and oops
> * info during argument parsing.  Noone should access us, since
> * strong_try_module_get() will fail. */
>stop_machine_run(__link_module, mod, NR_CPUS);
>  ...
> }

Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-22 Thread Andi Kleen
Max Krasnyanskiy [EMAIL PROTECTED] writes:

 static struct module *load_module(void __user *umod,
  unsigned long len,
  const char __user *uargs)
 {
  ...

  /* Now sew it into the lists so we can get lockdep and oops
 * info during argument parsing.  Noone should access us, since
 * strong_try_module_get() will fail. */
stop_machine_run(__link_module, mod, NR_CPUS);
  ...
 }

Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

-Andi
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-22 Thread Max Krasnyanskiy

Hi Andi,


Max Krasnyanskiy [EMAIL PROTECTED] writes:

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Wow you found some really bad code. I bet it wouldn't be that
difficult to fix the code to allow oops safe list insertion
without using the big stop machine overkill hammer.

Let me know if you have something in mind. When I get a chance I'll stare
some more at that code and try to come up with an alternative solution.

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.


That list (updated by __link_module) is accessed in couple of other places. ie 
outside symbol
lookup stuff used for kallsyms.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Tejun Heo
Max Krasnyanskiy wrote:
> Tejun Heo wrote:
>> Max Krasnyanskiy wrote:
>>> Thanks for the info. I guess I missed that from the code. In any case
>>> that seems like a pretty heavy refcounting mechanism. In a sense that
>>> every time something is loaded or unloaded entire machine freezes,
>>> potentially for several milliseconds. Normally it's not a big deal. But
>>> once you get more and more CPUs and/or start using realtime apps this
>>> becomes a big deal.
>>
>> Module loading doesn't involve stop_machine last time I checked.  It's a
>> big deal when unloading a module but it's actually a very good trade off
>> because it makes much hotter path (module_get/put) much cheaper.  If
>> your application can't stand stop_machine, simply don't unload a module.
> 
> static struct module *load_module(void __user *umod,
>  unsigned long len,
>  const char __user *uargs)
> {
>  ...
> 
>  /* Now sew it into the lists so we can get lockdep and oops
> * info during argument parsing.  Noone should access us, since
> * strong_try_module_get() will fail. */
>stop_machine_run(__link_module, mod, NR_CPUS);
>  ...
> }

Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.

> I actually rarely unload modules. The way I notice the problem in first
> place is when things started hanging when tun driver was autoloaded or
> when fs automounts triggered some auto loading.
> These days it's kind hard to have a semi-general purpose machine without
> module loading :).

Yeap, agreed.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.


Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.


static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}

I actually rarely unload modules. The way I notice the problem in first place is when 
things started hanging when tun driver was autoloaded or when fs automounts triggered 
some auto loading.

These days it's kind hard to have a semi-general purpose machine without module 
loading :).

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Tejun Heo
Max Krasnyanskiy wrote:
> Thanks for the info. I guess I missed that from the code. In any case
> that seems like a pretty heavy refcounting mechanism. In a sense that
> every time something is loaded or unloaded entire machine freezes,
> potentially for several milliseconds. Normally it's not a big deal. But
> once you get more and more CPUs and/or start using realtime apps this
> becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

> And it's plain broken for the use case that I mentioned
> during CPU isolation discussions. ie When user-space thread(s) prevent
> stopmachine kthread from running, in which
> case machine simply hangs until those user-space threads exit.

This I don't know nothing about. :-)

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-21 Thread Max Krasnyanskiy

Hi Tejun,


Max Krasnyansky wrote:

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.


Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().
Thanks for the info. I guess I missed that from the code. In any case that seems like a 
pretty heavy refcounting mechanism. In a sense that every time something is loaded or 
unloaded entire machine freezes, potentially for several milliseconds. Normally it's not a 
big deal. But once you get more and more CPUs and/or start using realtime apps this becomes
a big deal. And it's plain broken for the use case that I mentioned during CPU isolation 
discussions. ie When user-space thread(s) prevent stopmachine kthread from running, in which

case machine simply hangs until those user-space threads exit.

Initially I assumed that it had to do with subsystems 
registration/unregistration being
potentially unsafe if it's only for module ref counting there is gotta be a 
less expensive way.
I'll think some more about it.

The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Got it.
Thanks again for the explanation. I'll stare at the module code some more with 
what you said
in mind.

Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Tejun Heo
Max Krasnyanskiy wrote:
 Thanks for the info. I guess I missed that from the code. In any case
 that seems like a pretty heavy refcounting mechanism. In a sense that
 every time something is loaded or unloaded entire machine freezes,
 potentially for several milliseconds. Normally it's not a big deal. But
 once you get more and more CPUs and/or start using realtime apps this
 becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

 And it's plain broken for the use case that I mentioned
 during CPU isolation discussions. ie When user-space thread(s) prevent
 stopmachine kthread from running, in which
 case machine simply hangs until those user-space threads exit.

This I don't know nothing about. :-)

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.


Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.


static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}

I actually rarely unload modules. The way I notice the problem in first place is when 
things started hanging when tun driver was autoloaded or when fs automounts triggered 
some auto loading.

These days it's kind hard to have a semi-general purpose machine without module 
loading :).

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Hi Tejun,


Max Krasnyansky wrote:

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.


Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().
Thanks for the info. I guess I missed that from the code. In any case that seems like a 
pretty heavy refcounting mechanism. In a sense that every time something is loaded or 
unloaded entire machine freezes, potentially for several milliseconds. Normally it's not a 
big deal. But once you get more and more CPUs and/or start using realtime apps this becomes
a big deal. And it's plain broken for the use case that I mentioned during CPU isolation 
discussions. ie When user-space thread(s) prevent stopmachine kthread from running, in which

case machine simply hangs until those user-space threads exit.

Initially I assumed that it had to do with subsystems 
registration/unregistration being
potentially unsafe if it's only for module ref counting there is gotta be a 
less expensive way.
I'll think some more about it.

The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Got it.
Thanks again for the explanation. I'll stare at the module code some more with 
what you said
in mind.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Tejun Heo
Max Krasnyanskiy wrote:
 Tejun Heo wrote:
 Max Krasnyanskiy wrote:
 Thanks for the info. I guess I missed that from the code. In any case
 that seems like a pretty heavy refcounting mechanism. In a sense that
 every time something is loaded or unloaded entire machine freezes,
 potentially for several milliseconds. Normally it's not a big deal. But
 once you get more and more CPUs and/or start using realtime apps this
 becomes a big deal.

 Module loading doesn't involve stop_machine last time I checked.  It's a
 big deal when unloading a module but it's actually a very good trade off
 because it makes much hotter path (module_get/put) much cheaper.  If
 your application can't stand stop_machine, simply don't unload a module.
 
 static struct module *load_module(void __user *umod,
  unsigned long len,
  const char __user *uargs)
 {
  ...
 
  /* Now sew it into the lists so we can get lockdep and oops
 * info during argument parsing.  Noone should access us, since
 * strong_try_module_get() will fail. */
stop_machine_run(__link_module, mod, NR_CPUS);
  ...
 }

Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.

 I actually rarely unload modules. The way I notice the problem in first
 place is when things started hanging when tun driver was autoloaded or
 when fs automounts triggered some auto loading.
 These days it's kind hard to have a semi-general purpose machine without
 module loading :).

Yeap, agreed.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-21 Thread Max Krasnyanskiy

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Tejun Heo wrote:

Max Krasnyanskiy wrote:

Thanks for the info. I guess I missed that from the code. In any case
that seems like a pretty heavy refcounting mechanism. In a sense that
every time something is loaded or unloaded entire machine freezes,
potentially for several milliseconds. Normally it's not a big deal. But
once you get more and more CPUs and/or start using realtime apps this
becomes a big deal.

Module loading doesn't involve stop_machine last time I checked.  It's a
big deal when unloading a module but it's actually a very good trade off
because it makes much hotter path (module_get/put) much cheaper.  If
your application can't stand stop_machine, simply don't unload a module.

static struct module *load_module(void __user *umod,
 unsigned long len,
 const char __user *uargs)
{
 ...

 /* Now sew it into the lists so we can get lockdep and oops
* info during argument parsing.  Noone should access us, since
* strong_try_module_get() will fail. */
   stop_machine_run(__link_module, mod, NR_CPUS);
 ...
}


Ah... right.  That part doesn't have anything to do with module
reference counting as the comment suggests and can probably be removed
by updating how kallsyms synchronize against module load/unload.


That list (updated by __link_module) is accessed in couple of other places. ie 
outside symbol
lookup stuff used for kallsyms.

Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-13 Thread Tejun Heo
Hello, Max.

Max Krasnyansky wrote:
> I was hopping you could answer a couple of questions about module 
> loading/unloading
> and the stop machine.
> There was a recent discussion on LKML about CPU isolation patches I'm working 
> on.
> One of the patches makes stop machine ignore the isolated CPUs. People of 
> course had
> questions about that. So I started looking into more details and got this 
> silly, crazy 
> idea that maybe we do not need the stop machine any more :)
> 
> As far as I can tell the stop machine is basically a safety net in case some 
> locking
> and recounting mechanisms aren't bullet proof. In other words if a subsystem 
> can actually
> handle registration/unregistration in a robust way, module loader/unloader 
> does not 
> necessarily have to halt entire machine in order to load/unload a module that 
> belongs
> to that subsystem. I may of course be completely wrong on that.

Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().

> The problem with the stop machine is that it's a very very big gun :). In a 
> sense that 
> it totally kills all the latencies and stuff since the entire machine gets 
> halted while
> module is being (un)loaded. Which is a major issue for any realtime apps. 
> Specifically 
> for CPU isolation the issue is that high-priority rt user-space thread 
> prevents stop 
> machine threads from running and entire box just hangs waiting for it. 
> I'm kind of surprised that folks who use monster boxes with over 100 CPUs 
> have not 
> complained. It's must be a huge hit for those machines to halt the entire 
> thing. 
> 
> It seems that over the last few years most subsystems got much better at 
> locking and 
> refcounting. And I'm hopping that we can avoid halting the entire machine 
> these days.
> For CPU isolation in particular the solution is simple. We can just ignore 
> isolated CPUs. 
> What I'm trying to figure out is how safe it is and whether we can avoid full 
> halt 
> altogether.

Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-13 Thread Tejun Heo
Jike Song wrote:
> On 2/8/08, Max Krasnyansky <[EMAIL PROTECTED]> wrote:
>> Hi Rusty,
>>
>> I was hopping you could answer a couple of questions about module 
>> loading/unloading
>> and the stop machine.
> 
> I'm curious to know why it is called `stop machine', which is a queer
> name without any relationship with its function.

I guess it's "stop the rest of the machine" and run this.  Maybe it's
misnamed but stop_machine is kind of cool.  :-)

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-13 Thread Jike Song
On 2/8/08, Max Krasnyansky <[EMAIL PROTECTED]> wrote:
> Hi Rusty,
>
> I was hopping you could answer a couple of questions about module 
> loading/unloading
> and the stop machine.

I'm curious to know why it is called `stop machine', which is a queer
name without any relationship with its function.

Regards,
Jike
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-13 Thread Jike Song
On 2/8/08, Max Krasnyansky [EMAIL PROTECTED] wrote:
 Hi Rusty,

 I was hopping you could answer a couple of questions about module 
 loading/unloading
 and the stop machine.

I'm curious to know why it is called `stop machine', which is a queer
name without any relationship with its function.

Regards,
Jike
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-13 Thread Tejun Heo
Hello, Max.

Max Krasnyansky wrote:
 I was hopping you could answer a couple of questions about module 
 loading/unloading
 and the stop machine.
 There was a recent discussion on LKML about CPU isolation patches I'm working 
 on.
 One of the patches makes stop machine ignore the isolated CPUs. People of 
 course had
 questions about that. So I started looking into more details and got this 
 silly, crazy 
 idea that maybe we do not need the stop machine any more :)
 
 As far as I can tell the stop machine is basically a safety net in case some 
 locking
 and recounting mechanisms aren't bullet proof. In other words if a subsystem 
 can actually
 handle registration/unregistration in a robust way, module loader/unloader 
 does not 
 necessarily have to halt entire machine in order to load/unload a module that 
 belongs
 to that subsystem. I may of course be completely wrong on that.

Nope, it's integral part of module reference counting.  When using
refcnt for object lifetime management, the last put should be atomic
against initial get of the object.  This is usually achieved by
acquiring the lock used for object lookup before putting or using
atomic_dec_and_lock().

For module reference counts, this means that try_module_get() and
try_stop_module() should be atomic.  Note that modules don't use simple
refcnt so the latter part isn't module_put() but the analogy still
works.  There are two ways to synchronize try_module_get() against
try_stop_module() - the traditional is to grab lock in try_module_get()
and use atomic_dec_and_lock() in try_stop_module(), which works but
performance-wise bad because try_module_get() is used way much more than
try_stop_module() is.  For example, an IO command can go through several
try_module_get()'s.

So, all the burden of synchronization is put onto try_stop_module().
Because all of the cpus on the machine are stopped and none of them has
been stopped in the middle of non-preemptible code, __try_stop_module()
is synchronized from try_module_get() even though all the
synchronization try_module_get() does is get_cpu().

 The problem with the stop machine is that it's a very very big gun :). In a 
 sense that 
 it totally kills all the latencies and stuff since the entire machine gets 
 halted while
 module is being (un)loaded. Which is a major issue for any realtime apps. 
 Specifically 
 for CPU isolation the issue is that high-priority rt user-space thread 
 prevents stop 
 machine threads from running and entire box just hangs waiting for it. 
 I'm kind of surprised that folks who use monster boxes with over 100 CPUs 
 have not 
 complained. It's must be a huge hit for those machines to halt the entire 
 thing. 
 
 It seems that over the last few years most subsystems got much better at 
 locking and 
 refcounting. And I'm hopping that we can avoid halting the entire machine 
 these days.
 For CPU isolation in particular the solution is simple. We can just ignore 
 isolated CPUs. 
 What I'm trying to figure out is how safe it is and whether we can avoid full 
 halt 
 altogether.

Without the stop_machine call, there's no synchronization between
initial get and final put.  Things will break.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and "The Stop Machine"

2008-02-08 Thread Max Krasnyanskiy

Max Krasnyansky wrote:

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


So. Here is what I tried today on my Core2 Duo laptop

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
unsigned int cpu)
 
/* No CPUs can come up or down during this. */

lock_cpu_hotplug();
+/*
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
+*/
+   ret = fn(data);
unlock_cpu_hotplug();
 
return ret;


ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using 
that USB mouse. The Bluetooth services are running too.

By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this
email from :). It's still running all that :) 


So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've doing
for a couple of years now on a wide range of the machines where people are inserting 
modules left and right. 


What do you think ?

Thanx
Max


Quick update on this.
I've also ran
while true; do
sudo mount -o loop loopfs loopmnt && dd if=/dev/zero 
of=loopmnt/dummy bs=1M
sudo umount loopmnt
sleep 2
done
and
while true; do
/sbin/modprobe -r loop
/sbin/modprobe loop
sleep 1
done
in parallel on the Core2 Quad box for about 6 hours now. Same thing. No signs of problems 
whatsoever, with the "stop machine" completely disabled. Everything is working as expected.

Here we're exercising sysfs, block and fs layers.
So I'm now even more eager to see your response :).

btw Does anyone else have a module load/unload scenario that definitely 
requires stop machine ?

Max












--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Module loading/unloading and The Stop Machine

2008-02-08 Thread Max Krasnyanskiy

Max Krasnyansky wrote:

Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this silly, crazy 
idea that maybe we do not need the stop machine any more :)


As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does not 
necessarily have to halt entire machine in order to load/unload a module that belongs

to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a sense that 
it totally kills all the latencies and stuff since the entire machine gets halted while
module is being (un)loaded. Which is a major issue for any realtime apps. Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have not 
complained. It's must be a huge hit for those machines to halt the entire thing. 

It seems that over the last few years most subsystems got much better at locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these days.
For CPU isolation in particular the solution is simple. We can just ignore isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full halt 
altogether.


So. Here is what I tried today on my Core2 Duo laptop

--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
unsigned int cpu)
 
/* No CPUs can come up or down during this. */

lock_cpu_hotplug();
+/*
p = __stop_machine_run(fn, data, cpu);
if (!IS_ERR(p))
ret = kthread_stop(p);
else
ret = PTR_ERR(p);
+*/
+   ret = fn(data);
unlock_cpu_hotplug();
 
return ret;


ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and is using 
that USB mouse. The Bluetooth services are running too.

By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing this
email from :). It's still running all that :) 


So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've doing
for a couple of years now on a wide range of the machines where people are inserting 
modules left and right. 


What do you think ?

Thanx
Max


Quick update on this.
I've also ran
while true; do
sudo mount -o loop loopfs loopmnt  dd if=/dev/zero 
of=loopmnt/dummy bs=1M
sudo umount loopmnt
sleep 2
done
and
while true; do
/sbin/modprobe -r loop
/sbin/modprobe loop
sleep 1
done
in parallel on the Core2 Quad box for about 6 hours now. Same thing. No signs of problems 
whatsoever, with the stop machine completely disabled. Everything is working as expected.

Here we're exercising sysfs, block and fs layers.
So I'm now even more eager to see your response :).

btw Does anyone else have a module load/unload scenario that definitely 
requires stop machine ?

Max












--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Module loading/unloading and "The Stop Machine"

2008-02-07 Thread Max Krasnyansky
Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this 
silly, crazy 
idea that maybe we do not need the stop machine any more :)

As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does 
not 
necessarily have to halt entire machine in order to load/unload a module that 
belongs
to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a 
sense that 
it totally kills all the latencies and stuff since the entire machine gets 
halted while
module is being (un)loaded. Which is a major issue for any realtime apps. 
Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents 
stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have 
not 
complained. It's must be a huge hit for those machines to halt the entire 
thing. 

It seems that over the last few years most subsystems got much better at 
locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these 
days.
For CPU isolation in particular the solution is simple. We can just ignore 
isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full 
halt 
altogether.

So. Here is what I tried today on my Core2 Duo laptop
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
> unsigned int cpu)
>  
> /* No CPUs can come up or down during this. */
> lock_cpu_hotplug();
> +/*
> p = __stop_machine_run(fn, data, cpu);
> if (!IS_ERR(p))
> ret = kthread_stop(p);
> else
> ret = PTR_ERR(p);
> +*/
> +   ret = fn(data);
> unlock_cpu_hotplug();
>  
> return ret;

ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most 
interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're 
touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and 
is using 
that USB mouse. The Bluetooth services are running too.
By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is 
registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop 
machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing 
this
email from :). It's still running all that :) 

So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much 
better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've 
doing
for a couple of years now on a wide range of the machines where people are 
inserting 
modules left and right. 

What do you think ?

Thanx
Max
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Module loading/unloading and The Stop Machine

2008-02-07 Thread Max Krasnyansky
Hi Rusty,

I was hopping you could answer a couple of questions about module 
loading/unloading
and the stop machine.
There was a recent discussion on LKML about CPU isolation patches I'm working 
on.
One of the patches makes stop machine ignore the isolated CPUs. People of 
course had
questions about that. So I started looking into more details and got this 
silly, crazy 
idea that maybe we do not need the stop machine any more :)

As far as I can tell the stop machine is basically a safety net in case some 
locking
and recounting mechanisms aren't bullet proof. In other words if a subsystem 
can actually
handle registration/unregistration in a robust way, module loader/unloader does 
not 
necessarily have to halt entire machine in order to load/unload a module that 
belongs
to that subsystem. I may of course be completely wrong on that.
 
The problem with the stop machine is that it's a very very big gun :). In a 
sense that 
it totally kills all the latencies and stuff since the entire machine gets 
halted while
module is being (un)loaded. Which is a major issue for any realtime apps. 
Specifically 
for CPU isolation the issue is that high-priority rt user-space thread prevents 
stop 
machine threads from running and entire box just hangs waiting for it. 
I'm kind of surprised that folks who use monster boxes with over 100 CPUs have 
not 
complained. It's must be a huge hit for those machines to halt the entire 
thing. 

It seems that over the last few years most subsystems got much better at 
locking and 
refcounting. And I'm hopping that we can avoid halting the entire machine these 
days.
For CPU isolation in particular the solution is simple. We can just ignore 
isolated CPUs. 
What I'm trying to figure out is how safe it is and whether we can avoid full 
halt 
altogether.

So. Here is what I tried today on my Core2 Duo laptop
 --- a/kernel/stop_machine.c
 +++ b/kernel/stop_machine.c
 @@ -204,11 +204,14 @@ int stop_machine_run(int (*fn)(void *), void *data, 
 unsigned int cpu)
  
 /* No CPUs can come up or down during this. */
 lock_cpu_hotplug();
 +/*
 p = __stop_machine_run(fn, data, cpu);
 if (!IS_ERR(p))
 ret = kthread_stop(p);
 else
 ret = PTR_ERR(p);
 +*/
 +   ret = fn(data);
 unlock_cpu_hotplug();
  
 return ret;

ie Completely disabled stop machine. It just loads/unloads modules without full 
halt.
I then ran three scripts:

while true; do
/sbin/modprobe -r uhci_hcd
/sbin/modprobe uhci_hcd
sleep 10
done

while true; do
/sbin/modprobe -r tg3
/sbin/modprobe tg3
sleep 2
done

while true; do
/usr/sbin/tcpdump -i eth0
done

The machine has a bunch of USB devices connected to it. The two most 
interesting 
are a Bluetooth dongle and a USB mouse. By loading/unloading UHCI driver we're 
touching
Sysfs, USB stack, Bluetooth stack, HID layer, Input layer. The X is running and 
is using 
that USB mouse. The Bluetooth services are running too.
By loading/unloading TG3 driver we're touching sysfs, network stack (a bunch of 
layers).
The machine is running NetworkManager and tcpdumping on the eth0 which is 
registered 
by TG3.
This is a pretty good stress test in general let alone the disabled stop 
machine. 

I left all that running for the whole day while doing normal day to day things. 
Compiling a bunch of things, emails, office apps, etc. That's where I'm writing 
this
email from :). It's still running all that :) 

So the question is do we still need stop machine ? I must be missing something 
obvious.
But things seem to be working pretty well without it. I certainly feel much 
better about 
at least ignoring isolated CPUs during stop machine execution. Which btw I've 
doing
for a couple of years now on a wide range of the machines where people are 
inserting 
modules left and right. 

What do you think ?

Thanx
Max
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/