Re: vm_zeropage priority problems.

2001-12-26 Thread Terry Lambert

Luigi Rizzo wrote:
> 
> Don't know how interesting this can be, but i am writing
> (no plans to commit it, unless people find it interesting)
> some code to implement a weight-based instead of priority-based
> scheduler. The code is basically the WF2Q+ scheme which is
> already part of dummynet, adapted to processes.
> It is quite compact, and i think i can make it reasonably
> compatible with the old scheme, i.e. a sysctl var can be
> used to switch between one and the other with reasonably
> little overhead.
> 
> This would help removing the ugly property that priority-based
> have, which is that one process can starve the rest of the system.

Look for "QLINUX".  There's a nice paper on "Weighted Fair Share"
scheduling referenced on the page, as well as a couple of LRP
references.

Note that someone recently did a port of the LRP + Resource Containers
code to FreeBSD 4.4 (Terrible Rice Univeristy License, though).

-- Terry

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-22 Thread Bruce Evans

On Sat, 22 Dec 2001, Jake Burkholder wrote:

> Apparently, On Sat, Dec 22, 2001 at 06:48:26PM +1100,
>   Bruce Evans said words to the effect of;
> > Index: kern_synch.c
> > ===
> > RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> > retrieving revision 1.167
> > diff -u -2 -r1.167 kern_synch.c
> > --- kern_synch.c18 Dec 2001 00:27:17 -  1.167
> > +++ kern_synch.c19 Dec 2001 16:01:26 -
> > @@ -936,18 +1058,18 @@
> > struct thread *td;
> >  {
> > -   struct kse *ke = td->td_kse;
> > -   struct ksegrp *kg = td->td_ksegrp;
> > +   struct ksegrp *kg;
> >
> > -   if (td) {
> > -   ke->ke_cpticks++;
> > -   kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> > -   if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> > -   resetpriority(td->td_ksegrp);
> > -   if (kg->kg_pri.pri_level >= PUSER)
> > -   kg->kg_pri.pri_level = kg->kg_pri.pri_user;
> > -   }
> > -   } else {
> > +   if (td == NULL)
> > panic("schedclock");
> > -   }
> > +   td->td_kse->ke_cpticks++;
> > +   kg = td->td_ksegrp;
> > +#ifdef NEW_SCHED
> > +   kg->kg_estcpu += niceweights[kg->kg_nice - PRIO_MIN];
> > +#else
> > +   kg->kg_estcpu++;
> > +#endif
> > +   resetpriority(kg);
> > +   if (kg->kg_pri.pri_level >= PUSER)
> > +   kg->kg_pri.pri_level = kg->kg_pri.pri_user;
> >  }
>
> I'm curious why you removed the ESTCPULIM and INVERSE_ESTCPU_WEIGHT
> calculations even in the OLD_SCHED case.  Do these turn out to have
> no effect in general?

ESTCPULIM basically breaks scheduling if it is are hit (clipping to it
prevents accumulation of hog points that would cause cpu hogs to be
run less).  This is a problem in practice.  I use dynamic limits even
in the !NEW_SCHED case.  I forgot that I did this or I would have included
more context to show it (see below).  kg->kg_estcpu is allowed to grow
without explicit limit and scaled to fit in the priority range.  This
requires fixing sorcerer's-apprentice growth of kg_estcpu in fork()
and exit().  kg_estcpu has natural limits but they are quite large
(a constant multiple of the load average).

INVERSE_ESTCPU_WEIGHT is not used because it goes with static scaling,
and "% INVERSE_ESTCPU_WEIGHT" optimization (which depends on the internals
of resetpriority()) is not so easy to do.

Here are the corresponding changes for resetpriority():

%%%
Index: kern_synch.c
===
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.167
diff -u -2 -r1.167 kern_synch.c
--- kern_synch.c18 Dec 2001 00:27:17 -  1.167
+++ kern_synch.c22 Dec 2001 07:34:15 -
@@ -844,15 +949,32 @@
register struct ksegrp *kg;
 {
+   u_int estcpu;
register unsigned int newpriority;

mtx_lock_spin(&sched_lock);
if (kg->kg_pri.pri_class == PRI_TIMESHARE) {
-   newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +
+   estcpu = kg->kg_estcpu;
+   if (estcpu > estcpumax)
+   estcpu = estcpumax;
+#ifdef NEW_SCHED
+   newpriority = PUSER +
+   (((u_int64_t)estcpu * estcpumul) >> ESTCPU_SHIFT);
+#else
+   newpriority = PUSER +
+   (((u_int64_t)estcpu * estcpumul) >> ESTCPU_SHIFT) +
NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);
-   newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),
-   PRI_MAX_TIMESHARE);
+#endif
+   if (newpriority < PUSER)
+   newpriority = PUSER;
+   if (newpriority > PRI_MAX_TIMESHARE) {
+   Debugger("newpriority botch");
+   newpriority = PRI_MAX_TIMESHARE;
+   }
kg->kg_pri.pri_user = newpriority;
-   }
-   maybe_resched(kg);
+   maybe_resched(kg, newpriority);
+   } else
+   /* XXX doing anything here is dubious. */
+   /* XXX was: need_resched(). */
+   maybe_resched(kg, kg->kg_pri.pri_user);
mtx_unlock_spin(&sched_lock);
 }
%%%

> > Most of the changes here are to fix style bugs.  In the NEW_SCHED case,
> > the relative weights for each priority are determined by the niceweights[]
> > table.  kg->kg_estcpu is limited only by INT_MAX and priorities are
> > assigned according to relative values of kg->kg_estcpu (code for this is
> > not shown).  The NEW_SCHED case has not been tried since before SMPng
> > broke scheduling some more by compressing the priority ranges.
>
> It is relatively easy to uncompress the priority ranges if that is
> desirable.  What range is best?

The original algorithm works best with something close to the old range
of 50-127 (PUSER = 50, MAXPRI = 127) for positively niced processes alone.
This gives unniced processes a priority ran

Re: vm_zeropage priority problems.

2001-12-22 Thread Bruce Evans

On Sat, 22 Dec 2001, Luigi Rizzo wrote:

> On Sat, Dec 22, 2001 at 06:48:26PM +1100, Bruce Evans wrote:
> > Most of the changes here are to fix style bugs.  In the NEW_SCHED case,
> > the relative weights for each priority are determined by the niceweights[]
> > table.  kg->kg_estcpu is limited only by INT_MAX and priorities are
> > assigned according to relative values of kg->kg_estcpu (code for this is
> > not shown).
>
> i guess the latter is the hard part... what kind of complexity does
> it have ?

Not too bad.  I use an extra loop in schedcpu() to find the current maximum
of all kg->kg_estcpu, and convert the divison by this maximum (for scaling
individual kg->kg_estcpu's) to a multiplication and a shift.  This can
probably be done better in loadav().

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-22 Thread Bruce Evans

On Fri, 21 Dec 2001, John Baldwin wrote:

> On 21-Dec-01 Bruce Evans wrote:
> > On Fri, 21 Dec 2001, Luigi Rizzo wrote:
> >> the original priority should be somewhere and accessible,
> >> either directly or through some function. Otherwise how
> >> do we know what to pass to tsleep() ?
> >
> > It's whatever the thread set itself.  There is no good way of setting
> > this either (vm_pagezero() and poll_idle() hack it into
> > td->td_ksegrp->kg_pri).  Userland would use rtprio(2) instead.
> > Unfortunately, this gives priorities in different units than the ones
> > for tsleep().
>
> pri_level is the current priority of the thread.  The actual priority level is
> going to move back into the thread and out of the KSE group so that tsleep and
> priority propagation work properly, but pri_native, pri_user, and nice will
> stay in the KSE group.  The "normal" priorities for tsleep() are just a subset

This will make encapsulating priority stuff on a struct more obviously wrong.

> of the priorities available to a thread.  Thus, they are using the same unit,
> but perhaps a wider range.

They are offset by PRI_MIN_IDLE too, due to vestiges of the rtprio()
misimplementation.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-22 Thread Jake Burkholder

Apparently, On Sat, Dec 22, 2001 at 06:48:26PM +1100,
Bruce Evans said words to the effect of;

> On Fri, 21 Dec 2001, Luigi Rizzo wrote:
> 
> > Don't know how interesting this can be, but i am writing
> > (no plans to commit it, unless people find it interesting)
> > some code to implement a weight-based instead of priority-based
> > scheduler. The code is basically the WF2Q+ scheme which is
> > already part of dummynet, adapted to processes.
> > It is quite compact, and i think i can make it reasonably
> > compatible with the old scheme, i.e. a sysctl var can be
> > used to switch between one and the other with reasonably
> > little overhead.
> >
> > This would help removing the ugly property that priority-based
> > have, which is that one process can starve the rest of the system.
> 
> Only broken priority-based schedulers have that property.  One of
> my incomplete fixes uses weights:
> 
> Index: kern_synch.c
> ===
> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
> retrieving revision 1.167
> diff -u -2 -r1.167 kern_synch.c
> --- kern_synch.c  18 Dec 2001 00:27:17 -  1.167
> +++ kern_synch.c  19 Dec 2001 16:01:26 -
> @@ -936,18 +1058,18 @@
>   struct thread *td;
>  {
> - struct kse *ke = td->td_kse;
> - struct ksegrp *kg = td->td_ksegrp;
> + struct ksegrp *kg;
> 
> - if (td) {
> - ke->ke_cpticks++;
> - kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
> - if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
> - resetpriority(td->td_ksegrp);
> - if (kg->kg_pri.pri_level >= PUSER)
> - kg->kg_pri.pri_level = kg->kg_pri.pri_user;
> - }
> - } else {
> + if (td == NULL)
>   panic("schedclock");
> - }
> + td->td_kse->ke_cpticks++;
> + kg = td->td_ksegrp;
> +#ifdef NEW_SCHED
> + kg->kg_estcpu += niceweights[kg->kg_nice - PRIO_MIN];
> +#else
> + kg->kg_estcpu++;
> +#endif
> + resetpriority(kg);
> + if (kg->kg_pri.pri_level >= PUSER)
> + kg->kg_pri.pri_level = kg->kg_pri.pri_user;
>  }

I'm curious why you removed the ESTCPULIM and INVERSE_ESTCPU_WEIGHT
calculations even in the OLD_SCHED case.  Do these turn out to have
no effect in general?

> 
> Most of the changes here are to fix style bugs.  In the NEW_SCHED case,
> the relative weights for each priority are determined by the niceweights[]
> table.  kg->kg_estcpu is limited only by INT_MAX and priorities are
> assigned according to relative values of kg->kg_estcpu (code for this is
> not shown).  The NEW_SCHED case has not been tried since before SMPng
> broke scheduling some more by compressing the priority ranges.

It is relatively easy to uncompress the priority ranges if that is
desirable.  What range is best?

Jake

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-22 Thread Luigi Rizzo

On Sat, Dec 22, 2001 at 06:48:26PM +1100, Bruce Evans wrote:
> On Fri, 21 Dec 2001, Luigi Rizzo wrote:
...
> > This would help removing the ugly property that priority-based
> > have, which is that one process can starve the rest of the system.
> 
> Only broken priority-based schedulers have that property.  One of
> my incomplete fixes uses weights:

which makes it weight based, doesn't it :)

> Most of the changes here are to fix style bugs.  In the NEW_SCHED case,
> the relative weights for each priority are determined by the niceweights[]
> table.  kg->kg_estcpu is limited only by INT_MAX and priorities are
> assigned according to relative values of kg->kg_estcpu (code for this is
> not shown).

i guess the latter is the hard part... what kind of complexity does
it have ?

The nice feature of the scheduling code used in dummynet/Wf2Q+ is
that it has O(log N) complexity where N is the number of active
flows (processes in this case) and the basic scheduling operation
is just one or two heap insert/delete, so it's really fast.

cheers
luigi

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread Bruce Evans

On Fri, 21 Dec 2001, Luigi Rizzo wrote:

> Don't know how interesting this can be, but i am writing
> (no plans to commit it, unless people find it interesting)
> some code to implement a weight-based instead of priority-based
> scheduler. The code is basically the WF2Q+ scheme which is
> already part of dummynet, adapted to processes.
> It is quite compact, and i think i can make it reasonably
> compatible with the old scheme, i.e. a sysctl var can be
> used to switch between one and the other with reasonably
> little overhead.
>
> This would help removing the ugly property that priority-based
> have, which is that one process can starve the rest of the system.

Only broken priority-based schedulers have that property.  One of
my incomplete fixes uses weights:

Index: kern_synch.c
===
RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v
retrieving revision 1.167
diff -u -2 -r1.167 kern_synch.c
--- kern_synch.c18 Dec 2001 00:27:17 -  1.167
+++ kern_synch.c19 Dec 2001 16:01:26 -
@@ -936,18 +1058,18 @@
struct thread *td;
 {
-   struct kse *ke = td->td_kse;
-   struct ksegrp *kg = td->td_ksegrp;
+   struct ksegrp *kg;

-   if (td) {
-   ke->ke_cpticks++;
-   kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);
-   if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {
-   resetpriority(td->td_ksegrp);
-   if (kg->kg_pri.pri_level >= PUSER)
-   kg->kg_pri.pri_level = kg->kg_pri.pri_user;
-   }
-   } else {
+   if (td == NULL)
panic("schedclock");
-   }
+   td->td_kse->ke_cpticks++;
+   kg = td->td_ksegrp;
+#ifdef NEW_SCHED
+   kg->kg_estcpu += niceweights[kg->kg_nice - PRIO_MIN];
+#else
+   kg->kg_estcpu++;
+#endif
+   resetpriority(kg);
+   if (kg->kg_pri.pri_level >= PUSER)
+   kg->kg_pri.pri_level = kg->kg_pri.pri_user;
 }

Most of the changes here are to fix style bugs.  In the NEW_SCHED case,
the relative weights for each priority are determined by the niceweights[]
table.  kg->kg_estcpu is limited only by INT_MAX and priorities are
assigned according to relative values of kg->kg_estcpu (code for this is
not shown).  The NEW_SCHED case has not been tried since before SMPng
broke scheduling some more by compressing the priority ranges.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread Luigi Rizzo

Don't know how interesting this can be, but i am writing
(no plans to commit it, unless people find it interesting)
some code to implement a weight-based instead of priority-based
scheduler. The code is basically the WF2Q+ scheme which is
already part of dummynet, adapted to processes.
It is quite compact, and i think i can make it reasonably
compatible with the old scheme, i.e. a sysctl var can be
used to switch between one and the other with reasonably
little overhead.

This would help removing the ugly property that priority-based
have, which is that one process can starve the rest of the system.

cheers
luigi

On Fri, Dec 21, 2001 at 09:07:05AM -0800, John Baldwin wrote:
...
> > It's whatever the thread set itself.  There is no good way of setting
> > this either (vm_pagezero() and poll_idle() hack it into
> > td->td_ksegrp->kg_pri).  Userland would use rtprio(2) instead.
> > Unfortunately, this gives priorities in different units than the ones
> > for tsleep().
> 
> pri_level is the current priority of the thread.  The actual priority level is
> going to move back into the thread and out of the KSE group so that tsleep and
> priority propagation work properly, but pri_native, pri_user, and nice will
> stay in the KSE group.  The "normal" priorities for tsleep() are just a subset
> of the priorities available to a thread.  Thus, they are using the same unit,
> but perhaps a wider range.
> 
> >> In any case I wonder if this is a bug new in -current; -stable
> >> uses three separate data structures for realtime, user and idle tasks
> >> so even specifying the wrong priority in tsleep should not cause
> >> crossing classes there. -current has only one array, hence the
> >> chance of doing the wrong thing.
> > 
> > The 3 classes are a design bug in -stable.  Crossing classes is sometimes
> > right and 3 classes mainly make it harder and force triplication of code.
> 
> Agreed.  In current we basically have one large priority array.  At the top
> (low priorities) are realtime kernel priorities including interrupt thread
> priorities.  Below those are other top-half kernel priorities including the
> normal sleep priorities.  Below that are real-time user priorities, followed by
> time sharing user priorities, and finally idle user priorities.
> 
> Possibly real-time user priorities should move up into the real-time kernel
> range, and it should be noted that the idle priorites are for idle kernel
> threads, not just user threads (so sys/priority.h may need a bit of updating).
> 
> This is a simpler model esp. when you consider a thread of one priority class
> blocking on a lock held by a thread of a "lower" priority class as you only
> have to update pri_level for priority propagation rather than comparing classes
> and having to manage that extra gunk.
> 
> -- 
> 
> John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
> "Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread John Baldwin


On 21-Dec-01 Bruce Evans wrote:
> On Fri, 21 Dec 2001, Luigi Rizzo wrote:
> 
>> On Sat, Dec 22, 2001 at 12:46:40AM +1100, Bruce Evans wrote:
>> > I think pri_native is just an implementation detail which shouldn't
>> > be used or visible to threads.  It used used by the priority propagation
>> > mechanism to hold the original pri_level.  Threads should just use their
>> > original priority (or a different one if they want to temporarily change
>> > thier priority).  Even pri_level probably shouldn't be used or visible
>> > to threads.
>>
>> the original priority should be somewhere and accessible,
>> either directly or through some function. Otherwise how
>> do we know what to pass to tsleep() ?
> 
> It's whatever the thread set itself.  There is no good way of setting
> this either (vm_pagezero() and poll_idle() hack it into
> td->td_ksegrp->kg_pri).  Userland would use rtprio(2) instead.
> Unfortunately, this gives priorities in different units than the ones
> for tsleep().

pri_level is the current priority of the thread.  The actual priority level is
going to move back into the thread and out of the KSE group so that tsleep and
priority propagation work properly, but pri_native, pri_user, and nice will
stay in the KSE group.  The "normal" priorities for tsleep() are just a subset
of the priorities available to a thread.  Thus, they are using the same unit,
but perhaps a wider range.

>> In any case I wonder if this is a bug new in -current; -stable
>> uses three separate data structures for realtime, user and idle tasks
>> so even specifying the wrong priority in tsleep should not cause
>> crossing classes there. -current has only one array, hence the
>> chance of doing the wrong thing.
> 
> The 3 classes are a design bug in -stable.  Crossing classes is sometimes
> right and 3 classes mainly make it harder and force triplication of code.

Agreed.  In current we basically have one large priority array.  At the top
(low priorities) are realtime kernel priorities including interrupt thread
priorities.  Below those are other top-half kernel priorities including the
normal sleep priorities.  Below that are real-time user priorities, followed by
time sharing user priorities, and finally idle user priorities.

Possibly real-time user priorities should move up into the real-time kernel
range, and it should be noted that the idle priorites are for idle kernel
threads, not just user threads (so sys/priority.h may need a bit of updating).

This is a simpler model esp. when you consider a thread of one priority class
blocking on a lock held by a thread of a "lower" priority class as you only
have to update pri_level for priority propagation rather than comparing classes
and having to manage that extra gunk.

-- 

John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread Bruce Evans

On Fri, 21 Dec 2001, Luigi Rizzo wrote:

> On Sat, Dec 22, 2001 at 12:46:40AM +1100, Bruce Evans wrote:
> > I think pri_native is just an implementation detail which shouldn't
> > be used or visible to threads.  It used used by the priority propagation
> > mechanism to hold the original pri_level.  Threads should just use their
> > original priority (or a different one if they want to temporarily change
> > thier priority).  Even pri_level probably shouldn't be used or visible
> > to threads.
>
> the original priority should be somewhere and accessible,
> either directly or through some function. Otherwise how
> do we know what to pass to tsleep() ?

It's whatever the thread set itself.  There is no good way of setting
this either (vm_pagezero() and poll_idle() hack it into
td->td_ksegrp->kg_pri).  Userland would use rtprio(2) instead.
Unfortunately, this gives priorities in different units than the ones
for tsleep().

> In any case I wonder if this is a bug new in -current; -stable
> uses three separate data structures for realtime, user and idle tasks
> so even specifying the wrong priority in tsleep should not cause
> crossing classes there. -current has only one array, hence the
> chance of doing the wrong thing.

The 3 classes are a design bug in -stable.  Crossing classes is sometimes
right and 3 classes mainly make it harder and force triplication of code.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread Luigi Rizzo

On Sat, Dec 22, 2001 at 12:46:40AM +1100, Bruce Evans wrote:
> I think pri_native is just an implementation detail which shouldn't
> be used or visible to threads.  It used used by the priority propagation
> mechanism to hold the original pri_level.  Threads should just use their
> original priority (or a different one if they want to temporarily change
> thier priority).  Even pri_level probably shouldn't be used or visible
> to threads.

the original priority should be somewhere and accessible,
either directly or through some function. Otherwise how
do we know what to pass to tsleep() ?

In any case I wonder if this is a bug new in -current; -stable
uses three separate data structures for realtime, user and idle tasks
so even specifying the wrong priority in tsleep should not cause
crossing classes there. -current has only one array, hence the
chance of doing the wrong thing.

cheers
luigi

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-21 Thread Bruce Evans

On Thu, 20 Dec 2001, John Baldwin wrote:

> On 20-Dec-01 Luigi Rizzo wrote:
> > On Thu, Dec 20, 2001 at 12:16:03PM -0800, John Baldwin wrote:
> >> However, kthreads should tsleep() with their current priority, not PPAUSE.
> >
> > "current" meaning pri_level or pri_native ? What if one tries to
> > tsleep() while holding a lock and so its pri_level is raised ?
>
> pri_level.  Calling tsleep() while holding a lock is a bug though. :)  Unless
> you are calling msleep() with a lock that will be released.
>
> > In the device polling code i did a tsleep on the "original" pri_level,
> > but maybe pri_native is good enough.
>
> pri_level is more correct.

I think pri_native is just an implementation detail which shouldn't
be used or visible to threads.  It used used by the priority propagation
mechanism to hold the original pri_level.  Threads should just use their
original priority (or a different one if they want to temporarily change
thier priority).  Even pri_level probably shouldn't be used or visible
to threads.

Bruce


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-20 Thread John Baldwin


On 20-Dec-01 Luigi Rizzo wrote:
> On Thu, Dec 20, 2001 at 12:16:03PM -0800, John Baldwin wrote:
> ...
>> Priority propagation will already handle things ok.  We drop to pri_native
>> after we drop a lock (although if we still hold a contested lock we bump our
>> priority to the min(nativepri, highest priority of threads on contested
>> locks
>> we hold and drop to nativepri after dropping the last contested lock). 
> 
> ok, thanks for the clarification
> 
>> However, kthreads should tsleep() with their current priority, not PPAUSE.
> 
> "current" meaning pri_level or pri_native ? What if one tries to
> tsleep() while holding a lock and so its pri_level is raised ?

pri_level.  Calling tsleep() while holding a lock is a bug though. :)  Unless
you are calling msleep() with a lock that will be released.

> In the device polling code i did a tsleep on the "original" pri_level,
> but maybe pri_native is good enough.

pri_level is more correct.

-- 

John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-20 Thread Luigi Rizzo

On Thu, Dec 20, 2001 at 12:16:03PM -0800, John Baldwin wrote:
...
> Priority propagation will already handle things ok.  We drop to pri_native
> after we drop a lock (although if we still hold a contested lock we bump our
> priority to the min(nativepri, highest priority of threads on contested locks
> we hold and drop to nativepri after dropping the last contested lock). 

ok, thanks for the clarification

> However, kthreads should tsleep() with their current priority, not PPAUSE.

"current" meaning pri_level or pri_native ? What if one tries to
tsleep() while holding a lock and so its pri_level is raised ?

In the device polling code i did a tsleep on the "original" pri_level,
but maybe pri_native is good enough.

cheers
luigi

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-20 Thread John Baldwin


On 20-Dec-01 Luigi Rizzo wrote:
> On Thu, Dec 20, 2001 at 11:13:27AM -0800, Peter Wemm wrote:
> ...
>> Excellent catch!  This particular problem was one of the main reasons
>> why this is still defaulting to 'off'.  I have a couple of other changes
>> to it pending commit to fix some of Bruce's complaints, but I hadn't
>> noticed the cause of this.
>> 
>> Part of the problem is that tsleep temporarily elevates the priority for
>> wakeup, and it is normally returned back to "normal" level when the process
>> returns to userland (see the *_userpri stuff).
> 
> ah, ok, kernel threads never return to userland...
> 
> so, i presume one should also make sure that after the process is
> scheduled, the priority is restored to the 'native' level; this
> should also solve the problem with the priority propagation mechanism
> (though... I have the feeling that if you have nested locks, this
> cannot scale...)

Priority propagation will already handle things ok.  We drop to pri_native
after we drop a lock (although if we still hold a contested lock we bump our
priority to the min(nativepri, highest priority of threads on contested locks
we hold and drop to nativepri after dropping the last contested lock). 
However, kthreads should tsleep() with their current priority, not PPAUSE.

-- 

John Baldwin <[EMAIL PROTECTED]>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve!"  -  http://www.FreeBSD.org/

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-20 Thread Luigi Rizzo

On Thu, Dec 20, 2001 at 11:13:27AM -0800, Peter Wemm wrote:
...
> Excellent catch!  This particular problem was one of the main reasons
> why this is still defaulting to 'off'.  I have a couple of other changes
> to it pending commit to fix some of Bruce's complaints, but I hadn't
> noticed the cause of this.
> 
> Part of the problem is that tsleep temporarily elevates the priority for
> wakeup, and it is normally returned back to "normal" level when the process
> returns to userland (see the *_userpri stuff).

ah, ok, kernel threads never return to userland...

so, i presume one should also make sure that after the process is
scheduled, the priority is restored to the 'native' level; this
should also solve the problem with the priority propagation mechanism
(though... I have the feeling that if you have nested locks, this
cannot scale...)

cheers
luigi

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: vm_zeropage priority problems.

2001-12-20 Thread Peter Wemm

Luigi Rizzo wrote:
> [Cc peter because he introduced this code]
> 
> Hi,
> i was trying the following code in -current (basically copied from
> vm_zeropage.c), to implement device polling in the idle loop, and
> noticed that the process would take all of the CPU time. Being
> suspicious that something was wrong with priorities, I added a
> couple of printfs, and to my surprise, the priority of the process
> (pri_level) after the first tsleep() becomes 104 instead of staying
> at 255 as it was originally initialized.
> 
> A quick scan of the files in sys/sys shows that 104 is PPAUSE,
> and this is the source of the problem.

Excellent catch!  This particular problem was one of the main reasons
why this is still defaulting to 'off'.  I have a couple of other changes
to it pending commit to fix some of Bruce's complaints, but I hadn't
noticed the cause of this.

Part of the problem is that tsleep temporarily elevates the priority for
wakeup, and it is normally returned back to "normal" level when the process
returns to userland (see the *_userpri stuff).

Changing the tlseep PPAUSE value isn't necessarily the right thing either
because the priority propagation mechanism will elevate us if we hold a lock
that something else *needs*.  (This is so that when it yields, we will run
and do what we need to do and presumably release the lock.)

> I guess this is a bug in vm_zeropage.c, the second parameter
> in tsleep should be td->td_ksegrp->kg_pri.pri_level instead
> of PPAUSE, or at least the priority should be reset to the
> original value after returning from the tsleep ?

Possibly both of these.  Let me do some checking.

>   cheers
>   luigi
> 
> static void
> poll_idle(void)
> {
> struct thread *td = curthread;
> struct rtprio rtp;
>  
> rtp.prio = RTP_PRIO_MAX;/* lowest priority */
> rtp.type = RTP_PRIO_IDLE;
> mtx_lock_spin(&sched_lock);
> rtp_to_pri(&rtp, &td->td_ksegrp->kg_pri);
> mtx_unlock_spin(&sched_lock);
>
> printf("idlepoll start and sleep, pri is %d native %d user %d\n",
> td->td_ksegrp->kg_pri.pri_level,
> td->td_ksegrp->kg_pri.pri_native,
> td->td_ksegrp->kg_pri.pri_user);
> for (;;) {
> if (poll_in_idle && poll_handlers > 0) {
> idlepoll_sleeping = 0;
> mtx_lock(&Giant);
> ether_poll(poll_each_burst);
> mtx_unlock(&Giant);
> mtx_assert(&Giant, MA_NOTOWNED);
> mtx_lock_spin(&sched_lock);
> setrunqueue(td);
> td->td_proc->p_stats->p_ru.ru_nvcsw++;
> mi_switch();
> mtx_unlock_spin(&sched_lock);
> } else {
>   printf("idlepoll goes to sleep, "
>   "pri is %d native %d user %d\n",
>   td->td_ksegrp->kg_pri.pri_level,
>   td->td_ksegrp->kg_pri.pri_native,
>   td->td_ksegrp->kg_pri.pri_user);
> idlepoll_sleeping = 1;
> tsleep(&idlepoll_sleeping, PPAUSE, "pollid", hz * 3);
> }
> }
> }
> 
> static struct proc *idlepoll;
> static struct kproc_desc idlepoll_kp = {
>  "idlepoll",
>  poll_idle,
>  &idlepoll
> };
> SYSINIT(idlepoll, SI_SUB_KTHREAD_VM, SI_ORDER_ANY, kproc_start, &idlepoll_kp)
> 
> 
> 

Cheers,
-Peter
--
Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
"All of this is for nothing if we don't go to the stars" - JMS/B5


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message