Propose New Linux Scheduling Policies
Proposed new Linux Scheduling PoliciesJuly 2019 Version 0.9 ./uapi/linux/sched.h #define SCHED_INTERACTIVE7 #defineSCHED_UNEQUAL8 Each of these new scheduling policies have been implemented in one or more other UNIX operating systems in the past and currently to support a new scheduling behaviour. Isn’t ist about time that Linux supports these in the mainline OS. SCHED_INTERACTIVE is meant to be used for tasks that are infrequently runnable, but once runnable should run with an absolute minimum latency. An example of this is the mouse icon. Currently this behaviour is determined based on past running behaviour where a minimum percentage of use of its runnable time has occurred and thus can loose its interactive as its runs. Common sense says that this type of process should always be interactive. SCHED_UNEQUAL is meant to support an unfair scheduling policy for tasks that SHOULD run infrequently buy when runnable, SHOULD run to completion. This is more of a scalable flag, where the number of runnable tasks may increase, thus limiting its runnable time within a specific time window and/or the objects that the task is dealing with may increase that almost two times or more executing cycles are necessary for the task to run to completion. An example of this is a routing task that processes Link State Advertisements (LSAs) so a change in the routing table can be accomplished within a minimum of time. Without doing this network routing loops and black holes can exist for short periods of time loosing and/or increasing unnecessary control and data packet exchange. Manual pages: Any manual page such as SCHED_SETATTR(2) SHOULD be updated to acknowledge these new scheduling policies. Applications: Tasks that display output about tasks / scheduling minimally need to be recompliled and be made aware of these new scheduling policies. Inheritance: It is not believed that these new policies SHOULD be inherited at time of a clone(2) / fork(). Weighting: Currently the suggested / proposed change is to implement a simplified differential weighting policy when SCHED_UNEQUAL is specified. A possible future change COULD support multiple UNEQUAL scheduling policies by combining an existing metric. This is for a later proposed change sometime after this proposed change is accepted. Security / Denial Of Service (DoS) issues: SCHED_INTERACTIVE: Code must be implemented to prevent repeatedly inserting SCHED_INTERACTIVE tasks onto cpu cores and to deny other tasks from executing. SCHED_UNEQUAL: Currently only when multiple tasks are runnable and consume the entire window of cpu cycles, by running this task two or three times in a row SHOULD on the worse case add less than 8% cpu cycles. To mitigate any scheduling disturbances setting these flags should be limited to an administrator aka root type user with specific capabilities. TBDs: If Linux’s kernel.org sees that this feature is wanted in the near and/or distant future, then “UNKNOWNs / TBDs” or issues should be listed here. This notice is not a proposal for kernel.org code integration at this time, but is intended for Copyright Notice and a general consensus whether this feature is beneficial for its future inclusion into Linux’s kernel.org. Copyright Notice: While this two page introduction on these new scheduling policies is being proposed, it is actually a minimal awareness document that should satisfy that a code change has been implemented within public domain source in the past and a determination whether this feature is wanted by kernel.org. While the actual implementation code that uses these flags is proprietary, as most OS code is modified, it is coded / changed by multiple engineers and without general acceptance that this change is accepted, anything beyond an inception / concept proposal to specify new scheduling policy flags kernel.org is unnecessary at this time. Warantee: Usage etc, and incorporation of minimally tested / unknown / untested code is “AS-IS” with no warranties suggested or implied. Reference for a past scheduler: SunOS within the SVR4.x used SCHED_INTERACTIVE VMware and others: Support a weighted process scheduler to allow UNEQUAL scheduling. Mitchell Erblich UNIX Kernel Engineer Developer Mitchell Erblich erbli...@earthlink.net
mm/page_alloc.c : Intent to clone zone_watermark_ok()
Group, This is a notification that local changes are believed to be needed to the zone_watermark_ok() : mm/page_alloc.c for a family of embedded devices. The embedded device has no secondary storage thus cleaning dirty pages are not available and crossing below the min levels is unwarranted. This function seems to contain the necessary logic with minor modifications and thus the intent is to duplicate most of its functionality. The local function is create a new family of zone_watermark_num() based on the family of zone_watermark_ok() that now returns an int, number of free pages. The number of free_pages, where a positive number is num of free pages is above the watermark at the order, etc and negative below. Please inform me whether the equivalent function is already present in what release. Thank you, Mitchell Erblich OS Engineer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
mm/page_alloc.c : Intent to clone zone_watermark_ok()
Group, This is a notification that local changes are believed to be needed to the zone_watermark_ok() : mm/page_alloc.c for a family of embedded devices. The embedded device has no secondary storage thus cleaning dirty pages are not available and crossing below the min levels is unwarranted. This function seems to contain the necessary logic with minor modifications and thus the intent is to duplicate most of its functionality. The local function is create a new family of zone_watermark_num() based on the family of zone_watermark_ok() that now returns an int, number of free pages. The number of free_pages, where a positive number is num of free pages is above the watermark at the order, etc and negative below. Please inform me whether the equivalent function is already present in what release. Thank you, Mitchell Erblich OS Engineer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
Group, As a contractor or employee, the code that I write while being employed by them is owned by the company that I work for. It is then up to them / legal / management, etc whether they offer that code implementation to kernel.org or a ISP, insert it into their release, or whatever…. My notification is to say here is a minimal TOI, explain additionally where necessary, that I never saw a patch offered, it exists, and I would be willing to repeat the code with a different implementation for FREE as an individual. If the code is within a networking area, then maybe a simple Request For Enhancement (RFE) that explains functionality, probably some minimal API, but the direction is NOT to offer an implementation of code. Thus is just a legal statement that needs repeating each and every time that make my offer. How rude is that? Thus, Mike ,, your statement is totally uncalled for, inappropriate… and that being behind a email address does not excuse that. Mitchell Erblich Kernel Engineer On Mar 18, 2015, at 8:59 PM, Mike Galbraith wrote: > On Wed, 2015-03-18 at 20:43 -0700, Mitchell Erblich wrote: > >> This proposal was ONLY to resolve the legal issue with public domain >> code of notification when a patch was not offered.… > > Ah, so completely off topic here.. but then you knew that. How rude. > > -Mike > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
On Mar 18, 2015, at 7:38 PM, Mike Galbraith wrote: > On Wed, 2015-03-18 at 16:25 -0700, Mitchell Erblich wrote: > >> >> SCHED_IA >> Over 10 years ago, System V Release 4 was enhanced with additional >> features by Sun Microsystems. One of the more minor extensions dealt >> with the subdivision of process’s scheduling characteristics and was >> known as he INTERACTIVE /IA scheduling class. This scheduling class >> was targeted to frequent sleepers, with the mouse icon being one the >> first processes/tasks.. >> >> Linux has no explicit SCHED_IA scheduling policy, but does alter >> scheduling characteristics based on some sleep behavior (example: >> GENTLE_FAIR_SLEEPERS) within the fair share scheduling configuration >> option. > > That's about fairness, it levels the playing field sleepers vs hogs. > >> Processes / tasks that are CPU bound that fit into a SLEEPER behavior >> can have a hybrid behavior over time where during any one scheduling >> period, it may consume its variable length allocated time. This can >> alter its expected short latency to be current / ONPROC. To simplify >> the implementation, it is suggested that SCHED_IA be a sub scheduling >> policy of SCHED_NORMAL. Shouldn’t an administrator be able to classify >> that the NORMAL long term behavior of a task, be one as a FIXED >> sleeper? > > Nope, we definitely don't want a SCHED_IA class. > > Your box can't tell if you're interacting or not even if you explicitly > define something as 'interactive'. If I stare in fascination at an > eye-candy screen-saver, it becomes 'interactive'. If I'm twiddling my > thumbs waiting for a kbuild whatever other CPU intensive job to finish, > that job becomes the 'interactive' thing of import. The last thing I > want is to have to squabble with every crack-head programmer on the > planet who thinks his/her cool app should get special treatment. > > You can’t get there from here. This proposal was ONLY to resolve the legal issue with public domain code of notification when a patch was not offered.… Any “crack-head programmer” can still set if he has the CAPABILITY to change the scheduling policy to a RT-FIFO or RT-RR, to give the app special treatment… So, this proposal does not mitigate or change that treatment, assuming that the same CAPABILITY is checked. POSIX ONLY specifies RT tasks as creating a level of unfairness, where in some APPLICATION SPECIFIC uses of Linux, need to execute an infrequently executed more than what is currently FAIR. So, the Linux scheduler ALREADY determines if a task slept during a time window and if it DIDN’T, it is penalized versus the tasks that did sleep. It is effectively a dynamic scheduling behavior, where sometimes your interactive task did not fully sleep. Thus, why not still treat it as a IA task, because of the nature/functionality of the task??? If could be generating tones/music through the audio/speaker driver of the system and you want a consistent minimal latency, else you generate warble on your music. Thus, if an admin/task KNOWS that a task is essentially a INTERACTIVE, aka a mouse icon driver, audio driver, etc that if a admin or startup script has the CAPABILITY, then he can set to make sure that the INTERACTIVE task ALWAYS/CONSISTENTLY is treated as an INTERACTIVE task. Doing this change allows one or more tasks to BETTER behave the same independently of the number of tasks that are being scheduled during the time window and the number of CPUs/cores without any other special scheduling. This was a SVR4.x feature within the SunOS kernel and a number of other SVR4.x UNIX Kernels, that could be set via priocntl(1) (Linux does not support) or a start script. >> >> Thus, the first Proposal is to explicitly support the SCHED_IA >> scheduling policy within the Linux kernel. After kernel support, any >> application that has the same functionality as priocntl(1) then needs >> to be altered to support this new scheduling policy. >> >> >> Note: Administrator in this context should be a task with a UID, EUID, >> GID, EGID, etc, that has the proper CAPABILITY to alter scheduling >> behavior. > >> >> SCHED_UNFAIR > > Snip.. we already have truckloads of bandwidth control. > > -Mike > This is NOT bandwidth control.. Allocates an increase in its time slice versus other tasks… SCHED_UNFAIR pertains to a task that can sometimes run and consume a UN-fair number of CPU cycles. With the routing protocol OSPFv2/v3, a well known Link state task is specified by function, but regular tasks that INFREQUENTLY execute like a file system file check(fsck
Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
Please note that this proposal is from this engineer and not from the company he works for. This SHOULD also fulfills any legal notification of work done, but not submitted to the Linux Kernel. Transfer of Information : Notification & Proposal of Feasibility to Support System V Release 4 Defacto Standard Scheduling Extensions, etc within Linux ——- SCHED_IA Over 10 years ago, System V Release 4 was enhanced with additional features by Sun Microsystems. One of the more minor extensions dealt with the subdivision of process’s scheduling characteristics and was known as he INTERACTIVE /IA scheduling class. This scheduling class was targeted to frequent sleepers, with the mouse icon being one the first processes/tasks.. Linux has no explicit SCHED_IA scheduling policy, but does alter scheduling characteristics based on some sleep behavior (example: GENTLE_FAIR_SLEEPERS) within the fair share scheduling configuration option. Processes / tasks that are CPU bound that fit into a SLEEPER behavior can have a hybrid behavior over time where during any one scheduling period, it may consume its variable length allocated time. This can alter its expected short latency to be current / ONPROC. To simplify the implementation, it is suggested that SCHED_IA be a sub scheduling policy of SCHED_NORMAL. Shouldn’t an administrator be able to classify that the NORMAL long term behavior of a task, be one as a FIXED sleeper? Thus, the first Proposal is to explicitly support the SCHED_IA scheduling policy within the Linux kernel. After kernel support, any application that has the same functionality as priocntl(1) then needs to be altered to support this new scheduling policy. Note: Administrator in this context should be a task with a UID, EUID, GID, EGID, etc, that has the proper CAPABILITY to alter scheduling behavior. SCHED_UNFAIR UNIX / Linux scheduling has in the most part attempts to achieve some level of process / task scheduling fairness within the Linux scheduler using the fair share scheduling class. Exceptions do exist, but are not being discussed below. In general this type of scheduling is acceptable in a generic implementation, but has weaknesses when UNIX / Linux is moved into a different environment. Many companies use UNIX / Linux in heavy networking environments where one or more tasks can infrequently attempt to consume more than its fair share in a window scheduling period. This proposal is to acknowledge that “nice” and a few other scheduling workarounds do no always suffice to allow this temporary inequality to exist. Yes, a cpumask could be set that allows only certain tasks to run on specific nodes, however the implied assumption is that they only infrequently need to have inequality / greater “time slice” than their fair share. A network protocol task example is a convergence task that needs to run until it is finished and until that happens needed routing changes will not occur. The time window in which all tasks need to be run, SHOULD not need this task in consecutive time windows, thus over longer periods of time, it still fulfills the fair scheduling policy. Again the proper CAPABILITY needs to be specified with the priocntl(1) like application as running many tasks per CPU COULD then effect the performance of the system. Thus, explicit support for a new SCHED_UNFAIR scheduling policy is proposed / suggested within the Linux kernel. Again it can be a sub scheduling policy of SCHED_NORMAL. If there is an expressed need / want for this type of functionality to be patched into a git, please inform this engineer if any additional information is to be provided since this minimal TOI document is more architecture / enhancement based and does not deal into any details as to a possible implementation of the above additional functionality. Thank you, Mitchell Erblich UNIX Kernel Engineer-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
Please note that this proposal is from this engineer and not from the company he works for. This SHOULD also fulfills any legal notification of work done, but not submitted to the Linux Kernel. Transfer of Information : Notification Proposal of Feasibility to Support System V Release 4 Defacto Standard Scheduling Extensions, etc within Linux ——- SCHED_IA Over 10 years ago, System V Release 4 was enhanced with additional features by Sun Microsystems. One of the more minor extensions dealt with the subdivision of process’s scheduling characteristics and was known as he INTERACTIVE /IA scheduling class. This scheduling class was targeted to frequent sleepers, with the mouse icon being one the first processes/tasks.. Linux has no explicit SCHED_IA scheduling policy, but does alter scheduling characteristics based on some sleep behavior (example: GENTLE_FAIR_SLEEPERS) within the fair share scheduling configuration option. Processes / tasks that are CPU bound that fit into a SLEEPER behavior can have a hybrid behavior over time where during any one scheduling period, it may consume its variable length allocated time. This can alter its expected short latency to be current / ONPROC. To simplify the implementation, it is suggested that SCHED_IA be a sub scheduling policy of SCHED_NORMAL. Shouldn’t an administrator be able to classify that the NORMAL long term behavior of a task, be one as a FIXED sleeper? Thus, the first Proposal is to explicitly support the SCHED_IA scheduling policy within the Linux kernel. After kernel support, any application that has the same functionality as priocntl(1) then needs to be altered to support this new scheduling policy. Note: Administrator in this context should be a task with a UID, EUID, GID, EGID, etc, that has the proper CAPABILITY to alter scheduling behavior. SCHED_UNFAIR UNIX / Linux scheduling has in the most part attempts to achieve some level of process / task scheduling fairness within the Linux scheduler using the fair share scheduling class. Exceptions do exist, but are not being discussed below. In general this type of scheduling is acceptable in a generic implementation, but has weaknesses when UNIX / Linux is moved into a different environment. Many companies use UNIX / Linux in heavy networking environments where one or more tasks can infrequently attempt to consume more than its fair share in a window scheduling period. This proposal is to acknowledge that “nice” and a few other scheduling workarounds do no always suffice to allow this temporary inequality to exist. Yes, a cpumask could be set that allows only certain tasks to run on specific nodes, however the implied assumption is that they only infrequently need to have inequality / greater “time slice” than their fair share. A network protocol task example is a convergence task that needs to run until it is finished and until that happens needed routing changes will not occur. The time window in which all tasks need to be run, SHOULD not need this task in consecutive time windows, thus over longer periods of time, it still fulfills the fair scheduling policy. Again the proper CAPABILITY needs to be specified with the priocntl(1) like application as running many tasks per CPU COULD then effect the performance of the system. Thus, explicit support for a new SCHED_UNFAIR scheduling policy is proposed / suggested within the Linux kernel. Again it can be a sub scheduling policy of SCHED_NORMAL. If there is an expressed need / want for this type of functionality to be patched into a git, please inform this engineer if any additional information is to be provided since this minimal TOI document is more architecture / enhancement based and does not deal into any details as to a possible implementation of the above additional functionality. Thank you, Mitchell Erblich UNIX Kernel Engineer-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
On Mar 18, 2015, at 7:38 PM, Mike Galbraith umgwanakikb...@gmail.com wrote: On Wed, 2015-03-18 at 16:25 -0700, Mitchell Erblich wrote: SCHED_IA Over 10 years ago, System V Release 4 was enhanced with additional features by Sun Microsystems. One of the more minor extensions dealt with the subdivision of process’s scheduling characteristics and was known as he INTERACTIVE /IA scheduling class. This scheduling class was targeted to frequent sleepers, with the mouse icon being one the first processes/tasks.. Linux has no explicit SCHED_IA scheduling policy, but does alter scheduling characteristics based on some sleep behavior (example: GENTLE_FAIR_SLEEPERS) within the fair share scheduling configuration option. That's about fairness, it levels the playing field sleepers vs hogs. Processes / tasks that are CPU bound that fit into a SLEEPER behavior can have a hybrid behavior over time where during any one scheduling period, it may consume its variable length allocated time. This can alter its expected short latency to be current / ONPROC. To simplify the implementation, it is suggested that SCHED_IA be a sub scheduling policy of SCHED_NORMAL. Shouldn’t an administrator be able to classify that the NORMAL long term behavior of a task, be one as a FIXED sleeper? Nope, we definitely don't want a SCHED_IA class. Your box can't tell if you're interacting or not even if you explicitly define something as 'interactive'. If I stare in fascination at an eye-candy screen-saver, it becomes 'interactive'. If I'm twiddling my thumbs waiting for a kbuild whatever other CPU intensive job to finish, that job becomes the 'interactive' thing of import. The last thing I want is to have to squabble with every crack-head programmer on the planet who thinks his/her cool app should get special treatment. You can’t get there from here. This proposal was ONLY to resolve the legal issue with public domain code of notification when a patch was not offered.… Any “crack-head programmer” can still set if he has the CAPABILITY to change the scheduling policy to a RT-FIFO or RT-RR, to give the app special treatment… So, this proposal does not mitigate or change that treatment, assuming that the same CAPABILITY is checked. POSIX ONLY specifies RT tasks as creating a level of unfairness, where in some APPLICATION SPECIFIC uses of Linux, need to execute an infrequently executed more than what is currently FAIR. So, the Linux scheduler ALREADY determines if a task slept during a time window and if it DIDN’T, it is penalized versus the tasks that did sleep. It is effectively a dynamic scheduling behavior, where sometimes your interactive task did not fully sleep. Thus, why not still treat it as a IA task, because of the nature/functionality of the task??? If could be generating tones/music through the audio/speaker driver of the system and you want a consistent minimal latency, else you generate warble on your music. Thus, if an admin/task KNOWS that a task is essentially a INTERACTIVE, aka a mouse icon driver, audio driver, etc that if a admin or startup script has the CAPABILITY, then he can set to make sure that the INTERACTIVE task ALWAYS/CONSISTENTLY is treated as an INTERACTIVE task. Doing this change allows one or more tasks to BETTER behave the same independently of the number of tasks that are being scheduled during the time window and the number of CPUs/cores without any other special scheduling. This was a SVR4.x feature within the SunOS kernel and a number of other SVR4.x UNIX Kernels, that could be set via priocntl(1) (Linux does not support) or a start script. Thus, the first Proposal is to explicitly support the SCHED_IA scheduling policy within the Linux kernel. After kernel support, any application that has the same functionality as priocntl(1) then needs to be altered to support this new scheduling policy. Note: Administrator in this context should be a task with a UID, EUID, GID, EGID, etc, that has the proper CAPABILITY to alter scheduling behavior. SCHED_UNFAIR Snip.. we already have truckloads of bandwidth control. -Mike This is NOT bandwidth control.. Allocates an increase in its time slice versus other tasks… SCHED_UNFAIR pertains to a task that can sometimes run and consume a UN-fair number of CPU cycles. With the routing protocol OSPFv2/v3, a well known Link state task is specified by function, but regular tasks that INFREQUENTLY execute like a file system file check(fsck) becomes a boot bottleneck if it is actually doing work and isn’t able to do its work without a number of context switches. With this functionality more than a 30% or more latency decrease can occur, depending on the number of tasks contending for the time window. Yes, if a TASK_UNFAIR policy always executes in every time
Re: Linux Kernel Scheduling Addition Notification : Hybrid Sleepers and Unfair scheduling
Group, As a contractor or employee, the code that I write while being employed by them is owned by the company that I work for. It is then up to them / legal / management, etc whether they offer that code implementation to kernel.org or a ISP, insert it into their release, or whatever…. My notification is to say here is a minimal TOI, explain additionally where necessary, that I never saw a patch offered, it exists, and I would be willing to repeat the code with a different implementation for FREE as an individual. If the code is within a networking area, then maybe a simple Request For Enhancement (RFE) that explains functionality, probably some minimal API, but the direction is NOT to offer an implementation of code. Thus is just a legal statement that needs repeating each and every time that make my offer. How rude is that? Thus, Mike ,, your statement is totally uncalled for, inappropriate… and that being behind a email address does not excuse that. Mitchell Erblich Kernel Engineer On Mar 18, 2015, at 8:59 PM, Mike Galbraith umgwanakikb...@gmail.com wrote: On Wed, 2015-03-18 at 20:43 -0700, Mitchell Erblich wrote: This proposal was ONLY to resolve the legal issue with public domain code of notification when a patch was not offered.… Ah, so completely off topic here.. but then you knew that. How rude. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
TOI: introducing Split Lists hybrids of doubly linked lists within the Linux Kernel
in this suggestion. For the skeptical, add elements into a tree with the values from 1 to 100. Is the tree balanced without doing extra work? Dynamic upgrades from the split lists into the skip list should be a simple matter and can be configured via a /proc tunable if wanted. A proper split list or skip list should show 16x, 32x, 64x or more higher performance finding the location to locate/search, add, or delete an element from the currently supported link list. Of course, all new functions will co-exist and only engineers who are comfortable with using the new functions will then possibly see performance improvements. Yes, only if these lists are frequently walked (possibly a bottleneck) will the performance improvements be noticeable. This engineer is willing to submit an initial prototype of the above logic functions to be incorporated within the Linux Kernel if/when there is a request to do so. Sincerely, Mitchell Erblich Kernel Engineer-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
TOI: introducing Split Lists hybrids of doubly linked lists within the Linux Kernel
in this suggestion. For the skeptical, add elements into a tree with the values from 1 to 100. Is the tree balanced without doing extra work? Dynamic upgrades from the split lists into the skip list should be a simple matter and can be configured via a /proc tunable if wanted. A proper split list or skip list should show 16x, 32x, 64x or more higher performance finding the location to locate/search, add, or delete an element from the currently supported link list. Of course, all new functions will co-exist and only engineers who are comfortable with using the new functions will then possibly see performance improvements. Yes, only if these lists are frequently walked (possibly a bottleneck) will the performance improvements be noticeable. This engineer is willing to submit an initial prototype of the above logic functions to be incorporated within the Linux Kernel if/when there is a request to do so. Sincerely, Mitchell Erblich Kernel Engineer-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: maturity and status and attributes, oh my!
"Robert P. J. Day" wrote: > > at the risk of driving everyone here totally bonkers, i'm going to > take one last shot at explaining what i was thinking of when i first > proposed this whole "maturity level" thing. and, just so you know, > the major reason i'm so cranked up about this is that i'm feeling just > a little territorial -- i was the one who first started nagging people > to consider this idea, so i'm a little edgy when i see folks finally > giving it some serious thought but appearing to get ready to implement > it entirely incorrectly in a way that's going to ruin it irreparably > and make it utterly useless. > > this isn't just about defining a single feature called "maturity". > it's about defining a general mechanism so that you can add entirely > new (what i call) "attributes" to kernel features. one attribute > could be "maturity", which could take one of a number of possible > values. another could be "status", with the same restrictions. > heck, you could define the attribute "colour", and decide that various > kernel features could be labelled as (at most) one of "red", "green" > and "chartreuse." that's what i mean by an "attribute", and > attributes would have two critical and non-negotiable properties: <<< snip>>>> > > but i hope i've flogged this thoroughly to the point where people > can see what i'm driving at. once you see (as in simon's patch) how > to add the first attribute, it's trivial to simply duplicate that code > to add as many more as you want. > > rday > > -- > > Robert P. J. Day > Linux Consulting, Training and Annoying Kernel Pedantry > Waterloo, Ontario, CANADA > > http://crashcourse.ca > Robert Day, If I can interpret what you are asking about and changing it abit. Don't you think that Maturity can be defined ALSO, as the number of known bugs and their priority / serverity against a architecture dependent or independent item? Would this suffice and wouldn't it be easier to maintain? Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: maturity and status and attributes, oh my!
Robert P. J. Day wrote: at the risk of driving everyone here totally bonkers, i'm going to take one last shot at explaining what i was thinking of when i first proposed this whole maturity level thing. and, just so you know, the major reason i'm so cranked up about this is that i'm feeling just a little territorial -- i was the one who first started nagging people to consider this idea, so i'm a little edgy when i see folks finally giving it some serious thought but appearing to get ready to implement it entirely incorrectly in a way that's going to ruin it irreparably and make it utterly useless. this isn't just about defining a single feature called maturity. it's about defining a general mechanism so that you can add entirely new (what i call) attributes to kernel features. one attribute could be maturity, which could take one of a number of possible values. another could be status, with the same restrictions. heck, you could define the attribute colour, and decide that various kernel features could be labelled as (at most) one of red, green and chartreuse. that's what i mean by an attribute, and attributes would have two critical and non-negotiable properties: snip but i hope i've flogged this thoroughly to the point where people can see what i'm driving at. once you see (as in simon's patch) how to add the first attribute, it's trivial to simply duplicate that code to add as many more as you want. rday -- Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry Waterloo, Ontario, CANADA http://crashcourse.ca Robert Day, If I can interpret what you are asking about and changing it abit. Don't you think that Maturity can be defined ALSO, as the number of known bugs and their priority / serverity against a architecture dependent or independent item? Would this suffice and wouldn't it be easier to maintain? Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd _page_from_freelist() : No more no page failures. (WHY????)
Nick Piggin wrote: > > [EMAIL PROTECTED] wrote: > > [EMAIL PROTECTED] > > Sent: Friday, August 24, 2007 3:11 PM > > Subject: Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd & > > get_page_from_freelist() : No more no page failures. > > > > Mailer added a HTML subpart and chopped the earlier email :^( > > Hi Mitchell, > > Is it possible to send suggestions in the form of a unified diff, even > if you haven't even compiled it (just add a note to let people know). > > Secondly, we already have a (supposedly working) system of asynch > reclaim, with buffering and hysteresis. I don't exactly understand > what problem you think it has that would be solved by rechecking > watermarks after allocating a page. > > When we're in the (min,low) watermark range, we'll wake up kswapd > _before_ allocating anything, so what is better about the change to > wake up kswapd after allocating? Can you perhaps come up with an > example situation also to make this more clear? > > Overhead of wakeup_kswapd isn't too much of a problem: if we _should_ > be waking it up when we currently aren't, then we should be calling > it. However the extra checking in the allocator fastpath is something > we want to avoid if possible, because this can be a really hot path. > > Thanks, > Nick > > -- > SUSE Labs, Novell Inc. > - Nick Piggin, et al, First diffs would generate alot of noise, since I rip and insert alot of code based on whether I think the code is REALLY needed for MY TEST environment. These suggestions are basicly minimal merge suggestions between my development envir and the public Linux tree. Now the why for this SUGGESTION/PATCH... > When we're in the (min,low) watermark range, we'll wake up kswapd > _before_ allocating anything, so what is better about the change to > wake up kswapd after allocating? Can you perhaps come up with an > example situation also to make this more clear? Answer Will GFP_ATOMIC alloc be failing at that point? If yes, then why not allow kswapd attempt to prevent this condition from occuring? The existing code reads that the first call to get_page_from_freelist() has returned no page. Now you are going to start up something that is at best going to take millisecs to start helping out. Won't it first grab some pages to do its work? So we are going to be lower in free memory right when it starts up. Right? So, before the change, with high memory consumption/pressure, various GFP_xxx allocations would fail or take an excessive amount of time due to the simple fact of low memory and/or Slub/slab consumption and/or first failure of get_page_from_freelist() when in a low free memory condition. Once the above condition occurs the perception is that the current mainline Linux code then on demand increases its effort to find some memory. However, while this is happening the system is in a low memory bind and various performance parameters are being effected and some allocations are sleeping or being delayed or outright failing. What I could see is that CURR suggestions allow a new class of GFP_xxx allocations to succeed while in low memory, try again philosophy, wake-up kswapd , etc, are all AFTER the fact while something is WAITING for the memory. This wait is in effect a SYNCHRONOUS wait for memory. Assuming that kswapd is really what is mostly needed. Execute it BEFORE (JUST IN TIME) to PREVENT low memory since I/O needs pages and GFP_ATOMIC allocs fails and other GFP allocs sleping and The SUGGESTION is to take the fraction of microsec longer in the fast path to see if it is needed to be started up and to ATTEMPT to prevent the SLOW-PATH and low/min memory from occuring. The 2x low memory is to allow some scalability and to allow it ENOUGH time to do what it needs to do, since I expect a minimum number of millisecs before it can move us away from low free memory. As the amount of memory increases in a system this probably could be decreased somewhat to maybe 1.25x. IF the above is good then the issue is how to optimize the heck out of the check. Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd get_page_from_freelist() : No more no page failures. (WHY????)
Nick Piggin wrote: [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] Sent: Friday, August 24, 2007 3:11 PM Subject: Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd get_page_from_freelist() : No more no page failures. Mailer added a HTML subpart and chopped the earlier email :^( Hi Mitchell, Is it possible to send suggestions in the form of a unified diff, even if you haven't even compiled it (just add a note to let people know). Secondly, we already have a (supposedly working) system of asynch reclaim, with buffering and hysteresis. I don't exactly understand what problem you think it has that would be solved by rechecking watermarks after allocating a page. When we're in the (min,low) watermark range, we'll wake up kswapd _before_ allocating anything, so what is better about the change to wake up kswapd after allocating? Can you perhaps come up with an example situation also to make this more clear? Overhead of wakeup_kswapd isn't too much of a problem: if we _should_ be waking it up when we currently aren't, then we should be calling it. However the extra checking in the allocator fastpath is something we want to avoid if possible, because this can be a really hot path. Thanks, Nick -- SUSE Labs, Novell Inc. - Nick Piggin, et al, First diffs would generate alot of noise, since I rip and insert alot of code based on whether I think the code is REALLY needed for MY TEST environment. These suggestions are basicly minimal merge suggestions between my development envir and the public Linux tree. Now the why for this SUGGESTION/PATCH... When we're in the (min,low) watermark range, we'll wake up kswapd _before_ allocating anything, so what is better about the change to wake up kswapd after allocating? Can you perhaps come up with an example situation also to make this more clear? Answer Will GFP_ATOMIC alloc be failing at that point? If yes, then why not allow kswapd attempt to prevent this condition from occuring? The existing code reads that the first call to get_page_from_freelist() has returned no page. Now you are going to start up something that is at best going to take millisecs to start helping out. Won't it first grab some pages to do its work? So we are going to be lower in free memory right when it starts up. Right? So, before the change, with high memory consumption/pressure, various GFP_xxx allocations would fail or take an excessive amount of time due to the simple fact of low memory and/or Slub/slab consumption and/or first failure of get_page_from_freelist() when in a low free memory condition. Once the above condition occurs the perception is that the current mainline Linux code then on demand increases its effort to find some memory. However, while this is happening the system is in a low memory bind and various performance parameters are being effected and some allocations are sleeping or being delayed or outright failing. What I could see is that CURR suggestions allow a new class of GFP_xxx allocations to succeed while in low memory, try again philosophy, wake-up kswapd , etc, are all AFTER the fact while something is WAITING for the memory. This wait is in effect a SYNCHRONOUS wait for memory. Assuming that kswapd is really what is mostly needed. Execute it BEFORE (JUST IN TIME) to PREVENT low memory since I/O needs pages and GFP_ATOMIC allocs fails and other GFP allocs sleping and The SUGGESTION is to take the fraction of microsec longer in the fast path to see if it is needed to be started up and to ATTEMPT to prevent the SLOW-PATH and low/min memory from occuring. The 2x low memory is to allow some scalability and to allow it ENOUGH time to do what it needs to do, since I expect a minimum number of millisecs before it can move us away from low free memory. As the amount of memory increases in a system this probably could be decreased somewhat to maybe 1.25x. IF the above is good then the issue is how to optimize the heck out of the check. Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RFC: issues concerning the next NAPI interface
Jan-Bernd Themann, IMO, a box must be aware of the speed of all its interfaces, whether the interface impliments tail-drop or RED or XYZ, the latency to access the packet, etc. Then when a packet arrives, a timer is started for interrupt colelesing, and to process awaiting packets, if tail-drop is implemented, it is possible to wait until a the input FIFO fills to a specific point before starting a timer. This may maximize the number of packets per interupt. And realize that the worse case of a interrupt per packet is wirespeed pings (echo request/reply) of 64 bytes per packet. Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd & get_page_from_freelist() : No more no page failures.
From: Mitchell Erblich To: Peter Zijlstra Cc: Andrew Morton ; [EMAIL PROTECTED] ; linux-kernel@vger.kernel.org ; [EMAIL PROTECTED] Sent: Friday, August 24, 2007 3:11 PM Subject: Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd & get_page_from_freelist() : No more no page failures. Mailer added a HTML subpart and chopped the earlier email :^( Peter Zijlstra wrote: > > On Thu, 2007-08-23 at 02:35 -0700, Mitchell Erblich wrote: > > Group, > > > > On the infrequent condition of failing to recieve a page from the > > freelists, one of the things you do is call wakeup_kswapd()(exception of > > NUMA or GFP_THISNODE). > > > > Asuming that wakeup_kswapd() does what we want, this call is > > such a high overhead call that you want to make sure that the > > call is infrequent. > > It just wakes up a thread, it doesn't actually wait for anything. > So the function is actually rather cheap. > > > My initial guess is that it REALLY needs to re-populate the > > freelists just before they/it is used up. However, the simple change > > is being suggested NOW. > > kswapd will only stop once it has reached the high watermarks > > > Assuming that on avg that the order value will be used, you should > > increase the order to cover two allocs of that same level of order, > > thus the +1. If on the chance that later page_alloc() calls need > > fewer pages (smaller order) then the extra pages will be available > > for more page_allocs(). If later calls have larger orders, hopefully > > the latency between the calls is great enough that other parts of > > the system will respond to the low memory / on the freelist(s). > > by virtue of kswapd only stopping reclaim when it reaches the high > watermark you already have that it will free more than one page (its > started when we're below the low watermark, so it'll free at least > high-min pages). > > Changing the order has quite a different impact, esp now that we have > lumpy reclaim. > > > Line 1265 within function __alloc_pages(), mm/page_alloc.c > > > > wakeup_kswapd(*z, order); > > to > > wakeup_kswapd(*z, order + 1); > > > > In addition, isn't a call needed to determine that the > > freelist(s) are almost empty, but are still returning a page? > > didn't we just do that by finding out that ALLOC_WMARK_LOW fails to > return a page? > Peter Zijlstra, et al, This reply is long.. In summary the order only effects whether the max_order is larger or smaller, and this makes sure that we are getting enough memory. When I get to swapd, I will look whether I really think, IMO, that it SHOULD quit before it reaches high mem as you seem to say above. IMO, cached memory is good. Freelists are good. No page returns from get_page_from_freelist() are bad. And that doing something after a bad thing happens doesn't seem to be the right thing to do. > didn't we just do that by finding out that ALLOC_WMARK_LOW fails to > return a page? Yes, but I am talking about being premature in my checking while we still are returning pages. When we first start not returning a page, things are starting to fail. Thus, this fairly KISS code is their to stop us from getting to the no page part! It calls two modified functions renaming them just a few lines above the current call to the wakeup_kswapd in what I term the pre-call sequence. And the code itself is shorter than all this explanation because it is a different way of thinking to a problem, IMO. Please realize that this is a prototype and is intended to be very close, but needs to be reviewed by multiple people to identify whether the extra overhead in the fast path is worth the tradeoff to wake up the swapd BEFORE free memory drops to the current point that it is NORMALLY awoken. I will assume that 2x low memory is more than ample time to wakeup kswapd and allow is to clean any dirty pages, thus preventing any low memory condition from occuring.. Group, Summary: > > In addition, isn't a call needed to determine that the > > freelist(s) are almost empty, but are still returning a page? These suggestions are founded to attempt to decrease the chance of dropping below low_memory by waking up kswapd before we run out of pages from the page freelist and/or effect MOST normal page allocations. The simplistic fix is to add a lightweight check prototyped/ suggestion below that checks whether we are about to drop below the low watermark (free_pages / 2 ). If we are within 2 * lowmem_reserve, then we should start up kswapd with the shortened wake
Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd get_page_from_freelist() : No more no page failures.
From: Mitchell Erblich To: Peter Zijlstra Cc: Andrew Morton ; [EMAIL PROTECTED] ; linux-kernel@vger.kernel.org ; [EMAIL PROTECTED] Sent: Friday, August 24, 2007 3:11 PM Subject: Re: [RFC] : mm : / Patch / code : Suggestion :snip kswapd get_page_from_freelist() : No more no page failures. Mailer added a HTML subpart and chopped the earlier email :^( Peter Zijlstra wrote: On Thu, 2007-08-23 at 02:35 -0700, Mitchell Erblich wrote: Group, On the infrequent condition of failing to recieve a page from the freelists, one of the things you do is call wakeup_kswapd()(exception of NUMA or GFP_THISNODE). Asuming that wakeup_kswapd() does what we want, this call is such a high overhead call that you want to make sure that the call is infrequent. It just wakes up a thread, it doesn't actually wait for anything. So the function is actually rather cheap. My initial guess is that it REALLY needs to re-populate the freelists just before they/it is used up. However, the simple change is being suggested NOW. kswapd will only stop once it has reached the high watermarks Assuming that on avg that the order value will be used, you should increase the order to cover two allocs of that same level of order, thus the +1. If on the chance that later page_alloc() calls need fewer pages (smaller order) then the extra pages will be available for more page_allocs(). If later calls have larger orders, hopefully the latency between the calls is great enough that other parts of the system will respond to the low memory / on the freelist(s). by virtue of kswapd only stopping reclaim when it reaches the high watermark you already have that it will free more than one page (its started when we're below the low watermark, so it'll free at least high-min pages). Changing the order has quite a different impact, esp now that we have lumpy reclaim. Line 1265 within function __alloc_pages(), mm/page_alloc.c wakeup_kswapd(*z, order); to wakeup_kswapd(*z, order + 1); In addition, isn't a call needed to determine that the freelist(s) are almost empty, but are still returning a page? didn't we just do that by finding out that ALLOC_WMARK_LOW fails to return a page? Peter Zijlstra, et al, This reply is long.. In summary the order only effects whether the max_order is larger or smaller, and this makes sure that we are getting enough memory. When I get to swapd, I will look whether I really think, IMO, that it SHOULD quit before it reaches high mem as you seem to say above. IMO, cached memory is good. Freelists are good. No page returns from get_page_from_freelist() are bad. And that doing something after a bad thing happens doesn't seem to be the right thing to do. didn't we just do that by finding out that ALLOC_WMARK_LOW fails to return a page? Yes, but I am talking about being premature in my checking while we still are returning pages. When we first start not returning a page, things are starting to fail. Thus, this fairly KISS code is their to stop us from getting to the no page part! It calls two modified functions renaming them just a few lines above the current call to the wakeup_kswapd in what I term the pre-call sequence. And the code itself is shorter than all this explanation because it is a different way of thinking to a problem, IMO. Please realize that this is a prototype and is intended to be very close, but needs to be reviewed by multiple people to identify whether the extra overhead in the fast path is worth the tradeoff to wake up the swapd BEFORE free memory drops to the current point that it is NORMALLY awoken. I will assume that 2x low memory is more than ample time to wakeup kswapd and allow is to clean any dirty pages, thus preventing any low memory condition from occuring.. Group, Summary: In addition, isn't a call needed to determine that the freelist(s) are almost empty, but are still returning a page? These suggestions are founded to attempt to decrease the chance of dropping below low_memory by waking up kswapd before we run out of pages from the page freelist and/or effect MOST normal page allocations. The simplistic fix is to add a lightweight check prototyped/ suggestion below that checks whether we are about to drop below the low watermark (free_pages / 2 ). If we are within 2 * lowmem_reserve, then we should start up kswapd with the shortened wakeup_short_swapd() which removes the call to zone_watermark_ok(), because it was already done. So, the shorted zone_water_check() is done because we aren't in the middle of a allocation. thus, if we can start the kswapd() soon enough, the chance of going below the goto got_pg should be diminished
Re: RFC: issues concerning the next NAPI interface
Jan-Bernd Themann, IMO, a box must be aware of the speed of all its interfaces, whether the interface impliments tail-drop or RED or XYZ, the latency to access the packet, etc. Then when a packet arrives, a timer is started for interrupt colelesing, and to process awaiting packets, if tail-drop is implemented, it is possible to wait until a the input FIFO fills to a specific point before starting a timer. This may maximize the number of packets per interupt. And realize that the worse case of a interrupt per packet is wirespeed pings (echo request/reply) of 64 bytes per packet. Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] : mm : / Patch / Suggestion : Add 1 order or agressiveness to wakeup_kswapd() : 1 line / 1 arg change
Group, On the infrequent condition of failing to recieve a page from the freelists, one of the things you do is call wakeup_kswapd()(exception of NUMA or GFP_THISNODE). Asuming that wakeup_kswapd() does what we want, this call is such a high overhead call that you want to make sure that the call is infrequent. My initial guess is that it REALLY needs to re-populate the freelists just before they/it is used up. However, the simple change is being suggested NOW. Assuming that on avg that the order value will be used, you should increase the order to cover two allocs of that same level of order, thus the +1. If on the chance that later page_alloc() calls need fewer pages (smaller order) then the extra pages will be available for more page_allocs(). If later calls have larger orders, hopefully the latency between the calls is great enough that other parts of the system will respond to the low memory / on the freelist(s). Line 1265 within function __alloc_pages(), mm/page_alloc.c wakeup_kswapd(*z, order); to wakeup_kswapd(*z, order + 1); In addition, isn't a call needed to determine that the freelist(s) are almost empty, but are still returning a page? Thus, a lightweight call be done after a NORMAL page is recieved at line 1250 if (page) from the get_page_from_freelist()? Or it could be embedded within the function call path that get_page_from_freelist() uses? We could call wakeup_kswapd() or equiv pro-actively? The idea is that the call should check for 2x of LOW_MEMORY equiv of the freelists and to re-populate them. Then, HOPEFULLY the 2nd time calling get_page_from_freelist() would then be obsolete. When I come up with it, I will suggest it to the group. Mitchell Erblich FYI: My kernel is different enough that I can not validate this change for the 2.6.2x git. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] : mm : / Patch / Suggestion : Add 1 order or agressiveness to wakeup_kswapd() : 1 line / 1 arg change
Group, On the infrequent condition of failing to recieve a page from the freelists, one of the things you do is call wakeup_kswapd()(exception of NUMA or GFP_THISNODE). Asuming that wakeup_kswapd() does what we want, this call is such a high overhead call that you want to make sure that the call is infrequent. My initial guess is that it REALLY needs to re-populate the freelists just before they/it is used up. However, the simple change is being suggested NOW. Assuming that on avg that the order value will be used, you should increase the order to cover two allocs of that same level of order, thus the +1. If on the chance that later page_alloc() calls need fewer pages (smaller order) then the extra pages will be available for more page_allocs(). If later calls have larger orders, hopefully the latency between the calls is great enough that other parts of the system will respond to the low memory / on the freelist(s). Line 1265 within function __alloc_pages(), mm/page_alloc.c wakeup_kswapd(*z, order); to wakeup_kswapd(*z, order + 1); In addition, isn't a call needed to determine that the freelist(s) are almost empty, but are still returning a page? Thus, a lightweight call be done after a NORMAL page is recieved at line 1250 if (page) from the get_page_from_freelist()? Or it could be embedded within the function call path that get_page_from_freelist() uses? We could call wakeup_kswapd() or equiv pro-actively? The idea is that the call should check for 2x of LOW_MEMORY equiv of the freelists and to re-populate them. Then, HOPEFULLY the 2nd time calling get_page_from_freelist() would then be obsolete. When I come up with it, I will suggest it to the group. Mitchell Erblich FYI: My kernel is different enough that I can not validate this change for the 2.6.2x git. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: RT & SCHED & fork: ?MISSING EQUIV of task_new_fairfor RT tasks.
Mike Galbraith wrote: > > On Tue, 2007-08-14 at 12:28 -0700, Mitchell Erblich wrote: > > Group, Ingo Molnar, etc, > > > > Why does the rt sched_class contain fewer elements than fair? > > missing is the RT for .task_new. > > No class specific initialization needs to be done for RT tasks. > > -Mike Mike, et al, one time: I was told that this group likes bottom posts. The logic of class independent code calling class scheduling dependent code, assumes that all functions are in ALL the class dependent sections. Minimally, if I agree with your above statement, I would assume that the function should still exist as a null type function. However, in reality, alot of RT class specific init is done. Just currently none of it is done in this non-existant function. Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: QUESTION: RT SCHED fork: ?MISSING EQUIV of task_new_fairfor RT tasks.
Mike Galbraith wrote: On Tue, 2007-08-14 at 12:28 -0700, Mitchell Erblich wrote: Group, Ingo Molnar, etc, Why does the rt sched_class contain fewer elements than fair? missing is the RT for .task_new. No class specific initialization needs to be done for RT tasks. -Mike Mike, et al, one time: I was told that this group likes bottom posts. The logic of class independent code calling class scheduling dependent code, assumes that all functions are in ALL the class dependent sections. Minimally, if I agree with your above statement, I would assume that the function should still exist as a null type function. However, in reality, alot of RT class specific init is done. Just currently none of it is done in this non-existant function. Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
QUESTION: RT & SCHED & fork: ?MISSING EQUIV of task_new_fair for RT tasks.
Group, Ingo Molnar, etc, Why does the rt sched_class contain fewer elements than fair? missing is the RT for .task_new. called by do_fork(): fork.c , which calls wake_up_new_task() via if (!(clone_flags & CLONE_STOPPED)) wake_up_new_task(p, clone_flags); which is in sched.c and calls if (!p->sched_class->task_new || and p->sched_class->task_new(rq, p, now); sched_rt.c: static struct sched_class rt_sched_class __read_mostly = { 217 .enqueue_task = enqueue_task_rt, 218 .dequeue_task = dequeue_task_rt, 219 .yield_task = yield_task_rt, 220 221 .check_preempt_curr = check_preempt_curr_rt, 222 223 .pick_next_task = pick_next_task_rt, 224 .put_prev_task = put_prev_task_rt, 225 226 .load_balance = load_balance_rt, 227 228 .task_tick = task_tick_rt, 229 }; sched_fair.c: struct sched_class fair_sched_class __read_mostly = { 1075 .enqueue_task = enqueue_task_fair, 1076 .dequeue_task = dequeue_task_fair, 1077 .yield_task = yield_task_fair, 1078 1079 .check_preempt_curr = check_preempt_curr_fair, 1080 1081 .pick_next_task = pick_next_task_fair, 1082 .put_prev_task = put_prev_task_fair, 1083 1084 .load_balance = load_balance_fair, 1085 1086 .set_curr_task = set_curr_task_fair, 1087 .task_tick = task_tick_fair, 1088 .task_new = task_new_fair, 1089 }; 1090 The missing rt equivalent item is: /* 1014 * Share the fairness runtime between parent and child, thus the 1015 * total amount of pressure for CPU stays equal - new tasks 1016 * get a chance to run but frequent forkers are not allowed to 1017 * monopolize the CPU. Note: the parent runqueue is locked, 1018 * the child is not running yet. 1019 */ 1020 static void task_new_fair(struct rq *rq, struct task_struct *p) 1021 { Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
minor Suggested cleanup: RT / sched : Have RT tasks use PF_LESS_THROTTLE flag
Group, Ingo Molnar, and Dmitry, et al, task mm/page-writeback.c : get_dirty_limits() Shouldn't the PF_LESS_THROTTLE flag be scheduling class independent? The below is a suggestion is to use the flag for RT tasks. -- to drop the rt_task() call within the function if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { 1) becomes if (tsk->flags & PF_LESS_THROTTLE) Then 2) in kernel/sched.c set: p->flags |= PF_LESS_THROTTLE before the break; case SCHED_FIFO: case SCHED_RR: p->sched_class = _sched_class; p->flags |= PF_LESS_THROTTLE; break; 3) Unset it in case sched class changed; before the break case SCHED_NORMAL: case SCHED_BATCH: case SCHED_IDLE: p->sched_class = _sched_class; p->flags &= ~PF_LESS_THROTTLE; break; 4) set the flag rt_mutex_setprio with braces if (rt_prio(prio)) { p->sched_class = _sched_class; p->flags |= PF_LESS_THROTTLE; } else 5) Am I missing anything? Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
minor Suggested cleanup: RT / sched : Have RT tasks use PF_LESS_THROTTLE flag
Group, Ingo Molnar, and Dmitry, et al, task mm/page-writeback.c : get_dirty_limits() Shouldn't the PF_LESS_THROTTLE flag be scheduling class independent? The below is a suggestion is to use the flag for RT tasks. -- to drop the rt_task() call within the function if (tsk-flags PF_LESS_THROTTLE || rt_task(tsk)) { 1) becomes if (tsk-flags PF_LESS_THROTTLE) Then 2) in kernel/sched.c set: p-flags |= PF_LESS_THROTTLE before the break; case SCHED_FIFO: case SCHED_RR: p-sched_class = rt_sched_class; p-flags |= PF_LESS_THROTTLE; break; 3) Unset it in case sched class changed; before the break case SCHED_NORMAL: case SCHED_BATCH: case SCHED_IDLE: p-sched_class = fair_sched_class; p-flags = ~PF_LESS_THROTTLE; break; 4) set the flag rt_mutex_setprio with braces if (rt_prio(prio)) { p-sched_class = rt_sched_class; p-flags |= PF_LESS_THROTTLE; } else 5) Am I missing anything? Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
QUESTION: RT SCHED fork: ?MISSING EQUIV of task_new_fair for RT tasks.
Group, Ingo Molnar, etc, Why does the rt sched_class contain fewer elements than fair? missing is the RT for .task_new. called by do_fork(): fork.c , which calls wake_up_new_task() via if (!(clone_flags CLONE_STOPPED)) wake_up_new_task(p, clone_flags); which is in sched.c and calls if (!p-sched_class-task_new || and p-sched_class-task_new(rq, p, now); sched_rt.c: static struct sched_class rt_sched_class __read_mostly = { 217 .enqueue_task = enqueue_task_rt, 218 .dequeue_task = dequeue_task_rt, 219 .yield_task = yield_task_rt, 220 221 .check_preempt_curr = check_preempt_curr_rt, 222 223 .pick_next_task = pick_next_task_rt, 224 .put_prev_task = put_prev_task_rt, 225 226 .load_balance = load_balance_rt, 227 228 .task_tick = task_tick_rt, 229 }; sched_fair.c: struct sched_class fair_sched_class __read_mostly = { 1075 .enqueue_task = enqueue_task_fair, 1076 .dequeue_task = dequeue_task_fair, 1077 .yield_task = yield_task_fair, 1078 1079 .check_preempt_curr = check_preempt_curr_fair, 1080 1081 .pick_next_task = pick_next_task_fair, 1082 .put_prev_task = put_prev_task_fair, 1083 1084 .load_balance = load_balance_fair, 1085 1086 .set_curr_task = set_curr_task_fair, 1087 .task_tick = task_tick_fair, 1088 .task_new = task_new_fair, 1089 }; 1090 The missing rt equivalent item is: /* 1014 * Share the fairness runtime between parent and child, thus the 1015 * total amount of pressure for CPU stays equal - new tasks 1016 * get a chance to run but frequent forkers are not allowed to 1017 * monopolize the CPU. Note: the parent runqueue is locked, 1018 * the child is not running yet. 1019 */ 1020 static void task_new_fair(struct rq *rq, struct task_struct *p) 1021 { Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question : sched_rt.c : Loss of stats?? requeue_task_rt() does not call update_curr_rt() which updates stats
sched_rt.c : requeue_task_rt() The comment states the problem requeue no dequeue. Put task to the end of the run list without the overhead of dequeue followed by enqueue. dequeue_task_rt() updates stats. Where without calling it will skip the stat update. Thus, shouldn't requeue_task_rt() call update_curr_rt(rq); ??? Mitchell Erblich. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question: sched_rt.c : is RT check needed within a RT func? dequeue_task_rt() calls update_curr_rt() which checks for priority of RR or FIFO :
1) * Possible wasted stats overhead during dequeue.. sched_rt.c: Is RT check needed within a RT func? dequeue_task_rt() calls update_curr_rt() which checks for priority of RR or FIFO. WITHIN.. static inline void update_curr_rt(struct rq *rq) are the two lines.. if (!task_has_rt_policy(curr)) return; Generally if I am reading this right, then what RT task is neither FIFO or RR??? Thus, I think those two lines could be removed. 2) nit The comment within sched_rt.c -> Adding/removing a task to/from a priority array: Is placed before dequeue_task_rt() where enqueue_task_rt() is placed above the comment Thus, the comment should be moved above enqueue_task_rt() Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question: RT schedular : task_tick_rt(struct rq *rq, structtask_struct *p) : decreases overhead when rq->nr_running == 1
Ingo Molnar and group, First, RT/RR tasks are not deprecated. This change simply removes sched overhead while only 1 RT/RR task is runable per rq. The code size increase is minor and is a very straightforward change.. THUS... My assumption is / was to hand you/Ingo Molnar this change because, * you have the latest scheduler git tree and do periodic pull requests, so I consider you the default scheduler maintainer : ^ ) * my scheduler tree is more developmental, * and I wish to first suggest minor changes over time to see what scheduler changes are more acceptable. Mitchell Erblich -- Ingo Molnar wrote: > > * Mitchell Erblich <[EMAIL PROTECTED]> wrote: > > > After > > p->time_slice = static_prio_timeslice(p->static_prio); > > > > Why isn't their a check like > > if (rq->nr_running == 1) > > return; > > > > Which world remove the need for any recheduling or requeue'ing... > > your change is a possible optimization, but this is a pretty rare > codepath because the overwhelming majority of RT apps uses SCHED_FIFO. > Plus, the time_slice going down to 0 is a relatively rare event even for > SCHED_RR tasks. And if we have only a single SCHED_RR task, why is it > SCHED_RR to begin with? So this is on several levels an uncommon > workload and by adding a check like that we'd increase the codesize. But > ... no strong feelings against this optimization - if you send a proper > patch we can apply it, it certainly makes sense from a logic POV. > > Ingo > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question: RT schedular : task_tick_rt(struct rq *rq, structtask_struct *p) : decreases overhead when rq-nr_running == 1
Ingo Molnar and group, First, RT/RR tasks are not deprecated. This change simply removes sched overhead while only 1 RT/RR task is runable per rq. The code size increase is minor and is a very straightforward change.. THUS... My assumption is / was to hand you/Ingo Molnar this change because, * you have the latest scheduler git tree and do periodic pull requests, so I consider you the default scheduler maintainer : ^ ) * my scheduler tree is more developmental, * and I wish to first suggest minor changes over time to see what scheduler changes are more acceptable. Mitchell Erblich -- Ingo Molnar wrote: * Mitchell Erblich [EMAIL PROTECTED] wrote: After p-time_slice = static_prio_timeslice(p-static_prio); Why isn't their a check like if (rq-nr_running == 1) return; Which world remove the need for any recheduling or requeue'ing... your change is a possible optimization, but this is a pretty rare codepath because the overwhelming majority of RT apps uses SCHED_FIFO. Plus, the time_slice going down to 0 is a relatively rare event even for SCHED_RR tasks. And if we have only a single SCHED_RR task, why is it SCHED_RR to begin with? So this is on several levels an uncommon workload and by adding a check like that we'd increase the codesize. But ... no strong feelings against this optimization - if you send a proper patch we can apply it, it certainly makes sense from a logic POV. Ingo - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question: sched_rt.c : is RT check needed within a RT func? dequeue_task_rt() calls update_curr_rt() which checks for priority of RR or FIFO :
1) * Possible wasted stats overhead during dequeue.. sched_rt.c: Is RT check needed within a RT func? dequeue_task_rt() calls update_curr_rt() which checks for priority of RR or FIFO. WITHIN.. static inline void update_curr_rt(struct rq *rq) are the two lines.. if (!task_has_rt_policy(curr)) return; Generally if I am reading this right, then what RT task is neither FIFO or RR??? Thus, I think those two lines could be removed. 2) nit The comment within sched_rt.c - Adding/removing a task to/from a priority array: Is placed before dequeue_task_rt() where enqueue_task_rt() is placed above the comment Thus, the comment should be moved above enqueue_task_rt() Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question : sched_rt.c : Loss of stats?? requeue_task_rt() does not call update_curr_rt() which updates stats
sched_rt.c : requeue_task_rt() The comment states the problem requeue no dequeue. Put task to the end of the run list without the overhead of dequeue followed by enqueue. dequeue_task_rt() updates stats. Where without calling it will skip the stat update. Thus, shouldn't requeue_task_rt() call update_curr_rt(rq); ??? Mitchell Erblich. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question: RT schedular : task_tick_rt(struct rq *rq, struct task_struct *p) : decreases overhead when rq->nr_running == 1
Ingo Molnar and group, If their is a single RT task on this rq, then why not just reset the timeslice and return??? Just MAYBE decreasing re-scheduling overhead.. Thus After p->time_slice = static_prio_timeslice(p->static_prio); Why isn't their a check like if (rq->nr_running == 1) return; Which world remove the need for any recheduling or requeue'ing... Mitchell Erblich FYI: below is believed to be a snap of the current/ orig func - +static void task_tick_rt(struct rq *rq, struct task_struct *p) +{ + /* + * RR tasks need a special form of timeslice management. + * FIFO tasks have no timeslices. + */ + if (p->policy != SCHED_RR) + return; + + if (--p->time_slice) + return; + + p->time_slice = static_prio_timeslice(p->static_prio); + set_tsk_need_resched(p); + + /* put it at the end of the queue: */ + requeue_task_rt(rq, p); +} - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Question: RT schedular : task_tick_rt(struct rq *rq, struct task_struct *p) : decreases overhead when rq-nr_running == 1
Ingo Molnar and group, If their is a single RT task on this rq, then why not just reset the timeslice and return??? Just MAYBE decreasing re-scheduling overhead.. Thus After p-time_slice = static_prio_timeslice(p-static_prio); Why isn't their a check like if (rq-nr_running == 1) return; Which world remove the need for any recheduling or requeue'ing... Mitchell Erblich FYI: below is believed to be a snap of the current/ orig func - +static void task_tick_rt(struct rq *rq, struct task_struct *p) +{ + /* + * RR tasks need a special form of timeslice management. + * FIFO tasks have no timeslices. + */ + if (p-policy != SCHED_RR) + return; + + if (--p-time_slice) + return; + + p-time_slice = static_prio_timeslice(p-static_prio); + set_tsk_need_resched(p); + + /* put it at the end of the queue: */ + requeue_task_rt(rq, p); +} - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: about modularization
Rene, Of the uni-processor systems currently that can run Linux, I would not doubt if 99.% percent are uni-cores. It will be probably 3-5 years minimum before the multi-core processors will have any decent percentage of systems. And I am not suggesting not supporting them. I am only suggesting is wrt the schedular, bring the system up with a default schedular, and then load additional functionality based on the hardware/software requirements of the system. Thus, the fallout MIGHT be a uni-processor CFS that would not migrate tasks between multiple CPUs and as additional processors are brought online, migration could be enabled, and gang type scheduling, whatever could be then used. IMO, if their is a fault (because of heat, etc) the user would rather bring up the system in a degraded mode. Same reason applies to... boot -s.. Mitchell Erblich -- Rene Herman wrote: > > On 08/06/2007 10:20 PM, Mitchell Erblich wrote: > > > Thus, a hybrid schedular approach could be taken > > that would default to a single uni-processor schedular > > What a brilliant idea in a world where buying a non multi core CPU is > getting to be only somewhat easier than a non SMT one... > > Rene. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: about modularization
Ingo Molnar and group, If we just concentrate on CPU schedulars... IMO, POSIX requirements almost guarantee the support for modularization. The different task scheds allow a set of task class specific funcs to be generated. The question is whether those modular schedulars will ALWAYS consume kernel footprint space? With the arg of modularization is really targeted to optional hardware and decreases the kernel footprint size. Then here is a arg to support only 1 default schedular and take the rest of the sched code and modularize that.. IMO, ONLY some envs REQUIRE RT sched and some envs REQUIRE MP (multi-core and multi-processor) scheduling. I question whether the core kernel needs this support. . This additional capability could be removed from the growing kernel footprint and additional schedulars could be kept in the src code base with increasingly minimal effort if full modularization support were added. Thus, a hybrid schedular approach could be taken that would default to a single uni-processor schedular (ex: CFS) and the other schedulars could be modularized. Mitchell Erblich Ingo Molnar wrote: > > * T. J. Brumfield <[EMAIL PROTECTED]> wrote: > > > 1 - Can someone please explain why the kernel can be modular in every > > other aspect, including offering a choice of IO schedulers, but not > > kernel schedulers? > > that's a fundamental misconception. If you boot into a distro kernel on > a typical PC, about half of the kernel code that the box runs in any > moment will be in modules, half of it is in the "kernel core". For > example, on a random laptop: > > $ echo `lsmod | cut -c1-30 | cut -d' ' -f2-` | sed 's/Size //' | >sed 's/ /+/g' | bc > 2513784 > > i.e. 2.5 MB of modules. The core kernel's size: > > $ dmesg | grep 'kernel code' > Memory: 2053212k/2087808k available (2185k kernel code, 33240k reserved, 1174k data, 244k init, 1170304k highmem) > > 2.1 MB of kernel core code. (of course the total body of "possible > drivers" is 10 times larger than that of the core kernel - but the > fundamental 'variety' is not.) > > most of the modules are for stuff where there is a significant physical > difference between the components they support. Drivers for different > pieces of hardware. Filesystem drivers for different on-disk physical > layouts. Network protocol drivers for different on-wire formats. The > sanest technological decision there is clearly to modularize. > > And note that often it's not even about choice there: the user's system > has a particular piece of hardware, to which there is usually one > primary driver. The user does not have any real 'choice' over the > modularization here, it's largely a technological act to make the > kernel's footprint smaller. > > But the kernel core, which does not depend as much on the physical > properties of the stuff it supports (it depends on the physics of the > machine of course, but those rules are mostly shared between all > machines of that architecture), and is fundamentally influenced by the > syscall API (which is not modular either) and by our OS design > decisions, has much less reason to be modularized. > > The core kernel was always non-modular, and it depends on the technical > details whether we want to or _have to_ modularize something so that it > becomes modular to the user too. For example we dont have 'competing', > modular versions of the IPv4 stack. Neither of the VFS. Nor of timers, > futexes, nor of locking code or of the CPU scheduler. But we can switch > out any of those implementations from the core kernel, and did so > numerous times in the past and will do so in the future. > > CPU schedulers are as core kernel code as it gets - you cannot even boot > without having a CPU scheduler. IO schedulers, although similar in name, > are quite different beasts from CPU schedulers, and they are somewhere > between the core kernel and drivers. They are not 'physical drivers' (an > IO scheduler can drive any disk), nor are they fully 'core kernel code' > in the sense of a kernel not even being able to boot without them. Also, > disks are physically different from CPUs, in a way which works _against_ > the user-modularization of CPU schedulers. (there are also many other > differences which have been pointed out in the past) > > In any case, the IO subsystem maintainers decided to modularize IO > schedulers, and that's their decision. One of the authors of the IO > scheduler code said it on lkml recently that while modularization of IO > scheduler had advantages too, in retrospect he wishes they would not > hav
Re: about modularization
Ingo Molnar and group, If we just concentrate on CPU schedulars... IMO, POSIX requirements almost guarantee the support for modularization. The different task scheds allow a set of task class specific funcs to be generated. The question is whether those modular schedulars will ALWAYS consume kernel footprint space? With the arg of modularization is really targeted to optional hardware and decreases the kernel footprint size. Then here is a arg to support only 1 default schedular and take the rest of the sched code and modularize that.. IMO, ONLY some envs REQUIRE RT sched and some envs REQUIRE MP (multi-core and multi-processor) scheduling. I question whether the core kernel needs this support. . This additional capability could be removed from the growing kernel footprint and additional schedulars could be kept in the src code base with increasingly minimal effort if full modularization support were added. Thus, a hybrid schedular approach could be taken that would default to a single uni-processor schedular (ex: CFS) and the other schedulars could be modularized. Mitchell Erblich Ingo Molnar wrote: * T. J. Brumfield [EMAIL PROTECTED] wrote: 1 - Can someone please explain why the kernel can be modular in every other aspect, including offering a choice of IO schedulers, but not kernel schedulers? that's a fundamental misconception. If you boot into a distro kernel on a typical PC, about half of the kernel code that the box runs in any moment will be in modules, half of it is in the kernel core. For example, on a random laptop: $ echo `lsmod | cut -c1-30 | cut -d' ' -f2-` | sed 's/Size //' | sed 's/ /+/g' | bc 2513784 i.e. 2.5 MB of modules. The core kernel's size: $ dmesg | grep 'kernel code' Memory: 2053212k/2087808k available (2185k kernel code, 33240k reserved, 1174k data, 244k init, 1170304k highmem) 2.1 MB of kernel core code. (of course the total body of possible drivers is 10 times larger than that of the core kernel - but the fundamental 'variety' is not.) most of the modules are for stuff where there is a significant physical difference between the components they support. Drivers for different pieces of hardware. Filesystem drivers for different on-disk physical layouts. Network protocol drivers for different on-wire formats. The sanest technological decision there is clearly to modularize. And note that often it's not even about choice there: the user's system has a particular piece of hardware, to which there is usually one primary driver. The user does not have any real 'choice' over the modularization here, it's largely a technological act to make the kernel's footprint smaller. But the kernel core, which does not depend as much on the physical properties of the stuff it supports (it depends on the physics of the machine of course, but those rules are mostly shared between all machines of that architecture), and is fundamentally influenced by the syscall API (which is not modular either) and by our OS design decisions, has much less reason to be modularized. The core kernel was always non-modular, and it depends on the technical details whether we want to or _have to_ modularize something so that it becomes modular to the user too. For example we dont have 'competing', modular versions of the IPv4 stack. Neither of the VFS. Nor of timers, futexes, nor of locking code or of the CPU scheduler. But we can switch out any of those implementations from the core kernel, and did so numerous times in the past and will do so in the future. CPU schedulers are as core kernel code as it gets - you cannot even boot without having a CPU scheduler. IO schedulers, although similar in name, are quite different beasts from CPU schedulers, and they are somewhere between the core kernel and drivers. They are not 'physical drivers' (an IO scheduler can drive any disk), nor are they fully 'core kernel code' in the sense of a kernel not even being able to boot without them. Also, disks are physically different from CPUs, in a way which works _against_ the user-modularization of CPU schedulers. (there are also many other differences which have been pointed out in the past) In any case, the IO subsystem maintainers decided to modularize IO schedulers, and that's their decision. One of the authors of the IO scheduler code said it on lkml recently that while modularization of IO scheduler had advantages too, in retrospect he wishes they would not have made IO schedulers modular and now that decision cannot be undone. So even that much different situation was far from a clear decision, and some negative effects can be felt today too, in form of having two primary IO schedulers but not having one IO scheduler that works well in all cases. For CPU schedulers
Re: about modularization
Rene, Of the uni-processor systems currently that can run Linux, I would not doubt if 99.% percent are uni-cores. It will be probably 3-5 years minimum before the multi-core processors will have any decent percentage of systems. And I am not suggesting not supporting them. I am only suggesting is wrt the schedular, bring the system up with a default schedular, and then load additional functionality based on the hardware/software requirements of the system. Thus, the fallout MIGHT be a uni-processor CFS that would not migrate tasks between multiple CPUs and as additional processors are brought online, migration could be enabled, and gang type scheduling, whatever could be then used. IMO, if their is a fault (because of heat, etc) the user would rather bring up the system in a degraded mode. Same reason applies to... boot -s.. Mitchell Erblich -- Rene Herman wrote: On 08/06/2007 10:20 PM, Mitchell Erblich wrote: Thus, a hybrid schedular approach could be taken that would default to a single uni-processor schedular What a brilliant idea in a world where buying a non multi core CPU is getting to be only somewhat easier than a non SMT one... Rene. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Question: Scheduler 'Exit' or modularization of scheduler?
Paul Robinson, Solaris's SunOS SVR4.x has a modular schedular / dispatcher, however, I believe that the time share dispatcher is actually a frozen-base and is not replaceable. I do believe that some of its scheduling characteristics are modifyable. IT is my understanding that CFS is the new schedular, however it is separated from the RT (real-time) FIFO and RR schedular. Currently, I have a initial suggestion to lkma to see if their is any support of a interactive task class. It allows a root user to classify a set of tasks as interactive. If their is acceptance, I plan to propose a patch/ set of changes to support this new task class by the end of Aug. IFF, then maybe in the future... Paul Robinson, if you feel that you are up to the task of modularizing the schedular / dispatcher to do what you think should be done, I would suggest submitting a prototype to Ingo, et al and see what the response is.. Mitchell Erblich Paul Robinson wrote: > > There has been a considerable amount of talk and many news articles on > some websites because of the inclusion of the CFS scheduler either as a > replacement for the old scheduler or instead of using the SD scheduler, > some people apparently feel that one or the other of these is not right > in some contexts or in some environments. I'm not completely clear on > what is going on or exactly what the complaint is. But I personally > would like to try to toss in my 0.02 Euro in an attempt to offer some > light and less heat to the dilemma and offer a suggestion. > > If my ignorance of the subject is too obvious, please excuse me, I might > not have that much experience in the subject. I've only been > programming for 27 years and I hope to get better at it with more practice. > > So, there are two questions about which I am wondering. They may be > somewhat related but the methodology for each are different and the > method of implementation would be different. > > 1. Could it be possible to design the interface to the scheduler either > that there is an 'exit' (my mainframe history precedes me) in which one > can issue a monitor call such that the system changes to an alternate > scheduler, possibly as part of the boot process? Thus there might be a > default scheduler but it is possible to invoke an alternative one. > > 2. Could the scheduler be such that it be designed as a system loadable > module rather than as a monolithic part of the code, such that the > particular scheduler is a specific file and is simply installed at boot > time, and if someone wants a different scheduler, they can simply create > a new one, rename the existing one to something else, name theirs to > whatever the scheduler's name is, then shutdown and reboot the machine? > > I am thinking that the system scheduler is an integral part of a > time-shared operating system, it would be memory resident while the > machine is operating, thus it only has to be on disk during start up and > is not in use while the system is in normal operation and could be > replaced at any time (subject to the usual caveat that the system has to > be shutdown and rebooted to cause the scheduling mechanism to be changed > to the new one.). > > If such a capacity were available, or perhaps if such capacity can be > implemented at some point in the future, this would solve one of the > more critical issues, since people needing more finely tuned scheduling > facilities can use one different from the common one, or 'roll their > own' if they need something really special. > > I am also thinking this sort of a capacity would be extremely useful > either in virtualization issues, in running other operating systems (or > copies of Linux) as guest operating systems under Linux vis-a-vis Xen, > or in respect to real-time versions of Linux, such that if someone needs > to grant certain processes high priority, and the rest everything that's > left over, then they could do that simply by writing the scheduler to > the interface definition. > > Of course, I could be completely wrong on this point and this is not a > partitionable feature, that it's not possible to have the job scheduler > loaded from a secondary module at boot time. (This may be one of the > reasons why there have been problems with non-monolithic kernels being > unavailable for general use except in extremely limited cases.) > > Or I could be wrong in that this issue isn't that important and most of > the noise over the issue is a small and vocal minority complaining about > a marginal and unimportant issue. Of course, this sort of situation is > proba
Re: Question: Scheduler 'Exit' or modularization of scheduler?
Paul Robinson, Solaris's SunOS SVR4.x has a modular schedular / dispatcher, however, I believe that the time share dispatcher is actually a frozen-base and is not replaceable. I do believe that some of its scheduling characteristics are modifyable. IT is my understanding that CFS is the new schedular, however it is separated from the RT (real-time) FIFO and RR schedular. Currently, I have a initial suggestion to lkma to see if their is any support of a interactive task class. It allows a root user to classify a set of tasks as interactive. If their is acceptance, I plan to propose a patch/ set of changes to support this new task class by the end of Aug. IFF, then maybe in the future... Paul Robinson, if you feel that you are up to the task of modularizing the schedular / dispatcher to do what you think should be done, I would suggest submitting a prototype to Ingo, et al and see what the response is.. Mitchell Erblich Paul Robinson wrote: There has been a considerable amount of talk and many news articles on some websites because of the inclusion of the CFS scheduler either as a replacement for the old scheduler or instead of using the SD scheduler, some people apparently feel that one or the other of these is not right in some contexts or in some environments. I'm not completely clear on what is going on or exactly what the complaint is. But I personally would like to try to toss in my 0.02 Euro in an attempt to offer some light and less heat to the dilemma and offer a suggestion. If my ignorance of the subject is too obvious, please excuse me, I might not have that much experience in the subject. I've only been programming for 27 years and I hope to get better at it with more practice. So, there are two questions about which I am wondering. They may be somewhat related but the methodology for each are different and the method of implementation would be different. 1. Could it be possible to design the interface to the scheduler either that there is an 'exit' (my mainframe history precedes me) in which one can issue a monitor call such that the system changes to an alternate scheduler, possibly as part of the boot process? Thus there might be a default scheduler but it is possible to invoke an alternative one. 2. Could the scheduler be such that it be designed as a system loadable module rather than as a monolithic part of the code, such that the particular scheduler is a specific file and is simply installed at boot time, and if someone wants a different scheduler, they can simply create a new one, rename the existing one to something else, name theirs to whatever the scheduler's name is, then shutdown and reboot the machine? I am thinking that the system scheduler is an integral part of a time-shared operating system, it would be memory resident while the machine is operating, thus it only has to be on disk during start up and is not in use while the system is in normal operation and could be replaced at any time (subject to the usual caveat that the system has to be shutdown and rebooted to cause the scheduling mechanism to be changed to the new one.). If such a capacity were available, or perhaps if such capacity can be implemented at some point in the future, this would solve one of the more critical issues, since people needing more finely tuned scheduling facilities can use one different from the common one, or 'roll their own' if they need something really special. I am also thinking this sort of a capacity would be extremely useful either in virtualization issues, in running other operating systems (or copies of Linux) as guest operating systems under Linux vis-a-vis Xen, or in respect to real-time versions of Linux, such that if someone needs to grant certain processes high priority, and the rest everything that's left over, then they could do that simply by writing the scheduler to the interface definition. Of course, I could be completely wrong on this point and this is not a partitionable feature, that it's not possible to have the job scheduler loaded from a secondary module at boot time. (This may be one of the reasons why there have been problems with non-monolithic kernels being unavailable for general use except in extremely limited cases.) Or I could be wrong in that this issue isn't that important and most of the noise over the issue is a small and vocal minority complaining about a marginal and unimportant issue. Of course, this sort of situation is probably the case with 90% of all traffic on usenet, newsgroups and mailing lists, so what else is new? Or, and this is the big one, that this feature already exists in the Linux kernel and the method of scheduler invocation is already modularized for boot-time
schedular : No code : New interactive (ia) sched class : Part 1
- Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments With Ingo Molnar's 2.6.22 latest inclusions into the Linux source base, he has opened an opportunity to add additional task classes. This informal 1-pager is to try to identify whether their is support for the interactive (ia) task class. Posix explicitly states only three classes. However it states that implementations may define additional schedulars/classes. This 1-pager is to address the support for an additional schedular class SCHED_IA. Currently interactive tasks are derived from SCHED_NORMAL based on sleep factors and generate bonuses/credits to alter their scheduling behaviour. However, this developer believes that other tasks with different behaviours exist that COULD also be classified as a ia task. This developers preconcieved belief is that the ia class could generally be supported with a minimal amount of effort, be a benefit to the Linux kernel, and to generally support the following rules: * allow root / SUSER (Super USER) to determine that a task should be classified as a ia (interactive) task without eviction based on scheduling behaviours, * allow SCHED_IA priorities be placed between the Real-time schedulars and SCHED_NORMAL/OTHER, * derive the interactive behaviour based on its nice value and/or its initial combined priority, * TBD : identify whether the priority should be fixed or have limited variability within the ia interactive class, * user level command support to identify that a task is a interactive (ia) task, * etc (fork behaviours), Note: this 1-pager is not suggesting the removal or adjustment of interactive support within the SCHED_NORMAL/OTHER Linux, currently supported CFS environment. - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
schedular : No code : New interactive (ia) sched class : Part 1
Group, - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments With Ingo Molnar's 2.6.22 latest inclusions into the Linux source base, he has opened an opportunity to add additional task classes. This informal 1-pager is to try to identify whether their is support for the interactive (ia) task class. Posix explicitly states only three classes. However it states that implementations may define additional schedulars/classes. This 1-pager is to address the support for an additional schedular class SCHED_IA. Currently interactive tasks are derived from SCHED_NORMAL based on sleep factors and generate bonuses/credits to alter their scheduling behaviour. However, this developer believes that other tasks with different behaviours exist that COULD also be classified as a ia task. This developers preconcieved belief is that the ia class could generally be supported with a minimal amount of effort, be a benefit to the Linux kernel, and to generally support the following rules: * allow root / SUSER (Super USER) to determine that a task should be classified as a ia (interactive) task without eviction based on scheduling behaviours, * allow SCHED_IA priorities be placed between the Real-time schedulars and SCHED_NORMAL/OTHER, * derive the interactive behaviour based on its nice value and/or its initial combined priority, * TBD : identify whether the priority should be fixed or have limited variability within the ia interactive class, * user level command support to identify that a task is a interactive (ia) task, * etc (fork behaviours), Note: this 1-pager is not suggesting the removal or adjustment of interactive support within the SCHED_NORMAL/OTHER Linux, currently supported CFS environment. - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments Mitchell Erblich - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
schedular : No code : New interactive (ia) sched class : Part 1
Group, - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments With Ingo Molnar's 2.6.22 latest inclusions into the Linux source base, he has opened an opportunity to add additional task classes. This informal 1-pager is to try to identify whether their is support for the interactive (ia) task class. Posix explicitly states only three classes. However it states that implementations may define additional schedulars/classes. This 1-pager is to address the support for an additional schedular class SCHED_IA. Currently interactive tasks are derived from SCHED_NORMAL based on sleep factors and generate bonuses/credits to alter their scheduling behaviour. However, this developer believes that other tasks with different behaviours exist that COULD also be classified as a ia task. This developers preconcieved belief is that the ia class could generally be supported with a minimal amount of effort, be a benefit to the Linux kernel, and to generally support the following rules: * allow root / SUSER (Super USER) to determine that a task should be classified as a ia (interactive) task without eviction based on scheduling behaviours, * allow SCHED_IA priorities be placed between the Real-time schedulars and SCHED_NORMAL/OTHER, * derive the interactive behaviour based on its nice value and/or its initial combined priority, * TBD : identify whether the priority should be fixed or have limited variability within the ia interactive class, * user level command support to identify that a task is a interactive (ia) task, * etc (fork behaviours), Note: this 1-pager is not suggesting the removal or adjustment of interactive support within the SCHED_NORMAL/OTHER Linux, currently supported CFS environment. - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
schedular : No code : New interactive (ia) sched class : Part 1
- Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments With Ingo Molnar's 2.6.22 latest inclusions into the Linux source base, he has opened an opportunity to add additional task classes. This informal 1-pager is to try to identify whether their is support for the interactive (ia) task class. Posix explicitly states only three classes. However it states that implementations may define additional schedulars/classes. This 1-pager is to address the support for an additional schedular class SCHED_IA. Currently interactive tasks are derived from SCHED_NORMAL based on sleep factors and generate bonuses/credits to alter their scheduling behaviour. However, this developer believes that other tasks with different behaviours exist that COULD also be classified as a ia task. This developers preconcieved belief is that the ia class could generally be supported with a minimal amount of effort, be a benefit to the Linux kernel, and to generally support the following rules: * allow root / SUSER (Super USER) to determine that a task should be classified as a ia (interactive) task without eviction based on scheduling behaviours, * allow SCHED_IA priorities be placed between the Real-time schedulars and SCHED_NORMAL/OTHER, * derive the interactive behaviour based on its nice value and/or its initial combined priority, * TBD : identify whether the priority should be fixed or have limited variability within the ia interactive class, * user level command support to identify that a task is a interactive (ia) task, * etc (fork behaviours), Note: this 1-pager is not suggesting the removal or adjustment of interactive support within the SCHED_NORMAL/OTHER Linux, currently supported CFS environment. - Comments for or against the SCHED_IA class in the generic Linux kernel source tree. - General comments Mitchell Erblich - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/