Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Fri, 1 Dec 2006 12:53:07 +0300 > Isn't it a step in direction of full tcp processing bound to process > context? :) :-) Rather, it is just finer grained locking. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, Nov 30, 2006 at 12:14:43PM -0800, David Miller ([EMAIL PROTECTED]) wrote: > > It steals timeslices from other processes to complete tcp_recvmsg() > > task, and only when it does it for too long, it will be preempted. > > Processing backlog queue on behalf of need_resched() will break > > fairness too - processing itself can take a lot of time, so process > > can be scheduled away in that part too. > > Yes, at this point I agree with this analysis. > > Currently I am therefore advocating some way to allow > full input packet handling even amidst tcp_recvmsg() > processing. Isn't it a step in direction of full tcp processing bound to process context? :) -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 21:49:08 +0100 > So i dont support the scheme proposed here, the blatant bending of the > priority scale towards the TCP workload. I don't support this scheme either ;-) That's why my proposal is to find a way to allow input packet processing even during tcp_recvmsg() work. It is a solution that would give the TCP task exactly it's time slice, no more, no less, without the erroneous behavior of sleeping with packets held in the socket backlog. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Ingo Molnar <[EMAIL PROTECTED]> wrote: > [...] Instead what i'd like to see is more TCP performance (and a > nicer over-the-wire behavior - no retransmits for example) /with the > same 10% CPU time used/. Are we in rough agreement? put in another way: i'd like to see the "TCP bytes transferred per CPU time spent by the TCP stack" ratio to be maximized in a load-independent way (part of which is the sender host too: to not cause unnecessary retransmits is important as well). In a high-load scenario this means that any measure that purely improves TCP throughput by giving it more cycles is not a real improvement. So the focus should be on throttling intelligently and without causing extra work on the sender side either - not on trying to circumvent throttling measures. Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
>if you still have the test-setup, could you nevertheless try setting the >priority of the receiving TCP task to nice -20 and see what kind of >performance you get? A process with nice of -20 can easily get the interactivity status. When it expires, it still go back to the active array. It just hide the TCP problem, instead of solving it. For a process with nice value of -20, it will have the following advantages over other processes: (1) its timeslice is 800ms, the timeslice of a process with a nice value of 0 is 100ms (2) it has higher priority than other processes (3) it is easier to gain the interactivity status. The chances that the process expires and moves to the expired array with packets within backlog is much reduces, but still has the chance. wenji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > > disk I/O is typically not CPU bound, and i believe these TCP tests > > /are/ CPU-bound. Otherwise there would be no expiry of the timeslice > > to begin with and the TCP receiver task would always be boosted to > > 'interactive' status by the scheduler and would happily chug along > > at 500 mbits ... > > It's about the prioritization of the work. > > If all disk I/O were shut off and frozen while we copy file data into > userspace, you'd see the same problem for disk I/O. well, it's an issue of how much processing is done in non-prioritized contexts. TCP is a bit more sensitive to process context being throttled - but disk I/O is not immune either: if nothing submits new IO, or if the task does shorts reads+writes then any process level throttling immediately shows up in IO throughput. but in the general sense it is /unfair/ that certain processing such as disk and network IO can get a disproportionate amount of CPU time from the system - just because they happen to have some of their processing in IRQ and softirq context (which is essentially prioritized to SCHED_FIFO 100). A system can easily spend 80% CPU time in softirq context. (and that is easily visible in something like an -rt kernel where various softirq contexts are separate threads and you can see 30% net-rx and 20% net-tx CPU utilization in 'top'). How is this kind of processing different from purely process-context based subsystems? so i agree with you that by tweaking the TCP stack to be less sensitive to process throttling you /will/ improve the relative performance of the TCP receiver task - but in general system design and scheduler design terms it's not a win. i'd also agree with the notion that the current 'throttling' of process contexts can be abrupt and uncooperative, and hence the TCP stack could get more out of the same amount of CPU time if it used it in a smarter way. As i pointed it out in the first mail i'd support the TCP stack getting the ability to query how much timeslices it has - or even the scheduler notifying the TCP stack via some downcall if current->timeslice reaches 1 (or something like that). So i dont support the scheme proposed here, the blatant bending of the priority scale towards the TCP workload. Instead what i'd like to see is more TCP performance (and a nicer over-the-wire behavior - no retransmits for example) /with the same 10% CPU time used/. Are we in rough agreement? Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
> It steals timeslices from other processes to complete tcp_recvmsg() > task, and only when it does it for too long, it will be preempted. > Processing backlog queue on behalf of need_resched() will break > fairness too - processing itself can take a lot of time, so process > can be scheduled away in that part too. It does steal timeslices from other processes to complete tcp_recvmsg() task. But I do not think it will take long. When processing backlog, the processed packets will go to the receive buffer, the TCP flow control will take effect to slow down the sender. The data receiving process might be preempted by higher priority processes. Only the data recieving process stays in the active array, the problem is not that bad because the process might resume its execution soon. The worst case is that it expires and is moved to the active array with packets within the backlog queue. wenji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > I want to point out something which is slightly misleading about this > kind of analysis. > > Your disk I/O speed doesn't go down by a factor of 10 just because 9 > other non disk I/O tasks are running, yet for TCP that's seemingly OK > :-) disk I/O is typically not CPU bound, and i believe these TCP tests /are/ CPU-bound. Otherwise there would be no expiry of the timeslice to begin with and the TCP receiver task would always be boosted to 'interactive' status by the scheduler and would happily chug along at 500 mbits ... (and i grant you, if a disk IO test is 20% CPU bound in process context and system load is 10, then the scheduler will throttle that task quite effectively.) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 21:30:26 +0100 > disk I/O is typically not CPU bound, and i believe these TCP tests /are/ > CPU-bound. Otherwise there would be no expiry of the timeslice to begin > with and the TCP receiver task would always be boosted to 'interactive' > status by the scheduler and would happily chug along at 500 mbits ... It's about the prioritization of the work. If all disk I/O were shut off and frozen while we copy file data into userspace, you'd see the same problem for disk I/O. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Wenji Wu <[EMAIL PROTECTED]> wrote: > >The solution is really simple and needs no kernel change at all: if > >you want the TCP receiver to get a larger share of timeslices then > >either renice it to -20 or renice the other tasks to +19. > > Simply give a larger share of timeslices to the TCP receiver won't > solve the problem. No matter what the timeslice is, if the TCP > receiving process has packets within backlog, and the process is > expired and moved to the expired array, RTO might happen in the TCP > sender. if you still have the test-setup, could you nevertheless try setting the priority of the receiving TCP task to nice -20 and see what kind of performance you get? Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 11:32:40 +0100 > Note that even without the change the TCP receiving task is already > getting a disproportionate share of cycles due to softirq processing! > Under a load of 10.0 it went from 500 mbits to 74 mbits, while the > 'fair' share would be 50 mbits. So the TCP receiver /already/ has an > unfair advantage. The patch only deepends that unfairness. I want to point out something which is slightly misleading about this kind of analysis. Your disk I/O speed doesn't go down by a factor of 10 just because 9 other non disk I/O tasks are running, yet for TCP that's seemingly OK :-) Not looking at input TCP packets enough to send out the ACKs is the same as "forgetting" to queue some I/O requests that can go to the controller right now. That's the problem, TCP performance is intimately tied to ACK feedback. So we should find a way to make sure ACK feedback goes out, in preference to other tcp_recvmsg() processing. What really should pace the TCP sender in this kind of situation is the advertised window, not the lack of ACKs. Lack of an ACK mean the packet didn't get there, which is the wrong signal in this kind of situation, whereas a closing window means "application can't keep up with the data rate, hold on..." and is the proper flow control signal in this high load scenerio. If you don't send ACKs, packets are retransmitted when there is no reason for it, and that borders on illegal. :-) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Evgeniy Polyakov <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 13:22:06 +0300 > It steals timeslices from other processes to complete tcp_recvmsg() > task, and only when it does it for too long, it will be preempted. > Processing backlog queue on behalf of need_resched() will break > fairness too - processing itself can take a lot of time, so process > can be scheduled away in that part too. Yes, at this point I agree with this analysis. Currently I am therefore advocating some way to allow full input packet handling even amidst tcp_recvmsg() processing. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Wenji Wu <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 10:08:22 -0600 > If the higher prioirty processes become runnable (e.g., interactive > process), you better yield the CPU, instead of continuing this process. If > it is the case that the process within tcp_recvmsg() is expriring, then, you > can continue the process to go ahead to process backlog. Yes, I understand this, and I made that point in one of my replies to Ingo Molnar last night. The only seemingly remaining possibility is to find a way to allow input packet processing, at least enough to emit ACKs, during tcp_recvmsg() processing. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
>The solution is really simple and needs no kernel change at all: if you >want the TCP receiver to get a larger share of timeslices then either >renice it to -20 or renice the other tasks to +19. Simply give a larger share of timeslices to the TCP receiver won't solve the problem. No matter what the timeslice is, if the TCP receiving process has packets within backlog, and the process is expired and moved to the expired array, RTO might happen in the TCP sender. The solution does not look like that simple. wenji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, 2006-11-30 at 09:33 +, Christoph Hellwig wrote: > On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote: > > Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is > > why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. > > This "problem" happens in the 2.6 Desktop and Low-latency Desktop. > > CONFIG_PREEMPT is only for people that are in for the feeling. There is no > real world advtantage to it and we should probably remove it again. There certainly is a real world advantage for many applications. Of course it would be better if the latency requirements could be met without kernel preemption but that's not the case now. Lee - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [patch 1/4] - Potential performance bottleneck for Linxu TCP
>We can make explicitl preemption checks in the main loop of >tcp_recvmsg(), and release the socket and run the backlog if >need_resched() is TRUE. >This is the simplest and most elegant solution to this problem. I am not sure whether this approach will work. How can you make the explicit preemption checks? For Desktop case, yes, you can make the explicit preemption checks at some points whether need_resched() is true. But when need_resched() is true, you can not decide whether it is triggered by higher priority processes becoming runnable, or the process within tcp_recvmsg being expiring. If the higher prioirty processes become runnable (e.g., interactive process), you better yield the CPU, instead of continuing this process. If it is the case that the process within tcp_recvmsg() is expriring, then, you can continue the process to go ahead to process backlog. For Low-latency Desktop case, I believe it is very hard to make the checks. We do not know when the process is going to expire, or when higher priority process will become runnable. The process could expire at any moment, or higher priority process could become runnnable at any moment. If we do not want to tradeoff system responsiveness, where do you want to make the check? If you just make the chekc, then need_resched() become TRUE, what are you going to do in this case? wenji - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Evgeniy Polyakov <[EMAIL PROTECTED]> wrote: > > David's line of thinking for a solution sounds better to me. This > > patch does not prevent the process from being preempted (for > > potentially a long time), by any means. > > It steals timeslices from other processes to complete tcp_recvmsg() > task, and only when it does it for too long, it will be preempted. > Processing backlog queue on behalf of need_resched() will break > fairness too - processing itself can take a lot of time, so process > can be scheduled away in that part too. correct - it's just the wrong thing to do. The '10% performance win' that was measured was against _9 other tasks who contended for the same CPU resource_. I.e. it's /not/ an absolute 'performance win' AFAICS, it's a simple shift in CPU cycles away from the other 9 tasks and towards the task that does TCP receive. Note that even without the change the TCP receiving task is already getting a disproportionate share of cycles due to softirq processing! Under a load of 10.0 it went from 500 mbits to 74 mbits, while the 'fair' share would be 50 mbits. So the TCP receiver /already/ has an unfair advantage. The patch only deepends that unfairness. The solution is really simple and needs no kernel change at all: if you want the TCP receiver to get a larger share of timeslices then either renice it to -20 or renice the other tasks to +19. The other disadvantage, even ignoring that it's the wrong thing to do, is the crudeness of preempt_disable() that i mentioned in the other post: --> independently of the issue at hand, in general the explicit use of preempt_disable() in non-infrastructure code is quite a heavy tool. Its effects are heavy and global: it disables /all/ preemption (even on PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU data structures then [unlike for example to a spin-lock] the connection between the 'data' and the 'lock' is not explicit - causing all kinds of grief when trying to convert such code to a different preemption model. (such as PREEMPT_RT :-) So my plan is to remove all "open-coded" use of preempt_disable() [and raw use of local_irq_save/restore] from the kernel and replace it with some facility that connects data and lock. (Note that this will not result in any actual changes on the instruction level because internally every such facility still maps to preempt_disable() on non-PREEMPT_RT kernels, so on non-PREEMPT_RT kernels such code will still be the same as before.) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, Nov 30, 2006 at 09:07:42PM +1100, Nick Piggin ([EMAIL PROTECTED]) wrote: > >Doesn't the provided solution is just a in-kernel variant of the > >SCHED_FIFO set from userspace? Why kernel should be able to mark some > >users as having higher priority? > >What if workload of the system is targeted to not the maximum TCP > >performance, but maximum other-task performance, which will be broken > >with provided patch. > > David's line of thinking for a solution sounds better to me. This patch > does not prevent the process from being preempted (for potentially a long > time), by any means. It steals timeslices from other processes to complete tcp_recvmsg() task, and only when it does it for too long, it will be preempted. Processing backlog queue on behalf of need_resched() will break fairness too - processing itself can take a lot of time, so process can be scheduled away in that part too. > -- > SUSE Labs, Novell Inc. > Send instant messages to your online friends http://au.messenger.yahoo.com -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
Evgeniy Polyakov wrote: On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote: Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. David's line of thinking for a solution sounds better to me. This patch does not prevent the process from being preempted (for potentially a long time), by any means. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Thu, Nov 30, 2006 at 08:35:04AM +0100, Ingo Molnar ([EMAIL PROTECTED]) wrote: > what was observed here were the effects of completely throttling TCP > processing for a given socket. I think such throttling can in fact be > desirable: there is a /reason/ why the process context was preempted: in > that load scenario there was 10 times more processing requested from the > CPU than it can possibly service. It's a serious overload situation and > it's the scheduler's task to prioritize between workloads! > > normally such kind of "throttling" of the TCP stack for this particular > socket does not happen. Note that there's no performance lost: we dont > do TCP processing because there are /9 other tasks for this CPU to run/, > and the scheduler has a tough choice. > > Now i agree that there are more intelligent ways to throttle and less > intelligent ways to throttle, but the notion to allow a given workload > 'steal' CPU time from other workloads by allowing it to push its > processing into a softirq is i think unfair. (and this issue is > partially addressed by my softirq threading patches in -rt :-) Doesn't the provided solution is just a in-kernel variant of the SCHED_FIFO set from userspace? Why kernel should be able to mark some users as having higher priority? What if workload of the system is targeted to not the maximum TCP performance, but maximum other-task performance, which will be broken with provided patch. > Ingo -- Evgeniy Polyakov - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Wed, Nov 29, 2006 at 07:56:58PM -0600, Wenji Wu wrote: > Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why > I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This > "problem" happens in the 2.6 Desktop and Low-latency Desktop. CONFIG_PREEMPT is only for people that are in for the feeling. There is no real world advtantage to it and we should probably remove it again. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > > furthermore, the tweak allows the shifting of processing from a > > prioritized process context into a highest-priority softirq context. > > (it's not proven that there is any significant /net win/ of > > performance: all that was proven is that if we shift TCP processing > > from process context into softirq context then TCP throughput of > > that otherwise penalized process context increases.) > > If we preempt with any packets in the backlog, we send no ACKs and the > sender cannot send thus the pipe empties. That's the problem, this > has nothing to do with scheduler priorities or stuff like that IMHO. > The argument goes that if the reschedule is delayed long enough, the > ACKs will exceed the round trip time and trigger retransmits which > will absolutely kill performance. yes, but i disagree a bit about the characterisation of the problem. The question in my opinion is: how is TCP processing prioritized for this particular socket, which is attached to the process context which was preempted. normally, normally quite a bit of TCP processing happens in a softirq context (in fact most of it happens there), and softirq contexts have no fairness whatsoever - they preempt whatever processing is going on, regardless of any priority preferences of the user! what was observed here were the effects of completely throttling TCP processing for a given socket. I think such throttling can in fact be desirable: there is a /reason/ why the process context was preempted: in that load scenario there was 10 times more processing requested from the CPU than it can possibly service. It's a serious overload situation and it's the scheduler's task to prioritize between workloads! normally such kind of "throttling" of the TCP stack for this particular socket does not happen. Note that there's no performance lost: we dont do TCP processing because there are /9 other tasks for this CPU to run/, and the scheduler has a tough choice. Now i agree that there are more intelligent ways to throttle and less intelligent ways to throttle, but the notion to allow a given workload 'steal' CPU time from other workloads by allowing it to push its processing into a softirq is i think unfair. (and this issue is partially addressed by my softirq threading patches in -rt :-) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 07:47:58 +0100 > furthermore, the tweak allows the shifting of processing from a > prioritized process context into a highest-priority softirq context. > (it's not proven that there is any significant /net win/ of performance: > all that was proven is that if we shift TCP processing from process > context into softirq context then TCP throughput of that otherwise > penalized process context increases.) If we preempt with any packets in the backlog, we send no ACKs and the sender cannot send thus the pipe empties. That's the problem, this has nothing to do with scheduler priorities or stuff like that IMHO. The argument goes that if the reschedule is delayed long enough, the ACKs will exceed the round trip time and trigger retransmits which will absolutely kill performance. The only reason we block input packet processing while we hold this lock is because we don't want the receive queue changing from underneath us while we're copying data to userspace. Furthermore once you preempt in this particular way, no input packet processing occurs in that socket still, exacerbating the situation. Anyways, even if we somehow unlocked the socket and ran the backlog at preemption points, by hand, since we've thus deferred the whole work of processing whatever is in the backlog until the preemption point, we've lost our quantum already, so it's perhaps not legal to do the deferred processing as the preemption signalling point from a fairness perspective. It would be different if we really did the packet processing at the original moment (where we had to queue to the socket backlog because it was locked, in softirq) because then we'd return from the softirq and hit the preemption point earlier or whatever. Therefore, perhaps the best would be to see if there is a way we can still allow input packet processing even while running the majority of TCP's recvmsg(). It won't be easy :) - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > This is why my suggestion is to preempt_disable() as soon as we grab > the socket lock, [...] independently of the issue at hand, in general the explicit use of preempt_disable() in non-infrastructure code is quite a heavy tool. Its effects are heavy and global: it disables /all/ preemption (even on PREEMPT_RT). Furthermore, when preempt_disable() is used for per-CPU data structures then [unlike for example to a spin-lock] the connection between the 'data' and the 'lock' is not explicit - causing all kinds of grief when trying to convert such code to a different preemption model. (such as PREEMPT_RT :-) So my plan is to remove all "open-coded" use of preempt_disable() [and raw use of local_irq_save/restore] from the kernel and replace it with some facility that connects data and lock. (Note that this will not result in any actual changes on the instruction level because internally every such facility still maps to preempt_disable() on non-PREEMPT_RT kernels, so on non-PREEMPT_RT kernels such code will still be the same as before.) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > > yeah, i like this one. If the problem is "too long locked section", > > then the most natural solution is to "break up the lock", not to > > "boost the priority of the lock-holding task" (which is what the > > proposed patch does). > > Ingo you're mis-read the problem :-) yeah, the problem isnt too long locked section but "too much time spent holding a lock" and hence opening up ourselves to possible negative side-effects of the scheduler's fairness algorithm when it forces a preemption of that process context with that lock held (and forcing all subsequent packets to be backlogged). but please read my last mail - i think i'm slowly starting to wake up ;-) I dont think there is any real problem: a tweak to the scheduler that in essence gives TCP-using tasks a preference changes the balance of workloads. Such an explicit tweak is possible already. furthermore, the tweak allows the shifting of processing from a prioritized process context into a highest-priority softirq context. (it's not proven that there is any significant /net win/ of performance: all that was proven is that if we shift TCP processing from process context into softirq context then TCP throughput of that otherwise penalized process context increases.) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Ingo Molnar <[EMAIL PROTECTED]> Date: Thu, 30 Nov 2006 07:17:58 +0100 > > * David Miller <[EMAIL PROTECTED]> wrote: > > > We can make explicitl preemption checks in the main loop of > > tcp_recvmsg(), and release the socket and run the backlog if > > need_resched() is TRUE. > > > > This is the simplest and most elegant solution to this problem. > > yeah, i like this one. If the problem is "too long locked section", then > the most natural solution is to "break up the lock", not to "boost the > priority of the lock-holding task" (which is what the proposed patch > does). Ingo you're mis-read the problem :-) The issue is that we actually don't hold any locks that prevent preemption, so we can take preemption points which the TCP code wasn't designed with in-mind. Normally, we control the sleep point very carefully in the TCP sendmsg/recvmsg code, such that when we sleep we drop the socket lock and process the backlog packets that accumulated while the socket was locked. With pre-emption we can't control that properly. The problem is that we really do need to run the backlog any time we give up the cpu in the sendmsg/recvmsg path, or things get real erratic. ACKs don't go out as early as we'd like them to, etc. It isn't easy to do generically, perhaps, because we can only drop the socket lock at certain points and we need to do that to run the backlog. This is why my suggestion is to preempt_disable() as soon as we grab the socket lock, and explicitly test need_resched() at places where it is absolutely safe, like this: if (need_resched()) { /* Run packet backlog... */ release_sock(sk); schedule(); lock_sock(sk); } The socket lock is just a by-hand binary semaphore, so it doesn't block pre-emption. We have to be able to sleep while holding it. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* Wenji Wu <[EMAIL PROTECTED]> wrote: > > That yield() will need to be removed - yield()'s behaviour is truly > > awfulif the system is otherwise busy. What is it there for? > > Please read the uploaded paper, which has detailed description. do you have any URL for that? Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
* David Miller <[EMAIL PROTECTED]> wrote: > We can make explicitl preemption checks in the main loop of > tcp_recvmsg(), and release the socket and run the backlog if > need_resched() is TRUE. > > This is the simplest and most elegant solution to this problem. yeah, i like this one. If the problem is "too long locked section", then the most natural solution is to "break up the lock", not to "boost the priority of the lock-holding task" (which is what the proposed patch does). [ Also note that "sprinkle the code with preempt_disable()" kind of solutions, besides hurting interactivity, are also a pain to resolve in something like PREEMPT_RT. (unlike say a spinlock, preempt_disable() is quite opaque in what data structure it protects, etc., making it hard to convert it to a preemptible primitive) ] > The one suggested in your patch and paper are way overkill, there is > no reason to solve a TCP specific problem inside of the generic > scheduler. agreed. What we could also add is a /reverse/ mechanism to the scheduler: a task could query whether it has just a small amount of time left in its timeslice, and could in that case voluntarily drop its current lock and yield, and thus give up its current timeslice and wait for a new, full timeslice, instead of being forcibly preempted due to lack of timeslices with a possibly critical lock still held. But the suggested solution here, to "prolong the running of this task just a little bit longer" only starts a perpetual arms race between users of such a facility and other kernel subsystems. (besides not being adequate anyway, there can always be /so/ long lock-hold times that the scheduler would have no other option but to preempt the task) Ingo - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Wed, 2006-11-29 at 17:08 -0800, Andrew Morton wrote: > + if (p->backlog_flag == 0) { > + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { > + enqueue_task(p, rq->expired); > + if (p->static_prio < rq->best_expired_prio) > + rq->best_expired_prio = p->static_prio; > + } else > + enqueue_task(p, rq->active); > + } else { > + if (expired_starving(rq)) { > + enqueue_task(p,rq->expired); > + if (p->static_prio < rq->best_expired_prio) > + rq->best_expired_prio = p->static_prio; > + } else { > + if (!TASK_INTERACTIVE(p)) > + p->extrarun_flag = 1; > + enqueue_task(p,rq->active); > + } > + } (oh my, doing that to the scheduler upsets my tummy, but that aside...) I don't see how that can really solve anything. "Interactive" tasks starting to use cpu heftily can still preempt and keep the special cased cpu hog off the cpu for ages. It also only takes one task in the expired array to trigger the forced array switch with a fully loaded cpu, and once any task hits the expired array, a stream of wakeups can prevent the switch from completing for as long as you can keep wakeups happening. -Mike - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Wenji Wu <[EMAIL PROTECTED]> Date: Wed, 29 Nov 2006 19:56:58 -0600 > >We could also pepper tcp_recvmsg() with some very carefully placed > >preemption disable/enable calls to deal with this even with > >CONFIG_PREEMPT enabled. > > I also think about this approach. But since the "problem" happens in > the 2.6 Desktop and Low-latency Desktop (not server), system > responsiveness is a key feature, simply placing preemption > disabled/enable call might not work. If you want to place > preemption disable/enable calls within tcp_recvmsg, you have to put > them in the very beginning and end of the call. Disabling preemption > would degrade system responsiveness. We can make explicitl preemption checks in the main loop of tcp_recvmsg(), and release the socket and run the backlog if need_resched() is TRUE. This is the simplest and most elegant solution to this problem. The one suggested in your patch and paper are way overkill, there is no reason to solve a TCP specific problem inside of the generic scheduler. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
> That yield() will need to be removed - yield()'s behaviour is truly > awfulif the system is otherwise busy. What is it there for? Please read the uploaded paper, which has detailed description. thanks, wenji - Original Message - From: Andrew Morton <[EMAIL PROTECTED]> Date: Wednesday, November 29, 2006 7:08 pm Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP > On Wed, 29 Nov 2006 16:53:11 -0800 (PST) > David Miller <[EMAIL PROTECTED]> wrote: > > > > > Please, it is very difficult to review your work the way you have > > submitted this patch as a set of 4 patches. These patches have not > > been split up "logically", but rather they have been split up "per > > file" with the same exact changelog message in each patch posting. > > This is very clumsy, and impossible to review, and wastes a lot of > > mailing list bandwith. > > > > We have an excellent file, called > Documentation/SubmittingPatches, in > > the kernel source tree, which explains exactly how to do this > > correctly. > > > > By splitting your patch into 4 patches, one for each file touched, > > it is impossible to review your patch as a logical whole. > > > > Please also provide your patch inline so people can just hit reply > > in their mail reader client to quote your patch and comment on it. > > This is impossible with the attachments you've used. > > > > Here you go - joined up, cleaned up, ported to mainline and test- > compiled. > That yield() will need to be removed - yield()'s behaviour is truly > awfulif the system is otherwise busy. What is it there for? > > > > From: Wenji Wu <[EMAIL PROTECTED]> > > For Linux TCP, when the network applcaiton make system call to move > data from > socket's receive buffer to user space by calling tcp_recvmsg(). > The socket > will be locked. During this period, all the incoming packet for > the TCP > socket will go to the backlog queue without being TCP processed > > Since Linux 2.6 can be inerrupted mid-task, if the network application > expires, and moved to the expired array with the socket locked, all > thepackets within the backlog queue will not be TCP processed till > the network > applicaton resume its execution. If the system is heavily loaded, > TCP can > easily RTO in the Sender Side. > > > > include/linux/sched.h |2 ++ > kernel/fork.c |3 +++ > kernel/sched.c| 24 ++-- > net/ipv4/tcp.c|9 + > 4 files changed, 32 insertions(+), 6 deletions(-) > > diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c > --- a/net/ipv4/tcp.c~tcp-speedup > +++ a/net/ipv4/tcp.c > @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru > struct task_struct *user_recv = NULL; > int copied_early = 0; > > + current->backlog_flag = 1; > + > lock_sock(sk); > > TCP_CHECK_TIMER(sk); > @@ -1468,6 +1470,13 @@ skip_copy: > > TCP_CHECK_TIMER(sk); > release_sock(sk); > + > + current->backlog_flag = 0; > + if (current->extrarun_flag == 1){ > + current->extrarun_flag = 0; > + yield(); > + } > + > return copied; > > out: > diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h > --- a/include/linux/sched.h~tcp-speedup > +++ a/include/linux/sched.h > @@ -1023,6 +1023,8 @@ struct task_struct { > #ifdefCONFIG_TASK_DELAY_ACCT > struct task_delay_info *delays; > #endif > + int backlog_flag; /* packets wait in tcp backlog queue flag */ > + int extrarun_flag; /* extra run flag for TCP performance */ > }; > > static inline pid_t process_group(struct task_struct *tsk) > diff -puN kernel/sched.c~tcp-speedup kernel/sched.c > --- a/kernel/sched.c~tcp-speedup > +++ a/kernel/sched.c > @@ -3099,12 +3099,24 @@ void scheduler_tick(void) > > if (!rq->expired_timestamp) > rq->expired_timestamp = jiffies; > - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { > - enqueue_task(p, rq->expired); > - if (p->static_prio < rq->best_expired_prio) > - rq->best_expired_prio = p->static_prio; > - } else > - enqueue_task(p, rq->active); > + if (p->backlog_flag == 0) { > + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { > + enqueue_task(p, rq->expired); > +
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
Yes, when CONFIG_PREEMPT is disabled, the "problem" won't happen. That is why I put "for 2.6 desktop, low-latency desktop" in the uploaded paper. This "problem" happens in the 2.6 Desktop and Low-latency Desktop. >We could also pepper tcp_recvmsg() with some very carefully placed preemption >disable/enable calls to deal with this even with CONFIG_PREEMPT enabled. I also think about this approach. But since the "problem" happens in the 2.6 Desktop and Low-latency Desktop (not server), system responsiveness is a key feature, simply placing preemption disabled/enable call might not work. If you want to place preemption disable/enable calls within tcp_recvmsg, you have to put them in the very beginning and end of the call. Disabling preemption would degrade system responsiveness. wenji - Original Message - From: David Miller <[EMAIL PROTECTED]> Date: Wednesday, November 29, 2006 7:13 pm Subject: Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP > From: Andrew Morton <[EMAIL PROTECTED]> > Date: Wed, 29 Nov 2006 17:08:35 -0800 > > > On Wed, 29 Nov 2006 16:53:11 -0800 (PST) > > David Miller <[EMAIL PROTECTED]> wrote: > > > > > > > > Please, it is very difficult to review your work the way you have > > > submitted this patch as a set of 4 patches. These patches have > not> > been split up "logically", but rather they have been split > up "per > > > file" with the same exact changelog message in each patch posting. > > > This is very clumsy, and impossible to review, and wastes a lot of > > > mailing list bandwith. > > > > > > We have an excellent file, called > Documentation/SubmittingPatches, in > > > the kernel source tree, which explains exactly how to do this > > > correctly. > > > > > > By splitting your patch into 4 patches, one for each file touched, > > > it is impossible to review your patch as a logical whole. > > > > > > Please also provide your patch inline so people can just hit reply > > > in their mail reader client to quote your patch and comment on it. > > > This is impossible with the attachments you've used. > > > > > > > Here you go - joined up, cleaned up, ported to mainline and test- > compiled.> > > That yield() will need to be removed - yield()'s behaviour is > truly awful > > if the system is otherwise busy. What is it there for? > > What about simply turning off CONFIG_PREEMPT to fix this "problem"? > > We always properly run the backlog (by doing a release_sock()) before > going to sleep otherwise except for the specific case of taking a page > fault during the copy to userspace. It is only CONFIG_PREEMPT that > can cause this situation to occur in other circumstances as far as I > can see. > > We could also pepper tcp_recvmsg() with some very carefully placed > preemption disable/enable calls to deal with this even with > CONFIG_PREEMPT enabled. > - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Andrew Morton <[EMAIL PROTECTED]> Date: Wed, 29 Nov 2006 17:08:35 -0800 > On Wed, 29 Nov 2006 16:53:11 -0800 (PST) > David Miller <[EMAIL PROTECTED]> wrote: > > > > > Please, it is very difficult to review your work the way you have > > submitted this patch as a set of 4 patches. These patches have not > > been split up "logically", but rather they have been split up "per > > file" with the same exact changelog message in each patch posting. > > This is very clumsy, and impossible to review, and wastes a lot of > > mailing list bandwith. > > > > We have an excellent file, called Documentation/SubmittingPatches, in > > the kernel source tree, which explains exactly how to do this > > correctly. > > > > By splitting your patch into 4 patches, one for each file touched, > > it is impossible to review your patch as a logical whole. > > > > Please also provide your patch inline so people can just hit reply > > in their mail reader client to quote your patch and comment on it. > > This is impossible with the attachments you've used. > > > > Here you go - joined up, cleaned up, ported to mainline and test-compiled. > > That yield() will need to be removed - yield()'s behaviour is truly awful > if the system is otherwise busy. What is it there for? What about simply turning off CONFIG_PREEMPT to fix this "problem"? We always properly run the backlog (by doing a release_sock()) before going to sleep otherwise except for the specific case of taking a page fault during the copy to userspace. It is only CONFIG_PREEMPT that can cause this situation to occur in other circumstances as far as I can see. We could also pepper tcp_recvmsg() with some very carefully placed preemption disable/enable calls to deal with this even with CONFIG_PREEMPT enabled. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
On Wed, 29 Nov 2006 16:53:11 -0800 (PST) David Miller <[EMAIL PROTECTED]> wrote: > > Please, it is very difficult to review your work the way you have > submitted this patch as a set of 4 patches. These patches have not > been split up "logically", but rather they have been split up "per > file" with the same exact changelog message in each patch posting. > This is very clumsy, and impossible to review, and wastes a lot of > mailing list bandwith. > > We have an excellent file, called Documentation/SubmittingPatches, in > the kernel source tree, which explains exactly how to do this > correctly. > > By splitting your patch into 4 patches, one for each file touched, > it is impossible to review your patch as a logical whole. > > Please also provide your patch inline so people can just hit reply > in their mail reader client to quote your patch and comment on it. > This is impossible with the attachments you've used. > Here you go - joined up, cleaned up, ported to mainline and test-compiled. That yield() will need to be removed - yield()'s behaviour is truly awful if the system is otherwise busy. What is it there for? From: Wenji Wu <[EMAIL PROTECTED]> For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During this period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. include/linux/sched.h |2 ++ kernel/fork.c |3 +++ kernel/sched.c| 24 ++-- net/ipv4/tcp.c|9 + 4 files changed, 32 insertions(+), 6 deletions(-) diff -puN net/ipv4/tcp.c~tcp-speedup net/ipv4/tcp.c --- a/net/ipv4/tcp.c~tcp-speedup +++ a/net/ipv4/tcp.c @@ -1109,6 +1109,8 @@ int tcp_recvmsg(struct kiocb *iocb, stru struct task_struct *user_recv = NULL; int copied_early = 0; + current->backlog_flag = 1; + lock_sock(sk); TCP_CHECK_TIMER(sk); @@ -1468,6 +1470,13 @@ skip_copy: TCP_CHECK_TIMER(sk); release_sock(sk); + + current->backlog_flag = 0; + if (current->extrarun_flag == 1){ + current->extrarun_flag = 0; + yield(); + } + return copied; out: diff -puN include/linux/sched.h~tcp-speedup include/linux/sched.h --- a/include/linux/sched.h~tcp-speedup +++ a/include/linux/sched.h @@ -1023,6 +1023,8 @@ struct task_struct { #ifdef CONFIG_TASK_DELAY_ACCT struct task_delay_info *delays; #endif + int backlog_flag; /* packets wait in tcp backlog queue flag */ + int extrarun_flag; /* extra run flag for TCP performance */ }; static inline pid_t process_group(struct task_struct *tsk) diff -puN kernel/sched.c~tcp-speedup kernel/sched.c --- a/kernel/sched.c~tcp-speedup +++ a/kernel/sched.c @@ -3099,12 +3099,24 @@ void scheduler_tick(void) if (!rq->expired_timestamp) rq->expired_timestamp = jiffies; - if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { - enqueue_task(p, rq->expired); - if (p->static_prio < rq->best_expired_prio) - rq->best_expired_prio = p->static_prio; - } else - enqueue_task(p, rq->active); + if (p->backlog_flag == 0) { + if (!TASK_INTERACTIVE(p) || expired_starving(rq)) { + enqueue_task(p, rq->expired); + if (p->static_prio < rq->best_expired_prio) + rq->best_expired_prio = p->static_prio; + } else + enqueue_task(p, rq->active); + } else { + if (expired_starving(rq)) { + enqueue_task(p,rq->expired); + if (p->static_prio < rq->best_expired_prio) + rq->best_expired_prio = p->static_prio; + } else { + if (!TASK_INTERACTIVE(p)) + p->extrarun_flag = 1; + enqueue_task(p,rq->active); + } + } } else { /* * Prevent a too long timeslice allowing a task to monopolize diff -puN kernel/fork.c~tcp-speedup kernel/fork.c --- a/kernel/fork.c~tcp-speedup +++ a/kernel/fork.c @@ -1032,6 +1032,9 @@ static struct task_struct *copy_process( clear_tsk_thread_flag(p,
Re: [patch 1/4] - Potential performance bottleneck for Linxu TCP
Please, it is very difficult to review your work the way you have submitted this patch as a set of 4 patches. These patches have not been split up "logically", but rather they have been split up "per file" with the same exact changelog message in each patch posting. This is very clumsy, and impossible to review, and wastes a lot of mailing list bandwith. We have an excellent file, called Documentation/SubmittingPatches, in the kernel source tree, which explains exactly how to do this correctly. By splitting your patch into 4 patches, one for each file touched, it is impossible to review your patch as a logical whole. Please also provide your patch inline so people can just hit reply in their mail reader client to quote your patch and comment on it. This is impossible with the attachments you've used. Thanks. - To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
[patch 1/4] - Potential performance bottleneck for Linxu TCP
From: Wenji Wu <[EMAIL PROTECTED]> Greetings, For Linux TCP, when the network applcaiton make system call to move data from socket's receive buffer to user space by calling tcp_recvmsg(). The socket will be locked. During the period, all the incoming packet for the TCP socket will go to the backlog queue without being TCP processed. Since Linux 2.6 can be inerrupted mid-task, if the network application expires, and moved to the expired array with the socket locked, all the packets within the backlog queue will not be TCP processed till the network applicaton resume its execution. If the system is heavily loaded, TCP can easily RTO in the Sender Side. Attached is the patch 1/4 best regards, wenji Wenji Wu Network Researcher Fermilab, MS-368 P.O. Box 500 Batavia, IL, 60510 (Email): [EMAIL PROTECTED] (O): 001-630-840-4541 tcp.c.patch Description: Binary data