Re: [PATCH] NET: Multiqueue network device support.
On Tue, 2007-26-06 at 13:57 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Tue, 26 Jun 2007 09:27:28 -0400 Back to the question: Do you recall how this number was arrived at? 128 packets will be sent out at GiGe in about 80 microsecs, so from a feel-the-wind-direction perspective it seems reasonable. I picked it out of a hat. It is not a bad value for Gige; doubt it will be a good one for 10/100 or even 10GE. But you could say that about the ring sizes too. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: jamal [EMAIL PROTECTED] Date: Wed, 27 Jun 2007 18:32:45 -0400 On Tue, 2007-26-06 at 13:57 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Tue, 26 Jun 2007 09:27:28 -0400 Back to the question: Do you recall how this number was arrived at? 128 packets will be sent out at GiGe in about 80 microsecs, so from a feel-the-wind-direction perspective it seems reasonable. I picked it out of a hat. It is not a bad value for Gige; doubt it will be a good one for 10/100 or even 10GE. But you could say that about the ring sizes too. The thing that's really important is that the value is not so large such that the TX ring can become empty. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Wed, 2007-27-06 at 15:54 -0700, David Miller wrote: The thing that's really important is that the value is not so large such that the TX ring can become empty. In the case of batching, varying the values makes a difference. The logic is that if you can tune it so that the driver takes sufficiently long to stay closed the more packets you accumulate at the qdisc and the more you can batch to the driver (when it opens up). Deciding what sufficiently long is an art - and i am sure speed dependent. With e1000 at gige 128 seems to be a good value, going above or below that gave lesser performance. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: jamal [EMAIL PROTECTED] Date: Wed, 27 Jun 2007 20:15:47 -0400 On Wed, 2007-27-06 at 15:54 -0700, David Miller wrote: The thing that's really important is that the value is not so large such that the TX ring can become empty. In the case of batching, varying the values makes a difference. The logic is that if you can tune it so that the driver takes sufficiently long to stay closed the more packets you accumulate at the qdisc and the more you can batch to the driver (when it opens up). Deciding what sufficiently long is an art - and i am sure speed dependent. With e1000 at gige 128 seems to be a good value, going above or below that gave lesser performance. Right. And another thing you want to moderate is lock hold times, perhaps even at the slight expense of performance. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-22-06 at 09:26 +0800, Zhu Yi wrote: On Thu, 2007-06-21 at 11:39 -0400, jamal wrote: It sounds stupid I'm still trying to convince you why we need multiqueue support in Qdisc when everybody else are already working on the code, If you go back historically (maybe 2 years ago on netdev?) - i was a big fan of the scheme used in those patches ;- In a Xcracy like Linux, you have to agree to disagree at some point and move on. I dont need any core changes to deploy what i am suggesting. fixing bugs and preparing for merge. The only reason I keep the conversation is that I think you _might_ have some really good points that buried under everybody else's positive support for multiqueue. But with the conversation goes on, it turns out not the truth. We have come a long way and maybe you just didnt understand me initially. We don't have THL and THH in our driver. They are what you suggested. The queue wakeup number is 1/4 of the ring size. So how did you pick 1/4? Experimentation? If you look at tg3 its much higher for example. The timer fires only if a ring shuts down the interface. Where is the busy loop? If packets go out, there is no timer. The busy loop happens in the period after the ring is shut down and before it is opened again. During this period, the Qdisc will keep dequeuing and requeuing PL packets in the Tx SoftIRQ, where the busy loop happens. Ok, sure - this boils to what Patrick pointed out as well. I would see this as being similar to any other corner case you meet. You may be able to convince me otherwise if you can show me some numbers, for example: how often this would happen, and how long would a LP be disallowed from sending on the wire when the ring is full. I have read a few papers (i posted one or two on the list) and none seem to have come across this as an issue. You may have better insight. So to me this is a corner case which is resolvable. I wouldnt consider this to be any different than say dealing with failed allocs.You have to deal with them. So lets make this an engineering challenge and try to see how many ways we can solve it ... Here's one: Use an exponentially backoff timer. i.e if you decide that you will open the path every 1 sec or X packets, then the next time it turns to be a false positive, increment the timer; upto an upper bound ( a lot of protocols do this). That should cut down substantially how many times you open up and finding no HP packets. I dont think you understood: Whatever value you choose for THL and THH today, keep those. OTOH, the wake threshold is what i was refering to. I don't even care about the threshold. Even you set it to 1, there is still busy loop during the period before this first packet is sent out in the air. But you cannot ignore this small time, because it could be longer when the wireless medium is congested with high prio packets. Give me some numbers and you may be able to convince me that this may not be so good for wireless. I have had no problems with prescribed scheme for multiqueue ethernet chips. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: jamal [EMAIL PROTECTED] Date: Mon, 25 Jun 2007 12:47:31 -0400 On Fri, 2007-22-06 at 09:26 +0800, Zhu Yi wrote: We don't have THL and THH in our driver. They are what you suggested. The queue wakeup number is 1/4 of the ring size. So how did you pick 1/4? Experimentation? If you look at tg3 its much higher for example. tg3 uses 1/4: #define TG3_TX_WAKEUP_THRESH(tp)((tp)-tx_pending / 4) tp-tx_pending is the current configured ring size, configurable via ethtool. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
I gave you two opportunities to bail out of this discussion, i am gonna take that your rejection to that offer implies you my friend wants to get to the bottom of this i.e you are on a mission to find the truth. So lets continue this. On Wed, 2007-20-06 at 13:51 +0800, Zhu Yi wrote: No, because this is over-engineered. Furthermore, don't you think the algorithm is complicated and unnecessary (i.e. one timer per h/w queue)? The (one-shot) timer is only necessary when a ring shuts down the driver. This is only for the case of wireless media. Standard Wired Ethernet doesnt need it. Note: You are not going to convince me by throwing cliches like this is over-engineering around. Because it leads to a response like Not at all. I think Sending flow control messages back to the stack is over-engineering. And where do we go then? Do you think the driver maintainer will accept such kind of workaround patch? Give me access to your manual for the chip on my laptop wireless which is 3945ABG and i can produce a very simple patch for you. Actually if you answer some questions for me, it may be good enough to produce such a patch. You did too much to keep the Qdisc interface untouched! What metric do you want to define for too much - lines of code? Complexity? I consider architecture cleanliness to be more important. Besides, the lower THL you choose, the more CPU time is wasted in busy loop for the only PL case; Your choice of THL and THH has nothing to do with what i am proposing. I am not proposing you even touch that. What numbers do you have today? What i am saying is you use _some_ value for opening up the driver; some enlightened drivers such as the tg3 (and the e1000 - for which i unashamedly take credit) do have such parametrization. This has already been proven to be valuable. The timer fires only if a ring shuts down the interface. Where is the busy loop? If packets go out, there is no timer. the higher THL you choose, the slower the PH packets will be sent out than expected (the driver doesn't fully utilize the device function -- multiple rings, I dont think you understood: Whatever value you choose for THL and THH today, keep those. OTOH, the wake threshold is what i was refering to. which conlicts with a device driver's intention). I dont see how given i am talking about wake thresholds. You can never make a good trade off in this model. Refer to above. I think I have fully understood you, Thanks for coming such a long way - you stated it couldnt be done before unless you sent feedback to the stack. but your point is invalid. The Qdisc must be changed to have the hardware queue information to support multiple hardware queues devices. Handwaving as above doesnt add value to a discussion. If you want meaningful discussions, stop these cliches. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Thu, 2007-06-21 at 11:39 -0400, jamal wrote: I gave you two opportunities to bail out of this discussion, i am gonna take that your rejection to that offer implies you my friend wants to get to the bottom of this i.e you are on a mission to find the truth. So lets continue this. It sounds stupid I'm still trying to convince you why we need multiqueue support in Qdisc when everybody else are already working on the code, fixing bugs and preparing for merge. The only reason I keep the conversation is that I think you _might_ have some really good points that buried under everybody else's positive support for multiqueue. But with the conversation goes on, it turns out not the truth. Let me snip the nonsense part below and only focus on technical. Besides, the lower THL you choose, the more CPU time is wasted in busy loop for the only PL case; Your choice of THL and THH has nothing to do with what i am proposing. I am not proposing you even touch that. What numbers do you have today? We don't have THL and THH in our driver. They are what you suggested. The queue wakeup number is 1/4 of the ring size. What i am saying is you use _some_ value for opening up the driver; some enlightened drivers such as the tg3 (and the e1000 - for which i unashamedly take credit) do have such parametrization. This has already been proven to be valuable. The timer fires only if a ring shuts down the interface. Where is the busy loop? If packets go out, there is no timer. The busy loop happens in the period after the ring is shut down and before it is opened again. During this period, the Qdisc will keep dequeuing and requeuing PL packets in the Tx SoftIRQ, where the busy loop happens. the higher THL you choose, the slower the PH packets will be sent out than expected (the driver doesn't fully utilize the device function -- multiple rings, I dont think you understood: Whatever value you choose for THL and THH today, keep those. OTOH, the wake threshold is what i was refering to. I don't even care about the threshold. Even you set it to 1, there is still busy loop during the period before this first packet is sent out in the air. But you cannot ignore this small time, because it could be longer when the wireless medium is congested with high prio packets. Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Tue, 2007-19-06 at 10:12 +0800, Zhu Yi wrote: Mine was much simpler. We don't need to consider the wireless dynamic priority change case at this time. Just tell me what you suppose the driver to do (stop|start queue) when the hardware PHL is full but PHH is empty? I already responded to this a few emails back. My suggestion then was: Pick between a timer and a number of packets X transmitted, whichever comes first. [In e1000 for example, the opening strategy is every time 32 packets get transmitted, you open up]. In the case of wireless, pick two numbers XHH and XHL with XHL XHH. The timers would be similar in nature (THH THL). All these variables are only valid if you shutdown the ring. So in the case HL shuts down the ring, you fire THL. If either XHL packets are transmitted or THL expires, you netif_wake. Did that make sense? BTW, this thread is going back and forth on the same recycled arguements. As an example, i have responded to this specific question. Can we drop the discussion? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Hello Yi, On Mon, 2007-18-06 at 09:18 +0800, Zhu Yi wrote: Would you respond the question I asked early, I thought i did respond to all questions you asked but some may have been lost in the noise. in your model how to define the queue wakeup strategy in the driver to deal with the PHL full situation? Consider about 1) both high prio and low prio packets could come (you cannot predict it beforehand) I am assuming by come you mean from the stack (example an ssh packet) as opposed from the outside. 2) the time for PHL to send out a packet to the wireless medium is relative long (given the medium is congested). If you can resolve it in an elegant way, I'm all ears. Congestion periods are the only time any of this stuff makes sense. Ok, so let me repeat what i said earlier: Once a packet is in the DMA ring, we dont take it out. If a high prio packet is blocking a low prio one, i consider that to be fine. If otoh, you receive a management detail from the AP indicating that LP has its priority bumped or HP has its prio lowered, then by all means use that info to open up the path again. Again, that is an example, you could use that or schemes (refer to my expression on cats earlier). Anyways, you will have to forgive me - this thread is getting too long and i dont have much time to follow up on this topic for about a week; and given we are not meeting anywhere in the middle i am having a hard time continuing to repeat the same arguements over and over again. It is ok for rational people to agree to disagree for the sake of progress cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-06-18 at 11:16 -0400, jamal wrote: in your model how to define the queue wakeup strategy in the driver to deal with the PHL full situation? Consider about 1) both high prio and low prio packets could come (you cannot predict it beforehand) I am assuming by come you mean from the stack (example an ssh packet) as opposed from the outside. Right. 2) the time for PHL to send out a packet to the wireless medium is relative long (given the medium is congested). If you can resolve it in an elegant way, I'm all ears. Congestion periods are the only time any of this stuff makes sense. We are talking about the period from the time PHL is full to the time it can accept more packets again. How to design the queue wakeup policy in this period is the question. Ok, so let me repeat what i said earlier: Once a packet is in the DMA ring, we dont take it out. If a high prio packet is blocking a low prio one, i consider that to be fine. If otoh, you receive a management detail from the AP indicating that LP has its priority bumped or HP has its prio lowered, then by all means use that info to open up the path again. Again, that is an example, you could use that or schemes (refer to my expression on cats earlier). No, this is not my question. Mine was much simpler. We don't need to consider the wireless dynamic priority change case at this time. Just tell me what you suppose the driver to do (stop|start queue) when the hardware PHL is full but PHH is empty? Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-06-15 at 06:49 -0400, jamal wrote: Hello Yi, On Fri, 2007-15-06 at 09:27 +0800, Zhu Yi wrote: 1. driver becomes complicated (as it is too elaborate in the queue wakeup strategies design) I am not sure i see the complexity in the wireless driver's wakeup strategy. I just gave some suggestions to use management frames - they dont have to be literally that way. 2. duplicated code among drivers (otherwise you put all the queue management logics in a new layer?) There will be some shared code on drivers of same media on the netif_stop/wake strategy perhaps, but not related to queue management. 3. it's not 100% accurate. there has to be some overhead, more or less depends on the queue wakeup strategy the driver selected. Why is it not accurate for wireless? I can see the corner case Patrick mentioned in wired ethernet but then wired ethernet doesnt have other events such as management frames (actually DCE does) to help. Would you respond the question I asked early, in your model how to define the queue wakeup strategy in the driver to deal with the PHL full situation? Consider about 1) both high prio and low prio packets could come (you cannot predict it beforehand) 2) the time for PHL to send out a packet to the wireless medium is relative long (given the medium is congested). If you can resolve it in an elegant way, I'm all ears. Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Hello Yi, On Fri, 2007-15-06 at 09:27 +0800, Zhu Yi wrote: 1. driver becomes complicated (as it is too elaborate in the queue wakeup strategies design) I am not sure i see the complexity in the wireless driver's wakeup strategy. I just gave some suggestions to use management frames - they dont have to be literally that way. 2. duplicated code among drivers (otherwise you put all the queue management logics in a new layer?) There will be some shared code on drivers of same media on the netif_stop/wake strategy perhaps, but not related to queue management. 3. it's not 100% accurate. there has to be some overhead, more or less depends on the queue wakeup strategy the driver selected. Why is it not accurate for wireless? I can see the corner case Patrick mentioned in wired ethernet but then wired ethernet doesnt have other events such as management frames (actually DCE does) to help. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Hi Yi, On Thu, 2007-14-06 at 10:44 +0800, Zhu Yi wrote: On Wed, 2007-06-13 at 08:32 -0400, jamal wrote: The key arguement i make (from day one actually) is to leave the majority of the work to the driver. But it seems not feasible the Qdisc needs to know nothing about the hardware rings. This discussion is addressing whether it is feasible to do it without the qdisc knowing anything about the hardware ring. My view of wireless WMM etc is it is a different media behavior (compared to wired ethernet) which means a different view of strategy for when it opens the valve to allow in more packets. 802.11 media has embedded signalling which is usable. Guy Cohen gave a good use case which i responded to. Do you wanna look at that and respond? The key to support multi-ring hardware for software is to put packets into hardware as much/early as possible. Guy gave a good VO vs. BK example. To achieve this in your model, you have to keep the TX ring running (in the case of PHL full) and requeue. But when there are only BK packets coming, you do want to stop the ring, right? AFAICS, the driver is not the best place to make the decision (it only knows the current and previous packets, but not the _next_), the Qdisc is the best place. I dont have much time to followup for sometime to come. I have left my answer above. To clarify, incase i wasnt clear, I am saying: a) It is better to have the driver change via some strategy of when to open the tx path than trying to be generic. This shifts the burden to the driver. b) given the behavior of wireless media (which is very different from wired ethernet media), you need a different strategy. In response to Guy's question, I gave the example of being able to use management frames to open up the tx path for VO (even when you dont know VO packets are sitting on the qdisc); alternatively you could use a timer etc. Theres many ways to skin the cat (with apologies to cat lovers/owners). i.e you need to look at the media and be creative. Peters DCE for example could also be handled by having a specific strategy. I will try to continue participating in the discussion (if CCed) but much less for about a week. In any case I think i have had the discussion i was hoping for and trust Patrick understands both sides. This thread has run for too long folks, eh? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Thu, 2007-06-14 at 07:48 -0400, jamal wrote: I dont have much time to followup for sometime to come. I have left my answer above. To clarify, incase i wasnt clear, I am saying: a) It is better to have the driver change via some strategy of when to open the tx path than trying to be generic. This shifts the burden to the driver. b) given the behavior of wireless media (which is very different from wired ethernet media), you need a different strategy. In response to Guy's question, I gave the example of being able to use management frames to open up the tx path for VO (even when you dont know VO packets are sitting on the qdisc); alternatively you could use a timer etc. Theres many ways to skin the cat (with apologies to cat lovers/owners). i.e you need to look at the media and be creative. Peters DCE for example could also be handled by having a specific strategy. OK. You tried so much to guess the traffic flow pattern in the low level driver, which could be implemented straightforward in the Qdisc. The pro is the Qdisc API is untouched. But the cons are: 1. driver becomes complicated (as it is too elaborate in the queue wakeup strategies design) 2. duplicated code among drivers (otherwise you put all the queue management logics in a new layer?) 3. it's not 100% accurate. there has to be some overhead, more or less depends on the queue wakeup strategy the driver selected. Time for voting? Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Zhu Yi wrote: On Tue, 2007-06-12 at 23:17 +0200, Patrick McHardy wrote: I've hacked up a small multiqueue simulator device and to my big surprise my testing showed that Jamal's suggestion of using a single queue state seems to work better than I expected. But I've been doing mostly testing of the device itself up to now with very simple traffic patterns (mostly just flood all queues), so I'll try to get some real results tomorrow. The key argument for Jamal's solution is the NIC will send out 32 packets in the full PHL in a reasonably short time (a few microsecs per Jamal's calculation). But for wireless, the PHL hardware has low probability to seize the wireless medium when there are full of high priority frames in the air. That is, the chance for transmission in PHL and PHH is not equal. Queuing packets in software will starve high priority packets than putting them to PHH as early as possible. Well, the key result of our discussion was that it makes no difference wrt. queuing behaviour if the queue wakeup strategy is suitable chosen for the specific queueing discipline, but it might add some overhead. Patrick, I don't think your testing considered about above scenario, right? No, as stated my testing so far has been very limited. I'll try to get some better results later. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Wed, 2007-13-06 at 13:56 +0800, Zhu Yi wrote: The key argument for Jamal's solution is the NIC will send out 32 packets in the full PHL in a reasonably short time (a few microsecs per Jamal's calculation). But for wireless, the PHL hardware has low probability to seize the wireless medium when there are full of high priority frames in the air. That is, the chance for transmission in PHL and PHH is not equal. Queuing packets in software will starve high priority packets than putting them to PHH as early as possible. The key arguement i make (from day one actually) is to leave the majority of the work to the driver. My view of wireless WMM etc is it is a different media behavior (compared to wired ethernet) which means a different view of strategy for when it opens the valve to allow in more packets. 802.11 media has embedded signalling which is usable. Guy Cohen gave a good use case which i responded to. Do you wanna look at that and respond? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal writes: The key arguement i make (from day one actually) is to leave the majority of the work to the driver. My view of wireless WMM etc is it is a different media behavior (compared to wired ethernet) which means a different view of strategy for when it opens the valve to allow in more packets. 802.11 media has embedded signalling which is usable. Guy Cohen gave a good use case which i responded to. Do you wanna look at that and respond? Hello, Haven't got all details. IMO we need to support some bonding-like scenario too. Where one CPU is feeding just one TX-ring. (and TX-buffers cleared by same CPU). We probably don't want to stall all queuing when when one ring is full. The scenario I see is to support parallelism in forwarding/firewalling etc. For example when RX load via HW gets split into different CPU's and for cache reasons we want to process in same CPU even with TX. If RX HW split keeps packets from the same flow on same CPU we shouldn't get reordering within flows. Cheers --ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Wow - Robert in the house, I cant resist i have to say something before i run out;- On Wed, 2007-13-06 at 15:12 +0200, Robert Olsson wrote: Haven't got all details. IMO we need to support some bonding-like scenario too. Where one CPU is feeding just one TX-ring. (and TX-buffers cleared by same CPU). We probably don't want to stall all queuing when when one ring is full. For newer NICs - the kind of that Leonid Grossman was talking about, makes a lot of sense in non-virtual environment. I think the one described by Leonid has not just 8 tx/rx rings but also a separate register set, MSI binding etc iirc. The only shared resources as far as i understood Leonid are the bus and the ethernet wire. So in such a case (assuming 8 rings), One model is creating 4 netdev devices each based on single tx/rx ring and register set and then having a mother netdev (what you call the bond) that feeds these children netdev based on some qos parametrization is very sensible. Each of the children netdevices (by virtue of how we do things today) could be tied to a CPU for effectiveness (because our per CPU work is based on netdevs). In virtual environments, the supervisor will be in charge of the bond-like parent device. Another model is creating a child netdev based on more than one ring example 2 tx and 2 rcv rings for two netdevices etc. The scenario I see is to support parallelism in forwarding/firewalling etc. For example when RX load via HW gets split into different CPU's and for cache reasons we want to process in same CPU even with TX. If RX HW split keeps packets from the same flow on same CPU we shouldn't get reordering within flows. For the Leonid-NIC (for lack of better name) it may be harder to do parallelization on rcv if you use what i said above. But you could use a different model on receive - such as create a single netdev and with 8 rcv rings and MSI tied on rcv to 8 different CPUs Anyways, it is an important discussion to have. ttl. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
-Original Message- From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal For the Leonid-NIC (for lack of better name) it may be harder to do parallelization on rcv if you use what i said above. But you could use a different model on receive - such as create a single netdev and with 8 rcv rings and MSI tied on rcv to 8 different CPUs Anyways, it is an important discussion to have. ttl. Call it IOV-style NIC :-) Or something like that, it's a bit too early to talk about full IOV compliance... From what I see in Intel new pci-e 10GbE driver, they have quite a few of the same attributes, and the category is likely to grow further. In IOV world, hw channel requirements are pretty brutal; in a nutshell each channel could be owned by a separate OS instance (and the OS instances do not even have to be the same type). For a non-virtualized OS some of these capabilities are not a must have, but they are/will be there and Linux may as well take advantage of it. Leonid cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal writes: I think the one described by Leonid has not just 8 tx/rx rings but also a separate register set, MSI binding etc iirc. The only shared resources as far as i understood Leonid are the bus and the ethernet wire. AFAIK most new NIC will look like this... I still lack a lot of crucial hardware understanding What will happen when if we for some reason is not capable of serving one TX ring? NIC still working so we continue filling/sending/clearing on other rings? So in such a case (assuming 8 rings), One model is creating 4 netdev devices each based on single tx/rx ring and register set and then having a mother netdev (what you call the bond) that feeds these children netdev based on some qos parametrization is very sensible. Each of the children netdevices (by virtue of how we do things today) could be tied to a CPU for effectiveness (because our per CPU work is based on netdevs). Some kind of supervising function for the TX is probably needed as we still want see the device as one entity. But if upcoming HW supports parallelism straight to the TX-ring we of course like to use to get mininal cache effects. It's up to how this master netdev or queue superviser can be designed. For the Leonid-NIC (for lack of better name) it may be harder to do parallelization on rcv if you use what i said above. But you could use a different model on receive - such as create a single netdev and with 8 rcv rings and MSI tied on rcv to 8 different CPUs Yes that should be the way do it... and ethtool or something to hint the NIC how the incoming data classified wrt available CPU's. Maybe something more dynamic for the brave ones. Cheers -ro - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
I'm starting to wonder how a multi-queue NIC differs from a bunch of bonded single-queue NICs, and if there is leverage opportunity there. rick jones - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: jamal [EMAIL PROTECTED] Date: Wed, 13 Jun 2007 09:33:22 -0400 So in such a case (assuming 8 rings), One model is creating 4 netdev devices each based on single tx/rx ring and register set and then having a mother netdev (what you call the bond) that feeds these children netdev based on some qos parametrization is very sensible. Why all of this layering and overhead for something so BLOODY SIMPLE?!?!? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
From: jamal [EMAIL PROTECTED] Date: Wed, 13 Jun 2007 09:33:22 -0400 So in such a case (assuming 8 rings), One model is creating 4 netdev devices each based on single tx/rx ring and register set and then having a mother netdev (what you call the bond) that feeds these children netdev based on some qos parametrization is very sensible. Why all of this layering and overhead for something so BLOODY SIMPLE?!?!? I am currently packing up the newest patches against 2.6.23, with feedback from Patrick. The delay in posting them was a weird panic with the loopback device, which I just found. Let me run a test cycle or two, and I'll send them today for review, including an e1000 patch to show how to use the API. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
PJ Waskiewicz wrote: diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index f28bb2d..b9dc2a6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -123,7 +123,8 @@ static inline int qdisc_restart(struct net_device *dev) /* And release queue */ spin_unlock(dev-queue_lock); - if (!netif_queue_stopped(dev)) { + if (!netif_queue_stopped(dev) + !netif_subqueue_stopped(dev, skb-queue_mapping)) { int ret; ret = dev_hard_start_xmit(skb, dev); Your patch doesn't update any other users of netif_queue_stopped(). The assumption that they can pass packets to the driver when the queue is running is no longer valid since they don't know whether the subqueue the packet will end up in is active (it might be different from queue 0 if packets were redirected from a multiqueue aware qdisc through TC actions). So they need to be changed to check the subqueue state as well. The cases I found were net/core/netpoll.c, net/core/pktgen.c, and the software device case in net/core/dev.c. In all cases, the value of skb-queue_mapping will be zero, but they don't initialize the subqueue lock of the single allocated queue (hence panic when trying to use it...). I also don't think it makes sense for them to care, since -enqueue() doesn't get called as far as I can tell, therefore the classification won't happen. Did I miss something in looking at this? Thanks, -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Wed, 2007-13-06 at 11:20 -0700, David Miller wrote: From: jamal [EMAIL PROTECTED] Date: Wed, 13 Jun 2007 09:33:22 -0400 So in such a case (assuming 8 rings), One model is creating 4 netdev devices each based on single tx/rx ring and register set and then having a mother netdev (what you call the bond) that feeds these children netdev based on some qos parametrization is very sensible. Why all of this layering and overhead for something so BLOODY SIMPLE?!?!? Are we still talking about the same thing?;- This was about NICs which have multi register sets, tx/rx rings; the only shared resource is the bus and the wire. The e1000 cant do that. The thread is too long, so you may be talking about the same thing. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Wed, 2007-06-13 at 13:34 +0200, Patrick McHardy wrote: The key argument for Jamal's solution is the NIC will send out 32 packets in the full PHL in a reasonably short time (a few microsecs per Jamal's calculation). But for wireless, the PHL hardware has low probability to seize the wireless medium when there are full of high priority frames in the air. That is, the chance for transmission in PHL and PHH is not equal. Queuing packets in software will starve high priority packets than putting them to PHH as early as possible. Well, the key result of our discussion was that it makes no difference wrt. queuing behaviour if the queue wakeup strategy is suitable chosen for the specific queueing discipline, but it might add some overhead. My point is the overhead is hugh for the wireless case which causes it unacceptable. Given the above example in wireless medium, which queue wakeup strategy will you choose? I guess it might be the not stop tx ring + requeue? If this is selected, when there is a low priority packet coming (PHL is full), the Qdisc will keep dequeue and requeue for the same packet for a long time (given the fact of wireless medium) and chew tons of CPU. We met this problem before in our driver and this (not stop tx ring + requeue) is not a good thing to do. Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Wed, 2007-06-13 at 08:32 -0400, jamal wrote: The key arguement i make (from day one actually) is to leave the majority of the work to the driver. But it seems not feasible the Qdisc needs to know nothing about the hardware rings. My view of wireless WMM etc is it is a different media behavior (compared to wired ethernet) which means a different view of strategy for when it opens the valve to allow in more packets. 802.11 media has embedded signalling which is usable. Guy Cohen gave a good use case which i responded to. Do you wanna look at that and respond? The key to support multi-ring hardware for software is to put packets into hardware as much/early as possible. Guy gave a good VO vs. BK example. To achieve this in your model, you have to keep the TX ring running (in the case of PHL full) and requeue. But when there are only BK packets coming, you do want to stop the ring, right? AFAICS, the driver is not the best place to make the decision (it only knows the current and previous packets, but not the _next_), the Qdisc is the best place. Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-06-11 at 08:23 -0400, jamal wrote: On Mon, 2007-11-06 at 13:58 +0200, Patrick McHardy wrote: Thats not true. Assume PSL has lots of packets, PSH is empty. We fill the PHL queue until their is no room left, so the driver has to stop the queue. Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. Not at all! Packets could be on the DMA queue forever if you're feeding out more packets. Heck, on most wireless hardware packets can even be *expired* from the DMA queue and you get an indication that it was impossible to send them. johannes signature.asc Description: This is a digitally signed message part
Re: [PATCH] NET: Multiqueue network device support.
On Tue, 2007-12-06 at 11:19 +0200, Johannes Berg wrote: On Mon, 2007-06-11 at 08:23 -0400, jamal wrote: Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. Not at all! Packets could be on the DMA queue forever if you're feeding out more packets. Heck, on most wireless hardware packets can even be *expired* from the DMA queue and you get an indication that it was impossible to send them. The spirit of the discussion you are quoting was much higher level than that. Yes what you describe can happen on any DMA (to hard-disk etc) A simpler example, if you tcpdump on an outgoing packet you see it on its way to the driver - it is accounted for as gone[1]. In any case, read the rest of the thread. cheers, jamal [1] Current Linux tcpdumping is not that accurate, but i dont wanna go into that discussion - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: the qdisc has a chance to hand out either a packet of the same priority or higher priority, but at the cost of at worst (n - 1) * m unnecessary dequeues+requeues in case there is only a packet of lowest priority and we need to fully serve all higher priority HW queues before it can actually be dequeued. yes, i see that. [It actually is related to the wake threshold you use in the driver. tg3 and e1000 for example will do it after 30 or so packets. But i get your point - what you are trying to describe is a worst case scenario]. Yes. Using a higher threshold reduces the overhead, but leads to lower priority packets getting out even if higher priority packets are present in the qdisc. Note that if we use the threshold with multiple queue states (threshold per ring) this doesn't happen. The other possibility would be to activate the queue again once all rings can take packets again, but that wouldn't fix the problem, which you can easily see if you go back to my example and assume we still have a low priority packet within the qdisc when the lowest priority ring fills up (and the queue is stopped), and after we tried to wake it and stopped it again the higher priority packet arrives. In your use case, only low prio packets are available on the stack. Above you mention arrival of high prio - assuming thats intentional and not it being late over there ;- If higher prio packets are arriving on the qdisc when you open up, then given strict prio those packets get to go to the driver first until there are no more left; followed of course by low prio which then shutdown the path again... Whats happening is: Lowest priority ring fills up, queue is stopped. We have more packets for it in the qdisc. A higher priority packet is transmitted, the queue is woken up again, the lowest priority packet goes to the driver and hits the full ring, packet is requeued and queue shut down until ring frees up again. Now a high priority packet arrives. It won't get to the driver anymore. But its not very important since having two different wakeup-strategies would be a bit strange anyway, so lets just rule out this possibility. Considering your proposal in combination with RR, you can see the same problem of unnecessary dequeues+requeues. Well, we havent really extended the use case from prio to RR. But this is a good start as any since all sorts of work conserving schedulers will behave in a similar fashion .. Since there is no priority for waking the queue when a equal or higher priority ring got dequeued as in the prio case, I presume you would wake the queue whenever a packet was sent. I suppose that is a viable approach if the hardware is RR based. Actually in the case of e1000 it is WRR not plain RR, but that is a moot point which doesnt affect the discussion. For the RR qdisc dequeue after requeue should hand out the same packet, independantly of newly enqueued packets (which doesn't happen and is a bug in Peter's RR version), so in the worst case the HW has to make the entire round before a packet can get dequeued in case the corresponding HW queue is full. This is a bit better than prio, but still up to n - 1 unnecessary requeues+dequeues. I think it can happen more often than for prio though. I think what would better to be use is DRR. I pointed the code i did a long time ago to Peter. With DRR, a deficit is viable to be carried forward. If both driver and HW do it, its probably OK for short term, but it shouldn't grow too large since short-term fairness is also important. But the unnecessary dequeues+requeues can still happen. Forgetting about things like multiple qdisc locks and just looking at queueing behaviour, the question seems to come down to whether the unnecessary dequeues/requeues are acceptable (which I don't think since they are easily avoidable). As i see it, the worst case scenario would have a finite time. A 100Mbps NIC should be able to dish out, depending on packet size, 148Kpps to 8.6Kpps; a GigE 10x that. so i think the phase in general wont last that long given the assumption is packets are coming in from the stack to the driver with about the packet rate equivalent to wire rate (for the case of all work conserving schedulers). In the general case there should be no contention at all. It does have finite time, but its still undesirable. The average case would probably have been more interesting, but its also harder :) I also expect to see lots of requeues under normal load that doesn't ressemble the worst-case, but only tests can confirm that. OTOH you could turn it around and argue that the patches won't do much harm since ripping them out again (modulo queue mapping) should result in the same behaviour with just more overhead. I am not sure i understood - but note that i have asked for a middle ground from the begining. I just mean that we could rip the patches out at any
RE: [PATCH] NET: Multiqueue network device support.
Hi Jamal, Here is a simple scenario (nothing here is rare of extreme case): - Busy wireless environment - FTP TX on BE queue (low priority) - Skype TX on VO queue (high priority) The channel is busy with high priority packets hence the BE packets are transmitted to the air rarely so the DMA/HW queue of the BE access category gets full and the qdisc is stopped. Now periodic VO-tagged Skype packets arrive. I would expect that they get the priority (and pass) in all stages of the stack and reach the HW ASAP and compete there on the medium with the other access categories and the other clients on the channel. Now this packet will be stuck in the qdisc and wait there until a BE packet is transmitted, which can take a long time. This is a real problem. There is also a problem with the queues that will be dedicated to TX aggregation in 11n (currently implemented) - the packets will be classified to queues by the destination MAC address and not only by the priority class, but I don't want to get into that now. I think that there are enough arguments now why the patch that started this thread is needed... Please see below some replies to your questions. Regards, Guy. jamal wrote: It could be estimated well by the host sw; but lets defer that to later in case i am clueless on something or you misunderstood something i said. It cannot be estimated well by the host SW. This is one of the main issues - we can't put it aside... I understand. Please correct me if am wrong: The only reason AC_BK packet will go out instead of AC_VO when contending in hardware is because of a statistical opportunity not the firmware intentionaly trying to allow AC_BK out i.e it is influenced by the three variables: 1) The contention window 2) the backoff timer and 3)the tx opportunity And if you look at the default IEEE parameters as in that url slide 43, the only time AC_BK will win is luck. In most scenarios BK packets will be transmitted and will win the medium against VO packets (thought, in some non-favored ratio). Heres a really dated paper before the standard was ratified: http://www.mwnl.snu.ac.kr/~schoi/publication/Conferences/02-EW.pdf Sorry, I'm really overloaded - I won't be able to review the docs you sent (really apologize for that). So essentially the test you mention changes priorities in real time. What is the purpose of this test? Is WMM expected to change its priorities in real time? The WMM parameters of the AC are set and controlled by the network/BSS (access point) administrator and can be used in anyway. There are the default parameters but they can be changed. Regards, Guy. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Tue, 2007-12-06 at 15:21 +0200, Patrick McHardy wrote: jamal wrote: Yes. Using a higher threshold reduces the overhead, but leads to lower priority packets getting out even if higher priority packets are present in the qdisc. As per earlier discussion, the packets already given to hardware should be fine to go out first. If they get overriden by the chance arrival of higher prio packets from the stack, then that is fine. Note that if we use the threshold with multiple queue states (threshold per ring) this doesn't happen. I think if you do the math, youll find that (n - 1) * m is actually not that unreasonable given parameters typically used on the drivers; Lets for example take the parameters from e1000; the tx ring is around 256, the wake threshold is 32 packets (although i have found a better number is 1/2 the tx size and have that changed in my batching patches). Assume such a driver with above parameters doing Gige exists and it implements 4 queus (n = 4); in such a case, (n-1)*m/32 is 3*256/32 = 3*8 = 24 times. You have to admit your use case is a real corner case but lets be conservative since we are doing a worst case scenario and from that perspective consider that gige can be achieved at pkt levels of 86Kpps to 1.48Mpps and if you are non-work conserving you will be running at that rate and lets pick the low end of 86Kpps - what that means is there is a blip (remember again this to be a corner case) for a few microsecs once in a while with probability of what you described actually occuring... Ok, so then update the threshold to 1/2 the tx ring etc and it is even less. You get the message. If both driver and HW do it, its probably OK for short term, but it shouldn't grow too large since short-term fairness is also important. But the unnecessary dequeues+requeues can still happen. In a corner case, yes there is a probability that will happen. I think its extremely low. It does have finite time, but its still undesirable. The average case would probably have been more interesting, but its also harder :) I also expect to see lots of requeues under normal load that doesn't ressemble the worst-case, but only tests can confirm that. And that is what i was asking of Peter. Some testing. Clearly the subqueueing is more complex; what i am asking for is for the driver to bear the brunt and not for it to be an impacting architectural change. I am not sure i understood - but note that i have asked for a middle ground from the begining. I just mean that we could rip the patches out at any point again without user visible impact aside from more overhead. So even if they turn out to be a mistake its easily correctable. That is a good compromise i think. The reason i am spending my time discussing this is i believe this to be a very important subsystem. You know i have been voiceferous for years on this topic. What i was worried about is these patches make it and become engrained with hot lava on stone. I've also looked into moving all multiqueue specific handling to the top-level qdisc out of sch_generic, unfortunately that leads to races unless all subqueue state operations takes dev-qdisc_lock. Besides the overhead I think it would lead to ABBA deadlocks. I am confident you can handle that. So how do we move forward? What you described above is a good compromise IMO. I dont have much time to chase this path at the moment but what it does is give me freedom to revisit later on with data points. More importantly you understand my view;- And of course you did throw a lot of rocks but it a definete alternative ;- cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
Guy, I apologize for not responding immediately - i promise to in a few hours when i get back (and read it over some good coffee) - seems like you have some good stuff there; thanks for taking the time despite the overload. cheers, jamal On Tue, 2007-12-06 at 17:04 +0300, Cohen, Guy wrote: Hi Jamal, Here is a simple scenario (nothing here is rare of extreme case): - Busy wireless environment - FTP TX on BE queue (low priority) - Skype TX on VO queue (high priority) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Patrick McHardy [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 15:21:54 +0200 So how do we move forward? We're going to put hw multiqueue support in, all of this discussion has been pointless, I just watch this thread and basically laugh at the resistence to hw multiqueue support :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
If hardware w/ multiple queues will the capability for different MAC addresses, different RX filters, etc. does it make sense to add that below the net_device level? We will have to add all the configuration machinery at the per-queue level that already exists at the per-netdev level. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Jeff Garzik wrote: If hardware w/ multiple queues will the capability for different MAC addresses, different RX filters, etc. does it make sense to add that below the net_device level? We will have to add all the configuration machinery at the per-queue level that already exists at the per-netdev level. Perhaps the mac-vlan patch would be a good fit. Currently it is all software based, but if the hardware can filter on MAC, it can basically do mac-vlan acceleration. The mac-vlan devices are just like 'real' ethernet devices, so they can be used with whatever schemes work with regular devices. Thanks, Ben Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Ben Greear [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 14:17:44 -0700 Jeff Garzik wrote: If hardware w/ multiple queues will the capability for different MAC addresses, different RX filters, etc. does it make sense to add that below the net_device level? We will have to add all the configuration machinery at the per-queue level that already exists at the per-netdev level. Perhaps the mac-vlan patch would be a good fit. Currently it is all software based, but if the hardware can filter on MAC, it can basically do mac-vlan acceleration. The mac-vlan devices are just like 'real' ethernet devices, so they can be used with whatever schemes work with regular devices. Interesting. But to answer Jeff's question, that's not really the model being used to implement multiple queues. The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. The TX and RX queues can be so isolated as to be able to be exported to virtualization nodes. You can give them full access to the DMA queues and assosciated mailboxes. So instead of all of this bogus virtualized device overhead, you just give the guest access to the real device. So you can use multiple queues either for better single node SMP performance, or better virtualization performance. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
David Miller wrote: From: Ben Greear [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 14:17:44 -0700 Jeff Garzik wrote: If hardware w/ multiple queues will the capability for different MAC addresses, different RX filters, etc. does it make sense to add that below the net_device level? We will have to add all the configuration machinery at the per-queue level that already exists at the per-netdev level. Perhaps the mac-vlan patch would be a good fit. Currently it is all software based, but if the hardware can filter on MAC, it can basically do mac-vlan acceleration. The mac-vlan devices are just like 'real' ethernet devices, so they can be used with whatever schemes work with regular devices. Interesting. But to answer Jeff's question, that's not really the model being used to implement multiple queues. The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. Not quite... You'll have to deal with multiple Rx filters, not just the current one-filter-for-all model present in today's NICs. Pools of queues will have separate configured characteristics. The steer portion you mention is a bottleneck that wants to be eliminated. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
David Miller wrote: From: Ben Greear [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 14:17:44 -0700 Jeff Garzik wrote: If hardware w/ multiple queues will the capability for different MAC addresses, different RX filters, etc. does it make sense to add that below the net_device level? We will have to add all the configuration machinery at the per-queue level that already exists at the per-netdev level. Perhaps the mac-vlan patch would be a good fit. Currently it is all software based, but if the hardware can filter on MAC, it can basically do mac-vlan acceleration. The mac-vlan devices are just like 'real' ethernet devices, so they can be used with whatever schemes work with regular devices. Interesting. But to answer Jeff's question, that's not really the model being used to implement multiple queues. The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. The TX and RX queues can be so isolated as to be able to be exported to virtualization nodes. You can give them full access to the DMA queues and assosciated mailboxes. So instead of all of this bogus virtualized device overhead, you just give the guest access to the real device. So you can use multiple queues either for better single node SMP performance, or better virtualization performance. That sounds plausible for many uses, but it may also be useful to have the virtual devices. Having 802.1Q VLANs be 'real' devices has worked out quite well, so I think there is a place for a 'mac-vlan' as well. With your description above, the 'correct RX queue' could be the only queue that the mac-vlan sees, so it would behave somewhat like a vanilla ethernet driver. When the mac-vlan transmits, it could transmit directly into it's particular TX queue on the underlying device. In a non guest environment, I believe the mac-vlan will act somewhat like a more flexible form of an ip-alias. When name-spaces are implemented, the mac-vlan would very easily allow the different name-spaces to share the same physical hardware. The overhead should be minimal, and it's likely that using a 'real' network device will be a lot easier to maintain than trying to directly share separate queues on a single device that is somehow visible in multiple namespaces. And, since the mac-vlan can work as pure software on top of any NIC that can go promisc and send with arbitrary source MAC, it will already work with virtually all wired ethernet devices currently in existence. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. Not quite... You'll have to deal with multiple Rx filters, not just the current one-filter-for-all model present in today's NICs. Pools of queues will have separate configured characteristics. The steer portion you mention is a bottleneck that wants to be eliminated. I think you're misunderstanding. These NICs still have only one physical port, so sending or receiving real packets onto a physical wire is fundamentally serialized. The steering of packets to receive queues is done right after the packets are received from the wire -- in fact it can be done as soon as the NIC has parsed enough of the headers to make a decision, which might be before the full packet has even been received. The steering is no more of a bottleneck than the physical link is. - R. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 17:46:20 -0400 Not quite... You'll have to deal with multiple Rx filters, not just the current one-filter-for-all model present in today's NICs. Pools of queues will have separate configured characteristics. The steer portion you mention is a bottleneck that wants to be eliminated. It runs in hardware at wire speed, what's the issue? :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Ben Greear [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 14:46:50 -0700 And, since the mac-vlan can work as pure software on top of any NIC that can go promisc and send with arbitrary source MAC, it will already work with virtually all wired ethernet devices currently in existence. Absolutely, I'm not against something like mac-vlan at all. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Jason Lunz [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 17:47:53 -0400 Are you aware of any hardware designs that allow other ways to map packets onto rx queues? I can think of several scenarios where it could be advantageous to map packets by IP 3- or 5-tuple to get cpu locality all the way up the stack on a flow-by-flow basis. But doing this would require some way to request this mapping from the hardware. These chips allow this too, Microsoft defined a standard for RX queue interrupt hashing by flow so everyone puts it, or something like it, in hardware. In the extreme case it would be cool if it were possible to push a bpf-like classifier down into the hardware to allow arbitrary kinds of flow distribution. Maybe not a fully bpf, but many chips allow something close. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Roland Dreier wrote: The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. Not quite... You'll have to deal with multiple Rx filters, not just the current one-filter-for-all model present in today's NICs. Pools of queues will have separate configured characteristics. The steer portion you mention is a bottleneck that wants to be eliminated. I think you're misunderstanding. These NICs still have only one physical port, so sending or receiving real packets onto a physical wire is fundamentally serialized. The steering of packets to receive queues is done right after the packets are received from the wire -- in fact it can be done as soon as the NIC has parsed enough of the headers to make a decision, which might be before the full packet has even been received. The steering is no more of a bottleneck than the physical link is. No, you're misreading. People are putting in independent configurable Rx filters because a single Rx filter setup for all queues was a bottleneck. Not a performance bottleneck but a configuration and flexibility limitation that's being removed. And where shall we put the configuration machinery, to support sub-queues? Shall we duplicate the existing configuration code for sub-queues? What will ifconfig/ip usage look like? How will it differ from configurating full net_devices, if you are assigning the same types of parameters? Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Roland Dreier [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 14:52:11 -0700 I think you're misunderstanding. These NICs still have only one physical port, so sending or receiving real packets onto a physical wire is fundamentally serialized. The steering of packets to receive queues is done right after the packets are received from the wire -- in fact it can be done as soon as the NIC has parsed enough of the headers to make a decision, which might be before the full packet has even been received. The steering is no more of a bottleneck than the physical link is. Yep, that's right. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
From: Jeff Garzik [EMAIL PROTECTED] Date: Tue, 12 Jun 2007 17:59:43 -0400 And where shall we put the configuration machinery, to support sub-queues? Shall we duplicate the existing configuration code for sub-queues? What will ifconfig/ip usage look like? How will it differ from configurating full net_devices, if you are assigning the same types of parameters? If you're asking about the virtualization scenerio, the control node (dom0 or whatever) is the only entity which can get at programming the filters and will set it up properly based upon which parts of the physical device are being exported to which guest nodes. For the non-virtualized case, it's a good question. But really the current hardware is just about simple queue steering, and simple static DRR/WRED fairness algorithms applied to the queues in hardware. We don't need to add support for configuring anything fancy from the start just to get something working. Especially the important bits such as the virtualization case and the interrupt and queue distribution case on SMP. The latter can even be configured automatically by the driver, and that's in fact what I expect drivers to do initially. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Tue, Jun 12, 2007 at 02:55:34PM -0700, David Miller wrote: These chips allow this too, Microsoft defined a standard for RX queue interrupt hashing by flow so everyone puts it, or something like it, in hardware. I think you're referring to RSS? http://www.microsoft.com/whdc/device/network/NDIS_RSS.mspx http://msdn2.microsoft.com/en-us/library/ms795609.aspx Jason - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
David Miller wrote: If you're asking about the virtualization scenerio, the control node (dom0 or whatever) is the only entity which can get at programming the filters and will set it up properly based upon which parts of the physical device are being exported to which guest nodes. You're avoiding the question. Clearly guest VMs must contact the host VM (dom0) to get real work done. They are ultimately going to have to pass the same configuration info as the non-virt case. For the non-virtualized case, it's a good question. ... But really the current hardware is just about simple queue steering, and simple static DRR/WRED fairness algorithms applied to the queues in hardware. We don't need to add support for configuring anything fancy from the start just to get something working. Correct. But if we don't plan for the future that's currently in the silicon pipeline, our ass will be in a sling WHEN we must figure out the best configuration points for sub-queues. Or are we prepared to rip out sub-queues for a non-experimental solution, when confronted with the obvious necessity of configuring them? You know I want multi-queue and increased parallelism it provides. A lot. But let's not dig ourselves into a hole we must climb out of in 6-12 months. We need to think about configuration issues -now-. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Ben Greear wrote: That sounds plausible for many uses, but it may also be useful to have the virtual devices. Having 802.1Q VLANs be 'real' devices has worked out quite well, so I think there is a place for a 'mac-vlan' as well. Virtual devices are pretty much the only solution we have right now, both in terms of available control points, and in terms of mapping to similar existing solutions (like wireless and its multiple net devices). Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Tue, Jun 12, 2007 at 02:26:58PM -0700, David Miller wrote: The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. The TX and RX queues can be so isolated as to be able to be exported to virtualization nodes. You can give them full access to the DMA queues and assosciated mailboxes. So instead of all of this bogus virtualized device overhead, you just give the guest access to the real device. So you can use multiple queues either for better single node SMP performance, or better virtualization performance. Are you aware of any hardware designs that allow other ways to map packets onto rx queues? I can think of several scenarios where it could be advantageous to map packets by IP 3- or 5-tuple to get cpu locality all the way up the stack on a flow-by-flow basis. But doing this would require some way to request this mapping from the hardware. In the extreme case it would be cool if it were possible to push a bpf-like classifier down into the hardware to allow arbitrary kinds of flow distribution. Jason - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Jeff Garzik wrote: Ben Greear wrote: That sounds plausible for many uses, but it may also be useful to have the virtual devices. Having 802.1Q VLANs be 'real' devices has worked out quite well, so I think there is a place for a 'mac-vlan' as well. Virtual devices are pretty much the only solution we have right now, both in terms of available control points, and in terms of mapping to similar existing solutions (like wireless and its multiple net devices). I believe Patrick is working on cleaning up mac-vlans and converting them to use the new netlink configuration API, so there should be a patch for these hitting the list shortly. Thanks, Ben -- Ben Greear [EMAIL PROTECTED] Candela Technologies Inc http://www.candelatech.com - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
Hi Guy, On Tue, 2007-12-06 at 17:04 +0300, Cohen, Guy wrote: Hi Jamal, Here is a simple scenario (nothing here is rare of extreme case): - Busy wireless environment - FTP TX on BE queue (low priority) - Skype TX on VO queue (high priority) The channel is busy with high priority packets hence the BE packets are transmitted to the air rarely so the DMA/HW queue of the BE access category gets full and the qdisc is stopped. Now periodic VO-tagged Skype packets arrive. I would expect that they get the priority (and pass) in all stages of the stack and reach the HW ASAP and compete there on the medium with the other access categories and the other clients on the channel. Now this packet will be stuck in the qdisc and wait there until a BE packet is transmitted, which can take a long time. This is a real problem. Understood. My take is that this is resolvable by understanding the nature of the beast. IOW, the strategy of when to open up on such a medium is not conventional as one of a wired netdev. You can use signalling from the media such as an AP giving you signals for different ACs to open up; example: if the AC_BE is not being allowed out and it is just rotting because the AP is favoring VO, then you need to occasionally open up the tx path for the driver etc. There is also a problem with the queues that will be dedicated to TX aggregation in 11n (currently implemented) - the packets will be classified to queues by the destination MAC address and not only by the priority class, but I don't want to get into that now. We have an infrastructure at the qdisc level for selecting queues based on literally anything you can think of in a packet as well as metadata. So i think this aspect should be fine. I think that there are enough arguments now why the patch that started this thread is needed... Sorry Guy, I dont see it that way - unfortunately i dont think anybody else other than Patrick understood what i said and this thread is going on for too long i doubt 99% of the people are following any more ;- In most scenarios BK packets will be transmitted and will win the medium against VO packets (thought, in some non-favored ratio). So if understand you correctly: over a period of time, yes BK will make it out but under contention it will loose; is that always? Is there some mathematics behind this stuff? Sorry, I'm really overloaded - I won't be able to review the docs you sent (really apologize for that). No problem. I totaly understand. The WMM parameters of the AC are set and controlled by the network/BSS (access point) administrator and can be used in anyway. There are the default parameters but they can be changed. It would certainly lead to unexpected behavior if you start favoring BE over VO, no? Would that ever happen by adjusting the WMM parameters? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
-Original Message- From: [EMAIL PROTECTED] [mailto:netdev- [EMAIL PROTECTED] On Behalf Of Jason Lunz Sent: Tuesday, June 12, 2007 2:48 PM To: David Miller Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCH] NET: Multiqueue network device support. On Tue, Jun 12, 2007 at 02:26:58PM -0700, David Miller wrote: The MAC is still very much centralized in most designs. So one way they'll do it is to support assigning N MAC addresses, and you configure the input filters of the chip to push packets for each MAC to the proper receive queue. So the MAC will accept any of those in the N MAC addresses as it's own, then you use the filtering facilities to steer frames to the correct RX queue. The TX and RX queues can be so isolated as to be able to be exported to virtualization nodes. You can give them full access to the DMA queues and assosciated mailboxes. So instead of all of this bogus virtualized device overhead, you just give the guest access to the real device. So you can use multiple queues either for better single node SMP performance, or better virtualization performance. Are you aware of any hardware designs that allow other ways to map packets onto rx queues? I can think of several scenarios where it could be advantageous to map packets by IP 3- or 5-tuple to get cpu locality all the way up the stack on a flow-by-flow basis. But doing this would require some way to request this mapping from the hardware. 10GbE Xframe NICs do that, as well as rx steering by MAC address, VLAN, MS RSS, generic hashing and bunch of other criteria (there is actually a decent chapter on rx steering in the ASIC manual at www.neterion.com support page). The caveat is that in the current products the tuple table is limited to 256 entries only. Next ASIC bumps this number to 64k. In the extreme case it would be cool if it were possible to push a bpf-like classifier down into the hardware to allow arbitrary kinds of flow distribution. Jason - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Tue, 2007-06-12 at 23:17 +0200, Patrick McHardy wrote: I've hacked up a small multiqueue simulator device and to my big surprise my testing showed that Jamal's suggestion of using a single queue state seems to work better than I expected. But I've been doing mostly testing of the device itself up to now with very simple traffic patterns (mostly just flood all queues), so I'll try to get some real results tomorrow. The key argument for Jamal's solution is the NIC will send out 32 packets in the full PHL in a reasonably short time (a few microsecs per Jamal's calculation). But for wireless, the PHL hardware has low probability to seize the wireless medium when there are full of high priority frames in the air. That is, the chance for transmission in PHL and PHH is not equal. Queuing packets in software will starve high priority packets than putting them to PHH as early as possible. Patrick, I don't think your testing considered about above scenario, right? Thanks, -yi - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Wed, 2007-06-06 at 17:11 +0200, Patrick McHardy wrote: [...] The problem is the premise is _innacurate_. Since you havent followed the discussion, i will try to be brief (which is hard). If you want verbosity it is in my previous emails: Consider a simple example of strict prio qdisc which is mirror configuration of a specific hardware. Then for sake of discussion, assume two prio queues in the qdisc - PSL and PSH and two hardware queues/rings in a NIC which does strict prio with queues PHL and PHH. The mapping is as follows: PSL --- maps to --- PHL PSH --- maps to --- PHH Assume the PxH has a higher prio than PxL. Strict prio will always favor H over L. Two scenarios: a) a lot of packets for PSL arriving on the stack. They only get sent from PSL - PHL if and only if there are no packets from PSH-PHH. b)a lot of packets for PSH arriving from the stack. They will always be favored over PSL in sending to the hardware. From the above: The only way PHL will ever shutdown the path to the hardware is when there are sufficient PHL packets. Corrollary, The only way PSL will ever shutdown the path to the hardware is when there are _NO_ PSH packets. Thats not true. Assume PSL has lots of packets, PSH is empty. We fill the PHL queue until their is no room left, so the driver has to stop the queue. Now some PSH packets arrive, but the queue is stopped, no packets will be sent. Now, you can argue that as soon as the first PHL packet is sent there is room for more and the queue will be activated again and we'll take PSH packets, so it doesn't matter because we can't send two packets at once anyway. Fine. Take three HW queues, prio 0-2. The prio 2 queue is entirely full, prio 1 has some packets queued and prio 0 is empty. Now, because prio 2 is completely full, the driver has to stop the queue. Before it can start it again it has to send all prio 1 packets and then at least one packet of prio 2. Until this happens, no packets can be queued to prio 0. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Wed, 2007-06-06 at 15:35 -0700, David Miller wrote: The problem with this line of thinking is that it ignores the fact that it is bad to not queue to the device when there is space available, _even_ for lower priority packets. So use a different scheduler. Dont use strict prio. Strict prio will guarantee starvation of low prio packets as long as there are high prio packets. Thats the intent. With a single queue state _any_ full HW queue will starve all other queues, independant of the software queueing discipline. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Waskiewicz Jr, Peter P wrote: If they have multiple TX queues, independantly programmable, that single lock is stupid. We could use per-queue TX locks for such hardware, but we can't support that currently. There could be bad packet reordering with this (like some SMP routers used to do). My original multiqueue patches I submitted actually had a per-queue Tx lock, but it was removed since the asymmetry in the stack for locking was something people didn't like. Locking a queue for -enqueue(), unlocking, then locking for -dequeue(), unlocking, was something people didn't like very much. Also knowing what queue to lock on -enqueue() was where the original -map_queue() idea came from, since we wanted to lock before calling -enqueue(). I guess there were a few more reasons why people (at least me) didn't like it. IIRC It didn't include any sch_api locking changes, to it was completely broken wrt. concurrent configuration changes (easy fixable though). Additionally it assumed that classification was deterministic and two classify calls would return the same result, which is not necessarily true and might have resulted in locking the wrong queue, and it didn't deal with TC actions doing stuff to a packet during the first classification. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 13:58 +0200, Patrick McHardy wrote: Thats not true. Assume PSL has lots of packets, PSH is empty. We fill the PHL queue until their is no room left, so the driver has to stop the queue. Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. Now some PSH packets arrive, but the queue is stopped, no packets will be sent. Now, you can argue that as soon as the first PHL packet is sent there is room for more and the queue will be activated again and we'll take PSH packets, _exactly_ ;- so it doesn't matter because we can't send two packets at once anyway. Fine. i can see your thought process building - You are actually following what i am saying;- Take three HW queues, prio 0-2. The prio 2 queue is entirely full, prio 1 has some packets queued and prio 0 is empty. Now, because prio 2 is completely full, the driver has to stop the queue. Before it can start it again it has to send all prio 1 packets and then at least one packet of prio 2. Until this happens, no packets can be queued to prio 0. The assumption is packets gone to the DMA are gone to the wire, thats it. If you have a strict prio scheduler, contention from the stack is only valid if they both arrive at the same time. If that happens then (assuming 0 is more important than 1 which is more important than 2) then 0 always wins over 1 which wins over 2. Same thing if you queue into hardware and the priorization is the same. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 13:58 +0200, Patrick McHardy wrote: Thats not true. Assume PSL has lots of packets, PSH is empty. We fill the PHL queue until their is no room left, so the driver has to stop the queue. Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. I disagree, its obviously not true and leads to the behaviour I described. If it were true there would be no reason to use multiple HW TX queues to begin with. [...] i can see your thought process building - You are actually following what i am saying;- I am :) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 14:39 +0200, Patrick McHardy wrote: jamal wrote: On Mon, 2007-11-06 at 13:58 +0200, Patrick McHardy wrote: Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. I disagree, its obviously not true Patrick, you are making too strong a statement. Take a step back: When you put a packet on the DMA ring, are you ever going to take it away at some point before it goes to the wire? and leads to the behaviour I described. If it were true there would be no reason to use multiple HW TX queues to begin with. In the general case, they are totaly useless. They are useful when theres contention/congestion. Even in a shared media like wireless. And if there is contention, the qdisc scheduler will do the right thing. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 14:39 +0200, Patrick McHardy wrote: Sure. Packets stashed on the any DMA ring are considered gone to the wire. That is a very valid assumption to make. I disagree, its obviously not true Patrick, you are making too strong a statement. Well, its not. Take a step back: When you put a packet on the DMA ring, are you ever going to take it away at some point before it goes to the wire? No, but its nevertheless not on the wire yet and the HW scheduler controls when it will get there. It might in theory even never get there if higher priority queues are continously active. and leads to the behaviour I described. If it were true there would be no reason to use multiple HW TX queues to begin with. In the general case, they are totaly useless. They are useful when theres contention/congestion. Even in a shared media like wireless. The same is true for any work-conserving queue, software or hardware. And if there is contention, the qdisc scheduler will do the right thing. That ignores a few points that were raised in this thread, - you can treat each HW queue as an indivdual network device - you can avoid synchronizing on a single queue lock for multiple TX queues - it is desirable to keep all queues full - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 15:03 +0200, Patrick McHardy wrote: jamal wrote: Well, its not. I dont wanna go into those old style debates again; so lets drop this point. Take a step back: When you put a packet on the DMA ring, are you ever going to take it away at some point before it goes to the wire? No, but its nevertheless not on the wire yet and the HW scheduler controls when it will get there. It might in theory even never get there if higher priority queues are continously active. Sure - but what is wrong with that? What would be wrong is in the case of contention for a resource like a wire between a less important packet and a more important packet, the more important packet gets favored. Nothing like that ever happens in what i described. Remember there is no issue if there is no congestion or contention for local resources. And if there is contention, the qdisc scheduler will do the right thing. That ignores a few points that were raised in this thread, - you can treat each HW queue as an indivdual network device You can treat a pair of tx/rx as a netdev. In which case none of this is important. You instantiate a different netdev and it only holds the appropriate locks. - you can avoid synchronizing on a single queue lock for multiple TX queues Unneeded if you do what i described. Zero changes to the qdisc code. - it is desirable to keep all queues full It is desirable to keep resources fully utilized. Sometimes that is achieved by keeping _all_ queues full. If i fill up a single queue full and transmit at wire rate, there is no issue. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 15:03 +0200, Patrick McHardy wrote: Take a step back: When you put a packet on the DMA ring, are you ever going to take it away at some point before it goes to the wire? No, but its nevertheless not on the wire yet and the HW scheduler controls when it will get there. It might in theory even never get there if higher priority queues are continously active. Sure - but what is wrong with that? Nothing, this was just to illustrate why I disagree with the assumption that the packet has hit the wire. On second thought I do agree with your assumption for the single HW queue case, at the point we hand the packet to the HW the packet order is determined and is unchangeable. But this is not the case if the hardware includes its own scheduler. The qdisc is simply not fully in charge anymore. What would be wrong is in the case of contention for a resource like a wire between a less important packet and a more important packet, the more important packet gets favored. Read again what I wrote about the n 2 case. Low priority queues might starve high priority queues when using a single queue state for a maximum of the time it takes to service n - 2 queues with max_qlen - 1 packets queued plus the time for a single packet. Thats assuming the worst case of n - 2 queues with max_qlen - 1 packets and the lowest priority queue full, so the queue is stopped until we can send at least one lowest priority packet, which requires to fully service all higher priority queues previously. Nothing like that ever happens in what i described. Remember there is no issue if there is no congestion or contention for local resources. Your basic assumption seems to be that the qdisc is still in charge of when packets get sent. This isn't the case if there is another scheduler after the qdisc and there is contention in the second queue. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
Patrick McHardy wrote: jamal wrote: Sure - but what is wrong with that? Nothing, this was just to illustrate why I disagree with the assumption that the packet has hit the wire. On second thought I do agree with your assumption for the single HW queue case, at the point we hand the packet to the HW the packet order is determined and is unchangeable. But this is not the case if the hardware includes its own scheduler. The qdisc is simply not fully in charge anymore. For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. Hence, to make it possible for wireless devices to use the qdisc mechanism properly, the HW queues should _ALL_ be non-empty at all times, whenever data is available in the upper layers. Or in other words, the upper layers should not block a specific queue because of the usage of any other queue. Your basic assumption seems to be that the qdisc is still in charge of when packets get sent. This isn't the case if there is another scheduler after the qdisc and there is contention in the second queue. Which is often the case in wireless devices - transmission scheduling is done in HW. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 16:03 +0200, Patrick McHardy wrote: jamal wrote: Sure - but what is wrong with that? Nothing, this was just to illustrate why I disagree with the assumption that the packet has hit the wire. fair enough. On second thought I do agree with your assumption for the single HW queue case, at the point we hand the packet to the HW the packet order is determined and is unchangeable. But this is not the case if the hardware includes its own scheduler. The qdisc is simply not fully in charge anymore. i am making the case that it does not affect the overall results as long as you use the same parameterization on qdisc and hardware. If in fact the qdisc high prio packets made it to the driver before the they make it out onto the wire, it is probably a good thing that the hardware scheduler starves the low prio packets. Read again what I wrote about the n 2 case. Low priority queues might starve high priority queues when using a single queue state for a maximum of the time it takes to service n - 2 queues with max_qlen - 1 packets queued plus the time for a single packet. Thats assuming the worst case of n - 2 queues with max_qlen - 1 packets and the lowest priority queue full, so the queue is stopped until we can send at least one lowest priority packet, which requires to fully service all higher priority queues previously. I didnt quiet follow the above - I will try retrieving reading your other email to see if i can make sense of it. Your basic assumption seems to be that the qdisc is still in charge of when packets get sent. This isn't the case if there is another scheduler after the qdisc and there is contention in the second queue. My basic assumption is if you use the same scheduler in both the hardware and qdisc, configured the same same number of queues and mapped the same priorities then you dont need to make any changes to the qdisc code. If i have a series of routers through which a packet traveses to its destination with the same qos parameters i also achieve the same results. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Cohen, Guy wrote: Patrick McHardy wrote: jamal wrote: Sure - but what is wrong with that? Nothing, this was just to illustrate why I disagree with the assumption that the packet has hit the wire. On second thought I do agree with your assumption for the single HW queue case, at the point we hand the packet to the HW the packet order is determined and is unchangeable. But this is not the case if the hardware includes its own scheduler. The qdisc is simply not fully in charge anymore. For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. Hence, to make it possible for wireless devices to use the qdisc mechanism properly, the HW queues should _ALL_ be non-empty at all times, whenever data is available in the upper layers. Or in other words, the upper layers should not block a specific queue because of the usage of any other queue. Thats exactly what I'm saying. And its not possible with a single queue state as I tried to explain in my last last. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 17:30 +0300, Cohen, Guy wrote: For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. WMM is a strict prio mechanism. The parametrization very much favors the high prio packets when the tx opportunity to send shows up. Hence, to make it possible for wireless devices to use the qdisc mechanism properly, the HW queues should _ALL_ be non-empty at all times, whenever data is available in the upper layers. agreed. Or in other words, the upper layers should not block a specific queue because of the usage of any other queue. This is where we are going to disagree. There is no way the stack will send the driver packets which are low prio if there are some which are high prio. There is therefore, on contention between low and high prio, no way for low prio packets to obstruct the high prio packets; however, it is feasible that high prio packets will obstruct low prio packets (which is fine). cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 16:03 +0200, Patrick McHardy wrote: Read again what I wrote about the n 2 case. Low priority queues might starve high priority queues when using a single queue state for a maximum of the time it takes to service n - 2 queues with max_qlen - 1 packets queued plus the time for a single packet. Thats assuming the worst case of n - 2 queues with max_qlen - 1 packets and the lowest priority queue full, so the queue is stopped until we can send at least one lowest priority packet, which requires to fully service all higher priority queues previously. I didnt quiet follow the above - I will try retrieving reading your other email to see if i can make sense of it. Let me explain with some ASCII art :) We have n empty HW queues with a maximum length of m packets per queue: [0] empty [1] empty [2] empty .. [n-1] empty Now we receive m - 1 packets for each all priorities = 1 and n - 1, so we have: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n] empty Since no queue is completely full, the queue is still active. Now we receive m packets of priorty n: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n-1] m packets At this point the queue needs to be stopped since the highest priority queue is entirely full. To start it again at least one packet of queue n - 1 needs to be sent, which (assuming strict priority) requires that queues 1 to n - 2 are serviced first. So any prio 0 packets arriving during this period will sit in the qdisc and will not reach the device for a possibly quite long time. With multiple queue states we'd know that queue 0 can still take packets. Your basic assumption seems to be that the qdisc is still in charge of when packets get sent. This isn't the case if there is another scheduler after the qdisc and there is contention in the second queue. My basic assumption is if you use the same scheduler in both the hardware and qdisc, configured the same same number of queues and mapped the same priorities then you dont need to make any changes to the qdisc code. If i have a series of routers through which a packet traveses to its destination with the same qos parameters i also achieve the same results. Did my example above convince you that this is not the case? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On 6/11/07, jamal [EMAIL PROTECTED] wrote: On Mon, 2007-11-06 at 17:30 +0300, Cohen, Guy wrote: For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. WMM is a strict prio mechanism. The parametrization very much favors the high prio packets when the tx opportunity to send shows up. This is not true, there is no simple priority order from 1 to 4 , rather set of parameters that dermises access to medium. You have to emulate medium behavior to schedule packets in correct order. That's why this pushed to HW, otherwise nobody would invest money in this part of silicon :) Hence, to make it possible for wireless devices to use the qdisc mechanism properly, the HW queues should _ALL_ be non-empty at all times, whenever data is available in the upper layers. agreed. Or in other words, the upper layers should not block a specific queue because of the usage of any other queue. This is where we are going to disagree. There is no way the stack will send the driver packets which are low prio if there are some which are high prio. There is therefore, on contention between low and high prio, no way for low prio packets to obstruct the high prio packets; however, it is feasible that high prio packets will obstruct low prio packets (which is fine). cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 16:49 +0200, Patrick McHardy wrote: Let me explain with some ASCII art :) Ok ;- We have n empty HW queues with a maximum length of m packets per queue: [0] empty [1] empty [2] empty .. [n-1] empty Asumming 0 i take it is higher prio than n-1. Now we receive m - 1 packets for each all priorities = 1 and n - 1, so we have: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n] empty Since no queue is completely full, the queue is still active. and packets are being fired on the wire by the driver etc ... Now we receive m packets of priorty n: n-1 (i think?) [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n-1] m packets At this point the queue needs to be stopped since the highest priority queue is entirely full. ok, so 0 is lower prio than n-1 To start it again at least one packet of queue n - 1 needs to be sent, following so far ... which (assuming strict priority) requires that queues 1 to n - 2 are serviced first. Ok, so let me revert that; 0 is higher prio than n-1. So any prio 0 packets arriving during this period will sit in the qdisc and will not reach the device for a possibly quite long time. possibly long time is where we diverge ;- If you throw the burden to the driver (as i am recommending in all my arguements so far), it should open up sooner based on priorities. I didnt wanna bring this earlier because it may take the discussion in the wrong direction. So in your example if n-1 shuts down the driver, then it is upto to the driver to open it up if any higher prio packet makes it out. With multiple queue states we'd know that queue 0 can still take packets. And with what i described you dont make any such changes to the core; the burden is on the driver. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 18:00 +0300, Tomas Winkler wrote: On 6/11/07, jamal [EMAIL PROTECTED] wrote: On Mon, 2007-11-06 at 17:30 +0300, Cohen, Guy wrote: For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. WMM is a strict prio mechanism. The parametrization very much favors the high prio packets when the tx opportunity to send shows up. This is not true, there is no simple priority order from 1 to 4 , rather set of parameters that dermises access to medium. You have to emulate medium behavior to schedule packets in correct order. That's why this pushed to HW, otherwise nobody would invest money in this part of silicon :) I dont have the specs neither am i arguing the value of having the scheduler in hardware. (I think the over radio contention clearly needs the scheduler in hardware). But i have read a couple of papers on people simulating this in s/ware. And have seen people describe the parametrization that is default, example Slide 43 on: http://madwifi.org/attachment/wiki/ChipsetFeatures/WMM/qos11e.pdf?format=raw seems to indicate the default parameters for the different timers is clearly strictly in favor of you if you have higher prio. If the info quoted is correct, it doesnt change anything i have said so far. i.e it is strict prio scheduling with some statistical chance a low prio packet will make it. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 16:49 +0200, Patrick McHardy wrote: We have n empty HW queues with a maximum length of m packets per queue: [0] empty [1] empty [2] empty .. [n-1] empty Asumming 0 i take it is higher prio than n-1. Yes. Now we receive m - 1 packets for each all priorities = 1 and n - 1, so we have: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n] empty Since no queue is completely full, the queue is still active. Now we receive m packets of priorty n: n-1 (i think?) Right. [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n-1] m packets At this point the queue needs to be stopped since the highest priority queue is entirely full. ok, so 0 is lower prio than n-1 Higher priority. But we don't know what the priority of the next packet is going to be, so we have to stop the entire qdisc anyway. To start it again at least one packet of queue n - 1 needs to be sent, following so far ... which (assuming strict priority) requires that queues 1 to n - 2 are serviced first. Ok, so let me revert that; 0 is higher prio than n-1. Yes. So any prio 0 packets arriving during this period will sit in the qdisc and will not reach the device for a possibly quite long time. possibly long time is where we diverge ;- Worst cast is (n - 2) * (m - 1) + 1 full sized packet transmission times. You can do the math yourself, but we're talking about potentially a lot of packets. If you throw the burden to the driver (as i am recommending in all my arguements so far), it should open up sooner based on priorities. I didnt wanna bring this earlier because it may take the discussion in the wrong direction. So in your example if n-1 shuts down the driver, then it is upto to the driver to open it up if any higher prio packet makes it out. How could it do that? n-1 is still completely full and you don't know what the next packet is going to be. Are you proposing to simply throw the packet away in the driver even though its within the configured limits of the qdisc? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 17:12 +0200, Patrick McHardy wrote: Ok, so let me revert that; 0 is higher prio than n-1. Yes. Ok, gotcha. possibly long time is where we diverge ;- Worst cast is (n - 2) * (m - 1) + 1 full sized packet transmission times. You can do the math yourself, but we're talking about potentially a lot of packets. I agree if you use the strategy of a ring shutdown down implies dont wake up until the ring that caused the shutdown opens up What i am saying below is to make a change to that strategy. If you throw the burden to the driver (as i am recommending in all my arguements so far), it should open up sooner based on priorities. I didnt wanna bring this earlier because it may take the discussion in the wrong direction. So in your example if n-1 shuts down the driver, then it is upto to the driver to open it up if any higher prio packet makes it out. How could it do that? n-1 is still completely full and you don't know what the next packet is going to be. Are you proposing to simply throw the packet away in the driver even though its within the configured limits of the qdisc? No no Patrick - i am just saying the following: - let the driver shutdown whenever a ring is full. Remember which ring X shut it down. - when you get a tx interupt or prun tx descriptors, if a ring = X has transmitted a packet (or threshold of packets), then wake up the driver (i.e open up). In the meantime packets from the stack are sitting on the qdisc and will be sent when the driver opens up. Anyways, I have to run to work; thanks for keeping the discussion at the level you did. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
Some more details inside regarding wireless QoS. jamal wrote: On Mon, 2007-11-06 at 17:30 +0300, Cohen, Guy wrote: For WiFi devices the HW often implements the scheduling, especially when QoS (WMM/11e/11n) is implemented. There are few traffic queues defined by the specs and the selection of the next queue to transmit a packet from, is determined in real time, just when there is a tx opportunity. This cannot be predicted in advance since it depends on the medium usage of other stations. WMM is a strict prio mechanism. The parametrization very much favors the high prio packets when the tx opportunity to send shows up. Sorry, but this not as simple as you describe it. WMM is much more complicated. WMM defines the HW queues as virtually multiple clients that compete on the medium access individually. Each implements a contention-based medium access. The Access Point publishes to the clients the medium access parameters (e.g. back off parameters) that are different for each access category (virtual client). There is _not_ a strict priority assigned to each access category. The behavior of each access category totally depends on the medium usage of other clients and is totally different for each access category. This cannot be predicated at the host SW. Hence, to make it possible for wireless devices to use the qdisc mechanism properly, the HW queues should _ALL_ be non-empty at all times, whenever data is available in the upper layers. agreed. Or in other words, the upper layers should not block a specific queue because of the usage of any other queue. This is where we are going to disagree. There is no way the stack will send the driver packets which are low prio if there are some which are high prio. There is therefore, on contention between low and high prio, no way for low prio packets to obstruct the high prio packets; And this is not the right behavior for a WLAN stack. QoS in WLAN doesn't favor strictly one access category over another, but defines some softer and smarter prioritization. This is implemented in the HW/Firmware. I just think that providing a per-queue controls (start/stop) will allow WLAN drivers/Firmware/HW to do that while still using qdisc (and it will work properly even when one queue is full and others are empty). however, it is feasible that high prio packets will obstruct low prio packets (which is fine). No this is _not_ fine. Just to emphasize again, WMM doesn't define priority in the way it is implemented in airplane boarding (Pilots first, Business passengers next, couch passengers at the end), but more like _distributed_ weights prioritization (between all the multiple queues of all the clients on the channel). As a side note, in one of the WFA WMM certification tests, the AP changes the medium access parameters of the access categories in a way that favors a lower access category. This is something very soft that cannot be reflected in any intuitive way in the host SW. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 17:12 +0200, Patrick McHardy wrote: Worst cast is (n - 2) * (m - 1) + 1 full sized packet transmission times. You can do the math yourself, but we're talking about potentially a lot of packets. I agree if you use the strategy of a ring shutdown down implies dont wake up until the ring that caused the shutdown opens up What i am saying below is to make a change to that strategy. Glad we agree on something. Now all I have to do is convince you that a change to this strategy is not a good idea :) If you throw the burden to the driver (as i am recommending in all my arguements so far), it should open up sooner based on priorities. I didnt wanna bring this earlier because it may take the discussion in the wrong direction. So in your example if n-1 shuts down the driver, then it is upto to the driver to open it up if any higher prio packet makes it out. How could it do that? n-1 is still completely full and you don't know what the next packet is going to be. Are you proposing to simply throw the packet away in the driver even though its within the configured limits of the qdisc? No no Patrick - i am just saying the following: - let the driver shutdown whenever a ring is full. Remember which ring X shut it down. - when you get a tx interupt or prun tx descriptors, if a ring = X has transmitted a packet (or threshold of packets), then wake up the driver (i.e open up). At this point the qdisc might send new packets. What do you do when a packet for a full ring arrives? I see three choices: - drop it, even though its still within the qdiscs configured limits - requeue it, which does not work because the qdisc is still active and might just hand you the same packet over and over again in a busy loop, until the ring has more room (which has the same worst case, just that we're sitting in a busy loop now). - requeue and stop the queue: we're back to where we started since now higher priority packets will not get passed to the driver. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
PJ Waskiewicz wrote: diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index f28bb2d..b9dc2a6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -123,7 +123,8 @@ static inline int qdisc_restart(struct net_device *dev) /* And release queue */ spin_unlock(dev-queue_lock); - if (!netif_queue_stopped(dev)) { + if (!netif_queue_stopped(dev) + !netif_subqueue_stopped(dev, skb-queue_mapping)) { int ret; ret = dev_hard_start_xmit(skb, dev); Your patch doesn't update any other users of netif_queue_stopped(). The assumption that they can pass packets to the driver when the queue is running is no longer valid since they don't know whether the subqueue the packet will end up in is active (it might be different from queue 0 if packets were redirected from a multiqueue aware qdisc through TC actions). So they need to be changed to check the subqueue state as well. BTW, I couldn't find anything but a single netif_wake_subqueue in your (old) e1000 patch. Why doesn't it stop subqueues? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
PJ Waskiewicz wrote: diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e7367c7..8bcd870 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t; * @pkt_type: Packet class * @fclone: skbuff clone status * @ip_summed: Driver fed us an IP checksum + * @queue_mapping: Queue mapping for multiqueue devices * @priority: Packet queueing priority * @users: User count - see {datagram,tcp}.c * @protocol: Packet protocol from driver @@ -269,6 +270,7 @@ struct sk_buff { __u16 csum_offset; }; }; + __u16 queue_mapping; __u32 priority; __u8local_df:1, cloned:1, I think we can reuse skb-priority. Assuming only real hardware devices use multiqueue support, there should be no user of skb-priority after egress qdisc classification. The only reason to preserve it in the qdisc layer is for software devices. Grepping through drivers/net shows a few users, bot most seem to be using it on the RX path and some use it to store internal data. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
PJ Waskiewicz wrote: diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index e7367c7..8bcd870 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -215,6 +215,7 @@ typedef unsigned char *sk_buff_data_t; * @pkt_type: Packet class * @fclone: skbuff clone status * @ip_summed: Driver fed us an IP checksum + * @queue_mapping: Queue mapping for multiqueue devices * @priority: Packet queueing priority * @users: User count - see {datagram,tcp}.c * @protocol: Packet protocol from driver @@ -269,6 +270,7 @@ struct sk_buff { __u16 csum_offset; }; }; + __u16 queue_mapping; __u32 priority; __u8local_df:1, cloned:1, I think we can reuse skb-priority. Assuming only real hardware devices use multiqueue support, there should be no user of skb-priority after egress qdisc classification. The only reason to preserve it in the qdisc layer is for software devices. That would be oustanding. Grepping through drivers/net shows a few users, bot most seem to be using it on the RX path and some use it to store internal data. Thank you for hunting this down. I will test on my little environment here to see if I run into any issues. -PJ - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
PJ Waskiewicz wrote: diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index f28bb2d..b9dc2a6 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -123,7 +123,8 @@ static inline int qdisc_restart(struct net_device *dev) /* And release queue */ spin_unlock(dev-queue_lock); - if (!netif_queue_stopped(dev)) { + if (!netif_queue_stopped(dev) + !netif_subqueue_stopped(dev, skb-queue_mapping)) { int ret; ret = dev_hard_start_xmit(skb, dev); Your patch doesn't update any other users of netif_queue_stopped(). The assumption that they can pass packets to the driver when the queue is running is no longer valid since they don't know whether the subqueue the packet will end up in is active (it might be different from queue 0 if packets were redirected from a multiqueue aware qdisc through TC actions). So they need to be changed to check the subqueue state as well. I will look at all these cases and change them accordingly. Thanks for catching that. BTW, I couldn't find anything but a single netif_wake_subqueue in your (old) e1000 patch. Why doesn't it stop subqueues? A previous e1000 patch stopped subqueues. The last e1000 patch I sent to the list doesn't stop them, and that's a problem with that patch; it was sent purely to show how the alloc_etherdev_mq() stuff worked, but I missed the subqueue control. I can fix that and send an updated patch if you'd like. The reason I missed it is we maintain an out-of-tree driver and an in-tree driver, and mixing/matching code between the two becomes a bit of a juggling act sometimes when doing little engineering snippits. Thanks for reviewing these. I'll repost something with updates from your feedback. Cheers, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Waskiewicz Jr, Peter P wrote: I think we can reuse skb-priority. Assuming only real hardware devices use multiqueue support, there should be no user of skb-priority after egress qdisc classification. The only reason to preserve it in the qdisc layer is for software devices. That would be oustanding. Grepping through drivers/net shows a few users, bot most seem to be using it on the RX path and some use it to store internal data. Thank you for hunting this down. I will test on my little environment here to see if I run into any issues. I think grepping will help more than testing :) The only issue I can see is that packets going to a multiqueue device that doesn't have a multiqueue aware qdisc attached will get a random value. So you would have to conditionally reset it before -enqueue. Another question is what to do about other hard_start_xmit callers. Independant of which field is used, should the classification that may have happend on a different device be retained (TC actions again)? - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Waskiewicz Jr, Peter P wrote: BTW, I couldn't find anything but a single netif_wake_subqueue in your (old) e1000 patch. Why doesn't it stop subqueues? A previous e1000 patch stopped subqueues. The last e1000 patch I sent to the list doesn't stop them, and that's a problem with that patch; it was sent purely to show how the alloc_etherdev_mq() stuff worked, but I missed the subqueue control. I can fix that and send an updated patch if you'd like. The reason I missed it is we maintain an out-of-tree driver and an in-tree driver, and mixing/matching code between the two becomes a bit of a juggling act sometimes when doing little engineering snippits. Thanks for reviewing these. I'll repost something with updates from your feedback. Thanks, I do have some more comments, but a repost with the patches split up in infrastructure changes, qdisc changes one patch per qdisc and the e1000 patch would make that easier. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
I think grepping will help more than testing :) The only issue I can see is that packets going to a multiqueue device that doesn't have a multiqueue aware qdisc attached will get a random value. So you would have to conditionally reset it before -enqueue. I currently clear queue_mapping before -enqueue(). Perhaps keeping queue_mapping in there might solve needing a conditional in the hotpath. Let me think about this one. Another question is what to do about other hard_start_xmit callers. Independant of which field is used, should the classification that may have happend on a different device be retained (TC actions again)? The two cases I can think of here are ip forwarding and bonding. In the case of bonding, things should be fine since the bonded device will only have one ring. Therefore if the underlying slave devices are heterogenous, there shouldn't be a problem retaining the previous TC classification; if the device has its own qdisc and classifiers, it can override it. For ip forwarding, I believe it will also be ok since ultimately the device doing the last transmit will have his classifiers applied and remap skb's if necessary. Either way, before it gets enqueued through dev_queue_xmit(), the value will get cleared, so having an artifact laying around won't be possible. If that's not what you're referring to, please let me know. Thanks, -PJ Waskiewicz - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
Waskiewicz Jr, Peter P wrote: Another question is what to do about other hard_start_xmit callers. Independant of which field is used, should the classification that may have happend on a different device be retained (TC actions again)? [...] Either way, before it gets enqueued through dev_queue_xmit(), the value will get cleared, so having an artifact laying around won't be possible. You're right, I was thinking of a case where a packet would be redirected from a multiqueue device to another one and then not go through dev_queue_xmit but some other path to hard_start_xmit that doesn't update the classification. But there is no case like this. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 17:44 +0200, Patrick McHardy wrote: jamal wrote: [..] - let the driver shutdown whenever a ring is full. Remember which ring X shut it down. - when you get a tx interupt or prun tx descriptors, if a ring = X has transmitted a packet (or threshold of packets), then wake up the driver (i.e open up). At this point the qdisc might send new packets. What do you do when a packet for a full ring arrives? Hrm... ok, is this a trick question or i am missing the obvious?;- What is wrong with what any driver would do today - which is: netif_stop and return BUSY; core requeues the packet? I see three choices: - drop it, even though its still within the qdiscs configured limits - requeue it, which does not work because the qdisc is still active and might just hand you the same packet over and over again in a busy loop, until the ring has more room (which has the same worst case, just that we're sitting in a busy loop now). - requeue and stop the queue: we're back to where we started since now higher priority packets will not get passed to the driver. Refer to choice #4 above. The patches are trivial - really; as soon as Peter posts the e1000 change for his version i should be able to cutnpaste and produce one that will work with what i am saying. I am going to try my best to do that this week - i am going to be a little busy and have a few outstanding items (like the pktgen thing) that i want to get out of the way... cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
On Mon, 2007-11-06 at 18:34 +0300, Cohen, Guy wrote: jamal wrote: [..] WMM is a strict prio mechanism. The parametrization very much favors the high prio packets when the tx opportunity to send shows up. Sorry, but this not as simple as you describe it. WMM is much more complicated. WMM defines the HW queues as virtually multiple clients that compete on the medium access individually. Each implements a contention-based medium access. The Access Point publishes to the clients the medium access parameters (e.g. back off parameters) that are different for each access category (virtual client). There is _not_ a strict priority assigned to each access category. You sound like you know this stuff well so please bear with me. I am actually hoping i will learn from you. I dont have access to the IEEE docs but i have been reasonably following up on the qos aspect and i have a good feel for how the parameters work. I posted a url to a pdf earlier which describes the WMM default parameterization for each AC you refer to above - do you wanna comment on the accuracy of that? The behavior of each access category totally depends on the medium usage of other clients and is totally different for each access category. This cannot be predicated at the host SW. It could be estimated well by the host sw; but lets defer that to later in case i am clueless on something or you misunderstood something i said. QoS in WLAN doesn't favor strictly one access category over another, but defines some softer and smarter prioritization. This is implemented in the HW/Firmware. I understand. Please correct me if am wrong: The only reason AC_BK packet will go out instead of AC_VO when contending in hardware is because of a statistical opportunity not the firmware intentionaly trying to allow AC_BK out i.e it is influenced by the three variables: 1) The contention window 2) the backoff timer and 3)the tx opportunity And if you look at the default IEEE parameters as in that url slide 43, the only time AC_BK will win is luck. I just think that providing a per-queue controls (start/stop) will allow WLAN drivers/Firmware/HW to do that while still using qdisc (and it will work properly even when one queue is full and others are empty). I dont see it the same way. But iam willing to see wireless in a different light than wireless, more below. however, it is feasible that high prio packets will obstruct low prio packets (which is fine). No this is _not_ fine. Just to emphasize again, WMM doesn't define priority in the way it is implemented in airplane boarding (Pilots first, Business passengers next, couch passengers at the end), but more like _distributed_ weights prioritization (between all the multiple queues of all the clients on the channel). I am not trying to be obtuse in any way - but let me ask this for wireless contention resolution: When a bussiness passenger is trying to get into plane at the same time as a couch passenger and the attendant notices i.e to resolve the contention, who gets preferential treatment? There is the case of the attendant statistically not noticing (but that accounts for luck)... Heres a really dated paper before the standard was ratified: http://www.mwnl.snu.ac.kr/~schoi/publication/Conferences/02-EW.pdf a) looking at table 1 at the AIFS, CWmin/max and PF used in the experiment I dont see how a low prio or mid prio ac will ever beat something in the high prio just by virtue that they have longer AIFS + CW values. Maybe you can explain (trust me i am trying to resolve this in my mind and not trying to be difficult in any way; i am a geek and these sorts of things intrigue me; i may curse but thats ok) The only way it would happen is if there is no collision i.e stastical luck. b) The paragraph between fig 4 and fig 5 talks about virtual collision between two TCs within a station as _always_ favoring the higher prio. Slide 43 on: http://madwifi.org/attachment/wiki/ChipsetFeatures/WMM/qos11e.pdf?format=raw also seems to indicate the default parameters for the different timers is clearly strictly in favor of you if you have higher prio. Do those numbers cross-reference with the IEEE doc you may have? As a side note, in one of the WFA WMM certification tests, the AP changes the medium access parameters of the access categories in a way that favors a lower access category. This is something very soft that cannot be reflected in any intuitive way in the host SW. So essentially the test you mention changes priorities in real time. What is the purpose of this test? Is WMM expected to change its priorities in real time? cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 17:44 +0200, Patrick McHardy wrote: jamal wrote: [..] - let the driver shutdown whenever a ring is full. Remember which ring X shut it down. - when you get a tx interupt or prun tx descriptors, if a ring = X has transmitted a packet (or threshold of packets), then wake up the driver (i.e open up). At this point the qdisc might send new packets. What do you do when a packet for a full ring arrives? Hrm... ok, is this a trick question or i am missing the obvious?;- What is wrong with what any driver would do today - which is: netif_stop and return BUSY; core requeues the packet? That doesn't fix the problem, high priority queues may be starved by low priority queues if you do that. BTW, I missed something you said before: --quote-- i am making the case that it does not affect the overall results as long as you use the same parameterization on qdisc and hardware. --end quote-- I agree that multiple queue states wouldn't be necessary if they would be parameterized the same, in that case we wouldn't even need the qdisc at all (as you're saying). But one of the parameters is the maximum queue length, and we want to be able to parameterize the qdisc differently than the hardware here. Which is the only reason for the possible starvation. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
jamal wrote: On Mon, 2007-11-06 at 17:44 +0200, Patrick McHardy wrote: At this point the qdisc might send new packets. What do you do when a packet for a full ring arrives? Hrm... ok, is this a trick question or i am missing the obvious?;- What is wrong with what any driver would do today - which is: netif_stop and return BUSY; core requeues the packet? I see three choices: - drop it, even though its still within the qdiscs configured limits - requeue it, which does not work because the qdisc is still active and might just hand you the same packet over and over again in a busy loop, until the ring has more room (which has the same worst case, just that we're sitting in a busy loop now). - requeue and stop the queue: we're back to where we started since now higher priority packets will not get passed to the driver. Refer to choice #4 above. Replying again so we can hopefully move forward soon. Your choice #4 is exactly what I proposed as choice number 3. Let me repeat my example why it doesn't work (well) without multiple queue states (with typos etc fixed) and describe the possibilities. If you still disagree I hope you can just change my example to show how it gets fixed. As a thank you I will actually understand that your solution works as well :) We have n empty HW queues served in ascending priority order with a maximum length of m packets per queue: [0] empty [1] empty [2] empty .. [n-1] empty Now we receive m - 1 packets for all priorities = 1 and n - 1, so we have: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n-1] empty Since no HW queue is completely full, the queue is still active. Now we receive m packets of priority n - 1: [0] empty [1] m - 1 packets [2] m - 1 packets .. [n-2] m - 1 packets [n-1] m packets At this point the queue needs to be stopped since the highest priority queue is entirely full. To start it again at least one packet of queue n - 1 needs to be sent, which requires that queues 1 to n - 2 are serviced first. So any prio 0 packet arriving during this period will sit in the qdisc and will not reach the device for up to the time for (n - 2) * (m - 1) + 1 full sized packet transmissions. With multiple queue states we'd know that queue 0 can still take packets. You agreed that this is a problem and instead of keeping the queue stopped until all rings can take at least one packet again you proposed: - let the driver shutdown whenever a ring is full. Remember which ring X shut it down. - when you get a tx interupt or prun tx descriptors, if a ring = X has transmitted a packet (or threshold of packets), then wake up the driver (i.e open up). At this point the queue is active, but at least one ring is already full and the qdisc can still pass packets for it to the driver. When this happens we can: - drop it. This makes qdisc configured limit meaningless since the qdisc can't anticipate when the packet will make it through or get dropped. - requeue it: this might result in a busy loop if the qdisc decides to hand out the packet again. The loop will be terminated once the ring has more room available an can eat the packet, which has the same worst case behaviour I described above. - requeue (return BUSY) and stop the queue: thats what you proposed as option #4. The question is when to wake the queue again. Your suggestion was to wake it when some other queue with equal or higher priority got dequeued. Correcting my previous statement, you are correct that this will fix the starvation of higher priority queues because the qdisc has a chance to hand out either a packet of the same priority or higher priority, but at the cost of at worst (n - 1) * m unnecessary dequeues+requeues in case there is only a packet of lowest priority and we need to fully serve all higher priority HW queues before it can actually be dequeued. The other possibility would be to activate the queue again once all rings can take packets again, but that wouldn't fix the problem, which you can easily see if you go back to my example and assume we still have a low priority packet within the qdisc when the lowest priority ring fills up (and the queue is stopped), and after we tried to wake it and stopped it again the higher priority packet arrives. Considering your proposal in combination with RR, you can see the same problem of unnecessary dequeues+requeues. Since there is no priority for waking the queue when a equal or higher priority ring got dequeued as in the prio case, I presume you would wake the queue whenever a packet was sent. For the RR qdisc dequeue after requeue should hand out the same packet, independantly of newly enqueued packets (which doesn't happen and is a bug in Peter's RR version), so in the worst case the HW has to make the entire round before a packet can get dequeued in case the corresponding HW queue is full. This is a bit better than prio, but still up to n - 1
Re: [PATCH] NET: Multiqueue network device support.
Sorry - i was distracted elsewhere and didnt respond to your earlier email; this one seems a superset. On Tue, 2007-12-06 at 02:58 +0200, Patrick McHardy wrote: jamal wrote: On Mon, 2007-11-06 at 17:44 +0200, Patrick McHardy wrote: [use case abbreviated..] the use case is sensible. the qdisc has a chance to hand out either a packet of the same priority or higher priority, but at the cost of at worst (n - 1) * m unnecessary dequeues+requeues in case there is only a packet of lowest priority and we need to fully serve all higher priority HW queues before it can actually be dequeued. yes, i see that. [It actually is related to the wake threshold you use in the driver. tg3 and e1000 for example will do it after 30 or so packets. But i get your point - what you are trying to describe is a worst case scenario]. The other possibility would be to activate the queue again once all rings can take packets again, but that wouldn't fix the problem, which you can easily see if you go back to my example and assume we still have a low priority packet within the qdisc when the lowest priority ring fills up (and the queue is stopped), and after we tried to wake it and stopped it again the higher priority packet arrives. In your use case, only low prio packets are available on the stack. Above you mention arrival of high prio - assuming thats intentional and not it being late over there ;- If higher prio packets are arriving on the qdisc when you open up, then given strict prio those packets get to go to the driver first until there are no more left; followed of course by low prio which then shutdown the path again... Considering your proposal in combination with RR, you can see the same problem of unnecessary dequeues+requeues. Well, we havent really extended the use case from prio to RR. But this is a good start as any since all sorts of work conserving schedulers will behave in a similar fashion .. Since there is no priority for waking the queue when a equal or higher priority ring got dequeued as in the prio case, I presume you would wake the queue whenever a packet was sent. I suppose that is a viable approach if the hardware is RR based. Actually in the case of e1000 it is WRR not plain RR, but that is a moot point which doesnt affect the discussion. For the RR qdisc dequeue after requeue should hand out the same packet, independantly of newly enqueued packets (which doesn't happen and is a bug in Peter's RR version), so in the worst case the HW has to make the entire round before a packet can get dequeued in case the corresponding HW queue is full. This is a bit better than prio, but still up to n - 1 unnecessary requeues+dequeues. I think it can happen more often than for prio though. I think what would better to be use is DRR. I pointed the code i did a long time ago to Peter. With DRR, a deficit is viable to be carried forward. Forgetting about things like multiple qdisc locks and just looking at queueing behaviour, the question seems to come down to whether the unnecessary dequeues/requeues are acceptable (which I don't think since they are easily avoidable). As i see it, the worst case scenario would have a finite time. A 100Mbps NIC should be able to dish out, depending on packet size, 148Kpps to 8.6Kpps; a GigE 10x that. so i think the phase in general wont last that long given the assumption is packets are coming in from the stack to the driver with about the packet rate equivalent to wire rate (for the case of all work conserving schedulers). In the general case there should be no contention at all. OTOH you could turn it around and argue that the patches won't do much harm since ripping them out again (modulo queue mapping) should result in the same behaviour with just more overhead. I am not sure i understood - but note that i have asked for a middle ground from the begining. Thanks again for the patience and taking the time to go over this. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
-Original Message- From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal Sent: Saturday, June 09, 2007 8:03 PM To: Leonid Grossman Cc: Waskiewicz Jr, Peter P; Patrick McHardy; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; Kok, Auke-jan H; Ramkrishna Vepa; Alex Aizman Subject: RE: [PATCH] NET: Multiqueue network device support. our definition of channel on linux so far is a netdev (not a DMA ring). A netdev is the entity that can be bound to a CPU. Link layer flow control terminates (and emanates) from the netdev. I think we are saying the same thing. Link layer flow control frames are generated (and terminated) by the hardware; the hardware gets configured by netdev. And if a hw channel has enough resources, it could be configured as a separate netdev and handle it's flow control the same way single-channel NICs do now. I'm not advocating flow control on per DMA ring basis. This is not what I'm saying :-). The IEEE link you sent shows that per-link flow control is a separate effort, and it will likely to take time to become a standard. Ok, my impression was it was happening already or it will happen tommorow morning ;- the proposal you mentioned is dated 2005, but something like that will probably happen sooner or later in IEEE. Some non-standard options, including ours, are already here - but as we just discussed, in any case flow control is arguably a netdev property not a queue property. The multi-queue patch itself though (and possibly some additional per-queue properties) is a good thing :-) - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, Jun 08, 2007 at 09:12:52AM -0400, jamal wrote: To mimick that behavior in LLTX, a driver needs to use the same lock on both tx and receive. e1000 holds a different lock on tx path from rx path. Maybe theres something clever i am missing; but it seems to be a bug on e1000. It's both actually :) It takes the tx_lock in the xmit routine as well as in the clean-up routine. However, the lock is only taken when it updates the queue status. Thanks to the ring buffer structure the rest of the clean-up/xmit code will run concurrently just fine. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Sat, 2007-09-06 at 21:08 +1000, Herbert Xu wrote: It takes the tx_lock in the xmit routine as well as in the clean-up routine. However, the lock is only taken when it updates the queue status. Thanks to the ring buffer structure the rest of the clean-up/xmit code will run concurrently just fine. I know you are a patient man Herbert - so please explain slowly (if that doesnt make sense on email, then bear with me as usual) ;- - it seems the cleverness is that some parts of the ring description are written to on tx but not rx (and vice-versa), correct? example the next_to_watch/use bits. If thats a yes - there at least should have been a big fat comment on the code so nobody changes it; - and even if thats the case, a) then the tx_lock sounds unneeded, correct? (given the RUNNING atomicity). b) do you even need the adapter lock? ;- given the nature of the NAPI poll only one CPU can prune the descriptors. I have tested with just getting rid of tx_lock and it worked fine. I havent tried removing the adapter lock. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
-Original Message- From: [EMAIL PROTECTED] [mailto:netdev- [EMAIL PROTECTED] On Behalf Of Waskiewicz Jr, Peter P Sent: Wednesday, June 06, 2007 3:31 PM To: [EMAIL PROTECTED]; Patrick McHardy Cc: [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; Kok, Auke-jan H Subject: RE: [PATCH] NET: Multiqueue network device support. [Which of course leads to the complexity (and not optimizing for the common - which is single ring NICs)]. The common for 100 Mbit and older 1Gbit is single ring NICs. Newer PCI-X and PCIe NICs from 1Gbit to 10Gbit support multiple rings in the hardware, and it's all headed in that direction, so it's becoming the common case. IMHO, in addition to current Intel and Neterion NICs, some/most upcoming NICs are likely to be multiqueue, since virtualization emerges as a major driver for hw designs (there are other things of course that drive hw, but these are complimentary to multiqueue). PCI-SIG IOV extensions for pci spec are almost done, and a typical NIC (at least, typical 10GbE NIC that supports some subset of IOV) in the near future is likely to have at least 8 independent channels with its own tx/rx queue, MAC address, msi-x vector(s), reset that doesn't affect other channels, etc. Basically, each channel could be used as an independent NIC that just happens to share pci bus and 10GbE PHY with other channels (but has per-channel QoS and throughput guarantees). In a non-virtualized system, such NICs could be used in a mode when each channel runs on one core; this may eliminate some locking... This mode will require btw deterministic session steering, current hashing approach in the patch is not sufficient; this is something we can contribute once Peter's code is in. In general, a consensus on kernel support for multiqueue NICs will be beneficial since multiqueue HW is here and other stacks already taking advantage of it. - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
On Sat, 2007-09-06 at 10:58 -0400, Leonid Grossman wrote: IMHO, in addition to current Intel and Neterion NICs, some/most upcoming NICs are likely to be multiqueue, since virtualization emerges as a major driver for hw designs (there are other things of course that drive hw, but these are complimentary to multiqueue). PCI-SIG IOV extensions for pci spec are almost done, and a typical NIC (at least, typical 10GbE NIC that supports some subset of IOV) in the near future is likely to have at least 8 independent channels with its own tx/rx queue, MAC address, msi-x vector(s), reset that doesn't affect other channels, etc. Leonid - any relation between that and data center ethernet? i.e http://www.ieee802.org/3/ar/public/0503/wadekar_1_0503.pdf It seems to desire to do virtualization as well. Is there any open spec for PCI-SIG IOV? Basically, each channel could be used as an independent NIC that just happens to share pci bus and 10GbE PHY with other channels (but has per-channel QoS and throughput guarantees). Sounds very similar to data centre ethernet - except data centre ethernet seems to map channels to rings; whereas the scheme you describe maps a channel essentially to a virtual nic which seems to read in the common case as a single tx, single rx ring. Is that right? If yes, we should be able to do the virtual nics today without any changes really since each one appears as a separate NIC. It will be a matter of probably boot time partitioning and parametrization to create virtual nics (ex of priorities of each virtual NIC etc). In a non-virtualized system, such NICs could be used in a mode when each channel runs on one core; this may eliminate some locking... This mode will require btw deterministic session steering, current hashing approach in the patch is not sufficient; this is something we can contribute once Peter's code is in. I can actually see how the PCI-SIG approach using virtual NIC approach could run on multiple CPUs (since each is no different from a NIC that we have today). And our current Linux steering would also work just fine. In the case of non-virtual NICs, i am afraid i dont think it is as easy as simple session steering - if you want to be generic that is; you may wanna consider a more complex connection tracking i.e a grouping of sessions as the basis for steering to a tx ring (and therefore tying to a specific CPU). If you are an ISP or a data center with customers partitioned based on simple subnets, then i can see a simple classification based on subnets being tied to a hw ring/CPU. And in such cases simple flow control on a per ring basis makes sense. Have you guys experimented on the the non-virtual case? And are you doing the virtual case as a pair of tx/rx being a single virtual nic? In general, a consensus on kernel support for multiqueue NICs will be beneficial since multiqueue HW is here and other stacks already taking advantage of it. My main contention with the Peters approach has been to do with the propagating of flow control back to the qdisc queues. However, if this PCI SIG standard is also desiring such an approach then it will shed a different light. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
-Original Message- From: J Hadi Salim [mailto:[EMAIL PROTECTED] On Behalf Of jamal Sent: Saturday, June 09, 2007 12:23 PM To: Leonid Grossman Cc: Waskiewicz Jr, Peter P; Patrick McHardy; [EMAIL PROTECTED]; netdev@vger.kernel.org; [EMAIL PROTECTED]; Kok, Auke-jan H; Ramkrishna Vepa; Alex Aizman Subject: RE: [PATCH] NET: Multiqueue network device support. On Sat, 2007-09-06 at 10:58 -0400, Leonid Grossman wrote: IMHO, in addition to current Intel and Neterion NICs, some/most upcoming NICs are likely to be multiqueue, since virtualization emerges as a major driver for hw designs (there are other things of course that drive hw, but these are complimentary to multiqueue). PCI-SIG IOV extensions for pci spec are almost done, and a typical NIC (at least, typical 10GbE NIC that supports some subset of IOV) in the near future is likely to have at least 8 independent channels with its own tx/rx queue, MAC address, msi-x vector(s), reset that doesn't affect other channels, etc. Leonid - any relation between that and data center ethernet? i.e http://www.ieee802.org/3/ar/public/0503/wadekar_1_0503.pdf It seems to desire to do virtualization as well. Not really. This is a very old presentation; you probably saw some newer PR on Convergence Enhanced Ethernet, Congestion Free Ethernet etc. These efforts are in very early stages and arguably orthogonal to virtualization, but in general having per channel QoS (flow control is just a part of it) is a good thing. Is there any open spec for PCI-SIG IOV? I don't think so, the actual specs and event presentations at www.pcisig.org are members-only, although there are many PRs about early IOV support that may shed some light on the features. But my point was that while virtualization capabilities of upcoming NICs may be not even relevant to Linux, the multi-channel hw designs (a side effect of virtualization push, if you will) will be there and a non-virtualized stack can take advantage of them. Actually, our current 10GbE NICs have most of such multichannel framework already shipping (in pre-IOV fashion), so the programming manual on the website can probably give you a pretty good idea about how multi-channel 10GbE NICs may look like. Basically, each channel could be used as an independent NIC that just happens to share pci bus and 10GbE PHY with other channels (but has per-channel QoS and throughput guarantees). Sounds very similar to data centre ethernet - except data centre ethernet seems to map channels to rings; whereas the scheme you describe maps a channel essentially to a virtual nic which seems to read in the common case as a single tx, single rx ring. Is that right? If yes, we should be able to do the virtual nics today without any changes really since each one appears as a separate NIC. It will be a matter of probably boot time partitioning and parametrization to create virtual nics (ex of priorities of each virtual NIC etc). Right, this is one deployment scenario for a multi-channel NIC, and it will require very few changes in the stack (couple extra IOCTLS would be nice). There are two reasons why you still may want to have a generic multi-channel support/awareness in the stack: 1. Some users may want to have single ip interface with multiple channels. 2. While multi-channel NICs will likely to be many, only best-in-class will make the hw channels completely independent and able to operate as a separate nic. Other implementations may have some limitations, and will work as multi-channel API compliant devices but not nesseserily as independent mac devices. I agree though that supporting multi-channel APIs is a bigger effort. In a non-virtualized system, such NICs could be used in a mode when each channel runs on one core; this may eliminate some locking... This mode will require btw deterministic session steering, current hashing approach in the patch is not sufficient; this is something we can contribute once Peter's code is in. I can actually see how the PCI-SIG approach using virtual NIC approach could run on multiple CPUs (since each is no different from a NIC that we have today). And our current Linux steering would also work just fine. In the case of non-virtual NICs, i am afraid i dont think it is as easy as simple session steering - if you want to be generic that is; you may wanna consider a more complex connection tracking i.e a grouping of sessions as the basis for steering to a tx ring (and therefore tying to a specific CPU). If you are an ISP or a data center with customers partitioned based on simple subnets, then i can see a simple classification based on subnets being tied to a hw ring/CPU. And in such cases simple flow control on a per ring basis makes sense. Have you guys experimented on the the non-virtual case? And are you doing the virtual case as a pair of tx/rx being a single virtual nic? To a degree. We have quite a bit
Re: [PATCH] NET: Multiqueue network device support.
Leonid Grossman wrote: But my point was that while virtualization capabilities of upcoming NICs may be not even relevant to Linux, the multi-channel hw designs (a side effect of virtualization push, if you will) will be there and a non-virtualized stack can take advantage of them. I'm looking at the current hardware virtualization efforts, and often grimacing. A lot of these efforts assume that virtual PCI devices will be wonderful virtualization solutions, without stopping to think about global events that affect all such devices, such as silicon resets or errata workarounds. In the real world, you wind up having to un-virtualize to deal with certain exceptional events. But as you point out, these hardware virt efforts can bestow benefits on non-virtualized stacks. Jeff - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH] NET: Multiqueue network device support.
On Sat, 2007-09-06 at 17:23 -0400, Leonid Grossman wrote: Not really. This is a very old presentation; you probably saw some newer PR on Convergence Enhanced Ethernet, Congestion Free Ethernet etc. Not been keeping up to date in that area. These efforts are in very early stages and arguably orthogonal to virtualization, but in general having per channel QoS (flow control is just a part of it) is a good thing. our definition of channel on linux so far is a netdev (not a DMA ring). A netdev is the entity that can be bound to a CPU. Link layer flow control terminates (and emanates) from the netdev. But my point was that while virtualization capabilities of upcoming NICs may be not even relevant to Linux, the multi-channel hw designs (a side effect of virtualization push, if you will) will be there and a non-virtualized stack can take advantage of them. Makes sense... Actually, our current 10GbE NICs have most of such multichannel framework already shipping (in pre-IOV fashion), so the programming manual on the website can probably give you a pretty good idea about how multi-channel 10GbE NICs may look like. Ok, thanks. Right, this is one deployment scenario for a multi-channel NIC, and it will require very few changes in the stack (couple extra IOCTLS would be nice). Essentially a provisioning interface. There are two reasons why you still may want to have a generic multi-channel support/awareness in the stack: 1. Some users may want to have single ip interface with multiple channels. 2. While multi-channel NICs will likely to be many, only best-in-class will make the hw channels completely independent and able to operate as a separate nic. Other implementations may have some limitations, and will work as multi-channel API compliant devices but not nesseserily as independent mac devices. I agree though that supporting multi-channel APIs is a bigger effort. IMO, the challenges you describe above are solvable via a parent netdevice (similar to bonding) with children being the virtual NICs. The IP address is attached to the parent. Of course the other model is not to show the parent device at all. To a degree. We have quite a bit of testing done in non-virtual OS (not in Linux though), using channels with tx/rx rings, msi-x etc as independent NICs. Flow control was not a focus since the fabric typically was not congested in these tests, but in theory per-channel flow control should work reasonably well. Of course, flow control is only part of resource sharing problem. In the current model - flow control to the s/ware queueing level (qdisc) is implicit. i.e hardware receives pause frames - stops sending; ring becomes full as hardware sends, netdev tx path gets shut until things open up when This is not what I'm saying :-). The IEEE link you sent shows that per-link flow control is a separate effort, and it will likely to take time to become a standard. Ok, my impression was it was happening already or it will happen tommorow morning ;- Also, (besides the shared link) the channels will share pci bus. One solution could be to provide a generic API for QoS level to a channel (and also to a generic NIC!). Internally, device driver can translate QoS requirements into flow control, pci bus bandwidth, and whatever else is shared on the physical NIC between the channels. As always, as some of that code becomes common between the drivers it can migrate up. indeed. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Thu, Jun 07, 2007 at 09:35:36PM -0400, jamal wrote: On Thu, 2007-07-06 at 17:31 -0700, Sridhar Samudrala wrote: If the QDISC_RUNNING flag guarantees that only one CPU can call dev-hard_start_xmit(), then why do we need to hold netif_tx_lock for non-LLTX drivers? I havent stared at other drivers, but for e1000 seems to me even if you got rid of LLTX that netif_tx_lock is unnecessary. Herbert? It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote: It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock on tx side and adapter queue lock on rx). Adding that lock will certainly bring down the performance numbers on a send/recv profile. The bizare thing is things run just fine even under the heavy tx/rx traffic i was testing under. I guess i didnt hit hard enough. cheers, jamal - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NET: Multiqueue network device support.
On Fri, Jun 08, 2007 at 07:34:57AM -0400, jamal wrote: On Fri, 2007-08-06 at 20:39 +1000, Herbert Xu wrote: It would guard against the poll routine which would acquire this lock when cleaning the TX ring. Ok, then i suppose we can conclude it is a bug on e1000 (holds tx_lock on tx side and adapter queue lock on rx). Adding that lock will certainly bring down the performance numbers on a send/recv profile. The bizare thing is things run just fine even under the heavy tx/rx traffic i was testing under. I guess i didnt hit hard enough. Hmm I wasn't describing how it works now. I'm talking about how it would work if we removed LLTX and replaced the private tx_lock with netif_tx_lock. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmVHI~} [EMAIL PROTECTED] Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt - To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html