Re: Per-CPU Queueing for QoS
2017-11-13 20:53 GMT-08:00 Tom Herbert <t...@herbertland.com>: > On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend > <john.fastab...@gmail.com> wrote: >> On 11/13/2017 06:18 PM, Michael Ma wrote: >>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: >>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> >>>> wrote: >>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger >>>>>>> <step...@networkplumber.org>: >>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800 >>>>>>>> Michael Ma <make0...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any >>>>>>>>> early feedback. >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Michael >>>>>>>>> >>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from >>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which >>>>>>>>>> avoids out-of-order problem. >>>>>>>>>> >>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple >>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a >>>>>>>>>> per-class thread to drain these queues. If we can do this frequently >>>>>>>>>> enough the out-of-order situation in this polling thread should not >>>>>>>>>> be >>>>>>>>>> that bad. >>>>>>>>>> >>>>>>>>>> To give more details - in the send path we introduce per-cpu >>>>>>>>>> per-class >>>>>>>>>> queues so that packets from the same class and same core will be >>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues >>>>>>>>>> belonging to its class from all the cpus and aggregate them into >>>>>>>>>> another per-class queue. This can effectively reduce contention but >>>>>>>>>> inevitably introduces potential out-of-order issue. >>>>>>>>>> >>>>>>>>>> Any concern/suggestion for working towards this direction? >>>>>>>> >>>>>>>> In general, there is no meta design discussions in Linux development >>>>>>>> Several developers have tried to do lockless >>>>>>>> qdisc and similar things in the past. >>>>>>>> >>>>>>>> The devil is in the details, show us the code. >>>>>>> >>>>>>> Thanks for the response, Stephen. The code is fairly straightforward, >>>>>>> we have a per-cpu per-class queue defined as this: >>>>>>> >>>>>>> struct bandwidth_group >>>>>>> { >>>>>>> struct skb_list queues[MAX_CPU_COUNT]; >>>>>>> struct skb_list drain; >>>>>>> } >>>>>>> >>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the >>>>>>> same class. In the enqueue function, we determine the cpu where the >>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue: >>>>>>> >>>>>>> int cpu; >>>>>>> struct bandwidth_group *bwg = _rx_groups[bwgid]; >>>>>>> >>>>>>> cpu = get_cpu(); >>>>>>> skb_list_append(>queues[cpu], skb); >>>>>>> >>>>>>> Here we don't check the flow of the packet so if there is task >>>>>>> migration or multiple threads sending packets through the same flow we >>>>>>> theoretically can have packets enqueued to different queues and >>>>>>> aggregated to the &qu
Re: Per-CPU Queueing for QoS
2017-11-13 19:45 GMT-08:00 John Fastabend <john.fastab...@gmail.com>: > On 11/13/2017 06:18 PM, Michael Ma wrote: >> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: >>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> >>> wrote: >>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger >>>>>> <step...@networkplumber.org>: >>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800 >>>>>>> Michael Ma <make0...@gmail.com> wrote: >>>>>>> >>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any >>>>>>>> early feedback. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Michael >>>>>>>> >>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>>>>>>> >>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from >>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which >>>>>>>>> avoids out-of-order problem. >>>>>>>>> >>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple >>>>>>>>> per-cpu queues for a single TC class and do busy polling from a >>>>>>>>> per-class thread to drain these queues. If we can do this frequently >>>>>>>>> enough the out-of-order situation in this polling thread should not be >>>>>>>>> that bad. >>>>>>>>> >>>>>>>>> To give more details - in the send path we introduce per-cpu per-class >>>>>>>>> queues so that packets from the same class and same core will be >>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues >>>>>>>>> belonging to its class from all the cpus and aggregate them into >>>>>>>>> another per-class queue. This can effectively reduce contention but >>>>>>>>> inevitably introduces potential out-of-order issue. >>>>>>>>> >>>>>>>>> Any concern/suggestion for working towards this direction? >>>>>>> >>>>>>> In general, there is no meta design discussions in Linux development >>>>>>> Several developers have tried to do lockless >>>>>>> qdisc and similar things in the past. >>>>>>> >>>>>>> The devil is in the details, show us the code. >>>>>> >>>>>> Thanks for the response, Stephen. The code is fairly straightforward, >>>>>> we have a per-cpu per-class queue defined as this: >>>>>> >>>>>> struct bandwidth_group >>>>>> { >>>>>> struct skb_list queues[MAX_CPU_COUNT]; >>>>>> struct skb_list drain; >>>>>> } >>>>>> >>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the >>>>>> same class. In the enqueue function, we determine the cpu where the >>>>>> packet is processed and enqueue it to the corresponding per-cpu queue: >>>>>> >>>>>> int cpu; >>>>>> struct bandwidth_group *bwg = _rx_groups[bwgid]; >>>>>> >>>>>> cpu = get_cpu(); >>>>>> skb_list_append(>queues[cpu], skb); >>>>>> >>>>>> Here we don't check the flow of the packet so if there is task >>>>>> migration or multiple threads sending packets through the same flow we >>>>>> theoretically can have packets enqueued to different queues and >>>>>> aggregated to the "drain" queue out of order. >>>>>> >>>>>> Also AFAIK there is no lockless htb-like qdisc implementation >>>>>> currently, however if there is already similar effort ongoing please >>>>>> let me know. >>>>> >>>>> The question I would have is how would this differ from using XPS w/ >>>>> mqprio? Would this be a classful qdisc l
Re: Per-CPU Queueing for QoS
2017-11-13 18:18 GMT-08:00 Michael Ma <make0...@gmail.com>: > 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: >> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger >>>> > <step...@networkplumber.org>: >>>> >> On Sun, 12 Nov 2017 13:43:13 -0800 >>>> >> Michael Ma <make0...@gmail.com> wrote: >>>> >> >>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any >>>> >>> early feedback. >>>> >>> >>>> >>> Thanks, >>>> >>> Michael >>>> >>> >>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>> >>> > >>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from >>>> >>> > the same flow will follow the order when they enter qdisc/txq, which >>>> >>> > avoids out-of-order problem. >>>> >>> > >>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple >>>> >>> > per-cpu queues for a single TC class and do busy polling from a >>>> >>> > per-class thread to drain these queues. If we can do this frequently >>>> >>> > enough the out-of-order situation in this polling thread should not >>>> >>> > be >>>> >>> > that bad. >>>> >>> > >>>> >>> > To give more details - in the send path we introduce per-cpu >>>> >>> > per-class >>>> >>> > queues so that packets from the same class and same core will be >>>> >>> > enqueued to the same place. Then a per-class thread poll the queues >>>> >>> > belonging to its class from all the cpus and aggregate them into >>>> >>> > another per-class queue. This can effectively reduce contention but >>>> >>> > inevitably introduces potential out-of-order issue. >>>> >>> > >>>> >>> > Any concern/suggestion for working towards this direction? >>>> >> >>>> >> In general, there is no meta design discussions in Linux development >>>> >> Several developers have tried to do lockless >>>> >> qdisc and similar things in the past. >>>> >> >>>> >> The devil is in the details, show us the code. >>>> > >>>> > Thanks for the response, Stephen. The code is fairly straightforward, >>>> > we have a per-cpu per-class queue defined as this: >>>> > >>>> > struct bandwidth_group >>>> > { >>>> > struct skb_list queues[MAX_CPU_COUNT]; >>>> > struct skb_list drain; >>>> > } >>>> > >>>> > "drain" queue is used to aggregate per-cpu queues belonging to the >>>> > same class. In the enqueue function, we determine the cpu where the >>>> > packet is processed and enqueue it to the corresponding per-cpu queue: >>>> > >>>> > int cpu; >>>> > struct bandwidth_group *bwg = _rx_groups[bwgid]; >>>> > >>>> > cpu = get_cpu(); >>>> > skb_list_append(>queues[cpu], skb); >>>> > >>>> > Here we don't check the flow of the packet so if there is task >>>> > migration or multiple threads sending packets through the same flow we >>>> > theoretically can have packets enqueued to different queues and >>>> > aggregated to the "drain" queue out of order. >>>> > >>>> > Also AFAIK there is no lockless htb-like qdisc implementation >>>> > currently, however if there is already similar effort ongoing please >>>> > let me know. >>>> >>>> The question I would have is how would this differ from using XPS w/ >>>> mqprio? Would this be a classful qdisc like HTB or a classless one >>>> like mqprio? >>>> >>>> From what I can tell XPS would be able to get you your per-cpu >>>> functionality, the benefit o
Re: Per-CPU Queueing for QoS
2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: > On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote: >> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >>> >> On Sun, 12 Nov 2017 13:43:13 -0800 >>> >> Michael Ma <make0...@gmail.com> wrote: >>> >> >>> >>> Any comments? We plan to implement this as a qdisc and appreciate any >>> >>> early feedback. >>> >>> >>> >>> Thanks, >>> >>> Michael >>> >>> >>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>> >>> > >>> >>> > Currently txq/qdisc selection is based on flow hash so packets from >>> >>> > the same flow will follow the order when they enter qdisc/txq, which >>> >>> > avoids out-of-order problem. >>> >>> > >>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple >>> >>> > per-cpu queues for a single TC class and do busy polling from a >>> >>> > per-class thread to drain these queues. If we can do this frequently >>> >>> > enough the out-of-order situation in this polling thread should not be >>> >>> > that bad. >>> >>> > >>> >>> > To give more details - in the send path we introduce per-cpu per-class >>> >>> > queues so that packets from the same class and same core will be >>> >>> > enqueued to the same place. Then a per-class thread poll the queues >>> >>> > belonging to its class from all the cpus and aggregate them into >>> >>> > another per-class queue. This can effectively reduce contention but >>> >>> > inevitably introduces potential out-of-order issue. >>> >>> > >>> >>> > Any concern/suggestion for working towards this direction? >>> >> >>> >> In general, there is no meta design discussions in Linux development >>> >> Several developers have tried to do lockless >>> >> qdisc and similar things in the past. >>> >> >>> >> The devil is in the details, show us the code. >>> > >>> > Thanks for the response, Stephen. The code is fairly straightforward, >>> > we have a per-cpu per-class queue defined as this: >>> > >>> > struct bandwidth_group >>> > { >>> > struct skb_list queues[MAX_CPU_COUNT]; >>> > struct skb_list drain; >>> > } >>> > >>> > "drain" queue is used to aggregate per-cpu queues belonging to the >>> > same class. In the enqueue function, we determine the cpu where the >>> > packet is processed and enqueue it to the corresponding per-cpu queue: >>> > >>> > int cpu; >>> > struct bandwidth_group *bwg = _rx_groups[bwgid]; >>> > >>> > cpu = get_cpu(); >>> > skb_list_append(>queues[cpu], skb); >>> > >>> > Here we don't check the flow of the packet so if there is task >>> > migration or multiple threads sending packets through the same flow we >>> > theoretically can have packets enqueued to different queues and >>> > aggregated to the "drain" queue out of order. >>> > >>> > Also AFAIK there is no lockless htb-like qdisc implementation >>> > currently, however if there is already similar effort ongoing please >>> > let me know. >>> >>> The question I would have is how would this differ from using XPS w/ >>> mqprio? Would this be a classful qdisc like HTB or a classless one >>> like mqprio? >>> >>> From what I can tell XPS would be able to get you your per-cpu >>> functionality, the benefit of it though would be that it would avoid >>> out-of-order issues for sockets originating on the local system. The >>> only thing I see as an issue right now is that the rate limiting with >>> mqprio is assumed to be handled via hardware due to mechanisms such as >>> DCB. >> >> I think one of the key point was in : " do busy polling from a per-class >> thread to drain these queues." >> >
Re: Per-CPU Queueing for QoS
2017-11-13 18:05 GMT-08:00 Michael Ma <make0...@gmail.com>: > 2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >>> >> On Sun, 12 Nov 2017 13:43:13 -0800 >>> >> Michael Ma <make0...@gmail.com> wrote: >>> >> >>> >>> Any comments? We plan to implement this as a qdisc and appreciate any >>> >>> early feedback. >>> >>> >>> >>> Thanks, >>> >>> Michael >>> >>> >>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>> >>> > >>> >>> > Currently txq/qdisc selection is based on flow hash so packets from >>> >>> > the same flow will follow the order when they enter qdisc/txq, which >>> >>> > avoids out-of-order problem. >>> >>> > >>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple >>> >>> > per-cpu queues for a single TC class and do busy polling from a >>> >>> > per-class thread to drain these queues. If we can do this frequently >>> >>> > enough the out-of-order situation in this polling thread should not be >>> >>> > that bad. >>> >>> > >>> >>> > To give more details - in the send path we introduce per-cpu per-class >>> >>> > queues so that packets from the same class and same core will be >>> >>> > enqueued to the same place. Then a per-class thread poll the queues >>> >>> > belonging to its class from all the cpus and aggregate them into >>> >>> > another per-class queue. This can effectively reduce contention but >>> >>> > inevitably introduces potential out-of-order issue. >>> >>> > >>> >>> > Any concern/suggestion for working towards this direction? >>> >> >>> >> In general, there is no meta design discussions in Linux development >>> >> Several developers have tried to do lockless >>> >> qdisc and similar things in the past. >>> >> >>> >> The devil is in the details, show us the code. >>> > >>> > Thanks for the response, Stephen. The code is fairly straightforward, >>> > we have a per-cpu per-class queue defined as this: >>> > >>> > struct bandwidth_group >>> > { >>> > struct skb_list queues[MAX_CPU_COUNT]; >>> > struct skb_list drain; >>> > } >>> > >>> > "drain" queue is used to aggregate per-cpu queues belonging to the >>> > same class. In the enqueue function, we determine the cpu where the >>> > packet is processed and enqueue it to the corresponding per-cpu queue: >>> > >>> > int cpu; >>> > struct bandwidth_group *bwg = _rx_groups[bwgid]; >>> > >>> > cpu = get_cpu(); >>> > skb_list_append(>queues[cpu], skb); >>> > >>> > Here we don't check the flow of the packet so if there is task >>> > migration or multiple threads sending packets through the same flow we >>> > theoretically can have packets enqueued to different queues and >>> > aggregated to the "drain" queue out of order. >>> > >>> > Also AFAIK there is no lockless htb-like qdisc implementation >>> > currently, however if there is already similar effort ongoing please >>> > let me know. >>> >>> The question I would have is how would this differ from using XPS w/ >>> mqprio? Would this be a classful qdisc like HTB or a classless one >>> like mqprio? >>> >>> From what I can tell XPS would be able to get you your per-cpu >>> functionality, the benefit of it though would be that it would avoid >>> out-of-order issues for sockets originating on the local system. The >>> only thing I see as an issue right now is that the rate limiting with >>> mqprio is assumed to be handled via hardware due to mechanisms such as >>> DCB. >> >> I think one of the key point was in : " do busy polling from a per-class >> thread to drain these queues." >> >> I mentioned this idea in TX path of : >> >> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf >> >> > > Right - this part is the key difference. With mqprio we still don't > have the ability to explore parallelism at the level of class. The > parallelism is restricted to the way of partitioning flows across > queues. > >> Eric - do you think if we do busy polling frequently enough out-of-order problem will effectively be mitigated? I'll take a look at your slides as well.
Re: Per-CPU Queueing for QoS
2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>: > On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: >> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >> >> On Sun, 12 Nov 2017 13:43:13 -0800 >> >> Michael Ma <make0...@gmail.com> wrote: >> >> >> >>> Any comments? We plan to implement this as a qdisc and appreciate any >> >>> early feedback. >> >>> >> >>> Thanks, >> >>> Michael >> >>> >> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >> >>> > >> >>> > Currently txq/qdisc selection is based on flow hash so packets from >> >>> > the same flow will follow the order when they enter qdisc/txq, which >> >>> > avoids out-of-order problem. >> >>> > >> >>> > To improve the concurrency of QoS algorithm we plan to have multiple >> >>> > per-cpu queues for a single TC class and do busy polling from a >> >>> > per-class thread to drain these queues. If we can do this frequently >> >>> > enough the out-of-order situation in this polling thread should not be >> >>> > that bad. >> >>> > >> >>> > To give more details - in the send path we introduce per-cpu per-class >> >>> > queues so that packets from the same class and same core will be >> >>> > enqueued to the same place. Then a per-class thread poll the queues >> >>> > belonging to its class from all the cpus and aggregate them into >> >>> > another per-class queue. This can effectively reduce contention but >> >>> > inevitably introduces potential out-of-order issue. >> >>> > >> >>> > Any concern/suggestion for working towards this direction? >> >> >> >> In general, there is no meta design discussions in Linux development >> >> Several developers have tried to do lockless >> >> qdisc and similar things in the past. >> >> >> >> The devil is in the details, show us the code. >> > >> > Thanks for the response, Stephen. The code is fairly straightforward, >> > we have a per-cpu per-class queue defined as this: >> > >> > struct bandwidth_group >> > { >> > struct skb_list queues[MAX_CPU_COUNT]; >> > struct skb_list drain; >> > } >> > >> > "drain" queue is used to aggregate per-cpu queues belonging to the >> > same class. In the enqueue function, we determine the cpu where the >> > packet is processed and enqueue it to the corresponding per-cpu queue: >> > >> > int cpu; >> > struct bandwidth_group *bwg = _rx_groups[bwgid]; >> > >> > cpu = get_cpu(); >> > skb_list_append(>queues[cpu], skb); >> > >> > Here we don't check the flow of the packet so if there is task >> > migration or multiple threads sending packets through the same flow we >> > theoretically can have packets enqueued to different queues and >> > aggregated to the "drain" queue out of order. >> > >> > Also AFAIK there is no lockless htb-like qdisc implementation >> > currently, however if there is already similar effort ongoing please >> > let me know. >> >> The question I would have is how would this differ from using XPS w/ >> mqprio? Would this be a classful qdisc like HTB or a classless one >> like mqprio? >> >> From what I can tell XPS would be able to get you your per-cpu >> functionality, the benefit of it though would be that it would avoid >> out-of-order issues for sockets originating on the local system. The >> only thing I see as an issue right now is that the rate limiting with >> mqprio is assumed to be handled via hardware due to mechanisms such as >> DCB. > > I think one of the key point was in : " do busy polling from a per-class > thread to drain these queues." > > I mentioned this idea in TX path of : > > https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf > > Right - this part is the key difference. With mqprio we still don't have the ability to explore parallelism at the level of class. The parallelism is restricted to the way of partitioning flows across queues. >
Re: Per-CPU Queueing for QoS
2017-11-13 14:47 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>: > On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: >> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: >>> On Sun, 12 Nov 2017 13:43:13 -0800 >>> Michael Ma <make0...@gmail.com> wrote: >>> >>>> Any comments? We plan to implement this as a qdisc and appreciate any >>>> early feedback. >>>> >>>> Thanks, >>>> Michael >>>> >>>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >>>> > >>>> > Currently txq/qdisc selection is based on flow hash so packets from >>>> > the same flow will follow the order when they enter qdisc/txq, which >>>> > avoids out-of-order problem. >>>> > >>>> > To improve the concurrency of QoS algorithm we plan to have multiple >>>> > per-cpu queues for a single TC class and do busy polling from a >>>> > per-class thread to drain these queues. If we can do this frequently >>>> > enough the out-of-order situation in this polling thread should not be >>>> > that bad. >>>> > >>>> > To give more details - in the send path we introduce per-cpu per-class >>>> > queues so that packets from the same class and same core will be >>>> > enqueued to the same place. Then a per-class thread poll the queues >>>> > belonging to its class from all the cpus and aggregate them into >>>> > another per-class queue. This can effectively reduce contention but >>>> > inevitably introduces potential out-of-order issue. >>>> > >>>> > Any concern/suggestion for working towards this direction? >>> >>> In general, there is no meta design discussions in Linux development >>> Several developers have tried to do lockless >>> qdisc and similar things in the past. >>> >>> The devil is in the details, show us the code. >> >> Thanks for the response, Stephen. The code is fairly straightforward, >> we have a per-cpu per-class queue defined as this: >> >> struct bandwidth_group >> { >> struct skb_list queues[MAX_CPU_COUNT]; >> struct skb_list drain; >> } >> >> "drain" queue is used to aggregate per-cpu queues belonging to the >> same class. In the enqueue function, we determine the cpu where the >> packet is processed and enqueue it to the corresponding per-cpu queue: >> >> int cpu; >> struct bandwidth_group *bwg = _rx_groups[bwgid]; >> >> cpu = get_cpu(); >> skb_list_append(>queues[cpu], skb); >> >> Here we don't check the flow of the packet so if there is task >> migration or multiple threads sending packets through the same flow we >> theoretically can have packets enqueued to different queues and >> aggregated to the "drain" queue out of order. >> >> Also AFAIK there is no lockless htb-like qdisc implementation >> currently, however if there is already similar effort ongoing please >> let me know. > > The question I would have is how would this differ from using XPS w/ > mqprio? Would this be a classful qdisc like HTB or a classless one > like mqprio? > Yes, xps + mqprio will achive similar effect as this. However xps + mqprio requires quite some static configuration to make things work. Also all the overhead associated with qdisc root lock still exists even though the contention might be reduced. With thread dedicated to cpu we can avoid acquiring lock for enqueue/dequeue since it's effectively single producer and single consumer. In the case that multiple threads do share the same flow the lock contention still exists and we don't have per-class parallelism for rate limiting if classes ever share the same cores, which is avoidable by having per-class thread to run stuff like leaky bucket. We still plan to implement it as a classful qdisc. > From what I can tell XPS would be able to get you your per-cpu > functionality, the benefit of it though would be that it would avoid > out-of-order issues for sockets originating on the local system. The > only thing I see as an issue right now is that the rate limiting with > mqprio is assumed to be handled via hardware due to mechanisms such as > DCB. > mqprio can also be attached with qdiscs like HTB so this can actually work without DCB. > - Alex
Re: Per-CPU Queueing for QoS
2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: > On Sun, 12 Nov 2017 13:43:13 -0800 > Michael Ma <make0...@gmail.com> wrote: > >> Any comments? We plan to implement this as a qdisc and appreciate any early >> feedback. >> >> Thanks, >> Michael >> >> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: >> > >> > Currently txq/qdisc selection is based on flow hash so packets from >> > the same flow will follow the order when they enter qdisc/txq, which >> > avoids out-of-order problem. >> > >> > To improve the concurrency of QoS algorithm we plan to have multiple >> > per-cpu queues for a single TC class and do busy polling from a >> > per-class thread to drain these queues. If we can do this frequently >> > enough the out-of-order situation in this polling thread should not be >> > that bad. >> > >> > To give more details - in the send path we introduce per-cpu per-class >> > queues so that packets from the same class and same core will be >> > enqueued to the same place. Then a per-class thread poll the queues >> > belonging to its class from all the cpus and aggregate them into >> > another per-class queue. This can effectively reduce contention but >> > inevitably introduces potential out-of-order issue. >> > >> > Any concern/suggestion for working towards this direction? > > In general, there is no meta design discussions in Linux development > Several developers have tried to do lockless > qdisc and similar things in the past. > > The devil is in the details, show us the code. Thanks for the response, Stephen. The code is fairly straightforward, we have a per-cpu per-class queue defined as this: struct bandwidth_group { struct skb_list queues[MAX_CPU_COUNT]; struct skb_list drain; } "drain" queue is used to aggregate per-cpu queues belonging to the same class. In the enqueue function, we determine the cpu where the packet is processed and enqueue it to the corresponding per-cpu queue: int cpu; struct bandwidth_group *bwg = _rx_groups[bwgid]; cpu = get_cpu(); skb_list_append(>queues[cpu], skb); Here we don't check the flow of the packet so if there is task migration or multiple threads sending packets through the same flow we theoretically can have packets enqueued to different queues and aggregated to the "drain" queue out of order. Also AFAIK there is no lockless htb-like qdisc implementation currently, however if there is already similar effort ongoing please let me know.
Re: Per-CPU Queueing for QoS
Any comments? We plan to implement this as a qdisc and appreciate any early feedback. Thanks, Michael > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: > > Currently txq/qdisc selection is based on flow hash so packets from > the same flow will follow the order when they enter qdisc/txq, which > avoids out-of-order problem. > > To improve the concurrency of QoS algorithm we plan to have multiple > per-cpu queues for a single TC class and do busy polling from a > per-class thread to drain these queues. If we can do this frequently > enough the out-of-order situation in this polling thread should not be > that bad. > > To give more details - in the send path we introduce per-cpu per-class > queues so that packets from the same class and same core will be > enqueued to the same place. Then a per-class thread poll the queues > belonging to its class from all the cpus and aggregate them into > another per-class queue. This can effectively reduce contention but > inevitably introduces potential out-of-order issue. > > Any concern/suggestion for working towards this direction?
Per-CPU Queueing for QoS
Currently txq/qdisc selection is based on flow hash so packets from the same flow will follow the order when they enter qdisc/txq, which avoids out-of-order problem. To improve the concurrency of QoS algorithm we plan to have multiple per-cpu queues for a single TC class and do busy polling from a per-class thread to drain these queues. If we can do this frequently enough the out-of-order situation in this polling thread should not be that bad. To give more details - in the send path we introduce per-cpu per-class queues so that packets from the same class and same core will be enqueued to the same place. Then a per-class thread poll the queues belonging to its class from all the cpus and aggregate them into another per-class queue. This can effectively reduce contention but inevitably introduces potential out-of-order issue. Any concern/suggestion for working towards this direction?
Multiple Txqs with One Qdisc
Hi - As I mentioned in a previous thread, we're implementing a qdisc similar to mqprio which can associate multiple txqs to one qdisc, so that we can work around the root qdisc bottleneck. I think this should be in general fine since without MQ, root qdisc is associated with multiple txqs anyway. However we encountered kernel panic as I mentioned in the "corrupted skb" thread. Any thought on potential issue of doing this? I'm asking this because mqprio only associates one txq with one qdisc which I think must have been done this way for a reason. Would appreciate if anyone can shed some light on this. Thanks, Michael
Re: Corrupted SKB
2017-04-18 21:46 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2017-04-18 16:12 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>: >> On Mon, Apr 17, 2017 at 5:39 PM, Michael Ma <make0...@gmail.com> wrote: >>> Hi - >>> >>> We've implemented a "glue" qdisc similar to mqprio which can associate >>> one qdisc to multiple txqs as the root qdisc. Reference count of the >>> child qdiscs have been adjusted properly in this case so that it >>> represents the number of txqs it has been attached to. However when >>> sending packets we saw the skb from dequeue_skb() corrupted with the >>> following call stack: >>> >>> [exception RIP: netif_skb_features+51] >>> RIP: 815292b3 RSP: 8817f6987940 RFLAGS: 00010246 >>> >>> #9 [8817f6987968] validate_xmit_skb at 815294aa >>> #10 [8817f69879a0] validate_xmit_skb at 8152a0d9 >>> #11 [8817f69879b0] __qdisc_run at 8154a193 >>> #12 [8817f6987a00] dev_queue_xmit at 81529e03 >>> >>> It looks like the skb has already been released since its dev pointer >>> field is invalid. >>> >>> Any clue on how this can be investigated further? My current thought >>> is to add some instrumentation to the place where skb is released and >>> analyze whether there is any race condition happening there. However >> >> Either dropwatch or perf could do the work to instrument kfree_skb(). > > Thanks - will try it out. I'm using perf to collect the callstack for kfree_skb and trying to correlate that with the corrupted SKB address however when system crashes the perf.data file is also corrupted - how can I view this file in case the system crashes before perf exits? >> >>> by looking through the existing code I think the case where one root >>> qdisc is associated with multiple txqs already exists (when mqprio is >>> not used) so not sure why it won't work when we group txqs and assign >>> each group a root qdisc. Any insight on this issue would be much >>> appreciated! >> >> How do you implement ->attach()? How does it work with netdev_pick_tx()? > > attach() essentially grafts the default qdisc(pfifo) to each "txq > group" represented by a TC class. For netdev_pick_txq() we use classid > of the socket to select a class based on a "class id base" and the > class to txq mapping defined together with this glue qdisc - it's > pretty much the same as mqprio with the difference of mapping one > class to multiple txqs and selecting the txq through a hash.
Re: Corrupted SKB
2017-04-18 16:12 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>: > On Mon, Apr 17, 2017 at 5:39 PM, Michael Ma <make0...@gmail.com> wrote: >> Hi - >> >> We've implemented a "glue" qdisc similar to mqprio which can associate >> one qdisc to multiple txqs as the root qdisc. Reference count of the >> child qdiscs have been adjusted properly in this case so that it >> represents the number of txqs it has been attached to. However when >> sending packets we saw the skb from dequeue_skb() corrupted with the >> following call stack: >> >> [exception RIP: netif_skb_features+51] >> RIP: 815292b3 RSP: 8817f6987940 RFLAGS: 00010246 >> >> #9 [8817f6987968] validate_xmit_skb at 815294aa >> #10 [8817f69879a0] validate_xmit_skb at 8152a0d9 >> #11 [8817f69879b0] __qdisc_run at 8154a193 >> #12 [8817f6987a00] dev_queue_xmit at 81529e03 >> >> It looks like the skb has already been released since its dev pointer >> field is invalid. >> >> Any clue on how this can be investigated further? My current thought >> is to add some instrumentation to the place where skb is released and >> analyze whether there is any race condition happening there. However > > Either dropwatch or perf could do the work to instrument kfree_skb(). Thanks - will try it out. > >> by looking through the existing code I think the case where one root >> qdisc is associated with multiple txqs already exists (when mqprio is >> not used) so not sure why it won't work when we group txqs and assign >> each group a root qdisc. Any insight on this issue would be much >> appreciated! > > How do you implement ->attach()? How does it work with netdev_pick_tx()? attach() essentially grafts the default qdisc(pfifo) to each "txq group" represented by a TC class. For netdev_pick_txq() we use classid of the socket to select a class based on a "class id base" and the class to txq mapping defined together with this glue qdisc - it's pretty much the same as mqprio with the difference of mapping one class to multiple txqs and selecting the txq through a hash.
Corrupted SKB
Hi - We've implemented a "glue" qdisc similar to mqprio which can associate one qdisc to multiple txqs as the root qdisc. Reference count of the child qdiscs have been adjusted properly in this case so that it represents the number of txqs it has been attached to. However when sending packets we saw the skb from dequeue_skb() corrupted with the following call stack: [exception RIP: netif_skb_features+51] RIP: 815292b3 RSP: 8817f6987940 RFLAGS: 00010246 #9 [8817f6987968] validate_xmit_skb at 815294aa #10 [8817f69879a0] validate_xmit_skb at 8152a0d9 #11 [8817f69879b0] __qdisc_run at 8154a193 #12 [8817f6987a00] dev_queue_xmit at 81529e03 It looks like the skb has already been released since its dev pointer field is invalid. Any clue on how this can be investigated further? My current thought is to add some instrumentation to the place where skb is released and analyze whether there is any race condition happening there. However by looking through the existing code I think the case where one root qdisc is associated with multiple txqs already exists (when mqprio is not used) so not sure why it won't work when we group txqs and assign each group a root qdisc. Any insight on this issue would be much appreciated! Thanks, Michael
Re: Why do we need tasklet in IFB?
2016-11-01 4:38 GMT-07:00 Jamal Hadi Salim <j...@mojatatu.com>: > On 16-10-31 02:10 PM, David Miller wrote: >> >> From: Michael Ma <make0...@gmail.com> >> Date: Mon, 31 Oct 2016 11:02:28 -0700 >> >>> 2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>: >>>> >>>> 2016-10-28 14:48 GMT-07:00 Stephen Hemminger >>>> <step...@networkplumber.org>: >>>>> >>>>> On Fri, 28 Oct 2016 14:45:07 -0700 >>>>> Michael Ma <make0...@gmail.com> wrote: >>>>> >>>>>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger >>>>>> <step...@networkplumber.org>: >>>>>>> >>>>>>> On Fri, 28 Oct 2016 14:36:27 -0700 >>>>>>> Michael Ma <make0...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi - >>>>>>>> >>>>>>>> Currently IFB uses tasklet to process tx/rx on the interface that >>>>>>>> forwarded the packet to IFB. My understanding on why we're doing >>>>>>>> this >>>>>>>> is that since dev_queue_xmit() can be invoked in interrupt, we want >>>>>>>> to >>>>>>>> defer the processing of original tx/rx in case ifb_xmit() is called >>>>>>>> from interrupt. >>>>>>> >>>>>>> >>>>>>> dev_queue_xmit is only called from interrupt if doing netconsole. >>>>>> >>>>>> >>> >>> In fact this doesn't seem to explain since if the original path is tx >>> and the context is interrupt, IFB will call dev_queue_xmit as well so >>> the context can be interrupt in that case. >>> >>> Then tasklet is still unnecessary. >> >> >> Perhaps it's necessary to deal with potential recursion, that's my only >> guess at this point. >> >> Jamal, do you remember what went through your mind back in 2005? :-) >> > > Certainly recursions (which were a huge issue in the original > implementations) as well as the need to have qdisc which shape traffic > (which implies they will sit on the queue for some period of time and > where accuracy of timing is important; therefore low latency of > scheduling was needed). > Thanks for the explanation - I can get the point of avoiding recursion. However latency wise if we don't have queue/tasklet in IFB and rely on the upstream/downstream part (either qdisc or tx/rx) to do the queueing, latency wouldn't be a problem anyway, or did I miss anything here? > I didnt follow the context of "tasklets are unnecessary": > Are tasklets bad or is the OP interested in replacing them with > something s/he thinks is better? IIRC, at some point the idea was to > infact use a softirq but it became obvious it was overkill. I was thinking of directly forwarding the packet in the thread context of IFB since it's really an intermediate forwarding component and additional queueing/tasklet doesn't seem to be necessary (let's put the consideration of too deep call stack/recursion problem aside). The problem that I have found is that queue selection on NIC will actually be inherited in IFB and that affects how packets are handled in tasklet since the tasklet is associated with each queue. We don't really have a strict one to one mapping of cpu core to queue in NIC, whereas in IFB case the queue is strictly mapped to tasklet/cpu core which can cause undesirable contention. The question here was raised mostly because I wanted to adopt the thread context of packet in IFB. > > cheers, > jamal
Re: Why do we need tasklet in IFB?
2016-10-31 11:02 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>: >> 2016-10-28 14:48 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>: >>> On Fri, 28 Oct 2016 14:45:07 -0700 >>> Michael Ma <make0...@gmail.com> wrote: >>> >>>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>: >>>> > On Fri, 28 Oct 2016 14:36:27 -0700 >>>> > Michael Ma <make0...@gmail.com> wrote: >>>> > >>>> >> Hi - >>>> >> >>>> >> Currently IFB uses tasklet to process tx/rx on the interface that >>>> >> forwarded the packet to IFB. My understanding on why we're doing this >>>> >> is that since dev_queue_xmit() can be invoked in interrupt, we want to >>>> >> defer the processing of original tx/rx in case ifb_xmit() is called >>>> >> from interrupt. >>>> > >>>> > dev_queue_xmit is only called from interrupt if doing netconsole. >>>> > > In fact this doesn't seem to explain since if the original path is tx > and the context is interrupt, IFB will call dev_queue_xmit as well so > the context can be interrupt in that case. > > Then tasklet is still unnecessary. > >>>> OK - so the reason is that netif_receive_skb() can only be invoked >>>> from softirq and we have to use tasklet in IFB to guarantee this. >>>> >>>> Then if the original path is rx, tasklet is unnecessary because >>>> ifb_xmit() is invoked from netif_receive_skb() which is always in the >>>> softirq context, right? >>> >>> The other reason is to avoid deep kernel callstacks I see - this seems to be the ultimate reason but in case we know the actual IFB configuration is simple then tasklet can still be avoided, right?
Re: Why do we need tasklet in IFB?
2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-10-28 14:48 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>: >> On Fri, 28 Oct 2016 14:45:07 -0700 >> Michael Ma <make0...@gmail.com> wrote: >> >>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>: >>> > On Fri, 28 Oct 2016 14:36:27 -0700 >>> > Michael Ma <make0...@gmail.com> wrote: >>> > >>> >> Hi - >>> >> >>> >> Currently IFB uses tasklet to process tx/rx on the interface that >>> >> forwarded the packet to IFB. My understanding on why we're doing this >>> >> is that since dev_queue_xmit() can be invoked in interrupt, we want to >>> >> defer the processing of original tx/rx in case ifb_xmit() is called >>> >> from interrupt. >>> > >>> > dev_queue_xmit is only called from interrupt if doing netconsole. >>> In fact this doesn't seem to explain since if the original path is tx and the context is interrupt, IFB will call dev_queue_xmit as well so the context can be interrupt in that case. Then tasklet is still unnecessary. >>> OK - so the reason is that netif_receive_skb() can only be invoked >>> from softirq and we have to use tasklet in IFB to guarantee this. >>> >>> Then if the original path is rx, tasklet is unnecessary because >>> ifb_xmit() is invoked from netif_receive_skb() which is always in the >>> softirq context, right? >> >> The other reason is to avoid deep kernel callstacks
Why do we need tasklet in IFB?
Hi - Currently IFB uses tasklet to process tx/rx on the interface that forwarded the packet to IFB. My understanding on why we're doing this is that since dev_queue_xmit() can be invoked in interrupt, we want to defer the processing of original tx/rx in case ifb_xmit() is called from interrupt. However, if the packet is originally from rx, calling context should already be a tasklet and there is no need to queue the processing to another tasklet anymore. Even if the packet is originated from tx, we can rely on the deferred processing of the "original device" or TC if necessary. Did I miss anything here? Furthermore, looking at the bonding device's code there isn't this kind of buffering/tasklet handling for packet forwarding even though bond also has its own txq/rxq configured separately from the actual nic, which is very similar to IFB. So why do we need tasklet in IFB? Thanks, Michael
Re: Modification to skb->queue_mapping affecting performance
2016-09-16 15:00 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote: >> >>> This is actually the problem - if flows from different RX queues are >>> switched to the same RX queue in IFB, they'll use different processor >>> context with the same tasklet, and the processor context of different >>> tasklets might be the same. So multiple tasklets in IFB competes for >>> the same core when queue is switched. >>> >>> The following simple fix proved this - with this change even switching >>> the queue won't affect small packet bandwidth/latency anymore: >>> >>> in ifb.c: >>> >>> - struct ifb_q_private *txp = dp->tx_private + >>> skb_get_queue_mapping(skb); >>> + struct ifb_q_private *txp = dp->tx_private + >>> (smp_processor_id() % dev->num_tx_queues); >>> >>> This should be more efficient since we're not sending the task to a >>> different processor, instead we try to queue the packet to an >>> appropriate tasklet based on the processor ID. Will this cause any >>> packet out-of-order problem? If packets from the same flow are queued >>> to the same RX queue due to RSS, and processor affinity is set for RX >>> queues, I assume packets from the same flow will end up in the same >>> core when tasklet is scheduled. But I might have missed some uncommon >>> cases here... Would appreciate if anyone can provide more insights. >> >> Wait, don't you have proper smp affinity for the RX queues on your NIC ? >> >> ( Documentation/networking/scaling.txt RSS IRQ Configuration ) >> > Yes - what I was trying to say is that this change will be more > efficient than using smp_call_function_single() to schedule the > tasklet to a different processor. > > RSS IRQ should be set properly already. The issue here is that I'll > need to switch the queue mapping for NIC RX to a different TXQ on IFB, > which allows me to classify the flows at the IFB TXQ layer and avoid > qdisc lock contention. > > When that switch happens, ideally processor core shouldn't be switched > because all the thread context isn't changed. The work in tasklet > should be scheduled to the same processor as well. That's why I tried > this change. Also conceptually IFB is a software device which should > be able to schedule its workload independent from how NIC is > configured for the interrupt handling. > >> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because >> the driver queue is locked before ndo_start_xmit()) (for non >> NETIF_F_LLTX drivers at least) >> > > Thanks a lot for pointing out this! I was expecting this kind of > guidance... Then the options would be: > > 1. Use smp_call_function_single() to schedule the tasklet to a core > statically mapped to the IFB TXQ, which is very similar to how TX/RX > IRQ is configured. This actually won't help with the throughput because ultimately load will still be concentrated to some particular cores after packets are concentrated to a TXQ due to queue level classification. > 2. As you suggested below add some additional action to do the > rescheduling before entering IFB - for example when receiving the > packet we could just use RSS to redirect to the desired RXQ, however > this doesn't seem to be easy, especially compared with the way how > mqprio chooses the queue. The challenge here is that IFB queue > selection is based on queue_mapping when skb arrives at IFB and core > selection is based on RXQ on NIC and so it's also based on > queue_mapping when skb arrives at NIC. Then these two queue_mappings > must be the same so that there is no core conflict of processing two > TXQs of IFB. Then this essentially means we have to change queue > mapping of the NIC on the receiver side which can't be achieved using > TC. > I tried to explore this further - there is actually XPS on ifb which can be used to specify the processor cores that will be used to process each TXQ of ifb, however the problem is similar as above - eventually I'll have a few cores processing these queues instead of having all the cores processing together with relatively light contention. And this again reduces the throughput. So there isn't a good place to do this. The ultimate problem is that we're trying to workaround the qdisc spin lock problem by leveraging the independence of TXQs, but at the same time after qdisc phase we also want to maximize the utilization of cores across whatever TXQs that are used. >> In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb >>
Re: Modification to skb->queue_mapping affecting performance
2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote: > >> This is actually the problem - if flows from different RX queues are >> switched to the same RX queue in IFB, they'll use different processor >> context with the same tasklet, and the processor context of different >> tasklets might be the same. So multiple tasklets in IFB competes for >> the same core when queue is switched. >> >> The following simple fix proved this - with this change even switching >> the queue won't affect small packet bandwidth/latency anymore: >> >> in ifb.c: >> >> - struct ifb_q_private *txp = dp->tx_private + >> skb_get_queue_mapping(skb); >> + struct ifb_q_private *txp = dp->tx_private + >> (smp_processor_id() % dev->num_tx_queues); >> >> This should be more efficient since we're not sending the task to a >> different processor, instead we try to queue the packet to an >> appropriate tasklet based on the processor ID. Will this cause any >> packet out-of-order problem? If packets from the same flow are queued >> to the same RX queue due to RSS, and processor affinity is set for RX >> queues, I assume packets from the same flow will end up in the same >> core when tasklet is scheduled. But I might have missed some uncommon >> cases here... Would appreciate if anyone can provide more insights. > > Wait, don't you have proper smp affinity for the RX queues on your NIC ? > > ( Documentation/networking/scaling.txt RSS IRQ Configuration ) > Yes - what I was trying to say is that this change will be more efficient than using smp_call_function_single() to schedule the tasklet to a different processor. RSS IRQ should be set properly already. The issue here is that I'll need to switch the queue mapping for NIC RX to a different TXQ on IFB, which allows me to classify the flows at the IFB TXQ layer and avoid qdisc lock contention. When that switch happens, ideally processor core shouldn't be switched because all the thread context isn't changed. The work in tasklet should be scheduled to the same processor as well. That's why I tried this change. Also conceptually IFB is a software device which should be able to schedule its workload independent from how NIC is configured for the interrupt handling. > A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because > the driver queue is locked before ndo_start_xmit()) (for non > NETIF_F_LLTX drivers at least) > Thanks a lot for pointing out this! I was expecting this kind of guidance... Then the options would be: 1. Use smp_call_function_single() to schedule the tasklet to a core statically mapped to the IFB TXQ, which is very similar to how TX/RX IRQ is configured. 2. As you suggested below add some additional action to do the rescheduling before entering IFB - for example when receiving the packet we could just use RSS to redirect to the desired RXQ, however this doesn't seem to be easy, especially compared with the way how mqprio chooses the queue. The challenge here is that IFB queue selection is based on queue_mapping when skb arrives at IFB and core selection is based on RXQ on NIC and so it's also based on queue_mapping when skb arrives at NIC. Then these two queue_mappings must be the same so that there is no core conflict of processing two TXQs of IFB. Then this essentially means we have to change queue mapping of the NIC on the receiver side which can't be achieved using TC. > In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb > list. > > In any case, you could have an action to do this before reaching IFB. > > >
Re: Modification to skb->queue_mapping affecting performance
2016-09-15 17:51 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-09-14 10:46 GMT-07:00 Michael Ma <make0...@gmail.com>: >> 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote: >>> >>>> I don't intend to install multiple qdisc - the only reason that I'm >>>> doing this now is to leverage MQ to workaround the lock contention, >>>> and based on the profile this all worked. However to simplify the way >>>> to setup HTB I wanted to use TXQ to partition HTB classes so that a >>>> HTB class only belongs to one TXQ, which also requires mapping skb to >>>> TXQ using some rules (here I'm using priority but I assume it's >>>> straightforward to use other information such as classid). And the >>>> problem I found here is that when using priority to infer the TXQ so >>>> that queue_mapping is changed, bandwidth is affected significantly - >>>> the only thing I can guess is that due to queue switch, there are more >>>> cache misses assuming processor cores have a static mapping to all the >>>> queues. Any suggestion on what to do next for the investigation? >>>> >>>> I would also guess that this should be a common problem if anyone >>>> wants to use MQ+IFB to workaround the qdisc lock contention on the >>>> receiver side and classful qdisc is used on IFB, but haven't really >>>> found a similar thread here... >>> >>> But why are you changing the queue ? >>> >>> NIC already does the proper RSS thing, meaning all packets of one flow >>> should land on one RX queue. No need to ' classify yourself and risk >>> lock contention' >>> >>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no >>> problem. >>> >>> Do you really need to rate limit flows ? Not clear what are your goals, >>> why for example you use HTB to begin with. >>> >> Yes. My goal is to set different min/max bandwidth limits for >> different processes, so we started with HTB. However with HTB the >> qdisc root lock contention caused some unintended correlation between >> flows in different classes. For example if some flows belonging to one >> class have large amount of small packets, other flows in a different >> class will get their effective bandwidth reduced because they'll wait >> longer for the root lock. Using MQ this can be avoided because I'll >> just put flows belonging to one class to its dedicated TXQ. Then >> classes within one HTB on a TXQ will still have the lock contention >> problem but classes in different HTB will use different root locks so >> the contention doesn't exist. >> >> This also means that I'll need to classify packets to different >> TXQ/HTB based on some skb metadata (essentially similar to what mqprio >> is doing). So TXQ might need to be switched to achieve this. > > My current theory to this problem is that tasklets in IFB might be > scheduled to the same cpu core if the RXQ happens to be the same for > two different flows. When queue_mapping is modified and multiple flows > are concentrated to the same IFB TXQ because they need to be > controlled by the same HTB, they'll have to use the same tasklet > because of the way IFB is implemented. So if other flows belonging to > a different TXQ/tasklet happens to be scheduled on the same core, that > core can be overloaded and becomes the bottleneck. Without modifying > the queue_mapping the chance of this contention is much lower. > > This is a speculation based on the increased si time in softirqd > process. I'll try to affinitize each tasklet with a cpu core to verify > whether this is the problem. I also noticed that in the past there was > a similar proposal of scheduling the tasklet to a dedicated core which > was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try > something similar to verify this theory. This is actually the problem - if flows from different RX queues are switched to the same RX queue in IFB, they'll use different processor context with the same tasklet, and the processor context of different tasklets might be the same. So multiple tasklets in IFB competes for the same core when queue is switched. The following simple fix proved this - with this change even switching the queue won't affect small packet bandwidth/latency anymore: in ifb.c: - struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb); + struct ifb_q_private *txp = dp->tx_private + (smp_processor_id() % dev->num_tx_queues); This should be more efficient since we're not sending the task to a different processor, instead we try to queue the packet to an appropriate tasklet based on the processor ID. Will this cause any packet out-of-order problem? If packets from the same flow are queued to the same RX queue due to RSS, and processor affinity is set for RX queues, I assume packets from the same flow will end up in the same core when tasklet is scheduled. But I might have missed some uncommon cases here... Would appreciate if anyone can provide more insights.
Re: Modification to skb->queue_mapping affecting performance
2016-09-14 10:46 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote: >> >>> I don't intend to install multiple qdisc - the only reason that I'm >>> doing this now is to leverage MQ to workaround the lock contention, >>> and based on the profile this all worked. However to simplify the way >>> to setup HTB I wanted to use TXQ to partition HTB classes so that a >>> HTB class only belongs to one TXQ, which also requires mapping skb to >>> TXQ using some rules (here I'm using priority but I assume it's >>> straightforward to use other information such as classid). And the >>> problem I found here is that when using priority to infer the TXQ so >>> that queue_mapping is changed, bandwidth is affected significantly - >>> the only thing I can guess is that due to queue switch, there are more >>> cache misses assuming processor cores have a static mapping to all the >>> queues. Any suggestion on what to do next for the investigation? >>> >>> I would also guess that this should be a common problem if anyone >>> wants to use MQ+IFB to workaround the qdisc lock contention on the >>> receiver side and classful qdisc is used on IFB, but haven't really >>> found a similar thread here... >> >> But why are you changing the queue ? >> >> NIC already does the proper RSS thing, meaning all packets of one flow >> should land on one RX queue. No need to ' classify yourself and risk >> lock contention' >> >> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no >> problem. >> >> Do you really need to rate limit flows ? Not clear what are your goals, >> why for example you use HTB to begin with. >> > Yes. My goal is to set different min/max bandwidth limits for > different processes, so we started with HTB. However with HTB the > qdisc root lock contention caused some unintended correlation between > flows in different classes. For example if some flows belonging to one > class have large amount of small packets, other flows in a different > class will get their effective bandwidth reduced because they'll wait > longer for the root lock. Using MQ this can be avoided because I'll > just put flows belonging to one class to its dedicated TXQ. Then > classes within one HTB on a TXQ will still have the lock contention > problem but classes in different HTB will use different root locks so > the contention doesn't exist. > > This also means that I'll need to classify packets to different > TXQ/HTB based on some skb metadata (essentially similar to what mqprio > is doing). So TXQ might need to be switched to achieve this. My current theory to this problem is that tasklets in IFB might be scheduled to the same cpu core if the RXQ happens to be the same for two different flows. When queue_mapping is modified and multiple flows are concentrated to the same IFB TXQ because they need to be controlled by the same HTB, they'll have to use the same tasklet because of the way IFB is implemented. So if other flows belonging to a different TXQ/tasklet happens to be scheduled on the same core, that core can be overloaded and becomes the bottleneck. Without modifying the queue_mapping the chance of this contention is much lower. This is a speculation based on the increased si time in softirqd process. I'll try to affinitize each tasklet with a cpu core to verify whether this is the problem. I also noticed that in the past there was a similar proposal of scheduling the tasklet to a dedicated core which was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try something similar to verify this theory.
Re: Modification to skb->queue_mapping affecting performance
2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote: > >> I don't intend to install multiple qdisc - the only reason that I'm >> doing this now is to leverage MQ to workaround the lock contention, >> and based on the profile this all worked. However to simplify the way >> to setup HTB I wanted to use TXQ to partition HTB classes so that a >> HTB class only belongs to one TXQ, which also requires mapping skb to >> TXQ using some rules (here I'm using priority but I assume it's >> straightforward to use other information such as classid). And the >> problem I found here is that when using priority to infer the TXQ so >> that queue_mapping is changed, bandwidth is affected significantly - >> the only thing I can guess is that due to queue switch, there are more >> cache misses assuming processor cores have a static mapping to all the >> queues. Any suggestion on what to do next for the investigation? >> >> I would also guess that this should be a common problem if anyone >> wants to use MQ+IFB to workaround the qdisc lock contention on the >> receiver side and classful qdisc is used on IFB, but haven't really >> found a similar thread here... > > But why are you changing the queue ? > > NIC already does the proper RSS thing, meaning all packets of one flow > should land on one RX queue. No need to ' classify yourself and risk > lock contention' > > I use IFB + MQ + netem every day, and it scales to 10 Mpps with no > problem. > > Do you really need to rate limit flows ? Not clear what are your goals, > why for example you use HTB to begin with. > Yes. My goal is to set different min/max bandwidth limits for different processes, so we started with HTB. However with HTB the qdisc root lock contention caused some unintended correlation between flows in different classes. For example if some flows belonging to one class have large amount of small packets, other flows in a different class will get their effective bandwidth reduced because they'll wait longer for the root lock. Using MQ this can be avoided because I'll just put flows belonging to one class to its dedicated TXQ. Then classes within one HTB on a TXQ will still have the lock contention problem but classes in different HTB will use different root locks so the contention doesn't exist. This also means that I'll need to classify packets to different TXQ/HTB based on some skb metadata (essentially similar to what mqprio is doing). So TXQ might need to be switched to achieve this.
Re: Modification to skb->queue_mapping affecting performance
2016-09-13 22:13 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote: >> >>> If I understand correctly this is still to associate a qdisc with each >>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need >>> to divide the bandwidth of each class in HTB by the number of TX >>> queues for each individual HTB qdisc associated? >>> >>> My original idea was to attach a HTB qdisc for each ifb queue >>> representing a set of flows not sharing bandwidth with others so that >>> root lock contention still happens but only affects flows in the same >>> HTB. Did I understand the root lock contention issue incorrectly for >>> ifb? I do see some comments in __dev_queue_xmit() about using a >>> different code path for software devices which bypasses >>> __dev_xmit_skb(). Does this mean ifb won't go through >>> __dev_xmit_skb()? >> >> You can install HTB on all of your MQ children for sure. >> >> Again, there is no qdisc lock contention if you properly use MQ. >> >> Now if you _need_ to install a single qdisc for whatever reason, then >> maybe you want to use a single rx queue on the NIC, to reduce lock >> contention ;) Yes - this might reduce lock contention but there would still be contention and I'm really looking for more concurrency... >> >> > I don't intend to install multiple qdisc - the only reason that I'm > doing this now is to leverage MQ to workaround the lock contention, > and based on the profile this all worked. However to simplify the way > to setup HTB I wanted to use TXQ to partition HTB classes so that a > HTB class only belongs to one TXQ, which also requires mapping skb to > TXQ using some rules (here I'm using priority but I assume it's > straightforward to use other information such as classid). And the > problem I found here is that when using priority to infer the TXQ so > that queue_mapping is changed, bandwidth is affected significantly - > the only thing I can guess is that due to queue switch, there are more > cache misses assuming processor cores have a static mapping to all the > queues. Any suggestion on what to do next for the investigation? > > I would also guess that this should be a common problem if anyone > wants to use MQ+IFB to workaround the qdisc lock contention on the > receiver side and classful qdisc is used on IFB, but haven't really > found a similar thread here... Hi Cong - I saw quite some threads from you regarding to ingress qdisc + MQ and issues for queue_mapping. Do you by any chance have a similar setup? (classful qdiscs associated to the queues of IFB which requires queue_mapping modification so that the qdisc selection is done at queue selection time based on information such as skb priority/classid. Would appreciate any suggestions.
Re: Modification to skb->queue_mapping affecting performance
2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote: > >> If I understand correctly this is still to associate a qdisc with each >> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need >> to divide the bandwidth of each class in HTB by the number of TX >> queues for each individual HTB qdisc associated? >> >> My original idea was to attach a HTB qdisc for each ifb queue >> representing a set of flows not sharing bandwidth with others so that >> root lock contention still happens but only affects flows in the same >> HTB. Did I understand the root lock contention issue incorrectly for >> ifb? I do see some comments in __dev_queue_xmit() about using a >> different code path for software devices which bypasses >> __dev_xmit_skb(). Does this mean ifb won't go through >> __dev_xmit_skb()? > > You can install HTB on all of your MQ children for sure. > > Again, there is no qdisc lock contention if you properly use MQ. > > Now if you _need_ to install a single qdisc for whatever reason, then > maybe you want to use a single rx queue on the NIC, to reduce lock > contention ;) > > I don't intend to install multiple qdisc - the only reason that I'm doing this now is to leverage MQ to workaround the lock contention, and based on the profile this all worked. However to simplify the way to setup HTB I wanted to use TXQ to partition HTB classes so that a HTB class only belongs to one TXQ, which also requires mapping skb to TXQ using some rules (here I'm using priority but I assume it's straightforward to use other information such as classid). And the problem I found here is that when using priority to infer the TXQ so that queue_mapping is changed, bandwidth is affected significantly - the only thing I can guess is that due to queue switch, there are more cache misses assuming processor cores have a static mapping to all the queues. Any suggestion on what to do next for the investigation? I would also guess that this should be a common problem if anyone wants to use MQ+IFB to workaround the qdisc lock contention on the receiver side and classful qdisc is used on IFB, but haven't really found a similar thread here...
Re: Modification to skb->queue_mapping affecting performance
2016-09-13 17:09 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote: > >> The RX queue number I found from "ls /sys/class/net/eth0/queues" is >> 64. (is this the correct way of identifying the queue number on NIC?) >> I setup ifb with 24 queues which is equal to the TX queue number of >> eth0 and also the number of CPU cores. > > Please do not drop netdev@ from this mail exchange. Sorry that I accidentally dropped that. > > ethtool -l eth0 > >> >> > There is no qdisc lock contention anymore AFAIK, since each cpu will use >> > a dedicate IFB queue and tasklet. >> > >> How is this achieved? I thought qdisc on ifb will still be protected >> by the qdisc root lock in __dev_xmit_skb() so essentially all threads >> processing qdisc are still serialized without using MQ? > > You have to properly setup ifb/mq like in : > > # netem based setup, installed at receiver side only > ETH=eth0 > IFB=ifb10 > #DELAY="delay 100ms" > EST="est 1sec 4sec" > #REORDER=1000us > #LOSS="loss 2.0" > TXQ=24 # change this to number of TX queues on the physical NIC > > ip link add $IFB numtxqueues $TXQ type ifb > ip link set dev $IFB up > > tc qdisc del dev $ETH ingress 2>/dev/null > tc qdisc add dev $ETH ingress 2>/dev/null > > tc filter add dev $ETH parent : \ >protocol ip u32 match u32 0 0 flowid 1:1 \ > action mirred egress redirect dev $IFB > > tc qdisc del dev $IFB root 2>/dev/null > > tc qdisc add dev $IFB root handle 1: mq > for i in `seq 1 $TXQ` > do > slot=$( printf %x $(( i )) ) > tc qd add dev $IFB parent 1:$slot $EST netem \ > limit 10 $DELAY $REORDER $LOSS > done > > If I understand correctly this is still to associate a qdisc with each ifb TXQ. How should I do this if I want to use HTB? I guess I'll need to divide the bandwidth of each class in HTB by the number of TX queues for each individual HTB qdisc associated? My original idea was to attach a HTB qdisc for each ifb queue representing a set of flows not sharing bandwidth with others so that root lock contention still happens but only affects flows in the same HTB. Did I understand the root lock contention issue incorrectly for ifb? I do see some comments in __dev_queue_xmit() about using a different code path for software devices which bypasses __dev_xmit_skb(). Does this mean ifb won't go through __dev_xmit_skb()?
Modification to skb->queue_mapping affecting performance
Hi - We currently use mqprio on ifb to work around the qdisc root lock contention on the receiver side. The problem that we found was that queue_mapping is already set when redirecting from ingress qdisc to ifb (based on RX selection, I guess?) so the TX queue selection is not based on priority. Then we implemented a filter which can set skb->queue_mapping to 0 so that TX queue selection can be done as expected and flows with different priorities will go through different TX queues. However with the queue_mapping recomputed, we found the achievable bandwidth with small packets (512 bytes) dropped significantly if they're targeting different queues. From perf profile I don't see any bottleneck from CPU perspective. Any thoughts on why modifying queue_mapping will have this kind of effect? Also is there any better way of achieving receiver side throttling using HTB while avoiding the qdisc root lock on ifb? Thanks, Michael
Re: qdisc spin lock
2016-04-21 15:12 GMT-07:00 Michael Ma <make0...@gmail.com>: > 2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote: >>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >>> >> >> that each txq can be associated with a particular qdisc. Also we don't >>> >> >> really have multiple interfaces... >>> >> >> >>> >> >> With this MQ solution we'll still need to assign transmit queues to >>> >> >> different classes by doing some math on the bandwidth limit if I >>> >> >> understand correctly, which seems to be less convenient compared with >>> >> >> a solution purely within HTB. >>> >> >> >>> >> >> I assume that with this solution I can still share qdisc among >>> >> >> multiple transmit queues - please let me know if this is not the case. >>> >> > >>> >> > Note that this MQ + HTB thing works well, unless you use a bonding >>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >>> >> > tokens between the slaves) >>> >> >>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >>> >> packets can be throttled by HTB using one txq without being affected >>> >> by other flows with small packets. However I found using this solution >>> >> large packets (10k for example) will only achieve very limited >>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >>> >> using 30 threads. But sending 10k packets using 10 threads has only 10 >>> >> Mbit/s with the same TC configuration. If I increase burst and cburst >>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >>> >> hit. >>> >> >>> >> The strange thing is that I don't see this problem when using HTB as >>> >> the root. So txq number seems to be a factor here - however it's >>> >> really hard to understand why would it only affect larger packets. Is >>> >> this a known issue? Any suggestion on how to investigate the issue >>> >> further? Profiling shows that the cpu utilization is pretty low. >>> > >>> > You could try >>> > >>> > perf record -a -g -e skb:kfree_skb sleep 5 >>> > perf report >>> > >>> > So that you see where the packets are dropped. >>> > >>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are >>> > dropped at qdisc enqueue time, instead of having backpressure. >>> > >>> >>> Thanks for the hint - how should I read the perf report? Also we're >>> using TCP socket in this testing - TCP window size is set to 70kB. >> >> But how are you telling TCP to send 10k packets ? >> > We just write to the socket with 10k buffer and wait for a response > from the server (using read()) before the next write. Using tcpdump I > can see the 10k write is actually sent through 3 packets > (7.3k/1.5k/1.3k). > >> AFAIK you can not : TCP happily aggregates packets in write queue >> (see current MSG_EOR discussion) >> >> I suspect a bug in your tc settings. >> >> > > Could you help to check my tc setting? > > sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0 > queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0 > sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10 > sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit > > I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used. Just to circle back on this - it seems there is 200ms delay sometimes during data push which stalled the sending: 01:34:44.046232 IP (tos 0x0, ttl 64, id 2863, offset 0, flags [DF], proto: TCP (6), length: 8740) 10.101.197.75.59126 > 10.101.197.105.redwood-broker: . 250025:258713(8688) ack 1901 win 58 <nop,nop,timestamp 507571833 196626529> 01:34:44.046304 IP (tos 0x0, ttl 64, id 15420, offset 0, flags [DF], prot
Re: qdisc spin lock
2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote: >> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> >> really have multiple interfaces... >> >> >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> >> different classes by doing some math on the bandwidth limit if I >> >> >> understand correctly, which seems to be less convenient compared with >> >> >> a solution purely within HTB. >> >> >> >> >> >> I assume that with this solution I can still share qdisc among >> >> >> multiple transmit queues - please let me know if this is not the case. >> >> > >> >> > Note that this MQ + HTB thing works well, unless you use a bonding >> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> >> > tokens between the slaves) >> >> >> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >> >> packets can be throttled by HTB using one txq without being affected >> >> by other flows with small packets. However I found using this solution >> >> large packets (10k for example) will only achieve very limited >> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >> >> using 30 threads. But sending 10k packets using 10 threads has only 10 >> >> Mbit/s with the same TC configuration. If I increase burst and cburst >> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >> >> hit. >> >> >> >> The strange thing is that I don't see this problem when using HTB as >> >> the root. So txq number seems to be a factor here - however it's >> >> really hard to understand why would it only affect larger packets. Is >> >> this a known issue? Any suggestion on how to investigate the issue >> >> further? Profiling shows that the cpu utilization is pretty low. >> > >> > You could try >> > >> > perf record -a -g -e skb:kfree_skb sleep 5 >> > perf report >> > >> > So that you see where the packets are dropped. >> > >> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are >> > dropped at qdisc enqueue time, instead of having backpressure. >> > >> >> Thanks for the hint - how should I read the perf report? Also we're >> using TCP socket in this testing - TCP window size is set to 70kB. > > But how are you telling TCP to send 10k packets ? > We just write to the socket with 10k buffer and wait for a response from the server (using read()) before the next write. Using tcpdump I can see the 10k write is actually sent through 3 packets (7.3k/1.5k/1.3k). > AFAIK you can not : TCP happily aggregates packets in write queue > (see current MSG_EOR discussion) > > I suspect a bug in your tc settings. > > Could you help to check my tc setting? sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0 queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0 sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10 sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used.
Re: qdisc spin lock
2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> really have multiple interfaces... >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> different classes by doing some math on the bandwidth limit if I >> >> understand correctly, which seems to be less convenient compared with >> >> a solution purely within HTB. >> >> >> >> I assume that with this solution I can still share qdisc among >> >> multiple transmit queues - please let me know if this is not the case. >> > >> > Note that this MQ + HTB thing works well, unless you use a bonding >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> > tokens between the slaves) >> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >> packets can be throttled by HTB using one txq without being affected >> by other flows with small packets. However I found using this solution >> large packets (10k for example) will only achieve very limited >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >> using 30 threads. But sending 10k packets using 10 threads has only 10 >> Mbit/s with the same TC configuration. If I increase burst and cburst >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >> hit. >> >> The strange thing is that I don't see this problem when using HTB as >> the root. So txq number seems to be a factor here - however it's >> really hard to understand why would it only affect larger packets. Is >> this a known issue? Any suggestion on how to investigate the issue >> further? Profiling shows that the cpu utilization is pretty low. > > You could try > > perf record -a -g -e skb:kfree_skb sleep 5 > perf report > > So that you see where the packets are dropped. > > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are > dropped at qdisc enqueue time, instead of having backpressure. > Thanks for the hint - how should I read the perf report? Also we're using TCP socket in this testing - TCP window size is set to 70kB. - 35.88% init [kernel.kallsyms] [k] intel_idle ◆ intel_idle ▒ - 15.83% strings libc-2.5.so[.] __GI___connect_internal ▒ - __GI___connect_internal ▒ - 50.00% get_mapping ▒ __nscd_get_map_ref ▒ 50.00% __nscd_open_socket ▒ - 13.19% strings libc-2.5.so[.] __GI___libc_recvmsg ▒ - __GI___libc_recvmsg ▒ + 64.52% getifaddrs ▒ + 35.48% __check_pf ▒ - 10.55% strings libc-2.5.so[.] __sendto_nocancel ▒ - __sendto_nocancel ▒ 100.00% 0 > >
Re: qdisc spin lock
2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Note that this MQ + HTB thing works well, unless you use a bonding > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > tokens between the slaves) Actually MQ+HTB works well for small packets - like flow of 512 byte packets can be throttled by HTB using one txq without being affected by other flows with small packets. However I found using this solution large packets (10k for example) will only achieve very limited bandwidth. In my test I used MQ to assign one txq to a HTB which sets rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by using 30 threads. But sending 10k packets using 10 threads has only 10 Mbit/s with the same TC configuration. If I increase burst and cburst of HTB to some extreme large value (like 50MB) the ceiling rate can be hit. The strange thing is that I don't see this problem when using HTB as the root. So txq number seems to be a factor here - however it's really hard to understand why would it only affect larger packets. Is this a known issue? Any suggestion on how to investigate the issue further? Profiling shows that the cpu utilization is pretty low. > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 > > bonding can not really be used as a true MQ device yet. > > I might send a patch to disable this 'bonding feature' if no slave sets > a queue_id. > >
Re: qdisc spin lock
2016-04-15 15:54 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote: >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> really have multiple interfaces... >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> different classes by doing some math on the bandwidth limit if I >> >> understand correctly, which seems to be less convenient compared with >> >> a solution purely within HTB. >> >> >> >> I assume that with this solution I can still share qdisc among >> >> multiple transmit queues - please let me know if this is not the case. >> > >> > Note that this MQ + HTB thing works well, unless you use a bonding >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> > tokens between the slaves) >> > >> > >> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 >> > >> > bonding can not really be used as a true MQ device yet. >> > >> > I might send a patch to disable this 'bonding feature' if no slave sets >> > a queue_id. >> > >> > >> So there is no way of using this MQ solution when bonding interface is >> used, right? Then modifying HTB might be the only solution? > > I probably can submit a bonding patch very soon if there is interest. Would definitely appreciate that. If you can share the patch it will be helpful as well. Let me know if I can help with this... > > Modifying HTB is way more complicated :( > > >
Re: qdisc spin lock
2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>: > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Note that this MQ + HTB thing works well, unless you use a bonding > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > tokens between the slaves) > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 > > bonding can not really be used as a true MQ device yet. > > I might send a patch to disable this 'bonding feature' if no slave sets > a queue_id. > > So there is no way of using this MQ solution when bonding interface is used, right? Then modifying HTB might be the only solution?
Re: qdisc spin lock
2016-03-31 20:44 GMT-07:00 John Fastabend <john.fastab...@gmail.com>: > On 16-03-31 04:48 PM, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... > > MQ will assign a default qdisc to each txq and the default qdisc can > be changed to htb or any other qdisc of your choice. > >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> > > Agreed. > >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Nope sorry doesn't work that way unless you employ some sort of stacked > netdevice strategy which does start to get a bit complex. The basic hint > would be to stack some type of virtual netdev on top of a device and > run the htb qdisc there. Push traffic onto the netdev depending on the > class it belongs to. Its ugly yes. Thanks for the input - is it possible to implement a MQ to allow one qdisc to be associated with multiple TXQ belonging to one inner class? Does that require any additional synchronization in qdisc routines? I think it shouldn't because the common use of TC already associates one qdisc with all the TXQ. This would simplify the way that qdisc is crafted since it doesn't need to be created for each queue. Also is it possible to allow multiple qdiscs to be associated with one TXQ? I can see one issue with this - the TC interface for MQ today doesn't allow one TXQ to be associated with multiple classes, but this seems to be a restriction only from the tc command interface perspective. MQ anyway only establishes a connection between TXQ and qdisc so theoretically we can associate one TXQ to multiple classes/qdiscs and use priority/classid of the packet to do the pre-selection of qdisc. Again, I might have missed some potential synchronization issue when this actually happens because today TXQ is not shared by multiple qdiscs in any case. > > Noting all that I posted an RFC patch some time back to allow writing > qdiscs that do not require taking the lock. I'll try to respin these > and submit them when net-next opens again. The next logical step is to > write a "better" HTB probably using a shared counter and dropping the > requirement that it be exact. > > Sorry I didn't get a chance to look at the paper in your post so not > sure if they suggest something similar or not. > > Thanks, > John > >> >> 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>: >>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0...@gmail.com> wrote: >>>> As far as I understand the design of TC is to simplify locking schema >>>> and minimize the work in __qdisc_run so that throughput won’t be >>>> affected, especially with large packets. However if the scenario is >>>> that multiple classes in the queueing discipline only have the shaping >>>> limit, there isn’t really a necessary correlation between different >>>> classes. The only synchronization point should be when the packet is >>>> dequeued from the qdisc queue and enqueued to the transmit queue of >>>> the device. My question is – is it worth investing on avoiding the >>>> locking contention by partitioning the queue/lock so that this >>>> scenario is addressed with relatively smaller latency? >>> >>> If your HTB classes don't share bandwidth, why do you still make them >>> under the same hierarchy? IOW, you can just isolate them either with some >>> other qdisc or just separated interfaces. >
Re: qdisc spin lock
2016-03-31 19:19 GMT-07:00 David Miller <da...@davemloft.net>: > From: Michael Ma <make0...@gmail.com> > Date: Thu, 31 Mar 2016 16:48:43 -0700 > >> I didn't really know that multiple qdiscs can be isolated using MQ so > ... > > Please stop top-posting. Sorry that I wasn't aware of this...
Re: qdisc spin lock
I didn't really know that multiple qdiscs can be isolated using MQ so that each txq can be associated with a particular qdisc. Also we don't really have multiple interfaces... With this MQ solution we'll still need to assign transmit queues to different classes by doing some math on the bandwidth limit if I understand correctly, which seems to be less convenient compared with a solution purely within HTB. I assume that with this solution I can still share qdisc among multiple transmit queues - please let me know if this is not the case. 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>: > On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0...@gmail.com> wrote: >> As far as I understand the design of TC is to simplify locking schema >> and minimize the work in __qdisc_run so that throughput won’t be >> affected, especially with large packets. However if the scenario is >> that multiple classes in the queueing discipline only have the shaping >> limit, there isn’t really a necessary correlation between different >> classes. The only synchronization point should be when the packet is >> dequeued from the qdisc queue and enqueued to the transmit queue of >> the device. My question is – is it worth investing on avoiding the >> locking contention by partitioning the queue/lock so that this >> scenario is addressed with relatively smaller latency? > > If your HTB classes don't share bandwidth, why do you still make them > under the same hierarchy? IOW, you can just isolate them either with some > other qdisc or just separated interfaces.
Re: qdisc spin lock
Thanks for the suggestion - I'll try the MQ solution out. It seems to be able to solve the problem well with the assumption that bandwidth can be statically partitioned. 2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <bro...@redhat.com>: > > On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0...@gmail.com> wrote: > >> I know this might be an old topic so bare with me – what we are facing >> is that applications are sending small packets using hundreds of >> threads so the contention on spin lock in __dev_xmit_skb increases the >> latency of dev_queue_xmit significantly. We’re building a network QoS >> solution to avoid interference of different applications using HTB. > > Yes, as you have noticed with HTB there is a single qdisc lock, and > congestion obviously happens :-) > > It is possible with different tricks to make it scale. I believe > Google is using a variant of HTB, and it scales for them. They have > not open source their modifications to HTB (which likely also involves > a great deal of setup tricks). > > If your purpose it to limit traffic/bandwidth per "cloud" instance, > then you can just use another TC setup structure. Like using MQ and > assigning a HTB per MQ queue (where the MQ queues are bound to each > CPU/HW queue)... But you have to figure out this setup yourself... > > >> But in this case when some applications send massive small packets in >> parallel, the application to be protected will get its throughput >> affected (because it’s doing synchronous network communication using >> multiple threads and throughput is sensitive to the increased latency) >> >> Here is the profiling from perf: >> >> - 67.57% iperf [kernel.kallsyms] [k] _spin_lock >> - 99.94% dev_queue_xmit >> - 96.91% _spin_lock >> - 2.62% __qdisc_run >> - 98.98% sch_direct_xmit >> - 99.98% _spin_lock >> >> As far as I understand the design of TC is to simplify locking schema >> and minimize the work in __qdisc_run so that throughput won’t be >> affected, especially with large packets. However if the scenario is >> that multiple classes in the queueing discipline only have the shaping >> limit, there isn’t really a necessary correlation between different >> classes. The only synchronization point should be when the packet is >> dequeued from the qdisc queue and enqueued to the transmit queue of >> the device. My question is – is it worth investing on avoiding the >> locking contention by partitioning the queue/lock so that this >> scenario is addressed with relatively smaller latency? > > Yes, there is a lot go gain, but it is not easy ;-) > >> I must have oversimplified a lot of details since I’m not familiar >> with the TC implementation at this point – just want to get your input >> in terms of whether this is a worthwhile effort or there is something >> fundamental that I’m not aware of. If this is just a matter of quite >> some additional work, would also appreciate helping to outline the >> required work here. >> >> Also would appreciate if there is any information about the latest >> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf > > This article seems to be very low quality... spelling errors, only 5 > pages, no real code, etc. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer
qdisc spin lock
Hi - I know this might be an old topic so bare with me – what we are facing is that applications are sending small packets using hundreds of threads so the contention on spin lock in __dev_xmit_skb increases the latency of dev_queue_xmit significantly. We’re building a network QoS solution to avoid interference of different applications using HTB. But in this case when some applications send massive small packets in parallel, the application to be protected will get its throughput affected (because it’s doing synchronous network communication using multiple threads and throughput is sensitive to the increased latency) Here is the profiling from perf: - 67.57% iperf [kernel.kallsyms] [k] _spin_lock - 99.94% dev_queue_xmit 96.91% _spin_lock - 2.62% __qdisc_run - 98.98% sch_direct_xmit 99.98% _spin_lock 1.01% _spin_lock As far as I understand the design of TC is to simplify locking schema and minimize the work in __qdisc_run so that throughput won’t be affected, especially with large packets. However if the scenario is that multiple classes in the queueing discipline only have the shaping limit, there isn’t really a necessary correlation between different classes. The only synchronization point should be when the packet is dequeued from the qdisc queue and enqueued to the transmit queue of the device. My question is – is it worth investing on avoiding the locking contention by partitioning the queue/lock so that this scenario is addressed with relatively smaller latency? I must have oversimplified a lot of details since I’m not familiar with the TC implementation at this point – just want to get your input in terms of whether this is a worthwhile effort or there is something fundamental that I’m not aware of. If this is just a matter of quite some additional work, would also appreciate helping to outline the required work here. Also would appreciate if there is any information about the latest status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf Thanks, Ke Ma