Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 20:53 GMT-08:00 Tom Herbert <t...@herbertland.com>:
> On Mon, Nov 13, 2017 at 7:45 PM, John Fastabend
> <john.fastab...@gmail.com> wrote:
>> On 11/13/2017 06:18 PM, Michael Ma wrote:
>>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
>>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> 
>>>> wrote:
>>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>>>>>>> <step...@networkplumber.org>:
>>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>>> Michael Ma <make0...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>>>>>>> early feedback.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Michael
>>>>>>>>>
>>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>>
>>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>>> enough the out-of-order situation in this polling thread should not 
>>>>>>>>>> be
>>>>>>>>>> that bad.
>>>>>>>>>>
>>>>>>>>>> To give more details - in the send path we introduce per-cpu 
>>>>>>>>>> per-class
>>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>>
>>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>>
>>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>>> Several developers have tried to do lockless
>>>>>>>> qdisc and similar things in the past.
>>>>>>>>
>>>>>>>> The devil is in the details, show us the code.
>>>>>>>
>>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>>
>>>>>>> struct bandwidth_group
>>>>>>> {
>>>>>>> struct skb_list queues[MAX_CPU_COUNT];
>>>>>>> struct skb_list drain;
>>>>>>> }
>>>>>>>
>>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>>
>>>>>>> int cpu;
>>>>>>> struct bandwidth_group *bwg = _rx_groups[bwgid];
>>>>>>>
>>>>>>> cpu = get_cpu();
>>>>>>> skb_list_append(>queues[cpu], skb);
>>>>>>>
>>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>>> theoretically can have packets enqueued to different queues and
>>>>>>> aggregated to the &qu

Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 19:45 GMT-08:00 John Fastabend <john.fastab...@gmail.com>:
> On 11/13/2017 06:18 PM, Michael Ma wrote:
>> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
>>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> 
>>> wrote:
>>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>>>>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>>>>>> <step...@networkplumber.org>:
>>>>>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>>>>>> Michael Ma <make0...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>>>>>> early feedback.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Michael
>>>>>>>>
>>>>>>>>> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> Currently txq/qdisc selection is based on flow hash so packets from
>>>>>>>>> the same flow will follow the order when they enter qdisc/txq, which
>>>>>>>>> avoids out-of-order problem.
>>>>>>>>>
>>>>>>>>> To improve the concurrency of QoS algorithm we plan to have multiple
>>>>>>>>> per-cpu queues for a single TC class and do busy polling from a
>>>>>>>>> per-class thread to drain these queues. If we can do this frequently
>>>>>>>>> enough the out-of-order situation in this polling thread should not be
>>>>>>>>> that bad.
>>>>>>>>>
>>>>>>>>> To give more details - in the send path we introduce per-cpu per-class
>>>>>>>>> queues so that packets from the same class and same core will be
>>>>>>>>> enqueued to the same place. Then a per-class thread poll the queues
>>>>>>>>> belonging to its class from all the cpus and aggregate them into
>>>>>>>>> another per-class queue. This can effectively reduce contention but
>>>>>>>>> inevitably introduces potential out-of-order issue.
>>>>>>>>>
>>>>>>>>> Any concern/suggestion for working towards this direction?
>>>>>>>
>>>>>>> In general, there is no meta design discussions in Linux development
>>>>>>> Several developers have tried to do lockless
>>>>>>> qdisc and similar things in the past.
>>>>>>>
>>>>>>> The devil is in the details, show us the code.
>>>>>>
>>>>>> Thanks for the response, Stephen. The code is fairly straightforward,
>>>>>> we have a per-cpu per-class queue defined as this:
>>>>>>
>>>>>> struct bandwidth_group
>>>>>> {
>>>>>> struct skb_list queues[MAX_CPU_COUNT];
>>>>>> struct skb_list drain;
>>>>>> }
>>>>>>
>>>>>> "drain" queue is used to aggregate per-cpu queues belonging to the
>>>>>> same class. In the enqueue function, we determine the cpu where the
>>>>>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>>>>>
>>>>>> int cpu;
>>>>>> struct bandwidth_group *bwg = _rx_groups[bwgid];
>>>>>>
>>>>>> cpu = get_cpu();
>>>>>> skb_list_append(>queues[cpu], skb);
>>>>>>
>>>>>> Here we don't check the flow of the packet so if there is task
>>>>>> migration or multiple threads sending packets through the same flow we
>>>>>> theoretically can have packets enqueued to different queues and
>>>>>> aggregated to the "drain" queue out of order.
>>>>>>
>>>>>> Also AFAIK there is no lockless htb-like qdisc implementation
>>>>>> currently, however if there is already similar effort ongoing please
>>>>>> let me know.
>>>>>
>>>>> The question I would have is how would this differ from using XPS w/
>>>>> mqprio? Would this be a classful qdisc l

Re: Per-CPU Queueing for QoS

2017-11-14 Thread Michael Ma
2017-11-13 18:18 GMT-08:00 Michael Ma <make0...@gmail.com>:
> 2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
>> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger 
>>>> > <step...@networkplumber.org>:
>>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>>> >> Michael Ma <make0...@gmail.com> wrote:
>>>> >>
>>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>> >>> early feedback.
>>>> >>>
>>>> >>> Thanks,
>>>> >>> Michael
>>>> >>>
>>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>>> >>> >
>>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>>> >>> > avoids out-of-order problem.
>>>> >>> >
>>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>>> >>> > enough the out-of-order situation in this polling thread should not 
>>>> >>> > be
>>>> >>> > that bad.
>>>> >>> >
>>>> >>> > To give more details - in the send path we introduce per-cpu 
>>>> >>> > per-class
>>>> >>> > queues so that packets from the same class and same core will be
>>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>>> >>> > belonging to its class from all the cpus and aggregate them into
>>>> >>> > another per-class queue. This can effectively reduce contention but
>>>> >>> > inevitably introduces potential out-of-order issue.
>>>> >>> >
>>>> >>> > Any concern/suggestion for working towards this direction?
>>>> >>
>>>> >> In general, there is no meta design discussions in Linux development
>>>> >> Several developers have tried to do lockless
>>>> >> qdisc and similar things in the past.
>>>> >>
>>>> >> The devil is in the details, show us the code.
>>>> >
>>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>>> > we have a per-cpu per-class queue defined as this:
>>>> >
>>>> > struct bandwidth_group
>>>> > {
>>>> > struct skb_list queues[MAX_CPU_COUNT];
>>>> > struct skb_list drain;
>>>> > }
>>>> >
>>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>>> > same class. In the enqueue function, we determine the cpu where the
>>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>>> >
>>>> > int cpu;
>>>> > struct bandwidth_group *bwg = _rx_groups[bwgid];
>>>> >
>>>> > cpu = get_cpu();
>>>> > skb_list_append(>queues[cpu], skb);
>>>> >
>>>> > Here we don't check the flow of the packet so if there is task
>>>> > migration or multiple threads sending packets through the same flow we
>>>> > theoretically can have packets enqueued to different queues and
>>>> > aggregated to the "drain" queue out of order.
>>>> >
>>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>>> > currently, however if there is already similar effort ongoing please
>>>> > let me know.
>>>>
>>>> The question I would have is how would this differ from using XPS w/
>>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>>> like mqprio?
>>>>
>>>> From what I can tell XPS would be able to get you your per-cpu
>>>> functionality, the benefit o

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma
2017-11-13 16:13 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
> On Mon, Nov 13, 2017 at 3:08 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0...@gmail.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> >>> early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> > struct skb_list queues[MAX_CPU_COUNT];
>>> > struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = _rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(>queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>

Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma
2017-11-13 18:05 GMT-08:00 Michael Ma <make0...@gmail.com>:
> 2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>>> >> Michael Ma <make0...@gmail.com> wrote:
>>> >>
>>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>> >>> early feedback.
>>> >>>
>>> >>> Thanks,
>>> >>> Michael
>>> >>>
>>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>> >>> >
>>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>>> >>> > avoids out-of-order problem.
>>> >>> >
>>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>> >>> > per-cpu queues for a single TC class and do busy polling from a
>>> >>> > per-class thread to drain these queues. If we can do this frequently
>>> >>> > enough the out-of-order situation in this polling thread should not be
>>> >>> > that bad.
>>> >>> >
>>> >>> > To give more details - in the send path we introduce per-cpu per-class
>>> >>> > queues so that packets from the same class and same core will be
>>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>>> >>> > belonging to its class from all the cpus and aggregate them into
>>> >>> > another per-class queue. This can effectively reduce contention but
>>> >>> > inevitably introduces potential out-of-order issue.
>>> >>> >
>>> >>> > Any concern/suggestion for working towards this direction?
>>> >>
>>> >> In general, there is no meta design discussions in Linux development
>>> >> Several developers have tried to do lockless
>>> >> qdisc and similar things in the past.
>>> >>
>>> >> The devil is in the details, show us the code.
>>> >
>>> > Thanks for the response, Stephen. The code is fairly straightforward,
>>> > we have a per-cpu per-class queue defined as this:
>>> >
>>> > struct bandwidth_group
>>> > {
>>> > struct skb_list queues[MAX_CPU_COUNT];
>>> > struct skb_list drain;
>>> > }
>>> >
>>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>>> > same class. In the enqueue function, we determine the cpu where the
>>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>>> >
>>> > int cpu;
>>> > struct bandwidth_group *bwg = _rx_groups[bwgid];
>>> >
>>> > cpu = get_cpu();
>>> > skb_list_append(>queues[cpu], skb);
>>> >
>>> > Here we don't check the flow of the packet so if there is task
>>> > migration or multiple threads sending packets through the same flow we
>>> > theoretically can have packets enqueued to different queues and
>>> > aggregated to the "drain" queue out of order.
>>> >
>>> > Also AFAIK there is no lockless htb-like qdisc implementation
>>> > currently, however if there is already similar effort ongoing please
>>> > let me know.
>>>
>>> The question I would have is how would this differ from using XPS w/
>>> mqprio? Would this be a classful qdisc like HTB or a classless one
>>> like mqprio?
>>>
>>> From what I can tell XPS would be able to get you your per-cpu
>>> functionality, the benefit of it though would be that it would avoid
>>> out-of-order issues for sockets originating on the local system. The
>>> only thing I see as an issue right now is that the rate limiting with
>>> mqprio is assumed to be handled via hardware due to mechanisms such as
>>> DCB.
>>
>> I think one of the key point was in : " do busy polling from a per-class
>> thread to drain these queues."
>>
>> I mentioned this idea in TX path of :
>>
>> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>>
>>
>
> Right - this part is the key difference. With mqprio we still don't
> have the ability to explore parallelism at the level of class. The
> parallelism is restricted to the way of partitioning flows across
> queues.
>
>>

Eric - do you think if we do busy polling frequently enough
out-of-order problem will effectively be mitigated? I'll take a look
at your slides as well.


Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma
2017-11-13 15:08 GMT-08:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote:
>> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>> > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>> >> On Sun, 12 Nov 2017 13:43:13 -0800
>> >> Michael Ma <make0...@gmail.com> wrote:
>> >>
>> >>> Any comments? We plan to implement this as a qdisc and appreciate any 
>> >>> early feedback.
>> >>>
>> >>> Thanks,
>> >>> Michael
>> >>>
>> >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>> >>> >
>> >>> > Currently txq/qdisc selection is based on flow hash so packets from
>> >>> > the same flow will follow the order when they enter qdisc/txq, which
>> >>> > avoids out-of-order problem.
>> >>> >
>> >>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> >>> > per-cpu queues for a single TC class and do busy polling from a
>> >>> > per-class thread to drain these queues. If we can do this frequently
>> >>> > enough the out-of-order situation in this polling thread should not be
>> >>> > that bad.
>> >>> >
>> >>> > To give more details - in the send path we introduce per-cpu per-class
>> >>> > queues so that packets from the same class and same core will be
>> >>> > enqueued to the same place. Then a per-class thread poll the queues
>> >>> > belonging to its class from all the cpus and aggregate them into
>> >>> > another per-class queue. This can effectively reduce contention but
>> >>> > inevitably introduces potential out-of-order issue.
>> >>> >
>> >>> > Any concern/suggestion for working towards this direction?
>> >>
>> >> In general, there is no meta design discussions in Linux development
>> >> Several developers have tried to do lockless
>> >> qdisc and similar things in the past.
>> >>
>> >> The devil is in the details, show us the code.
>> >
>> > Thanks for the response, Stephen. The code is fairly straightforward,
>> > we have a per-cpu per-class queue defined as this:
>> >
>> > struct bandwidth_group
>> > {
>> > struct skb_list queues[MAX_CPU_COUNT];
>> > struct skb_list drain;
>> > }
>> >
>> > "drain" queue is used to aggregate per-cpu queues belonging to the
>> > same class. In the enqueue function, we determine the cpu where the
>> > packet is processed and enqueue it to the corresponding per-cpu queue:
>> >
>> > int cpu;
>> > struct bandwidth_group *bwg = _rx_groups[bwgid];
>> >
>> > cpu = get_cpu();
>> > skb_list_append(>queues[cpu], skb);
>> >
>> > Here we don't check the flow of the packet so if there is task
>> > migration or multiple threads sending packets through the same flow we
>> > theoretically can have packets enqueued to different queues and
>> > aggregated to the "drain" queue out of order.
>> >
>> > Also AFAIK there is no lockless htb-like qdisc implementation
>> > currently, however if there is already similar effort ongoing please
>> > let me know.
>>
>> The question I would have is how would this differ from using XPS w/
>> mqprio? Would this be a classful qdisc like HTB or a classless one
>> like mqprio?
>>
>> From what I can tell XPS would be able to get you your per-cpu
>> functionality, the benefit of it though would be that it would avoid
>> out-of-order issues for sockets originating on the local system. The
>> only thing I see as an issue right now is that the rate limiting with
>> mqprio is assumed to be handled via hardware due to mechanisms such as
>> DCB.
>
> I think one of the key point was in : " do busy polling from a per-class
> thread to drain these queues."
>
> I mentioned this idea in TX path of :
>
> https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf
>
>

Right - this part is the key difference. With mqprio we still don't
have the ability to explore parallelism at the level of class. The
parallelism is restricted to the way of partitioning flows across
queues.

>


Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma
2017-11-13 14:47 GMT-08:00 Alexander Duyck <alexander.du...@gmail.com>:
> On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote:
>> 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
>>> On Sun, 12 Nov 2017 13:43:13 -0800
>>> Michael Ma <make0...@gmail.com> wrote:
>>>
>>>> Any comments? We plan to implement this as a qdisc and appreciate any 
>>>> early feedback.
>>>>
>>>> Thanks,
>>>> Michael
>>>>
>>>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>>>> >
>>>> > Currently txq/qdisc selection is based on flow hash so packets from
>>>> > the same flow will follow the order when they enter qdisc/txq, which
>>>> > avoids out-of-order problem.
>>>> >
>>>> > To improve the concurrency of QoS algorithm we plan to have multiple
>>>> > per-cpu queues for a single TC class and do busy polling from a
>>>> > per-class thread to drain these queues. If we can do this frequently
>>>> > enough the out-of-order situation in this polling thread should not be
>>>> > that bad.
>>>> >
>>>> > To give more details - in the send path we introduce per-cpu per-class
>>>> > queues so that packets from the same class and same core will be
>>>> > enqueued to the same place. Then a per-class thread poll the queues
>>>> > belonging to its class from all the cpus and aggregate them into
>>>> > another per-class queue. This can effectively reduce contention but
>>>> > inevitably introduces potential out-of-order issue.
>>>> >
>>>> > Any concern/suggestion for working towards this direction?
>>>
>>> In general, there is no meta design discussions in Linux development
>>> Several developers have tried to do lockless
>>> qdisc and similar things in the past.
>>>
>>> The devil is in the details, show us the code.
>>
>> Thanks for the response, Stephen. The code is fairly straightforward,
>> we have a per-cpu per-class queue defined as this:
>>
>> struct bandwidth_group
>> {
>> struct skb_list queues[MAX_CPU_COUNT];
>> struct skb_list drain;
>> }
>>
>> "drain" queue is used to aggregate per-cpu queues belonging to the
>> same class. In the enqueue function, we determine the cpu where the
>> packet is processed and enqueue it to the corresponding per-cpu queue:
>>
>> int cpu;
>> struct bandwidth_group *bwg = _rx_groups[bwgid];
>>
>> cpu = get_cpu();
>> skb_list_append(>queues[cpu], skb);
>>
>> Here we don't check the flow of the packet so if there is task
>> migration or multiple threads sending packets through the same flow we
>> theoretically can have packets enqueued to different queues and
>> aggregated to the "drain" queue out of order.
>>
>> Also AFAIK there is no lockless htb-like qdisc implementation
>> currently, however if there is already similar effort ongoing please
>> let me know.
>
> The question I would have is how would this differ from using XPS w/
> mqprio? Would this be a classful qdisc like HTB or a classless one
> like mqprio?
>

Yes, xps + mqprio will achive similar effect as this. However xps +
mqprio requires quite some static configuration to make things work.
Also all the overhead associated with qdisc root lock still exists
even though the contention might be reduced. With thread dedicated to
cpu we can avoid acquiring lock for enqueue/dequeue since it's
effectively single producer and single consumer. In the case that
multiple threads do share the same flow the lock contention still
exists and we don't have per-class parallelism for rate limiting if
classes ever share the same cores, which is avoidable by having
per-class thread to run stuff like leaky bucket.

We still plan to implement it as a classful qdisc.

> From what I can tell XPS would be able to get you your per-cpu
> functionality, the benefit of it though would be that it would avoid
> out-of-order issues for sockets originating on the local system. The
> only thing I see as an issue right now is that the rate limiting with
> mqprio is assumed to be handled via hardware due to mechanisms such as
> DCB.
>

mqprio can also be attached with qdiscs like HTB so this can actually
work without DCB.

> - Alex


Re: Per-CPU Queueing for QoS

2017-11-13 Thread Michael Ma
2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>:
> On Sun, 12 Nov 2017 13:43:13 -0800
> Michael Ma <make0...@gmail.com> wrote:
>
>> Any comments? We plan to implement this as a qdisc and appreciate any early 
>> feedback.
>>
>> Thanks,
>> Michael
>>
>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
>> >
>> > Currently txq/qdisc selection is based on flow hash so packets from
>> > the same flow will follow the order when they enter qdisc/txq, which
>> > avoids out-of-order problem.
>> >
>> > To improve the concurrency of QoS algorithm we plan to have multiple
>> > per-cpu queues for a single TC class and do busy polling from a
>> > per-class thread to drain these queues. If we can do this frequently
>> > enough the out-of-order situation in this polling thread should not be
>> > that bad.
>> >
>> > To give more details - in the send path we introduce per-cpu per-class
>> > queues so that packets from the same class and same core will be
>> > enqueued to the same place. Then a per-class thread poll the queues
>> > belonging to its class from all the cpus and aggregate them into
>> > another per-class queue. This can effectively reduce contention but
>> > inevitably introduces potential out-of-order issue.
>> >
>> > Any concern/suggestion for working towards this direction?
>
> In general, there is no meta design discussions in Linux development
> Several developers have tried to do lockless
> qdisc and similar things in the past.
>
> The devil is in the details, show us the code.

Thanks for the response, Stephen. The code is fairly straightforward,
we have a per-cpu per-class queue defined as this:

struct bandwidth_group
{
struct skb_list queues[MAX_CPU_COUNT];
struct skb_list drain;
}

"drain" queue is used to aggregate per-cpu queues belonging to the
same class. In the enqueue function, we determine the cpu where the
packet is processed and enqueue it to the corresponding per-cpu queue:

int cpu;
struct bandwidth_group *bwg = _rx_groups[bwgid];

cpu = get_cpu();
skb_list_append(>queues[cpu], skb);

Here we don't check the flow of the packet so if there is task
migration or multiple threads sending packets through the same flow we
theoretically can have packets enqueued to different queues and
aggregated to the "drain" queue out of order.

Also AFAIK there is no lockless htb-like qdisc implementation
currently, however if there is already similar effort ongoing please
let me know.


Re: Per-CPU Queueing for QoS

2017-11-12 Thread Michael Ma
Any comments? We plan to implement this as a qdisc and appreciate any early 
feedback.

Thanks,
Michael

> On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote:
> 
> Currently txq/qdisc selection is based on flow hash so packets from
> the same flow will follow the order when they enter qdisc/txq, which
> avoids out-of-order problem.
> 
> To improve the concurrency of QoS algorithm we plan to have multiple
> per-cpu queues for a single TC class and do busy polling from a
> per-class thread to drain these queues. If we can do this frequently
> enough the out-of-order situation in this polling thread should not be
> that bad.
> 
> To give more details - in the send path we introduce per-cpu per-class
> queues so that packets from the same class and same core will be
> enqueued to the same place. Then a per-class thread poll the queues
> belonging to its class from all the cpus and aggregate them into
> another per-class queue. This can effectively reduce contention but
> inevitably introduces potential out-of-order issue.
> 
> Any concern/suggestion for working towards this direction?


Per-CPU Queueing for QoS

2017-11-09 Thread Michael Ma
Currently txq/qdisc selection is based on flow hash so packets from
the same flow will follow the order when they enter qdisc/txq, which
avoids out-of-order problem.

To improve the concurrency of QoS algorithm we plan to have multiple
per-cpu queues for a single TC class and do busy polling from a
per-class thread to drain these queues. If we can do this frequently
enough the out-of-order situation in this polling thread should not be
that bad.

To give more details - in the send path we introduce per-cpu per-class
queues so that packets from the same class and same core will be
enqueued to the same place. Then a per-class thread poll the queues
belonging to its class from all the cpus and aggregate them into
another per-class queue. This can effectively reduce contention but
inevitably introduces potential out-of-order issue.

Any concern/suggestion for working towards this direction?


Multiple Txqs with One Qdisc

2017-04-25 Thread Michael Ma
Hi -

As I mentioned in a previous thread, we're implementing a qdisc
similar to mqprio which can associate multiple txqs to one qdisc, so
that we can work around the root qdisc bottleneck. I think this should
be in general fine since without MQ, root qdisc is associated with
multiple txqs anyway. However we encountered kernel panic as I
mentioned in the "corrupted skb" thread.

Any thought on potential issue of doing this? I'm asking this because
mqprio only associates one txq with one qdisc which I think must have
been done this way for a reason.

Would appreciate if anyone can shed some light on this.

Thanks,
Michael


Re: Corrupted SKB

2017-04-25 Thread Michael Ma
2017-04-18 21:46 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2017-04-18 16:12 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>:
>> On Mon, Apr 17, 2017 at 5:39 PM, Michael Ma <make0...@gmail.com> wrote:
>>> Hi -
>>>
>>> We've implemented a "glue" qdisc similar to mqprio which can associate
>>> one qdisc to multiple txqs as the root qdisc. Reference count of the
>>> child qdiscs have been adjusted properly in this case so that it
>>> represents the number of txqs it has been attached to. However when
>>> sending packets we saw the skb from dequeue_skb() corrupted with the
>>> following call stack:
>>>
>>> [exception RIP: netif_skb_features+51]
>>> RIP: 815292b3  RSP: 8817f6987940  RFLAGS: 00010246
>>>
>>>  #9 [8817f6987968] validate_xmit_skb at 815294aa
>>> #10 [8817f69879a0] validate_xmit_skb at 8152a0d9
>>> #11 [8817f69879b0] __qdisc_run at 8154a193
>>> #12 [8817f6987a00] dev_queue_xmit at 81529e03
>>>
>>> It looks like the skb has already been released since its dev pointer
>>> field is invalid.
>>>
>>> Any clue on how this can be investigated further? My current thought
>>> is to add some instrumentation to the place where skb is released and
>>> analyze whether there is any race condition happening there. However
>>
>> Either dropwatch or perf could do the work to instrument kfree_skb().
>
> Thanks - will try it out.

I'm using perf to collect the callstack for kfree_skb and trying to
correlate that with the corrupted SKB address however when system
crashes the perf.data file is also corrupted - how can I view this
file in case the system crashes before perf exits?
>>
>>> by looking through the existing code I think the case where one root
>>> qdisc is associated with multiple txqs already exists (when mqprio is
>>> not used) so not sure why it won't work when we group txqs and assign
>>> each group a root qdisc. Any insight on this issue would be much
>>> appreciated!
>>
>> How do you implement ->attach()? How does it work with netdev_pick_tx()?
>
> attach() essentially grafts the default qdisc(pfifo) to each "txq
> group" represented by a TC class. For netdev_pick_txq() we use classid
> of the socket to select a class based on a "class id base" and the
> class to txq mapping defined together with this glue qdisc - it's
> pretty much the same as mqprio with the difference of mapping one
> class to multiple txqs and selecting the txq through a hash.


Re: Corrupted SKB

2017-04-18 Thread Michael Ma
2017-04-18 16:12 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>:
> On Mon, Apr 17, 2017 at 5:39 PM, Michael Ma <make0...@gmail.com> wrote:
>> Hi -
>>
>> We've implemented a "glue" qdisc similar to mqprio which can associate
>> one qdisc to multiple txqs as the root qdisc. Reference count of the
>> child qdiscs have been adjusted properly in this case so that it
>> represents the number of txqs it has been attached to. However when
>> sending packets we saw the skb from dequeue_skb() corrupted with the
>> following call stack:
>>
>> [exception RIP: netif_skb_features+51]
>> RIP: 815292b3  RSP: 8817f6987940  RFLAGS: 00010246
>>
>>  #9 [8817f6987968] validate_xmit_skb at 815294aa
>> #10 [8817f69879a0] validate_xmit_skb at 8152a0d9
>> #11 [8817f69879b0] __qdisc_run at 8154a193
>> #12 [8817f6987a00] dev_queue_xmit at 81529e03
>>
>> It looks like the skb has already been released since its dev pointer
>> field is invalid.
>>
>> Any clue on how this can be investigated further? My current thought
>> is to add some instrumentation to the place where skb is released and
>> analyze whether there is any race condition happening there. However
>
> Either dropwatch or perf could do the work to instrument kfree_skb().

Thanks - will try it out.
>
>> by looking through the existing code I think the case where one root
>> qdisc is associated with multiple txqs already exists (when mqprio is
>> not used) so not sure why it won't work when we group txqs and assign
>> each group a root qdisc. Any insight on this issue would be much
>> appreciated!
>
> How do you implement ->attach()? How does it work with netdev_pick_tx()?

attach() essentially grafts the default qdisc(pfifo) to each "txq
group" represented by a TC class. For netdev_pick_txq() we use classid
of the socket to select a class based on a "class id base" and the
class to txq mapping defined together with this glue qdisc - it's
pretty much the same as mqprio with the difference of mapping one
class to multiple txqs and selecting the txq through a hash.


Corrupted SKB

2017-04-17 Thread Michael Ma
Hi -

We've implemented a "glue" qdisc similar to mqprio which can associate
one qdisc to multiple txqs as the root qdisc. Reference count of the
child qdiscs have been adjusted properly in this case so that it
represents the number of txqs it has been attached to. However when
sending packets we saw the skb from dequeue_skb() corrupted with the
following call stack:

[exception RIP: netif_skb_features+51]
RIP: 815292b3  RSP: 8817f6987940  RFLAGS: 00010246

 #9 [8817f6987968] validate_xmit_skb at 815294aa
#10 [8817f69879a0] validate_xmit_skb at 8152a0d9
#11 [8817f69879b0] __qdisc_run at 8154a193
#12 [8817f6987a00] dev_queue_xmit at 81529e03

It looks like the skb has already been released since its dev pointer
field is invalid.

Any clue on how this can be investigated further? My current thought
is to add some instrumentation to the place where skb is released and
analyze whether there is any race condition happening there. However
by looking through the existing code I think the case where one root
qdisc is associated with multiple txqs already exists (when mqprio is
not used) so not sure why it won't work when we group txqs and assign
each group a root qdisc. Any insight on this issue would be much
appreciated!

Thanks,
Michael


Re: Why do we need tasklet in IFB?

2016-11-01 Thread Michael Ma
2016-11-01 4:38 GMT-07:00 Jamal Hadi Salim <j...@mojatatu.com>:
> On 16-10-31 02:10 PM, David Miller wrote:
>>
>> From: Michael Ma <make0...@gmail.com>
>> Date: Mon, 31 Oct 2016 11:02:28 -0700
>>
>>> 2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>:
>>>>
>>>> 2016-10-28 14:48 GMT-07:00 Stephen Hemminger
>>>> <step...@networkplumber.org>:
>>>>>
>>>>> On Fri, 28 Oct 2016 14:45:07 -0700
>>>>> Michael Ma <make0...@gmail.com> wrote:
>>>>>
>>>>>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger
>>>>>> <step...@networkplumber.org>:
>>>>>>>
>>>>>>> On Fri, 28 Oct 2016 14:36:27 -0700
>>>>>>> Michael Ma <make0...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi -
>>>>>>>>
>>>>>>>> Currently IFB uses tasklet to process tx/rx on the interface that
>>>>>>>> forwarded the packet to IFB. My understanding on why we're doing
>>>>>>>> this
>>>>>>>> is that since dev_queue_xmit() can be invoked in interrupt, we want
>>>>>>>> to
>>>>>>>> defer the processing of original tx/rx in case ifb_xmit() is called
>>>>>>>> from interrupt.
>>>>>>>
>>>>>>>
>>>>>>> dev_queue_xmit is only called from interrupt if doing netconsole.
>>>>>>
>>>>>>
>>>
>>> In fact this doesn't seem to explain since if the original path is tx
>>> and the context is interrupt, IFB will call dev_queue_xmit as well so
>>> the context can be interrupt in that case.
>>>
>>> Then tasklet is still unnecessary.
>>
>>
>> Perhaps it's necessary to deal with potential recursion, that's my only
>> guess at this point.
>>
>> Jamal, do you remember what went through your mind back in 2005? :-)
>>
>
> Certainly recursions (which were a huge issue in the original
> implementations) as well as the need to have qdisc which shape traffic
> (which implies they will sit on the queue for some period of time and
> where accuracy of timing is important; therefore low latency of
> scheduling was  needed).
>
Thanks for the explanation - I can get the point of avoiding
recursion. However latency wise if we don't have queue/tasklet in IFB
and rely on the upstream/downstream part (either qdisc or tx/rx) to do
the queueing, latency wouldn't be a problem anyway, or did I miss
anything here?

> I didnt follow the context of "tasklets are unnecessary":
> Are tasklets bad or is the OP interested in replacing them with
> something s/he thinks is better? IIRC, at some point the idea was to
> infact use a softirq but it became obvious it was overkill.

I was thinking of directly forwarding the packet in the thread context
of IFB since it's really an intermediate forwarding component and
additional queueing/tasklet doesn't seem to be necessary (let's put
the consideration of too deep call stack/recursion problem aside).

The problem that I have found is that queue selection on NIC will
actually be inherited in IFB and that affects how packets are handled
in tasklet since the tasklet is associated with each queue. We don't
really have a strict one to one mapping of cpu core to queue in NIC,
whereas in IFB case the queue is strictly mapped to tasklet/cpu core
which can cause undesirable contention.

The question here was raised mostly because I wanted to adopt the
thread context of packet in IFB.

>
> cheers,
> jamal


Re: Why do we need tasklet in IFB?

2016-10-31 Thread Michael Ma
2016-10-31 11:02 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>:
>> 2016-10-28 14:48 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>:
>>> On Fri, 28 Oct 2016 14:45:07 -0700
>>> Michael Ma <make0...@gmail.com> wrote:
>>>
>>>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>:
>>>> > On Fri, 28 Oct 2016 14:36:27 -0700
>>>> > Michael Ma <make0...@gmail.com> wrote:
>>>> >
>>>> >> Hi -
>>>> >>
>>>> >> Currently IFB uses tasklet to process tx/rx on the interface that
>>>> >> forwarded the packet to IFB. My understanding on why we're doing this
>>>> >> is that since dev_queue_xmit() can be invoked in interrupt, we want to
>>>> >> defer the processing of original tx/rx in case ifb_xmit() is called
>>>> >> from interrupt.
>>>> >
>>>> > dev_queue_xmit is only called from interrupt if doing netconsole.
>>>>
>
> In fact this doesn't seem to explain since if the original path is tx
> and the context is interrupt, IFB will call dev_queue_xmit as well so
> the context can be interrupt in that case.
>
> Then tasklet is still unnecessary.
>
>>>> OK - so the reason is that netif_receive_skb() can only be invoked
>>>> from softirq and we have to use tasklet in IFB to guarantee this.
>>>>
>>>> Then if the original path is rx, tasklet is unnecessary because
>>>> ifb_xmit() is invoked from netif_receive_skb() which is always in the
>>>> softirq context, right?
>>>
>>> The other reason is to avoid deep kernel callstacks

I see - this seems to be the ultimate reason but in case we know the
actual IFB configuration is simple then tasklet can still be avoided,
right?


Re: Why do we need tasklet in IFB?

2016-10-31 Thread Michael Ma
2016-10-28 14:52 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-10-28 14:48 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>:
>> On Fri, 28 Oct 2016 14:45:07 -0700
>> Michael Ma <make0...@gmail.com> wrote:
>>
>>> 2016-10-28 14:38 GMT-07:00 Stephen Hemminger <step...@networkplumber.org>:
>>> > On Fri, 28 Oct 2016 14:36:27 -0700
>>> > Michael Ma <make0...@gmail.com> wrote:
>>> >
>>> >> Hi -
>>> >>
>>> >> Currently IFB uses tasklet to process tx/rx on the interface that
>>> >> forwarded the packet to IFB. My understanding on why we're doing this
>>> >> is that since dev_queue_xmit() can be invoked in interrupt, we want to
>>> >> defer the processing of original tx/rx in case ifb_xmit() is called
>>> >> from interrupt.
>>> >
>>> > dev_queue_xmit is only called from interrupt if doing netconsole.
>>>

In fact this doesn't seem to explain since if the original path is tx
and the context is interrupt, IFB will call dev_queue_xmit as well so
the context can be interrupt in that case.

Then tasklet is still unnecessary.

>>> OK - so the reason is that netif_receive_skb() can only be invoked
>>> from softirq and we have to use tasklet in IFB to guarantee this.
>>>
>>> Then if the original path is rx, tasklet is unnecessary because
>>> ifb_xmit() is invoked from netif_receive_skb() which is always in the
>>> softirq context, right?
>>
>> The other reason is to avoid deep kernel callstacks


Why do we need tasklet in IFB?

2016-10-28 Thread Michael Ma
Hi -

Currently IFB uses tasklet to process tx/rx on the interface that
forwarded the packet to IFB. My understanding on why we're doing this
is that since dev_queue_xmit() can be invoked in interrupt, we want to
defer the processing of original tx/rx in case ifb_xmit() is called
from interrupt.

However, if the packet is originally from rx, calling context should
already be a tasklet and there is no need to queue the processing to
another tasklet anymore. Even if the packet is originated from tx, we
can rely on the deferred processing of the "original device" or TC if
necessary. Did I miss anything here?

Furthermore, looking at the bonding device's code there isn't this
kind of buffering/tasklet handling for packet forwarding even though
bond also has its own txq/rxq configured separately from the actual
nic, which is very similar to IFB.

So why do we need tasklet in IFB?

Thanks,
Michael


Re: Modification to skb->queue_mapping affecting performance

2016-09-23 Thread Michael Ma
2016-09-16 15:00 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:
>>
>>> This is actually the problem - if flows from different RX queues are
>>> switched to the same RX queue in IFB, they'll use different processor
>>> context with the same tasklet, and the processor context of different
>>> tasklets might be the same. So multiple tasklets in IFB competes for
>>> the same core when queue is switched.
>>>
>>> The following simple fix proved this - with this change even switching
>>> the queue won't affect small packet bandwidth/latency anymore:
>>>
>>> in ifb.c:
>>>
>>> -   struct ifb_q_private *txp = dp->tx_private + 
>>> skb_get_queue_mapping(skb);
>>> +   struct ifb_q_private *txp = dp->tx_private +
>>> (smp_processor_id() % dev->num_tx_queues);
>>>
>>> This should be more efficient since we're not sending the task to a
>>> different processor, instead we try to queue the packet to an
>>> appropriate tasklet based on the processor ID. Will this cause any
>>> packet out-of-order problem? If packets from the same flow are queued
>>> to the same RX queue due to RSS, and processor affinity is set for RX
>>> queues, I assume packets from the same flow will end up in the same
>>> core when tasklet is scheduled. But I might have missed some uncommon
>>> cases here... Would appreciate if anyone can provide more insights.
>>
>> Wait, don't you have proper smp affinity for the RX queues on your NIC ?
>>
>> ( Documentation/networking/scaling.txt RSS IRQ Configuration )
>>
> Yes - what I was trying to say is that this change will be more
> efficient than using smp_call_function_single() to schedule the
> tasklet to a different processor.
>
> RSS IRQ should be set properly already. The issue here is that I'll
> need to switch the queue mapping for NIC RX to a different TXQ on IFB,
> which allows me to classify the flows at the IFB TXQ layer and avoid
> qdisc lock contention.
>
> When that switch happens, ideally processor core shouldn't be switched
> because all the thread context isn't changed. The work in tasklet
> should be scheduled to the same processor as well. That's why I tried
> this change. Also conceptually IFB is a software device which should
> be able to schedule its workload independent from how NIC is
> configured for the interrupt handling.
>
>> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
>> the driver queue is locked before ndo_start_xmit())  (for non
>> NETIF_F_LLTX drivers at least)
>>
>
> Thanks a lot for pointing out this! I was expecting this kind of
> guidance... Then the options would be:
>
> 1. Use smp_call_function_single() to schedule the tasklet to a core
> statically mapped to the IFB TXQ, which is very similar to how TX/RX
> IRQ is configured.

This actually won't help with the throughput because ultimately load
will still be concentrated to some particular cores after packets are
concentrated to a TXQ due to queue level classification.

> 2. As you suggested below add some additional action to do the
> rescheduling before entering IFB - for example when receiving the
> packet we could just use RSS to redirect to the desired RXQ, however
> this doesn't seem to be easy, especially compared with the way how
> mqprio chooses the queue. The challenge here is that IFB queue
> selection is based on queue_mapping when skb arrives at IFB and core
> selection is based on RXQ on NIC and so it's also based on
> queue_mapping when skb arrives at NIC. Then these two queue_mappings
> must be the same so that there is no core conflict of processing two
> TXQs of IFB. Then this essentially means we have to change queue
> mapping of the NIC on the receiver side which can't be achieved using
> TC.
>

I tried to explore this further - there is actually XPS on ifb which
can be used to specify the processor cores that will be used to
process each TXQ of ifb, however the problem is similar as above -
eventually I'll have a few cores processing these queues instead of
having all the cores processing together with relatively light
contention. And this again reduces the throughput. So there isn't a
good place to do this. The ultimate problem is that we're trying to
workaround the qdisc spin lock problem by leveraging the independence
of TXQs, but at the same time after qdisc phase we also want to
maximize the utilization of cores across whatever TXQs that are used.

>> In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb
>>

Re: Modification to skb->queue_mapping affecting performance

2016-09-16 Thread Michael Ma
2016-09-16 12:53 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:
>
>> This is actually the problem - if flows from different RX queues are
>> switched to the same RX queue in IFB, they'll use different processor
>> context with the same tasklet, and the processor context of different
>> tasklets might be the same. So multiple tasklets in IFB competes for
>> the same core when queue is switched.
>>
>> The following simple fix proved this - with this change even switching
>> the queue won't affect small packet bandwidth/latency anymore:
>>
>> in ifb.c:
>>
>> -   struct ifb_q_private *txp = dp->tx_private + 
>> skb_get_queue_mapping(skb);
>> +   struct ifb_q_private *txp = dp->tx_private +
>> (smp_processor_id() % dev->num_tx_queues);
>>
>> This should be more efficient since we're not sending the task to a
>> different processor, instead we try to queue the packet to an
>> appropriate tasklet based on the processor ID. Will this cause any
>> packet out-of-order problem? If packets from the same flow are queued
>> to the same RX queue due to RSS, and processor affinity is set for RX
>> queues, I assume packets from the same flow will end up in the same
>> core when tasklet is scheduled. But I might have missed some uncommon
>> cases here... Would appreciate if anyone can provide more insights.
>
> Wait, don't you have proper smp affinity for the RX queues on your NIC ?
>
> ( Documentation/networking/scaling.txt RSS IRQ Configuration )
>
Yes - what I was trying to say is that this change will be more
efficient than using smp_call_function_single() to schedule the
tasklet to a different processor.

RSS IRQ should be set properly already. The issue here is that I'll
need to switch the queue mapping for NIC RX to a different TXQ on IFB,
which allows me to classify the flows at the IFB TXQ layer and avoid
qdisc lock contention.

When that switch happens, ideally processor core shouldn't be switched
because all the thread context isn't changed. The work in tasklet
should be scheduled to the same processor as well. That's why I tried
this change. Also conceptually IFB is a software device which should
be able to schedule its workload independent from how NIC is
configured for the interrupt handling.

> A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
> the driver queue is locked before ndo_start_xmit())  (for non
> NETIF_F_LLTX drivers at least)
>

Thanks a lot for pointing out this! I was expecting this kind of
guidance... Then the options would be:

1. Use smp_call_function_single() to schedule the tasklet to a core
statically mapped to the IFB TXQ, which is very similar to how TX/RX
IRQ is configured.
2. As you suggested below add some additional action to do the
rescheduling before entering IFB - for example when receiving the
packet we could just use RSS to redirect to the desired RXQ, however
this doesn't seem to be easy, especially compared with the way how
mqprio chooses the queue. The challenge here is that IFB queue
selection is based on queue_mapping when skb arrives at IFB and core
selection is based on RXQ on NIC and so it's also based on
queue_mapping when skb arrives at NIC. Then these two queue_mappings
must be the same so that there is no core conflict of processing two
TXQs of IFB. Then this essentially means we have to change queue
mapping of the NIC on the receiver side which can't be achieved using
TC.

> In case of ifb, __skb_queue_tail(>rq, skb); could corrupt the skb
> list.
>
> In any case, you could have an action to do this before reaching IFB.
>
>
>


Re: Modification to skb->queue_mapping affecting performance

2016-09-16 Thread Michael Ma
2016-09-15 17:51 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-09-14 10:46 GMT-07:00 Michael Ma <make0...@gmail.com>:
>> 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>>>
>>>> I don't intend to install multiple qdisc - the only reason that I'm
>>>> doing this now is to leverage MQ to workaround the lock contention,
>>>> and based on the profile this all worked. However to simplify the way
>>>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>>>> HTB class only belongs to one TXQ, which also requires mapping skb to
>>>> TXQ using some rules (here I'm using priority but I assume it's
>>>> straightforward to use other information such as classid). And the
>>>> problem I found here is that when using priority to infer the TXQ so
>>>> that queue_mapping is changed, bandwidth is affected significantly -
>>>> the only thing I can guess is that due to queue switch, there are more
>>>> cache misses assuming processor cores have a static mapping to all the
>>>> queues. Any suggestion on what to do next for the investigation?
>>>>
>>>> I would also guess that this should be a common problem if anyone
>>>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>>>> receiver side and classful qdisc is used on IFB, but haven't really
>>>> found a similar thread here...
>>>
>>> But why are you changing the queue ?
>>>
>>> NIC already does the proper RSS thing, meaning all packets of one flow
>>> should land on one RX queue. No need to ' classify yourself and risk
>>> lock contention'
>>>
>>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
>>> problem.
>>>
>>> Do you really need to rate limit flows ? Not clear what are your goals,
>>> why for example you use HTB to begin with.
>>>
>> Yes. My goal is to set different min/max bandwidth limits for
>> different processes, so we started with HTB. However with HTB the
>> qdisc root lock contention caused some unintended correlation between
>> flows in different classes. For example if some flows belonging to one
>> class have large amount of small packets, other flows in a different
>> class will get their effective bandwidth reduced because they'll wait
>> longer for the root lock. Using MQ this can be avoided because I'll
>> just put flows belonging to one class to its dedicated TXQ. Then
>> classes within one HTB on a TXQ will still have the lock contention
>> problem but classes in different HTB will use different root locks so
>> the contention doesn't exist.
>>
>> This also means that I'll need to classify packets to different
>> TXQ/HTB based on some skb metadata (essentially similar to what mqprio
>> is doing). So TXQ might need to be switched to achieve this.
>
> My current theory to this problem is that tasklets in IFB might be
> scheduled to the same cpu core if the RXQ happens to be the same for
> two different flows. When queue_mapping is modified and multiple flows
> are concentrated to the same IFB TXQ because they need to be
> controlled by the same HTB, they'll have to use the same tasklet
> because of the way IFB is implemented. So if other flows belonging to
> a different TXQ/tasklet happens to be scheduled on the same core, that
> core can be overloaded and becomes the bottleneck. Without modifying
> the queue_mapping the chance of this contention is much lower.
>
> This is a speculation based on the increased si time in softirqd
> process. I'll try to affinitize each tasklet with a cpu core to verify
> whether this is the problem. I also noticed that in the past there was
> a similar proposal of scheduling the tasklet to a dedicated core which
> was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try
> something similar to verify this theory.

This is actually the problem - if flows from different RX queues are
switched to the same RX queue in IFB, they'll use different processor
context with the same tasklet, and the processor context of different
tasklets might be the same. So multiple tasklets in IFB competes for
the same core when queue is switched.

The following simple fix proved this - with this change even switching
the queue won't affect small packet bandwidth/latency anymore:

in ifb.c:

-   struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
+   struct ifb_q_private *txp = dp->tx_private +
(smp_processor_id() % dev->num_tx_queues);

This should be more efficient since we're not sending the task to a
different processor, instead we try to queue the packet to an
appropriate tasklet based on the processor ID. Will this cause any
packet out-of-order problem? If packets from the same flow are queued
to the same RX queue due to RSS, and processor affinity is set for RX
queues, I assume packets from the same flow will end up in the same
core when tasklet is scheduled. But I might have missed some uncommon
cases here... Would appreciate if anyone can provide more insights.


Re: Modification to skb->queue_mapping affecting performance

2016-09-15 Thread Michael Ma
2016-09-14 10:46 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>>
>>> I don't intend to install multiple qdisc - the only reason that I'm
>>> doing this now is to leverage MQ to workaround the lock contention,
>>> and based on the profile this all worked. However to simplify the way
>>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>>> HTB class only belongs to one TXQ, which also requires mapping skb to
>>> TXQ using some rules (here I'm using priority but I assume it's
>>> straightforward to use other information such as classid). And the
>>> problem I found here is that when using priority to infer the TXQ so
>>> that queue_mapping is changed, bandwidth is affected significantly -
>>> the only thing I can guess is that due to queue switch, there are more
>>> cache misses assuming processor cores have a static mapping to all the
>>> queues. Any suggestion on what to do next for the investigation?
>>>
>>> I would also guess that this should be a common problem if anyone
>>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>>> receiver side and classful qdisc is used on IFB, but haven't really
>>> found a similar thread here...
>>
>> But why are you changing the queue ?
>>
>> NIC already does the proper RSS thing, meaning all packets of one flow
>> should land on one RX queue. No need to ' classify yourself and risk
>> lock contention'
>>
>> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
>> problem.
>>
>> Do you really need to rate limit flows ? Not clear what are your goals,
>> why for example you use HTB to begin with.
>>
> Yes. My goal is to set different min/max bandwidth limits for
> different processes, so we started with HTB. However with HTB the
> qdisc root lock contention caused some unintended correlation between
> flows in different classes. For example if some flows belonging to one
> class have large amount of small packets, other flows in a different
> class will get their effective bandwidth reduced because they'll wait
> longer for the root lock. Using MQ this can be avoided because I'll
> just put flows belonging to one class to its dedicated TXQ. Then
> classes within one HTB on a TXQ will still have the lock contention
> problem but classes in different HTB will use different root locks so
> the contention doesn't exist.
>
> This also means that I'll need to classify packets to different
> TXQ/HTB based on some skb metadata (essentially similar to what mqprio
> is doing). So TXQ might need to be switched to achieve this.

My current theory to this problem is that tasklets in IFB might be
scheduled to the same cpu core if the RXQ happens to be the same for
two different flows. When queue_mapping is modified and multiple flows
are concentrated to the same IFB TXQ because they need to be
controlled by the same HTB, they'll have to use the same tasklet
because of the way IFB is implemented. So if other flows belonging to
a different TXQ/tasklet happens to be scheduled on the same core, that
core can be overloaded and becomes the bottleneck. Without modifying
the queue_mapping the chance of this contention is much lower.

This is a speculation based on the increased si time in softirqd
process. I'll try to affinitize each tasklet with a cpu core to verify
whether this is the problem. I also noticed that in the past there was
a similar proposal of scheduling the tasklet to a dedicated core which
was not committed(https://patchwork.ozlabs.org/patch/38486/). I'll try
something similar to verify this theory.


Re: Modification to skb->queue_mapping affecting performance

2016-09-14 Thread Michael Ma
2016-09-13 22:22 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Tue, 2016-09-13 at 22:13 -0700, Michael Ma wrote:
>
>> I don't intend to install multiple qdisc - the only reason that I'm
>> doing this now is to leverage MQ to workaround the lock contention,
>> and based on the profile this all worked. However to simplify the way
>> to setup HTB I wanted to use TXQ to partition HTB classes so that a
>> HTB class only belongs to one TXQ, which also requires mapping skb to
>> TXQ using some rules (here I'm using priority but I assume it's
>> straightforward to use other information such as classid). And the
>> problem I found here is that when using priority to infer the TXQ so
>> that queue_mapping is changed, bandwidth is affected significantly -
>> the only thing I can guess is that due to queue switch, there are more
>> cache misses assuming processor cores have a static mapping to all the
>> queues. Any suggestion on what to do next for the investigation?
>>
>> I would also guess that this should be a common problem if anyone
>> wants to use MQ+IFB to workaround the qdisc lock contention on the
>> receiver side and classful qdisc is used on IFB, but haven't really
>> found a similar thread here...
>
> But why are you changing the queue ?
>
> NIC already does the proper RSS thing, meaning all packets of one flow
> should land on one RX queue. No need to ' classify yourself and risk
> lock contention'
>
> I use IFB + MQ + netem every day, and it scales to 10 Mpps with no
> problem.
>
> Do you really need to rate limit flows ? Not clear what are your goals,
> why for example you use HTB to begin with.
>
Yes. My goal is to set different min/max bandwidth limits for
different processes, so we started with HTB. However with HTB the
qdisc root lock contention caused some unintended correlation between
flows in different classes. For example if some flows belonging to one
class have large amount of small packets, other flows in a different
class will get their effective bandwidth reduced because they'll wait
longer for the root lock. Using MQ this can be avoided because I'll
just put flows belonging to one class to its dedicated TXQ. Then
classes within one HTB on a TXQ will still have the lock contention
problem but classes in different HTB will use different root locks so
the contention doesn't exist.

This also means that I'll need to classify packets to different
TXQ/HTB based on some skb metadata (essentially similar to what mqprio
is doing). So TXQ might need to be switched to achieve this.


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 22:13 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>>
>>> If I understand correctly this is still to associate a qdisc with each
>>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>>> to divide the bandwidth of each class in HTB by the number of TX
>>> queues for each individual HTB qdisc associated?
>>>
>>> My original idea was to attach a HTB qdisc for each ifb queue
>>> representing a set of flows not sharing bandwidth with others so that
>>> root lock contention still happens but only affects flows in the same
>>> HTB. Did I understand the root lock contention issue incorrectly for
>>> ifb? I do see some comments in __dev_queue_xmit() about using a
>>> different code path for software devices which bypasses
>>> __dev_xmit_skb(). Does this mean ifb won't go through
>>> __dev_xmit_skb()?
>>
>> You can install HTB on all of your MQ children for sure.
>>
>> Again, there is no qdisc lock contention if you properly use MQ.
>>
>> Now if you _need_ to install a single qdisc for whatever reason, then
>> maybe you want to use a single rx queue on the NIC, to reduce lock
>> contention ;)

Yes - this might reduce lock contention but there would still be
contention and I'm really looking for more concurrency...

>>
>>
> I don't intend to install multiple qdisc - the only reason that I'm
> doing this now is to leverage MQ to workaround the lock contention,
> and based on the profile this all worked. However to simplify the way
> to setup HTB I wanted to use TXQ to partition HTB classes so that a
> HTB class only belongs to one TXQ, which also requires mapping skb to
> TXQ using some rules (here I'm using priority but I assume it's
> straightforward to use other information such as classid). And the
> problem I found here is that when using priority to infer the TXQ so
> that queue_mapping is changed, bandwidth is affected significantly -
> the only thing I can guess is that due to queue switch, there are more
> cache misses assuming processor cores have a static mapping to all the
> queues. Any suggestion on what to do next for the investigation?
>
> I would also guess that this should be a common problem if anyone
> wants to use MQ+IFB to workaround the qdisc lock contention on the
> receiver side and classful qdisc is used on IFB, but haven't really
> found a similar thread here...

Hi Cong - I saw quite some threads from you regarding to ingress qdisc
+ MQ and issues for queue_mapping. Do you by any chance have a similar
setup? (classful qdiscs associated to the queues of IFB which requires
queue_mapping modification so that the qdisc selection is done at
queue selection time based on information such as skb
priority/classid. Would appreciate any suggestions.


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 18:18 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Tue, 2016-09-13 at 17:23 -0700, Michael Ma wrote:
>
>> If I understand correctly this is still to associate a qdisc with each
>> ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
>> to divide the bandwidth of each class in HTB by the number of TX
>> queues for each individual HTB qdisc associated?
>>
>> My original idea was to attach a HTB qdisc for each ifb queue
>> representing a set of flows not sharing bandwidth with others so that
>> root lock contention still happens but only affects flows in the same
>> HTB. Did I understand the root lock contention issue incorrectly for
>> ifb? I do see some comments in __dev_queue_xmit() about using a
>> different code path for software devices which bypasses
>> __dev_xmit_skb(). Does this mean ifb won't go through
>> __dev_xmit_skb()?
>
> You can install HTB on all of your MQ children for sure.
>
> Again, there is no qdisc lock contention if you properly use MQ.
>
> Now if you _need_ to install a single qdisc for whatever reason, then
> maybe you want to use a single rx queue on the NIC, to reduce lock
> contention ;)
>
>
I don't intend to install multiple qdisc - the only reason that I'm
doing this now is to leverage MQ to workaround the lock contention,
and based on the profile this all worked. However to simplify the way
to setup HTB I wanted to use TXQ to partition HTB classes so that a
HTB class only belongs to one TXQ, which also requires mapping skb to
TXQ using some rules (here I'm using priority but I assume it's
straightforward to use other information such as classid). And the
problem I found here is that when using priority to infer the TXQ so
that queue_mapping is changed, bandwidth is affected significantly -
the only thing I can guess is that due to queue switch, there are more
cache misses assuming processor cores have a static mapping to all the
queues. Any suggestion on what to do next for the investigation?

I would also guess that this should be a common problem if anyone
wants to use MQ+IFB to workaround the qdisc lock contention on the
receiver side and classful qdisc is used on IFB, but haven't really
found a similar thread here...


Re: Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
2016-09-13 17:09 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Tue, 2016-09-13 at 16:30 -0700, Michael Ma wrote:
>
>> The RX queue number I found from "ls /sys/class/net/eth0/queues" is
>> 64. (is this the correct way of identifying the queue number on NIC?)
>> I setup ifb with 24 queues which is equal to the TX queue number of
>> eth0 and also the number of CPU cores.
>
> Please do not drop netdev@ from this mail exchange.

Sorry that I accidentally dropped that.
>
> ethtool -l eth0
>
>>
>> > There is no qdisc lock contention anymore AFAIK, since each cpu will use
>> > a dedicate IFB queue and tasklet.
>> >
>> How is this achieved? I thought qdisc on ifb will still be protected
>> by the qdisc root lock in __dev_xmit_skb() so essentially all threads
>> processing qdisc are still serialized without using MQ?
>
> You have to properly setup ifb/mq like in :
>
> # netem based setup, installed at receiver side only
> ETH=eth0
> IFB=ifb10
> #DELAY="delay 100ms"
> EST="est 1sec 4sec"
> #REORDER=1000us
> #LOSS="loss 2.0"
> TXQ=24  # change this to number of TX queues on the physical NIC
>
> ip link add $IFB numtxqueues $TXQ type ifb
> ip link set dev $IFB up
>
> tc qdisc del dev $ETH ingress 2>/dev/null
> tc qdisc add dev $ETH ingress 2>/dev/null
>
> tc filter add dev $ETH parent : \
>protocol ip u32 match u32 0 0 flowid 1:1 \
> action mirred egress redirect dev $IFB
>
> tc qdisc del dev $IFB root 2>/dev/null
>
> tc qdisc add dev $IFB root handle 1: mq
> for i in `seq 1 $TXQ`
> do
>  slot=$( printf %x $(( i )) )
>  tc qd add dev $IFB parent 1:$slot $EST netem \
> limit 10 $DELAY $REORDER $LOSS
> done
>
>
If I understand correctly this is still to associate a qdisc with each
ifb TXQ. How should I do this if I want to use HTB? I guess I'll need
to divide the bandwidth of each class in HTB by the number of TX
queues for each individual HTB qdisc associated?

My original idea was to attach a HTB qdisc for each ifb queue
representing a set of flows not sharing bandwidth with others so that
root lock contention still happens but only affects flows in the same
HTB. Did I understand the root lock contention issue incorrectly for
ifb? I do see some comments in __dev_queue_xmit() about using a
different code path for software devices which bypasses
__dev_xmit_skb(). Does this mean ifb won't go through
__dev_xmit_skb()?


Modification to skb->queue_mapping affecting performance

2016-09-13 Thread Michael Ma
Hi -

We currently use mqprio on ifb to work around the qdisc root lock
contention on the receiver side. The problem that we found was that
queue_mapping is already set when redirecting from ingress qdisc to
ifb (based on RX selection, I guess?) so the TX queue selection is not
based on priority.

Then we implemented a filter which can set skb->queue_mapping to 0 so
that TX queue selection can be done as expected and flows with
different priorities will go through different TX queues. However with
the queue_mapping recomputed, we found the achievable bandwidth with
small packets (512 bytes) dropped significantly if they're targeting
different queues. From perf profile I don't see any bottleneck from
CPU perspective.

Any thoughts on why modifying queue_mapping will have this kind of
effect? Also is there any better way of achieving receiver side
throttling using HTB while avoiding the qdisc root lock on ifb?

Thanks,
Michael


Re: qdisc spin lock

2016-04-25 Thread Michael Ma
2016-04-21 15:12 GMT-07:00 Michael Ma <make0...@gmail.com>:
> 2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote:
>>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>>> >> >> that each txq can be associated with a particular qdisc. Also we don't
>>> >> >> really have multiple interfaces...
>>> >> >>
>>> >> >> With this MQ solution we'll still need to assign transmit queues to
>>> >> >> different classes by doing some math on the bandwidth limit if I
>>> >> >> understand correctly, which seems to be less convenient compared with
>>> >> >> a solution purely within HTB.
>>> >> >>
>>> >> >> I assume that with this solution I can still share qdisc among
>>> >> >> multiple transmit queues - please let me know if this is not the case.
>>> >> >
>>> >> > Note that this MQ + HTB thing works well, unless you use a bonding
>>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>>> >> > tokens between the slaves)
>>> >>
>>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte
>>> >> packets can be throttled by HTB using one txq without being affected
>>> >> by other flows with small packets. However I found using this solution
>>> >> large packets (10k for example) will only achieve very limited
>>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>>> >> using 30 threads. But sending 10k packets using 10 threads has only 10
>>> >> Mbit/s with the same TC configuration. If I increase burst and cburst
>>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>>> >> hit.
>>> >>
>>> >> The strange thing is that I don't see this problem when using HTB as
>>> >> the root. So txq number seems to be a factor here - however it's
>>> >> really hard to understand why would it only affect larger packets. Is
>>> >> this a known issue? Any suggestion on how to investigate the issue
>>> >> further? Profiling shows that the cpu utilization is pretty low.
>>> >
>>> > You could try
>>> >
>>> > perf record -a -g -e skb:kfree_skb sleep 5
>>> > perf report
>>> >
>>> > So that you see where the packets are dropped.
>>> >
>>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
>>> > dropped at qdisc enqueue time, instead of having backpressure.
>>> >
>>>
>>> Thanks for the hint - how should I read the perf report? Also we're
>>> using TCP socket in this testing - TCP window size is set to 70kB.
>>
>> But how are you telling TCP to send 10k packets ?
>>
> We just write to the socket with 10k buffer and wait for a response
> from the server (using read()) before the next write. Using tcpdump I
> can see the 10k write is actually sent through 3 packets
> (7.3k/1.5k/1.3k).
>
>> AFAIK you can not : TCP happily aggregates packets in write queue
>> (see current MSG_EOR discussion)
>>
>> I suspect a bug in your tc settings.
>>
>>
>
> Could you help to check my tc setting?
>
> sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0
> queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0
> sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10
> sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit
>
> I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used.

Just to circle back on this - it seems there is 200ms delay sometimes
during data push which stalled the sending:

01:34:44.046232 IP (tos 0x0, ttl  64, id 2863, offset 0, flags [DF],
proto: TCP (6), length: 8740) 10.101.197.75.59126 >
10.101.197.105.redwood-broker: . 250025:258713(8688) ack 1901 win 58
<nop,nop,timestamp 507571833 196626529>
01:34:44.046304 IP (tos 0x0, ttl  64, id 15420, offset 0, flags [DF],
prot

Re: qdisc spin lock

2016-04-21 Thread Michael Ma
2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote:
>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> >> really have multiple interfaces...
>> >> >>
>> >> >> With this MQ solution we'll still need to assign transmit queues to
>> >> >> different classes by doing some math on the bandwidth limit if I
>> >> >> understand correctly, which seems to be less convenient compared with
>> >> >> a solution purely within HTB.
>> >> >>
>> >> >> I assume that with this solution I can still share qdisc among
>> >> >> multiple transmit queues - please let me know if this is not the case.
>> >> >
>> >> > Note that this MQ + HTB thing works well, unless you use a bonding
>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> >> > tokens between the slaves)
>> >>
>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte
>> >> packets can be throttled by HTB using one txq without being affected
>> >> by other flows with small packets. However I found using this solution
>> >> large packets (10k for example) will only achieve very limited
>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>> >> using 30 threads. But sending 10k packets using 10 threads has only 10
>> >> Mbit/s with the same TC configuration. If I increase burst and cburst
>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>> >> hit.
>> >>
>> >> The strange thing is that I don't see this problem when using HTB as
>> >> the root. So txq number seems to be a factor here - however it's
>> >> really hard to understand why would it only affect larger packets. Is
>> >> this a known issue? Any suggestion on how to investigate the issue
>> >> further? Profiling shows that the cpu utilization is pretty low.
>> >
>> > You could try
>> >
>> > perf record -a -g -e skb:kfree_skb sleep 5
>> > perf report
>> >
>> > So that you see where the packets are dropped.
>> >
>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
>> > dropped at qdisc enqueue time, instead of having backpressure.
>> >
>>
>> Thanks for the hint - how should I read the perf report? Also we're
>> using TCP socket in this testing - TCP window size is set to 70kB.
>
> But how are you telling TCP to send 10k packets ?
>
We just write to the socket with 10k buffer and wait for a response
from the server (using read()) before the next write. Using tcpdump I
can see the 10k write is actually sent through 3 packets
(7.3k/1.5k/1.3k).

> AFAIK you can not : TCP happily aggregates packets in write queue
> (see current MSG_EOR discussion)
>
> I suspect a bug in your tc settings.
>
>

Could you help to check my tc setting?

sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0
queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0
sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10
sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit

I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used.


Re: qdisc spin lock

2016-04-20 Thread Michael Ma
2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote:
>> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> really have multiple interfaces...
>> >>
>> >> With this MQ solution we'll still need to assign transmit queues to
>> >> different classes by doing some math on the bandwidth limit if I
>> >> understand correctly, which seems to be less convenient compared with
>> >> a solution purely within HTB.
>> >>
>> >> I assume that with this solution I can still share qdisc among
>> >> multiple transmit queues - please let me know if this is not the case.
>> >
>> > Note that this MQ + HTB thing works well, unless you use a bonding
>> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> > tokens between the slaves)
>>
>> Actually MQ+HTB works well for small packets - like flow of 512 byte
>> packets can be throttled by HTB using one txq without being affected
>> by other flows with small packets. However I found using this solution
>> large packets (10k for example) will only achieve very limited
>> bandwidth. In my test I used MQ to assign one txq to a HTB which sets
>> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
>> using 30 threads. But sending 10k packets using 10 threads has only 10
>> Mbit/s with the same TC configuration. If I increase burst and cburst
>> of HTB to some extreme large value (like 50MB) the ceiling rate can be
>> hit.
>>
>> The strange thing is that I don't see this problem when using HTB as
>> the root. So txq number seems to be a factor here - however it's
>> really hard to understand why would it only affect larger packets. Is
>> this a known issue? Any suggestion on how to investigate the issue
>> further? Profiling shows that the cpu utilization is pretty low.
>
> You could try
>
> perf record -a -g -e skb:kfree_skb sleep 5
> perf report
>
> So that you see where the packets are dropped.
>
> Chances are that your UDP sockets SO_SNDBUF is too big, and packets are
> dropped at qdisc enqueue time, instead of having backpressure.
>

Thanks for the hint - how should I read the perf report? Also we're
using TCP socket in this testing - TCP window size is set to 70kB.

-  35.88% init  [kernel.kallsyms]  [k] intel_idle
   ◆
 intel_idle
   ▒
-  15.83%  strings  libc-2.5.so[.]
__GI___connect_internal
▒
   - __GI___connect_internal
   ▒
  - 50.00% get_mapping
   ▒
   __nscd_get_map_ref
   ▒
50.00% __nscd_open_socket
   ▒
-  13.19%  strings  libc-2.5.so[.] __GI___libc_recvmsg
   ▒
   - __GI___libc_recvmsg
   ▒
  + 64.52% getifaddrs
   ▒
  + 35.48% __check_pf
   ▒
-  10.55%  strings  libc-2.5.so[.] __sendto_nocancel
   ▒
   - __sendto_nocancel
   ▒
100.00% 0
>
>


Re: qdisc spin lock

2016-04-20 Thread Michael Ma
2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Note that this MQ + HTB thing works well, unless you use a bonding
> device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> tokens between the slaves)

Actually MQ+HTB works well for small packets - like flow of 512 byte
packets can be throttled by HTB using one txq without being affected
by other flows with small packets. However I found using this solution
large packets (10k for example) will only achieve very limited
bandwidth. In my test I used MQ to assign one txq to a HTB which sets
rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by
using 30 threads. But sending 10k packets using 10 threads has only 10
Mbit/s with the same TC configuration. If I increase burst and cburst
of HTB to some extreme large value (like 50MB) the ceiling rate can be
hit.

The strange thing is that I don't see this problem when using HTB as
the root. So txq number seems to be a factor here - however it's
really hard to understand why would it only affect larger packets. Is
this a known issue? Any suggestion on how to investigate the issue
further? Profiling shows that the cpu utilization is pretty low.

>
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>
> bonding can not really be used as a true MQ device yet.
>
> I might send a patch to disable this 'bonding feature' if no slave sets
> a queue_id.
>
>


Re: qdisc spin lock

2016-04-15 Thread Michael Ma
2016-04-15 15:54 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote:
>> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
>> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> >> I didn't really know that multiple qdiscs can be isolated using MQ so
>> >> that each txq can be associated with a particular qdisc. Also we don't
>> >> really have multiple interfaces...
>> >>
>> >> With this MQ solution we'll still need to assign transmit queues to
>> >> different classes by doing some math on the bandwidth limit if I
>> >> understand correctly, which seems to be less convenient compared with
>> >> a solution purely within HTB.
>> >>
>> >> I assume that with this solution I can still share qdisc among
>> >> multiple transmit queues - please let me know if this is not the case.
>> >
>> > Note that this MQ + HTB thing works well, unless you use a bonding
>> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing
>> > tokens between the slaves)
>> >
>> >
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>> >
>> > bonding can not really be used as a true MQ device yet.
>> >
>> > I might send a patch to disable this 'bonding feature' if no slave sets
>> > a queue_id.
>> >
>> >
>> So there is no way of using this MQ solution when bonding interface is
>> used, right? Then modifying HTB might be the only solution?
>
> I probably can submit a bonding patch very soon if there is interest.

Would definitely appreciate that. If you can share the patch it will
be helpful as well. Let me know if I can help with this...
>
> Modifying HTB is way more complicated :(
>
>
>


Re: qdisc spin lock

2016-04-15 Thread Michael Ma
2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.duma...@gmail.com>:
> On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Note that this MQ + HTB thing works well, unless you use a bonding
> device. (Or you need the MQ+HTB on the slaves, with no way of sharing
> tokens between the slaves)
>
>
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3
>
> bonding can not really be used as a true MQ device yet.
>
> I might send a patch to disable this 'bonding feature' if no slave sets
> a queue_id.
>
>
So there is no way of using this MQ solution when bonding interface is
used, right? Then modifying HTB might be the only solution?


Re: qdisc spin lock

2016-04-13 Thread Michael Ma
2016-03-31 20:44 GMT-07:00 John Fastabend <john.fastab...@gmail.com>:
> On 16-03-31 04:48 PM, Michael Ma wrote:
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>> that each txq can be associated with a particular qdisc. Also we don't
>> really have multiple interfaces...
>
> MQ will assign a default qdisc to each txq and the default qdisc can
> be changed to htb or any other qdisc of your choice.
>
>>
>> With this MQ solution we'll still need to assign transmit queues to
>> different classes by doing some math on the bandwidth limit if I
>> understand correctly, which seems to be less convenient compared with
>> a solution purely within HTB.
>>
>
> Agreed.
>
>> I assume that with this solution I can still share qdisc among
>> multiple transmit queues - please let me know if this is not the case.
>
> Nope sorry doesn't work that way unless you employ some sort of stacked
> netdevice strategy which does start to get a bit complex. The basic hint
> would be to stack some type of virtual netdev on top of a device and
> run the htb qdisc there. Push traffic onto the netdev depending on the
> class it belongs to. Its ugly yes.

Thanks for the input - is it possible to implement a MQ to allow one
qdisc to be associated with multiple TXQ belonging to one inner class?
Does that require any additional synchronization in qdisc routines? I
think it shouldn't because the common use of TC already associates one
qdisc with all the TXQ. This would simplify the way that qdisc is
crafted since it doesn't need to be created for each queue.

Also is it possible to allow multiple qdiscs to be associated with one
TXQ? I can see one issue with this - the TC interface for MQ today
doesn't allow one TXQ to be associated with multiple classes, but this
seems to be a restriction only from the tc command interface
perspective. MQ anyway only establishes a connection between TXQ and
qdisc so theoretically we can associate one TXQ to multiple
classes/qdiscs and use priority/classid of the packet to do the
pre-selection of qdisc. Again, I might have missed some potential
synchronization issue when this actually happens because today TXQ is
not shared by multiple qdiscs in any case.

>
> Noting all that I posted an RFC patch some time back to allow writing
> qdiscs that do not require taking the lock. I'll try to respin these
> and submit them when net-next opens again. The next logical step is to
> write a "better" HTB probably using a shared counter and dropping the
> requirement that it be exact.
>
> Sorry I didn't get a chance to look at the paper in your post so not
> sure if they suggest something similar or not.
>
> Thanks,
> John
>
>>
>> 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>:
>>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0...@gmail.com> wrote:
>>>> As far as I understand the design of TC is to simplify locking schema
>>>> and minimize the work in __qdisc_run so that throughput won’t be
>>>> affected, especially with large packets. However if the scenario is
>>>> that multiple classes in the queueing discipline only have the shaping
>>>> limit, there isn’t really a necessary correlation between different
>>>> classes. The only synchronization point should be when the packet is
>>>> dequeued from the qdisc queue and enqueued to the transmit queue of
>>>> the device. My question is – is it worth investing on avoiding the
>>>> locking contention by partitioning the queue/lock so that this
>>>> scenario is addressed with relatively smaller latency?
>>>
>>> If your HTB classes don't share bandwidth, why do you still make them
>>> under the same hierarchy? IOW, you can just isolate them either with some
>>> other qdisc or just separated interfaces.
>


Re: qdisc spin lock

2016-04-01 Thread Michael Ma
2016-03-31 19:19 GMT-07:00 David Miller <da...@davemloft.net>:
> From: Michael Ma <make0...@gmail.com>
> Date: Thu, 31 Mar 2016 16:48:43 -0700
>
>> I didn't really know that multiple qdiscs can be isolated using MQ so
>  ...
>
> Please stop top-posting.

Sorry that I wasn't aware of this...


Re: qdisc spin lock

2016-03-31 Thread Michael Ma
I didn't really know that multiple qdiscs can be isolated using MQ so
that each txq can be associated with a particular qdisc. Also we don't
really have multiple interfaces...

With this MQ solution we'll still need to assign transmit queues to
different classes by doing some math on the bandwidth limit if I
understand correctly, which seems to be less convenient compared with
a solution purely within HTB.

I assume that with this solution I can still share qdisc among
multiple transmit queues - please let me know if this is not the case.

2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangc...@gmail.com>:
> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0...@gmail.com> wrote:
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> If your HTB classes don't share bandwidth, why do you still make them
> under the same hierarchy? IOW, you can just isolate them either with some
> other qdisc or just separated interfaces.


Re: qdisc spin lock

2016-03-31 Thread Michael Ma
Thanks for the suggestion - I'll try the MQ solution out. It seems to
be able to solve the problem well with the assumption that bandwidth
can be statically partitioned.

2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <bro...@redhat.com>:
>
> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0...@gmail.com> wrote:
>
>> I know this might be an old topic so bare with me – what we are facing
>> is that applications are sending small packets using hundreds of
>> threads so the contention on spin lock in __dev_xmit_skb increases the
>> latency of dev_queue_xmit significantly. We’re building a network QoS
>> solution to avoid interference of different applications using HTB.
>
> Yes, as you have noticed with HTB there is a single qdisc lock, and
> congestion obviously happens :-)
>
> It is possible with different tricks to make it scale.  I believe
> Google is using a variant of HTB, and it scales for them.  They have
> not open source their modifications to HTB (which likely also involves
> a great deal of setup tricks).
>
> If your purpose it to limit traffic/bandwidth per "cloud" instance,
> then you can just use another TC setup structure.  Like using MQ and
> assigning a HTB per MQ queue (where the MQ queues are bound to each
> CPU/HW queue)... But you have to figure out this setup yourself...
>
>
>> But in this case when some applications send massive small packets in
>> parallel, the application to be protected will get its throughput
>> affected (because it’s doing synchronous network communication using
>> multiple threads and throughput is sensitive to the increased latency)
>>
>> Here is the profiling from perf:
>>
>> -  67.57%   iperf  [kernel.kallsyms] [k] _spin_lock
>>   - 99.94% dev_queue_xmit
>>   -  96.91% _spin_lock
>>  - 2.62%  __qdisc_run
>>   - 98.98% sch_direct_xmit
>> - 99.98% _spin_lock
>>
>> As far as I understand the design of TC is to simplify locking schema
>> and minimize the work in __qdisc_run so that throughput won’t be
>> affected, especially with large packets. However if the scenario is
>> that multiple classes in the queueing discipline only have the shaping
>> limit, there isn’t really a necessary correlation between different
>> classes. The only synchronization point should be when the packet is
>> dequeued from the qdisc queue and enqueued to the transmit queue of
>> the device. My question is – is it worth investing on avoiding the
>> locking contention by partitioning the queue/lock so that this
>> scenario is addressed with relatively smaller latency?
>
> Yes, there is a lot go gain, but it is not easy ;-)
>
>> I must have oversimplified a lot of details since I’m not familiar
>> with the TC implementation at this point – just want to get your input
>> in terms of whether this is a worthwhile effort or there is something
>> fundamental that I’m not aware of. If this is just a matter of quite
>> some additional work, would also appreciate helping to outline the
>> required work here.
>>
>> Also would appreciate if there is any information about the latest
>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
>
> This article seems to be very low quality... spelling errors, only 5
> pages, no real code, etc.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer


qdisc spin lock

2016-03-30 Thread Michael Ma
Hi -

I know this might be an old topic so bare with me – what we are facing
is that applications are sending small packets using hundreds of
threads so the contention on spin lock in __dev_xmit_skb increases the
latency of dev_queue_xmit significantly. We’re building a network QoS
solution to avoid interference of different applications using HTB.
But in this case when some applications send massive small packets in
parallel, the application to be protected will get its throughput
affected (because it’s doing synchronous network communication using
multiple threads and throughput is sensitive to the increased latency)

Here is the profiling from perf:

-  67.57%   iperf  [kernel.kallsyms] [k] _spin_lock
 - 99.94%
dev_queue_xmit
  96.91%
_spin_lock
- 2.62%
__qdisc_run
   -
98.98% sch_direct_xmit

99.98% _spin_lock
 1.01%
_spin_lock

As far as I understand the design of TC is to simplify locking schema
and minimize the work in __qdisc_run so that throughput won’t be
affected, especially with large packets. However if the scenario is
that multiple classes in the queueing discipline only have the shaping
limit, there isn’t really a necessary correlation between different
classes. The only synchronization point should be when the packet is
dequeued from the qdisc queue and enqueued to the transmit queue of
the device. My question is – is it worth investing on avoiding the
locking contention by partitioning the queue/lock so that this
scenario is addressed with relatively smaller latency?

I must have oversimplified a lot of details since I’m not familiar
with the TC implementation at this point – just want to get your input
in terms of whether this is a worthwhile effort or there is something
fundamental that I’m not aware of. If this is just a matter of quite
some additional work, would also appreciate helping to outline the
required work here.

Also would appreciate if there is any information about the latest
status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf

Thanks,
Ke Ma