Re: [dpdk-users] Query on handling packets

2018-11-24 Thread Stephen Hemminger
On Sat, 24 Nov 2018 16:01:04 +
"Wiles, Keith"  wrote:

> > On Nov 22, 2018, at 9:54 AM, Harsh Patel  wrote:
> > 
> > Hi
> > 
> > Thank you so much for the reply and for the solution.
> > 
> > We used the given code. We were amazed by the pointer arithmetic you used, 
> > got to learn something new.
> > 
> > But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> > 
> > We also checked if the raw socket was using any extra (logical) cores than 
> > the DPDK. We found that raw socket has 2 logical threads running on 2 
> > logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical 
> > CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same 
> > bottleneck.
> > 
> > We have updated our code (you can use the same links from previous mail). 
> > It would be helpful if you could help us in finding what causes the 
> > bottleneck.  
> 
> I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro 
> that calls (rte_get_timer_hz()/2014) just to be safe I would not call 
> rte_get_timer_hz() time, but grab the value and store the hz locally and use 
> that variable instead. This will not improve performance is my guess and I 
> would have to look at the code the that routine to see if it buys you 
> anything to store the value locally. If the getting hz is just a simple read 
> of a variable then good, but still you should should a local variable within 
> the object to hold the (rte_get_timer_hz()/2048) instead of doing the call 
> and divide each time.
> 
> > 
> > Thanks and Regards, 
> > Harsh and Hrishikesh 
> > 
> > 
> > On Mon, Nov 19, 2018, 19:19 Wiles, Keith  wrote:
> > 
> >   
> > > On Nov 17, 2018, at 4:05 PM, Kyle Larose  wrote:
> > > 
> > > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel  
> > > wrote:  
> > >> 
> > >> Hello,
> > >> Thanks a lot for going through the code and providing us with so much
> > >> information.
> > >> We removed all the memcpy/malloc from the data path as you suggested and 
> > >>  
> > > ...  
> > >> After removing this, we are able to see a performance gain but not as 
> > >> good
> > >> as raw socket.
> > >>   
> > > 
> > > You're using an unordered_map to map your buffer pointers back to the
> > > mbufs. While it may not do a memcpy all the time, It will likely end
> > > up doing a malloc arbitrarily when you insert or remove entries from
> > > the map. If it needs to resize the table, it'll be even worse. You may
> > > want to consider using librte_hash:
> > > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > > see if you can design the system to avoid needing to do a lookup like
> > > this. Can you return a handle with the mbuf pointer and the data
> > > together?
> > > 
> > > You're also using floating point math where it's unnecessary (the
> > > timing check). Just multiply the numerator by 100 prior to doing
> > > the division. I doubt you'll overflow a uint64_t with that. It's not
> > > as efficient as integer math, though I'm not sure offhand it'd cause a
> > > major perf problem.
> > > 
> > > One final thing: using a raw socket, the kernel will take over
> > > transmitting and receiving to the NIC itself. that means it is free to
> > > use multiple CPUs for the rx and tx. I notice that you only have one
> > > rx/tx queue, meaning at most one CPU can send and receive packets.
> > > When running your performance test with the raw socket, you may want
> > > to see how busy the system is doing packet sends and receives. Is it
> > > using more than one CPU's worth of processing? Is it using less, but
> > > when combined with your main application's usage, the overall system
> > > is still using more than one?  
> > 
> > Along with the floating point math, I would remove all floating point math 
> > and use the rte_rdtsc() function to use cycles. Using something like:
> > 
> > uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th 
> > of a second use 2/4/8/16/32 power of two numbers to make the math simple 
> > divide */
> > 
> > cur_tsc = rte_rdtsc();
> > 
> > next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> > 
> > while(1) {
> > cur_tsc = rte_rdtsc();
> > if (cur_tsc >= next_tsc) {
> > flush();
> > next_tsc += timo;
> > }
> > /* Do other stuff */
> > }
> > 
> > For the m_bufPktMap I would use the rte_hash or do not use a hash at all by 
> > grabbing the buffer address and subtract the
> > mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + 
> > RTE_MAX_HEADROOM);
> > 
> > 
> > DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> > {
> > struct rte_mbuf *pkt;
> > uint64_t cur_tsc;
> > 
> > pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct 
> > rte_mbuf) + RTE_MAX_HEADROOM);
> > 
> > /* No need to test pkt, but buffer maybe tested to make sure it is 
> > not null above the math above */
> > 
> > 

Re: [dpdk-users] Query on handling packets

2018-11-24 Thread Wiles, Keith



> On Nov 22, 2018, at 9:54 AM, Harsh Patel  wrote:
> 
> Hi
> 
> Thank you so much for the reply and for the solution.
> 
> We used the given code. We were amazed by the pointer arithmetic you used, 
> got to learn something new.
> 
> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> 
> We also checked if the raw socket was using any extra (logical) cores than 
> the DPDK. We found that raw socket has 2 logical threads running on 2 logical 
> CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We 
> also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> 
> We have updated our code (you can use the same links from previous mail). It 
> would be helpful if you could help us in finding what causes the bottleneck.

I looked at the code for a few seconds and noticed your TX_TIMEOUT is macro 
that calls (rte_get_timer_hz()/2014) just to be safe I would not call 
rte_get_timer_hz() time, but grab the value and store the hz locally and use 
that variable instead. This will not improve performance is my guess and I 
would have to look at the code the that routine to see if it buys you anything 
to store the value locally. If the getting hz is just a simple read of a 
variable then good, but still you should should a local variable within the 
object to hold the (rte_get_timer_hz()/2048) instead of doing the call and 
divide each time.

> 
> Thanks and Regards, 
> Harsh and Hrishikesh 
> 
> 
> On Mon, Nov 19, 2018, 19:19 Wiles, Keith  wrote:
> 
> 
> > On Nov 17, 2018, at 4:05 PM, Kyle Larose  wrote:
> > 
> > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel  
> > wrote:
> >> 
> >> Hello,
> >> Thanks a lot for going through the code and providing us with so much
> >> information.
> >> We removed all the memcpy/malloc from the data path as you suggested and
> > ...
> >> After removing this, we are able to see a performance gain but not as good
> >> as raw socket.
> >> 
> > 
> > You're using an unordered_map to map your buffer pointers back to the
> > mbufs. While it may not do a memcpy all the time, It will likely end
> > up doing a malloc arbitrarily when you insert or remove entries from
> > the map. If it needs to resize the table, it'll be even worse. You may
> > want to consider using librte_hash:
> > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > see if you can design the system to avoid needing to do a lookup like
> > this. Can you return a handle with the mbuf pointer and the data
> > together?
> > 
> > You're also using floating point math where it's unnecessary (the
> > timing check). Just multiply the numerator by 100 prior to doing
> > the division. I doubt you'll overflow a uint64_t with that. It's not
> > as efficient as integer math, though I'm not sure offhand it'd cause a
> > major perf problem.
> > 
> > One final thing: using a raw socket, the kernel will take over
> > transmitting and receiving to the NIC itself. that means it is free to
> > use multiple CPUs for the rx and tx. I notice that you only have one
> > rx/tx queue, meaning at most one CPU can send and receive packets.
> > When running your performance test with the raw socket, you may want
> > to see how busy the system is doing packet sends and receives. Is it
> > using more than one CPU's worth of processing? Is it using less, but
> > when combined with your main application's usage, the overall system
> > is still using more than one?
> 
> Along with the floating point math, I would remove all floating point math 
> and use the rte_rdtsc() function to use cycles. Using something like:
> 
> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th 
> of a second use 2/4/8/16/32 power of two numbers to make the math simple 
> divide */
> 
> cur_tsc = rte_rdtsc();
> 
> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> 
> while(1) {
> cur_tsc = rte_rdtsc();
> if (cur_tsc >= next_tsc) {
> flush();
> next_tsc += timo;
> }
> /* Do other stuff */
> }
> 
> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by 
> grabbing the buffer address and subtract the
> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + 
> RTE_MAX_HEADROOM);
> 
> 
> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> {
> struct rte_mbuf *pkt;
> uint64_t cur_tsc;
> 
> pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) 
> + RTE_MAX_HEADROOM);
> 
> /* No need to test pkt, but buffer maybe tested to make sure it is 
> not null above the math above */
> 
> pkt->pk_len = length;
> pkt->data_len = length;
> 
> rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> 
> cur_tsc = rte_rdtsc();
> 
> /* next_tsc is a private variable */
> if (cur_tsc >= next_tsc) {
> rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);   /* 
> hardcoded 

Re: [dpdk-users] Query on handling packets

2018-11-24 Thread Wiles, Keith



> On Nov 24, 2018, at 9:43 AM, Wiles, Keith  wrote:
> 
> 
> 
>> On Nov 22, 2018, at 9:54 AM, Harsh Patel  wrote:
>> 
>> Hi
>> 
>> Thank you so much for the reply and for the solution.
>> 
>> We used the given code. We were amazed by the pointer arithmetic you used, 
>> got to learn something new.
>> 
>> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.
> 
> Make sure the cores you are using are on the same NUMA or socket the PCI 
> devices are located.
> 
> If you have two CPUs or sockets in your system. The cpu_layout.py script will 
> help you understand the layout of the cores and/or lcores in the system.
> 
> On my machine the PCI bus is connected to socket 1 and not socket 0, this 
> means I have to use lcores only on socket 1. Some systems have two PCI buses 
> one on each socket. Accessing data from one NUMA zone or socket to another 
> can effect performance and should be avoided.
> 
> HTH
>> 
>> We also checked if the raw socket was using any extra (logical) cores than 
>> the DPDK. We found that raw socket has 2 logical threads running on 2 
>> logical CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical 
>> CPUs. We also ran the 6 threads on 4 logical CPUs, still we see the same 
>> bottleneck.

Not sure what you are trying to tell me here, but a picture could help me a lot.

>> 
>> We have updated our code (you can use the same links from previous mail). It 
>> would be helpful if you could help us in finding what causes the bottleneck.
>> 
>> Thanks and Regards, 
>> Harsh and Hrishikesh 
>> 
>> 
>> On Mon, Nov 19, 2018, 19:19 Wiles, Keith  wrote:
>> 
>> 
>>> On Nov 17, 2018, at 4:05 PM, Kyle Larose  wrote:
>>> 
>>> On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel  
>>> wrote:
 
 Hello,
 Thanks a lot for going through the code and providing us with so much
 information.
 We removed all the memcpy/malloc from the data path as you suggested and
>>> ...
 After removing this, we are able to see a performance gain but not as good
 as raw socket.
 
>>> 
>>> You're using an unordered_map to map your buffer pointers back to the
>>> mbufs. While it may not do a memcpy all the time, It will likely end
>>> up doing a malloc arbitrarily when you insert or remove entries from
>>> the map. If it needs to resize the table, it'll be even worse. You may
>>> want to consider using librte_hash:
>>> https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
>>> see if you can design the system to avoid needing to do a lookup like
>>> this. Can you return a handle with the mbuf pointer and the data
>>> together?
>>> 
>>> You're also using floating point math where it's unnecessary (the
>>> timing check). Just multiply the numerator by 100 prior to doing
>>> the division. I doubt you'll overflow a uint64_t with that. It's not
>>> as efficient as integer math, though I'm not sure offhand it'd cause a
>>> major perf problem.
>>> 
>>> One final thing: using a raw socket, the kernel will take over
>>> transmitting and receiving to the NIC itself. that means it is free to
>>> use multiple CPUs for the rx and tx. I notice that you only have one
>>> rx/tx queue, meaning at most one CPU can send and receive packets.
>>> When running your performance test with the raw socket, you may want
>>> to see how busy the system is doing packet sends and receives. Is it
>>> using more than one CPU's worth of processing? Is it using less, but
>>> when combined with your main application's usage, the overall system
>>> is still using more than one?
>> 
>> Along with the floating point math, I would remove all floating point math 
>> and use the rte_rdtsc() function to use cycles. Using something like:
>> 
>> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th 
>> of a second use 2/4/8/16/32 power of two numbers to make the math simple 
>> divide */
>> 
>> cur_tsc = rte_rdtsc();
>> 
>> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
>> 
>> while(1) {
>>cur_tsc = rte_rdtsc();
>>if (cur_tsc >= next_tsc) {
>>flush();
>>next_tsc += timo;
>>}
>>/* Do other stuff */
>> }
>> 
>> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by 
>> grabbing the buffer address and subtract the
>> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + 
>> RTE_MAX_HEADROOM);
>> 
>> 
>> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
>> {
>>struct rte_mbuf *pkt;
>>uint64_t cur_tsc;
>> 
>>pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) 
>> + RTE_MAX_HEADROOM);
>> 
>>/* No need to test pkt, but buffer maybe tested to make sure it is 
>> not null above the math above */
>> 
>>pkt->pk_len = length;
>>pkt->data_len = length;
>> 
>>rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
>> 
>>cur_tsc = rte_rdtsc();
>> 
>>/* next_tsc is a private variable 

Re: [dpdk-users] Query on handling packets

2018-11-24 Thread Wiles, Keith



> On Nov 22, 2018, at 9:54 AM, Harsh Patel  wrote:
> 
> Hi
> 
> Thank you so much for the reply and for the solution.
> 
> We used the given code. We were amazed by the pointer arithmetic you used, 
> got to learn something new.
> 
> But still we are under performing.The same bottleneck of ~2.5Mbps is seen.

Make sure the cores you are using are on the same NUMA or socket the PCI 
devices are located.

If you have two CPUs or sockets in your system. The cpu_layout.py script will 
help you understand the layout of the cores and/or lcores in the system.

On my machine the PCI bus is connected to socket 1 and not socket 0, this means 
I have to use lcores only on socket 1. Some systems have two PCI buses one on 
each socket. Accessing data from one NUMA zone or socket to another can effect 
performance and should be avoided.

HTH
> 
> We also checked if the raw socket was using any extra (logical) cores than 
> the DPDK. We found that raw socket has 2 logical threads running on 2 logical 
> CPUs. Whereas, the DPDK version has 6 logical threads on 2 logical CPUs. We 
> also ran the 6 threads on 4 logical CPUs, still we see the same bottleneck.
> 
> We have updated our code (you can use the same links from previous mail). It 
> would be helpful if you could help us in finding what causes the bottleneck.
> 
> Thanks and Regards, 
> Harsh and Hrishikesh 
> 
> 
> On Mon, Nov 19, 2018, 19:19 Wiles, Keith  wrote:
> 
> 
> > On Nov 17, 2018, at 4:05 PM, Kyle Larose  wrote:
> > 
> > On Sat, Nov 17, 2018 at 5:22 AM Harsh Patel  
> > wrote:
> >> 
> >> Hello,
> >> Thanks a lot for going through the code and providing us with so much
> >> information.
> >> We removed all the memcpy/malloc from the data path as you suggested and
> > ...
> >> After removing this, we are able to see a performance gain but not as good
> >> as raw socket.
> >> 
> > 
> > You're using an unordered_map to map your buffer pointers back to the
> > mbufs. While it may not do a memcpy all the time, It will likely end
> > up doing a malloc arbitrarily when you insert or remove entries from
> > the map. If it needs to resize the table, it'll be even worse. You may
> > want to consider using librte_hash:
> > https://doc.dpdk.org/api/rte__hash_8h.html instead. Or, even better,
> > see if you can design the system to avoid needing to do a lookup like
> > this. Can you return a handle with the mbuf pointer and the data
> > together?
> > 
> > You're also using floating point math where it's unnecessary (the
> > timing check). Just multiply the numerator by 100 prior to doing
> > the division. I doubt you'll overflow a uint64_t with that. It's not
> > as efficient as integer math, though I'm not sure offhand it'd cause a
> > major perf problem.
> > 
> > One final thing: using a raw socket, the kernel will take over
> > transmitting and receiving to the NIC itself. that means it is free to
> > use multiple CPUs for the rx and tx. I notice that you only have one
> > rx/tx queue, meaning at most one CPU can send and receive packets.
> > When running your performance test with the raw socket, you may want
> > to see how busy the system is doing packet sends and receives. Is it
> > using more than one CPU's worth of processing? Is it using less, but
> > when combined with your main application's usage, the overall system
> > is still using more than one?
> 
> Along with the floating point math, I would remove all floating point math 
> and use the rte_rdtsc() function to use cycles. Using something like:
> 
> uint64_t cur_tsc, next_tsc, timo = (rte_timer_get_hz() / 16);   /* One 16th 
> of a second use 2/4/8/16/32 power of two numbers to make the math simple 
> divide */
> 
> cur_tsc = rte_rdtsc();
> 
> next_tsc = cur_tsc + timo; /* Now next_tsc the next time to flush */
> 
> while(1) {
> cur_tsc = rte_rdtsc();
> if (cur_tsc >= next_tsc) {
> flush();
> next_tsc += timo;
> }
> /* Do other stuff */
> }
> 
> For the m_bufPktMap I would use the rte_hash or do not use a hash at all by 
> grabbing the buffer address and subtract the
> mbuf = (struct rte_mbuf *)RTE_PTR_SUB(buf, sizeof(struct rte_mbuf) + 
> RTE_MAX_HEADROOM);
> 
> 
> DpdkNetDevice:Write(uint8_t *buffer, size_t length)
> {
> struct rte_mbuf *pkt;
> uint64_t cur_tsc;
> 
> pkt = (struct rte_mbuf *)RTE_PTR_SUB(buffer, sizeof(struct rte_mbuf) 
> + RTE_MAX_HEADROOM);
> 
> /* No need to test pkt, but buffer maybe tested to make sure it is 
> not null above the math above */
> 
> pkt->pk_len = length;
> pkt->data_len = length;
> 
> rte_eth_tx_buffer(m_portId, 0, m_txBuffer, pkt);
> 
> cur_tsc = rte_rdtsc();
> 
> /* next_tsc is a private variable */
> if (cur_tsc >= next_tsc) {
> rte_eth_tx_buffer_flush(m_portId, 0, m_txBuffer);   /* 
> hardcoded the queue id, should be fixed */
> next_tsc = cur_tsc + timo; /* timo is a fixed