> On Feb 5, 2019, at 12:37 AM, Harsh Patel <[email protected]> wrote:
> 
> Hi, 
> 
> We would like to inform you that our code is working as expected and we are 
> able to obtain 95-98 Mbps data rate for a 100Mbps application rate. We are 
> now working on the testing of the code. Thanks a lot, especially to Keith for 
> all the help you provided.
> 
> We have 2 main queries :-
> 1) We wanted to calculate Backlog at the NIC Tx Descriptors but were not able 
> to find anything in the documentation. Can you help us in how to calculate 
> the backlog?
> 2) We searched on how to use Byte Queue Limit (BQL) on the NIC queue but 
> couldn't find anything like that in DPDK. Does DPDK support BQL? If so, can 
> you help us on how to use it for our project?

what was the last set of problems if I may ask?
> 
> Thanks & Regards
> Harsh & Hrishikesh
> 
> On Thu, 31 Jan 2019 at 22:28, Wiles, Keith <[email protected]> wrote:
> 
> 
> Sent from my iPhone
> 
> On Jan 30, 2019, at 5:36 PM, Harsh Patel <[email protected]> wrote:
> 
>> Hello, 
>> 
>> This mail is to inform you that the integration of DPDK is working with ns-3 
>> on a basic level. The model is running. 
>> For UDP traffic we are getting throughput same or better than raw socket. 
>> (Around 100Mbps)
>> But unfortunately for TCP, there are burst packet losses due to which the 
>> throughput is drastically affected after some point of time. The bandwidth 
>> of the link used was 100Mbps. 
>> We have obtained cwnd and ssthresh graphs which show that once the flow gets 
>> out from Slow Start mode, there are so many packet losses that the 
>> congestion window & the slow start threshold is not able to go above 4-5 
>> packets. 
> 
> Can you determine where the packets are being dropped?
>> We have attached the graphs with this mail.
>> 
> 
> I do not see the graphs attached but that’s OK. 
>> We would like to know if there is any reason to this or how can we fix this. 
> 
> I think we have to find out where the packets are being dropped this is the 
> only reason for the case to your referring to. 
>> 
>> Thanks & Regards
>> Harsh & Hrishikesh
>> 
>> On Wed, 16 Jan 2019 at 19:25, Harsh Patel <[email protected]> wrote:
>> Hi
>> 
>> We were able to optimise the DPDK version. There were couple of things we 
>> needed to do.
>> 
>> We were using tx timeout as 1s/2048, which we found out to be very less. 
>> Then we increased the timeout, but we were getting lot of retransmissions.
>> 
>> So we removed the timeout and sent single packet as soon as we get it. This 
>> increased the throughput.
>> 
>> Then we used DPDK feature to launch function on core, and gave a dedicated 
>> core for Rx. This increased the throughput further.
>> 
>> The code is working really well for low bandwidth (<~50Mbps) and is 
>> outperforming raw socket version.
>> But for high bandwidth, we are getting packet length mismatches for some 
>> reason. We are investigating it.
>> 
>> We really thank you for the suggestions given by you and also for keeping 
>> the patience for last couple of months. 
>> 
>> Thank you
>> 
>> Regards, 
>> Harsh & Hrishikesh 
>> 
>> On Fri, Jan 4, 2019, 11:27 Harsh Patel <[email protected]> wrote:
>> Yes that would be helpful. 
>> It'd be ok for now to use the same dpdk version to overcome the build 
>> issues. 
>> We will look into updating the code for latest versions once we get past 
>> this problem. 
>> 
>> Thank you very much. 
>> 
>> Regards, 
>> Harsh & Hrishikesh
>> 
>> On Fri, Jan 4, 2019, 04:13 Wiles, Keith <[email protected]> wrote:
>> 
>> 
>> > On Jan 3, 2019, at 12:12 PM, Harsh Patel <[email protected]> wrote:
>> > 
>> > Hi
>> > 
>> > We applied your suggestion of removing the `IsLinkUp()` call. But the 
>> > performace is even worse. We could only get around 340kbits/s.
>> > 
>> > The Top Hotspots are:
>> > 
>> > Function    Module    CPU Time
>> > eth_em_recv_pkts    librte_pmd_e1000.so    15.106s
>> > rte_delay_us_block    librte_eal.so.6.1    7.372s
>> > ns3::DpdkNetDevice::Read    libns3.28.1-fd-net-device-debug.so    5.080s
>> > rte_eth_rx_burst    libns3.28.1-fd-net-device-debug.so    3.558s
>> > ns3::DpdkNetDeviceReader::DoRead    libns3.28.1-fd-net-device-debug.so    
>> > 3.364s
>> > [Others]        4.760s
>> 
>> Performance reduced by removing that link status check, that is weird.
>> > 
>> > Upon checking the callers of `rte_delay_us_block`, we got to know that 
>> > most of the time (92%) spent in this function is during initialization.
>> > This does not waste our processing time during communication. So, it's a 
>> > good start to our optimization.
>> > 
>> > Callers    CPU Time: Total    CPU Time: Self
>> > rte_delay_us_block    100.0%    7.372s
>> >   e1000_enable_ulp_lpt_lp    92.3%    6.804s
>> >   e1000_write_phy_reg_mdic    1.8%    0.136s
>> >   e1000_reset_hw_ich8lan    1.7%    0.128s
>> >   e1000_read_phy_reg_mdic    1.4%    0.104s
>> >   eth_em_link_update    1.4%    0.100s
>> >   e1000_get_cfg_done_generic    0.7%    0.052s
>> >   e1000_post_phy_reset_ich8lan.part.18    0.7%    0.048s
>> 
>> I guess you are having vTune start your application and that is why you have 
>> init time items in your log. I normally start my application and then attach 
>> vtune to the application. One of the options in configuration of vtune for 
>> that project is to attach to the application. Maybe it would help hear.
>> 
>> Looking at the data you provided it was ok. The problem is it would not load 
>> the source files as I did not have the same build or executable. I tried to 
>> build the code, but it failed to build and I did not go further. I guess I 
>> would need to see the full source tree and the executable you used to really 
>> look at the problem. I have limited time, but I can try if you like. 
>> > 
>> > 
>> > Effective CPU Utilization:    21.4% (0.856 out of 4)
>> > 
>> > Here is the link to vtune profiling results. 
>> > https://drive.google.com/open?id=1M6g2iRZq2JGPoDVPwZCxWBo7qzUhvWi5
>> > 
>> > Thank you
>> > 
>> > Regards
>> > 
>> > On Sun, Dec 30, 2018, 06:00 Wiles, Keith <[email protected]> wrote:
>> > 
>> > 
>> > > On Dec 29, 2018, at 4:03 PM, Harsh Patel <[email protected]> 
>> > > wrote:
>> > > 
>> > > Hello,
>> > > As suggested, we tried profiling the application using Intel VTune 
>> > > Amplifier. We aren't sure how to use these results, so we are attaching 
>> > > them to this email.
>> > > 
>> > > The things we understood were 'Top Hotspots' and 'Effective CPU 
>> > > utilization'. Following are some of our understandings:
>> > > 
>> > > Top Hotspots
>> > > 
>> > > Function        Module  CPU Time
>> > > rte_delay_us_block      librte_eal.so.6.1       15.042s
>> > > eth_em_recv_pkts        librte_pmd_e1000.so     9.544s
>> > > ns3::DpdkNetDevice::Read        libns3.28.1-fd-net-device-debug.so      
>> > > 3.522s
>> > > ns3::DpdkNetDeviceReader::DoRead        
>> > > libns3.28.1-fd-net-device-debug.so      2.470s
>> > > rte_eth_rx_burst        libns3.28.1-fd-net-device-debug.so      2.456s
>> > > [Others]                6.656s
>> > > 
>> > > We knew about other methods except `rte_delay_us_block`. So we 
>> > > investigated the callers of this method:
>> > > 
>> > > Callers Effective Time  Spin Time       Overhead Time   Effective Time  
>> > > Spin Time       Overhead Time   Wait Time: Total        Wait Time: Self
>> > > e1000_enable_ulp_lpt_lp 45.6%   0.0%    0.0%    6.860s  0usec   0usec
>> > > e1000_write_phy_reg_mdic        32.7%   0.0%    0.0%    4.916s  0usec   
>> > > 0usec
>> > > e1000_read_phy_reg_mdic 19.4%   0.0%    0.0%    2.922s  0usec   0usec
>> > > e1000_reset_hw_ich8lan  1.0%    0.0%    0.0%    0.143s  0usec   0usec
>> > > eth_em_link_update      0.7%    0.0%    0.0%    0.100s  0usec   0usec
>> > > e1000_post_phy_reset_ich8lan.part.18    0.4%    0.0%    0.0%    0.064s  
>> > > 0usec   0usec
>> > > e1000_get_cfg_done_generic      0.2%    0.0%    0.0%    0.037s  0usec   
>> > > 0usec
>> > > 
>> > > We lack sufficient knowledge to investigate more than this.
>> > > 
>> > > Effective CPU utilization
>> > > 
>> > > Interestingly, the effective CPU utilization was 20.8% (0.832 out of 4 
>> > > logical CPUs). We thought this is less. So we compared this with the 
>> > > raw-socket version of the code, which was even less, 8.0% (0.318 out of 
>> > > 4 logical CPUs), and even then it is performing way better.
>> > > 
>> > > It would be helpful if you give us insights on how to use these results 
>> > > or point us to some resources to do so. 
>> > > 
>> > > Thank you 
>> > > 
>> > 
>> > BTW, I was able to build ns3 with DPDK 18.11 it required a couple changes 
>> > in the DPDK init code in ns3 plus one hack in rte_mbuf.h file.
>> > 
>> > I did have a problem including rte_mbuf.h file into your code. It appears 
>> > the g++ compiler did not like referencing the struct rte_mbuf_sched inside 
>> > the rte_mbuf structure. The rte_mbuf_sched was inside the big union as a 
>> > hack I moved the struct outside of the rte_mbuf structure and replaced the 
>> > struct in the union with ’struct rte_mbuf_sched sched;', but I am guessing 
>> > you are missing some compiler options in your build system as DPDK builds 
>> > just fine without that hack.
>> > 
>> > The next place was the rxmode and the txq_flags. The rxmode structure has 
>> > changed and I commented out the inits in ns3 and then commented out the 
>> > txq_flags init code as these are now the defaults.
>> > 
>> > Regards,
>> > Keith
>> > 
>> 
>> Regards,
>> Keith
>> 
>> <Ssthresh.png>
>> <Cwnd.png>

Regards,
Keith

Reply via email to