Stephen, thanks! That is it! Not sure if there is any workaround.
So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s) from its pre-allocated mempool, and then (bulk) enqueues it into a rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the data pointed by the ring's element (i.e. my_packet->tag1), this memory access latency issue is seen. I cannot advance the prefetch any earlier. Is there any clever workaround (or hack) to overcome this issue other than using the same core for all the functions? For e.g. can I can prefetch the packets in core 0 for core 1's cache (could be a dumb question!)? Thanks, Arvind On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger < [email protected]> wrote: > On Tue, 11 Sep 2018 12:18:42 -0500 > Arvind Narayanan <[email protected]> wrote: > > > If I don't do any processing, I easily get 10G. It is only when I access > > the tag when the throughput drops. > > What confuses me is if I use the following snippet, it works at line > rate. > > > > ``` > > int temp_key = 1; // declared outside of the for loop > > > > for (i = 0; i < pkt_count; i++) { > > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < > 0) { > > } > > } > > ``` > > > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > > fall in throughput (which in a way confirms the issue is due to cache > > misses). > > Your packet data is not in cache. > Doing prefetch can help but it is very timing sensitive. If prefetch is > done > before data is available it won't help. And if prefetch is done just before > data is used then there isn't enough cycles to get it from memory to the > cache. > > >
