On Tue, 11 Sep 2018 13:39:24 -0500 Arvind Narayanan <[email protected]> wrote:
> Stephen, thanks! > > That is it! Not sure if there is any workaround. > > So, essentially, what I am doing is -- core 0 gets a burst of my_packet(s) > from its pre-allocated mempool, and then (bulk) enqueues it into a > rte_ring. Core 1 then (bulk) dequeues from this ring and when it access the > data pointed by the ring's element (i.e. my_packet->tag1), this memory > access latency issue is seen. I cannot advance the prefetch any earlier. Is > there any clever workaround (or hack) to overcome this issue other than > using the same core for all the functions? For e.g. can I can prefetch the > packets in core 0 for core 1's cache (could be a dumb question!)? > > Thanks, > Arvind > > On Tue, Sep 11, 2018 at 1:07 PM Stephen Hemminger < > [email protected]> wrote: > > > On Tue, 11 Sep 2018 12:18:42 -0500 > > Arvind Narayanan <[email protected]> wrote: > > > > > If I don't do any processing, I easily get 10G. It is only when I access > > > the tag when the throughput drops. > > > What confuses me is if I use the following snippet, it works at line > > rate. > > > > > > ``` > > > int temp_key = 1; // declared outside of the for loop > > > > > > for (i = 0; i < pkt_count; i++) { > > > if (rte_hash_lookup_data(rx_table, &(temp_key), (void **)&val[i]) < > > 0) { > > > } > > > } > > > ``` > > > > > > But as soon as I replace `temp_key` with `my_packet->tag1`, I experience > > > fall in throughput (which in a way confirms the issue is due to cache > > > misses). > > > > Your packet data is not in cache. > > Doing prefetch can help but it is very timing sensitive. If prefetch is > > done > > before data is available it won't help. And if prefetch is done just before > > data is used then there isn't enough cycles to get it from memory to the > > cache. > > > > > > In my experience, if you want performance then don't pass packets between cores. It is slightly less bad if the core that does the passing does not access the packet. It is really bad if the handling core writes the packet. Especially cores with greater cache distance (numa). If you have to then use cores which share hyper-thread.
