Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-12-02 Thread Yong Wang
> -Original Message-
> From: Thomas Monjalon [mailto:thomas.monja...@6wind.com]
> Sent: Wednesday, November 30, 2016 10:27 AM
> To: Ananyev, Konstantin <konstantin.anan...@intel.com>
> Cc: Harish Patil <harish.pa...@qlogic.com>; dev@dpdk.org; Rahul Lakkireddy
> <rahul.lakkire...@chelsio.com>; Stephen Hurd
> <stephen.h...@broadcom.com>; Jan Medala <j...@semihalf.com>; Jakub
> Palider <j...@semihalf.com>; John Daley <johnd...@cisco.com>; Adrien
> Mazarguil <adrien.mazarg...@6wind.com>; Alejandro Lucero
> <alejandro.luc...@netronome.com>; Rasesh Mody
> <rasesh.m...@qlogic.com>; Jacob, Jerin <jerin.ja...@cavium.com>;
> Yuanhan Liu <yuanhan@linux.intel.com>; Yong Wang
> <yongw...@vmware.com>; Kulasek, TomaszX
> <tomaszx.kula...@intel.com>; olivier.m...@6wind.com
> Subject: Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation
> 
> 2016-11-30 17:42, Ananyev, Konstantin:
> > > >Please, we need a comment for each driver saying
> > > >"it is OK, we do not need any checksum preparation for TSO"
> > > >or
> > > >"yes we have to implement tx_prepare or TSO will not work in this
> mode"
> > > >
> > >
> > > qede PMD doesn’t currently support TSO yet, it only supports Tx
> TCP/UDP/IP
> > > csum offloads.
> > > So Tx preparation isn’t applicable. So as of now -
> > > "it is OK, we do not need any checksum preparation for TSO"
> >
> > Thanks for the answer.
> > Though please note that it not only for TSO.
> 
> Oh yes, sorry, my wording was incorrect.
> We need to know if any checksum preparation is needed prior
> offloading its final computation to the hardware or driver.
> So the question applies to TSO and simple checksum offload.
> 
> We are still waiting answers for
>   bnxt, cxgbe, ena, nfp, thunderx, virtio and vmxnet3.

The case for a virtual device is a little bit more complicated as packets 
offloaded from a virtual device can eventually be delivered to another virtual 
NIC or different physical NICs that have different offload requirements.  In 
ESX, the hypervisor will enforce that the packets offloaded will be something 
that the hardware expects.  The contract for vmxnet3 is that the guest needs to 
fill in pseudo header checksum for both l4 checksum only and TSO + l4 checksum 
offload cases.

> > This is for any TX offload for which the upper layer SW would have
> > to modify the contents of the packet.
> > Though as I can see for qede neither PKT_TX_IP_CKSUM or
> PKT_TX_TCP_CKSUM
> > exhibits any extra requirements for the user.
> > Is that correct?



Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-12-01 Thread Ananyev, Konstantin

Hi Adrien,

> 
> Hi Konstantin,
> 
> On Wed, Nov 30, 2016 at 10:54:50AM +, Ananyev, Konstantin wrote:
> [...]
> > > Something is definitely needed here, and only PMDs can provide it. I think
> > > applications should not have to clear checksum fields or initialize them 
> > > to
> > > some magic value, same goes for any other offload or hardware limitation
> > > that needs to be worked around.
> > >
> > > tx_prep() is one possible answer to this issue, however as mentioned in 
> > > the
> > > original patch it can be very expensive if exposed by the PMD.
> > >
> > > Another issue I'm more concerned about is the way limitations are managed
> > > (struct rte_eth_desc_lim). While not officially tied to tx_prep(), this
> > > structure contains new fields that are only relevant to a few devices, 
> > > and I
> > > fear it will keep growing with each new hardware quirk to manage, breaking
> > > ABIs in the process.
> >
> > Well, if some new HW capability/limitation would arise and we'd like to 
> > support
> > it in DPDK, then yes we probably would need to think how to incorporate it 
> > here.
> > Do you have anything particular in mind here?
> 
> Nothing in particular, so for the sake of the argument, let's suppose that I
> would like to add a field to expose some limitation that only applies to my
> PMD during TX but looks generic enough to make sense, e.g. maximum packet
> size when VLAN tagging is requested.

Hmm, I didn't hear about such limitations so far, but if it is real case -
sure, feel free to submit the patch.   

> PMDs are free to set that field to some
> special value (say, 0) if they do not care.
> 
> Since that field exists however, conscious applications should check its
> value for each packet that needs to be transmitted. This extra code causes a
> slowdown just by sitting in the data path. Since it is not the only field in
> that structure, the performance impact can be significant.
> 
> Even though this code is inside applications, it remains unfair to PMDs for
> which these tests are irrelevant. This problem is identified and addressed
> by tx_prepare().

I suppose the question is why do we need:
uint16_t nb_seg_max;
uint16_t nb_mtu_seg_max;
as we now have tx_prepare(), right?

For two reasons:
1. Some people might feel that tx_prepare() is not good (smart/fast) enough
for them and would prefer to do necessary preparations for TX offloads 
themselves.
2. Even if people do use tx_prepare() they still should take this information 
into accout.
As an example ixgbe can't TX packets with then 40 segments.
Obviously ixbge_tx_prep() performs that check and returns an error.
But it wouldn't try to merge/reallocate mbufs for you.
User still has to do it himself, or just prevent creating such long chains 
somehow.

> 
> Thanks to tx_prepare(), these checks are moved back into PMDs where they
> belong. PMDs that do not need them do not have to provide support for
> tx_prepare() and do not suffer any performance impact as result;
> applications only have to make sure tx_prepare() is always called at some
> point before tx_burst().
> 
> Once you reach this stage, you've effectively made tx_prepare() mandatory
> before tx_burst(). If some bug occurs, then perhaps you forgot to call
> tx_prepare(), you just need to add it. The total cost for doing TX is
> therefore tx_prepare() + tx_burst().
> 
> I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() will
> remain optional for long. Sure, PMDs that do not implement it do not care,
> I'm focusing on applications, for which the performance impact of calling
> tx_prepare() followed by tx_burst() is higher than a single tx_burst()
> performing all the necessary preparation at once.
> 
> [...]
> > > Following the same logic, why can't such a thing be made part of the TX
> > > burst function as well (through a direct call to rte_phdr_cksum_fix()
> > > whenever necessary). From an application standpoint, what are the 
> > > advantages
> > > of having to:
> > >
> > >  if (tx_prep()) // iterate and update mbufs as needed
> > >  tx_burst(); // iterate and send
> > >
> > > Compared to:
> > >
> > >  tx_burst(); // iterate, update as needed and send
> >
> > I think that was discussed extensively quite a lot previously here:
> > As Thomas already replied - main motivation is to allow user
> > to execute them on different stages of packet TX pipeline,
> > and probably on different cores.
> > I think that provides better flexibility to the user to when/where
> > do these preparations and hopefully would lead to better performance.
> 
> And I agree, I think this use case is valid but does not warrant such a high
> penalty when your application does not need that much flexibility. Simple
> (yet conscious) applications need the highest performance. Complex ones as
> you described already suffer quite a bit from IPCs and won't mind a couple
> of extra CPU cycles right?

It would mean an extra cache-miss for every packet, so I think performance 

Re: [dpdk-dev] [PATCH v12 0/6] add Tx preparation

2016-12-01 Thread Jerin Jacob
On Thu, Dec 01, 2016 at 09:58:31AM +0100, Thomas Monjalon wrote:
> 2016-12-01 08:15, Adrien Mazarguil:
> > I'm perhaps a bit pessimistic mind you, but I do not think tx_prepare() will
> > remain optional for long. Sure, PMDs that do not implement it do not care,
> > I'm focusing on applications, for which the performance impact of calling
> > tx_prepare() followed by tx_burst() is higher than a single tx_burst()
> > performing all the necessary preparation at once.
> 
> I agree that tx_prepare() should become mandatory shortly.

I agree. The tx_prepare has to be mandatory. Application will have no
idea on how PMD drivers use this hook to fix up PMD tx side limitations.
On other side, if it turns out to be mandatory, what real benefit it is
going to have compared to existing scheme of just tx_burst.

> 
> > [...]
> > > > Following the same logic, why can't such a thing be made part of the TX
> > > > burst function as well (through a direct call to rte_phdr_cksum_fix()
> > > > whenever necessary). From an application standpoint, what are the 
> > > > advantages
> > > > of having to:
> > > > 
> > > >  if (tx_prep()) // iterate and update mbufs as needed
> > > >  tx_burst(); // iterate and send
> > > > 
> > > > Compared to:
> > > > 
> > > >  tx_burst(); // iterate, update as needed and send
> > > 
> > > I think that was discussed extensively quite a lot previously here:
> > > As Thomas already replied - main motivation is to allow user
> > > to execute them on different stages of packet TX pipeline,
> > > and probably on different cores.
> > > I think that provides better flexibility to the user to when/where
> > > do these preparations and hopefully would lead to better performance.
> > 
> > And I agree, I think this use case is valid but does not warrant such a high
> > penalty when your application does not need that much flexibility. Simple
> > (yet conscious) applications need the highest performance. Complex ones as
> > you described already suffer quite a bit from IPCs and won't mind a couple
> > of extra CPU cycles right?
> > 
> > Yes they will, therefore we need a method that satisfies both cases.
> > 
> > As a possible solution, a special mbuf flag could be added to each mbuf
> > having gone through tx_prepare(). That way, tx_burst() could skip some
> > checks and things it would otherwise have done.
> 
> I like this idea!
>