Hi Fiona and Pathak,

Thanks!
Changchun (Alex)

-----Original Message-----
From: Pathak, Pravin [mailto:[email protected]] 
Sent: Friday, January 18, 2019 9:29 AM
To: Trahe, Fiona <[email protected]>; Changchun Zhang 
<[email protected]>; [email protected]
Cc: Trahe, Fiona <[email protected]>
Subject: RE: [dpdk-users] Run-to-completion or Pipe-line for QAT PMD in DPDK

Hi Alex -

-----Original Message-----
From: users [mailto:[email protected]] On Behalf Of Trahe, Fiona
Sent: Friday, January 18, 2019 8:14 AM
To: Changchun Zhang <[email protected]>; [email protected]
Cc: Trahe, Fiona <[email protected]>
Subject: Re: [dpdk-users] Run-to-completion or Pipe-line for QAT PMD in DPDK

Hi Alex,

> -----Original Message-----
> From: users [mailto:[email protected]] On Behalf Of Changchun 
> Zhang
> Sent: Thursday, January 17, 2019 11:01 PM
> To: [email protected]
> Subject: [dpdk-users] Run-to-completion or Pipe-line for QAT PMD in 
> DPDK
> 
> Hi,
> 
> 
> 
> I have user question on using the QAT device in the DPDK.
> 
> In the real design, after calling enqueuer_burst() on the specified 
> queue pair at one of the lcore, usually which one is usually done?
> 
> 1.     should we do run-to-completion to call dequeuer_burst() waiting for 
> the device finishing the
> crypto operation,
> 
> 2.     or should we do pipe-line, in which we return right after 
> enqueuer_burst() and release the CPU.
> And call dequeuer_burst() on other thread function?
> 
> Option 1 is more like synchronous and can be seen on all the DPDK 
> crypto examples, while option 2 is asynchronous which I have never seen in 
> any reference design if I missed anything.
[Fiona]
Option 2 is not possible with QAT - the dequeue must be called in the same 
thread as the enqueue. This is optimised without atomics for best performance - 
if this is a problem let us know. 
However best performance is not quite using option 1 and not a synchronous 
blocking method. 
If you enqueue and then go straight to dequeue, you're not getting the best 
advantage from the cycles freed up by  offloading. 
i.e. best to enqueue a burst, then go do some other work, like maybe collecting 
more requests for next enqueue or other processing, then dequeue. Take and 
process whatever ops are dequeued - this will not necessarily match up with the 
number you've enqueued - depends on how quickly you call the dequeue.
Don't wait until all the enqueued ops are dequeued before enqueuing the next 
batch.
SO it's asynchronous. But in the same thread.
[changchun] In the same thread, but how about to dequeuer at the beginning of 
the thread each time, if data presents then processing them, if no data just do 
other work, and equeue the packets at some time but does not wait.
For example:
While(1)
{
        Nb_ops = dequeuer();
        If(nb_ops > ) 
             {
                 process_dequeued_data();
                 continue;      
             }
              
              Other_work();
              If(ipsec)
                  Enqueuer(); 
}
Does it make sense?

You'll get best throughput when you keep the input filled up so the device has 
operations to work on and regularly dequeue a burst. Dequeuing too often will 
waste cycles in the overhead calling the API, dequeuing too slowly will cause 
the device to back up. Ideally tune for your application to find the sweet spot 
in between these 2 extremes.  
 [Pravin]
I faced exact same issue while moving from software crypto to HW. I implemented 
option Fiona suggested.  
Thread enqueues to crypto engine and goes back to other work. It periodically 
polls crypto to see if work is finished.
As we have a single thread running, it keeps doing queuing as work arrives and 
de-queuing as results are ready while in between doing other stuff.
To keep track of packets, I put some ID into crypto operation private data.

Reply via email to