Re: crypto: caam from tasklet to threadirq

2016-09-20 Thread Russell King - ARM Linux
Okay, I've re-tested, using a different way of measuring, because using
openssl speed is impractical for off-loaded engines.  I've decided to
use this way to measure the performance:

dd if=/dev/zero bs=1048576 count=128 | /usr/bin/time openssl dgst -md5

For the threaded IRQs case gives:

0.05user 2.74system 0:05.30elapsed 52%CPU (0avgtext+0avgdata 2400maxresident)k
0.06user 2.52system 0:05.18elapsed 49%CPU (0avgtext+0avgdata 2404maxresident)k
0.12user 2.60system 0:05.61elapsed 48%CPU (0avgtext+0avgdata 2460maxresident)k
=> 5.36s => 25.0MB/s

and the tasklet case:

0.08user 2.53system 0:04.83elapsed 54%CPU (0avgtext+0avgdata 2468maxresident)k
0.09user 2.47system 0:05.16elapsed 49%CPU (0avgtext+0avgdata 2368maxresident)k
0.10user 2.51system 0:04.87elapsed 53%CPU (0avgtext+0avgdata 2460maxresident)k
=> 4.95 => 27.1MB/s

which corresponds to an 8% slowdown for the threaded IRQ case.  So,
tasklets are indeed faster than threaded IRQs.

I guess the reason is that tasklets are much simpler, being able to
run just before we return to userspace without involving scheduler
overheads, but that's speculation.

I've tried to perf it, but...

Samples: 31K of event 'cycles', Event count (approx.): 3552246846
  Overhead  Command  Shared Object Symbol
+   33.22%  kworker/0:1  [kernel.vmlinux]  [k] __do_softirq
+   15.78%  irq/311-2101000  [kernel.vmlinux]  [k] __do_softirq
+7.49%  irqbalance   [kernel.vmlinux]  [k] __do_softirq
+7.26%  openssl  [kernel.vmlinux]  [k] __do_softirq
+5.71%  ksoftirqd/0  [kernel.vmlinux]  [k] __do_softirq
+3.64%  kworker/0:2  [kernel.vmlinux]  [k] __do_softirq
+3.52%  swapper  [kernel.vmlinux]  [k] __do_softirq
+3.14%  kworker/0:1  [kernel.vmlinux]  [k] _raw_spin_unlock_irq

I was going to try to get the threaded IRQ case, but I've ended up with
perf getting buggered because of the iMX6 SMP perf disfunctionality:

[ 3448.810416] irq 24: nobody cared (try booting with the "irqpoll" option)
...
[ 3448.824528] Disabling IRQ #24

caused by FSL's utterly brain-dead idea of routing all the perf
interrupts to single non-CPU local interrupt input, and the refusal of
kernel folk to find an acceptable solution to support this.

So, sorry, I'm not going to bother trying to get any further with this.
If the job was not made harder stupid hardware design and kernel
politics, then I might be more inclined to do deeper investigation, but
right now I'm finding that I'm not interested in trying to jump through
these stupid hoops.

I think I've proven from the above that this patch needs to be reverted
due to the performance regression, and that there _is_ most definitely
a deterimental effect of switching from tasklets to threaded IRQs.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: crypto: caam from tasklet to threadirq

2016-09-16 Thread Russell King - ARM Linux
On Fri, Sep 16, 2016 at 02:01:00PM +, Cata Vasile wrote:
> Hi,
> 
> We've tried to test and benchmark your submitted work[1].
> 
> Cryptographic offloading is also used in IPsec in the Linux Kernel. In
> heavy traffic scenarios, the NIC driver competes with the crypto device
> driver. Most NICs use the NAPI context, which is one of the most
> prioritized context types. In IPsec scenarios  the performance is
> trashed because, although raw data gets in to device, the data is
> encrypted/decrypted and the dequeue code in CAAM driver has a hard time
> being scheduled to actually call the callback to notify the networking
> stack it can continue working with  that data.

Having received a reply from Thomas Gleixner today, there appears to be
some disagreement with your findings, and a suggestion that the problem
needs proper and more in-depth investigation.

Thomas indicates that the NAPI processing shows an improvement when
moved to the same context that threaded interrupts run in, as opposed
to the current softirq context - which also would run the tasklets.

What I would say is that if threaded IRQs are causing harm, then there
seems to be something very wrong somewhere.

> Being this scenario, at heavy load, the Kernel warns on rcu stalls and
> the forwarding path has a lot of latency.  Have you tried benchmarking
> the board you used for testing?

It's way too long ago for me to remember - these patches were created
almost a year ago - October 20th 2015, which is when I'd have tested
them.  So, I'm afraid I can't help very much at this point, apart from
trying to re-run some benchmarks.

I'd suggest testing the openssl (with AF_ALG support), which is probably
what I tested and benchmarked.  However, as I say, it's far too long
ago for me to really remember at this point.

> I have ran some on our other platforms. The after benchmark fails to
> run at the top level of the before results.

Sorry, that last sentence doesn't make any sense to me.

I don't have the bandwidth to look at this, and IPsec doesn't interest
me one bit - I've never been able to work out how to setup IPsec
locally.  From what I remember when I looked into it many years ago,
you had to have significant information about ipsec to get it up and
running.  Maybe things have changed since then, I don't know.

If you want me to reproduce it, please send me a step-by-step idiots
guide on setting up a working test scenario which reproduces your
problem.

Thanks.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


crypto: caam from tasklet to threadirq

2016-09-16 Thread Cata Vasile
Hi,

We've tried to test and benchmark your submitted work[1].

Cryptographic offloading is also used in IPsec in the Linux Kernel. In heavy 
traffic scenarios, the NIC driver competes with the crypto device driver. Most 
NICs use the NAPI context, which is one of the most prioritized context types. 
In IPsec scenarios  the performance is trashed because, although raw data gets 
in to device, the data is encrypted/decrypted and the dequeue code in CAAM 
driver has a hard time being scheduled to actually call the callback to notify 
the networking stack it can continue working with  that data.

Being this scenario, at heavy load, the Kernel warns on rcu stalls and the 
forwarding path has a lot of latency.
Have you tried benchmarking the board you used for testing?

I have ran some on our other platforms. The after benchmark fails to run at the 
top level of the before results. The rcu stall does not always stall in the 
same place. The after ping latency is greater, and oscillates a lot.

It might be a good idea for the codebase to change to a threadirq, but from a 
pragmatic perspective, the whole system has to suffer. That is one the reasons 
most crypto accelerators try to run dequeue primitives in high priority 
contexts.


Regards,
Catalin Vasile


[1] 
https://git.kernel.org/cgit/linux/kernel/git/herbert/cryptodev-2.6.git/commit/?id=66d2e2028091a074aa1290d2eeda5ddb1a6c329c
 --
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html