[jira] [Created] (CASSANDRA-7931) Single Host CPU profiling data dump

Michael Nelson (JIRA) Sat, 13 Sep 2014 16:52:45 -0700

Michael Nelson created CASSANDRA-7931:
-----------------------------------------

Summary: Single Host CPU profiling data dump
Key: CASSANDRA-7931
URL: https://issues.apache.org/jira/browse/CASSANDRA-7931
Project: Cassandra
Issue Type: Improvement
Environment: 2.1.0 on my local machine
Reporter: Michael Nelson
Priority: Minor
Attachments: traces.txt

At Boot Camp today I did some CPU profiling and wanted to turn in my findings
somehow. So here they are in a JIRA.

I ran cassandra-stress read test against my local machine (single node: SSD, 8
proc, plenty of RAM). I'm using straight 2.1.0 with the "Lightweight Java
Profiler" by Jeremy Manson
(https://code.google.com/p/lightweight-java-profiler/). This is a lower-level
profiler that more accurately reports what the CPU is actually doing rather
than where time is going.

Attached is the report. Here are a few high points with some commentary from
some investigation I did today:

1. The number one _consumer_of_CPU_ is this stack trace:
sun.misc.Unsafe.park(Unsafe.java:-1)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349)
org.apache.cassandra.concurrent.SEPWorker.doWaitSpin(SEPWorker.java:236)
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:74)
java.lang.Thread.run(Thread.java:745)

There is a CPU cost to going into and out of Unsafe.park() and the SEPWorkers
are doing it so much when they are spinning that they were the largest single
consumer of CPU. Note that this profiler is sampling the CPU so it isn't the
_blocking_ that this call does that is the issue. It is the CPU actually being
consumed as it prepares to block and then later comes out. This kind of lower
level fine grained control of threads is tricky and there appears to be room
for improvement here. I tried a few approaches that didn't pan out. There
really needs to be a way for threads to block waiting for the executor queues.
That way they will be immediately available (rather than waiting for them to
poll the executors) and will not consume CPU when they aren't needed. Maybe
block for some short period of time and then become available for other queues
after that?

2. Second is Netty writing to sockets. I didn't investigate this. Netty is
pretty optimized. Someone mentioned there are native options for Netty.
Probably worth trying.

3. Third is similar to #1:
sun.misc.Unsafe.park(Unsafe.java:-1)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:349)
org.apache.cassandra.utils.concurrent.WaitQueue$AbstractSignal.awaitUntil(WaitQueue.java:303)

org.apache.cassandra.utils.concurrent.SimpleCondition.await(SimpleCondition.java:63)
org.apache.cassandra.service.ReadCallback.await(ReadCallback.java:90)
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:100)
org.apache.cassandra.service.AbstractReadExecutor.get(AbstractReadExecutor.java:144)
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:1226)

I looked at this a little. The asynchronous callback using Unsafe.park() is
being invoked even for local reads (like my local laptop). It only rarely
blocked. It just paid the CPU cost of going in and immediately coming out
because the read was already satisfied from the buffer cache. :-(

4. Fourth is Netty doing epoll. Again, worth considering Netty's native
optimizations to help here.

The others are interesting but, for this little benchmark, not as big a deal.
For example, ArrayList.grow() shows up more than I would have expected.

The Boot Camp was great. Feel free to contact me if I can help somehow.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-7931) Single Host CPU profiling data dump

Reply via email to