I tested the new patch out and am seeing comparable CPU usage to the
previous patch. As far as I can see, heap usage is also comparable between
the two patches, though I will say that both look significantly better than
0.8.1.1 (~250MB vs. ~1GB).
I'll report back if any new issues come up as I
Jun,
I updated our brokers earlier today with the mentioned patch. A week ago
our brokers used ~380% CPU (out of 400%) quite consistently, and now
they're varying between 250-325% (probably running a bit high right now as
we have some consumers catching up quite some lag), so there's definitely
I'm checking into this on our side. The version we're working on jumping to
right now is not the 0.8.2 release version, but it is significantly ahead
of 0.8.1.1. We've got it deployed on one cluster and I'm making sure it's
balanced right now before I take a look at all the metrics. I'll fill in
We can reproduce this issue, have a theory as to the cause, and are working
on a fix. Here is the ticket to track it:
https://issues.apache.org/jira/browse/KAFKA-1952
I would recommend people hold off on 0.8.2 upgrades until we have a handle
on this.
-Jay
On Fri, Feb 13, 2015 at 1:47 PM, Solon
Thanks for the fast response. I did a quick test and initial results look
promising. When I swapped in the patched version, CPU usage dropped from
~150% to ~65%. Still a bit higher than what I see with 0.8.1.1 but much
more reasonable.
I'll do more testing on Monday but wanted to get you some
This is a serious issue, we'll take a look.
-Jay
On Thu, Feb 12, 2015 at 3:19 PM, Solon Gordon so...@knewton.com wrote:
I saw a very similar jump in CPU usage when I tried upgrading from 0.8.1.1
to 0.8.2.0 today in a test environment. The Kafka cluster there is two
m1.larges handling 2,000
Jun,
I re-ran the hprof test, for about 30 minutes again, for 0.8.2.0-rc2 with
the same version of snappy that 0.8.1.1 used. Attached the logs.
Unfortunately there wasn't any improvement as the node running 0.8.2.0-rc2
still had a higher load and CPU usage.
Best regards,
Mathias
On Tue Feb 03
Mathias,
The new hprof doesn't reveal anything new to me. We did fix the logic in
using Purgatory in 0.8.2, which could potentially drive up the CPU usage a
bit. To verify that, could you do your test on a single broker (with
replication factor 1) btw 0.8.1 and 0.8.2 and see if there is any
Hi all,
I ran the same hprof test on 0.8.1.1, and also did a re-run on 0.8.2.0-rc2,
attached logs from both runs. Both runs lasted for 30-40 minutes. The
configurations used can be found over here:
https://gist.github.com/mthssdrbrg/5fcb9fbdb851d8cc66a2. The configuration
used for the first run
Jun,
Yeah, sure, I'll take it for a spin tomorrow.
On Mon Feb 02 2015 at 11:08:42 PM Jun Rao j...@confluent.io wrote:
Mathias,
Thanks for the info. I took a quick look. The biggest difference I saw is
the org.xerial.snappy.SnappyNative.rawCompress() call. In 0.8.1.1, it uses
about 0.05% of
Mathias,
Thanks for the info. I took a quick look. The biggest difference I saw is
the org.xerial.snappy.SnappyNative.rawCompress() call. In 0.8.1.1, it uses
about 0.05% of the CPU. In 0.8.2.0, it uses about 0.10% of the CPU. We did
upgrade snappy from 1.0.5 in 0.8.1.1 to 1.1.1.6 in 0.8.2.0.
On Monday 02 February 2015 11:03 PM, Jun Rao wrote:
Jaikiran,
The fix you provided in probably unnecessary. The channel that we use in
SimpleConsumer (BlockingChannel) is configured to be blocking. So even
though the read from the socket is in a loop, each read blocks if there is
no bytes
Actually that fetch call blocks on the server side. That is, if there is no
data, the server will wait until data arrives or the timeout occurs to send
a response. This is done to help simplify the client development. If that
isn't happening it is likely a bug or a configuration change in the
Ah, yeah, you're right. That is just wait time not CPU time. We should
check that profile it must be something else on the list.
-Jay
On Mon, Feb 2, 2015 at 9:33 AM, Jun Rao j...@confluent.io wrote:
Hi, Mathias,
From the hprof output, it seems that the top CPU consumers are
socketAccept()
Hi, Mathias,
From the hprof output, it seems that the top CPU consumers are
socketAccept() and epollWait(). As far as I am aware, there hasn't been any
significant changes in the socket server code btw 0.8.1 and 0.8.2. Could
you run the same hprof test on 0.8.1 so that we can see the difference?
Hi Mathias,
Looking at that thread dump, I think the potential culprit is this one:
TRACE 303545: (thread=200049)
sun.nio.ch.EPollArrayWrapper.epollWait(EPollArrayWrapper.java:Unknown line)
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
Hi Neha,
I sent an e-mail earlier today, but noticed now that it didn't actually go
through.
Anyhow, I've attached two files, one with output from a 10 minute run and
one with output from a 30 minute run. Realized that maybe I should've done
one or two runs with 0.8.1.1 as well, but
Hi Neha,
Yeah sure. I'm not familiar with hprof, so any particular options I should
include or just run with defaults?
Best regards,
Mathias
On Mon Dec 08 2014 at 7:41:32 PM Neha Narkhede n...@confluent.io wrote:
Thanks for reporting the issue. Would you mind running hprof and sending
the
The following should be sufficient
java
-agentlib:hprof=cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=kafka.hprof
classname
You would need to start the Kafka server with the settings above for
sometime until you observe the problem.
On Tue, Dec 9, 2014 at 3:47 AM, Mathias Söderberg
Good day,
I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed that
the CPU usage on the broker machines went up by roughly 40%, from ~60% to
~100% and am wondering if anyone else has experienced something similar?
The load average also went up by 2x-3x.
We're running on EC2 and
Thanks for reporting the issue. Would you mind running hprof and sending
the output?
On Mon, Dec 8, 2014 at 1:25 AM, Mathias Söderberg
mathias.soederb...@gmail.com wrote:
Good day,
I upgraded a Kafka cluster from v0.8.1.1 to v0.8.2-beta and noticed that
the CPU usage on the broker machines
21 matches
Mail list logo