Becket/Jason,
So, it turns out the server where saw the recurring FD issue was not
patched correctly, which is why we saw the deadlock again. We caught that,
and after testing over the last few days, feel pretty confident, I'd say
99% sure, that the patch in KAFKA-3994 does fix the problem for
Thanks Becket,
We should get a full thread dump the next time, so I'll send it as soon
that happens.
Marcos
On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin wrote:
> Hi Marcos,
>
> Thanks for the update. It looks the deadlock you saw was another one. Do
> you mind sending us
Hi Marcos,
Thanks for the update. It looks the deadlock you saw was another one. Do
you mind sending us a full stack trace after this happens?
Regarding the downgrade, the steps would be the following:
1. change the inter.broker.protocol to 0.10.0
2. rolling bounce the cluster
3. deploy the
Becket/Jason,
We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch,
but we're seeing the same exact issue. It doesnt' seem like the patch
fixes the problem we're seeing.
At this point, we're considering downgrading our prod clusters back to
0.10.0.1. Is there any
Thanks Becket.
I was working on that today. I have a working jar, created from the
0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've
left it running in one test broker today, will try tomorrow to trigger the
issue, and try it with both the patched and un-patched versions.
Hi Marcos,
Is it possible for you to apply the patch of KAFKA-3994 and see if the
issue is still there. The current patch of KAFKA-3994 should work, the only
reason we haven't checked that in was because when we ran stress test it
shows noticeable performance impact when producers are producing
We ran into this issue several more times over the weekend. Basically, FDs
are exhausted so fast now, we can't even get to the server in time, the JVM
goes down in less than 5 minutes.
I can send the whole thread dumps if needed, but for brevity's sake, I just
copied over the relevant deadlock
That's great, thanks Jason.
We'll try and apply the patch in the meantime, and wait for the official
release for 0.10.1.1.
Please let us know if you need more details about the deadlocks on our side.
Thanks again!
Marcos
On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson
Hi Marcos,
I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since
we're now seeing users hit this in practice, I'll definitely bump up the
priority on a fix. I can't say for sure when the release will be, but we'll
merge the fix into the 0.10.1 branch and you can build from there
Jason,
Thanks for that link. It does appear to be a very similar issue, if not
identical. In our case, the deadlock is reported as across 3 threads, one
of them being a group_metadata_manager thread. Otherwise, it looks the same.
On your questions:
- We did not see this in prior releases, but
Hey Marcos,
Thanks for the report. Can you check out
https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches? At
a glance, it looks like the same problem. We tried pretty hard to get the
fix into the release, but it didn't quite make it. A few questions:
1. Did you not see this in
Just to expand on Lawrence's answer: The increase in file descriptor usage
goes from 2-3K under normal conditions, to 64K+ under deadlock, which it
hits within a couple of hours, at which point the broker goes down, because
that's our OS-defined limit.
If it was only a 33% increase from the new
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1.
We’re now running on 0.10.1.0, and the FD increase is due to a deadlock, not
functionality or new features.
Lawrence Weikum | Software Engineer | Pandora
1426 Pearl Street, Suite 100, Boulder CO 80302
m 720.203.1578 |
The 0.10.1 broker will use more file descriptor than previous releases
because of the new timestamp indexes. You should expect and plan for ~33%
more file descriptors to be open.
-hans
/**
* Hans Jespersen, Principal Systems Engineer, Confluent Inc.
* h...@confluent.io (650)924-2670
*/
On
We're running into a recurrent deadlock issue in both our production and
staging clusters, both using the latest 0.10.1 release. The symptom we
noticed was that, in servers in which kafka producer connections are short
lived, every other day or so, we'd see file descriptors being exhausted,
until
15 matches
Mail list logo