Re: Deadlock using latest 0.10.1 Kafka release

2016-11-18 Thread Marcos Juarez
Becket/Jason, So, it turns out the server where saw the recurring FD issue was not patched correctly, which is why we saw the deadlock again. We caught that, and after testing over the last few days, feel pretty confident, I'd say 99% sure, that the patch in KAFKA-3994 does fix the problem for

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Marcos Juarez
Thanks Becket, We should get a full thread dump the next time, so I'll send it as soon that happens. Marcos On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin wrote: > Hi Marcos, > > Thanks for the update. It looks the deadlock you saw was another one. Do > you mind sending us

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Becket Qin
Hi Marcos, Thanks for the update. It looks the deadlock you saw was another one. Do you mind sending us a full stack trace after this happens? Regarding the downgrade, the steps would be the following: 1. change the inter.broker.protocol to 0.10.0 2. rolling bounce the cluster 3. deploy the

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-11 Thread Marcos Juarez
Becket/Jason, We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 patch, but we're seeing the same exact issue. It doesnt' seem like the patch fixes the problem we're seeing. At this point, we're considering downgrading our prod clusters back to 0.10.0.1. Is there any

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-07 Thread Marcos Juarez
Thanks Becket. I was working on that today. I have a working jar, created from the 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. I've left it running in one test broker today, will try tomorrow to trigger the issue, and try it with both the patched and un-patched versions.

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-07 Thread Becket Qin
Hi Marcos, Is it possible for you to apply the patch of KAFKA-3994 and see if the issue is still there. The current patch of KAFKA-3994 should work, the only reason we haven't checked that in was because when we ran stress test it shows noticeable performance impact when producers are producing

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-07 Thread Marcos Juarez
We ran into this issue several more times over the weekend. Basically, FDs are exhausted so fast now, we can't even get to the server in time, the JVM goes down in less than 5 minutes. I can send the whole thread dumps if needed, but for brevity's sake, I just copied over the relevant deadlock

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-04 Thread Marcos Juarez
That's great, thanks Jason. We'll try and apply the patch in the meantime, and wait for the official release for 0.10.1.1. Please let us know if you need more details about the deadlocks on our side. Thanks again! Marcos On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-04 Thread Jason Gustafson
Hi Marcos, I think we'll try to get this into 0.10.1.1 (I updated the JIRA). Since we're now seeing users hit this in practice, I'll definitely bump up the priority on a fix. I can't say for sure when the release will be, but we'll merge the fix into the 0.10.1 branch and you can build from there

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-04 Thread Marcos Juarez
Jason, Thanks for that link. It does appear to be a very similar issue, if not identical. In our case, the deadlock is reported as across 3 threads, one of them being a group_metadata_manager thread. Otherwise, it looks the same. On your questions: - We did not see this in prior releases, but

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Jason Gustafson
Hey Marcos, Thanks for the report. Can you check out https://issues.apache.org/jira/browse/KAFKA-3994 and see if it matches? At a glance, it looks like the same problem. We tried pretty hard to get the fix into the release, but it didn't quite make it. A few questions: 1. Did you not see this in

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Marcos Juarez
Just to expand on Lawrence's answer: The increase in file descriptor usage goes from 2-3K under normal conditions, to 64K+ under deadlock, which it hits within a couple of hours, at which point the broker goes down, because that's our OS-defined limit. If it was only a 33% increase from the new

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Lawrence Weikum
We saw this increase when upgrading from 0.9.0.1 to 0.10.0.1. We’re now running on 0.10.1.0, and the FD increase is due to a deadlock, not functionality or new features. Lawrence Weikum | Software Engineer | Pandora 1426 Pearl Street, Suite 100, Boulder CO 80302 m 720.203.1578 |

Re: Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Hans Jespersen
The 0.10.1 broker will use more file descriptor than previous releases because of the new timestamp indexes. You should expect and plan for ~33% more file descriptors to be open. -hans /** * Hans Jespersen, Principal Systems Engineer, Confluent Inc. * h...@confluent.io (650)924-2670 */ On

Deadlock using latest 0.10.1 Kafka release

2016-11-03 Thread Marcos Juarez
We're running into a recurrent deadlock issue in both our production and staging clusters, both using the latest 0.10.1 release. The symptom we noticed was that, in servers in which kafka producer connections are short lived, every other day or so, we'd see file descriptors being exhausted, until