Becket/Jason, So, it turns out the server where saw the recurring FD issue was not patched correctly, which is why we saw the deadlock again. We caught that, and after testing over the last few days, feel pretty confident, I'd say 99% sure, that the patch in KAFKA-3994 does fix the problem for good. All the evidence right now indicates that any broker that had been patched correctly, never experienced the deadlock again. If we put stock Kafka 0.10.1.0 in those brokers again, we get deadlocks within a day or two.
Any idea of roughly when this patch will make it into a Kafka release? Thanks for your help! Marcos On Fri, Nov 11, 2016 at 11:56 AM, Marcos Juarez <mjua...@gmail.com> wrote: > Thanks Becket, > > We should get a full thread dump the next time, so I'll send it as soon > that happens. > > Marcos > > On Fri, Nov 11, 2016 at 11:27 AM, Becket Qin <becket....@gmail.com> wrote: > >> Hi Marcos, >> >> Thanks for the update. It looks the deadlock you saw was another one. Do >> you mind sending us a full stack trace after this happens? >> >> Regarding the downgrade, the steps would be the following: >> 1. change the inter.broker.protocol to 0.10.0 >> 2. rolling bounce the cluster >> 3. deploy the 0.10.0.1 code >> >> There might be a bunch of .timeindex file left over but that should be >> fine. >> >> Thanks, >> >> Jiangjie (Becket) Qin >> >> >> On Fri, Nov 11, 2016 at 9:51 AM, Marcos Juarez <mjua...@gmail.com> wrote: >> >> > Becket/Jason, >> > >> > We deployed a jar with the base 0.10.1.0 release plus the KAFKA-3994 >> patch, >> > but we're seeing the same exact issue. It doesnt' seem like the patch >> > fixes the problem we're seeing. >> > >> > At this point, we're considering downgrading our prod clusters back to >> > 0.10.0.1. Is there any concern/issues we should be aware of when >> > downgrading the cluster like that? >> > >> > Thanks, >> > >> > Marcos Juarez >> > >> > >> > On Mon, Nov 7, 2016 at 5:47 PM, Marcos Juarez <mjua...@gmail.com> >> wrote: >> > >> > > Thanks Becket. >> > > >> > > I was working on that today. I have a working jar, created from the >> > > 0.10.1.0 branch, and that specific KAFKA-3994 patch applied to it. >> I've >> > > left it running in one test broker today, will try tomorrow to trigger >> > the >> > > issue, and try it with both the patched and un-patched versions. >> > > >> > > I'll let you know what we find. >> > > >> > > Thanks, >> > > >> > > Marcos >> > > >> > > On Mon, Nov 7, 2016 at 11:25 AM, Becket Qin <becket....@gmail.com> >> > wrote: >> > > >> > >> Hi Marcos, >> > >> >> > >> Is it possible for you to apply the patch of KAFKA-3994 and see if >> the >> > >> issue is still there. The current patch of KAFKA-3994 should work, >> the >> > >> only >> > >> reason we haven't checked that in was because when we ran stress >> test it >> > >> shows noticeable performance impact when producers are producing with >> > >> acks=all. So if you are blocking on this issue maybe you can pick up >> the >> > >> patch as a short term solution. Meanwhile we will prioritize the >> ticket. >> > >> >> > >> Thanks, >> > >> >> > >> Jiangjie (Becket) Qin >> > >> >> > >> On Mon, Nov 7, 2016 at 9:47 AM, Marcos Juarez <mjua...@gmail.com> >> > wrote: >> > >> >> > >> > We ran into this issue several more times over the weekend. >> > Basically, >> > >> > FDs are exhausted so fast now, we can't even get to the server in >> > time, >> > >> the >> > >> > JVM goes down in less than 5 minutes. >> > >> > >> > >> > I can send the whole thread dumps if needed, but for brevity's >> sake, I >> > >> > just copied over the relevant deadlock segment, and concatenated >> them >> > >> all >> > >> > together in the attached text file. >> > >> > >> > >> > Do you think this is something I should add to KAFKA-3994 ticket? >> Or >> > is >> > >> > the information in that ticket enough for now? >> > >> > >> > >> > Thanks, >> > >> > >> > >> > Marcos >> > >> > >> > >> > On Fri, Nov 4, 2016 at 2:05 PM, Marcos Juarez <mjua...@gmail.com> >> > >> wrote: >> > >> > >> > >> >> That's great, thanks Jason. >> > >> >> >> > >> >> We'll try and apply the patch in the meantime, and wait for the >> > >> official >> > >> >> release for 0.10.1.1. >> > >> >> >> > >> >> Please let us know if you need more details about the deadlocks on >> > our >> > >> >> side. >> > >> >> >> > >> >> Thanks again! >> > >> >> >> > >> >> Marcos >> > >> >> >> > >> >> On Fri, Nov 4, 2016 at 1:02 PM, Jason Gustafson < >> ja...@confluent.io> >> > >> >> wrote: >> > >> >> >> > >> >>> Hi Marcos, >> > >> >>> >> > >> >>> I think we'll try to get this into 0.10.1.1 (I updated the JIRA). >> > >> Since >> > >> >>> we're now seeing users hit this in practice, I'll definitely >> bump up >> > >> the >> > >> >>> priority on a fix. I can't say for sure when the release will be, >> > but >> > >> >>> we'll >> > >> >>> merge the fix into the 0.10.1 branch and you can build from >> there if >> > >> you >> > >> >>> need something in a hurry. >> > >> >>> >> > >> >>> Thanks, >> > >> >>> Jason >> > >> >>> >> > >> >>> On Fri, Nov 4, 2016 at 9:48 AM, Marcos Juarez <mjua...@gmail.com >> > >> > >> wrote: >> > >> >>> >> > >> >>> > Jason, >> > >> >>> > >> > >> >>> > Thanks for that link. It does appear to be a very similar >> issue, >> > if >> > >> >>> not >> > >> >>> > identical. In our case, the deadlock is reported as across 3 >> > >> threads, >> > >> >>> one >> > >> >>> > of them being a group_metadata_manager thread. Otherwise, it >> looks >> > >> the >> > >> >>> > same. >> > >> >>> > >> > >> >>> > On your questions: >> > >> >>> > >> > >> >>> > - We did not see this in prior releases, but we are ramping up >> > usage >> > >> >>> of our >> > >> >>> > kafka clusters lately, so maybe we didn't have the needed >> volume >> > >> >>> before to >> > >> >>> > trigger it. >> > >> >>> > >> > >> >>> > - Across our multiple staging and production clusters, we're >> > seeing >> > >> the >> > >> >>> > problem roughly once or twice a day. >> > >> >>> > >> > >> >>> > - Our clusters are small at the moment. The two that are >> > >> experiencing >> > >> >>> the >> > >> >>> > issue are 5 and 8 brokers, respectively. The number of >> consumers >> > is >> > >> >>> small, >> > >> >>> > I'd say less than 20 at the moment. The amount of data being >> > >> produced >> > >> >>> is >> > >> >>> > small also, in the tens of megabytes per second range, but the >> > >> number >> > >> >>> of >> > >> >>> > connects/disconnects is really high, because they're usually >> > >> >>> short-lived >> > >> >>> > processes. Our guess at the moment is that this is triggering >> the >> > >> >>> bug. We >> > >> >>> > have a separate cluster where we don't have short-lived >> producers, >> > >> and >> > >> >>> that >> > >> >>> > one has been rock solid. >> > >> >>> > >> > >> >>> > >> > >> >>> > We'll look into applying the patch, which could help reduce the >> > >> >>> problem. >> > >> >>> > The ticket says it's being targeted for the 0.10.2 release. >> Any >> > >> rough >> > >> >>> > estimate of a timeline for that to come out? >> > >> >>> > >> > >> >>> > Thanks! >> > >> >>> > >> > >> >>> > Marcos >> > >> >>> > >> > >> >>> > >> > >> >>> > On Thu, Nov 3, 2016 at 5:19 PM, Jason Gustafson < >> > ja...@confluent.io >> > >> > >> > >> >>> > wrote: >> > >> >>> > >> > >> >>> > > Hey Marcos, >> > >> >>> > > >> > >> >>> > > Thanks for the report. Can you check out >> > >> >>> > > https://issues.apache.org/jira/browse/KAFKA-3994 and see if >> it >> > >> >>> matches? >> > >> >>> > At >> > >> >>> > > a glance, it looks like the same problem. We tried pretty >> hard >> > to >> > >> >>> get the >> > >> >>> > > fix into the release, but it didn't quite make it. A few >> > >> questions: >> > >> >>> > > >> > >> >>> > > 1. Did you not see this in prior releases? As far as we can >> > tell, >> > >> it >> > >> >>> is >> > >> >>> > > possible going back to 0.9.0.0, but there could be a subtle >> > >> >>> difference in >> > >> >>> > > 0.10.1.0 which increases its likelihood. >> > >> >>> > > 2. How often are you hitting this problem? You might try the >> > patch >> > >> >>> that's >> > >> >>> > > available if it's occurring frequently and you can't >> downgrade. >> > I >> > >> >>> think >> > >> >>> > the >> > >> >>> > > existing patch is incomplete, but it should still reduce the >> > >> >>> likelihood. >> > >> >>> > > 3. Out of curiosity, what is the size of your cluster and how >> > many >> > >> >>> > > consumers do you have in your cluster? >> > >> >>> > > >> > >> >>> > > Thanks! >> > >> >>> > > Jason >> > >> >>> > > >> > >> >>> > > On Thu, Nov 3, 2016 at 1:32 PM, Marcos Juarez < >> > mjua...@gmail.com> >> > >> >>> wrote: >> > >> >>> > > >> > >> >>> > > > Just to expand on Lawrence's answer: The increase in file >> > >> >>> descriptor >> > >> >>> > > usage >> > >> >>> > > > goes from 2-3K under normal conditions, to 64K+ under >> > deadlock, >> > >> >>> which >> > >> >>> > it >> > >> >>> > > > hits within a couple of hours, at which point the broker >> goes >> > >> down, >> > >> >>> > > because >> > >> >>> > > > that's our OS-defined limit. >> > >> >>> > > > >> > >> >>> > > > If it was only a 33% increase from the new timestamp >> indexes, >> > we >> > >> >>> should >> > >> >>> > > be >> > >> >>> > > > going to max 4K-5K file descriptors in use, not 64K+. >> > >> >>> > > > >> > >> >>> > > > Marcos >> > >> >>> > > > >> > >> >>> > > > >> > >> >>> > > > On Thu, Nov 3, 2016 at 1:53 PM, Lawrence Weikum < >> > >> >>> lwei...@pandora.com> >> > >> >>> > > > wrote: >> > >> >>> > > > >> > >> >>> > > > > We saw this increase when upgrading from 0.9.0.1 to >> > 0.10.0.1. >> > >> >>> > > > > We’re now running on 0.10.1.0, and the FD increase is due >> > to a >> > >> >>> > > deadlock, >> > >> >>> > > > > not functionality or new features. >> > >> >>> > > > > >> > >> >>> > > > > Lawrence Weikum | Software Engineer | Pandora >> > >> >>> > > > > 1426 Pearl Street, Suite 100, Boulder CO 80302 >> > >> >>> > > > > m 720.203.1578 | lwei...@pandora.com >> > >> >>> > > > > >> > >> >>> > > > > On 11/3/16, 12:42 PM, "Hans Jespersen" < >> h...@confluent.io> >> > >> >>> wrote: >> > >> >>> > > > > >> > >> >>> > > > > The 0.10.1 broker will use more file descriptor than >> > >> previous >> > >> >>> > > > releases >> > >> >>> > > > > because of the new timestamp indexes. You should >> expect >> > >> and >> > >> >>> plan >> > >> >>> > > for >> > >> >>> > > > > ~33% >> > >> >>> > > > > more file descriptors to be open. >> > >> >>> > > > > >> > >> >>> > > > > -hans >> > >> >>> > > > > >> > >> >>> > > > > /** >> > >> >>> > > > > * Hans Jespersen, Principal Systems Engineer, >> Confluent >> > >> Inc. >> > >> >>> > > > > * h...@confluent.io (650)924-2670 >> > >> >>> > > > > */ >> > >> >>> > > > > >> > >> >>> > > > > On Thu, Nov 3, 2016 at 10:02 AM, Marcos Juarez < >> > >> >>> > mjua...@gmail.com> >> > >> >>> > > > > wrote: >> > >> >>> > > > > >> > >> >>> > > > > > We're running into a recurrent deadlock issue in >> both >> > >> our >> > >> >>> > > > production >> > >> >>> > > > > and >> > >> >>> > > > > > staging clusters, both using the latest 0.10.1 >> > release. >> > >> >>> The >> > >> >>> > > > symptom >> > >> >>> > > > > we >> > >> >>> > > > > > noticed was that, in servers in which kafka >> producer >> > >> >>> > connections >> > >> >>> > > > are >> > >> >>> > > > > short >> > >> >>> > > > > > lived, every other day or so, we'd see file >> > descriptors >> > >> >>> being >> > >> >>> > > > > exhausted, >> > >> >>> > > > > > until the broker is restarted, or the broker runs >> out >> > of >> > >> >>> file >> > >> >>> > > > > descriptors, >> > >> >>> > > > > > and it goes down. None of the clients are on >> 0.10.1 >> > >> kafka >> > >> >>> > jars, >> > >> >>> > > > > they're >> > >> >>> > > > > > all using previous versions. >> > >> >>> > > > > > >> > >> >>> > > > > > When diagnosing the issue, we found that when the >> > system >> > >> >>> is in >> > >> >>> > > that >> > >> >>> > > > > state, >> > >> >>> > > > > > using up file descriptors at a really fast rate, >> the >> > >> JVM is >> > >> >>> > > > actually >> > >> >>> > > > > in a >> > >> >>> > > > > > deadlock. Did a thread dump from both jstack and >> > >> >>> visualvm, and >> > >> >>> > > > > attached >> > >> >>> > > > > > those to this email. >> > >> >>> > > > > > >> > >> >>> > > > > > This is the interesting bit from the jstack thread >> > dump: >> > >> >>> > > > > > >> > >> >>> > > > > > >> > >> >>> > > > > > Found one Java-level deadlock: >> > >> >>> > > > > > ============================= >> > >> >>> > > > > > "executor-Heartbeat": >> > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 >> (object >> > >> >>> > > > > 0x000000062732a398, a >> > >> >>> > > > > > kafka.coordinator.GroupMetadata), >> > >> >>> > > > > > which is held by "group-metadata-manager-0" >> > >> >>> > > > > > >> > >> >>> > > > > > "group-metadata-manager-0": >> > >> >>> > > > > > waiting to lock monitor 0x00000000011ddaa8 >> (object >> > >> >>> > > > > 0x000000063f1b0cc0, a >> > >> >>> > > > > > java.util.LinkedList), >> > >> >>> > > > > > which is held by "kafka-request-handler-3" >> > >> >>> > > > > > >> > >> >>> > > > > > "kafka-request-handler-3": >> > >> >>> > > > > > waiting to lock monitor 0x00000000016c8138 >> (object >> > >> >>> > > > > 0x000000062732a398, a >> > >> >>> > > > > > kafka.coordinator.GroupMetadata), >> > >> >>> > > > > > which is held by "group-metadata-manager-0" >> > >> >>> > > > > > >> > >> >>> > > > > > >> > >> >>> > > > > > I also noticed the background heartbeat thread (I'm >> > >> >>> guessing >> > >> >>> > the >> > >> >>> > > > one >> > >> >>> > > > > > called "executor-Heartbeat" above) is new for this >> > >> release, >> > >> >>> > under >> > >> >>> > > > > > KAFKA-3888 ticket - https://urldefense.proofpoint. >> > >> >>> > > > > com/v2/url?u=https-3A__issues. >> apache.org_jira_browse_KAFKA- >> > >> >>> > > > > 2D3888&d=CwIBaQ&c=gFTBenQ7Vj71sUi1A4CkFnmPzqwDo07QsHw- >> > >> JRepxyw&r= >> > >> >>> > > > > VSog3hHkqzZLadc6n_6BPH1OAPc78b >> 24WpAbuhVZI0E&m=zJ2wVkapVi8N- >> > >> >>> > > > > jmDGRxM8a16nchqtjTfs20lhBw5xB0 >> > &s=nEcLEnYWPyaDuPDI5vSSKPWoljo >> > >> XYb >> > >> >>> > > > > vNriVw0wrEegk&e= >> > >> >>> > > > > > >> > >> >>> > > > > > We haven't noticed this problem with earlier Kafka >> > >> broker >> > >> >>> > > versions, >> > >> >>> > > > > so I'm >> > >> >>> > > > > > guessing maybe this new background heartbeat >> thread is >> > >> what >> > >> >>> > > > > introduced the >> > >> >>> > > > > > deadlock problem. >> > >> >>> > > > > > >> > >> >>> > > > > > That same broker is still in the deadlock >> scenario, we >> > >> >>> haven't >> > >> >>> > > > > restarted >> > >> >>> > > > > > it, so let me know if you'd like more >> info/log/stats >> > >> from >> > >> >>> the >> > >> >>> > > > system >> > >> >>> > > > > before >> > >> >>> > > > > > we restart it. >> > >> >>> > > > > > >> > >> >>> > > > > > Thanks, >> > >> >>> > > > > > >> > >> >>> > > > > > Marcos Juarez >> > >> >>> > > > > > >> > >> >>> > > > > >> > >> >>> > > > > >> > >> >>> > > > > >> > >> >>> > > > >> > >> >>> > > >> > >> >>> > >> > >> >>> >> > >> >> >> > >> >> >> > >> > >> > >> >> > > >> > > >> > >> > >