Hi Ismael, Thanks again for sending that link yesterday! I tried it this AM and this change totally fixed the problem! The manifestation we observed was not increased CPU usage, but rather a MUCH larger memory heap requirement. Once I changed log.message.format.version to the version of our clients, the following occurred:
1. ISRs went to full replication for each partition 2. Memory heap usage went down by a factor of 6 3. Storm throughput went up by a factor of 5 Our cluster looks great now--thanks again for pointing me to the docs where the config issue was described--much, much appreciated! --John On Mon, Jul 10, 2017 at 12:26 PM, Ismael Juma <ism...@juma.me.uk> wrote: > Hi John, > > Yes, down conversion when consuming messages does increase JVM heap usage > as we have to load the data into the JVM heap to convert it. If down > conversion is not needed, we are able to send the data without copying it > to the JVM heap. > > Ismael > > On Sun, Jul 9, 2017 at 4:23 PM, John Yost <hokiege...@gmail.com> wrote: > > > Hi Ismael, > > > > Gotcha, will do. Okay, in reading to docs you linked, that may explain > what > > we're seeing. When we upgraded to 0.10.0, we did not upgrade the clients > > from 0.9.0.1, so while the message format is the default--in this case, > > 0.10.0--the message format expected by the consumers is pre-0.10.0. > While I > > am not seeing increased CPU utilization, it appears that the memory > > requirements for the brokers have changed with the upgrade given that I > had > > to increase the broker memory heap size from 6 to 10GB to prevent > > out-of-memory errors from occurring. > > > > Would the message format difference result in consumed and/or produced > > messages piling up in a buffer, and, consequently, increase the broker > > memory heap size requirement due to the format mismatch? That would be > > awesome because that means we just need to update the > > log.message.format.version to 0.9.0 until we upgrade the clients. > > > > --John > > > > --John > > > > On Sun, Jul 9, 2017 at 10:46 AM, Ismael Juma <ism...@juma.me.uk> wrote: > > > > > Hi John, > > > > > > Please read the upgrade documentation for the relevant versions: > > > > > > http://kafka.apache.org/documentation.html#upgrade > > > > > > Also, let's try to keep the discussion in one thread. I asked some > > > questions in the related "0.10.1 memory and garbage collection issues" > > > thread that you started. > > > > > > Ismael > > > > > > On Sun, Jul 9, 2017 at 3:30 PM, John Yost <hokiege...@gmail.com> > wrote: > > > > > > > Hi Everyone, > > > > > > > > Ever since we've upgraded from 0.9.0.1 to 0.10.0 our five-node Kafka > > > > cluster is unstable. Specifically, whereas before a 6GB memory heap > > > worked > > > > fine, following the upgrade all five brokers crashed with out of > memory > > > > errors within an hour of the upgrade. I boosted the memory heap to > > 10GB, > > > > which fixed the OOM error problem, but now it appears the GC pauses > are > > > > preventing the cluster from maintaining more than one ISR. I realize > I > > > > could up the replica lag settings to improve the ISR numbers, but > > that's > > > > treating the symptom and not the root problem. > > > > > > > > There appears to be a change in the memory requirements somewhere in > > the > > > > Kafka stack, which could be on the producer side as well, but I want > to > > > > rule out any configuration issues on the broker side. Are there any > 0.9 > > > > defaults in particular anyone is aware of that I should change for > > 0.10.x > > > > to resolve the root problem(s) of these observations? > > > > > > > > --John > > > > > > > > > >