We ran into an incident a while back where one of our broker machines
abruptly went down (AWS is fun). While the leadership transitions and
so forth seemed to work correctly with the remaining brokers, our
producers hung shortly thereafter. I should point out that we are using
the old Scala producer in async mode. What happened was that the
producer's queue filled up and the SyncProducer on the other end was
blocked in a write() call, waiting for ACKs that will never come. My
understanding of blocking IO on the JVM is that this call will block
until such time as the OS gives up on the TCP connection, which could
take as long as 30 minutes.

As a remedy, we're first going to set queue.enqueue.timeout.ms to some
positive value, as we're willing to lose some of these particular
messages to avoid blocking user requests. But this won't actually make
the producer recover more quickly. Is lowering the OS level TCP
keepalive time the right thing here? Also, can someone comment on
whether this behavior would also happen with the new producer? We want
to get there, but it hasn't been a priority.

--


    Tommy Becker

    Senior Software Engineer

    O +1 919.460.4747

    tivo.com


________________________________

This email and any attachments may contain confidential and privileged material 
for the sole use of the intended recipient. Any review, copying, or 
distribution of this email (or any attachments) by others is prohibited. If you 
are not the intended recipient, please contact the sender immediately and 
permanently delete this email and any attachments. No employee or agent of TiVo 
Inc. is authorized to conclude any binding agreement on behalf of TiVo Inc. by 
email. Binding agreements with TiVo Inc. may only be made by a signed written 
agreement.

Reply via email to