Re: Producer connect timeouts
Yes, this is something that we could consider fixing in Kafka itself. Pretty much all timeouts can be customized if the defaults for the OS/network are larger than make sense for the system. And given the large default values for some of these timeouts, we probably don't want to rely on the defaults. -Ewen On Mon, Dec 19, 2016 at 8:23 AM, Luke Steensenwrote: > Makes sense, thanks Ewen. > > Is this something we could consider fixing in Kafka itself? I don't think > the producer is necessarily doing anything wrong, but the end result is > certainly very surprising behavior. It would also be nice not to have to > coordinate request timeouts, retries, and the max block configuration with > system-level configs. > > > On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postava > wrote: > > > Without having dug back into the code to check, this sounds right. > > Connection management just fires off a request to connect and then > > subsequent poll() calls will handle any successful/failed connections. > The > > timeouts wrt requests are handled somewhat differently (the connection > > request isn't explicitly tied to the request that triggered it, so when > the > > latter times out, we don't follow up and timeout the connection request > > either). > > > > So yes, you currently will have connection requests tied to your > underlying > > TCP timeout request. This tends to be much more of a problem in public > > clouds where the handshake request will be silently dropped due to > firewall > > rules. > > > > The metadata.max.age.ms is a workable solution, but agreed that it's not > > great. If possible, reducing the default TCP connection timeout isn't > > unreasonable either -- the defaults are set for WAN connections (and > > arguably set for WAN connections of long ago), so much more aggressive > > timeouts are reasonable for Kafka clusters. > > > > -Ewen > > > > On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen > braintreepayments.com> wrote: > > > > > Hello, > > > > > > Is it correct that producers do not fail new connection establishment > > when > > > it exceeds the request timeout? > > > > > > Running on AWS, we've encountered a problem where certain very low > volume > > > producers end up with metadata that's sufficiently stale that they > > attempt > > > to establish a connection to a broker instance that has already been > > > terminated as part of a maintenance operation. I would expect this to > > fail > > > and be retried normally, but it appears to hang until the system-level > > TCP > > > connection timeout is reached (2-3 minutes), with the writes themselves > > > being expired before even a single attempt is made to send them. > > > > > > We've worked around the issue by setting `metadata.max.age.ms` > extremely > > > low, such that these producers are requesting new metadata much faster > > than > > > our maintenance operations are terminating instances. While this does > > work, > > > it seems like an unfortunate workaround for some very surprising > > behavior. > > > > > > Thanks, > > > Luke > > > > > >
Re: Producer connect timeouts
Makes sense, thanks Ewen. Is this something we could consider fixing in Kafka itself? I don't think the producer is necessarily doing anything wrong, but the end result is certainly very surprising behavior. It would also be nice not to have to coordinate request timeouts, retries, and the max block configuration with system-level configs. On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postavawrote: > Without having dug back into the code to check, this sounds right. > Connection management just fires off a request to connect and then > subsequent poll() calls will handle any successful/failed connections. The > timeouts wrt requests are handled somewhat differently (the connection > request isn't explicitly tied to the request that triggered it, so when the > latter times out, we don't follow up and timeout the connection request > either). > > So yes, you currently will have connection requests tied to your underlying > TCP timeout request. This tends to be much more of a problem in public > clouds where the handshake request will be silently dropped due to firewall > rules. > > The metadata.max.age.ms is a workable solution, but agreed that it's not > great. If possible, reducing the default TCP connection timeout isn't > unreasonable either -- the defaults are set for WAN connections (and > arguably set for WAN connections of long ago), so much more aggressive > timeouts are reasonable for Kafka clusters. > > -Ewen > > On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen braintreepayments.com> wrote: > > > Hello, > > > > Is it correct that producers do not fail new connection establishment > when > > it exceeds the request timeout? > > > > Running on AWS, we've encountered a problem where certain very low volume > > producers end up with metadata that's sufficiently stale that they > attempt > > to establish a connection to a broker instance that has already been > > terminated as part of a maintenance operation. I would expect this to > fail > > and be retried normally, but it appears to hang until the system-level > TCP > > connection timeout is reached (2-3 minutes), with the writes themselves > > being expired before even a single attempt is made to send them. > > > > We've worked around the issue by setting `metadata.max.age.ms` extremely > > low, such that these producers are requesting new metadata much faster > than > > our maintenance operations are terminating instances. While this does > work, > > it seems like an unfortunate workaround for some very surprising > behavior. > > > > Thanks, > > Luke > > >
Re: Producer connect timeouts
Without having dug back into the code to check, this sounds right. Connection management just fires off a request to connect and then subsequent poll() calls will handle any successful/failed connections. The timeouts wrt requests are handled somewhat differently (the connection request isn't explicitly tied to the request that triggered it, so when the latter times out, we don't follow up and timeout the connection request either). So yes, you currently will have connection requests tied to your underlying TCP timeout request. This tends to be much more of a problem in public clouds where the handshake request will be silently dropped due to firewall rules. The metadata.max.age.ms is a workable solution, but agreed that it's not great. If possible, reducing the default TCP connection timeout isn't unreasonable either -- the defaults are set for WAN connections (and arguably set for WAN connections of long ago), so much more aggressive timeouts are reasonable for Kafka clusters. -Ewen On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensenwrote: > Hello, > > Is it correct that producers do not fail new connection establishment when > it exceeds the request timeout? > > Running on AWS, we've encountered a problem where certain very low volume > producers end up with metadata that's sufficiently stale that they attempt > to establish a connection to a broker instance that has already been > terminated as part of a maintenance operation. I would expect this to fail > and be retried normally, but it appears to hang until the system-level TCP > connection timeout is reached (2-3 minutes), with the writes themselves > being expired before even a single attempt is made to send them. > > We've worked around the issue by setting `metadata.max.age.ms` extremely > low, such that these producers are requesting new metadata much faster than > our maintenance operations are terminating instances. While this does work, > it seems like an unfortunate workaround for some very surprising behavior. > > Thanks, > Luke >
Producer connect timeouts
Hello, Is it correct that producers do not fail new connection establishment when it exceeds the request timeout? Running on AWS, we've encountered a problem where certain very low volume producers end up with metadata that's sufficiently stale that they attempt to establish a connection to a broker instance that has already been terminated as part of a maintenance operation. I would expect this to fail and be retried normally, but it appears to hang until the system-level TCP connection timeout is reached (2-3 minutes), with the writes themselves being expired before even a single attempt is made to send them. We've worked around the issue by setting `metadata.max.age.ms` extremely low, such that these producers are requesting new metadata much faster than our maintenance operations are terminating instances. While this does work, it seems like an unfortunate workaround for some very surprising behavior. Thanks, Luke