Re: Producer connect timeouts

2016-12-19 Thread Ewen Cheslack-Postava
Yes, this is something that we could consider fixing in Kafka itself.
Pretty much all timeouts can be customized if the defaults for the
OS/network are larger than make sense for the system. And given the large
default values for some of these timeouts, we probably don't want to rely
on the defaults.

-Ewen

On Mon, Dec 19, 2016 at 8:23 AM, Luke Steensen  wrote:

> Makes sense, thanks Ewen.
>
> Is this something we could consider fixing in Kafka itself? I don't think
> the producer is necessarily doing anything wrong, but the end result is
> certainly very surprising behavior. It would also be nice not to have to
> coordinate request timeouts, retries, and the max block configuration with
> system-level configs.
>
>
> On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postava 
> wrote:
>
> > Without having dug back into the code to check, this sounds right.
> > Connection management just fires off a request to connect and then
> > subsequent poll() calls will handle any successful/failed connections.
> The
> > timeouts wrt requests are handled somewhat differently (the connection
> > request isn't explicitly tied to the request that triggered it, so when
> the
> > latter times out, we don't follow up and timeout the connection request
> > either).
> >
> > So yes, you currently will have connection requests tied to your
> underlying
> > TCP timeout request. This tends to be much more of a problem in public
> > clouds where the handshake request will be silently dropped due to
> firewall
> > rules.
> >
> > The metadata.max.age.ms is a workable solution, but agreed that it's not
> > great. If possible, reducing the default TCP connection timeout isn't
> > unreasonable either -- the defaults are set for WAN connections (and
> > arguably set for WAN connections of long ago), so much more aggressive
> > timeouts are reasonable for Kafka clusters.
> >
> > -Ewen
> >
> > On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen  > braintreepayments.com> wrote:
> >
> > > Hello,
> > >
> > > Is it correct that producers do not fail new connection establishment
> > when
> > > it exceeds the request timeout?
> > >
> > > Running on AWS, we've encountered a problem where certain very low
> volume
> > > producers end up with metadata that's sufficiently stale that they
> > attempt
> > > to establish a connection to a broker instance that has already been
> > > terminated as part of a maintenance operation. I would expect this to
> > fail
> > > and be retried normally, but it appears to hang until the system-level
> > TCP
> > > connection timeout is reached (2-3 minutes), with the writes themselves
> > > being expired before even a single attempt is made to send them.
> > >
> > > We've worked around the issue by setting `metadata.max.age.ms`
> extremely
> > > low, such that these producers are requesting new metadata much faster
> > than
> > > our maintenance operations are terminating instances. While this does
> > work,
> > > it seems like an unfortunate workaround for some very surprising
> > behavior.
> > >
> > > Thanks,
> > > Luke
> > >
> >
>


Re: Producer connect timeouts

2016-12-19 Thread Luke Steensen
Makes sense, thanks Ewen.

Is this something we could consider fixing in Kafka itself? I don't think
the producer is necessarily doing anything wrong, but the end result is
certainly very surprising behavior. It would also be nice not to have to
coordinate request timeouts, retries, and the max block configuration with
system-level configs.


On Sat, Dec 17, 2016 at 6:55 PM, Ewen Cheslack-Postava 
wrote:

> Without having dug back into the code to check, this sounds right.
> Connection management just fires off a request to connect and then
> subsequent poll() calls will handle any successful/failed connections. The
> timeouts wrt requests are handled somewhat differently (the connection
> request isn't explicitly tied to the request that triggered it, so when the
> latter times out, we don't follow up and timeout the connection request
> either).
>
> So yes, you currently will have connection requests tied to your underlying
> TCP timeout request. This tends to be much more of a problem in public
> clouds where the handshake request will be silently dropped due to firewall
> rules.
>
> The metadata.max.age.ms is a workable solution, but agreed that it's not
> great. If possible, reducing the default TCP connection timeout isn't
> unreasonable either -- the defaults are set for WAN connections (and
> arguably set for WAN connections of long ago), so much more aggressive
> timeouts are reasonable for Kafka clusters.
>
> -Ewen
>
> On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen  braintreepayments.com> wrote:
>
> > Hello,
> >
> > Is it correct that producers do not fail new connection establishment
> when
> > it exceeds the request timeout?
> >
> > Running on AWS, we've encountered a problem where certain very low volume
> > producers end up with metadata that's sufficiently stale that they
> attempt
> > to establish a connection to a broker instance that has already been
> > terminated as part of a maintenance operation. I would expect this to
> fail
> > and be retried normally, but it appears to hang until the system-level
> TCP
> > connection timeout is reached (2-3 minutes), with the writes themselves
> > being expired before even a single attempt is made to send them.
> >
> > We've worked around the issue by setting `metadata.max.age.ms` extremely
> > low, such that these producers are requesting new metadata much faster
> than
> > our maintenance operations are terminating instances. While this does
> work,
> > it seems like an unfortunate workaround for some very surprising
> behavior.
> >
> > Thanks,
> > Luke
> >
>


Re: Producer connect timeouts

2016-12-17 Thread Ewen Cheslack-Postava
Without having dug back into the code to check, this sounds right.
Connection management just fires off a request to connect and then
subsequent poll() calls will handle any successful/failed connections. The
timeouts wrt requests are handled somewhat differently (the connection
request isn't explicitly tied to the request that triggered it, so when the
latter times out, we don't follow up and timeout the connection request
either).

So yes, you currently will have connection requests tied to your underlying
TCP timeout request. This tends to be much more of a problem in public
clouds where the handshake request will be silently dropped due to firewall
rules.

The metadata.max.age.ms is a workable solution, but agreed that it's not
great. If possible, reducing the default TCP connection timeout isn't
unreasonable either -- the defaults are set for WAN connections (and
arguably set for WAN connections of long ago), so much more aggressive
timeouts are reasonable for Kafka clusters.

-Ewen

On Fri, Dec 16, 2016 at 1:41 PM, Luke Steensen  wrote:

> Hello,
>
> Is it correct that producers do not fail new connection establishment when
> it exceeds the request timeout?
>
> Running on AWS, we've encountered a problem where certain very low volume
> producers end up with metadata that's sufficiently stale that they attempt
> to establish a connection to a broker instance that has already been
> terminated as part of a maintenance operation. I would expect this to fail
> and be retried normally, but it appears to hang until the system-level TCP
> connection timeout is reached (2-3 minutes), with the writes themselves
> being expired before even a single attempt is made to send them.
>
> We've worked around the issue by setting `metadata.max.age.ms` extremely
> low, such that these producers are requesting new metadata much faster than
> our maintenance operations are terminating instances. While this does work,
> it seems like an unfortunate workaround for some very surprising behavior.
>
> Thanks,
> Luke
>


Producer connect timeouts

2016-12-16 Thread Luke Steensen
Hello,

Is it correct that producers do not fail new connection establishment when
it exceeds the request timeout?

Running on AWS, we've encountered a problem where certain very low volume
producers end up with metadata that's sufficiently stale that they attempt
to establish a connection to a broker instance that has already been
terminated as part of a maintenance operation. I would expect this to fail
and be retried normally, but it appears to hang until the system-level TCP
connection timeout is reached (2-3 minutes), with the writes themselves
being expired before even a single attempt is made to send them.

We've worked around the issue by setting `metadata.max.age.ms` extremely
low, such that these producers are requesting new metadata much faster than
our maintenance operations are terminating instances. While this does work,
it seems like an unfortunate workaround for some very surprising behavior.

Thanks,
Luke