I'm not sure about the old producer behavior in this same failure scenario,
but creating a new producer instance would resolve the issue since it would
start with the list of bootstrap nodes and, assuming at least one of them
was up, it would be able to fetch up to date metadata.

On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg <j...@squareup.com> wrote:

> Can you clarify, is this issue here specific to the "new" producer?  With
> the "old" producer, we routinely construct a new producer which makes a
> fresh metadata request (via a VIP connected to all nodes in the cluster).
> Would this approach work with the new producer?
>
> Jason
>
>
> On Tue, May 5, 2015 at 1:12 PM, Rahul Jain <rahul...@gmail.com> wrote:
>
> > Mayuresh,
> > I was testing this in a development environment and manually brought
> down a
> > node to simulate this. So the dead node never came back up.
> >
> > My colleague and I were able to consistently see this behaviour several
> > times during the testing.
> > On 5 May 2015 20:32, "Mayuresh Gharat" <gharatmayures...@gmail.com>
> wrote:
> >
> > > I agree that to find the least Loaded node the producer should fall
> back
> > to
> > > the bootstrap nodes if its not able to connect to any nodes in the
> > current
> > > metadata. That should resolve this.
> > >
> > > Rahul, I suppose the problem went off because the dead node in your
> case
> > > might have came back up and allowed for a metadata update. Can you
> > confirm
> > > this?
> > >
> > > Thanks,
> > >
> > > Mayuresh
> > >
> > > On Tue, May 5, 2015 at 5:10 AM, Rahul Jain <rahul...@gmail.com> wrote:
> > >
> > > > We observed the exact same error. Not very clear about the root cause
> > > > although it appears to be related to leastLoadedNode implementation.
> > > > Interestingly, the problem went away by increasing the value of
> > > > reconnect.backoff.ms to 1000ms.
> > > > On 29 Apr 2015 00:32, "Ewen Cheslack-Postava" <e...@confluent.io>
> > wrote:
> > > >
> > > > > Ok, all of that makes sense. The only way to possibly recover from
> > that
> > > > > state is either for K2 to come back up allowing the metadata
> refresh
> > to
> > > > > eventually succeed or to eventually try some other node in the
> > cluster.
> > > > > Reusing the bootstrap nodes is one possibility. Another would be
> for
> > > the
> > > > > client to get more metadata than is required for the topics it
> needs
> > in
> > > > > order to ensure it has more nodes to use as options when looking
> for
> > a
> > > > node
> > > > > to fetch metadata from. I added your description to KAFKA-1843,
> > > although
> > > > it
> > > > > might also make sense as a separate bug since fixing it could be
> > > > considered
> > > > > incremental progress towards resolving 1843.
> > > > >
> > > > > On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy <
> > ku...@nmsworks.co.in
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi Ewen,
> > > > > >
> > > > > >  Thanks for the response.  I agree with you, In some case we
> should
> > > use
> > > > > > bootstrap servers.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > If you have logs at debug level, are you seeing this message in
> > > > between
> > > > > > the
> > > > > > > connection attempts:
> > > > > > >
> > > > > > > Give up sending metadata request since no node is available
> > > > > > >
> > > > > >
> > > > > >  Yes, this log came for couple of times.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Also, if you let it continue running, does it recover after the
> > > > > > > metadata.max.age.ms timeout?
> > > > > > >
> > > > > >
> > > > > >  It does not reconnect.  It is continuously trying to connect
> with
> > > dead
> > > > > > node.
> > > > > >
> > > > > >
> > > > > > -Manikumar
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks,
> > > > > Ewen
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -Regards,
> > > Mayuresh R. Gharat
> > > (862) 250-7125
> > >
> >
>



-- 
Thanks,
Ewen

Reply via email to