Re: New producer: metadata update problem on 2 Node cluster.

2015-05-07 Thread Rahul Jain
Creating a new consumer instance *does not* solve this problem.

Attaching the producer/consumer code that I used for testing.



On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava e...@confluent.io
wrote:

 I'm not sure about the old producer behavior in this same failure scenario,
 but creating a new producer instance would resolve the issue since it would
 start with the list of bootstrap nodes and, assuming at least one of them
 was up, it would be able to fetch up to date metadata.

 On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg j...@squareup.com wrote:

  Can you clarify, is this issue here specific to the new producer?  With
  the old producer, we routinely construct a new producer which makes a
  fresh metadata request (via a VIP connected to all nodes in the cluster).
  Would this approach work with the new producer?
 
  Jason
 
 
  On Tue, May 5, 2015 at 1:12 PM, Rahul Jain rahul...@gmail.com wrote:
 
   Mayuresh,
   I was testing this in a development environment and manually brought
  down a
   node to simulate this. So the dead node never came back up.
  
   My colleague and I were able to consistently see this behaviour several
   times during the testing.
   On 5 May 2015 20:32, Mayuresh Gharat gharatmayures...@gmail.com
  wrote:
  
I agree that to find the least Loaded node the producer should fall
  back
   to
the bootstrap nodes if its not able to connect to any nodes in the
   current
metadata. That should resolve this.
   
Rahul, I suppose the problem went off because the dead node in your
  case
might have came back up and allowed for a metadata update. Can you
   confirm
this?
   
Thanks,
   
Mayuresh
   
On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com
 wrote:
   
 We observed the exact same error. Not very clear about the root
 cause
 although it appears to be related to leastLoadedNode
 implementation.
 Interestingly, the problem went away by increasing the value of
 reconnect.backoff.ms to 1000ms.
 On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io
   wrote:

  Ok, all of that makes sense. The only way to possibly recover
 from
   that
  state is either for K2 to come back up allowing the metadata
  refresh
   to
  eventually succeed or to eventually try some other node in the
   cluster.
  Reusing the bootstrap nodes is one possibility. Another would be
  for
the
  client to get more metadata than is required for the topics it
  needs
   in
  order to ensure it has more nodes to use as options when looking
  for
   a
 node
  to fetch metadata from. I added your description to KAFKA-1843,
although
 it
  might also make sense as a separate bug since fixing it could be
 considered
  incremental progress towards resolving 1843.
 
  On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
   ku...@nmsworks.co.in

  wrote:
 
   Hi Ewen,
  
Thanks for the response.  I agree with you, In some case we
  should
use
   bootstrap servers.
  
  
   
If you have logs at debug level, are you seeing this message
 in
 between
   the
connection attempts:
   
Give up sending metadata request since no node is available
   
  
Yes, this log came for couple of times.
  
  
   
Also, if you let it continue running, does it recover after
 the
metadata.max.age.ms timeout?
   
  
It does not reconnect.  It is continuously trying to connect
  with
dead
   node.
  
  
   -Manikumar
  
 
 
 
  --
  Thanks,
  Ewen
 

   
   
   
--
-Regards,
Mayuresh R. Gharat
(862) 250-7125
   
  
 



 --
 Thanks,
 Ewen



Re: New producer: metadata update problem on 2 Node cluster.

2015-05-07 Thread Rahul Jain
Sorry, I meant creating a new producer, not consumer.

Here's the code.

Producer - http://pastebin.com/Kqq1ymCX
Consumer - http://pastebin.com/i2Z8PTYB
Callback - http://pastebin.com/x253z7bG

As you'll notice, I am creating a new producer for each message. So the
bootstrap nodes should be refreshed.

I have a single topic (receive.queue) replicated across 3 nodes. I add all
3 nodes to the bootstrap list. On bringing one of the nodes down, some
messages start failing (metadata update timeout error).

As I mentioned earlier, the problem goes away simply by setting the
reconnect.backoff.ms property to 1000ms.





On 7 May 2015 23:18, Ewen Cheslack-Postava e...@confluent.io wrote:

 Rahul, the mailing list filters attachments, you'd have to post the code
 somewhere else for people to be able to see it.

 But I don't think anyone suggested that creating a new consumer would fix
 anything. Creating a new producer *and discarding the old one* basically
 just makes it start from scratch using the bootstrap nodes, which is why
 that would allow recovery from that condition.

 But that's just a workaround. The real issue is that the producer only
 maintains metadata for the nodes that are replicas for the partitions of
 the topics the producer sends data to. In some cases, this is a small set
 of servers and can get the producer stuck if a node goes offline and it
 doesn't have any other nodes that it can try to communicate with to get
 updated metadata (since the topic partitions should have a new leader).
 Falling back on the original bootstrap servers is one solution to this
 problem. Another would be to maintain metadata for additional servers so
 you always have extra bootstrap nodes in your current metadata set, even
 if they aren't replicas for any of the topics you're working with.

 -Ewen



 On Thu, May 7, 2015 at 12:06 AM, Rahul Jain rahul...@gmail.com wrote:

  Creating a new consumer instance *does not* solve this problem.
 
  Attaching the producer/consumer code that I used for testing.
 
 
 
  On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava e...@confluent.io
 
  wrote:
 
  I'm not sure about the old producer behavior in this same failure
  scenario,
  but creating a new producer instance would resolve the issue since it
  would
  start with the list of bootstrap nodes and, assuming at least one of
 them
  was up, it would be able to fetch up to date metadata.
 
  On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg j...@squareup.com
 wrote:
 
   Can you clarify, is this issue here specific to the new producer?
  With
   the old producer, we routinely construct a new producer which makes
 a
   fresh metadata request (via a VIP connected to all nodes in the
  cluster).
   Would this approach work with the new producer?
  
   Jason
  
  
   On Tue, May 5, 2015 at 1:12 PM, Rahul Jain rahul...@gmail.com
 wrote:
  
Mayuresh,
I was testing this in a development environment and manually brought
   down a
node to simulate this. So the dead node never came back up.
   
My colleague and I were able to consistently see this behaviour
  several
times during the testing.
On 5 May 2015 20:32, Mayuresh Gharat gharatmayures...@gmail.com
   wrote:
   
 I agree that to find the least Loaded node the producer should
 fall
   back
to
 the bootstrap nodes if its not able to connect to any nodes in the
current
 metadata. That should resolve this.

 Rahul, I suppose the problem went off because the dead node in
 your
   case
 might have came back up and allowed for a metadata update. Can you
confirm
 this?

 Thanks,

 Mayuresh

 On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com
  wrote:

  We observed the exact same error. Not very clear about the root
  cause
  although it appears to be related to leastLoadedNode
  implementation.
  Interestingly, the problem went away by increasing the value of
  reconnect.backoff.ms to 1000ms.
  On 29 Apr 2015 00:32, Ewen Cheslack-Postava 
 e...@confluent.io
wrote:
 
   Ok, all of that makes sense. The only way to possibly recover
  from
that
   state is either for K2 to come back up allowing the metadata
   refresh
to
   eventually succeed or to eventually try some other node in the
cluster.
   Reusing the bootstrap nodes is one possibility. Another would
 be
   for
 the
   client to get more metadata than is required for the topics it
   needs
in
   order to ensure it has more nodes to use as options when
 looking
   for
a
  node
   to fetch metadata from. I added your description to
 KAFKA-1843,
 although
  it
   might also make sense as a separate bug since fixing it could
 be
  considered
   incremental progress towards resolving 1843.
  
   On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
ku...@nmsworks.co.in
 
   wrote:
  
Hi Ewen,
   
 

Re: New producer: metadata update problem on 2 Node cluster.

2015-05-07 Thread Ewen Cheslack-Postava
Rahul, the mailing list filters attachments, you'd have to post the code
somewhere else for people to be able to see it.

But I don't think anyone suggested that creating a new consumer would fix
anything. Creating a new producer *and discarding the old one* basically
just makes it start from scratch using the bootstrap nodes, which is why
that would allow recovery from that condition.

But that's just a workaround. The real issue is that the producer only
maintains metadata for the nodes that are replicas for the partitions of
the topics the producer sends data to. In some cases, this is a small set
of servers and can get the producer stuck if a node goes offline and it
doesn't have any other nodes that it can try to communicate with to get
updated metadata (since the topic partitions should have a new leader).
Falling back on the original bootstrap servers is one solution to this
problem. Another would be to maintain metadata for additional servers so
you always have extra bootstrap nodes in your current metadata set, even
if they aren't replicas for any of the topics you're working with.

-Ewen



On Thu, May 7, 2015 at 12:06 AM, Rahul Jain rahul...@gmail.com wrote:

 Creating a new consumer instance *does not* solve this problem.

 Attaching the producer/consumer code that I used for testing.



 On Wed, May 6, 2015 at 6:31 AM, Ewen Cheslack-Postava e...@confluent.io
 wrote:

 I'm not sure about the old producer behavior in this same failure
 scenario,
 but creating a new producer instance would resolve the issue since it
 would
 start with the list of bootstrap nodes and, assuming at least one of them
 was up, it would be able to fetch up to date metadata.

 On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg j...@squareup.com wrote:

  Can you clarify, is this issue here specific to the new producer?
 With
  the old producer, we routinely construct a new producer which makes a
  fresh metadata request (via a VIP connected to all nodes in the
 cluster).
  Would this approach work with the new producer?
 
  Jason
 
 
  On Tue, May 5, 2015 at 1:12 PM, Rahul Jain rahul...@gmail.com wrote:
 
   Mayuresh,
   I was testing this in a development environment and manually brought
  down a
   node to simulate this. So the dead node never came back up.
  
   My colleague and I were able to consistently see this behaviour
 several
   times during the testing.
   On 5 May 2015 20:32, Mayuresh Gharat gharatmayures...@gmail.com
  wrote:
  
I agree that to find the least Loaded node the producer should fall
  back
   to
the bootstrap nodes if its not able to connect to any nodes in the
   current
metadata. That should resolve this.
   
Rahul, I suppose the problem went off because the dead node in your
  case
might have came back up and allowed for a metadata update. Can you
   confirm
this?
   
Thanks,
   
Mayuresh
   
On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com
 wrote:
   
 We observed the exact same error. Not very clear about the root
 cause
 although it appears to be related to leastLoadedNode
 implementation.
 Interestingly, the problem went away by increasing the value of
 reconnect.backoff.ms to 1000ms.
 On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io
   wrote:

  Ok, all of that makes sense. The only way to possibly recover
 from
   that
  state is either for K2 to come back up allowing the metadata
  refresh
   to
  eventually succeed or to eventually try some other node in the
   cluster.
  Reusing the bootstrap nodes is one possibility. Another would be
  for
the
  client to get more metadata than is required for the topics it
  needs
   in
  order to ensure it has more nodes to use as options when looking
  for
   a
 node
  to fetch metadata from. I added your description to KAFKA-1843,
although
 it
  might also make sense as a separate bug since fixing it could be
 considered
  incremental progress towards resolving 1843.
 
  On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
   ku...@nmsworks.co.in

  wrote:
 
   Hi Ewen,
  
Thanks for the response.  I agree with you, In some case we
  should
use
   bootstrap servers.
  
  
   
If you have logs at debug level, are you seeing this
 message in
 between
   the
connection attempts:
   
Give up sending metadata request since no node is available
   
  
Yes, this log came for couple of times.
  
  
   
Also, if you let it continue running, does it recover after
 the
metadata.max.age.ms timeout?
   
  
It does not reconnect.  It is continuously trying to connect
  with
dead
   node.
  
  
   -Manikumar
  
 
 
 
  --
  Thanks,
  Ewen
 

   
   
   
--
-Regards,
Mayuresh R. Gharat
(862) 250-7125
   
 

Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Rahul Jain
We observed the exact same error. Not very clear about the root cause
although it appears to be related to leastLoadedNode implementation.
Interestingly, the problem went away by increasing the value of
reconnect.backoff.ms to 1000ms.
On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io wrote:

 Ok, all of that makes sense. The only way to possibly recover from that
 state is either for K2 to come back up allowing the metadata refresh to
 eventually succeed or to eventually try some other node in the cluster.
 Reusing the bootstrap nodes is one possibility. Another would be for the
 client to get more metadata than is required for the topics it needs in
 order to ensure it has more nodes to use as options when looking for a node
 to fetch metadata from. I added your description to KAFKA-1843, although it
 might also make sense as a separate bug since fixing it could be considered
 incremental progress towards resolving 1843.

 On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy ku...@nmsworks.co.in
 wrote:

  Hi Ewen,
 
   Thanks for the response.  I agree with you, In some case we should use
  bootstrap servers.
 
 
  
   If you have logs at debug level, are you seeing this message in between
  the
   connection attempts:
  
   Give up sending metadata request since no node is available
  
 
   Yes, this log came for couple of times.
 
 
  
   Also, if you let it continue running, does it recover after the
   metadata.max.age.ms timeout?
  
 
   It does not reconnect.  It is continuously trying to connect with dead
  node.
 
 
  -Manikumar
 



 --
 Thanks,
 Ewen



Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Ewen Cheslack-Postava
I'm not sure about the old producer behavior in this same failure scenario,
but creating a new producer instance would resolve the issue since it would
start with the list of bootstrap nodes and, assuming at least one of them
was up, it would be able to fetch up to date metadata.

On Tue, May 5, 2015 at 5:32 PM, Jason Rosenberg j...@squareup.com wrote:

 Can you clarify, is this issue here specific to the new producer?  With
 the old producer, we routinely construct a new producer which makes a
 fresh metadata request (via a VIP connected to all nodes in the cluster).
 Would this approach work with the new producer?

 Jason


 On Tue, May 5, 2015 at 1:12 PM, Rahul Jain rahul...@gmail.com wrote:

  Mayuresh,
  I was testing this in a development environment and manually brought
 down a
  node to simulate this. So the dead node never came back up.
 
  My colleague and I were able to consistently see this behaviour several
  times during the testing.
  On 5 May 2015 20:32, Mayuresh Gharat gharatmayures...@gmail.com
 wrote:
 
   I agree that to find the least Loaded node the producer should fall
 back
  to
   the bootstrap nodes if its not able to connect to any nodes in the
  current
   metadata. That should resolve this.
  
   Rahul, I suppose the problem went off because the dead node in your
 case
   might have came back up and allowed for a metadata update. Can you
  confirm
   this?
  
   Thanks,
  
   Mayuresh
  
   On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com wrote:
  
We observed the exact same error. Not very clear about the root cause
although it appears to be related to leastLoadedNode implementation.
Interestingly, the problem went away by increasing the value of
reconnect.backoff.ms to 1000ms.
On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io
  wrote:
   
 Ok, all of that makes sense. The only way to possibly recover from
  that
 state is either for K2 to come back up allowing the metadata
 refresh
  to
 eventually succeed or to eventually try some other node in the
  cluster.
 Reusing the bootstrap nodes is one possibility. Another would be
 for
   the
 client to get more metadata than is required for the topics it
 needs
  in
 order to ensure it has more nodes to use as options when looking
 for
  a
node
 to fetch metadata from. I added your description to KAFKA-1843,
   although
it
 might also make sense as a separate bug since fixing it could be
considered
 incremental progress towards resolving 1843.

 On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy 
  ku...@nmsworks.co.in
   
 wrote:

  Hi Ewen,
 
   Thanks for the response.  I agree with you, In some case we
 should
   use
  bootstrap servers.
 
 
  
   If you have logs at debug level, are you seeing this message in
between
  the
   connection attempts:
  
   Give up sending metadata request since no node is available
  
 
   Yes, this log came for couple of times.
 
 
  
   Also, if you let it continue running, does it recover after the
   metadata.max.age.ms timeout?
  
 
   It does not reconnect.  It is continuously trying to connect
 with
   dead
  node.
 
 
  -Manikumar
 



 --
 Thanks,
 Ewen

   
  
  
  
   --
   -Regards,
   Mayuresh R. Gharat
   (862) 250-7125
  
 




-- 
Thanks,
Ewen


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Mayuresh Gharat
I agree that to find the least Loaded node the producer should fall back to
the bootstrap nodes if its not able to connect to any nodes in the current
metadata. That should resolve this.

Rahul, I suppose the problem went off because the dead node in your case
might have came back up and allowed for a metadata update. Can you confirm
this?

Thanks,

Mayuresh

On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com wrote:

 We observed the exact same error. Not very clear about the root cause
 although it appears to be related to leastLoadedNode implementation.
 Interestingly, the problem went away by increasing the value of
 reconnect.backoff.ms to 1000ms.
 On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io wrote:

  Ok, all of that makes sense. The only way to possibly recover from that
  state is either for K2 to come back up allowing the metadata refresh to
  eventually succeed or to eventually try some other node in the cluster.
  Reusing the bootstrap nodes is one possibility. Another would be for the
  client to get more metadata than is required for the topics it needs in
  order to ensure it has more nodes to use as options when looking for a
 node
  to fetch metadata from. I added your description to KAFKA-1843, although
 it
  might also make sense as a separate bug since fixing it could be
 considered
  incremental progress towards resolving 1843.
 
  On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy ku...@nmsworks.co.in
  wrote:
 
   Hi Ewen,
  
Thanks for the response.  I agree with you, In some case we should use
   bootstrap servers.
  
  
   
If you have logs at debug level, are you seeing this message in
 between
   the
connection attempts:
   
Give up sending metadata request since no node is available
   
  
Yes, this log came for couple of times.
  
  
   
Also, if you let it continue running, does it recover after the
metadata.max.age.ms timeout?
   
  
It does not reconnect.  It is continuously trying to connect with dead
   node.
  
  
   -Manikumar
  
 
 
 
  --
  Thanks,
  Ewen
 




-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: New producer: metadata update problem on 2 Node cluster.

2015-05-05 Thread Rahul Jain
Mayuresh,
I was testing this in a development environment and manually brought down a
node to simulate this. So the dead node never came back up.

My colleague and I were able to consistently see this behaviour several
times during the testing.
On 5 May 2015 20:32, Mayuresh Gharat gharatmayures...@gmail.com wrote:

 I agree that to find the least Loaded node the producer should fall back to
 the bootstrap nodes if its not able to connect to any nodes in the current
 metadata. That should resolve this.

 Rahul, I suppose the problem went off because the dead node in your case
 might have came back up and allowed for a metadata update. Can you confirm
 this?

 Thanks,

 Mayuresh

 On Tue, May 5, 2015 at 5:10 AM, Rahul Jain rahul...@gmail.com wrote:

  We observed the exact same error. Not very clear about the root cause
  although it appears to be related to leastLoadedNode implementation.
  Interestingly, the problem went away by increasing the value of
  reconnect.backoff.ms to 1000ms.
  On 29 Apr 2015 00:32, Ewen Cheslack-Postava e...@confluent.io wrote:
 
   Ok, all of that makes sense. The only way to possibly recover from that
   state is either for K2 to come back up allowing the metadata refresh to
   eventually succeed or to eventually try some other node in the cluster.
   Reusing the bootstrap nodes is one possibility. Another would be for
 the
   client to get more metadata than is required for the topics it needs in
   order to ensure it has more nodes to use as options when looking for a
  node
   to fetch metadata from. I added your description to KAFKA-1843,
 although
  it
   might also make sense as a separate bug since fixing it could be
  considered
   incremental progress towards resolving 1843.
  
   On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy ku...@nmsworks.co.in
 
   wrote:
  
Hi Ewen,
   
 Thanks for the response.  I agree with you, In some case we should
 use
bootstrap servers.
   
   

 If you have logs at debug level, are you seeing this message in
  between
the
 connection attempts:

 Give up sending metadata request since no node is available

   
 Yes, this log came for couple of times.
   
   

 Also, if you let it continue running, does it recover after the
 metadata.max.age.ms timeout?

   
 It does not reconnect.  It is continuously trying to connect with
 dead
node.
   
   
-Manikumar
   
  
  
  
   --
   Thanks,
   Ewen
  
 



 --
 -Regards,
 Mayuresh R. Gharat
 (862) 250-7125



Re: New producer: metadata update problem on 2 Node cluster.

2015-04-28 Thread Manikumar Reddy
Hi Ewen,

 Thanks for the response.  I agree with you, In some case we should use
bootstrap servers.



 If you have logs at debug level, are you seeing this message in between the
 connection attempts:

 Give up sending metadata request since no node is available


 Yes, this log came for couple of times.



 Also, if you let it continue running, does it recover after the
 metadata.max.age.ms timeout?


 It does not reconnect.  It is continuously trying to connect with dead
node.


-Manikumar


Re: New producer: metadata update problem on 2 Node cluster.

2015-04-28 Thread Ewen Cheslack-Postava
Ok, all of that makes sense. The only way to possibly recover from that
state is either for K2 to come back up allowing the metadata refresh to
eventually succeed or to eventually try some other node in the cluster.
Reusing the bootstrap nodes is one possibility. Another would be for the
client to get more metadata than is required for the topics it needs in
order to ensure it has more nodes to use as options when looking for a node
to fetch metadata from. I added your description to KAFKA-1843, although it
might also make sense as a separate bug since fixing it could be considered
incremental progress towards resolving 1843.

On Tue, Apr 28, 2015 at 9:18 AM, Manikumar Reddy ku...@nmsworks.co.in
wrote:

 Hi Ewen,

  Thanks for the response.  I agree with you, In some case we should use
 bootstrap servers.


 
  If you have logs at debug level, are you seeing this message in between
 the
  connection attempts:
 
  Give up sending metadata request since no node is available
 

  Yes, this log came for couple of times.


 
  Also, if you let it continue running, does it recover after the
  metadata.max.age.ms timeout?
 

  It does not reconnect.  It is continuously trying to connect with dead
 node.


 -Manikumar




-- 
Thanks,
Ewen


Re: New producer: metadata update problem on 2 Node cluster.

2015-04-27 Thread Manikumar Reddy
Any comments on this issue?
On Apr 24, 2015 8:05 PM, Manikumar Reddy ku...@nmsworks.co.in wrote:

 We are testing new producer on a 2 node cluster.
 Under some node failure scenarios, producer is not able
 to update metadata.

 Steps to reproduce
 1. form a 2 node cluster (K1, K2)
 2. create a topic with single partition, replication factor = 2
 3. start producing data (producer metadata : K1,K2)
 2. Kill leader node (say K1)
 3. K2 becomes the leader (producer metadata : K2)
 4. Bring back K1 and Kill K2 before metadata.max.age.ms
 5. K1 becomes the Leader (producer metadata still contains : K2)

 After this point, producer is not able to update the metadata.
 producer continuously trying to connect with dead node (K2).

 This looks like a bug to me. Am I missing anything?



Re: New producer: metadata update problem on 2 Node cluster.

2015-04-27 Thread Ewen Cheslack-Postava
Maybe add this to the description of
https://issues.apache.org/jira/browse/KAFKA-1843 ? I can't find it now, but
I think there was another bug where I described a similar problem -- in
some cases it makes sense to fall back to the list of bootstrap nodes
because you've gotten into a bad state and can't make any progress without
a metadata update but can't connect to any nodes. The leastLoadedNode
method only considers nodes in the current metadata, so in your example K1
is not considered an option after seeing the producer metadata update that
only includes K2. In KAFKA-1501 I also found another obscure edge case
where you can run into this problem if your broker hostnames/ports aren't
consistent across restarts. Yours is obviously much more likely to occur,
and may not even be that uncommon for producers that are only sending data
to one topi.

If you have logs at debug level, are you seeing this message in between the
connection attempts:

Give up sending metadata request since no node is available

Also, if you let it continue running, does it recover after the
metadata.max.age.ms timeout? If so, I think that would definitely confirm
the issue and might suggest a fix -- preserve the bootstrap metadata and
fall back to choosing a node from it when leastLoadedNode would otherwise
return null.

-Ewen

On Mon, Apr 27, 2015 at 5:40 AM, Manikumar Reddy manikumar.re...@gmail.com
wrote:

 Any comments on this issue?
 On Apr 24, 2015 8:05 PM, Manikumar Reddy ku...@nmsworks.co.in wrote:

  We are testing new producer on a 2 node cluster.
  Under some node failure scenarios, producer is not able
  to update metadata.
 
  Steps to reproduce
  1. form a 2 node cluster (K1, K2)
  2. create a topic with single partition, replication factor = 2
  3. start producing data (producer metadata : K1,K2)
  2. Kill leader node (say K1)
  3. K2 becomes the leader (producer metadata : K2)
  4. Bring back K1 and Kill K2 before metadata.max.age.ms
  5. K1 becomes the Leader (producer metadata still contains : K2)
 
  After this point, producer is not able to update the metadata.
  producer continuously trying to connect with dead node (K2).
 
  This looks like a bug to me. Am I missing anything?
 




-- 
Thanks,
Ewen