subject:"\[jira\] \[Commented\] \(KAFKA\-1642\) \[Java New Producer Kafka Trunk\] CPU Usage Spike to 100% when network connection is lost"

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2015-01-05 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14265602#comment-14265602
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Updated reviewboard https://reviews.apache.org/r/28582/diff/
 against branch origin/trunk

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch, KAFKA-1642_2015-01-05_18:56:55.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-29 Thread Neha Narkhede (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14260508#comment-14260508
 ] 

Neha Narkhede commented on KAFKA-1642:
--

Thanks for following through on this, [~ewencp]! [~junrao], since you probably 
have most context on this, would you mind reviewing the follow up patch 
(https://reviews.apache.org/r/28582/diff/) so we can close this out?

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-23 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257686#comment-14257686
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

Patch indeed solve the high CPU Problem reported by this bug.  I have tested 
all brokers down, one broker down and two broker down:

Here are some interesting Observations from YourKit:

0)  Overall, patch has also brought down  overall consumption in Normal Healthy 
or Happy case where every thing is up and running.  In old code (without 
patch), I use to see about 10% of overall CPU used by process by io threads (4 
in my case), it has reduce to 5% or less now with path.   

1)  When two brokers are down, then occasionally I see IO thread blocked. ( 
I did not see this when one brokers is down) 

{code}
kafka-producer-network-thread | rawlog [BLOCKED] [DAEMON]
org.apache.kafka.clients.producer.internals.Metadata.fetch() Metadata.java:70
java.lang.Thread.run() Thread.java:744
{code}

2)  record-error-rate metric remain zero despite following firewall rule.  
In my opinion, it should have called  
org.apache.kafka.clients.producer.Callback  but I did not see that happening 
either in either one or two brokers down.  Should I file another issue for this 
? Please confirm.

{code}
00100 reject tcp from me to b1.ip dst-port 9092
00200 reject tcp from me to b2.ip dst-port 9092
{code}

{code}
class LoggingCallBaHandler implements Callback {

/**
 * A callback method the user can implement to provide 
asynchronous
 * handling of request completion. This method will be called 
when the
 * record sent to the server has been acknowledged. Exactly one 
of the
 * arguments will be non-null.
 * 
 * @param metadata
 *The metadata for the record that was sent (i.e. 
the
 *partition and offset). Null if an error occurred.
 * @param exception
 *The exception thrown during processing of this 
record.
 *Null if no error occurred.
 */
@Override
public void onCompletion(RecordMetadata metadata, Exception 
exception) {
if(exception != null){
exception.printStackTrace();
}
}
}
{code}

I do not see any exception at all on consolenot sure why ?

3)  Application does NOT gracefully shutdown when there one or more brokers 
are down. (io Thread never exits this is know issue ) 

{code}
SIGTERM handler daemon prio=5 tid=0x7f8bd79e4000 nid=0x17907 waiting for 
monitor entry [0x00011e906000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock 0x00070008f7c0 (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Terminator$1.handle(Terminator.java:52)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:744)

SIGTERM handler daemon prio=5 tid=0x7f8bd5159000 nid=0x1cb0b waiting for 
monitor entry [0x00011e803000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock 0x00070008f7c0 (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Terminator$1.handle(Terminator.java:52)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:744)

SIGTERM handler daemon prio=5 tid=0x7f8bdd147800 nid=0x15d0b waiting for 
monitor entry [0x00011e30a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock 0x00070008f7c0 (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Terminator$1.handle(Terminator.java:52)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:744)

SIGTERM handler daemon prio=5 tid=0x7f8bdf82 nid=0x770b waiting for 
monitor entry [0x00011e207000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock 0x00070008f7c0 (a java.lang.Class for 
java.lang.Shutdown)
at java.lang.Terminator$1.handle(Terminator.java:52)
at sun.misc.Signal$1.run(Signal.java:212)
at java.lang.Thread.run(Thread.java:744)

SIGTERM handler daemon prio=5 tid=0x7f8bdc393800 nid=0x1c30f waiting for 
monitor entry [0x00011e104000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.Shutdown.exit(Shutdown.java:212)
- waiting to lock 0x00070008f7c0 (a java.lang.Class for 
java.lang.Shutdown)
at

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-23 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257788#comment-14257788
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

1) Since this method just returns data, it probably just happens to have had to 
wait for the lock. Not a big deal, we would expect to see that occasionally.

2) Right now this is expected, as explained in KAFKA-1788. I saw there's patch 
for that one now, might help if you tested that out and see if gives the 
behavior you were expecting.

3) As you said, known issue so we can leave that for other JIRAs to resolve.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-23 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14257805#comment-14257805
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

Thanks for patch.  You may close this issue.   The only thing,  I have not 
tested the rare case where a single broker is out of File Descriptor and under 
heavy load on producer will request more connections to same broker.  According 
to code, it will mark the Node State to disconnect and I am not sure if data 
will be sent via already live connection.  

Thanks for all your help !

Thanks,

Bhavesh  

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-08 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14239063#comment-14239063
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~stevenz3wu],

0.8.2 is very well tested and worked well under heavy load.  This bug is rare 
only happen when broker or network has issue.  We have been producing about 7 
to 10 TB per day using this new producer, so 0.8.2 is very safe to use in 
production.  It has survived  pick traffic of the year on large e-commerce 
site.  So I am fairly confident that  New Java API is indeed does true 
round-robin and much faster than Scala Based API.

[~ewencp],  I will verify the patch by end of this Friday, but do let me know 
your understanding based on my last comment.

Thanks,

Bhavesh

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-04 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14234297#comment-14234297
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

1) I will posted toward KAFKA-1788 and perhaps link the issue.
2) True , some sort of measure would be great 5,10...25 50, 95 and 99 
percentile would be great of execution time.  The point is just measure the 
duration report the rate of execution. 
3) Agree with what you are saying and I have observed same behavior.  But only 
recommendation is to add some intelligence to *timeouts* to detect if for long 
period and consecutive timeout is zero then there is problem. (Little more 
defensive) 
4) Again I agree with you point, but based in your previous comments you had 
mentioned that you may consider having back-off logic further up the chain. So 
I was just checking run() is best place to do that check.  Again, may be add 
intelligence here if you get consecutive “Exception” then likelihood of high 
CPU is high.  
 
5) Ok.  I agree what you are saying is data needs to be de-queue so more data 
can be en-queue even in event of network lost.  Is my understanding correct ?

6) All I am saying is network firewall rule (such as only 2 TCP connections per 
source host) or Brokers running out of File Descriptor so new connection to 
broker is not established but Client have live and active TCP connection to 
same broker.  But based on what I see in the method * initiateConnect* will 
mark the entire Broker or Node status as disconnected.  Is this expected 
behavior?  So question is: will client continue to send data ?

Thank you very much for entertaining my questions so far and I will test out 
the patch next week.

Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-02 Thread Steven Zhen Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231830#comment-14231830
 ] 

Steven Zhen Wu commented on KAFKA-1642:
---

this sounds like a serious blocker issue. once this fix is confirmed and 
merged, is there any chance for a patch release of 0.8.2.1? 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-02 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231837#comment-14231837
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

[~stevenz3wu] this is already a blocker on 0.8.2. 0.8.2 hasn't been released 
yet, only a beta.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-02 Thread Steven Zhen Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231867#comment-14231867
 ] 

Steven Zhen Wu commented on KAFKA-1642:
---

[~ewencp] forgot 0.8.2 is still beta, because we are already using 0.8.2-beta 
now. thanks for the clarification.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-02 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232061#comment-14232061
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Hi  [~ewencp],

I will not have time to validate this patch till next week.  

Here is my comments:

1) You still have not address the Producer.close() method issue that in event 
of network connection lost or other events happens, IO thread will not be 
killed and close method hangs. In patch that I have provided, I had timeout for 
join method and interrupted IO thread.  I think we need similar for this.
2) Also, can we please add JMX monitoring for IO tread to know how quick it is 
running.  It will great to add this and run() method will report duration to 
metric.
{code}
try{
ThreadMXBean bean = ManagementFactory.getThreadMXBean( );
if(bean.isThreadCpuTimeSupported()  
bean.isThreadCpuTimeEnabled()){
this.ioTheadCPUTime = metrics.sensor(iothread-cpu);
this.ioTheadCPUTime.add(iothread-cpu-ms, The Rate Of CPU 
Cycle used by iothead in NANOSECONDS, new Rate(TimeUnit.NANOSECONDS) {
public double measure(MetricConfig config, long now) {
return (now - metadata.lastUpdate()) / 1000.0;
}
}); 
}
}catch(Throwable th){
log.warn(Not able to set the CPU time... etc);
}
{code}

3)  Please check the timeout final value in *pollTimeout* if it is zero for 
constantly then we need to slow IO thread down.
4)  Defensive check in for back off  in run() method when IO thread is 
aggressive:  

5)  When all nodes are disconnected, do you still want to spin the IO Thread ?

6)  When you have a firewall rule that says you can only have 2 concurrent TCP 
connections from Client to Brokers and client still have live TCP connection 
to same not (Broker), but new TCP connection is rejected. Node State will be 
marked as Disconnected in initiateConnect ?  Are you handling that gracefully  ?

By the way, thank you very much for quick reply and with new patch.  I 
appreciate your help.

Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-02 Thread Ewen Cheslack-Postava (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232407#comment-14232407
]

Ewen Cheslack-Postava commented on KAFKA-1642:
--

[~Bmis13], responses to each item:

1. I'm specifically trying to address the CPU usage here. I realize from your
perspective they are closely related since they're both can be triggered by a
loss of network connectivity, but internally they're really separate issues --
the CPU usage has to do with incorrect timeouts and the join() issues is due to
the lack of timeouts on produce operations. That's why I pointed you toward
KAFKA-1788. If a timeout is added for data in the producer, that would resolve
the close issue as well since any data waiting in the producer would eventually
timeout and the IO thread could exit. I think that's the cleanest solution
since it solves both problems with a single setting (the amount of time your
willing to wait before discarding data). If you think a separate timeout
specifically for Producer.close() is worthwhile I'd suggest filing a separate
JIRA for that.

2. I think the measure() implementation you have is incorrect, it looks like
the amount of time since the last metadata update. In any case, I think you'd
want a sensor there so it would take care of computing the rate for you. Is the
use case for this just detecting the CPU spike? I suppose providing the average
iteration rate can at least tell you the source of CPU usage compared to just
monitoring CPU at the system level. But (as described below) it's not
necessarily bad for the rate to be quite high. And an average also doesn't tell
you much except maybe in the extreme case that we're fixing with this patch.
Before just adding another metric, maybe we should think through exactly what
you're trying to measure/monitor and how best to reveal that information?

3. I probably should have given a better explanation earlier for why that
approach is problematic. In a lot of cases, you *want* run() to use small
timeouts. If you're pumping a lot of data through, saturating your network
connection, then you're likely to going to need to poll very fast and end up
with consistently small timeouts (e.g. if you produce to many topics and
consistently have data flowing through them, you'll end up using the linger
period and your timeout will generally be a fraction of that period). In other
words, if we've actually given the producer a lot of work to do, then we should
expect that it will use short timeouts and eat up CPU. You found that you could
work around your particular issue by backing off, but it breaks another very
important use case -- high throughput to the producer. I bet if you ran
performance tests with it enabled you would find diminished performance.
Fundamentally the bug wasn't that we were sometimes computing small timeouts
consistently, it was that we were computing *incorrect* timeouts. If we get the
timeouts computed correctly, then we'll properly support both cases.

4. See explanation in 3. The rate of this loop is controlled by correct
computation of timeouts. I think the only case this code could be an issue is
if we compute correct timeouts, but then consistently see exceptions that are
only caught by this block. That shouldn't be happening consistenly, and if you
find an example where it is then we should catch that exception deeper in the
call stack and handle it more gracefully.

5. The thread still has to be available so that when send() is called again it
can initiate a connection. What should happen now (after my patch) if there is
no work to do is that we'll still run the loop, but the timeout will be very
large so it basically won't be using any CPU. If we get a request to send data,
a wakeup() call wakes it back up before the timeout so it can start sending
data immediately.

6. It should be handled gracefully, and should have been fixed by the original
patch (although maybe the fix to the ordering in initiateConnect in the second
patch was also necessary). The logic in Sender.run() adjusts timeouts when it
finds nodes that aren't ready to send data (i.e. are disconnected) and makes
sure we backoff when connection attempts fail. This is a fixed backoff
(reconnect.backoff.ms, default 10ms), but is enough to avoid high CPU usage.

[Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network
connection is lost
---

Key: KAFKA-1642
URL: https://issues.apache.org/jira/browse/KAFKA-1642
Project: Kafka
Issue Type: Bug
Components: producer
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
Fix For: 0.8.2

Attachments:

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-01 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230709#comment-14230709
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Created reviewboard https://reviews.apache.org/r/28582/diff/
 against branch origin/trunk

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-12-01 Thread Ewen Cheslack-Postava (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14230715#comment-14230715
]

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Attached a new patch that fixes all the timeout issues I'm aware of. Here's how
it addresses each of the situations I listed earlier:

1. lastNoNodeAvailableMs is updated, which forces metadata timeout for each
poll to be use a backoff period

2. Added another backoff value based on metadataFetchInProgress. Since the
request actually made it out, this can be arbitrarily large -- we just need to
see some sort of response or failure for the request.

3. Requires some response to arrive to clear out space, so we can wait
arbitrarily long. Updating lastNoNodeAvailableMs works even though it may wake
up sooner than necessary. But making all cases that didn't send the data use a
single approach keeps the code simpler.

4a. This can happen if, e.g., the network interface has been taken down
entirely. After fixing the ordering of marking the node as connecting and
issuing the request, this cleans up after that error cleanly.
lastNoNodeAvailableMs is updated since there *weren't* any nodes available.
This triggers a backoff period where the connection won't be retried.

4b. This can be handled in the same way - we set lastNoNodeAvailableMs whether
or not we immediately saw an error from the connection request. This causes it
to sleep while waiting for the connection request. This may wake up before we
get connected. However, if the node is still in the connecting state, it'll be
ignored during the next round and we'll either start trying to connect to
another node or we'll end up in state 1 with no nodes available. Either way, we
still only wake up periodically based on the this timeout.

5. Looking more carefully at how leastLoadedNode works, this case isn't
actually possible.

One additional note -- apparently you can't use Long.MAX_VALUE as a timeout, it
throws an exception. That's why Integer.MAX_VALUE is there instead. We could
also detect the large value and convert it to a negative value instead, which
the underlying API treats as having no timeout.

[~Bmis13] can you test this out for the failure modes you found?

[Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network
connection is lost
---

Attachments:
0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch,
KAFKA-1642.patch, KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch,
KAFKA-1642_2014-10-23_16:19:41.patch

I see my CPU spike to 100% when network connection is lost for while. It
seems network IO thread are very busy logging following error message. Is
this expected behavior ?
2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka
producer I/O thread:
java.lang.IllegalStateException: No entry found for node -2
at
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
at
org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
at
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
at
org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:744)
Thanks,
Bhavesh

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-30 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14229233#comment-14229233
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

I just discovered yesterday that 0.8.1.1 release does not have new producer 
code base jar officially released jar (kafka-clients) although code is there in 
0.8.1.1 branch.   That created confusion about porting to  0.8.1.1.
  
Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-27 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14228040#comment-14228040
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~soumen.sarkar],

Time out is one thing, but also IO Thread needs to be safe guarded to see how 
aggressive it is based on network and data to be send.  So it does not consume 
so much CPU cycle.

Thanks,
Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-26 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14226751#comment-14226751
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

Even setting long following parameter, states of system does get impacted does 
not matter what reconnect.backoff.ms and retry.backoff.ms is set to.  Once Node 
state is removed, the time out is set to 0.  Please see the following logs.  

# 15 minutes
reconnect.backoff.ms=90
retry.backoff.ms=90

{code}
2014-11-26 11:01:27.898 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:02:27.903 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:03:27.903 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:04:27.903 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:05:27.904 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:06:27.905 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:07:27.906 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:08:27.908 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:09:27.908 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:10:27.909 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:11:27.909 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:12:27.910 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:13:27.911 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:14:27.912 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:15:27.914 Kafka Drop message topic=.rawlog
org.apache.kafka.common.errors.TimeoutException: Failed to update metadata 
after 6 ms.
 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR 
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
producer I/O thread: 
 2014-11-26 11:00:27.613 [kafka-producer-network-thread | rawlog] ERROR 
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
producer I/O thread: 
java.lang.IllegalStateException: No entry found for node -1
at 
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131)
at 
org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120)
at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407)
at 
org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:744)
java.lang.IllegalStateException: No entry found for node -3
at 
org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:131)
at 
org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:120)
at 
org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:407)
at 
org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:393)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:187)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:184)
at 
org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
at java.lang.Thread.run(Thread.java:744)
 2014-11-26 11:00:27.613 [kafka-producer-network-thread | heartbeat] ERROR 
org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
producer I/O thread: 
 2014-11-26 11:00:27.613 [kafka-producer-network-thread | error] ERROR 
org.apache.kafka.clients.producer.internals.Sender - Uncaught

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-26 Thread Soumen Sarkar (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14227286#comment-14227286
 ] 

Soumen Sarkar commented on KAFKA-1642:
--

Is it reasonable to expect that timeout should have a lower bound (say *100 
ms*) instead of being 0?

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223161#comment-14223161
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

The way to reproduce this is to simulate network instability by turning on and 
off network service (or turn on/off physical cable).   The connect and see if 
recover and disconnect and connect again etc.. you will see the behavior again 
and again. 

The issue is also with connection state management :

{code}
private void initiateConnect(Node node, long now) {
try {
log.debug(Initiating connection to node {} at {}:{}., node.id(), 
node.host(), node.port());
// TODO FIX java.lang.IllegalStateException: No entry found for 
node -3 (We need put before remove it..)..
this.connectionStates.connecting(node.id(), now);  (This line has 
problem because it will loose previous last attempt made get above exception 
and it will try to connect to that node for ever and ever with exception )
selector.connect(node.id(), new InetSocketAddress(node.host(), 
node.port()), this.socketSendBuffer, this.socketReceiveBuffer);
} catch (IOException e) {
/* attempt failed, we'll try again after the backoff */
connectionStates.disconnectedWhenConnectting(node.id());
/* maybe the problem is our metadata, update it */
metadata.requestUpdate();
log.debug(Error connecting to node {} at {}:{}:, node.id(), 
node.host(), node.port(), e);
}
}
{code]

In my opinion, regardless of what node status is in run() method needs to be 
safe-guarded from still CPU Cycle when there is no state for Node.  (Hence I 
have added exponential sleep as temp solution to not to still CPU cycle , I 
think must protect it some how and check the execution time...)

Please let me know if you need more info  and i will be more than happy to 
reproduce bug and we can have conference call, and I can show you the problem.

Thanks,
Bhavesh 



 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223532#comment-14223532
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Also Regarding KafkaProder.close() method hangs for ever because of following 
loop, and 

{code}
Sender.java

 // okay we stopped accepting requests but there may still be
// requests in the accumulator or waiting for acknowledgment,
// wait until these are completed.
while (this.accumulator.hasUnsent() || this.client.inFlightRequestCount()  0) {
try {
run(time.milliseconds());
} catch (Exception e) {
log.error(Uncaught error in kafka producer I/O thread: , e);
}
}

KafkaProducer.java

 /**
* Close this producer. This method blocks until all in-flight requests complete.
*/
@Override
public void close() {
log.trace(Closing the Kafka producer.);
this.sender.initiateClose();
try {
this.ioThread.join();  // THIS IS BLOCKED since ioThread does not give up.
} catch (InterruptedException e) {
throw new KafkaException(e);
}
this.metrics.close();
log.debug(The Kafka producer has closed.);
}
{code}

The issue describe in KAFKA-1788  is likelihood, but if you look the close call 
stack then calling thread that initiated the close() will hang till io thread 
dies (which it never dies when data is there and network is down).  

Thanks,

Bhavesh


 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223571#comment-14223571
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Here is exact steps how to reproducer it bug: (Must have demon program 
continuously running).

1)  Start with happy Situation where all borkers are up everything is running 
fine.  And verify all top -pid JAVA_PID and your kit (kafka network threads  
are taking less than 4% CPU).
2)  Shutdown network (turn off network or pull the eth0 cable)  wait for while 
and you will see that CPU spike to 325% under top  (if you have 4 producer) and 
verify your kit is showing 25% CPU consumption for for each Kafka io thread.
3) Connect back the network ( Spike will still be there but CPU after while 
come down to 100% or so ) and remain connected for while.  
4) again simulate network failure (to simulate network instability) repeat 
steps again 1 to 4 but wait for 10 or so minutes in between and you will see 
the trends of CPU spike along with above exception. 
java.lang.IllegalStateException: No entry found for node -2

Also, I see that Kafka is logging excessively when network is down (your kit 
shows it is taking more CPU Cycle  as compare  to normal)

Thanks,
Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223626#comment-14223626
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Also, there is issue in my last patch.  I did not update the 
lastConnectAttemptMs...in connecting.

{code}
 /**
 * Enter the connecting state for the given node.
 * @param node The id of the node we are connecting to
 * @param now The current time.
 */
public void connecting(int node, long now) {
NodeConnectionState nodeConn = nodeState.get(node); 
if(nodeConn == null){
nodeState.put(node, new 
NodeConnectionState(ConnectionState.CONNECTING, now));
}else{
nodeConn.state = ConnectionState.CONNECTING;
nodeConn.lastConnectAttemptMs = now;  (This will capture and 
update last connection attempt) 

}
}
{code}

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Ewen Cheslack-Postava (JIRA)

[
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223755#comment-14223755
]

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Ok, so as I suspected, you need to wait awhile before the issue shows up. It
looks to me like this is due to a metadata refresh. This causes metadataTimeout
in Client.poll() to be 0, but then maybeUpdateMetadata is unable to make any
progress since it can't connect to any nodes. The previous patch fixed issues
that caused the timeout parameter to that method to be 0, so this is a similar
issue. However, under normal testing it won't always show up immediately -- you
need to wait until the next metadata refresh, which is currently every 5
minutes.

I need to think more about the details of the fix. That timeout shouldn't
consistently be 0 if we're just trying to refresh metadata, but we need to make
sure we select an appropriate timeout for each case. Looking through
maybeUpdateMetadata there are a few different possibilities:

1. leastLoadedNode returns null, leaving no nodes available and we don't even
try to refresh
2. The selected node is connected and we can send more data - we mark
metadataFetchInProgress to avoid resending requests, but should probably also
use that to determine the timeout on poll()
3. The selected node is connected but we can't send more data yet
4. The selected node is not connected, but we are allowed to try to initiate a
connection based on the reconnection backoff.
4a. Trying to initiate the connection may return an immediate error
4b. Or we'll need to wait for the connection event.
5. The selected node is not connected and we aren't allowed to initiate a new
connection yet.

Given that all these conditions are based on the code in maybeUpdateMetadata
(and initiateConnect, which it calls), it probably makes sense to have that
code return an appropriate timeout to be used in poll(). But we need to make
sure the selected values are also combined correctly with the timeout passed
into poll() and that any wakeups before that time also subsequently produce
correct values.

The logic in the Sender.run() and NetworkClient.poll() are complex and need to
handle a lot of different cases, but it should be possible to fix this problem
only by adjusting that code without adding retries/backoff further up the
stack. The core of this problem is just that that loop is selecting too small a
timeout.

[Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network
connection is lost
---

Attachments:
0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch,
KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch,
KAFKA-1642_2014-10-23_16:19:41.patch

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14223779#comment-14223779
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

Thanks for looking into this really appreciate your response. 

Also, do you think rapid connect and disconnect is also due to incorrect Node 
state management ?  connecting method and initiateConnection also ?

Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224039#comment-14224039
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Here are some more cases to reproduce this simulating network connection issue 
with one of brokers only and still problem persist:

Case 1:  brokers connection is down (note according to ZK leader for partition 
still with b1 ) 
Have tree brokers: b1, b2, b3
1)  Start your daemon program and keep sending data to all the brokers and 
continue sending some data 
2)  Observed that you have data  netstat -a | grep b1|b2|b3   (keep pumping 
data for 5 minutes and observed normal behavior using top -pid or top -p 
java_pid )
3) Simulate a network connection or problem establishing new TCP connection via 
following as java program still continues to pump data aggressively (please 
note TCP connection to B1 still active and connected)
a)  sudo vi /etc/hosts 2) add entry b1 127.0.0.1 
b) /etc/init.d/network restart  after while (5 to 7 minutes you will see the 
issue but keep pumping data, and also repeat this for b2 it will be more CPU 
consumption) 
 
4) Under a heavy dumping data, now producer will try to establish new TCP 
connection to B1 and it will get connection refused (Note that CPU spikes up 
again and remain in state) just because could not establish.

Case 2) Simulate Firewall rule such as you are only allowed (4 TCP connection 
to each brokers) 

Do step 1,2 and 3 above.
4) use Iptable rule to reject 
To start an enforcing fire wall:
iptables -A OUTPUT -p tcp -m tcp -d b1 --dport 9092 -j REJECT
5) Still pump data will while iptable rejects ( you will see CPU spike to to 
200% more depending on # of producer)
To recover :
iptables -D OUTPUT -p tcp -m tcp -d b1 --dport 9092 -j REJECT


 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224041#comment-14224041
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~ewencp],

I hope above steps will give you comprehensive steps to reproduce problems with 
run() method.  It would be really great if we can make the client more 
resilient and  robust so network and brokers instability does not cause CPU 
spikes and degrade application performance. Hence, I would strongly at least 
detect the time run(time) is taking and do based on some configuration, we can 
do CPU Throttling just to be more defensive or at lest detect that io thread is 
taking CPU cycle.

By the way the experimental patch still works for steps describe above. 


Thanks,

Bhavesh  

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-24 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14224046#comment-14224046
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

Also, Are you going to port back the back to 0.8.1.1 version as well ?  Please 
let me know also.

Thanks,
Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.1.1, 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
Priority: Blocker
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-23 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14222684#comment-14222684
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

[~Bmis13] I've only taken a quick look at the patch, but Sender.run() is 
intentionally a tight loop. The body of the loop calls Sender.run(time), which 
in turn calls Client.poll() with a timeout that *should* keep it from actually 
being a tight loop. The previous patch fixed some issues in those other methods 
that were causing the timeout to incorrectly be 0, leading to the high CPU 
usage. If you're still seeing this problem, the right fix will almost 
definitely involve tracking down how a 0 (or very small) timeout is 
consistently being computed.

The way I verified the previous patch was simple -- I ran the producer against 
a local cluster and then just disabled the network connection. Can you describe 
how you produce the error now? Things like whether there are already active 
connections to the brokers, if it's time sensitive (e.g. takes a certain amount 
of time to start using CPU), exactly how you simulate the network failure, and 
whether the issue is consistent or only happens intermittently would all be 
helpful details to know.

The issue with it not shutting down is probably because the producer doesn't 
timeout messages when the leader it needs to send them to isn't available, but 
it waits to send any outstanding messages before shutting down. KAFKA-1788 is 
probably really the same issue since it's also caused by messages that never 
get sent and don't time out, although since that issue is specifically about 
the RecordAccumulator/buffer pool accounting, a fix for that issue may or may 
not fix the shutdown issue you're describing here.


 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: 
 0001-Initial-CPU-Hish-Usage-by-Kafka-FIX-and-Also-fix-CLO.patch, 
 KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-14 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14212635#comment-14212635
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

[~junrao] I think you reviewed most of this already since we discussed it 
offline, so I reassigned to you. I think this should be in good shape for 
committing.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-11-14 Thread Jun Rao (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14213013#comment-14213013
 ] 

Jun Rao commented on KAFKA-1642:


Since this is relatively critical and the changes are only in the new java 
producer, double committed to 0.8.2 as well.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Fix For: 0.8.2

 Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-23 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14182165#comment-14182165
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Updated reviewboard https://reviews.apache.org/r/26885/diff/
 against branch origin/trunk

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch, 
 KAFKA-1642_2014-10-23_16:19:41.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-20 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14177723#comment-14177723
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

To summarize the issues fixed now:
* Fix logic issue with expired in RecordAccumulator.ready
* Don't include nodes that can send data when computing the delay until the 
next check for ready data. Including these doesn't make sense since their 
delays will change when we send data.
* To correctly account for nodes with sendable data, use a timeout of 0 if we 
send any. This guarantees any necessary delay is computed immediately in the 
next round after some current data has been removed.
* Properly account for nodes with sendable data under connection retry backoff. 
Since they weren't included in computing the next check delay when looking up 
ready nodes, we need to account for it later, but only if we conclude the node 
isn't ready. We need to incorporate the amount of backoff time still required 
before a retry will be performed (nothing else would wakeup at the right time, 
unlike other conditions like a full buffer which only change if data is 
received).

It might be possible to break this into smaller commits for each one, but the 
ordering of applying them needs to be careful because some by themselves result 
in bad behavior -- the existing client worked because it often ended up with 
poll timeouts that were much more aggressive (i.e., often 0).

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Ewen Cheslack-Postava
 Attachments: KAFKA-1642.patch, KAFKA-1642_2014-10-20_17:33:57.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-17 Thread Ewen Cheslack-Postava (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14175360#comment-14175360
 ] 

Ewen Cheslack-Postava commented on KAFKA-1642:
--

Created reviewboard https://reviews.apache.org/r/26885/diff/
 against branch origin/trunk

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao
 Attachments: KAFKA-1642.patch


 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-16 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14174462#comment-14174462
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

[~jkreps],

Did you get chance to re-produce the problem ?  Has someone else reported this 
issues or similar issue ?

Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-13 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169415#comment-14169415
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

{code}

import java.io.IOException;
import java.io.InputStream;
import java.util.Properties;

import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.Producer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;

public class TestNetworkDownProducer {

public static void main(String[] args) throws IOException {

Properties prop = new Properties();
InputStream propFile = 
Thread.currentThread().getContextClassLoader()

.getResourceAsStream(kafkaproducer.properties);

String topic = test;
prop.load(propFile);
System.out.println(Property:  + prop.toString());
StringBuilder builder = new StringBuilder(1024);
int msgLenth = 256;
for (int i = 0; i  msgLenth; i++)
builder.append(a);

int numberOfProducer = 4;
Producer[] producer = new Producer[numberOfProducer];

for (int i = 0; i  producer.length; i++) {
producer[i] = new KafkaProducer(prop);
}

Callback callback = new Callback() {
public void onCompletion(RecordMetadata metadata, 
Exception exception) {
if(exception != null){
System.err.println(Msg dropped..!);
exception.printStackTrace();
}

}
};

ProducerRecord record = new ProducerRecord(topic, 
builder.toString().getBytes());
while (true) {
try {
for (int i = 0; i  producer.length; i++) {
producer[i].send(record, callback);
}
Thread.sleep(10);
} catch (Throwable th) {
System.err.println(FATAL );
th.printStackTrace();
}

}

}

}

{code}

{code: name=kafkaproducer.properties }
# THIS IS FOR NEW PRODUCERS API TRUNK Please see the configuration at 
https://kafka.apache.org/documentation.html#newproducerconfigs
# Broker List
bootstrap.servers= BROKERS HERE...
#Data Acks
acks=1
# 64MB of Buffer for log lines (including all messages).
buffer.memory=134217728
compression.type=snappy
retries=3
# DEFAULT FROM THE KAFKA...
# batch size =  ((buffer.memory) / (number of partitions)) (so we can have in 
progress batch size created for each partition.).
batch.size=1048576
#2MiB
max.request.size=1048576
send.buffer.bytes=2097152
# We do not want to block the buffer Full so application thread will not be 
blocked but logs lines will be dropped...
block.on.buffer.full=false
#2MiB
send.buffer.bytes=2097152

#wait...
linger.ms=5000
{code}

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-10-13 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169453#comment-14169453
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

 [~jkreps]  Let me know if you need any other help !!

Thanks,
Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-09-26 Thread Jay Kreps (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14149875#comment-14149875
 ] 

Jay Kreps commented on KAFKA-1642:
--

Yeah a sample program that reproduces the issue would be excellent. That will 
help us take a look.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-09-25 Thread Bhavesh Mistry (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14148121#comment-14148121
 ] 

Bhavesh Mistry commented on KAFKA-1642:
---

HI [~jkreps],

I will work on the sample program. We are not setting reconnect.backoff.ms and 
retry.backoff.ms configuration so it would be default configuration.  Only 
thing I can tell you is that I have 4 Producer instance per JVM.  So this might 
amplify issue. 

Thanks,

Bhavesh 

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (KAFKA-1642) [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network connection is lost

2014-09-19 Thread Jay Kreps (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140868#comment-14140868
 ] 

Jay Kreps commented on KAFKA-1642:
--

The intended behavior is that the client will periodically attempt to reconnect 
and update metadata until either it can reconnect or it discovers that a new 
node has taken over leadership for the given partition.

There are two things that could be going on here: (1) our default backoffs 
could be too low or (2) the network selector could be busy waiting. The 
backoffs are controlled by reconnect.backoff.ms and retry.backoff.ms. 
reconnect.backoff.ms controls the amount of time to wait after the last 
connection attempt (whether successful or unsuccessful) before trying to make 
another connection attempt--this avoids trying to connect over and over again. 
This seems to default to only 10ms. The retry.backoff.ms controls the amount of 
time we wait before attempting to update the metadata. This defaults to 100ms.

Alternatively, [~guozhang] found and fixed a bug in the network selector that 
lead to busy waiting previously. Maybe there is another bug like that.

Would you be willing to try setting the two backoffs to something high and see 
if you can reproduce the problem. The ideal would be a short piece of code that 
reproduces this that we could use for testing.

 [Java New Producer Kafka Trunk] CPU Usage Spike to 100% when network 
 connection is lost
 ---

 Key: KAFKA-1642
 URL: https://issues.apache.org/jira/browse/KAFKA-1642
 Project: Kafka
  Issue Type: Bug
  Components: producer 
Affects Versions: 0.8.2
Reporter: Bhavesh Mistry
Assignee: Jun Rao

 I see my CPU spike to 100% when network connection is lost for while.  It 
 seems network  IO thread are very busy logging following error message.  Is 
 this expected behavior ?
 2014-09-17 14:06:16.830 [kafka-producer-network-thread] ERROR 
 org.apache.kafka.clients.producer.internals.Sender - Uncaught error in kafka 
 producer I/O thread: 
 java.lang.IllegalStateException: No entry found for node -2
 at 
 org.apache.kafka.clients.ClusterConnectionStates.nodeState(ClusterConnectionStates.java:110)
 at 
 org.apache.kafka.clients.ClusterConnectionStates.disconnected(ClusterConnectionStates.java:99)
 at 
 org.apache.kafka.clients.NetworkClient.initiateConnect(NetworkClient.java:394)
 at 
 org.apache.kafka.clients.NetworkClient.maybeUpdateMetadata(NetworkClient.java:380)
 at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:174)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:175)
 at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:115)
 at java.lang.Thread.run(Thread.java:744)
 Thanks,
 Bhavesh



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

39 matches

Mail list logo