Re: KafkaException or ExecutionStateChange failure on job startup

2018-10-26 Thread Mark Harris
Hi Dominik

Setting that bit of configuration seems to have done the trick for the
MXBean exception.

Many thanks for your help.

Best regards,

Mark

On Tue, 23 Oct 2018 at 14:41, Dominik Wosiński  wrote:

> Hey Mark,
>
> Do You use more than 1 Kafka consumer for Your jobs? I think this relates
> to the known issue in Kafka:
> https://issues.apache.org/jira/browse/KAFKA-3992.
> The problem is that if You don't provide client ID for your
> *KafkaConsumer* Kafka assigns one, but this is done in an unsynchronized
> way, so finally, it ends up in assigning the same id for multiple
> different Consumer instances. Probably this is what happens when multiple
> jobs are resumed at the same time.
>
> What You could try to do is to assign the *consumer.id
> * using properties passed to each consumer. This
> should help in solving this issue.
>
> Best Regards,
> Dom.
>
>
>
>
> wt., 23 paź 2018 o 13:21 Mark Harris 
> napisał(a):
>
>> Hi,
>> We regularly see the following two exceptions in a number of jobs shortly
>> after they have been resumed during our flink cluster startup:
>>
>> org.apache.kafka.common.KafkaException: Error registering mbean
>> kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
>> at
>> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:159)
>> at
>> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:77)
>> at
>> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:436)
>> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:249)
>> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:234)
>> at
>> org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:749)
>> at
>> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:327)
>> at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
>> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
>> at
>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
>> at
>> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:188)
>> at
>> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:283)
>> at
>> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1344)
>> at
>> org.apache.flink.streaming.connectors.kafka.internal.Kafka09PartitionDiscoverer.getAllPartitionsForTopics(Kafka09PartitionDiscoverer.java:77)
>> at
>> org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
>> at
>> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:473)
>> at
>> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
>> at
>> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
>> at
>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
>> at java.lang.Thread.run(Thread.java:748)
>> Caused by: javax.management.InstanceAlreadyExistsException:
>> kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
>> at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
>> at
>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
>> at
>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
>> at
>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
>> at
>> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
>> at
>> com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
>> at
>> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:157)
>> ... 21 more
>> java.lang.Exception: Failed to send ExecutionStateChange notification to
>> JobManager
>> at
>> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:439)
>> at
>> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:423)
>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>> at
>> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
>> at
>> 

Re: KafkaException or ExecutionStateChange failure on job startup

2018-10-23 Thread Dominik Wosiński
Hey Mark,

Do You use more than 1 Kafka consumer for Your jobs? I think this relates
to the known issue in Kafka:
https://issues.apache.org/jira/browse/KAFKA-3992.
The problem is that if You don't provide client ID for your
*KafkaConsumer* Kafka
assigns one, but this is done in an unsynchronized way, so finally, it ends
up in assigning the same id for multiple different Consumer instances.
Probably this is what happens when multiple jobs are resumed at the same
time.

What You could try to do is to assign the *consumer.id
* using
properties passed to each consumer. This should help in solving this issue.

Best Regards,
Dom.




wt., 23 paź 2018 o 13:21 Mark Harris  napisał(a):

> Hi,
> We regularly see the following two exceptions in a number of jobs shortly
> after they have been resumed during our flink cluster startup:
>
> org.apache.kafka.common.KafkaException: Error registering mbean
> kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
> at
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:159)
> at
> org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:77)
> at
> org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:436)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:249)
> at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:234)
> at
> org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:749)
> at
> org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:327)
> at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
> at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
> at
> org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:188)
> at
> org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:283)
> at
> org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1344)
> at
> org.apache.flink.streaming.connectors.kafka.internal.Kafka09PartitionDiscoverer.getAllPartitionsForTopics(Kafka09PartitionDiscoverer.java:77)
> at
> org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
> at
> org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:473)
> at
> org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
> at
> org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
> at
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: javax.management.InstanceAlreadyExistsException:
> kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
> at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
> at
> com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
> at
> com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
> at
> org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:157)
> ... 21 more
> java.lang.Exception: Failed to send ExecutionStateChange notification to
> JobManager
> at
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:439)
> at
> org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:423)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
> at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
> at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> at
> akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
> at
> 

KafkaException or ExecutionStateChange failure on job startup

2018-10-23 Thread Mark Harris
Hi,
We regularly see the following two exceptions in a number of jobs shortly
after they have been resumed during our flink cluster startup:

org.apache.kafka.common.KafkaException: Error registering mbean
kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
at
org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:159)
at
org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:77)
at
org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:436)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:249)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:234)
at
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:749)
at
org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:327)
at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
at
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
at
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:188)
at
org.apache.kafka.clients.consumer.internals.Fetcher.getTopicMetadata(Fetcher.java:283)
at
org.apache.kafka.clients.consumer.KafkaConsumer.partitionsFor(KafkaConsumer.java:1344)
at
org.apache.flink.streaming.connectors.kafka.internal.Kafka09PartitionDiscoverer.getAllPartitionsForTopics(Kafka09PartitionDiscoverer.java:77)
at
org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
at
org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:473)
at
org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at
org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:424)
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:290)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
at java.lang.Thread.run(Thread.java:748)
Caused by: javax.management.InstanceAlreadyExistsException:
kafka.consumer:type=consumer-node-metrics,client-id=consumer-1,node-id=node--1
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at
com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at
org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:157)
... 21 more
java.lang.Exception: Failed to send ExecutionStateChange notification to
JobManager
at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:439)
at
org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$org$apache$flink$runtime$taskmanager$TaskManager$$handleTaskMessage$3$$anonfun$apply$2.apply(TaskManager.scala:423)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
at
akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at
akka.dispatch.BatchingExecutor$BlockableBatch$$anonfun$run$1.apply(BatchingExecutor.scala:91)
at
scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
at
akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:90)
at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:39)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:415)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on
[Actor[akka.tcp://fl...@ip-10-150-24-22.eu-west-1.compute.internal:41775/user/jobmanager#163569829]]
after [3 ms]. Sender[null] sent