[jira] [Commented] (FLINK-3066) Kafka producer fails on leader change
[ https://issues.apache.org/jira/browse/FLINK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15138677#comment-15138677 ] Gyula Fora commented on FLINK-3066: --- Thank you Robert for the help, it is a good catch :) > Kafka producer fails on leader change > - > > Key: FLINK-3066 > URL: https://issues.apache.org/jira/browse/FLINK-3066 > Project: Flink > Issue Type: Bug > Components: Kafka Connector, Streaming Connectors >Affects Versions: 0.10.0, 1.0.0 >Reporter: Gyula Fora > > I got the following exception during my streaming job: > {code} > 16:44:50,637 INFO org.apache.flink.runtime.jobmanager.JobManager >- Status of job 4d3f9443df4822e875f1400244a6e8dd (deduplo!) changed to > FAILING. > java.lang.Exception: Failed to send data to Kafka: This server is not the > leader for that topic-partition. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:275) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:246) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:37) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:221) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: > This server is not the leader for that topic-partition. > {code} > And then the job crashed and recovered. This should probably be something > that we handle without crashing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3066) Kafka producer fails on leader change
[ https://issues.apache.org/jira/browse/FLINK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15137610#comment-15137610 ] Robert Metzger commented on FLINK-3066: --- I have to correct my comment from november last year: It is the Producer which is failing. In this case we need to see why the Kafka's producer implementation is not properly handling this. > Kafka producer fails on leader change > - > > Key: FLINK-3066 > URL: https://issues.apache.org/jira/browse/FLINK-3066 > Project: Flink > Issue Type: Bug > Components: Kafka Connector, Streaming Connectors >Affects Versions: 0.10.0, 1.0.0 >Reporter: Gyula Fora > > I got the following exception during my streaming job: > {code} > 16:44:50,637 INFO org.apache.flink.runtime.jobmanager.JobManager >- Status of job 4d3f9443df4822e875f1400244a6e8dd (deduplo!) changed to > FAILING. > java.lang.Exception: Failed to send data to Kafka: This server is not the > leader for that topic-partition. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:275) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:246) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:37) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:221) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: > This server is not the leader for that topic-partition. > {code} > And then the job crashed and recovered. This should probably be something > that we handle without crashing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-3066) Kafka producer fails on leader change
[ https://issues.apache.org/jira/browse/FLINK-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15137662#comment-15137662 ] Robert Metzger commented on FLINK-3066: --- I further looked into the issue. There is a quite obvious way to fix it: Set the number of retries to a value above 0. By default Kafka sets it to 0 to avoid duplicates. Your job is called "deduplo", so I assume the objective here is to avoid duplicates. The fundamental issue is that Kafka's producers don't have any "exactly-once" guarantees / they don't support no way of committing data. I would vote to close this issue since this is a limitation of the underlying API / system. > Kafka producer fails on leader change > - > > Key: FLINK-3066 > URL: https://issues.apache.org/jira/browse/FLINK-3066 > Project: Flink > Issue Type: Bug > Components: Kafka Connector, Streaming Connectors >Affects Versions: 0.10.0, 1.0.0 >Reporter: Gyula Fora > > I got the following exception during my streaming job: > {code} > 16:44:50,637 INFO org.apache.flink.runtime.jobmanager.JobManager >- Status of job 4d3f9443df4822e875f1400244a6e8dd (deduplo!) changed to > FAILING. > java.lang.Exception: Failed to send data to Kafka: This server is not the > leader for that topic-partition. > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.checkErroneous(FlinkKafkaProducer.java:275) > at > org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.invoke(FlinkKafkaProducer.java:246) > at > org.apache.flink.streaming.api.operators.StreamSink.processElement(StreamSink.java:37) > at > org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:166) > at > org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:63) > at > org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:221) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:588) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.kafka.common.errors.NotLeaderForPartitionException: > This server is not the leader for that topic-partition. > {code} > And then the job crashed and recovered. This should probably be something > that we handle without crashing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)