[ https://issues.apache.org/jira/browse/KAFKA-6129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Francesco vigotti updated KAFKA-6129: ------------------------------------- Description: I've started writing in this issue: https://issues.apache.org/jira/browse/KAFKA-2729 but then I'm going to open this new issue because I've probably found the cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his setup ( and all other application works using the same nodeport redirection , ie: zookeeper ) kafka brokers fails , silently (randomly in multiple brokers setup) and with a misleading error from producer so I think that Kafka should be improved, providing more robust pre-startup flight-checks and identifying/reporting the current issue After further investigation from my reply here https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster ( 1 zk + 1 kafka-broker ) I've found the problem, the problem is with kubernetes, ( I don't know why this issue appeared only now to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , or ... ) anyway my old kafka cluster started being underreplicated and return various problem , the problem happens when in kubernetes pods are created and redirected using a nodeport-service ( over a static ip in my case ) to expose kafka brokers from the host, when using hostNetwork ( so no redirection ) everything works, what is strange is that zookeeper instead works fine with nodeport ( which create a redirection rule in iptables->nat->prerouting ) the only application I've found problems with this kubernetes configuration is kafka, what is weird is that kafka starts correctly without errors, but on multiple broker clusters there are random issues, on single broker cluster instead the console-producer fails with infinite looop of : ``` [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) ``` , still no errors reported from broker or zookeeper, Also I want to say that I've come across this discussion : https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer but the proposed solution for the host pod ( to allow self-resolving of advertised hostname) didn't worked ``` hostAliases: - ip: "127.0.0.1" hostnames: - "---myhosthostname---" ```` was: I've started writing in this issue: https://issues.apache.org/jira/browse/KAFKA-2729 but then I'm going to open this new issue because I've probably found the cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong in his setup ( and all other application works using the same nodeport redirection , ie: zookeeper ) kafka brokers fails , silently (randomly in multiple brokers setup) and with a misleading error from producer so I think that Kafka should be improved, providing more robust pre-startup flight-checks and identifying/reporting the current issue After further investigation from my reply here https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster ( 1 zk + 1 kafka-broker ) I've found the problem, the problem is with kubernetes, ( I don't know why this issue appeared only now to me , if something changed in recent kube-proxy versions or in kafka 0.10+ , or ... ) anyway my old kafka cluster started being underreplicated and return various problem , the problem happens when in kubernetes pods are created and redirected using a nodeport-service ( over a static ip in my case ) to expose kafka brokers from the host, when using hostNetwork ( so no redirection ) everything works, what is strange is that zookeeper instead works fine with nodeport ( which create a redirection rule in iptables->nat->prerouting ) the only application I've found problems with this kubernetes configuration is kafka, what is weird is that kafka starts correctly without errors, but on multiple broker clusters there are random issues, on single broker cluster instead the console-producer fails with infinite looop of : ``` [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient) ``` , still no errors reported from broker or zookeeper, Also I want to say that I've come across this discussion : https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer but the proposed solution for the host pod ( to allow self-resolving of advertised hostname) didn't worked ``` hostAliases: - ip: "127.0.0.1" hostnames: - "---myhosthostname---" ```` > kafka issue when exposing through nodeport in kubernetes > -------------------------------------------------------- > > Key: KAFKA-6129 > URL: https://issues.apache.org/jira/browse/KAFKA-6129 > Project: Kafka > Issue Type: Bug > Affects Versions: 0.10.2.1 > Environment: kubernetes > Reporter: Francesco vigotti > Priority: Critical > > I've started writing in this issue: > https://issues.apache.org/jira/browse/KAFKA-2729 > but then I'm going to open this new issue because I've probably found the > cause in my kubernetes setup, but In my opinion kubernetes did nothing wrong > in his setup ( and all other application works using the same nodeport > redirection , ie: zookeeper ) > kafka brokers fails , silently (randomly in multiple brokers setup) and with > a misleading error from producer so I think that Kafka should be improved, > providing more robust pre-startup flight-checks and identifying/reporting the > current issue > After further investigation from my reply here > https://issues.apache.org/jira/browse/KAFKA-2729 with a minimum size cluster > ( 1 zk + 1 kafka-broker ) I've found the problem, > the problem is with kubernetes, ( I don't know why this issue appeared only > now to me , if something changed in recent kube-proxy versions or in kafka > 0.10+ , or ... ) > anyway my old kafka cluster started being underreplicated and return various > problem , > the problem happens when in kubernetes pods are created and redirected using > a nodeport-service ( over a static ip in my case ) to expose kafka brokers > from the host, when using hostNetwork ( so no redirection ) everything > works, what is strange is that zookeeper instead works fine with nodeport ( > which create a redirection rule in iptables->nat->prerouting ) the only > application I've found problems with this kubernetes configuration is kafka, > what is weird is that kafka starts correctly without errors, but on multiple > broker clusters there are random issues, on single broker cluster instead the > console-producer fails with infinite looop of : > ``` > [2017-10-26 09:38:23,281] WARN Error while fetching metadata with correlation > id 5 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,383] WARN Error while fetching metadata with correlation > id 6 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > [2017-10-26 09:38:23,485] WARN Error while fetching metadata with correlation > id 7 : {test6=UNKNOWN_TOPIC_OR_PARTITION} > (org.apache.kafka.clients.NetworkClient) > ``` > , still no errors reported from broker or zookeeper, > Also I want to say that I've come across this discussion : > > https://stackoverflow.com/questions/35788697/leader-not-available-kafka-in-console-producer > > but the proposed solution for the host pod ( to allow self-resolving of > advertised hostname) didn't worked > ``` > hostAliases: > - ip: "127.0.0.1" > hostnames: > - "---myhosthostname---" > ```` -- This message was sent by Atlassian JIRA (v6.4.14#64029)