Hi guys, I have a 3 node cluster and i am running a spark streaming job. consider the below example
/*spark-submit* --master yarn-cluster --class com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars /opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar /opt/SparkStreamingExample-1.0.jar /tmp/test 10 test 189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/ In this case, suppose node 10.1.1.1 is down. Then for every window batch, spark tries to send a request to all the nodes. This code is in the class org.apache.spark.streaming.kafka.KafkaCluster Function : getPartitionMetadata() Line : val resp: TopicMetadataResponse = consumer.send(req) The function getPartitionMetadata() is called from getPartitions() and findLeaders() which gets called for every batch. Hence, if the node is down, the connection fails and it wits till the timeout to happen before continuing which adds to the processing time. Question : Is there any way to avoid this ? In simple words, i do not want spark to send request to the node that is down for every batch. How can i achieve this ? -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org