Spark Kafka API tries to connect to the dead node for every batch, which increases the processing time

supritht Fri, 13 Oct 2017 07:12:32 -0700

Hi guys,

I have a 3 node cluster and i am running a spark streaming job. consider the
below example


/*spark-submit* --master yarn-cluster --class
com.huawei.bigdata.spark.examples.FemaleInfoCollectionPrint --jars
/opt/client/Spark/spark/lib/streamingClient/kafka-clients-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/kafka_2.10-0.8.2.1.jar,/opt/client/Spark/spark/lib/streamingClient/spark-streaming-kafka_2.10-1.5.1.jar
/opt/SparkStreamingExample-1.0.jar  /tmp/test 10 test
189.132.190.106:21005,189.132.190.145:21005,10.1.1.1:21005/

In this case, suppose node 10.1.1.1 is down. Then for every window batch,
spark tries to send a request  to all the nodes. 
This code is in the class org.apache.spark.streaming.kafka.KafkaCluster

Function : getPartitionMetadata()
Line : val resp: TopicMetadataResponse = consumer.send(req)

The function getPartitionMetadata() is called from getPartitions() and
findLeaders() which gets called for every batch.

Hence, if the node is down, the connection fails and it wits till the
timeout to happen before continuing which adds to the processing time.

Question :
Is there any way to avoid this ?
In simple words, i do not want spark to send request to the node that is
down for every batch. How can i achieve this ?






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Kafka API tries to connect to the dead node for every batch, which increases the processing time

Reply via email to