Hi,
I am using Kafka Spark cluster for real time aggregation analytics use case
in production.
*Cluster details*
*6 nodes*, each node running 1 Spark and kafka processes each.
Node1 -> 1 Master , 1 Worker, 1 Driver,
1 Kafka process
Node 2,3,4,5,6 -> 1 Worker prcocess each 1
Kafka process each
Spark version 1.3.0
Kafka Veriosn 0.8.1
I am using *Kafka* *Directstream* for Kafka Spark Integration.
Analytics code is written in using Spark Java API.
*Problem Statement : *
We are dealing with about *10 M records per hour*.
My Spark Streaming Batch runs at *1 hour interval*( at 11:30 12:30
1:30 and so on)
Since i am using Direct Stream, it reads all the data for past hour
at 11:30 12:30 1:30 and so on
Though as of now it takes *about 3 minutes* to read the data with
Network bandwidth utilization of *100-200 MBPS per node*( out of 6 node
Spark Cluster)
Since i am running both Spark and Kafka on same machine
* I WANT TO BIND MY SPARK EXECUTOR TO KAFKA PARTITION LEADER*, so as
to elliminate the Network bandwidth consumption of Spark.
I understand that the number of partitions created on Spark for a
Direct Stream is equivalent to the number of partitions on Kafka, which is
the reason got a curiosity, perhaps there might be such a provision in
SPark.
Regards,
Gaurav