Hi,

I am using Kafka Spark cluster for real time aggregation analytics use case
in production.

*Cluster details*

*6 nodes*, each node running 1 Spark and kafka processes each.
Node1              -> 1 Master , 1 Worker, 1 Driver,
   1 Kafka process
Node 2,3,4,5,6 -> 1 Worker prcocess each                                  1
Kafka process each

Spark version 1.3.0
Kafka Veriosn 0.8.1

I am using *Kafka* *Directstream* for Kafka Spark Integration.
Analytics code is written in using Spark Java API.

*Problem Statement : *

      We are dealing with about *10 M records per hour*.
      My Spark Streaming Batch runs at *1 hour interval*( at 11:30 12:30
1:30 and so on)

      Since i am using Direct Stream, it reads all the data for past hour
at 11:30 12:30 1:30 and so on
      Though as of now it takes *about 3 minutes* to read the data with
Network bandwidth utilization of  *100-200 MBPS per node*( out of 6 node
Spark Cluster)

      Since i am running both Spark and Kafka on same machine
*      I WANT TO BIND MY SPARK EXECUTOR TO KAFKA PARTITION LEADER*, so as
to elliminate the Network bandwidth consumption of Spark.

      I understand that the number of partitions created on Spark for a
Direct Stream is equivalent to the number of partitions on Kafka, which is
the reason got a curiosity, perhaps there might be such a provision in
SPark.



Regards,
Gaurav

Reply via email to