Spark streaming takes longer time to read json into dataframes

Diwakar Dhanuskodi Fri, 15 Jul 2016 20:45:05 -0700

Hello,

I have 400K json messages pulled from Kafka into spark streaming using
DirectStream approach. Size of 400K messages is around 5G.  Kafka topic is
single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to
convert  rdd into dataframe. It takes almost 2.3 minutes to convert into
dataframe.


I am running in Yarn client mode with executor memory as 15G and executor
cores as 2.

Caching rdd before converting into dataframe  doesn't change processing
time. Whether introducing hash partitions inside foreachRDD  will help?
(or) Will partitioning topic and have more than one DirectStream help?. How
can I approach this situation to reduce time in converting to dataframe..

Regards,
Diwakar.

Spark streaming takes longer time to read json into dataframes

Reply via email to