Hello, I have 400K json messages pulled from Kafka into spark streaming using DirectStream approach. Size of 400K messages is around 5G. Kafka topic is single partitioned. I am using rdd.read.json(_._2) inside foreachRDD to convert rdd into dataframe. It takes almost 2.3 minutes to convert into dataframe.
I am running in Yarn client mode with executor memory as 15G and executor cores as 2. Caching rdd before converting into dataframe doesn't change processing time. Whether introducing hash partitions inside foreachRDD will help? (or) Will partitioning topic and have more than one DirectStream help?. How can I approach this situation to reduce time in converting to dataframe.. Regards, Diwakar.