Hello,
I have a streaming use case where I plan to keep a dataset broadcasted and
cached on each executor.
Every micro batch in streaming will create a DF out of the RDD and join the
batch.
The below code will perform the broadcast operation for each RDD. Is there
a way to broadcast it just once?
Alternate approachs are also welcome.
val DF1 = sqlContext.read.format("json").schema(schema1).load(file1)
val metaDF = sqlContext.read.format("json").schema(schema1).load(file2)
.join(DF1, "id")
metaDF.cache
val lines = streamingcontext.textFileStream(path)
lines.foreachRDD( rdd => {
val recordDF = rdd.flatMap(r => Record(r)).toDF()
val joinedDF = recordDF.join(broadcast(metaDF), "id")
joinedDF.rdd.foreachPartition ( partition => {
partition.foreach( row => {
...
...
})
})
})
streamingcontext.start
On a similar note, if the metaDF is too big for broadcast, can I partition
it(df.repartition($"col")) and also partition each streaming RDD?
This way I can avoid shuffling metaDF each time.
Let me know you thoughts.
Thanks