Writing DataFrame filter results to separate files

Everett Anderson Mon, 05 Dec 2016 11:00:06 -0800

Hi,

I have a DataFrame of records with dates, and I'd like to write all
12-month (with overlap) windows to separate outputs.


Currently, I have a loop equivalent to:

for ((windowStart, windowEnd) <- windows) {
    val windowData = allData.filter(
        getFilterCriteria(windowStart, windowEnd))
    windowData.write.format(...).save(...)
}

This works fine, but has the drawback that since Spark doesn't parallelize
the writes, there is a fairly cost based on the number of windows.

Is there a way around this?

In MapReduce, I'd probably multiply the data in a Mapper with a window ID
and then maybe use something like MultipleOutputs
<https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html>.
But I'm a bit worried of trying to do this in Spark because of the data
explosion and RAM use. What's the best approach?

Thanks!

- Everett

Writing DataFrame filter results to separate files

Reply via email to