Hi Spark users,
I'm often using Spark for ETL type tasks, where the input is a large file
on-disk and the output is another large file on-disk. I've loaded
everything into HDFS, but still need to produce files out on the other side.
Right now I produce these processed files in a 2-step process:
1) in a single spark job, read from HDFS location A, process, and write to
HDFS location B
2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto the
local disk.
It would be great to get this down to a 1-step process.
If I run .saveAsTextFile("...") on my RDD, then the shards of the file are
scattered onto the local disk across the cluster. But if I .collect() on
the driver and then save to disk using normal Scala disk IO utilities, I'll
certainly OOM the driver.
*So the question*: is there a way to get an iterator for an RDD that I can
scan through the contents on the driver and flush to disk?
I found the RDD.iterator() method but it looks to be intended for use by
RDD subclasses not end users (requires a Partition and TaskContext
parameter). The .foreach() method executes on each worker also, rather
than on the driver, so would also scatter files across the cluster if I
saved from there.
Any suggestions?
Thanks!
Andrew