Andrew, couldn't you do in the Scala code:

  scala.sys.process.Process("hadoop fs -copyToLocal ...")!

or is that still considered a second step?

"hadoop fs" is almost certainly going to be better at copying these files
than some memory-to-disk-to-memory serdes within Spark.

--
Christopher T. Nguyen
Co-founder & CEO, Adatao <http://adatao.com>
linkedin.com/in/ctnguyen



On Thu, Jan 30, 2014 at 2:21 AM, Andrew Ash <[email protected]> wrote:

> Hi Spark users,
>
> I'm often using Spark for ETL type tasks, where the input is a large file
> on-disk and the output is another large file on-disk.  I've loaded
> everything into HDFS, but still need to produce files out on the other side.
>
> Right now I produce these processed files in a 2-step process:
>
> 1) in a single spark job, read from HDFS location A, process, and write to
> HDFS location B
> 2) run hadoop fs -cat hdfs:///path/to/* > /path/tomyfile to get it onto
> the local disk.
>
> It would be great to get this down to a 1-step process.
>
> If I run .saveAsTextFile("...") on my RDD, then the shards of the file are
> scattered onto the local disk across the cluster.  But if I .collect() on
> the driver and then save to disk using normal Scala disk IO utilities, I'll
> certainly OOM the driver.
>
> *So the question*: is there a way to get an iterator for an RDD that I
> can scan through the contents on the driver and flush to disk?
>
> I found the RDD.iterator() method but it looks to be intended for use by
> RDD subclasses not end users (requires a Partition and TaskContext
> parameter).  The .foreach() method executes on each worker also, rather
> than on the driver, so would also scatter files across the cluster if I
> saved from there.
>
> Any suggestions?
>
> Thanks!
> Andrew
>

Reply via email to