Hi Rafal,
Thanks for the explanation and solution! I need to write maybe 100 GB to
s3. I will try your way and see whether it works for me.
Thanks again!
On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny wrote:
> Hi,
> How large is the dataset you're saving into S3?
>
> Actually saving to S3 is done in two steps:
> 1) writing temporary files
> 2) commiting them to proper directory
> Step 2) could be slow because S3 do not have a quick atomic "move"
> operation, you have to copy (server side but still takes time) and then
> delete the original.
>
> I've overcome this but using a jobconf with NullOutputCommitter
> jobConf.setOutputCommitter(classOf[NullOutputCommitter])
>
> Where NullOutputCommiter is a Class that doesn't do anything:
>
> class NullOutputCommitter extends OutputCommitter {
> def abortTask(taskContext: TaskAttemptContext) = { }
> override def cleanupJob(jobContext: JobContext ) = { }
> def commitTask(taskContext: TaskAttemptContext ) = { }
> def needsTaskCommit(taskContext: TaskAttemptContext ) = { false }
> def setupJob(jobContext: JobContext) { }
> def setupTask(taskContext: TaskAttemptContext) { }
> }
>
> This works but maybe someone has a better solution.
>
> /Raf
>
> anny9699 wrote:
> > Hi,
> >
> > I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> > slow, much slower than reading from s3. Is there a way to make it faster?
> > The rdd has 150 partitions so parallelism is enough I assume.
> >
> > Thanks a lot!
> > Anny
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>