Re: Spark output to s3 extremely slow

2014-10-16 Thread Anny Chen
Hi Rafal,

Thanks for the explanation and solution! I need to write maybe 100 GB to
s3. I will try your way and see whether it works for me.

Thanks again!

On Wed, Oct 15, 2014 at 1:44 AM, Rafal Kwasny  wrote:

> Hi,
> How large is the dataset you're saving into S3?
>
> Actually saving to S3 is done in two steps:
> 1) writing temporary files
> 2) commiting them to proper directory
> Step 2) could be slow because S3 do not have a quick atomic "move"
> operation, you have to copy (server side but still takes time) and then
> delete the original.
>
> I've overcome this but using a jobconf with NullOutputCommitter
>   jobConf.setOutputCommitter(classOf[NullOutputCommitter])
>
> Where NullOutputCommiter is a Class that doesn't do anything:
>
>   class NullOutputCommitter extends OutputCommitter {
> def abortTask(taskContext: TaskAttemptContext) =  { }
> override  def cleanupJob(jobContext: JobContext ) = { }
> def commitTask(taskContext: TaskAttemptContext ) = { }
> def needsTaskCommit(taskContext: TaskAttemptContext ) = {  false  }
> def setupJob(jobContext: JobContext) { }
> def setupTask(taskContext: TaskAttemptContext) { }
>   }
>
> This works but maybe someone has a better solution.
>
> /Raf
>
> anny9699 wrote:
> > Hi,
> >
> > I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> > slow, much slower than reading from s3. Is there a way to make it faster?
> > The rdd has 150 partitions so parallelism is enough I assume.
> >
> > Thanks a lot!
> > Anny
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>
>


Re: Spark output to s3 extremely slow

2014-10-15 Thread Rafal Kwasny
Hi,
How large is the dataset you're saving into S3?

Actually saving to S3 is done in two steps:
1) writing temporary files
2) commiting them to proper directory
Step 2) could be slow because S3 do not have a quick atomic "move"
operation, you have to copy (server side but still takes time) and then
delete the original.

I've overcome this but using a jobconf with NullOutputCommitter
  jobConf.setOutputCommitter(classOf[NullOutputCommitter])

Where NullOutputCommiter is a Class that doesn't do anything:

  class NullOutputCommitter extends OutputCommitter {
def abortTask(taskContext: TaskAttemptContext) =  { }
override  def cleanupJob(jobContext: JobContext ) = { }
def commitTask(taskContext: TaskAttemptContext ) = { }
def needsTaskCommit(taskContext: TaskAttemptContext ) = {  false  }
def setupJob(jobContext: JobContext) { }
def setupTask(taskContext: TaskAttemptContext) { }
  }

This works but maybe someone has a better solution.

/Raf

anny9699 wrote:
> Hi,
>
> I found writing output back to s3 using rdd.saveAsTextFile() is extremely
> slow, much slower than reading from s3. Is there a way to make it faster?
> The rdd has 150 partitions so parallelism is enough I assume.
>
> Thanks a lot!
> Anny
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-output-to-s3-extremely-slow-tp16447.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org