The driver must initially compute the partitions and their preferred
locations for each part of the file, which results in a serial
getFileBlockLocations() on each part. However, I would expect this to take
several seconds, not minutes, to perform on 1000 parts. Is your driver
inside or outside of AWS? There is an order of magnitude difference in the
latency of S3 requests if you're running outside of AWS.

We have also experienced an excessive slowdown in the metadata lookups
using Hadoop 2 versus Hadoop 1, likely due to the differing jets3t library
versions. If you're using Hadoop 2, you might try downgrading to Hadoop
1.2.1 and seeing if the startup time decreases.


On Sat, Aug 16, 2014 at 6:46 PM, kmatzen <kmat...@gmail.com> wrote:

> I have some RDD's stored as s3://-backed sequence files sharded into 1000
> parts.  The startup time is pretty long (~10's of minutes).  It's
> communicating with S3, but I don't know what it's doing.  Is it just
> fetching the metadata from S3 for each part?  Is there a way to pipeline
> this with the computation?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to