The driver must initially compute the partitions and their preferred locations for each part of the file, which results in a serial getFileBlockLocations() on each part. However, I would expect this to take several seconds, not minutes, to perform on 1000 parts. Is your driver inside or outside of AWS? There is an order of magnitude difference in the latency of S3 requests if you're running outside of AWS.
We have also experienced an excessive slowdown in the metadata lookups using Hadoop 2 versus Hadoop 1, likely due to the differing jets3t library versions. If you're using Hadoop 2, you might try downgrading to Hadoop 1.2.1 and seeing if the startup time decreases. On Sat, Aug 16, 2014 at 6:46 PM, kmatzen <kmat...@gmail.com> wrote: > I have some RDD's stored as s3://-backed sequence files sharded into 1000 > parts. The startup time is pretty long (~10's of minutes). It's > communicating with S3, but I don't know what it's doing. Is it just > fetching the metadata from S3 for each part? Is there a way to pipeline > this with the computation? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >