Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've met before. It takes a dozen minutes to read the metadata because the ParquetInputFormat tries to call geFileStatus for all part-files sequentially.
Just checked SequenceFileInputFormat, and found that a MapFile may share similar issue. On Mon, Aug 18, 2014 at 5:26 AM, Aaron Davidson <ilike...@gmail.com> wrote: > The driver must initially compute the partitions and their preferred > locations for each part of the file, which results in a serial > getFileBlockLocations() on each part. However, I would expect this to take > several seconds, not minutes, to perform on 1000 parts. Is your driver > inside or outside of AWS? There is an order of magnitude difference in the > latency of S3 requests if you're running outside of AWS. > > We have also experienced an excessive slowdown in the metadata lookups > using Hadoop 2 versus Hadoop 1, likely due to the differing jets3t library > versions. If you're using Hadoop 2, you might try downgrading to Hadoop > 1.2.1 and seeing if the startup time decreases. > > > On Sat, Aug 16, 2014 at 6:46 PM, kmatzen <kmat...@gmail.com> wrote: > >> I have some RDD's stored as s3://-backed sequence files sharded into 1000 >> parts. The startup time is pretty long (~10's of minutes). It's >> communicating with S3, but I don't know what it's doing. Is it just >> fetching the metadata from S3 for each part? Is there a way to pipeline >> this with the computation? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >