Maybe irrelevant, but this resembles a lot the S3 Parquet file issue we've
met before. It takes a dozen minutes to read the metadata because the
ParquetInputFormat tries to call geFileStatus for all part-files
sequentially.

Just checked SequenceFileInputFormat, and found that a MapFile may share
similar issue.


On Mon, Aug 18, 2014 at 5:26 AM, Aaron Davidson <ilike...@gmail.com> wrote:

> The driver must initially compute the partitions and their preferred
> locations for each part of the file, which results in a serial
> getFileBlockLocations() on each part. However, I would expect this to take
> several seconds, not minutes, to perform on 1000 parts. Is your driver
> inside or outside of AWS? There is an order of magnitude difference in the
> latency of S3 requests if you're running outside of AWS.
>
> We have also experienced an excessive slowdown in the metadata lookups
> using Hadoop 2 versus Hadoop 1, likely due to the differing jets3t library
> versions. If you're using Hadoop 2, you might try downgrading to Hadoop
> 1.2.1 and seeing if the startup time decreases.
>
>
> On Sat, Aug 16, 2014 at 6:46 PM, kmatzen <kmat...@gmail.com> wrote:
>
>> I have some RDD's stored as s3://-backed sequence files sharded into 1000
>> parts.  The startup time is pretty long (~10's of minutes).  It's
>> communicating with S3, but I don't know what it's doing.  Is it just
>> fetching the metadata from S3 for each part?  Is there a way to pipeline
>> this with the computation?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to