I have some RDD's stored as s3://-backed sequence files sharded into 1000
parts.  The startup time is pretty long (~10's of minutes).  It's
communicating with S3, but I don't know what it's doing.  Is it just
fetching the metadata from S3 for each part?  Is there a way to pipeline
this with the computation?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/s3-sequence-file-startup-time-tp12242.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to