I stumble upon this thread and I conjecture that this may affect restoring a checkpointed RDD as well:
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-checkpointed-RDD-in-Apache-Spark-has-long-gt-10-hour-between-stage-latency-td22925.html#a22928 In my case I have 1600+ fragmented checkpoint file and the time to read all metadata takes a staggering 11 hours. If this is really the cause then its an obvious handicap, as checkponted RDD already has all file parttition information available and doesn't need to to read them from s3 into driver again (which cause a single-point-of-failure and a bottleneck). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/S3NativeFileSystem-inefficient-implementation-when-calling-sc-textFile-tp19841p22984.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org