Hi Akshat, Is there a particular reason you don't use s3a? From my experience,s3a performs much better than the rest. I believe the inefficiency is from the implementation of the s3 interface.
Best Regards, Jerry Sent from my iPhone > On 9 Aug, 2015, at 5:48 am, Akhil Das <ak...@sigmoidanalytics.com> wrote: > > Depends on which operation you are doing, If you are doing a .count() on a > parquet, it might not download the entire file i think, but if you do a > .count() on a normal text file it might pull the entire file. > > Thanks > Best Regards > >> On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya <aara...@gmail.com> wrote: >> Hi, >> >> I've been trying to track down some problems with Spark reads being very >> slow with s3n:// URIs (NativeS3FileSystem). After some digging around, I >> realized that this file system implementation fetches the entire file, which >> isn't really a Spark problem, but it really slows down things when trying to >> just read headers from a Parquet file or just creating partitions in the >> RDD. Is this something that others have observed before, or am I doing >> something wrong? >> >> Thanks, >> Akshat >