Re: Accessing S3 files with s3n://

2015-08-11 Thread Steve Loughran

On 10 Aug 2015, at 20:17, Akshat Aranya 
mailto:aara...@gmail.com>> wrote:

Hi Jerry, Akhil,

Thanks your your help. With s3n, the entire file is downloaded even while just 
creating the RDD with sqlContext.read.parquet().  It seems like even just 
opening and closing the InputStream causes the entire data to get fetched.

As it turned out, I was able to use s3a and avoid this problem.  I was under 
the impression that s3a was only meant for using EMRFS, where the metadata of 
the FS is kept separately.  This is not true; s3a maps object keys directly to 
file names and directories.

There's a bug with close() under the httpclient code which was fixed in s3a; 
sounds like the same issue has arisen in s3n

S3a has had some bugs which surfaced after Hadoop 2.6 shipped; it's ready for 
use in Hadoop 2.7.1



On Sun, Aug 9, 2015 at 6:01 AM, Jerry Lam 
mailto:chiling...@gmail.com>> wrote:
Hi Akshat,

Is there a particular reason you don't use s3a? From my experience,s3a performs 
much better than the rest. I believe the inefficiency is from the 
implementation of the s3 interface.


It's from some client-side optimisation that "for socket reuse" reads through 
the entire incoming HTTP stream on close().


Best Regards,

Jerry

Sent from my iPhone

On 9 Aug, 2015, at 5:48 am, Akhil Das 
mailto:ak...@sigmoidanalytics.com>> wrote:

Depends on which operation you are doing, If you are doing a .count() on a 
parquet, it might not download the entire file i think, but if you do a 
.count() on a normal text file it might pull the entire file.

Thanks
Best Regards

On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya 
mailto:aara...@gmail.com>> wrote:
Hi,

I've been trying to track down some problems with Spark reads being very slow 
with s3n:// URIs (NativeS3FileSystem).  After some digging around, I realized 
that this file system implementation fetches the entire file, which isn't 
really a Spark problem, but it really slows down things when trying to just 
read headers from a Parquet file or just creating partitions in the RDD.  Is 
this something that others have observed before, or am I doing something wrong?

Thanks,
Akshat





Re: Accessing S3 files with s3n://

2015-08-10 Thread Akshat Aranya
Hi Jerry, Akhil,

Thanks your your help. With s3n, the entire file is downloaded even while
just creating the RDD with sqlContext.read.parquet().  It seems like even
just opening and closing the InputStream causes the entire data to get
fetched.

As it turned out, I was able to use s3a and avoid this problem.  I was
under the impression that s3a was only meant for using EMRFS, where the
metadata of the FS is kept separately.  This is not true; s3a maps object
keys directly to file names and directories.

On Sun, Aug 9, 2015 at 6:01 AM, Jerry Lam  wrote:

> Hi Akshat,
>
> Is there a particular reason you don't use s3a? From my experience,s3a
> performs much better than the rest. I believe the inefficiency is from the
> implementation of the s3 interface.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 9 Aug, 2015, at 5:48 am, Akhil Das  wrote:
>
> Depends on which operation you are doing, If you are doing a .count() on a
> parquet, it might not download the entire file i think, but if you do a
> .count() on a normal text file it might pull the entire file.
>
> Thanks
> Best Regards
>
> On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya  wrote:
>
>> Hi,
>>
>> I've been trying to track down some problems with Spark reads being very
>> slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
>> realized that this file system implementation fetches the entire file,
>> which isn't really a Spark problem, but it really slows down things when
>> trying to just read headers from a Parquet file or just creating partitions
>> in the RDD.  Is this something that others have observed before, or am I
>> doing something wrong?
>>
>> Thanks,
>> Akshat
>>
>
>


Re: Accessing S3 files with s3n://

2015-08-09 Thread bo yang
Hi Akshat,

I find some open source library which implements S3 InputFormat for Hadoop.
Then I use Spark newAPIHadoopRDD to load data via that S3 InputFormat.

The open source library is https://github.com/ATLANTBH/emr-s3-io. It is a
little old. I look inside it and make some changes. Then it works, and I
have been using it for more than half year with Spark. It sill work great
so far with latest Spark 1.4.0.

You may need to modify it to avoid reading the whole file. Please feel free
to let me know if you hit any questions.

Best,
Bo




On Sun, Aug 9, 2015 at 6:01 AM, Jerry Lam  wrote:

> Hi Akshat,
>
> Is there a particular reason you don't use s3a? From my experience,s3a
> performs much better than the rest. I believe the inefficiency is from the
> implementation of the s3 interface.
>
> Best Regards,
>
> Jerry
>
> Sent from my iPhone
>
> On 9 Aug, 2015, at 5:48 am, Akhil Das  wrote:
>
> Depends on which operation you are doing, If you are doing a .count() on a
> parquet, it might not download the entire file i think, but if you do a
> .count() on a normal text file it might pull the entire file.
>
> Thanks
> Best Regards
>
> On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya  wrote:
>
>> Hi,
>>
>> I've been trying to track down some problems with Spark reads being very
>> slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
>> realized that this file system implementation fetches the entire file,
>> which isn't really a Spark problem, but it really slows down things when
>> trying to just read headers from a Parquet file or just creating partitions
>> in the RDD.  Is this something that others have observed before, or am I
>> doing something wrong?
>>
>> Thanks,
>> Akshat
>>
>
>


Re: Accessing S3 files with s3n://

2015-08-09 Thread Jerry Lam
Hi Akshat,

Is there a particular reason you don't use s3a? From my experience,s3a performs 
much better than the rest. I believe the inefficiency is from the 
implementation of the s3 interface.

Best Regards,

Jerry

Sent from my iPhone

> On 9 Aug, 2015, at 5:48 am, Akhil Das  wrote:
> 
> Depends on which operation you are doing, If you are doing a .count() on a 
> parquet, it might not download the entire file i think, but if you do a 
> .count() on a normal text file it might pull the entire file.
> 
> Thanks
> Best Regards
> 
>> On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya  wrote:
>> Hi,
>> 
>> I've been trying to track down some problems with Spark reads being very 
>> slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I 
>> realized that this file system implementation fetches the entire file, which 
>> isn't really a Spark problem, but it really slows down things when trying to 
>> just read headers from a Parquet file or just creating partitions in the 
>> RDD.  Is this something that others have observed before, or am I doing 
>> something wrong?
>> 
>> Thanks,
>> Akshat
> 


Re: Accessing S3 files with s3n://

2015-08-09 Thread Akhil Das
Depends on which operation you are doing, If you are doing a .count() on a
parquet, it might not download the entire file i think, but if you do a
.count() on a normal text file it might pull the entire file.

Thanks
Best Regards

On Sat, Aug 8, 2015 at 3:12 AM, Akshat Aranya  wrote:

> Hi,
>
> I've been trying to track down some problems with Spark reads being very
> slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
> realized that this file system implementation fetches the entire file,
> which isn't really a Spark problem, but it really slows down things when
> trying to just read headers from a Parquet file or just creating partitions
> in the RDD.  Is this something that others have observed before, or am I
> doing something wrong?
>
> Thanks,
> Akshat
>


Accessing S3 files with s3n://

2015-08-07 Thread Akshat Aranya
Hi,

I've been trying to track down some problems with Spark reads being very
slow with s3n:// URIs (NativeS3FileSystem).  After some digging around, I
realized that this file system implementation fetches the entire file,
which isn't really a Spark problem, but it really slows down things when
trying to just read headers from a Parquet file or just creating partitions
in the RDD.  Is this something that others have observed before, or am I
doing something wrong?

Thanks,
Akshat