Re: Tez / Orc / S3

Prasanth J Tue, 10 Nov 2015 11:11:38 -0800

Please read this similar thread for more context on why S3 is slow. You should 
be using newer s3a implementation which got huge performance gains over old s3n.


Thanks
Prasanth


> On Nov 10, 2015, at 1:05 PM, Alan Gates <[email protected]> wrote:
> 
> ORC does a lot of seeks inside its files in order to only load the data you 
> need.  S3 doesn't handle seeks well, so ORC does not give you the same 
> improvements that you would see using it on HDFS directly.
> 
> Alan.
> 
>>      Mcbride, Neil <mailto:[email protected]>     November 10, 2015 at 
>> 5:11
>> Hi there!
>> 
>> ORC sounds perfect for our use case (lots of Hive queries of the type 'WHERE 
>> subscriber = '12345678'). However, I can't seem to see any performance gains 
>> over the use of standard LZO compression when the files are stored on S3. I 
>> raised a case to AWS and was told they also weren't seeing any benefits. The 
>> engineer I spoke to used to work for Hortonworks and said he'd had great 
>> experiences with ORC.
>> 
>> Before I write it off completely for us, I wondered if you were aware of any 
>> particular set up that is required to make better use of ORC on S3?
>> 
>> A typical setup is:
>> 
>> CREATE EXTERNAL TABLE test
>>      180_columns_wide
>> ) STORED AS ORC
>> LOCATION 's3://bucketname/orc/'
>> TBLPROPERTIES ('orc.compress'='SNAPPY')
>> ;
>> 
>> select * from test
>> where subscriber_id = '12345678'
>> 
>> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).
>> 
>> -- 
>> Regards
>> Neil

Re: Tez / Orc / S3

Reply via email to