Re: Tez / Orc / S3

Prasanth J Tue, 10 Nov 2015 11:11:28 -0800

Oops.. missed the link
http://mail-archives.apache.org/mod_mbox/orc-user/201509.mbox/%[email protected]%3e


Thanks
Prasanth
> On Nov 10, 2015, at 1:10 PM, Prasanth J <[email protected]> wrote:
> 
> Please read this similar thread for more context on why S3 is slow. You 
> should be using newer s3a implementation which got huge performance gains 
> over old s3n.
> 
> Thanks
> Prasanth
> 
> 
>> On Nov 10, 2015, at 1:05 PM, Alan Gates <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> ORC does a lot of seeks inside its files in order to only load the data you 
>> need.  S3 doesn't handle seeks well, so ORC does not give you the same 
>> improvements that you would see using it on HDFS directly.
>> 
>> Alan.
>> 
>>>     Mcbride, Neil <mailto:[email protected]>     November 10, 2015 at 
>>> 5:11
>>> Hi there!
>>> 
>>> ORC sounds perfect for our use case (lots of Hive queries of the type 
>>> 'WHERE subscriber = '12345678'). However, I can't seem to see any 
>>> performance gains over the use of standard LZO compression when the files 
>>> are stored on S3. I raised a case to AWS and was told they also weren't 
>>> seeing any benefits. The engineer I spoke to used to work for Hortonworks 
>>> and said he'd had great experiences with ORC.
>>> 
>>> Before I write it off completely for us, I wondered if you were aware of 
>>> any particular set up that is required to make better use of ORC on S3?
>>> 
>>> A typical setup is:
>>> 
>>> CREATE EXTERNAL TABLE test
>>>      180_columns_wide
>>> ) STORED AS ORC
>>> LOCATION 's3://bucketname/orc/' <s3://bucketname/orc/'>
>>> TBLPROPERTIES ('orc.compress'='SNAPPY')
>>> ;
>>> 
>>> select * from test
>>> where subscriber_id = '12345678'
>>> 
>>> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).
>>> 
>>> -- 
>>> Regards
>>> Neil
>

Re: Tez / Orc / S3

Reply via email to