Please read this similar thread for more context on why S3 is slow. You should 
be using newer s3a implementation which got huge performance gains over old s3n.

Thanks
Prasanth


> On Nov 10, 2015, at 1:05 PM, Alan Gates <[email protected]> wrote:
> 
> ORC does a lot of seeks inside its files in order to only load the data you 
> need.  S3 doesn't handle seeks well, so ORC does not give you the same 
> improvements that you would see using it on HDFS directly.
> 
> Alan.
> 
>>      Mcbride, Neil <mailto:[email protected]>     November 10, 2015 at 
>> 5:11
>> Hi there!
>> 
>> ORC sounds perfect for our use case (lots of Hive queries of the type 'WHERE 
>> subscriber = '12345678'). However, I can't seem to see any performance gains 
>> over the use of standard LZO compression when the files are stored on S3. I 
>> raised a case to AWS and was told they also weren't seeing any benefits. The 
>> engineer I spoke to used to work for Hortonworks and said he'd had great 
>> experiences with ORC.
>> 
>> Before I write it off completely for us, I wondered if you were aware of any 
>> particular set up that is required to make better use of ORC on S3?
>> 
>> A typical setup is:
>> 
>> CREATE EXTERNAL TABLE test
>>      180_columns_wide
>> ) STORED AS ORC
>> LOCATION 's3://bucketname/orc/'
>> TBLPROPERTIES ('orc.compress'='SNAPPY')
>> ;
>> 
>> select * from test
>> where subscriber_id = '12345678'
>> 
>> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).
>> 
>> -- 
>> Regards
>> Neil

Reply via email to