ORC does a lot of seeks inside its files in order to only load the data you need. S3 doesn't handle seeks well, so ORC does not give you the same improvements that you would see using it on HDFS directly.

Alan.

Mcbride, Neil <mailto:[email protected]>
November 10, 2015 at 5:11
Hi there!

ORC sounds perfect for our use case (lots of Hive queries of the type 'WHERE subscriber = '12345678'). However, I can't seem to see any performance gains over the use of standard LZO compression when the files are stored on S3. I raised a case to AWS and was told they also weren't seeing any benefits. The engineer I spoke to used to work for Hortonworks and said he'd had great experiences with ORC.

Before I write it off completely for us, I wondered if you were aware of any particular set up that is required to make better use of ORC on S3?

A typical setup is:

CREATE EXTERNAL TABLE test
     180_columns_wide
) STORED AS ORC
LOCATION 's3://bucketname/orc/'
TBLPROPERTIES ('orc.compress'='SNAPPY')
;

select * from test
where subscriber_id = '12345678'

We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).

--
Regards
Neil

Reply via email to