Re: Tez / Orc / S3

Alan Gates Tue, 10 Nov 2015 11:06:18 -0800

ORC does a lot of seeks inside its files in order to only load the datayou need. S3 doesn't handle seeks well, so ORC does not give you thesame improvements that you would see using it on HDFS directly.


Alan.

Mcbride, Neil <mailto:[email protected]>
November 10, 2015 at 5:11
Hi there!
ORC sounds perfect for our use case (lots of Hive queries of the type'WHERE subscriber = '12345678'). However, I can't seem to see anyperformance gains over the use of standard LZO compression when thefiles are stored on S3. I raised a case to AWS and was told they alsoweren't seeing any benefits. The engineer I spoke to used to work forHortonworks and said he'd had great experiences with ORC.
Before I write it off completely for us, I wondered if you were awareof any particular set up that is required to make better use of ORC on S3?
A typical setup is:

CREATE EXTERNAL TABLE test
     180_columns_wide
) STORED AS ORC
LOCATION 's3://bucketname/orc/'
TBLPROPERTIES ('orc.compress'='SNAPPY')
;

select * from test
where subscriber_id = '12345678'

We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).

--
Regards
Neil

Re: Tez / Orc / S3

Reply via email to