ORC does a lot of seeks inside its files in order to only load the data
you need. S3 doesn't handle seeks well, so ORC does not give you the
same improvements that you would see using it on HDFS directly.
Alan.
Mcbride, Neil <mailto:[email protected]>
November 10, 2015 at 5:11
Hi there!
ORC sounds perfect for our use case (lots of Hive queries of the type
'WHERE subscriber = '12345678'). However, I can't seem to see any
performance gains over the use of standard LZO compression when the
files are stored on S3. I raised a case to AWS and was told they also
weren't seeing any benefits. The engineer I spoke to used to work for
Hortonworks and said he'd had great experiences with ORC.
Before I write it off completely for us, I wondered if you were aware
of any particular set up that is required to make better use of ORC on S3?
A typical setup is:
CREATE EXTERNAL TABLE test
180_columns_wide
) STORED AS ORC
LOCATION 's3://bucketname/orc/'
TBLPROPERTIES ('orc.compress'='SNAPPY')
;
select * from test
where subscriber_id = '12345678'
We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).
--
Regards
Neil