Hi there!

ORC sounds perfect for our use case (lots of Hive queries of the type
'WHERE subscriber = '12345678'). However, I can't seem to see any
performance gains over the use of standard LZO compression when the files
are stored on S3. I raised a case to AWS and was told they also weren't
seeing any benefits. The engineer I spoke to used to work for Hortonworks
and said he'd had great experiences with ORC.

Before I write it off completely for us, I wondered if you were aware of
any particular set up that is required to make better use of ORC on S3?

A typical setup is:

CREATE EXTERNAL TABLE test
     180_columns_wide
) STORED AS ORC
LOCATION 's3://bucketname/orc/'
TBLPROPERTIES ('orc.compress'='SNAPPY')
;

select * from test
where subscriber_id = '12345678'

We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).

-- 
Regards
Neil

Reply via email to