Oops.. missed the link http://mail-archives.apache.org/mod_mbox/orc-user/201509.mbox/%[email protected]%3e
Thanks Prasanth > On Nov 10, 2015, at 1:10 PM, Prasanth J <[email protected]> wrote: > > Please read this similar thread for more context on why S3 is slow. You > should be using newer s3a implementation which got huge performance gains > over old s3n. > > Thanks > Prasanth > > >> On Nov 10, 2015, at 1:05 PM, Alan Gates <[email protected] >> <mailto:[email protected]>> wrote: >> >> ORC does a lot of seeks inside its files in order to only load the data you >> need. S3 doesn't handle seeks well, so ORC does not give you the same >> improvements that you would see using it on HDFS directly. >> >> Alan. >> >>> Mcbride, Neil <mailto:[email protected]> November 10, 2015 at >>> 5:11 >>> Hi there! >>> >>> ORC sounds perfect for our use case (lots of Hive queries of the type >>> 'WHERE subscriber = '12345678'). However, I can't seem to see any >>> performance gains over the use of standard LZO compression when the files >>> are stored on S3. I raised a case to AWS and was told they also weren't >>> seeing any benefits. The engineer I spoke to used to work for Hortonworks >>> and said he'd had great experiences with ORC. >>> >>> Before I write it off completely for us, I wondered if you were aware of >>> any particular set up that is required to make better use of ORC on S3? >>> >>> A typical setup is: >>> >>> CREATE EXTERNAL TABLE test >>> 180_columns_wide >>> ) STORED AS ORC >>> LOCATION 's3://bucketname/orc/' <s3://bucketname/orc/'> >>> TBLPROPERTIES ('orc.compress'='SNAPPY') >>> ; >>> >>> select * from test >>> where subscriber_id = '12345678' >>> >>> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe). >>> >>> -- >>> Regards >>> Neil >
