Please read this similar thread for more context on why S3 is slow. You should be using newer s3a implementation which got huge performance gains over old s3n.
Thanks Prasanth > On Nov 10, 2015, at 1:05 PM, Alan Gates <[email protected]> wrote: > > ORC does a lot of seeks inside its files in order to only load the data you > need. S3 doesn't handle seeks well, so ORC does not give you the same > improvements that you would see using it on HDFS directly. > > Alan. > >> Mcbride, Neil <mailto:[email protected]> November 10, 2015 at >> 5:11 >> Hi there! >> >> ORC sounds perfect for our use case (lots of Hive queries of the type 'WHERE >> subscriber = '12345678'). However, I can't seem to see any performance gains >> over the use of standard LZO compression when the files are stored on S3. I >> raised a case to AWS and was told they also weren't seeing any benefits. The >> engineer I spoke to used to work for Hortonworks and said he'd had great >> experiences with ORC. >> >> Before I write it off completely for us, I wondered if you were aware of any >> particular set up that is required to make better use of ORC on S3? >> >> A typical setup is: >> >> CREATE EXTERNAL TABLE test >> 180_columns_wide >> ) STORED AS ORC >> LOCATION 's3://bucketname/orc/' >> TBLPROPERTIES ('orc.compress'='SNAPPY') >> ; >> >> select * from test >> where subscriber_id = '12345678' >> >> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe). >> >> -- >> Regards >> Neil
