Hi, > >http://mail-archives.apache.org/mod_mbox/orc-user/201509.mbox/%3c560AB8D2. >[email protected]%3e ... > ORC does a lot of seeks inside its files in order to only load the data >you need. S3 doesn't handle seeks well, so ORC does not give you the >same improvements that you would see using it on HDFS directly.
ORC changed the way it generates seeks recently in hive-2.0, to get connection re-use working (HIVE-11945). The S3A drivers still need to be fixed to handle seeks via HTTP range requests (HADOOP-12444), but the EMR drivers are better at it I think. > select * from test where subscriber_id = '12345678' Are the filter columns strings? I think the version you're running doesn't have bloom filter indexes, which is somewhat necessary for strings (since a uniformly distributed 1 byte prefix effectively ruins regular index lookups). You can work around that issue by laying out data in order to get a tight grouping insert overwrite table test as select * from test sort by subscriber_id; -- sort by, not order by Also "select *" is a corner case, since it doesn't get you any benefit of the columnar layout, since all columns are being read. > We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe). Wow, I did not know this. I will try this. Cheers, Gopal
