Re: Tez / Orc / S3

Gopal Vijayaraghavan Tue, 10 Nov 2015 11:25:34 -0800

Hi,

> 
>http://mail-archives.apache.org/mod_mbox/orc-user/201509.mbox/%3c560AB8D2.
>[email protected]%3e
...
> ORC does a lot of seeks inside its files in order to only load the data
>you need.  S3 doesn't handle seeks well, so ORC does not give you the
>same improvements that you would see using it on HDFS directly.


ORC changed the way it generates seeks recently in hive-2.0, to get
connection re-use working (HIVE-11945).

The S3A drivers still need to be fixed to handle seeks via HTTP range
requests (HADOOP-12444), but the EMR drivers are better at it I think.

> select * from test where subscriber_id = '12345678'

Are the filter columns strings?


I think the version you're running doesn't have bloom filter indexes,
which is somewhat necessary for strings (since a uniformly distributed 1
byte prefix effectively ruins regular index lookups).


You can work around that issue by laying out data in order to get a tight
grouping 

insert overwrite table test as select * from test sort by subscriber_id;
-- sort by, not order by

Also "select *" is a corner case, since it doesn't get you any benefit of
the columnar layout, since all columns are being read.

> We are using Tez on EMR 4.1 (which uses Hive 1.0, I believe).


Wow, I did not know this. I will try this.

Cheers,
Gopal

Re: Tez / Orc / S3

Reply via email to