Thanks Jason.

Just a bit more background on my question. Modern MPPs such as Teradata allow for full data co-locality via hash distribution of keys. This ensures that join data of two large tables will always end up on the same node and data co-locality is always ensured (no network overhead), which makes large table joins ultra fast. In Hive we have bucketing, which somewhat alleviates the problem. However, it does not guarantee data co-locality as two corresponding bucket files may end up on two different nodes. While it reduces network overhead it does not eliminate it.


On 13/02/2015 16:25, Jason Altekruse wrote:
I don't think this actually answers your question. You can limit your
filters by directory to avoid reads from the filesystem, and some of the
storage plugins like Hbase and Hive implement scan level pushdown, but I do
not know if this is sophisticated enough that a join would be aware of the
partitioning. I'll keep watching the thread and reach out to our planning
experts if they don't chime in.

- Jason

On Fri, Feb 13, 2015 at 5:45 AM, Carol McDonald <[email protected]>
wrote:

yes you can read about it here

https://cwiki.apache.org/confluence/display/DRILL/Partition+Pruning

On Fri, Feb 13, 2015 at 6:42 AM, Uli Bethke <[email protected]> wrote:

I have two large tables in Hive (both partitioned and bucketed). Using
Map
side joins I can take advantage of data locality for the hash join table.

Using Drill does the optimizer take the partitioning and bucketing into
consideration?
thanks
uli



--
___________________________
Uli Bethke
Co-founder Sonra
p: +353 86 32 83 040
w: www.sonra.io
l: linkedin.com/in/ulibethke
t: twitter.com/ubethke

Reply via email to