Re: Large Table Joins

Uli Bethke Fri, 13 Feb 2015 08:39:13 -0800

Thanks Jason.

Just a bit more background on my question. Modern MPPs such as Teradataallow for full data co-locality via hash distribution of keys. Thisensures that join data of two large tables will always end up on thesame node and data co-locality is always ensured (no network overhead),which makes large table joins ultra fast. In Hive we have bucketing,which somewhat alleviates the problem. However, it does not guaranteedata co-locality as two corresponding bucket files may end up on twodifferent nodes. While it reduces network overhead it does not eliminateit.



On 13/02/2015 16:25, Jason Altekruse wrote:

I don't think this actually answers your question. You can limit your
filters by directory to avoid reads from the filesystem, and some of the
storage plugins like Hbase and Hive implement scan level pushdown, but I do
not know if this is sophisticated enough that a join would be aware of the
partitioning. I'll keep watching the thread and reach out to our planning
experts if they don't chime in.

- Jason

On Fri, Feb 13, 2015 at 5:45 AM, Carol McDonald <[email protected]>
wrote:

yes you can read about it here

https://cwiki.apache.org/confluence/display/DRILL/Partition+Pruning

On Fri, Feb 13, 2015 at 6:42 AM, Uli Bethke <[email protected]> wrote:

I have two large tables in Hive (both partitioned and bucketed). Using

Map

side joins I can take advantage of data locality for the hash join table.

Using Drill does the optimizer take the partitioning and bucketing into
consideration?
thanks
uli


--
___________________________
Uli Bethke
Co-founder Sonra
p: +353 86 32 83 040
w: www.sonra.io
l: linkedin.com/in/ulibethke
t: twitter.com/ubethke

Re: Large Table Joins

Reply via email to