Re: Large Table Joins

Aman Sinha Fri, 13 Feb 2015 08:47:32 -0800

Drill joins (either hash join or merge join) currently generate 3 types of
plans:  hash distribute both sides of the join, hash distribute left side
and broadcast right side,  broadcast right side and don't distribute left
side.  These are cost-based decisions based on cardinalities.  However, the
partitioning information of the table is not currently used because at
least for now the partitioning information of the table is not exposed to
Drill.


This does not mean it cannot be added in the future.  Currently, Drill does
not maintain a catalog or extensive metadata - it is designed for
schema-less data.  So in your use case it would end up re-distributing the
data.  But we have plans for exposing the distribution information to the
planner.

On Fri, Feb 13, 2015 at 8:37 AM, Uli Bethke <[email protected]> wrote:

> Thanks Jason.
>
> Just a bit more background on my question. Modern MPPs such as Teradata
> allow for full data co-locality via hash distribution of keys. This ensures
> that join data of two large tables will always end up on the same node and
> data co-locality is always ensured (no network overhead), which makes large
> table joins ultra fast. In Hive we have bucketing, which somewhat
> alleviates the problem. However, it does not guarantee data co-locality as
> two corresponding bucket files may end up on two different nodes. While it
> reduces network overhead it does not eliminate it.
>
>
>
> On 13/02/2015 16:25, Jason Altekruse wrote:
>
>> I don't think this actually answers your question. You can limit your
>> filters by directory to avoid reads from the filesystem, and some of the
>> storage plugins like Hbase and Hive implement scan level pushdown, but I
>> do
>> not know if this is sophisticated enough that a join would be aware of the
>> partitioning. I'll keep watching the thread and reach out to our planning
>> experts if they don't chime in.
>>
>> - Jason
>>
>> On Fri, Feb 13, 2015 at 5:45 AM, Carol McDonald <[email protected]>
>> wrote:
>>
>>  yes you can read about it here
>>>
>>> https://cwiki.apache.org/confluence/display/DRILL/Partition+Pruning
>>>
>>> On Fri, Feb 13, 2015 at 6:42 AM, Uli Bethke <[email protected]> wrote:
>>>
>>>  I have two large tables in Hive (both partitioned and bucketed). Using
>>>>
>>> Map
>>>
>>>> side joins I can take advantage of data locality for the hash join
>>>> table.
>>>>
>>>> Using Drill does the optimizer take the partitioning and bucketing into
>>>> consideration?
>>>> thanks
>>>> uli
>>>>
>>>>
>>>>
> --
> ___________________________
> Uli Bethke
> Co-founder Sonra
> p: +353 86 32 83 040
> w: www.sonra.io
> l: linkedin.com/in/ulibethke
> t: twitter.com/ubethke
>
>

Re: Large Table Joins

Reply via email to