Re: Tuning Triangle Joins on Hive

Lefty Leverenz Sat, 02 Aug 2014 12:48:06 -0700

>
> How does indexes work in hive?
>

See the Indexes design doc
<https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive
wiki, although it hasn't been updated.


-- Lefty


On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala <
chandra.reddy2...@gmail.com> wrote:

> How does indexes work in hive? I thought file formats like ORC have
> indexes in each block. But not a separate index that can help query
> performance.
> Thanks,
> Chandra
>
>
> On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <devo...@gmail.com> wrote:
>
>> Please try the following approach and let me know if you are not getting
>> better performance:
>>
>> 1. Ensure indexes are present on dst , rsc columns in the respective
>> tables.
>> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a
>> physical table, and then create index on its new src column as well
>> 3. Join this to r1
>>
>> If this approach works well, then try out the WITH SELECT ... using the
>> same approach , just no physical intermediate table will be created.
>>
>> Hope it helps..
>>
>> regards
>> Dev
>>
>>
>>
>>
>> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu>
>> wrote:
>>
>>> Hi,
>>>
>>> We're running various "triangle" join queries on Hive 0.9.0, and we're
>>> wondering if we can get any better performance. Here's the query we're
>>> running:
>>>
>>> SELECT count(*)
>>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON
>>> (r2.dst = r3.src AND r3.dst = r1.src)
>>> WHERE r1.src < r2.src AND r2.src < r3.src;
>>>
>>> We're currently passing the following tuning parameters as well:
>>>
>>> set mapred.map.tasks=120;
>>> set mapred.reduce.tasks=120;
>>> set mapred.tasktracker.map.tasks.maximum=8;
>>> set mapred.tasktracker.reduce.tasks.maximum=8;
>>> set mapred.child.java.opts=-Xmx5120m;
>>>
>>> The dataset we're using has 5 million nodes and 70 million edges, and
>>> most of our time is spent on garbage collection. We have about 30 machines
>>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we
>>> can improve performance? Thanks in advance!
>>>
>>
>>
>>
>> --
>> Devopam Mittra
>> Life and Relations are not binary
>>
>
>

Re: Tuning Triangle Joins on Hive

Reply via email to