> > How does indexes work in hive? > See the Indexes design doc <https://cwiki.apache.org/confluence/display/Hive/IndexDev> in the Hive wiki, although it hasn't been updated.
-- Lefty On Sat, Aug 2, 2014 at 2:07 AM, chandra Reddy Bogala < chandra.reddy2...@gmail.com> wrote: > How does indexes work in hive? I thought file formats like ORC have > indexes in each block. But not a separate index that can help query > performance. > Thanks, > Chandra > > > On Fri, Aug 1, 2014 at 9:10 AM, Devopam Mittra <devo...@gmail.com> wrote: > >> Please try the following approach and let me know if you are not getting >> better performance: >> >> 1. Ensure indexes are present on dst , rsc columns in the respective >> tables. >> 2. Create a subset first taking r2 and r2 (i.e.: r3.src > r2.src) in a >> physical table, and then create index on its new src column as well >> 3. Join this to r1 >> >> If this approach works well, then try out the WITH SELECT ... using the >> same approach , just no physical intermediate table will be created. >> >> Hope it helps.. >> >> regards >> Dev >> >> >> >> >> On Fri, Aug 1, 2014 at 12:58 AM, Firas Abuzaid <fabuz...@stanford.edu> >> wrote: >> >>> Hi, >>> >>> We're running various "triangle" join queries on Hive 0.9.0, and we're >>> wondering if we can get any better performance. Here's the query we're >>> running: >>> >>> SELECT count(*) >>> FROM table r1 JOIN table r2 ON (r1.dst = r2.src) JOIN table r3 ON >>> (r2.dst = r3.src AND r3.dst = r1.src) >>> WHERE r1.src < r2.src AND r2.src < r3.src; >>> >>> We're currently passing the following tuning parameters as well: >>> >>> set mapred.map.tasks=120; >>> set mapred.reduce.tasks=120; >>> set mapred.tasktracker.map.tasks.maximum=8; >>> set mapred.tasktracker.reduce.tasks.maximum=8; >>> set mapred.child.java.opts=-Xmx5120m; >>> >>> The dataset we're using has 5 million nodes and 70 million edges, and >>> most of our time is spent on garbage collection. We have about 30 machines >>> in our cluster, and each machine has 45GB of RAM. Any thoughts on how we >>> can improve performance? Thanks in advance! >>> >> >> >> >> -- >> Devopam Mittra >> Life and Relations are not binary >> > >