Ok, very clear on the partitions, try to make them match the WHERE clauses, not 
so much about group clauses then ;)

The member_map contains 11.636.619 records atm, I think bucketing those would 
be good?
What's a good number to bucket them by then?

And is there any point in bucketing the visit_stats?

From: Tucker, Matt [mailto:matt.tuc...@disney.com]
Sent: Monday, April 23, 2012 5:30 PM
To: user@hive.apache.org
Subject: RE: When/how to use partitions and buckets usefully?

If you're only interested in a certain window of dates for analysis, a 
date-based partition scheme will be helpful, as it will trim partitions that 
aren't needed by the query before execution.

If the member_map table is small, you might consider testing the feasibility of 
map-side joins, as it will reduce the number of processing stages.  If 
member_map is large, bucketing on member_id will avoid having as many rows from 
visit_stats compared to each member_id for joins.

Matt Tucker

From: Ruben de Vries 
[mailto:ruben.devr...@hyves.nl]<mailto:[mailto:ruben.devr...@hyves.nl]>
Sent: Monday, April 23, 2012 11:19 AM
To: user@hive.apache.org<mailto:user@hive.apache.org>
Subject: When/how to use partitions and buckets usefully?

It seems there's enough information to be found on how to setup and use 
partitions and buckets.
But I'm more interested in how to figure out when and what columns you should 
be partitioning and bucketing to increase performance?!

In my case I got 2 tables, 1 visit_stats (member_id, date and some MAP cols 
which give me info about the visits) and 1 member_map (member_id, gender, age).

Usually I group by date and then one of the other col so I assume that 
partitioning on date is a good start?!

It seems the join of the member_map onto the visit_stats makes the queries a 
lot slower, can that be fixed by bucketing both tables? Or just one of them?

Maybe some ppl have written good blogs on this subject but I can't really seem 
to find them!?

Any help would be appreciated, thanks in advance :)

Reply via email to