Hi all, I have two tables I need to join and then summarize. They are both huge (about 1B rows each, in the relevant partitions) and the query runs for over 2 hours creating 5T intermediate data.
The current query looks like this: select t1.b,t1.c,t2.d,t2.e, count(*) from (select a,b,c from table baseTB1 where ... ) t1 -- filter by partition as well join (select a,d,e from baseTB2 where ...) t2 -- filter by partition as well on t1.a=t2.a group by t1.b,t1.c,t2.d,t2.e two questions: 1. would joining baseTB1 and baseTB2 directly (instead of subqueries) be better in any way? (I know subqueries cause a lot of writes of the intermediate data but we also understand it's best to filter down the data that is being joined, which is "more" correct?) 2. can I use 'distribute by ' and/or 'sort by' in some way that would help this? my understanding at the moment is that the problem lies in the fact that the reduces are on column a while the group by is on column b ... Any thoughts would be appreciated.