Hi all, I am using hive(1.2.0) on spark(1.6.0). These days I met a problem that is window function waste too much time: I have table `A` has 3 columns 5 million lines and table `B` has 10 columns and 3 billion lines 🤡, my hql like: ``` select *, … ( select id, sampledate, max(b1) over wf7 as max_wf7_b1, avg(b2) over wfoy as avg_wfoy_b2, …. From ( select id, sampledate, NULL as b1, NULL as b2… ’t1’ as fromtable from A Union all select id, sampledate, b1, b2,…’t2’ as fromtable from B )tmp window wf7 as (partition by id order by sampledate range between 3600 * 24 * 7 preceding and current row), window wfoy as (partition by id order by sampledate range between 3600 * 24 * 365 preceding and current row), … ) t_result where formable = ’t1' ``` This sql using 29+ hours in 11 computers cluster within 600G memory. In my opinion, the time wasting in the `order by sampledate` and `calculate the table B’s record`. Is there a setting to avoid `table B`’s record not to get ‘avg_wfoy_b2’ column, in the `t_result` will filter 'formable = ’t1’' Also, I have two ways to choose, but both are missing some infos from the table `B`: one is that I remove `order by` by add some columns, when datediff(sampleDate_A, sampleDate_B) = 7 then add a column named `datediff_7` value 7, if datediff(sampleDate_A, sampleDate_B) = 30 the add two columns named `datediff_7` value `7` `datediff_30` value `30`, finally when my window function is wf7 then i filter the data to calculate only using partition by . the other one is that focus on `table B`, using month instead of day. First to get the distince between the min(sampledate), floor(months_between(sampledate_B, min(sampledate_B)) as `month_b`, then calculation using monthly, finally, the data can reduce to about 1/30, and window wfoy as (partition by id order by month_b range between 12 preceding and current row).
Is there a better way to finish this work? Or I can set only calculate the main table `A`🤣