Hi all,
I am using hive(1.2.0) on spark(1.6.0). These days I met a problem that is 
window function waste too much time:
I have table `A` has 3 columns 5 million lines and table `B` has 10 columns and 
3 billion lines 🤡, my hql like:
```
select *,
…
(
select id, sampledate,
max(b1) over wf7 as max_wf7_b1,
avg(b2) over wfoy as avg_wfoy_b2,
….
From
(
select id, sampledate, NULL as b1, NULL as b2… ’t1’ as fromtable from A
Union all
select id, sampledate, b1, b2,…’t2’ as fromtable from B
)tmp
window wf7 as (partition by id order by sampledate range between 3600 * 24 * 7 
preceding and current row),
window wfoy as (partition by id order by sampledate range between 3600 * 24 * 
365 preceding and current row),
…
) t_result where formable = ’t1'
```
This sql using 29+ hours in 11 computers cluster within 600G memory.
In my opinion, the time wasting in the `order by sampledate` and `calculate the 
table B’s record`. Is there a setting to avoid `table B`’s record not to get 
‘avg_wfoy_b2’ column, in the `t_result` will filter 'formable = ’t1’'
Also, I have two ways to choose, but both are missing some infos from the table 
`B`:
one is that I remove `order by` by add some columns, when 
datediff(sampleDate_A, sampleDate_B) = 7 then add a column named `datediff_7` 
value 7, if datediff(sampleDate_A, sampleDate_B) = 30 the add two columns named 
`datediff_7` value `7`  `datediff_30` value `30`, finally when my window 
function is wf7 then i filter the data to calculate only using partition by .
the other one is that focus on `table B`, using month instead of day. First to 
get the distince between the min(sampledate), 
floor(months_between(sampledate_B, min(sampledate_B)) as `month_b`, then 
calculation using monthly, finally, the data can reduce to about 1/30, and 
window wfoy as (partition by id order by month_b range between 12 preceding and 
current row).


Is there a better way to finish this work? Or I can set only calculate the main 
table `A`🤣

Reply via email to