Re: hive window function can only calculate the main table?

2017-10-09 Thread Gopal Vijayaraghavan


> ) t_result where formable = ’t1'
…
> This sql using 29+ hours in 11 computers cluster within 600G memory.
> In my opinion, the time wasting in the `order by sampledate` and `calculate 
> the table B’s record`. Is there a setting to avoid `table B`’s record not to 
> get ‘avg_wfoy_b2’ column, in the `t_result` will filter 'formable = ’t1’' 

> hive(1.2.0)

https://issues.apache.org/jira/browse/HIVE-12808
+
https://issues.apache.org/jira/browse/HIVE-16797

These are the relevant fixes for the issue you're looking into, both went into 
hive 2.x branches.

Since you have nested queries for OVER() clauses, you might also need this 
un-committed patch

https://issues.apache.org/jira/browse/HIVE-17668

Cheers,
Gopal




Re: does anyone care about list bucketing stored as directories?

2017-10-09 Thread Sergey Shelukhin
Ok, here’s synopsis that is hopefully clearer.
1) LB, when stored as directories, adds a lot of low-level complexity to Hive 
tables that has to be accounted for in many places in the code where the files 
are written or modified - from FSOP to ACID/replication/export.
2) While working on some FSOP code I noticed that some of that logic is broken 
- e.g. the duplicate file removal from tasks, a pretty fundamental correctness 
feature in Hive, may be broken. LB also doesn’t appear to be compatible with 
e.g. regular bucketing.
3) The feature hasn’t seen development activity in a while; it also doesn’t 
appear to be used a lot.

Keeping with the theme of cleaning up “legacy” code for 3.0, I was proposing we 
remove it.

(2) also suggested that, if needed, it might be easier to implement similar 
functionality by adding some flexibility to partitions (which LB directories 
look like anyway); that would also keep the logic on a higher level of 
abstraction (split generation, partition pruning) as opposed to many low-level 
places like FSOP, etc.



From: Xuefu Zhang >
Date: Sunday, October 8, 2017 at 20:56
To: "d...@hive.apache.org" 
>
Cc: "user@hive.apache.org" 
>, Sergey Shelukhin 
>
Subject: Re: does anyone care about list bucketing stored as directories?

Lack a response doesn't necessarily means "don't care". Maybe you can have a 
good description of the problem and proposed solution. Frankly I cannot make 
much sense out of the previous email.

Thanks,
Xuefu

On Fri, Oct 6, 2017 at 5:05 PM, Sergey Shelukhin 
> wrote:
Looks like nobody does… I’ll file a ticket to remove it shortly.

From: Sergey Shelukhin 
>>
Date: Tuesday, October 3, 2017 at 12:59
To: 
"user@hive.apache.org>"
 
>>,
 
"d...@hive.apache.org>"
 
>>
Subject: does anyone care about list bucketing stored as directories?

1) There seem to be some bugs and limitations in LB (e.g. incorrect cleanup - 
https://issues.apache.org/jira/browse/HIVE-14886) and nobody appears to as much 
as watch JIRAs ;) Does anyone actually use this stuff? Should we nuke it in 
3.0, and by 3.0 I mean I’ll remove it from master in a few weeks? :)

2) I actually wonder, on top of the same SQL syntax, wouldn’t it be much easier 
to add logic to partitioning to write skew values into partitions and non-skew 
values into a new type of default partition? It won’t affect nearly as many low 
level codepaths in obscure and unobvious ways, instead keeping all the logic in 
metastore and split generation, and would integrate with Hive features like PPD 
automatically.
Esp. if we are ok with the same limitations - e.g. if you add a new skew value 
right now, I’m not sure what happens to the rows with that value already 
sitting in the non-skew directories, but I don’t expect anything reasonable...




hive window function can only calculate the main table?

2017-10-09 Thread newbegin2020
Hi all,
I am using hive(1.2.0) on spark(1.6.0). These days I met a problem that is 
window function waste too much time:
I have table `A` has 3 columns 5 million lines and table `B` has 10 columns and 
3 billion lines 嵐, my hql like:
```
select *,
…
(
select id, sampledate,
max(b1) over wf7 as max_wf7_b1,
avg(b2) over wfoy as avg_wfoy_b2,
….
From
(
select id, sampledate, NULL as b1, NULL as b2… ’t1’ as fromtable from A
Union all
select id, sampledate, b1, b2,…’t2’ as fromtable from B
)tmp
window wf7 as (partition by id order by sampledate range between 3600 * 24 * 7 
preceding and current row),
window wfoy as (partition by id order by sampledate range between 3600 * 24 * 
365 preceding and current row),
…
) t_result where formable = ’t1'
```
This sql using 29+ hours in 11 computers cluster within 600G memory.
In my opinion, the time wasting in the `order by sampledate` and `calculate the 
table B’s record`. Is there a setting to avoid `table B`’s record not to get 
‘avg_wfoy_b2’ column, in the `t_result` will filter 'formable = ’t1’'
Also, I have two ways to choose, but both are missing some infos from the table 
`B`:
one is that I remove `order by` by add some columns, when 
datediff(sampleDate_A, sampleDate_B) = 7 then add a column named `datediff_7` 
value 7, if datediff(sampleDate_A, sampleDate_B) = 30 the add two columns named 
`datediff_7` value `7`  `datediff_30` value `30`, finally when my window 
function is wf7 then i filter the data to calculate only using partition by .
the other one is that focus on `table B`, using month instead of day. First to 
get the distince between the min(sampledate), 
floor(months_between(sampledate_B, min(sampledate_B)) as `month_b`, then 
calculation using monthly, finally, the data can reduce to about 1/30, and 
window wfoy as (partition by id order by month_b range between 12 preceding and 
current row).


Is there a better way to finish this work? Or I can set only calculate the main 
table `A`藍