It seems you’ll have to go with JOIN.
Here are 2 options.
Dudu
select t0.id
as id_0
,min (named_struct ("dist",abs((t1.price - t0.price)/100) +
abs((t1.number - t0.number)/1000),"id",t1.id)).id as id_1
from t as t0
join t as t1
on t0.state =
t1.state
and t0.city =
t1.city
where t0.flag = 0
and t1.flag = 1
group by t0.id
;
select t.id_0
,t.id_1
from (select t0.id
as id_0
,t1.id
as id_1
,row_number () over (partition by t0.id order by
abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000)) as n
from t as t0
join t as t1
on t0.state =
t1.state
and t0.city =
t1.city
where t0.flag = 0
and t1.flag = 1
)
as t
where n = 1
;
From: Mobius ReX [mailto:[email protected]]
Sent: Tuesday, September 13, 2016 10:47 PM
To: [email protected]
Subject: What's the best way to find the nearest neighbor in Hive? Any
windowing function?
Given a table
> $cat data.csv
>
> ID,State,City,Price,Number,Flag
> 1,CA,A,100,1000,0
> 2,CA,A,96,1010,1
> 3,CA,A,195,1010,1
> 4,NY,B,124,2000,0
> 5,NY,B,128,2001,1
> 6,NY,C,24,30000,0
> 7,NY,C,27,30100,1
> 8,NY,C,29,30200,0
> 9,NY,C,39,33000,1
Expected Result:
ID0, ID1
1,2
4,5
6,7
8,7
for each ID with Flag=0 above, we want to find another ID from Flag=1, with the
same "State" and "City", and the nearest Price and Number normalized by the
corresponding values of that ID with Flag=0.
For example, ID = 1 and ID=2, has the same State and City, but different FLAG.
After normalized the Price and Number (Price divided by 100, Number divided by
1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05
What's the best way to find such nearest neighbor in Hive? Can we use Lead/Lag
or Rank for this case? Any valuable tips will be greatly appreciated!