It seems you’ll have to go with JOIN. Here are 2 options. Dudu
select t0.id as id_0 ,min (named_struct ("dist",abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000),"id",t1.id)).id as id_1 from t as t0 join t as t1 on t0.state = t1.state and t0.city = t1.city where t0.flag = 0 and t1.flag = 1 group by t0.id ; select t.id_0 ,t.id_1 from (select t0.id as id_0 ,t1.id as id_1 ,row_number () over (partition by t0.id order by abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000)) as n from t as t0 join t as t1 on t0.state = t1.state and t0.city = t1.city where t0.flag = 0 and t1.flag = 1 ) as t where n = 1 ; From: Mobius ReX [mailto:aoi...@gmail.com] Sent: Tuesday, September 13, 2016 10:47 PM To: user@hive.apache.org Subject: What's the best way to find the nearest neighbor in Hive? Any windowing function? Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,30000,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1 Expected Result: ID0, ID1 1,2 4,5 6,7 8,7 for each ID with Flag=0 above, we want to find another ID from Flag=1, with the same "State" and "City", and the nearest Price and Number normalized by the corresponding values of that ID with Flag=0. For example, ID = 1 and ID=2, has the same State and City, but different FLAG. After normalized the Price and Number (Price divided by 100, Number divided by 1000), the distance between ID=1 and ID=2 is: abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05 What's the best way to find such nearest neighbor in Hive? Can we use Lead/Lag or Rank for this case? Any valuable tips will be greatly appreciated!