It seems you’ll have to go with JOIN.
Here are 2 options.

Dudu


select      t0.id                                                               
                                            as id_0
           ,min (named_struct ("dist",abs((t1.price - t0.price)/100) + 
abs((t1.number - t0.number)/1000),"id",t1.id)).id    as id_1

from                t   as t0

            join    t   as t1

            on      t0.state    =
                    t1.state

                and t0.city    =
                    t1.city

where       t0.flag = 0
        and t1.flag = 1

group by    t0.id
;



select      t.id_0
           ,t.id_1

from       (select      t0.id                                                   
                                                            as id_0
                       ,t1.id                                                   
                                                            as id_1
                       ,row_number () over (partition by t0.id order by 
abs((t1.price - t0.price)/100) + abs((t1.number - t0.number)/1000)) as n

            from                t   as t0

                        join    t   as t1

                        on      t0.state    =
                                t1.state

                            and t0.city    =
                                t1.city

            where       t0.flag = 0
                    and t1.flag = 1
            )
            as t

where       n = 1
;




From: Mobius ReX [mailto:aoi...@gmail.com]
Sent: Tuesday, September 13, 2016 10:47 PM
To: user@hive.apache.org
Subject: What's the best way to find the nearest neighbor in Hive? Any 
windowing function?

Given a table

>     $cat data.csv
>
>     ID,State,City,Price,Number,Flag
>     1,CA,A,100,1000,0
>     2,CA,A,96,1010,1
>     3,CA,A,195,1010,1
>     4,NY,B,124,2000,0
>     5,NY,B,128,2001,1
>     6,NY,C,24,30000,0
>     7,NY,C,27,30100,1
>     8,NY,C,29,30200,0
>     9,NY,C,39,33000,1


Expected Result:

    ID0, ID1
    1,2
    4,5
    6,7
    8,7

for each ID with Flag=0 above, we want to find another ID from Flag=1, with the 
same "State" and "City", and the nearest Price and Number normalized by the 
corresponding values of that ID with Flag=0.

For example, ID = 1 and ID=2, has the same State and City, but different FLAG.
After normalized the Price and Number (Price divided by 100, Number divided by 
1000), the distance between ID=1 and ID=2 is:
abs(100/100 - 96/100) + abs(1000/1000 - 1010/1000) = 0.04 + 0.01 = 0.05


What's the best way to find such nearest neighbor in Hive? Can we use Lead/Lag 
or Rank for this case? Any valuable tips will be greatly appreciated!

Reply via email to