What's the best way to find the nearest neighbor in Hive? Any windowing function?

2016-09-13 Thread Mobius ReX
Given a table > $cat data.csv > > ID,State,City,Price,Number,Flag > 1,CA,A,100,1000,0 > 2,CA,A,96,1010,1 > 3,CA,A,195,1010,1 > 4,NY,B,124,2000,0 > 5,NY,B,128,2001,1 > 6,NY,C,24,3,0 > 7,NY,C,27,30100,1 > 8,NY,C,29,30200,0 > 9,NY,C,39,33000,1

What's the best way to detect and remove outliers in a table?

2016-09-01 Thread Mobius ReX
Given a table with hundreds of columns mixed with both categorical and numerical attributes, and the distribution of values is unknown, what's the best way to detect outliers? For example, given a table Category Price A 1 A 1.3 A 100 C