Re: Compute pairwise distance

2016-07-07 Thread Debasish Das
Hi Manoj, I have a spark meetup talk that explains the issues with dimsum where you have to calculate row similarities. You can still use the PR since it has all the code you need but I have not got time to refactor it for the merge. I believe few kernels are supported as well. Thanks. Deb On

Re: Compute pairwise distance

2016-07-07 Thread Manoj Awasthi
Hi Debasish, All, I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't gather from the relevant pull request [1] if part of it is already in 1.6.0 (it's closed now). We are facing the same problem of computing pairwise distances between vectors where rows are > 5M and columns in

Re: Compute pairwise distance

2015-04-30 Thread Driesprong, Fokko
Thank you guys for the input. Ayan, I am not sure how this can be done using reduceByKey, as far as I can see (but I am not so advanced with Spark), this requires a groupByKey which can be very costly. What would be nice to transform the dataset which contains all the vectors like: val

Compute pairwise distance

2015-04-29 Thread Driesprong, Fokko
Dear Sparkers, I am working on an algorithm which requires the pair distance between all points (eg. DBScan, LOF, etc.). Computing this for *n* points will require produce a n^2 matrix. If the distance measure is symmetrical, this can be reduced to (n^2)/2. What would be the most optimal way of

Re: Compute pairwise distance

2015-04-29 Thread ayan guha
This is my first thought, please suggest any further improvement: 1. Create a rdd of your dataset 2. Do an cross join to generate pairs 3. Apply reducebykey and compute distance. You will get a rdd with keypairs and distance Best Ayan On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl

Re: Compute pairwise distance

2015-04-29 Thread Debasish Das
Cross Join shuffle space might not be needed since most likely through application specific logic (topK etc) you can cut the shuffle space...Also most likely the brute force approach will be a benchmark tool to see how better is your clustering based KNN solution since there are several ways you