Hi Manoj,
I have a spark meetup talk that explains the issues with dimsum where you
have to calculate row similarities. You can still use the PR since it has
all the code you need but I have not got time to refactor it for the merge.
I believe few kernels are supported as well.
Thanks.
Deb
On
Hi Debasish, All,
I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't
gather from the relevant pull request [1] if part of it is already in 1.6.0
(it's closed now). We are facing the same problem of computing pairwise
distances between vectors where rows are > 5M and columns in
Thank you guys for the input.
Ayan, I am not sure how this can be done using reduceByKey, as far as I can
see (but I am not so advanced with Spark), this requires a groupByKey which
can be very costly. What would be nice to transform the dataset which
contains all the vectors like:
val
Dear Sparkers,
I am working on an algorithm which requires the pair distance between all
points (eg. DBScan, LOF, etc.). Computing this for *n* points will require
produce a n^2 matrix. If the distance measure is symmetrical, this can be
reduced to (n^2)/2. What would be the most optimal way of
This is my first thought, please suggest any further improvement:
1. Create a rdd of your dataset
2. Do an cross join to generate pairs
3. Apply reducebykey and compute distance. You will get a rdd with keypairs
and distance
Best
Ayan
On 30 Apr 2015 06:11, Driesprong, Fokko fo...@driesprong.frl
Cross Join shuffle space might not be needed since most likely through
application specific logic (topK etc) you can cut the shuffle space...Also
most likely the brute force approach will be a benchmark tool to see how
better is your clustering based KNN solution since there are several ways
you