Hi Debasish, All,

I see the status of SPARK-4823 [0] is "in-progress" still. I couldn't
gather from the relevant pull request [1] if part of it is already in 1.6.0
(it's closed now). We are facing the same problem of computing pairwise
distances between vectors where rows are > 5M and columns in tens (20 to be
specific). DIMSUM doesn't help because of obvious reasons (transposing the
matrix infeasible) already discussed in JIRA.

Is there an update on the JIRA ticket above and can I use something to
compute RowSimilarity in spark 1.6.0 on my dataset? I will be thankful for
any other ideas too on this.

- Manoj

[0] https://issues.apache.org/jira/browse/SPARK-4823
[1] https://github.com/apache/spark/pull/6213



On Thu, Apr 30, 2015 at 6:40 PM, Driesprong, Fokko <fo...@driesprong.frl>
wrote:

> Thank you guys for the input.
>
> Ayan, I am not sure how this can be done using reduceByKey, as far as I
> can see (but I am not so advanced with Spark), this requires a groupByKey
> which can be very costly. What would be nice to transform the dataset which
> contains all the vectors like:
>
>
> val localData = data.zipWithUniqueId().map(_.swap) // Provide some keys
> val cartesianProduct = localData.cartesian(localData) // Provide the pairs
> val groupedByKey = cartesianProduct.groupByKey()
>
> val neighbourhoods = groupedByKey.map {
>   case (point: (Long, VectorWithNormAndClass), points: Iterable[(Long,
> VectorWithNormAndClass)]) => {
>     val distances = points.map {
>       case (idxB: Long, pointB: VectorWithNormAndClass) =>
>         (idxB, MLUtils.fastSquaredDistance(point._2.vector, point._2.norm,
> pointB.vector, pointB.norm))
>     }
>
>     val kthDistance = distances.sortBy(_._2).take(K).max(compareByDistance)
>
>     (point, distances.filter(_._2 <= kthDistance._2))
>   }
> }
>
> This is part of my Local Outlier Factor implementation.
>
> Of course the distances can be sorted because it is an Iterable, but it
> gives an idea. Is it possible to make this more efficient? I don't want to
> use probabilistic functions, and I will cache the matrix because many
> distances are looked up at the matrix, computing them on demand would
> require far more computations.​
>
> ​​Kind regards,
> Fokko
>
>
>
> 2015-04-30 4:39 GMT+02:00 Debasish Das <debasish.da...@gmail.com>:
>
>> Cross Join shuffle space might not be needed since most likely through
>> application specific logic (topK etc) you can cut the shuffle space...Also
>> most likely the brute force approach will be a benchmark tool to see how
>> better is your clustering based KNN solution since there are several ways
>> you can find approximate nearest neighbors for your application
>> (KMeans/KDTree/LSH etc)...
>>
>> There is a variant that I will bring as a PR for this JIRA and we will of
>> course look into how to improve it further...the idea is to think about
>> distributed matrix multiply where both matrices A and B are distributed and
>> master coordinates pulling a partition of A and multiply it with B...
>>
>> The idea suffices for kernel matrix generation as well if the number of
>> rows are modest (~10M or so)...
>>
>> https://issues.apache.org/jira/browse/SPARK-4823
>>
>>
>> On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> This is my first thought, please suggest any further improvement:
>>> 1. Create a rdd of your dataset
>>> 2. Do an cross join to generate pairs
>>> 3. Apply reducebykey and compute distance. You will get a rdd with
>>> keypairs and distance
>>>
>>> Best
>>> Ayan
>>> On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote:
>>>
>>>> Dear Sparkers,
>>>>
>>>> I am working on an algorithm which requires the pair distance between
>>>> all points (eg. DBScan, LOF, etc.). Computing this for *n* points will
>>>> require produce a n^2 matrix. If the distance measure is symmetrical, this
>>>> can be reduced to (n^2)/2. What would be the most optimal way of computing
>>>> this?
>>>>
>>>> The paper *Pairwise Element Computation with MapReduce
>>>> <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper
>>>> describes different approaches to optimize this process within a map-reduce
>>>> model. Although I don't believe this is applicable to Spark. How would you
>>>> guys approach this?
>>>>
>>>> I first thought about broadcasting the original points to all the
>>>> workers, and then compute the distances across the different workers.
>>>> Although this requires all the points to be distributed across all the
>>>> machines. But this feels rather brute-force, what do you guys think.
>>>>
>>>> I don't expect full solutions, but some pointers would be great. I
>>>> think a good solution can be re-used for many algorithms.
>>>>
>>>> Kind regards,
>>>> Fokko Driesprong
>>>>
>>>
>>
>

Reply via email to