Thank you guys for the input.

Ayan, I am not sure how this can be done using reduceByKey, as far as I can
see (but I am not so advanced with Spark), this requires a groupByKey which
can be very costly. What would be nice to transform the dataset which
contains all the vectors like:


val localData = data.zipWithUniqueId().map(_.swap) // Provide some keys
val cartesianProduct = localData.cartesian(localData) // Provide the pairs
val groupedByKey = cartesianProduct.groupByKey()

val neighbourhoods = groupedByKey.map {
  case (point: (Long, VectorWithNormAndClass), points: Iterable[(Long,
VectorWithNormAndClass)]) => {
    val distances = points.map {
      case (idxB: Long, pointB: VectorWithNormAndClass) =>
        (idxB, MLUtils.fastSquaredDistance(point._2.vector, point._2.norm,
pointB.vector, pointB.norm))
    }

    val kthDistance = distances.sortBy(_._2).take(K).max(compareByDistance)

    (point, distances.filter(_._2 <= kthDistance._2))
  }
}

This is part of my Local Outlier Factor implementation.

Of course the distances can be sorted because it is an Iterable, but it
gives an idea. Is it possible to make this more efficient? I don't want to
use probabilistic functions, and I will cache the matrix because many
distances are looked up at the matrix, computing them on demand would
require far more computations.​

​​Kind regards,
Fokko



2015-04-30 4:39 GMT+02:00 Debasish Das <debasish.da...@gmail.com>:

> Cross Join shuffle space might not be needed since most likely through
> application specific logic (topK etc) you can cut the shuffle space...Also
> most likely the brute force approach will be a benchmark tool to see how
> better is your clustering based KNN solution since there are several ways
> you can find approximate nearest neighbors for your application
> (KMeans/KDTree/LSH etc)...
>
> There is a variant that I will bring as a PR for this JIRA and we will of
> course look into how to improve it further...the idea is to think about
> distributed matrix multiply where both matrices A and B are distributed and
> master coordinates pulling a partition of A and multiply it with B...
>
> The idea suffices for kernel matrix generation as well if the number of
> rows are modest (~10M or so)...
>
> https://issues.apache.org/jira/browse/SPARK-4823
>
>
> On Wed, Apr 29, 2015 at 3:25 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> This is my first thought, please suggest any further improvement:
>> 1. Create a rdd of your dataset
>> 2. Do an cross join to generate pairs
>> 3. Apply reducebykey and compute distance. You will get a rdd with
>> keypairs and distance
>>
>> Best
>> Ayan
>> On 30 Apr 2015 06:11, "Driesprong, Fokko" <fo...@driesprong.frl> wrote:
>>
>>> Dear Sparkers,
>>>
>>> I am working on an algorithm which requires the pair distance between
>>> all points (eg. DBScan, LOF, etc.). Computing this for *n* points will
>>> require produce a n^2 matrix. If the distance measure is symmetrical, this
>>> can be reduced to (n^2)/2. What would be the most optimal way of computing
>>> this?
>>>
>>> The paper *Pairwise Element Computation with MapReduce
>>> <https://www.cs.ucsb.edu/~ravenben/classes/290F/papers/kvl10.pdf>* paper
>>> describes different approaches to optimize this process within a map-reduce
>>> model. Although I don't believe this is applicable to Spark. How would you
>>> guys approach this?
>>>
>>> I first thought about broadcasting the original points to all the
>>> workers, and then compute the distances across the different workers.
>>> Although this requires all the points to be distributed across all the
>>> machines. But this feels rather brute-force, what do you guys think.
>>>
>>> I don't expect full solutions, but some pointers would be great. I think
>>> a good solution can be re-used for many algorithms.
>>>
>>> Kind regards,
>>> Fokko Driesprong
>>>
>>
>

Reply via email to