We are trying to implement a Univariate Kernel Density Estimation in Spark,
our first step was implement it in R from Scratch, here the relevant code:

## R CODE
## gaussian kernel
gau<-function(z)
{
    k<- exp(-(z^2)/2)*(1/sqrt(2*pi))
    k
}

ker<-c()
for ( j in 1:length(x2))
{  z<-(x2[j]-x)/h
   ker[j]<-sum(w*gau(z))/(sum(w)*h)
}

Then we implement similar ideas in the Spark Shell

// SCALA-SPARK CODE

def gau(z:Double)={
    scala.math.exp(-scala.math.pow(z,2)/2)*(1/scala.math.sqrt(2*
scala.math.Pi))
}

def kernel(x2j:Double,id_x:RDD[(java.lang.String, Double)] ,id_w:
RDD[(java.lang.String, Double)] ,h:Double) = {
    val z = id_x.mapValues(x=>(x2j-x)/h)

z.mapValues(gau(_)).join(id_w).map(x=>(x._2._1*x._2._2)).sum/(id_w.map(x=>x._2).sum
* h)
}

val ker = x2.map(kernel(_,id_x,id_w,h))

The problem is that R is faster (3 sec) than Spark (30 min) in the same
machine (Quad Core), I´m sure is because we are not using Spark as well as
its possible.

Can any help us

Abel Coronado
@abxda

Reply via email to