We are trying to implement a Univariate Kernel Density Estimation in Spark,
our first step was implement it in R from Scratch, here the relevant code:
## R CODE
## gaussian kernel
gau<-function(z)
{
k<- exp(-(z^2)/2)*(1/sqrt(2*pi))
k
}
ker<-c()
for ( j in 1:length(x2))
{ z<-(x2[j]-x)/h
ker[j]<-sum(w*gau(z))/(sum(w)*h)
}
Then we implement similar ideas in the Spark Shell
// SCALA-SPARK CODE
def gau(z:Double)={
scala.math.exp(-scala.math.pow(z,2)/2)*(1/scala.math.sqrt(2*
scala.math.Pi))
}
def kernel(x2j:Double,id_x:RDD[(java.lang.String, Double)] ,id_w:
RDD[(java.lang.String, Double)] ,h:Double) = {
val z = id_x.mapValues(x=>(x2j-x)/h)
z.mapValues(gau(_)).join(id_w).map(x=>(x._2._1*x._2._2)).sum/(id_w.map(x=>x._2).sum
* h)
}
val ker = x2.map(kernel(_,id_x,id_w,h))
The problem is that R is faster (3 sec) than Spark (30 min) in the same
machine (Quad Core), I´m sure is because we are not using Spark as well as
its possible.
Can any help us
Abel Coronado
@abxda