Re: UDF in SparkR
I experienced very slow execution time http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow and wondering why... On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung <felixcheun...@hotmail.com> wrote: > This is supported in Spark 2.0.0 as dapply and gapply. Please see the API > doc: > https://spark.apache.org/docs/2.0.0/api/R/ > > Feedback welcome and appreciated! > > > _ > From: Yogesh Vyas <informy...@gmail.com> > Sent: Tuesday, August 16, 2016 11:39 PM > Subject: UDF in SparkR > To: user <user@spark.apache.org> > > > > Hi, > > Is there is any way of using UDF in SparkR ? > > Regards, > Yogesh > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > -- = Yann-Aël Le Borgne Machine Learning Group Université Libre de Bruxelles http://mlg.ulb.ac.be http://www.ulb.ac.be/di/map/yleborgn =
Re: Avoid Cartesian product in calculating a distance matrix?
Hi I also experienced very slow computation times for the cartesian product, and could not find an efficient way to do this apart from doing my own implementation. I used the 'balanced' cluster algorithm described here http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/ I'd be interested to know if there is a more direct way to do this. Cheers Yann On Fri, Aug 5, 2016 at 7:20 PM, Paschalis Veskos <ves...@gmail.com> wrote: > Hello everyone, > > I am interested in running an application on Spark that at some point > needs to compare all elements of an RDD against all others to create a > distance matrix. The RDD is of type <String, Double> and the Pearson > correlation is applied to each element against all others, generating > a matrix with the distance between all possible combinations of > elements. > > I have implemented this by taking the cartesian product of the RDD > with itself, filtering half the matrix away since it is symmetric, > then doing a combineByKey to get all other elements that it needs to > be compared with. I map the output of this over the comparison > function implementing the Pearson correlation. > > You can probably guess this is dead slow. I use Spark 1.6.2, the code > is written in Java 8. At the rate it is processing in a cluster with 4 > nodes with 16cores and 56gb ram each, for a list with ~15000 elements > split in 512 partitions, the cartesian operation alone is estimated to > take about 3000 hours (all cores are maxed out on all machines)! > > Is there a way to avoid the cartesian product to calculate what I > want? Would a DataFrame join be faster? Or is this an operation that > just requires a much larger cluster? > > Thank you, > > Paschalis > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- = Yann-Aël Le Borgne Machine Learning Group Université Libre de Bruxelles http://mlg.ulb.ac.be http://www.ulb.ac.be/di/map/yleborgn =
Spark R 2.0 dapply very slow
Hello all, I just started testing Spark R 2.0, and find the execution of dapply very slow. For example, using R, the following code set.seed(2) random_DF<-data.frame(matrix(rnorm(100),10,10)) system.time(dummy_res<-random_DF[random_DF[,1]>1,]) user system elapsed 0.005 0.000 0.006 is executed in 6ms Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get: sparkR.session(master = "local[4]") random_DF_Spark <- repartition(createDataFrame(random_DF),4) subset_DF_Spark <- dapply( random_DF_Spark, function(x) { y <- x[x[1] > 1, ] y }, schema(random_DF_Spark)) system.time(dummy_res_Spark<-collect(subset_DF_Spark)) user system elapsed 2.003 0.119 62.919 I.e. 1 minute, which is abnormally slow Am I missing something? I get also a warning (16/07/31 15:07:02 WARN TaskSetManager: Stage 64 contains a task of very large size (16411 KB). The maximum recommended task size is 100 KB.). Why is this 100KB limit so low? I am using R 3.3.0 on Mac OS 10.10.5 Any insight welcome, Best, Yann-Aël -- ============= Yann-Aël Le Borgne Machine Learning Group Université Libre de Bruxelles http://mlg.ulb.ac.be http://www.ulb.ac.be/di/map/yleborgn =