from:"Yann\-Aël Le Borgne"

Re: UDF in SparkR

2016-08-17 Thread Yann-Aël Le Borgne

I experienced very slow execution time

http://stackoverflow.com/questions/38803546/spark-r-2-0-dapply-very-slow

and wondering why...

On Wed, Aug 17, 2016 at 1:12 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> This is supported in Spark 2.0.0 as dapply and gapply. Please see the API
> doc:
> https://spark.apache.org/docs/2.0.0/api/R/
>
> Feedback welcome and appreciated!
>
>
> _
> From: Yogesh Vyas <informy...@gmail.com>
> Sent: Tuesday, August 16, 2016 11:39 PM
> Subject: UDF in SparkR
> To: user <user@spark.apache.org>
>
>
>
> Hi,
>
> Is there is any way of using UDF in SparkR ?
>
> Regards,
> Yogesh
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>


-- 
=
Yann-Aël Le Borgne
Machine Learning Group
Université Libre de Bruxelles

http://mlg.ulb.ac.be
http://www.ulb.ac.be/di/map/yleborgn
=

Re: Avoid Cartesian product in calculating a distance matrix?

2016-08-06 Thread Yann-Aël Le Borgne

Hi

I also experienced very slow computation times for the cartesian product,
and could not find an efficient way to do this apart from doing my own
implementation. I used the 'balanced' cluster algorithm described here
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4246436/

I'd be interested to know if there is a more direct way to do this.

Cheers
Yann

On Fri, Aug 5, 2016 at 7:20 PM, Paschalis Veskos <ves...@gmail.com> wrote:

> Hello everyone,
>
> I am interested in running an application on Spark that at some point
> needs to compare all elements of an RDD against all others to create a
> distance matrix. The RDD is of type <String, Double> and the Pearson
> correlation is applied to each element against all others, generating
> a matrix with the distance between all possible combinations of
> elements.
>
> I have implemented this by taking the cartesian product of the RDD
> with itself, filtering half the matrix away since it is symmetric,
> then doing a combineByKey to get all other elements that it needs to
> be compared with. I map the output of this over the comparison
> function implementing the Pearson correlation.
>
> You can probably guess this is dead slow. I use Spark 1.6.2, the code
> is written in Java 8. At the rate it is processing in a cluster with 4
> nodes with 16cores and 56gb ram each, for a list with ~15000 elements
> split in 512 partitions, the cartesian operation alone is estimated to
> take about 3000 hours (all cores are maxed out on all machines)!
>
> Is there a way to avoid the cartesian product to calculate what I
> want? Would a DataFrame join be faster? Or is this an operation that
> just requires a much larger cluster?
>
> Thank you,
>
> Paschalis
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
=
Yann-Aël Le Borgne
Machine Learning Group
Université Libre de Bruxelles

http://mlg.ulb.ac.be
http://www.ulb.ac.be/di/map/yleborgn
=

Spark R 2.0 dapply very slow

2016-07-31 Thread Yann-Aël Le Borgne

Hello all,

I just started testing Spark R 2.0, and find the execution of dapply very
slow.

For example, using R, the following code

set.seed(2)
random_DF<-data.frame(matrix(rnorm(100),10,10))
system.time(dummy_res<-random_DF[random_DF[,1]>1,])
   user  system elapsed
  0.005   0.000   0.006

is executed in 6ms

Now, if I create a Spark DF on 4 partition, and run on 4 cores, I get:

sparkR.session(master = "local[4]")

  random_DF_Spark <- repartition(createDataFrame(random_DF),4)

  subset_DF_Spark <- dapply(
random_DF_Spark,
function(x) {
  y <- x[x[1] > 1, ]
  y
},
schema(random_DF_Spark))

  system.time(dummy_res_Spark<-collect(subset_DF_Spark))
user  system elapsed
  2.003   0.119  62.919

I.e. 1 minute, which is abnormally slow Am I missing something?

I get also a warning (16/07/31 15:07:02 WARN TaskSetManager: Stage 64
contains a task of very large size (16411 KB). The maximum recommended task
size is 100 KB.). Why is this 100KB limit so low?

I am using R 3.3.0 on Mac OS 10.10.5

Any insight welcome,
Best,
Yann-Aël

-- 
=============
Yann-Aël Le Borgne
Machine Learning Group
Université Libre de Bruxelles

http://mlg.ulb.ac.be
http://www.ulb.ac.be/di/map/yleborgn
=

Re: UDF in SparkR

Re: Avoid Cartesian product in calculating a distance matrix?

Spark R 2.0 dapply very slow

3 matches

Site Navigation

Mail list logo

Footer information