PowerIterationClustering Benchmark

2016-12-15 Thread Lydia Ickler
Hi all,

I have a question regarding the PowerIterationClusteringExample.
I have adjusted the code so that it reads a file via 
„sc.textFile(„path/to/input“)“ which works fine.

Now I wanted to benchmark the algorithm using different number of nodes to see 
how well the implementation scales. As a testbed I have up to 32 nodes 
available, each with 16 cores and Spark 2.0.2 on Yarn running.
For my smallest input data set (16MB) the runtime does not really change if I 
use 1,2,4,8,16 or 32 nodes. (always ~ 1.5 minute)
Same behavior for my largest data set (2.3GB). The runtime stays around 1h if I 
use 16 or if I use 32 nodes.

I was expecting that when I e.g. double the number of nodes the runtime would 
shrink. 
As for setting up my cluster environment I tried different suggestions from 
this paper https://hal.inria.fr/hal-01347638v1/document 


Has someone experienced the same? Or has someone suggestions what might went 
wrong?

Thanks in advance!
Lydia




distribute work (files)

2016-09-06 Thread Lydia Ickler
Hi, 

maybe this is a stupid question:

I have a list of files. Each file I want to take as an input for a 
ML-algorithm. All files are independent from another.
My question now is how do I distribute the work so that each worker takes a 
block of files and just runs the algorithm on them one by one.
I hope somebody can point me in the right direction! :)

Best regards, 
Lydia
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Eigenvalue solver

2016-01-12 Thread Lydia Ickler
Hi,

I wanted to know if there are any implementations yet within the Machine 
Learning Library or generally that can efficiently solve eigenvalue problems?
Or if not do you have suggestions on how to approach a parallel execution maybe 
with BLAS or Breeze?

Thanks in advance!
Lydia


Von meinem iPhone gesendet
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org