Also, are you using the latest master in this experiment? A PR merged into the master couple days ago will spend up the k-means three times. See
https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1 Sincerely, DB Tsai ------------------------------------------------------- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: > The code is really simple : > > object TestKMeans { > > def main(args: Array[String]) { > > val conf = new SparkConf() > .setAppName("Test KMeans") > .setMaster("local[8]") > .set("spark.executor.memory", "8g") > > val sc = new SparkContext(conf) > > val numClusters = 500; > val numIterations = 2; > > > val data = sc.textFile("sample.csv").map(x => > Vectors.dense(x.split(',').map(_.toDouble))) > data.cache() > > > val clusters = KMeans.train(data, numClusters, numIterations) > > println(clusters.clusterCenters.size) > > val wssse = clusters.computeCost(data) > println(s"error : $wssse") > > } > } > > > For the testing purpose, I was generating a sample random data with julia > and store it in a csv file delimited by comma. The dimensions is 248000 x > 384. > > In the target application, I will have more than 248k data to cluster. > > > On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> wrote: >> >> Could you post you script to reproduce the results (also how to >> generate the dataset)? That will help us to investigate it. >> >> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com> >> wrote: >> > Hmm, here I use spark on local mode on my laptop with 8 cores. The data >> > is >> > on my local filesystem. Event thought, there an overhead due to the >> > distributed computation, I found the difference between the runtime of >> > the >> > two implementations really, really huge. Is there a benchmark on how >> > well >> > the algorithm implemented in mllib performs ? >> > >> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> Spark has much more overhead, since it's set up to distribute the >> >> computation. Julia isn't distributed, and so has no such overhead in a >> >> completely in-core implementation. You generally use Spark when you >> >> have a problem large enough to warrant distributing, or, your data >> >> already lives in a distributed store like HDFS. >> >> >> >> But it's also possible you're not configuring the implementations the >> >> same way, yes. There's not enough info here really to say. >> >> >> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com> >> >> wrote: >> >> > Hi all, >> >> > >> >> > I'm trying to a run clustering with kmeans algorithm. The size of my >> >> > data >> >> > set is about 240k vectors of dimension 384. >> >> > >> >> > Solving the problem with the kmeans available in julia (kmean++) >> >> > >> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html >> >> > >> >> > take about 8 minutes on a single core. >> >> > >> >> > Solving the same problem with spark kmean|| take more than 1.5 hours >> >> > with 8 >> >> > cores!!!! >> >> > >> >> > Either they don't implement the same algorithm either I don't >> >> > understand >> >> > how >> >> > the kmeans in spark works. Is my data not big enough to take full >> >> > advantage >> >> > of spark ? At least, I expect to the same runtime. >> >> > >> >> > >> >> > Cheers, >> >> > >> >> > >> >> > Jao >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org