Re: Why KMeans with mllib is so slow ?

DB Tsai Fri, 05 Dec 2014 13:35:56 -0800

Also, are you using the latest master in this experiment? A PR merged
into the master couple days ago will spend up the k-means three times.
See


https://github.com/apache/spark/commit/7fc49ed91168999d24ae7b4cc46fbb4ec87febc1

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Dec 5, 2014 at 9:36 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
> The code is really simple :
>
> object TestKMeans {
>
>   def main(args: Array[String]) {
>
>     val conf = new SparkConf()
>       .setAppName("Test KMeans")
>       .setMaster("local[8]")
>       .set("spark.executor.memory", "8g")
>
>     val sc = new SparkContext(conf)
>
>     val numClusters = 500;
>     val numIterations = 2;
>
>
>     val data = sc.textFile("sample.csv").map(x =>
> Vectors.dense(x.split(',').map(_.toDouble)))
>     data.cache()
>
>
>     val clusters = KMeans.train(data, numClusters, numIterations)
>
>     println(clusters.clusterCenters.size)
>
>     val wssse = clusters.computeCost(data)
>     println(s"error : $wssse")
>
>   }
> }
>
>
> For the testing purpose, I was generating a sample random data with julia
> and store it in a csv file delimited by comma. The dimensions is 248000 x
> 384.
>
> In the target application, I will have more than 248k data to cluster.
>
>
> On Fri, Dec 5, 2014 at 6:03 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> Could you post you script to reproduce the results (also how to
>> generate the dataset)? That will help us to investigate it.
>>
>> On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>> wrote:
>> > Hmm, here I use spark on local mode on my laptop with 8 cores. The data
>> > is
>> > on my local filesystem. Event thought, there an overhead due to the
>> > distributed computation, I found the difference between the runtime of
>> > the
>> > two implementations really, really huge. Is there a benchmark on how
>> > well
>> > the algorithm implemented in mllib performs ?
>> >
>> > On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote:
>> >>
>> >> Spark has much more overhead, since it's set up to distribute the
>> >> computation. Julia isn't distributed, and so has no such overhead in a
>> >> completely in-core implementation. You generally use Spark when you
>> >> have a problem large enough to warrant distributing, or, your data
>> >> already lives in a distributed store like HDFS.
>> >>
>> >> But it's also possible you're not configuring the implementations the
>> >> same way, yes. There's not enough info here really to say.
>> >>
>> >> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>> >> wrote:
>> >> > Hi all,
>> >> >
>> >> > I'm trying to a run clustering with kmeans algorithm. The size of my
>> >> > data
>> >> > set is about 240k vectors of dimension 384.
>> >> >
>> >> > Solving the problem with the kmeans available in julia (kmean++)
>> >> >
>> >> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
>> >> >
>> >> > take about 8 minutes on a single core.
>> >> >
>> >> > Solving the same problem with spark kmean|| take more than 1.5 hours
>> >> > with 8
>> >> > cores!!!!
>> >> >
>> >> > Either they don't implement the same algorithm either I don't
>> >> > understand
>> >> > how
>> >> > the kmeans in spark works. Is my data not big enough to take full
>> >> > advantage
>> >> > of spark ? At least, I expect to the same runtime.
>> >> >
>> >> >
>> >> > Cheers,
>> >> >
>> >> >
>> >> > Jao
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why KMeans with mllib is so slow ?

Reply via email to