Re: Why KMeans with mllib is so slow ?

Davies Liu Fri, 05 Dec 2014 09:06:36 -0800

Could you post you script to reproduce the results (also how to
generate the dataset)? That will help us to investigate it.


On Fri, Dec 5, 2014 at 8:40 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
> Hmm, here I use spark on local mode on my laptop with 8 cores. The data is
> on my local filesystem. Event thought, there an overhead due to the
> distributed computation, I found the difference between the runtime of the
> two implementations really, really huge. Is there a benchmark on how well
> the algorithm implemented in mllib performs ?
>
> On Fri, Dec 5, 2014 at 4:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> Spark has much more overhead, since it's set up to distribute the
>> computation. Julia isn't distributed, and so has no such overhead in a
>> completely in-core implementation. You generally use Spark when you
>> have a problem large enough to warrant distributing, or, your data
>> already lives in a distributed store like HDFS.
>>
>> But it's also possible you're not configuring the implementations the
>> same way, yes. There's not enough info here really to say.
>>
>> On Fri, Dec 5, 2014 at 9:50 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>> wrote:
>> > Hi all,
>> >
>> > I'm trying to a run clustering with kmeans algorithm. The size of my
>> > data
>> > set is about 240k vectors of dimension 384.
>> >
>> > Solving the problem with the kmeans available in julia (kmean++)
>> >
>> > http://clusteringjl.readthedocs.org/en/latest/kmeans.html
>> >
>> > take about 8 minutes on a single core.
>> >
>> > Solving the same problem with spark kmean|| take more than 1.5 hours
>> > with 8
>> > cores!!!!
>> >
>> > Either they don't implement the same algorithm either I don't understand
>> > how
>> > the kmeans in spark works. Is my data not big enough to take full
>> > advantage
>> > of spark ? At least, I expect to the same runtime.
>> >
>> >
>> > Cheers,
>> >
>> >
>> > Jao
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Why KMeans with mllib is so slow ?

Reply via email to