Is it v0.9? Did you run in local mode? Try to set --driver-memory 4g
and repartition your data to match number of CPU cores such that the
data is evenly distributed. You need 1m * 50 * 8 ~ 400MB to storage
the data. Make sure there are enough memory for caching. -Xiangrui

On Thu, Jul 17, 2014 at 1:48 AM, Ravishankar Rajagopalan
<viora...@gmail.com> wrote:
> I am trying to use MLlib for K-Means clustering on a data set with 1 million
> rows and 50 columns (all columns have double values) which is on HDFS (raw
> txt file is 28 MB)
>
> I initially tried the following:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This took me nearly 850 seconds.
>
> I tried using persist with MEMORY_ONLY option hoping that this would
> significantly speed up the algorithm:
>
> val data3 = sc.textFile("hdfs://...inputData.txt")
> val parsedData3 = data3.map( _.split('\t').map(_.toDouble))
> parsedData3.persist(MEMORY_ONLY)
> val numIterations = 10
> val numClusters = 200
> val clusters = KMeans.train(parsedData3, numClusters, numIterations)
>
> This resulted in only a marginal improvement and took around 720 seconds.
>
> Is there any other way to speed up the algorithm further?
>
> Thank you.
>
> Regards,
> Ravi

Reply via email to