How many dimensions does your data have? The size of the k-means model is k * d, where d is the dimension of the data.
Since you're using k=1000, if your data has dimension higher than say, 10,000, you will have trouble, because k*d doubles have to fit in the driver. Reza On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen <davidshe...@gmail.com> wrote: > I have put more detail of my problem at > http://stackoverflow.com/questions/29295420/spark-kmeans-computation-cannot-be-distributed > > It is really appreciate if you can help me take a look at this problem. I > have tried various settings and ways to load/partition my data, but I just > cannot get rid that long pause. > > > Thanks, > David > > > > > > [image: --] > Xi Shen > [image: http://]about.me/davidshen > <http://about.me/davidshen?promo=email_sig> > <http://about.me/davidshen> > > On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshe...@gmail.com> wrote: > >> Yes, I have done repartition. >> >> I tried to repartition to the number of cores in my cluster. Not >> helping... >> I tried to repartition to the number of centroids (k value). Not >> helping... >> >> >> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com> >> wrote: >> >>> Can you try specifying the number of partitions when you load the data >>> to equal the number of executors? If your ETL changes the number of >>> partitions, you can also repartition before calling KMeans. >>> >>> >>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I have a large data set, and I expects to get 5000 clusters. >>>> >>>> I load the raw data, convert them into DenseVector; then I did >>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train(). >>>> >>>> Now the job is running, and data are loaded. But according to the Spark >>>> UI, all data are loaded onto one executor. I checked that executor, and its >>>> CPU workload is very low. I think it is using only 1 of the 8 cores. And >>>> all other 3 executors are at rest. >>>> >>>> Did I miss something? Is it possible to distribute the workload to all >>>> 4 executors? >>>> >>>> >>>> Thanks, >>>> David >>>> >>>> >>> >