Hey Xi, Have you tried Spark 1.3.0? The initialization happens on the driver node and we fixed an issue with the initialization in 1.3.0. Again, please start with a smaller k, and increase it gradually, Let us know at what k the problem happens.
Best, Xiangrui On Sat, Mar 28, 2015 at 3:11 AM, Xi Shen <davidshe...@gmail.com> wrote: > My vector dimension is like 360 or so. The data count is about 270k. My > driver has 2.9G memory. I attache a screenshot of current executor status. > I submitted this job with "--master yarn-cluster". I have a total of 7 > worker node, one of them acts as the driver. In the screenshot, you can see > all worker nodes have loaded some data, but the driver is not loaded with > any data. > > But the funny thing is, when I log on to the driver, and check its CPU & > memory status. I saw one java process using about 18% of CPU, and is using > about 1.6 GB memory. > > [image: Inline image 1] > > > On Sat, Mar 28, 2015 at 7:06 PM Reza Zadeh <r...@databricks.com> wrote: > >> How many dimensions does your data have? The size of the k-means model is >> k * d, where d is the dimension of the data. >> >> Since you're using k=1000, if your data has dimension higher than say, >> 10,000, you will have trouble, because k*d doubles have to fit in the >> driver. >> >> Reza >> >> On Sat, Mar 28, 2015 at 12:27 AM, Xi Shen <davidshe...@gmail.com> wrote: >> >>> I have put more detail of my problem at http://stackoverflow.com/ >>> questions/29295420/spark-kmeans-computation-cannot-be-distributed >>> >>> It is really appreciate if you can help me take a look at this problem. >>> I have tried various settings and ways to load/partition my data, but I >>> just cannot get rid that long pause. >>> >>> >>> Thanks, >>> David >>> >>> >>> >>> >>> >>> [image: --] >>> Xi Shen >>> [image: http://]about.me/davidshen >>> <http://about.me/davidshen?promo=email_sig> >>> <http://about.me/davidshen> >>> >>> On Sat, Mar 28, 2015 at 2:38 PM, Xi Shen <davidshe...@gmail.com> wrote: >>> >>>> Yes, I have done repartition. >>>> >>>> I tried to repartition to the number of cores in my cluster. Not >>>> helping... >>>> I tried to repartition to the number of centroids (k value). Not >>>> helping... >>>> >>>> >>>> On Sat, Mar 28, 2015 at 7:27 AM Joseph Bradley <jos...@databricks.com> >>>> wrote: >>>> >>>>> Can you try specifying the number of partitions when you load the data >>>>> to equal the number of executors? If your ETL changes the number of >>>>> partitions, you can also repartition before calling KMeans. >>>>> >>>>> >>>>> On Thu, Mar 26, 2015 at 8:04 PM, Xi Shen <davidshe...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I have a large data set, and I expects to get 5000 clusters. >>>>>> >>>>>> I load the raw data, convert them into DenseVector; then I did >>>>>> repartition and cache; finally I give the RDD[Vector] to KMeans.train(). >>>>>> >>>>>> Now the job is running, and data are loaded. But according to the >>>>>> Spark UI, all data are loaded onto one executor. I checked that executor, >>>>>> and its CPU workload is very low. I think it is using only 1 of the 8 >>>>>> cores. And all other 3 executors are at rest. >>>>>> >>>>>> Did I miss something? Is it possible to distribute the workload to >>>>>> all 4 executors? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> David >>>>>> >>>>>> >>>>> >>> >>