We test large feature dimension but not very large k (https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525). Again, please create a JIRA and post your test code and a link to your test dataset, we can work on it. It is hard to track the issue with multiple threads in the mailing list. -Xiangrui
On Mon, Mar 30, 2015 at 3:55 PM, Xi Shen <davidshe...@gmail.com> wrote: > For the same amount of data, if I set the k=500, the job finished in about 3 > hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest > time I waited was 12 hrs... > > If I use kmeans-random, same amount of data, k=5000, the job finished in > less than 2 hrs. > > I think current kmeans|| implementation could not handle large vector > dimensions properly. In my case, my vector has about 350 dimensions. I found > another post complaining about kmeans performance in Spark, and that guy has > vectors of 200 dimensions. > > It is possible people never tested large dimension case. > > > Thanks, > David > > > > > On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng <men...@gmail.com> wrote: >> >> Hi Xi, >> >> Please create a JIRA if it takes longer to locate the issue. Did you >> try a smaller k? >> >> Best, >> Xiangrui >> >> On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen <davidshe...@gmail.com> wrote: >> > Hi Burak, >> > >> > After I added .repartition(sc.defaultParallelism), I can see from the >> > log >> > the partition number is set to 32. But in the Spark UI, it seems all the >> > data are loaded onto one executor. Previously they were loaded onto 4 >> > executors. >> > >> > Any idea? >> > >> > >> > Thanks, >> > David >> > >> > >> > On Fri, Mar 27, 2015 at 11:01 AM Xi Shen <davidshe...@gmail.com> wrote: >> >> >> >> How do I get the number of cores that I specified at the command line? >> >> I >> >> want to use "spark.default.parallelism". I have 4 executors, each has 8 >> >> cores. According to >> >> >> >> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, >> >> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it >> >> is too >> >> large, or inappropriate. Please give some suggestion. >> >> >> >> I have already used cache, and count to pre-cache. >> >> >> >> I can try with smaller k for testing, but eventually I will have to use >> >> k >> >> = 5000 or even large. Because I estimate our data set would have that >> >> much >> >> of clusters. >> >> >> >> >> >> Thanks, >> >> David >> >> >> >> >> >> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brk...@gmail.com> wrote: >> >>> >> >>> Hi David, >> >>> The number of centroids (k=5000) seems too large and is probably the >> >>> cause of the code taking too long. >> >>> >> >>> Can you please try the following: >> >>> 1) Repartition data to the number of available cores with >> >>> .repartition(numCores) >> >>> 2) cache data >> >>> 3) call .count() on data right before k-means >> >>> 4) try k=500 (even less if possible) >> >>> >> >>> Thanks, >> >>> Burak >> >>> >> >>> On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshe...@gmail.com> wrote: >> >>> > >> >>> > The code is very simple. >> >>> > >> >>> > val data = sc.textFile("very/large/text/file") map { l => >> >>> > // turn each line into dense vector >> >>> > Vectors.dense(...) >> >>> > } >> >>> > >> >>> > // the resulting data set is about 40k vectors >> >>> > >> >>> > KMeans.train(data, k=5000, maxIterations=500) >> >>> > >> >>> > I just kill my application. In the log I found this: >> >>> > >> >>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of >> >>> > block broadcast_26_piece0 >> >>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in >> >>> > connection from >> >>> > >> >>> > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 >> >>> > java.io.IOException: An existing connection was forcibly closed by >> >>> > the >> >>> > remote host >> >>> > >> >>> > Notice the time gap. I think it means the work node did not generate >> >>> > any log at all for about 12hrs...does it mean they are not working >> >>> > at all? >> >>> > >> >>> > But when testing with very small data set, my application works and >> >>> > output expected data. >> >>> > >> >>> > >> >>> > Thanks, >> >>> > David >> >>> > >> >>> > >> >>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com> >> >>> > wrote: >> >>> >> >> >>> >> Can you share the code snippet of how you call k-means? Do you >> >>> >> cache >> >>> >> the data before k-means? Did you repartition the data? >> >>> >> >> >>> >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote: >> >>> >>> >> >>> >>> OH, the job I talked about has ran more than 11 hrs without a >> >>> >>> result...it doesn't make sense. >> >>> >>> >> >>> >>> >> >>> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com> >> >>> >>> wrote: >> >>> >>>> >> >>> >>>> Hi Burak, >> >>> >>>> >> >>> >>>> My iterations is set to 500. But I think it should also stop of >> >>> >>>> the >> >>> >>>> centroid coverages, right? >> >>> >>>> >> >>> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is >> >>> >>>> about >> >>> >>>> 40k vectors, each vector has about 300 features, all normalised. >> >>> >>>> All work >> >>> >>>> node have sufficient memory and disk space. >> >>> >>>> >> >>> >>>> Thanks, >> >>> >>>> David >> >>> >>>> >> >>> >>>> >> >>> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote: >> >>> >>>>> >> >>> >>>>> Hi David, >> >>> >>>>> >> >>> >>>>> When the number of runs are large and the data is not properly >> >>> >>>>> partitioned, it seems that K-Means is hanging according to my >> >>> >>>>> experience. >> >>> >>>>> Especially setting the number of runs to something high >> >>> >>>>> drastically >> >>> >>>>> increases the work in executors. If that's not the case, can you >> >>> >>>>> give more >> >>> >>>>> info on what Spark version you are using, your setup, and your >> >>> >>>>> dataset? >> >>> >>>>> >> >>> >>>>> Thanks, >> >>> >>>>> Burak >> >>> >>>>> >> >>> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com> >> >>> >>>>> wrote: >> >>> >>>>>> >> >>> >>>>>> Hi, >> >>> >>>>>> >> >>> >>>>>> When I run k-means cluster with Spark, I got this in the last >> >>> >>>>>> two >> >>> >>>>>> lines in the log: >> >>> >>>>>> >> >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast >> >>> >>>>>> 26 >> >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> Then it hangs for a long time. There's no active job. The >> >>> >>>>>> driver >> >>> >>>>>> machine is idle. I cannot access the work node, I am not sure >> >>> >>>>>> if they are >> >>> >>>>>> busy. >> >>> >>>>>> >> >>> >>>>>> I understand k-means may take a long time to finish. But why no >> >>> >>>>>> active job? no log? >> >>> >>>>>> >> >>> >>>>>> >> >>> >>>>>> Thanks, >> >>> >>>>>> David >> >>> >>>>>> --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org