How do I get the number of cores that I specified at the command line? I want to use "spark.default.parallelism". I have 4 executors, each has 8 cores. According to https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior, the "spark.default.parallelism" value will be 4 * 8 = 32...I think it is too large, or inappropriate. Please give some suggestion.
I have already used cache, and count to pre-cache. I can try with smaller k for testing, but eventually I will have to use k = 5000 or even large. Because I estimate our data set would have that much of clusters. Thanks, David On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brk...@gmail.com> wrote: > Hi David, > The number of centroids (k=5000) seems too large and is probably the cause > of the code taking too long. > > Can you please try the following: > 1) Repartition data to the number of available cores with > .repartition(numCores) > 2) cache data > 3) call .count() on data right before k-means > 4) try k=500 (even less if possible) > > Thanks, > Burak > > On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshe...@gmail.com> wrote: > > > > The code is very simple. > > > > val data = sc.textFile("very/large/text/file") map { l => > > // turn each line into dense vector > > Vectors.dense(...) > > } > > > > // the resulting data set is about 40k vectors > > > > KMeans.train(data, k=5000, maxIterations=500) > > > > I just kill my application. In the log I found this: > > > > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of block > broadcast_26_piece0 > > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in > connection from > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277 > > java.io.IOException: An existing connection was forcibly closed by the > remote host > > > > Notice the time gap. I think it means the work node did not generate any > log at all for about 12hrs...does it mean they are not working at all? > > > > But when testing with very small data set, my application works and > output expected data. > > > > > > Thanks, > > David > > > > > > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com> wrote: > >> > >> Can you share the code snippet of how you call k-means? Do you cache > the data before k-means? Did you repartition the data? > >> > >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote: > >>> > >>> OH, the job I talked about has ran more than 11 hrs without a > result...it doesn't make sense. > >>> > >>> > >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com> wrote: > >>>> > >>>> Hi Burak, > >>>> > >>>> My iterations is set to 500. But I think it should also stop of the > centroid coverages, right? > >>>> > >>>> My spark is 1.2.0, working in windows 64 bit. My data set is about > 40k vectors, each vector has about 300 features, all normalised. All work > node have sufficient memory and disk space. > >>>> > >>>> Thanks, > >>>> David > >>>> > >>>> > >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote: > >>>>> > >>>>> Hi David, > >>>>> > >>>>> When the number of runs are large and the data is not properly > partitioned, it seems that K-Means is hanging according to my experience. > Especially setting the number of runs to something high drastically > increases the work in executors. If that's not the case, can you give more > info on what Spark version you are using, your setup, and your dataset? > >>>>> > >>>>> Thanks, > >>>>> Burak > >>>>> > >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com> wrote: > >>>>>> > >>>>>> Hi, > >>>>>> > >>>>>> When I run k-means cluster with Spark, I got this in the last two > lines in the log: > >>>>>> > >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast 26 > >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5 > >>>>>> > >>>>>> > >>>>>> > >>>>>> Then it hangs for a long time. There's no active job. The driver > machine is idle. I cannot access the work node, I am not sure if they are > busy. > >>>>>> > >>>>>> I understand k-means may take a long time to finish. But why no > active job? no log? > >>>>>> > >>>>>> > >>>>>> Thanks, > >>>>>> David > >>>>>> >