We test large feature dimension but not very large k
(https://github.com/databricks/spark-perf/blob/master/config/config.py.template#L525).
Again, please create a JIRA and post your test code and a link to your
test dataset, we can work on it. It is hard to track the issue with
multiple threads in the mailing list. -Xiangrui

On Mon, Mar 30, 2015 at 3:55 PM, Xi Shen <davidshe...@gmail.com> wrote:
> For the same amount of data, if I set the k=500, the job finished in about 3
> hrs. I wonder if I set k=5000, the job could finish in 30 hrs...the longest
> time I waited was 12 hrs...
>
> If I use kmeans-random, same amount of data, k=5000, the job finished in
> less than 2 hrs.
>
> I think current kmeans|| implementation could not handle large vector
> dimensions properly. In my case, my vector has about 350 dimensions. I found
> another post complaining about kmeans performance in Spark, and that guy has
> vectors of 200 dimensions.
>
> It is possible people never tested large dimension case.
>
>
> Thanks,
> David
>
>
>
>
> On Tue, Mar 31, 2015 at 4:00 AM Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Hi Xi,
>>
>> Please create a JIRA if it takes longer to locate the issue. Did you
>> try a smaller k?
>>
>> Best,
>> Xiangrui
>>
>> On Thu, Mar 26, 2015 at 5:45 PM, Xi Shen <davidshe...@gmail.com> wrote:
>> > Hi Burak,
>> >
>> > After I added .repartition(sc.defaultParallelism), I can see from the
>> > log
>> > the partition number is set to 32. But in the Spark UI, it seems all the
>> > data are loaded onto one executor. Previously they were loaded onto 4
>> > executors.
>> >
>> > Any idea?
>> >
>> >
>> > Thanks,
>> > David
>> >
>> >
>> > On Fri, Mar 27, 2015 at 11:01 AM Xi Shen <davidshe...@gmail.com> wrote:
>> >>
>> >> How do I get the number of cores that I specified at the command line?
>> >> I
>> >> want to use "spark.default.parallelism". I have 4 executors, each has 8
>> >> cores. According to
>> >>
>> >> https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
>> >> the "spark.default.parallelism" value will be 4 * 8 = 32...I think it
>> >> is too
>> >> large, or inappropriate. Please give some suggestion.
>> >>
>> >> I have already used cache, and count to pre-cache.
>> >>
>> >> I can try with smaller k for testing, but eventually I will have to use
>> >> k
>> >> = 5000 or even large. Because I estimate our data set would have that
>> >> much
>> >> of clusters.
>> >>
>> >>
>> >> Thanks,
>> >> David
>> >>
>> >>
>> >> On Fri, Mar 27, 2015 at 10:40 AM Burak Yavuz <brk...@gmail.com> wrote:
>> >>>
>> >>> Hi David,
>> >>> The number of centroids (k=5000) seems too large and is probably the
>> >>> cause of the code taking too long.
>> >>>
>> >>> Can you please try the following:
>> >>> 1) Repartition data to the number of available cores with
>> >>> .repartition(numCores)
>> >>> 2) cache data
>> >>> 3) call .count() on data right before k-means
>> >>> 4) try k=500 (even less if possible)
>> >>>
>> >>> Thanks,
>> >>> Burak
>> >>>
>> >>> On Mar 26, 2015 4:15 PM, "Xi Shen" <davidshe...@gmail.com> wrote:
>> >>> >
>> >>> > The code is very simple.
>> >>> >
>> >>> > val data = sc.textFile("very/large/text/file") map { l =>
>> >>> >   // turn each line into dense vector
>> >>> >   Vectors.dense(...)
>> >>> > }
>> >>> >
>> >>> > // the resulting data set is about 40k vectors
>> >>> >
>> >>> > KMeans.train(data, k=5000, maxIterations=500)
>> >>> >
>> >>> > I just kill my application. In the log I found this:
>> >>> >
>> >>> > 15/03/26 11:42:43 INFO storage.BlockManagerMaster: Updated info of
>> >>> > block broadcast_26_piece0
>> >>> > 15/03/26 23:02:57 WARN server.TransportChannelHandler: Exception in
>> >>> > connection from
>> >>> >
>> >>> > workernode0.xshe3539-hadoop-sydney.q10.internal.cloudapp.net/100.72.84.107:56277
>> >>> > java.io.IOException: An existing connection was forcibly closed by
>> >>> > the
>> >>> > remote host
>> >>> >
>> >>> > Notice the time gap. I think it means the work node did not generate
>> >>> > any log at all for about 12hrs...does it mean they are not working
>> >>> > at all?
>> >>> >
>> >>> > But when testing with very small data set, my application works and
>> >>> > output expected data.
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> > David
>> >>> >
>> >>> >
>> >>> > On Fri, Mar 27, 2015 at 10:04 AM Burak Yavuz <brk...@gmail.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Can you share the code snippet of how you call k-means? Do you
>> >>> >> cache
>> >>> >> the data before k-means? Did you repartition the data?
>> >>> >>
>> >>> >> On Mar 26, 2015 4:02 PM, "Xi Shen" <davidshe...@gmail.com> wrote:
>> >>> >>>
>> >>> >>> OH, the job I talked about has ran more than 11 hrs without a
>> >>> >>> result...it doesn't make sense.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Fri, Mar 27, 2015 at 9:48 AM Xi Shen <davidshe...@gmail.com>
>> >>> >>> wrote:
>> >>> >>>>
>> >>> >>>> Hi Burak,
>> >>> >>>>
>> >>> >>>> My iterations is set to 500. But I think it should also stop of
>> >>> >>>> the
>> >>> >>>> centroid coverages, right?
>> >>> >>>>
>> >>> >>>> My spark is 1.2.0, working in windows 64 bit. My data set is
>> >>> >>>> about
>> >>> >>>> 40k vectors, each vector has about 300 features, all normalised.
>> >>> >>>> All work
>> >>> >>>> node have sufficient memory and disk space.
>> >>> >>>>
>> >>> >>>> Thanks,
>> >>> >>>> David
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> On Fri, 27 Mar 2015 02:48 Burak Yavuz <brk...@gmail.com> wrote:
>> >>> >>>>>
>> >>> >>>>> Hi David,
>> >>> >>>>>
>> >>> >>>>> When the number of runs are large and the data is not properly
>> >>> >>>>> partitioned, it seems that K-Means is hanging according to my
>> >>> >>>>> experience.
>> >>> >>>>> Especially setting the number of runs to something high
>> >>> >>>>> drastically
>> >>> >>>>> increases the work in executors. If that's not the case, can you
>> >>> >>>>> give more
>> >>> >>>>> info on what Spark version you are using, your setup, and your
>> >>> >>>>> dataset?
>> >>> >>>>>
>> >>> >>>>> Thanks,
>> >>> >>>>> Burak
>> >>> >>>>>
>> >>> >>>>> On Mar 26, 2015 5:10 AM, "Xi Shen" <davidshe...@gmail.com>
>> >>> >>>>> wrote:
>> >>> >>>>>>
>> >>> >>>>>> Hi,
>> >>> >>>>>>
>> >>> >>>>>> When I run k-means cluster with Spark, I got this in the last
>> >>> >>>>>> two
>> >>> >>>>>> lines in the log:
>> >>> >>>>>>
>> >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned broadcast
>> >>> >>>>>> 26
>> >>> >>>>>> 15/03/26 11:42:42 INFO spark.ContextCleaner: Cleaned shuffle 5
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Then it hangs for a long time. There's no active job. The
>> >>> >>>>>> driver
>> >>> >>>>>> machine is idle. I cannot access the work node, I am not sure
>> >>> >>>>>> if they are
>> >>> >>>>>> busy.
>> >>> >>>>>>
>> >>> >>>>>> I understand k-means may take a long time to finish. But why no
>> >>> >>>>>> active job? no log?
>> >>> >>>>>>
>> >>> >>>>>>
>> >>> >>>>>> Thanks,
>> >>> >>>>>> David
>> >>> >>>>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to