Also the cluster centroid I get in streaming mode (some with negative values) do not make sense - if I use the same data and run in batch
KMeans.train(sc.parallelize(parsedData), numClusters, numIterations) cluster centers are what you would expect. Krishna On Fri, Feb 19, 2016 at 12:49 PM, krishna ramachandran <[email protected]> wrote: > ok i will share a simple example soon. meantime you will be able to see > this behavior using example here, > > > https://github.com/apache/spark/blob/branch-1.2/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala > > slightly modify it to include > > model.latestModel.clusterCenters.foreach(println) > > (after model.trainOn) > > add new files to trainingDir periodically > > I have 3 dimensions per data-point - they look like these, > > [1, 1, 385.2241452777778] > > [3, 1, 384.7529463888889] > > [4,1, 3083.2778025] > > [2, 4, 6226.402321388889] > > [1, 2, 785.8426655555555] > > [5, 1, 6706.054241388889] > > ........ > > and monitor. please let know if I missed something > > Krishna > > > > > > On Fri, Feb 19, 2016 at 10:59 AM, Bryan Cutler <[email protected]> wrote: > >> Can you share more of your code to reproduce this issue? The model >> should be updated with each batch, but can't tell what is happening from >> what you posted so far. >> >> On Fri, Feb 19, 2016 at 10:40 AM, krishna ramachandran <[email protected]> >> wrote: >> >>> Hi Bryan >>> Agreed. It is a single statement to print the centers once for *every* >>> streaming batch (4 secs) - remember this is in streaming mode and the >>> receiver has fresh data every batch. That is, as the model is trained >>> continuously so I expect the centroids to change with incoming streams (at >>> least until convergence) >>> >>> But am seeing same centers always for the entire duration - ran the app >>> for several hours with a custom receiver. >>> >>> Yes I am using the latestModel to predict using "labeled" test data. But >>> also like to know where my centers are >>> >>> regards >>> Krishna >>> >>> >>> >>> On Fri, Feb 19, 2016 at 10:18 AM, Bryan Cutler <[email protected]> >>> wrote: >>> >>>> Could you elaborate where the issue is? You say calling >>>> model.latestModel.clusterCenters.foreach(println) doesn't show an updated >>>> model, but that is just a single statement to print the centers once.. >>>> >>>> Also, is there any reason you don't predict on the test data like this? >>>> >>>> model.predictOnValues(testData.map(lp => (lp.label, >>>> lp.features))).print() >>>> >>>> >>>> >>>> On Thu, Feb 18, 2016 at 5:59 PM, ramach1776 <[email protected]> wrote: >>>> >>>>> I have streaming application wherein I train the model using a >>>>> receiver input >>>>> stream in 4 sec batches >>>>> >>>>> val stream = ssc.receiverStream(receiver) //receiver gets new data >>>>> every >>>>> batch >>>>> model.trainOn(stream.map(Vectors.parse)) >>>>> If I use >>>>> model.latestModel.clusterCenters.foreach(println) >>>>> >>>>> the value of cluster centers remain unchanged from the very initial >>>>> value it >>>>> got during first iteration (when the streaming app started) >>>>> >>>>> when I use the model to predict cluster assignment with a labeled >>>>> input the >>>>> assignments change over time as expected >>>>> >>>>> testData.transform {rdd => >>>>> rdd.map(lp => (lp.label, >>>>> model.latestModel().predict(lp.features))) >>>>> }.print() >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKMeans-does-not-update-cluster-centroid-locations-tp26275.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>> >> >
