StreamingKMeans does not update cluster centroid locations
I have streaming application wherein I train the model using a receiver input stream in 4 sec batches val stream = ssc.receiverStream(receiver) //receiver gets new data every batch model.trainOn(stream.map(Vectors.parse)) If I use model.latestModel.clusterCenters.foreach(println) the value of cluster centers remain unchanged from the very initial value it got during first iteration (when the streaming app started) when I use the model to predict cluster assignment with a labeled input the assignments change over time as expected testData.transform {rdd => rdd.map(lp => (lp.label, model.latestModel().predict(lp.features))) }.print() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/StreamingKMeans-does-not-update-cluster-centroid-locations-tp26275.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: adding a split and union to a streaming application cause big performance hit
bq. streamingContext.remember("duration") did not help Can you give a bit more detail on the above ? Did you mean the job encountered OOME later on ? Which Spark release are you using ? tried these 2 global settings (and restarted the app) after enabling cache for stream1 conf.set("spark.streaming.unpersist", "true") streamingContext.remember(Seconds(batchDuration * 4)) batch duration is 4 sec Using spark-1.4.1. The application runs for about 4-5 hrs then see out of memory error -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259p26269.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
adding a split and union to a streaming application cause big performance hit
We have a streaming application containing approximately 12 jobs every batch, running in streaming mode (4 sec batches). Each job has several transformations and 1 action (output to cassandra) which causes the execution of the job (DAG) For example the first job, /job 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map -->/ groupbykey --> transform --> reducebykey --> map Likewise we go thro' few more transforms and save to database (job2, job3...) Recently we added a new transformation further downstream wherein we union the output of DStream from job 1 (in italics) with output from a new transformation(job 5). It appears the whole execution thus far is repeated which is redundant (I can see this in execution graph & also performance -> processing time). That is, with this additional transformation (union with a stream processed upstream) each batch runs as much as 2.5 times slower compared to runs without the union. If I cache the DStream from job 1(italics), performance improves substantially but hit out of memory errors within few hours. What is the recommended way to cache/unpersist in such a scenario? there is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/adding-a-split-and-union-to-a-streaming-application-cause-big-performance-hit-tp26259.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org