Hello - I am trying to implement an outlier detection application on streaming data. I am a newbie to Spark and hence would like some advice on the confusions that I have ..
I am thinking of using StreamingKMeans - is this a good choice ? I have one stream of data and I need an online algorithm. But here are some questions that immediately come to my mind .. 1. I cannot do separate training, cross validation etc. Is this a good idea to do training and prediction online ? 2. The data will be read from the stream coming from Kafka in microbatches of (say) 3 seconds. I get a DStream on which I train and get the clusters. How can I decide on the number of clusters ? Using StreamingKMeans is there any way I can iterate on microbatches with different values of k to find the optimal one ? 3. Even if I fix k, after training on every microbatch I get a DStream. How can I compute things like clustering score on the DStream ? StreamingKMeansModel has a computeCost function but it takes an RDD. May be using DStream.foreachRDD { //.. can work, but I am not able to figure out how. How can we compute the cost of clustering for an unbounded list of data ? Any idiomatic way to handle this ? Or is StreamingKMeans is not the right choice to do anomaly detection in an online setting .. any suggestion will be welcome .. regards. -- Debasish Ghosh http://manning.com/ghosh2 http://manning.com/ghosh Twttr: @debasishg Blog: http://debasishg.blogspot.com Code: http://github.com/debasishg