Curious why do you want to train your models every 3 secs? On 20 Nov 2016 06:25, "Debasish Ghosh" <ghosh.debas...@gmail.com> wrote:
> Thanks a lot for the response. > > Regarding the sampling part - yeah that's what I need to do if there's no > way of titrating the number of clusters online. > > I am using something like > > dstream.foreachRDD { rdd => > if (rdd.count() > 0) { //.. logic > } > } > > Feels a little odd but if that's the idiom then I will stick to it. > > regards. > > > > On Sat, Nov 19, 2016 at 10:52 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> So I haven't played around with streaming k means at all, but given >> that no one responded to your message a couple of days ago, I'll say >> what I can. >> >> 1. Can you not sample out some % of the stream for training? >> 2. Can you run multiple streams at the same time with different values >> for k and compare their performance? >> 3. foreachRDD is fine in general, can't speak to the specifics. >> 4. If you haven't done any transformations yet on a direct stream, >> foreachRDD will give you a KafkaRDD. Checking if a KafkaRDD is empty >> is very cheap, it's done on the driver only because the beginning and >> ending offsets are known. So you should be able to skip empty >> batches. >> >> >> >> On Sat, Nov 19, 2016 at 10:46 AM, debasishg <ghosh.debas...@gmail.com> >> wrote: >> > Hello - >> > >> > I am trying to implement an outlier detection application on streaming >> data. >> > I am a newbie to Spark and hence would like some advice on the >> confusions >> > that I have .. >> > >> > I am thinking of using StreamingKMeans - is this a good choice ? I have >> one >> > stream of data and I need an online algorithm. But here are some >> questions >> > that immediately come to my mind .. >> > >> > 1. I cannot do separate training, cross validation etc. Is this a good >> idea >> > to do training and prediction online ? >> > >> > 2. The data will be read from the stream coming from Kafka in >> microbatches >> > of (say) 3 seconds. I get a DStream on which I train and get the >> clusters. >> > How can I decide on the number of clusters ? Using StreamingKMeans is >> there >> > any way I can iterate on microbatches with different values of k to >> find the >> > optimal one ? >> > >> > 3. Even if I fix k, after training on every microbatch I get a DStream. >> How >> > can I compute things like clustering score on the DStream ? >> > StreamingKMeansModel has a computeCost function but it takes an RDD. I >> can >> > use dstream.foreachRDD { // process RDD for the micro batch here } - is >> this >> > the idiomatic way ? >> > >> > 4. If I use dstream.foreachRDD { .. } and use functions like new >> > StandardScaler().fit(rdd) to do feature normalization, then it works >> when I >> > have data in the stream. But when the microbatch is empty (say I don't >> have >> > data for some time), the fit method throws exception as it gets an empty >> > collection. Things start working ok when data starts coming back to the >> > stream. But is this the way to go ? >> > >> > any suggestion will be welcome .. >> > >> > regards. >> > >> > >> > >> > -- >> > View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/using-StreamingKMeans-tp28109.html >> > Sent from the Apache Spark User List mailing list archive at Nabble.com. >> > >> > --------------------------------------------------------------------- >> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> > >> > > > > -- > Debasish Ghosh > http://manning.com/ghosh2 > http://manning.com/ghosh > > Twttr: @debasishg > Blog: http://debasishg.blogspot.com > Code: http://github.com/debasishg >