Re: using StreamingKMeans

2016-11-21 Thread Julian Keppel
I do research in anomaly detection with methods of machine learning at the moment. And currently I do kmeans clustering, too in an offline learning setting. In further work we want to compare the two paradigms of offline and online learning. I would like to share some thoughts on this disscussion.

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
I share both the concerns that u have expressed. And as I mentioned in my earlier mail, offline (batch) training is an option if I get a dataset without outliers. In that case I can train and have a model. I find the model parameters, which will be the mean distance to the centroid. Note in trainin

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
Here are 2 concerns I would have with the design (This discussion is mostly to validate my own understanding) 1. if you have outliers "before" running k-means, aren't your centroids get skewed? In other word, outliers by themselves may bias the cluster evaluation, isn't it? 2. Typically microbatch

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
Looking for alternative suggestions in case where we have 1 continuous stream of data. Offline training and online prediction can be one option if we can have an alternate set of data to train. But if it's one single stream you don't have separate sets for training or cross validation. So whatever

Re: using StreamingKMeans

2016-11-19 Thread ayan guha
Curious why do you want to train your models every 3 secs? On 20 Nov 2016 06:25, "Debasish Ghosh" wrote: > Thanks a lot for the response. > > Regarding the sampling part - yeah that's what I need to do if there's no > way of titrating the number of clusters online. > > I am using something like >

Re: using StreamingKMeans

2016-11-19 Thread Debasish Ghosh
Thanks a lot for the response. Regarding the sampling part - yeah that's what I need to do if there's no way of titrating the number of clusters online. I am using something like dstream.foreachRDD { rdd => if (rdd.count() > 0) { //.. logic } } Feels a little odd but if that's the idiom the

Re: using StreamingKMeans

2016-11-19 Thread Cody Koeninger
So I haven't played around with streaming k means at all, but given that no one responded to your message a couple of days ago, I'll say what I can. 1. Can you not sample out some % of the stream for training? 2. Can you run multiple streams at the same time with different values for k and compare