I do research in anomaly detection with methods of machine learning at the
moment. And currently I do kmeans clustering, too in an offline learning
setting. In further work we want to compare the two paradigms of offline
and online learning. I would like to share some thoughts on this
disscussion.
I share both the concerns that u have expressed. And as I mentioned in my
earlier mail, offline (batch) training is an option if I get a dataset
without outliers. In that case I can train and have a model. I find the
model parameters, which will be the mean distance to the centroid. Note in
trainin
Here are 2 concerns I would have with the design (This discussion is mostly
to validate my own understanding)
1. if you have outliers "before" running k-means, aren't your centroids get
skewed? In other word, outliers by themselves may bias the cluster
evaluation, isn't it?
2. Typically microbatch
Looking for alternative suggestions in case where we have 1 continuous
stream of data. Offline training and online prediction can be one option if
we can have an alternate set of data to train. But if it's one single
stream you don't have separate sets for training or cross validation.
So whatever
Curious why do you want to train your models every 3 secs?
On 20 Nov 2016 06:25, "Debasish Ghosh" wrote:
> Thanks a lot for the response.
>
> Regarding the sampling part - yeah that's what I need to do if there's no
> way of titrating the number of clusters online.
>
> I am using something like
>
Thanks a lot for the response.
Regarding the sampling part - yeah that's what I need to do if there's no
way of titrating the number of clusters online.
I am using something like
dstream.foreachRDD { rdd =>
if (rdd.count() > 0) { //.. logic
}
}
Feels a little odd but if that's the idiom the
So I haven't played around with streaming k means at all, but given
that no one responded to your message a couple of days ago, I'll say
what I can.
1. Can you not sample out some % of the stream for training?
2. Can you run multiple streams at the same time with different values
for k and compare