Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-16 Thread Joel Nothman
The contingency matrix ( https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html) counts how many times each pair of (true cluster, predicted cluster) occurs. It is sufficient statistics for every "supervised" (i.e. ground truth-based) clustering

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-14 Thread Joel Nothman
Evaluating on large datasets is easy if the sufficient statistics are just the contingency matrix. On Tue., 14 May 2019, 11:19 pm Tom Augspurger, wrote: > If anyone is interested in implementing these, dask-ml would welcome > additional > metrics that work well with Dask arrays: >

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-14 Thread Tom Augspurger
If anyone is interested in implementing these, dask-ml would welcome additional metrics that work well with Dask arrays: https://github.com/dask/dask-ml/issues/213. On Tue, May 14, 2019 at 2:09 AM Uri Goren wrote: > Sounds like you need to use spark, > this project looks promising: >

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-14 Thread Uri Goren
Sounds like you need to use spark, this project looks promising: https://github.com/xiaocai00/SparkPinkMST On Tue, May 14, 2019 at 5:12 AM lampahome wrote: > > Uri Goren 於 2019年5月3日 週五 下午7:29寫道: > >> I usually use clustering to save costs on labelling. >> I like to apply hierarchical

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-13 Thread lampahome
Uri Goren 於 2019年5月3日 週五 下午7:29寫道: > I usually use clustering to save costs on labelling. > I like to apply hierarchical clustering, and then label a small sample and > fine-tune the clustering algorithm. > > That way, you can evaluate the effectiveness in terms of cluster purity > (how many

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-03 Thread Uri Goren
I usually use clustering to save costs on labelling. I like to apply hierarchical clustering, and then label a small sample and fine-tune the clustering algorithm. That way, you can evaluate the effectiveness in terms of cluster purity (how many clusters contain mixed labels) See example with

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-03 Thread Guillaume Lemaître
oh sorry, I see now that you mention about evaluating. On Fri, 3 May 2019 at 10:12, Guillaume Lemaître wrote: > You can always predict incrementally by predicting on batches of samples. > > On Fri, 3 May 2019 at 10:05, lampahome wrote: > >> I see some algo can cluster incrementally if dataset

Re: [scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-03 Thread Guillaume Lemaître
You can always predict incrementally by predicting on batches of samples. On Fri, 3 May 2019 at 10:05, lampahome wrote: > I see some algo can cluster incrementally if dataset is too huge ex: > minibatchkmeans and Birch. > > But is there any way to evaluate incrementally? > > I found

[scikit-learn] Can I evaluate clustering efficiency incrementally?

2019-05-03 Thread lampahome
I see some algo can cluster incrementally if dataset is too huge ex: minibatchkmeans and Birch. But is there any way to evaluate incrementally? I found silhouette-coefficient and Calinski-Harabaz index because I don't know the ground truth labels. But they can't evaluate incrementally.