The contingency matrix (
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cluster.contingency_matrix.html)
counts how many times each pair of (true cluster, predicted cluster)
occurs. It is sufficient statistics for every "supervised" (i.e. ground
truth-based) clustering
Evaluating on large datasets is easy if the sufficient statistics are just
the contingency matrix.
On Tue., 14 May 2019, 11:19 pm Tom Augspurger,
wrote:
> If anyone is interested in implementing these, dask-ml would welcome
> additional
> metrics that work well with Dask arrays:
>
If anyone is interested in implementing these, dask-ml would welcome
additional
metrics that work well with Dask arrays:
https://github.com/dask/dask-ml/issues/213.
On Tue, May 14, 2019 at 2:09 AM Uri Goren wrote:
> Sounds like you need to use spark,
> this project looks promising:
>
Sounds like you need to use spark,
this project looks promising:
https://github.com/xiaocai00/SparkPinkMST
On Tue, May 14, 2019 at 5:12 AM lampahome wrote:
>
> Uri Goren 於 2019年5月3日 週五 下午7:29寫道:
>
>> I usually use clustering to save costs on labelling.
>> I like to apply hierarchical
Uri Goren 於 2019年5月3日 週五 下午7:29寫道:
> I usually use clustering to save costs on labelling.
> I like to apply hierarchical clustering, and then label a small sample and
> fine-tune the clustering algorithm.
>
> That way, you can evaluate the effectiveness in terms of cluster purity
> (how many
I usually use clustering to save costs on labelling.
I like to apply hierarchical clustering, and then label a small sample and
fine-tune the clustering algorithm.
That way, you can evaluate the effectiveness in terms of cluster purity
(how many clusters contain mixed labels)
See example with
oh sorry, I see now that you mention about evaluating.
On Fri, 3 May 2019 at 10:12, Guillaume Lemaître
wrote:
> You can always predict incrementally by predicting on batches of samples.
>
> On Fri, 3 May 2019 at 10:05, lampahome wrote:
>
>> I see some algo can cluster incrementally if dataset
You can always predict incrementally by predicting on batches of samples.
On Fri, 3 May 2019 at 10:05, lampahome wrote:
> I see some algo can cluster incrementally if dataset is too huge ex:
> minibatchkmeans and Birch.
>
> But is there any way to evaluate incrementally?
>
> I found
I see some algo can cluster incrementally if dataset is too huge ex:
minibatchkmeans and Birch.
But is there any way to evaluate incrementally?
I found silhouette-coefficient and Calinski-Harabaz index because I don't
know the ground truth labels.
But they can't evaluate incrementally.