Re: [scikit-learn] A basic question about kmeans algorithms elkan and llyod

2020-03-31 Thread 樊 书华
Thank you very much for your information.

From: scikit-learn  
On Behalf Of Andreas Mueller
Sent: Tuesday, March 31, 2020 3:04 AM
To: scikit-learn@python.org
Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and 
llyod

sorry I thought it also did experiements on what they call "sta" but I guess 
they are not included.
The conclusion is the same, though. Different algorithms show different 
performance on different datasets.

The Yingyang k-means has some elkan vs lloyd figures:
http://proceedings.mlr.press/v37/ding15.pdf

In table 2, the Elkan row, in cases the speedup is <1, it means elkans is 
slower than lloyd.
Elkans is also more memory intensive, so you can see some missing values in 
that where the computation couldn't be performed, but lloyd could.


On 3/30/20 3:33 AM, 樊 书华 wrote:
Hi,

Thanks for your suggestion of the paper. However, the paper shows many more 
algorithms and finds out different algorithms show different performance on 
dataset with various dimensions, Lloyd algorithm not included. What I want to 
know is that can we remove the Lloyd algorithm in kmeans of scikit-learn since 
elkan is an optimized on with better performance.

Best regards,
George

From: scikit-learn 

 On Behalf Of Andreas Mueller
Sent: Saturday, March 28, 2020 12:37 AM
To: scikit-learn@python.org
Subject: Re: [scikit-learn] A basic question about kmeans algorithms elkan and 
llyod

There's an interesting analysis in this paper:
Fast K-Means with Accurate Bounds

http://proceedings.mlr.press/v48/newling16.pdf


On 3/26/20 3:40 AM, Alexandre Gramfort wrote:
hi,

I suspect Elkan is really winning when you have many centroids
so the conclusion is not systematic

my 2c
Alex


On Thu, Mar 26, 2020 at 3:18 AM 
mc_george...@hotmail.com 
mailto:mc_george...@hotmail.com>> wrote:
Hi admins,

My team is working on optimization on scikit-learn staff now. When it comes to 
kmeans, I find there are two algorithms, one of which is lloyd and the other is 
elkan, which is the optimized one for lloyd using triangle inequality.  In the 
older version of scikit-learn, elkan only supports dense dataset instead of 
sparse one. And in the latest version, elkan supports both type of datasets. So 
there is a question why both two algorithms are kept in kmeans since they do 
the almost same thing and elkan is a optimized one for lloyd. Are there any 
precision difference between two algorithms and how can I decide what algorithm 
to use?

Best regards,
George Fan
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn




___

scikit-learn mailing list

scikit-learn@python.org

https://mail.python.org/mailman/listinfo/scikit-learn




___

scikit-learn mailing list

scikit-learn@python.org

https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Number of informative features vs total number of features

2020-03-31 Thread Benoît Presles

Dear sklearn users,

I did some supervised classification simulations with the 
make_classification function from sklearn increasing the number of 
informative features from 1 out of 40 to 40 out of 40 (100%). I did not 
generate any repeated or redundant features. I fixed the number of 
classes to two and the number of clusters per class to one.


I split the dataset 100 times using the StratifiedShuffleSplit function 
into two subsets: a training set and a test set (80% - 20%). I performed 
a logistic regression and calculated training and testing accuracies and 
averaged the results over the 100 splits leading to a mean training 
accuracy and a mean testing accuracy.


I was expecting to get an increasing accuracy score as a function of 
informative features for both the training and the test sets. On the 
contrary, I have got the best training and test scores for one 
informative feature. Why do I get these results ?


Thanks for your help,
Best regards,
Ben

Below the simulation code I have written:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

RANDOM_SEED = 4
n_inf = np.array([1, 5, 10, 15, 20, 25, 30, 35, 40])

mean_training_score_array = np.array([])
mean_testing_score_array = np.array([])
for n_inf_value in n_inf:
    X, y = make_classification(n_samples=2500,
   n_features=40,
   n_informative=n_inf_value,
   n_redundant=0,
   n_repeated=0,
   n_classes=2,
   n_clusters_per_class=1,
   random_state=RANDOM_SEED,
   shuffle=False)
    #
    print('Simulated data - number of informative features = ' + 
str(n_inf_value))

    #
    sss = StratifiedShuffleSplit(n_splits=100, test_size=0.2, 
random_state=RANDOM_SEED)

    training_score_array = np.array([])
    testing_score_array = np.array([])
    for train_index_split, test_index_split in sss.split(X, y):
    X_split_train, X_split_test = X[train_index_split], 
X[test_index_split]
    y_split_train, y_split_test = y[train_index_split], 
y[test_index_split]

    scaler = StandardScaler()
    X_split_train = scaler.fit_transform(X_split_train)
    X_split_test = scaler.transform(X_split_test)
    lr = LogisticRegression(fit_intercept=True, max_iter=1e9, 
verbose=0,
    random_state=RANDOM_SEED, 
solver='lbfgs', tol=1e-6, C=10)

    lr.fit(X_split_train, y_split_train)
    y_pred_train = lr.predict(X_split_train)
    y_pred_test = lr.predict(X_split_test)
    accuracy_train_score = accuracy_score(y_split_train, y_pred_train)
    accuracy_test_score = accuracy_score(y_split_test, y_pred_test)
    training_score_array = np.append(training_score_array, 
accuracy_train_score)
    testing_score_array = np.append(testing_score_array, 
accuracy_test_score)
    mean_training_score_array = np.append(mean_training_score_array, 
np.average(training_score_array))
    mean_testing_score_array = np.append(mean_testing_score_array, 
np.average(testing_score_array))

#
print('mean_training_score_array=' + str(mean_training_score_array))
print('mean_testing_score_array=' + str(mean_testing_score_array))
#
plt.plot(n_inf, mean_training_score_array, 'r', label='mean training score')
plt.plot(n_inf, mean_testing_score_array, 'g', label='mean testing score')
plt.xlabel('number of informative features out of 40')
plt.ylabel('accuracy')
plt.legend()
plt.show()

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn