Hello Everyone, I am performing clustering on a dataset using PySpark. To find the number of clusters I performed clustering over a range of values (2,20) and found the wsse (within-cluster sum of squares) values for each value of k. This where I found something unusual. According to my understanding when you increase the number of clusters, the wsse decreases monotonically. But results I got say otherwise. I 'm displaying wsse for first few clusters only
Results from spark For k = 002 WSSE is 255318.793358 For k = 003 WSSE is 209788.479560 For k = 004 WSSE is 208498.351074 For k = 005 WSSE is 142573.272672 For k = 006 WSSE is 154419.027612 For k = 007 WSSE is 115092.404604 For k = 008 WSSE is 104753.205635 For k = 009 WSSE is 98000.985547 For k = 010 WSSE is 95134.137071 If you look at the wsse value of for k=5 and k=6, you'll see the wsse has increased. I turned to sklearn to see if I get similar results. The codes I used for spark and sklearn are in the appendix section towards the end of the post. I have tried to use same values for the parameters in spark and sklearn KMeans model. The following are the results from sklearn and they are as I expected them to be - monotonically decreasing. Results from sklearn For k = 002 WSSE is 245090.224247 For k = 003 WSSE is 201329.888159 For k = 004 WSSE is 166889.044195 For k = 005 WSSE is 142576.895154 For k = 006 WSSE is 123882.070776 For k = 007 WSSE is 112496.692455 For k = 008 WSSE is 102806.001664 For k = 009 WSSE is 95279.837212 For k = 010 WSSE is 89303.574467 I am not sure as to why I the wsse values increase in Spark. I tried using different datasets and found similar behavior there as well. Is there someplace I am going wrong? Any clues would be great. APPENDIX The dataset is located here. Read the data and set declare variables # get data import pandas as pd url = "https://raw.githubusercontent.com/vectosaurus/bb_lite/master/3.0%20data/adult_comp_cont.csv" df_pandas = pd.read_csv(url) df_spark = sqlContext(df_pandas) target_col = 'high_income' numeric_cols = [i for i in df_pandas.columns if i !=target_col] k_min = 2 # 2 in inclusive k_max = 21 # 2i is exlusive. will fit till 20 max_iter = 1000 seed = 42 This is the code I am using for getting the sklearn results: from sklearn.cluster import KMeans as KMeans_SKL from sklearn.preprocessing import StandardScaler as StandardScaler_SKL ss = StandardScaler_SKL(with_std=True, with_mean=True) ss.fit(df_pandas.loc[:, numeric_cols]) df_pandas_scaled = pd.DataFrame(ss.transform(df_pandas.loc[:, numeric_cols])) wsse_collect = [] for i in range(k_min, k_max): km = KMeans_SKL(random_state=seed, max_iter=max_iter, n_clusters=i) _ = km.fit(df_pandas_scaled) wsse = km.inertia_ print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse)) wsse_collect.append(wsse) This is the code I am using for getting the spark results from pyspark.ml.feature import StandardScaler, VectorAssembler from pyspark.ml.clustering import KMeans standard_scaler_inpt_features = 'ss_features' kmeans_input_features = 'features' kmeans_prediction_features = 'prediction' assembler = VectorAssembler(inputCols=numeric_cols, outputCol=standard_scaler_inpt_features) assembled_df = assembler.transform(df_spark) scaler = StandardScaler(inputCol=standard_scaler_inpt_features, outputCol=kmeans_input_features, withStd=True, withMean=True) scaler_model = scaler.fit(assembled_df) scaled_data = scaler_model.transform(assembled_df) wsse_collect_spark = [] for i in range(k_min, k_max): km = KMeans(featuresCol=kmeans_input_features, predictionCol=kmeans_prediction_col, k=i, maxIter=max_iter, seed=seed) km_fit = km.fit(scaled_data) wsse_spark = km_fit.computeCost(scaled_data) wsse_collect_spark .append(wsse_spark) print('For k = {i:03d} WSSE is {wsse:10f}'.format(i=i, wsse=wsse_spark)) -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org