[ https://issues.apache.org/jira/browse/SPARK-23535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-23535. ------------------------------- Resolution: Won't Fix > MinMaxScaler return 0.5 for an all zero column > ---------------------------------------------- > > Key: SPARK-23535 > URL: https://issues.apache.org/jira/browse/SPARK-23535 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 2.0.0 > Reporter: Yigal Weinberger > Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When applying MinMaxScaler on a column that contains only 0 the output is 0.5 > for all the column. > This is inconsistent with sklearn implementation > > Steps to reproduce: > > > {code:java} > from pyspark.ml.feature import MinMaxScaler > from pyspark.ml.linalg import Vectors > dataFrame = spark.createDataFrame([ > (0, Vectors.dense([1.0, 0.1, -1.0]),), > (1, Vectors.dense([2.0, 1.1, 1.0]),), > (2, Vectors.dense([3.0, 10.1, 3.0]),) > ], ["id", "features"]) > scaler = MinMaxScaler(inputCol="features", outputCol="scaledFeatures") > # Compute summary statistics and generate MinMaxScalerModel > scalerModel = scaler.fit(dataFrame) > # rescale each feature to range [min, max]. > scaledData = scalerModel.transform(dataFrame) > print("Features scaled to range: [%f, %f]" % (scaler.getMin(), > scaler.getMax())) > scaledData.select("features", "scaledFeatures").show() > {code} > Features scaled to range: [0.000000, 1.000000] > +--------------+--------------+ > |features|scaledFeatures| > +--------------+--------------+ > | [1.0,0.1,0.0]| [0.0,0.0,*0.5*]| | > [2.0,1.1,0.0]| [0.5,0.1,*0.5*]| | > [3.0,10.1,0.0]| [1.0,1.0,*0.5*]| > +--------------+--------------+ > > VS. > {code:java} > from sklearn.preprocessing import MinMaxScaler > mms = MinMaxScaler(copy=False) > test = np.array([[1.0, 0.1, 0],[2.0, 1.1, 0],[3.0, 10.1, 0]]) > print (mms.fit_transform(test)) > {code} > > Output: > [[ 0. 0. *0.* ] > [ 0.5 0.1 *0.* ] > [ 1. 1. *0.* ]] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org